Vacuum, Freezing, and Wraparound
The mechanics of VACUUM and freezing in PostgreSQL: VACUUM's three jobs, its phase pipeline (heap scan, index vacuum, heap vacuum, cleanup, truncate), the difference between VACUUM, VACUUM (FULL), CLUSTER, and pg_repack, the 32-bit transaction id wraparound danger and the autovacuum_freeze_max_age / vacuum_failsafe_age / shutdown safety stops, how freezing actually marks a tuple with HEAP_XMIN_FROZEN, the cooperation of vacuum_freeze_min_age and vacuum_freeze_table_age, the parallel multixact wraparound, the diagnostic queries on pg_database and pg_class, and the operational playbook for a high-age incident.
Learning outcomes
This is the page that closes the concurrency-and-maintenance loop. Every earlier page in this track set up a debt: mvcc-and-tuple-visibility showed why dead tuples are created, heap-pages-and-toast showed where they live, autovacuum-tuning showed how the daemon decides when to clean them, and table-bloat-and-hot-updates showed the day-to-day waste they cause. VACUUM is the worker that pays the debt, and freezing is the silent second job that keeps the database from shutting itself down. If autovacuum tuning is the policy, this page is the mechanics underneath it.
After studying this page, you can:
- State the three jobs of VACUUM and explain why none of them are optional at scale.
- Walk a single VACUUM through its phases, and read
pg_stat_progress_vacuumto tell which phase is stuck. - Explain the 32-bit transaction id wraparound problem precisely, and the safety stops PostgreSQL puts between you and it.
- Choose between
VACUUM,VACUUM (FULL),VACUUM (FREEZE),CLUSTER, andpg_repack, and say what each costs. - Diagnose a “database age” problem with
pg_databaseandpg_class, and run the right playbook before PostgreSQL reaches the shutdown threshold. - Recognise the recurring failure modes: long transactions blocking freeze, hot standby feedback holding the primary back, manual VACUUM scripts fighting autovacuum, and the truncate phase quietly grabbing AccessExclusiveLock.
Before we dive in
You should be comfortable with the building blocks the rest of this track established. A row in PostgreSQL is a stack of tuple versions, each stamped with an xmin (the transaction that wrote it) and an xmax (the transaction that ended it). A snapshot is the rule that decides which versions are visible. A dead tuple is a version that no current or future snapshot will ever need. Bloat is the on-disk space dead tuples occupy. The xmin horizon is the oldest transaction id any live snapshot still depends on. We rely on each of these without re-teaching them; the mvcc-and-tuple-visibility, heap-pages-and-toast, autovacuum-tuning, and table-bloat-and-hot-updates pages cover the details.
A few new terms you will need on this page. A transaction id (xid) is the 32-bit integer PostgreSQL hands out to writing transactions. Two markers matter: relfrozenxid on every table and datfrozenxid on every database are the oldest unfrozen xid still present in that table or database. The function age(x) returns how many transactions have happened since x, which is how you measure distance to wraparound. A frozen tuple is a tuple PostgreSQL has marked as visible to every future transaction, so its xmin no longer needs to be compared. The visibility map is the per-page bitmap that says “this page holds only all-visible tuples” and, separately, “this page holds only all-frozen tuples”. The free space map records how much free space each page has so inserts can find a home. Hold those. Everything below is built from them.
Mental Model
The wrong model, and the one almost everyone starts with, is that VACUUM is a single thing whose only job is to delete dead rows. Under that model, more VACUUM means a smaller table, and if you never had bloat you would never need to vacuum. So an append-only table that never updates needs no vacuuming, and a smaller n_dead_tup means a healthier database.
That model misses the most dangerous half of what VACUUM does. PostgreSQL recycles 32-bit transaction ids. Every tuple ever written carries the xid of the transaction that wrote it. If a tuple’s xid ever falls more than about two billion transactions in the past, PostgreSQL can no longer tell whether that xid is in the past or the future, and the row would silently disappear from view. The only thing that saves you is freezing: marking very old visible tuples as “valid for everyone, forever” before their xid ages out. An append-only table that has never had a dead tuple still desperately needs vacuuming, because every insert started a freezing clock that VACUUM is the only tool that can stop.
The better model is two clocks, not one. The bloat clock ticks with every UPDATE and DELETE, and you can let it run a little behind; the worst case is a slow scan. The wraparound clock ticks with every transaction, full stop, and if you let it reach the failsafe line PostgreSQL stops the world. VACUUM is the one process that resets both clocks. Autovacuum is your scheduler; this page is the engine.
Breaking it down
1. The three jobs of VACUUM
VACUUM is one command that does three different jobs at once. Most operational mistakes happen because someone is thinking about one of the three and ignoring the other two. Name them all and they stop being mysterious.
The first job is reclaiming space within pages. VACUUM scans the heap, finds tuples whose xmax committed before the xmin horizon, and marks the item pointers that referenced them as free. Future inserts and updates can reuse that space on the same page. Plain VACUUM does not return space to the operating system; the file stays the same size on disk. It just makes the holes inside the file reusable. This is the job people associate with “running VACUUM”.
The second job is maintaining the free space map and visibility map. The free space map records, per page, roughly how much free space is available, so the next INSERT can find a page with room. The visibility map records two bits per page: all-visible (every tuple on this page is visible to every current snapshot, so index-only scans can answer from the index alone) and all-frozen (every tuple on this page is frozen, so a future freeze pass can skip it). Without VACUUM keeping these maps current, inserts hunt blindly for room, index-only scans stop helping, and the next anti-wraparound vacuum cannot skip the pages it already finished.
The third job is freezing old tuples to prevent transaction id wraparound. VACUUM looks for tuples whose xmin is older than vacuum_freeze_min_age and marks them frozen. On a freeze-aggressive scan, it freezes every visible tuple it can. This is the job that prevents the database-shutdown outage at the bottom of rung 4. It is silent, it has no symptom you can see in n_dead_tup, and it is what gets people fired.
| Job | What it does | Knob you usually tune |
|---|---|---|
| Reclaim dead-tuple space | Marks dead item pointers free, reusable on the same page | autovacuum_vacuum_scale_factor (when it runs) |
| Maintain FSM and VM | Updates per-page free space and visibility/freeze bitmaps | autovacuum_vacuum_insert_scale_factor for inserts |
| Freeze old tuples | Marks old visible tuples as permanent, advances relfrozenxid | vacuum_freeze_min_age, autovacuum_freeze_max_age |
State this three-way job out loud whenever you reason about VACUUM. A page can be “well vacuumed for dead tuples” and “catastrophically behind on freezing” at the same time. They are not the same metric.
2. The phases of one VACUUM, end to end
VACUUM is not one scan. It is a small pipeline of phases, and each phase has its own cost and its own contention profile. The view pg_stat_progress_vacuum exposes which phase a running vacuum is in, and once you know the phases, that view stops feeling cryptic.
The phases run roughly in this order:
initializing. The worker attaches to the table and sets up. Almost instantaneous.scanning heap. VACUUM walks the heap pages, skipping pages the visibility map marks all-visible (unless this is a freeze pass that must visit them) and collecting the tuple ids (tid) of dead tuples into an in-memory list sized bymaintenance_work_memorautovacuum_work_mem. Most of a vacuum’s wall time lives here.vacuuming indexes. VACUUM scans each index in turn, looking up every collected deadtidand removing the matching index entries. This is one full pass per index over the dead-tid list. If the list fills before the heap scan is done, the worker drains it through every index, empties it, and resumes the heap scan, which is why a too-smallmaintenance_work_memmakes vacuum scan indexes many times instead of once.vacuuming heap. With the index entries gone, VACUUM walks the heap again and marks the dead item pointers free so the space is reusable on the page.cleaning up indexes. Per-index final cleanup (page compaction, internal bookkeeping). Cheap for most index types.truncating heap. If the trailing pages of the table are now fully empty, VACUUM may truncate them off the end and return that space to the operating system. To do this it must take anAccessExclusiveLockon the table briefly. This is the surprise contention source on busy systems: a vacuum that has been invisible for hours suddenly blocks every reader and writer for the truncate window.
flowchart LR
A["initializing"] --> B["scanning heap"]
B --> C["vacuuming indexes (one pass per index)"]
C --> D["vacuuming heap (mark item pointers free)"]
D --> E["cleaning up indexes"]
E --> F["truncating heap (AccessExclusiveLock if trailing pages empty)"]
B -. mem buffer full .-> C
C -. resume heap scan .-> BWatching the phase is the most direct diagnostic vacuum offers. A query like the one below tells you which phase, on which table, and how far through the heap scan you are.
SELECT p.pid,
c.relname,
p.phase,
p.heap_blks_total,
p.heap_blks_scanned,
round(100.0 * p.heap_blks_scanned / nullif(p.heap_blks_total, 0), 1) AS pct,
p.num_dead_item_ids,
p.index_vacuum_count
FROM pg_stat_progress_vacuum p
JOIN pg_class c ON c.oid = p.relid;
The two columns that betray problems are index_vacuum_count and phase. An index_vacuum_count of 3 or more on a single VACUUM means the dead-tid buffer filled mid-scan three times, so each index was scanned three times: raise maintenance_work_mem (or autovacuum_work_mem). A vacuum stuck in truncating heap for minutes is holding AccessExclusiveLock; if this happens repeatedly, set vacuum_truncate = off on the table to suppress the truncate phase.
3. VACUUM FULL and when it earns its place
Plain VACUUM never moves the file. The holes get reused, but a table that was once 200 GB and is now 40 percent dead stays a 200 GB file. To return that space to the operating system you need a rewrite. VACUUM (FULL) is the in-tree command that does it, and it has a very narrow place.
VACUUM (FULL) rewrites the entire table into a new file, copying only live tuples, and swaps the new file in for the old one. The on-disk size shrinks to the live size, the indexes are rebuilt, and the operating system gets the freed space back. The cost is steep:
- It takes an
AccessExclusiveLockon the table for the whole rewrite, so every reader and writer of that table is blocked for as long as it runs. On a 500 GB table that is hours. - It writes a complete copy of the live data, so it needs disk headroom equal to the live size before it gives any space back.
- It is fully logged (unless you tweak settings you should not), so it generates a write-ahead-log volume proportional to the rewrite.
CLUSTER is the same idea with one extra: it rewrites the table in the order of a chosen index, which can improve sequential-scan locality on a heavy time-range query. VACUUM (FULL) is essentially CLUSTER without the ordering.
The third option, pg_repack, is an out-of-tree extension that does the same logical job with online semantics: it builds a shadow copy under a small lock window and swaps it in under a brief exclusive lock at the end. For any production system that cannot afford an AccessExclusiveLock for hours, pg_repack is what you reach for, with VACUUM (FULL) reserved for true emergencies and tables you can take offline.
| Tool | Returns space to OS | Lock | Online? | When to use |
|---|---|---|---|---|
VACUUM | No (only reusable holes) | SHARE UPDATE EXCLUSIVE | Yes | Routine maintenance, the default |
VACUUM (FULL) | Yes (rewrites file) | ACCESS EXCLUSIVE whole run | No | Emergency reclamation on a table you can lock |
CLUSTER | Yes (rewrites in index order) | ACCESS EXCLUSIVE whole run | No | Rewrite to physical key order, rarely |
pg_repack | Yes (shadow-copy and swap) | Brief ACCESS EXCLUSIVE at swap | Mostly yes | Production reclamation without long lock |
So the rule of thumb: if a table has gone past prevention into “the disk is full and I need bytes back today”, use pg_repack first; use VACUUM (FULL) only when you also need to take the table offline anyway. Plain VACUUM run more aggressively (rung 2 of the autovacuum-tuning page) is what keeps you from ever needing either.
4. Why freezing exists: the 32-bit transaction id
This is the rung where the wraparound danger becomes precise, and once you see the arithmetic it stops being scary and becomes a number you watch on a dashboard.
PostgreSQL hands out a transaction id (xid) to every writing transaction. The xid is a 32-bit integer, so there are about 4.29 billion possible values. PostgreSQL treats them as a circular space and uses modular comparison: any xid can be interpreted as either two billion in the past or two billion in the future of the current one, whichever direction is closer on the circle. This split, two billion in each direction, is what gives you a usable visibility test on a finite counter.
So at any moment, of the 4.29 billion xid values, roughly two billion are “in the past” relative to the current xid and two billion are “in the future”. A tuple is visible to you if its xmin is “in the past” of your snapshot’s xid. The danger is that if a tuple’s xmin ever falls more than two billion transactions behind the current xid, the modular comparison flips: an xid that should mean “two billion years ago” starts to look like “two billion years in the future”. Your tuple silently disappears from view. To anyone querying, the row is just gone.
The mitigation has been there since the very first releases of MVCC and is called freezing. Before a tuple’s xmin can age out, VACUUM rewrites its visibility so the comparison no longer depends on the original xid at all. After freezing, the tuple is treated as visible to every current and future transaction, period. The xid slot the tuple used to occupy is now safe to recycle.
Now the safety stops PostgreSQL puts between you and the cliff:
autovacuum_freeze_max_age(default 200 million). When any table’srelfrozenxidfalls this far behind the currentxid, autovacuum launches an anti-wraparound vacuum on it. That vacuum is not optional. It runs even if autovacuum is disabled. It cannot be cancelled lightly, and a cancellation just gets reissued by the launcher.vacuum_failsafe_age(default 1.6 billion on PostgreSQL 14 and later). Whenrelfrozenxidage reaches this threshold, the running vacuum drops its cost-based throttling and stops respecting most cancellation interrupts, so it can finish as fast as the disk allows. This is “VACUUM with the brakes off”.- Shutdown to prevent wraparound (~2 billion). If age ever gets within roughly 1 million transactions of the wraparound point, PostgreSQL refuses to accept new write transactions. You see the message
database is shutting down to prevent transaction ID wraparound data loss. The only fix at this point is to start the server in single-user mode and run VACUUM to advance the horizon. Sentry, Mailchimp, and Joyent have all written public postmortems about this exact outage.
stateDiagram-v2
[*] --> Safe
Safe: Safe (age < 200M, normal autovacuum)
Forced: Forced (age >= autovacuum_freeze_max_age, anti-wraparound vacuum)
Failsafe: Failsafe (age >= vacuum_failsafe_age, throttle off)
Shutdown: Shutdown (age near 2B, refuses writes)
Safe --> Forced: age reaches 200M (default)
Forced --> Failsafe: age reaches 1.6B (PG14 default)
Failsafe --> Shutdown: age within ~1M of 2B
Failsafe --> Safe: vacuum finishes, relfrozenxid advances
Forced --> Safe: vacuum finishes, relfrozenxid advancesThe slider below makes the bands tangible. Drag the age across the four zones and watch where you fall.
Once you have the picture, the worst thing you can do is what well-meaning operators sometimes do: set autovacuum_freeze_max_age very high to “give us more runway”. All that does is push you closer to the shutdown line without any extra cleanup, so when the forced vacuum finally fires it has billions of transactions of freezing to do at once, while the failsafe and shutdown lines stay where they are.
5. How freezing actually marks a tuple
The mechanism for “marking a tuple frozen” has evolved, and the modern version matters when you read tooling output or postmortems. The summary you want in your head is: freezing flips a hint bit in the tuple header, not the xid field itself, so the original xmin survives for debugging.
The original mechanism (PostgreSQL 9.3 and earlier) wrote a literal constant called FrozenTransactionId (the value 2) into the tuple’s xmin slot. Once the slot said “FrozenTransactionId”, the visibility check short-circuited and treated the tuple as visible to everyone. The cost was that the original creating xid was lost the moment the tuple was frozen; forensics on “who created this row” stopped after a freeze.
Modern PostgreSQL (9.4 and later) keeps the original xmin and instead sets a flag bit, HEAP_XMIN_FROZEN, inside the tuple’s t_infomask and t_infomask2. Visibility checks see the flag first and treat the tuple as frozen without looking at the xid. This was a small change with big benefits: the original xmin is still readable, page-level “all-frozen” bookkeeping is exact, and the visibility map can track all-frozen pages separately so a later freeze pass can skip them entirely. When you select xmin from a frozen tuple today you see the original number, not the literal 2.
Two consequences of the modern mechanism are worth carrying. First, the visibility map now has a per-page all-frozen bit in addition to the older all-visible bit. A page marked all-frozen can be skipped by every future freeze scan, which is what makes incremental freezing affordable on huge tables. Second, an aggressive freeze pass is the only thing that flips the all-frozen bit on a page; lazy freezing leaves it off. That is why a server that has never run an aggressive freeze on a giant cold table will, the very first time it has to, do an enormous amount of work even though the data has not changed.
6. The freezing knobs and how they cooperate
Three parameters set how aggressively VACUUM freezes, and they are easier to reason about as a ladder rather than independent dials.
vacuum_freeze_min_age(default 50 million). The minimum age a tuple’sxminmust reach before a normal VACUUM bothers to freeze it. Below this age, VACUUM looks at a tuple but does not freeze. Set lower and VACUUM freezes more aggressively per pass, doing more work but spreading freezing earlier; set higher and VACUUM freezes less, deferring work to later passes. Production large-table tuning often lowers this so freezing keeps up incrementally.vacuum_freeze_table_age(default 150 million). Above this age, VACUUM scans the entire table instead of skipping pages marked all-visible in the visibility map. Below this age, VACUUM uses the visibility map to skip the easy pages. This is the difference between an “incremental” VACUUM and a “whole-table” VACUUM. The whole-table scan is what actually advancesrelfrozenxid, so on a giant table this is the parameter that decides how often you pay the big scan.autovacuum_freeze_max_age(default 200 million). The hard line above which autovacuum forces an anti-wraparound VACUUM on the table. This is the safety stop, not the tuning knob. You can raise it (some shops go to one billion) but only after you have made sure the lower two knobs keeprelfrozenxidadvancing far below it.
The ordering is the key insight: vacuum_freeze_min_age < vacuum_freeze_table_age < autovacuum_freeze_max_age. A normal VACUUM that runs above vacuum_freeze_table_age becomes a whole-table scan that freezes everything older than vacuum_freeze_min_age. A forced anti-wraparound VACUUM (above autovacuum_freeze_max_age) does the same scan but cannot be cancelled. So the only difference between a “gentle whole-table freeze” and “the emergency one” is who started it; the work itself is the same.
-- Make freezing happen earlier and more incrementally on a hot table.
ALTER TABLE events SET (
autovacuum_freeze_min_age = 10000000,
autovacuum_freeze_table_age = 100000000
);
-- Raise the emergency line on a table where you trust the incremental schedule.
ALTER TABLE events SET (autovacuum_freeze_max_age = 1000000000);
A useful manual command for big-table maintenance windows is VACUUM (FREEZE), which sets the effective freeze age to zero and freezes every visible tuple it can. This is how you proactively reset relfrozenxid on a quiet weekend instead of letting autovacuum hit autovacuum_freeze_max_age on a Monday.
7. Multixacts have their own wraparound
Here is the second wraparound problem, the one most operators do not learn about until it bites them. When more than one transaction holds a row-level lock on the same row at the same time (for example two SELECT ... FOR SHARE statements), PostgreSQL cannot squeeze multiple xid values into the single xmax field. So it allocates a multixact id (mxid), records the list of member xid values in a side data structure, and stores the mxid in the row’s xmax instead.
Multixacts are also a 32-bit counter, so they have their own wraparound. The relevant knobs mirror the xid ones:
autovacuum_multixact_freeze_max_age(default 400 million). Whendatminmxidage reaches this, autovacuum forces an anti-wraparound vacuum on the table to freeze multixacts.vacuum_multixact_freeze_min_age(default 5 million) andvacuum_multixact_freeze_table_age(default 150 million) play the same roles as thexidversions, but formxid.
The dangerous part is the workload that produces them silently. Long-running SELECT ... FOR SHARE or SELECT ... FOR KEY SHARE patterns, common under heavy use of foreign keys, can burn through multixact ids far faster than ordinary xid values. The standard age() query everyone runs looks only at relfrozenxid, so it shows green while datminmxid quietly approaches its own line. A multixact wraparound shutdown has exactly the same effect as an xid one: writes refused, the cluster comes down.
Watch both at once. The query in rung 8 explicitly trends age(datfrozenxid) and age(datminmxid) side by side for this reason.
8. Diagnostics: distance to wraparound
This is the section to bookmark. Three queries answer “how close are we?” at three different granularities.
First, database-wide distance for both xid and mxid. Run this on a schedule and alert when any line crosses, say, half of autovacuum_freeze_max_age.
SELECT datname,
age(datfrozenxid) AS xid_age,
mxid_age(datminmxid) AS mxid_age,
current_setting('autovacuum_freeze_max_age')::bigint AS xid_limit,
current_setting('autovacuum_multixact_freeze_max_age')::bigint AS mxid_limit
FROM pg_database
ORDER BY age(datfrozenxid) DESC;
Second, the per-table picture: which tables are nearest to wraparound. The database age is bounded below by the maximum table age in it, so finding the offending table is the same as finding the source of the database-level number.
SELECT n.nspname || '.' || c.relname AS table_name,
age(c.relfrozenxid) AS xid_age,
mxid_age(c.relminmxid) AS mxid_age,
pg_size_pretty(pg_table_size(c.oid)) AS size,
s.last_autovacuum,
s.last_vacuum
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
LEFT JOIN pg_stat_user_tables s ON s.relid = c.oid
WHERE c.relkind IN ('r', 'm', 't')
ORDER BY age(c.relfrozenxid) DESC
LIMIT 20;
Third, the live phase of whatever vacuum is running right now, which tells you whether the freezing work is in flight or stalled. The query in rung 2 covers the columns; the key one to watch during a freeze incident is phase (is it scanning heap and advancing?) and heap_blks_scanned / heap_blks_total (is the percentage moving?).
SELECT p.pid,
c.relname,
p.phase,
p.heap_blks_total,
p.heap_blks_scanned,
p.index_vacuum_count,
a.state,
now() - a.xact_start AS txn_age
FROM pg_stat_progress_vacuum p
JOIN pg_class c ON c.oid = p.relid
JOIN pg_stat_activity a ON a.pid = p.pid;
The signal you want from these three is one of two answers. If both age numbers are moving down toward zero, the system is healthy. If both age numbers are moving up while a vacuum sits in scanning heap not advancing heap_blks_scanned, something is holding the horizon back and freezing cannot progress, which is the playbook in rung 9.
9. The operational playbook for a high-age incident
When you wake up to “table X is at age 1.5 billion and climbing”, the temptation is to do everything at once. The right order is precise. Run it top to bottom.
-
Confirm the horizon is not pinned. A vacuum can only freeze tuples whose creating transaction predates every running snapshot. A single long-open or idle-in-transaction connection blocks the entire freeze, no matter how aggressively vacuum is running. Find it first:
SELECT pid, datname, usename, state, age(backend_xmin) AS xmin_age, now() - xact_start AS txn_age, left(query, 80) AS query FROM pg_stat_activity WHERE backend_xmin IS NOT NULL ORDER BY age(backend_xmin) DESC LIMIT 10;If any row shows a
xmin_agenear the table’s age, that is your bottleneck. Coordinate the end of that transaction (orpg_terminate_backend(pid)) before changing any VACUUM setting; otherwise the running vacuum will keep failing to advance. -
Raise the work memory so each pass collects more. Larger
maintenance_work_memmeans fewer index passes per vacuum. On a giant table this can shorten a vacuum from days to hours.ALTER SYSTEM SET maintenance_work_mem = '4GB'; ALTER SYSTEM SET autovacuum_work_mem = '2GB'; SELECT pg_reload_conf();On PostgreSQL 16 and earlier, anything above about 1 GB per pass is wasted by the old dead-tid storage; on 17 and later the new structure uses the larger value fully.
-
Lift the cost throttle for the duration. Anti-wraparound vacuums respect the cost throttle until they cross
vacuum_failsafe_age. Below that line, you can make the running vacuum go faster by raising the limit cluster-wide:ALTER SYSTEM SET autovacuum_vacuum_cost_limit = 5000; ALTER SYSTEM SET autovacuum_vacuum_cost_delay = 2; SELECT pg_reload_conf(); -
Let the anti-wraparound vacuum finish. Once the horizon is unpinned and the throttle is lifted, the running vacuum will reach the end and advance
relfrozenxid. Do not cancel it; cancellation only delays the same work and the launcher restarts it. -
Do not paper over by raising
autovacuum_freeze_max_age. A common mistake under stress is to bump the parameter from 200 million to 1 billion to “buy time”. This does the opposite. The forced vacuum stops, age keeps climbing toward the failsafe line (still 1.6 billion), and you have less margin, not more. Raise it only as a steady-state policy on a table whose normal vacuum is known to keep up. -
After recovery, fix the cause. Almost every wraparound incident traces back to one of three things: a long transaction left open by the application, an autovacuum cost throttle too low for fast disks, or a giant cold table whose
vacuum_freeze_table_ageandvacuum_freeze_min_agewere never tuned so it never freezes incrementally. Fix the one that fired, and add an alert at half the limit so you get a week’s warning next time.
10. Replication, long transactions, and the failure modes that recur
Two extra mechanisms reach into freezing in ways that surprise people. Both are covered in their own pages; here we name the freezing-side effect.
The first is hot_standby_feedback. When a replica streams from the primary with hot_standby_feedback = on, it sends back the oldest xmin of its running queries. The primary then holds its own horizon at that point so the replica’s snapshots stay valid. The cost is that a long-running query on the replica is just as effective at blocking freezing on the primary as a long transaction on the primary itself. The streaming-and-synchronous-replication page covers the full mechanism; for VACUUM purposes the rule is: if your replica runs hours-long reports and you have it on, expect the primary’s freezing to lag and its bloat to climb. Trade the freshness of replica reads against the freeze lag explicitly.
The second is the long transaction on the primary. The mvcc-and-tuple-visibility page covers the xmin-horizon trap end to end; the freezing-specific version is sharper. A normal autovacuum that hits a held horizon simply does less and tries again later. An anti-wraparound autovacuum that hits a held horizon keeps running, gets nowhere, and the wraparound clock keeps ticking. So a long transaction in normal operation is a bloat problem; a long transaction during an anti-wraparound vacuum is an outage in progress.
The recurring failure modes are the ones to memorise so you spot them by reflex:
- The shutdown-to-prevent-wraparound outage. Cluster refuses writes; logs say so. Cause is always one of: a long transaction blocked freezing for days,
autovacuum_freeze_max_agewas raised without making sure freezing kept up, orautovacuumwas disabled by mistake. The Sentry, Mailchimp, and Joyent postmortems are public; they all read the same way. - The truncate-phase stall. Every reader and writer of a table suddenly blocks for minutes while a vacuum finishes. The vacuum is in
truncating heap, holdingAccessExclusiveLock. Setvacuum_truncate = offon tables where this hurts. - The manual-VACUUM versus autovacuum race. An operator writes a cron job that runs
VACUUMon the busy tables nightly. Autovacuum is also running. The two queue against each other on the same tables, each cancelling the other’s progress, and total throughput drops. Either turn off autovacuum on those tables (and own the schedule) or trust autovacuum and delete the cron job. Do not run both blindly. VACUUM (FULL)during peak. Someone reads “VACUUM does not reclaim space” online and runsVACUUM (FULL)on a 200 GB production table at noon. TheAccessExclusiveLocktakes the table offline for hours. Usepg_repackand a maintenance window instead.- Multixact wraparound on a foreign-key-heavy workload. Standard
xidage looks fine;mxidage is approaching its line. The shutdown that follows is identical to thexidone. Watch both.
flowchart TB
A["Long transaction on primary"] --> B["xmin horizon pinned"]
A2["hot_standby_feedback from replica"] --> B
B --> C["Normal VACUUM cannot freeze old tuples"]
C --> D["age(relfrozenxid) climbs across tables"]
D --> E["Anti-wraparound vacuum forced, still cannot progress"]
E --> F["Failsafe lifts throttle, race continues"]
F --> G["Shutdown to prevent wraparound: writes refused"]The point of the chain is that every failure mode here has one early intervention: end the long transaction, or stop pinning the horizon from the replica. Everything else downstream is a symptom.
Mastery Questions
-
Your overnight on-call alert reads: “
eventstable hasage(relfrozenxid) = 1.4 billionand an autovacuum is running butheap_blks_scannedhas not moved in two hours.” A teammate suggests raisingautovacuum_freeze_max_ageto one billion so the running vacuum can stop. Walk through what you would actually do, in order, and explain why their suggestion is wrong.Answer. Their suggestion is exactly backwards.
autovacuum_freeze_max_ageis the line that starts the forced vacuum; thevacuum_failsafe_age(default 1.6 billion on PostgreSQL 14 and later) and the ~2-billion shutdown line are not relative to it. Raising it to a billion does not save you any margin; you are already at 1.4 billion. The shutdown is now about 600 million transactions away no matter whatautovacuum_freeze_max_agesays. The real order is: first, find the held xmin horizon. The fact thatheap_blks_scannedis not moving with a vacuum running means the vacuum cannot freeze past some transaction id, which means a backend is still alive with a snapshot from that point. Querypg_stat_activityfor the row with the largestage(backend_xmin), confirm it (idle in transaction, long query, or a replica withhot_standby_feedbackon holding it), and end it. Second, raisemaintenance_work_memandautovacuum_work_memto 2 to 4 GB so the running vacuum collects more dead and old tids per pass, scanning indexes fewer times. Third, raiseautovacuum_vacuum_cost_limitto 5000 or so to let the vacuum use the disk; below the failsafe line the vacuum is still throttled. Fourth, let the vacuum finish; do not cancel it. After recovery, set an alert at half ofautovacuum_freeze_max_ageso the next time this happens you have a week of warning, not two hours, and review whether the long-transaction pattern that caused this is in application code or in a replica usinghot_standby_feedback. -
A colleague says, “Our
audit_logtable is append-only and has zero dead tuples, so we have setautovacuum_enabled = offon it to save I/O.” Explain why this is a wraparound bug waiting to fire, and what mechanism specifically protects them today only by accident.Answer. Turning off autovacuum on an append-only table disables every job VACUUM does, not just the dead-tuple reclamation. The two that matter here are visibility-map maintenance and freezing. The freezing job is the one that is about to bite them. Every inserted tuple carries the
xidof the inserting transaction, and thatxidis a 32-bit counter. Once a tuple’sxminis more than roughly two billion transactions in the past, the modular comparison flips and the row silently disappears from view; the safety stop that prevents this is freezing the tuple before that happens. Withautovacuum_enabled = off, no normal VACUUM ever runs on the table, no freezing happens, andrelfrozenxidslides further behind with every transaction. What protects them today is the anti-wraparound vacuum: whenage(relfrozenxid)crossesautovacuum_freeze_max_age(default 200 million), autovacuum forces a vacuum on the table even though they “disabled” autovacuum on it, because the alternative is data loss. So their “saving” is illusory: they have replaced a stream of small, throttled, incremental vacuums with one giant emergency vacuum that ignores the off switch and scans the entire table at once, exactly when the table is at its largest. The fix is to leave autovacuum on, and instead useautovacuum_vacuum_insert_scale_factor(PostgreSQL 13 and later) to drive vacuuming on insert volume so the visibility map stays current and freezing happens incrementally. -
A high-traffic app uses
SELECT ... FOR KEY SHAREheavily because of foreign-key enforcement. Routine monitoring ofage(datfrozenxid)shows healthy numbers, yet the cluster shuts down with “database is not accepting commands to avoid wraparound data loss.” How did this happen, and what was missing from the monitoring?Answer. They were watching only the transaction-id wraparound clock. PostgreSQL has a second 32-bit clock: the multixact id counter. When more than one transaction holds a row-level lock on the same row at the same time, PostgreSQL allocates a multixact id (
mxid) and stores it in the row’sxmaxinstead of a singlexid. Foreign-key enforcement under load is one of the easiest ways to generate multixacts in volume, becauseSELECT ... FOR KEY SHAREis exactly the lock the foreign-key check uses. Themxidcounter has its own wraparound, its own forced-vacuum threshold (autovacuum_multixact_freeze_max_age, default 400 million), and its own shutdown line at roughly two billion. Theirage(datfrozenxid)monitoring was healthy because ordinaryxidconsumption was fine; the cluster shut down becauseage(datminmxid)had reached its line, and nothing was watching it. The monitoring fix is to trend bothage(datfrozenxid)andmxid_age(datminmxid)for every database, alert at half of each limit, and identify the tables with the highestrelminmxidage viapg_classso the source workload can be tuned. The architectural fix is to look at whether the foreign-key checks really need to take that many shared locks, and whether the relationship can be enforced with cheaper patterns.
Sources & evidence14 claims · 4 cited
Parameter names, defaults, phase names, the 32-bit xid wraparound, the cost-based delay accounting, and the multixact freeze thresholds are grounded in the PostgreSQL documentation for Routine Vacuuming, Automatic Vacuuming runtime parameters, MVCC and visibility, and Database Physical Storage. Facts that those sources do not state explicitly (the HEAP_XMIN_FROZEN hint-bit mechanism replacing literal FrozenTransactionId in 9.4+, the vacuum_failsafe_age PG14 default of 1.6B, the named production outages, and the multixact-wraparound shutdown specifics) are treated as stable common knowledge.
- VACUUM does three jobs in one command: it reclaims dead-tuple space within heap pages (making it reusable but not returning it to the operating system), it maintains the free space map and the visibility map (the per-page bitmap with all-visible and all-frozen bits), and it freezes old tuples so transaction ids can be safely recycled.verified
- A single VACUUM proceeds through phases visible in pg_stat_progress_vacuum: initializing, scanning heap, vacuuming indexes (one full pass per index over the collected dead-tid list), vacuuming heap (marking item pointers free), cleaning up indexes, and truncating heap (which requires an AccessExclusiveLock and only runs if trailing pages are fully empty).verified
- If the in-memory dead-tid buffer (sized by maintenance_work_mem or autovacuum_work_mem) fills before the heap scan finishes, VACUUM must drain through every index, empty the buffer, and resume; index_vacuum_count in pg_stat_progress_vacuum at 3 or more on a single vacuum means each index was scanned that many times instead of once.verified
- VACUUM (FULL) rewrites the entire table into a new file copying only live tuples, takes an AccessExclusiveLock for the whole run, and needs disk headroom equal to the live size before it returns any space; CLUSTER is the same rewrite in index order, and pg_repack achieves the same logical result online with only a brief AccessExclusiveLock at swap time.verified
- PostgreSQL's transaction id is a 32-bit integer (about 4.29 billion values) compared modularly with a roughly two-billion past and two-billion future window; if a tuple's xmin ever falls more than ~2 billion transactions behind the current xid, the comparison flips and the row becomes invisible, which is why freezing replaces the visibility check on old tuples before that happens.verified
- Three safety stops sit between normal operation and data loss: autovacuum_freeze_max_age (default 200 million) launches a forced anti-wraparound vacuum that cannot be skipped, vacuum_failsafe_age (PostgreSQL 14 and later default 1.6 billion) drops the cost-based throttle so the running vacuum is no longer rate-limited, and at roughly 2 billion the cluster refuses new write transactions with the 'database is shutting down to prevent transaction ID wraparound data loss' message.stable common knowledge
- PostgreSQL 9.3 and earlier froze a tuple by overwriting its xmin with the literal FrozenTransactionId constant (2), losing the original creator id; PostgreSQL 9.4 and later instead set the HEAP_XMIN_FROZEN hint bit in the tuple's t_infomask so the original xmin is preserved and the visibility map gained a per-page all-frozen bit that lets later freeze scans skip whole pages.stable common knowledge
- Three parameters cooperate as a ladder: vacuum_freeze_min_age (default 50 million) is the minimum age before a normal VACUUM freezes a tuple, vacuum_freeze_table_age (default 150 million) is the age above which VACUUM scans the whole table instead of skipping all-visible pages so relfrozenxid actually advances, and autovacuum_freeze_max_age (default 200 million) is the hard line that forces an anti-wraparound VACUUM.verified
- VACUUM (FREEZE) forces the effective freeze age to zero and freezes every visible tuple it can, which is the proactive manual command for resetting relfrozenxid in a maintenance window instead of waiting for an anti-wraparound vacuum to do it.verified
- When more than one transaction holds a row-level lock on the same row at once (for example concurrent SELECT FOR SHARE or foreign-key-driven SELECT FOR KEY SHARE), PostgreSQL allocates a 32-bit multixact id and stores it in xmax; multixact ids have their own wraparound with autovacuum_multixact_freeze_max_age (default 400 million) and produce the same shutdown-to-prevent-wraparound outage when their counter approaches its limit.stable common knowledge
- Distance to wraparound is monitored per database with SELECT datname, age(datfrozenxid), mxid_age(datminmxid) FROM pg_database and per table with age(relfrozenxid) and mxid_age(relminmxid) on pg_class, and the live phase of a running vacuum is seen in pg_stat_progress_vacuum joined to pg_class.verified
- A stalled anti-wraparound vacuum (heap_blks_scanned not advancing) is almost always caused by a held xmin horizon: an idle-in-transaction connection or a long query holds back backend_xmin so VACUUM cannot freeze past it, and the fix is to find that backend via pg_stat_activity and end it before tuning any vacuum parameter.verified
- Raising autovacuum_freeze_max_age to delay a forced vacuum does not increase wraparound safety margin, because the failsafe (1.6B) and shutdown (~2B) lines are absolute distances from the wraparound point; the operational response under stress is to unpin the horizon, raise maintenance_work_mem and autovacuum_work_mem, raise autovacuum_vacuum_cost_limit, and let the anti-wraparound vacuum run to completion without cancellation.stable common knowledge
- When hot_standby_feedback is on, a replica sends its oldest running-query xmin to the primary so the primary holds its own horizon back to preserve replica snapshots, which means a long-running query on the replica blocks freezing on the primary just as effectively as a long transaction on the primary itself.stable common knowledge
Cited sources
- PostgreSQL Documentation: Routine Vacuuming · PostgreSQL Global Development Group
- PostgreSQL Documentation: Database Physical Storage (Page Layout and TOAST) · PostgreSQL Global Development Group
- PostgreSQL Documentation: Automatic Vacuuming Parameters · PostgreSQL Global Development Group
- PostgreSQL Documentation: Concurrency Control (MVCC) · PostgreSQL Global Development Group