Table Bloat and HOT Updates
How table bloat actually grows, how to measure it with pgstattuple and the canonical bloat-estimating query, why the Heap-Only Tuple (HOT) update path is the single biggest defense against it, how fillfactor and indexed-column updates control HOT eligibility, how to diagnose HOT with pg_stat_user_tables, the lock implications of VACUUM FULL, CLUSTER, pg_repack, and pg_squeeze, REINDEX CONCURRENTLY for index bloat, and why the xmin horizon is the real root cause of most cleanup failures.
Learning outcomes
The mvcc-and-tuple-visibility page ended with a quiet bill: every UPDATE and DELETE leaves a dead tuple, and someone has to clean it up. This page is what happens when that cleanup falls behind, plus the single most important trick PostgreSQL plays to avoid the cost in the first place: Heap-Only Tuple updates, known as HOT. By the end you can look at a 500 GB OLTP table that is “mostly empty” and know exactly what to do.
After studying this page, you can:
- Define table bloat in terms of dead tuples and unreused free space, and quantify it with
pgstattupleand the canonical bloat-estimating query. - Explain why bloat slows queries beyond the obvious storage cost, including the buffer-cache and index angles.
- Describe the HOT mechanism in detail: the two conditions, what the heap looks like after a HOT update, and why no index entry is written.
- Tune
fillfactorfor hot-updated tables and read then_tup_hot_updsignal inpg_stat_user_tables. - Choose between
VACUUM FULL,CLUSTER,pg_repack, andpg_squeezefor an online or offline rewrite, with the lock implications of each. - Diagnose the long-transaction trap as the actual root cause of most “vacuum cannot keep up” tickets.
Before we dive in
You should already be comfortable with the model from the mvcc-and-tuple-visibility page: a row is a stack of physical versions, an UPDATE writes a new version and tombstones the old one with xmax, and VACUUM reclaims dead versions whose death is older than every running snapshot. You should know what a heap page is at the level the heap-pages-and-toast page covers (an 8 KB block with a header, a line-pointer array, and tuples packed from the end), and that a B-tree index entry stores a key plus a ctid pointer into the heap, as the b-tree-index-internals page lays out.
Two new words for this page. Bloat is the gap between the size of the live data in a table and the size the table actually occupies on disk. HOT stands for Heap-Only Tuple, a special kind of update that PostgreSQL performs whenever it can, which avoids most of the bloat cost. Hold those two. Every section is about either measuring bloat, preventing it, or reclaiming it.
Mental Model
The wrong model, and it is the one most engineers carry over from MySQL or Oracle with in-place updates, is that UPDATE cost is proportional to the size of the row being changed. Under that model an update of one boolean column is cheap because only one byte changed. So if a table grows out of control, the problem must be raw write volume.
PostgreSQL does not work that way. The right model is that an UPDATE writes a brand-new tuple every time, and on top of that, by default, writes a brand-new index entry into every index on the table. A table with eight indexes pays for nine writes on each UPDATE, not one. The reason a tiny status flip on a wide row can bring a database to its knees is not the row size; it is the index storm and the dead tuples piling up behind it.
HOT is the carve-out that fixes both at once. When the update touches no indexed column and there is free space on the same page, PostgreSQL writes the new tuple in place on the page, links the old tuple to the new one with t_ctid, and skips every index write. The indexes still point at the old tuple, but a reader following the pointer walks the chain on the page and finds the live version. Keep that picture: a HOT update is a single-page write that the indexes never even know about, and your whole bloat story turns on whether your updates qualify.
Breaking it down
1. What bloat is, and why a table can be 5x its data
Start with the size on disk. A heap relation is a sequence of 8 KB pages, and a page holds two kinds of “stuff that takes space”: live tuples (versions visible to some current or future snapshot) and dead tuples (versions whose xmax committed before the oldest possible snapshot). Bloat is everything that is not live data. It comes in two flavors that you have to keep separate.
Dead-tuple bloat is space held by tuples that are still physically present but logically gone. Until VACUUM runs, that space cannot be reused. A row updated a thousand times has, at least briefly, a thousand dead versions sitting on its pages.
Free-but-fragmented bloat is space that VACUUM has already marked as reusable, but no new insert has filled it yet, and the table will not return it to the operating system. PostgreSQL only shrinks a heap file from the end: if pages 0 through 999 are half-empty but page 1000 is full, the file stays 8 MB. This is why “I deleted half the rows and the table is still the same size” is not a bug.
Add the two together with a heavily updated workload, and a 100 GB logical dataset routinely occupies 300 to 500 GB on disk. A 3x to 5x ratio is common on tables that combine high update volume with even one indexed column that updates touch (we get to why that matters in rung 6).
flowchart LR
A[Live data: 100 GB] --> T[Table on disk: 400 GB]
B[Dead tuples awaiting vacuum] --> T
C[Vacuumed-but-unreused free space] --> T
D[Pages mostly empty after a delete wave] --> TRead that picture from the right. The disk number you see in pg_total_relation_size is the sum of four things, and only the first one is the data the application cares about. Every later rung on this page is a tool for shrinking one of the other three.
2. Measuring bloat with pgstattuple
You cannot fix what you cannot measure, and crude tricks like “compare row count times average row size to file size” lie on tables with TOASTed columns or wide variable-length data. Two tools matter.
The pgstattuple extension is the ground-truth answer. It walks every page of a table and counts live, dead, and free bytes precisely. The cost is that it reads the whole table, so on a 500 GB heap it is expensive; in production you usually point it at a single relation during an incident, not at every table on a schedule.
create extension if not exists pgstattuple;
select * from pgstattuple('orders');
The output you read carefully:
table_len | tuple_count | tuple_len | tuple_percent
dead_tuple_count | dead_tuple_len | dead_tuple_percent
free_space | free_percent
-----------+-------------+-------------+---------------
429496729600 | 812000000 | 178640000000 | 41.6
310000000 | 68200000000 | 15.9
181440000000 | 42.3
Reading that row: the file is 400 GB. About 42 percent of it is live tuples, 16 percent is dead tuples waiting on vacuum, and 42 percent is free space that vacuum has already reclaimed but no insert has filled. Logical data is 167 GB; the table is 2.4x its data. That is a bloat problem.
For dashboards across thousands of tables you cannot afford a full scan, so the community-canonical bloat-check query estimates bloat from pg_class, pg_stats, and the per-column average widths. It is a few hundred lines of SQL maintained in the open (“bloat_check.sql”, “ioguix/pgsql-bloat-estimation”, and similar), and it is what pgwatch, check_postgres, and most monitoring systems run. The estimate is good to within a few percent on most tables and a lot wider on tables with skewed widths or many TOASTed columns, so use it for alerting and pgstattuple for confirmation.
-- Cheap, indicative: estimate from catalog stats.
-- (Real query is long; the shape is: per table, compute
-- expected_bytes from avg row width times row estimate,
-- compare to relpages * 8192, the gap is bloat.)
select relname,
pg_size_pretty(pg_relation_size(oid)) as on_disk,
pg_size_pretty((reltuples * 100)::bigint) as approx_live
from pg_class
where relkind = 'r'
order by pg_relation_size(oid) desc
limit 10;
A practical rule on a busy OLTP system: if pgstattuple says dead_tuple_percent is above 20 and rising, autovacuum is losing; if free_percent is above 40, a rewrite is overdue.
3. Why bloat hurts beyond wasted disk
Wasted gigabytes are the smallest part of the problem. Three downstream costs do most of the damage.
Cache pollution. PostgreSQL caches heap pages in shared_buffers. A 200 GB table with 60 percent dead tuples puts 120 GB of dead weight into competition for cache slots, evicting pages of other relations that would have been hot. The hit ratio drops, and queries that used to be in-memory now hit disk.
Scan amplification. A sequential scan visits every page, whether the live density on that page is 100 percent or 5 percent. A bloated 100 GB table that holds only 20 GB of live data scans 5x more pages than it should, and so do the index scans that touch many heap rows (a bitmap-heap scan in particular).
Index bloat. Every non-HOT update writes a new index entry while leaving the old one in place; VACUUM cleans those up later, but for a window the indexes carry dead entries too. The b-tree-index-internals page describes how a B-tree page that loses many entries does not collapse: the tree stays the same shape, and a scan of an index range walks pages that are mostly empty. So your indexes can be 3x their useful size and force extra I/O on every lookup.
The combined effect is non-linear. A table at 50 percent bloat is not 2x slower; it is often 4x or 5x slower on point lookups and 10x slower on sequential reports, because cache, scan, and index costs compound. That is why catching bloat early matters more than the storage cost suggests.
4. HOT updates: same page, no new index entries
Now the mechanism that prevents most of the cost. A HOT update is an UPDATE that PostgreSQL is able to perform without touching any index. It works when two conditions both hold.
- The
UPDATEdoes not change any column that is part of any index on the table (counting unique constraints and primary keys as indexes too). - There is enough free space on the same heap page as the old tuple to fit the new version.
When both hold, PostgreSQL writes the new tuple on the same page, sets the old tuple’s t_ctid to point at the new one, and marks the old tuple’s header with the HOT-updated flag. The chain is called a HOT chain. Every index entry that pointed at the old tuple still says “row lives at page X, line pointer Y”, and a reader who follows that pointer arrives at the old tuple, sees the HOT flag, and walks the on-page chain to find the live version.
flowchart TB
subgraph IndexBefore [Index entries]
I1["key=42, ctid=(7,3)"]
end
subgraph PageAfter [Heap page 7 after HOT update]
T1["Item 3: old tuple, xmax set, HOT-updated, t_ctid=(7,5)"]
T2["Item 5: new tuple, live, xmin=you"]
end
I1 --> T1
T1 -. on-page chain .-> T2The win is large. A normal update on a table with eight indexes does nine writes: one heap tuple plus eight index entries. A HOT update on the same table does one write, the heap tuple, and the indexes are not touched. Multiply by your update rate and the difference dominates write throughput.
The win compounds at cleanup time. A normal update leaves both a dead heap tuple and eight dead index entries, all of which VACUUM must later find and reclaim. A HOT chain can be pruned by a much cheaper mechanism: any backend that reads the page can perform HOT pruning on the spot, walking the chain and freeing the dead links without taking a vacuum-style heavy lock. Autovacuum still runs, but the per-update cleanup cost is paid lazily as a side effect of normal reads.
5. Fillfactor: reserving room for HOT
Condition 2 of HOT (room on the same page) is something you can engineer. By default PostgreSQL packs heap pages full on INSERT, leaving no room for a new version of any tuple on that page. The first update of any row then fails the room-on-page test, falls back to a normal update, and the table starts to bloat.
The lever is fillfactor, a per-table storage parameter that tells PostgreSQL what fraction of a page to fill on INSERT. The default is 100 (fill completely). Set it to 80, and INSERT leaves 20 percent of each page free, reserving room for future HOT versions.
-- For a hot-updated table, reserve 20 percent of each page for HOT versions.
alter table orders set (fillfactor = 80);
-- New inserts honor the new fillfactor immediately.
-- Existing rows do not move; the free space appears as they are updated and the
-- table is gradually rewritten, or after a CLUSTER / pg_repack.
The trade-off is explicit. A lower fillfactor wastes space on tables that do not actually update, because you have reserved room nobody uses. So tune it where it matters: tables whose updates are eligible for HOT and whose update rate is high enough that the bloat cost dwarfs the wasted-space cost. Common values:
100(default) for write-once tables and append-only logs.90for tables with moderate update rates on non-indexed columns.80for hot OLTP tables where most updates are status flips, counters, or timestamps.70only on extreme cases, like a queue table whose every row is updated many times.
There is no fillfactor below 10, and values under 60 are almost never the right answer; if you need more room than that, your access pattern is wrong and you should be looking at a separate queue table or partition.
Read the bands as direction, not as a formula: the slider does not compute a HOT rate (real HOT success depends on row width, update frequency, and indexed-column changes), but the qualitative direction is solid. The pgstattuple-driven check is whether n_tup_hot_upd over n_tup_upd in pg_stat_user_tables actually rose after the change.
6. When HOT breaks: the indexed-column trap
Condition 1 is the silent killer. HOT requires that no indexed column changes in the UPDATE. The instant the changed column is part of any index, every index on the table gets a new entry, and the new tuple goes on a different page (or at least is treated as a regular update for indexing purposes). One indexed column killed HOT for the whole row.
Two patterns trip experienced engineers.
Adding an index quietly turns cheap updates expensive. An orders table updates status a hundred times per second; updates are HOT, indexes are quiet, life is good. A reporting team asks for a query “where status equals ‘pending’”, a junior engineer adds CREATE INDEX ON orders (status), and overnight your write throughput craters. Every status update now writes a new index entry, drops out of HOT, and the table starts to bloat. The index helped a few queries and broke a thousand writes. Always ask “do any frequent updates touch this column?” before adding an index.
Adding a column to an index does the same thing. A covering index added with INCLUDE for an index-only scan looks innocent; if any column listed in INCLUDE is updated, HOT is broken. PostgreSQL 16 onward improved this case for some BRIN scenarios, but for B-tree the rule still holds: every column in the index, key or include, gates HOT.
-- HOT-friendly: status is not indexed, this update stays HOT.
update orders set status = 'shipped', updated_at = now() where id = 12345;
-- HOT-breaking: status is indexed, every such update writes new index entries.
create index on orders (status);
update orders set status = 'shipped', updated_at = now() where id = 12345;
The only safe way to verify this in your own system is to measure n_tup_hot_upd before and after a schema change, which is the subject of the next rung.
7. Diagnosing HOT in the statistics tables
The view pg_stat_user_tables gives you the running tally: n_tup_upd is the total updates a table has seen, and n_tup_hot_upd is how many of them were HOT. Their ratio is the single most useful number on this page.
select relname,
n_tup_upd,
n_tup_hot_upd,
round(100.0 * n_tup_hot_upd / nullif(n_tup_upd, 0), 1) as hot_pct,
n_dead_tup,
last_autovacuum
from pg_stat_user_tables
where schemaname = 'public'
order by n_tup_upd desc
limit 20;
Read the hot_pct column. A healthy hot-updated table sits at 95 percent or above. The 5 percent that miss are usually the first update on a freshly inserted page (no room yet) or the rare update that happens to touch an indexed column. A table at 60 percent is leaving most of the win on the table: investigate which columns its updates touch and which of them are indexed. A table at zero with high n_tup_upd is either using fillfactor 100 on rows that have no room, or every update is changing an indexed column. The dead-tuple count climbing alongside zero HOT is the strongest possible signal.
relname | n_tup_upd | n_tup_hot_upd | hot_pct | n_dead_tup | last_autovacuum
----------+-----------+---------------+---------+------------+-----------------
orders | 184320000 | 176122000 | 95.6 | 220000 | 11 minutes ago
sessions | 8412000 | 3360000 | 39.9 | 3140000 | 2 hours ago
audit | 92000000 | 0 | 0.0 | 180000 | 6 minutes ago
orders is healthy. sessions is losing more than half its updates to non-HOT; somebody indexed a column updated on every page load. audit is append-only, so HOT does not apply; the dead tuples there come from a different source (likely the INSERT...ON CONFLICT DO UPDATE pattern, which is still an update underneath).
8. Remediation: rewriting a bloated table online and off
When a table is already badly bloated, autovacuum can prevent future growth but not shrink the existing file. You need a rewrite: read the live tuples, write them to a fresh file, swap the two, drop the old. Four tools, very different lock behaviors.
VACUUM FULL is the built-in option. It takes ACCESS EXCLUSIVE on the table for the entire run, which blocks every read and every write. On a 500 GB OLTP table that is hours of downtime. It is the right tool for offline maintenance windows and small tables, and the wrong tool for anything else.
-- Offline: takes ACCESS EXCLUSIVE, blocks reads and writes for the whole run.
vacuum full orders;
CLUSTER rewrites the table while physically ordering it by an index. Same lock as VACUUM FULL, plus the benefit that range scans on the cluster index become sequential afterwards. Same downside: hours of blocked traffic on a large table.
-- Offline rewrite, physically ordered by orders_pkey.
cluster orders using orders_pkey;
pg_repack is an external extension that performs an online rewrite. It creates a shadow table, copies live tuples to it, captures concurrent changes in a log table via triggers, replays them, and at the end takes ACCESS EXCLUSIVE for only the millisecond-scale swap. Throughput on the table drops while it runs (the trigger and the shadow writes cost something), but reads and writes keep working. This is the production tool for a bloated 500 GB OLTP table.
# Online: shadow-table copy, trigger-captured concurrent writes,
# brief ACCESS EXCLUSIVE only at the final swap.
pg_repack --no-superuser-check --jobs=4 --table=public.orders dbname
pg_squeeze is the logical-decoding-based alternative. Instead of triggers, it tails the WAL through a logical replication slot to capture concurrent writes. The advantage is no trigger overhead on the source table; the disadvantage is the WAL volume and the slot management. It is a newer tool, well-suited to write-heavy tables where the trigger cost of pg_repack would be noticeable.
A quick comparison.
| Tool | Lock held | Online? | When to pick |
|---|---|---|---|
VACUUM FULL | ACCESS EXCLUSIVE for the whole run | No | Small tables, scheduled maintenance windows |
CLUSTER | ACCESS EXCLUSIVE for the whole run | No | Same, when physical order by an index also helps |
pg_repack | ACCESS EXCLUSIVE only for the final swap (ms) | Yes | The default for production OLTP rewrites |
pg_squeeze | Brief at the end, no triggers on source | Yes | Write-heavy tables where trigger overhead matters |
Index bloat is its own job and a much friendlier one. REINDEX CONCURRENTLY (PostgreSQL 12 and later) rebuilds an index alongside the old one and swaps them at the end, taking only short locks. On a replica there is a gotcha: a REINDEX CONCURRENTLY on the primary still propagates a brief catalog change that conflicts with max_standby_streaming_delay on hot standbys, so on a tightly tuned replica it can cancel a long-running query. Schedule index rebuilds when replica reporting traffic is quiet.
-- Online index rebuild: short locks, safe on a busy table.
reindex index concurrently orders_status_idx;
-- Or rebuild every index on a table.
reindex table concurrently orders;
9. The xmin horizon, the real root cause
Here is the rung that closes the loop with the mvcc-and-tuple-visibility page, and it is the one most production incidents come down to.
VACUUM cannot remove a dead tuple whose xmax is more recent than the xmin horizon, the oldest transaction id any running snapshot still depends on. A single transaction left open for hours, even one that is just idle after running one tiny SELECT, pins the horizon at the moment it started. For as long as it lives, autovacuum across the entire database is forbidden to clean up any tuple that died after that point. Dead tuples pile up everywhere. Fillfactor is right. The autovacuum-tuning page’s scale-factor tricks are right. None of it matters. The bill is held open by one connection.
This is why most “vacuum cannot keep up” tickets are not vacuum problems. The signature is unmistakable: dead tuples rise on every table at once, not just one, and the rise stops the moment someone kills the held-open transaction. You diagnose it with pg_stat_activity and the backend_xmin column.
-- Who is holding the xmin horizon back, and for how long?
select pid,
state,
usename,
application_name,
age(backend_xmin) as xmin_age_xids,
now() - xact_start as xact_duration,
query
from pg_stat_activity
where backend_xmin is not null
order by age(backend_xmin) desc
limit 10;
The row at the top of that result, with a large xmin_age_xids and an idle in transaction state, is your culprit. Kill it (pg_terminate_backend(pid)), watch the horizon advance, watch autovacuum catch up over the next several minutes. The durable fix is in the application: set idle_in_transaction_session_timeout to something like 60 seconds at the cluster or role level, and audit ORMs and background jobs for transactions held open across slow external calls.
The reason this rung is here and not on the autovacuum-tuning page is that the symptom shows up as bloat, not as a vacuum complaint. You read the disk-usage chart, you reach for fillfactor and a pg_repack, and the held-open transaction is still there underneath, recreating the bloat the moment you finish. Always check the horizon first.
Mastery Questions
-
A coworker adds an index on a frequently updated column and write latency on the whole table doubles overnight. They argue the index cannot be the cause because the index itself is small. What is happening, and how would you prove it from the statistics views?
Answer. They are watching HOT collapse. Before the new index, updates on that column qualified for HOT: the new tuple went on the same page as the old one, the t_ctid chain linked them, and no index was touched. After the new index, every such update touches an indexed column, so HOT is disabled for those updates. PostgreSQL now writes a new heap tuple plus a new entry in every index on the table (the new one plus all the pre-existing ones), and the previously cheap update becomes proportional to the number of indexes. To prove it, query
pg_stat_user_tablesfor that table and computen_tup_hot_updovern_tup_upd. The ratio will have crashed at the moment the index was created, andn_dead_tupwill be climbing at a much higher rate than before. The fix is to drop the index if it is not earning its keep, or to keep it but stop updating the column unnecessarily (for example, only updateupdated_atwhen something actually changed), so HOT can return. -
A 500 GB
orderstable on a busy OLTP system showsdead_tuple_percentof 12 inpgstattupleandn_dead_tupof about 80 million. Application owners cannot tolerate any downtime. Walk through a remediation plan, in order, with the lock implications at each step.Answer. First, check the xmin horizon, because no other step matters if it is held open. Query
pg_stat_activityfor the oldestbackend_xmin; if a session has held the horizon for hours, end it and wait several autovacuum cycles before deciding anything else, because the dead-tuple count may drop on its own. Second, tune autovacuum aggressively for this one table withALTER TABLE orders SET (autovacuum_vacuum_scale_factor = 0.01, autovacuum_vacuum_cost_limit = 2000), as the autovacuum-tuning page covers, so future bloat is caught earlier. Third, set a lower fillfactor withALTER TABLE orders SET (fillfactor = 80)so new pages reserve room for HOT, and verify HOT recovers by watchingn_tup_hot_updovern_tup_updclimb. Fourth, audit indexes: if any frequently updated column is indexed and the index is not pulling its weight, drop it. Finally, to reclaim the 60 GB of existing dead-tuple space without downtime, runpg_repack --table=public.ordersin off-peak hours; it copies live tuples to a shadow table with triggers capturing concurrent changes, and takesACCESS EXCLUSIVEonly for the final swap (typically milliseconds).VACUUM FULLandCLUSTERwould each block the table for hours and are off the table given the no-downtime constraint. After the rewrite, rebuild the indexes online withREINDEX TABLE CONCURRENTLY ordersto shrink any index bloat. -
Disk usage on the cluster grows by 40 GB per night, every night, on a workload that has not changed in months. Restarting the application makes the growth stop until the next morning. What is the most likely root cause, what is the wrong fix, and what is the right fix?
Answer. The signature, growth that stops on restart, points squarely at a connection that is holding a transaction open, pinning the xmin horizon and blocking autovacuum across the entire database. As the mvcc-and-tuple-visibility page covers,
VACUUMmay not reclaim any tuple a running snapshot might still need, so dead tuples accumulate on every table that sees writes, even tables the held-open transaction never touched. The wrong fix is to add storage, raiseautovacuum_max_workers, lower scale factors, or runVACUUM FULLon the largest tables; every one of those leaves the held-open transaction in place, so the bloat comes back the next night. The right fix has two parts. Immediately, find the offending session with a query onpg_stat_activitythat orders byage(backend_xmin)and look for a row inidle in transactionwith a multi-hourxact_start; terminate it withpg_terminate_backend(pid)and watch dead-tuple counts fall as autovacuum catches up. Durably, setidle_in_transaction_session_timeouton the database or role to a few minutes so a forgotten transaction is killed automatically, and trace the application code path that opened the transaction (a missing commit after a slow external HTTP call is the classic culprit) so it never opens again.
Sources & evidence12 claims · 3 cited
Vacuum behavior, autovacuum triggers, fillfactor, and indexing parameters are grounded in the PostgreSQL documentation for Routine Vacuuming, Database Physical Storage, Indexes (CREATE INDEX storage parameters), and the Automatic Vacuuming runtime parameters. The HOT mechanism, the t_ctid chain and HOT pruning specifics, the pgstattuple output shape, pg_repack's shadow-table-plus-triggers design, and pg_squeeze's logical-decoding approach are treated as stable common knowledge from PostgreSQL internals and well-known community extensions where the listed documentation sections do not state them directly. The 3x to 5x bloat ratio and the worked pgstattuple output are illustrative of common production observations, not measurements from a specific live system.
- Table bloat is the sum of two distinct things: dead tuples whose xmax committed before the oldest possible snapshot but which VACUUM has not yet reclaimed, and free space that VACUUM has marked reusable but no insert has filled, and PostgreSQL only shrinks a heap file by truncating empty trailing pages, so deleting half a table's rows does not return space to the operating system.verified
- The pgstattuple extension reports exact per-relation tuple_count, tuple_percent, dead_tuple_count, dead_tuple_percent, free_space, and free_percent by walking every page of the relation, which makes it accurate but expensive on large tables, so it is typically run on a single relation during diagnosis rather than across every table on a schedule.stable common knowledge
- Bloat hurts beyond wasted disk in three compounding ways: dead tuples occupy slots in shared_buffers and evict useful pages, sequential and bitmap-heap scans read pages whose live density may be 5 to 20 percent so they amplify I/O, and a B-tree index whose pages have lost many entries does not collapse, so index scans walk pages that are mostly empty.verified
- A Heap-Only Tuple (HOT) update is performed when the UPDATE changes no column that appears in any index on the table and there is enough free space on the same heap page to hold the new tuple; PostgreSQL writes the new tuple on the same page, sets the old tuple's t_ctid to point at the new one, marks the old tuple as HOT-updated, and writes no new index entries, so a reader following an existing index pointer walks the on-page chain to find the live version.stable common knowledge
- The dead links in a HOT chain can be pruned opportunistically by any backend that reads the page (single-page HOT pruning), without taking a vacuum-style heavy lock, so the cleanup cost of a HOT update is paid lazily during normal reads and does not require a full VACUUM pass over the table's indexes.stable common knowledge
- The per-table fillfactor storage parameter controls what percentage of each heap page INSERT fills, defaulting to 100; lowering it (commonly to 80 on hot-updated OLTP tables) reserves room for future versions on the same page so subsequent UPDATEs qualify for HOT, with the trade-off that disk space is wasted if those updates never happen.verified
- If an UPDATE changes even one column that is part of any index on the table (including primary keys, unique constraints, and INCLUDE columns of covering indexes), HOT is disabled for that update and PostgreSQL writes a new entry in every index on the table, so adding an index on a frequently updated column can quietly turn cheap HOT updates into expensive multi-index writes.stable common knowledge
- pg_stat_user_tables exposes n_tup_upd (total updates) and n_tup_hot_upd (updates performed as HOT) per table, and the ratio n_tup_hot_upd over n_tup_upd is the single most useful HOT health signal: a healthy hot-updated table typically sits at 95 percent or above, and a sudden drop after a schema change points at a new or extended index on an updated column.verified
- VACUUM FULL and CLUSTER both rewrite a table while holding ACCESS EXCLUSIVE for the entire run, which blocks all reads and writes; pg_repack performs an online rewrite by copying live tuples to a shadow table while triggers capture concurrent changes, taking ACCESS EXCLUSIVE only for the final millisecond-scale swap; pg_squeeze achieves the same online rewrite using a logical replication slot to tail the WAL instead of triggers, avoiding source-table trigger overhead.stable common knowledge
- REINDEX CONCURRENTLY (available from PostgreSQL 12) rebuilds an index alongside the old one and swaps them at the end with only short locks, but on a hot standby the catalog change can conflict with max_standby_streaming_delay and cancel long-running replica queries, so index rebuilds should be scheduled when replica reporting traffic is quiet.verified
- VACUUM may only remove a dead tuple whose xmax is older than the xmin horizon, so a single long-running or idle-in-transaction session pins that horizon at its start and forbids cleanup of any tuple that died after that point across the entire database; the diagnostic is to query pg_stat_activity for the oldest backend_xmin and look for an idle-in-transaction state with a long xact_start, and the durable fix is to set idle_in_transaction_session_timeout in the application or role.verified
- Bloat that rises on every table at once and stops the moment the application is restarted is a signature of the xmin-horizon trap rather than of autovacuum misconfiguration, so tuning autovacuum scale factors, adding workers, or running VACUUM FULL will fail to fix it as long as the held-open transaction is recreated each day.verified
Cited sources
- PostgreSQL Documentation: Routine Vacuuming · PostgreSQL Global Development Group
- PostgreSQL Documentation: Database Physical Storage (Page Layout and TOAST) · PostgreSQL Global Development Group
- PostgreSQL Documentation: Indexes · PostgreSQL Global Development Group