MVCC and Tuple Visibility
How PostgreSQL's multiversion concurrency control works and why it underpins every later optimization: row versions stamped with xmin and xmax, snapshot-based visibility, what INSERT, UPDATE, and DELETE do physically, the dead-tuple bloat they create, the xmin horizon and the long-transaction trap, and the system columns that let you see it all.
Learning outcomes
Multiversion concurrency control is the single idea the rest of PostgreSQL performance rests on. Almost every tuning decision you make later, from why VACUUM exists to why a long-running report can bloat a table you never wrote to, traces back to how PostgreSQL keeps old versions of rows alive. This page makes that machinery concrete.
After studying this page, you can:
- Explain why PostgreSQL lets readers and writers run at the same time without blocking, and what it pays for that.
- Describe a row version (tuple) in terms of its
xminandxmax, and decide whether a given transaction can see it. - Predict how an
INSERT,UPDATE, orDELETEchanges the physical tuples on a page. - Connect MVCC to the symptoms it causes downstream: dead tuples, bloat, and the need to vacuum.
- Diagnose how one long-running transaction holds back cleanup for the whole database, and read the system columns that prove it.
Before we dive in
You should be comfortable with basic SQL and the idea of a transaction: a group of statements that commits or rolls back as a unit. You do not need to know how PostgreSQL stores rows on disk yet. We touch the physical page layout only lightly here and go deep on it in the heap-pages-and-toast page.
A few words you will need, defined as we use them. A transaction id (often shortened to xid or txid) is a number PostgreSQL hands out to each writing transaction, increasing over time. A tuple is one physical version of a row: a single row you think of as “one record” can exist on disk as several tuples at once. A snapshot is the rule a transaction uses to decide which tuples are visible to it. Hold onto those three. Everything below is built from them.
Mental Model
The wrong model, and it is a tempting one, is that an UPDATE finds your row and changes it in place, the way assigning to a variable overwrites the old value. Under that model a reader and a writer touching the same row must take turns, because there is only one copy and they would corrupt each other.
PostgreSQL does not work that way. The better model is a stack of sticky notes. When you update a row, PostgreSQL does not erase the old note. It writes a new note with the new values and marks the old one with “obsolete as of transaction 1234.” Both notes physically exist for a while. Which note you read depends on when your transaction started, not on who is writing right now. A reader picks up the note that was current when its snapshot was taken and ignores the rest, so it never waits for a writer and never sees a half-finished change.
Keep this picture. “Update” means “add a new version and tombstone the old one,” never “overwrite.” Once that clicks, dead tuples, bloat, vacuum, and even replication conflicts stop being surprising and start being obvious consequences of the same rule.
Breaking it down
1. Why readers and writers must not block each other
Start with the problem MVCC solves, because the design only makes sense against it. Imagine a reporting query that scans a million-row accounts table for thirty seconds. While it runs, a hundred users are updating their balances. You want two things at once, and they fight.
You want the report to see a consistent picture: every balance as it stood at one instant, not a smear of values from across the thirty seconds. And you want those hundred writers to keep working, not freeze for half a minute waiting for the report to finish.
A naive database picks one. Lock-based systems make readers and writers take turns on each row, so either the report blocks the writers or the writers block the report. Both are bad at scale: the first kills write throughput, the second makes long reads impossible on a busy system.
PostgreSQL refuses the trade. Its rule is that readers do not block writers and writers do not block readers. A SELECT never waits for an UPDATE of the same row, and the reverse holds too. The way it pulls this off is to keep more than one version of a row around and let each transaction read the version that was current when it began.
sequenceDiagram
participant R as Report (started T1)
participant W as Writer (T2)
participant Row as accounts row #42
R->>Row: read balance (sees version from T1)
W->>Row: UPDATE balance, creates new version
Note over Row: both versions now exist
R->>Row: read again (still sees T1 version)
W-->>R: never blocked, never blocksThe cost, and there is always a cost, is that those extra versions pile up and someone has to clean them later. Hold that thought; rung 5 is about the bill.
2. The core idea: a row is a stack of versions
Here is the mechanism behind the sticky-note picture. Every tuple carries two hidden system columns that PostgreSQL maintains for you: xmin and xmax.
xmin is the id of the transaction that created this version. xmax is the id of the transaction that deleted or superseded it, or zero if the version is still current. So a tuple’s lifespan is the half-open range of transaction ids from xmin (inclusive) up to xmax (exclusive). A live, never-touched row has an xmax of 0, meaning “still valid, no end yet.”
That is the entire data structure. A logical row is the set of tuples that share a place in a t_ctid chain, and at most one of them is visible to any given transaction. The diagram below shows one logical row across three transactions: created by transaction 100, updated by 105, and what each leaves behind.
stateDiagram-v2
[*] --> V1: T100 inserts
V1: Version 1 (xmin=100, xmax=105)
V2: Version 2 (xmin=105, xmax=0)
V1 --> V2: T105 updates
note right of V1: dead after T105 commits
note right of V2: live, current versionNotice what the UPDATE by transaction 105 did. It did not edit version 1. It stamped version 1 with xmax = 105 (this version ends when 105’s change takes effect) and wrote version 2 with xmin = 105, xmax = 0. Two physical tuples, one logical row.
3. How a snapshot decides what you see
Now the question that makes MVCC actually work: given several versions of a row, which one does your transaction read? The answer is the snapshot.
When your statement (or transaction) starts, PostgreSQL takes a snapshot that records, in effect, “which transactions had already committed at this instant.” A snapshot is three things: the lowest still-running transaction id, the next id not yet assigned, and the list of ids that were in progress in between. From those, PostgreSQL can answer one yes-or-no question about any transaction id: was it committed and visible as of my snapshot?
A tuple is visible to you when both of these hold:
- Its
xminis a transaction that committed before your snapshot. The version has been born, as far as you are concerned. - Its
xmaxis either zero, or a transaction that had not committed before your snapshot. The version has not yet died, as far as you are concerned.
Walk the two-version row above from the viewpoint of a report whose snapshot was taken when only transaction 100 had committed. Version 1 has xmin = 100 (committed, visible) and xmax = 105 (not committed yet in my snapshot), so version 1 is visible. Version 2 has xmin = 105 (not committed yet), so it is invisible. The report reads version 1, the old balance, even as transaction 105 commits the new one a millisecond later. That is consistency without locking.
This is why two SELECTs of the same row, issued by two transactions at the same wall-clock moment, can correctly return different values. They are reading through different snapshots.
4. What insert, update, and delete really do
With visibility understood, the three write operations are simple to state precisely. Think of each in terms of the tuples it creates and the xmax it stamps.
INSERTwrites one new tuple withxminset to your transaction andxmax = 0. Nothing old is touched.DELETEdoes not remove anything physically. It sets the target tuple’sxmaxto your transaction id. The tuple stays on the page, now a tombstone, invisible to snapshots taken after you commit but still visible to older snapshots that should see it.UPDATEis a delete and an insert fused together: it stamps the old tuple’sxmaxwith your id and writes a brand-new tuple with the new values. The new tuple’sxminis your transaction.
The table makes the surprising part stand out: every one of these is an append at the physical level, and only INSERT avoids leaving a dead tuple behind.
| Operation | New tuple written? | Old tuple changed? | Dead tuple left behind? |
|---|---|---|---|
INSERT | yes (xmin = you, xmax = 0) | no | no |
DELETE | no | xmax set to you | yes (after you commit) |
UPDATE | yes (the new version) | old xmax set to you | yes (the old version) |
This is the fact that trips up engineers coming from databases that update in place: in PostgreSQL, a table that is only ever updated, never inserted into, still grows. Each UPDATE adds a tuple. A row updated a thousand times has, at least briefly, up to a thousand and one physical versions waiting to be cleaned up.
5. The bill comes due: dead tuples and bloat
You have now met the cost MVCC defers. Every UPDATE and DELETE leaves a dead tuple: a physical version that no current or future snapshot will ever need again, once the last transaction old enough to see it has finished. Those dead tuples occupy space on their pages. They are not reused automatically the instant they die.
The accumulation of dead tuples is called bloat. A table that is heavily updated can end up several times larger on disk than the live data it holds, with dead tuples scattered through its pages. Bloat hurts in two ways: it wastes storage, and worse, it makes scans slower because PostgreSQL must read pages that are mostly dead weight to find the live rows among them.
The cleanup mechanism is VACUUM. Its core job is to find dead tuples whose xmax committed before the oldest snapshot any running transaction could hold, and mark their space free for reuse. Autovacuum runs it for you in the background. We treat vacuum tuning, the visibility map, and HOT updates (a clever optimization that avoids some of this cost) in their own pages; what matters here is the causal chain.
flowchart LR
A[UPDATE or DELETE] --> B[dead tuple created]
B --> C[bloat: table grows, scans slow]
C --> D[VACUUM reclaims dead space]
D --> E[space reusable for new tuples]Read that chain in both directions. Forward, it says writes create cleanup debt. Backward, it explains a question that puzzles people daily: “why is my table 4 GB when the data is 800 MB?” The answer is almost always dead tuples that have not been vacuumed, or cannot be vacuumed yet. Which brings us to the most important operational trap in all of MVCC.
6. The xmin horizon and the long-transaction trap
Here is the rung that separates someone who knows the MVCC definition from someone who can debug a production incident with it.
VACUUM may only remove a dead tuple if no transaction still alive could legitimately need to see it. PostgreSQL tracks this with the xmin horizon: the oldest transaction id that any current snapshot still depends on. A dead tuple is removable only if its death is older than that horizon. Anything newer might still be visible to some running transaction, so vacuum must leave it.
Now the trap. A single transaction that stays open, even one that is just idle after running one tiny SELECT, holds the horizon back at the moment it started. For as long as it lives, vacuum across the entire database cannot clean up any tuple that died after that point. Dead tuples pile up everywhere, in tables that transaction never touched, and bloat grows without bound.
This is the mechanism behind a classic outage: an application leaves a transaction open (“idle in transaction”), or an analyst runs a twelve-hour query, and overnight the database balloons and queries crawl. Nothing is “broken.” MVCC is doing exactly what it promised, keeping versions alive for a snapshot that still exists. The fix is to find and end the offending transaction, not to add storage.
You monitor this directly. The view pg_stat_activity shows each connection’s state and xact_start; a row stuck in idle in transaction with an old xact_start is the smoking gun. The function age(backend_xmin) and the column backend_xmin tell you how far back a backend is pinning the horizon.
7. Seeing it for yourself with system columns
MVCC stops being abstract the moment you query the hidden columns directly. Every table exposes xmin, xmax, and ctid as system columns you can select even though they are not in your CREATE TABLE. The ctid is the physical address of the tuple: a pair of (page number, item number) that tells you exactly where the version sits.
Run this in one session to watch an update fork a row into two physical tuples.
create table demo (id int primary key, balance int);
insert into demo values (1, 100);
-- One physical tuple. Note its ctid and the transaction that made it.
select ctid, xmin, xmax, balance from demo where id = 1;
-- ctid | xmin | xmax | balance
-- (0,1) | 742 | 0 | 100
update demo set balance = 90 where id = 1;
-- The live version moved to a new ctid; the old one is dead, awaiting vacuum.
select ctid, xmin, xmax, balance from demo where id = 1;
-- ctid | xmin | xmax | balance
-- (0,2) | 743 | 0 | 90
The live tuple is now at (0,2) with a new xmin, while the dead version still sits at (0,1) until vacuum reclaims it. Two more tools make the lifecycle measurable. The age() function turns a transaction id into “how many transactions ago,” which is how you measure both horizon lag and freezing pressure. And pg_stat_user_tables reports n_live_tup and n_dead_tup per table, the running count of live versus dead tuples that autovacuum itself watches to decide when to act.
-- How many dead tuples are waiting, and when did autovacuum last run?
select relname, n_live_tup, n_dead_tup, last_autovacuum
from pg_stat_user_tables
where relname = 'demo';
When n_dead_tup climbs and stays high, you are watching the bill from rung 5 accumulate, and the cause is almost always either a write-heavy workload outpacing autovacuum or the long-transaction trap from rung 6 freezing the horizon.
Mastery Questions
-
A colleague says, “We never delete from the
eventstable, we onlyUPDATEa status column, so it should never need vacuuming.” Why are they wrong, and what would you expect to see?Answer. They have the in-place-edit model in their head. In PostgreSQL an
UPDATEdoes not overwrite; it writes a new tuple and tombstones the old one by setting itsxmax. So a table that is only ever updated still produces one dead tuple per update and will bloat exactly like a delete-heavy table. You would expectn_dead_tupinpg_stat_user_tablesto climb, the table’s on-disk size to outgrow its live data, and sequential scans to slow as they wade through dead versions. The fix is normal vacuuming (autovacuum, possibly tuned more aggressively for this table), and if updates touch only non-indexed columns, ensuringHOTupdates can kick in to reduce the damage. The point to internalize: “we don’t delete” does not mean “we don’t create dead tuples.” -
Two transactions both
SELECT balance FROM accounts WHERE id = 7at the same instant and get different numbers. No error occurred and the data is not corrupt. How is this possible, and is it a bug?Answer. It is not a bug; it is MVCC working correctly. The two transactions are reading through different snapshots. If transaction A began before a writer committed a new version of row 7, and transaction B began after, then A’s snapshot makes the new version’s
xmininvisible (so A reads the old version) while B’s snapshot makes it visible (so B reads the new one). Each is internally consistent: it sees the database as it stood when its snapshot was taken. This is the whole purpose of snapshot isolation, and it is why a thirty-second report and a flood of concurrent writers can coexist without either blocking the other. The deeper lesson is that “the current value of a row” is not a well-defined notion in a concurrent system; “the value visible to this snapshot” is. -
Your monitoring shows total database size growing steadily overnight even though application write volume is flat, and it stops the moment you restart the application. What is the most probable cause at the MVCC level, and how would you confirm it?
Answer. The signature points straight at the long-transaction trap. An application connection is almost certainly sitting
idle in transaction, holding the xmin horizon back at the time it opened. Because vacuum may not remove any tuple that a still-running snapshot could need, dead-tuple cleanup stalls across the entire database, so size grows even on tables that connection never touched. The restart ends the stuck transaction, the horizon advances, autovacuum catches up, and growth stops. You confirm it by queryingpg_stat_activityfor rows whosestateisidle in transactionwith an oldxact_start, and by checkingbackend_xminandage(backend_xmin)to see which backend pins the oldest horizon. The durable fix is in the application: set anidle_in_transaction_session_timeout, and make sure transactions are not left open across slow external calls.
Sources & evidence9 claims · 5 cited
Every MVCC mechanism (xmin/xmax visibility, snapshot rules, what UPDATE/DELETE do to tuples, the xmin horizon, dead-tuple bloat, and the system columns) is grounded in the official PostgreSQL documentation: Concurrency Control (MVCC), Database Physical Storage, Routine Vacuuming, and Transaction Isolation. The specific transaction-id and ctid values in the worked examples are illustrative, not captured from a live system.
- PostgreSQL uses multiversion concurrency control so that readers do not block writers and writers do not block readers: each statement sees a snapshot of the data as it was at some point in time.verified
- Each row version (tuple) carries hidden columns xmin (the id of the transaction that created it) and xmax (the id of the transaction that deleted or superseded it, or zero if still current); visibility is decided by comparing these against the snapshot.verified
- A tuple is visible to a transaction when its xmin committed before the transaction's snapshot and its xmax is either zero or belongs to a transaction that had not committed as of that snapshot.verified
- An UPDATE writes a new tuple version and marks the old one as deleted by setting its xmax rather than overwriting in place, and a DELETE only sets xmax, so an update-only table still accumulates dead tuples.verified
- Read Committed takes a fresh snapshot at the start of each statement, while Repeatable Read and Serializable use a single snapshot for the whole transaction.verified
- Dead tuples left by updates and deletes cause table bloat, and VACUUM (run automatically by autovacuum) reclaims the space they occupy for reuse.verified
- VACUUM may remove a dead tuple only when no running transaction's snapshot could still need it, so a single long-running or idle-in-transaction transaction holds back the xmin horizon and stalls dead-tuple cleanup across the whole database.verified
- A heap tuple has a 23-byte header holding xmin, xmax, and t_ctid among other fields, and the ctid system column is the tuple's physical address as a (page, item) pair.verified
- pg_stat_user_tables reports n_live_tup and n_dead_tup per table, and autovacuum triggers a table when its dead tuples exceed autovacuum_vacuum_threshold plus autovacuum_vacuum_scale_factor times the estimated row count.verified
Cited sources
- PostgreSQL Documentation: Concurrency Control (MVCC) · PostgreSQL Global Development Group
- PostgreSQL Documentation: Database Physical Storage (Page Layout and TOAST) · PostgreSQL Global Development Group
- PostgreSQL Documentation: Transaction Isolation · PostgreSQL Global Development Group
- PostgreSQL Documentation: Routine Vacuuming · PostgreSQL Global Development Group
- PostgreSQL Documentation: Automatic Vacuuming Parameters · PostgreSQL Global Development Group