MVCC and Tuple Visibility

Learning outcomes

Multiversion concurrency control is the single idea the rest of PostgreSQL performance rests on. Almost every tuning decision you make later, from why VACUUM exists to why a long-running report can bloat a table you never wrote to, traces back to how PostgreSQL keeps old versions of rows alive. This page makes that machinery concrete.

After studying this page, you can:

Explain why PostgreSQL lets readers and writers run at the same time without blocking, and what it pays for that.
Describe a row version (tuple) in terms of its xmin and xmax, and decide whether a given transaction can see it.
Predict how an INSERT, UPDATE, or DELETE changes the physical tuples on a page.
Connect MVCC to the symptoms it causes downstream: dead tuples, bloat, and the need to vacuum.
Diagnose how one long-running transaction holds back cleanup for the whole database, and read the system columns that prove it.

Before we dive in

You should be comfortable with basic SQL and the idea of a transaction: a group of statements that commits or rolls back as a unit. You do not need to know how PostgreSQL stores rows on disk yet. We touch the physical page layout only lightly here and go deep on it in the heap-pages-and-toast page.

A few words you will need, defined as we use them. A transaction id (often shortened to xid or txid) is a number PostgreSQL hands out to each writing transaction, increasing over time. A tuple is one physical version of a row: a single row you think of as “one record” can exist on disk as several tuples at once. A snapshot is the rule a transaction uses to decide which tuples are visible to it. Hold onto those three. Everything below is built from them.

Mental Model

The wrong model, and it is a tempting one, is that an UPDATE finds your row and changes it in place, the way assigning to a variable overwrites the old value. Under that model a reader and a writer touching the same row must take turns, because there is only one copy and they would corrupt each other.

PostgreSQL does not work that way. The better model is a stack of sticky notes. When you update a row, PostgreSQL does not erase the old note. It writes a new note with the new values and marks the old one with “obsolete as of transaction 1234.” Both notes physically exist for a while. Which note you read depends on when your transaction started, not on who is writing right now. A reader picks up the note that was current when its snapshot was taken and ignores the rest, so it never waits for a writer and never sees a half-finished change.

Keep this picture. “Update” means “add a new version and tombstone the old one,” never “overwrite.” Once that clicks, dead tuples, bloat, vacuum, and even replication conflicts stop being surprising and start being obvious consequences of the same rule.

Breaking it down

1. Why readers and writers must not block each other

Start with the problem MVCC solves, because the design only makes sense against it. Imagine a reporting query that scans a million-row accounts table for thirty seconds. While it runs, a hundred users are updating their balances. You want two things at once, and they fight.

You want the report to see a consistent picture: every balance as it stood at one instant, not a smear of values from across the thirty seconds. And you want those hundred writers to keep working, not freeze for half a minute waiting for the report to finish.

A naive database picks one. Lock-based systems make readers and writers take turns on each row, so either the report blocks the writers or the writers block the report. Both are bad at scale: the first kills write throughput, the second makes long reads impossible on a busy system.

PostgreSQL refuses the trade. Its rule is that readers do not block writers and writers do not block readers. A SELECT never waits for an UPDATE of the same row, and the reverse holds too. The way it pulls this off is to keep more than one version of a row around and let each transaction read the version that was current when it began.

sequenceDiagram
    participant R as Report (started T1)
    participant W as Writer (T2)
    participant Row as accounts row #42
    R->>Row: read balance (sees version from T1)
    W->>Row: UPDATE balance, creates new version
    Note over Row: both versions now exist
    R->>Row: read again (still sees T1 version)
    W-->>R: never blocked, never blocks

The cost, and there is always a cost, is that those extra versions pile up and someone has to clean them later. Hold that thought; rung 5 is about the bill.

2. The core idea: a row is a stack of versions

Here is the mechanism behind the sticky-note picture. Every tuple carries two hidden system columns that PostgreSQL maintains for you: xmin and xmax.

xmin is the id of the transaction that created this version. xmax is the id of the transaction that deleted or superseded it, or zero if the version is still current. So a tuple’s lifespan is the half-open range of transaction ids from xmin (inclusive) up to xmax (exclusive). A live, never-touched row has an xmax of 0, meaning “still valid, no end yet.”

That is the entire data structure. A logical row is the set of tuples that share a place in a t_ctid chain, and at most one of them is visible to any given transaction. The diagram below shows one logical row across three transactions: created by transaction 100, updated by 105, and what each leaves behind.

stateDiagram-v2
    [*] --> V1: T100 inserts
    V1: Version 1 (xmin=100, xmax=105)
    V2: Version 2 (xmin=105, xmax=0)
    V1 --> V2: T105 updates
    note right of V1: dead after T105 commits
    note right of V2: live, current version

Notice what the UPDATE by transaction 105 did. It did not edit version 1. It stamped version 1 with xmax = 105 (this version ends when 105’s change takes effect) and wrote version 2 with xmin = 105, xmax = 0. Two physical tuples, one logical row.

3. How a snapshot decides what you see

Now the question that makes MVCC actually work: given several versions of a row, which one does your transaction read? The answer is the snapshot.

When your statement (or transaction) starts, PostgreSQL takes a snapshot that records, in effect, “which transactions had already committed at this instant.” A snapshot is three things: the lowest still-running transaction id, the next id not yet assigned, and the list of ids that were in progress in between. From those, PostgreSQL can answer one yes-or-no question about any transaction id: was it committed and visible as of my snapshot?

A tuple is visible to you when both of these hold:

Its xmin is a transaction that committed before your snapshot. The version has been born, as far as you are concerned.
Its xmax is either zero, or a transaction that had not committed before your snapshot. The version has not yet died, as far as you are concerned.

Walk the two-version row above from the viewpoint of a report whose snapshot was taken when only transaction 100 had committed. Version 1 has xmin = 100 (committed, visible) and xmax = 105 (not committed yet in my snapshot), so version 1 is visible. Version 2 has xmin = 105 (not committed yet), so it is invisible. The report reads version 1, the old balance, even as transaction 105 commits the new one a millisecond later. That is consistency without locking.

Resolving a row at snapshot time

The two versionsVersion 1: xmin=100, xmax=105. Version 2: xmin=105, xmax=0. Same logical row, two physical tuples on the page.

Step 1 of 5

This is why two SELECTs of the same row, issued by two transactions at the same wall-clock moment, can correctly return different values. They are reading through different snapshots.

4. What insert, update, and delete really do

With visibility understood, the three write operations are simple to state precisely. Think of each in terms of the tuples it creates and the xmax it stamps.

INSERT writes one new tuple with xmin set to your transaction and xmax = 0. Nothing old is touched.
DELETE does not remove anything physically. It sets the target tuple’s xmax to your transaction id. The tuple stays on the page, now a tombstone, invisible to snapshots taken after you commit but still visible to older snapshots that should see it.
UPDATE is a delete and an insert fused together: it stamps the old tuple’s xmax with your id and writes a brand-new tuple with the new values. The new tuple’s xmin is your transaction.

The table makes the surprising part stand out: every one of these is an append at the physical level, and only INSERT avoids leaving a dead tuple behind.

Operation	New tuple written?	Old tuple changed?	Dead tuple left behind?
`INSERT`	yes (xmin = you, xmax = 0)	no	no
`DELETE`	no	xmax set to you	yes (after you commit)
`UPDATE`	yes (the new version)	old xmax set to you	yes (the old version)

This is the fact that trips up engineers coming from databases that update in place: in PostgreSQL, a table that is only ever updated, never inserted into, still grows. Each UPDATE adds a tuple. A row updated a thousand times has, at least briefly, up to a thousand and one physical versions waiting to be cleaned up.

UPDATE on a single logical row

The old tuple (balance 100) gets xmax = your txid and stays on the page as a dead version. A new tuple (balance 90, xmin = your txid) is appended. Old snapshots still read 100; new ones read 90; VACUUM removes the dead 100 later.

5. The bill comes due: dead tuples and bloat

You have now met the cost MVCC defers. Every UPDATE and DELETE leaves a dead tuple: a physical version that no current or future snapshot will ever need again, once the last transaction old enough to see it has finished. Those dead tuples occupy space on their pages. They are not reused automatically the instant they die.

The accumulation of dead tuples is called bloat. A table that is heavily updated can end up several times larger on disk than the live data it holds, with dead tuples scattered through its pages. Bloat hurts in two ways: it wastes storage, and worse, it makes scans slower because PostgreSQL must read pages that are mostly dead weight to find the live rows among them.

The cleanup mechanism is VACUUM. Its core job is to find dead tuples whose xmax committed before the oldest snapshot any running transaction could hold, and mark their space free for reuse. Autovacuum runs it for you in the background. We treat vacuum tuning, the visibility map, and HOT updates (a clever optimization that avoids some of this cost) in their own pages; what matters here is the causal chain.

flowchart LR
    A[UPDATE or DELETE] --> B[dead tuple created]
    B --> C[bloat: table grows, scans slow]
    C --> D[VACUUM reclaims dead space]
    D --> E[space reusable for new tuples]

Read that chain in both directions. Forward, it says writes create cleanup debt. Backward, it explains a question that puzzles people daily: “why is my table 4 GB when the data is 800 MB?” The answer is almost always dead tuples that have not been vacuumed, or cannot be vacuumed yet. Which brings us to the most important operational trap in all of MVCC.

6. The xmin horizon and the long-transaction trap

Here is the rung that separates someone who knows the MVCC definition from someone who can debug a production incident with it.

VACUUM may only remove a dead tuple if no transaction still alive could legitimately need to see it. PostgreSQL tracks this with the xmin horizon: the oldest transaction id that any current snapshot still depends on. A dead tuple is removable only if its death is older than that horizon. Anything newer might still be visible to some running transaction, so vacuum must leave it.

Now the trap. A single transaction that stays open, even one that is just idle after running one tiny SELECT, holds the horizon back at the moment it started. For as long as it lives, vacuum across the entire database cannot clean up any tuple that died after that point. Dead tuples pile up everywhere, in tables that transaction never touched, and bloat grows without bound.

This is the mechanism behind a classic outage: an application leaves a transaction open (“idle in transaction”), or an analyst runs a twelve-hour query, and overnight the database balloons and queries crawl. Nothing is “broken.” MVCC is doing exactly what it promised, keeping versions alive for a snapshot that still exists. The fix is to find and end the offending transaction, not to add storage.

Check yourself

An idle-in-transaction connection has been open for six hours. What is the most likely database-wide symptom?

You monitor this directly. The view pg_stat_activity shows each connection’s state and xact_start; a row stuck in idle in transaction with an old xact_start is the smoking gun. The function age(backend_xmin) and the column backend_xmin tell you how far back a backend is pinning the horizon.

7. Seeing it for yourself with system columns

MVCC stops being abstract the moment you query the hidden columns directly. Every table exposes xmin, xmax, and ctid as system columns you can select even though they are not in your CREATE TABLE. The ctid is the physical address of the tuple: a pair of (page number, item number) that tells you exactly where the version sits.

Run this in one session to watch an update fork a row into two physical tuples.

create table demo (id int primary key, balance int);
insert into demo values (1, 100);

-- One physical tuple. Note its ctid and the transaction that made it.
select ctid, xmin, xmax, balance from demo where id = 1;
--  ctid  | xmin | xmax | balance
-- (0,1)  |  742 |    0 |     100

update demo set balance = 90 where id = 1;

-- The live version moved to a new ctid; the old one is dead, awaiting vacuum.
select ctid, xmin, xmax, balance from demo where id = 1;
--  ctid  | xmin | xmax | balance
-- (0,2)  |  743 |    0 |      90

The live tuple is now at (0,2) with a new xmin, while the dead version still sits at (0,1) until vacuum reclaims it. Two more tools make the lifecycle measurable. The age() function turns a transaction id into “how many transactions ago,” which is how you measure both horizon lag and freezing pressure. And pg_stat_user_tables reports n_live_tup and n_dead_tup per table, the running count of live versus dead tuples that autovacuum itself watches to decide when to act.

-- How many dead tuples are waiting, and when did autovacuum last run?
select relname, n_live_tup, n_dead_tup, last_autovacuum
from pg_stat_user_tables
where relname = 'demo';

When n_dead_tup climbs and stays high, you are watching the bill from rung 5 accumulate, and the cause is almost always either a write-heavy workload outpacing autovacuum or the long-transaction trap from rung 6 freezing the horizon.

Mastery Questions

A colleague says, “We never delete from the events table, we only UPDATE a status column, so it should never need vacuuming.” Why are they wrong, and what would you expect to see?

Answer. They have the in-place-edit model in their head. In PostgreSQL an UPDATE does not overwrite; it writes a new tuple and tombstones the old one by setting its xmax. So a table that is only ever updated still produces one dead tuple per update and will bloat exactly like a delete-heavy table. You would expect n_dead_tup in pg_stat_user_tables to climb, the table’s on-disk size to outgrow its live data, and sequential scans to slow as they wade through dead versions. The fix is normal vacuuming (autovacuum, possibly tuned more aggressively for this table), and if updates touch only non-indexed columns, ensuring HOT updates can kick in to reduce the damage. The point to internalize: “we don’t delete” does not mean “we don’t create dead tuples.”
Two transactions both SELECT balance FROM accounts WHERE id = 7 at the same instant and get different numbers. No error occurred and the data is not corrupt. How is this possible, and is it a bug?

Answer. It is not a bug; it is MVCC working correctly. The two transactions are reading through different snapshots. If transaction A began before a writer committed a new version of row 7, and transaction B began after, then A’s snapshot makes the new version’s xmin invisible (so A reads the old version) while B’s snapshot makes it visible (so B reads the new one). Each is internally consistent: it sees the database as it stood when its snapshot was taken. This is the whole purpose of snapshot isolation, and it is why a thirty-second report and a flood of concurrent writers can coexist without either blocking the other. The deeper lesson is that “the current value of a row” is not a well-defined notion in a concurrent system; “the value visible to this snapshot” is.
Your monitoring shows total database size growing steadily overnight even though application write volume is flat, and it stops the moment you restart the application. What is the most probable cause at the MVCC level, and how would you confirm it?

Answer. The signature points straight at the long-transaction trap. An application connection is almost certainly sitting idle in transaction, holding the xmin horizon back at the time it opened. Because vacuum may not remove any tuple that a still-running snapshot could need, dead-tuple cleanup stalls across the entire database, so size grows even on tables that connection never touched. The restart ends the stuck transaction, the horizon advances, autovacuum catches up, and growth stops. You confirm it by querying pg_stat_activity for rows whose state is idle in transaction with an old xact_start, and by checking backend_xmin and age(backend_xmin) to see which backend pins the oldest horizon. The durable fix is in the application: set an idle_in_transaction_session_timeout, and make sure transactions are not left open across slow external calls.