WAL, Checkpoints, and Durability

Learning outcomes

Durability is the promise that a committed transaction survives a crash. PostgreSQL keeps that promise with the write-ahead log, the same mechanism that powers replication and point-in-time recovery. Almost every durability knob you will ever touch, from synchronous_commit to max_wal_size, is a dial on the machinery this page describes. Get the model right and those knobs stop being folklore.

After studying this page, you can:

Explain why a database cannot flush random data pages on every commit, and how a sequential log fixes it.
State the write-ahead rule precisely and decide, for a given moment, whether a transaction is durable.
Trace a write from a buffer through the WAL buffer to an fsync, and read the current log position with pg_current_wal_lsn().
Predict how checkpoint_timeout, max_wal_size, and checkpoint_completion_target shape both recovery time and WAL volume.
Connect frequent checkpoints to full-page-write amplification, and pick the right synchronous_commit level for a given durability budget.
Diagnose checkpoint storms, runaway WAL retention, and the data-loss risk of turning durability off.

Before we dive in

You should be comfortable with basic SQL, transactions, and the idea that PostgreSQL holds recently used data pages in a shared memory cache called shared buffers. You do not need to know the on-disk page layout in detail; we touch it only where it matters, and the heap-pages-and-toast page goes deeper. It also helps to remember from the mvcc-and-tuple-visibility page that a write does not overwrite in place; it produces new tuple versions on a page.

A few terms, defined as we use them. A data page (or heap page) is the 8 KB block PostgreSQL reads and writes as a unit. A dirty page is a page in shared buffers that has been modified in memory but not yet written to its data file on disk. The WAL (write-ahead log) is a separate, append-only stream of records describing every change, stored under pg_wal/. An fsync is the system call that forces the operating system to push a file’s buffered writes all the way down to durable storage; until it returns, a “written” file may still live only in volatile OS cache. A checkpoint is a point at which PostgreSQL guarantees all changes up to a certain log position are safely in the data files. Hold these five. Everything below is built from them.

Mental Model

The wrong model, and it is the intuitive one, is that committing a transaction means writing your changed rows to their place in the table file and making sure they land on disk. Under that picture, durability is just “save the data,” and a commit waits for the table to be updated on disk.

PostgreSQL does almost the opposite. The better model is a journal kept beside a ledger. When you make a change, you do not rewrite the ledger page right away. You append one line to a running journal: “at position X, row 42 on page 9 changed to this.” You flush that journal line to disk, and only then is the change durable. The ledger pages themselves get rewritten later, lazily, in the background, whenever it is convenient. If the machine crashes, you reopen the journal, replay every line since the last known-good point, and the ledger is reconstructed exactly.

Keep this picture. A commit is durable the instant its journal line (its WAL record) is flushed, not when its data page is saved. Data pages are written behind the scenes. Once that clicks, checkpoints, full-page writes, recovery time, and the synchronous_commit levels all fall out of the same rule instead of being a grab bag of settings.

Breaking it down

1. Why you cannot fsync data pages on every commit

Start with the problem, because the whole design exists to dodge it. Imagine the naive durability scheme: on every commit, find each data page the transaction touched, and fsync it to disk before returning success. It sounds correct, and it is correct. It is also unusably slow.

Here is why. The pages a transaction touches are scattered. One commit might dirty the heap page holding a row, two or three index pages, and a free-space map page, sitting at unrelated offsets in unrelated files. Flushing them means random I/O: the storage device must seek to each location. On a spinning disk a random write is roughly a hundred times slower than a sequential one, and even on an SSD random small writes carry real overhead and wear. Worse, two transactions that both touch the same hot page would each have to flush that whole 8 KB page, so a page touched a thousand times per second would be flushed a thousand times.

flowchart TB
    C[Commit touches 4 scattered pages]
    C --> P1[Heap page in table file]
    C --> P2[Index page A]
    C --> P3[Index page B]
    C --> P4[Free space map page]
    P1 --> D[Random fsync to disk: seek, seek, seek]
    P2 --> D
    P3 --> D
    P4 --> D
    D --> S[Slow: random I/O dominates commit latency]

The insight that breaks the bind: you do not need the data pages on disk to be durable. You only need a durable record of what changed. And a record of changes can be written as one thing, in one place, in commit order. That turns many scattered random writes into a single sequential append, which is the fastest thing a storage device does. That single append is the WAL.

2. The write-ahead rule: log first, page later

The rule that makes this safe has a name, and it is worth stating exactly. The write-ahead rule: the log record describing a change must be flushed to disk before the data page it describes is allowed to be written to disk. Log first, page later, always in that order.

Read why that order is non-negotiable. Suppose PostgreSQL wrote a dirty data page to disk but had not yet flushed the WAL record for that change, and then the machine crashed. On restart, the data file holds a half-applied change with no journal entry explaining it, and recovery cannot reason about it. By forcing the WAL out first, PostgreSQL guarantees that any change visible in a data file is also described in the durable log, so recovery can always finish the story.

Now the part that surprises people. A commit is durable the moment its WAL record is fsync’d, full stop. The data pages it changed may still be sitting dirty in shared buffers, not yet on disk, and that is fine. PostgreSQL does not write those pages at commit time at all. They are written later, lazily, by two background actors: the background writer, which trickles dirty pages out to keep clean buffers available, and the checkpointer, which flushes them in bulk at a checkpoint. The commit path itself only appends to the log and flushes the log.

What a commit actually waits for

COMMIT appends WAL records for the change and fsyncs the WAL up to this transaction's commit record. The data pages stay dirty in shared buffers; the background writer and the next checkpoint write them later. Durability is a property of the log, not of the table file at commit time.

This is the whole trick. Durability becomes a property of one sequential, append-only file, and the expensive scattered writes are deferred and batched. The cost you defer is that dirty pages accumulate in memory and the log grows, so eventually something must flush the pages and let the old log be reused. That something is the checkpoint, rung 4. First, the log itself.

3. WAL mechanics: records, LSNs, and segments

Here is what the log is made of. Every change generates one or more WAL records: compact descriptions like “on page 9 of relation 16384, insert this tuple at this slot,” or “set the commit status of transaction 742.” Records are appended to the log in the order changes happen, and the log is a single logical stream that never seeks backward.

Every byte position in that stream has an address called the LSN (log sequence number). An LSN is just an offset into the ever-growing log, printed as two hex halves like 3A/1C5F08. Because it only ever increases, an LSN is a clean way to say “this point in history.” If WAL record A has a smaller LSN than record B, A happened first. You can read the current insert position any time:

-- Where are we in the WAL right now?
select pg_current_wal_lsn();
--  pg_current_wal_lsn
-- --------------------
--  3A/1C5F08

The stream is stored on disk as a series of fixed-size files called WAL segments, living in the pg_wal/ directory. Each segment is 16 MB by default. PostgreSQL fills one segment, moves to the next, and names them by the LSN range they cover. Splitting the log into fixed-size files is what lets old WAL be recycled: once a segment is no longer needed for recovery or replication, PostgreSQL renames and reuses the file rather than deleting and reallocating it, which avoids filesystem churn.

The write path is a small pipeline, and naming its stages makes the tuning later make sense.

flowchart LR
    A[Backend changes a page in shared buffers] --> B[WAL record built]
    B --> C[Appended to WAL buffers in shared memory]
    C --> D[On commit: write WAL buffers to pg_wal segment]
    D --> E[fsync the segment]
    E --> F[Commit returns: change is durable]

A backend modifies a page in shared buffers and, in the same breath, builds the WAL record for that change and appends it to the WAL buffers, a small ring of shared memory. The records sit there cheaply until a commit (or a full buffer, or the wal_writer) forces them out: PostgreSQL writes the WAL buffers to the current segment file and fsyncs it. Only after that fsync returns does the commit report success. The data page, remember, is still dirty in shared buffers this whole time; nothing wrote it.

4. Checkpoints: bounding recovery and recycling the log

You have deferred two things: dirty pages pile up in memory, and the WAL grows without bound. A checkpoint is the periodic event that settles both debts. At a checkpoint, the checkpointer flushes every dirty shared buffer to its data file and fsyncs those files, so that all changes up to a known LSN are now safely in the data pages, not just in the log.

That known LSN is the checkpoint’s REDO point: the log position from which recovery would have to start if the server crashed right after this checkpoint. Two payoffs follow directly. First, any WAL segment entirely before the REDO point is no longer needed for crash recovery, so it can be recycled. That is what keeps pg_wal/ from growing forever. Second, recovery now has a bounded starting line: it never has to replay from further back than the last checkpoint, so recovery time is bounded by how much WAL accumulates between checkpoints.

Checkpoints fire on whichever of two triggers comes first:

checkpoint_timeout, default 5 minutes: a time-based cap on the interval.
max_wal_size, default 1 GB: a volume-based cap. When the WAL written since the last checkpoint approaches this, a checkpoint is forced to allow recycling. Note max_wal_size is a soft target for the WAL kept between checkpoints, not a hard ceiling on pg_wal/ size.

A checkpoint that flushed all its dirty pages at once would slam the disk with a write spike and stall everyone. So checkpoint_completion_target, default 0.9, spreads the flush across a fraction of the interval to the next checkpoint. At 0.9, PostgreSQL aims to finish writing the dirty pages over the first 90 percent of the way to the next checkpoint, smearing the I/O into a gentle slope instead of a cliff.

The life of a checkpoint

TriggerEither 5 minutes pass (checkpoint_timeout) or about 1 GB of WAL has been written since the last checkpoint (max_wal_size), whichever comes first. A checkpoint begins.

Step 1 of 5

When checkpoints fire too often, PostgreSQL warns you in the server log:

LOG:  checkpoints are occurring too frequently (9 seconds apart)
HINT:  Consider increasing the configuration parameter "max_wal_size".

This message means max_wal_size filled before checkpoint_timeout elapsed, so volume, not time, is driving your checkpoints, and they are closer together than the timeout intends. It is almost never benign. The next rung explains why frequent checkpoints do not just cost flush I/O; they inflate the WAL itself.

5. Full-page writes: why frequent checkpoints cost you

This rung explains the single most counterintuitive cost in the whole system, so slow down here. The setting is full_page_writes, on by default, and the problem it solves is the torn page.

A PostgreSQL page is 8 KB, but the operating system and storage write in smaller units, often 4 KB sectors. If the power fails mid-write, the disk can end up with half of an 8 KB page from the new version and half from the old. That is a torn page: internally inconsistent, and a plain WAL record that says “change byte 40 on this page” cannot fix it, because it assumes the rest of the page is intact. It is not.

The fix: the first time a page is modified after a checkpoint, PostgreSQL writes a full-page image of that entire 8 KB page into the WAL, not just the delta. During recovery, replay starts by stamping that whole known-good page image down, which overwrites any torn state, and then applies the later deltas on top. Subsequent modifications of the same page within the same checkpoint interval are normal small records again; only the first one after each checkpoint pays the full-page cost.

Now connect it to checkpoint spacing, because this is the link people miss. Full-page images are charged once per page per checkpoint interval. So the more often checkpoints happen, the more often each hot page pays its full 8 KB toll instead of a tiny delta. Halve the checkpoint interval and you roughly double the full-page-write volume in the WAL. This is precisely why a too-small max_wal_size is expensive twice over: it triggers frequent checkpoints (flush spikes), and those frequent checkpoints multiply full-page writes, so the WAL volume itself balloons, which forces checkpoints even sooner. That feedback loop is the checkpoint storm.

flowchart TB
    A[max_wal_size too small] --> B[Checkpoints fire often]
    B --> C[Each page pays a full-page image more often]
    C --> D[WAL volume balloons]
    D --> A
    B --> E[Frequent flush I/O spikes]

You can shrink the full-page-image cost without changing checkpoint spacing by compressing those images. Setting wal_compression (try lz4 on PostgreSQL 15 and later, or on for the older pglz) compresses full-page images before they go into the WAL, trading a little CPU for less WAL volume and less WAL I/O. It is often a clear win on write-heavy systems.

# postgresql.conf
full_page_writes = on        # leave ON unless your storage guarantees atomic 8 KB writes
wal_compression = lz4        # compress full-page images; cheap CPU for less WAL

Three full-page-write facts catch experienced engineers off guard. First, turning full_page_writes off does shrink WAL, but it removes torn-page protection: unless your storage truly guarantees atomic 8 KB writes, a crash mid-write can silently corrupt a page that no later WAL record can repair, so the savings come with a data-corruption risk that is almost never worth taking. Second, a large UPDATE or bulk load that straddles a checkpoint writes far more WAL than the row data alone, because right after a checkpoint nearly every page it touches is being modified for the first time this interval and so emits a full 8 KB image. Third, wal_compression pays off most on pages with free space or repetitive structure, where lz4 can cut full-page-image volume substantially for a few percent of CPU; measure your WAL generation rate before and after to confirm the win on your data.

6. Crash recovery: replaying from the REDO point

Now you can see why checkpoints exist at all, from the recovery side. When PostgreSQL starts after a crash, it reads pg_control to find the REDO point of the last completed checkpoint. Then it does redo: it walks the WAL forward from that REDO point and re-applies every change record to the data pages, in order, until it reaches the end of the valid WAL. Replaying a record is idempotent in effect because each page carries the LSN of the last change applied to it, so recovery skips any record already reflected on its page and applies the rest.

This is where full-page writes earn their keep. The first record for each page in the replay stream is that page’s full-page image (if one was written), so recovery stamps a known-good 8 KB page down first, erasing any torn state, then applies the deltas. By the end of redo, every committed change that made it into the WAL before the crash is back in the data pages, and every uncommitted change is simply not marked committed, so MVCC visibility ignores it. The database is consistent.

sequenceDiagram
    participant Boot as Startup
    participant Ctl as pg_control
    participant WAL as pg_wal segments
    participant Pages as Data pages
    Boot->>Ctl: read last checkpoint REDO point
    Boot->>WAL: open WAL at REDO point
    loop each record to end of WAL
        WAL->>Pages: apply change if page LSN is older
    end
    Pages-->>Boot: data pages now reflect all committed WAL
    Boot->>Boot: open database, accept connections

The cost of recovery is the amount of WAL between the REDO point and the crash, because that is what must be replayed. This is the whole reason checkpoints exist and the reason their spacing is a real trade. Frequent checkpoints mean short recovery (little WAL to replay) but heavy steady-state cost (flush spikes plus full-page-write amplification). Infrequent checkpoints mean cheap steady state but a longer worst-case recovery. Tuning max_wal_size and checkpoint_timeout is choosing where on that line you want to sit.

7. synchronous_commit: trading durability for latency

Until now, “durable” meant the commit’s WAL record is fsync’d to local disk before the commit returns. That is the default, and synchronous_commit is the knob that lets you move that bar, trading durability guarantees for commit latency. It is per-transaction, so you can keep the strict default globally and relax it only where the data can tolerate it.

The levels, from strongest to weakest:

Level	Commit returns after	Risk on crash
`remote_apply`	WAL is durable locally and applied on a sync standby	Strongest; a read on the standby sees the commit. Highest latency.
`on` (default)	WAL `fsync`’d locally (and durable on a sync standby if one is configured)	None on a single node: a returned commit survives a crash.
`remote_write`	WAL written (not yet `fsync`’d) on a sync standby	Standby OS crash could lose it before its disk flush; local node still safe.
`local`	WAL `fsync`’d locally, ignoring any standby	Standby may lag; local durability intact.
`off`	WAL handed to the OS, not yet `fsync`’d, commit returns immediately	The last small window of committed transactions can be lost on a crash. Never corrupts.

Two levels deserve a hard look. synchronous_commit = off is the tempting one for throughput: the commit returns as soon as the WAL record is in the OS, before the fsync completes. A background process flushes shortly after. The reward is much lower commit latency and higher throughput, because commits no longer wait on the disk flush. The risk, stated precisely, is bounded: a crash can lose the last fraction of a second of committed transactions, the ones whose WAL had not yet been flushed. It cannot corrupt the database, because the write-ahead rule still holds for the data pages; you only lose recent commits, you never get a torn state. That distinction is the whole point. Losing the tail is acceptable for, say, high-volume event ingestion you can replay. It is unacceptable for a financial ledger, where a commit that returned success must never vanish.

The reason off helps at all is group commit. When many transactions commit at once, PostgreSQL batches their WAL flushes: one fsync can make a whole group of commits durable, amortizing the flush cost. The wal_writer process also periodically flushes WAL in the background. With synchronous_commit = off, commits do not wait for these flushes at all; with on, the batching still helps, because under load your fsync often piggybacks on a flush another transaction already triggered. Settings like commit_delay can deliberately wait a tiny interval to let more commits join a group, raising throughput at a small latency cost.

-- Relax durability for one bulk job that you can safely re-run, leaving the default strict.
begin;
set local synchronous_commit = off;
-- ... high-volume inserts that are cheap to replay ...
commit;

One pairing is genuinely dangerous and worth saying plainly. Setting both fsync = off and full_page_writes = off removes the last safety nets: fsync = off lets the OS reorder and defer all writes with no flush barrier, and full_page_writes = off removes torn-page protection. Together, a crash can leave the data files arbitrarily inconsistent in a way no recovery can repair. These exist only for throwaway scratch databases you can rebuild from scratch. On anything you care about, leave both on.

8. Diagnostics, tuning hooks, and failure modes

You tune all of this by measurement, not by guesswork, and PostgreSQL exposes the right counters. On PostgreSQL 17 and later, checkpoint statistics live in pg_stat_checkpointer; on 16 and earlier they are in pg_stat_bgwriter. The deep how-to lives in the configuration track’s WAL and checkpoint tuning page; here is the foundation.

-- PostgreSQL 17+: are checkpoints driven by time or by volume?
select num_timed, num_requested, write_time, sync_time
from pg_stat_checkpointer;

num_timed counts checkpoints triggered by checkpoint_timeout; num_requested counts those forced by max_wal_size (or a manual CHECKPOINT). If num_requested dominates, your max_wal_size is too small for your write rate and you are in the checkpoint-storm regime from rung 5. write_time and sync_time tell you how much wall time checkpoints spend writing and flushing, which is your spike budget.

To measure WAL volume directly, sample the LSN at two moments and subtract. pg_wal_lsn_diff returns the byte distance between two LSNs:

-- Run, wait a representative interval, run again, then diff the two LSNs.
select pg_current_wal_lsn();        -- note the value, e.g. 3A/1C5F08
-- ... let the workload run for, say, 60 seconds ...
select pg_wal_lsn_diff('3A/9F0000', '3A/1C5F08') as bytes_written;

That byte count per interval is your WAL generation rate, the number that tells you whether max_wal_size gives you the checkpoint spacing you want. The symptom of an under-sized max_wal_size is the trio: the “checkpoints occurring too frequently” log warning, num_requested climbing far faster than num_timed, and a WAL generation rate that fills max_wal_size well inside checkpoint_timeout.

Check yourself

pg_stat_checkpointer shows num_requested rising fast while num_timed barely moves, and the log says checkpoints are occurring too frequently. What is happening and what is the first fix?

WAL is also the foundation of everything beyond a single node. Streaming and synchronous replication ship these same WAL records to standbys and replay them there; the replication-and-ha track covers that in depth. Point-in-time recovery works by archiving WAL segments and replaying them up to a chosen target time or LSN. The same log that gives you crash recovery on one machine gives you replicas and time travel across many.

That power creates the last failure mode, and it is one that pages people at 3 a.m. PostgreSQL must keep WAL segments that a consumer still needs: an archive command that has not yet succeeded, or a replication slot for a standby that has fallen behind or disconnected. If the archiver is stuck or a slot’s standby is gone, WAL accumulates in pg_wal/ with no upper bound, and it will fill the disk. A full pg_wal/ is a hard stop: the database cannot write WAL, so it cannot commit, so it effectively halts.

The three failure modes that bite experienced engineers

Mastery Questions

A teammate proposes setting max_wal_size = 256MB on a busy OLTP database “to keep pg_wal small and save disk.” Throughput drops and the server log fills with “checkpoints are occurring too frequently.” Explain the chain of cause and effect, and what you would do instead.

Answer. A small max_wal_size caps how much WAL may accumulate between checkpoints, so on a busy system the volume trigger fires long before checkpoint_timeout, and checkpoints happen very often, which is exactly what the warning reports. Frequent checkpoints hurt twice. First, each checkpoint flushes dirty buffers, so you get repeated I/O spikes even with checkpoint_completion_target smoothing them. Second, and the part people miss, every checkpoint re-arms full-page writes: the first modification of each page after a checkpoint writes a full 8 KB image into the WAL. With checkpoints close together, hot pages pay that 8 KB toll constantly instead of emitting tiny deltas, so WAL volume balloons, which makes max_wal_size fill even faster, a feedback loop. The fix is the opposite of the proposal: measure your WAL generation rate by sampling pg_current_wal_lsn() over an interval and diffing, then raise max_wal_size so that checkpoint_timeout (not volume) paces checkpoints in steady state. Optionally enable wal_compression = lz4 to shrink the full-page images. pg_wal being a bit larger is cheap; checkpoint storms are not.
Your application commits financial transactions. A senior engineer suggests synchronous_commit = off globally because benchmarks show a big throughput gain. Is the database at risk of corruption? What is actually at risk, and how would you get most of the benefit safely?

Answer. No, synchronous_commit = off cannot corrupt the database. The write-ahead rule still holds, so data pages are never written ahead of their WAL, and recovery always produces a consistent state. What is at risk is durability of recently committed transactions: with off, a commit returns as soon as its WAL record is in the OS, before the fsync, so a crash can lose the last small window of transactions that returned success but had not been flushed. For a financial ledger that is unacceptable: a commit that told the client “done” must never disappear, both for correctness and for audit. So you do not set it off globally. You keep the strict default on for the money path, and relax synchronous_commit only per-transaction (set local synchronous_commit = off) on workloads where loss is recoverable, such as bulk imports you can re-run or analytics staging. You still get most of the throughput from group commit, which batches WAL flushes under load so a single fsync makes many commits durable, and you can nudge it with commit_delay. Strict where it matters, relaxed only where loss is safe.
Monitoring shows pg_wal growing steadily and approaching the disk limit, even though max_wal_size is 1 GB and the write rate is normal. What are the likely causes, how do you tell them apart, and why is this an emergency?

Answer. max_wal_size is only a soft target for WAL kept between checkpoints; it does not cap WAL that some consumer still needs. Two consumers commonly pin WAL. First, archiving for point-in-time recovery: if archive_command is failing or slow, PostgreSQL keeps every unarchived segment, and they pile up. Second, a replication slot: if a standby has disconnected or fallen far behind, its slot holds back every segment the standby has not yet consumed, with no bound unless you set one. You tell them apart by looking: check the archiver state and last archive failure (pg_stat_archiver), and check replication slots for a stale or inactive one with a low restart_lsn far behind the current LSN (pg_replication_slots). It is an emergency because a full pg_wal is a hard stop: with no room to write WAL, PostgreSQL cannot commit anything, so the database halts. The durable fix depends on cause: repair or unblock the archive command, or drop the dead slot and bound future risk with max_slot_wal_keep_size so a failed standby can never fill the primary’s disk again.