WAL and Checkpoint Tuning
How to tune PostgreSQL's write-ahead log and checkpoints for write throughput: sizing max_wal_size and checkpoint_timeout, cutting WAL with wal_compression and wal_level, choosing synchronous_commit levels and group commit, tuning the background writer, and reading the diagnostics, with the four classic failure modes.
Learning outcomes
This is the tuning page. The wal-checkpoints-and-durability page already taught the machinery: the write-ahead rule, LSNs and segments, what a checkpoint does, why frequent checkpoints inflate WAL through full-page writes. This page assumes that model and spends its time on the knobs. You will leave knowing which setting to reach for first, what number to put in it, and what the cost is on the other side of the trade.
After studying this page, you can:
- Size
max_wal_sizefrom a measured WAL generation rate, and state the recovery-time cost of raising it. - Pace checkpoints with
checkpoint_timeoutandcheckpoint_completion_target, and explain how the two triggers interact. - Cut WAL volume with
wal_compressionand the rightwal_level, and price the CPU and feature trade-offs. - Choose a
synchronous_commitlevel per workload, and usecommit_delayto harvest group commit on a commit storm. - Tune
wal_buffers, the WAL writer, and the background writer so checkpoints and commits stop spiking your I/O. - Read
pg_stat_checkpointerand a two-point LSN diff to drive every change above by measurement, and recognize the four classic failure modes.
Before we dive in
You should already know, from the wal-checkpoints-and-durability page, what WAL is, what a checkpoint does, and why the first write to a page after a checkpoint emits a full 8 KB image. We recap that in rung 1 in one paragraph and then build only on top of it. If any of those words feel fuzzy, read that page first; everything here is a dial on its mechanism.
A few terms used throughout, defined on first use. Write throughput is how many bytes of changes per second your database can commit and persist. Write amplification is the ratio of bytes actually written to durable storage over the bytes of logical change you asked for: a ratio of 3 means every megabyte of row changes costs three megabytes on disk. WAL generation rate is how fast pg_wal fills, in bytes per second, under your real workload. Crash recovery time is how long the server takes to come back after an unclean shutdown, dominated by how much WAL it must replay. Hold those four; the whole page is about moving them against each other on purpose.
You apply every setting here with ALTER SYSTEM, which writes to postgresql.auto.conf, followed by a reload. Most WAL settings reload live; a few need a restart, and we flag those.
alter system set max_wal_size = '16GB';
select pg_reload_conf();
Mental Model
The wrong model, and it is a natural one, is that tuning WAL means making the log smaller and the fsyncs rarer so writes go faster, full stop. Under that picture every knob points the same way: shrink WAL, relax durability, win throughput. So you set max_wal_size small to keep pg_wal tidy, and you reach for synchronous_commit = off first because it is the famous fast switch.
Both moves are backwards. The better model is a budget with two accounts that trade against each other: a steady-state write cost account and a crash-recovery time account. Almost every WAL knob moves cost from one account to the other, and the default settings sit at a conservative point that favors fast recovery over write throughput. Tuning is deciding, deliberately, to spend more recovery time to buy back write throughput, because on a busy server recovery happens rarely and writes happen constantly. That is why the single biggest win is to make max_wal_size larger, not smaller: a bigger budget between checkpoints means fewer checkpoints, fewer full-page writes, less amplification, at the price of a longer replay if you do crash. Keep this picture. You are not minimizing WAL; you are choosing where on the steady-state-versus-recovery line your workload should sit.
Breaking it down
1. Recap: the one rule you tune around
One paragraph of recap, then we tune. A commit is durable the instant its WAL record is fsync’d, not when its data pages reach disk; the pages are written lazily by the background writer and flushed in bulk at each checkpoint. A checkpoint fires on whichever comes first, checkpoint_timeout or max_wal_size, and it bounds crash recovery because replay starts from the last checkpoint’s REDO point. The expensive twist is full-page writes: the first time a page is touched after a checkpoint, its entire 8 KB image goes into the WAL to survive a torn write, so the closer together your checkpoints, the more often every hot page pays that 8 KB toll instead of a tiny delta. That last sentence is the lever the whole page pulls. Spacing checkpoints further apart is the same act as cutting full-page-write volume, and that is why one knob, max_wal_size, sits at the top.
flowchart LR
A[Checkpoint spacing] --> B[Full-page-write volume]
B --> C[WAL generation rate]
C --> D[Write amplification]
A --> E[Crash recovery time]
B --> F[fsync and flush I/O]Read it as one chain. Widen the spacing on the left and everything in the middle column shrinks, while the one cost on the far right, recovery time, grows. The rest of this page is which knob moves which arrow.
2. max_wal_size: the highest-leverage write knob
Start here, because on a write-heavy server this one setting buys more throughput than all the others combined. max_wal_size (default 1 GB) is the soft cap on how much WAL may accumulate between checkpoints. When WAL written since the last checkpoint approaches it, a checkpoint is forced so old segments can be recycled. So max_wal_size is really a checkpoint-spacing knob wearing a size label: bigger value, longer gap between checkpoints.
Here is the concrete win. Suppose your workload generates WAL at 100 MB per second. With the default 1 GB cap, a volume-triggered checkpoint fires every ten seconds, far inside the 5-minute timeout. Every ten seconds, every hot page pays its full 8 KB image again. Raise max_wal_size to 16 GB and the volume trigger now needs about 160 seconds to fill, so checkpoints space out and most pages emit one full image per interval instead of dozens. Busy systems commonly run 8 GB to 32 GB here, and it is normal to see WAL generation drop 2x to 3x purely from the reduced full-page-write churn. That is the write amplification from the Mental Model collapsing.
# postgresql.conf
max_wal_size = '16GB' # soft cap on WAL between checkpoints; spaces checkpoints apart
min_wal_size = '2GB' # floor below which PostgreSQL stops shrinking the recycled pool
min_wal_size (default 80 MB) sets the floor of the recycled-segment pool. PostgreSQL keeps pre-allocated, renamed segments ready up to this size rather than deleting them, which avoids the filesystem churn of creating new files during a write burst. On a bursty workload, raising min_wal_size to a couple of GB keeps a warm pool of segments so a sudden spike does not pay allocation cost mid-flight. It costs only disk you have already decided to give pg_wal.
Now the price, stated explicitly, because raising this is not free. A larger max_wal_size means more WAL accumulates between checkpoints, and crash recovery must replay all of it from the last REDO point. With a 1 GB budget, a crash replays at most about 1 GB of WAL, often seconds of work. With 32 GB, a crash can replay tens of GB, and recovery can stretch to many minutes while the server is down and refusing connections. You are trading steady-state write cost for worst-case downtime. On most OLTP systems that trade is overwhelmingly worth it, because crashes are rare and writes are constant, but you must size it knowing the recovery bill, not blindly. A second cost is simply disk: pg_wal will sit larger, so provision for it.
The slider has no derive formula on purpose: the real relationship between max_wal_size and full-page volume depends on your page-touch pattern and cannot be a clean built-in. The bands carry the honest verdict. Drag it and read the trade move from “WAL amplifies” on the left to “recovery takes minutes” on the right; the sweet spot for most write-heavy OLTP is the middle-to-right of that range.
3. checkpoint_timeout and completion_target: pacing the flush
max_wal_size is the volume trigger; checkpoint_timeout (default 5 minutes) is the time trigger. Checkpoints fire on whichever comes first, so the two are partners, and you tune them together. The goal on a write-heavy server is to make the timeout the trigger that fires, not the volume cap. A time-paced checkpoint is predictable and evenly spaced; a volume-paced one fires as fast as your write bursts, which is exactly the storm you are trying to avoid.
So when you raise max_wal_size, raise checkpoint_timeout to match, commonly to 15 or 30 minutes on a write-heavy system with a large WAL budget. The two must move together: a huge max_wal_size with a 5-minute timeout just means time always wins and your big WAL budget is never used; a 30-minute timeout with a 1 GB cap means volume always wins and the timeout is decorative. Set them so that under normal load the timeout fires first and the volume cap is your safety valve for bursts.
# postgresql.conf (write-heavy server, paired with max_wal_size = 16GB)
checkpoint_timeout = '20min' # let time pace checkpoints, not volume
checkpoint_completion_target = 0.9 # spread the flush over 90% of the interval
checkpoint_warning = '30s' # warn if checkpoints fall closer than this
checkpoint_completion_target (default 0.9) controls the shape of each checkpoint’s I/O, not its frequency. A checkpoint that dumped all its dirty buffers at once would slam the disk with a write cliff and stall every query behind it. At 0.9, PostgreSQL paces the writes to finish over the first 90 percent of the way to the next checkpoint, turning the cliff into a gentle slope. Leave it at 0.9 on almost all systems. Lowering it sharpens the spike, which is the opposite of what you want; raising it toward 1.0 leaves no margin to finish before the next checkpoint begins. The longer your checkpoint_timeout, the more wall-clock time that 90 percent represents, so a longer timeout also makes each flush gentler. That is a quiet second benefit of spacing checkpoints out.
4. wal_compression: shrinking the full-page flood
Even perfectly spaced checkpoints still emit full-page images, one per page per interval, and right after every checkpoint those images dominate WAL. wal_compression attacks that volume directly. It compresses the full-page images before they go into the WAL, leaving the small delta records alone. Default is off. On modern PostgreSQL you set it to a compression method: lz4 (available from PostgreSQL 15) or zstd, rather than the legacy pglz you get from on.
# postgresql.conf
wal_compression = lz4 # compress full-page images; lz4 is fast, zstd compresses harder
The trade is CPU for I/O. Compressing every full-page image costs a little CPU on the write path; in return, fewer bytes hit pg_wal, fewer bytes fsync, and on a replicated system fewer bytes cross the network to standbys. lz4 is the usual pick: its compression is fast enough that the CPU cost is negligible on most servers, and full-page images compress well because heap pages carry free space and repetitive structure. zstd squeezes harder for a bit more CPU, worth it when WAL network bandwidth or archive storage is your bottleneck rather than CPU. On a write-heavy, full-page-dominated workload, wal_compression = lz4 is one of the clearest wins available: measure your WAL generation rate before and after with the two-point LSN diff from rung 10 to confirm it on your data.
The one place it does not help is a workload whose WAL is mostly small delta records, not full-page images, for example many tiny single-row updates spread thin across a huge table. There the images are a small fraction of WAL, so compressing them moves little. That is why you measure rather than assume.
5. wal_level: paying only for the replication you use
wal_level decides how much information each WAL record carries, and that directly sets a floor on WAL volume. There are three levels, and you pay for the most capable one whether or not you use it.
wal_level | What it supports | WAL volume | When to use |
|---|---|---|---|
minimal | Crash recovery only, no replication; some bulk operations skip WAL entirely | Lowest | A single primary with no streaming replica and no archiving |
replica (default) | Crash recovery plus streaming replication and archiving for point-in-time recovery | Moderate | The default; any setup with replicas or WAL archiving |
logical | Everything in replica plus logical decoding for logical replication and CDC | Highest | Logical replication, change-data-capture, or upgrade with logical replication |
The default is replica, and it is the right default because most production databases have a standby or archive WAL for recovery. The interesting cases are the edges. minimal writes the least WAL: with no replica to feed, certain bulk operations such as COPY into a table created or truncated in the same transaction, or CREATE TABLE AS, can skip WAL for the table contents entirely, because a crash would just roll the whole table back anyway. That makes minimal meaningfully faster for bulk loads, but it forecloses replication and archiving, so it suits a throwaway analytics box you rebuild from source, not a primary you protect. Changing wal_level requires a restart, and dropping to minimal also needs max_wal_senders = 0.
logical sits at the other end: it stamps extra identity information into WAL records so a downstream consumer can reconstruct logical row changes, which raises WAL volume above replica. Turn it on only when you actually run logical replication or a CDC pipeline, because you pay the volume whether a subscriber exists or not. The mistake to avoid is leaving wal_level = logical set “just in case” on a system that does no logical decoding: it is pure WAL overhead for an unused capability.
6. wal_buffers and the WAL writer
Before a WAL record is fsync’d it lives briefly in wal_buffers, a small ring of shared memory where backends append records cheaply. The default is -1, meaning auto-sized to about 1/32 of shared_buffers, capped near 16 MB. For most systems that automatic value is fine and you should leave it. The setting requires a restart to change.
When does pinning it higher help? On a high-concurrency write workload, many backends generate WAL at once, and if the buffer is too small it fills and backends must wait for a flush to free space before they can append, serializing what should be parallel. If you see WAL-buffer waits under heavy concurrent writes, setting wal_buffers explicitly to 64 MB or 128 MB gives more headroom for bursts of concurrent WAL generation. It is a modest knob, not a headline one: it removes a contention point under high concurrency, it does not change steady-state throughput on a lightly loaded system.
# postgresql.conf
wal_buffers = '64MB' # explicit; helps only under high write concurrency (restart required)
wal_writer_delay = '200ms' # how often the background WAL writer flushes filled buffers
The WAL writer is a background process that flushes filled WAL buffers to disk between commits, so backends find space waiting rather than flushing inline. wal_writer_delay (default 200 ms) is how often it wakes to do that. With synchronous_commit = off, this delay also bounds how stale the unflushed WAL can get, so it interacts with your durability window. You rarely need to touch it; the default cadence suits most systems. Lowering it makes the writer more eager at the cost of more frequent wakeups, which only matters at extreme write rates.
7. synchronous_commit and group commit
synchronous_commit is the knob with the biggest throughput swing and the sharpest edge, which is exactly why it is rung 7 and not rung 1. It moves the bar for what “durable” means before a commit returns. Per-transaction, so you can keep the strict default globally and relax it only where the data allows.
| Level | Commit returns after | Risk on crash |
|---|---|---|
remote_apply | WAL durable locally and replayed on a sync standby | Strongest, highest latency; a read on the standby sees the commit |
on (default) | WAL fsync’d locally (and on a sync standby if configured) | None on a single node: a returned commit survives a crash |
remote_write | WAL written, not yet fsync’d, on a sync standby | A standby OS crash could lose it; local node still safe |
local | WAL fsync’d locally, ignoring any standby | Standby may lag; local durability intact |
off | WAL handed to the OS, not yet fsync’d; commit returns at once | The last fraction of a second of commits can be lost on a crash. Never corrupts. |
The throughput lever is off. With it, a commit returns the moment its WAL record is in the OS, before the fsync completes, so commits stop waiting on the disk flush. On a flood of small single-row writes, where commit latency is dominated by fsync, this can multiply throughput, because the WAL writer batches the actual flushes in the background. The risk is precise and bounded: a crash loses the last small window of transactions that returned success but had not yet been flushed, on the order of wal_writer_delay. It never corrupts the database, because the write-ahead rule for data pages still holds; you lose recent commits, you never get a torn state. That is the whole distinction. Losing the tail is fine for high-volume event ingestion you can replay; it is unacceptable for financial or any data where a returned commit must never vanish.
-- Strict by default; relax for one bulk job you can safely re-run.
begin;
set local synchronous_commit = off;
-- ... high-volume inserts that are cheap to replay ...
commit;
When you must keep durability strict but still want throughput on a commit storm, reach for group commit instead of off. Under concurrency, PostgreSQL already batches WAL flushes so one fsync makes many commits durable. commit_delay (microseconds, default 0) tells a committing transaction to wait a tiny interval before flushing, so more transactions join the same fsync batch. commit_siblings (default 5) is the gate: the delay only kicks in when at least that many other transactions are already active, so a quiet system never pays it. The pairing harvests group commit on high-concurrency commit storms while keeping full synchronous_commit = on durability.
# postgresql.conf (durability stays strict; harvest group commit under load)
commit_delay = 100 # microseconds to wait so more commits share one fsync
commit_siblings = 5 # only delay when at least 5 transactions are already active
8. Background writer: cleaning ahead of the checkpoint
The checkpointer is not the only process that writes dirty pages to disk. The background writer continuously trickles dirty buffers out so that backends needing a free buffer find a clean one, and so that fewer dirty pages remain for the checkpoint to flush in one go. Tuning it smooths I/O and shrinks the checkpoint’s flush spike, complementing checkpoint_completion_target.
Three knobs shape it. bgwriter_delay (default 200 ms) is how often the writer wakes. bgwriter_lru_maxpages (default 100) caps how many buffers it may write per round; raise it to let the writer clean more aggressively on a write-heavy system. bgwriter_lru_multiplier (default 2.0) scales how far ahead of demand it cleans: a higher value tells it to write more buffers in anticipation of upcoming allocations, keeping a deeper pool of clean buffers ready.
# postgresql.conf (clean more aggressively ahead of demand on a write-heavy server)
bgwriter_delay = '100ms' # wake more often
bgwriter_lru_maxpages = 1000 # allow more buffers cleaned per round (default 100)
bgwriter_lru_multiplier = 4.0 # clean further ahead of demand (default 2.0)
The trade is the same shape as everywhere on this page: cleaning more pages ahead of time smooths the I/O profile and reduces checkpoint spikes, but it costs some extra writes, because a page cleaned early may be dirtied again before the checkpoint and written twice. On a heavy write workload that extra write is usually worth the smoother profile; on a read-mostly system the background writer has little to do and the defaults are fine. The honest framing: the background writer does not reduce total work, it spreads it out, which is exactly what you want when the alternative is a checkpoint cliff.
9. full_page_writes and the atomic-write exception
full_page_writes (default on) is what makes full-page images happen at all, and the strong default advice is simple: leave it on. It is the protection against torn pages, the half-old, half-new 8 KB pages a crash mid-write can leave behind. With it off, a crash during a page write can silently corrupt a page that no later WAL record can repair, because every delta record assumes the rest of the page is intact.
There is exactly one situation where turning it off is defensible: your storage layer guarantees atomic 8 KB writes, so a page can never be torn. The mechanism is worth seeing whole, because it is the same one that makes full-page images costly. A PostgreSQL page is 8 KB, but storage often writes in 4 KB sectors, so a crash mid-write can leave half the new page and half the old, a torn page that a WAL delta cannot repair because the delta assumes the unchanged bytes are intact. The full-page image fixes it by stamping a known-good 8 KB page down first during recovery, erasing the tear before the deltas replay. That protection is exactly why the first write to each page after every checkpoint emits the whole 8 KB image, which is the single reason checkpoint spacing controls WAL volume and why rung 2 matters so much.
So the only way to turn it off safely is to remove the torn-page possibility at the storage layer. A copy-on-write filesystem such as ZFS, which never overwrites a block in place, gives that guarantee, as do some enterprise arrays with battery-backed atomic writes. On such storage, full-page images defend against a failure mode that cannot occur, so disabling them removes a large slice of WAL volume for free. The decision is binary and unforgiving: if you are certain your storage is atomic at 8 KB, turning it off is a real WAL win; if you are even slightly unsure, leave it on, because the downside is silent, unrecoverable corruption with no error, not a graceful failure.
10. Measuring before you turn a knob
Every change above is a guess until you measure, and PostgreSQL gives you exactly the counters to drive them. Two measurements anchor all WAL tuning: how checkpoints are being triggered, and how fast WAL is being generated.
Checkpoint statistics live in pg_stat_checkpointer on PostgreSQL 17 and later, and in pg_stat_bgwriter on 16 and earlier. The key comparison is time-triggered versus volume-triggered checkpoints.
-- PostgreSQL 17+: are checkpoints paced by time or forced by volume?
select num_timed, num_requested, write_time, buffers_written
from pg_stat_checkpointer;
num_timed counts checkpoints fired by checkpoint_timeout; num_requested counts those forced by max_wal_size or a manual CHECKPOINT. If num_requested dominates num_timed, volume is driving your checkpoints and your max_wal_size is too small for your write rate: this is the storm from rung 2, and the fix is to raise max_wal_size and checkpoint_timeout together. write_time is the wall time checkpoints spent writing, your flush-spike budget; buffers_written is how many buffers the checkpointer flushed. On 16 and earlier, read checkpoints_timed, checkpoints_req, and buffers_checkpoint from pg_stat_bgwriter for the same picture.
To measure WAL generation rate directly, sample the LSN at two moments and diff them. pg_wal_lsn_diff returns the byte distance between two LSNs.
-- Note the value, let the workload run a representative interval, sample again, then diff.
select pg_current_wal_lsn(); -- e.g. 3A/1C5F08
-- ... wait, say, 60 seconds under normal load ...
select pg_wal_lsn_diff('3A/9F0000', '3A/1C5F08') as bytes_in_interval;
Divide the byte count by your interval and you have bytes per second. That single number sizes max_wal_size: multiply it by your target seconds between checkpoints and you have the floor your WAL budget must clear for the timeout, not volume, to pace checkpoints. It is also your before-and-after meter for wal_compression: run the diff with compression off, set lz4, run it again, and the ratio is your WAL savings on your real data. PostgreSQL also watches checkpoint frequency for you and warns in the server log when they crowd together:
LOG: checkpoints are occurring too frequently (8 seconds apart)
HINT: Consider increasing the configuration parameter "max_wal_size".
checkpoint_warning (default 30 seconds) sets the threshold for that line: if two checkpoints land closer than this, you get warned. It is a free smoke alarm for an undersized max_wal_size. Treat the warning as a prompt to run the two measurements above, not as noise to silence.
11. The four failure modes
End with the four ways WAL tuning goes wrong in production, because recognizing the symptom is half the fix. Each maps to a knob you now understand.
The checkpoint storm is the default max_wal_size = 1GB left on a write-heavy server. Volume forces a checkpoint every few seconds, every hot page re-images on each one, and WAL amplifies 2x to 3x over the logical change. You see it as num_requested dwarfing num_timed, the “checkpoints are occurring too frequently” warning, and a WAL generation rate that fills 1 GB well inside checkpoint_timeout. The fix is rung 2: raise max_wal_size and checkpoint_timeout together, sized from your measured rate.
The lost-commit trap is synchronous_commit = off on data you cannot lose. It is a genuine throughput win and it never corrupts the database, but a crash can erase the last sub-second of committed transactions. On a financial ledger or anywhere a returned commit must be permanent, that is a correctness and audit failure. The fix is to keep the strict default on the critical path and relax synchronous_commit only per-transaction where loss is recoverable, or use commit_delay to get group-commit throughput with full durability.
The recovery-time blowout is max_wal_size raised so far that crash recovery replays tens of GB and the server is down for many minutes. This is the recovery account of the Mental Model overdrawn. The fix is to size max_wal_size against a recovery-time budget you can actually tolerate, not just for steady-state throughput, and to test your real recovery time after a change rather than assuming it.
The pg_wal disk-fill is retention you forgot, not spacing. max_wal_size is a soft target only; it does not cap WAL that a consumer still needs. A failing archive_command keeps every unarchived segment, and a stale or disconnected replication slot pins every segment its standby has not consumed, both without bound until the disk fills and the database halts, because it can no longer write WAL to commit. The fix is to watch pg_stat_archiver and pg_replication_slots, and to bound slot retention with max_slot_wal_keep_size so a dead standby can never take the primary’s disk down with it.
flowchart TB
A[Symptom in production] --> B{Which counter?}
B -->|num_requested high| C[Checkpoint storm: raise max_wal_size and checkpoint_timeout]
B -->|pg_wal growing past max_wal_size| D[Retention: check archiver and slots, set max_slot_wal_keep_size]
A --> E{Durability choice}
E -->|lost commits after crash| F[synchronous_commit=off on critical data: keep on, relax per-transaction]
E -->|recovery takes minutes| G[max_wal_size too large for recovery budget: resize to tolerable replay]Mastery Questions
-
A write-heavy OLTP database runs the defaults:
max_wal_size = 1GB,checkpoint_timeout = 5min. Monitoring shows WAL generated at roughly 120 MB/s,num_requestedclimbing fast whilenum_timedbarely moves, and the “checkpoints are occurring too frequently” warning every few seconds. Walk through the full chain of cause and effect, and give the concrete settings you would apply, with the cost of each.Answer. At 120 MB/s, the 1 GB budget fills in about eight seconds, so the volume trigger forces a checkpoint roughly every eight seconds, far inside the 5-minute timeout. That is why
num_requesteddominates and the warning fires. The expensive consequence is full-page writes: every checkpoint re-arms them, so each hot page emits its whole 8 KB image every eight seconds instead of one image per timeout interval, inflating WAL volume 2x to 3x, which makes the 1 GB budget fill even faster, a feedback loop. The fix is to let time pace checkpoints. Raisemax_wal_sizeto something like 16 GB, which at 120 MB/s gives roughly 130 seconds of headroom, and raisecheckpoint_timeoutto 20 minutes so the timeout, not volume, becomes the trigger under normal load. Keepcheckpoint_completion_target = 0.9so the now-larger flush still spreads gently. Addwal_compression = lz4to shrink the full-page images that remain. The costs, stated plainly:pg_walwill sit larger on disk, and crash recovery now replays up to roughly a checkpoint interval of WAL, so worst-case recovery stretches from seconds toward a couple of minutes. On a busy OLTP system that trade is strongly worth it. Confirm the win by re-running the two-point LSN diff: the WAL generation rate should drop sharply once the amplification collapses. -
An engineer wants to halve commit latency on a high-concurrency payments service. They propose
synchronous_commit = offglobally. Is the database at risk of corruption? What is actually at risk, and how would you get most of the latency win without that risk?Answer. No,
synchronous_commit = offcannot corrupt the database. The write-ahead rule for data pages still holds, so recovery always produces a consistent state; you never get a torn or half-applied page. What is at risk is durability of recently committed transactions. Withoff, a commit returns as soon as its WAL record is in the OS, before the fsync, so a crash can lose the last small window of transactions, on the order ofwal_writer_delay, that returned success but were not yet flushed. For a payments service that is unacceptable: a commit that told the client “paid” must never vanish, for correctness and for audit. So you do not set itoffon the payment path. Instead you harvest group commit with full durability: keepsynchronous_commit = on, and setcommit_delayto a small value such as 100 microseconds withcommit_siblings = 5. Under high concurrency a committing transaction waits that tiny interval so more transactions join the same fsync, and one flush makes many commits durable. You get most of the throughput benefit, because the cost being amortized is the fsync itself, with zero added data-loss risk. If there is a genuinely loss-tolerant path, such as writing non-critical analytics events, you may relax that path alone withset local synchronous_commit = off, leaving the money path strict. -
After raising
max_wal_sizeto 32 GB to kill a checkpoint storm, the team is happy with write throughput. Months later, the server crashes during a load spike and takes eleven minutes to accept connections again. Separately,pg_walhas on two occasions grown to 60 GB, well past 32 GB. Explain both, and what you would change.Answer. These are two different failure modes, both predictable from how the knobs work. The eleven-minute recovery is the recovery-time account overdrawn.
max_wal_sizeis the soft cap on WAL between checkpoints, and crash recovery must replay everything from the last REDO point, so a 32 GB budget means a crash mid-interval can replay tens of GB of WAL, which at the server’s replay speed took eleven minutes. That is the explicit cost of spacing checkpoints far apart: you bought steady-state throughput with worst-case downtime. The fix is to sizemax_wal_sizeagainst a recovery-time budget you can tolerate, not throughput alone. If eleven minutes is too long, pull it back, perhaps to 16 GB, accept a few more checkpoints, and measure the real recovery time after the change rather than guessing. The 60 GBpg_walis unrelated tomax_wal_size, which is only a soft target for WAL between checkpoints; it does not cap WAL that a consumer still needs. Something is pinning segments: most likely a failingarchive_commandretaining every unarchived segment, or a replication slot for a standby that disconnected or fell behind, holding every segment it has not consumed. You tell them apart withpg_stat_archiverfor the last archive failure andpg_replication_slotsfor an inactive slot with arestart_lsnfar behind current. The durable fix is to repair the archive or drop the dead slot, and to bound future risk withmax_slot_wal_keep_sizeso a stale standby can never fill the disk and halt the primary again.
Sources & evidence14 claims · 3 cited
Parameters, defaults, and behaviors are grounded in the WAL-configuration, WAL-internals, and high-availability sections of the PostgreSQL 16/17 docs (src_pg_docs_runtime_wal, src_pg_docs_wal, src_pg_docs_high_availability). Version-specific availability facts those sections do not state directly (lz4 from PG 15, pg_stat_checkpointer landing in PG 17, the auto-size cap of wal_buffers near 16 MB) and common-knowledge engineering facts (the 8 KB page, ZFS copy-on-write atomicity) are marked stable-common-knowledge with empty source_ids.
- max_wal_size (default 1 GB) is a soft cap on WAL accumulated between checkpoints that forces a checkpoint when approached, so raising it (busy servers commonly run 8 GB to 32 GB) spaces checkpoints apart and cuts full-page-write churn enough to drop WAL generation roughly 2x to 3x, at the cost of longer crash recovery (replay covers all WAL since the last REDO point) and more pg_wal disk.verified
- min_wal_size (default 80 MB) is the floor of the recycled-segment pool: PostgreSQL keeps pre-allocated renamed segments ready up to this size to avoid filesystem churn during write bursts.verified
- Checkpoints fire on whichever comes first of checkpoint_timeout (default 5 minutes) or max_wal_size, so a write-heavy server raises both together (timeout commonly 15 to 30 minutes) to make time, not volume, the trigger; checkpoint_completion_target (default 0.9) then spreads each checkpoint's writes over about 90 percent of the interval to smooth the I/O, and lowering it sharpens the spike rather than reducing checkpoint frequency.verified
- wal_compression (default off) compresses only the full-page images in WAL, trading CPU for fewer bytes; modern PostgreSQL accepts lz4 (from PostgreSQL 15) or zstd rather than the legacy pglz selected by on, with lz4 the usual pick for negligible CPU and zstd when WAL network or archive bandwidth is the bottleneck.verified
- wal_level has three levels: minimal (crash recovery only, lowest WAL, lets bulk operations like COPY into a same-transaction-created table or CREATE TABLE AS skip writing table contents), replica (the default, adds streaming replication and archiving), and logical (adds logical decoding at the highest WAL volume); changing it requires a restart and minimal also needs max_wal_senders = 0.verified
- wal_buffers defaults to -1, auto-sized to about 1/32 of shared_buffers capped near 16 MB; pinning it higher (64 MB to 128 MB, restart required) helps only under high write concurrency where a too-small buffer fills and serializes backends, while the WAL writer flushes filled buffers on a wal_writer_delay (default 200 ms) cadence that also bounds the unflushed window under synchronous_commit = off.verified
- synchronous_commit is per-transaction and ranges from remote_apply and on (default, durable on a single node) through remote_write and local to off, where a commit returns before the fsync so a crash can lose the last fraction of a second of committed transactions but never corrupts the database.verified
- commit_delay (microseconds, default 0) makes a committing transaction wait briefly so more transactions join one fsync (group commit), and commit_siblings (default 5) gates that delay so it only applies when at least that many transactions are already active, raising throughput on commit storms without relaxing durability.verified
- The background writer trickles dirty buffers to disk ahead of demand, controlled by bgwriter_delay (default 200 ms), bgwriter_lru_maxpages (default 100, the cap on buffers written per round) and bgwriter_lru_multiplier (default 2.0, how far ahead of demand it cleans); raising them smooths I/O and shrinks the checkpoint spike at the cost of some pages written twice.verified
- full_page_writes (default on) protects against torn pages by writing each page's full 8 KB image to WAL on its first modification after a checkpoint; turning it off is only safe when storage guarantees atomic 8 KB writes (such as a copy-on-write filesystem like ZFS), otherwise a crash can silently and unrecoverably corrupt a page.verified
- Checkpoint statistics live in pg_stat_checkpointer on PostgreSQL 17 and later (num_timed for time-triggered, num_requested for volume- or manually-triggered checkpoints, plus write_time and buffers_written) and in pg_stat_bgwriter on 16 and earlier; num_requested dominating num_timed signals an undersized max_wal_size.stable common knowledge
- WAL generation rate is measured by sampling pg_current_wal_lsn() at two moments and taking pg_wal_lsn_diff of the two LSNs, which gives bytes per interval to size max_wal_size and to verify wal_compression savings; checkpoint_warning (default 30 s) logs 'checkpoints are occurring too frequently' when two checkpoints land closer than that.verified
- A default max_wal_size = 1GB on a write-heavy server causes a checkpoint storm: volume forces frequent checkpoints that re-arm full-page writes, amplifying WAL roughly 2x to 3x in a feedback loop; the fix is to raise max_wal_size and checkpoint_timeout together from the measured rate.verified
- max_wal_size is a soft target only and does not cap WAL that a consumer still needs, so a failing archive_command or a stale replication slot pins segments and fills pg_wal until the disk is full and the database halts; max_slot_wal_keep_size bounds the slot-retention risk.verified
Cited sources
- PostgreSQL Documentation: Write Ahead Log Configuration · PostgreSQL Global Development Group
- PostgreSQL Documentation: Write-Ahead Logging (WAL) and Reliability · PostgreSQL Global Development Group
- PostgreSQL Documentation: High Availability, Load Balancing, and Replication · PostgreSQL Global Development Group