Streaming and Synchronous Replication
How PostgreSQL physical replication actually works: walsender on the primary streams WAL over libpq to a walreceiver on the standby, the startup process replays it, and the same WAL underpins hot standby reads, replication slots, and synchronous commits. Covers the four LSN positions, replica conflicts and hot_standby_feedback, quorum syntax in synchronous_standby_names, real cross-AZ and cross-region latency numbers, and the failure modes (stuck slot, canceled report, stalled commits, cross-region durability) that the design forces operators to face.
Learning outcomes
Replication is where the WAL stream stops being a recovery log and becomes a live feed. Everything on this page is one idea applied: a standby replays the primary’s WAL, fast enough to be a hot standby, durably enough to be a synchronous one. Get the mechanism right and durability, read scaling, and failover trade-offs stop being mysterious choices and start being knobs on a single dial.
After studying this page, you can:
- Explain how a standby connects to a primary, what walsender, walreceiver, and the startup process each do, and why both servers must run the same major version.
- Configure a streaming standby from scratch, including
wal_level,max_wal_senders,pg_basebackup,primary_conninfo, andstandby.signal. - Decide when to use a physical replication slot and how to bound its risk with
max_slot_wal_keep_size. - Choose between asynchronous,
remote_write,on, andremote_applyforsynchronous_commit, including the millisecond cost across an availability zone or a region. - Read
pg_stat_replication’s four LSN positions, diagnose replica conflicts, and recover from the classic failure modes (stuck slot, canceled report, stalled commits).
Before we dive in
You should already know what the write-ahead log is and what a checkpoint does. The wal-checkpoints-and-durability page covers that foundation, and this page builds directly on it: every byte a standby applies is a WAL record produced for crash recovery on the primary, sent over the network instead of reread from disk. You also want a working picture of MVCC and vacuum, because two replication trade-offs (replica conflicts and hot_standby_feedback) trace back to the xmin horizon the vacuum-freezing-and-wraparound page describes.
A few words you will need. A standby is a server replaying another server’s WAL; the source is the primary. A byte-exact copy means the standby’s data files are bit-identical to the primary’s after every record applies, because they are produced by replaying the same low-level page operations. Physical replication operates at the page and WAL-record level. Logical replication, by contrast, ships decoded row changes; we touch it only briefly here to mark the boundary.
LSN is the log sequence number: a 64-bit byte offset into the WAL stream, shown as 0/1A2B3C4D. Every WAL record sits at a specific LSN, and every replication position on this page is one of these numbers.
Mental Model
The wrong model, and an easy one to fall into, is that a standby periodically copies files from the primary, like a fancy backup. Under that model “lag” is just how long since the last copy, and “synchronous” means waiting for that copy to complete.
That picture is wrong on every count. A standby holds an open TCP connection to the primary and reads a continuous, ordered byte stream of WAL records as they are generated. It writes them to its own pg_wal/ directory, fsyncs them, and a separate process replays them into the page cache. There is no “copy interval.” The pipeline runs at the speed of the network and the standby’s disk, and lag is measured in bytes still in flight, not in time since the last snapshot.
Hold this picture. A standby is not a backup that catches up periodically. It is a replay engine listening to a live feed, and every replication feature on this page (slots, sync, conflict handling) is a knob on that feed.
Breaking it down
1. Why a second copy needs a feed, not a snapshot
A single PostgreSQL server is one disk away from a bad day. The host fails, the volume corrupts, the AZ disappears, and the database is offline until you restore from backup. Backups give you durability at hours of granularity. You want durability at milliseconds.
The cheap answer is to copy the data directory to a second machine every few minutes. It is also wrong. The window between copies is a window of data loss, and the copy itself competes with the production workload for I/O. More importantly, restoring an inconsistent snapshot (files copied while writes were in flight) gives you a corrupt database, not a usable one.
PostgreSQL’s answer rides on the WAL the primary already writes for crash recovery. Every change is a WAL record. If the standby has every WAL record the primary has emitted, replaying them produces a byte-exact copy of the primary’s data files. The feed is the WAL stream itself, ordered, durable, and already free.
This is why replication and durability are the same machinery viewed from two ends. The primary fsyncs WAL so it can recover after a crash; the standby reads that same WAL so it can be the recovery. Two consequences fall out immediately. The standby must run the same major version as the primary, because WAL records are page-level and the page format changes between majors. And it cannot run on a different CPU architecture in any way that changes the byte layout of a page. Physical replication is a low-level pact, not a high-level protocol.
2. The streaming mechanism: walsender, walreceiver, startup
Three processes carry the feed. On the primary, a walsender backend opens for each connected standby. On the standby, a walreceiver holds the TCP socket and writes incoming WAL to disk. A separate startup process reads that WAL and applies it to the data files. Each process has one job, and the page traverses them in order.
The wire is a libpq connection in replication mode: the standby connects with replication=true on the connection string (or replication=database for logical replication), authenticates as a role with the REPLICATION attribute, and the primary spawns a walsender to serve it. From the network’s point of view this is just another PostgreSQL connection on the usual port, which is why pg_hba.conf needs a replication entry to allow it.
sequenceDiagram
participant App as Client
participant P as Primary
participant WS as walsender (on primary)
participant WR as walreceiver (on standby)
participant SU as startup (on standby)
App->>P: COMMIT
P->>P: write WAL, fsync
P->>WS: WAL record at LSN N
WS->>WR: send over libpq
WR->>WR: write to pg_wal, fsync (flush_lsn = N)
WR->>SU: hand off record
SU->>SU: REDO on page (replay_lsn = N)Read the diagram as four checkpoints. The record exists at the primary when it is durable in pg_wal/. It is sent when the walsender hands it to the socket. It is flushed when the walreceiver has fsynced it on the standby. It is applied when the startup process has replayed it into the pages. These are the four LSNs pg_stat_replication exposes, and we come back to them in rung 7.
Two boundaries to remember. The streaming protocol is the modern default; the fallback is file-based log shipping, where the primary’s archive_command copies completed WAL segments to shared storage and the standby’s restore_command fetches them. Modern setups use streaming for live catch-up and archival to long-term storage (pgBackRest or wal-g) for point-in-time recovery. The archive is what you reach for when the standby has been offline so long the streaming gap cannot be closed.
3. Setting up a standby end to end
The mechanism is half the story. The other half is the handful of settings that turn it on and the steps that bootstrap the standby’s data directory.
On the primary, two settings make replication possible. wal_level = replica (the default since PG 10) makes the WAL self-contained enough to replay on a standby; wal_level = logical adds extra information for logical decoding. max_wal_senders caps how many concurrent walsenders can run (default 10), and you need one per standby plus headroom for pg_basebackup runs. The standby authenticates with a role that has the REPLICATION attribute and a matching pg_hba.conf entry on the replication database.
# postgresql.conf on the primary
wal_level = replica
max_wal_senders = 10
max_replication_slots = 10
# pg_hba.conf on the primary
host replication repluser 10.0.0.0/16 scram-sha-256
You seed the standby with pg_basebackup. It is a streaming-protocol client too: it opens a replication connection and copies every file under the data directory plus the WAL needed to make the copy self-consistent.
pg_basebackup \
-h primary.internal -U repluser \
-D /var/lib/postgresql/17/main \
-X stream -R -C -S standby1 -P
The flags earn their keep. -X stream opens a second connection that streams WAL in parallel with the base backup so the snapshot stays consistent. -R writes primary_conninfo and creates a standby.signal file (more on that in a moment). -C -S standby1 creates a physical replication slot named standby1 and pins the backup to it.
The standby.signal file is the PG 12 way of telling a server “you are a standby.” Before PG 12 you put standby_mode = on and primary_conninfo in a separate recovery.conf file; in PG 12 and later recovery.conf is gone, those parameters live in postgresql.conf (or postgresql.auto.conf written by ALTER SYSTEM), and the presence of an empty standby.signal file flips the server into recovery mode at startup. A recovery.signal file (without standby.signal) means “run point-in-time recovery, then promote.”
# postgresql.auto.conf on the standby (written by pg_basebackup -R)
primary_conninfo = 'host=primary.internal port=5432 user=repluser application_name=standby1'
primary_slot_name = 'standby1'
Start the server. The walreceiver connects, the startup process catches up, and you have a streaming standby. To let it serve read-only queries you set hot_standby = on (the default since PG 10) before startup. Cascading replication is a small variation: a standby can itself accept connections from a third server and stream WAL onward, which is how you fan out a read fleet without overloading the primary’s network card.
4. Replication slots: a promise not to discard WAL
A streaming standby has one job, keeping up. If it falls behind, the primary may recycle and overwrite the WAL segments the standby still needs, and the standby fails with requested WAL segment has already been removed. Without help, the only recovery is to rebuild the standby with another pg_basebackup.
A physical replication slot is the primary’s promise not to recycle WAL until a named standby has consumed it. The slot records the standby’s confirmed_flush_lsn, and the WAL cleanup logic refuses to discard any segment whose LSN is greater than or equal to that confirmed point. The standby links itself to the slot through primary_slot_name, and as it acknowledges progress the slot advances.
-- On the primary, create a slot before the standby connects.
select pg_create_physical_replication_slot('standby1');
-- Inspect what each slot is holding.
select slot_name, active, restart_lsn, confirmed_flush_lsn,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as retained
from pg_replication_slots;
The promise cuts both ways. With a slot, the standby cannot fall irrecoverably behind, because the WAL it needs is kept. Without one, you are racing the WAL recycler. But if the standby disappears (host fire, decommission without dropping the slot) the slot keeps growing pg_wal/ on the primary forever, and the primary’s data volume runs out of space. The author has watched this take a production server down at 03:00.
max_slot_wal_keep_size (PG 13 and later) bounds the damage. It sets a hard cap on how much WAL a slot may retain; once exceeded, the slot is invalidated and the primary is free to recycle the WAL, at the cost of breaking the standby that needs it.
# postgresql.conf on the primary
max_slot_wal_keep_size = '64GB' # never let a slot retain more than 64 GB of WAL
The right knob depends on your blast radius. Set it large enough to ride out a planned standby maintenance window, small enough that one forgotten slot cannot fill the volume. And alert on pg_replication_slots.active = false plus pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) climbing.
5. Hot standby and replica query conflicts
hot_standby = on lets the standby serve read-only queries while it replays WAL. This is what makes a read replica useful for reporting traffic. But two jobs now compete on the same server: applying the next WAL record and answering the running query, and they sometimes need different versions of the same page.
The classic conflict is vacuum on the primary versus a long read on the standby. The primary runs VACUUM and writes a WAL record that prunes a dead tuple from page 42. The standby has a SELECT in progress whose snapshot still wants to see that tuple. Replaying the prune record would erase data the live query is reading. PostgreSQL must choose: pause replay (lag grows) or cancel the query.
max_standby_streaming_delay (default 30 seconds) is the choice’s parameter. It is the longest replay will wait for a conflicting query to finish before canceling it with ERROR: canceling statement due to conflict with recovery. max_standby_archive_delay is the same idea for WAL coming from the archive. Set the values low and queries get canceled aggressively; set them high (or -1 for “wait forever”) and replication lag grows during heavy report runs.
The release valve is hot_standby_feedback = on. The standby periodically tells the primary the oldest transaction id its running queries depend on, and the primary’s vacuum respects that horizon: it will not prune tuples a standby query still needs. The cost is that vacuum on the primary is held back by reads on the standby, which is the long-transaction trap from vacuum-freezing-and-wraparound, now operating across a network. Bloat on the primary becomes a function of how long your standby reports run.
# postgresql.conf on the standby
hot_standby = on
hot_standby_feedback = on
max_standby_streaming_delay = '30s'
The trade-off has no universal answer. A standby that runs minute-long queries with hot_standby_feedback = on is usually fine. A standby that runs four-hour analytical queries with feedback on will quietly bloat the primary; you may prefer to leave feedback off and accept that long reports occasionally die. Decide per workload, and watch pg_stat_database_conflicts to see which kind of conflict actually fires.
6. Synchronous replication and the four durability levels
So far every commit on the primary returns the moment the primary’s WAL is durable on the primary’s disk. The standby is asynchronous: it catches up at its own pace, and if the primary dies before the standby has the latest records, those commits are lost. For most workloads this is fine; for “never lose a confirmed commit” it is not.
Synchronous replication makes a commit wait until a named standby has acknowledged the record before returning success to the client. The two settings together control it. synchronous_standby_names lists which standbys count, and synchronous_commit controls how far down the pipeline the wait extends.
synchronous_commit has four levels worth knowing in detail.
The latency cost is real and worth knowing in millisecond terms. A standby in the same availability zone over a 10 GbE network typically adds 0.5 to 2 ms per synchronous commit. Cross-AZ within a region adds 1 to 3 ms. Cross-region (say us-east to us-west) adds 50 to 150 ms or more, dominated by the speed of light. remote_apply adds the standby’s replay time on top, which for a heavy DDL or a large transaction can be tens of milliseconds more.
Choose the level by what you are protecting. remote_write is enough when the standby’s host is unlikely to crash independently of its disk. on is the right default for “no committed data lost on primary failure.” remote_apply is for the small set of workloads where the replica must be read-consistent with the primary the instant a commit returns.
The four positions also map cleanly to the WAL flow in rung 2: the record is local-written, sent, flushed on the standby, and applied on the standby. Each step is one level of synchronous_commit. Picturing them this way also explains why every step downstream is strictly slower: each one waits for everything before it.
synchronous_standby_names controls the quorum across multiple standbys. The two syntaxes worth memorizing are:
# Wait for ANY two of s1, s2, s3 to acknowledge.
synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
# Wait for the FIRST listed standby; fall back to the next if it is offline.
# 'FIRST 2 (s1, s2, s3)' means the first two listed.
synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
ANY k is a quorum: any k standbys can satisfy a commit, which gives the best tail latency because slow stragglers do not hold up the group. FIRST k is priority order: the first k in the list must acknowledge, in that order, so the named primary-backup carries fixed identity (useful when one standby is your designated failover target).
The dangerous failure mode is universal. If every named standby is unreachable, every commit on the primary blocks indefinitely. The primary keeps accepting connections, but no transaction can commit. The standard mitigation is to keep at least one fast async fallback (do not include it in synchronous_standby_names) so reads continue, and to monitor for “synchronous standbys configured, zero connected” as a paging alert. Some operators run with synchronous_standby_names set to a quorum across three standbys so any one can be down without stalling commits.
7. Monitoring lag and reading the four LSNs
pg_stat_replication is the primary’s view of every connected standby. Read it as the four-stop diagram from the previous rung, expressed as numbers.
-- One row per connected standby. Run this on the primary.
select application_name, state, sync_state,
pg_current_wal_lsn() as primary_lsn,
sent_lsn, write_lsn, flush_lsn, replay_lsn,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as apply_lag_bytes,
write_lag, flush_lag, replay_lag
from pg_stat_replication;
The four LSNs match the four guarantees. sent_lsn is what walsender has handed to the socket. write_lsn is what the standby has written to its OS. flush_lsn is what the standby has fsynced. replay_lsn is what the standby’s startup process has applied. They always satisfy sent_lsn >= write_lsn >= flush_lsn >= replay_lsn, and the gap between pg_current_wal_lsn() and replay_lsn is your apply lag.
The time-valued columns write_lag, flush_lag, and replay_lag translate those byte gaps into intervals, by recording how long it took the standby to confirm each level. They are the right columns to plot in a dashboard, because a steady replay_lag of 50 ms is informative in a way that an LSN delta of 12,345,678 bytes is not.
The standby has its own view too. pg_stat_wal_receiver shows the walreceiver’s state, the slot it uses, the primary’s host, and the LSN it has last written. pg_last_wal_receive_lsn() and pg_last_wal_replay_lsn() are the equivalents on the standby of write_lsn and replay_lsn on the primary.
Two reasons the standby is “lag-free but useless” deserve their own check. First, the standby is read-only, full stop. It cannot run any DDL, cannot run INSERT/UPDATE/DELETE, and cannot advance sequences with NEXTVAL. An application that uses a sequence for primary keys must talk to the primary even if it only reads, or nextval('seq') will fail with ERROR: cannot execute nextval() in a read-only transaction. Second, temporary tables still work on the standby because they live in private space, but they consume memory and emit no WAL.
8. Failure modes that bite in production
The mechanisms above lead to a small, recurring set of incidents. Here they are by symptom, with the root cause and the fix.
The pattern across all six is the same shape. The streaming feed is a contract: the primary keeps WAL, the standby applies it, and synchronous_standby_names tells the primary which standbys must vote on a commit. Every failure mode is one side of that contract breaking, and every fix restores it in the cheapest way that keeps both durability and availability honest.
Mastery Questions
-
Your team enables
synchronous_commit = onwithsynchronous_standby_names = 'standby1', and the on-call gets paged because every transaction on the primary has been blocked for two minutes.pg_stat_replicationshows no rows. What happened, and what is the right operational response?Answer. The single named standby is down, so the synchronous quorum cannot form. Every COMMIT is parked waiting for an acknowledgment that will never arrive, which is exactly what
synchronous_commit = onpromises. The emergency unblock is to clearsynchronous_standby_nameson the primary (ALTER SYSTEM SET synchronous_standby_names = ''andSELECT pg_reload_conf()), which immediately drops the requirement and lets commits proceed asynchronously. The durable fix is to stop naming a single standby. Use a quorum,ANY 1 (s1, s2), so either of two standbys satisfies the commit, and make sure at least one async fallback exists outside the synchronous list so reads continue even if every sync standby is offline. Add a paging alert on “synchronous standbys required: N, connected: 0,” not just on commit latency, because the first symptom is silence on the durability side, not slowness. -
A read replica running daily analytical reports is showing slow but steady bloat on the primary, even on tables the reports never touch. Walk through the chain of cause and the two viable fixes.
Answer. The replica has
hot_standby_feedback = on, which makes the standby’s running snapshots count toward the primary’s xmin horizon. Vacuum on the primary cannot prune dead tuples that any running snapshot might still need, and the horizon is database-wide, so a long-running report on the replica holds back cleanup on every table on the primary, not just the ones the report reads. This is the same long-transaction trap that vacuum-freezing-and-wraparound describes, now operating across the network. Two fixes work. First, turnhot_standby_feedbackoff and raisemax_standby_streaming_delayso the replica tolerates the conflicts vacuum will now cause; long reports may occasionally die with “canceling statement due to conflict with recovery,” but the primary’s bloat stops. Second, keep feedback on but cap query duration on the replica withstatement_timeoutso no single snapshot lives long enough to do real damage. The right choice depends on whether you can rerun a canceled report cheaply (favor the first) or whether the reports must finish at all costs (favor the second). The mistake to avoid is leaving feedback on without a duration cap, because the replica then dictates the primary’s bloat ceiling. -
You inherit a setup with
synchronous_standby_namesset toFIRST 1 (us_east_a, us_east_b)andsynchronous_commit = on. The primary is inus-east-1aand both standbys are listed. p99 commit latency is 1.4 ms. The team wants to addus-west-2aas a third synchronous standby for regional disaster recovery. What happens to commit latency, and how would you preserve both regional durability and acceptable performance?Answer. Adding
us-west-2ato the synchronous list as currently configured would push p99 commit latency from 1.4 ms to roughly 60 to 120 ms, because every commit would now wait for an acknowledgment from a standby separated by 50 to 150 ms of round-trip light-speed delay, andFIRST 1only helps ifus_east_ais the one that acknowledges first; with three standbys named the quorum still requires the first listed to vote. There is no way to make a cross-region synchronous commit cheap without changing the durability target. The pragmatic option is to keep the synchronous quorum local withANY 1 (us_east_a, us_east_b), so commits land on local stable storage in under 2 ms, and runus_west_2aas an asynchronous standby for regional disaster recovery, accepting a small RPO measured in seconds for a cross-region failure. If true zero-RPO across regions is mandated, the only honest answer is to embrace the latency: tell the application that p99 commits will be in the hundreds of milliseconds, batch writes to amortize the cost, and design clients with that budget in mind. The lesson is that synchronous replication is a durability and latency contract, and there is no setting that buys cross-region durability at single-AZ speed.
Sources & evidence14 claims · 3 cited
Mechanism, configuration, slot semantics, hot standby conflicts, and synchronous_standby_names quorum are grounded in the official PostgreSQL high-availability, warm-standby, WAL, and runtime-WAL documentation. Real inter-AZ and cross-region latency numbers, the operational use of pgBackRest and wal-g for WAL archival, and the standby.signal transition that replaced recovery.conf in PG 12 are correct facts not covered by the listed sources and are marked stable-common-knowledge.
- A standby connects to the primary in libpq replication mode and the primary spawns a walsender process for each connected standby that streams WAL records as they are generated, while a walreceiver on the standby writes them to pg_wal and a separate startup process replays them.verified
- Physical streaming replication produces a byte-exact copy of the primary because it ships and replays WAL records at the page level, so the primary and every standby must run the same PostgreSQL major version and the same WAL format.verified
- Streaming replication is the default modern path, while file-based log shipping using archive_command on the primary and restore_command on the standby remains supported as a fallback for filling streaming gaps.verified
- wal_level must be at least replica (the default since PostgreSQL 10) for streaming replication, max_wal_senders caps how many concurrent walsenders can run (default 10), and pg_basebackup with -X stream seeds a standby's data directory while streaming WAL in parallel to keep the base backup consistent.verified
- Since PostgreSQL 12 the legacy recovery.conf file is gone: primary_conninfo and related recovery parameters now live in postgresql.conf or postgresql.auto.conf, and the presence of an empty standby.signal file is what tells the server to start in standby mode.stable common knowledge
- A physical replication slot records the standby's confirmed_flush_lsn on the primary and prevents the primary from recycling any WAL segment the standby has not yet consumed, so a slot stops the standby from falling irrecoverably behind but causes unbounded pg_wal growth if the standby disappears.verified
- max_slot_wal_keep_size (added in PostgreSQL 13) caps how much WAL a replication slot may retain; once the cap is exceeded the slot is invalidated and the primary is free to recycle the WAL, bounding the blast radius of a forgotten or dead slot.verified
- When replay of a WAL record (such as a vacuum cleanup) conflicts with a snapshot held by a query on the standby, replay waits up to max_standby_streaming_delay (default 30 seconds) or max_standby_archive_delay before canceling the conflicting query with ERROR: canceling statement due to conflict with recovery.verified
- hot_standby_feedback = on makes the standby periodically send its oldest snapshot's xmin to the primary, so the primary's vacuum holds back tuples the standby still needs, at the cost of extending the primary's xmin horizon based on standby query duration.verified
- synchronous_commit has four meaningful levels above off: local (primary fsync only), remote_write (standby OS write), on (standby fsync, the default), and remote_apply (standby has replayed the record so reads on the standby see the just-committed write).verified
- synchronous_standby_names supports both ANY k (quorum, any k standbys may acknowledge) and FIRST k (priority order, the first k listed must acknowledge), with the failure mode that if no qualifying standby is reachable every commit on the primary blocks indefinitely until the list is changed or a standby reconnects.verified
- A synchronous standby in the same availability zone typically adds 0.5 to 2 ms per commit, cross-AZ within a region adds 1 to 3 ms, and a cross-region synchronous standby (for example us-east to us-west) adds 50 to 150 ms or more, dominated by speed-of-light round-trip time.stable common knowledge
- pg_stat_replication exposes sent_lsn, write_lsn, flush_lsn, and replay_lsn for each standby plus the time-valued write_lag, flush_lag, and replay_lag columns, and they always satisfy sent_lsn >= write_lsn >= flush_lsn >= replay_lsn so the four positions correspond to four nested durability guarantees.verified
- A hot standby is strictly read-only: it cannot execute DDL, cannot run INSERT/UPDATE/DELETE, and cannot advance sequences via nextval(), which is why an application that allocates primary keys from a sequence must still talk to the primary even when otherwise read-only.verified
Cited sources
- PostgreSQL Documentation: Log-Shipping Standby Servers and Replication Slots · PostgreSQL Global Development Group
- PostgreSQL Documentation: High Availability, Load Balancing, and Replication · PostgreSQL Global Development Group
- PostgreSQL Documentation: Write Ahead Log Configuration · PostgreSQL Global Development Group