High Availability and Failover
How to build HA on top of PostgreSQL: writing down RTO and RPO targets, running Patroni against a 3- or 5-node DCS quorum with a watchdog for split-brain prevention, promoting via pg_ctl promote and rejoining via pg_rewind, routing through HAProxy that follows Patroni's REST endpoint, combining synchronous replication and quorum failover for RPO=0, and grounding all of it on pgBackRest or wal-g archives with rehearsed point-in-time recovery.
Learning outcomes
High availability is not a feature you turn on. It is a small system you build on top of PostgreSQL, made of replication, a consensus store, a fencing mechanism, a router, and a tested backup pipeline. Each piece exists to prevent one specific failure, and any piece you skip becomes the way you lose data. This page is the closing rung of the replication track. It assumes you already know how streaming and synchronous replication work and how logical replication and CDC differ from physical replication.
After studying this page, you can:
- State your RTO and RPO targets in seconds and explain which replication mode is compatible with each.
- Explain why PostgreSQL ships without automatic failover, and what an external coordinator like Patroni adds on top.
- Configure a three-node Patroni cluster with etcd, including the TTL, loop, and lag settings that bound failover latency and data loss.
- Defend a cluster against split-brain using a distributed consensus store, a watchdog, and STONITH-style fencing.
- Re-attach a former primary as a standby using pg_rewind, and read an HAProxy health-check that follows the Patroni REST endpoint.
- Diagnose the failure modes that bite teams in production: joint-failure DCS placement, missing watchdogs, sync-rep wedged on a dead standby, and stale router health checks.
Before we dive in
You should be comfortable with WAL, streaming replication, replication slots, and what synchronous_commit and synchronous_standby_names do. We refer to the streaming and synchronous replication page for the mechanics of how WAL flows from primary to standby, and to the logical replication and CDC page when contrasting physical and logical failover paths. We do not re-derive those here.
A few words you will lean on. High availability (HA) is the property that the database keeps serving traffic across a single-node failure. Failover is the act of promoting a standby to primary when the old primary is gone. RTO (recovery time objective) is the maximum time you tolerate without writes. RPO (recovery point objective) is the maximum data loss, measured in time or transactions, you tolerate. A distributed consensus store (DCS) is a small fault-tolerant key-value store, typically etcd, Consul, or ZooKeeper, that external coordinators use to agree on who is primary. Fencing is the act of forcibly stopping or isolating a node so it cannot accept writes after losing leadership. Split-brain is the failure where two nodes both believe they are primary and accept divergent writes.
Hold those six. Every mechanism below is one of them in disguise.
Mental Model
The wrong model, and a common one, is that HA is a switch on the database: turn on replication, point a virtual IP at the cluster, and the database “fails over” by itself. Under that picture, the standby is a hot spare that magically takes over when the primary dies.
PostgreSQL does not work that way. The better model is a small distributed system wrapped around PostgreSQL. The database itself only does two things: ship WAL to standbys, and accept a single command, “you are now primary.” Everything else (deciding the primary is dead, choosing the next one, telling everyone, kicking the old one off the network, and pointing clients at the new one) is done by an external coordinator, with help from a consensus store, a watchdog, and a router. If any of those pieces is missing or wrong, your “HA cluster” is actually a way to lose committed transactions while feeling safe.
Keep this picture: PostgreSQL provides the ingredients (WAL, standbys, the promote command). The coordinator provides the recipe. The consensus store provides the truth about who is leader. The fencing layer guarantees there is only one leader at a time. The router puts clients in front of that leader. Each of the eleven rungs below is one of those parts.
Breaking it down
1. RTO and RPO: the two numbers that define your HA
Before you choose any tool, write down two numbers. RTO is the time you tolerate with no writes after a failure. RPO is the data loss you tolerate, in seconds of WAL or in committed transactions. Every HA decision after this point falls out of those two numbers.
Async streaming replication, the default, optimizes RTO. The standby replays WAL as fast as it arrives, but commit on the primary returns to the client as soon as WAL hits local disk. If the primary dies between the local flush and the network send, those committed transactions are lost. A typical async setup gives an RTO of 30 to 60 seconds (long enough to detect and promote) and an RPO of milliseconds to seconds of WAL not yet shipped.
Synchronous replication can give RPO equal to zero. With synchronous_commit set to on or remote_apply and a standby named in synchronous_standby_names, a commit on the primary waits until at least one standby has flushed (or replayed) that WAL record before returning. The cost is commit latency: every transaction pays one extra network round trip. A primary across a 10 ms network from its sync standby has a hard floor of about 10 ms per commit, regardless of how fast its disks are.
flowchart LR
A[Choose your RTO and RPO] --> B{RPO acceptable as seconds of WAL?}
B -- yes --> C[Async streaming + fast failover]
B -- no, must be zero --> D[Synchronous replication with quorum]
C --> E[Coordinator promotes a standby on primary loss]
D --> E
E --> F[Router moves clients to the new primary]A useful sanity check: if your business says “we can tolerate no data loss,” push back and ask for the cost they will pay. Sync replication on a hot path can double or triple commit latency. Many teams choose async plus very fast failover (sub-30-second RTO, sub-second RPO) instead, because the cost of zero-RPO is felt on every single commit, not only at failover.
2. Why PostgreSQL has no built-in automatic failover
PostgreSQL ships streaming replication, replication slots, and a single command to promote a standby. It does not ship a leader election, a health monitor, or a coordinator. This is a deliberate design choice, not an oversight. Promotion is destructive: once a standby is promoted, two timelines exist, and the old primary cannot simply come back. The PostgreSQL project leaves the policy of “when do we promote” to operators, because the right answer depends on RTO, RPO, your network topology, and what counts as a failure on your monitoring.
Promotion itself is a single operator action. On the standby you run pg_ctl promote, or in older versions you place a trigger file at the path named in recovery.conf. In response, the standby finishes replaying any WAL it has, switches to a new timeline (the timeline id increments), starts accepting writes, and stops applying WAL from upstream. That is it. PostgreSQL never decides on its own.
pg_ctl -D /var/lib/postgresql/16/main promote
External coordinators provide the missing logic. The dominant choice today is Patroni, which we cover in depth in rung 3. Other options include repmgr (older, simpler, less robust against split-brain), pg_auto_failover from Microsoft (uses a separate monitor node and is opinionated about topology), and stolon (a different architectural take with a separate proxy layer). Managed cloud services (Amazon RDS, Aurora, Google Cloud SQL, Azure Database for PostgreSQL) include their own automation, usually as a service you cannot tune. The trade-off is the recurring theme of this page: more automation means less control over policy.
3. Patroni and the distributed consensus store
Patroni is a daemon you run on every PostgreSQL node. Each daemon manages its local PostgreSQL: it can start and stop it, change postgresql.conf and pg_hba.conf, and call pg_ctl promote. The interesting work happens between the daemons, and they coordinate through a distributed consensus store, almost always etcd, sometimes Consul or ZooKeeper.
The DCS holds a small set of keys per cluster. The central one is the leader key: it names the current primary and carries a time-to-live (TTL). The primary’s Patroni renews the lease on this key on every loop. If it fails to renew within the TTL, the key expires automatically, and the remaining Patroni daemons race to claim it. Only the standby with the highest replayed WAL location (LSN) is allowed to win; that is how Patroni keeps RPO bounded.
scope: pg-prod
namespace: /db/
name: pg-1
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.0.11:8008
etcd3:
hosts: 10.0.0.21:2379,10.0.0.22:2379,10.0.0.23:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
synchronous_mode: false
postgresql:
use_pg_rewind: true
parameters:
max_connections: 500
wal_level: replica
hot_standby: on
max_wal_senders: 10
max_replication_slots: 10
postgresql:
data_dir: /var/lib/postgresql/16/main
bin_dir: /usr/lib/postgresql/16/bin
authentication:
superuser: {username: postgres, password: '...'}
replication: {username: replicator, password: '...'}
watchdog:
mode: required
device: /dev/watchdog
safety_margin: 5
Four parameters carry most of the weight.
- ttl (default 30 seconds): how long the leader lease lives between renewals. Lowering ttl tightens RTO but raises the risk of a false failover from a brief network blip.
- loop_wait (default 10): the period Patroni waits between health-check iterations on each node. It bounds how quickly a replica notices the lease is gone.
- retry_timeout (default 10): how long Patroni keeps trying a DCS or PostgreSQL operation before treating it as failed.
- maximum_lag_on_failover (default 1 MiB): the most a candidate may lag the last known LSN and still be eligible to win the election. This is your RPO ceiling in bytes of WAL.
Run the DCS itself with an odd number of nodes, three or five, so it has a real quorum. A two-node etcd has no fault tolerance: lose one and the cluster has no majority. Spread the nodes across availability zones so a single zone failure cannot take both PostgreSQL and DCS down at once.
flowchart TB
subgraph DCS["etcd quorum (3 nodes)"]
E1[etcd-1]
E2[etcd-2]
E3[etcd-3]
end
subgraph PG["PostgreSQL + Patroni"]
P1["pg-1 (primary)\nleader key holder"]
P2["pg-2 (standby)"]
P3["pg-3 (standby)"]
end
P1 -- "renew leader key" --> DCS
P2 -- "watch leader key" --> DCS
P3 -- "watch leader key" --> DCS
P1 -- "stream WAL" --> P2
P1 -- "stream WAL" --> P34. Split-brain and the fencing that prevents it
Split-brain is the single failure mode that turns an HA cluster into a corruption event. It happens when two nodes both believe they are primary at the same time: each accepts writes, and a few minutes later you have two divergent histories with no automated way to reconcile them. Replaying both would lose transactions. Picking one would also lose transactions. You will be on a call at 3 a.m. apologizing.
Patroni’s first defense is the leader key itself. A node only stays primary while it holds the key. The instant it cannot reach the DCS, it must demote itself, because it cannot prove it is still the leader. That sounds simple, and on a happy path it is. The problem is what happens when Patroni hangs.
Suppose Patroni on the primary deadlocks, or the kernel pauses it for ten seconds during heavy I/O. Patroni does not renew the lease. The DCS expires the key. A standby is elected and promoted. But the original primary’s PostgreSQL is still running and still accepting writes from clients who have not yet noticed. Now you have two primaries.
The watchdog is the second defense. Patroni opens /dev/watchdog (systemd-watchdog or a hardware watchdog) and feeds it on every successful loop. If Patroni stops feeding the watchdog, the kernel hard-reboots the machine after a few seconds. The reboot stops PostgreSQL the only way that is guaranteed to work even when software is hung. Configure mode: required in patroni.yml so Patroni refuses to start without a working watchdog; otherwise a missing /dev/watchdog silently disables your protection.
watchdog:
mode: required
device: /dev/watchdog
safety_margin: 5
The third defense is STONITH (“Shoot The Other Node In The Head”). Before promoting the new primary, the coordinator forcibly powers off or network-isolates the old one through an out-of-band channel: an IPMI power command, a cloud API call to stop the VM, a switch port shutdown. STONITH gives a positive proof, not just a belief, that the old node cannot serve writes. Patroni does not ship a built-in STONITH, but operators integrate it via callback scripts or by relying on the watchdog plus the routing layer (HAProxy refuses to send traffic to a non-leader).
stateDiagram-v2
[*] --> Primary
Primary --> Demoting: lost leader key
Demoting --> Standby: PostgreSQL stopped, ready to follow
Primary --> Rebooted: Patroni hung, watchdog fires
Rebooted --> Standby: machine restarts, rejoins as standby
Standby --> Promoting: won election
Promoting --> Primary: pg_ctl promote returnedA simple rule: if you cannot answer “what stops the old primary from accepting writes after it loses leadership,” your cluster is not protected against split-brain.
5. Promotion, rewind, and rejoin
Promotion itself is fast. Patroni runs pg_ctl promote on the winning standby, PostgreSQL finishes replaying any WAL it has, advances to a new timeline (the timeline id in pg_wal increments), and starts accepting writes. From the client’s view, the database is back.
Rejoining the old primary is where many engineers stumble. After a clean failover, the old primary is behind the new one (it missed the last few writes), and worse, it may have written WAL of its own that the new primary never saw. A plain restart will not work: the two timelines have diverged.
Two tools fix this. pg_rewind rewinds the old primary back to the divergence point by reading the new primary’s WAL, identifying the last common LSN, and rewriting only the changed blocks of the old primary’s data directory so it can follow the new timeline. It is fast: pg_rewind copies kilobytes to megabytes of WAL deltas rather than re-streaming the whole database. Patroni runs pg_rewind automatically when use_pg_rewind: true is set in the cluster config.
pg_rewind --target-pgdata=/var/lib/postgresql/16/main \
--source-server="host=10.0.0.12 user=postgres dbname=postgres"
When pg_rewind cannot run (for instance, the old primary’s WAL was lost or wal_log_hints and data_checksums are both off), the fallback is a fresh pg_basebackup. This copies the entire data directory from the new primary, which can be slow on large clusters (minutes to hours for a multi-terabyte database) and floods the network. Production clusters should always enable data_checksums or wal_log_hints so pg_rewind has the bookkeeping it needs.
# postgresql.conf, set before initdb where possible
wal_log_hints = on
data_checksums = on # actually set with initdb --data-checksums
6. Routing writes with HAProxy and pgbouncer
Even after a successful promotion, clients still hold connections to the old primary’s IP. Something must move them. There are three common patterns: a virtual IP that the coordinator moves, an HAProxy fronted by Patroni’s REST API, and application-side discovery through the DCS.
The HAProxy pattern is the most common. HAProxy listens on a single port and uses an HTTP health check against the Patroni REST API on each backend (port 8008 by convention). Patroni exposes endpoints that return HTTP 200 on the role you ask about and 503 otherwise: /primary, /replica, /standby-leader, /read-only. So an HAProxy backend that wants to send writes to the primary checks /primary, and one that wants to spread reads sends to /replica.
# /etc/haproxy/haproxy.cfg, the write path
frontend pg_writes
bind *:5432
default_backend pg_primary
backend pg_primary
option httpchk GET /primary
http-check expect status 200
default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
server pg-1 10.0.0.11:5432 check port 8008
server pg-2 10.0.0.12:5432 check port 8008
server pg-3 10.0.0.13:5432 check port 8008
# the read path, spread across standbys
backend pg_replicas
option httpchk GET /replica
balance roundrobin
default-server inter 3s fall 3 rise 2
server pg-1 10.0.0.11:5432 check port 8008
server pg-2 10.0.0.12:5432 check port 8008
server pg-3 10.0.0.13:5432 check port 8008
The key knob is on-marked-down shutdown-sessions on the write backend. Without it, existing connections to the old primary keep working even after HAProxy marks it down, and your application happily writes to a non-leader (which, in a split-brain situation, can lose data). With it, HAProxy slams shut any connection to a backend that fails health checks, forcing the client to reconnect to the current primary.
Many teams put pgbouncer in front of HAProxy as a transaction-pooler. The pool layer survives a brief primary outage: when the old primary disappears, pgbouncer cancels its server-side connections, the client gets a few errors for a couple of seconds, and once HAProxy picks the new primary, pgbouncer reconnects without the application reconfiguring anything. Pair pgbouncer with server_lifetime and server_reset_query so its pools recycle through promotions cleanly.
7. Cloud-native and Kubernetes operators
On Kubernetes, the Patroni pattern is wrapped in an operator. The two dominant ones are CloudNativePG (a CNCF project, very active in 2024-2026) and the Crunchy PostgreSQL Operator (pgo). Both manage a StatefulSet of PostgreSQL pods, a service for the primary, services for replicas, a PodDisruptionBudget, and the backup pipeline. CloudNativePG uses its own controller rather than running Patroni inside each pod; pgo embeds Patroni.
A minimal CloudNativePG cluster spec captures the model:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: pg-prod
spec:
instances: 3
primaryUpdateStrategy: unsupervised
storage:
size: 200Gi
storageClass: ssd
postgresql:
parameters:
max_connections: "500"
shared_buffers: "8GB"
wal_level: "replica"
bootstrap:
initdb:
database: app
owner: app
backup:
barmanObjectStore:
destinationPath: s3://pg-prod-backups
s3Credentials:
accessKeyId:
name: backup-creds
key: ACCESS_KEY_ID
secretAccessKey:
name: backup-creds
key: SECRET_ACCESS_KEY
The operator handles failover, backups, point-in-time recovery, and rolling upgrades. The trade-off is that the operator decides policy: you cannot, for instance, freely tune the equivalent of maximum_lag_on_failover; you live within what the operator exposes.
The most opinionated cloud model is Amazon Aurora. Aurora separates compute from storage: PostgreSQL writes to a distributed storage layer that lives independently of any compute node. A failover, in Aurora, is just swapping which compute node is primary; the storage is already shared and durable. RTO is sub-30 seconds and there is no need to copy data. The trade-off is that you cannot tune the storage, you pay for the proprietary substrate, and the PostgreSQL version is whatever Aurora supports today. RDS for PostgreSQL (the non-Aurora flavor) is closer to vanilla PostgreSQL with a managed failover via Multi-AZ replication, similar in feel to a managed Patroni.
8. Synchronous replication and HA together
If your RPO must be zero, you need synchronous replication, and you need it to keep working when one standby is gone. The naive setup fails the second test.
The setting that controls who waits is synchronous_standby_names. The simplest form, ”*”, waits for any one standby. A safer form names standbys explicitly. The form that matters for HA is the quorum form:
# postgresql.conf on the primary
synchronous_commit = on
synchronous_standby_names = 'ANY 1 (pg-2, pg-3)'
This says: a commit returns when any one of pg-2 or pg-3 has flushed the WAL. If pg-2 is down, pg-3 still acknowledges and writes continue. If you wrote FIRST instead of ANY, the primary would block when pg-2 is unavailable, even with pg-3 healthy. The naming matters: with synchronous_commit = on a transaction is durable as soon as one sync standby has it on disk, even if the primary then crashes.
For RPO = 0 across a primary loss, the chosen winner of the failover must be one of the sync standbys, and there must always be at least one sync standby up. Patroni’s synchronous_mode: true ties these together: the leader key in the DCS records which standby is the current sync, and only that standby is eligible to be promoted on a hard failure.
bootstrap:
dcs:
synchronous_mode: true
synchronous_node_count: 1
A common production target is “RPO = 0 within a region, async cross-region.” Two sync standbys in different availability zones of one region give you zero data loss across single-node and single-zone failures, with cross-region async standbys as a separate disaster-recovery tier (whose RPO is measured in seconds of WAL).
9. Backups and point-in-time recovery as the floor
Replication is not a backup. A bug, a bad migration, or a malicious DELETE replicates exactly as fast as a correct write. If you depend only on a standby for safety, the moment someone runs DELETE FROM customers without a WHERE clause, both primary and standby are wrong, and you have nothing to restore from.
The 3-2-1 rule is the working floor: at least three copies of the data, on at least two different kinds of storage, with at least one off-site. For PostgreSQL, that almost always means a full backup plus continuous WAL archival to object storage, taken by pgBackRest or wal-g. Both tools are battle-tested and handle incremental backups, compression, encryption, parallel uploads, and retention policy.
# pgBackRest, daily full + continuous WAL archive to s3
pgbackrest --stanza=main --type=full backup
pgbackrest --stanza=main --type=incr backup
pgbackrest --stanza=main info # list backups and the oldest restorable point
postgresql.conf must continuously ship WAL to the archive so a restored backup can be rolled forward to any later moment.
archive_mode = on
archive_command = 'pgbackrest --stanza=main archive-push %p'
archive_timeout = 60 # force a WAL switch at least once a minute, even on idle
wal_level = replica
Point-in-time recovery (PITR) is the partner of the archive. After a destructive event, you stop the cluster, restore the last full backup, and tell PostgreSQL to replay WAL up to a target. The target can be a timestamp, an LSN, a transaction id, or a named restore point.
pgbackrest --stanza=main \
--type=time --target="2026-06-06 14:32:15+00" \
restore
# postgresql.conf for the restored cluster
restore_command = 'pgbackrest --stanza=main archive-get %f "%p"'
recovery_target_time = '2026-06-06 14:32:15+00'
recovery_target_action = 'promote'
You signal recovery mode with an empty file in the data directory: standby.signal to come up as a standby, recovery.signal to come up as a recovering primary that will promote when it hits its target. The two were unified into signal files in PostgreSQL 12; before that, the same logic lived in recovery.conf.
A backup you have not restored is a backup you do not have. Schedule a regular restore drill (monthly, quarterly at the latest) onto a separate environment, walk through PITR to a known moment, and verify the result. Most outages traceable to “we had a backup” actually trace to “we never restored it before, and the restore broke.”
10. A worked failover timeline
Numbers make the choices in earlier rungs concrete. Consider a Patroni cluster with ttl: 30, loop_wait: 10, retry_timeout: 10, HAProxy with inter 3s fall 3 (so a backend needs three failed checks across nine seconds before HAProxy declares it down), and on-marked-down shutdown-sessions.
sequenceDiagram
participant App as Application
participant HA as HAProxy
participant P1 as pg-1 (primary)
participant DCS as etcd
participant P2 as pg-2 (standby)
Note over P1: t=0 network partition cuts pg-1 from etcd
P1->>DCS: renew leader key (fails)
Note over DCS: t=30 lease TTL expires, key removed
P2->>DCS: claim leader key (LSN highest, lag under cap)
DCS-->>P2: granted
P2->>P2: pg_ctl promote, timeline 2 begins
HA->>P2: GET /primary -> 200
HA->>P1: GET /primary -> 503 (Patroni demoted or unreachable)
HA->>App: shutdown-sessions on pg-1 backend
App->>HA: reconnect
HA->>P2: route writes here
Note over App: t roughly 35-45 s writes resumeThe timeline lands close to ttl + loop_wait + a couple of HAProxy check intervals, give or take the time pg_ctl promote takes to replay residual WAL. With the defaults above, expect 30 to 45 seconds of write unavailability for a hard primary failure. Tuning ttl down to 10 and loop_wait down to 3 (with corresponding retry_timeout) can bring this to under 10 seconds, but at the cost of more false failovers during brief network blips. Pick a target RTO, then tune the smallest set of values to meet it.
If you also use synchronous replication, the promoted standby must be the one that holds the most recent committed WAL. Patroni enforces this by checking each candidate’s LSN against the last known leader LSN and rejecting any whose lag exceeds maximum_lag_on_failover. The clearer phrasing: that setting is your RPO ceiling in bytes of WAL, not just a sanity check.
11. Failure modes experienced teams hit
These are the ways HA breaks in production. Each is a real incident class, not a theoretical worry.
Mastery Questions
-
Your business says “we cannot lose any committed transactions.” You propose synchronous replication with synchronous_standby_names = ‘ANY 1 (pg-2, pg-3)’, a Patroni cluster with three nodes, and synchronous_mode: true in the DCS. A network partition isolates the primary from both sync standbys but leaves it connected to clients. What happens, and is the business goal met?
Answer. The primary cannot get an acknowledgment from either sync standby, so every commit blocks waiting for the quorum. Clients see hung transactions. Meanwhile, Patroni on the primary cannot reach the DCS either (the partition cuts it off), so its leader lease expires; one of the sync standbys on the other side wins the election (its lag is by definition zero on the last committed transaction it acknowledged) and is promoted. The primary’s watchdog hard-reboots it because Patroni is no longer feeding it, killing the in-flight uncommitted transactions on the old primary side. Clients reconnect via HAProxy to the new primary. No committed transaction is lost, because by construction a transaction was only ever reported as committed after at least one of the now-promoted standbys had flushed its WAL. The business goal is met, at the cost of all in-flight transactions on the old primary failing and a few tens of seconds of write unavailability during failover.
-
A junior engineer wires up a Patroni cluster but co-locates the three etcd nodes on the same three machines as PostgreSQL, and skips the watchdog because /dev/watchdog is “not available in our VMs.” During an incident, one VM kernel-panics. What goes wrong, and what would you change?
Answer. Two things. First, the joint placement means one VM failure removes one PostgreSQL and one etcd at once. etcd still has a two-of-three majority, so the DCS survives. But if a second VM hiccups (and incidents cluster), two of three etcd nodes are gone, the DCS loses quorum, and now every Patroni daemon loses contact with the consensus store and must demote itself to be safe. The whole cluster goes read-only even though one of three PostgreSQLs is technically healthy. Second, without a watchdog, a hung Patroni on what is currently the primary cannot be guaranteed to stop accepting writes; if the leader key expires and a standby is promoted while the old primary is still running, you have split-brain and divergent histories. The fixes are independent: run etcd on its own nodes (or at minimum on a different failure domain from PostgreSQL), and enable a software or hardware watchdog (mode: required), even if it means switching the VM to one that exposes /dev/watchdog or using systemd-watchdog. Both decisions trade a small amount of operational complexity for a category of incident class.
-
A migration accidentally truncates the orders table on the primary. Replication is healthy and replays the TRUNCATE on every standby within 50 milliseconds. The team’s runbook says “failover to the standby.” Walk through what actually recovers the data, and how the cluster’s design should change.
Answer. Failing over does nothing useful because the standbys have replayed the TRUNCATE too. The orders table is empty on every node in the cluster. The only recoverable copy is a backup, not a replica. The right move is to spin a separate restored environment from the last full backup taken by pgBackRest (or wal-g), then PITR forward to a target timestamp just before the TRUNCATE. With archive_timeout = 60 and continuous WAL archival to object storage, you can pick a recovery target within a minute of the bad transaction, restore there, dump the orders table, and import it back into production. The cluster design needs two changes. First, write down explicitly that replication is not a backup and that the recovery path for application-level bugs is PITR from object storage, not failover. Second, rehearse the PITR path on a recurring cadence onto a separate environment, because the first time you run pgBackRest restore should not be during the incident. A bonus is to gate destructive migrations behind a procedure that takes a logical dump of the affected tables first, so even before PITR you have a five-minute escape hatch.
Sources & evidence18 claims · 4 cited
The PostgreSQL mechanisms (no built-in failover, pg_ctl promote, standby.signal/recovery.signal, recovery_target_*, archive_mode/archive_command, WAL streaming and sync replication, timelines and pg_rewind) are grounded in the official PostgreSQL docs (High Availability, Warm Standby, WAL). Patroni-specific values (ttl, loop_wait, retry_timeout, maximum_lag_on_failover defaults, synchronous_mode, watchdog mode required) are sourced to the Patroni docs. Operational patterns layered on top (HAProxy port 8008 health endpoint, CloudNativePG, Aurora compute/storage split, the 3-2-1 backup rule, pgBackRest commands, the worked 30-45 s failover timeline) are stable industry common knowledge and carry no specific source.
- Asynchronous streaming replication gives a low recovery time objective but a non-zero recovery point objective because WAL flush on the primary is acknowledged to the client before the WAL is shipped, so a primary crash between flush and send loses the most recent committed transactions.verified
- Synchronous replication with synchronous_commit set to on or remote_apply and a standby named in synchronous_standby_names can give RPO equal to zero at the cost of one network round trip per commit, so a primary across a 10 ms link from its sync standby has a hard floor of roughly 10 ms per commit.verified
- PostgreSQL has no built-in automatic failover: promotion is triggered externally by running pg_ctl promote on a standby (or placing a trigger file in older versions), and the project leaves the policy of when to promote to operators because failover is destructive and depends on RTO, RPO, and topology.verified
- After a successful promotion the new primary advances to a new timeline (the timeline id in pg_wal increments), so the old primary cannot rejoin by a plain restart because its history has diverged from the new timeline.verified
- Patroni runs as a daemon on every PostgreSQL node and coordinates through a distributed consensus store (typically etcd, Consul, or ZooKeeper) that holds a leader key with a TTL; the primary's Patroni renews the lease on each loop, and if it fails to renew within the TTL the key expires and remaining nodes race to claim it.verified
- Patroni's defaults for cluster timing are ttl 30 seconds, loop_wait 10 seconds, retry_timeout 10 seconds, and maximum_lag_on_failover roughly 1 MiB; only standbys whose lag is below maximum_lag_on_failover are eligible to win the election, which bounds RPO loss in bytes of WAL.verified
- Production deployments should run an odd number of DCS nodes (typically three or five) so the consensus store has a real quorum: a two-node etcd has no fault tolerance because losing one leaves no majority.verified
- Patroni's watchdog support (systemd-watchdog or a hardware watchdog at /dev/watchdog, configured with mode required) hard-reboots the host if Patroni stops feeding it, which is the only reliable way to stop a hung primary from accepting writes after losing the leader key and so to prevent split-brain.verified
- pg_rewind re-attaches a former primary to a new timeline by reading the new primary's WAL, identifying the last common LSN, and rewriting only the diverging blocks of the old primary's data directory; it requires either data_checksums or wal_log_hints enabled on the old primary to have the bookkeeping needed for the rewrite.verified
- HAProxy fronting a Patroni cluster uses an HTTP health check against the Patroni REST API (port 8008 by convention) where the /primary endpoint returns 200 on the current primary and 503 elsewhere, and the on-marked-down shutdown-sessions directive forcibly closes connections to a backend that fails the check so clients cannot keep writing to a demoted primary.stable common knowledge
- CloudNativePG, Crunchy's pgo, and Zalando's postgres-operator wrap the Patroni pattern in a Kubernetes controller that reconciles a Cluster custom resource into a StatefulSet, services, backups, and PITR, trading fine-grained policy control for declarative operations.stable common knowledge
- Amazon Aurora separates compute from storage so a failover is just swapping which compute node is primary against an already-shared durable storage layer, which gives sub-30-second RTO and removes the need to copy data, at the cost of vendor lock-in and a fixed PostgreSQL version set.stable common knowledge
- synchronous_standby_names supports a quorum form such as ANY 1 (pg-2, pg-3) so a commit returns once any one of the named candidates flushes the WAL, which keeps writes flowing when one sync standby is unavailable; the FIRST form blocks writes when its first-named standby is down.verified
- Point-in-time recovery restores the last full backup and replays archived WAL up to a recovery_target_time, recovery_target_lsn, recovery_target_xid, or recovery_target_name; the cluster is signaled to recover by an empty standby.signal or recovery.signal file in the data directory, a model introduced in PostgreSQL 12 in place of recovery.conf.verified
- pgBackRest and wal-g implement the working backup floor for PostgreSQL: incremental backups plus continuous WAL archival to object storage, with the 3-2-1 rule of at least three copies on at least two media with at least one off-site, and a routinely tested restore because replication is not a backup (a destructive bug replicates correctly).stable common knowledge
- With Patroni defaults (ttl 30, loop_wait 10) and HAProxy inter 3 seconds with fall 3, a hard primary failure typically yields 30 to 45 seconds of write unavailability: roughly ttl seconds for the leader lease to expire, loop_wait for promotion, and a few HAProxy intervals to reroute clients.stable common knowledge
- Co-locating the DCS nodes on the same machines as PostgreSQL turns a single VM loss into a joint loss of one database and one consensus member, and a second concurrent fault can collapse the DCS quorum and force the whole cluster read-only even while a healthy database survives.verified
- Setting maximum_lag_on_failover too generously means a promotion can pick a standby that is missing committed transactions, causing silent data loss; the parameter is effectively your RPO ceiling in bytes of WAL and should be tight on clusters that require zero data loss.verified
Cited sources
- PostgreSQL Documentation: High Availability, Load Balancing, and Replication · PostgreSQL Global Development Group
- PostgreSQL Documentation: Write-Ahead Logging (WAL) and Reliability · PostgreSQL Global Development Group
- PostgreSQL Documentation: Log-Shipping Standby Servers and Replication Slots · PostgreSQL Global Development Group
- Patroni Documentation: a Template for HA PostgreSQL · Zalando / Patroni contributors