Globally Distributed PostgreSQL
The capstone of the scaling-and-distributed track: how PostgreSQL stretches across regions, where the physics and the regulators force the topology choice, and when the right answer is no longer stock PostgreSQL at all. Covers the four topologies (single-region HA, primary plus regional replicas, multi-writer, geo-sharded) with real latency numbers, the NewSQL contenders (CockroachDB, YugabyteDB, Spanner, Aurora), Aurora Global Database's actual RPO and RTO, a decision tree that starts at residency, hybrid patterns that ship in production, global backup design with per-region pgBackRest plus cross-region object replication, and the seven failure modes that bite globally distributed PostgreSQL hardest.
Learning outcomes
This is the capstone of the scaling and distributed track. Every page before this one tuned one mechanism: a buffer, an index, a vacuum, a streaming feed, a Citus shard, a read replica. Here those mechanisms collide with the planet. A user in Singapore types into the same form as a user in Frankfurt, regulators on three continents want their data at home, and the architect must decide where each row lives and which round trips each commit pays. There is no “scale globally” switch. There is only a topology, chosen against the access pattern and the residency map, with numbers honest about light and lawyers.
After studying this page, you can:
- Reason in milliseconds about cross-AZ, cross-region, and intercontinental latency, and predict how each choice of synchronous_commit lands on user-visible commit time.
- Pick between four real topologies (single-region HA, primary plus regional replicas, multi-writer, geo-sharded) by mapping access patterns and residency rules to each one’s strengths.
- Place PostgreSQL alongside the NewSQL contenders (CockroachDB, YugabyteDB, Spanner) and managed offerings (Aurora, Neon, Supabase, Crunchy), and explain when each is the right reach.
- Defend an Aurora architecture call against a stock Postgres one, including the Aurora Global Database RPO and RTO numbers.
- Build a global backup and DR plan around per-region pgBackRest, cross-region object replication, and a tested failover runbook.
- Walk a decision tree that starts at residency and ends at a topology you can name, and avoid the classic mistakes (sync rep across regions, multi-writer with no conflict policy, residency as an afterthought).
Before we dive in
You should already be at home with the rest of this track. The streaming-and-synchronous-replication page is the foundation for every mechanism here that moves a byte from one region to another. The high-availability-and-failover page covers Patroni, the consensus store, fencing, and the policy questions that surround a single-region failover; we lean on that page when discussing intra-region HA and only sketch what changes at global scale. The sharding-with-citus page is the partition story you extend into geographies on rung 6, and the read-scaling-and-caching page is the layer you stretch across regions on rung 4. We do not re-derive any of that here.
A few words you will need, defined as we use them. Region is a cloud or operator’s geographic deployment area (us-east-1, eu-west-1, ap-southeast-1), usually composed of two or more availability zones a few kilometres apart. Availability zone (AZ) is a fault domain within a region with its own power, cooling, and network. Round-trip time (RTT) is the time for a packet to make the round journey between two points; it is the cost we pay for every synchronous handshake. RPO is recovery point objective, the data loss you tolerate; RTO is recovery time objective, the downtime you tolerate. Data residency is the regulatory rule that says certain data must remain physically inside one jurisdiction. Hold these. Every choice below is one of them tightened to a number.
Mental Model
The wrong model, and a tempting one, is that “global PostgreSQL” is a single logical database with a knob labelled “more regions.” You spin up replicas in each region, point clients at the nearest one, and the system becomes globally fast. Under this picture, distributing the database is a matter of provisioning, not architecture.
That picture is wrong, and lethally so at scale. A database is a physical artefact, made of disks in racks in buildings in cities, and the speed of light through fibre is roughly 200 km per millisecond. Any commit that needs an acknowledgment from a disk in another city pays that distance, every single time. The continents themselves set hard floors: a US-to-EU round trip cannot go below about 80 to 100 ms even on perfect hardware, because Atlantic-bottom fibre is around 16,000 km of cable each way. You can add CPUs and bigger disks; you cannot add light.
So global PostgreSQL is not “more replicas.” It is a topology choice. You decide, ahead of time, which writes pay a round trip across the ocean and which do not, where each row lives, and what the regulator forces. The architecture must START with the access pattern (who reads what, who writes what, from where) and the residency map (which rows must stay where), and only THEN choose a mechanism. Choose the mechanism first and you spend the next two years re-sharding under outage pressure.
Breaking it down
1. The hard physics: light, distance, and what a millisecond buys
Start here, because every later decision is a budget against this number. Light through single-mode optical fibre travels at roughly two thirds of vacuum speed, which gives about 200 km per millisecond. A round trip doubles the distance, so a round-trip millisecond buys you about 100 km of separation. Real networks add switches, routers, congestion, and the path is almost never a straight line, so the rule of thumb is generous: assume a round trip pays at least 1.3 to 1.5 times the great-circle distance divided by 100,000 km per second.
The numbers worth memorising:
| Hop | One-way path | Realistic RTT | What it means for a sync commit |
|---|---|---|---|
| Same rack, intra-AZ | metres to a few hundred metres | 0.1 to 0.5 ms | ”Free” by sync replication standards |
| Cross-AZ, same region | 1 to 100 km | 1 to 3 ms | Comfortably synchronous |
| Same continent, different region | 500 to 5000 km | 10 to 80 ms | Painful, sometimes acceptable for bursty writes |
| US to EU | about 6000 km | 70 to 100 ms | Synchronous is a non-starter for OLTP |
| US to Asia | about 11000 km | 130 to 180 ms | Synchronous is impossible for interactive flows |
| Antipodes (London to Sydney) | about 17000 km | 230 to 280 ms | Even async replication needs careful buffering |
A synchronous commit is gated on the worst standby in the quorum. If synchronous_commit = on and the primary is in us-east-1 with the chosen standby in eu-west-1, every transaction pays roughly 80 to 100 ms before it returns to the client, no matter how fast the disks are. The streaming-and-synchronous-replication page covered this for one extra hop; the lesson at planetary scale is the same lesson, applied harder.
The point of the slider is not the exact number; it is to internalise that the bands are not continuous. There are four physically meaningful regimes, and a topology that is fine in one is a disaster in the next. The job of the architect is to know which regime each request lives in before promising the user a millisecond budget.
2. CAP in practice and the daily PACELC choice
The CAP theorem says that under a network partition, a distributed system must choose between consistency (refuse writes that cannot be safely accepted) and availability (accept writes and reconcile later). The theorem is true, and almost never the question that comes up in production, because clean partitions are rare. The question that comes up daily is the one PACELC adds: even when there is no partition (E), the system trades latency (L) against consistency (C) on every read and every write.
PostgreSQL on its own is a CP system. A single primary serialises every write, and a partition that separates the primary from its quorum stalls writes until either the partition heals or an operator forces a failover. PostgreSQL does not silently accept divergent writes; it stops. That is a feature, not a bug, and it is what synchronous_standby_names enforces when configured.
The PACELC twist is what you face every day. Even without a partition, sync replication across a 100 ms link adds 100 ms to every commit. Async replication removes that 100 ms but exposes you to a few seconds of RPO if the primary dies. Most “global database” decisions are not CAP decisions; they are PACELC decisions made twice, once for the partition case and once for the normal case.
NewSQL systems like Spanner and CockroachDB make the same CAP choice (CP) but use distributed consensus (Paxos or Raft) per data range, so a single region partition only stalls writes for ranges whose quorum sits across the partition, not for the whole database. The latency cost shows up as PACELC: every write pays a Raft round trip, and Spanner pays the additional TrueTime uncertainty window on top. There is no free lunch; the menus differ.
3. Topology one: single region with multi-AZ HA
The first topology is the one most teams should reach for and the one most teams over-engineer past. A primary in one AZ, a synchronous standby in a second AZ within the same region, and an asynchronous standby in a third AZ as cold spare. Patroni or pg_auto_failover runs the leader election; HAProxy or a cloud load balancer routes traffic. This is exactly the high-availability-and-failover page’s setup, and it gives the cheapest path to four nines of availability.
The cross-AZ RTT in this setup is 1 to 3 ms, so synchronous_commit = on is realistic: every commit pays a small synchronous tax in exchange for zero RPO on a single-AZ failure. The router cuts over in tens of seconds to the sync standby on a primary loss, so RTO is also bounded. A well-run three-AZ Postgres cluster delivers about 99.99 percent availability (roughly 52 minutes of downtime per year) at single-digit-millisecond commit latency.
# postgresql.conf on the primary (us-east-1a)
synchronous_commit = on
synchronous_standby_names = 'ANY 1 (standby_1b, standby_1c)'
The ANY 1 quorum across two same-region standbys is the workhorse setting. Either one acknowledges, the commit proceeds; both are down, commits stall (and you page someone). The async standby outside the synchronous list keeps reads available even if both sync standbys vanish.
This topology wins when your users live in one geography, residency does not force you elsewhere, and your read/write pattern can be served from one region. That is the majority of B2B SaaS, a lot of internal tooling, and any product still finding its market. Reach for multi-region only when this stops being enough.
4. Topology two: primary plus regional read replicas
When reads dominate and they come from many geographies, you keep the single writer but pin a read replica into each user region. Writes still pay cross-region latency to reach the primary; reads are served by the local replica at intra-region latency. The read-scaling-and-caching page covers the read-routing layer, the staleness budget, and the read-your-write traps; this rung is the multi-region wrap around it.
The architecture: a primary in the home region (say us-east-1), asynchronous streaming replicas in eu-west-1, ap-southeast-1, and any other user-heavy region. Each region’s application instances send writes back to us-east-1 and serve reads locally. The async replication adds seconds of RPO on a region loss but keeps reads fast.
flowchart TB
subgraph US[us-east-1 primary]
P["Primary"]
end
subgraph EU[eu-west-1]
RE["Read replica"]
AE["App tier EU"]
end
subgraph AP[ap-southeast-1]
RA["Read replica"]
AA["App tier APAC"]
end
P -->|async WAL stream| RE
P -->|async WAL stream| RA
AE -->|reads| RE
AA -->|reads| RA
AE -->|writes 80 ms| P
AA -->|writes 180 ms| PRead the diagram twice. Forward, it says reads are local and fast. Backward, it says every write from Singapore pays roughly 180 ms before the user sees their action committed. For workloads with read-heavy fan-out (catalog browsing, dashboards, search), that trade is excellent. For interactive write flows (editing a document, sending a chat), the 180 ms makes the product feel broken in APAC.
This is the right topology when:
- The product is read-dominated (above about 80 percent reads is the typical threshold).
- Writes can tolerate cross-region latency, or are batched/asynchronous from the user’s point of view.
- A single jurisdiction owns the data and residency does not split it.
The trap is treating it as a global write topology. It is not. It is a global READ topology, and the more you push writes through it the more the primary’s region becomes a bottleneck for everybody.
5. Topology three: multi-writer with conflict resolution
What if reads AND writes must be local in every region? The obvious answer (run a writer in every region and reconcile) is the most dangerous topology in this page, and the one that bites teams hardest in year two.
PostgreSQL has three real paths to multi-writer. BDR (Bidirectional Replication from EDB, now sold as Postgres Distributed) is the most mature commercial offering, with row-level conflict detection and per-table conflict resolution policies. pglogical was the open-source ancestor, still usable but less polished. As of PostgreSQL 16, native logical replication with origin filtering can be configured to flow writes between two clusters without echo loops, which is the building block other systems use under the hood. All three reduce to the same underlying mechanism: logical decoding extracts row-level changes from one cluster’s WAL, ships them to another cluster, and applies them as if they were local writes.
The fatal subtlety is conflict resolution. Two writers update the same row at the same time. Each commits locally. The change ships to the other side. Which write wins?
- Last-write-wins (the easiest policy) compares commit timestamps and keeps the later one. The earlier write is silently lost. If two users in different regions edit the same field in the same second, one user’s edit vanishes with no error.
- Application-defined resolution lets a function decide per conflict (keep both, merge, escalate). This is correct but means every developer who writes to a multi-writer table must think about every conflict shape, forever. Most teams cannot sustain that discipline.
- Conflict avoidance by design keeps each row writable in only one region (effectively geo-sharding, our next rung). This is what most successful multi-region deployments end up doing, after spending a year trying to make a true multi-writer work.
Multi-writer wins for a narrow set of workloads: append-only events that never collide, audit logs that tolerate duplicates, last-seen counters that can be merged, presence pings. It loses, often catastrophically, for transactional data where every column has a single correct value. The rule of thumb: if you cannot draw the conflict-resolution diagram on a whiteboard in five minutes, you are not ready to run multi-writer. Choose another topology.
The toggle is the lesson in one image. The clever-looking multi-writer setup loses data quietly. The simpler-looking geo-shard pays a write hop but never lies. Most teams that try the first eventually rebuild as the second.
6. Topology four: geo-sharded with regional locality
Geo-sharding takes the sharding-with-citus model and applies it to geographies instead of tenant IDs. Each shard lives in the region of the tenants it serves. A European customer’s rows live in eu-west-1; an American customer’s rows live in us-east-1. The application routes each request to the home region of the tenant it concerns.
The mechanism is the same as single-region Citus, with one extra constraint: the distribution column must imply, or be a function of, the region. The natural choice is the tenant id, with a registry that maps each tenant to a home region.
-- Distribute tenants table by id; each shard's placement is pinned to a region.
SELECT create_distributed_table('tenants', 'tenant_id');
-- Co-located tables share the distribution column and the same shards.
SELECT create_distributed_table('orders', 'tenant_id', colocate_with => 'tenants');
SELECT create_distributed_table('events', 'tenant_id', colocate_with => 'tenants');
In Citus terms, each region is its own coordinator and worker set, with shards placed according to a tenant-to-region map. The application asks the routing layer “where does tenant 12345 live?” and sends the request there. Most operations are local: the tenant’s region is the tenant’s home, and reads and writes stay in-region at single-digit-millisecond latency.
The non-local cases are the interesting ones. A cross-region join (rare, by design) pays the full inter-region RTT per round trip. A tenant migration (a customer’s company moves headquarters) is a planned event that ships the tenant’s rows to the new region with logical replication and switches the routing entry. A global query that touches all tenants (a finance report) runs as a fan-out and aggregates across regions, with latency dominated by the slowest region.
Geo-sharding wins, and wins big, when:
- The access pattern is tenant-local: most operations belong to one tenant, and that tenant has a clear home.
- Residency is required: each tenant’s data is bound to a jurisdiction by law (see rung 7).
- The scale is large enough that any single region cannot host the whole workload.
It loses for workloads where tenants overlap (social graphs, marketplaces with cross-tenant orders) and for any model where the access pattern is “everyone reads everything,” because the fan-out cost eats the locality benefit.
7. Data residency: the regulator picks your topology
For a growing class of workloads, the topology decision is not made by engineers. It is made by regulators, and the rest of the architecture follows.
The GDPR in the European Union (2018, enforced) sets the strictest default: personal data of EU residents may be processed outside the EU only under specific safeguards (standard contractual clauses, adequacy decisions, binding corporate rules), and any transfer to a jurisdiction without strong privacy law is a risk. The Schrems II ruling (2020) struck down the EU-US Privacy Shield and put US transfers under specific scrutiny. The practical effect is that many EU customers contractually require their data to remain in EU regions, full stop.
India’s DPDP Act (2023) and China’s PIPL (2021) go further in different ways. PIPL requires “important data” and personal information of Chinese residents to be stored inside China, with cross-border transfers requiring security assessments. DPDP allows the central government to restrict transfers to specific jurisdictions. The trend across jurisdictions is one-way: more residency requirements, not fewer.
What this means for topology: if you have customers in jurisdictions with residency rules, multi-region is not optional and geo-sharding is almost certainly the answer. Each tenant’s rows live in their home region. A single-region setup excludes you from those markets. A primary-plus-replicas setup with the primary in us-east-1 is non-compliant for EU customers if writes flow through US infrastructure. A multi-writer setup is non-compliant if EU data is replicated to US writers as part of normal operation.
The trap is treating residency as a future problem. Lawyers escalate residency, and the escalation almost always comes with a deadline that is shorter than the time it takes to re-architect a non-geo-sharded system into a geo-sharded one. The right move is to bake the tenant-to-region map into the schema from day one, even if today every tenant lives in one region. The map is cheap to add early and expensive to add late.
8. The NewSQL alternatives to pure PostgreSQL
For workloads that genuinely need globally distributed writes with strong consistency, the right answer is often not stock PostgreSQL at all. A small set of systems were designed from scratch for this problem, each with a different set of compromises.
The trick is to be honest about which workload you actually have. Most teams who reach for Spanner do not need TrueTime; they need a fast intra-region database with good DR. Most teams who reach for CockroachDB end up surprised by the per-row write cost, because their workload was 5 percent of writes touching 1 percent of rows (a perfect fit for a single Postgres primary, a terrible fit for Raft per range). The decision tree is at rung 12 and starts with “do you actually need cross-region writes.”
The managed end of the spectrum is its own decision. Neon runs PostgreSQL with compute and storage separated on Kubernetes, giving instant branching and serverless compute at the cost of a different operational model. Supabase packages stock Postgres with a generated REST and realtime layer. Crunchy offers managed Postgres focused on operational excellence in your own cloud. None of these solve the global-writes problem; they solve different problems (developer experience, operations, branching) using PostgreSQL as a building block.
9. Aurora as a different kind of PostgreSQL
Aurora deserves its own rung because the architecture is genuinely different from stock Postgres, and the differences matter for both the win cases and the trap cases.
The first difference is storage. A stock Postgres writes a tuple to shared_buffers, writes a redo record to WAL, fsyncs the WAL, and at checkpoint time writes the dirty page back to its data file. The page is durably stored twice (once as WAL, once as the page itself) over the life of a checkpoint cycle. Aurora’s writer only ships redo log records to the storage layer; the storage nodes themselves reconstruct pages from the redo stream when asked. There is no double-write-and-WAL, no checkpoint of dirty pages back to disk by the writer. This typically halves write IOPS for the same workload and is the central reason Aurora can deliver higher throughput on the same instance size.
The second difference is replication topology. The storage layer is a single cluster volume, distributed across 6 storage nodes in 3 AZs (2 per AZ), with a 4-of-6 write quorum and 3-of-6 read quorum. A single AZ loss still has 4 storage nodes available; a single node loss has 5. Read replicas (up to 15) all read from the same shared storage, so they do not lag behind the writer the way streaming replicas do. Failover to a read replica is fast (typically under 30 seconds) because the replica already has full storage access; it only needs to start accepting writes and have clients reconnect.
The third difference is the cross-region story. Aurora Global Database allows one primary region and up to 5 secondary regions, with a dedicated cross-region replication path at the storage layer (not Postgres logical replication). The promised numbers are typical replication lag under one second, RPO under one second, and RTO of approximately one minute for managed failover. The trap, and it is widespread, is treating Aurora Global Database as RPO equals zero. It is not. There is a sub-second window during which committed writes may not yet have reached the secondary region; a sudden total loss of the primary region during that window loses those writes. For most workloads sub-second RPO is acceptable; for “no committed writes lost on regional loss” it is not, and you need the synchronous, multi-region NewSQL alternatives from rung 8.
Choosing Aurora over stock Postgres comes down to three questions. Does the workload benefit from the redo-log-to-storage architecture (write-heavy with regular checkpoints)? Yes, often. Do you need the fast failover and the low-lag read replicas Aurora provides natively? Often, yes. Are you willing to give up some configurability (some postgresql.conf parameters are managed, some extensions unavailable) in exchange? Most teams say yes. The classic mistake is choosing Aurora because “it scales globally,” then discovering Aurora Global Database does not actually give you a global writer; it gives you one writer with cross-region read replicas and managed failover.
10. Hybrid patterns that ship in real production
Real production systems rarely pick one topology. The most resilient global deployments are hybrids: each region runs a topology suited to its job, and the regions are connected by a clearly designed boundary.
The first hybrid is Citus plus Patroni plus cross-region logical replication. Inside each region, Citus shards a tenant table across multiple workers, with Patroni managing HA on the coordinator and each worker. The high-availability-and-failover page covers the in-region story. The cross-region story is a single dedicated logical-replication slot per region that ships analytical or backup-relevant data to a sibling region. The OLTP writes are regional; the analytical read-out is global. This pattern is common at SaaS companies in the 1 to 100 million customer range.
The second hybrid is primary plus cross-region async streaming standby for DR. Inside the primary region you run a full Patroni HA cluster with sync replication across AZs. A single async streaming standby sits in a second region, fed by the in-region primary. Normally the cross-region standby just consumes WAL; on regional disaster you promote it (manual operation, well-rehearsed) and accept the RPO window. RPO is typically seconds to a few minutes depending on how far the standby has fallen behind; RTO is the time it takes to switch DNS plus the time to confirm the primary region is truly gone and not flapping.
The third hybrid is a transactional store plus a separate analytical store. The OLTP workload runs on the topology that suits it (single region with HA, or geo-sharded), and a separate columnar store (Clickhouse, BigQuery, Snowflake) runs all the analytical queries, fed by CDC out of PostgreSQL via Debezium or native logical replication. The decoupling lets each system be sized and tuned for its workload and avoids the trap of expecting one PostgreSQL deployment to do both jobs across continents.
flowchart LR
subgraph US[us-east-1: Citus + Patroni HA]
CO["Coordinator"]
W1["Worker 1"]
W2["Worker 2"]
end
subgraph EU[eu-west-1: Citus + Patroni HA]
CE["Coordinator"]
E1["Worker 1"]
E2["Worker 2"]
end
subgraph AN[Global analytics]
CH["Columnar store"]
end
CO -->|logical CDC| CH
CE -->|logical CDC| CHThe point is that “globally distributed PostgreSQL” almost never means one PostgreSQL deployment serving the planet. It means several deployments, each playing the role it does well, with explicit boundaries between them.
11. Backups, restores, and testing a global failover
A global system needs a global backup story. The mistake is assuming the replication is the backup; it is not. Replication ships every change, including the accidental TRUNCATE, the bad migration, the dropped index. A backup is the thing you fall back to when replication faithfully shipped the disaster.
The shape that works at global scale: each PostgreSQL cluster (in each region) runs its own pgBackRest with full backups weekly, differential backups daily, and continuous WAL archiving. The backups land in an S3-equivalent object store in the same region as the cluster, so backup writes do not pay cross-region cost. The object store then replicates the backups to another region’s bucket with cloud-native cross-region replication, giving you off-site copies without making the database pay the wire.
# pgbackrest.conf on the primary
[global]
repo1-type=s3
repo1-s3-bucket=pgbackups-us-east-1
repo1-s3-endpoint=s3.us-east-1.amazonaws.com
repo1-s3-region=us-east-1
repo1-retention-full=4
repo1-retention-diff=14
archive-async=y
process-max=4
[pg-prod]
pg1-path=/var/lib/postgresql/17/main
The S3 lifecycle rule on the bucket sets up cross-region replication to pgbackups-eu-west-1 (or wherever your DR is), with delete tombstones suppressed. The result is that for any region-loss scenario, the most recent full plus diffs plus archived WAL are available in another region, and restoring to the latest LSN takes the time to download and replay them.
The other essential piece is testing the failover end to end, regularly. A failover runbook that has never been exercised is fiction. The discipline is to do scheduled game-day exercises: pick a Tuesday at 10am, fail over from primary to standby, watch what breaks, fix what broke, repeat until nothing breaks. The most common discovery is that some service has hard-coded the primary’s hostname, that an HAProxy health check is too slow to notice the switch, or that a connection pooler caches the old primary’s address for too long. None of these show up in normal operations; all of them turn a real disaster into a worse one.
12. A decision tree you can defend
Here is the decision tree to walk when someone asks you to “make our PostgreSQL global.” It starts with the constraint, not the technology.
The tree’s most important property is that residency goes first. Many teams skip step 1 because nobody mentioned a regulator yet, then discover six months in that a key customer is in the EU and the architecture cannot support it. Run step 1 first. If a regulator might force geo-sharding within 18 months, design for it now.
The second-most important property: steps 2 and 3 are different questions. Many teams hear “global” and assume both. A read-heavy workload (a marketplace, a catalog, a dashboard product) usually answers YES to step 2 and NO to step 3. A write-heavy workload (a ride-share, a payments processor, a real-time collaborative editor) often answers NO to step 2 (writes batch fine) but YES to step 3, and that is where Spanner-class systems earn their keep.
13. Failure modes that bite globally distributed systems
The mechanisms above lead to a small, recurring set of incidents. Here they are by symptom, with the root cause and the fix.
The pattern across all seven is the same shape: a topology choice that worked on paper missed something about the physical world (the speed of light, a regulator, a conflict, a real disaster), and the gap shows up at the worst possible moment. Every fix listed boils down to the same principle: start with the access pattern and the residency map, then choose the mechanism, then test it under real conditions until it stops surprising you.
Mastery Questions
-
Your CTO declares “we are going global next quarter” and asks you to add a synchronous standby in the EU to the existing us-east-1 PostgreSQL cluster, citing zero-RPO durability for European customers. The current cluster has 1.5 ms p99 commit latency with intra-region sync replication. Walk through what would actually happen and what you would propose instead.
Answer. Adding a synchronous standby across the Atlantic moves the p99 commit latency from 1.5 ms to roughly 80 to 100 ms, because every commit would now wait for a fsync acknowledgment from a disk separated by about 6000 km of fibre. The light-speed floor is a hard physical limit; no tuning or faster hardware reduces it. The user-visible effect is that every interactive write becomes noticeably slow, and the throughput of the primary collapses because each backend spends most of its time waiting for the network. This is not a misconfiguration to fix later; it is the physics of synchronous replication across an ocean. The right proposal depends on what “global” actually means. If the CTO’s goal is reads available in the EU, run an async streaming replica there and route EU reads to it (topology two, rung 4); writes stay fast in us-east-1, EU reads stay fast in eu-west-1, and the only cost is seconds of RPO on a regional loss. If the goal is data residency for EU customers, the answer is geo-sharding (rung 6): each tenant’s rows live in their home region, and there is no cross-region durability cost because the writes never cross. If the goal is genuinely zero-RPO across continents for the same data, then stock Postgres cannot do it; that requirement maps to Spanner-class systems (rung 8) and a much larger conversation about latency budgets and rewriting the application. The wrong answer is to do what was asked: add the sync standby, watch the system fall over, and roll it back at 3am. Push the decision tree from rung 12 in front of the conversation before the architecture changes.
-
A SaaS platform serving 50,000 customers across the US, EU, and Asia has been running on a single-region us-east-1 Postgres primary with regional async read replicas. Customer support is reporting two simultaneous trends: EU customers complain that their writes take “almost a second” to feel responsive, and a major European bank customer just asked for written confirmation that their data never leaves the EU. What topology do you migrate to, in what order, and what are the riskiest steps?
Answer. The destination is geo-sharded (topology four, rung 6) with tenant-to-region mapping: each customer’s rows live in a coordinator and worker set in their home region. EU customers’ writes commit at single-digit-millisecond latency in eu-west-1, and the bank’s data physically never leaves the EU, satisfying both the latency complaint and the residency requirement. The migration order matters because both problems are urgent but the second one (residency) is a regulatory liability that the lawyers will not let you defer. Step one is the schema-level work: add a region column to the tenants table, a tenant-to-region routing table, and a routing layer in the application that asks the registry “where does tenant X live?” before every query. This work is the heaviest lift but the most reversible: nothing changes operationally until the routing is exercised. Step two is to stand up eu-west-1 as a real Citus cluster with its own Patroni-managed HA, distinct from the existing us-east-1 cluster, plus a logical replication path from us-east-1 to eu-west-1 for the customers being migrated. Step three is per-tenant migration: for each EU customer, snapshot their rows, ship them to eu-west-1, switch their routing entry, and verify. Do the bank customer last because they are the loudest if something goes wrong and you want practice. Step four is the same for Asian customers in ap-southeast-1, on the rhythm the residency rules require (PIPL forces this for Chinese customers; other Asian markets may not yet). The riskiest steps are the routing layer (a bug there silently sends a write to the wrong region and creates orphan rows) and the per-tenant cutover (a tenant migrated mid-transaction loses writes; you need a brief read-only window per cutover). The slowest part is convincing the team that geo-sharding is the only honest answer; the residency requirement alone forecloses every other topology, and once that is internalised the rest is execution.
-
A startup pitches you on rewriting your Postgres-based product on CockroachDB because “Postgres does not scale globally and we should not be locked into a single region.” Your current workload is a B2B analytics dashboard, 200 customers, 95 percent reads, 5 percent writes, and writes are dominated by ingesting batches of events for one customer at a time. p99 read latency is 40 ms, p99 write latency is 8 ms, and the customer-perceived problem is dashboard load time, not consistency. How do you reason through the proposal, and what do you actually do?
Answer. The proposal mistakes a globally distributed STORE for a globally fast PRODUCT, and the actual problem is in a different layer. Walk through the decision tree (rung 12) explicitly. Step one, residency: is anyone forcing geo-sharding? Probably not yet for a 200-customer B2B product, but worth checking; if EU bank customers are on the horizon the answer changes and CockroachDB does not solve residency either (you still need to pin data per jurisdiction). Step two, do users need sub-100 ms reads globally? Almost certainly yes for a dashboard product, but the customer-perceived problem is dashboard load time of seconds, which is dominated by query latency (40 ms), application rendering, network to the user’s browser, and how many round trips the dashboard makes. None of those is fixed by changing the storage layer. The right answer is regional read replicas: deploy async Postgres read replicas in EU and APAC, route reads to the nearest replica via a routing layer (the read-scaling-and-caching page has the mechanism), and cache aggressively at the dashboard level. This buys real read latency wins for global users at the cost of seconds of read staleness, which a dashboard product can absorb. Step three, do you need sub-100 ms writes globally? No: writes are 5 percent of traffic, batched by customer, and the customer ingesting events does not care if their ingest takes 200 ms; they care if it succeeds. Step four, is conventional HA enough for the home region? Yes. So the actual plan is: keep the single-region Postgres primary with Patroni HA, add async read replicas in EU and APAC, fix the dashboard’s request fan-out, and revisit the topology when the access pattern actually changes. The startup’s pitch is selling a global writer to a workload that does not write much. CockroachDB might be the right database for a different product (a global real-time auction, a worldwide payments ledger), but for this dashboard it would deliver 15 ms p50 writes instead of 8 ms, restrict your schema flexibility, introduce a new operational model, and not change the dashboard’s load time at all. The lesson is to match the database to the access pattern, not to the keynote slide. Global PostgreSQL at hundreds of thousands of queries per second is achievable; choosing the topology before you know the access pattern is the biggest mistake on this page.
Sources & evidence15 claims · 5 cited
The PostgreSQL streaming and synchronous mechanisms, replication slots, and hot standby behaviours referenced in topologies 1 through 4 are grounded in the official PostgreSQL high-availability, warm-standby, and logical-replication documentation. Patroni's role in single-region HA is grounded in the Patroni docs, and the Citus geo-sharding mechanism (create_distributed_table, colocation, shard placement) in the Citus docs. The speed-of-light latency arithmetic, the regional RTT bands, the NewSQL system architectures (CockroachDB Raft per range, YugabyteDB DocDB sharding, Spanner TrueTime and Paxos), Aurora's separated storage and Aurora Global Database's published RPO/RTO numbers, and the GDPR/PIPL/DPDP residency rules are correct facts not covered by the allowed PostgreSQL sources and are marked stable-common-knowledge.
- Light through single-mode optical fibre travels at roughly two thirds of vacuum speed (about 200 km per millisecond), so a US-to-EU round trip cannot drop below about 70 to 100 ms even on perfect hardware because the great-circle Atlantic distance is approximately 6000 km.stable common knowledge
- PostgreSQL behaves as a CP system under the CAP theorem: a network partition that separates the primary from its synchronous quorum stalls writes rather than accepting divergent ones, and PACELC describes the everyday case where even without a partition the operator trades commit latency for replication consistency via synchronous_commit.verified
- A three-AZ PostgreSQL cluster with synchronous_commit = on and synchronous_standby_names set to ANY 1 of two same-region standbys typically delivers about 99.99 percent availability with single-digit-millisecond commit latency, because cross-AZ RTT within a region is 1 to 3 ms.verified
- Async streaming read replicas in remote regions serve local reads at intra-region latency while writes continue to pay the cross-region round trip to the single primary, which makes this topology a global-read topology, not a global-write topology.verified
- PostgreSQL multi-writer setups using BDR, pglogical, or native logical replication with origin filtering (introduced in PostgreSQL 16) all reduce to logical decoding of one cluster's WAL and apply on another, and last-write-wins conflict resolution silently drops the earlier of two near-simultaneous writes to the same row.verified
- Geo-sharded PostgreSQL with Citus distributes tables on a tenant id whose home region is recorded in a routing table, using create_distributed_table and colocate_with so co-located tables share the same shards and tenant-local operations remain in one region at single-digit-millisecond latency.verified
- GDPR (EU, 2018) restricts cross-border transfers of EU residents' personal data, PIPL (China, 2021) requires important data and personal information of Chinese residents to be stored inside China subject to security assessments for cross-border transfer, and India's DPDP Act (2023) lets the central government restrict transfers to specific jurisdictions, and all three trend toward more residency requirements rather than fewer.stable common knowledge
- CockroachDB is wire-compatible with PostgreSQL but stores each key range with a Raft-replicated quorum of 3 or 5 nodes, so every write pays a Raft round trip and intra-region writes typically run 5 to 15 ms while a hot single-row workload pays that round trip per write regardless of overall cluster size.stable common knowledge
- Google Spanner uses TrueTime (an API backed by GPS receivers and atomic clocks in every datacenter that exposes a bounded time uncertainty, typically under 7 ms) plus Paxos per data range to give externally consistent (linearizable) transactions globally, paying a Paxos round trip plus the TrueTime commit-wait on every transaction.stable common knowledge
- Amazon Aurora ships only redo log records from the writer to a shared storage layer that replicates them 6 ways across 3 availability zones (4-of-6 write quorum, 3-of-6 read quorum), eliminating the double-write of WAL plus checkpointed dirty pages that stock PostgreSQL performs and roughly halving write IOPS for the same workload.stable common knowledge
- Aurora Global Database supports one primary region with up to 5 secondary regions, advertises typical replication lag under one second, RPO under one second, and managed-failover RTO of approximately one minute for the database flip itself, meaning it is not RPO = 0 and a sudden total loss of the primary region can drop sub-second committed writes.stable common knowledge
- A common production pattern combines Citus plus Patroni for single-region OLTP HA with a single cross-region logical-replication slot that ships selected tables to a sibling region or to a separate columnar analytical store (Clickhouse, BigQuery, Snowflake) via Debezium or native logical decoding, keeping OLTP writes regional while making analytical reads global.verified
- A defensible global backup design runs per-region pgBackRest with full plus differential backups and continuous WAL archiving into a same-region object store, then uses cloud-native cross-region object-store replication to keep off-site copies in another region, so the database never pays cross-region write cost for backups and the off-site copy still exists after a regional disaster.stable common knowledge
- The defensible decision-tree order is residency first (any regulatory requirement forces geo-sharding), then global reads (regional async read replicas or geo-sharding), then global writes (NewSQL such as Spanner or CockroachDB if writes truly must be sub-100 ms across continents), then default to single-region Patroni-managed HA, because skipping the residency step is the single most common architectural mistake teams make at this scale.verified
- Adding a cross-region synchronous standby to synchronous_standby_names raises p99 commit latency from low-millisecond intra-region numbers to 80 to 150 ms because every commit waits for an fsync acknowledgment across the inter-region speed-of-light delay, and no tuning or hardware change reduces that floor.verified
Cited sources
- PostgreSQL Documentation: High Availability, Load Balancing, and Replication · PostgreSQL Global Development Group
- PostgreSQL Documentation: Log-Shipping Standby Servers and Replication Slots · PostgreSQL Global Development Group
- PostgreSQL Documentation: Logical Replication · PostgreSQL Global Development Group
- Citus Documentation: Distributed Tables, Colocation, and Query Processing · Citus Data / Microsoft
- Patroni Documentation: a Template for HA PostgreSQL · Zalando / Patroni contributors