Sharding with Citus
Citus turns a PostgreSQL cluster into a distributed database via an extension: coordinator plus workers, distributed/reference/local table types, colocation as the design lever, shard count and logical-replication rebalancing, single-shard vs multi-shard vs repartition query plans, primary-key and foreign-key rules, the multi-tenant SaaS sweet spot, and the failure modes (low-cardinality keys, cross-tenant transactions, non-colocated joins, coordinator bottleneck) experienced engineers still hit.
Learning outcomes
Connection pooling and read replicas, the two scaling moves the connection-pooling-at-scale and read-scaling-and-caching pages walk through, both keep one PostgreSQL instance as the single source of truth for writes. That works until the writes themselves outgrow what one machine can sustain. At that point you stop adding memory and start cutting the data into pieces, one piece per node, and accept a new class of trade-off. Citus is the mature way to do that cut inside PostgreSQL itself: not a fork, not a separate engine, an extension that turns a Postgres cluster into a sharded distributed database. This page makes the model and its sharp edges concrete enough to design a schema you will not regret.
After studying this page, you can:
- Decide when vertical scaling has actually run out, and when sharding is the right next move instead of caching or replication.
- Set up a Citus cluster of one coordinator and several workers, and choose distributed, reference, or local for each table in your schema.
- Pick a distribution column and a shard count that let queries route to one worker and let foreign keys and joins stay local.
- Read the catalogs that tell you where a shard lives and how a query was executed (router, real-time, or repartition).
- Avoid the three classic schema mistakes: a low-cardinality distribution column, a tiny lookup that should have been a reference table, and joins that are not colocated.
Before we dive in
You should be comfortable with regular PostgreSQL tables, primary keys, foreign keys, and partitioning. The table-partitioning page covers how one logical table can be split into many physical children on one node, and a Citus shard is exactly that idea pushed across machines: a shard is a partition that happens to live on a different worker. If you have read the partitioning page, you already understand most of what a shard is structurally. The new ideas here are about distribution, network cost, and the planner that hides both.
A few terms used throughout. A shard is a child table holding a slice of one distributed table’s rows; it is a real PostgreSQL table on a worker, with a name like orders_102045. A distribution column is the column the cluster uses to decide which shard a row belongs to (for example, tenant_id). The coordinator is the node that receives client connections, plans distributed queries, and routes them. A worker is a node that stores shards and runs the per-shard SQL. Colocation is the property that two tables sharded the same way send rows with the same distribution-column value to the same worker, so joins between them stay local. Hold these five. Everything below is built from them.
Mental Model
The wrong model, and it is a tempting one, is that Citus is “PostgreSQL but bigger”: you point your app at the coordinator instead of a regular Postgres, write the same SQL, and the cluster magically scales the way an autoscaling object store does. People reach for Citus expecting horizontal scale without schema changes.
That is not what Citus is. Citus is a distributed query planner sitting on top of regular PostgreSQL nodes, and the planner is only as smart as the schema lets it be. The choice of one column, your distribution column, decides for the rest of the cluster’s life whether a typical query touches one worker or fans out across all of them. Pick well and a multi-tenant SaaS query runs at near-single-node latency on a hundred-node cluster. Pick poorly and every query becomes a network gather, and adding workers makes things slower, not faster.
Keep that picture. Citus does not scale your workload, it scales the workload your distribution column lets it scale. The right way to read every rule below, the colocation requirement, the primary-key constraint, the choice of reference versus distributed, is “what does this let the planner avoid sending across the network?” Once that question is the lens, the rest of Citus stops feeling arbitrary and starts feeling inevitable.
Breaking it down
1. The wall: when a single PostgreSQL node stops scaling
Start with the wall, because sharding is heavy work and only earns its place when no cheaper move works. A single modern PostgreSQL node can do a lot. Cloud instances now reach hundreds of vCPUs, terabytes of RAM, and tens of gigabytes per second of NVMe bandwidth. With careful tuning, a single instance can sustain on the order of a hundred thousand transactions per second for short OLTP writes, and well past that for read-mostly workloads. Most applications never need to leave this regime.
Three signs say you have actually hit the wall, not just a tuning problem. First, the CPU on the primary is pinned, and the bottleneck is write commits, not reads (read replicas have already absorbed the reads). Second, the working set no longer fits in RAM no matter how much you add, and the active buffer-cache miss rate is the dominant latency. Third, vacuum and checkpoint pressure on a single huge heap are eating a steady fraction of every IO operation regardless of how you tune autovacuum. When all three are true at once, vertical scaling is over: the next bigger instance type is not bigger enough, and the one after that does not exist.
Sharding is the structural answer. You cut rows of one logical table across many nodes by a chosen key. The win is real: writes that used to compete for the same WAL stream now commit to different nodes in parallel, the working set per node shrinks to a fraction of the whole, and vacuum runs on smaller heaps. The cost is also real, and irreversible in shape. Cross-shard queries and joins are now network operations. Foreign keys and unique constraints that span the cut become impossible or expensive. The choice of distribution column is a schema decision you will be living with in production for years. This is why the partitioning page warned you that partitioning is not sharding: partitioning skips work on one node, sharding distributes work across many, and the latter buys you scale at a price the former does not pay.
flowchart LR
A["Single node, hundreds of cores"] --> B["WAL is the bottleneck"]
B --> C["Vertical scaling exhausted"]
C --> D["Shard by chosen key"]
D --> E["Writes parallel across nodes"]
D --> F["Cross-shard queries become network ops"]That diagram is the whole argument. Sharding moves the bottleneck from a single node’s commit pipeline to the network and the planner, and the new bottleneck behaves very differently from the old one.
2. Citus as an extension: coordinator and workers
Citus is a PostgreSQL extension, not a fork. You install it with CREATE EXTENSION citus on a stock PostgreSQL server, and the extension installs new catalogs, planner hooks, and functions. Every node in the cluster is a normal PostgreSQL server with the extension loaded, running the same major version as its peers. There is no separate query engine, no parallel storage layer underneath, no proprietary protocol. This matters in three concrete ways: you keep MVCC, you keep your favorite extensions where they make sense on each node, and your migration to Citus is mostly a schema decision rather than a database swap.
The cluster has two roles. The coordinator is the node clients connect to. It owns the distributed catalog (the pg_dist_* tables described in section 9), plans distributed queries, and routes them to the right workers. The workers are nodes that hold shards: each shard is a regular PostgreSQL table living on one worker, with a name that encodes its parent and shard id, like orders_102045. Workers run the per-shard SQL the coordinator routes to them. They do not talk to clients directly in normal operation.
flowchart TB
C["Coordinator: pg_dist_* catalogs, planner, router"]
W1["Worker 1: shards 0,3,6,..."]
W2["Worker 2: shards 1,4,7,..."]
W3["Worker 3: shards 2,5,8,..."]
Client["Client connection"]
Client --> C
C --> W1
C --> W2
C --> W3A few practical details that surprise people on first contact. The coordinator is itself one PostgreSQL node, so it can become a bottleneck if every query is forced to coordinate. Most Citus failure modes are about queries that overload the coordinator rather than the workers. Adding workers does not make the coordinator faster; it makes more parallel work available to a query the coordinator chooses to fan out. Section 6 returns to this when we walk through query routing.
You add a worker by calling citus_add_node('worker-3.internal', 5432) on the coordinator. New shards created after that point can be placed on it, and the rebalancer (section 5) can move existing shards onto it.
3. The three table types and the DDL that creates them
Citus gives you three table types, and choosing the right type for each table in your schema is the next thing after picking the distribution column. They differ in one question: where does each row physically live?
A distributed table is sharded by a column across all workers. Its rows are split into citus.shard_count shards (a cluster-wide default, typically 32, that you can override per table), and each shard is placed on one worker. This is the table type for anything large or write-heavy: orders, events, page views, telemetry.
-- A distributed table sharded by tenant_id.
create table orders (
id bigint generated always as identity,
tenant_id uuid not null,
total_cents bigint not null,
created_at timestamptz not null default now(),
primary key (id, tenant_id)
);
select create_distributed_table('orders', 'tenant_id');
A reference table is replicated in full to every worker. Every worker holds an identical copy, and writes are coordinated by Citus across all copies via two-phase commit. Use this for small, slow-changing lookup data that joins against distributed tables: country codes, currency tables, plan tiers, feature flags. Replication is the point: a join against a reference table runs locally on each worker without going over the network for the lookup side.
-- A reference table replicated to every worker.
create table country (
code char(2) primary key,
name text not null,
region text not null
);
select create_reference_table('country');
A local table lives only on the coordinator. It is not sharded and not replicated. Use this for admin tables and small dashboards that the coordinator alone reads or writes. Local tables cannot be efficiently joined with distributed tables in a single query (the data is on the wrong nodes), so reach for this type when you genuinely want a table that the workers never see.
The cost model differs sharply across the three. A distributed table scales writes and storage with the number of workers, but pays network and coordination cost on multi-shard queries. A reference table scales reads cheaply (every worker has a local copy) but pays a real cost on every write, since each write is a distributed two-phase commit across every node. A local table pays nothing extra but cannot scale beyond the coordinator. The rule of thumb: distribute the big and write-heavy, reference the small and read-mostly, leave the truly local on the coordinator.
The instinct to reach for is “make all the small lookups reference, distribute the big fact tables on tenant_id, leave the dashboards local.” That single rule, applied across a multi-tenant SaaS schema, gets you most of the way to a well-designed Citus cluster.
4. Colocation: the one idea that makes or breaks a Citus schema
If you only remember one section of this page, make it this one. Colocation is the property that two distributed tables sharded on the same column with the same shard count send rows with the same distribution-column value to the same worker. Two colocated tables that both shard on tenant_id will both put tenant_id = 'A' on the same shard on the same worker. A join between them on that column runs locally on each worker, in parallel, with no network traffic for the join itself.
Without colocation, joining two distributed tables across the network is dramatically more expensive. The coordinator either repartitions one side (reads it from every worker, hashes it, sends each row to the worker that has the matching shard of the other side, then runs the join) or pulls everything to itself and joins locally, which defeats the point of distributing. Both paths involve moving data across the cluster and are an order of magnitude slower than a colocated join. This is the cost that catches every team that thinks “I will just shard each table on whatever column makes most sense for it.” That instinct is right on one node, where each table tunes for its own access pattern. It is wrong on a sharded cluster, where two tables that need to join must share a distribution column or pay the network bill on every join.
The design rule is therefore stark: pick ONE distribution column for the whole schema and colocate everything around it. In a multi-tenant SaaS, that column is almost always tenant_id. Every distributed table that holds tenant data shards on tenant_id. Every join between two tenant tables filters on tenant_id, so it stays inside one tenant’s worker. Every foreign key from one tenant table to another includes tenant_id in the referenced key (Citus only allows distributed-to-distributed foreign keys between colocated tables, see section 7). The schema looks more verbose than its single-node ancestor, with tenant_id repeated everywhere, and the verbosity is the price you pay for keeping joins local.
-- Three distributed tables, all sharded on tenant_id with the same shard count,
-- so they are automatically colocated.
create table orders (
id bigint generated always as identity,
tenant_id uuid not null,
total_cents bigint not null,
primary key (id, tenant_id)
);
select create_distributed_table('orders', 'tenant_id');
create table order_items (
id bigint generated always as identity,
tenant_id uuid not null,
order_id bigint not null,
sku text not null,
qty int not null,
primary key (id, tenant_id),
foreign key (order_id, tenant_id)
references orders (id, tenant_id)
);
select create_distributed_table(
'order_items', 'tenant_id', colocate_with => 'orders'
);
create table payments (
id bigint generated always as identity,
tenant_id uuid not null,
order_id bigint not null,
paid_cents bigint not null,
primary key (id, tenant_id)
);
select create_distributed_table(
'payments', 'tenant_id', colocate_with => 'orders'
);
Now the join below runs locally per shard: each worker joins the slice of orders it holds with the slice of order_items it holds, in parallel, and returns one partial result per worker. The coordinator concatenates them. No row crosses the network between workers.
-- Colocated join, runs in parallel on each worker.
select o.id, sum(i.qty)
from orders o
join order_items i on i.order_id = o.id and i.tenant_id = o.tenant_id
where o.tenant_id = $1
group by o.id;
Reference tables are the second half of the colocation story. A join against a reference table is automatically colocated, because every worker has a local copy of the reference table. So a distributed table joined to a reference table on any column runs locally per worker without repartitioning. That is the reason small lookups are nearly free in Citus and large fact tables are not.
5. Shard count, placement, and rebalancing
The shard count is the number of pieces each distributed table is cut into, set per colocation group at distribution time (citus.shard_count is the cluster-wide default, commonly 32). You cannot change a table’s shard count after the fact without redistributing the data, so the choice matters.
More shards make the cluster easier to rebalance: shards are the unit the rebalancer moves, so finer pieces mean smoother spreading across workers and less data movement per move. Fewer shards lower planner overhead, since the coordinator considers each shard during planning. The community rule of thumb is 32 to 128 shards per colocation group for most clusters. Sixteen workers each holding two to eight shards lets you grow the cluster without immediately running out of room to rebalance. Going far above 128 makes planning noticeably slower without a real win; going far below 32 means a single shard is too big to move cheaply and you cannot rebalance hot workers smoothly.
Shards are placed on workers when they are first created. Citus assigns them across the available workers as evenly as it can. When you add a worker, no existing shards move automatically; you have to ask. The function you call is citus_rebalance_start(), and it does the work using PostgreSQL’s own logical replication to copy each shard to its new home with minimal blocking.
-- After citus_add_node('worker-3.internal', 5432), the new worker holds no shards.
-- Start a rebalance to spread existing shards across all workers.
select citus_rebalance_start();
-- Watch progress; the function returns immediately and the rebalance runs in
-- background workers, copying shards via logical replication.
select * from citus_rebalance_status();
The use of logical replication is the part to internalize. While a shard is being copied, the source remains readable and writable. Logical replication streams the WAL changes for that shard to the new copy until the two are caught up, then Citus performs a brief catalog flip to point reads and writes at the new placement. The window where the shard is unavailable is measured in seconds, not minutes, and there is no global lock on the cluster. Compare this to an in-place “shut down, copy, restart” rebalance, which is what naive sharding implementations have to do, and you can see why logical replication makes the rebalancer practical to run on a live OLTP cluster.
sequenceDiagram
participant App as Application
participant W1 as Worker 1 (source)
participant W3 as Worker 3 (target, new)
participant Coord as Coordinator
Coord->>W3: citus_rebalance_start, create empty shard copy
W1->>W3: logical replication of WAL changes
App->>W1: writes continue, replicated to W3
Note over W1,W3: catch-up phase, both copies live
Coord->>Coord: catalog flip: new placement is canonical
App->>W3: subsequent reads and writes route to W3Section 9 returns to the catalogs you query to see where each shard currently lives.
6. Query routing: single-shard, multi-shard, and repartition
Every query against a distributed table is routed by the coordinator into one of three plans, and which one you get is the single biggest determinant of how that query performs.
A single-shard router query has a qualifier that fixes the distribution column to one value (WHERE tenant_id = 'A'). The coordinator computes which shard holds that tenant, sends the query to the one worker that owns that shard, and returns the worker’s answer. There is no fan-out, no merge step, and no per-shard planning overhead beyond locating the shard. This path runs at near-single-node latency. It is the path you want for OLTP queries on a multi-tenant SaaS: log in, list this tenant’s orders, write a new payment, refund an order. Every one of those queries names the tenant in the WHERE clause and routes to one shard.
A multi-shard real-time query has no constraint on the distribution column, or asks for an aggregate across all shards. The coordinator fans the query out to every worker in parallel, each worker computes its slice, and the coordinator merges the results. This is the path you want for analytics: count active users across the platform, sum revenue across all tenants for the last hour, find the top SKUs across the catalog. Each worker scans only its own shards, and the work runs in parallel proportional to the number of workers, so the aggregate finishes much faster than the same query against a single-node table. The cost you pay is the fan-out: every worker has to do real work, and the coordinator merges every worker’s answer.
A repartition query is what the planner falls back to when a join between two distributed tables is not colocated. The coordinator either repartitions one or both sides (reading the data from every worker, hashing on the join key, and sending each row to the worker that holds the matching piece of the other side) or pulls one side to the coordinator and joins there. Both options send a large fraction of the data across the cluster network, and both are orders of magnitude slower than the colocated equivalent. The honest engineering rule is avoid repartition queries in OLTP and tolerate them only as occasional offline analytics. If you find one in your hot path, either fix the schema so the join is colocated, or rewrite the query so the heavy side filters by the distribution column first.
You can see exactly which path a query took with the Citus version of EXPLAIN, which annotates the plan with the route it chose.
-- A single-shard router query. Expect a "Custom Scan (Citus Adaptive)" with
-- one Task on one worker.
explain (analyze, verbose)
select sum(total_cents) from orders where tenant_id = $1;
Custom Scan (Citus Adaptive)
Task Count: 1
Tasks Shown: All
-> Task
Node: host=worker-2.internal port=5432 dbname=app
-> Aggregate
-> Index Scan using orders_pkey_102046 on orders_102046
Index Cond: (tenant_id = $1)
-- A multi-shard real-time aggregate. Expect Task Count = shard_count, with
-- one Task per shard fanned out to the workers in parallel.
explain (analyze, verbose)
select count(*) from orders;
Aggregate
-> Custom Scan (Citus Adaptive)
Task Count: 32
Tasks Shown: One of 32
-> Task
Node: host=worker-1.internal port=5432
-> Aggregate
-> Seq Scan on orders_102045
The number of Tasks and which nodes they run on is the truth: Task Count: 1 is the router path; Task Count: 32 (or whatever your shard count is) is the real-time fan-out; an Append on the coordinator with a per-row redistribution is the repartition path.
7. Constraints, foreign keys, and distributed DDL
Constraints on a distributed table follow one rule, the same shape as the rule the partitioning page made you live with: the distribution column must be part of every primary key and every unique constraint. PostgreSQL enforces uniqueness with an index, and on a distributed table each shard has its own local index. A primary key on (id) alone would need a global cluster-wide unique index, which does not exist, so the constraint would have to check every shard on every insert, which is not viable. Including the distribution column in the key guarantees that any uniqueness check happens inside one shard, by the local index on the worker that holds it, and that the per-shard pieces add up to a cluster-wide correct constraint.
This is the reason every example above writes primary key (id, tenant_id) rather than primary key (id). Practically, you start carrying the tenant id (or whatever your distribution column is) everywhere: in primary keys, in foreign keys, in queries. The schema looks more verbose, and the verbosity is the property the planner uses to route everything single-shard.
Foreign keys come in two flavors. A foreign key between two colocated distributed tables works as long as the referenced key includes the distribution column (so the lookup happens inside one shard). A foreign key from a distributed table to a reference table works because every worker has a local copy of the reference table, so the lookup is local. A foreign key from a distributed table to a non-colocated distributed table is not supported; the lookup would have to cross workers on every insert. The rule shapes the schema: use a single distribution column for everything tenant-shaped that needs to reference itself, and use reference tables for shared lookups.
-- Allowed: distributed-to-distributed, colocated, includes distribution column.
alter table order_items
add foreign key (order_id, tenant_id)
references orders (id, tenant_id);
-- Allowed: distributed-to-reference, lookup is local on every worker.
alter table orders
add foreign key (country_code)
references country (code);
-- Rejected: would require a cross-shard lookup on every insert.
alter table order_items
add foreign key (sku) references catalog (sku); -- error if catalog is distributed and not colocated
DDL on distributed tables goes through the coordinator and runs on every shard. ALTER TABLE orders ADD COLUMN refunded boolean DEFAULT false is one statement to you, and the coordinator turns it into one ALTER per shard on every worker. The change is transactional and atomic across the cluster, but it does take locks on every shard, so big-table DDL needs the same care as it does on a single-node Postgres: prefer ADD COLUMN ... NULL DEFAULT NULL (a metadata-only change in modern PostgreSQL) over a NOT NULL DEFAULT that rewrites every row, and schedule heavy DDL during low-traffic windows. The coordinator also applies the same DDL to new shards when they are added later.
8. Workloads that win, workloads that lose
Citus is a sharp tool, and the difference between a workload it scales effortlessly and one it actively hurts comes down to a single question: does the workload have a natural distribution column that nearly every query already includes?
Multi-tenant SaaS is the canonical Citus win, and it is the use case the rest of the design points at. Almost every query already names the tenant: log in for tenant X, list orders for tenant X, refund payment for tenant X. Shard everything tenant-shaped on tenant_id, make small lookups reference tables, and every OLTP query routes single-shard at near-single-node latency. The cluster scales horizontally as you add tenants, because tenants are independent and their data spreads naturally across workers.
Real-time analytics is the other strong win, especially over time-series data. A dashboard that aggregates events for the last hour fans out to every worker and uses the parallelism to finish in milliseconds where a single-node query would have taken seconds. Combine Citus distribution with pg_partman-style time partitioning inside each shard, and you get parallel per-worker scans over time-bounded slices, which is the recipe behind many production observability systems.
Large time-series with retention combines the two. You distribute the fact table on something like device_id, partition each shard by day, and drop old partitions per shard for retention. Inserts route by device, analytics queries fan out across all devices and prune to the recent partitions on every worker, and old data drops out as cheap metadata operations.
The losses are equally honest. Workloads with no good shard key are the most common one: a small app where every query is “show me everything across the whole database,” with no tenant or device or natural axis to cut along. Sharding such a workload makes every query a fan-out and adds coordination overhead with no per-query parallelism win. Tight cross-tenant transactions are the second loss: if your business logic regularly updates rows in two tenants atomically, you are paying distributed-transaction cost on every update and likely losing more than you gain. Very small datasets are the third: a 50 GB database that fits comfortably on a single instance does not need a distributed planner, and the overhead of coordinator routing and per-shard planning will make it slower, not faster, than a tuned single node.
The lens to read all of this through: Citus is not faster than PostgreSQL on workloads PostgreSQL can do alone. It is the only honest answer when PostgreSQL alone cannot do them at all.
9. Operating Citus: catalogs, version, and managed offerings
Two catalogs tell you everything you need to know about where data lives in a Citus cluster. pg_dist_shard lists every shard of every distributed table: which parent it belongs to, the hash range it covers, and the shard id (the number suffix that ends up in shard table names like orders_102045). pg_dist_placement lists where each shard physically lives: shard id, the worker node id, and the placement state. Joined together, these two catalogs answer the operator’s first question, “which worker holds this tenant?” without guesswork.
-- Find which worker holds a given tenant's orders shard.
select s.logicalrelid::regclass as parent,
s.shardid,
p.nodename,
p.nodeport
from pg_dist_shard s
join pg_dist_placement p using (shardid)
where s.logicalrelid = 'orders'::regclass
and get_shard_id_for_distribution_column('orders', $1) = s.shardid;
The other catalog worth knowing is pg_dist_node, which lists every node in the cluster, its role (coordinator or worker), and whether it is currently active. Together, these three give you the full physical map. Beyond them, citus_shards is a convenience view that joins the lot for human-readable inspection.
Two facts about Citus the version matter in practice. Citus 12.0 (released in 2023) was a major refactor that improved the shard-aware planner: queries that mix local and distributed tables plan better, and several long-standing limitations around schemas and the public schema were lifted. Subsequent releases continue to track new PostgreSQL versions (Citus on PG 16 and on PG 17 follow PostgreSQL major releases by a few months). When you adopt or upgrade Citus, the matrix you check is “which Citus versions support which PostgreSQL major,” and the upgrade path is usually a coordinated step across all nodes.
The managed offering you most often see is Azure Cosmos DB for PostgreSQL, which is Citus packaged and operated by Microsoft as a managed service on Azure. It runs the same Citus extension you would install yourself; the difference is in the operations layer (node provisioning, backups, monitoring) rather than the database semantics. Citing it here matters because many production Citus deployments run there rather than on self-managed clusters, and the documentation you might encounter on Microsoft’s side is documenting the same engine described in this page.
10. Failure modes experienced engineers still hit
Most of the production trouble with Citus traces back to one of a small set of mistakes, and each is a direct violation of a rule the earlier sections explained.
Choosing a low-cardinality distribution column. Hash distribution maps each value of the column to one shard. If your distribution column only takes a few hundred distinct values, only a few hundred hash buckets are ever exercised, and on a 32-shard layout you can easily get one giant hot shard with most of the data and 31 nearly empty ones. The fix is at design time: distribute on a high-cardinality column (tenant_id, user_id, device_id), never on a small enum or a status column. If you already shipped it, you are in for a redistribution.
Cross-tenant transactions costing ten times what you expect. Updates that touch rows belonging to two different tenants in one transaction are distributed transactions: Citus has to coordinate a two-phase commit across the two workers that hold those rows. The latency is on the order of ten milliseconds even when each side is microseconds, and the throughput ceiling is far below what a single-shard write achieves. The fix is to redesign the workflow to do tenant-local writes and reconcile across tenants asynchronously, or to push the cross-tenant data into a reference table where the cross-shard cost lives only on writes to the lookup.
Non-colocated joins thrashing the network. Two distributed tables that shard on different columns force the planner to repartition on every join. On a busy cluster, this becomes the dominant network traffic, and every join slows down as the cluster grows. The fix is to colocate: pick one distribution column for the schema, shard every related table on it, and include it in the join condition.
Tiny lookup tables that should be reference becoming distributed. A country table or a currency table accidentally created with create_distributed_table becomes 32 little shards on 32 workers, none with the full set of rows, and every join against it costs a repartition. The fix is mechanical: use create_reference_table for small lookups, and treat that as the default unless the table is large enough to need real distribution.
Shard count chosen too small to ever rebalance. A cluster created with citus.shard_count = 4 cannot spread shards across more than four workers without splitting shards, and you cannot change a colocation group’s shard count without redistributing. If you start small, you stay small. The fix is preventative: pick 32 to 128 shards per colocation group at table-creation time even if you only have a few workers today, so you have room to add workers and rebalance later.
Coordinator becoming the bottleneck. The coordinator is one PostgreSQL node. It serves every client connection, plans every distributed query, and merges results from multi-shard queries. Workloads that hammer it with a high rate of cheap multi-shard queries can saturate it even when every individual worker is idle. The fix is twofold: route OLTP queries to be single-shard so they put little planning load on the coordinator, and consider using a connection pooler in front of the coordinator the same way you would in front of any single Postgres (the connection-pooling-at-scale page explains the pattern). For very high throughput, Citus also supports multi-coordinator topologies through query routers, but on most clusters the answer is to keep coordinator load low by writing single-shard queries.
The unifying thread: every failure above is a symptom of either picking the wrong distribution column or violating colocation. Get those two things right at design time and Citus is uneventful. Get them wrong and no amount of hardware will fix the schema.
Mastery Questions
-
You operate a multi-tenant SaaS PostgreSQL database on a 96-vCPU instance with a 700 GB working set, and writes have just become CPU-bound during peak hours despite a properly sized buffer cache and a tuned autovacuum. Read replicas already handle the read workload. The team is debating Citus vs renting a 192-vCPU instance. Which would you pick, and what would you do first if you chose Citus?
Answer. Either can buy you some time, but they differ in shape, and the difference matters once you sketch what happens twelve months out. The 192-vCPU instance is one knob and a no-schema-change move; you get roughly two times the write headroom and you defer the structural decision. The catch is that at the current growth rate that headroom only lasts as long as the previous doubling did, you still have one WAL stream, and the next bigger instance type either does not exist or is dramatically more expensive per core. Citus, by contrast, gives you horizontal write parallelism that scales with the number of workers, and on a multi-tenant SaaS workload the natural distribution column (
tenant_id) is already present in essentially every query, so most OLTP traffic routes single-shard at near-single-node latency. The honest answer is that if you expect the workload to keep growing, vertical scaling has bought you the last six to twelve months you have time to make available, and you should start the Citus migration now while the system is still tractable. If you do go that way, the first thing you do is not to install Citus, it is to audit the schema and prove that every distributed table can be sharded ontenant_id. Every join between two tenant tables must filter ontenant_id, every primary key must include it, every foreign key between distributed tables must point colocation-compatible columns. Once that audit passes, you decide which small tables become reference tables and which stay local on the coordinator, pick 32 shards per colocation group as the starting point, and migrate tenants in waves, validating colocated joins and single-shard routing as you go. -
A teammate creates
country (code char(2) primary key, name text)and callscreate_distributed_table('country', 'code')because the small table “should be sharded for consistency with the rest.” Three weeks later, every analytics query that joinsorderstocountryhas gotten dramatically slower. What happened, and how do you fix it?Answer. Two things went wrong, and they compound. First, the country table now lives in 32 little shards spread across the workers, and no individual worker has the full set of country rows. Second,
ordersis distributed ontenant_id, whilecountryis distributed oncode, so the two tables are not colocated. Every join between them has to repartition: the planner reads each shard oforders, hashes itscountry_codecolumn on the workers, sends each row to the worker that holds the corresponding shard ofcountry, runs the join, and ships the result back to the coordinator. Repartition turns what should be a local hash lookup into a cluster-wide data move on every analytics query, and on a busy cluster it dominates the network. The fix is to makecountrya reference table:select undistribute_table('country')to undo the distribution, thenselect create_reference_table('country'). Now every worker holds a complete local copy ofcountry, and the same join becomes a local lookup on every worker in parallel, with no network at all for the lookup side. The deeper lesson is that “distribute everything for consistency” is exactly the wrong instinct on Citus. The right rule is “distribute the large and write-heavy on a shared distribution column, reference the small lookups, and leave the truly local on the coordinator.” Reference tables are not a corner case; they are the standard pattern for the parts of the schema that join against distributed tables but do not themselves need sharding. -
Your eight-worker Citus cluster is suddenly slow on every kind of query, but worker CPU is near idle and worker network is quiet. The coordinator’s CPU is pinned and
pg_stat_activityon it shows hundreds of short queries. Workers report normal latency on the queries that do reach them. What is the most likely cause at the cluster level, and how would you fix it?Answer. The signature points squarely at the coordinator as the bottleneck. The workers are idle because the coordinator cannot dispatch fast enough; it is spending its time on connection handling and per-query planning rather than on actual data work. Two causes are common and they often coexist. First, the application opens too many connections directly to the coordinator (a connection storm), so the coordinator spends its CPU on backend startup and context switching instead of query work, exactly the failure mode the connection-pooling-at-scale page warns about. The fix is a PgBouncer or equivalent pooler in front of the coordinator, in transaction-pooling mode, capping backend connections to a few hundred. Second, the application is sending high-rate multi-shard queries that should have been single-shard, so every query forces the coordinator to fan out to all eight workers, plan per-shard, and merge results, even though only one shard has the data the query actually wants. The fix is to audit the hot queries with
EXPLAINand confirm the Task Count is 1 for OLTP queries; any query withtenant_idin scope should be single-shard, and any that isn’t is missing the predicate. Apply the connection pooler and the query audit together, and the coordinator stops being the limit on a workload that the workers were always large enough to serve.
Sources & evidence14 claims · 2 cited
Citus-specific mechanism, DDL, table-type, colocation, query-routing, and pg_dist_* catalog claims are grounded in the Citus documentation; the use of PostgreSQL logical replication by the shard rebalancer is grounded in both the Citus docs and the PostgreSQL logical replication chapter; the table-partitioning analogy points at PostgreSQL's partitioning docs. Correct operational facts not stated quantitatively in those sources (vertical-scaling ceilings, the 32-to-128 shard-count rule of thumb, Citus 12.0 being a 2023 shard-aware planner refactor, Azure Cosmos DB for PostgreSQL being managed Citus, the coordinator-bottleneck pattern, and the schema-design rubric) are marked stable-common-knowledge with empty source_ids.
- A single modern PostgreSQL instance on a high-end cloud node tops out around hundreds of vCPUs, terabytes of RAM, and roughly a hundred thousand transactions per second for short OLTP writes; once the CPU on the primary is pinned with writes (not reads) as the bottleneck and the working set no longer fits in RAM, vertical scaling has effectively run out and horizontal sharding is the next structural move.stable common knowledge
- Citus is a PostgreSQL extension (not a fork) installed with CREATE EXTENSION citus on a stock Postgres server: every node in the cluster runs the same PostgreSQL major version with the extension loaded, the coordinator owns the distributed catalogs and plans queries, and worker nodes hold shards where each shard is a regular PostgreSQL table named like orders_102045.verified
- Citus offers exactly three table types: DISTRIBUTED tables are sharded across workers by a chosen column (create_distributed_table('orders', 'tenant_id')), REFERENCE tables are replicated in full to every worker for small lookups (create_reference_table('country'), writes coordinated via two-phase commit), and LOCAL tables live only on the coordinator and are not distributed.verified
- Two distributed tables sharded on the same column with the same shard count are colocated, so rows with the same distribution-column value live on the same worker and joins and foreign keys on that column run locally per shard in parallel; non-colocated joins force the planner to repartition data across the cluster network or pull a side to the coordinator, both an order of magnitude slower than the colocated path.verified
- A typical Citus colocation group uses 32 to 128 shards: more shards make rebalancing finer-grained (shards are the unit the rebalancer moves) but raise per-query planner overhead, while fewer than ~32 makes individual shards too large to move smoothly when workers are added.stable common knowledge
- citus_rebalance_start() spreads shards across newly added workers by copying each shard via PostgreSQL logical replication: the source shard remains readable and writable during the copy, logical replication streams the WAL changes to the new placement until both are caught up, and Citus then performs a brief catalog flip so subsequent reads and writes route to the new placement.verified
- Citus routes queries on distributed tables into one of three plans: a single-shard router query (a WHERE that fixes the distribution column to one value, executed on one worker with Task Count: 1 at near-single-node latency), a multi-shard real-time query (fanned out to every worker in parallel and merged by the coordinator, Task Count equal to the shard count), or a repartition query (a non-colocated join that moves data across the network and is dramatically slower than the colocated form).verified
- On a Citus distributed table the distribution column must be part of every primary key and every unique constraint, because uniqueness is enforced by per-shard local indexes (there is no global cross-shard unique index); foreign keys between two distributed tables are only supported when the tables are colocated and the referenced key includes the distribution column, while foreign keys from a distributed table to a reference table always work because every worker holds a local copy.verified
- DDL issued on the coordinator (such as ALTER TABLE ADD COLUMN) is propagated by Citus to every shard on every worker as part of one distributed operation, takes locks on each shard, and applies to new shards created afterwards.verified
- The canonical Citus win is multi-tenant SaaS where almost every query already filters by tenant_id, which lets the cluster route OLTP queries single-shard at near-single-node latency; the canonical losses are workloads with no natural distribution column (every query becomes a fan-out), tight cross-tenant transactions (each one is a distributed two-phase commit), and datasets small enough to fit on one tuned single node where coordinator and per-shard planning overhead outweighs any parallelism win.stable common knowledge
- The cluster's physical layout is visible through Citus catalogs: pg_dist_shard lists each shard's parent and hash range, pg_dist_placement lists where each shard physically lives (worker node and port), and pg_dist_node lists every node and its role; the helper function get_shard_id_for_distribution_column lets you compute which shard holds a given distribution-column value.verified
- Citus 12.0 (released in 2023) was a major refactor of the distributed planner with improved shard-aware planning and a relaxation of several long-standing schema limitations; subsequent releases track new PostgreSQL major versions (Citus on PG 16 and on PG 17 follow PostgreSQL major releases), and Azure Cosmos DB for PostgreSQL is the same Citus extension packaged and operated as a managed service on Azure.stable common knowledge
- Picking a low-cardinality distribution column (a small enum or status) maps the whole workload onto a few hash buckets and produces one giant hot shard with the others nearly empty; high-cardinality keys like tenant_id, user_id, or device_id are the safe defaults, and the choice cannot be changed without redistributing the data.stable common knowledge
- The coordinator is itself a single PostgreSQL node that serves every client connection, plans every distributed query, and merges results from multi-shard queries, so workloads dominated by high-rate multi-shard queries (or by raw connection storms) can saturate the coordinator while workers stay idle; the fixes are to make hot OLTP queries single-shard (Task Count: 1) and to place a connection pooler in front of the coordinator just as you would in front of any single Postgres.stable common knowledge
Cited sources
- Citus Documentation: Distributed Tables, Colocation, and Query Processing · Citus Data / Microsoft
- PostgreSQL Documentation: Logical Replication · PostgreSQL Global Development Group