Table Partitioning

PostgreSQL declarative table partitioning: RANGE/LIST/HASH methods, the partition-key-in-primary-key rule, partition pruning, partition-wise join and aggregate, the CONCURRENTLY index workflow, DETACH/ATTACH for rotation and bulk load, pg_partman automation, and the planning-time cost that caps partition counts at hundreds.

advanced 30 min read in Indexing And Schema

postgresqlpartitioningindexingschemaperformanceretention

Learning outcomes

A single PostgreSQL table can hold billions of rows and several terabytes before it becomes painful, but past that line every operation you care about, scans, index maintenance, vacuum, and bulk deletes, starts to drag in a way no amount of memory or faster disk fixes. Partitioning is the structural answer. You cut the table along a chosen key, the planner skips the pieces a query cannot touch, maintenance works on one piece at a time, and dropping a year of cold data becomes a metadata operation rather than a multi-hour DELETE. This page makes that machinery concrete enough to ship.

After studying this page, you can:

Choose between RANGE, LIST, and HASH partitioning for a given workload, and write the DDL that creates the parent and its children.
Place the partition key in your primary and unique keys correctly, and explain why PostgreSQL forces this.
Read an EXPLAIN plan that proves partition pruning happened, and identify queries that defeat it.
Roll out indexes and ATTACH new partitions without taking the table offline, including the CONCURRENTLY workflow.
Pick a sensible partition count for a workload, and recognize the planning-time and partition-explosion failure modes before they hit production.
Combine partitioning with row-level security to give each tenant a private slice that drops in one DDL.

Before we dive in

You should be comfortable with the basics of a CREATE TABLE, primary keys, indexes, and EXPLAIN. Two earlier ideas matter more than the rest. The data-types page covers how PostgreSQL stores timestamps, integers, and text on a page, which is what determines whether your partition key is cheap to compare a billion times a day. The b-tree internals page covers how the planner uses a sorted index to find a range of rows, which is exactly the operation partition pruning short-circuits at the table level. Refer back to those if a section here moves faster than you want.

A few terms used throughout. A partitioned table is the parent: a logical table with no storage of its own, declared with PARTITION BY. A partition (sometimes called a child or leaf) is an ordinary table that holds a slice of the parent’s rows. The partition key is the column or expression PostgreSQL uses to route a row to the right partition. Pruning is the planner or executor proving that a partition cannot contain rows the query asks for, so it skips reading it entirely. Hold these four. Everything below is built from them.

Mental Model

The wrong model, and it is a very common one, is that partitioning is just sharding by another name: a way to split data so you can spread it across many disks or machines. People reach for partitioning expecting the storage layer to do work it does not do.

PostgreSQL declarative partitioning is not a sharding engine. The partitions all live in the same database, on the same server, sharing the same buffer pool and the same WAL stream. The win is not parallel disks. The win is that the planner can prove, at plan time and again at execution time, that most of the table is irrelevant to your query, and skip it. Partitioning is a query-skipping mechanism dressed up as a storage layout.

Keep that picture. The point of cutting the table along a date is that a query for last Tuesday’s rows touches one small partition instead of two years of history. The point of cutting along tenant_id is that one tenant’s query never reads another tenant’s pages, and dropping a tenant is one DROP TABLE. Once you read partitioning as “let the planner skip work,” every rule that follows, the key-in-primary-key constraint, the partition-wise join settings, the planning-cost trade-off, starts to feel inevitable.

Breaking it down

1. Why a multi-terabyte table needs to be cut up

Start with the problem, because partitioning is real work and only earns its place when a single table has stopped being able to carry the workload. Imagine a page_views table that has grown to four terabytes across two years of history. The hot data your application reads is the last seven days. The cold data your analysts query is the last quarter. Everything older is read once a month at most. The single-table design is failing in four specific ways at once.

Sequential scans now touch four terabytes even when the query asks for last week. Index maintenance is brutal: a CREATE INDEX rewrites a four-terabyte b-tree, and routine VACUUM traverses the whole heap on every run. Deletes hurt the most. Removing eighteen-month-old rows means a DELETE that finds a billion tuples, marks each with xmax, generates a billion index tombstones, and then a long autovacuum pass behind it. The table is one indivisible unit, and every maintenance operation pays for the whole of it.

Partitioning fixes each of these by changing the unit of work. The planner skips partitions that cannot match. A CREATE INDEX on a new partition rewrites a few gigabytes, not a few terabytes. VACUUM runs per partition, in parallel where you choose. Dropping eighteen-month-old data becomes DROP TABLE page_views_2024_01, a metadata change that finishes in milliseconds and releases the disk space immediately. The same physical data, cut along the right line, makes every operation tractable again.

flowchart LR
    A["Monolithic 4 TB table"] --> B["Plan time: full stats"]
    A --> C["Vacuum: whole heap"]
    A --> D["Delete: row-by-row"]
    E["Partitioned by month"] --> F["Plan time: one partition"]
    E --> G["Vacuum: per partition"]
    E --> H["DROP partition: instant"]

That diagram is the whole argument. The same query, the same maintenance, the same retention, against two different physical layouts. The partitioned layout makes them small operations on small pieces.

2. The three partitioning methods and when each fits

PostgreSQL declarative partitioning offers three methods, and the choice between them is not aesthetic. Each one routes a row to a partition by a different rule, and each one matches a different shape of workload.

RANGE partitioning routes by a contiguous range of values. You declare bounds like FROM ('2026-01-01') TO ('2026-02-01'), and any row whose key falls in that half-open interval lands in that partition. This is the right tool for time-series data, append-only event logs, and any data where queries naturally ask for a window: last hour, last day, last quarter. It is by far the most common method in production.

LIST partitioning routes by an explicit set of discrete values. You declare a partition for FOR VALUES IN ('US', 'CA', 'MX'), and rows with those exact key values land there. This fits a small, stable set of keys: region, country, tenant tier, environment. If the set of possible values is unbounded or large, LIST is the wrong tool.

HASH partitioning routes by hash(key) mod N, evenly spreading rows across N partitions with no natural ordering. You declare FOR VALUES WITH (MODULUS 8, REMAINDER 0) for the first of eight partitions. Use it when there is no natural range or list to cut along, but you still want the per-partition wins on vacuum and bulk operations. It does not help pruning for range queries, since rows that share a date are scattered across all partitions, but it does help when queries always include the hashed key as an equality filter (a single tenant looking up their rows).

The DDL is identical in shape across the three. Here is RANGE on a date column, the most common case in practice.

create table page_views (
  id          bigint generated always as identity,
  tenant_id   uuid not null,
  run_date    date not null,
  path        text not null,
  status      smallint not null,
  primary key (id, run_date)
) partition by range (run_date);

create table page_views_2026_01
  partition of page_views
  for values from ('2026-01-01') to ('2026-02-01');

create table page_views_2026_02
  partition of page_views
  for values from ('2026-02-01') to ('2026-03-01');

create table page_views_default
  partition of page_views default;

A few details worth pinning down. The bounds are half-open: FROM is inclusive, TO is exclusive, so adjacent partitions meet without overlap. A DEFAULT partition catches rows that match no other partition, which sounds like a safety net and is actually a trap, see rung 9. And the parent has no storage of its own: every row physically lives in exactly one child.

The three partitioning methods at a glance

Routes by half-open intervals on the key. The canonical fit is time: one partition per day, week, or month on an event log or fact table. Pruning is excellent for any query with a BETWEEN, less-than, or greater-than on the key. Adjacent partitions never overlap. Use this unless you have a strong reason not to.

3. Keys, uniqueness, and the constraint that bites everyone

Here is the rule that catches almost every team on their first partitioned table. The partition key must be part of every primary key and every unique constraint on the partitioned table. No exceptions. If you partition by run_date, your primary key cannot be (id) alone. It must be (id, run_date) or some other tuple that includes the partition key.

This is not arbitrary. PostgreSQL enforces uniqueness with a unique index, and each partition gets its own local unique index covering just that partition’s rows. A primary key on (id) alone would require a global index that spans every partition, and global indexes are not supported. By forcing the partition key into the constraint, PostgreSQL guarantees that any unique check happens inside one partition, with a local index, and the per-partition index pieces add up to a partition-wise correct constraint.

This rule has two consequences you live with. First, your “primary key” on a partitioned fact table is almost always a composite ending in the partition key: (id, run_date), (order_id, created_at), (event_id, ingest_day). Second, foreign keys point one way. A non-partitioned table can have a foreign key into a partitioned table (PostgreSQL 12 and later), and a partitioned table can reference another table by foreign key. What you cannot do is enforce a unique constraint that ignores the partition key, so any natural-key uniqueness that crosses partitions has to be enforced at the application layer or by a deferrable trigger.

-- This fails: id alone is not unique across partitions.
-- ERROR: unique constraint on partitioned table must include all partitioning columns
alter table page_views add primary key (id);

-- This works: the composite includes the partition key.
alter table page_views add primary key (id, run_date);

The fix is mechanical: include the partition key. The lesson is that the partition key is no longer just a routing column. It is part of every row’s identity.

4. Partition pruning: how the planner skips what it does not need

Pruning is the payoff. When you write a query against the parent, the planner inspects each partition’s bounds against your query’s qualifiers. If the bounds prove the partition cannot contain a matching row, the planner does not even open it. The remaining partitions get planned in the normal way, with their own scan, index, and join choices.

Pruning happens in two passes. Plan-time pruning uses constants the planner knows when it builds the plan: a literal date in your WHERE clause, an immutable function applied to constants. Execution-time pruning uses values the planner does not know until the query runs: a parameter from a prepared statement, the inner side of a nested-loop join, or a value computed from now(). Both are controlled by enable_partition_pruning, which is on by default. Leaving it on costs you nothing if the query has no partition-key filter, and saves a multi-terabyte scan when it does.

Here is what pruning looks like on a typical analytics query. The table has twenty-four monthly partitions covering two years. The query asks for one day in January 2026.

explain (analyze, buffers)
select count(*)
from page_views
where run_date = date '2026-01-15';

 Aggregate  (cost=... rows=1)
   ->  Append  (cost=... rows=...)
         Subplans Removed: 23
         ->  Seq Scan on page_views_2026_01  (cost=...)
               Filter: (run_date = '2026-01-15'::date)

The crucial line is Subplans Removed: 23. Twenty-three of the twenty-four partitions were pruned. The plan only scans page_views_2026_01. That single number is how you prove pruning is working in production, and its absence is the first thing to check when a partitioned query is somehow slower than the monolithic table was.

How the planner prunes a date-range query

The query arrivesYou run select sum(bytes) from page_views where run_date between date '2026-01-10' and date '2026-01-20'. The parent has 24 monthly partitions from 2025-01 through 2026-12.

Step 1 of 5

Two ways to defeat pruning, both common. First, omit the partition key from the predicate. WHERE status = 200 cannot prune anything on a date-partitioned table, since the planner has no information about which partitions hold which statuses. The query scans every partition. Second, wrap the partition key in a function the planner cannot see through: WHERE date_trunc('day', run_date) = '2026-01-15' defeats plan-time pruning because the planner does not push the function into the bounds check. Write the qualifier in terms the planner can match: WHERE run_date = '2026-01-15' works, WHERE run_date >= '2026-01-15' AND run_date < '2026-01-16' also works.

5. Partition-wise join and aggregate

Pruning makes a single-table query small. The next two settings make joins and aggregates between two partitioned tables small in the same way.

enable_partitionwise_join lets the planner notice that two tables partitioned the same way along the same key can be joined partition-by-partition. Instead of building one giant hash table over the whole right side, the planner emits a separate join node for each matching partition pair. Each pair fits in memory, can be hashed independently, and can be executed in parallel. On aligned partitioned tables, this commonly takes a five-minute analytics query down to thirty seconds.

enable_partitionwise_aggregate does the same trick for GROUP BY. If your grouping keys include the partition key, the planner can aggregate each partition independently and then concatenate the per-partition results. The Append happens at the top of the plan, after the heavy work has finished, instead of at the bottom where every row would have to flow through one shared aggregator.

Both settings default to off. The reason is honest: they let the planner build larger plan trees with more nodes, and a query against a hundred-partition table can spend noticeably more time in planning. The trade is wonderful when the query is large enough that planning is a rounding error, and bad when the query is small enough that planning matters. Turn them on at the session or query level for analytics workloads, leave them off as the global default if your OLTP queries dominate.

-- Per-session, for an analytics workload:
set enable_partitionwise_join = on;
set enable_partitionwise_aggregate = on;

-- The two tables must be partitioned the same way for the planner to use this:
-- same method (RANGE), same key column, and the partition bounds must align.
select t.tenant_id, sum(v.bytes)
from page_views v
join tenants t on t.id = v.tenant_id and t.run_date = v.run_date
where v.run_date between '2026-01-01' and '2026-01-31'
group by t.tenant_id;

Bounds must match, not merely overlap. Two tables both partitioned monthly by run_date with identical month boundaries are aligned. Two tables one of which is daily and the other monthly are not aligned, and you get a plain join across the Appended children with none of the wins. When you can keep partition layouts identical across tables that join, the planner rewards you generously.

6. Indexes on a partitioned table

Indexes on the partitioned table are partitioned too. Since PostgreSQL 11, creating an index on the parent automatically creates a matching local index on every existing partition, and on every partition added later. There is no global index that spans partitions, and there does not need to be: each child’s local index plus pruning gives you the same lookup behavior.

The catch is CREATE INDEX ... CONCURRENTLY. It is not supported on the parent directly. The reason is that CONCURRENTLY needs to lock and rebuild one relation at a time, and the parent does not have storage of its own. The supported workflow is to build the per-partition pieces yourself, then attach them. This is the production-safe pattern.

-- 1. Create the parent index as INVALID first; this takes only an AccessShareLock.
create index page_views_path_idx on only page_views (path);

-- 2. Build each child index CONCURRENTLY, one at a time, with a stable name.
create index concurrently page_views_2026_01_path_idx
  on page_views_2026_01 (path);
create index concurrently page_views_2026_02_path_idx
  on page_views_2026_02 (path);
-- ... and so on for every partition.

-- 3. Attach each child to the parent index. The parent flips to VALID when every
--    partition has an attached child index.
alter index page_views_path_idx
  attach partition page_views_2026_01_path_idx;
alter index page_views_path_idx
  attach partition page_views_2026_02_path_idx;

This pattern adds an index to a multi-terabyte partitioned table without ever taking a heavy lock on the parent. New partitions created after this point inherit the index definition from the parent and get their own local index automatically. The same workflow applies to UNIQUE indexes that include the partition key.

One subtle pitfall worth flagging. If you create a local index on a child only and never attach it to a parent index, the parent will keep building its own local index on every new partition the next time you add one, and you can end up with two indexes per partition: yours, plus the auto-created one. Always attach, or always build through the parent.

7. DETACH, ATTACH, and rotating partitions in production

Partitioning shines on retention because adding and removing data becomes a metadata operation. Two commands carry most of the work: ATTACH PARTITION and DETACH PARTITION.

The bulk-load pattern. To add a month of historical data, build a standalone table, load it at full speed without partition routing overhead, create its indexes, run its constraints, and only then attach it.

-- 1. Build a standalone, identical-shape table.
create table page_views_stage (like page_views including all);
copy page_views_stage from '/data/page_views_2025_12.csv' csv header;

-- 2. Enforce the bound as a CHECK so ATTACH skips its own scan.
alter table page_views_stage
  add constraint chk_dec_2025
  check (run_date >= '2025-12-01' and run_date < '2026-01-01') not valid;
alter table page_views_stage validate constraint chk_dec_2025;

-- 3. Attach. Without the CHECK, PostgreSQL scans the whole table to verify
--    the bound; with it, the attach is almost instant.
alter table page_views attach partition page_views_stage
  for values from ('2025-12-01') to ('2026-01-01');

The matching NOT VALID then VALIDATE pair is the trick that keeps ATTACH from re-scanning a billion rows you already know are in range.

The retention pattern. To remove an old partition, detach it, then drop or archive at your leisure.

-- DETACH CONCURRENTLY (PG14+) avoids the heavy AccessExclusiveLock on the parent
-- by detaching in two phases. New queries stop seeing the partition immediately,
-- but in-flight queries that planned it can finish.
alter table page_views detach partition page_views_2024_01 concurrently;

-- Now the partition is a standalone table you can archive or drop without
-- affecting the live parent.
drop table page_views_2024_01;

Before DETACH CONCURRENTLY arrived in PostgreSQL 14, a plain DETACH took an AccessExclusiveLock on the parent, which on a busy OLTP system meant a brief but real outage. The concurrent form takes lighter locks and waits for in-flight queries to drain. Use it.

Removing eighteen months of cold data

ALTER TABLE page_views DETACH PARTITION page_views_2024_06 CONCURRENTLY then DROP TABLE page_views_2024_06. Two metadata operations. Disk space is reclaimed immediately, no autovacuum debt, and the parent stays online throughout.

8. Automating the boring part with pg_partman

In any time-partitioned workload, you have a daily chore: create tomorrow’s partition before midnight so writes do not fall into the default, and detach last year’s so storage does not grow forever. Doing this by hand is fine for a week and miserable forever. pg_partman is the community extension that automates this loop.

You configure pg_partman with one row in its config table per partitioned table: the parent name, the interval (daily, monthly, 1 day, 1 hour), how many partitions to premake ahead of the current time, and how many to retain behind it. Then a single function call runs every few minutes from cron or a scheduler.

create extension pg_partman;

-- Register page_views as a time-partitioned table.
select partman.create_parent(
  p_parent_table => 'public.page_views',
  p_control      => 'run_date',
  p_type         => 'range',
  p_interval     => '1 day',
  p_premake      => 7,
  p_start_partition => '2026-01-01'
);

-- Set retention: keep 90 days, detach older partitions.
update partman.part_config
  set retention = '90 days',
      retention_keep_table = false   -- drop instead of just detaching
  where parent_table = 'public.page_views';

-- The maintenance call. Schedule this every 5-10 minutes.
select partman.run_maintenance(p_analyze => false);

That single run_maintenance call premakes the next seven partitions, drops or detaches anything past retention, and updates statistics where you let it. You schedule it with pg_cron, an external cron job, or a sidecar process, and the partition graveyard stays groomed without you thinking about it.

9. The hidden cost: planning time and partition explosion

Pruning is fast at runtime. Building the plan is not free. The planner reads every partition’s catalog entry, considers each one for pruning, and only then discards the ones it can rule out. With ten partitions this is invisible. With ten thousand partitions, the time spent in the planner can dominate a fast OLTP query.

This is the partition-count trade-off. The cost per partition is small but real, paid on every query against the parent regardless of how many partitions actually survive pruning. A reasonable rule of thumb for PostgreSQL 16 and 17 is hundreds, not tens of thousands. A few hundred partitions is comfortable. A few thousand is the upper edge of comfortable. Ten thousand is a problem you will feel in p99 latency on point queries.

flowchart TB
    A["10 partitions"] --> B["Plan time invisible"]
    C["500 partitions"] --> D["Plan time noticeable on OLTP"]
    E["10000 partitions"] --> F["Plan time dominates fast queries"]
    G["1 partition per day for 10 years"] --> H["3650 partitions, planner pain"]

Some real cases that go wrong here. Daily partitions for a decade is 3,650 children, often well past the sweet spot. Sub-partitioning, partitioning a partition further, multiplies the count and rarely pays for itself on PostgreSQL declarative partitioning. A LIST partition per tenant on a system with fifty thousand tenants is a graveyard for the planner. The fix in each case is to coarsen: monthly partitions hold a decade in 120 children; tenants can share LIST partitions by bucket; HASH gives you a fixed N regardless of how many distinct keys exist.

There is also the default-partition trap. A DEFAULT partition silently absorbs any row that does not match an explicit bound. If you forget to premake tomorrow’s partition, tomorrow’s writes land in the default, mix with other unmatched rows, and now you cannot ATTACH a new partition for that range without first removing those misrouted rows from the default. The fix is not to remove the default; the fix is to make sure pg_partman premakes far enough ahead that the default is empty, and to alert when it is not.

Failure modes experienced engineers still hit

10. Tenant-aware partitioning and hard isolation

A second strong use case for partitioning is multi-tenant isolation. By partitioning on tenant_id, you make each tenant’s rows live in their own table, with their own indexes and their own page cache footprint. Three concrete wins follow.

First, dropping a tenant is DROP TABLE. A customer churns, you detach and drop their partition, their storage is reclaimed instantly. Second, per-tenant vacuum and analyze run in parallel, and a heavy tenant cannot stall maintenance on the rest. Third, partition pruning gives you defense in depth for tenancy: a query whose qualifier includes tenant_id = 'X' cannot touch any other tenant’s pages, period, regardless of bugs above.

The shape choice depends on tenant cardinality and lifetime. A small, stable set of premium tenants (tens or low hundreds) is a LIST partition each. A larger pool of similar tenants is HASH on tenant_id with a fixed modulus, keeping the partition count bounded as new tenants sign up. A fact table you want to partition by both time and tenant is usually best done by composite partitioning, RANGE on time at the parent and HASH on tenant in sub-partitions, but remember the planner cost: sub-partitioning multiplies the partition count.

Pair this with row-level security for the strongest isolation. The combination is powerful: RLS proves at the planner level that a tenant only sees their own rows, and partition pruning proves that the executor never even reads pages belonging to other tenants. A bug in one layer cannot leak data, because the other layer would still block it.

-- HASH partitioning on tenant_id, eight buckets, with RLS on top.
create table orders (
  id         bigint generated always as identity,
  tenant_id  uuid not null,
  amount_cents bigint not null,
  primary key (id, tenant_id)
) partition by hash (tenant_id);

create table orders_p0 partition of orders
  for values with (modulus 8, remainder 0);
-- ... orders_p1 through orders_p7 ...

alter table orders enable row level security;
create policy tenant_isolation on orders
  using (tenant_id = current_setting('app.tenant_id')::uuid);

The application sets app.tenant_id per request. RLS adds the tenant_id predicate to every query, which lets the executor prune to the one hash partition that holds this tenant’s rows. Two layers of safety, one schema.

11. Diagnostics: proving pruning, measuring per partition

Three tools tell you what is actually happening once partitioning is in place.

EXPLAIN with the right format. The plan must show fewer partitions than the parent has. Look for Subplans Removed: N and the list of children inside the Append. If every partition appears, pruning is not happening and the query is doing the work you tried to avoid.

explain (analyze, buffers, settings)
select count(*) from page_views
where run_date = current_date and tenant_id = $1;

pg_partition_tree, pg_partition_ancestors, and pg_partition_root. These three system functions let you enumerate the partition hierarchy programmatically. They are the right primitives for monitoring queries that count partitions per parent, find orphaned children, or check that premake produced what you expected.

-- All partitions under page_views, with their nesting level.
select * from pg_partition_tree('public.page_views');

-- For an arbitrary table, find the partitioned root it ultimately belongs to.
select pg_partition_root('public.page_views_2026_01');

pg_stat_user_tables per partition. Every partition is a real table, so its row counts, vacuum timings, and live and dead tuple counts show up under its own name. A partition with n_dead_tup climbing while its peers stay flat is doing more rewrite work than you thought, and probably needs its own vacuum settings.

select relname, n_live_tup, n_dead_tup, last_autovacuum
from pg_stat_user_tables
where relname like 'page_views_%'
order by n_dead_tup desc
limit 10;

When pruning is working, the older partitions stay mostly cold and their last_autovacuum recedes into the past, while the current month’s partition is the only hot relation in the working set. That is the steady state you are aiming for: a small number of busy children, a long tail of cold ones, and the parent sitting on top doing nothing but routing.

Mastery Questions

You are designing a 5 TB analytics fact table that ingests roughly 30 GB per day and is queried mostly by date ranges of a week or less, with retention of two years. What partitioning method and granularity would you pick, and what would you put in the primary key?

Answer. RANGE partitioning on the ingest date is the right choice, because almost every query has a date predicate the planner can use to prune. Monthly partitions over two years give you 24 children, comfortably in the sweet spot for planning time, and each partition holds about 900 GB, which is large but tractable for vacuum and index rebuilds. Daily partitions would give you 730 children, still workable but pushing planner cost up for fast OLTP queries that hit the parent, and gaining little for typical week-long range queries that already prune to one or two months. The primary key must include the partition key, so something like (id, ingest_date) is correct; (id) alone would be rejected. You would also automate creation with pg_partman premaking seven monthly partitions ahead and retaining 24, so the manual chore goes away and the DEFAULT partition stays empty. Finally, you would set enable_partitionwise_aggregate per session for the analytics workload, so monthly rollups happen in parallel per partition rather than across the whole Append.
A teammate complains that their query against a 100-partition table is much slower than it was before partitioning. The query is SELECT * FROM page_views WHERE status = 500 AND path LIKE '/api/%' ORDER BY created_at DESC LIMIT 100. What is almost certainly wrong, and how do you fix it?

Answer. The query has no predicate on the partition key, so the planner cannot prune. It Appends a scan of every one of the hundred partitions, sorts the combined result, and only then takes the top 100. Before partitioning, the single-table query had one index it could walk in created_at order with a LIMIT, and the same index would let it stop as soon as it had 100 rows matching the filters. Now the planner has to materialize matches from every partition before it can sort. The fix is to give the query a partition-key bound, even a loose one: WHERE created_at > now() - interval '7 days' lets the planner prune to roughly seven daily partitions, after which the per-partition (created_at DESC, status, path) index can do its job and the LIMIT stops the scan early. If the application genuinely needs unbounded-time results, the right answer is usually a denormalized summary or a different storage choice for that access pattern. Partitioning is a tool for queries whose access pattern aligns with the partition key, and queries that ignore the key pay for the alignment without benefiting from it.
Your team rolled out daily partitions on a tenant fact table four years ago. You now have about 1,500 partitions, EXPLAIN is showing 30 to 80 ms of planning time on point queries that used to run in 2 ms, and p99 latency has gotten worse even as data volume per day has stayed flat. What changed and what would you do?

Answer. Nothing in the data changed; the partition count crossed the threshold where planning cost dominates. Each fast OLTP query now spends most of its time in the planner inspecting partition descriptors before pruning narrows them down, even though execution against the surviving partition is still microseconds. The general remedy is to coarsen: roll up older partitions you no longer query at daily resolution into monthly ones. A practical path is to keep the last 90 days as daily partitions, where retention and per-day operations matter, and merge older daily partitions into monthly ones (build the monthly table, copy or attach the underlying data, detach the daily children). That brings the partition count back toward a few hundred and recovers the planning time. If the workload genuinely needs daily granularity for retention purposes, a sub-partitioning scheme of monthly-then-daily helps slightly but does not save you, since the total partition count is still high. The deeper lesson is that partition granularity is not free: it is a knob you tune against query patterns and partition count, and the right value drifts as the table ages. Audit it yearly.

Recommended next

Isolation Levels and MVCC Anomalies
Builds directly on this page: Isolation Levels and MVCC Anomalies is the next step in the PostgreSQL performance ladder.

Sources & evidence14 claims · 4 cited

Mechanism and DDL claims are grounded in the PostgreSQL docs for partitioning, indexes, and the planner statistics chapter; correct operational facts that the docs do not state quantitatively (pg_partman behavior, the hundreds-not-thousands partition-count rule of thumb, partition-wise settings being off by default for planning cost) are marked stable-common-knowledge with empty source_ids.

PostgreSQL declarative partitioning supports exactly three methods: RANGE (rows routed by a half-open interval on the key, FROM inclusive, TO exclusive, so adjacent partitions meet without overlap), LIST (an explicit set of discrete key values), and HASH (rows routed by hash(key) mod N across N partitions with MODULUS and REMAINDER bounds).verified
A partitioned table has no storage of its own; every row physically lives in exactly one child partition, and a DEFAULT partition catches rows that match no other partition's bounds.verified
PostgreSQL requires every primary key and every unique constraint on a partitioned table to include all of the partitioning columns, because uniqueness is enforced by a per-partition local unique index and global indexes are not supported; this is why composite keys like (id, run_date) are standard on partitioned fact tables.verified
Partition pruning runs in two passes governed by enable_partition_pruning (on by default): plan-time pruning uses constants known at planning, while execution-time pruning uses values bound at query start (prepared-statement parameters or values from the outer side of a nested-loop join) to discard further partitions before scanning them.verified
An EXPLAIN plan for a pruned query shows an Append node with a 'Subplans Removed: N' line counting partitions excluded by pruning, and only the surviving partitions appear as scan children; the absence of this line on a partition-key qualifier is the first signal that pruning is not happening.verified
Wrapping the partition key in a function the planner cannot constant-fold (such as date_trunc on a column) defeats plan-time pruning because the planner cannot push the call into the bounds check; rewriting the qualifier in the column's own terms (run_date >= '2026-01-15' AND run_date < '2026-01-16') restores it.verified
enable_partitionwise_join and enable_partitionwise_aggregate default to off because they grow the planner's plan tree with per-partition join and aggregate nodes, increasing planning time noticeably on large partition counts; they are worth enabling for analytics sessions where execution time dominates.stable common knowledge
Partition-wise join requires the two tables to be partitioned the same way along the same key with matching (not merely overlapping) partition bounds; differing granularities (daily versus monthly) prevent the optimization even when the join key is identical.verified
Since PostgreSQL 11, an index created on a partitioned parent is propagated to every existing partition and to every partition added later; CREATE INDEX CONCURRENTLY is not supported directly on the parent, so the production workflow is to create the parent index ON ONLY parent, build each child index CONCURRENTLY, then ALTER INDEX parent ATTACH PARTITION each child, which flips the parent to VALID when all children are attached.verified
ALTER TABLE ... ATTACH PARTITION scans the candidate partition to verify the partition constraint unless an equivalent CHECK constraint already exists (added as NOT VALID and then VALIDATEd), in which case the attach skips the scan and finishes almost instantly even on a billion-row table.verified
ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY (added in PostgreSQL 14) detaches in two phases with lighter locks than the plain DETACH, allowing in-flight queries that already planned the partition to finish; the plain form takes an AccessExclusiveLock on the parent and briefly blocks all queries.verified
The community extension pg_partman automates time- and serial-based partition lifecycle: a single run_maintenance() call (typically scheduled every five to ten minutes) premakes the next N partitions ahead of the current time and detaches or drops partitions older than the configured retention.stable common knowledge
The planner inspects every partition's catalog entry before pruning narrows the candidate set, so planning time grows with the total partition count regardless of how many survive pruning; the practical rule of thumb on PostgreSQL 16 and 17 is hundreds of partitions, not tens of thousands, before planning cost dominates fast OLTP queries.stable common knowledge
Each partition is an ordinary table in pg_stat_user_tables, so per-partition n_live_tup, n_dead_tup, and last_autovacuum are visible by relname; system functions pg_partition_tree, pg_partition_ancestors, and pg_partition_root enumerate the partition hierarchy for monitoring queries.verified

Cited sources

PostgreSQL Documentation: Table Partitioning · PostgreSQL Global Development Group
PostgreSQL Documentation: Indexes · PostgreSQL Global Development Group
PostgreSQL Documentation: How the Planner Uses Statistics · PostgreSQL Global Development Group
PostgreSQL Documentation: B-Tree Indexes · PostgreSQL Global Development Group