B-Tree Index Internals

The default PostgreSQL index: its on-disk shape, the leftmost-prefix rule for multicolumn keys, insert and update mechanics, deduplication and bottom-up deletion, INCLUDE columns, operator classes, and the diagnostics for finding bloated or unused indexes.

Learning outcomes

A B-tree is the only index you get for free in PostgreSQL: CREATE INDEX with no qualifier builds one, every primary key is one, and every other index type defends itself against the B-tree’s numbers. So before you can pick a GIN, a BRIN, or a partial index honestly, you need a working picture of how this one performs, where it spends its bytes, and what kind of writes hurt it. That is what this page builds.

After studying this page, you can:

  • Draw a B-tree’s on-disk shape (root, inner pages, leaves) and explain why three or four page reads suffice for hundreds of millions of rows.
  • Predict, for a multicolumn index on (a, b, c), which queries can use it as a full index scan, which get a partial range scan, and which fall back to a sequential scan.
  • Choose between a random-key primary key (uuidv4) and an ordered one (bigserial, uuidv7) based on insert locality, and read the bloat that follows from getting it wrong.
  • Decide when to add INCLUDE columns, when to use text_pattern_ops, and when to drop an index instead of building one.
  • Use pg_stat_user_indexes, pgstattuple, and REINDEX CONCURRENTLY to find a bloated or unused index and fix it without taking the table offline.

Before we dive in

You should be comfortable with PostgreSQL’s page model: the 8 kB page as the unit of I/O, and a tuple addressed by its ctid, the pair (page, item). The heap-pages-and-toast page covers both. You also want to be able to read an EXPLAIN (ANALYZE, BUFFERS) plan node, because every diagnostic in this page lives inside one. The reading-explain-analyze page covers that.

A few terms, defined as we use them. A key is the column value (or tuple of values, for a multicolumn index) the tree is sorted by. A separator key is the routing value stored in a non-leaf page that tells a descending search which child to follow. A leaf page is a B-tree page at the bottom of the tree, and it is the only level that stores actual (key, ctid) pointers into the heap. Fanout is how many children a single inner page can address, and it is the reason the tree stays shallow. Hold those four; the rest is built from them.

Mental Model

The wrong picture, and it is everywhere, is “an index is a sorted list of keys with pointers next to them, so a lookup just binary-searches the list.” Under that model the index is one big array, and adding a row means shifting everything after the insertion point.

PostgreSQL does not work that way. The better picture is a multi-level phone directory. The top level (the root page) is short: a handful of letters, each pointing to a thicker book. Each thicker book (an inner page) is itself a short index of sub-ranges, each pointing to a still thicker book underneath. Only the bottom-level books (the leaf pages) hold actual entries: a sorted run of (name, address) pairs, plus a thread tying each leaf to its left and right neighbors so you can read straight across without going back up.

Keep this picture. Every descent reads three or four pages, never the whole tree. Every insert finds its right leaf and edits one page (occasionally splitting it in two). A range scan does one descent to find the start, then walks the leaf thread sideways. Once that clicks, the leftmost-prefix rule, the uuidv4 fragmentation story, and the cost of INCLUDE columns all stop being trivia and start being consequences of one shape.

Breaking it down

1. Why a B-tree, and what shape it has on disk

A B-tree is a balanced tree of 8 kB pages. Each non-leaf page (the root at the top, and inner pages below it) holds a sorted array of separator keys, each paired with a pointer to a child page one level down. Each leaf page holds a sorted array of (key, ctid) entries: the indexed value, and the physical address of the corresponding heap tuple. A leaf does not store the row, only the pointer to it.

The cleverness of the shape is the fanout. With an 8 kB page, a small separator key (say a few dozen bytes), and the line-pointer overhead, an inner page can route to a few hundred children. A tree with fanout 200 reaches one million rows in three levels (200^3 = 8,000,000), and a hundred million rows in four (200^4 = 1.6 billion). That is why you keep hearing the same number for production B-trees: three or four levels covers nearly every table you will ever build, and the height grows logarithmically, not linearly, in the number of rows. The formal cost of a point lookup is O(log_n) page reads, and for realistic fanouts that logarithm is a single-digit integer.

The leaves carry one more trick: each leaf page holds a pointer to its left and right sibling. After the tree finds the leaf where a range scan begins, it follows those sibling pointers to read the rest in order, without ever climbing back to the root. This is what turns a B-tree into the right shape for both WHERE id = 42 and WHERE created_at BETWEEN '2026-01-01' AND '2026-01-31'.

flowchart TB
    R["Root page: 3 separator keys + 4 child pointers"]
    I1["Inner page: separators for keys 1..50,000"]
    I2["Inner page: separators for keys 50,001..100,000"]
    L1["Leaf page: sorted (key, ctid) pairs, keys 1..200"]
    L2["Leaf page: keys 201..400"]
    L3["Leaf page: keys 401..600"]
    R --> I1
    R --> I2
    I1 --> L1
    I1 --> L2
    I1 --> L3
    L1 -. left-right link .-> L2
    L2 -. left-right link .-> L3

The picture is the whole answer to “why is the tree shallow?” and “why are range scans cheap?” at once. Hold it.

2. A lookup, page by page

Now walk a single lookup, because the cost is set by exactly how many pages it reads. Suppose you run SELECT * FROM orders WHERE id = 8675309 against a table with 200 million rows and a B-tree on id.

PostgreSQL starts at the root page. It binary-searches the root’s separator keys, finds the slot whose range covers 8,675,309, and follows the child pointer to an inner page. That is read number one. On the inner page it does the same search and follows another child pointer to the next level down. Read number two. One or two levels later, it reaches a leaf page. Read three or four. On the leaf, it binary-searches the sorted (key, ctid) entries, finds the matching id, and reads the heap tuple at the ctid. That is one final page read on the heap.

Total: three or four index page reads plus one heap page read for a point lookup against 200 million rows. If those pages are already in the shared buffer pool (and the upper levels of a hot index almost always are), every read is a memory hit and the whole lookup takes microseconds. If the leaf is cold, you pay one random disk read for it. That is the entire performance story of a B-tree point lookup.

A range scan reuses the same descent. To run WHERE id BETWEEN 8000000 AND 8001000, PostgreSQL descends once to find the leaf holding the start of the range, then walks the sibling pointers leaf to leaf, emitting matching entries in order until the upper bound is crossed. The descent is paid once; the rest is sequential leaf reads. That is why an index range scan can return thousands of rows for the cost of one descent plus a handful of leaf pages, and why you should think of leaf order, not heap order, when reasoning about how a B-tree streams rows.

A point lookup on a four-level B-tree
Read the rootBinary-search the root's separator keys to find the slot that covers id = 8675309. Follow its child pointer. One page read.
Step 1 of 5

3. Multicolumn indexes and the leftmost-prefix rule

A B-tree can index more than one column at a time: CREATE INDEX ON orders (customer_id, status, created_at) builds a single tree whose key is the triple (customer_id, status, created_at). The entries are sorted by the first column, then by the second within ties on the first, then by the third within ties on the first two. That ordering is the one rule you need to predict every query the index can serve.

The shape of the order produces the leftmost-prefix rule. The index can be used as a full B-tree search when the query has equality conditions on a leftmost prefix of the columns, and a range or sort on the next column. So for an index on (a, b, c):

  • WHERE a = 1 uses the index. The tree descends to all entries with a = 1 and scans them on the leaf level.
  • WHERE a = 1 AND b = 2 uses the index, more tightly: the descent goes straight to entries with (a, b) = (1, 2).
  • WHERE a = 1 AND b = 2 AND c BETWEEN 100 AND 200 uses the index, fully: equality on the prefix (a, b) plus a range on the next column c is exactly what the ordering supports.
  • WHERE a = 1 AND c = 9 uses the index, but only partially: it can descend to a = 1, then it must filter on c as it scans, because c is not contiguous when b varies.
  • WHERE b = 2 cannot use the index for a tree descent at all. The values of b are scattered throughout the leaves once a varies, so the planner usually picks a sequential scan or a different index.

The same logic governs ORDER BY. An index on (a, b) can return rows in ORDER BY a, b order with no sort node, and it can return them in ORDER BY a order for free as well. It cannot serve ORDER BY b without a sort.

-- Index built once, supports several access patterns.
CREATE INDEX orders_cust_status_created
  ON orders (customer_id, status, created_at);

-- Full index scan: equality prefix, range on the next column.
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders
WHERE customer_id = 42
  AND status = 'shipped'
  AND created_at >= now() - interval '30 days';

-- Partial scan: skips columns 2 and 3 in the descent, filters on the leaf.
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders
WHERE customer_id = 42
  AND created_at >= now() - interval '30 days';

-- Cannot use the index for a descent: the planner falls back to Seq Scan or
-- picks a single-column index on status if one exists.
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE status = 'shipped';

The practical rule is to put the column with the most selective equality predicate first, then the column whose ranges or sorts the workload cares about, then any extra equality columns. Get the order wrong and the index either helps fewer queries than you expected, or, worse, costs you write throughput while no read takes advantage of it.

Which queries can a B-tree on (a, b, c) serve?
Equality on a leftmost prefix, optionally followed by a range or sort on the next column. Examples: WHERE a = 1; WHERE a = 1 AND b = 2; WHERE a = 1 AND b = 2 AND c > 100; ORDER BY a, b. The descent goes straight to the matching leaves and the scan reads them in order.

4. Insert: the cost of random versus ordered keys

Every insert into a table with a B-tree adds one entry per index. The mechanics are the same shape as a lookup: descend to the right leaf for the new key, then write the (key, ctid) entry there. The cost shows up in what happens when the leaf is full.

If the target leaf has room, the insert is cheap: write the entry into the leaf, update the line-pointer array, fsync the WAL record, done. If the leaf is full, PostgreSQL splits it: it allocates a new leaf page, moves about half the entries to it, updates the sibling pointers on both sides, and pushes a new separator key up into the parent inner page. If the parent is also full, the split cascades upward. A split is a handful of page writes and one or two WAL records, and it is the unit of cost you pay for growth.

This makes the order of your inserts decisive. Two patterns dominate, and they have opposite economics:

  • Ordered insertion: a key that always increases, such as bigserial, a timestamptz for events, or uuidv7 (which encodes a timestamp as the leading bits). Every new entry lands on the rightmost leaf of the tree. That leaf fills, splits once, and the new entry goes onto the fresh leaf. Splits are local, predictable, and produce densely packed leaves. The rightmost path of the tree is hot in cache. Index growth is linear in the number of rows.
  • Random insertion: a key drawn from a uniform distribution, such as uuidv4. Every new entry lands at a uniformly random leaf. Any leaf can split at any time, so splits ripple across the tree. The cache footprint is the whole index, not the rightmost path, and disk locality is poor. Worse, splits leave each side roughly half full, and subsequent random inserts rarely fill the gaps in the same order, so the index ends up with leaves at low fill ratios. A uuidv4 primary key on a billion-row table can grow several times larger than the same table keyed by bigserial.

This is the practical core of the choice between random and ordered identifiers, and it is the right place to bring up the data-types-and-schema-design page: choosing uuidv7 over uuidv4 for a public-facing identifier preserves random-looking external IDs while keeping insertion locality, because uuidv7 is time-ordered. Choosing bigserial is even cheaper internally but exposes the row count to consumers, which is sometimes the reason people reach for UUIDs in the first place.

Inserting a million rows: random vs ordered key
Every insert lands on the current rightmost leaf. That leaf splits when it fills, the new entry goes onto the fresh leaf, and the old leaf is now densely packed. Only the rightmost path of the tree is hot, the cache footprint is tiny, and the index grows linearly. Same insert workload, a fraction of the WAL and a denser index.

5. Delete and update: dead entries, microvacuum, and HOT

A DELETE against the heap does not remove the corresponding B-tree entry. It cannot: index entries point to physical tuple addresses, and at the moment of the delete the dead tuple still exists on the heap (the MVCC page covers why). Instead, the index entry is left in place and is later marked dead during a scan that notices the heap tuple is gone, or it is reclaimed by VACUUM. Until then, it occupies space on its leaf and is read on every range scan that touches that leaf.

Two cleanup paths run in parallel. The first is VACUUM, which scans the heap to find dead tuples and then scans each index to remove the entries pointing to them. The second is microvacuum, an in-line cleanup that runs during a regular index scan: when the scan encounters an entry whose heap tuple is no longer visible to anyone, it can mark the entry’s line pointer dead immediately, so subsequent scans skip it. Microvacuum lets the index reclaim slot space between full vacuum runs, and it is one reason a busy index does not collapse under bloat even when autovacuum lags slightly.

An UPDATE is the case where the choice between inserting a new index entry or not gets interesting. The naive rule is that every UPDATE adds a new index entry per index, because the new tuple has a new ctid. PostgreSQL avoids that work when it can, through HOT updates (heap-only tuples). If the updated columns are not part of any index, and the new tuple version fits on the same heap page as the old one, the heap simply links the old tuple to the new with a t_ctid chain, and the indexes do not need to be touched. No new index entry, no leaf write, no WAL for index changes. HOT is the single largest reason to leave non-indexed columns off your indexes and to give tables a fillfactor below 100. The table-bloat-and-hot-updates page goes into exactly when HOT applies and when it stops applying.

The takeaway here is the rule: an index pays insert cost on every insert, and (non-HOT) update cost on every update that touches an indexed column or that cannot fit the new tuple on the same page. So adding an index does not only cost reads; it costs writes too, and HOT is the optimization that gives you some of that write cost back.

6. Deduplication and bottom-up index deletion

Two PostgreSQL releases changed the steady-state size of a B-tree dramatically, and both deserve a name.

Deduplication (PostgreSQL 13) changes how a leaf stores repeated keys. Before PG13, an index on a low-cardinality column (say a status column with five distinct values) stored each (key, ctid) pair separately, so a billion-row table produced a billion leaf entries. From PG13 on, the leaf can store a posting list: one copy of the key followed by a packed list of ctids that share it. The space savings on a leading column with low cardinality are dramatic, often shrinking the index by half or more, and lookups still work because a leaf entry can fan out into many heap rows. Deduplication is enabled by default for most operator classes; it is the reason a status-leading multicolumn index is much cheaper on PG13 and later than it used to be.

Bottom-up index deletion (PostgreSQL 14) changes how the index reclaims space under heavy churn. Before PG14, a B-tree leaf that filled with dead-but-not-yet-vacuumed entries would split when it ran out of room, even if most of its entries were dead and would be removed by the next vacuum. The split was wasted work, and it permanently grew the index. From PG14 on, when a leaf is about to split, PostgreSQL first attempts a bottom-up deletion: it visits the heap pages the leaf’s entries point to, identifies entries whose heap tuples are dead, and removes them. If enough space frees up, the split is avoided. This makes the index much more resilient to non-HOT update churn on indexed columns, and it noticeably reduces bloat between vacuum runs on update-heavy tables.

Together, deduplication and bottom-up deletion mean the modern B-tree (PG13 to PG17) is materially smaller and cheaper per insert than the one many older articles describe. If you are migrating from PG11 or PG12, expect indexes to shrink after a REINDEX, and expect update-heavy tables to need less aggressive vacuuming than they used to.

7. INCLUDE columns and index-only scans

An ordinary B-tree leaf stores (key, ctid). To return any column the query asks for that is not in the key, PostgreSQL has to fetch the heap tuple at that ctid. That heap fetch is what makes a normal Index Scan more expensive than an Index Only Scan, where the entire result can be served from the index.

PostgreSQL 11 added INCLUDE columns, the way to keep a leaf entry small for sorting purposes while still carrying extra payload for the planner to read. CREATE INDEX ON orders (customer_id) INCLUDE (status, total) builds an index whose tree is sorted by customer_id, but whose leaves also carry status and total for each row. A query that filters on customer_id and projects only customer_id, status, and total can run as an Index Only Scan, skipping the heap entirely when the visibility map says the relevant heap pages are all-visible.

The trade-off is exactly what you would expect: wider leaves. Every leaf page now holds the keys plus the included columns, so each leaf packs fewer rows, and the leaf level grows. A range scan that returns ten thousand rows reads more leaf pages than before, and the index occupies more space in the shared buffer pool, evicting other pages. INCLUDE earns its place when the index-only scan removes a hot heap-fetch cost from a query that runs often, and it is a waste when the included columns are wide (think text or jsonb) or the query rarely runs as an index-only scan.

The included columns also do not affect uniqueness. A UNIQUE index ON (customer_id) INCLUDE (status) enforces uniqueness only on customer_id. That is sometimes the reason you reach for INCLUDE: you want a covering payload without changing the uniqueness semantics of the key.

8. Operator classes, sort order, and unique enforcement

A B-tree compares keys using an operator class, a named bundle of comparison operators (<, <=, =, >=, >) that defines the index’s sort order. The default operator class for a column type matches the type’s default comparison: text_ops sorts text using the collation’s locale rules, int4_ops sorts integers numerically, and so on. You almost never name an operator class explicitly, which is exactly why it is worth knowing when you must.

The case that bites people most often is LIKE 'prefix%' on a text column under a non-C collation. Under, say, an English locale, the default text_ops does not order text in the order LIKE cares about (locale collation can split character runs, treat case as equivalent, and so on), so the planner refuses to use the default index for a prefix match. The fix is to build the index with text_pattern_ops (or varchar_pattern_ops, or bpchar_pattern_ops for char(n)), which sorts by raw byte order, the order LIKE 'prefix%' actually walks. With that operator class, the index supports anchored-prefix queries:

-- Default text_ops: does NOT serve LIKE 'foo%' under a non-C collation.
CREATE INDEX ON users (email);

-- text_pattern_ops: byte-order sort, perfect for LIKE 'foo%'.
CREATE INDEX ON users (email text_pattern_ops);

SELECT * FROM users WHERE email LIKE 'alice@%';

If your database is set up with the C collation, the default text_ops already does this, and you do not need text_pattern_ops. But the C collation has its own consequences (case-sensitive ordering, non-localized sorting), so most production databases keep a locale and add text_pattern_ops indexes where prefix matching matters.

The B-tree also sorts each column with a direction, and direction is per column. CREATE INDEX ON events (occurred_at DESC, severity ASC NULLS FIRST) builds a tree whose first column descends and whose second ascends, with nulls placed at the start of each group. The point of this is to match the ORDER BY of a hot query exactly. When the index direction lines up with the query’s ORDER BY, the planner can return rows already in order and skip the Sort node above the scan. A mismatch costs you an in-memory sort, sometimes a spill to disk if the result is large. The same trick lets a B-tree run reverse scans: it can walk the leaf sibling pointers from right to left as easily as left to right, so an ascending index can serve a DESC query at the same cost.

Uniqueness is just a B-tree with an extra check. A UNIQUE index inserts a new entry, then verifies that no other live entry shares the same key. Primary keys, unique constraints, and the UNIQUE keyword on CREATE INDEX all produce the same underlying structure. This is also why ON CONFLICT (column) DO ... requires a UNIQUE constraint or a unique index on that column: the conflict detection literally reuses the index’s uniqueness check.

9. The cost of an index, and the anti-patterns

Every B-tree index is a write tax. On every insert, PostgreSQL descends the tree and writes a leaf entry. On every non-HOT update of an indexed column, it adds a new entry and (later) reclaims the old one. On every delete, it leaves a dead entry until VACUUM or microvacuum cleans it up. The index also occupies cache. A four-gigabyte index competes with the heap for shared buffers, and indexes that are scanned rarely but updated constantly steal cache from indexes that earn it.

The natural mistake is the “every column gets an index” anti-pattern: a table with twenty columns and twenty single-column indexes. Most of those indexes will never be used; the planner reaches for them only when the column is selective, and a column that is rarely filtered in production is rarely indexed by the planner even when an index exists. Meanwhile every insert and every update is paying the write tax on twenty trees. The fix is to look at pg_stat_user_indexes (rung 10) and drop the ones with no idx_scan activity.

A close cousin is the low-selectivity index. An index on a column with three distinct values (status, for example) does not narrow a scan enough for the planner to prefer it over a sequential scan: even the best-case lookup returns a large fraction of the table, and the planner correctly chooses to read the heap sequentially instead. The index sits unused while it gets written to on every insert and update. The right answer is usually a multicolumn index that combines the low-cardinality column with a more selective one (the leftmost-prefix rule from rung 3 is the design tool), or a partial index that only covers the rare value (a topic the advanced-indexing-techniques page expands).

A third pitfall is the never-REINDEX’d index on an update-heavy table. Before PG14’s bottom-up index deletion, indexes on heavily updated columns would grow several times larger than the live data they pointed to. Even with PG14 and PG13’s improvements, an index that has lived through years of churn benefits from a periodic REINDEX CONCURRENTLY to compact it back down. We cover the diagnostic for this in rung 10.

The last is the over-INCLUDEd covering index: an index built to serve one specific query with a fat payload of included columns. It works beautifully for that one query and slowly starves the rest of the workload by occupying half the buffer pool. Reach for INCLUDE only when the index-only scan is the bottleneck of a hot query, and keep the payload narrow.

10. Diagnosing and maintaining a B-tree in production

You should be able to answer four questions about any index in production: is it being used, how big is it, how bloated is it, and how do you rebuild it without downtime.

Is it being used? The view pg_stat_user_indexes answers this directly. idx_scan counts how many index scans have started against the index since the last statistics reset; idx_tup_read counts entries returned, and idx_tup_fetch counts heap tuples fetched. An index with zero or near-zero idx_scan over a meaningful period is a strong drop candidate.

-- Indexes that have never been scanned (or almost never).
SELECT s.schemaname, s.relname AS table, s.indexrelname AS index,
       s.idx_scan, pg_size_pretty(pg_relation_size(s.indexrelid)) AS size
FROM pg_stat_user_indexes s
JOIN pg_index i ON i.indexrelid = s.indexrelid
WHERE NOT i.indisunique          -- never drop a uniqueness-enforcing index
  AND s.idx_scan < 50            -- threshold for "essentially unused"
ORDER BY pg_relation_size(s.indexrelid) DESC;

The pg_index catalog has the structural metadata: indisunique, indisprimary, indkey (the columns), and the operator classes. Cross-join it to skip uniqueness and primary-key indexes when hunting drop candidates.

How big is it? pg_relation_size('orders_cust_status_created') returns bytes; pg_size_pretty formats them. Compare to pg_relation_size('orders') for the table itself. A single index larger than its table is normal for wide multicolumn or covering indexes; several indexes whose combined size dwarfs the table is the sign of the “every column gets an index” anti-pattern from rung 9.

How bloated is it? The pgstattuple extension provides pgstatindex(<index_name>), which reports leaf fragmentation as avg_leaf_density (target around 90 percent, with 50 to 70 percent typical of a bloated index) and leaf_fragmentation (the percent of leaf pages whose neighbors are out of order, a proxy for how badly a range scan reads non-sequentially). High fragmentation plus low density is the signature of an index that has lost ground to churn.

CREATE EXTENSION IF NOT EXISTS pgstattuple;

-- Watch the density and fragmentation of a hot index.
SELECT * FROM pgstatindex('orders_cust_status_created');
 version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
       4 |          3 |  812400640 |             3 |            612 |      98562 |           0 |           412 |             62.4 |               21.7

How do you rebuild it without downtime? REINDEX CONCURRENTLY <index> builds a new copy of the index alongside the old one, swaps them atomically, and drops the old one. Reads against the table continue to use the existing index throughout. The catch is the same as for CREATE INDEX CONCURRENTLY: the operation is not transactional, it takes a lower-level lock for the swap, and if it fails partway it leaves an invalid index behind that you need to drop manually. For uniqueness-enforcing indexes the rebuild is the only safe way to compact a bloated primary key on a live table.

-- Rebuild a bloated index without taking the table offline.
REINDEX INDEX CONCURRENTLY orders_cust_status_created;

Put these together and you have a maintenance loop: list indexes by size, prune the ones with no scans, REINDEX CONCURRENTLY the ones whose pgstatindex shows low density, and keep an eye on the views over time. A B-tree that is right-sized, scanned often, and periodically compacted earns its keep; one that is none of those is paying the write tax of rung 9 for nothing.

Mastery Questions

  1. A teammate proposes adding a uuidv4 primary key to a new high-write events table because UUIDs are “nicer to expose.” After six months at production traffic, what would you predict, and what would you suggest instead?

    Answer. I would predict that the primary key index ends up significantly larger than the equivalent bigserial index, that insert throughput degrades as the working set of the index outgrows shared buffers, and that pgstatindex shows low avg_leaf_density on the PK. The mechanism is that every insert lands on a uniformly random leaf, so splits scatter across the whole tree, leaves end up roughly half full, and the rightmost-path locality of an ordered key is lost; only the upper levels of the tree stay hot in cache, while leaf reads on insert become random I/O. The right alternative is uuidv7 (time-ordered, still random-looking externally) or bigserial (cheapest internally, but exposes counts). The data-types-and-schema-design page makes the same point in a wider context: the external shape of an identifier and its internal locality are separate decisions, and choosing the wrong identifier is one of the few schema mistakes that quietly compounds at scale.

  2. You add a covering index CREATE INDEX ON orders (customer_id) INCLUDE (status, notes_jsonb, last_event_at) to make one specific dashboard query fast, and a week later the rest of the application slows down. What is the most likely cause, and how would you find it?

    Answer. The INCLUDE payload is too wide. notes_jsonb in particular can be many kilobytes per row, so each leaf page packs only a handful of entries, the index grows several times larger than a plain (customer_id) index would, and it begins evicting other pages from shared buffers. Hot heap pages and other index leaves stop fitting in cache, so unrelated queries pay extra reads, even though their plans did not change. I would confirm by comparing pg_relation_size of the new index to other indexes on the table, checking pg_buffercache (if installed) for the share of buffers the index occupies, and watching pg_stat_user_indexes to see whether idx_scan on the new index justifies the cache cost. The fix is to drop the wide payload and keep only the columns that the dashboard actually needs (often status and a key timestamp, never the raw jsonb), or to drop the covering form entirely and accept the heap fetch if the dashboard does not run often enough to earn it.

  3. Your users table has a B-tree index on email, but EXPLAIN shows that SELECT ... WHERE email LIKE 'alice%' falls back to a sequential scan. The locale is en_US.UTF-8. Why, and how do you fix it without breaking anything that already uses the index?

    Answer. The existing index uses the default text_ops operator class, which sorts text under the database’s locale collation. LIKE 'alice%' matches by byte order, not by collation order, so even though the index is sorted, the planner cannot prove that all matching values are contiguous in it and refuses to use it for the prefix match. The fix is to create a second index with the byte-order operator class: CREATE INDEX users_email_pattern_idx ON users (email text_pattern_ops). That index sorts by raw bytes, which is exactly the order LIKE 'foo%' walks, so the planner uses it for anchored-prefix queries. Equality lookups (WHERE email = 'alice@example.com') keep using whichever index is cheaper; you do not need to drop the original. The cost is one extra index to maintain on writes, which is the right trade for a hot prefix-search query, and you can verify the win with EXPLAIN (ANALYZE, BUFFERS) showing an Index Scan using users_email_pattern_idx and a sharp drop in Buffers: shared read.

Recommended next

  • Index Types Beyond B-Tree
    Builds directly on this page: Index Types Beyond B-Tree is the next step in the PostgreSQL performance ladder.
Sources & evidence14 claims · 3 cited

Mechanism and operational claims grounded in the PostgreSQL B-tree, indexes, and CREATE INDEX documentation; release-pinned facts (PG11 INCLUDE, PG13 deduplication, PG14 bottom-up index deletion) and the fanout-shapes-log-n-page-reads math are marked stable-common-knowledge with empty source_ids where the listed docs do not state the exact version or number.

  • A PostgreSQL B-tree is built from 8 kB pages: a root and inner pages hold sorted separator keys plus child pointers, while leaf pages hold sorted (key, ctid) entries that point into the heap, and each leaf is linked to its left and right sibling so an ordered range scan can stream across the leaf level after one descent.verified
  • Fanout from the 8 kB page size keeps the tree shallow: with a typical inner-page fanout near 200, a B-tree reaches one million entries in 3 levels and hundreds of millions in 4, so a point lookup costs O(log_n) page reads that is a single-digit integer in practice.stable common knowledge
  • A point lookup descends root to inner to leaf with one binary search per page (typically 3-4 index page reads) and then follows the leaf entry's ctid into one heap page; a range scan reuses that single descent and then walks leaf sibling pointers in order, paying one descent plus a handful of sequential leaf reads regardless of how many rows match.verified
  • A multicolumn B-tree on (a, b, c) sorts entries by a, then b within ties on a, then c within ties on (a, b), so it serves a full index search for equality on a leftmost prefix optionally followed by a range or sort on the next column; predicates only on non-leftmost columns (such as WHERE b = 2 alone) cannot drive a tree descent and the planner typically falls back to a sequential scan or a different index.verified
  • Every insert descends to its target leaf and writes one (key, ctid) entry; if the leaf is full the page splits, allocating a new leaf, moving about half the entries, updating sibling pointers, and pushing a new separator key into the parent (which can cascade). Ordered keys (bigserial, uuidv7, timestamps) keep inserts on the rightmost leaf so splits are local and leaves stay densely packed, while uniformly random keys (uuidv4) split across the whole tree and leave low-density leaves that bloat the index several times beyond an ordered key on the same workload.verified
  • A DELETE does not remove the corresponding B-tree entry; the entry is left in place and reclaimed by VACUUM or marked dead in-line during a scan that notices the heap tuple is no longer visible (microvacuum). An UPDATE writes a new heap tuple with a new ctid and therefore a new index entry, unless the update qualifies for a HOT update (no indexed columns changed and the new tuple fits on the same heap page), in which case the indexes are not touched at all.verified
  • PostgreSQL 13 added B-tree deduplication: a leaf can store repeated keys as a posting list (one copy of the key plus a packed list of ctids), which dramatically shrinks indexes whose leading column has low cardinality.stable common knowledge
  • PostgreSQL 14 added bottom-up index deletion: before a leaf splits, PostgreSQL visits the heap pages its entries point to, identifies entries whose heap tuples are dead, and removes them in place, often avoiding the split entirely and noticeably reducing bloat on update-heavy tables between vacuum runs.stable common knowledge
  • PostgreSQL 11 added INCLUDE columns for B-tree indexes: non-key payload columns are stored in leaf entries without affecting the index's sort order or uniqueness, enabling index-only scans for queries that project those columns, at the cost of wider leaves (fewer rows per leaf page and more shared-buffer pressure).verified
  • Under a non-C collation, the default text_ops operator class does not order text by raw byte order, so the planner cannot use such an index for LIKE 'prefix%' matches; building the index with text_pattern_ops (or varchar_pattern_ops / bpchar_pattern_ops) sorts by byte order and lets the index serve anchored-prefix LIKE queries.verified
  • B-tree direction is per column (ASC or DESC, with NULLS FIRST or NULLS LAST), so an index whose column directions match the query's ORDER BY can return rows already sorted and skip a Sort node; B-trees can also walk leaf sibling pointers in reverse, so an ASC index serves a DESC scan at the same cost.verified
  • A UNIQUE index is a B-tree with an extra check that inserts verify no other live entry shares the same key; primary keys, UNIQUE constraints, and CREATE UNIQUE INDEX all produce the same underlying structure, and ON CONFLICT (column) reuses that index's uniqueness check.verified
  • Every B-tree adds insert overhead and (non-HOT) update overhead per indexed column, and an index on a low-selectivity column (a column with very few distinct values) is typically skipped by the planner in favor of a sequential scan, so the index pays the write tax without serving reads; the fix is usually a multicolumn or partial index rather than a single-column one.verified
  • pg_stat_user_indexes exposes idx_scan, idx_tup_read, and idx_tup_fetch per index for finding unused indexes; the pgstattuple extension's pgstatindex function reports avg_leaf_density and leaf_fragmentation for measuring index bloat; and REINDEX INDEX CONCURRENTLY rebuilds a bloated index alongside the old one and swaps them atomically, so reads continue uninterrupted, leaving an invalid index behind to drop manually only if the rebuild fails.verified

Cited sources