Advanced Indexing Techniques

How a principal PostgreSQL engineer shapes btree indexes for production: partial, expression, covering, multicolumn, and ordered indexes, built and repaired with CREATE/REINDEX CONCURRENTLY, kept healthy against bloat with pgstatindex and fillfactor, matched to queries with operator classes, and verified with EXPLAIN (ANALYZE, BUFFERS).

Learning outcomes

A vanilla btree on the right column is the first 80 percent of indexing. The next 20 percent, the part that decides whether a hot endpoint serves a thousand queries a second or thirty, lives in choices the planner cannot make for you: which rows the index even contains, what shape the leaf carries, what order it walks, and whether it can be built without taking the table offline. The b-tree-index-internals and index-types-beyond-b-tree pages set up the mechanics. This page is where you turn those mechanics into production decisions.

After studying this page, you can:

  • Choose between a partial, expression, covering, multicolumn, or ordered index for a given query and justify the choice with the predicate, projection, and sort it has to serve.
  • Write CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY correctly, and recover when one leaves an INVALID entry behind.
  • Spot index bloat with pgstatindex and pgstattuple, and plan a reindex window that does not block writes.
  • Read an EXPLAIN (ANALYZE, BUFFERS) and tell whether you are getting an index-only scan with low Heap Fetches, and what to fix when you are not.
  • Avoid the common failures: a covering index so wide it loses to a narrow one, a partial index whose query forgets the predicate, an expression index whose query uses a different cast, and the cargo-cult new index on a tiny table.

Before we dive in

You should already know that a btree leaf stores key columns in sort order with a TID pointing at the heap, that a query plan can use an index for filtering, for sorting, or for both, and that VACUUM maintains a visibility map flagging all-visible heap pages. The b-tree-index-internals page covers leaves, internal pages, and page splits. The index-types-beyond-b-tree page covers GIN, GiST, BRIN, and hash. We do not redo any of that here.

A few words you will need, defined as we use them. A sargable predicate is one the planner can satisfy with an index range scan rather than a filter on the heap. A predicate index is the same thing as a partial index: an index that carries a WHERE clause and only stores rows that satisfy it. An operator class (opclass) is the set of comparison and support functions an index uses to interpret a column for a particular access method, like text_pattern_ops for prefix matching. Hold those three. The rest of this page is consequences.

Mental Model

The tempting wrong model is that an index is “a sorted copy of a column” and that the way to make a query fast is to add one index per column it touches. Under that model more indexes are always better, and a covering index is just a wider sorted copy.

The better model is that an index is a tiny, purpose-built data structure you are commissioning for a specific query shape. Every choice trades read speed against three costs: write amplification (the same UPDATE now has to maintain N indexes), cache footprint (a wider index pushes hot pages out of shared_buffers), and planner complexity (more options means more chances to pick a bad plan). A partial index says “I only care about active orders.” An expression index says “I only care about the lowercased email.” A covering index says “I will pay extra leaf bytes to skip the heap entirely.” A multicolumn index says “this filter set always comes together.”

Hold that picture. An index is a contract for one query shape, not a sorted column. Once that clicks, the question stops being “should I add an index” and becomes “which of the five techniques in this page does this query actually want, and what is the bill.”

Breaking it down

1. Partial indexes: index only the rows that matter

Start with the technique that pays back the fastest. A partial index carries a WHERE predicate on its definition. The index contains entries only for rows that satisfy the predicate at insert time, and any row that later stops matching is dropped from the index when it is vacuumed. Two consequences follow at once: the index is smaller, often by orders of magnitude, and every write that touches rows outside the predicate skips the index entirely, so the write amplification cost is gone for that part of the table.

The classic shapes show up in every real schema. A soft-delete column means you almost never query the deleted rows, so an index over the live rows is what you want.

create index orders_user_active_idx
  on orders (user_id)
  where deleted_at is null;

A queue table where most rows are already processed wants an index over the small unprocessed tail.

create index jobs_pending_idx
  on jobs (created_at)
  where status = 'pending';

A multi-tenant table where one big tenant dominates and the rest are sparse can carry a partial index per tenant or a partial over the long tail, instead of one fat shared index.

There is a sharp catch the planner cannot fix for you: the query must include the predicate, or a predicate the planner can prove implies it, for the partial index to be used. A query that selects pending jobs by id without saying where status = 'pending' will not touch the index, because the planner cannot tell whether the row qualifies. Two equivalent forms can disagree: where status = 'pending' matches the index, but where status <> 'done' and status <> 'failed' does not, even on a workload where only those three states exist. Production query templates have to be aligned to the partial predicate, not the other way around.

flowchart LR
    A[1B-row jobs table] --> B[Full index on created_at]
    A --> C["Partial index WHERE status = 'pending'"]
    B --> D[Index size ~ table size, every UPDATE rewrites leaf]
    C --> E["Index size ~ 0.01 percent of table, only pending rows touched"]

The size collapse is the headline, but the write story is what wins production. On a queue where a row spends two seconds in pending and forever in done, the partial index is touched on insert and on the single transition to done. The full index is touched on every other status change too. That difference shows up as INSERT latency, UPDATE p99, and replication lag, not just as disk.

2. Expression indexes: making a predicate sargable

A query like where lower(email) = $1 cannot use a plain btree on email, because the index sorts the raw column and the predicate compares the lowercased one. The planner sees lower(email) as a function call on the column, decides it cannot prove that the function preserves order, and falls back to a sequential scan with the function applied per row. The fix is an expression index (sometimes called a functional index): you build the index on the expression itself.

create index users_email_lower_idx on users (lower(email));

Now the index physically sorts on the value of lower(email), and a query whose WHERE uses the exact same expression is sargable again. Three rules attach to that “exact same.”

First, the expression must be immutable: same inputs always produce the same output. lower(email) qualifies, date_trunc('day', ts) qualifies, but date_trunc('day', ts, 'America/New_York') does not because the time-zone interpretation can shift. Postgres rejects the index build if the function is not marked IMMUTABLE. If you need a stable timestamp truncation, store the truncated value in a generated column and index that column instead.

Second, the query has to use the same expression. where lower(email) = $1 uses the index; where email ilike $1 does not, even when the runtime behaviour is the same. Implicit casts bite here too: indexing (lower(email)) and querying where lower(email::citext) = $1 will not match, because the planner sees a different expression after the cast.

Third, since PostgreSQL 14 you can attach extended statistics on an expression so the planner has accurate row estimates for predicates over that expression, not just over plain columns.

create statistics users_email_lower_stats (ndistinct, dependencies)
  on lower(email)
  from users;
analyze users;

Without that, the planner falls back to a default selectivity for the expression and can pick a bad plan even when the right index exists. A reasonable rule: any time you create an expression index you actually rely on for selectivity, create the matching CREATE STATISTICS next to it.

3. Covering indexes with INCLUDE

A btree index already stores the key columns in its leaf. A covering index appends extra columns to the leaf with the INCLUDE clause, so a query that needs only those columns can be answered straight from the index without touching the heap. The plan node you want to see is an Index Only Scan, and the diagnostic you want to see on the next line is Heap Fetches: 0.

create index orders_user_covering_idx
  on orders (user_id)
  include (status, total_cents, created_at);

A query like select status, total_cents, created_at from orders where user_id = $1 can now be served entirely from the leaf pages of this index. On a hot read path, the win is large: an index lookup typically touches one or two index pages and zero heap pages, instead of an index lookup followed by a random heap fetch per row.

The trade is real, and forgetting it is how covering indexes turn into a net loss. INCLUDE columns are not key columns: they do not affect ordering, they do not enforce uniqueness, and they do not let you do anything new with the index in terms of search. What they do is widen every leaf entry. A wider leaf means more leaf pages, a larger index on disk, more index pages in shared_buffers, and more bytes to rewrite on every leaf split during a write. Push enough columns into INCLUDE and you can blow past the size of the table itself, at which point the index is no longer “in cache” for free and an Index Only Scan against a cold leaf is slower than the original heap-fetching plan.

A second trap is the visibility map, covered in rung 9. Even with a perfect covering index, the planner still has to check that the heap page is all-visible to skip the heap fetch. If VACUUM is behind, Heap Fetches climbs and the covering index loses most of its win.

Partial vs expression vs covering
An index over a SUBSET of the table's rows, filtered by a WHERE predicate. Smaller, faster, and never touched by writes outside the predicate. Wins when most rows do not match the query (queues, soft-deleted rows, active flags). Cost: the query must repeat the predicate (or one that implies it) or the planner cannot use the index.

4. Multicolumn versus several single-column indexes

A multicolumn btree on (a, b, c) is one index that physically sorts by a, then by b within each a, then by c within each (a, b). The planner can use it for any leftmost prefix of the columns: a, (a, b), and (a, b, c). It cannot use it for b alone, because rows with the same b are scattered across every a value in the leaf order.

Several single-column indexes on a, b, and c give the planner more flexibility. For a query with predicates on b and c, the planner can do a BitmapAnd: scan each single-column index to build a bitmap of matching TIDs, AND the bitmaps, then fetch the surviving heap pages in physical order. That works, and on ad-hoc analytic queries it is often the right call.

The decision rubric is straightforward. If one query shape dominates and its filter set is stable, a single multicolumn index aligned to that shape is faster: one index scan, no bitmap construction, leaf entries packed tightly along the access pattern. If many query shapes share the same columns in different combinations, a set of single-column indexes the planner can mix is more flexible at the cost of per-query overhead.

flowchart TB
    Q["WHERE a = ? AND b = ? AND c = ?"]
    Q --> M[Multicolumn index on a, b, c]
    Q --> S[Single-column indexes on a, b, c]
    M --> M1[One index scan, leaf packed by a then b then c]
    S --> S1[BitmapAnd of 3 bitmap scans, then heap]
    M1 --> R1[Faster when the filter set is stable]
    S1 --> R2[Flexible when filter sets vary across query shapes]

A subtle point about column order in a multicolumn index. Put the column with the most selective equality first, not the most selective overall. If a is an equality on a user id (high cardinality, narrow result) and b is a range on created_at, (a, b) lets the index do an equality on a and a range scan on b within that user. (b, a) forces the index to walk every a for every matching b value. The same reasoning gives you ordering for sort avoidance, which is the next rung.

5. Ordering for sort avoidance

A btree leaf is already sorted, so an index can sometimes deliver rows in the order a query needs without a sort step. The plan you want to see is the index scan feeding directly into LIMIT or a merge join with no Sort node between them.

The match has to be exact, including direction and NULL placement. An index on (created_at) walks ascending with NULLS LAST by default. A query that asks for ORDER BY created_at DESC NULLS FIRST LIMIT 10 cannot use it without a Sort, because the index walks the wrong way. Build the index in the query’s order to fix it.

create index orders_recent_idx
  on orders (created_at desc nulls first);

Now ORDER BY created_at DESC NULLS FIRST LIMIT 10 reads the first ten leaf entries and stops. On a billion-row table, that is the difference between a 200ms top-N and a sort of the entire table.

Compose this with the earlier techniques. A query that returns the most recent ten pending jobs for a user wants a partial, multicolumn, ordered index with the projected columns included.

create index jobs_user_recent_pending_idx
  on jobs (user_id, created_at desc)
  include (payload_id, priority)
  where status = 'pending';

That single index is partial (only pending rows), multicolumn (user equality plus time range), ordered (newest first), and covering (the projection comes from the leaf). The query plan should be an Index Only Scan with Heap Fetches: 0, walking exactly ten leaf entries.

6. Building indexes without downtime: CIC and REINDEX CONCURRENTLY

A plain CREATE INDEX takes an ACCESS EXCLUSIVE lock on the table for the duration of the build, which on a hot table is the same thing as an outage. The right tool in production is CREATE INDEX CONCURRENTLY, often shortened to CIC.

create index concurrently orders_user_active_idx
  on orders (user_id)
  where deleted_at is null;

CIC takes a SHARE UPDATE EXCLUSIVE lock, which does not block INSERT, UPDATE, or DELETE. It does block other schema changes on the same table, which is what you want anyway. Under the hood CIC performs two table scans and waits for old transactions in between, so it is slower and more I/O-heavy than a plain build. The price is paid for not blocking writes.

What CREATE INDEX CONCURRENTLY actually does
Phase 1: register the indexPostgres inserts a pg_index row marked invalid, then acquires a SHARE UPDATE EXCLUSIVE lock on the table. Reads and writes continue. Concurrent writers now know about the index and start maintaining it for new rows, even though it is not yet complete.
Step 1 of 5

Two operational facts follow from that sequence. First, CIC needs you to keep idle_in_transaction_session_timeout short, or phases 3 and 5 can stall behind a connection that does nothing useful. Second, when CIC fails (a unique-violation discovered in phase 4, a worker killed, a deadlock with a DDL on the same table), it leaves a row in pg_index with indisvalid = false. That index is visible to your \d output and is being maintained on every write, but the planner refuses to use it. The recovery is one line:

drop index concurrently orders_user_active_idx;

You then re-run the CIC. Always check pg_index.indisvalid after a CIC, do not assume success from a clean exit.

select c.relname, i.indisvalid, i.indisready
from pg_index i
join pg_class c on c.oid = i.indexrelid
where not i.indisvalid;

The matching operation for fixing a bloated or corrupt existing index is REINDEX CONCURRENTLY (added in PostgreSQL 12). It builds a new index alongside the old one, swaps them under a brief lock, and drops the old one. Same locking discipline as CIC: writes continue, and a failure leaves an INVALID shadow index named with an _ccnew suffix that you drop the same way.

reindex index concurrently orders_user_idx;

7. Bloat, fillfactor, and the reindex window

A btree index bloats for the same reason a heap does: dead entries accumulate and pages do not shrink on their own. You spot it with the pgstatindex extension, which reports leaf density, dead tuple counts, and the live-to-allocated ratio per index.

create extension if not exists pgstattuple;

select relname, pgstatindex(oid) from pg_class
where relname = 'orders_user_active_idx';

The output gives you avg_leaf_density, the percentage of leaf bytes that are live entries. A healthy btree sits in the 70 to 90 percent range. Numbers in the 40s or below mean the index is paying for storage it no longer uses, and on a hot table it is also paying for the pages in cache. Cross-check against pg_stat_user_indexes and the table-level pgstattuple to confirm the bloat outpaces the heap’s own bloat rather than just tracking it.

The cleanup tool is REINDEX CONCURRENTLY from rung 6. Plan the window: a reindex doubles the disk footprint of the index for the duration of the build (you have the old and new index at once), needs enough I/O headroom that it does not starve normal queries, and on a multi-terabyte table can take hours. The classic pattern is to reindex the bloated index off-hours, monitor pg_stat_progress_create_index during the build, and verify indisvalid = true afterward.

A subtler control is fillfactor. The default for a btree index is 90, meaning each leaf is filled to 90 percent at build time and 10 percent is reserved for in-place updates. Lower fillfactor (with (fillfactor = 70)) gives the index more room for new entries before a leaf has to split, which on a high-update workload reduces bloat at the cost of a larger initial index. The trade-off is not the same as on a heap: an index split is expensive, but a fillfactor that is too low wastes cache for the steady-state read workload. A reasonable starting point is to leave fillfactor at 90 for read-mostly indexes and consider 80 or 70 for indexes on columns that are updated in place often.

8. Operator classes you actually reach for

A btree index uses an operator class to decide what “comparison” means for the indexed type. The default opclass for text sorts using the database collation, which is the right choice for ORDER BY name but the wrong one for anchored LIKE. Three opclasses come up so often in production that you should recognize them on sight.

text_pattern_ops sorts text byte-by-byte, the way C’s memcmp does, ignoring collation. That makes it sargable for anchored LIKE patterns like where path like '/usr/local/%'. The default text_ops cannot do this on a non-C collation, because collation rules can reorder the bytes a pattern would compare. The cost is that this index cannot serve a normal ORDER BY name in collation order; you need a separate default-opclass index if you need both.

create index files_path_prefix_idx on files (path text_pattern_ops);
-- now this is an index range scan:
select * from files where path like '/usr/local/%';

gin_trgm_ops (from the pg_trgm extension) supports fuzzy and unanchored matching on text. It builds a GIN index over the set of trigrams in each value, so LIKE '%search%', ILIKE, and similarity queries (%, <%) become index searches instead of sequential scans. This is the standard tool for in-database fuzzy search before you reach for a separate search engine.

create extension if not exists pg_trgm;
create index articles_title_trgm_idx on articles using gin (title gin_trgm_ops);

jsonb_path_ops is a GIN opclass for jsonb that supports only the containment operator @>. It indexes paths-and-leaves directly rather than every key, so the index is smaller and faster than the default jsonb_ops opclass, at the price of giving up ?, ?|, and ?& (key-existence) queries. If your jsonb access is essentially “does this document contain this subdocument,” jsonb_path_ops is what you want.

create index events_payload_path_idx on events using gin (payload jsonb_path_ops);
-- sargable for: where payload @> '{"actor": {"role": "admin"}}'

The index-types-beyond-b-tree page covers GIN, GiST, and BRIN at the access-method level. The point on this page is that the opclass is the lever for matching an index to your real query shape, not the access method alone.

9. Index-only scans and the visibility map

An index-only scan is the plan node that answers a query entirely from the index, without touching the heap. It is the goal for hot read paths and the headline win of covering indexes, but it has a precondition that catches people: the heap page for each matching row must be marked all-visible in the visibility map, or Postgres has to fetch the heap anyway to check tuple visibility.

The visibility map is a per-relation bitmap that VACUUM updates: one bit per heap page, set when every tuple on that page is visible to every active snapshot. As long as that bit is set, the planner trusts the index for visibility too. When VACUUM falls behind, the bit goes stale, and an index-only scan starts showing Heap Fetches: 1, Heap Fetches: 1000, Heap Fetches: N, one per scanned row that lives on a heap page whose all-visible bit is not set.

The diagnostic line in EXPLAIN (ANALYZE, BUFFERS) is exactly this:

Index Only Scan using orders_user_covering_idx on orders
  (cost=0.56..18.62 rows=12 width=24)
  (actual time=0.041..0.118 rows=12 loops=1)
  Heap Fetches: 0
  Buffers: shared hit=4

You want Heap Fetches: 0, or at most a small number relative to the row count. Anything else means the visibility map is not pulling its weight, and the win you designed the covering index for is gone. The fix is to tune autovacuum so it runs often enough to keep the all-visible bits set on hot tables; the autovacuum-tuning page covers the per-table scale factor knob and the insert-driven autovacuum that catches append-only tables.

alter table orders set (autovacuum_vacuum_scale_factor = 0.02);
alter table orders set (autovacuum_vacuum_insert_scale_factor = 0.02);

If your covering index plan still shows non-zero Heap Fetches after autovacuum is keeping up, run VACUUM manually on the table and re-check. The visibility map is what lets a covering index pay back its width. Without it, the index is wider for nothing.

10. The case against pg_hint_plan in production

The pg_hint_plan extension lets you embed planner hints in a query, forcing a specific scan type, join order, or index. It is a real tool and it has its place: an emergency fix while you investigate a regression, a one-off override on a query you cannot rewrite right now. As a production design, it is a trap.

The reason is that a hint pins a plan against a snapshot of statistics that is going to drift. A HashJoin hint that was correct when one side was a thousand rows becomes wrong when it grows to ten million; an IndexScan hint becomes wrong when a new partial index would serve the query better. The hint does not know that, so the plan rots silently and you get a slow query nobody can explain because the trace looks deliberate.

Every problem pg_hint_plan claims to solve has a better answer in the techniques on this page. Bad row estimates lead to bad plans: the answer is CREATE STATISTICS on the right columns or expression (rung 2). A query stubbornly picks the wrong index: the answer is to either build the index it should pick (a partial or expression index aligned to the actual predicate) or fix the statistics so the cost model agrees. A nested-loop plan is too slow: usually the answer is the right covering or multicolumn index, not a hint that forces a hash join over the wrong data.

Use hints to diagnose, almost never to ship. If you find yourself reaching for pg_hint_plan in a hot path, the index design or the statistics are the real fix.

11. A hot-path checklist and its failure modes

You can compress everything above into a five-step checklist for designing an index against a single hot query.

  1. Identify the actual filter set. Read the query template, not the schema. Which columns appear in WHERE, with which operators, and which combinations actually show up at runtime? The filter set is what decides multicolumn order (rung 4) and whether a partial predicate would shrink the index (rung 1).
  2. Narrow with a partial predicate if you can. Any condition that is always true for the rows this query cares about belongs in a WHERE on the index, not in the index keys.
  3. Include the projected columns. The exact set of columns the query selects, no more, goes in INCLUDE so the plan can be an Index Only Scan (rung 3). Resist the urge to add “in case someone needs it.”
  4. Match the ORDER BY direction and NULLS placement. Build the index in the exact order the query asks for so the planner skips the Sort node (rung 5).
  5. Verify with EXPLAIN (ANALYZE, BUFFERS). Look for Index Only Scan, Heap Fetches: 0, and Buffers: shared hit=N with N small. If Heap Fetches is not near zero, autovacuum is the next fix, not another index (rung 9).
Check yourself
Your covering index on (user_id) INCLUDE (status, total_cents) shows Index Only Scan with Heap Fetches: 12000 on a 12000-row result. What is the most likely cause?

Five failure modes catch teams over and over. An INCLUDE clause so wide it bloats the leaves and pushes the index out of cache, turning the covering index into a net loss against a narrower btree plus a heap fetch. A partial index whose query template forgets to repeat the predicate, so the planner cannot use it and the index becomes pure write tax. An expression index whose query uses a different cast or a slightly different expression, so the rewrite misses and the index sits idle. A CREATE INDEX CONCURRENTLY that failed in phase 4 and left an INVALID entry, which is still being maintained on every write but never used by the planner. And the “add an index” reflex on a table small enough that a sequential scan over its single 8KB page beats any index scan, where the extra index is pure write amplification with no read win.

A few production gotchas worth keeping

Mastery Questions

  1. A teammate adds create index orders_status_idx on orders (status) to speed up select * from orders where status = 'pending' on a table where 0.1 percent of rows are pending and 99.9 percent are done. The query is faster, but the team’s overnight UPDATE orders set status = 'done' job is now 4x slower and replication lag spikes. What is the better index, and why is it faster on both axes?

    Answer. The full btree on status has one entry per row, including all 99.9 percent of rows that nobody queries by status. Every UPDATE that changes status has to update that index, and every INSERT has to extend it, even though only the pending slice is ever read through it. The better index is partial: create index concurrently orders_pending_idx on orders (id) where status = 'pending'. It contains roughly 0.1 percent of the rows, so reads are at least as fast (the leaf is tiny and already in cache), inserts only touch it for the small fraction of rows born pending, and the overnight UPDATE status = 'done' job actually deletes from the index as rows leave the predicate rather than rewriting leaves in place. The query template just has to keep saying where status = 'pending' so the planner uses the index. The lesson is rung 1: an index is a contract for a query shape, and the cheapest contract is one that only covers the rows the shape actually touches.

  2. Your hot endpoint runs select id, status, total_cents from orders where user_id = $1 order by created_at desc limit 50. You build create index concurrently orders_user_covering_idx on orders (user_id, created_at desc) include (status, total_cents). The plan is Index Only Scan and p50 is great, but p99 shows Heap Fetches in the thousands during the morning peak and latency triples. What is happening, and what do you change, in priority order?

    Answer. The index design is correct: partial would not help here because every user_id is queried, and the multicolumn (user_id, created_at desc) plus INCLUDE (status, total_cents) is exactly the shape rung 5 prescribes. The morning spike does not break the index, it breaks the visibility map. During the night, writes pile up on the orders table; autovacuum has not caught up, so the heap pages backing the recent rows are not marked all-visible. An Index Only Scan still has to verify visibility, and when the all-visible bit is missing it falls back to a heap fetch per row, so the win you designed the covering index for evaporates exactly when load is highest. The fix is not another index; it is the autovacuum-tuning page. Drop autovacuum_vacuum_scale_factor on orders to about 0.02, and autovacuum_vacuum_insert_scale_factor similarly, so the visibility map is refreshed often enough to keep Heap Fetches near zero. If autovacuum already runs but cannot finish in time, raise autovacuum_vacuum_cost_limit so vacuum is not throttled to a crawl on fast storage. Only if both fail do you consider the next-tier fix, which is partitioning so each partition is small enough for autovacuum to keep up.

  3. You ran create index concurrently users_email_lower_idx on users (lower(email)) last night. The planner is still doing a sequential scan on select id from users where lower(email) = $1. Walk through how you would diagnose this in five minutes, in order, and name two distinct root causes you would distinguish between.

    Answer. First, check the index is valid: select indisvalid, indisready from pg_index where indexrelid = 'users_email_lower_idx'::regclass. A CREATE INDEX CONCURRENTLY that failed in phase 4 leaves indisvalid = false, and the planner will not consider it. If so, run drop index concurrently on it and rebuild, do not re-run CIC on top of the broken one. Second, run EXPLAIN (ANALYZE, BUFFERS) on the actual query and check the predicate it shows. If it reads Filter: (lower((email)::text) = $1) with a Seq Scan, the index exists and is valid but the expression does not match. The two distinct root causes to distinguish are (a) the query uses a slightly different expression, for example lower(email::citext) or a (email)::text cast that the planner does not equate with the indexed lower(email), in which case you fix the query, and (b) the planner has poor selectivity estimates for the expression (no extended statistics on it), so it costs the index scan above the seq scan even though the index exists and matches. The fix for (b) is create statistics users_email_lower_stats (ndistinct) on lower(email) from users; analyze users; so the planner sees the real n_distinct for the expression. Both are routine to misdiagnose, and the order matters: validity first, expression match second, statistics third. The deeper lesson is rung 2: an expression index only earns its keep when the query template uses the exact same expression and the planner has accurate statistics on it.

Recommended next

  • Table Partitioning
    Builds directly on this page: Table Partitioning is the next step in the PostgreSQL performance ladder.
Sources & evidence14 claims · 5 cited

Syntax, locking semantics, and parameter defaults come from the PostgreSQL documentation for Indexes (partial, expression, covering, opclasses), CREATE INDEX (CONCURRENTLY semantics, fillfactor, INCLUDE), B-Tree (btree opclass behaviour), CREATE STATISTICS (extended statistics on expressions, PG14+), and Planner Statistics. REINDEX CONCURRENTLY's PG12 introduction, the practical fillfactor advice, the case against pg_hint_plan, and operational rules of thumb that the runtime docs do not state are tagged stable-common-knowledge.

  • A partial index defined with CREATE INDEX ... WHERE predicate contains only the rows that satisfy the predicate, so it is smaller than a full index on the same column and is never updated for writes whose rows fall outside the predicate.verified
  • The planner can use a partial index only when the query's WHERE clause includes the index predicate (or one that logically implies it); two semantically equivalent predicates, like status = 'pending' versus status <> 'done' AND status <> 'failed', are not treated as implying each other by the planner.verified
  • An expression index (CREATE INDEX ON t (lower(email))) physically sorts on the value of the expression, so a query whose WHERE uses the same expression becomes sargable; the function in the expression must be marked IMMUTABLE for the index build to succeed.verified
  • Since PostgreSQL 14 you can create extended statistics on an expression with CREATE STATISTICS ... ON lower(email) FROM users, so the planner has accurate ndistinct and dependency estimates for predicates over that expression instead of falling back to a default selectivity.verified
  • INCLUDE columns are appended to the btree leaf so an Index Only Scan can return them without a heap fetch, but they are NOT key columns: they do not affect ordering and do not contribute to uniqueness.verified
  • Pushing too many columns into INCLUDE widens every leaf entry, increases the on-disk index size and cache footprint, and adds write cost per insert and update, which can make a covering index slower than a narrower btree plus a heap fetch when the index no longer fits in shared_buffers.stable common knowledge
  • A multicolumn btree on (a, b, c) can be used for any leftmost prefix of its columns (a, (a, b), or (a, b, c)) but not for b or c alone, because rows sharing only b or c are scattered across the leaf order; when the planner has only single-column indexes, it can instead BitmapAnd their bitmap scans to combine predicates.verified
  • A btree index built with a matching ORDER BY direction and NULLS placement (for example created_at DESC NULLS FIRST) can supply rows in that order directly, letting the planner skip the Sort node above LIMIT or merge join.verified
  • CREATE INDEX CONCURRENTLY takes a SHARE UPDATE EXCLUSIVE lock on the table, which does not block INSERT, UPDATE, or DELETE; the build performs two table scans and waits for old transactions between them.verified
  • If CREATE INDEX CONCURRENTLY fails it leaves a row in pg_index with indisvalid = false; that index is still maintained by every write but the planner refuses to use it, and the only recovery is DROP INDEX CONCURRENTLY followed by a fresh CIC.verified
  • REINDEX CONCURRENTLY, introduced in PostgreSQL 12, rebuilds an existing index alongside the old one and swaps them under a brief lock using the same locking discipline as CIC; the default fillfactor for a btree index is 90, leaving 10 percent of each leaf for in-place updates, and lowering it reduces leaf-split pressure on hot indexes at the cost of a larger initial index.verified
  • text_pattern_ops sorts text byte-by-byte (ignoring database collation) so it is sargable for anchored LIKE patterns like '/usr/local/%', whereas the default text_ops opclass under a non-C collation cannot serve those patterns by index range scan.verified
  • The jsonb_path_ops GIN opclass indexes paths and leaves rather than every key, so it is smaller and faster than the default jsonb_ops but supports only the containment operator @>, not the key-existence operators ?, ?|, or ?&.verified
  • An Index Only Scan must still verify each row's visibility, and it can skip the heap fetch only when the heap page is marked all-visible in the visibility map; a stale visibility map (autovacuum behind on a hot table) forces Heap Fetches to climb toward one per scanned row.verified

Cited sources