Query Rewriting and Anti-Patterns

How to rewrite SQL the PostgreSQL planner handles badly into equivalent forms it optimizes well, covering sargability, implicit casts, OR-to-UNION, semi-joins and the NOT IN null trap, CTE materialization, keyset pagination, N+1 batching, cheap COUNT/LIMIT/DISTINCT shapes, generic-plan skew, set-based writes, and proving each rewrite with EXPLAIN ANALYZE and pg_stat_statements.

Learning outcomes

This page closes the query-optimization track. Everything before it taught you how PostgreSQL thinks: how to read a plan in Reading EXPLAIN ANALYZE, which scan wins in Scan Methods and When Each Wins, how relations combine in Join Algorithms and Join Order, and where the row counts come from in Statistics and Row Estimation. Now you turn that knowledge around and aim it at the SQL itself. The planner is good, but it can only optimize the query you wrote. Some shapes of SQL hide an index from it, force it to do work it could skip, or hand it a wrong estimate. The fix is almost never a new index or a config knob. It is a rewrite that says the same thing in a form the planner handles well.

After studying this page, you can:

  • Spot a non-sargable predicate and rewrite it to a half-open range, an expression index, or the right operator, so an index becomes usable again.
  • Recognize when an implicit type cast lands on the column side and defeats an index, and fix it by matching the types.
  • Rewrite an OR across columns as a UNION of index scans, and choose EXISTS, IN, or a JOIN for a semi-join without falling into the NOT IN null trap.
  • Replace deep OFFSET pagination with keyset pagination, and collapse an N+1 loop into one set-based query.
  • Use approximate COUNT, index-served ORDER BY LIMIT, and the right DISTINCT form, and reach for set-based writes over row-by-row.
  • Prove each rewrite with EXPLAIN ANALYZE before and after, and find the worst offenders with pg_stat_statements.

Before we dive in

You should be comfortable reading an EXPLAIN ANALYZE plan and you should know what a B-tree index gives you: an ordered structure that answers equality and range predicates on its leading columns without scanning the table. If those are fuzzy, the earlier pages in this track cover them.

A few terms, defined as we use them. A predicate is a condition in a WHERE, JOIN, or HAVING clause, for example status = 'active'. A predicate is sargable (Search ARGument able) when the planner can use it to seek into an index rather than test it row by row. A semi-join asks “does at least one matching row exist,” and returns the left side, never the right; EXISTS and IN express it. A half-open range is an interval that includes its lower bound and excludes its upper bound, written x >= a AND x < b; it is the safe way to express “one day” or “one month” without rounding errors at the edge. Hold onto sargable and half-open range. Most of this page is built from them.

One ground rule for everything below: you change the SQL, never the answer. A correct rewrite returns exactly the same rows as the original, only faster. We will check that as we go.

Mental Model

The tempting wrong model is that the planner is smart enough to see through how you wrote a query. You believe that WHERE date_trunc('day', ts) = '2026-06-06' and WHERE ts >= '2026-06-06' AND ts < '2026-06-07' are obviously the same request, so PostgreSQL will treat them the same and use your index either way. Under that model, query shape is cosmetic and only indexes and statistics matter.

That is not how it works. The planner optimizes the expression tree you gave it, not the intent behind it. When you wrap the indexed column in date_trunc(...), the planner sees a function call whose result it cannot look up in a plain B-tree on ts, because the index stores ts, not date_trunc('day', ts). It has no choice but to read every row and compute the function, which is a sequential scan. The two queries mean the same thing to you and two different things to the optimizer.

So the better model is this: the planner can only seek on an index when the indexed column appears bare on one side of a simple comparison. The moment you put a function, an arithmetic operation, or a type cast around the column, you have handed the planner a value it did not index, and the index goes dark. Rewriting is the craft of saying the same thing while keeping the column bare. Keep that picture. Every rule below is one way to keep the column visible to the index.

Breaking it down

1. SARGability: keep the indexed column bare

Start with the single most common reason an index sits unused: the query asks a question the index cannot answer in the form it was asked. Suppose you have CREATE INDEX ON events (ts) and you want every event from one day.

The natural way to write it is also the slow way:

-- Non-sargable: a function wraps the indexed column.
EXPLAIN ANALYZE
SELECT * FROM events
WHERE date_trunc('day', ts) = '2026-06-06';

The plan is a Seq Scan. The index is on ts, but the predicate compares date_trunc('day', ts), a value the index does not store. PostgreSQL must read every row, apply date_trunc, and test the result. On a billion-row table that is a full pass over the heap.

The rewrite keeps ts bare and expresses “one day” as a half-open range:

-- Sargable: the bare column meets a range the index can seek.
EXPLAIN ANALYZE
SELECT * FROM events
WHERE ts >= '2026-06-06' AND ts < '2026-06-07';

Now the planner does an Index Scan (or Bitmap Index Scan), descending the B-tree to the start of the range and walking leaf pages until it passes the end. Use the half-open form, not BETWEEN '2026-06-06' AND '2026-06-06 23:59:59'. That older trick drops any event in the final second, and it breaks outright on timestamptz with sub-second precision. >= start AND < next_start is exact at every boundary.

The same trap appears in three more disguises. A function on the column: WHERE lower(email) = 'x@y.com'. Arithmetic on the column: WHERE price + 1 = 100. A leading wildcard: WHERE name LIKE '%smith'. Each one buries the column inside an expression the index cannot match. Here is the catalog, with the fix for each.

The non-sargable catalog
WHERE lower(email) = 'a@b.com'. The btree on email stores email, not lower(email). Fix: either build an expression index, CREATE INDEX ON users (lower(email)), so the index stores the computed value, or store and query a normalized column. Once the expression index exists, the same predicate becomes sargable because the index literally holds lower(email).

The systems-level why is one sentence: a B-tree answers questions about the value it stored, and only that value. The moment your predicate is about a different value, a transformed one, the index has nothing to seek on. The failure mode is quiet. Nothing errors; the query just runs a Seq Scan and gets slowly worse as the table grows. The metric that exposes it is a Seq Scan in EXPLAIN where you expected an Index Scan, and a climbing seq_scan count in pg_stat_user_tables for a table you thought was well indexed.

2. Implicit casts that silently skip the index

This anti-pattern is sneakier than the last, because the function defeating your index is invisible: PostgreSQL inserts it for you. When you compare a column to a literal or parameter of a different type, the planner must make the types match before it can compare. Where it puts that cast decides whether your index survives.

Picture an accounts table whose id column is text (an external account number that happens to look numeric), indexed with a plain B-tree.

-- The literal is an integer; the column is text. PostgreSQL casts the COLUMN to match.
EXPLAIN ANALYZE
SELECT * FROM accounts WHERE id = 12345;

To compare text with integer, PostgreSQL resolves the comparison in a common type and ends up casting the column: the predicate effectively becomes id::integer = 12345 or account_number::numeric = .... That cast is a function on the column, so by the rule from rung 1 the index goes dark and you get a Seq Scan. Match the literal’s type to the column instead:

-- Quote the literal so it is text, matching the column. The index is usable again.
EXPLAIN ANALYZE
SELECT * FROM accounts WHERE id = '12345';

The same shape bites in three common pairings. A timestamptz column compared to a date literal can force a cast and widen or shift the comparison; compare against a timestamptz range instead. An integer column compared against a bigint parameter (very common from drivers that bind every integer as int8) can cast the column to bigint and lose the int4 index. And a numeric literal against an integer column does the same. The rule is blunt and worth memorizing: make the constant or parameter match the column’s exact type, so the cast (if any) lands on the constant, where it is computed once and harmlessly.

flowchart TD
    A["Predicate: col op literal"] --> B{"Same type?"}
    B -->|yes| C["Direct comparison, index seekable"]
    B -->|no| D{"Where does the cast land?"}
    D -->|on the literal| C
    D -->|on the column| E["Function wraps the column, index skipped, Seq Scan"]

The failure mode and its metric are the same as rung 1, a surprise Seq Scan, but the cause hides one level down. To confirm it, read the Filter or index condition in EXPLAIN: if you see the column wrapped in a cast like (id)::integer, that is your culprit. Driver-injected bigint binds are the classic version; if a query is fast in psql but slow from the app, suspect a parameter-type mismatch.

3. OR across columns, rewritten as UNION

An OR that spans different columns is the next index-killer. The planner can sometimes turn a single-column OR into a Bitmap Or of two index scans, but an OR across two different columns usually cannot use either column’s index well, because no single index seek satisfies the whole condition.

-- email and phone are separately indexed. The OR spans both; neither index alone matches.
EXPLAIN ANALYZE
SELECT * FROM users
WHERE email = 'a@b.com' OR phone = '555-0100';

Rewrite the OR as a UNION of two queries, each of which is a clean single-index lookup:

-- Each branch seeks its own index; UNION merges and de-duplicates.
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'a@b.com'
UNION
SELECT * FROM users WHERE phone = '555-0100';

Each branch is sargable on its own index, and PostgreSQL runs two cheap seeks instead of one Seq Scan. One subtlety decides correctness. UNION removes duplicate rows; if a row matches both branches it appears once, which mirrors the OR. UNION ALL is faster because it skips the de-duplication sort or hash, but it would return a both-matching row twice. Use UNION ALL only when the branches cannot overlap (for instance, mutually exclusive conditions), and UNION when they can.

A related cleanup: an OR chain on the same column is just a set membership. Replace col = 1 OR col = 2 OR col = 3 with col IN (1, 2, 3). IN against a list is sargable and the planner handles it as a single index condition, and it reads far better than a long OR chain. The OR-to-UNION rewrite is for different columns; the OR-to-IN rewrite is for one column with many values.

4. Semi-joins: EXISTS, IN, and the NOT IN null trap

When you only want to know whether a related row exists, you have three ways to write it, and one of them has a correctness trap that silently returns wrong answers. Take “customers who have placed at least one order.”

-- EXISTS: a correlated semi-join, stops at the first match.
SELECT c.* FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);
-- IN: the planner usually rewrites this into the same semi-join.
SELECT c.* FROM customers c
WHERE c.id IN (SELECT customer_id FROM orders);

For the positive case, EXISTS and IN are close to equivalent: the planner recognizes both as semi-joins and often produces the same plan (a Hash Semi Join or Nested Loop). EXISTS has one structural edge: it can short-circuit. A semi-join needs only the first matching inner row, so once EXISTS finds one order for a customer it stops looking; it never counts or materializes the rest. That is why SELECT 1 (not SELECT *) inside EXISTS is idiomatic; you are not using the projected columns, only the existence.

The danger is the negative case. NOT IN against a subquery is a trap whenever the subquery can yield a NULL:

-- DANGEROUS: if any orders.customer_id is NULL, this returns ZERO rows.
SELECT c.* FROM customers c
WHERE c.id NOT IN (SELECT customer_id FROM orders);

Here is the why, and it is pure three-valued logic, not a bug. x NOT IN (a, b, NULL) expands to x <> a AND x <> b AND x <> NULL. The term x <> NULL is never true; it is UNKNOWN. So the whole AND can never be true, only FALSE or UNKNOWN, and no customer passes the filter. A single NULL in the subquery collapses the entire result to empty. Worse, it does so quietly: no error, just zero rows, and you may not notice until a report is mysteriously blank. Always express anti-joins with NOT EXISTS:

-- Correct and NULL-safe: NOT EXISTS is a true anti-join.
SELECT c.* FROM customers c
WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);

NOT EXISTS uses normal join matching, so a NULL customer_id simply fails to match any customer and is ignored, which is what you meant. It is both correct and well-optimized as an Anti Join.

Check yourself
orders.customer_id is nullable and a few rows have NULL there. You run: SELECT * FROM customers WHERE id NOT IN (SELECT customer_id FROM orders). What do you get?

When should you reach for an explicit JOIN instead of a semi-join? When you actually need columns from the other table, or you want every matching pair. A JOIN to orders multiplies a customer by their order count; a semi-join returns each customer once. Choosing JOIN when you meant EXISTS is a common source of accidental duplicate rows that people then patch with a DISTINCT, which is slower and hides the real intent.

5. CTEs: optimization fence or inlined

A common-table expression, the WITH name AS (...) block, behaves differently across PostgreSQL versions, and getting the version boundary wrong leads to plans you cannot explain. Through PostgreSQL 11, a CTE was an optimization fence: the planner always materialized it, running the subquery to completion into a temporary result, then reading from that result. It never pushed your outer WHERE down into the CTE.

-- On PG 11 and earlier, the planner materializes the CTE, then filters the 1M rows AFTER.
WITH recent AS (
  SELECT * FROM events WHERE ts >= now() - interval '7 days'
)
SELECT * FROM recent WHERE user_id = 42;

On the old behavior, recent is built in full (every event of the last seven days, perhaps millions of rows), and only then is user_id = 42 applied. The fence stopped the planner from combining the two conditions into one index scan on (user_id, ts).

PostgreSQL 12 changed the default: a CTE that is referenced once, has no side effects, and is not recursive is inlined by default, folded into the outer query so the planner can push predicates down and use indexes across the boundary. On PG 12 and later the query above can become a single index scan on (user_id, ts). You control the behavior explicitly with two keywords:

-- Force the fence: materialize once, reuse the result several times.
WITH heavy AS MATERIALIZED (
  SELECT account_id, sum(amount) AS total
  FROM huge_ledger GROUP BY account_id
)
SELECT * FROM heavy h JOIN dim d ON d.account_id = h.account_id
UNION ALL
SELECT * FROM heavy h JOIN other o ON o.account_id = h.account_id;
-- Force inlining even when referenced once, so predicates push down.
WITH recent AS NOT MATERIALIZED (
  SELECT * FROM events WHERE ts >= now() - interval '7 days'
)
SELECT * FROM recent WHERE user_id = 42;

The decision is about reuse. Use AS MATERIALIZED when a CTE is expensive and you reference it several times: computing it once and reusing the result beats recomputing it per reference. Use AS NOT MATERIALIZED (or rely on the PG12 default) when you reference it once and want the planner to push your outer predicates and index choices through it. The failure mode on an upgrade is real: code written against PG 11, which leaned on the fence to control evaluation order, can change plans on PG 12 when those CTEs start inlining. If a query regressed right after a major-version upgrade and it uses a CTE, add AS MATERIALIZED to restore the old shape and confirm with EXPLAIN.

WITH recent AS (...) referenced once, on PG 12+
The CTE folds into the outer query. The planner pushes user_id = 42 down beside ts >= ..., so a single Index Scan on (user_id, ts) serves both conditions. One seek, no temporary result. This is what you want for a once-referenced CTE.

6. Keyset pagination instead of deep OFFSET

Pagination with LIMIT ... OFFSET N looks innocent and degrades badly. The reason is mechanical: OFFSET N does not skip ahead cheaply. PostgreSQL must produce the first N rows in order, then read and discard all N of them, and only then start returning rows. Page 1 discards nothing; page 1000 with a page size of 20 discards 20,000 rows before it returns a single one. Cost grows linearly with the page number, so deep pages crawl while the early ones feel fine, and the slowdown hides until users (or a crawler) page deep.

-- OFFSET pagination: page 5000 reads and throws away 100,000 rows first.
SELECT id, created_at, title FROM articles
ORDER BY created_at, id
LIMIT 20 OFFSET 100000;

Keyset pagination (also called seek pagination) replaces “skip N rows” with “start just after the last row I saw.” You remember the sort key of the last row on the previous page and ask for rows greater than it:

-- Keyset: seek directly past the last seen row. No discarding.
SELECT id, created_at, title FROM articles
WHERE (created_at, id) > ('2026-06-06 12:00:00', 918273)
ORDER BY created_at, id
LIMIT 20;

With an index on (created_at, id), the WHERE (created_at, id) > (...) is a row-value comparison the B-tree seeks directly: it descends to the boundary and reads exactly 20 leaf entries. No rows are discarded, so the cost of page 5000 is identical to the cost of page 2. Performance stays flat no matter how deep you go.

Two rules make it correct. First, the keyset columns must match the ORDER BY exactly, and they must form a unique ordering; that is why id is appended to created_at, to break ties so no row is skipped or repeated at a page boundary. Second, use the row-value form (a, b) > (x, y), not a > x AND b > y, which is a different and wrong condition. The trade-off keyset pays is that you lose random page access: you can go to the next page, but not jump to “page 500,” because you do not have its boundary key. For infinite scroll, activity feeds, and API cursors, that is exactly the access pattern anyway, so the trade is free.

Reading page 5000 (page size 20)
The index seeks straight to the boundary key and reads exactly 20 leaf entries. Zero rows discarded. Page 5000 costs the same as page 2: a constant handful of buffer reads. The plan is a plain Index Scan with no large skip.

7. The N+1 pattern and how to batch it

The N+1 pattern is a performance bug that lives in the application, not the database, which is why EXPLAIN on any single query looks fine. It happens when you fetch a list of N parents with one query, then loop and fire one more query per parent to fetch its children: 1 query for the list, plus N for the children, hence N+1. Each of those N queries is individually fast, but each carries a full network round-trip and a parse-plan-execute cycle. At N equals 500, you have paid 500 round-trips where one would do.

-- Query 1: the parents.
SELECT id FROM customers WHERE region = 'EU';

-- Then, in app code, ONE of these PER customer id (the +N):
SELECT * FROM orders WHERE customer_id = $1;

The round-trip is the cost that matters. A query that executes in 0.2 ms on the server can still take 1 to 2 ms wall-clock once you add network latency, and 500 of them serialized is a full second of mostly waiting. The fix is to ask once, for everything, with a set-based query. Three rewrites, in rough order of preference:

-- Batch with IN: one round-trip, the planner seeks the index once per key set.
SELECT * FROM orders WHERE customer_id IN (1, 2, 3, ... , 500);
-- Batch with a JOIN: get parents and children together in one pass.
SELECT c.id, o.*
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE c.region = 'EU';
-- LATERAL: when you need a per-parent slice, like the 3 most recent orders each.
SELECT c.id, recent.*
FROM customers c
CROSS JOIN LATERAL (
  SELECT * FROM orders o
  WHERE o.customer_id = c.id
  ORDER BY o.created_at DESC
  LIMIT 3
) AS recent
WHERE c.region = 'EU';

Use IN when you already have the parent ids in hand and just need their children. Use a JOIN when you want parents and children flattened together. Use LATERAL for the case the others cannot express: a correlated per-parent subquery, like “the three newest orders for each customer,” where the inner LIMIT depends on the outer row. The systems-level lesson generalizes past N+1: prefer one query that does set-based work to many queries that each do a little, because the fixed per-round-trip cost dominates once the row count is anything but tiny. The metric that exposes N+1 is calls in pg_stat_statements: a child query with a calls count that tracks your parent row count, and a high calls times mean-time product, is an N+1 loop.

8. Cheap shapes: COUNT, LIMIT, and DISTINCT

Three everyday query shapes have a cheap form and an expensive one. Knowing which you wrote is half the optimization.

COUNT. SELECT count(*) FROM big_table is not free. Because of MVCC, PostgreSQL must check each row’s visibility against your snapshot, so an exact count reads the whole relation (a Seq Scan, or an index-only scan if the visibility map is current). On a large table that is seconds of work for a number that is stale the moment you read it. When an approximate count is acceptable (a dashboard total, “about 4.2M rows”), read the planner’s own estimate instead:

-- Approximate row count from the planner statistics, effectively instant.
SELECT reltuples::bigint AS approx_rows
FROM pg_class WHERE relname = 'big_table';

reltuples is maintained by ANALYZE and VACUUM, so it lags slightly but is accurate to within a few percent on a regularly analyzed table. For a filtered approximate count, run EXPLAIN (not EXPLAIN ANALYZE) and read the estimated rows from the top node; that is the planner’s estimate for your exact predicate, again without touching the heap. Reserve exact count(*) for when the number must be precise.

LIMIT and ORDER BY. A LIMIT is only as good as the order it sits on. LIMIT with a matching ORDER BY that an index already provides is one of the best deals in SQL: the index hands rows back in order, so the executor reads exactly LIMIT rows and stops, with no sort node at all.

-- Index on (created_at DESC) serves the order; reads 10 rows, no Sort node.
SELECT * FROM articles ORDER BY created_at DESC LIMIT 10;

Two failure modes hide here. Without a matching index, ORDER BY ... LIMIT 10 must sort (or top-N heapsort) a large input just to take ten rows, which is far more work than the result suggests. And LIMIT without any ORDER BY is nondeterministic: PostgreSQL returns whatever ten rows are cheapest to produce, and that set can change between runs, after a vacuum, or when the plan changes. LIMIT without ORDER BY is a correctness smell, not just a style nit; if you care which rows you get, you must order them.

DISTINCT, GROUP BY, and DISTINCT ON. These overlap but are not interchangeable. SELECT DISTINCT a, b and SELECT a, b GROUP BY a, b are logically the same de-duplication and usually plan the same way; prefer whichever reads clearer (GROUP BY once you add aggregates). DISTINCT ON (key) is the PostgreSQL-specific tool for “one row per key, picking a particular one”:

-- The latest order per customer: one row per customer_id, ordered to pick the newest.
SELECT DISTINCT ON (customer_id) customer_id, id, created_at
FROM orders
ORDER BY customer_id, created_at DESC;

DISTINCT ON keeps the first row of each customer_id group as defined by the ORDER BY, so the leading ORDER BY columns must start with the DISTINCT ON key. It is far cleaner than the classic self-join-to-find-the-max, and it leads to the last rewrite worth naming here.

Self-join to window function. The old way to attach a per-group aggregate to each row is to self-join the table to a grouped copy of itself. A window function does it in one pass:

-- Replace a self-join-to-grouped-copy with a single windowed scan.
SELECT id, customer_id, amount,
       sum(amount) OVER (PARTITION BY customer_id) AS customer_total
FROM orders;

The window function computes customer_total for every row without a second scan of orders and without a join, so it is both simpler and usually cheaper than the self-join it replaces. Window functions also express running totals, rankings, and “compare to previous row” that a GROUP BY cannot.

9. Generic plan skew and set-based thinking

Two final rewrites operate above the level of a single predicate.

Generic versus custom prepared-plan skew. When you use a prepared statement (every parameterized query from a driver is one), PostgreSQL chooses between a custom plan, re-planned for the actual parameter values each time, and a generic plan, planned once with placeholder selectivity and reused. The decision logic lives in The Query Planner and Cost Model: PostgreSQL builds custom plans for the first executions, then switches to the generic plan if it is not meaningfully more expensive on average. The skew problem is this: a generic plan uses an average-case row estimate, which is wrong for a skewed parameter. If one value of status matches 0.1 percent of rows (great for an index) and another matches 90 percent (better as a Seq Scan), a single generic plan cannot be right for both. Bind the common-but-rare value and you may get the wrong shape.

You control it per session with plan_cache_mode:

-- Force re-planning per call when parameter values are skewed and the plan must fit each.
SET plan_cache_mode = force_custom_plan;

-- Force the cached generic plan when planning cost itself is the bottleneck and values are uniform.
SET plan_cache_mode = force_generic_plan;

-- The default: let PostgreSQL decide via the custom-then-generic heuristic.
SET plan_cache_mode = auto;

Reach for force_custom_plan when a parameterized query is fast for some bind values and slow for others, and EXPLAIN of the generic plan shows a one-size estimate that does not fit the skewed value. The cost is re-planning on every execution, so use it where plan quality matters more than planning time.

Set-based thinking over row-by-row. The broadest anti-pattern of all is doing in a loop what the database can do in a set. The clearest case is inserts. A thousand single-row INSERT statements pay a thousand round-trips, a thousand parse-plan cycles, and (if each is its own transaction) a thousand WAL commits. One multi-row INSERT collapses that to a single statement:

-- Row-by-row: 1000 statements, 1000 round-trips, 1000 plan cycles.
INSERT INTO t (a, b) VALUES (1, 2);
INSERT INTO t (a, b) VALUES (3, 4);
-- ... 998 more ...

-- Set-based: one statement, one round-trip, one plan.
INSERT INTO t (a, b) VALUES (1, 2), (3, 4), (5, 6), ... ;

For bulk loads, COPY is faster still: it streams rows in a tight binary or text protocol that skips most per-row overhead and is the right tool for loading millions of rows. The same row-by-row trap appears in heavy per-row triggers: a trigger that fires expensive work for every affected row turns a single set-based UPDATE of a million rows into a million trigger invocations. When a statement that touches many rows is slow out of proportion to its size, suspect a per-row trigger doing work that could be batched or moved to a statement-level trigger.

10. Diagnostics: prove every rewrite

A rewrite you did not measure is a guess. Two tools turn guessing into evidence, and you should use both on every change.

EXPLAIN ANALYZE, before and after. For each rewrite on this page, run EXPLAIN (ANALYZE, BUFFERS) on the original and on the rewrite, and compare. You are looking for the plan to change the way you predicted: a Seq Scan becoming an Index Scan, a large Sort disappearing under a LIMIT, rows-removed-by-filter dropping toward zero, buffer reads falling by orders of magnitude. ANALYZE runs the query and reports actual time and actual rows, so it also catches a rewrite that looks better but is not. Always compare actual times, not just estimated costs.

-- Before (non-sargable): full scan, every row filtered.
Seq Scan on events  (cost=0.00..43210.00 rows=1 width=64)
                    (actual time=812.4..812.4 rows=1 loops=1)
  Filter: (date_trunc('day', ts) = '2026-06-06')
  Rows Removed by Filter: 4999999
  Buffers: shared read=21834

-- After (sargable half-open range): an index seek.
Index Scan using events_ts_idx on events  (cost=0.43..18.7 rows=1 width=64)
                    (actual time=0.05..0.06 rows=1 loops=1)
  Index Cond: ((ts >= '2026-06-06') AND (ts < '2026-06-07'))
  Buffers: shared hit=4

That contrast, 21,834 pages read and 5 million rows filtered against 4 buffer hits and a clean index condition, is the whole argument. Read Rows Removed by Filter and the Buffers line first; they expose wasted work faster than the cost numbers do.

pg_stat_statements, to find the worst offenders. You do not rewrite at random; you rewrite the queries that cost the most. The pg_stat_statements extension aggregates every executed query by a normalized form and records calls, total_exec_time, mean_exec_time, and rows. Sort by total time to find where the database actually spends its day:

-- The queries burning the most cumulative time: your rewrite worklist, ranked.
SELECT query, calls, total_exec_time, mean_exec_time, rows
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Read the columns together. A high total_exec_time with a low mean_exec_time and a huge calls is an N+1 loop or a chatty endpoint; the fix is batching from rung 7. A high mean_exec_time is a single slow query; the fix is a rewrite or an index from rungs 1 through 8. A query whose rows divided by calls is enormous is pulling back more than the application uses. This is how you turn a vague “the database is slow” into a ranked list of exactly which statements to rewrite first, each then proven with the EXPLAIN ANALYZE contrast above.

flowchart LR
    A["pg_stat_statements: rank by total_exec_time"] --> B["Pick the top offender"]
    B --> C["EXPLAIN ANALYZE the current query"]
    C --> D["Apply the matching rewrite"]
    D --> E["EXPLAIN ANALYZE the rewrite"]
    E --> F{"Plan improved, same rows?"}
    F -->|yes| G["Ship it, re-check the ranking"]
    F -->|no| C

Mastery Questions

  1. A reporting endpoint runs SELECT * FROM orders WHERE date_trunc('month', created_at) = '2026-06-01' against a table with CREATE INDEX ON orders (created_at), and it has slowed steadily for a year. EXPLAIN shows a Seq Scan with Rows Removed by Filter in the millions. Give the rewrite, explain why the original cannot use the index, and name two ways to confirm the fix.

    Answer. The predicate wraps the indexed column in date_trunc('month', created_at), so it is non-sargable: the B-tree stores created_at, not its month truncation, and the planner has no value in the index to seek on, forcing a full scan that computes the function for every row. Rewrite to a half-open range on the bare column: WHERE created_at >= '2026-06-01' AND created_at < '2026-07-01'. Now created_at appears bare against range bounds the index can descend to, so the plan becomes an Index Scan or Bitmap Index Scan. The half-open form is exact at the month boundary, unlike a BETWEEN ... AND '2026-06-30 23:59:59' that would drop part of the last day on a timestamptz. Confirm two ways: run EXPLAIN (ANALYZE, BUFFERS) before and after and watch the Seq Scan become an Index Scan with Rows Removed by Filter falling to near zero and buffer reads dropping by orders of magnitude; and check pg_stat_statements to see this query’s mean_exec_time and total_exec_time fall after the deploy. The slow growth over a year is the tell: a non-sargable predicate scales with table size, so it degrades exactly as the table grows.

  2. A teammate writes SELECT * FROM customers WHERE id NOT IN (SELECT customer_id FROM orders) to find customers with no orders. It returns zero rows in production but worked in their test database. Explain the mechanism, the difference between the two environments, and the correct rewrite.

    Answer. This is the NOT IN null trap. In production, orders.customer_id is nullable and at least one row holds a NULL; in the test database every customer_id happened to be non-null. x NOT IN (set) expands to x <> v1 AND x <> v2 AND ..., and once the set contains a NULL it includes the term x <> NULL, which evaluates to UNKNOWN, never TRUE. An AND chain with an UNKNOWN term can be FALSE or UNKNOWN but never TRUE, so no customer passes the filter and the result collapses to zero rows, silently, with no error. That is why it passed in test (no NULLs) and failed in production (one NULL is enough). The correct rewrite is WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id), a true anti-join: a NULL customer_id simply fails to match any customer and is ignored, so you get exactly the customers with no orders. NOT EXISTS is the right tool for every anti-join precisely because it is NULL-safe, and it also optimizes well as an Anti Join node. The durable lesson: never use NOT IN against a subquery on a nullable column.

  3. An API returns paginated results with LIMIT 20 OFFSET :n ordered by created_at. Early pages are instant; page 4000 takes two seconds. The table has an index on (created_at, id). Explain why deep pages are slow, give the rewrite, and state what capability you trade away.

    Answer. OFFSET n is not a cheap skip. PostgreSQL must produce the first n rows in sorted order and then read and discard every one of them before it returns the first row you actually want. At page 4000 with page size 20 that is 80,000 rows generated and thrown away per request, so cost grows linearly with the page number; early pages discard little and feel instant, deep pages discard a lot and crawl. The rewrite is keyset (seek) pagination: remember the sort key of the last row on the previous page and ask for rows past it, WHERE (created_at, id) > (:last_created_at, :last_id) ORDER BY created_at, id LIMIT 20. With the (created_at, id) index, the row-value comparison is a direct B-tree seek to the boundary that then reads exactly 20 leaf entries, discarding nothing, so page 4000 costs the same as page 2 and performance stays flat. Two correctness rules: the keyset columns must match the ORDER BY and form a unique order (hence id to break created_at ties), and you must use the row-value form (a, b) > (x, y), not a > x AND b > y. The capability you trade away is random page access: keyset can go to the next page but cannot jump straight to “page 500,” because it has no boundary key for it. For infinite scroll, feeds, and cursor-based APIs that access pattern is exactly what you want, so the trade costs nothing in practice.

Recommended next

  • Data Types and Schema Design
    Builds directly on this page: Data Types and Schema Design is the next step in the PostgreSQL performance ladder.
Sources & evidence16 claims · 5 cited

Sargability, index-condition vs filter behavior, and operator classes are grounded in src_pg_docs_indexes and src_pg_docs_btree; EXPLAIN/EXPLAIN ANALYZE plan-reading and the cost/rows/buffers output in src_pg_docs_explain; reltuples and selectivity estimation in src_pg_docs_planner_stats; pg_stat_statements columns in src_pg_docs_pgstatstatements. The CTE-inlining version boundary (PG11 fence vs PG12 default inlining and the AS MATERIALIZED / AS NOT MATERIALIZED keywords), keyset-vs-OFFSET behavior, the NOT IN three-valued-logic trap, and plan_cache_mode values are correct PostgreSQL facts not pinned to a single cited paragraph in the allowed sources and are marked stable-common-knowledge.

  • A predicate is sargable only when the indexed column appears bare on one side of a simple comparison; wrapping it in a function such as date_trunc('day', ts) or lower(email) makes the B-tree on the raw column unusable and forces a sequential scan, because the index stores the column value, not the transformed value.verified
  • A one-day predicate should be written as the half-open range ts >= '2026-06-06' AND ts < '2026-06-07' rather than BETWEEN '2026-06-06' AND '2026-06-06 23:59:59', because the half-open form is exact at the boundary on timestamptz with sub-second precision and keeps the column bare so the index can seek.verified
  • A leading-wildcard LIKE '%smith' cannot use a B-tree because the index is ordered left to right and a seek needs a known prefix; LIKE 'smith%' is sargable, and on a non-C locale a prefix LIKE needs the text_pattern_ops operator class to use the index, while substring search needs a trigram GIN index.verified
  • Comparing a column to a literal or parameter of a different type can force an implicit cast onto the column side (for example a text id column compared to an integer literal becomes id::integer = 12345), which wraps the column in a function and skips the index; matching the literal's type to the column keeps the cast on the constant side and preserves the index scan.verified
  • An OR across two different indexed columns usually cannot use either index well, but rewriting it as a UNION of two single-column queries lets each branch perform its own index seek; UNION de-duplicates to match OR semantics while UNION ALL is faster but only correct when the branches cannot overlap.verified
  • NOT IN against a subquery returns zero rows whenever the subquery yields a single NULL, because x NOT IN (..., NULL) expands to an AND chain containing x <> NULL which is UNKNOWN, so the predicate is never TRUE for any row; NOT EXISTS is the NULL-safe anti-join that should be used instead.stable common knowledge
  • EXISTS can short-circuit a semi-join by stopping at the first matching inner row, which is why SELECT 1 inside EXISTS is idiomatic; the planner typically recognizes both EXISTS and IN as the same semi-join and produces equivalent plans for the positive case.stable common knowledge
  • Through PostgreSQL 11 a CTE was always an optimization fence (materialized, with no predicate push-down); PostgreSQL 12 changed the default to inline a CTE that is referenced once, non-recursive, and side-effect free, and added the explicit keywords AS MATERIALIZED to force the fence and AS NOT MATERIALIZED to force inlining.stable common knowledge
  • OFFSET N forces PostgreSQL to produce and then discard the first N ordered rows before returning any result, so deep-page cost grows linearly with the page number; keyset pagination with WHERE (sort_key) > (last_seen) ORDER BY sort_key LIMIT n seeks directly past the last seen row using the index and keeps cost flat regardless of page depth.stable common knowledge
  • Keyset pagination requires the keyset columns to match the ORDER BY and form a unique ordering (appending a tiebreaker like id) and must use the row-value comparison (a, b) > (x, y) rather than a > x AND b > y; the trade-off is the loss of random page access.stable common knowledge
  • The N+1 pattern issues one query for a parent list plus one query per parent, paying a full network round-trip and parse-plan-execute cycle for each; it is collapsed into one set-based query with IN, a JOIN, or a correlated LATERAL subquery for a per-parent slice, because the fixed per-round-trip cost dominates once the row count is more than tiny.stable common knowledge
  • Because of MVCC visibility checks an exact count(*) reads the whole relation; when an approximate count is acceptable, pg_class.reltuples (maintained by ANALYZE and VACUUM) or the estimated rows from EXPLAIN gives a near-instant estimate within a few percent on a regularly analyzed table.verified
  • A LIMIT served by a matching ORDER BY that an index already provides reads exactly LIMIT rows with no Sort node, while LIMIT without ORDER BY is nondeterministic and can return a different set between runs; DISTINCT ON (key) keeps one row per key chosen by the leading ORDER BY, and a window function over PARTITION BY replaces a self-join-to-grouped-copy in a single scan.verified
  • A cached generic prepared-statement plan uses an average-case row estimate that can be wrong for a skewed parameter value, where one bind matches a tiny fraction of rows and another matches most; plan_cache_mode (auto, force_custom_plan, force_generic_plan) overrides the custom-then-generic heuristic so re-planning fits each value, at the cost of planning per execution.stable common knowledge
  • A multi-row INSERT collapses many single-row INSERTs into one statement, round-trip, and plan cycle, and COPY streams bulk rows faster still by skipping most per-row overhead; heavy per-row triggers re-introduce the row-by-row cost by firing once per affected row on an otherwise set-based statement.stable common knowledge
  • EXPLAIN (ANALYZE, BUFFERS) run before and after a rewrite reports actual time, actual rows, Rows Removed by Filter, and buffer reads, proving a Seq Scan became an Index Scan and that wasted work fell; pg_stat_statements aggregates each normalized query's calls, total_exec_time, mean_exec_time, and rows, so sorting by total_exec_time ranks the worst offenders to rewrite first.verified

Cited sources