Statistics and Row Estimation

How ANALYZE samples data into pg_stats, how the planner converts those statistics into selectivities and row estimates, and why the independence assumption on correlated columns causes most bad plans, with the extended-statistics and target fixes.

Learning outcomes

Almost every bad plan you will ever debug is a bad estimate wearing a costume. The query planner and cost model page showed you that the planner picks the cheapest plan by its own arithmetic, and the reading EXPLAIN ANALYZE page showed you how to compare that arithmetic against reality. This page is the one that explains WHY the arithmetic goes wrong: where the numbers come from, how they are computed, and the one assumption that breaks them most often.

After studying this page, you can:

  • Explain what ANALYZE samples, why it is a sample and not a full scan, and when you must run it by hand.
  • Read every column of pg_stats and say what the planner does with each one.
  • Compute a selectivity by hand for an equality and a range, and match it against the planner’s rows= estimate.
  • Diagnose a correlated-column misestimate and fix it with CREATE STATISTICS, checking the row estimate before and after.
  • Raise STATISTICS targets, override n_distinct, and add expression statistics where sampling alone fails.

Before we dive in

You should already be comfortable reading an EXPLAIN ANALYZE plan and spotting where estimated rows and actual rows disagree, and you should know that the planner is cost-based: it scores plans with numbers, never by trying them. Both ideas come from the two pages named above. You do not need to know join algorithms in depth here, though a wrong estimate is exactly what turns a good join plan into a disaster.

A few terms, defined as we meet them. Cardinality is the number of rows a node is expected to produce. Selectivity is the fraction of a table’s rows that a condition keeps, a number between 0 and 1: a selectivity of 0.01 means one row in a hundred passes. Statistics are the summary numbers PostgreSQL stores about your data’s distribution, living in the system catalog pg_statistic and shown to humans through the pg_stats view. Everything below is the story of how those statistics are gathered, read, and combined into a cardinality.

Mental Model

The tempting wrong model is that the planner knows your data. It does not. It has never counted your rows. The planner works from a small statistical sketch, a few hundred numbers per column, taken from a random sample some time ago. When it tells you rows=512, that is not a fact about your table; it is a prediction from a summary that may be stale, too coarse, or blind to how your columns relate.

The better model is a pollster. A pollster never asks every voter; they sample a few thousand and extrapolate. The estimate is good when the sample is fresh and the population is simple, and it is wrong in two famous ways: when the sample is out of date (you polled last month, the data moved), and when two questions are correlated (asking “do you own a truck” and “do you live rurally” as if they were independent badly undercounts the overlap). Hold the pollster picture. Every failure mode on this page is either a stale poll or a correlation the pollster did not model, and the fixes are the same two a real pollster reaches for: poll more often, and poll the correlated questions together.

Breaking it down

1. Why a planner needs a sample of your data

Start with the problem statistics solve. To choose a plan, the planner must guess how many rows each step produces. A WHERE clause that keeps ten rows wants an index scan; the same clause keeping ten million wants a sequential scan. The planner cannot run the query to find out, that would defeat the purpose, so it must predict the count from something cheap. That something is a stored summary of each column’s distribution.

Counting every value in a billion-row table on every query is far too expensive, so PostgreSQL does what every survey does: it samples. The command that builds the summary is ANALYZE. It reads a random sample of rows, computes a handful of distribution numbers, and writes them into the catalog table pg_statistic. From then on the planner reads those numbers, never your live data, until the next ANALYZE.

The sample size is tied to a setting called the statistics target, which defaults to 100. The rule is that ANALYZE collects roughly 300 times the target rows per analyzed table, so at the default it samples about 30000 rows. That is a fixed sample size, not a percentage. The deep consequence: on a thousand-row table the sample is the whole table and the stats are near-exact, but on a billion-row table 30000 rows is one row in thirty thousand, and the resulting estimates are genuinely approximate. Big tables get fuzzier statistics, and you cannot wish that away; you can only sample harder, which rung 4 covers.

-- ANALYZE one table by hand. Autovacuum normally does this for you,
-- but it lags, and after a bulk change it lags badly.
ANALYZE orders;

-- ANALYZE just one column, useful when only one column's stats are stale.
ANALYZE orders (status);

Autovacuum runs ANALYZE automatically when enough rows change, but it reacts to a threshold, so there is always a window where the data has moved and the stats have not. That window is the single most common cause of a sudden bad plan. After any bulk load, large UPDATE, or COPY, run ANALYZE yourself before you trust a plan, because autoanalyze may be minutes or millions of rows behind.

2. What ANALYZE collects: a tour of pg_stats

Everything ANALYZE learns about a column shows up in one row of the pg_stats view. Query it directly; it is the ground truth behind every estimate.

SELECT attname, null_frac, n_distinct,
       most_common_vals, most_common_freqs,
       histogram_bounds, correlation
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';

Six columns carry the weight. Read each as an answer to a question the planner is about to ask.

  • null_frac is the fraction of the column that is null. An IS NULL test has exactly this selectivity, and any non-null predicate starts by discarding this fraction.
  • n_distinct is the number of distinct values. When positive it is a literal count. When negative it is a ratio of distinct values to total rows, between minus one and zero: -1 means every value is unique (a serial id), -0.2 means one distinct value per five rows. PostgreSQL stores the negative form for columns whose distinct count grows with the table, so the ratio stays valid as the table grows and the absolute count would not.
  • most_common_vals (the MCV list) holds the most frequent values, and most_common_freqs holds each one’s frequency as a fraction. These two arrays line up: the first value’s frequency is the first number. The MCV list is how the planner knows that status = 'active' is common and status = 'archived' is rare. Its length is capped by the statistics target.
  • histogram_bounds is a list of boundary values that split the non-MCV data into equal-population buckets. With 100 buckets, each bucket holds about one percent of the rows, and the bounds let the planner estimate a range like created_at > '2026-01-01' by counting whole buckets. The most common values are pulled out into the MCV list first, so the histogram describes only what is left.
  • correlation measures how closely the physical row order on disk matches the sorted order of this column, from -1 to +1. A value near +1 (rows already on disk in ascending order of this column) makes an index scan cheap, because matching rows sit on adjacent pages. A value near 0 scatters them across the whole table and pushes the planner toward a sequential or bitmap scan. This single number often decides index versus sequential.
flowchart TD
    A["ANALYZE samples ~300 x target rows"] --> B["pg_statistic catalog"]
    B --> C["pg_stats view (human-readable)"]
    C --> D["null_frac: fraction NULL"]
    C --> E["n_distinct: distinct count or ratio"]
    C --> F["MCV list: frequent values + freqs"]
    C --> G["histogram_bounds: equal-population buckets"]
    C --> H["correlation: physical vs sorted order"]
    D --> I["Planner combines these into a selectivity"]
    E --> I
    F --> I
    G --> I
    H --> I

3. Selectivity by hand: turning stats into a row count

Now the payoff: you can compute the planner’s estimate yourself and check its work. The recipe depends on what kind of predicate you have and whether the value is in the MCV list.

Take a concrete orders table of one million rows. Suppose status has an MCV list where 'active' has frequency 0.99 and a few rare values share the rest, and n_distinct is 5.

Equality on an MCV value. If the value you test sits in most_common_vals, the planner uses its stored frequency directly. For WHERE status = 'active', the frequency is 0.99, so the selectivity is 0.99 and the estimate is 0.99 * 1000000 = 990000 rows. No histogram, no arithmetic beyond the lookup. This is why the MCV list matters so much: it makes the common, skewed cases exact.

Equality on a non-MCV value. If the value is not in the MCV list, the planner cannot look it up, so it spreads the leftover probability evenly across the leftover distinct values. The formula is the residual frequency divided by the residual distinct count:

selectivity = (1 - sum_of_MCV_freqs) / (n_distinct - number_of_MCVs)

Suppose the MCV list captures 'active' (0.99) and 'pending' (0.005), summing to 0.995, leaving 3 of the 5 distinct values uncaptured. For a non-MCV value like status = 'cancelled':

selectivity = (1 - 0.995) / (5 - 2) = 0.005 / 3 = 0.001667
estimate    = 0.001667 * 1000000  = ~1667 rows

That is the planner assuming the rare values share the remaining 0.5 percent equally. It is a guess, and it is why a non-MCV value with a wildly uneven true distribution can still be misestimated even with good stats.

A range predicate. A range like WHERE amount BETWEEN 100 AND 200 is answered from the histogram. Each bucket holds about one percent of the rows, so the planner finds which buckets the range spans, counts the whole buckets inside it, and interpolates the fractional buckets at the ends. If the range covers four full buckets and half of two edge buckets, that is about 4 + 0.5 + 0.5 = 5 buckets, or 5 percent, an estimate of 0.05 * 1000000 = 50000 rows. More buckets means finer interpolation at the edges, which is the whole reason to raise the statistics target on range-queried columns.

Estimating WHERE status = 'active' AND amount > 500
The tableorders has 1,000,000 rows. We want the estimate for two conditions ANDed together.
Step 1 of 5

4. The statistics target: how fine the picture is

You have seen the statistics target twice now, setting the sample size and capping the MCV list and histogram length. It is worth its own rung because it is your main lever on estimate quality.

The global default is set by default_statistics_target, which is 100 in PostgreSQL 16 and 17. Raising it makes the MCV list longer (more skewed values captured exactly) and the histogram finer (more buckets, so tighter range estimates), at two costs: ANALYZE reads more rows and runs longer, and planning gets slightly slower because the planner walks larger arrays on every query. You rarely want to raise it globally; you want to raise it on the specific columns that need it.

-- Per-column target: raise it on a skewed or high-cardinality column
-- that range queries or filters care about. 1000 is a common choice.
ALTER TABLE orders ALTER COLUMN amount SET STATISTICS 1000;

-- The new target takes effect on the NEXT analyze, so run it now.
ANALYZE orders (amount);

When does a column need more? Two signatures. A high-cardinality column filtered or ranged on, where 100 histogram buckets are too coarse to separate the values you query, benefits from more buckets. A heavily skewed column with more than 100 meaningfully frequent values needs a longer MCV list, because at target 100 only the top 100 values are stored exactly and the 101st common value falls into the residual estimate and gets misjudged. The trade is real: target 1000 means ANALYZE samples about 300000 rows for that table and stores ten times the arrays, so reserve it for columns where the plan actually depends on the precision.

Statistics targetSample rows (approx)MCV / histogram sizeBest for
100 (default)~30,000up to 100 entriesmost columns
1000~300,000up to 1000 entriesskewed or high-cardinality filter columns
10000 (max)~3,000,000up to 10000 entriesrare, a critical column on a huge table

5. The independence assumption, the root of bad plans

This is the rung that explains most of the bad plans you will ever see, so slow down here.

When a query has several conditions, the planner needs the selectivity of all of them together. It computes each one’s selectivity separately, then multiplies them. Multiplying probabilities is only valid when the events are independent, and the planner assumes they always are. That assumption is the independence assumption, and it is wrong whenever your columns are correlated.

Consider WHERE city = 'San Francisco' AND state = 'CA'. Suppose city = 'San Francisco' has selectivity 0.001 and state = 'CA' has selectivity 0.12. The planner multiplies: 0.001 * 0.12 = 0.00012, so on a ten-million-row table it predicts 1200 rows. But every San Francisco row is already a California row. The state condition removes nothing once the city is fixed. The true selectivity is just 0.001, the true count 10000. The planner is off by more than eight times, and always in the dangerous direction: it underestimates.

Why is underestimating dangerous? Because a tiny estimate invites a nested loop. The reading EXPLAIN ANALYZE page showed that a nested loop probes its inner relation once per outer row, which is cheap when the outer side has 1200 rows and catastrophic when it really has 10000, or 10 million. The planner picks a nested loop sized for 1200 probes, then runs 10000 of them, or worse if the misestimate compounds up a join tree. This is the single most common mechanism behind a query that ran fine yesterday and melts down today: correlated columns, no extended statistics, and a join that flipped to a nested loop.

flowchart LR
    A["city = 'SF': sel 0.001"] --> C["Planner multiplies"]
    B["state = 'CA': sel 0.12"] --> C
    C --> D["Estimate: 0.001 x 0.12 = 0.00012, ~1,200 rows"]
    D --> E["Reality: every SF row is a CA row, ~10,000 rows"]
    E --> F["8x under-estimate, planner picks a doomed nested loop"]

The classic correlated pairs to watch: city and state, ZIP and city, manufacturer and model, country and currency, and any pair where one column functionally implies the other. When you see two such columns filtered together and an EXPLAIN ANALYZE showing estimated rows far below actual, you have found your bug.

Check yourself
A query filters on both manufacturer = 'Toyota' and model = 'Corolla'. The planner estimates 30 rows; EXPLAIN ANALYZE shows 9,000 actual. What is the most likely cause?

6. Extended statistics: teaching the planner about correlation

The fix for the independence assumption is to give the planner statistics about how columns relate, not just about each column alone. The command is CREATE STATISTICS, and it builds one or more of three kinds of multi-column statistics, each answering a different question.

  • dependencies captures functional dependencies: when one column’s value determines another’s, like city implies state or model implies manufacturer. It tells the planner that once city is fixed, the state condition is redundant, so it stops multiplying in the redundant selectivity.
  • ndistinct captures the number of distinct combinations across several columns. The planner needs this for GROUP BY a, b: it would otherwise multiply the per-column distinct counts and badly overestimate the number of groups when the columns are correlated.
  • mcv captures the most common multi-column value combinations and their frequencies, the multi-column version of the MCV list. It is the most powerful and the largest, and it handles skew in specific combinations that a plain dependency cannot.

Here is the full fix for the correlated-column misestimate from rung 5, with the before and after.

-- BEFORE: the planner multiplies, assuming independence.
EXPLAIN ANALYZE
SELECT * FROM addresses WHERE city = 'San Francisco' AND state = 'CA';
-- ...rows=1200 (estimated)  ...actual rows=10000   <- 8x under
-- Create extended statistics on the correlated pair.
-- One object can carry several kinds; list what you need.
CREATE STATISTICS addr_city_state (dependencies, mcv)
  ON city, state FROM addresses;

-- Extended stats are populated by ANALYZE, like ordinary stats.
ANALYZE addresses;
-- AFTER: the planner now knows city implies state and stops
-- multiplying in the redundant condition.
EXPLAIN ANALYZE
SELECT * FROM addresses WHERE city = 'San Francisco' AND state = 'CA';
-- ...rows=10000 (estimated)  ...actual rows=10000   <- matches

You can inspect what the object captured through the pg_stats_ext view, which shows the dependency degrees and the multi-column MCV list, the extended-statistics analog of pg_stats. The estimate now matches reality, the planner sizes the join correctly, and the nested-loop trap is gone. Note the cost honestly: extended statistics add ANALYZE work and a little planning time, so create them for the specific correlated predicates that hurt, not on every column pair.

Row estimate for the correlated city/state filter
ANALYZE has recorded that city functionally determines state. The planner stops multiplying in the redundant state condition and estimates 10,000 rows, matching actual. It now sizes the join correctly and avoids the doomed nested loop.

7. Expressions and n_distinct overrides

Two more tools cover the cases plain column statistics miss.

Expression statistics. The planner has no statistics for the result of an expression. A filter like WHERE lower(email) = 'a@b.com' or WHERE date_trunc('month', created_at) = '2026-06-01' falls back to a hard-coded default selectivity, because nothing in pg_stats describes lower(email). Since PostgreSQL 14 you can fix this by creating statistics on the expression itself.

-- Give the planner real stats for an expression it would otherwise guess.
CREATE STATISTICS orders_lower_email ON lower(email) FROM orders;
ANALYZE orders;

There is a second path to the same place: an expression index automatically carries its own statistics on the indexed expression. If you have CREATE INDEX ON orders (lower(email)), the planner already has good stats for lower(email) from that index, and you do not also need a separate statistics object. Use CREATE STATISTICS ON (expression) when you want the estimate fixed but do not want, or cannot afford, the index.

Overriding n_distinct. Sometimes the sample lies about distinctness. On a huge table, a column with a very large number of distinct values can have its n_distinct underestimated, because a 30000-row sample sees most values only once and cannot tell how many more exist beyond the sample. A too-low n_distinct makes the planner think a value is more common than it is, inflating equality estimates. When you know the true distinctness, you can pin it.

-- Tell the planner this column is nearly unique, overriding the sample.
-- A negative value is the distinct-to-total ratio; -1 means fully unique.
ALTER TABLE events ALTER COLUMN session_id SET (n_distinct = -0.8);
ANALYZE events;

This override sticks until you change it; ANALYZE will respect it rather than recomputing that column’s distinct count. Reach for it only when you have measured the real distinctness (a SELECT count(DISTINCT session_id) on a representative slice) and confirmed the sample is wrong, because a bad override is worse than the sample it replaces.

8. The diagnostic workflow when an estimate lies

Tie it together into a procedure you can run during an incident. The trigger is always the same: a slow query whose plan does not match its data.

First, confirm it is an estimation problem. Run EXPLAIN (ANALYZE, BUFFERS) and read estimated rows= against actual rows on each node, bottom up, exactly as the reading EXPLAIN ANALYZE page teaches. A node where estimated and actual differ by a large factor is your culprit, and the lowest such node is the root cause; errors above it are usually just that error cascading upward.

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM addresses WHERE city = 'San Francisco' AND state = 'CA';

Second, find which column or columns drive the bad node. Query pg_stats for those columns and check whether the distribution looks current and whether two correlated columns are filtered together. Check pg_stat_user_tables to see when the table was last analyzed, because a stale last_analyze after a bulk change explains a misestimate instantly.

-- When was this table last analyzed, and how many rows have changed since?
SELECT relname, last_analyze, last_autoanalyze, n_mod_since_analyze, n_live_tup
FROM pg_stat_user_tables
WHERE relname = 'addresses';

Third, apply the matching fix and re-ANALYZE. The decision is mechanical once you know the cause:

flowchart TD
    A["EXPLAIN ANALYZE: estimate far from actual"] --> B{"last_analyze stale?"}
    B -->|yes| C["Run ANALYZE, recheck"]
    B -->|no| D{"Two correlated columns?"}
    D -->|yes| E["CREATE STATISTICS on the pair, ANALYZE"]
    D -->|no| F{"Skewed or high-cardinality single column?"}
    F -->|yes| G["SET STATISTICS higher, ANALYZE"]
    F -->|no| H{"Expression or wrong n_distinct?"}
    H -->|yes| I["CREATE STATISTICS on expr or SET n_distinct, ANALYZE"]
    C --> J["Re-run EXPLAIN ANALYZE, confirm estimate now matches"]
    E --> J
    G --> J
    I --> J

Fourth, recheck. Re-run the EXPLAIN ANALYZE and confirm the estimate now tracks the actual count. If it does, the plan will follow. If it does not, you have the wrong column or the wrong fix, so go back to step two. This loop, estimate wrong, find the column, raise the target or add extended statistics, re-ANALYZE, recheck, is the whole job. Master it and most planner mysteries become routine.

The failure modes to keep in your head, each a stop on this loop: stale stats after a bulk load (run ANALYZE); correlated columns with no extended statistics (the independence trap); a high-cardinality column stuck at target 100 (raise it); n_distinct underestimated on a huge table (override it); and the quietest one of all, assuming ANALYZE is exact when it is a sample, so an estimate on a billion-row table can be a little off no matter what you do.

Mastery Questions

  1. After a nightly batch loads 50 million new rows into an empty staging table, the first report against it runs for an hour with a plan full of nested loops. The same report on the same data is fast the next morning. What happened, and what one command would have prevented the slow run?

    Answer. The batch loaded the rows before autoanalyze had run, so the staging table had no statistics, or empty-table statistics, at the moment the report ran. With no statistics, the planner falls back to crude defaults and badly underestimates cardinalities, which invites nested loops sized for far fewer rows than exist; with 50 million rows that is an hour-long disaster. By morning, autovacuum had triggered an autoanalyze (the table crossed its change threshold), real statistics landed, the estimates corrected, and the planner switched to hash joins. The prevention is one line at the end of the load: ANALYZE staging_table; (or ANALYZE on the whole database). Never trust a plan against a freshly bulk-loaded table until you have analyzed it yourself, because autoanalyze reacts on a delay and the first queries run inside that window.

  2. A query filters WHERE zip = '94110' AND city = 'San Francisco'. Each column’s own statistics are fresh and accurate, yet the planner estimates 4 rows when the real count is 1,200. Raising default_statistics_target to 1000 and re-analyzing does not help. Explain precisely why, and give the fix.

    Answer. This is the independence assumption, and it is immune to the statistics target because the error is not in either column’s own distribution. The planner computes selectivity for zip = '94110' and for city = 'San Francisco' separately, each correctly, then multiplies them as if the two are independent. They are not: a ZIP code functionally determines its city, so every 94110 row is already a San Francisco row and the city condition removes nothing. Multiplying double-counts the filtering and collapses the estimate to 4. Raising the target sharpens each column’s histogram and MCV list, but those were already accurate; the target never changes how two selectivities are combined. The fix is CREATE STATISTICS zip_city (dependencies, mcv) ON zip, city FROM addresses; followed by ANALYZE addresses;. With the functional dependency recorded, the planner stops multiplying in the redundant city condition and the estimate climbs to match the real 1,200, after which the join sizing corrects itself.

  3. On a 2-billion-row events table, equality filters on session_id consistently overestimate, predicting tens of thousands of matching rows when each session has only a few hundred events. The column is in fact close to unique relative to the table. Why does the sample mislead the planner here, and how do you correct it without a schema change?

    Answer. The miss is in n_distinct. ANALYZE samples roughly 30000 rows from 2 billion, one row in about 67000. With a near-unique column, almost every sampled value appears exactly once, and from a sample of all-singletons the estimator cannot distinguish “millions of distinct values” from “tens of millions,” so it underestimates n_distinct. A too-low distinct count makes each value look more common than it is, so an equality estimate inflates: the planner divides the table among too few distinct values and assigns each a larger share. The correct estimate needs a n_distinct that reflects the true distinctness. You fix it without touching the schema by overriding the value: measure the real ratio on a representative slice, then ALTER TABLE events ALTER COLUMN session_id SET (n_distinct = -0.95); (a negative value is the distinct-to-total ratio, here 95 percent distinct), and ANALYZE events;. The override sticks across future analyzes, so the planner stops recomputing a wrong number from an unrepresentative sample, and the equality estimates drop to the correct small counts.

Recommended next

Sources & evidence14 claims · 3 cited

ANALYZE behavior, pg_stats columns, selectivity computation, default_statistics_target, the independence assumption, CREATE STATISTICS kinds, expression statistics, and n_distinct override are grounded in src_pg_docs_planner_stats and src_pg_docs_create_statistics; reading estimated vs actual rows in EXPLAIN is grounded in src_pg_docs_explain. The 300x-target sampling constant and the worked numeric examples are stable common knowledge from PostgreSQL behavior, not pinned to a single cited paragraph.

  • ANALYZE collects statistics from a random sample of roughly 300 times the statistics target rows per table (about 30000 rows at the default target of 100), a fixed sample size rather than a percentage, so estimates on very large tables are approximate.stable common knowledge
  • Autovacuum runs ANALYZE automatically when enough rows change, but it reacts on a threshold and lags, so you should run ANALYZE manually after a bulk load, large UPDATE, or COPY before trusting a plan.verified
  • The pg_stats view exposes per-column statistics including null_frac, n_distinct, most_common_vals and most_common_freqs (the MCV list), histogram_bounds, and correlation; a negative n_distinct is the ratio of distinct values to total rows, stored for columns whose distinct count grows with the table.verified
  • The correlation statistic measures how closely physical on-disk row order matches the column's sorted order (-1 to +1); a value near +1 makes an index scan cheap because matching rows cluster on adjacent pages, while a value near 0 pushes the planner toward a sequential or bitmap scan.verified
  • Equality on a value present in the MCV list uses that value's stored frequency directly as the selectivity, so WHERE status = 'active' with an MCV frequency of 0.99 estimates 0.99 * 1000000 = 990000 rows.verified
  • Equality on a non-MCV value uses selectivity = (1 - sum of MCV frequencies) / (n_distinct - number of MCVs), spreading the residual probability evenly across the uncaptured distinct values, while a range predicate is estimated by counting histogram buckets.verified
  • default_statistics_target is 100 in PostgreSQL 16 and 17, and ALTER TABLE ... ALTER COLUMN ... SET STATISTICS raises a single column's target; a higher target lengthens the MCV list and refines the histogram at the cost of longer ANALYZE and slightly slower planning.verified
  • The planner computes each condition's selectivity separately and multiplies them under an independence assumption, so correlated columns (city and state, ZIP and city, manufacturer and model) make it badly underestimate the combined selectivity.verified
  • An underestimate is dangerous because a tiny row estimate invites a nested loop that probes its inner relation once per outer row; when the real row count is far higher, the nested loop sized for the estimate melts down, which is the common mechanism behind a query that suddenly slows.verified
  • CREATE STATISTICS builds three kinds of multi-column statistics: ndistinct (multi-column distinct counts, used for GROUP BY), dependencies (functional dependencies such as city implies state), and mcv (multi-column most-common value combinations); extended statistics are populated by ANALYZE.verified
  • Creating dependencies and mcv statistics on a correlated pair lets the planner stop multiplying in the redundant condition, correcting a city/state estimate from about 1200 to the actual 10000 rows on a ten-million-row table.verified
  • Since PostgreSQL 14, CREATE STATISTICS can be created on an expression to give the planner real statistics for a predicate like lower(email), and an expression index also carries its own statistics on the indexed expression.verified
  • ALTER TABLE ... ALTER COLUMN ... SET (n_distinct = ...) overrides the sampled distinct count, useful when a 30000-row sample underestimates distinctness on a huge near-unique column; a negative value is the distinct-to-total ratio and the override persists across future ANALYZE runs.verified
  • A misestimate is detected by comparing estimated rows against actual rows in EXPLAIN ANALYZE bottom-up (the lowest node with a large discrepancy is the root cause), with pg_stat_user_tables.last_analyze and n_mod_since_analyze revealing stale statistics after a bulk change.verified

Cited sources