Heap Pages and TOAST

How PostgreSQL physically stores rows: the 8 kB page, the heap-tuple header and alignment padding, fillfactor, TOAST for oversized values, and the FSM/VM maps, plus the diagnostics to measure it all.

Learning outcomes

Every query you will ever tune ends at the same place: bytes on a page. PostgreSQL reads, writes, caches, and locks in fixed 8 kB pages, so the shape of your rows on disk decides how many pages a scan must touch and how much memory your buffer pool can hold. This page makes that physical layout concrete, so that “this table is slow” becomes “this table needs more pages than it should, and here is exactly why.”

After studying this page, you can:

  • Explain why the 8 kB page is the unit of I/O, caching, and locking, and why a row can never span two pages.
  • Read a heap page’s anatomy: its header, the line-pointer array, free space, and tuples filling from the end, and address any tuple as a (page, item) pair.
  • Reorder columns to cut alignment padding, and compute the byte savings before you run the migration.
  • Predict when a value gets pushed out to TOAST, choose a storage strategy and compression codec, and avoid the detoast cost trap.
  • Diagnose physical layout with pg_column_size(), the relation-size functions, and the pageinspect extension.

Before we dive in

You should be comfortable with basic SQL, with CREATE TABLE and column types, and with the idea that a table lives in files on disk. It helps to have read the MVCC and tuple visibility page, because the tuple header we dissect here is the same header that carries xmin and xmax. You do not need that page to follow this one, but the two fit together.

A few terms, defined as we use them. The heap is PostgreSQL’s default table storage: an unordered pile of rows, as opposed to an index, which is ordered. A page (also called a block) is a fixed-size chunk of that file, 8 kB by default, and the smallest amount PostgreSQL ever reads from or writes to disk. A tuple is one physical row version stored inside a page. A varlena is any variable-length type, such as text, bytea, jsonb, or numeric, that carries its own length header. Hold onto those four. Everything below is built from them.

Mental Model

The wrong model, and almost everyone starts here, is that a row is a contiguous record sitting wherever it lands, and that a wide column like a 50 kB JSON document simply makes that one record bigger. Under that model, row size and column order do not matter much, and a fat column only costs you when you read it.

PostgreSQL does not work that way. The better model is a mailroom of fixed-size pigeonholes. Every pigeonhole is exactly 8 kB. A row must fit inside one pigeonhole, whole, never split across two. So when a value is too wide to fit, the mailroom does not buy a bigger box: it compresses the value, and if that is still too big, it tears the value out, files it in a separate overflow cabinet (the TOAST table), and leaves a small claim ticket in the row. The visible row stays small; the bulk lives elsewhere and is fetched only when you ask for it.

Keep this picture. A row is a tenant in a fixed box, wide values are filed out of line, and the on-disk size of a row is set as much by alignment padding and what got toasted as by the data you think you stored. Once that clicks, padding waste, the detoast surprise, and why a mostly-NULL wide table is cheap all stop being mysteries.

Breaking it down

1. Why the 8 kB page is the only unit that matters

Start with the single fact the rest of the page hangs on: PostgreSQL never reads or writes one row. It reads and writes whole pages, 8 kB at a time (BLCKSZ, fixed at compile time, 8192 bytes by default). When you ask for one tiny row, PostgreSQL loads the entire 8 kB page that holds it into the shared buffer pool, and every later read of any row on that page is a memory hit.

This is why physical layout drives performance. The currency of a scan is pages touched, not rows returned. If your rows are 200 bytes, roughly 40 of them share a page, and a sequential scan of a million rows reads about 25,000 pages. Bloat the same rows to 400 bytes through careless column order, and you double the pages, double the I/O, and halve how much of the table fits in cache. The query plan did not change; the physics did.

The page size also sets a hard ceiling. Because a tuple must live inside one page, and PostgreSQL reserves room for the page header and a line pointer, a single row’s inline data cannot exceed roughly 8 kB. That ceiling is the entire reason TOAST exists, which we reach in rung 6. Hold the number 8192; it explains more PostgreSQL behavior than any other constant.

flowchart LR
    Q["Query asks for row 42"] --> B{"Page in buffer pool?"}
    B -->|yes| H["Memory hit, return tuple"]
    B -->|no| D["Read whole 8 kB page from disk"]
    D --> C["Cache the page"]
    C --> H

2. Inside one page: headers, line pointers, and tuples

Now open a page and look inside. A heap page has four regions, and they grow toward each other from the two ends.

At the very start sits a 24-byte page header (PageHeaderData): the page’s WAL position, checksum, flags, and two crucial offsets, pd_lower and pd_upper. Right after the header comes the line-pointer array (also called the item-id array): a list of 4-byte pointers, one per tuple, each holding the offset and length of a tuple within the page. The line pointers grow downward from the top. The tuples themselves are written from the bottom of the page upward. Between the growing line-pointer array and the growing tuple data lies the free space.

pd_lower marks where the line-pointer array ends; pd_upper marks where the tuple data begins. The gap between them, pd_upper - pd_lower, is exactly the free space on the page. When you insert a row, PostgreSQL writes the tuple just below pd_upper (moving pd_upper down) and adds a line pointer just above pd_lower (moving pd_lower up). The two ends march toward each other until they meet; then the page is full.

flowchart TB
    H["Page header (24 B): pd_lower, pd_upper, checksum, LSN"]
    L["Line pointers grow down (4 B each)"]
    F["Free space (pd_upper minus pd_lower)"]
    T["Tuples grow up from the end"]
    H --> L --> F --> T

This split is what makes a tuple’s address stable and cheap. A tuple is named by its ctid, a pair (page number, item number): which page, and which slot in that page’s line-pointer array. An index entry does not point at a byte offset; it points at a ctid. That indirection is the whole trick: PostgreSQL can move a tuple’s bytes around within its page to reclaim space, and as long as it updates the line pointer, every index that references that ctid still works. The line pointer is a level of indirection between “where the index thinks the row is” and “where the bytes actually sit.”

3. The heap tuple: 23 bytes of header, then your data

Zoom in one more level, to a single tuple. Every tuple begins with a fixed header (HeapTupleHeaderData) before any of your column data. That header is 23 bytes, and on most platforms it is padded to a 24-byte boundary before the data starts, so budget 24 bytes of overhead per row before you have stored a single column.

The header is not waste; it is the machinery of MVCC and addressing. The fields that matter:

  • t_xmin (4 bytes): the transaction that created this tuple.
  • t_xmax (4 bytes): the transaction that deleted or superseded it, or 0 if live.
  • t_cid / t_field3 (4 bytes): command id within the transaction, union’d with other state.
  • t_ctid (6 bytes): the ctid this tuple points to, used to chain to a newer version after an update.
  • t_infomask2 (2 bytes): the number of attributes, plus flag bits.
  • t_infomask (2 bytes): status flags, including whether a NULL bitmap is present and whether the row has been toasted.
  • t_hoff (1 byte): the offset from the tuple start to the user data, that is, the header length including any null bitmap and padding.

If the row contains any NULL, an optional null bitmap follows the fixed header: one bit per column, so a table with up to 8 columns adds 1 byte, up to 16 columns adds 2 bytes, and so on, all rounded up to the alignment of the first data column. The bitmap is present only when at least one column in that row is NULL. Then comes your actual column data, packed in column order, each value placed at its type’s required alignment.

That last clause is the one that costs people money, so it gets its own rung.

4. Alignment and padding: how column order changes row size

Here is the rung that turns a schema review into real savings. Each PostgreSQL type has an alignment requirement: the byte boundary its value must start on. A 4-byte int must begin at a multiple of 4. An 8-byte bigint, timestamptz, or double precision must begin at a multiple of 8. A 2-byte smallint must begin at a multiple of 2. A 1-byte bool or "char" can start anywhere. To honor those boundaries, PostgreSQL inserts padding bytes between columns, and that padding is dead weight stored on every single row.

Padding depends on the order you declare columns, because each value is placed right after the previous one and then bumped forward to its alignment. Declare a bool then a bigint, and the bigint cannot start at offset 1; it needs offset 8, so 7 padding bytes are burned. Declare them the other way, or group the small types together, and the padding vanishes.

Make it concrete. Consider this table.

-- Naive order: types interleaved, alignment fought at every step.
create table event_bad (
  flag    boolean,        -- 1 byte
  id      bigint,         -- 8 bytes, must align to 8
  kind    smallint,       -- 2 bytes, must align to 2
  ts      timestamptz,    -- 8 bytes, must align to 8
  active  boolean         -- 1 byte
);

Walk the layout after the 24-byte tuple header (the header is already 8-aligned, so column packing starts at a fresh 8 boundary). flag takes offset 0, 1 byte. id needs an 8-boundary, so 7 bytes of padding are inserted, then id occupies offsets 8 through 15. kind (smallint, 2-align) takes offsets 16 and 17. ts needs an 8-boundary, so offsets 18 and 19 are wasted on padding, then ts occupies offsets 24 through 31. active takes offset 32, 1 byte. The data ends at offset 33, then the whole tuple is padded to an 8-boundary (MAXALIGN), pushing the row to 40 bytes of data. Add the 24-byte header and one row costs 64 bytes on disk.

Now reorder by alignment, widest first.

-- Aligned order: 8-byte columns first, then 2-byte, then the 1-byte flags.
create table event_good (
  id      bigint,         -- 8 bytes
  ts      timestamptz,    -- 8 bytes
  kind    smallint,       -- 2 bytes
  flag    boolean,        -- 1 byte
  active  boolean         -- 1 byte
);

Now id takes 0 through 7, ts takes 8 through 15, kind takes 16 through 17, flag takes 18, active takes 19. Data ends at offset 20, padded to 24 (MAXALIGN). The row data is 24 bytes, not 40. Add the 24-byte header and the row is 48 bytes, not 64. That is a 25 percent cut on every row, with identical columns and identical data, purely from declaration order. On a billion-row table that is gigabytes of disk, cache, and I/O reclaimed for free.

Same five columns, two orders
The two 8-byte columns lead, the 2-byte column follows, the two 1-byte flags pack into the tail. No internal padding. Result: 24 bytes of column data, 48 bytes per row. Same data, 25 percent smaller.

The rule that falls out: order columns from widest alignment to narrowest, 8-byte types first, then 4-byte, then 2-byte, then the 1-byte and variable-length types last. You can prove the win before migrating by summing pg_column_size() over a representative row in each order, which we do in rung 10.

Padding tax across a table
Rows in the table100000000
10000001000000000
Padding saved at 16 bytes per row1,600,000,000 bytes reclaimed by reordering
Mid table: reordering reclaims real cache and I/O

5. fillfactor: leaving room on the page on purpose

So far we have packed pages as tightly as possible. Sometimes you want the opposite. fillfactor is a per-table (and per-index) setting that tells PostgreSQL to stop filling a page once it reaches a given percent full, leaving the rest as reserved free space. For tables the default is 100, meaning fill the page completely; for B-tree indexes the default is 90.

Why would you ever leave a page partly empty on purpose? Because that reserved space is where an updated row can land on the same page as its old version. PostgreSQL has an optimization in which, if an update does not change any indexed column and the new tuple fits on the same page, it can write the new version beside the old one and skip updating every index, a technique called a HOT update (covered in depth on the Table Bloat and HOT Updates page). A HOT update needs free space on the page to work. A page filled to 100 percent has none, so the update spills to a different page and every index must be touched. Lowering fillfactor to, say, 90 or 85 on an update-heavy table reserves the room that lets those cheaper same-page updates happen.

-- Reserve 15 percent of each page for in-place (HOT) updates.
alter table sessions set (fillfactor = 85);
-- Existing pages are not rewritten until they are next updated or you VACUUM FULL.
vacuum full sessions;  -- optional, to apply immediately; takes an exclusive lock.

The trade-off is direct: lower fillfactor means fewer rows per page, so more pages for the same data, so larger sequential scans and a bigger on-disk footprint. You spend space to buy cheaper updates. Use it on hot, frequently-updated tables where the indexed columns rarely change; leave it at 100 for append-only or read-mostly tables, where the reserved space would just be permanent waste.

6. TOAST: when a value is too wide to fit

Return to the hard ceiling from rung 1: a tuple must fit in one 8 kB page, and a row can never span pages. So what happens when you store a 2 MB JSON document or a 100 kB text blob? It cannot fit. This is the problem TOAST solves. TOAST stands for The Oversized-Attribute Storage Technique, and it is the mechanism that lets a logical row hold values far larger than a page.

The trigger is a threshold. PostgreSQL tries to keep each tuple at or below TOAST_TUPLE_THRESHOLD, which is about 2 kB (precisely 2032 bytes, derived as roughly one quarter of the 8 kB page minus overhead). When a row’s total width exceeds that threshold, PostgreSQL goes to work on its varlena columns, the variable-length ones, in order of largest first, doing two things: it compresses wide values, and if the row is still over the threshold, it moves the largest values out of line into a separate TOAST table, replacing the in-row value with an 18-byte pointer (a TOAST pointer holding the value’s size and the row identifier in the TOAST table).

That separate TOAST table is hidden. Every table whose schema has a TOAST-able column automatically gets a companion table named pg_toast.pg_toast_<oid>, with its own index, created and managed for you. A toasted value is sliced into roughly 2 kB chunks, each chunk stored as one row in the TOAST table, and reassembled on read. You never name this table in queries; PostgreSQL detoasts transparently when you select the column.

Two limits matter. First, only varlena types can be toasted: text, bytea, jsonb, json, numeric, tsvector, arrays, and the like. A fixed-width type such as bigint, timestamptz, or a plain uuid cannot be pushed out of line, because it has no length header to indirect through. Second, TOAST is what lets a single value reach up to about 1 GB; without it, the page ceiling would cap every value near 8 kB.

flowchart TB
    R["Row exceeds about 2 kB (TOAST_TUPLE_THRESHOLD)"] --> C{"Compress widest varlena"}
    C -->|"now fits"| K["Keep compressed value inline"]
    C -->|"still too wide"| M["Move value to pg_toast table"]
    M --> P["Leave an 18-byte TOAST pointer in the row"]
    M --> S["Slice value into ~2 kB chunks, one TOAST row each"]

7. The four storage strategies and the compression choice

You control TOAST per column. Every TOAST-able column has a storage strategy that decides whether PostgreSQL is allowed to compress it, move it out of line, or both. There are four, set with ALTER TABLE ... ALTER COLUMN ... SET STORAGE.

StrategyCompress inline?Move out of line?Typical use
PLAINnonofixed-width types, and short values you never want indirected
MAINyesonly as a last resortkeep compressed but inline when at all possible
EXTERNALnoyeswide values you slice or substring often (no compression cost on read)
EXTENDEDyesyesthe default for varlena: compress first, then move out if still too big

EXTENDED is the default for every toastable column, and it is usually right: it compresses, and moves out only what it must. MAIN keeps a value inline (compressed) unless the row simply cannot fit otherwise, which helps when you read the whole value on nearly every query and want to avoid the extra TOAST-table fetch. EXTERNAL is the sharp tool: it stores the value out of line uncompressed, which makes substring and length operations on that column much faster, because PostgreSQL can seek to a chunk and read a slice without decompressing the whole value first. PLAIN forbids both and is the only option for non-varlena types.

Compression itself has a codec. Historically PostgreSQL used pglz, its built-in algorithm. PostgreSQL 14 added lz4, which compresses and especially decompresses much faster than pglz at a similar or better ratio for typical data. You choose with default_toast_compression (cluster default) or per column.

-- Use lz4 as the cluster default for new toasted values (PostgreSQL 14+).
alter system set default_toast_compression = 'lz4';
select pg_reload_conf();

-- Per-column control of codec and strategy.
alter table docs alter column body set compression lz4;
alter table docs alter column body set storage external;  -- slice-friendly, no compression

One caution: changing STORAGE or compression only affects values written after the change. Existing rows keep whatever they were stored with until they are next rewritten (an UPDATE of that row, or a table rewrite via VACUUM FULL or a no-op ALTER TABLE). To convert a whole table you must rewrite it.

Choosing a storage strategy
Compress with the column codec, then move out of line if the row is still over the threshold. The right default for almost all text, jsonb, and bytea columns: small values stay inline and compressed, only genuinely huge ones get pushed out.

8. The maps beside the heap: FSM and the visibility map

The heap does not travel alone. Each table has two small companion forks that PostgreSQL maintains beside it, and both exist to let it avoid reading pages.

The Free Space Map (FSM, the _fsm fork) tracks, for every heap page, roughly how many bytes of free space it has. When you insert a row, PostgreSQL consults the FSM to find a page with enough room instead of always appending to the end, which is how deleted-and-vacuumed space gets reused. The FSM is approximate (it stores free space in coarse buckets) and is updated lazily, mostly by VACUUM. If the FSM is stale or missing, inserts tend to pile onto the end of the table and old free space goes unreused, which is one quiet cause of bloat.

The Visibility Map (VM, the _vm fork) is the higher-leverage one. It holds two bits per heap page: an all-visible bit and an all-frozen bit. The all-visible bit is set when every tuple on that page is visible to all current and future transactions, that is, the page has no dead tuples and nothing in flight. The all-frozen bit is set when every tuple on the page is also frozen (its xmin is old enough to be treated as permanently committed).

Those two bits buy two large wins. First, VACUUM can skip pages: a page marked all-visible has no dead tuples to clean, so an ordinary vacuum passes over it, and a page marked all-frozen can be skipped even by the anti-wraparound freeze pass. On a mostly-static table, vacuum touches only the handful of recently changed pages instead of the whole relation. Second, the all-visible bit enables index-only scans. When an index has every column a query needs, PostgreSQL would still normally visit the heap to check each tuple’s visibility (the index itself has no MVCC info). But if the VM says the tuple’s page is all-visible, that heap visit is unnecessary, and the scan answers from the index alone. The fraction of a table’s pages marked all-visible therefore directly governs how effective index-only scans are, which is why a freshly bulk-loaded table that has never been vacuumed gets few index-only scans until vacuum sets the bits.

flowchart LR
    V["Visibility map: 2 bits per page"] --> A["all-visible bit"]
    V --> F["all-frozen bit"]
    A --> S["VACUUM skips clean pages"]
    A --> I["Index-only scans skip the heap visit"]
    F --> W["Freeze pass skips already-frozen pages"]

9. The detoast cost trap

Now the trap that catches even experienced engineers, and it follows directly from rungs 6 and 7. A toasted value lives out of line, possibly compressed and sliced across many TOAST-table rows. To use that value, PostgreSQL must detoast it: fetch every chunk from the TOAST table and decompress them back into the full value. That work is invisible in your SQL but very real in your CPU and I/O.

Here is where it bites. Any operation that needs the content of a toasted column forces a detoast: filtering on it (WHERE body LIKE '%error%'), sorting on it (ORDER BY body), grouping on it, or computing on it. A sequential scan with a predicate on a large toasted column detoasts that column for every row it tests, even rows it ultimately rejects. And the broadest offender is SELECT *: it asks for every column, so it detoasts every toasted value in every returned row, whether or not you read them. A query that returns a 50 kB document column you never look at still paid to fetch and decompress all 50 kB per row.

The fix is layout discipline. Keep the columns you filter, sort, and join on small and inline, and push the rarely-read bulk into separate toastable columns (or even a separate table) that you select only when you need them. Then a scan that filters on a small inline column reads narrow rows fast and never touches the TOAST table at all, because a toasted value is fetched lazily, only when its column is actually projected or evaluated. Select named columns, not *, in hot paths. This is the practical payoff of the whole page: a row’s hot, frequently-touched data should be small and inline; its cold bulk should be toasted and left alone until asked for.

Check yourself
A reporting query filters on a small status column but runs SELECT * over a table whose rows carry a 40 kB jsonb payload. It is far slower than expected. What is the most likely cause?

10. Measuring it: pg_column_size, sizes, and pageinspect

You do not have to guess at any of this; PostgreSQL exposes the bytes directly. Start with per-value and per-row width.

-- octet_length is the logical length of a value; pg_column_size is the
-- ON-DISK size, AFTER TOAST compression, including the varlena header.
select octet_length(body)      as logical_bytes,
       pg_column_size(body)    as stored_bytes,     -- compressed if toasted
       pg_column_size(d.*)     as whole_row_bytes   -- full tuple incl. header
from docs d
where id = 123;

pg_column_size() is the tool that proves a column-reorder win before you migrate: build a one-row sample in each order and compare pg_column_size(t.*). It also reveals compression, because a 40 kB document might report 6 kB stored.

Next, size at the relation level, where three functions answer three different questions.

-- pg_relation_size: just the main heap fork (no indexes, no TOAST).
-- pg_total_relation_size: heap + all indexes + the TOAST table and its index.
-- The difference, plus the TOAST size, tells you where the bytes really live.
select
  pg_size_pretty(pg_relation_size('docs'))                         as heap_only,
  pg_size_pretty(pg_total_relation_size('docs'))                   as heap_plus_indexes_plus_toast,
  pg_size_pretty(pg_total_relation_size(reltoastrelid))            as toast_size
from pg_class
where relname = 'docs';

When toast_size dwarfs heap_only, your data lives out of line, and the detoast trap from rung 9 is your main performance lever. When pg_total_relation_size far exceeds pg_relation_size and TOAST is small, the bulk is in indexes.

For the deepest look, the pageinspect extension lets you read raw page structure: the header offsets and the line-pointer array, tuple by tuple.

create extension if not exists pageinspect;

-- The page header: pd_lower, pd_upper (their gap is the free space), and the LSN.
select lower, upper, special, pagesize
from page_header(get_raw_page('event_good', 0));

-- The line-pointer array of page 0: each item's offset (lp_off), length (lp_len),
-- and state. lp_len near zero with a redirect flag is a dead or redirected line pointer.
select lp, lp_off, lp_len, t_ctid, t_xmin, t_xmax
from heap_page_items(get_raw_page('event_good', 0))
limit 10;

page_header shows lower and upper directly, so upper - lower is the free space you reserved or consumed, the same pd_lower/pd_upper from rung 2. heap_page_items walks the line-pointer array and shows each tuple’s ctid, xmin, and xmax, which is how you watch the structure from rung 2 with your own eyes. These are how you confirm a fillfactor change actually reserved space, or that a reorder actually shrank the tuple.

Failure modes even experts hit

Mastery Questions

  1. You inherit a 2-billion-row events table declared as (active boolean, user_id bigint, score smallint, created_at timestamptz, archived boolean). Storage cost and cache pressure are both too high. Without dropping a single column or changing a type, what would you do, why does it work, and roughly how much could you save?

    Answer. Reorder the columns by alignment, widest first: (user_id bigint, created_at timestamptz, score smallint, active boolean, archived boolean). The win is pure padding elimination. In the original order, user_id (8-byte aligned) cannot start right after the 1-byte active, so 7 padding bytes are inserted; created_at likewise gets padding after score. Grouping the two 8-byte columns first, then the 2-byte score, then the two 1-byte booleans, removes that internal padding, taking the row from 40 bytes of column data down to 24, that is, from 64 to 48 bytes including the 24-byte tuple header. That is roughly a 25 percent cut on every row, which on 2 billion rows is many gigabytes of disk, and proportionally more of the table now fits in the buffer pool, so scans touch fewer pages. You prove it first with pg_column_size(t.*) on a sample row in each order, then apply it with a table rewrite (a new table plus INSERT ... SELECT, or pg_repack), since column order is fixed at creation.

  2. A documents table has a body jsonb column averaging 40 kB. A dashboard query, SELECT * FROM documents WHERE tenant_id = 7 ORDER BY created_at DESC LIMIT 20, is slow even though tenant_id and created_at are indexed and only 20 rows come back. What is happening physically, and what is the single highest-leverage fix?

    Answer. The bottleneck is detoasting driven by SELECT *. The body value is toasted, stored out of line and compressed across many pg_toast chunks. SELECT * projects every column, so for each of the 20 returned rows PostgreSQL fetches all of that row’s TOAST chunks and decompresses the full 40 kB document, even though the dashboard never displays the body. That is roughly 800 kB of TOAST reads and decompression the query does not need. The highest-leverage fix is to stop selecting the column: name only the columns the dashboard shows (SELECT id, title, created_at ...). Because a toasted value is fetched lazily, only when its column is actually projected, dropping body from the select list skips the TOAST table entirely and the query returns from the small inline columns. More broadly, keep hot, frequently-projected columns small and inline, and leave heavy bulk in toasted columns you select only on the detail view.

  3. After a large bulk load, you notice an important query refuses to use an index-only scan, and that a routine VACUUM of the table takes far longer than you expected for a table that is barely changing. The two symptoms turn out to share one cause. What is it, and what fixes both?

    Answer. The shared cause is that the table’s visibility map has few or no pages marked all-visible, because the freshly bulk-loaded pages have never been vacuumed. An index-only scan can skip the heap visit only for tuples on pages the VM marks all-visible; with the bits unset, PostgreSQL must visit the heap to check visibility on every tuple, so the planner sees no advantage and falls back to a regular index scan. The same unset bits mean VACUUM cannot skip any pages: it has to scan the entire relation rather than passing over clean, all-visible pages, so it runs long even though little has changed. The fix for both is to run VACUUM (or VACUUM ANALYZE) on the table after the load. That single pass sets the all-visible bits on the stable pages, which immediately enables index-only scans for queries the index covers and lets every subsequent vacuum skip those same pages, making routine maintenance fast. This is why running VACUUM after a bulk load is standard practice, not an afterthought.

Recommended next

Sources & evidence14 claims · 4 cited

Quantitative and mechanism claims grounded in the PostgreSQL storage, datatypes, vacuuming, and indexes docs; widely-known structural facts (page header size, ctid addressing, lz4 in PG14) marked stable-common-knowledge with empty source_ids where a listed doc does not state the exact number.

  • PostgreSQL never reads or writes a single row; it reads and writes whole pages whose default size BLCKSZ is 8192 bytes, set at compile time, so the entire page holding a requested row is loaded into the buffer pool.verified
  • A heap page has a 24-byte page header, a line-pointer (item-id) array growing downward with one 4-byte pointer per tuple, free space, and tuples written from the end upward; pd_lower and pd_upper mark the array end and tuple-data start (their difference is the free space), and a tuple is addressed by its ctid (page, item) so an index references the line pointer rather than a byte offset, letting PostgreSQL move a tuple within its page.verified
  • Every heap tuple begins with a fixed 23-byte header (HeapTupleHeaderData) carrying t_xmin, t_xmax, t_ctid, t_infomask, t_infomask2, and t_hoff, typically padded to a 24-byte boundary before the user data starts, so each row costs about 24 bytes of overhead before any column.verified
  • A per-row null bitmap (one bit per column) is present only when the row contains at least one NULL value, so a NULL column stores no data bytes and is recorded merely as a cleared bit.verified
  • Each type has an alignment requirement (8 for bigint/timestamptz/double, 4 for int, 2 for smallint, 1 for bool), and PostgreSQL inserts padding bytes to honor it, so declaration order changes a row's on-disk size; ordering columns widest-alignment first can shrink a row materially (the worked example drops 40 bytes of column data to 24).verified
  • fillfactor defaults to 100 for tables and 90 for B-tree indexes; lowering a table's fillfactor reserves free space per page so an update can place the new tuple on the same page, which is the room HOT updates need to avoid updating every index.verified
  • Because a tuple cannot span pages, PostgreSQL keeps tuples at or below TOAST_TUPLE_THRESHOLD (about 2 kB) by compressing wide varlena columns and, if still too large, moving the largest out of line into an automatically created hidden pg_toast.pg_toast_<oid> table, replacing the in-row value with a TOAST pointer.verified
  • Only variable-length (varlena) types such as text, bytea, jsonb, numeric, and arrays can be TOASTed, because TOAST requires the value's length header to indirect through; fixed-width types like bigint or uuid cannot be pushed out of line.verified
  • Each TOAST-able column has one of four storage strategies set with ALTER TABLE ... SET STORAGE: PLAIN (no compress, no out-of-line), MAIN (compress, out-of-line only as last resort), EXTERNAL (out-of-line, no compression, faster substring/length), and EXTENDED (compress then out-of-line, the default for varlena).verified
  • PostgreSQL 14 added lz4 alongside the historical pglz codec, selectable via default_toast_compression or per column; changing STORAGE or compression affects only values written after the change, so existing rows keep their prior storage until rewritten.verified
  • The Visibility Map holds two bits per heap page, all-visible and all-frozen; the all-visible bit lets VACUUM skip clean pages and enables index-only scans by removing the heap visibility check, and the all-frozen bit lets even the freeze pass skip pages.verified
  • The Free Space Map records approximate free space per heap page so inserts can reuse vacuumed space instead of always appending; it is updated lazily, mostly by VACUUM, and a stale or missing FSM lets old free space go unreused and contributes to bloat.verified
  • Operations that need a toasted column's content (filtering, sorting, grouping, or SELECT *) force a detoast that fetches every TOAST chunk and decompresses the value per row, while projecting only the needed columns fetches a toasted value lazily and can skip the TOAST table entirely.verified
  • pg_column_size reports the on-disk size of a value after TOAST compression including its varlena header (distinct from octet_length's logical length); pg_relation_size measures only the main heap fork while pg_total_relation_size includes indexes and the TOAST table, and the pageinspect extension's page_header and heap_page_items expose raw page offsets and the line-pointer array.verified

Cited sources