Data Types and Schema Design
Picking the right PostgreSQL types and constraints up front, from integer widths and timestamptz to uuidv7 versus bigint identity, jsonb versus columns, alignment-driven column ordering, NULL semantics, identity columns, and the constraint vocabulary that turns the schema into a spec.
Learning outcomes
A schema is the most expensive thing you will ever ship in PostgreSQL. Every later choice, indexes, plans, replication, even how aggressively autovacuum has to work, is shaped by which types you picked and how you ordered them on day one. Most of those choices are cheap to make right and ruinous to change once a table is large, because a wrong type often forces a full table rewrite under an exclusive lock. This page teaches the default choices a senior PostgreSQL engineer reaches for without thinking, and the exact reasons they are defaults.
After studying this page, you can:
- Pick the right numeric, text, and time types for a new column and defend each choice in one sentence.
- Choose between a bigint identity column and a uuidv7 primary key based on workload, and explain why uuidv4 fragments a B-tree.
- Decide when jsonb earns its place and when columns would be the right answer instead.
- Reorder a table’s columns to cut row size through alignment, using the heap-pages-and-toast model as the basis.
- Recognize the schema mistakes that experienced teams keep making, varchar(255), naive timestamps, uuidv4 keys, jsonb-as-schema, and avoid them before they become a migration.
Before we dive in
You should be comfortable with CREATE TABLE, ALTER TABLE, and PostgreSQL’s basic types. It helps to have read the heap-pages-and-toast page, because everything we say about alignment, padding, and on-disk row size is built on that page’s model of the 8 kB page and the 24 byte tuple header. If you have not read it, treat this as the working summary: each PostgreSQL type has an alignment requirement, columns are laid out in declaration order, and the database inserts padding bytes between columns to honour those alignments.
A few terms, defined as we use them. A type family is a group of related types that store the same kind of value at different widths, like smallint, integer, and bigint. A varlena is any variable-length type, such as text or jsonb, that carries its own length header. An identity column is a column whose value PostgreSQL generates from a sequence, the modern replacement for the older serial shortcut. A CHECK constraint is an inline rule the database verifies on every insert or update. Hold onto those four. The rest of the page is built from them.
Mental Model
The wrong model, and almost everyone starts here, is that types are mostly cosmetic: varchar(255) is “a string”, integer is “a number”, timestamp is “a time”, and you can always ALTER your way out later. Under that model you reach for the loosest type that fits today’s value, because tightening it later feels safe.
PostgreSQL does not work that way. The better model is a contract carved into every row on disk. A column’s type sets the bytes it occupies, the alignment padding around it, the operators and indexes available to it, and whether changing it later rewrites the whole table under an AccessExclusiveLock that blocks every reader. varchar(255) and text look interchangeable until you discover the length cap was meaningless, the index it implied does not exist, and shrinking the cap on a billion row table takes a maintenance window. timestamp looks like “a moment in time” until daylight saving day, when half your rows mean one thing and half mean another. A schema decision compounds on every row you will ever write.
Keep this picture. The right type now is almost always cheaper than the right type later, because “later” means rewriting the heap. Once that clicks, the rules below stop feeling like style preferences and start reading as risk management.
Breaking it down
1. Numeric types and the right default integer
Start with the most common column you will ever declare: an integer. PostgreSQL gives you three native integer widths, and they are not interchangeable.
smallint(int2) is 2 bytes, range about negative 32k to 32k. Use it for genuinely small bounded values, like a percentage or a small enum tag, where the saving across many rows matters and you can prove the range will not grow.integer(int4) is 4 bytes, range about negative 2.1 billion to 2.1 billion. It is the right default for most counts, scores, and quantities.bigint(int8) is 8 bytes, range about negative 9.2 quintillion to 9.2 quintillion. It is the right default for surrogate primary keys, monetary amounts in cents, and anything that will grow forever.
The trade-off looks obvious until you join across columns of mismatched widths. If your users.id is bigint (because you used bigserial or bigint generated as identity) but a foreign key column on a child table is integer, the planner inserts an implicit cast on every join, and an index on the narrower side may not be usable for that join. The fix is to make the foreign key the same type as the primary key it references. The simple rule: pick bigint for surrogate primary keys, and use the same width on every foreign key that points at them. The 4 byte saving per row is not worth a join plan disaster on a billion row table.
Then come the decimal types. PostgreSQL has numeric (also spelled decimal), and the floating point pair real (float4, 4 bytes) and double precision (float8, 8 bytes). They solve different problems.
numericis arbitrary precision and exact. It stores the value as a variable-length sequence of base-10000 digits with a scale, so0.1 + 0.2equals exactly0.3. It is slow compared to floats, allocated as a varlena, and the only correct choice for money, accounting, and anything where the user can see rounding.double precisionis IEEE 754 binary floating point, fast and inexact. It is the right choice for scientific quantities, telemetry, and aggregates where a tiny relative error is acceptable.
For currency the rule is firm: use numeric(p, s) with an explicit scale, or store integer minor units in bigint. A bank account in double precision will eventually round in front of a customer, and the bug will be hard to reproduce because the rounding depends on the order of additions. Storing cents in a bigint (so a 10 dollar charge is the integer 1000) is the fastest exact representation; numeric(19, 4) is the standard choice when you need a decimal column the application can read directly.
-- A money column you will not regret in two years.
create table charges (
id bigint generated always as identity primary key,
amount_cents bigint not null check (amount_cents >= 0), -- exact, fast, indexable
currency char(3) not null, -- ISO 4217
charged_at timestamptz not null default now()
);
-- The same value expressed as numeric, when the application needs decimal output.
create table charges_decimal (
id bigint generated always as identity primary key,
amount numeric(19, 4) not null, -- exact, slower
currency char(3) not null
);
2. Text types and the varchar(n) myth
PostgreSQL has three string types, and one of the most persistent myths in our industry is that picking between them matters for performance. It does not.
textis a variable-length string, no declared limit other than the roughly 1 GB hard ceiling on any value.varchar(n)istextplus a length check. PostgreSQL stores both types identically on disk and dispatches the same operators against them. The only behavioural difference is thatvarchar(n)rejects values longer thanncharacters.char(n)(also spelledcharacter(n)) is a fixed-length, space-padded string. Every value is padded with trailing spaces up toncharacters, then PostgreSQL trims them again on read. It is almost always the wrong choice, because the padding wastes bytes and the trimming hides values that look identical but are not.
There is no performance reason to prefer one over another. varchar(255) is the famous example of this confusion: it has no special meaning in PostgreSQL, it does not enable any optimization, and the 255 is a cargo culted byte counter from another database’s storage model. If the business does not actually require a 255 character cap, the cap is fiction that you will eventually exceed.
The rule, then: default to text. Add a length cap only when it is a real domain constraint, and express it as a CHECK constraint so the intent is visible. A CHECK is cheaper to evolve than a type change, because tightening it is a constant-time catalog update plus a validation scan; loosening it is instant.
-- Default to text, add a real length check when the limit is real.
create table users (
id bigint generated always as identity primary key,
email text not null check (length(email) between 3 and 254), -- RFC 5321 cap
display text not null check (length(display) between 1 and 80),
bio text -- no cap, no problem
);
One thing the rule does not say: a CHECK does not stop the column from holding arbitrarily large content if you ever remove the check. If a length cap is a real business invariant, the CHECK is the contract; if it is not, do not write one.
3. Time types: timestamptz is almost always the answer
PostgreSQL has four temporal types you will use, and one of them is a footgun pretending to be the obvious choice.
timestamp with time zone(commonly writtentimestamptz) stores a UTC instant, 8 bytes, no timezone attached. PostgreSQL converts inputs to UTC on write and converts back to the session’s timezone on read.timestamp without time zone(commonly writtentimestamp) stores wall-clock numbers, 8 bytes, with no concept of which timezone they refer to. Two rows with the sametimestampvalue can mean different real moments depending on where each was written.dateis just a calendar date, 4 bytes, no time-of-day.intervalstores a duration (years, months, days, microseconds), 16 bytes. It is the right type for “subscription length”, not “expires at”.
The recommendation is firm: for any column that records when something happened, use timestamptz. The name is misleading: it does not store a timezone, it stores a UTC instant. A timestamp column is correct only when you genuinely mean a wall-clock value that is intentionally not tied to an instant, such as the time on a recurring alarm that should fire at 09:00 local wherever the user is. Almost no production column is that.
The classic outage is to declare created_at timestamp without time zone, ingest “the current time” from application code in different timezones, and then discover that one hour every spring has either two events or zero, and one hour every autumn has rows that no longer sort in causal order. Once you have a year of such data, fixing it requires guessing the timezone of each historical write.
There is one tax to be aware of with timestamptz. Because the value depends on the session’s timezone parameter for display, two readers in different timezones see different strings for the same row. The underlying instant is identical; only the formatting differs. This is a feature, not a bug, but it surprises engineers who expect the database to return “what was written”.
4. Identifiers: bigint identity versus uuid
Now the choice that shapes everything downstream: the primary key. PostgreSQL gives you two strong defaults, and the trade-off is real.
The first is a bigint identity column generated from a sequence. It is 8 bytes, monotonically increasing, and inserts hit the right end of the primary key’s B-tree. The B-tree handles monotonic inserts beautifully: new rows go onto the rightmost leaf page, splits are rare and predictable, and index locality is high so caches stay warm. The downside is that the value is guessable, leaks volume information (an id of 4,837,201 tells competitors how many users you have), and is awkward to generate before the row is inserted, which complicates client-side optimistic flows.
The second is a uuid primary key. A uuid is 16 bytes, twice the width of bigint, and PostgreSQL has a native uuid type with its own operators. But not all UUIDs behave the same way as a key.
uuidv4is fully random. It is the historical default, and on a B-tree primary key it is the wrong choice at scale. Every insert lands at a random point in the tree, so each insert touches a different leaf page. On a hot table this means more leaf pages dirty per transaction, far more page splits, and a primary key index that does not fit in cache because the working set spans the whole index. The damage is silent until the table is big enough that the index outgrows shared buffers, then write throughput collapses.uuidv7is a draft IETF standard that prepends a millisecond Unix timestamp to a random suffix, so values are time-ordered while still globally unique. Inserts hit the right edge of the B-tree the same way a sequence does, so locality is preserved and page splits are rare. As of PostgreSQL 18, the server ships a built-inuuidv7()function; on PostgreSQL 16 and 17 you generate them in the application or with an extension.
Pick based on workload. For most internal tables, bigint generated always as identity is the right default: smallest, fastest, friendliest to the B-tree. When you need globally unique ids generated by clients (microservices that mint ids before talking to the database, mobile clients offline-first, multi-region writers), reach for uuidv7, not uuidv4. The only case for uuidv4 is when you genuinely need unpredictability as a security property, and even then the right answer is usually a separate non-key field.
flowchart TB
Q["New primary key"] --> M{"Need globally unique, client-generated ids?"}
M -->|"no"| B["bigint identity: 8 bytes, monotonic, right-edge inserts"]
M -->|"yes"| U{"Unpredictability required as a security property?"}
U -->|"no"| V7["uuidv7: 16 bytes, time-ordered, preserves locality"]
U -->|"yes"| V4["uuidv4 only as a non-key field, plus a key from one of the choices above"]-- Default for internal tables: bigint identity. Small, hot, B-tree friendly.
create table orders (
id bigint generated always as identity primary key,
customer_id bigint not null references customers(id),
total_cents bigint not null,
placed_at timestamptz not null default now()
);
-- When client-generated ids are needed: uuidv7 keeps insert locality.
-- On PostgreSQL 18, uuidv7() is built in; on 16/17, mint v7 in application code.
create table events (
id uuid primary key, -- caller passes a v7
tenant_id bigint not null references tenants(id),
payload jsonb not null,
occurred_at timestamptz not null default now()
);
5. Booleans, enums, and small lookup sets
A boolean column is one of the few PostgreSQL types that is simply itself: boolean is 1 byte, holds true, false, or NULL, and is the right answer for any genuine yes-or-no column. The only subtlety is that a NULL boolean propagates: is_admin AND NOT is_banned returns NULL (not false) when either column is NULL, which means a security check written that way silently passes the row to the result. The fix is to write defaults that exclude NULL (is_admin boolean not null default false).
When you have a small fixed set of string-valued choices (“draft”, “published”, “archived”), PostgreSQL offers three reasonable answers, and you should pick deliberately.
- Native
ENUMis a compact, ordered, named type. Values are stored as a 4 byte oid, sorted in declaration order, and the type is reusable across columns. The historical downside is that adding or reordering values used to require a carefulALTER TYPE, andALTER TYPE ... ADD VALUEcould not run inside a transaction block before PostgreSQL 12. As of PostgreSQL 16/17, adding a value is cheap and transactional, but removing or renaming a value still requires the same care: you must rewrite affected rows. textwith aCHECKconstraint stores the literal string. It is the cheapest to evolve: changing the allowed set means dropping and recreating one constraint (validated in the background withNOT VALIDplusVALIDATE CONSTRAINT). The trade is more bytes per row (the string itself) and a constraint check on every write.- A small lookup table with a foreign key. This is the most relational answer: a
statusestable with id and label, and anintorsmallintforeign key on every row. It is the best fit when you have additional attributes per choice (a sort order, a localized label, an archived flag).
The senior heuristic: native ENUM is the right default for a small, stable, performance-sensitive set; text plus CHECK is right when the set evolves often; the lookup table is right when the choice has its own attributes. Do not reach for ENUM if you expect to remove values regularly; do not reach for text plus CHECK if storage on a billion row table matters; do not reach for the lookup table if the join hurts more than it helps.
-- Native ENUM: compact and ordered.
create type post_status as enum ('draft', 'published', 'archived');
create table posts (
id bigint generated always as identity primary key,
status post_status not null default 'draft'
);
-- text + CHECK: cheap to evolve.
create table posts_text (
id bigint generated always as identity primary key,
status text not null default 'draft'
check (status in ('draft', 'published', 'archived'))
);
6. jsonb, arrays, domains, and composites
PostgreSQL gives you four types that look like they let you escape “a schema is a contract”, and each of them earns its place only in narrow circumstances.
jsonb is the headline feature. It stores JSON in a decomposed binary form, which means PostgreSQL parses the structure once at write time and stores it as a tree of nested objects, arrays, numbers, and strings. Operators include containment (@>), key existence (?), and path navigation (->, ->>, #>), and you can build a GIN index on a jsonb column to make those operators fast. There is also json, the older type, which stores the exact original text and reparses it on every operation. The choice is easy: use jsonb unless you genuinely need to round-trip the original JSON byte for byte (which is rare).
jsonb is the right answer when the shape of the document is genuinely open: customer-defined fields, third-party webhook payloads, audit logs of arbitrary events. It is the wrong answer when the shape is known and stable, because everything you can do with jsonb you can do better with columns. A column has a type and a NULL-or-not-NULL contract; a jsonb field has neither. A column gets a B-tree index for free; a jsonb field needs an expression index or a GIN index with jsonb_path_ops. A column shows up in autocomplete and is queryable by name; a jsonb field is invisible until you know the path.
The trap is jsonb-as-schema: a payload jsonb column that ends up holding customer_id, amount, status, and every other field the team did not bother to declare. Two years in, the team has invented a worse query language to interrogate fields that should have been columns, the planner has no statistics on individual paths so estimates are wrong, and changes to the shape have no migration history. If you would name the field on a whiteboard, it should be a column.
flowchart TB
F["A field you want to store"] --> K{"Known and stable shape?"}
K -->|"yes"| C["Column with the right type, NOT NULL, and CHECK as needed"]
K -->|"no"| O{"Are you ever the writer?"}
O -->|"yes, you control the shape"| N["Normalize before it ships: promote known fields to columns"]
O -->|"no, shape is genuinely open"| J["jsonb, plus a GIN index using jsonb_path_ops for containment queries"]For arrays, PostgreSQL supports native array types of any base type: text[], int[], uuid[], all addressable with arr[1] (1-indexed), arr @> ARRAY[...] for containment, and unnest() to expand them into rows. Arrays earn their place for small bounded lists that are read and written together as a unit: a row’s set of tags, the few permissions on a record, a fixed-length vector of feature scores. They are the wrong answer when the list is large (write amplification: any change rewrites the whole array), or when individual elements need their own attributes or independent updates. The relational answer (a child table) wins whenever the array would have its own queries.
Domain types are a CREATE DOMAIN shorthand: a named alias over an existing type plus constraints. A create domain email_address as text check (...) lets you reuse one definition across many columns. They are useful when the constraint is non-trivial and shared by many tables; they are over-engineering when the constraint is one line of CHECK.
Composite types are user-defined row types you can declare with CREATE TYPE ... AS (...) and use as a column. They are occasionally the right answer for tightly coupled data (a point’s (x, y), a money column’s (amount, currency) pair), but they cannot have constraints or indexes of their own, and most teams should reach for two normal columns instead.
-- jsonb shines when the shape is genuinely open.
create table webhook_events (
id bigint generated always as identity primary key,
source text not null,
payload jsonb not null,
received_at timestamptz not null default now()
);
-- A GIN index over jsonb_path_ops accelerates containment queries cheaply.
create index on webhook_events using gin (payload jsonb_path_ops);
-- Arrays for small bounded lists read and written together.
create table articles (
id bigint generated always as identity primary key,
title text not null,
tags text[] not null default '{}' check (cardinality(tags) <= 16)
);
7. Alignment and column order: rewriting the same row smaller
Now the operational rung that turns a schema review into reclaimed disk. As the heap-pages-and-toast page covers in depth, each PostgreSQL type has an alignment requirement (the byte boundary its value must start on), and PostgreSQL inserts padding bytes between columns to honour those alignments. The padding is dead weight, stored on every single row, and the amount of it depends on the declaration order of the columns.
The alignments to remember:
- 8 bytes:
bigint,timestamptz,timestamp,double precision,bigserial, the newuuid(treated as 8-aligned in practice). - 4 bytes:
integer,date,real,oid. - 2 bytes:
smallint. - 1 byte:
boolean,"char". - Variable, internal:
text,varchar,jsonb,numeric, arrays. These are varlena, and their alignment depends on whether they have a long or short header, but for layout purposes treat them as needing 4-byte alignment at the start.
The rule that follows: order columns from widest alignment to narrowest, with the variable-length columns at the end. The reason is mechanical, not stylistic. Each column is placed right after the previous one, and PostgreSQL bumps the offset forward to honour the alignment. If a 1 byte boolean precedes an 8 byte bigint, the bigint cannot start at offset 1; it needs offset 8, so 7 bytes are burned as padding. Reverse the order, and the saving falls straight to disk.
Make it concrete. Consider a table that holds an event id, a flag, a count, and a timestamp.
-- Naive declaration order: interleaved widths, alignment fought at every step.
create table event_log_bad (
id bigint, -- 8 bytes, needs 8-alignment
flag boolean, -- 1 byte
count integer, -- 4 bytes, needs 4-alignment
created timestamptz -- 8 bytes, needs 8-alignment
);
Walk the layout after the 24 byte tuple header (the header is already 8-aligned, so column packing starts at a fresh 8 boundary). id takes offsets 0 through 7. flag takes offset 8, 1 byte. count (4-align) cannot start at offset 9; it needs offset 12, so 3 bytes are padding, then count occupies 12 through 15. created needs 8-alignment; offset 16 is already 8-aligned, so it takes 16 through 23. Column data ends at offset 24. Total row data: 24 bytes including 3 bytes of padding. Plus the 24 byte header, the row is 48 bytes.
Now reorder to put the widest columns first, then the 4 byte, then the 1 byte.
-- Widest first, then narrow. No internal padding.
create table event_log_good (
id bigint, -- 8 bytes
created timestamptz, -- 8 bytes
count integer, -- 4 bytes
flag boolean -- 1 byte
);
id takes 0 through 7, created takes 8 through 15, count takes 16 through 19, flag takes 20. Data ends at offset 21, then the whole tuple pads to MAXALIGN, that is 24 bytes. Column data: 24 bytes including 3 bytes of tail padding only. The savings on this small table look minor, but the principle scales: on a table with several 8 byte columns interleaved with 1 byte and 2 byte columns, ordering them right cuts row size by 10 to 30 percent across every single row.
You prove the win before migrating by comparing pg_column_size(t.*) on a sample row in each order. Column order is fixed at table creation, so applying the change to an existing table means a rewrite, either via a new table plus INSERT ... SELECT, or via pg_repack. Plan the rewrite during a quiet period; an ordinary ALTER TABLE that adds a column does not let you rearrange existing ones.
8. NULL, generated columns, and identity behaviour
A few features of the type system shape schema design more than people credit them for.
NULL is cheap on PostgreSQL. The page covers this in heap-pages-and-toast, but the consequence for schema design is worth stating plainly: a per-row null bitmap exists only when the row contains at least one NULL column. When present, the bitmap is one bit per column, and a NULL value stores zero data bytes. So a hundred-column table where most columns are NULL on most rows costs roughly the header plus the bitmap plus the few non-null values, not a hundred slots. Width on paper is not width on disk.
The trade-off is logical, not physical. NULL is the SQL three-valued-logic value, and it propagates: 5 = NULL is NULL, not false. count(*) counts all rows including those with NULL, but count(col) skips NULL rows. WHERE col != 'x' excludes rows where col is NULL. The rule: make a column NOT NULL whenever the business does not have a “we do not know” state, because the cost of forgetting which way three-valued logic falls is much higher than the cost of choosing a default.
Generated columns are columns whose value is computed from other columns in the same row. They have two flavours.
STORED(available since PostgreSQL 12) materializes the value at write time and stores it on disk. You can index it like any other column, and queries read it without recomputation. The cost is the disk it occupies and the write cost of recomputing on update.VIRTUAL(available in PostgreSQL 18 as the default forGENERATED, not in 16 or 17) computes the value at read time. It stores nothing on disk, but the expression runs on every read and the column cannot be indexed directly (you index the underlying expression instead).
Use STORED for derived values you read often and write rarely, or for values that must be indexed (a normalized email, a JSON-extracted user_id). Use VIRTUAL for values that are cheap to recompute and you do not want to pay storage for.
-- A stored generated column the application can index, no triggers required.
create table users_norm (
id bigint generated always as identity primary key,
email text not null,
email_lower text generated always as (lower(email)) stored
);
create unique index on users_norm (email_lower);
Identity versus SERIAL. The old serial and bigserial shortcuts expand to a column with a default that calls nextval() on a sequence, plus implicit ownership of the sequence. They have always worked, but they have two operational warts: the column’s default is mutable so an application can override it (which then leaves the sequence behind, and the next default-driven insert raises a duplicate key error), and they predate the SQL standard’s identity syntax.
The modern replacement is GENERATED ALWAYS AS IDENTITY (or GENERATED BY DEFAULT AS IDENTITY). It binds the sequence to the column at the catalog level, and GENERATED ALWAYS rejects user-supplied values unless the writer explicitly says OVERRIDING SYSTEM VALUE. That single guard prevents the most common cause of “duplicate key” surprises in production: an ETL job that inserts a literal id, advancing past the sequence’s next value.
-- The modern primary key. ALWAYS rejects user-supplied ids by default.
create table customers (
id bigint generated always as identity primary key,
name text not null
);
9. Constraints, normalization, and the failure modes experts hit
PostgreSQL is unusually rich in constraints, and a senior schema uses them as the spec. Each one is a contract the database verifies on every write, so once the rule is in the catalog, no application can violate it and no migration can silently drop it.
The constraints to know:
NOT NULLis the cheapest and most overlooked. It is a column attribute, not a constraint row, and the planner uses it for stronger inferences (IS NULLagainst aNOT NULLcolumn is always false, no scan needed).CHECKis an arbitrary expression that must evaluate true for every row. Use it for range bounds (amount_cents >= 0), enum-like sets, and cross-column invariants (start_at <= end_at). Adding aCHECKon a large table is fast if you add itNOT VALIDand thenVALIDATE CONSTRAINTin a separate step that does not hold anAccessExclusiveLock.UNIQUEis enforced by a unique B-tree index. A partial unique index (CREATE UNIQUE INDEX ... WHERE) lets you enforce uniqueness only on the rows that should be unique, like “one active subscription per customer”.FOREIGN KEYenforces referential integrity. It costs a lookup on the parent on every insert and update, and a lookup or scan on the child on every delete or update of the parent. Always index the child column; PostgreSQL does not do it for you, and the missing index is one of the top causes of slow deletes on large tables.EXCLUDEconstraints generalize uniqueness to “no two rows overlap by some operator”. The classic use is atsrangecolumn withEXCLUDE USING gist (room_id WITH =, during WITH &&), which enforces that no two bookings for the same room overlap in time. Nothing else in SQL expresses range non-overlap natively.
-- A partial unique index: at most one active subscription per customer.
create unique index one_active_subscription_per_customer
on subscriptions (customer_id)
where status = 'active';
-- An EXCLUDE constraint: no two bookings overlap in the same room.
create extension if not exists btree_gist;
create table bookings (
id bigint generated always as identity primary key,
room_id bigint not null references rooms(id),
during tsrange not null,
exclude using gist (room_id with =, during with &&)
);
On normalization, the senior heuristic is: start at third normal form (3NF) and denormalize only with measurement and a maintenance plan. 3NF means no transitive dependencies (an orders table does not carry the customer’s address; it carries a customer_id, and the address lives on the customer row). It is the default because it eliminates update anomalies: changing the customer’s address updates one row, not every order. The cost is the join, which on a properly indexed schema is cheap until it is not.
Denormalize when measurement proves the join cost matters, the underlying data is genuinely stable, and you can write the maintenance code (a trigger, a scheduled job, a materialized view refresh) that keeps the denormalized copy in sync. A “we denormalized for speed” comment without a maintenance plan attached is a future bug.
Two tracking columns are nearly universal and worth budgeting for. created_at timestamptz not null default now() and updated_at timestamptz (set by a trigger, or by the application on every write) cost one or two timestamps per row and pay for themselves the first time you need to debug “when did this change?”. A soft-delete column (deleted_at timestamptz, with a partial index that excludes soft-deleted rows) is the gentler alternative to a hard DELETE when references matter.
Mastery Questions
-
You are designing the primary key for a high-write events table that ingests one million events per second across many regions and is expected to reach a trillion rows. A teammate proposes
id uuid default gen_random_uuid(). What is the concrete physical problem, and what would you propose instead?Answer.
gen_random_uuid()returns a uuidv4, which is fully random. Every insert lands at a random point in the primary key’s B-tree, so each insert dirties a different leaf page, page splits happen constantly, and the working set of the index spans the whole tree. Once the index outgrows shared buffers, every insert becomes an I/O-bound page fetch, and the write throughput plateau is hard to escape because the cost is invisible to EXPLAIN. There are two reasonable proposals depending on whether you need globally unique ids. If a single primary writer is acceptable, usebigint generated always as identity: 8 bytes per key, monotonic, fits the right edge of the B-tree, and leaves the whole index largely cache-resident. If client-generated ids are required (multi-region writers, offline-first clients), useuuidv7instead ofuuidv4: 16 bytes per key but time-ordered, so inserts hit the right edge of the B-tree the same way a sequence does. PostgreSQL 18 has a built-inuuidv7(); on 16 and 17 you mint the value in the application. The general principle to internalize: random ids on a B-tree primary key destroy index locality at scale; time-ordered ids preserve it. -
A teammate proposes
varchar(255)for every text column “because it gives us a sensible cap and a small performance edge over text”. Walk them through whatvarchar(255)actually is in PostgreSQL, and propose the rule you would apply instead.Answer. PostgreSQL stores
varchar(n)andtextidentically on disk and dispatches the same operators against them. The only behavioural difference is thatvarchar(n)performs a length check on insert. There is no performance edge, no storage difference, no index advantage. Thevarchar(255)convention is a cargo culted byte counter from another database’s storage model and means nothing in PostgreSQL. The proposal also encodes a fiction: 255 is rarely a real business constraint, so the cap is either ignored (and someone bumps into it years later) or enforced and then exceeded (and someone runs anALTER TYPEthat rewrites the heap). The rule to apply instead: default totext, and add aCHECK (length(col) between min and max)only when the limits are real business constraints. TheCHECKis easier to evolve (loosen instantly, tighten withNOT VALIDthenVALIDATE CONSTRAINT) and the intent is visible in the schema. -
You inherit a
userstable declared as(is_active boolean, id bigint, login_count integer, created_at timestamptz, is_premium boolean)with 800 million rows. Storage and cache pressure are both painful. Without dropping a column or changing a type, what would you do, why does it work, and how would you prove the win before the migration?Answer. Reorder the columns by alignment, widest first:
(id bigint, created_at timestamptz, login_count integer, is_active boolean, is_premium boolean). As the heap-pages-and-toast page covers, PostgreSQL inserts padding bytes between columns to honour each type’s alignment, and the padding is fixed at table creation in declaration order. In the original layout,id(8-aligned) cannot start right after the 1-byteis_active, so 7 padding bytes are burned;created_atsimilarly takes padding afterlogin_count. Reordering puts the two 8-byte columns first (offsets 0 to 7 and 8 to 15), then the 4-byte integer (16 to 19), then the two 1-byte booleans (20 and 21), with only tail MAXALIGN padding remaining. The win is the same data, the same constraints, smaller rows, so more rows per cache page and fewer pages touched on every scan. You prove the win before the migration by building one row in each order on a scratch table and comparingpg_column_size(t.*): that returns the exact on-disk size of the row including the 24-byte tuple header, and gives the per-row saving you can multiply by 800 million to size the migration. The migration itself is a table rewrite (a new table plusINSERT ... SELECT, orpg_repackfor an online rewrite), because column order is fixed at creation.
Sources & evidence15 claims · 3 cited
Type behaviour, varlena storage, generated columns, identity, jsonb operators and GIN indexing, EXCLUDE constraints, and alignment grounded in the PostgreSQL data types, storage, indexes, and B-tree docs. Facts none of those four cover (uuidv7 time-ordering and its effect on B-tree locality, GENERATED ALWAYS AS IDENTITY rejecting user-supplied values, IDENTITY versus SERIAL operational warts, VIRTUAL versus STORED availability windows) are marked stable-common-knowledge with empty source_ids.
- PostgreSQL has three native integer widths, smallint (2 bytes), integer (4 bytes), and bigint (8 bytes), and numeric is an arbitrary-precision exact decimal stored as a variable-length sequence of base-10000 digits so 0.1 + 0.2 equals exactly 0.3, while double precision is IEEE 754 binary floating point that is fast but inexact.verified
- Currency should be stored either as numeric with an explicit scale or as a bigint count of minor units (cents); storing money in double precision eventually produces visible rounding errors because the result depends on the order of additions.verified
- PostgreSQL stores text and varchar(n) identically on disk and dispatches the same operators against them; varchar(n) only adds a length check on insert, so there is no performance reason to prefer it over text, and char(n) is space-padded to its declared length with trailing spaces trimmed on read.verified
- timestamp with time zone (timestamptz) stores an 8-byte UTC instant by converting inputs to UTC on write and back to the session timezone on read; timestamp without time zone stores 8 bytes of wall-clock numbers with no timezone attached, so two rows with the same value can refer to different real moments.verified
- uuidv4 is fully random and inserts land at random points in a B-tree primary key, so each insert dirties a different leaf page, page splits are frequent, and the working set spans the whole index; uuidv7 prepends a millisecond Unix timestamp to a random suffix, so inserts hit the right edge of the B-tree like a sequence does and the hot leaf pages stay cached.stable common knowledge
- PostgreSQL has a native fixed-width 16-byte uuid type with its own operators and B-tree support, and as of PostgreSQL 18 the server ships a built-in uuidv7() generator; on PostgreSQL 16 and 17 v7 values must be minted in the application or via an extension.stable common knowledge
- PostgreSQL native ENUM types store each value as a 4-byte oid sorted in declaration order; ALTER TYPE ... ADD VALUE became transactional and inexpensive in PostgreSQL 12, but removing or renaming a value still requires rewriting affected rows because the on-disk representation references the value by its position in the type.verified
- jsonb stores JSON in a decomposed binary form parsed once at write time, supports containment (@>), key existence (?), and path navigation operators, and can be GIN-indexed (often with the jsonb_path_ops opclass for smaller and faster indexes against containment queries); the older json type stores the exact original text and reparses on every operation.verified
- Arrays are stored as a single varlena, so updating any element rewrites the entire array; they earn their place for small bounded lists read and written together (a row's tags) and lose to a child table once individual elements need their own queries or independent updates.verified
- Each PostgreSQL type has an alignment requirement (8 for bigint/timestamptz/double precision, 4 for integer/date, 2 for smallint, 1 for boolean) and the database inserts padding bytes between columns to honour it, so the declaration order of columns changes the on-disk row size; ordering columns from widest alignment to narrowest, with varlena columns last, eliminates internal padding.verified
- A per-row null bitmap (one bit per column) is present only when the row contains at least one NULL value, so a NULL column stores no data bytes and is recorded merely as a cleared bit, making mostly-NULL wide tables cheap on disk.verified
- STORED generated columns (available since PostgreSQL 12) materialize their value at write time and can be indexed directly; VIRTUAL generated columns compute on read, store nothing on disk, and became available as the default GENERATED kind in PostgreSQL 18, so on PostgreSQL 16 and 17 only STORED is available.stable common knowledge
- GENERATED ALWAYS AS IDENTITY binds a sequence to the column at the catalog level and rejects user-supplied values unless the writer specifies OVERRIDING SYSTEM VALUE, preventing the classic SERIAL failure where a literal-id INSERT advances past the sequence's next value and the next default-driven insert raises a duplicate key error.stable common knowledge
- EXCLUDE constraints generalize uniqueness to 'no two rows overlap by some operator'; with the btree_gist extension, EXCLUDE USING gist (room_id WITH =, during WITH &&) on a tsrange column enforces that no two bookings for the same room overlap in time, an invariant no other standard SQL construct expresses natively.verified
- Adding a CHECK or FOREIGN KEY constraint on a large table with the NOT VALID clause skips the full-table validation scan and only enforces the constraint on new writes; a subsequent VALIDATE CONSTRAINT then verifies existing rows without holding an AccessExclusiveLock, so a constraint can be added on a billion-row table without blocking writers.verified
Cited sources
- PostgreSQL Documentation: Data Types · PostgreSQL Global Development Group
- PostgreSQL Documentation: Indexes · PostgreSQL Global Development Group
- PostgreSQL Documentation: Database Physical Storage (Page Layout and TOAST) · PostgreSQL Global Development Group