Read Scaling and Caching

How to scale PostgreSQL reads horizontally: the ladder from shared buffers through replicas, materialized views, and external caches, with routing, read-your-writes defenses, invalidation strategies, and the failure modes that look like database problems but are really caching mistakes.

Learning outcomes

PostgreSQL scales writes vertically and reads horizontally, but only if you give the read path more places to live than the primary. The trick to running a hot read-heavy workload is not picking the perfect cache. It is laying out a ladder of progressively cheaper-and-staler stores, and being honest about which rung answers each query. This page makes that ladder concrete.

After studying this page, you can:

  • Name the rungs of the read-scaling ladder from in-process memory to the edge, and decide which rung answers a given query class.
  • Route reads to streaming replicas without serving stale data to the user who just wrote it.
  • Choose between a materialized view, an external cache, and a covering index for a precomputed result, and refresh each safely.
  • Pick cache-aside, write-through, or write-behind for a workload, and design invalidation that does not silently lie.
  • Spot the failure modes that look like a database problem but are really a caching mistake: thundering herds, OFFSET pagination across replicas, the read-your-writes trap, and one hot replica while the others idle.

Before we dive in

You should already be comfortable with two ideas this page builds on. The first is streaming replication: the primary ships its write-ahead log to standbys that replay it, as the streaming-and-synchronous-replication page covers. The second is connection pooling: a transaction-mode pooler in front of PostgreSQL multiplexes thousands of client connections onto a small backend pool, as the connection-pooling-at-scale page covers. We will route reads through both of those.

A few vocabulary anchors before we start. A read replica is a hot standby running in recovery, accepting SELECT only. Replication lag is the wall-clock or byte gap between the primary’s current write-ahead-log position and the standby’s replay position. A cache is any store that answers a query faster than recomputing it from base tables. Staleness is the time a cached answer can be wrong by, measured against the source of truth. Invalidation is the act of telling a cache that an entry is no longer valid. Keep those five in mind. Everything below is a different point on the tradeoff between freshness and cost.

Mental Model

The tempting wrong model is that caching is one decision: pick Redis, put it in front of PostgreSQL, write some invalidation, ship it. Under that model “scaling reads” means “I added a cache.”

The better model is a ladder with a price tag on every rung. Each rung answers reads faster than the one below it, costs more operational pain, and tells more lies about how fresh the data is. The shared buffer pool inside PostgreSQL answers in microseconds with perfect freshness. A read replica answers in milliseconds with seconds of lag. A materialized view answers a hard aggregate in milliseconds but is stale until you refresh it. Redis answers in microseconds across the network with whatever staleness your invalidation lets through. A CDN answers in tens of microseconds at the user’s region with minutes of staleness.

You do not pick one rung. You place each read on the cheapest rung that is still fresh enough for that read, and you keep climbing only when the rung below cannot meet your latency budget. The art is matching the query class to the rung, not buying a bigger Redis. Once that picture clicks, every failure mode on this page becomes a story of a read landing on the wrong rung.

Breaking it down

1. The ladder of read scaling: from shared buffers to the edge

Start with the full ladder, because every later rung is meaningful only against the one below it. From cheapest and freshest to widest and stalest, the rungs are:

  1. The PostgreSQL shared buffer pool plus the OS page cache. Pages of tables and indexes are in process memory, on the same host as the executor. Answers in microseconds. Always fresh.
  2. Index-only and covering scans. The query never touches the heap because the index already carries the columns. Effectively a denormalized in-database cache, kept consistent for free.
  3. Materialized views and summary tables. SQL precomputed and stored. Stale until refreshed. The tuning-memory-for-throughput page covers how much of these stay resident in RAM.
  4. Read replicas. Streaming standbys serving SELECT. Lag is seconds, sometimes more under load.
  5. External caches such as Redis or Memcached. Across a network hop, faster than a replica for point lookups, but lying until invalidated.
  6. CDN and app-level memoization. Per-region or per-process. Stalest, widest, cheapest per request.
flowchart TB
    A[Shared buffers and OS page cache] --> B[Index-only and covering scans]
    B --> C[Materialized views and summary tables]
    C --> D[Read replicas]
    D --> E[External caches: Redis, Memcached]
    E --> F[CDN and app-level memoization]

Read the ladder the way an executor reads a plan: bottom up for cost, top down for freshness. The right question is never “should I add a cache?” It is “for this query class, what is the highest rung that is still fresh enough?” A profile view tolerates ten seconds of staleness, so it lives happily on a replica or in Redis. A bank balance after a transfer cannot be stale by a single statement, so it stays on the primary. A landing-page leaderboard tolerates an hour of staleness, so it belongs in a materialized view or on the CDN.

Pick a rung by query class
Tolerance: zero staleness in the user's own session. Rung: PostgreSQL primary, or a replica with read-your-writes enforcement (rung 4 with the routing in section 4). External caches and CDNs are wrong answers.

The rest of the page walks the ladder rung by rung. We will not skip a rung, because the failure modes on the higher rungs are usually a misuse of a lower one.

2. The cheapest cache: index-only scans and the buffer pool

The first rung is invisible because it has no separate operational story: PostgreSQL already caches. The shared_buffers setting reserves a chunk of RAM (commonly 25 percent of host memory) as a page cache that lives inside the server process. The kernel’s own page cache sits behind it, so a page evicted from shared_buffers may still be a memcpy away in the operating system. The tuning-memory-for-throughput page covers the sizing math; here, the point is that this rung is free, fresh, and large enough on a well-tuned host to absorb the working set of most workloads.

The cache hit ratio tells you whether your working set fits. It is not a perfect metric, but it is the first one to look at:

select datname,
       blks_hit,
       blks_read,
       round(100.0 * blks_hit / nullif(blks_hit + blks_read, 0), 2) as hit_pct
  from pg_stat_database
 where datname = current_database();

A hit percentage above 99 on a busy OLTP database is normal and good. Below 95 means the working set is spilling to disk, and adding a replica or a Redis will not fix the root cause: your hot pages do not fit in memory.

The second sub-rung is the index-only scan, which is “the cache built into PostgreSQL” for read-heavy queries. When an index covers every column a query needs (either because all the predicates and projections live in the key, or because the columns are added via INCLUDE), the executor reads the index and never visits the heap. A point lookup that would touch two pages now touches one. As the advanced-indexing-techniques page covers, a covering index built with INCLUDE keeps the index keyset narrow for fast traversal while still answering the query without a heap fetch:

create index concurrently on orders (customer_id, created_at desc)
    include (status, total_cents);

-- Query plan: Index Only Scan, Heap Fetches: 0
explain (analyze, buffers)
  select status, total_cents
    from orders
   where customer_id = $1
   order by created_at desc
   limit 20;

Before reaching for any external cache, look for queries that could become index-only scans. The cheapest cache is the one PostgreSQL already maintains, and it is the only rung with no consistency story to write.

PostgreSQL’s own buffer-replacement policy is not a plain LRU. It is a clock-sweep with per-buffer usage counts: each access increments a counter (capped at 5), and the eviction pass decrements counters as it walks, evicting the first buffer whose count hits zero. The effect is closer to LFU than LRU: a buffer hit many times survives a flood of one-shot pages from a sequential scan. The practical consequence is that large sequential scans do not flush the cache the way they would in a strict LRU.

3. Read replicas: throughput at the cost of freshness

The next rung up is the read replica. The streaming-and-synchronous-replication page covers how the primary ships its write-ahead log and the standby replays it; here, the lens is “how do I send the right reads to this thing?” A hot standby in recovery accepts SELECT statements, refuses writes (they fail with cannot execute UPDATE in a read-only transaction), and lags the primary by however long it takes to ship and replay the WAL.

A healthy replica lags by milliseconds. A replica under stress (a long autovacuum on the primary, a slow disk, a heavy reporting query holding a snapshot) can lag by seconds or minutes. Measure it from the primary:

select application_name,
       state,
       sync_state,
       pg_wal_lsn_diff(sent_lsn,    replay_lsn)   as replay_bytes,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) as total_lag_bytes
  from pg_stat_replication;

And from the standby:

select pg_is_in_recovery(),
       now() - pg_last_xact_replay_timestamp() as time_lag,
       pg_last_wal_replay_lsn();

The fundamental rule of read replicas is the lag is a probability distribution, not a constant. Most reads see a millisecond of lag. Some reads, exactly when you cannot afford them, see a minute. Build for the tail.

sequenceDiagram
    participant App
    participant Primary
    participant Replica
    App->>Primary: UPDATE accounts SET balance = balance - 100 WHERE id = 1
    Primary-->>App: COMMIT (LSN = 0/1A2B3C4D)
    Primary->>Replica: WAL records (async, takes 30 ms)
    App->>Replica: SELECT balance FROM accounts WHERE id = 1
    Note over Replica: still at LSN 0/1A2B0000, has not seen the UPDATE
    Replica-->>App: 500 (stale read)

You will get reads like the last line of that diagram in production unless you do something about it. The next rung is exactly that defense.

4. The read-your-writes trap and how to defuse it

The single most common bug introduced by read replicas is the read-your-writes violation. A user updates their profile, the page refreshes, the refresh SELECT lands on a replica that has not yet replayed the update, the user sees their old name, and now they think your service ate their write. There is no error in any log. The data is fine. The routing is wrong.

You have three usable defenses, in order of operational simplicity.

Stickiness on commit. After a writing transaction commits, route that user’s reads to the primary for N seconds (often 5 to 30, longer than your worst routine lag). A flag in the session store or a header in the response carries “this user wrote at time T” forward; the router checks “now minus T less than the stick window” and picks the primary or a replica accordingly. Simple, effective, costs you a small bump in primary load for active users.

LSN gating, the precise version. On commit, the primary returns the commit’s log sequence number (LSN). The application carries that LSN forward (in a cookie, a session, or a request header). On a subsequent read, the router picks a replica only if that replica’s replay position is at or past the carried LSN. PostgreSQL gives you the primitives:

-- On the primary, right after commit:
select pg_current_wal_lsn();          -- e.g. 0/1A2B3C4D

-- On a candidate replica, before serving a read for that user:
select pg_last_wal_replay_lsn() >= '0/1A2B3C4D' as caught_up;

-- Or poll in a tight loop with a timeout, on any PostgreSQL version:
do $$
declare
  deadline timestamptz := clock_timestamp() + interval '50 ms';
begin
  while clock_timestamp() < deadline loop
    exit when pg_last_wal_replay_lsn() >= '0/1A2B3C4D';
    perform pg_sleep(0.001);
  end loop;
end $$;

This is the textbook “monotonic reads” trick. It costs a little code and gives you exactness: a session never sees a snapshot older than its own last write, period.

Tag the query. When neither stickiness nor LSN gating is available (a legacy ORM, a third-party SDK), an explicit hint in the query routes it: a comment such as /* primary */ on the SQL or a connection-string tag tells the router “send this one to the primary, lag is unacceptable here.” It is the worst defense because it depends on every author remembering, but it is sometimes the only one you can deploy quickly.

Cache-aside read with LSN gating
User commits a writeApplication writes through the primary. On commit it asks the primary for pg_current_wal_lsn() and stores it on the session: lastWrite = 0/1A2B3C4D.
Step 1 of 5

The third option, “send everything to the primary after a write,” is the brute-force version of stickiness. It works, but it concentrates load you wanted to shed.

5. Routing reads: pools, proxies, and per-query hints

Read routing happens at three plausible layers, and the right design picks one as the source of truth.

Application code. Many ORMs ship with a read/write split: Rails’ multi-database support, ActiveRecord’s connected_to(role: :reading), Django’s database routers, Node’s Sequelize replication option, Java’s Spring @Transactional(readOnly = true). The application keeps two connection pools (primary and replica) and decides per query. Pros: the application knows whether a query is part of a write workflow. Cons: every team writing a new service has to do the same plumbing.

A query-aware proxy. Tools such as PgBouncer + HAProxy, pgcat, Odyssey, and AWS RDS Proxy split reads from writes at the connection layer. pgcat, for example, parses the SQL and routes SELECT to a replica pool and writes to the primary pool, on a single client connection. As the connection-pooling-at-scale page covers, transaction-mode pooling shapes which session features (prepared statements, SET LOCAL, advisory locks) are safe on top.

DNS or a per-environment endpoint. A separate db-reader.example.com resolves to a load balancer in front of the replicas, and the primary lives at db-writer.example.com. The application picks the endpoint per query class. Crude but effective for batch jobs and read-only services.

Whichever you pick, the rules are the same. A transaction is sticky. Once a transaction starts on the primary, every statement in it goes to the primary. Once a transaction starts on a replica, every statement in it goes to that same replica (a different replica would have a different snapshot). Reads with side effects go to the primary. A SELECT ... FOR UPDATE is a write; a function call that issues an INSERT internally is a write. The proxy needs to know.

flowchart LR
    A[App or ORM] -->|writes| B[PgBouncer primary pool]
    A -->|reads with LSN gate| C[Proxy: pgcat or HAProxy]
    C -->|caught up| D[Replica 1]
    C -->|caught up| E[Replica 2]
    C -->|none caught up| B
    B --> F[PostgreSQL primary]
    D --> G[PostgreSQL replica 1]
    E --> H[PostgreSQL replica 2]

One more failure mode worth naming. If your routing is “round-robin across replicas,” one bad query (a sequential scan that holds locks, a runaway sort that fills work_mem) can pin one replica to one user’s session and make it look healthy to the load balancer while it serves nothing useful. Pair routing with a real query-latency check, not just TCP health.

6. Materialized views: precomputing the expensive query

A materialized view is a stored SELECT: the query runs once, the result is written to a table-shaped object, and subsequent reads scan that table instead of recomputing the query. It is the right rung for an expensive aggregate that does not need to be exact to the millisecond.

create materialized view daily_revenue as
  select date_trunc('day', created_at) as day,
         currency,
         sum(amount_cents)              as revenue_cents,
         count(*)                       as order_count
    from orders
   where status = 'paid'
   group by 1, 2;

create unique index on daily_revenue (day, currency);

Two refresh modes exist, and the difference matters in production.

REFRESH MATERIALIZED VIEW daily_revenue rebuilds the view’s contents and takes an ACCESS EXCLUSIVE lock on the view for the duration. Reads block. On a view that takes 30 seconds to rebuild, every query that touches it pauses for 30 seconds. This is almost never what you want in an online service.

REFRESH MATERIALIZED VIEW CONCURRENTLY daily_revenue rebuilds the view in a scratch table, then diffs against the old contents and applies inserts, updates, and deletes under a much weaker lock that lets reads proceed throughout. It requires a UNIQUE index on the view (so the diff can match rows), runs slower than the blocking form, and writes more WAL. Use it in production; use the blocking form only in batch windows where reads are paused anyway.

PostgreSQL does not ship incremental materialized views in core. The leading extension is pg_ivm, which tracks the source tables and applies just the changed rows on each refresh, turning a minute-long rebuild into a sub-second update. It is a real option for high-frequency views; it has caveats (not every aggregate is supported, source tables get triggers), so read the docs before adopting it.

The trap to watch is refresh under load. A REFRESH MATERIALIZED VIEW CONCURRENTLY on a busy primary generates write-ahead log, which streams to your replicas, which now have a replication lag spike exactly when your dashboards reload. If the dashboard reads route to a replica, the read-your-writes window of section 4 widens during refresh. Either schedule refreshes off-peak, or route the dashboard reads to a dedicated replica that you accept will lag during the refresh.

REFRESH MATERIALIZED VIEW: blocking vs concurrent
Builds the new contents in a scratch table, diffs against the old, applies the delta under a lock that lets readers continue. Requires a UNIQUE index. Slower to run and more WAL, but reads never block. The right default for any online materialized view.

7. External caches: cache-aside, write-through, write-behind

When the per-query cost is small but the volume is enormous (a profile lookup answered a hundred thousand times per second), the right rung is an external in-memory store: Redis (with its data structures, pipelining, and pub-sub) or Memcached (simpler, slab-allocated, multithreaded). Both answer point lookups in under a millisecond across the network and let you drop a large fraction of read traffic from PostgreSQL entirely.

Three patterns dominate, and the differences are entirely about who writes the cache and when.

Cache-aside, also known as lazy or read-through. The application checks the cache; on miss, it reads from PostgreSQL, writes the value to the cache, and returns it. On a write, the application writes PostgreSQL and either deletes the cache entry (preferred) or overwrites it. This is the dominant pattern: simple, correct under most conditions, and pushes the freshness decision to the writer.

Write-through. Every write goes to the cache and the database in lockstep, in the same code path. Reads always hit the cache. The cache is always populated. The cost is write latency (you pay for both stores) and complexity (now you have two stores in your write transaction, with no native two-phase commit).

Write-behind, also known as write-back. Writes go only to the cache. A background worker drains the cache to the database asynchronously. Reads are fastest of all. The cost is durability: a crash before the drain loses writes. This pattern is right for non-critical counters and analytics, wrong for money.

sequenceDiagram
    participant App
    participant Cache as Redis
    participant DB as PostgreSQL
    App->>Cache: GET user:42
    Cache-->>App: nil (miss)
    App->>DB: SELECT ... WHERE id = 42
    DB-->>App: row
    App->>Cache: SETEX user:42 60 row
    Cache-->>App: OK
    Note over App: later: a write happens
    App->>DB: UPDATE users SET ... WHERE id = 42
    DB-->>App: ok
    App->>Cache: DEL user:42

A note on Redis honesty. Treat the external cache as a derived store, not the source of truth. The day you “just use Redis for X” is the day you discover an eviction, a failover, or a misconfigured persistence has dropped X. The database stays authoritative; the cache is a faster window onto it. If you must persist in Redis (sessions, queues), accept the durability tradeoff explicitly.

The N+1 problem follows you up the ladder. A loop that fires one GET per row per request is as bad as a loop that fires one SELECT per row, sometimes worse because Redis is fast enough that you do not notice the round trips until the network saturates. Batch with MGET, MSET, pipelining (one round trip carries many commands), or hashes (HMGET) when the data is naturally keyed. The same rule as for SQL: do not loop where you can batch.

8. Invalidation: TTLs, LISTEN/NOTIFY, and change-data-capture

The classic line about caching is that there are only two hard problems in computer science: cache invalidation and naming things. Routing is mechanical. Invalidation is where caching gets a service paged at 3 a.m.

You have three honest strategies, and you usually compose two of them.

Time-based TTL. Every cache entry expires after a fixed time. The window of staleness is bounded by the TTL; the cache is self-healing because every entry eventually refreshes. The cost is steady-state load on PostgreSQL when entries expire, and a worst case where every popular key expires at once and a stampede of requests all miss together (the thundering herd). Mitigate with TTL jitter (each key’s expiry is the base TTL plus a small random offset) and stale-while-revalidate (a worker refreshes the entry in the background; readers see the old value during the refresh).

Explicit invalidation from the writer. On every UPDATE or DELETE, the writing code deletes or updates the affected cache keys. Correct in principle, but it puts the burden on every writer and breaks the moment a write path forgets, or a job edits the database without going through the app.

Change-data-capture (CDC). A subscriber tails PostgreSQL’s logical replication stream and emits invalidation events to the cache. As the logical-replication-and-cdc page covers, tools such as Debezium or wal2json read the WAL, decode it into row-level change events, and publish them to Kafka, RabbitMQ, or directly to a cache invalidator. The huge win is that every change to the database, no matter what wrote it, produces an invalidation. No one can forget. The cost is the operational burden of a CDC pipeline.

A lightweight alternative when you do not need cross-process pub-sub is PostgreSQL’s built-in LISTEN/NOTIFY. A trigger on the table issues NOTIFY cache_invalidate, 'user:42', and a worker LISTENing for that channel deletes the key in Redis. It is small, transactional (the NOTIFY is part of the commit), and works for single-region deployments. It is not the right tool for fan-out across data centers; for that, use CDC.

create or replace function notify_user_change()
  returns trigger
  language plpgsql
as $$
begin
  perform pg_notify('cache_invalidate',
                    format('user:%s', coalesce(new.id, old.id)));
  return coalesce(new, old);
end;
$$;

create trigger users_invalidate
  after insert or update or delete on users
  for each row execute function notify_user_change();

Redis itself helps with invalidation via pub-sub channels (PUBLISH/SUBSCRIBE), keyspace notifications (Redis emits events when keys change), and Redis 6’s client-side caching protocol (the server tracks which keys each client cached and pushes invalidations).

Invalidation pitfalls in the wild

9. Database-fronting caches and horizontal read partitioning

A newer category sits between Redis and PostgreSQL: a service that speaks the PostgreSQL wire protocol and caches query results transparently. The application thinks it is talking to a database; in reality, the proxy answers from cache when it can and forwards to PostgreSQL when it cannot. PolyScale is one such service (edge-cached query results, with cache-key derivation per query). ReadySet uses the streaming replication protocol to materialize and incrementally maintain query results, then serves them directly. pgpool-II has an in-memory query cache for repeated SELECTs.

These are not magic. They work best for read-heavy workloads where the same parameterized query runs millions of times (a profile fetch, a product detail). They struggle with queries that vary in subtle ways (random sort orders, per-user filters with high cardinality), and they all face the same invalidation question section 8 covers. Read the docs about their consistency model before adopting one in production: ReadySet is open-source and transparent about lag; PolyScale offers tunable freshness.

The other axis to scale on, when you have several replicas, is horizontal read partitioning. Instead of round-robin across replicas, hash the user id (or another stable key) modulo the number of replicas, and pin that user’s reads to replica H(user) mod N. The PostgreSQL replicas are all reading the same data, so this is not sharding (the data is fully replicated everywhere). The benefit is that each replica’s hot working set is a fraction of the total, so it fits in memory better; the cache hit ratio of each goes up; tail latency goes down. You also get sticky caching for free: a user always hits the same replica, so any application-level cache that keys on connection or session warms up.

flowchart LR
    A[Reader for user 42] -->|hash mod N| B[Replica 0]
    A2[Reader for user 91] -->|hash mod N| C[Replica 1]
    A3[Reader for user 17] -->|hash mod N| D[Replica 2]
    B -->|reads only| E[PostgreSQL streaming]
    C -->|reads only| E
    D -->|reads only| E

The cost is a hotspot when one user is far more active than the others (the celebrity problem). Mix in random fallback for the heaviest keys, or pin the celebrity to the primary.

10. Picking what to cache, and what breaks when you get it wrong

The first move is not to add a cache. The first move is to find out which queries actually cost you. The pg_stat_statements extension, as the pg_stat_statements page covers, tells you the top-N queries by total time, mean time, and total calls. Queries that show up high in total time and have a high calls count are the ones whose result is worth caching; queries that are slow but rare are usually fixed by an index, not a cache.

select query,
       calls,
       total_exec_time,
       mean_exec_time,
       rows
  from pg_stat_statements
 order by total_exec_time desc
 limit 20;

Now the failure modes that look like database problems but are caching mistakes.

OFFSET pagination across replicas. A user fetches page 5 with LIMIT 20 OFFSET 80. Their next request fetches page 6 with LIMIT 20 OFFSET 100. If the two requests land on different replicas with different lag, the rows can shift between them: the same row appears on both pages, or a row vanishes entirely. As the query-rewriting-and-anti-patterns page covers, the fix is keyset pagination (WHERE created_at < $last AND id < $last_id ORDER BY created_at desc, id desc LIMIT 20), which is stable across replicas because every request carries the cursor of where it left off.

One hot replica, others idle. Round-robin routing combined with sticky transactions can pin all the long-running queries to one replica. The pool stays “healthy” because connection counts look fine; latency on that one replica climbs because every slot is occupied. Watch per-replica query time, not just connection counts, and rebalance with hashing (section 9) when the workload is sticky.

Replication lag during a materialized view refresh. The REFRESH MATERIALIZED VIEW CONCURRENTLY writes a lot of WAL. The replicas fall behind. Now every read-your-writes gate widens, every dashboard refresh feels slow, and the on-call sees “replicas lagging” alerts. The fix is to schedule the refresh, accept the lag, and either pin dashboard reads to the primary during the window or use a dedicated replica for the dashboard that you accept will be stale during refresh.

Caching a query whose plan depends on its parameters. A query that sometimes uses an index scan and sometimes a sequential scan (depending on the parameter’s selectivity) is dangerous to cache without thinking. The cached result is correct for the parameters you cached it with, but if you cache the SQL plan (as some database-fronting caches do), the plan may be wrong for the next set of parameters. Cache results keyed on parameters, not plans.

Trusting the cache hit ratio alone. A 99.5 percent hit ratio sounds great until you notice the 0.5 percent of misses are the queries that matter (a hot page that just got invalidated, the first request after a deploy). Watch the tail of the miss distribution, not the average. A cache that hits often but misses on the slowest queries is doing less for you than the dashboard suggests.

Check yourself
Your service routes all SELECTs round-robin across three replicas. A user updates their email, then refreshes the profile page, and sees the old email. You have to fix this without sending every read to the primary. Which approach is correct?

Mastery Questions

  1. Your team adds three read replicas behind a round-robin proxy. Write throughput on the primary is healthy, total read QPS is up by 4x, and most pages feel snappier. But the support queue starts filling with “I changed my address and it reverted” tickets. Walk through what is happening, how you would prove it in production, and the smallest change that fixes it without slamming all reads back onto the primary.

    Answer. This is the read-your-writes trap from section 4. A user commits an UPDATE to the primary; the next request lands on a replica whose replay position is behind the commit; the user sees the pre-update row and reads it as a regression. You prove it by correlating the support tickets to pg_stat_replication lag on the primary at the time of the complaint (look for replay_lsn lagging sent_lsn by more than a second). The fix is LSN gating: the application captures pg_current_wal_lsn() on commit, carries it on the session for a short window (say 30 seconds), and the proxy picks a replica only if pg_last_wal_replay_lsn() is greater than or equal to that LSN, falling back to the primary otherwise. A simpler stopgap is stickiness: route reads for that user to the primary for 30 seconds after a write. Both keep the load-shedding benefit of replicas for the 99 percent of reads that are not in a read-after-write window. The deeper lesson is that “the replicas are caught up” is a probability statement, not a guarantee, and any session that just wrote needs an explicit defense.

  2. A landing-page leaderboard runs an expensive aggregate over 50 million rows and shows up at the top of pg_stat_statements. The product manager says “two minutes stale is fine.” You consider three solutions: a covering index, a materialized view, and Redis. Which do you pick, why, and what is the failure mode of your choice?

    Answer. The right answer is a materialized view, refreshed every two minutes with REFRESH MATERIALIZED VIEW CONCURRENTLY. A covering index helps a specific query shape but does not precompute a wide aggregate over 50 million rows; the planner still has to sum and group. Redis is fine for caching the rendered leaderboard page, but you need to populate it from something, and that something is exactly the aggregate query you are trying to avoid running constantly. The materialized view runs that query once every two minutes, stores the result, and lets every page request scan a tiny table. The failure modes are the ones from sections 6 and 10: the refresh writes WAL and can bump replication lag during the refresh window, so either route the dashboard reads to the primary or accept the lag on the replica that serves them; the CONCURRENTLY form requires a UNIQUE index on the view; and if traffic spikes during the refresh, you can stack refreshes (a second one starts before the first finishes), so guard with an advisory lock or a flag. If two minutes is too coarse and the aggregate is incremental in shape (sum, count), pg_ivm becomes attractive: the view stays sub-second fresh at the cost of triggers on the source tables.

  3. You instrument your service and discover that your Redis cache has a 99.5 percent hit rate and the application is faster than ever. A week later, a deploy reveals that several team members “just put it in Redis” for fields that have lived only in Redis since launch. A failover drops some of those keys, and the team learns the field had no PostgreSQL record at all. What is the structural mistake here, what consistency model should the team adopt going forward, and what cache pattern would have prevented the loss?

    Answer. The structural mistake is treating Redis as the source of truth for state that the business actually cares about. Redis is a derived store: fast, in-memory, and bounded by eviction and persistence configuration. It will lose data on a failover, an eviction, or a misconfigured appendonly setting, and that is fine for a cache because the database can rehydrate it. It is not fine for the only copy of a user’s data. The consistency model the team should adopt is “PostgreSQL is authoritative; every persistent field has a column; Redis caches a view of that column.” The cache pattern that prevents this is cache-aside (section 7): every write goes to PostgreSQL and invalidates Redis; every read tries Redis, falls back to PostgreSQL on miss, and repopulates the cache. Even if Redis is wiped, the next request reads from PostgreSQL and warms the cache; no data is lost. Write-through would also work, with the cost of paying both write latencies; write-behind would specifically be the wrong choice here because it is the pattern that lets the cache hold writes the database has not seen. The deeper lesson is to write down which store is the source of truth for each field, and treat any other store as a derived view that must survive its own destruction.

Recommended next

  • Sharding with Citus
    Builds directly on this page: Sharding with Citus is the next step in the PostgreSQL performance ladder.
Sources & evidence16 claims · 4 cited

Replica mechanics, WAL functions, and pg_stat_statements grounded in the PostgreSQL high availability, warm standby, runtime resource, and pg_stat_statements docs; cache patterns (cache-aside, write-through, write-behind), Redis behaviour, pg_ivm, ReadySet, PolyScale, the read-your-writes trap, materialized-view refresh modes (the existence of CONCURRENTLY and its UNIQUE-index requirement is widely documented), and PostgreSQL's clock-sweep buffer replacement are marked stable-common-knowledge with empty source_ids where the listed docs do not state the exact fact.

  • Read scaling in PostgreSQL is a ladder from cheapest-and-freshest to widest-and-stalest: shared buffer pool plus OS page cache, index-only and covering scans, materialized views and summary tables, streaming read replicas, external caches (Redis, Memcached), and CDN or app-level memoization, and the design choice is which rung is the highest one still fresh enough for a given query class.stable common knowledge
  • PostgreSQL's shared buffer pool sits inside the server process and is sized by the shared_buffers GUC; common operational guidance is roughly 25 percent of host RAM, with the kernel page cache acting as a secondary layer behind it.verified
  • PostgreSQL's buffer replacement is a clock-sweep with a per-buffer usage counter capped at 5: hits increment the counter, the eviction pass decrements counters as it walks, and the first buffer at zero is evicted, which behaves closer to LFU than LRU and protects hot pages from being flushed by one-shot sequential scans.stable common knowledge
  • An index-only scan answers a query without visiting the heap when the index alone supplies every required column (either in the key or via INCLUDE), so a covering index is effectively a denormalized in-database cache that PostgreSQL keeps consistent for free.stable common knowledge
  • A hot standby in recovery accepts SELECT only and refuses writes; replication lag from the primary is measurable from pg_stat_replication (sent_lsn, write_lsn, flush_lsn, replay_lsn) using pg_wal_lsn_diff against pg_current_wal_lsn, and from the standby with pg_is_in_recovery, pg_last_wal_replay_lsn, and pg_last_xact_replay_timestamp.verified
  • Routing reads to a streaming replica violates read-your-writes when the replica's replay position is behind the user's last commit; the textbook defenses are session stickiness on commit, LSN gating (capture pg_current_wal_lsn at commit and only pick a replica whose pg_last_wal_replay_lsn is at or past it, otherwise fall back to the primary), or explicit per-query primary hints.stable common knowledge
  • Read/write splitting can live in the application (ORM-driven, e.g. Rails connected_to, Django routers, Spring readOnly), in a query-aware proxy (PgBouncer + HAProxy, pgcat, Odyssey, RDS Proxy), or behind separate DNS endpoints; in every layout a transaction is sticky to the connection it started on, and any read with side effects (SELECT FOR UPDATE, a function that writes) must route to the primary.stable common knowledge
  • REFRESH MATERIALIZED VIEW takes ACCESS EXCLUSIVE on the view and blocks readers for the entire rebuild, while REFRESH MATERIALIZED VIEW CONCURRENTLY requires a UNIQUE index on the view, builds the new contents in a scratch table, diffs them against the old, and applies the delta under a weaker lock so that readers proceed throughout (at the cost of more WAL and a slower refresh).stable common knowledge
  • Incremental materialized views are not part of core PostgreSQL; pg_ivm is the leading extension, applying only the changed rows via triggers on the source tables to turn a full rebuild into a sub-second update, with the caveat that not every aggregate is supported and source tables gain triggers.stable common knowledge
  • External caching follows three patterns: cache-aside (the application checks the cache, falls back to PostgreSQL on miss, and either deletes or overwrites the cache entry on a write), write-through (every write goes to the cache and the database in lockstep so reads always hit), and write-behind (writes hit only the cache and a worker drains them to the database, sacrificing durability on crash), and treating Redis as a derived store rather than the source of truth is what keeps a failover or eviction from losing data.stable common knowledge
  • The N+1 problem reappears at the cache layer: per-row GET calls against Redis can saturate the network in the same shape as per-row SELECTs against PostgreSQL, and the fix is the same shape: batch with MGET, MSET, hash field reads (HMGET), or pipelining so one round trip carries many commands.stable common knowledge
  • Invalidation strategies compose: a time-based TTL bounds staleness but causes a thundering herd at expiry (mitigated by jitter, stale-while-revalidate, or per-key single-flight), explicit writer-side deletes are correct only if every write path remembers, and change-data-capture (Debezium, wal2json) tails the WAL and emits invalidation events that no writer can forget.stable common knowledge
  • PostgreSQL's built-in LISTEN/NOTIFY is a transactional, single-region alternative for cache invalidation: a trigger calls pg_notify on the row change, the notify is delivered only if the transaction commits, and a worker LISTENing on the channel deletes the corresponding cache key, with no operational CDC pipeline required.stable common knowledge
  • Database-fronting caches such as PolyScale, ReadySet, and pgpool-II's query cache speak the PostgreSQL wire protocol and answer repeated parameterized reads from cache; ReadySet uses streaming replication to incrementally maintain query results, and the consistency model of each (TTL-based versus replication-driven) determines how stale a cached answer can be.stable common knowledge
  • The pg_stat_statements extension records execution counts, total and mean execution time, and rows per parameterized query, and is the canonical input to picking which queries deserve a cache: high total_exec_time with high calls is a cacheable hot read, while a slow but rare query is almost always fixed by an index instead.verified
  • OFFSET pagination is unstable across replicas because two requests with different OFFSETs can land on replicas at different replay positions and see rows shift between them; keyset (cursor) pagination using a WHERE clause on the last-seen sort key is stable because each request carries the boundary forward.stable common knowledge

Cited sources