Tuning Memory for Throughput
How PostgreSQL's memory parameters (shared_buffers, effective_cache_size, work_mem, maintenance_work_mem, huge_pages) actually behave, how to measure cache hits and temp-file spills, and how to size them against total RAM so a tuning change does not invite the out-of-memory killer.
Learning outcomes
Memory is where PostgreSQL tuning pays off fastest and where it goes wrong most expensively. A few parameters decide whether a query reads from RAM or from disk, whether a sort runs in microseconds or spills to a temp file, and whether a traffic spike serves more requests or kills the server outright. The catch is that the two parameters everyone reaches for first, shared_buffers and work_mem, do not mean what their names suggest, and the defaults are tuned for a laptop, not your fleet.
After studying this page, you can:
- Explain why
shared_buffersat 25 percent of RAM is a starting point and why pushing it to 75 percent usually wastes memory. - Measure your cache hit ratio from
pg_stat_databaseand inspect what is actually cached withpg_buffercache. - Predict peak server memory from
work_memtimes concurrent operations, and size it without inviting the out-of-memory killer. - Read an
EXPLAIN (ANALYZE, BUFFERS)plan to tell an in-memory sort from one that spilled to disk. - Raise
maintenance_work_memandhuge_pagesdeliberately, knowing exactly what each speeds up and what each can break.
Before we dive in
You should be comfortable running SQL and editing postgresql.conf (or using ALTER SYSTEM). You do not need to have memorized any specific numbers; we pin every default to PostgreSQL 16 and 17 as we go. It helps to know that PostgreSQL reads and writes its tables in fixed 8 KB blocks called pages, and that the planner, as the query-planner-and-cost-model page covers, chooses a plan as a tree of nodes (a sort, a hash join, a scan) before any rows move.
Three words you will need. A buffer is one 8 KB page held in memory. A cache hit is a page request that found its page already in memory, so no disk read happened. A spill is what a memory-using operation does when it runs out of its allowance: it writes intermediate data to a temporary file on disk and keeps going, correctly but slowly. Hold those three. Almost every decision below is a fight to raise hits and avoid spills without running the machine out of RAM.
Mental Model
The tempting wrong model is that shared_buffers is simply “how much RAM PostgreSQL gets,” so more is always better, and you should hand it most of the machine. People set it to 60 or 70 percent of RAM and expect a matching speedup.
That model misses a second cache sitting underneath. PostgreSQL does not own its memory the way a JVM heap does. It keeps its own page cache in shared_buffers, but every page it reads also passes through the operating system’s page cache, so a hot page is often held in memory twice. The better model is two stacked caches, not one big pool. shared_buffers is the inner cache PostgreSQL manages directly. The OS page cache is the outer one the kernel manages for free. Give the inner cache too much and you starve the outer one, double-buffer the same pages, and leave less room for the per-query memory (work_mem) and the maintenance memory that real throughput actually needs.
Keep this picture. Memory is a budget split across two caches and many concurrent operations, not a single dial you turn to the right. Once that clicks, the 25 percent rule, the per-node nature of work_mem, and the out-of-memory risk all stop being arbitrary and start being arithmetic.
Breaking it down
1. Why memory is the highest-leverage knob
Start with the cost gap that makes memory matter. A page already in RAM is returned in roughly a tenth of a microsecond. The same page fetched from an SSD costs tens of microseconds, and from a spinning disk, several milliseconds. That is a difference of three to five orders of magnitude for the exact same 8 KB of data.
So the single most important thing memory tuning does is keep hot pages in RAM and keep working data out of temp files. Every parameter on this page serves one of those two goals. shared_buffers and effective_cache_size are about page caching: did your scan read from memory or from disk. work_mem and maintenance_work_mem are about working memory: did your sort, hash, or index build finish in RAM or spill to disk.
You can see why this dominates other tuning. A perfect index still has to read its pages from somewhere; if they are on disk every time, the index is slow. A clever query plan still has to sort its output; if the sort spills, the plan is slow. Memory sits underneath the planner and the indexes, which is why this is the first configuration topic and shapes every later one.
The rest of the page is a budget. We will meet each consumer of memory, learn what it really controls, and at the end add them up against total RAM, because the sum is what the kernel’s out-of-memory killer cares about.
2. shared_buffers and the double-buffering trap
shared_buffers is PostgreSQL’s own page cache: a single shared region, sized at startup, that holds 8 KB pages so repeat reads skip the disk. Every backend reads and writes table and index pages through it. The default is a tiny 128 MB, deliberately small so PostgreSQL starts on any machine; on a real server it is the first thing you raise.
The standard starting point is 25 percent of system RAM. On a 64 GB server that is 16 GB. Here is the part that surprises people: the reason you stop near a quarter, rather than handing PostgreSQL most of the box, is double buffering. A page PostgreSQL reads from disk lands first in the OS page cache, then gets copied into shared_buffers. Hot pages therefore live in memory twice. If you set shared_buffers to 75 percent of RAM, you have not gained a 75 percent cache; you have shrunk the OS cache that was already holding many of those same pages, so much of that extra allocation just duplicates what the kernel had for free. Past a point the returns flatten and then go negative, because the OS cache loses the room it needs.
# postgresql.conf, a 64 GB server: 25% of RAM as the inner cache.
shared_buffers = 16GB
Raising it above 25 percent earns its keep in specific cases, not as a reflex. It helps when your hot working set is large and fits better in the inner cache, and it helps write-heavy workloads, where keeping dirty pages resident in shared_buffers lets many updates to the same page coalesce before one write reaches disk, so you re-read and re-write fewer pages. It does not help when your data is far larger than any plausible cache, or when the OS cache was already serving those reads; there you have just moved memory from one cache to the other and lost the flexibility of the kernel’s.
You do not guess whether it is working. You measure the cache hit ratio: of all page requests, the fraction that found their page already in shared_buffers. pg_stat_database counts blks_hit (served from the buffer cache) and blks_read (had to go below it), so the ratio is blks_hit / (blks_hit + blks_read).
-- Buffer-cache hit ratio per database. Aim well above 0.99 on an OLTP workload.
select datname,
blks_hit,
blks_read,
round(blks_hit::numeric / nullif(blks_hit + blks_read, 0), 4) as hit_ratio
from pg_stat_database
where datname = current_database();
A healthy transactional database sits above 99 percent. If the ratio is low and stays low, your hot set does not fit, and raising shared_buffers (or adding RAM) may genuinely help. To go further and see what is in the cache right now, install the pg_buffercache extension, which exposes one row per buffer.
create extension if not exists pg_buffercache;
-- Which relations occupy the most buffers right now?
select c.relname,
count(*) as buffers,
round(count(*) * 8192 / 1024.0 / 1024.0, 1) as mb_cached
from pg_buffercache b
join pg_class c on c.relfilenode = b.relfilenode
group by c.relname
order by buffers desc
limit 10;
That query turns an abstract setting into a concrete answer: these tables and indexes are the ones actually living in your cache, in this much memory.
3. The OS page cache and effective_cache_size
The second tier of caching is the one PostgreSQL does not manage at all. Every read that misses shared_buffers goes to the kernel, and the kernel keeps its own page cache out of otherwise-free RAM. On a typical Linux database host most of physical memory ends up here, holding recently read file pages. This is exactly why you leave the OS room in the budget: it is doing a large share of your caching for free, and it adapts to the whole workload, not just PostgreSQL.
effective_cache_size is how you tell the planner about that second tier. It is not an allocation. PostgreSQL never reserves a single byte for it. It is a hint: your estimate of how much memory, across both shared_buffers and the OS cache, is realistically available to cache this database’s pages. The planner reads it when costing an index scan, to estimate how many of the index and table pages a repeated scan will find already cached rather than fetched from disk.
The effect is concrete. A larger effective_cache_size tells the planner that more pages are likely cached, which lowers the estimated cost of index scans (especially repeated or nested-loop lookups) and biases it toward using an index instead of sequentially scanning the whole table. Set it too low and the planner believes your cache is small, overcharges index scans, and falls back to sequential scans that are slower on a machine that actually had the pages cached.
The standard value is 50 to 75 percent of total RAM. The default of 4 GB is, like shared_buffers, a conservative placeholder.
# A planner hint, not an allocation: ~75% of a 64 GB box.
effective_cache_size = 48GB
flowchart LR
Q["Query needs a page"] --> SB{"In shared_buffers?"}
SB -->|hit| FAST["Return from RAM"]
SB -->|miss| OS{"In OS page cache?"}
OS -->|hit| WARM["Copy into shared_buffers"]
OS -->|miss| DISK["Read from disk (slow)"]
DISK --> WARMRead the diagram as the life of a page request. The two caches are tiers: shared_buffers first, the OS cache behind it, disk last. shared_buffers sizing decides how often you stop at the first tier; effective_cache_size is your honest estimate of the first two tiers combined, handed to the planner so it costs scans against the cache you really have.
4. work_mem is per node, not per query
Now the subtle parameter, the one that causes more outages than any other on this page. work_mem is the memory budget a single operation may use for sorting or hashing before it spills to a temporary file. The default is 4 MB.
Here is the trap in the name. It is not per query, and it is not per connection. It is per operation node in the plan. One query can contain many memory-using nodes at once: each Sort, each hash table for a Hash Join, each HashAggregate, each materialize. Every one of those nodes may use up to work_mem independently and at the same time. A single reporting query with three sorts and two hash joins can therefore use five times work_mem on its own.
Multiply that by concurrency and the real number appears. If work_mem is 64 MB and you have 200 connections each running a query with, say, four memory-using nodes, the worst-case working memory is 64 MB * 4 * 200, which is about 50 GB, on top of shared_buffers and everything else. That is how a setting that looked modest becomes the thing that runs the server out of memory.
flowchart TB
Q["One report query"] --> S1["Sort node (up to work_mem)"]
Q --> H1["Hash Join build (up to work_mem)"]
Q --> A1["HashAggregate (up to work_mem)"]
S1 --> SUM["Peak = work_mem x number of these nodes"]
H1 --> SUM
A1 --> SUMOne more lever changes the picture for hash nodes specifically. Hash tables get more headroom than sorts: hash_mem_multiplier, which defaults to 2.0, multiplies work_mem for hash-based nodes only. So with work_mem at 64 MB, a HashAggregate or a hash join’s table may use up to 128 MB before it spills. PostgreSQL gives hashing extra room because hash spills are especially painful, and because a hash table that barely overflows is a common, avoidable cause of slow aggregates.
How do you size it? Start low and global, then go high and local. Keep the global work_mem modest, sized so that work_mem * (typical nodes) * (peak connections) stays comfortably inside RAM. Then grant generous memory only where it pays, per role or per statement, for the few heavy reports that need it.
-- Global stays safe for the many small OLTP queries.
alter system set work_mem = '32MB';
select pg_reload_conf();
-- The analytics role gets room for big sorts and hash joins.
alter role analytics set work_mem = '512MB';
-- Or just for one heavy report, scoped to this transaction.
begin;
set local work_mem = '1GB';
-- ... the big reporting query ...
commit;
SET LOCAL is the surgical tool here: it raises the budget for one transaction and reverts automatically at commit, so a single report can have its gigabyte without every connection being allowed one.
5. Watching a node spill to disk
A parameter you cannot observe is a parameter you cannot tune, so make spills visible. When a sort or hash exceeds its work_mem budget, it does not fail; it spills the overflow to a temporary file under pgsql_tmp and finishes correctly, just slower. The danger is that this is silent by default. A report that quietly spills every night looks fine until you ask why it takes nine minutes.
Two tools turn the lights on. The first is log_temp_files: set it to a byte threshold (or 0 to log every temp file) and PostgreSQL writes a log line whenever an operation spills more than that.
# Log any temporary file 10 MB or larger. 0 logs all of them.
log_temp_files = 10MB
The second, sharper tool is the query plan itself. Run the statement under EXPLAIN (ANALYZE, BUFFERS) and read what each node actually did. A Sort node tells you its method and where it ran. In memory, it reads Sort Method: quicksort Memory: ...kB. Spilled, it reads Sort Method: external merge Disk: ...kB, which is the tell that this sort blew past work_mem.
explain (analyze, buffers)
select customer_id, sum(amount)
from orders
group by customer_id
order by sum(amount) desc;
Sort (cost=84512.30..85012.30 rows=200000 width=12)
(actual time=512.4..534.1 rows=200000 loops=1)
Sort Key: (sum(amount)) DESC
Sort Method: external merge Disk: 24560kB
Buffers: shared hit=18342 read=2210, temp read=3071 written=3071
-> HashAggregate (cost=...)
...
Read that output closely, because it carries two separate stories. Sort Method: external merge Disk: 24560kB says the sort spilled about 24 MB to disk; raising work_mem to roughly 32 MB for this query would let it use quicksort in memory instead. The Buffers: line is the other half. shared hit=18342 read=2210 is your cache story (18342 pages came from shared_buffers, 2210 had to be read from below it), and temp read=3071 written=3071 is your spill story in 8 KB blocks. When temp is non-zero, an operation went to disk for working memory, and work_mem is the lever.
6. maintenance_work_mem for builds and vacuum
work_mem covers the memory a query uses to answer you. A second, separate budget covers the heavy maintenance operations that keep the database healthy: maintenance_work_mem. It governs how much memory one such operation may use, and the default is 64 MB.
Three operations draw on it, and they are exactly the ones you wait on. VACUUM uses it to hold the list of dead tuple identifiers it is reclaiming, so a larger budget means fewer passes over a big table’s indexes. CREATE INDEX uses it for the sort that builds the index, so a larger budget builds indexes dramatically faster. ALTER TABLE ... ADD FOREIGN KEY and similar validations use it too. Because these run rarely and usually one or a few at a time, you can afford to be generous: 1 to 2 GB is a common, safe value on a server with plenty of RAM, and it can turn an hours-long index build into a much shorter one.
maintenance_work_mem = 2GB
-- Give one big index build a large budget without touching the global value.
set maintenance_work_mem = '4GB';
create index concurrently idx_orders_customer on orders (customer_id);
There is a catch worth knowing before you set it high. Autovacuum runs several workers in parallel, and each worker uses maintenance_work_mem by default. To stop a high maintenance budget from being multiplied by every autovacuum worker at once, PostgreSQL provides autovacuum_work_mem. Leave it at the default of -1 and autovacuum workers fall back to maintenance_work_mem; set it explicitly to cap autovacuum’s per-worker share separately from your manual builds. So you can run with maintenance_work_mem = 2GB for fast manual index builds while holding each autovacuum worker to, say, 256 MB.
maintenance_work_mem = 2GB # manual VACUUM, CREATE INDEX, ADD FOREIGN KEY
autovacuum_work_mem = 256MB # caps each autovacuum worker's share
7. huge_pages: backing buffers with bigger pages
This rung is about the memory under the memory. The kernel hands out RAM in pages, normally 4 KB each, and the CPU caches the lookup from a virtual address to a physical page in a small, fast table called the TLB. A large shared_buffers spans an enormous number of 4 KB pages, so the TLB thrashes, and every miss is a slow walk through the kernel’s page tables. The fix is to back shared_buffers with huge pages, typically 2 MB each on x86-64. One huge page covers what 512 normal pages did, so the same shared_buffers needs far fewer TLB entries, and you cut TLB misses and page-table overhead under load.
The parameter is huge_pages, with three values: try (the default), on, and off. With try, PostgreSQL asks for huge pages at startup and silently falls back to normal pages if the kernel cannot supply them. With on, it demands them, and if enough huge pages are not available, the server refuses to start. That last point is the operational gotcha: huge_pages = on is the strict, high-performance choice, but it turns “not enough huge pages reserved” into a failed startup. Use on only when you have reserved the pages and want a hard guarantee; try is the safe default that degrades gracefully.
To use them you must reserve enough huge pages in the kernel to cover shared_buffers. The arithmetic is simple: number of huge pages needed is roughly shared_buffers divided by the huge-page size, plus a little headroom for other shared memory. For 16 GB of shared_buffers with 2 MB pages, that is about 8192 pages.
# 16 GB of shared_buffers / 2 MB per huge page ~= 8192 pages (add headroom).
sysctl -w vm.nr_hugepages=8400
# Make it persist across reboots:
echo 'vm.nr_hugepages = 8400' >> /etc/sysctl.conf
# Demand huge pages once they are reserved; the server will not start without them.
huge_pages = on
PostgreSQL even helps you size the reservation: with shared_buffers set, postgres -C shared_memory_size_in_huge_pages prints the exact number of huge pages it needs, which you feed straight into vm.nr_hugepages.
8. The whole-RAM budget and the OOM killer
Now assemble the pieces, because the operating system judges PostgreSQL on the sum, not on any single parameter. On Linux, when memory runs out, the kernel’s out-of-memory killer picks a process and terminates it, and a large PostgreSQL backend is a prime target. A killed backend can take the whole server down with it. So the real constraint is one inequality: the total of every memory consumer must stay safely below physical RAM, with room left for the OS and its page cache.
The budget, in words, is the sum of these:
shared_buffers: the inner cache, fixed at startup.work_memtimes the peak number of concurrent memory-using nodes: the working memory of all queries at once.maintenance_work_memtimes the number of maintenance operations running together (including autovacuum workers, unless capped byautovacuum_work_mem).- A per-connection overhead of a few megabytes each, multiplied by your connection count.
- Whatever you leave for the operating system and its page cache.
A worked example on a 64 GB server makes it real. Suppose shared_buffers = 16GB, work_mem = 32MB, a peak of 200 connections, and roughly 3 memory-using nodes per query at peak. The working memory term is 32 MB * 3 * 200, about 19 GB. Add a few autovacuum workers at the maintenance budget and a few megabytes per connection, and you are already near 16 + 19 = 35 GB of committed memory before the OS cache gets anything. That still fits 64 GB with headroom. Now imagine someone “optimizes” by setting work_mem = 256MB. The same term becomes 256 MB * 3 * 200, about 150 GB, more than double the machine. Nothing warns you at configuration time; the server runs fine until a concurrency spike hits and the OOM killer fires.
The interactive below is that budget. Drag the inputs and watch total committed memory move; the OS gets whatever is left under your RAM. The formula is exactly the arithmetic above, with nothing hidden.
Switch the mode to “Free for OS” and push work_mem up: watch the headroom go negative. The instant it does, you are promising the kernel more memory than exists, and the only question left is which backend the OOM killer takes first.
9. The failure modes experts still hit
You now have every piece. This last rung names the three mistakes that catch experienced engineers, because each one passes review and only bites in production.
The first is global work_mem set high, then a concurrency spike. Someone profiles a slow report, sees it spilling, raises the global work_mem to fix it, and ships. The report is fast. Weeks later, traffic spikes, every connection runs a query with several memory-using nodes at the new high budget, the sum sails past RAM, and the OOM killer takes a backend down at the worst moment. The fix is the rung-4 discipline: keep global work_mem modest, and grant the large budget per role or per statement to the few queries that need it.
The second is shared_buffers cranked to 60 percent with no measured benefit. It feels like giving PostgreSQL more memory must help. But without checking the cache hit ratio first, you cannot know whether the inner cache was even the bottleneck, and past 25 percent you are mostly starving the OS cache and double-buffering pages it already held. The fix is to measure blks_hit / (blks_hit + blks_read) before and after any change, and to treat a flat hit ratio as proof the change did nothing but cost you OS-cache room.
The third is the quiet one: ignoring temp-file spills. A nightly report spills to disk every run, and because it returns the right answer, nobody notices it is disk-bound. The fix is to make spills loud, with log_temp_files, and to scan plans for external merge and non-zero temp blocks under EXPLAIN (ANALYZE, BUFFERS). A spill is the cheapest performance bug to find and one of the easiest to fix, but only if you are looking.
Mastery Questions
-
A teammate set
shared_buffersto 48 GB on a 64 GB server “to make PostgreSQL faster,” but the application is no quicker and memory feels tight. Walk through what you would check and what you would expect to find.Answer. Start by separating the claim from the measurement. The belief is that more
shared_buffersis always more cache; the reality is two stacked caches, and at 48 GB on a 64 GB box (75 percent) the OS page cache has been squeezed to almost nothing while many hot pages now sit in memory twice, once in each cache. I would pullblks_hitandblks_readfrompg_stat_databaseand compute the hit ratio, ideally comparing against the value before the change; I would expect it to be roughly flat, which is the proof that the extra 32 GB bought no additional caching. I would also expect memory pressure: with so little left for the OS cache and forwork_memtimes concurrency, the machine is closer to the out-of-memory killer than before. The fix is to returnshared_bufferstoward 25 percent (16 GB here), give the freed memory back to the OS cache and towork_memheadroom, and only deviate upward if the hit ratio data actually justifies it for a specific large, hot working set. The lesson: never moveshared_bufferson faith; move it on the hit ratio. -
A nightly aggregation report has slowly crept from two minutes to nine over a few months, with no schema change and no obvious cause. How do you confirm it is a memory problem, and how do you fix it without endangering the rest of the server?
Answer. Run the report under
EXPLAIN (ANALYZE, BUFFERS)and read two things. On the sort and hash nodes, look at the method:external merge Disk: NkBorSort Methodshowing disk means the operation outgrewwork_memand spilled. On theBuffers:line, a non-zerotemp read/temp writtenconfirms working data went to disk in 8 KB blocks. As the data grew, the same query crossed thework_memthreshold and started spilling, which is exactly the kind of slow creep that has no schema cause. I would also enablelog_temp_filesso future spills announce themselves. The fix is to give this one report more memory without raising the global budget for every connection: wrap it in a transaction andSET LOCAL work_mem = '1GB'(sized a bit above theDisk:figure the plan reported), or run it under a dedicated role with a higherwork_mem. That keeps the global value safe for the hundreds of small OLTP queries while the one heavy report gets the room it needs, so the fix cannot trigger an out-of-memory event under normal concurrency. -
You are asked to set a single global
work_memfor a server with 64 GB of RAM,shared_buffers = 16GB, and a peak of 300 connections that each typically run queries with about four memory-using nodes. What value is defensible, and what is the reasoning you would write down?Answer. The reasoning is the budget inequality, not a rule of thumb. Working memory at peak is roughly
work_mem * nodes_per_query * connections, herework_mem * 4 * 300, which iswork_mem * 1200. Total committed memory is that plusshared_buffers(16 GB) plus per-connection overhead plus the maintenance budget, and it must stay below 64 GB with real headroom for the OS page cache. If I leave, say, 40 GB for working memory aftershared_buffersand the OS, thenwork_mem * 1200must be under about 40 GB, sowork_memlands near 32 MB and I would round down to be safe, perhaps 24 to 32 MB. The number itself matters less than the discipline behind it: I would write down that the global value is deliberately modest because it is multiplied by nodes and by concurrency, and that any query needing more gets it throughSET LOCALor a per-role override, never by raising the global. The trap I am explicitly avoiding is sizingwork_memfor one slow report and forgetting that the same generosity is handed to all 300 connections at once, which is precisely how a tuning change becomes an out-of-memory outage.
Sources & evidence15 claims · 3 cited
Parameter behaviour, defaults, and diagnostics are grounded in the PostgreSQL 16/17 runtime resource-consumption docs, the official performance-tuning wiki, and the EXPLAIN documentation (src_pg_docs_runtime_resource, src_pg_wiki_tuning, src_pg_docs_explain). Correct facts those sources do not state directly (the relative cost of a RAM hit versus a disk read, the 2 MB x86-64 huge-page size, the per-page TLB rationale) are marked stable-common-knowledge with empty source_ids.
- shared_buffers is PostgreSQL's shared page cache; its default is 128 MB and a common starting point on a dedicated server is about 25 percent of system RAM.verified
- Because PostgreSQL relies on the operating system's own page cache as well, pages read into shared_buffers are often held in memory twice (double buffering), so raising shared_buffers far above roughly 25 percent of RAM gives diminishing returns and can starve the OS cache.verified
- The buffer-cache hit ratio is blks_hit / (blks_hit + blks_read) from pg_stat_database, and the pg_buffercache extension exposes one row per shared buffer so you can see which relations are cached.verified
- effective_cache_size is a planner hint, not an allocation; it estimates the total memory available for caching across shared_buffers and the OS cache, is typically set to 50 to 75 percent of RAM, and a larger value lowers estimated index-scan cost and biases the planner toward index scans.verified
- work_mem (default 4 MB) is the memory budget for a single sort or hash operation node, not per query or per connection, so a query with several memory-using nodes can use work_mem multiple times and peak server memory is work_mem times the number of concurrent memory-using nodes.verified
- hash_mem_multiplier (default 2.0) multiplies the work_mem budget for hash-based nodes only, so a hash join's table or a HashAggregate may use up to work_mem times 2.0 before spilling.verified
- work_mem can be set per role with ALTER ROLE or per transaction with SET LOCAL, so a heavy report can be granted a large budget while the global value stays modest.verified
- When a sort or hash exceeds work_mem it spills to a temporary file and still returns the correct result; log_temp_files logs such spills and EXPLAIN (ANALYZE, BUFFERS) reports 'Sort Method: external merge Disk: NkB' for a spilled sort versus 'quicksort Memory: NkB' for an in-memory one.verified
- In EXPLAIN (ANALYZE, BUFFERS) output, the Buffers line reports shared hit and read (the buffer-cache story) and temp read/written in 8 KB blocks (the spill story), so non-zero temp blocks signal that an operation went to disk for working memory.verified
- maintenance_work_mem (default 64 MB) budgets maintenance operations such as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY, and raising it to 1 to 2 GB speeds index builds and vacuum.verified
- autovacuum_work_mem (default -1, meaning fall back to maintenance_work_mem) caps the memory each autovacuum worker uses, so it can hold autovacuum to a smaller share while manual maintenance keeps a large maintenance_work_mem.verified
- huge_pages backs shared_buffers with larger memory pages (typically 2 MB each on x86-64) to reduce TLB misses, and the number of huge pages to reserve is roughly shared_buffers divided by the huge-page size.stable common knowledge
- huge_pages accepts try (the default, fall back to normal pages if unavailable), on (require them), and off; with on the server will not start if enough huge pages are not available.verified
- Total committed memory is approximately shared_buffers plus work_mem times peak concurrent memory-using nodes plus maintenance_work_mem times concurrent maintenance operations plus per-connection overhead plus the OS reservation; if this exceeds physical RAM the Linux out-of-memory killer can terminate a backend and take down the server.verified
- A page already in RAM is returned in a fraction of a microsecond, while a page read from SSD costs tens of microseconds and from a spinning disk several milliseconds, a difference of several orders of magnitude that makes keeping pages and working data in memory the dominant tuning concern.stable common knowledge
Cited sources
- PostgreSQL Documentation: Resource Consumption (Memory) · PostgreSQL Global Development Group
- PostgreSQL Wiki: Tuning Your PostgreSQL Server and Number Of Database Connections · PostgreSQL community
- PostgreSQL Documentation: Using EXPLAIN · PostgreSQL Global Development Group