Buffer Cache and WAL
© Postgres Professional, 2017–2021
Authors: Egor Rogov, Pavel Luzanov
Translated by Liudmila Mantrova
Usage of Course Materials
Non-commercial use of course materials (presentations, demonstrations) is
allowed on an unrestricted basis. Commercial use is only possible with prior
written permission of Postgres Professional company. Modification of course
materials is forbidden.
Contact Us
Please send your feedback to:
In no event shall Postgres Professional company be liable for any damages
or loss, including loss of profits, that arise from direct or indirect, special or
incidental use of course materials. Postgres Professional company
specifically disclaims any warranties on course materials. Course materials
are provided “as is,” and Postgres Professional company has no obligations
to provide maintenance, support, updates, enhancements, or modifications.
Buffer cache overview
Eviction algorithm
Write-ahead log
Buffer Cache
An array of buffers
data page (8 kB)
additional metadata
Memory locks
for shared access
background workers
shared memory
buffer cache
Buffer cache is used to even out the speed difference between disk and
RAM access. It consists of an array of buffers that store data pages and
additional metadata (such as the name of the file and page location within
this file).
Page size is usually 8 kB (it can be configured at build time, but there is
usually no point in doing it).
All data page access operations go via buffer cache. If any process needs to
access a page, it first tries to find it in cache. If the page is not found there,
the process asks the operating system to read this page and loads it into
buffer cache. (Note that the OS can read the page from disk or find it in its
own cache.)
Once the page is in buffer cache, it can be accessed multiple times without
any overhead caused by system calls.
However, buffer cache (just like other shared memory structures) is
protected by locks to handle concurrent access. Even though locks are
implemented quite efficiently, buffer cache access is not as fast as direct
access to RAM. So in most cases, the less data is read and updated by a
query, the faster this query completes.
Eviction of rarely
accessed pages
a “dirty” buffer
is flushed to disk
another page is loaded
into the freed space
background workers
shared memory
buffer cache
Buffer cache size is usually not big enough to hold the whole database. It is
limited both by the available RAM and by additional overhead that arises
when we try to increase it. Sooner or later, buffer cache will become
completely full while a new page is being read. In this case, page eviction is
An eviction algorithm selects a page in cache that has been recently used
less often than others, and replaces it with the new page. If the selected
page has been updated, it must be flushed to disk first not to lose these
changes (a buffer that contains an updated page is called “dirty”).
This eviction algorithm is called LRU (Least Recently Used). It ensures that
actively used data is kept in cache. Such “hot” data is usually not abundant,
so having just enough buffer cache allows to significantly reduce the number
of system calls (and disk operations).
Write-Ahead Log (WAL)
Problem: in case of a failure, all data from RAM will be lost
if not flushed to disk
a stream of information about all performed operations,
which allows repeating them if they are lost because of a failure
WAL entry is written to disk before the modified data
WAL protects
pages of tables, indexes, and other objects
transaction status information (xact)
WAL does not protect
temporary and unlogged tables
Buffer cache (like other buffers in RAM) boosts performance, but negatively
affects reliability. If the database system fails, all buffer cache contents will
be lost. In case of operating system or hardware failures, OS buffer contents
will also be cleared (but the operating system copes with it on its own).
To ensure fault tolerance, PostgreSQL uses write-ahead logging. For each
operation, a WAL entry is generated; it contains the essential information
required to repeat this operation. Such entry must be written to disk (or any
other persistent storage) before the actual data update (hence the “write-
ahead” log).
WAL protects all objects that are handled in RAM, such as tables or
indexes, as well as transaction status metadata.
WAL files do not store any data about temporary tables (such tables can be
accessed only by their owner within the current session or transaction) and
unlogged tables (such tables are just like regular ones, except that they are
not protected by WAL). In case of a failure, such tables are simply cleared.
They are used to speed up access to data that can be recovered by other
Periodic flushing of all dirty buffers to disk
guarantees that all changes before the checkpoint are saved on disk
reduces the size of WAL required for recovery
Crash recovery
starts from the latest checkpoint
replays WAL entries one by one if the corresponding changes are missing
checkpoint crash
required WAL files
of recovery
When PostgreSQL server is started after a failure, it enters the recovery
mode. At this moment, the data stored on disk is inconsistent: some "hot"
pages may not have been flushed yet, even though they got updated before
other pages already written to disk.
To restore consistency, PostgreSQL reads WAL and replays all WAL entries
one by one if the corresponding changes have not been flushed to disk.
Thus, it restarts all transactions and then aborts those that were not
registered in WAL as committed.
However, WAL size could become huge during server operation. It is
absolutely impossible to store all WAL entries and replay them all after a
failure. That’s why the database system periodically performs a checkpoint:
all dirty buffers are forced to disk (including xact buffers that store
transaction status metadata). It guarantees that all transaction changes that
had happened before the checkpoint are saved on disk.
A checkpoint can take a lot of time, and that’s OK. The “point” itself in the
sense of a particular moment marks the beginning of the process. But the
checkpoint is considered complete only after all dirty buffers that were
present at that moment are flushed to disk.
Crash recovery starts from the latest checkpoint, which allows PostgreSQL
to store only those WAL files that were written after the last completed
Synchronous mode
flushed at commit time
by backend
Asynchronous mode
flushed in the background
by walwriter
background workers
shared memory
buffer cachexactwal
Using write-ahead logging is more efficient than direct unbuffered disk
writes. First of all, the size of WAL entries is smaller than the size of the
whole data page. Second, WAL entries are written sequentially (and are
usually not read until a failure), so this process can be easily handled even
by HDDs.
You can also tune performance by changing some settings. If the data is
flushed at once (synchronously), it is guaranteed that a committed
transaction won’t disappear. But flushing is quite expensive, and the
backend performing the commit has to wait for its completion. For WAL
entries not to get stuck in the operating system cache, fsync is called:
PostgreSQL assumes it is enough to ensure that the data reaches a
persistent storage.
That’s why PostgreSQL also provides a delayed (asynchronous) commit
mode. In this case, commits are gradually flushed to disk by the walwriter
background worker (with some delay). It is less reliable, but provides better
performance in return. In case of a failure, a consistent recovery is still
guaranteed, but some of the recently committed transactions may be lost.
Buffer cache boosts performance by reducing the number of
disk access operations
Reliability is ensured by write-ahead logging
WAL size is reduced by using checkpoints
WAL is a convenient mechanism used in many scenarios
to perform recovery after a failure
during backup
for replicating data between servers
1. Check how buffer cache is used when updating a single row
in a regular table and in a temporary table. Try to explain the
2. Create an unlogged table and insert several rows into it. Simulate
a system failure by stopping the server in the immediate mode
as shown in the demo.
Start the server and check the table state.
In the server log, find the entries related to recovery.
Task 1. Temporary tables look just like regular ones, but their lifetime is
limited to the current session. Likewise, such tables are visible only in the
current session.
Use the following command as shown in the demo:
EXPLAIN (analyze, buffers, costs off, timing off)
Task 2. To stop the server in the immediate mode, run:
sudo pg_ctlcluster 12 main stop -m immediate –skip-systemctl-redirect
The --skip-systemctl-redirect parameter is required because PostgreSQL
has been installed on Ubuntu from a package. The server is managed by
the pg_ctlcluster command that calls the systemctl utility, and the specified
mode gets lost by the time pg_ctl is started. This parameter allows us to do
without systemctl and pass the command to pg_ctl directly.