Buffer cache and WAL
© Postgres Professional, 2015–2022
Authors: Egor Rogov, Pavel Luzanov, Ilya Bashtanov
Translated by Alexander Meleshko
Use of course materials
Non-commercial use of course materials (presentations, demonstrations) is
allowed without restrictions. Commercial use is possible only with the written
permission of Postgres Professional. It is prohibited to make changes to the
course materials.
Please send your feedback, comments and suggestions to:
Postgres Professional assumes no responsibility for any damages and
losses, including loss of income, caused by direct or indirect, intentional or
accidental use of course materials. Postgres Professional company
specifically disclaims any warranties on course materials. Course materials
are provided “as is,” and Postgres Professional company has no obligations
to provide maintenance, support, updates, enhancements, or modifications.
Buffer cache overview
Replacement algorithm
Write-ahead log
Processes related to the buffer cache and WAL
Buffer array
data page (8 KB)
additional information
“Dirty” buffers
asynchronous write
Locks in memory
for shared access
Buffer cache
background processes
shared memory
buffer cache
The buffer cache is used to smooth out the difference between the RAM and
disk speed. It consists of an array of buffers that contain data pages and
some additional information (for example, the file name and the position of
the page inside this file).
The page size is usually 8 KB; the size can only be changed when building
Any work with data pages goes through the buffer cache. If any process is
going to work with the page, it first tries to find it in the cache. If the page
does not exist, the process requests the operating system to read this page
and places it in the buffer cache. (Note that the OS can read the page either
from disk or from its own cache.)
After the page gets into the buffer cache, it can be accessed repeatedly
without the overhead of operating system calls.
If a process has changed the data in the page, the corresponding buffer is
called “dirty”. The modified page must be written on disk, but for
performance reasons, the recording occurs asynchronously and may be
The buffer cache, like other shared memory structures, is protected by locks
to control concurrent access. Although locks are implemented effectively,
access to the buffer cache is not nearly as fast as simply accessing RAM.
Therefore, in general, the less data a query reads and modifies, the faster it
will work.
Least Recently Used
dirty buffer is
written on disk
another page is read
into the vacant space
background processes
shared memory
buffer cache
The buffer cache size is usually not so large as to fit the entire database.
It is limited by the available RAM, and also the larger the buffer cache, the
greater the overhead. Therefore, when reading the next page, sooner or
later the buffer cache will run out of space. In this case, page replacement
Page replacement selects a page in the cache that has been used less
often than others. If the selected buffer is dirty, the page is written on disk
first to store the changes. Then a new page gets into the buffer.
This is called the Least Recently Used replacement, or LRU. It keeps the
most frequently accessed data in the cache. Such “hot” blocks of data are
not very common, and this approach helps to significantly reduce the
number of requests to OS (and disk operations), provided enough cache
Write-ahead log (WAL)
Problem: when a crash occurs, data from RAM that is not
written on disk is lost
a stream of records of the actions being performed; can be used to
re-trace the steps lost during the crash
the records are stored on disk before the actual changes are
The log tracks changes to
pages in tables, indexes and other objects
transaction status (clog)
The log does not track changes to
temporary and unlogged tables
Having a buffer cache (and other RAM buffers) increases performance at
the cost of reliability. When a crash happens, all buffer cache content is lost.
If the crash occurs on the OS or hardware level, the content of OS buffers
will also be lost (the OS may have its own failsafes for this).
To increase reliability, PostgreSQL uses the Write-ahead log. When
performing any operation, the log records minimum necessary information
about the operation to be able to perform it again. The record must be
written into non-volatile memory before the data modified by the operation is
(that’s why it’s called Write-ahead log).
WAL files are located in the PGDATA/pg_wal directory.
All objects that are being worked on in RAM have their operations logged.
These include tables, indexes and other objects, and transaction statuses.
Operations with temporary tables (tables which exist only during the scope
of a session or a transaction and are only available to the user who has
created them) aren’t logged. You can also set a regular table to be explicitly
unlogged. The table will be quicker to work with, but will be wiped on crash.
Regular flushing of all dirty buffers to disk
ensures that all data changes before the checkpoint get to the disk
limits the size of the log required for recovery
Crash recovery
starts from the last checkpoint
WAL records are replayed one-by-one to restore data consistency
checkpoint crash
required WAL files
When PostgreSQL crashes, it enters the recovery mode on the next start.
The data on disk at this point is inconsistent. Changes to hot pages were in
the buffer cache and are now lost, while some of the later changes have
been flushed to disk already.
To restore consistency, PostgreSQL reads the WAL log and sequentially
reads the records, replaying the changes that did not make it to the disk.
This way, the state of all transactions at the time of the crash is restored.
Then, any transactions that haven’t been logged as committed are aborted.
However, logging all changes throughout a server’s lifetime and replaying
everything from day 1 after each crash is impractical, if not impossible.
Instead, PostgreSQL uses checkpoints. Every now and then, it forces all
dirty buffers to disk (including clog buffers with transaction statuses) to
ensure that all data changes up to this point are safe in non-volatile memory.
This state is called a checkpoint. The “point” in checkpoint is the moment in
time when the flushing of all data to disk is started. However, you only have
a valid checkpoint when the flushing is complete, and it may take a bit of
Now, when a crash occurs, you can start recovery from the closest
checkpoint. Consequently, it’s sufficient to store WAL files only as far back
as the last checkpoint goes.
Synchronous mode
write on commit
Asynchronous mode
write in background
shared memory
buffer cacheclogWAL
The WAL approach is faster than working directly with disk without a buffer
cache. Firstly, a WAL record is smaller than an entire page of data.
Secondly, the log is written sequentially (and usually not read until a crash
occurs), which is better for basic hard disk drives.
Various configurations also affect WAL performance. If the records are
stored to disk immediately (synchronous mode), this guarantees that the
committed operation will get to disk one way or the other. But recording to
disk is expensive, and forces the committing process to wait in line. To
prevent log entries from being “stuckin the OS cache, PostgreSQL calls
the fsync function, which forces the data into non-volatile storage.
There is also asynchronous mode, which has a background process
(walwriter) constantly sending WAL records to disk with a certain delay. It’s
more efficient at the cost of some reliability, but still ensures consistency
after crash recovery.
In fact, both modes work together. Long transaction log records are written
asynchronously (to free up WAL buffers). And if a pages is getting flushed to
disk and the corresponding log record isn’t there yet, it will immediately be
recorded in synchronous mode.
Main processes
WAL Writer
flush all dirty buffers
Background writer
flush some dirty buffers
flush the replaced dirty buffer
checkpointer bgwriterwalwriter
shared memory
buffer cacheclogWAL
Let’s take a step back and look at the processes that maintain the buffer
cache and the WAL.
First, there is walwriter. This process writes WAL records to disk in
asynchronous mode. In synchronous mode, this job is handled by the
process that commits the transaction.
Second, checkpointer, the checkpoint process. It periodically flushes all dirty
buffers to disk.
Third, bgwriter (or background writer). It operates similarly to checkpointer,
but it only flushes some of the dirty buffers, prioritizing the ones which are at
a high risk of being replaced soon. It frees up buffer space so that when a
background worker selects a buffer to put a new page in, it doesn’t have to
flush the old contents of the buffer to disk itself.
Fourth, there are backends that put data into the buffer cache.
Whenever a buffer being replaced is still dirty (despite the efforts of
checkpointer and bgwriter), the background process will flush it to disk.
Log levels
guarantees crash recovery
Replica (default)
replication: transfer and replay of the log on another server
logical replication: information about adding, changing, and deleting
table rows
WAL was developed as a data protection tool to mitigate the risk of data loss
due to crashes.
However, the WAL mechanism turned out to have other applications, if its
records are supplemented with additional info.
The amount of data stored in each WAL record is controlled by the wal_level
The minimal level is sufficient to recover after a crash, and nothing else.
The replica level stores additional information that allows it to be used for
backup (see the Backup module) and replication (see the Replication
module). During replication, WAL records are streamed to the replica and
applied there, creating an exact copy of the original server.
At the logical level, information is added to the log that allows decoding
“physical” log entries and forming “logical” records of adding, changing
and deleting table rows. This is logical replication (also discussed in the
“Replication” module).
Buffer cache increases performance by reducing the number of
disk operations
WAL increases reliability
WAL size is kept in check by checkpoints
WAL has multiple uses:
crash recovery
1. Using the OS tools, find the processes responsible for the buffer
cache and the WAL.
2. Stop PostgreSQL in fast mode; start it again.
Check the server message log.
3. Now stop PostgreSQL in immediate mode; start it again.
Check the server message log and compare with the
previous one.
2. To stop in fast mode, use the command
pg_ctlcluster 13 main stop
This makes the server abort all open connections and perform a checkpoint
before shutting down, so that all data is flushed to disk and consistent.
In this mode, the shutdown may take some time, but on startup the server
will be good to go right away.
3. To stop in immediate mode, use the command
pg_ctlcluster 13 main stop -m immediate --skip-systemctl-redirect
The server will also abort open connections, but will not perform a
checkpoint. Data on disk will be inconsistent, like after a crash. In this mode,
the server shuts down quickly, but will enter recovery mode on startup and
will use the WAL to reach consistency.
If your PostgreSQL is compiled from source code, the fast stop command
will be
pg_ctl stop
and the immediate stop command will be
pg_ctl stop -m immediate