Architecture

Buffer Cache and WAL

Authors Egor Rogov, Pavel Luzanov, Ilya Bashtanov, Igor Gnatyuk

Translated by: Liudmila Mantrova, Alexander Meleshko, Elena Sharafutdinova

Photo by: Oleg Bartunov (Phu monastery, Bhrikuti Peak, Nepal)

Use of Course Materials

Non-commercial use of course materials (presentations, demonstrations) is

allowed without restrictions. Commercial use is possible only with the written

permission of Postgres Professional. It is prohibited to make changes to the

course materials.

Feedback

Please send your feedback, comments and suggestions to:

edu@postgrespro.ru

Disclaimer

Postgres Professional assumes no responsibility for any damages and losses,

including loss of income, caused by direct or indirect, intentional or accidental

use of course materials. Postgres Professional company specifically disclaims

any warranties on course materials. Course materials are provided “as is,” and

Postgres Professional company has no obligations to provide maintenance,

support, updates, enhancements, or modifications.

Topics

Buffer Cache

Replacement Algorithm

Write-Ahead Log

Checkpoint

Processes Related to the Buffer Cache and WAL

Buffer array

data page (8 KB)

additional information

“Dirty” buffers

asynchronous write

Locks in memory

for shared access

Buffer Cache

PostgreSQL

postmaster

backend

background processes

shared memory

buffer cache

dirty

buffer

cache

The buffer cache is used to smooth out the difference between the RAM and

disk speed. It consists of an array of buffers which contain data pages and

some additional information (for example, the file name and the position of

the page inside this file).

The page size is usually 8 KB; the size can only be changed when building

PostgreSQL.

Any work with data pages goes through the buffer cache. If any process is

going to work with the page, it first tries to find it in the cache. If the page

does not exist, the process requests the operating system to read this page

and places it in the buffer cache. (Note that the OS can read the page either

from disk or from its own cache.)

After the page is written to the buffer cache, it can be accessed repeatedly

without the overhead of operating system calls.

If a process has changed the data in the page, the corresponding buffer

becomes “dirty”. The modified page must be written on disk, but for

performance reasons, the recording occurs asynchronously and may be

delayed.

The buffer cache, like other shared memory structures, is protected by locks

to control concurrent access. Although locks are implemented effectively,

access to the buffer cache is not nearly as fast as simply accessing RAM.

Therefore, in general, the less data a query reads and modifies, the faster it

will work.

Replacement

Least Recently Used

replacement

dirty buffer is

written on disk

another page is read

into the vacant space

PostgreSQL

postmaster

backend

background processes

shared memory

buffer cache

cache

The buffer cache size is usually not so large as to fit the entire database. It is

limited by the available RAM. Also, the larger the buffer cache, the greater

the overhead. Therefore, when reading the next page, sooner or later the

buffer cache has to run out of space. In this case, page replacement

happens.

The replacement algorithm selects a page in the cache that has been used

less often than others. If the selected buffer is dirty, the page is written on

disk first to store the changes made to it. Then, a new page is written into

the buffer.

This replacement is called LRU (Least Recently Used). It keeps the most

frequently accessed data in the cache. Such “hot” blocks of data are not

very common, and this approach helps to significantly reduce the number of

requests to OS (and disk operations), provided enough cache memory.

Write-ahead Log (WAL)

Problem: when a crash occurs, data from RAM that is not

written on disk is lost

WAL

stream of records of the actions being performed;

can be used to redo the steps lost during the crash

records are written to disk earlier than the changed data

WAL tracks changes to

pages in tables, indexes and other objects

transaction status (clog)

WAL does not track changes to

temporary and unlogged tables

Having a buffer cache (and other buffers in RAM) increases performance at

the cost of reliability. When a crash happens, all buffer cache content is lost.

If the crash occurs on the OS or hardware level, the content of OS buffers

will also be lost (the OS may have its own failsafes for this).

To increase reliability, PostgreSQL uses the Write-ahead log. When

performing any operation, the WAL records minimum necessary information

about the operation to be able to perform it again. The record must be

written into the disk (or another persistent storage) before the data modified

by the operation is (that is why it is called Write-ahead log).

WAL files are located in the PGDATA/pg_wal directory.

All objects that are being worked on in RAM have their operations logged.

These include tables, indexes and other objects, as well as transaction

statuses. Operations with temporary tables (tables which exist only during

the scope of a session or a transaction and are only available to the user

who has created them) are not logged. You can also set a regular table to be

explicitly unlogged. The table will be quicker to work with, but will be wiped

on crash.

https://postgrespro.com/docs/postgresql/16/wal-intro

Checkpoint

Regular flushing of all dirty buffers to disk

ensures that all data changes before the checkpoint get to the disk

limits the size of the WAL required for recovery

Crash recovery

starts from the last checkpoint

WAL records are replayed one-by-one to restore data

xid

checkpoint

checkpoint crash

required WAL files

recovery

start

When PostgreSQL crashes, it enters the recovery mode on the next start.

The data on disk at this point is inconsistent. Changes to hot pages were in

the buffer cache, and are now lost, while some of the later changes have

been flushed to disk already.

To restore consistency, PostgreSQL sequentially reads the WAL records,

replaying the changes that did not make it to the disk. This way, the state of

all transactions at the time of the crash is restored. Then, any transactions

that have not been logged as committed are considered aborted.

However, logging all changes throughout a server’s lifetime and replaying

everything from day one after each crash is impractical, if not impossible.

Instead, PostgreSQL uses checkpoints. Every now and then, it forces all

dirty buffers to disk (including clog buffers, which store transaction statuses).

A checkpoint is the moment in time when the flushing of all data to disk is

started. However, you only have a valid checkpoint when the flushing of all

such buffers is complete. It ensures that all data changes up to this point are

safe in persistent memory.

In production environments with a large buffer cache, a checkpoint can flush

many dirty buffers, so the server spreads this flushing over time to smooth

out the I/O load.

When a crash occurs, recovery is started from the last completed

checkpoint. Consequently, it is sufficient to store WAL files only as far back

as the last completed checkpoint goes.

Performance

Synchronous mode

write on commit

backend

Asynchronous mode

background write

walwriter

PostgreSQL

backend

postmaster

checkpointerwalwriter

shared memory

buffer cacheclogWAL

WAL

transaction

status

fsync

cache

The WAL approach is faster than working directly with disk without a buffer

cache. Firstly, a WAL record is smaller than an entire page of data.

Secondly, the WAL is written sequentially (and usually not read until a crash

occurs), which is better for basic hard disk drives.

Performance can also be managed via configuration settings. If the records

are stored to disk immediately (synchronous mode), this guarantees that the

committed transaction will not be lost. But recording to disk is expensive and

forces the committing backend process to wait in line. To prevent WAL

records from getting “stuck” in the OS cache, PostgreSQL relies on call of

the fsync function, which forces the data into persistent storage.

There is also asynchronous mode, which has a background process

walwriter constantly flushing WAL records to disk, with a certain delay. It is

more efficient at the cost of some reliability, but still ensures consistency

after crash recovery.

In fact, both modes work together. WAL records of a long transaction are

written asynchronously (to free up WAL buffers). And if a page is getting

flushed to disk and the corresponding WAL record is not there yet, it will

immediately be flushed in synchronous mode.

Main Processes

WAL writer

Checkpointer

flush all

dirty buffers

Background writer

flush some

dirty buffers

Backends

flush replaced

dirty buffer

PostgreSQL

backend

postmaster

checkpointer bgwriterwalwriter

shared memory

buffer cacheclogWAL

cache

Let’s take a step back and look at the processes that maintain the buffer

cache and the WAL.

First, there is walwriter. This process writes WAL records to disk in

asynchronous mode. In synchronous mode, this job is handled by the

backend that commits the transaction.

Second, checkpointer, the checkpoint process. It periodically flushes all dirty

buffers to disk.

Third, bgwriter (or background writer). It operates similarly to checkpointer,

but it only flushes some of the dirty buffers, specifically, those that are most

likely to be replaced soon. It frees up buffer space so that when backend

selects a buffer to put a new page in, it does not have to flush the old

contents of the buffer to disk itself.

Fourth, there are backends that put data into the buffer cache. Whenever a

buffer being replaced is still dirty (despite the efforts of checkpointer and

bgwriter), the backend will flush it to disk.

WAL Levels

Minimal

guarantees crash recovery

Replica (default)

backup

replication: transmit the WAL on another server and replay it there

Logical

logical replication: information about adding, changing,

and deleting table rows

WAL was developed as a data protection tool to mitigate the risk of data loss

due to crashes.

However, the WAL mechanism turned out to have other applications, if it is

supplemented with additional information.

The data stored in the WAL is controlled by the wal_level parameter.

●

The minimal level is sufficient to recover after a crash, and nothing else.

●

The replica level stores additional information that allows WAL to be used

for backup and replication. During replication, WAL records are

transmitted to another server and applied there, creating an exact copy

(replica) of the original server.

●

At the logical level, information is added to the WAL that allows decoding

“physical” WAL records and forming “logical” records of adding, changing

and deleting table rows. This enables logical replication (see

corresponding lessons of DEV2 and DBA3 courses for details).

Takeaways

Buffer cache increases performance by reducing the number of

disk operations

WAL ensures reliability

WAL size is kept in check by checkpoints

WAL has multiple uses:

crash recovery

backup

replication

Practice

1. Using the OS tools, find the processes responsible for the buffer

cache and the WAL.

2. Stop PostgreSQL in fast mode; start it again. Check the server

message log.

3. Now stop PostgreSQL in immediate mode; start it again.

Check the server message log and compare with the previous

one.

2. To stop in fast mode, use the command

pg_ctlcluster 16 main stop

This makes the server abort all open connections and perform a checkpoint

before shutting down, so that all data is flushed to disk and consistent. In

this mode, the shutdown may take some time, but on startup the server will

be good to go right away.

3. To stop in immediate mode, use the command

pg_ctlcluster 16 main stop -m immediate --skip-systemctl-redirect

The server will also abort open connections, but will not perform a

checkpoint. Data on disk will be inconsistent, like after a crash. In this mode,

the server shuts down quickly, but will have to restore data consistency

using WAL on startup.

If your PostgreSQL is compiled from source code, the fast stop command

will be

pg_ctl stop

and the immediate stop command will be

pg_ctl stop -m immediate