Architecture

Buffer Cache and WAL

Authors: Egor Rogov, Pavel Luzanov

Translated by Liudmila Mantrova

Usage of Course Materials

Non-commercial use of course materials (presentations, demonstrations) is

allowed on an unrestricted basis. Commercial use is only possible with prior

written permission of Postgres Professional company. Modification of course

materials is forbidden.

Please send your feedback to: edu@postgrespro.ru

Disclaimer

In no event shall Postgres Professional company be liable for any damages

or loss, including loss of profits, that arise from direct or indirect, special or

incidental use of course materials. Postgres Professional company

specifically disclaims any warranties on course materials. Course materials

are provided “as is,” and Postgres Professional company has no obligations

to provide maintenance, support, updates, enhancements, or modifications.

Agenda

Buffer cache overview

Eviction algorithm

Write-ahead log

Checkpoints

Buffer Cache

An array of buffers

data page (8 kB)

additional metadata

Memory locks

for shared access

PostgreSQL

postmaster

backend

cache

background workers

shared memory

buffer cache

Buffer cache is used to even out the speed difference between disk and

RAM access. It consists of an array of buffers that store data pages and

additional metadata (such as the name of the file and page location within

this file).

Page size is usually 8 kB (it can be configured at build time, but there is

usually no point in doing it).

All data page access operations go via buffer cache. If any process needs to

access a page, it first tries to find it in cache. If the page is not found there,

the process asks the operating system to read this page and loads it into

buffer cache. (Note that the OS can read the page from disk or find it in its

own cache.)

Once the page is in buffer cache, it can be accessed multiple times without

any overhead caused by system calls.

However, buffer cache (just like other shared memory structures) is

protected by locks to handle concurrent access. Even though locks are

implemented quite efficiently, buffer cache access is not as fast as direct

access to RAM. So in most cases, the less data is read and updated by a

query, the faster this query completes.

Eviction

Eviction of rarely

accessed pages

a “dirty” buffer

is flushed to disk

another page is loaded

into the freed space

PostgreSQL

postmaster

backend

background workers

shared memory

buffer cache

cache

Buffer cache size is usually not big enough to hold the whole database. It is

limited both by the available RAM and by additional overhead that arises

when we try to increase it. Sooner or later, buffer cache will become

completely full while a new page is being read. In this case, page eviction is

applied.

An eviction algorithm selects a page in cache that has been recently used

less often than others, and replaces it with the new page. If the selected

page has been updated, it must be flushed to disk first not to lose these

changes (a buffer that contains an updated page is called “dirty”).

This eviction algorithm is called LRU (Least Recently Used). It ensures that

actively used data is kept in cache. Such “hot” data is usually not abundant,

so having just enough buffer cache allows to significantly reduce the number

of system calls (and disk operations).

Write-Ahead Log (WAL)

Problem: in case of a failure, all data from RAM will be lost

if not flushed to disk

WAL

a stream of information about all performed operations,

which allows repeating them if they are lost because of a failure

WAL entry is written to disk before the modified data

WAL protects

pages of tables, indexes, and other objects

transaction status information (xact)

WAL does not protect

temporary and unlogged tables

Buffer cache (like other buffers in RAM) boosts performance, but negatively

affects reliability. If the database system fails, all buffer cache contents will

be lost. In case of operating system or hardware failures, OS buffer contents

will also be cleared (but the operating system copes with it on its own).

To ensure fault tolerance, PostgreSQL uses write-ahead logging. For each

operation, a WAL entry is generated; it contains the essential information

required to repeat this operation. Such entry must be written to disk (or any

other persistent storage) before the actual data update (hence the “write-

ahead” log).

WAL protects all objects that are handled in RAM, such as tables or

indexes, as well as transaction status metadata.

WAL files do not store any data about temporary tables (such tables can be

accessed only by their owner within the current session or transaction) and

unlogged tables (such tables are just like regular ones, except that they are

not protected by WAL). In case of a failure, such tables are simply cleared.

They are used to speed up access to data that can be recovered by other

means.

https://postgrespro.com/docs/postgresql/12/wal-intro

Checkpoints

Periodic flushing of all dirty buffers to disk

guarantees that all changes before the checkpoint are saved on disk

reduces the size of WAL required for recovery

Crash recovery

starts from the latest checkpoint

replays WAL entries one by one if the corresponding changes are missing

xid

checkpoint

checkpoint crash

required WAL files

start

of recovery

When PostgreSQL server is started after a failure, it enters the recovery

mode. At this moment, the data stored on disk is inconsistent: some "hot"

pages may not have been flushed yet, even though they got updated before

other pages already written to disk.

To restore consistency, PostgreSQL reads WAL and replays all WAL entries

one by one if the corresponding changes have not been flushed to disk.

Thus, it restarts all transactions and then aborts those that were not

registered in WAL as committed.

However, WAL size could become huge during server operation. It is

absolutely impossible to store all WAL entries and replay them all after a

failure. That’s why the database system periodically performs a checkpoint:

all dirty buffers are forced to disk (including xact buffers that store

transaction status metadata). It guarantees that all transaction changes that

had happened before the checkpoint are saved on disk.

A checkpoint can take a lot of time, and that’s OK. The “point” itself in the

sense of a particular moment marks the beginning of the process. But the

checkpoint is considered complete only after all dirty buffers that were

present at that moment are flushed to disk.

Crash recovery starts from the latest checkpoint, which allows PostgreSQL

to store only those WAL files that were written after the last completed

checkpoint.

Performance

Synchronous mode

flushed at commit time

by backend

Asynchronous mode

flushed in the background

by walwriter

PostgreSQL

backend

postmaster

checkpointer

background workers

walwriter

shared memory

buffer cachexactwal

WAL

transaction

status

fsync

cache

Using write-ahead logging is more efficient than direct unbuffered disk

writes. First of all, the size of WAL entries is smaller than the size of the

whole data page. Second, WAL entries are written sequentially (and are

usually not read until a failure), so this process can be easily handled even

by HDDs.

You can also tune performance by changing some settings. If the data is

flushed at once (synchronously), it is guaranteed that a committed

transaction won’t disappear. But flushing is quite expensive, and the

backend performing the commit has to wait for its completion. For WAL

entries not to get stuck in the operating system cache, fsync is called:

PostgreSQL assumes it is enough to ensure that the data reaches a

persistent storage.

That’s why PostgreSQL also provides a delayed (asynchronous) commit

mode. In this case, commits are gradually flushed to disk by the walwriter

background worker (with some delay). It is less reliable, but provides better

performance in return. In case of a failure, a consistent recovery is still

guaranteed, but some of the recently committed transactions may be lost.

Summary

Buffer cache boosts performance by reducing the number of

disk access operations

Reliability is ensured by write-ahead logging

WAL size is reduced by using checkpoints

WAL is a convenient mechanism used in many scenarios

to perform recovery after a failure

during backup

for replicating data between servers

Practice

1. Check how buffer cache is used when updating a single row

in a regular table and in a temporary table. Try to explain the

difference.

2. Create an unlogged table and insert several rows into it. Simulate

a system failure by stopping the server in the immediate mode

as shown in the demo.

Start the server and check the table state.

In the server log, find the entries related to recovery.

Task 1. Temporary tables look just like regular ones, but their lifetime is

limited to the current session. Likewise, such tables are visible only in the

current session.

Use the following command as shown in the demo:

EXPLAIN (analyze, buffers, costs off, timing off)

Task 2. To stop the server in the immediate mode, run:

sudo pg_ctlcluster 12 main stop -m immediate –skip-systemctl-redirect

The --skip-systemctl-redirect parameter is required because PostgreSQL

has been installed on Ubuntu from a package. The server is managed by

the pg_ctlcluster command that calls the systemctl utility, and the specified

mode gets lost by the time pg_ctl is started. This parameter allows us to do

without systemctl and pass the command to pg_ctl directly.