Architecture

Buffer cache and WAL

Authors: Egor Rogov, Pavel Luzanov, Ilya Bashtanov

Translated by Alexander Meleshko

Use of course materials

Non-commercial use of course materials (presentations, demonstrations) is

allowed without restrictions. Commercial use is possible only with the written

permission of Postgres Professional. It is prohibited to make changes to the

course materials.

Feedback

Please send your feedback, comments and suggestions to:

edu@postgrespro.ru

Disclaimer

Postgres Professional assumes no responsibility for any damages and

losses, including loss of income, caused by direct or indirect, intentional or

accidental use of course materials. Postgres Professional company

specifically disclaims any warranties on course materials. Course materials

are provided “as is,” and Postgres Professional company has no obligations

to provide maintenance, support, updates, enhancements, or modifications.

Topics

Buffer cache overview

Replacement algorithm

Write-ahead log

Checkpoint

Processes related to the buffer cache and WAL

Buffer array

data page (8 KB)

additional information

“Dirty” buffers

asynchronous write

Locks in memory

for shared access

Buffer cache

PostgreSQL

postmaster

backend

background processes

shared memory

buffer cache

dirty

buffer

cache

The buffer cache is used to smooth out the difference between the RAM and

disk speed. It consists of an array of buffers that contain data pages and

some additional information (for example, the file name and the position of

the page inside this file).

The page size is usually 8 KB; the size can only be changed when building

PostgreSQL.

Any work with data pages goes through the buffer cache. If any process is

going to work with the page, it first tries to find it in the cache. If the page

does not exist, the process requests the operating system to read this page

and places it in the buffer cache. (Note that the OS can read the page either

from disk or from its own cache.)

After the page gets into the buffer cache, it can be accessed repeatedly

without the overhead of operating system calls.

If a process has changed the data in the page, the corresponding buffer is

called “dirty”. The modified page must be written on disk, but for

performance reasons, the recording occurs asynchronously and may be

delayed.

The buffer cache, like other shared memory structures, is protected by locks

to control concurrent access. Although locks are implemented effectively,

access to the buffer cache is not nearly as fast as simply accessing RAM.

Therefore, in general, the less data a query reads and modifies, the faster it

will work.

Replacement

Least Recently Used

replacement

dirty buffer is

written on disk

another page is read

into the vacant space

PostgreSQL

postmaster

backend

background processes

shared memory

buffer cache

cache

The buffer cache size is usually not so large as to fit the entire database.

It is limited by the available RAM, and also the larger the buffer cache, the

greater the overhead. Therefore, when reading the next page, sooner or

later the buffer cache will run out of space. In this case, page replacement

occurs.

Page replacement selects a page in the cache that has been used less

often than others. If the selected buffer is dirty, the page is written on disk

first to store the changes. Then a new page gets into the buffer.

This is called the Least Recently Used replacement, or LRU. It keeps the

most frequently accessed data in the cache. Such “hot” blocks of data are

not very common, and this approach helps to significantly reduce the

number of requests to OS (and disk operations), provided enough cache

memory.

Write-ahead log (WAL)

Problem: when a crash occurs, data from RAM that is not

written on disk is lost

WAL

a stream of records of the actions being performed; can be used to

re-trace the steps lost during the crash

the records are stored on disk before the actual changes are

The log tracks changes to

pages in tables, indexes and other objects

transaction status (clog)

The log does not track changes to

temporary and unlogged tables

Having a buffer cache (and other RAM buffers) increases performance at

the cost of reliability. When a crash happens, all buffer cache content is lost.

If the crash occurs on the OS or hardware level, the content of OS buffers

will also be lost (the OS may have its own failsafes for this).

To increase reliability, PostgreSQL uses the Write-ahead log. When

performing any operation, the log records minimum necessary information

about the operation to be able to perform it again. The record must be

written into non-volatile memory before the data modified by the operation is

(that’s why it’s called Write-ahead log).

WAL files are located in the PGDATA/pg_wal directory.

All objects that are being worked on in RAM have their operations logged.

These include tables, indexes and other objects, and transaction statuses.

Operations with temporary tables (tables which exist only during the scope

of a session or a transaction and are only available to the user who has

created them) aren’t logged. You can also set a regular table to be explicitly

unlogged. The table will be quicker to work with, but will be wiped on crash.

https://postgrespro.com/docs/postgresql/13/wal-intro

Checkpoint

Regular flushing of all dirty buffers to disk

ensures that all data changes before the checkpoint get to the disk

limits the size of the log required for recovery

Crash recovery

starts from the last checkpoint

WAL records are replayed one-by-one to restore data consistency

xid

checkpoint

checkpoint crash

required WAL files

recovery

start

When PostgreSQL crashes, it enters the recovery mode on the next start.

The data on disk at this point is inconsistent. Changes to hot pages were in

the buffer cache and are now lost, while some of the later changes have

been flushed to disk already.

To restore consistency, PostgreSQL reads the WAL log and sequentially

reads the records, replaying the changes that did not make it to the disk.

This way, the state of all transactions at the time of the crash is restored.

Then, any transactions that haven’t been logged as committed are aborted.

However, logging all changes throughout a server’s lifetime and replaying

everything from day 1 after each crash is impractical, if not impossible.

Instead, PostgreSQL uses checkpoints. Every now and then, it forces all

dirty buffers to disk (including clog buffers with transaction statuses) to

ensure that all data changes up to this point are safe in non-volatile memory.

This state is called a checkpoint. The “point” in checkpoint is the moment in

time when the flushing of all data to disk is started. However, you only have

a valid checkpoint when the flushing is complete, and it may take a bit of

time.

Now, when a crash occurs, you can start recovery from the closest

checkpoint. Consequently, it’s sufficient to store WAL files only as far back

as the last checkpoint goes.

Performance

Synchronous mode

write on commit

backend

Asynchronous mode

write in background

walwriter

PostgreSQL

backend

postmaster

checkpointerwalwriter

shared memory

buffer cacheclogWAL

WAL

transaction

status

fsync

cache

The WAL approach is faster than working directly with disk without a buffer

cache. Firstly, a WAL record is smaller than an entire page of data.

Secondly, the log is written sequentially (and usually not read until a crash

occurs), which is better for basic hard disk drives.

Various configurations also affect WAL performance. If the records are

stored to disk immediately (synchronous mode), this guarantees that the

committed operation will get to disk one way or the other. But recording to

disk is expensive, and forces the committing process to wait in line. To

prevent log entries from being “stuck” in the OS cache, PostgreSQL calls

the fsync function, which forces the data into non-volatile storage.

There is also asynchronous mode, which has a background process

(walwriter) constantly sending WAL records to disk with a certain delay. It’s

more efficient at the cost of some reliability, but still ensures consistency

after crash recovery.

In fact, both modes work together. Long transaction log records are written

asynchronously (to free up WAL buffers). And if a pages is getting flushed to

disk and the corresponding log record isn’t there yet, it will immediately be

recorded in synchronous mode.

Main processes

WAL Writer

Checkpointer

flush all dirty buffers

Background writer

flush some dirty buffers

Backend

flush the replaced dirty buffer

PostgreSQL

backend

postmaster

checkpointer bgwriterwalwriter

shared memory

buffer cacheclogWAL

cache

Let’s take a step back and look at the processes that maintain the buffer

cache and the WAL.

First, there is walwriter. This process writes WAL records to disk in

asynchronous mode. In synchronous mode, this job is handled by the

process that commits the transaction.

Second, checkpointer, the checkpoint process. It periodically flushes all dirty

buffers to disk.

Third, bgwriter (or background writer). It operates similarly to checkpointer,

but it only flushes some of the dirty buffers, prioritizing the ones which are at

a high risk of being replaced soon. It frees up buffer space so that when a

background worker selects a buffer to put a new page in, it doesn’t have to

flush the old contents of the buffer to disk itself.

Fourth, there are backends that put data into the buffer cache.

Whenever a buffer being replaced is still dirty (despite the efforts of

checkpointer and bgwriter), the background process will flush it to disk.

Log levels

Minimal

guarantees crash recovery

Replica (default)

backup

replication: transfer and replay of the log on another server

Logical

logical replication: information about adding, changing, and deleting

table rows

WAL was developed as a data protection tool to mitigate the risk of data loss

due to crashes.

However, the WAL mechanism turned out to have other applications, if its

records are supplemented with additional info.

The amount of data stored in each WAL record is controlled by the wal_level

parameter.

●

The minimal level is sufficient to recover after a crash, and nothing else.

●

The replica level stores additional information that allows it to be used for

backup (see the Backup module) and replication (see the Replication

module). During replication, WAL records are streamed to the replica and

applied there, creating an exact copy of the original server.

●

At the logical level, information is added to the log that allows decoding

“physical” log entries and forming “logical” records of adding, changing

and deleting table rows. This is logical replication (also discussed in the

“Replication” module).

Takeaways

Buffer cache increases performance by reducing the number of

disk operations

WAL increases reliability

WAL size is kept in check by checkpoints

WAL has multiple uses:

crash recovery

backup

replication

Practice

1. Using the OS tools, find the processes responsible for the buffer

cache and the WAL.

2. Stop PostgreSQL in fast mode; start it again.

Check the server message log.

3. Now stop PostgreSQL in immediate mode; start it again.

Check the server message log and compare with the

previous one.

2. To stop in fast mode, use the command

pg_ctlcluster 13 main stop

This makes the server abort all open connections and perform a checkpoint

before shutting down, so that all data is flushed to disk and consistent.

In this mode, the shutdown may take some time, but on startup the server

will be good to go right away.

3. To stop in immediate mode, use the command

pg_ctlcluster 13 main stop -m immediate --skip-systemctl-redirect

The server will also abort open connections, but will not perform a

checkpoint. Data on disk will be inconsistent, like after a crash. In this mode,

the server shuts down quickly, but will enter recovery mode on startup and

will use the WAL to reach consistency.

If your PostgreSQL is compiled from source code, the fast stop command

will be

pg_ctl stop

and the immediate stop command will be

pg_ctl stop -m immediate