8
Checkpoint
Regular flushing of all dirty buffers to disk
ensures that all data changes before the checkpoint get to the disk
limits the size of the WAL required for recovery
Crash recovery
starts from the last checkpoint
WAL records are replayed one-by-one to restore data
xid
checkpoint
checkpoint crash
required WAL files
recovery
start
When PostgreSQL crashes, it enters the recovery mode on the next start.
The data on disk at this point is inconsistent. Changes to hot pages were in
the buffer cache, and are now lost, while some of the later changes have
been flushed to disk already.
To restore consistency, PostgreSQL sequentially reads the WAL records,
replaying the changes that did not make it to the disk. This way, the state of
all transactions at the time of the crash is restored. Then, any transactions
that have not been logged as committed are considered aborted.
However, logging all changes throughout a server’s lifetime and replaying
everything from day one after each crash is impractical, if not impossible.
Instead, PostgreSQL uses checkpoints. Every now and then, it forces all
dirty buffers to disk (including clog buffers, which store transaction statuses).
A checkpoint is the moment in time when the flushing of all data to disk is
started. However, you only have a valid checkpoint when the flushing of all
such buffers is complete. It ensures that all data changes up to this point are
safe in persistent memory.
In production environments with a large buffer cache, a checkpoint can flush
many dirty buffers, so the server spreads this flushing over time to smooth
out the I/O load.
When a crash occurs, recovery is started from the last completed
checkpoint. Consequently, it is sufficient to store WAL files only as far back
as the last completed checkpoint goes.