8
Checkpoint
Regular flushing of all dirty buffers to disk
ensures that all data changes before the checkpoint get to the disk
limits the size of the log required for recovery
Crash recovery
starts from the last checkpoint
WAL records are replayed one-by-one to restore data consistency
xid
checkpoint
checkpoint crash
required WAL files
recovery
start
When PostgreSQL crashes, it enters the recovery mode on the next start.
The data on disk at this point is inconsistent. Changes to hot pages were in
the buffer cache and are now lost, while some of the later changes have
been flushed to disk already.
To restore consistency, PostgreSQL reads the WAL log and sequentially
reads the records, replaying the changes that did not make it to the disk.
This way, the state of all transactions at the time of the crash is restored.
Then, any transactions that haven’t been logged as committed are aborted.
However, logging all changes throughout a server’s lifetime and replaying
everything from day 1 after each crash is impractical, if not impossible.
Instead, PostgreSQL uses checkpoints. Every now and then, it forces all
dirty buffers to disk (including clog buffers with transaction statuses) to
ensure that all data changes up to this point are safe in non-volatile memory.
This state is called a checkpoint. The “point” in checkpoint is the moment in
time when the flushing of all data to disk is started. However, you only have
a valid checkpoint when the flushing is complete, and it may take a bit of
time.
Now, when a crash occurs, you can start recovery from the closest
checkpoint. Consequently, it’s sufficient to store WAL files only as far back
as the last checkpoint goes.