© Postgres Professional, 2015–2022
Authors: Egor Rogov, Pavel Luzanov, Ilya Bashtanov
Translated by Alexander Meleshko
Use of course materials
Non-commercial use of course materials (presentations, demonstrations) is
allowed without restrictions. Commercial use is possible only with the written
permission of Postgres Professional. It is prohibited to make changes to the
course materials.
Please send your feedback, comments and suggestions to:
Postgres Professional assumes no responsibility for any damages and
losses, including loss of income, caused by direct or indirect, intentional or
accidental use of course materials. Postgres Professional company
specifically disclaims any warranties on course materials. Course materials
are provided “as is,” and Postgres Professional company has no obligations
to provide maintenance, support, updates, enhancements, or modifications.
Replication purposes and types
Physical replication
Logical replication
Replication use cases
Replication purposes
the process of synchronizing multiple copies of a database cluster
on different servers
reliability if one of the servers fails,
the system must maintain availability
(performance degradation is acceptable)
scalability load distribution between servers
A single database server may not meet all the requirements.
First, reliability. One physical server is a possible point of failure. If the
server fails, the system becomes unavailable.
Secondly, performance. One server may not be able to handle
the load. Often, the ability to scale and distribute the load between multiple
servers is preferable to increasing single server capacity.
The solution is to have multiple servers managing the same databases.
Replication refers to the process of synchronizing these servers.
Replication types
primary-replica: data flow in one direction only
delivery of WAL records or files
binary server compatibility is required
only the cluster as a whole can be replicated
publication-subscription: data flow is possible in both directions
row level information (log level = logical)
protocol-level compatibility is required
can replicate individual tables
There are multiple ways to set up synchronization between servers. The two
main venues available in PostgreSQL are physical and logical replication.
During physical replication, one server is assigned the main server role and
the other becomes a replica. The main server transfers WAL records to a
replica (in the form of files or a stream of records). The replica applies these
records to its data files. The WAL record application is purely mechanical,
without “understanding the meaning” of the changes, so binary compatibility
between servers is necessary (the same platforms and major PostgreSQL
versions). Since the WAL is shared across the entire cluster, only the cluster
as a whole can be replicated.
During logical replication, higher-level information is added to the WAL,
allowing the replica to sort out changes at the row level (requires the
parameter wal_level = logical). This sort of replication does not require
binary compatibility, it only needs the replica to be able to understand the
incoming WAL information. Logical replication allows, if necessary, to
replicate only the changes made to individual tables.
Logical replication was introduced in PostgreSQL 10. Before, you had to use
the pg_logical extension or set up trigger-based replication.
Physical replication
How physical replication works
WAL transmission modes
Replica usage
Switching to replica and back
Replica configuration and use cases
Let’s discuss physical replication first.
It works by translating changes to the replica in the form of WAL records.
This is a very efficient mechanism, but it requires binary compatibility
between servers (the major version of the server, the operating system, the
hardware platform).
Physical replication is one-way only: while there may be any number of
replicas, there is always only one main server.
base backup via pg_basebackup
WAL files archive
Continuous recovery
deploy the backup
set configuration parameters
create a standby.signal file
launch the server
the server restores consistency and continues to apply incoming logs
delivery by replication protocol stream or WAL archive
connections (read-only) are allowed immediately after consistency is restored
Setting up replication is very similar to setting up a physical backup. The
difference is that the backup deploys immediately, without waiting for the
main server to crash, and works in continuous recovery mode: it
continuously reads and applies new WAL segments coming from the main
server. To tell the replica to start in this mode, a standby.signal file is created
instead of recovery.signal.
This way, the replica is constantly maintained in an almost up-to-date state
and if the main server fails, the replica is ready to take over.
By default, the replica operates in the “hot standby” mode. This means that
during the recovery process, it allows connections to read data (as soon as
consistency is restored). You can prohibit these connections (this is called
“warm standby”).
Unlike backup, replication does not allow you to recover to an arbitrary point
in the past. In other words, replication cannot be used to correct an error
(although it is possible to configure a replica so that it lags behind the main
server by a certain amount of time).
Replica usage
read-only queries (select, copy to, cursors)
setting server parameters (set, reset)
transaction management (begin, commit, rollback...)
creating a backup (pg_basebackup)
Not allowed
any changes (insert, update, delete, truncate, nextval...)
locks expecting changes (select for update...)
DDL commands (create, drop...), including creating temporary tables
maintenance commands (vacuum, analyze, reindex...)
access control (grant, revoke...)
triggers and advisory locks are disabled
In hot standby mode, no data changes (including sequences), locks, DDL
commands, service commands such as VACUUM and ANALYZE, or access
control commands are allowed on the replica. Basically, anything that
changes the data in any way is prohibited.
The replica can still process read-only queries. Changing server parameters
and executing transaction management commands is allowed. For example,
you can start a (reading) transaction with a specific isolation level.
In addition, the replica can also be used for making backups (of course,
taking into account the possible lag behind the main server).
Streaming replication
main server
WAL segments
select, insert
update, delete
backup server
wal sender wal receiver
There are two ways to deliver WALs from the primary server to the replica.
The one used more commonly in production is streaming replication.
In this case, the replica connects to the primary server via the replication
protocol and receives the WAL record stream. This minimizes the replica lag
and can even eliminate it entirely (in synchronous mode).
There's a notable possible issue with reading from a replica. While a query
on a replica takes a MVCC snapshot, the primary server may vacuum the
row versions required for the snapshot. The affected query on the replica will
have to be terminated in this case. With streaming replication, the issue is
resolved by the feedback mechanism. It lets the primary server know if any
transaction IDs are in use by the replica so that it can delay the vacuuming.
Replication via WAL archive
WAL archive
main server
WAL segments
select, insert
update, delete
backup server
WAL source
wal sender wal receiver
During streaming replication, there’s a chance that the primary server will
delete a WAL segment that hasn’t been received by the replica yet. To
ensure that it does not happen, you have to use either a replication slot or
streaming replication together with a WAL archive (that you need for
backups anyway).
When using a WAL archive, a special archiver process on the primary
server archives full WAL segments using the archive_command (this
mechanism is discussed in the Backup module).
If the replica cannot receive the next WAL entry via the replication protocol,
it will try to read it from the archive using the command from the
restore_command parameter.
In fact, replication can work with just the archive, without streaming
replication. But in this case:
the replica is forced to lag behind the primary server by the time it takes
to fill the WAL segment,
the primary server is not aware of the replica’s existence, so vacuuming
can delete the row versions needed for replica snapshots (you can set up
a delay for applying conflicting records, but it is not always clear how long
the delay should it be).
Switching to a replica
Scheduled switchover
shutdown of the main server for maintenance without interruption of service
manual mode
Emergency switchover
switch to a replica due to a primary server failure
manual mode, but can be automated with external cluster software
There are different reasons for switching to a backup server. The switchover
can be performed routinely at a convenient time to allow for maintenance
shutdown of the main server. If it is a main server failure, on the other hand,
the switchover has to be performed as quickly as possible to avoid service
Even an emergency switchover must be performed manually, because
PostgreSQL does not come with integrated cluster management software
that should monitor the state of the servers and initiate switchovers.
Switching to a replica
main server
WAL segments
select, insert
update, delete
backup server
wal sender wal receiver
The image above illustrates the state of the servers before a switchover.
The main server is on the left, the replica on the right, and replication is set
up between the two.
Switching to a replica
former main server
WAL segments
main server
(former replica)
select, insert
update, delete
In case of a main server failure or a planned switchover, the replica is given
the command to stop recovering and become an independent server, and
the former main server is disconnected.
Of course, a way to redirect users to a new server is required, but this is
done by means outside of PostgreSQL.
Main server recovery
backup server
(former main server)
main server
select, insert
update, delete
wal senderwal receiver
After the former main server is recovered or any maintenance on it is
complete, it connects as a replica to the new main server.
Main server recovery
Simply restarting the server will not work
WAL records missed by the replica due to delay will be lost
Restoring from a backup “from scratch”
a fresh new replica is deployed at the former primary server
this is time-consuming (rsync can accelerate it somewhat)
“rolls back” the lost WAL records, replacing the corresponding pages on the
disk with pages from the new primary server
comes with a number of limitations
If the switchover has occurred due to a hardware failure (disk or server
replacement is required) or an operating system failure (OS reinstallation is
required), then the only option is to create a completely new replica on the
If the switchover was a planned one, the server can be reconnected quickly
(now as a replica).
Unfortunately, you cant simply switch the server back on and connect it to
the new primary server over the replication protocol. Because of replication
delay, some WAL records could have not made it to the replica. If the old
primary server has such records and the new primary server doesn’t, then
applying WALs from the new primary server will ruin the database.
You can always scrap the old primary server data and create a brand new
replica instead from a base backup. However, for large databases, this can
take a long time. The rsync process can speed this up to an extent.
An even faster option is to use pg_rewind.
pg_rewind detects WAL records that have not reached the replica (starting
from the last common checkpoint) and finds the pages affected by these
records. The pages (should be just a few) are replaced with pages from the
new primary server. In addition, pg_rewind copies all service files from the
source server (the new primary server). The usual recovery process takes it
from there.
high availability
and distributed load for reading
1. Multiple replicas
main server
WAL segments
select, insert
update, delete
backup server
(replica A)
wal sender
wal receiver
backup server
(replica B)
wal receiver
wal sender
The replication mechanism offers flexible system design options for a variety
of applications. Let’s consider several typical cases and possible solutions.
Case: a system with high availability and reading load distribution.
Solution: a primary server server and multiple replicas. Replicas can serve
read-only queries and can take over immediately if the main server fails.
Each replica will have a dedicated wal sender process on the main server
and a replication slot, if necessary.
The reading load distribution between replicas must be done by external
data storage reliability
2. Synchronous replication
main server
WAL segments
select, insert
update, delete
backup server
(replica A)
wal sender
wal receiver
backup server
(replica B)
wal receiver
wal sender
replica sees
changes before the
main server
replica may
lag behind
Case: in the event of a main server failure, no data must be lost when
switching to a replica.
Synchronous replication is the solution. In a single server environment,
synchronous WAL recording ensures that the commi