Architecture

Vacuuming

Authors: Egor Rogov, Pavel Luzanov, Ilya Bashtanov

Translated by Alexander Meleshko

Use of course materials

Non-commercial use of course materials (presentations, demonstrations) is

allowed without restrictions. Commercial use is possible only with the written

permission of Postgres Professional. It is prohibited to make changes to the

course materials.

Feedback

Please send your feedback, comments and suggestions to:

edu@postgrespro.ru

Disclaimer

Postgres Professional assumes no responsibility for any damages and

losses, including loss of income, caused by direct or indirect, intentional or

accidental use of course materials. Postgres Professional company

specifically disclaims any warranties on course materials. Course materials

are provided “as is,” and Postgres Professional company has no obligations

to provide maintenance, support, updates, enhancements, or modifications.

Topics

Tasks that require periodic execution

Vacuum and analysis

Table and index bloating

Full vacuum and rebuilding of indexes

Routine tasks

Vacuuming pages to remove MVCC historical data

dead row versions are vacuumed out of tables

index entries referring to dead versions are vacuumed from indexes

MVCC makes it possible to effectively implement snapshot isolation, but as

a result, old versions of rows accumulate in table pages, and links to these

versions accumulate in index pages. Historical versions are needed for

some time so that transactions can work with their data snapshots. But over

time, a row version will no longer have any snapshot that needs it. Such a

row version is called “dead”.

The vacuum procedure cleans out dead row versions from table pages, as

well as unnecessary index entries that reference such versions.

If historical data is not vacuumed in a timely manner, tables and indexes will

bloat uncontrollably and the search for current row versions in them will slow

down.

https://postgrespro.com/docs/postgresql/13/routine-vacuuming

Routine tasks

Updating the visibility map

tracks pages where all row versions are visible in all snapshots

used to optimize vacuuming and speed up index access

exists only for tables

In addition to this main task, vacuuming also performs other instance

maintenance tasks. Vacuuming updates the visibility map and the free

space map. This is service information that is stored alongside the main

data.

The visibility map shows pages that contain only the current row versions

visible in all data snapshots. In other words, these are pages that have not

changed for long enough to be cleared out of outdated row versions.

The visibility map has several uses:

●

Vacuuming optimization.

The marked pages cannot contain dead row versions, so they can be

skipped during vacuuming.

●

Index Only access speedup.

Versioning information is stored only for tables, but not for indexes (that's

why indexes don’t have visibility maps). After getting a reference to a row

version from an index, you usually need to read the table page to check

its visibility. But if the index itself already has all the necessary columns,

and at the same time the page is marked in the visibility map, then

reading the table page can be skipped.

If the visibility map isn't updated regularly, index access can slow down. This

is described in more detail in the course “Query Performance Tuning” (QPT).

Routine tasks

Updating the free space map

tracks the free space in the pages after vacuuming

used when inserting new row versions

exists for both tables and indexes

The visibility map tracks available free space within pages. This space is

constantly changing, decreasing when new row versions are added and

increasing when they are removed.

When inserting new row versions, the map helps to quickly find a suitable

page to record the data into. The free space map has a complex tree-like

structure designed to improve search speed.

Indexes can have free space maps as well. However, since index entries

are inserted into specific positions within an index, the map only tracks

empty pages, which form when all index entries are deleted from them.

These pages are excluded from the index and later can be included again

into another part of the index.

Routine tasks

Updating statistics

used by the query optimizer

calculated based on a random sample

The query optimizer requires statistical information about the data it’s

working with, such as the number of rows in tables and the distribution of

data in columns. The process of collecting the statistics is called analysis.

For analysis, a random sample of data of a certain size is read from the

table. This way, the system can quickly collect statistics even on very large

tables. The result is not accurate, but it is not expected to be. The data

constantly changes, so it is impossible to maintain absolutely accurate

statistics all of the time anyway. It is sufficient to keep it relatively up-to-date

and relatively close to the truth.

If statistics are not updated regularly, they will no longer represent the data

accurately, leading to the optimizer proposing inefficient execution plans.

Because of this, queries may start executing orders of magnitude slower

than they could.

Routine tasks

Freezing

prevents the consequences of the 32-bit transaction counter overflow

As already mentioned, PostgreSQL orders events by transaction ID. The

counter has 32 bits allocated for it, and sooner or later it will overflow (this is

known as transaction ID wraparound). But when the counter resets to zero,

the order of transactions will be disrupted.

This is why the transaction ID scope is looped. For each transaction, half of

the IDs are considered to be in the future, half in the past.

Sufficiently old row versions are marked as frozen. This means that they

were created so far in the past that their transaction ID no longer means

anything and can be reused.

To avoid scanning extra pages, the visibility map has a bit that marks the

pages where all row versions are frozen.

Without regular freezing, the server may end up with no available

transaction IDs for new transactions. This is an emergency: the server

stops, all active transactions drop, and the administrator has to manually

restart the server and do the freezing.

Autovacuum

Autovacuum launcher

background process

periodically launches

worker processes

Autovacuum worker

vacuums tables of a

separate database that

require attention

PostgreSQL

backend

postmaster

background processes

autovacuum

shared memory

cache

All the maintenance tasks discussed above are taken care of by the

autovacuum process. It dynamically reacts to the frequency of table

updates, and the more active the changes, the more often the table will be

vacuumed.

The autovacuum launcher process runs in the background, scheduling the

vacuuming work and launching the required number of parallel autovacuum

workers.

Vacuuming works page-by-page, so it doesn't block other transactions

It still does require additional resources from the I/O system.

Autovacuum will stop when either of the two parameters autovacuum or

track_counts is switched off. It may seem that disabling autovacuum can

increase system performance by eliminating “unnecessary” I/O operations,

but it cannot. Failure to vacuum will lead to the consequences described

above: uncontrolled bloating, slower query processing, and the risk of an

emergency server shutdown. Ultimately, this will lead to a complete system

paralysis.

Autovacuuming is absolutely necessary. There is a large number of

configuration parameters that allow tweaking the autovacuum process.

They are discussed in detail in the course DBA2 “Configuration and

monitoring”.

Manual vacuuming

Vacuuming

VACUUM [table, ...] vacuuming specific tables

VACUUM vacuuming the entire database

$ vacuumdb wrapper for the OS

Analysis

ANALYZE

$ vacuumdb --analyze-only

Vacuum and analysis

VACUUM ANALYZE

$ vacuumdb --analyze

If necessary, vacuuming and analysis can be started manually using the

following commands:

VACUUM (vacuuming only), ANALYZE (analysis only), and VACUUM

ANALYZE (both vacuuming and analysis).

Autovacuum is different from running scheduled vacuum and analysis in that

it reacts to the frequency of data changes. Running it on a schedule too

often will create unnecessary load on the system. On the other hand, if you

vacuum too rarely, and data is changed often, the files may have time to

bloat significantly between vacuums.

https://postgrespro.com/docs/postgresql/13/sql-vacuum

https://postgrespro.com/docs/postgresql/13/sql-analyze

Bloating

Vacuuming does not reduce the size of tables and indexes

the “holes” in the pages are used for new data,

but never returned to the operating system

Causes of bloating

incorrect autovacuum configuration

massive changes of data

long-running transactions

Negative consequences

inefficient disk space use

slower sequential table scan

less efficient index access

Vacuuming cleans out outdated row versions from pages. This creates spots

of free space in the pages, which is then used to store new data. But the

free space is not returned to the operating system, so, from the point of view

of the OS, the size of the data files does not decrease.

In the case of indexes (B-trees), it is complicated by the fact that if there is

not enough space in the page to place an index entry, the page is split into

two. The resulting pages are never merged again, even if all index entries

are removed from them.

If autovacuuming is configured correctly, the data files grow by a certain

constant amount due to updates between vacuumings. But if a large amount

of data is being changed at the same time, or there are active long

transactions (keeping old data snapshots active and not allowing you to

vacuum out old row versions), vacuuming will not be able to free up the

space in time. As a result, the tables and indexes may continue to grow in

size.

File size bloating leads not only to disk space overuse (including for

backups), but also to a decrease in performance.

https://wiki.postgresql.org/wiki/Show_database_bloat

https://postgrespro.com/docs/postgresql/13/pgstattuple

Rebuilding objects

Full vacuum

VACUUM FULL

$ vacuumdb --full

completely rebuilds the contents of tables and indexes

locks the table completely

Rebuilding indexes

REINDEX

rebuilds indexes

locks the index completely and locks the associated table for write operations

In order to reduce the physical size of bloated tables and indexes, a full

vacuum is required.

The VACUUM FULL command completely rewrites the contents of the table

and its indexes, minimizing the space occupied. However, this process

requires an exclusive table lock and therefore cannot be executed in parallel

with other transactions.

https://postgrespro.com/docs/postgresql/13/sql-vacuum

You can rebuild an index or several indexes without touching the table. This

is done by the REINDEX command. It locks the table for writing (reading is

still available), so transactions trying to change the table will be blocked.

https://postgrespro.com/docs/postgresql/13/sql-reindex

If prolonged exclusive locking is undesirable, you can consider pg_repack,

a third-party extension (https://github.com/reorg/pg_repack) that allows you

to rebuild tables and their indexes on the fly.

Rebuilding objects

Non-blocking index rebuilding

REINDEX ... CONCURRENTLY

rebuilds indexes without blocking tables for writing

takes longer and may fail

not transactional

does not work for system indexes

does not work for indexes associated with exclusion constraints

In PostgreSQL 12 and higher, the REINDEX ... CONCURRENTLY command

can work without blocking the table for writing. However, non-blocking

rebuilding takes longer and may fail (due to deadlocks). In this case, the

index will need to be rebuilt again.

Non-blocking index rebuilding has a number of limitations: it cannot be

performed inside a transaction, and it cannot rebuild system indexes and

indexes with exclusion constraints (EXCLUDE).

Takeaways

Row versions accumulate, so periodic vacuuming is necessary

Vacuuming serves multiple goals:

updating visibility maps and free space maps

collecting statistics for the planner

freezing old row versions

Autovacuuming is necessary, but requires configuration

Full vacuum may be necessary to combat bloating

Practice

1. Disable autovacuuming and make sure it doesn’t work.

2. In a new database, create a table with one numeric column and

an index for this table. Insert 100,000 random numbers into the

table.

3. Change half of the table rows several times. Write down the size

of the table and the index each time.

4. Run a full vacuum.

5. Repeat step 4, running a regular vacuum each time you change

the values. Compare the results.

6. Turn autovacuuming back on.

1. Set the autovacuum parameter to off and reload the configuration files.

3. Use the functions pg_table_size (table name) and

pg_indexes_size (table name). For more information about calculating

the sizes of various objects, see the module “Data organization”.

6. Set the autovacuum parameter back to on (or reset this parameter with

the RESET command), then reload the configuration files.