Data Access03. Access Methods
16
Copyright
© Postgres Professional, 2019–2024
Authors Authors: Egor Rogov, Pavel Luzanov, Ilya Bashtanov
Photo by: Oleg Bartunov (Phu monastery, Bhrikuti summit, Nepal)
Use of course materials
Non-commercial use of course materials (presentations, demonstrations) is
allowed without restrictions. Commercial use is possible only with the written
permission of Postgres Professional. It is prohibited to make changes to the
course materials.
Feedback
Please send your feedback, comments and suggestions to:
edu@postgrespro.com
Disclaimer
In no event shall Postgres Professional company be liable for any damages
or loss, including loss of profits, that arise from direct or indirect, special or
incidental use of course materials. Postgres Professional company
specifically disclaims any warranties on course materials. Course materials
are provided “as is,” and Postgres Professional company has no obligations
to provide maintenance, support, updates, enhancements, or modifications.
2
Topics
Sequential Scan (Seq Scan)
Index Scan
Bitmap Scan Bitmap Heap Scan
Index Only Scan Index-Only Scan
Comparing Access Method Efficiency
3
Seq Scan
Reading all pages sequentially
The pages are read into the cache
Checking the visibility of row versions
data is returned in an arbitrary order
The scan time depends on the file's physical size.
4 3 5 2 1 10 12 8 7 11 4 9 6
outdated
version
N
U
L
L
The optimizer has several ways to access the data The simplest method is a
sequential scan of the table. The table's main data files are read in pages
from beginning to end. Note that data is read through the buffer cache (for
temporary tables, through the session-local cache).
Sequential file reading leverages the fact that the operating system typically
reads data in larger portions than the page size — likely, several subsequent
pages are already in the OS cache.
A sequential scan works well when reading the entire table or a significant
portion of it (if the condition's selectivity is low).
During a sequential scan, all row versions on each page are examined —
including non-current (dead) versions that haven't been removed by the
cleanup (autovacuum) process yet. If the table's files contain many dead row
versions that haven't been removed, this can cause a decrease in the
performance of sequential scans.
Details on row versions, data snapshots, and cleaning up unnecessary row
versions are covered in the DBA2 course.
5
Bulk Page Replacement
Buffer Ring (32 Pages)
shared_buffers
process
scans the
table
the table
process
joined
to the ring
dirty
buffer
is removed from the ring
or
is written to disk
During a sequential scan of the buffer cache, a large number of pages
containing "single-use" data are accessed, potentially evicting useful pages
from the buffers. During a sequential scan of a large table (exceeding a
quarter of the buffer cache size), only 32 pages from the entire cache are
used, with eviction occurring within this subset. Meanwhile, the remaining
data in the buffer remain unaffected.
If the hint bits or the actual data are modified during the read operation, a
dirty buffer is generated. This buffer is removed from the ring and will be
evicted normally, with a new buffer added to the ring. This strategy assumes
that data is mainly read, not modified.
If another process needs the same table during a scan, it doesn't start from
the beginning but joins the existing buffer ring. After the scan completes, the
process reads the "missed" beginning of the table.
When using temporary tables via the local cache, the buffer ring mechanism
is not employed.
7
B-tree
Index
Supporting structure in external memory
maps keys to table row identifiers
Structure: Search Tree
Balanced
Highly branched
only sortable data types (using 'greater than' and 'less than' operations)
Search results are automatically sorted
Using a composite type
Improved access speed
support for integrity constraints
In this section, we will focus on one of the index types available in
PostgreSQL: the B-tree (B-tree). B-tree (B-tree). This is the most commonly
used index type in practice. Other types of indexes are covered later in this
course within the "Types of Indexes" section.
Like all indexes in PostgreSQL, the B-tree is a secondary structure — it
contains no information that isn't already available from the table itself, but it
does take up additional disk space. An index can be dropped and recreated.
Indexes are used to accelerate operations that involve a small part of the
table, such as retrieving a limited number of rows, and enforcing integrity
constraints (primary and unique keys).
Indexes map the values of indexed fields (search keys) to table row
identifiers. In the B-tree index, an ordered key tree is built, allowing quick
lookup of the desired key along with item pointers to row versions. For
example, numbers, strings, and dates can be indexed, but planar points
cannot (other index types are available for them). When indexing text
strings, you should take into account the specifics of sorting rules (for more
details, see course DBA2, topic 'Localization').
The B-tree is characterized by its balanced structure (constant depth) and
high branching factor. Although the tree's size depends on the indexed
columns, in practice, trees typically have a depth of no more than 4–5.
8
B-tree
3 5 2 1 10 12 8 7
N
U
L
L
11 4
9
6
1 9
1 3 6 9 12
1 2 3 4 5 6 7 8 9 10 11 12 N
root
leaf pages
Internal
pages
table
pages
4
An example of a B-tree is shown at the top of the slide. Its pages are
composed of index entries, each containing:
The key represents the column values used to create the index (such
columns are referred to as key columns);
a pointer to another index page or pointers to tuple versions
The keys within a page are always sorted.
Leaf pages directly point to table tuple versions containing the index keys.
These pages are linked in a bidirectional list to facilitate traversing keys in
ascending or descending order.
Internal pages point to lower-level index pages, with key values defining the
range of values accessible by following the link.
The root page of the tree, which has no parent pointers, is called root.
An index page may be only partially filled. Free space is used to insert new
records into the index. If there's not enough space on a page, it is split into
two new pages. Split pages are never merged, which can lead to index
growth in some cases.
By default, null values are treated as 'greater than' non-null values, so they
are stored on the right side of the tree. This order can be adjusted when
creating an index using the NULLS LAST and NULLS FIRST clauses.
9
Index Scan: одно значение
5 2 1 12 8 7 11 4 9 6
1 9
1 3 6 9 12
1 2 3 4 5 6 7 8 9 10 11 12 N
N
U
L
L
4 3 10
Let's look at how searching for a single value using an index works. For
example, we want to find a row in the table where the value of the indexed
column is four.
We start at the tree's root. The index entries on the root page specify the key
value ranges for the lower-level pages: '1 to 9' and '9 and above'. The '1 to 9'
range applies here, corresponding to a row with key 1. It's worth noting that
since the keys are stored in order, page-level searches are highly efficient.
The link in the found entry leads to the second-level page.In it, we find the '3
to 6' range (key 3) and move to the third-level page.
This is a leaf page. In this page, we find the key value 4 and navigate to the
table page.
Note that keys may be duplicated, even in a unique index, because the
multiversion concurrency control mechanism can create different versions of
the same row. To save space, keys are stored in the index page as a single
instance.
Before returning the found row versions, check their visibility.
In the illustrations, the entries and pages that had to be read are color-
coded.
11
Index Scan: диапазон
5 2 1 12 8 7 11 4 9 6
1 9
1 3 6 9 12
1 2 3 4 5 6 7 8 9 10 11 12 N
N
U
L
L
pages
are read
in a chaotic manner
4 3 10
B-trees enable efficient searching for not just individual values, but also
ranges of values, such as "less than," "greater than," "less than or equal to,"
"greater than or equal to," and "between."
This is how it works. First, we search for the condition's extreme key. For
example, for the "between 4 and 9" condition, we can select 4 or 9, whereas
for the "less than 9" condition, we use 9. Then, we proceed to the index's
leaf page as discussed in the previous example and retrieve the first value
from the table.
Then, we proceed along the index's leaf pages in the appropriate direction
(right or left, depending on the condition), scanning the records until we
encounter a key outside the specified range.
The slide demonstrates an example of searching for values using the
condition "x BETWEEN4 AND 9" or, equivalently, "x >= 4 AND x <= 9". Once
we reach the value 4, we then scan the keys 5, 6, and so on until 9. When
we encounter key 10, we stop the search.
Two properties are at work: the ordered arrangement of keys on all pages
and the bidirectional linking of leaf pages. The search results are
automatically sorted.
Note that we had to access the same table page multiple times. We
accessed the first table page (value 4), then the last one (also 4), followed
by the first page again (5), then the last page (6), and so on.
13
Bitmap Index Scan
3 5 2 1 10 12 8 7 11 4 9 6
1 9
1 3 6 9 12
1 2 3 4 5 6 7 8 9 10 11 12 N
4
N
U
L
L
Repeatedly accessing the same table pages is extremely inefficient. Even in
the best case, if the required page is in the buffer cache, it must be located
and locked (see DBA2 course: "Buffer Cache" topic in the "Journaling"
module), whereas in the worst case, you end up dealing with random disk
reads.
To avoid wasting resources on repeated access to table pages, a bitmap
scan is used as an alternative access method. It's similar to a standard
index access, but it's carried out in two stages.
First, the index (Bitmap Index Scan) is scanned, and a bitmap is constructed
in the process's local memory. The bitmap is divided into fragments.
Fragments correspond to table pages, with each bit in a fragment
representing a tuple on the page. When constructing a bitmap, the tuple
versions that meet the condition and need to be read are marked within it.
By dividing the bitmap into fragments, the bitmap indicating only a few tuple
versions will occupy minimal space.
14
Bitmap Heap Scan
5 2 1 12 8 7 11 4 9 6
1 9
1 3 6 9 12
1 2 3 4 5 6 7 8 9 10 11 12 N
N
U
L
L
4 3 10
Once the index has been scanned and the bitmap is ready, the table scan
starts (Bitmap Heap Scan). Meanwhile:
A dedicated pre-fetching mechanism is employed, asynchronously
reading effective_io_concurrency pages (the default is one);
Multiple row versions can be checked on a single page, but each page is
scanned exactly once.
16
Approximate fragments
Bitmap without accuracy loss
As long as the map's size is within work_mem,, the information is stored
with row-version accuracy
Bitmap with accuracy loss
If memory is exhausted, part of the existing map is coarsened down to
individual pages.
Approximately 1 MB of memory is required per 64 GB of data; the
work_mem limit may be exceeded
The bitmap is stored in the local memory of the background process, and
work_mem bytes are allocated for storing it. Temporary files are never used.
If the map exceeds the work_mem limit, some of its fragments are
coarsened—each bit now corresponds to an entire page instead of an
individual row version (lossy bitmap). The processing overhead for these
fragments increases. The freed-up space is used to continue constructing
the map.
Generally speaking, with heavily restricted work_mem and a large dataset,
the bitmap may not fit in memory, even if no information remains at the row
version level. In this case, the work_mem limit is exceeded — additional
memory will be allocated for the bitmap as needed.
17
Approximate fragments
5 2 1 12 8 7 11 4 9 6
1 9
1 3 6 9 12
1 2 3 4 5 6 7 8 9 10 11 12 N
exact
fragment
inaccurate fragment
A recheck is required
N
U
L
L
4 3 10
As shown in the figure, a bitmap consists of two fragments. The first
fragment is precise, with each bit representing a single tuple. The second
fragment is imprecise, with each bit representing an entire page.
Imprecise fragments require rechecking conditions for all tuple versions in a
table page, which impacts performance. When combining two bitmaps, if at
least one contains an imprecise fragment, the resulting fragment must also
be imprecise. Therefore, the size work_mem is crucial for efficient bitmap
scanning.
19
Index Scan / Bitmap Scan
1 2 3 4 5 6 7 8 9 10 11 12
1 9
1 3 6 9 12
1 2 3 4 5 6 7 8 9 10 11 12 N
pages
If the data in the table is physically sorted, a regular index scan will not read the table page again.
N
U
L
L
4
If the table's data is physically sorted, a regular index scan won't read the
data page again. In such (rare in practice) cases, the bitmap scan method is
outperformed by a regular index scan.
Naturally, the query planner also accounts for this (how exactly is covered in
the "Basic Statistics" section).
21
Index-Only Scan
4 3 5 2 1 10 12 8 7 11 4 9 6
1 9
1 3 6 9 12
1 2 3 4 5 6 7 8 9 10 11 12 N
The page is present
within the visibility map
No check required
The page is missing
within the visibility map
table check
N
U
L
L
If a query only needs indexed data, that data is already available in the
index, so there's no need to access the table. Such an index is referred to as
a covering index for the query.
This is a good optimization that eliminates the need to access table pages.
Unfortunately, index pages don't contain information about row visibility – to
determine if the row found in the index should be displayed, we have to
examine the table page, which undermines the optimization.
Therefore, the visibility map is crucial for the efficiency of index-only scans. If
a table page contains data that's definitely visible and this is indicated in the
visibility map, you don't need to access that table page. However, pages not
marked in the visibility map still need to be accessed.
One reason to run vacuum frequently is that this process updates the
visibility map.
The planner doesn't know exactly how many table pages will need checking,
but it considers the estimated number. If the planner's estimate is poor, it
may avoid using index-only scans.
During processing of each index entry in a leaf node, the visibility map is first
checked for the presence of the table page, and if it's not found, the table
page itself is read.
23
Multi-column Index
1,A 2,C
1,A 1,C 2,N 3,C
1,A 1,B 1,C 2,A 2,B 2,N 2,D 3,A 3,C N,A N,N
NULL
1
C
3
A
2
A
1
A
1
B
3
C
NULL
2
B
2
D
NULL
NULL
2
A
You can create an index on multiple columns. In this case, the order of the
columns and the sort order matter.
The figure illustrates a multicolumn index created using two columns, both
sorted in ascending order. This index improves query performance when the
query involves conditions on one or more of the leading columns, as the
index entries are sorted first by the first column, then the second, and so on.
In the example on the slide, the query involves the first and second columns;
the index would also work for a condition that references only the first
column.
24
Multi-column Index
1,A 2,C
1,A 1,C 2,N 3,C
1,A 1,B 1,C 2,A 2,B 2,N 2,D 3,A 3,C N,A N,N
NULL
1
C
3
A
2
A
1
A
1
B
3
C
NULL
2
B
2
D
NULL
NULL
2
A
However, if the query contains a condition only on the second column, the
index becomes ineffective. As illustrated on the slide, leaf entries where the
second column is "A" can be found anywhere in the index — we don't have
a way to reach them from the root.
In such cases, the query planner may still opt for index-only scanning, but
this results in a full index scan.
Similarly, the index cannot return records in a different order than the one
specified during its creation. For example, the index depicted on the slide
cannot return records ordered by the first column in ascending order and the
second column in descending order.
26
Indexes with an INCLUDE clause
Include indexes
CREATE INDEX ... INCLUDE (...)
Non-key columns
are not used in index searches
are not subject to the unique constraint
The values reside in the index entry and are retrieved without needing to
access the table.
A covering index typically improves query performance. To create a covering
index, you might need to add columns, but this isn't always feasible:
Adding a column to a unique index would violate the unique constraint of
the original columns;
The added column's data type might not be supported by the index.
In such cases, you can add non-key columns to the index by including them
in the INCLUDE clause.
The values of such columns do not contribute to the index structure but are
stored as supplementary data in leaf page index entries. Although queries
on non-key columns aren't supported, their values can be retrieved without
accessing the table.
Currently, include indexes are only supported for B-tree, GiST, and SP-GiST
indexes.
Include indexes are created to make the index covering, but these are not
the same thing. An index can be covering for a query without using the
INCLUDE clause.
28
Efficiency Comparison
time
selectivity
0 11 2
I
n
d
e
x
S
c
a
n
B
i
t
m
a
p
I
n
d
e
x
S
c
a
n
Seq Scan
in proportion to
the number of rows
in proportion to
the number of pages
Index scans perform best under high selectivity, when a single or multiple
values are retrieved using the index.
Bit map scans typically perform best at medium selectivity. Although this
approach requires building a bit map first, it outperforms index scans by
eliminating repeated reads of the same pages (unless the table data is
physically ordered, which is uncommon). In the worst case, index scan
performance scales with the number of selected rows, while bitmap scans
scale with the number of pages.
At low selectivity, sequential scanning is most effective when selecting all or
nearly all table rows, as accessing index pages adds unnecessary
overhead. This effect is amplified when using rotating disks, as random read
speeds are significantly lower than sequential read speeds.
The selectivity threshold at which switching to a different access method
becomes beneficial varies significantly depending on the specific table and
index. The planner considers multiple parameters to select the most suitable
method.
Another observation: Index access returns results in sorted order, which can
make it more attractive even at low selectivity.
29
Efficiency Comparison
time
selectivity
0 1
I
n
d
e
x
S
c
a
n
B
i
t
m
a
p
I
n
d
e
x
S
c
a
n
Seq Scan
worst
case
I
n
d
e
x
O
n
l
y
S
c
a
n
best
case
The effectiveness of index-only scanning is heavily dependent on the current
state of the visibility map and the number of data pages that actually contain
only up-to-date tuple versions.
In the best case, this access method can outperform sequential scanning
even when selectivity is low, particularly when the index is smaller than the
table, especially on SSDs.
In the worst-case scenario, when visibility for each row must be checked, the
method reverts to a standard index scan.
Therefore, the planner must consider the visibility map's state: if the forecast
is unfavorable, index-only scans are avoided due to the risk of experiencing
a slowdown instead of a speedup.
30
Takeaways
The optimizer employs various access methods
sequential scan
index scan
index-only scan
bitmap scan
The cost model considers numerous parameters
31
Practice
1. Make sure that when re-running the query that selects all rows
from the flights table, it reads data from the DB cache.
2. An include index was created for the tickets table during the
demo. Replace it with the primary key index.
3. Create an index for the amount column in the ticket_flights table.
Identify flights costing more than 120,000 rubles (approximately
1% of the rows). Which access method was selected?
4. Repeat point 3 for costs under 4 rubles (approximately 90% of
the rows).
1. 1. Use the EXPLAIN command with the ANALYZE and BUFFERS
options.
4. 4. Also try disabling the selected access method (parameter
enable_seqscan) and compare the performance.