StatisticsBasic Statistics
16
Copyright
© Postgres Professional, 2019–2024
Authors Authors: Egor Rogov, Pavel Luzanov, Ilya Bashtanov
Photo by: Oleg Bartunov (Phu monastery, Bhrikuti summit, Nepal)
Use of course materials
Non-commercial use of course materials (presentations, demonstrations) is
allowed without restrictions. Commercial use is possible only with the written
permission of Postgres Professional. It is prohibited to make changes to the
course materials.
Feedback
Please send your feedback, comments and suggestions to:
edu@postgrespro.ru
Disclaimer
In no event shall Postgres Professional company be liable for any damages
or loss, including loss of profits, that arise from direct or indirect, special or
incidental use of course materials. Postgres Professional company
specifically disclaims any warranties on course materials. Course materials
are provided “as is,” and Postgres Professional company has no obligations
to provide maintenance, support, updates, enhancements, or modifications.
2
Topics
Basic statistics
Most common values histogram
Statistics for Elements of Composite Values
Leveraging statistics to estimate cardinality and selectivity
Private and Shared Execution Plans
Partial Index and Expression Index
3
Basic statistics
Table size
rows (pg_class.reltuples) and pages (pg_class.relpages)
Collected
DDL operations
vacuum
through analysis
Configuration
default_statistics_target = 100
Base statistics are collected at the table level and the column level.
Table statistics include data on the object's size, such as reltuples and
relpages in the pg_class table. Because this statistics is crucial, it is updated
by certain DDL operations (CREATE INDEX, CREATE TABLE AS SELECT)
and refined during vacuum and analyze.
Additionally, the planner adjusts the row count based on the difference
between the actual data file size and the relpages value.
When analyzing, a random sample of rows is examined. Research has
shown that the sample size ensuring accurate estimates is largely
independent of the table size. The sample size is based on the statistical
target defined by the parameter default_statistics_targe, multiplied by 300.
Keep in mind that the statistics doesn't have to be perfectly accurate for the
planner to select an acceptable plan; often, being in the right ballpark is
sufficient.
4
pg_statistic (pg_stats)
Basic statistics
null_frac
pg_class.reltuples, relpages
n_distinct
values
frequency
quantity
or proportion
proportion
During table analysis, all other statistics are collected separately for each
column. This is typically handled by autoanalysis, with its configuration
discussed in the DBA2 course.
The pg_statistic table stores column-level statistics.But the pg_stats view is
easier to use, as it displays the information in a more convenient format.
The null_frac field indicates the proportion of rows with null values in the
column (ranging from 0 to 1).
The n_distinct field contains the number of distinct values in the column. If
n_distinct is negative, its absolute value represents the proportion of unique
values. For example, -1 indicates that all values are unique (a typical case
for a primary key).
6
Most common values
null_frac
[most_common_vals] ≤ default_statistics_target
[most_common_freqs]
values
frequency
pg_class.reltuples, relpages
Had the data been uniformly distributed—that is, if all values occurred with
equal frequency—this information would have been nearly sufficient
(minimum and maximum values would still be required).
However, non-uniform distributions are very common in practice. Therefore,
the following information is also gathered.
The most common values array — the most_common_vals field;
The array of frequencies for these values is the most_common_freqs
column.
The frequencies from these arrays directly serve as a selectivity estimate for
querying a specific value.
This works well as long as the number of distinct values isn't too large. The
maximum size of each array is constrained by the default_statistics_target
parameter. This value can be adjusted on a per-column basis; when doing
so, the sample size is determined by the table's maximum value.
The tricky part is "large" values. To prevent pg_statistic from growing and to
avoid overloading the planner with unnecessary work, values exceeding 1
KB are excluded from statistics and analysis. In fact, if such large values are
stored in the column, they're likely unique and won't be included in
most_common_vals.
8
Histogram
null_frac
[histogram_bounds] ≤ default_statistics_target
values
frequency
When the number of distinct values grows too large to store them all in an
array, the system applies the histogram representation. A histogram employs
several buckets to store values in. The number of bins is limited by the same
parameter, default_statistics_target.
The width of the bins is set to ensure each contains roughly the same
number of values (as shown by equal rectangle areas in the figure).
With this setup, only the array of extreme values for each bin needs to be
stored — the histogram_bounds field. The frequency of a bin is 1 divided by
the number of bins.
To estimate the selectivity of the condition field < value, calculate N divided
by the total number of bins, where N represents the number of bins located
to the left of value. The estimate can be refined by including a portion of the
bin that contains the value itself.
However, when estimating the selectivity of the condition field = value, the
histogram cannot help, and you must rely on the assumption of a uniform
distribution, taking 1/n_distinct as the estimate.
9
Method Combination
null_frac
[histogram_bounds]
values
frequency
[most_common_vals]
[most_common_freqs]
However, the two methods are typically integrated: a list of the most
common values is created, with all remaining values represented in a
histogram.
The histogram is constructed to exclude values that are already in the list.
This helps improve the estimates.
11
Additional Fields
Ordering (Use a bit map?)
pg_stats.correlation(1 = ascending, 0 = chaotic, –1 = descending)
Visibility (Use index-only scan?)
pg_class.relallvisible
Average value size in bytes (memory estimate)
pg_stats.avg_width
The server maintains additional statistical metrics.
The pg_stats.correlation field records the physical ordering of the column's
values. If the values are stored in strictly ascending order, the value will be
close to 1; if in descending order, close to -1. The more randomly the data is
arranged on disk, the closer the value gets to zero. The optimizer uses this
field when deciding between a bitmap scan and a regular index scan.
The pg_class.relallvisible field tracks the number of table pages containing
only the latest tuple versions (this data is updated in conjunction with the
visibility map). If the count is insufficient, the planner may prefer a bitmap
scan over an index-only scan.
The pg_stats.avg_width field stores the average size of values in bytes for
this column to estimate the memory needed for the operation.
12
Composite Field Elements
Most Common Elements
pg_stats.most_common_elems
pg_stats.most_common_elem_freqs
Element Count Histogram
pg_stats.elem_count_histogram
For composite types like arrays or tsvector, pg_stats stores not just the
distribution of the values themselves, but also their elements:
The most_common_elems and most_common_elem_freqs columns store
the most frequent elements and their frequencies;
elem_count_histogram contains an element count histogram within the
value (such as, for an array, a histogram of array lengths)
This enables more accurate query planning for fields not in the first normal
form.In particular, this information is crucial for GIN index method operator
classes, as it enables distinguishing between frequent values (those
appearing in many documents, representing low-selectivity conditions) and
rare values (high-selectivity conditions).
14
Private and General Plans
Private Plans
May be beneficial in cases of uneven distribution, but the query is re-planned
each time it's executed.
Common Plan
Best suited for uniform distribution
Cached within the session for prepared statements
Seq Scan on tasks
Filter: status = 'done'
Index Scan on tasks
Index Cond: status = 'todo'
Index Scan on tasks
Index Cond: id = $1
is built
with values
parameters
is built
without considering the values
parameters
With the simple query protocol (refer to the "Planningand Execution"
section), each query is re-planned, taking into account the parameter values.
The extended protocol allows for the preparation of statements, which can
have parameters. Preparation always involves parsing and query rewriting,
with the parse tree stored in the backend process's local memory.
When executing a prepared statement, there are options. The query is
typically re-planned each time, taking into account the parameter values.
Such plans are called custom and (custom)
It makes sense when there's an uneven distribution, as optimal plans may
vary depending on the values. For example, index access is more effective
for highly selective conditions, while a sequential scan is preferable when
the value is common.
But the query can be planned without considering parameter values. This
allows not only saving the parse tree but also the plan itself, avoiding the
need for re-planning. Such a plan is referred to as generic (generic).
A generic plan performs well in cases of uniform distribution, where the
condition's selectivity is not dependent on the specific value.But in the case
of uneven distribution, a generic plan may perform well for some values but
poorly for others.
Let's see how the planner decides which option to use.
16
Takeaways
Data characteristics are collected as statistics.
Statistics are used to estimate cardinality
Cardinality is used to estimate cost.
Cost is used to select the optimal plan
The key to success lies in accurate statistics and proper
cardinality.
17
Practice
1. Create an index on the tickets table (tickets) by the passenger
name (passenger_name).
2. What statistics are available for this table?
3. Explain how cardinality is estimated and how the execution plan
is selected for the following queries: a) selecting all tickets, b)
selecting tickets by the name ALEKSANDR IVANOV, c)
selecting tickets by the name ANNA VASILEVA, d) selecting a
ticket by the identifier 0005432000284.