StatisticsBasic Statistics
16
Copyright
© Postgres Professional, 2019–2024
Authors Authors: Egor Rogov, Pavel Luzanov, Ilya Bashtanov
Photo by: Oleg Bartunov (Phu monastery, Bhrikuti summit, Nepal)
Use of course materials
Non-commercial use of course materials (presentations, demonstrations) is
allowed without restrictions. Commercial use is possible only with the written
permission of Postgres Professional. It is prohibited to make changes to the
course materials.
Feedback
Please send your feedback, comments and suggestions to:
edu@postgrespro.ru
Disclaimer
In no event shall Postgres Professional company be liable for any damages
or loss, including loss of profits, that arise from direct or indirect, special or
incidental use of course materials. Postgres Professional company
specifically disclaims any warranties on course materials. Course materials
are provided “as is,” and Postgres Professional company has no obligations
to provide maintenance, support, updates, enhancements, or modifications.
2
Topics
Functional dependency
Most frequent value combinations
Count of unique value combinations
Expression Statistics
3
Advanced Statistics
Contains
Multivariate statistics (across multiple columns)
Expression Statistics
A database objectDatabase object created manually
resides in pg_statistic_ext and pg_statistic_ext_data
the pg_stats_ext and pg_stats_ext_exprs views
Once created, the statistics are automatically collected.
The base statistics automatically collected may not be sufficient for accurate
estimates of cardinality and selectivity.
PostgreSQL allows the database administrator to manually determine which
additional, extended statistics are required. You can collect statistics that
cover multiple columns (multivariate statistics) or statistics for arbitrary
expressions.
Keep in mind that base statistics are automatically collected for tables and
their columns, but not for indexes—except for expression indexes.
Therefore, an index built on multiple columns does not automatically result in
the generation of multivariate statistics.
Extended statistics can be created using the CREATE STATISTICS
command. Once the object is created, the corresponding statistics are
automatically collected in the background or via the ANALYZE command.
The collected information is stored in the tables pg_statistic_ext and
pg_statistic_ext_data; the statistics accessible to users are displayed in the
views pg_stats_ext and pg_stats_ext_exprs.
4
Dependent Columns
Functional Dependency (Dependencies)
The value of one column defines the value of another column.
Statistics improves the estimation of condition selectivity.
Kazan
Samara
420000
443000
city index
Nizhny Novgorod 603000
Veliky Novgorod 173000
There are several types of multivariate statistics (i.e., statistics for multiple
table columns) that can be specified when creating an extended statistics
object.
Functional dependency between columns shows how much the data in one
column is determined by the value of another column.
In the example on the slide, the postal code clearly defines the city, so the
selectivity of the condition city = 'Samara' and index = '443000' is
determined by the selectivity of the index = '443000' predicate. Such
predicates, whose selectivity cannot be calculated independently of each
other, are called correlated.
6
Most Frequent
Combinations
Most Common Value Combinations (MCV)
Similar to pg_stats.most_common_vals/freqs, but for multiple columns.
Statistics improves the estimation of condition selectivity.
Kazan Volga
city river
Nizhny Novgorod Volga
Veliky Novgorod Volkhov
Samara Volga
Nizhny Novgorod Oka
The list of the most common value combinations enables the storage of
multiple value combinations and their frequencies.
The slide shows possible city-river pairs. It's clear that the predicates with
the city and river columns are correlated, but unlike the previous example,
neither column determines the other — there's a many-to-many relationship
between cities and rivers. In this case, functional dependency statistics won't
help improve the estimates.
8
Unique Value Combinations
Count of Unique Value Combinations (ndistinct)
Similar to pg_stats.ndistinct, but for multiple columns
Statistics improves the estimation of cardinality for grouping.
Kazan
Samara
Republic Tatarstan
Samara Region
city region
Nizhny Novgorod Nizhny Novgorod Region
Veliky Novgorod Novgorod Region
420000
443000
index
603000
173000
The number of unique value combinations enables more accurate cardinality
estimates when grouping by multiple columns.
In the example on the slide, the number of possible combinations of all fields
cannot be determined by simply multiplying the unique counts for each
column.
10
Expression statistics
Extended expression statistics
as if a generated column existed in the table
Statistics enhances the estimation of selectivity for conditions using
expressions.
Kazan
Samara
420000
443000
city index
Nizhny Novgorod 603000
Veliky Novgorod 173000
420000, Kazan
443000, Samara
address
603000, Nizhny Novgorod
173000, Veliky Novgorod City
Extended Statistics on Expressions extended Statistics on Expression
enables the collection of all basic statistics that would be gathered if a
column computed using this expression existed in the table.
If the predicate uses an expression instead of a column name on either side
of the operator, the planner uses a fixed selectivity estimate. Using
expression-based statistics, this issue can be resolved.
Note that expressions can also be used instead of column names in any
type of multivariate statistics.
12
Takeaways
Extended statistics assist in complex scenarios
Manually created and automatically maintained
May increase the costs of analysis and planning
13
Practice
1. Using the commands from the demo, create statistics such as
dependencies, mcv, and ndistinct for the flights table. Measure
the execution time of the ANALYZE command. Remove the
extended statistics, measure the execution time of the ANALYZE
command once more, and compare it to the previous result.
2. Write a query to select all business class flights priced over
100,000 rubles. Will extended statistics help improve the
cardinality estimate for the result? If so, which type of statistics
is better to use?
1. 1. Measure the execution time multiple times and average the results to
smooth out irregularities.