Data Access05. Index Types
16
Copyright
© Postgres Professional, 2019–2024
Authors Authors: Egor Rogov, Pavel Luzanov, Ilya Bashtanov
Photo by: Oleg Bartunov (Phu monastery, Bhrikuti summit, Nepal)
Use of course materials
Non-commercial use of course materials (presentations, demonstrations) is
allowed without restrictions. Commercial use is possible only with the written
permission of Postgres Professional. It is prohibited to make changes to the
course materials.
Feedback
Please send your feedback, comments and suggestions to:
edu@postgrespro.ru
Disclaimer
In no event shall Postgres Professional company be liable for any damages
or loss, including loss of profits, that arise from direct or indirect, special or
incidental use of course materials. Postgres Professional company
specifically disclaims any warranties on course materials. Course materials
are provided “as is,” and Postgres Professional company has no obligations
to provide maintenance, support, updates, enhancements, or modifications.
2
Topics
Hash Index
GiST
Operator Class
SP-GiST
or generalized inverted index
BRIN
3
CREATE INDEX ON t USING hash(title);
SELECT * FROM tWHERE title = 'Abbey
Road';
hash('Abbey Road') = 100010 01
Concept of Hashing
1
6
4
Yellow Submarine
Abbey Road
The Beatles
3 Let It Be
1969
1969
1968
1970
id title year
Basket 00
010001 00
Bucket 01
101000 01
100010 01
Bucket 10
Bucket 11
110001 11
number
buckets
The concept of hashing is that values of any data type are evenly distributed
across a limited number of buckets in a hash table using a hash function. If a
hash table is large enough to ensure that, on average, each bucket contains
only one value (hash code), then searching for a value in the hash table
takes constant time. To achieve this:
1) The hash function is applied to the given value.
2) the bucket number is determined by several bits of the resulting hash
code;
3) The bucket is scanned for the hash code.
When data is not evenly distributed, a large number of values can end up in
a single bucket. In this case, search efficiency will decrease.
Essentially, a hash index is a hash table stored on disk.
4
Hash Index
Stores only hash codes, not the original data.
The index size is independent of the indexing key.
Index-only scans are not possible.
The index grows dynamically
spiky growth
Searches only support equality conditions
A hash index stores only hash values and item pointers; the actual indexed
value is not stored. Therefore, the hash index size is not affected by the
indexing key size, but index-only scans are not possible — the value must
be retrieved from the table.
The size of the hash index grows dynamically as new values are added. As
the number of buckets doubles with each increase, the size grows in a spiky
manner.
Unlike B-trees, hash indexes have several limitations, such as:
The hash index only supports equality-based searches, as the hash
function does not retain the order of the values;
does not support unique constraints;
You cannot create a multicolumn index or add additional include columns
to an index.
Therefore, hash indexes haven't become widely used. However, the hash
index can be faster in some cases due to its smaller size and fixed search
time compared to a B-tree index.
6
GiST Index Example
GiST stands for Generalized Search Tree, also known as a generalized
search tree.
We'll explore how GiST indexes work using an example of points on a plane.
The plane is divided into several rectangles that together cover all the
indexed points. These rectangles make up the top level of the tree.
As shown in the figure, rectangles may overlap (although thismay reduce
search efficiency).
7
GiST Index Example
On the next level of the tree, each large rectangle splits into smaller
rectangles.
8
GiST Index Example
On the last level of the tree, each bounding rectangle holds as many points
as can fit on one index page.
The basic splitting condition is that the parent node's rectangle encloses all
rectangles within the corresponding subtree. This enables, for instance,
efficient retrieval of points located within a specific region:
1) Locate rectangles intersecting the specified area,at the topmost level of
the index;
2) We move down into the selected subtrees and repeat the search.
This indexing method is referred to as an R-tree.
9
Balanced search tree
Supports arbitrary data types
Ordering is not required
Commonly supported operations
Inclusion within the area
Determining left, right, top, and bottom positions relative to the area
Nearest Neighbor Search
The first k values nearest to the specified
GiST Index
A B-tree is a balanced tree where the valuesare ordered according to the
'greater than'and 'less than' operations. The GiST Index also forms a
balanced tree, but the values are organized differently, such asbased on the
relative positioning of points on a plane.
This makes GiST suitable for data types where 'greater than' and 'less than'
operations lack inherent meaning, while enabling optimization of other
crucial operations for these types. For example, the GiST index can speed
up searching for a value that falls within a specific area or finding values
located on a particular side of a specified area.
Another important feature of the GiST index — is its support for nearest
neighbor searches. The index enables quick retrieval of several values
closest to a given one.
11
Bridge between the index method and data type
Can encompass a substantial portion of the indexing logic
Operator Class
GiST
point
box
inet
range_ops
box_ops
inet_ops
point_ops
anyrange
To enable access methods to work with various data types (which can be
dynamically loaded in PostgreSQL), there's an intermediary called an
operator class that contains the required operators and support functions.
B-trees and hash indexes, like other index types, rely on operator classes,
though these classes are relatively simple, containing basic operators like
"equal," "greater than," and "less than."
Operator classes for GiST indexes contain a significant portion of the
indexing logic and define the rules for adding values to the index and
searching them. Therefore, GiST can accelerate different operations for
various data types.
For instance, GiST can be used to index values of range types, such as
int4range or tstzrange. This index enables you to find ranges contained
within, overlapping, or adjacent to the specified range, and so on.
GiST can be viewed as a framework upon which custom indexing schemes
(not just R-trees) are built, allowing for the implementation of operator
classes as needed. This approach is far simpler than building a new index
type from the ground up, a process that is highly complex and requires
substantial developer expertise.
12
SP-GiST Index Example
centroid
SP-GiST is short for space partitioning GiST. This is also a generalized
search tree, but it is built by dividing the search space into non-overlapping
regions.
Let's look at an example of an SP-GiST index for points on a plane.
One of the options is a quadrant tree. The root node splits the plane into four
quadrants relative to the selected centroid point.
13
SP-GiST Index Example
Each of the four quadrants is further divided into sub-quadrants.
14
SP-GiST Index Example
The splitting will continue until all points in the quadrant are contained within
a single index page.
15
Unbalanced search tree
A sparsely branching tree with significant depth
Supports arbitrary data types
The operations are similar to those in GiST
Inclusion within the area
Determining left, right, top, and bottom positions relative to the area
Nearest neighbor search
SP-GiST Index
SP-GiST, similar to GiST, serves as a framework for building arbitrary
indexing schemes through the implementation of operator classes. For
example, the quadrant tree for points is implemented by the point_ops
operator class. Another approach to dividing the plane is into two parts
instead of four. This approach is known as a k-d tree, implemented by a
different operator class — kd_point_ops.
Dividing the plane into non-overlapping regions results in unbalanced trees
that typically exhibit limited branching and possess greater depth.
SP-GiST indexes typically support the same data types and operators as
GiST. However, their different index structure can make them either more or
less efficient than GiST.
17
The Concept of the GIN
Index
GiST
or generalized inverted index
Index
Inverted
Document
Value
Index
Value
Object
36
38
42
43
45
42 18
42
42
51
27
31
42
45
47
51
GIN — generalized inverted index, generalized inverted index.
The easiest way to understand this indexing method is by examining a
subject index in a regular book. Terms appear on the book's pages, while the
subject index lists all terms in alphabetical order, along with the page
numbers where they appear.
The GIN index is primarily used for document indexing to speed up full-text
searches. Essentially, it's a standard B-tree, but instead of storing the
documents themselves, it stores the individual words that compose them.
The GIN index is optimized for scenarios where each word may appear in
multiple documents. If the "page list" is very large, it is stored in a separate
B-tree instead of the index page itself.
18
Inverted List
For data types where the values (documents) are composed of elements
Elements are indexed, not the documents themselves.
Commonly Supported Operation
Check if a document matches a search query
Check if an element is present in an array
Search JSON documents by keys or values
GIN index
GIN is designed for data types where values consist of elements rather than
being atomic. Rather than indexing the values themselves, the elements are
indexed.
Like GiST and SP-GiST, GIN is a framework that can be configured not only
for text (consisting of words) but also for other data types, such as arrays
(composed of elements) and JSON documents (containing keys and
values). To accomplish this, an operator class is created to break down the
document into elements and verify if it matches the search query.
20
BRIN Index Example
SELECT * FROM t WHERE temperature BETWEEN 20 AND 30;
1 1128
range pages
summary information
2 129256
3 257384
010001 00
010001 00
4 385512 010001 00
9 1025 1152 010001 00
5 513640 010001 00
6 641768
7 769896
010001 00
010001 00
8 8971024 010001 00
group
sequentially
arranged
pages
010001 00
010001 00
BRIN — block range index, «block range index». The table is divided into
zones of a defined (configurable) length, each comprising a set of
sequentially arranged pages. Each zone contains summary data, including
the minimum and maximum values of the indexed column.
During query execution, you can skip all zones that are guaranteed not to
satisfy the condition. As shown on the slide, only two zones could contain
temperature values satisfying the condition.
Unlike other indexes, the BRIN index does not store row version identifiers,
so all row versions in the selected zones must be examined. In a way, BRIN
can be considered an accelerator for sequential scans.
21
List of Zones Containing Summary Information
A zone spans a group of consecutive pages
summary information: minimum, maximum, and so on
Correlation with the physical location of rows is required.
Does not maintain item pointers to tuple versions.
Only a bit map scan
Designed for very large tables
compact size
configurable size-accuracy ratio
BRIN Index
Using operator classes, you can select the summary data stored in the index
for each zone. This can range from just minimum and maximum values to
multiple value ranges, and for geometric data types, it can store an
enclosing rectangle (as in GiST).
Regardless of the case, BRIN requires a correlation between column values
and the physical row location to function effectively, ensuring that values
with similar summary information are grouped into the same zone. Data
updates can disrupt the correlation, potentially impacting the index's
efficiency.
Since BRIN does not store row version pointers, it returns an approximate
bit map of the zone's pages. Standard index scans (and index-only scans)
are not possible.
However, the BRIN index has a very small size and can be adjusted by
specifying the zone size. The larger the zone, the smaller the index, but the
lower the accuracy. This makes BRIN a perfect fit for massive tables
commonly found in data warehouses.
23
Takeaways
Besides B-tree, there are other, more specialized index types.
Hash index for equality queries
GiST and SP-GiST for non-sortable data types
GIN for documents
BRIN for very large tables
24
Practice
1. Compare the size and build time of the hash index on columns
with varying sizes (book_ref and contact_data) in the tickets
table.
Do the same for a B-tree index.
2. Use the GIN index and pg_trgm extension to find passengers
whose phone numbers contain the digit sequence 1234.
Can a B-tree help speed up this query?
1. 1. To estimate the size of an index, use the pg_total_relation_size function
with the index name as a parameter.
2. 2. Create a GIN index on the expression contact_data->>'phone' using
the gin_trgm_ops operator class provided by the pg_trgm extension.