Join MethodsHash Join
16
Copyright
© Postgres Professional, 2019–2024
Authors Authors: Egor Rogov, Pavel Luzanov, Ilya Bashtanov
Photo by: Oleg Bartunov (Phu monastery, Bhrikuti summit, Nepal)
Use of course materials
Non-commercial use of course materials (presentations, demonstrations) is
allowed without restrictions. Commercial use is possible only with the written
permission of Postgres Professional. It is prohibited to make changes to the
course materials.
Feedback
Please send your feedback, comments and suggestions to:
edu@postgrespro.ru
Disclaimer
In no event shall Postgres Professional company be liable for any damages
or loss, including loss of profits, that arise from direct or indirect, special or
incidental use of course materials. Postgres Professional company
specifically disclaims any warranties on course materials. Course materials
are provided “as is,” and Postgres Professional company has no obligations
to provide maintenance, support, updates, enhancements, or modifications.
2
Topics
Sequential hash join: single- and two-pass
Computational complexity
Parallel hash join:single- and two-pass
3
A hash table is constructed from the rows of one set,and the
rows of the other set are matched against it.
Equi-joins only
Hash Join
Hash Join
Hash
Seq Scan
Seq Scan
Building
hash tables
Hash Join
The main idea of hashing is discussed in the "Types of Indexes" and
"Grouping" sections.
Hashing is used for joining as follows. Initially, a hash table is constructed
based on one of the datasets within the Hash node. Then, in the Hash Join
node, the rows from the other dataset are compared to the built table.
Similar to a hash index, a hash join can only function with an equality
condition: the hash function does not maintain the order of values.
Let's look at an example of how the join works.
4
One-pass Join
Used when the hash table fits into available memory
5
id title year
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
Building the Hash Table
1
6
4
Yellow Submarine
Abbey Road
The Beatles
3 Let It Be
1969
1969
1968
1970
3
6
Let It Be
Abbey Road
4 The Beatles
101000 01
100010 01
id titlehash code
010001 00
1 Yellow Submarine110001 11
01
10
00
11
number
buckets
work_mem × hash_mem_multiplier
is involved
in the condition
connections
is used
within the query
inner set
The first step involves constructing a hash table in memory.
The rows of the first (inner) set are processed sequentially, with a hash
function computed for each row based on the values of the fields involved in
the join condition (in our example, these are numeric ID values).
The computed hash code and all fields involved in the join condition or used
in the query are stored in the hash table bucket.
The size of the hash table in memory is constrained by work_mem × ×
hash_mem_multiplier. Optimal performance is achieved when the entire
hash table fits within this memory allocation. Therefore, the planner typically
selects the smaller of the two row sets as the inner one. It's also a good idea
to avoid including unnecessary fields in the query, such as the asterisk, to
prevent the hash table from being overloaded with extra data.
6
Row Matching
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
5
2
1
2
A Day in the Life
Another Girl
All You Need Is Love
Act Naturally
1 All Together Now
3 Across the Universe
110001 00
01
10
00
11
namealbum_id
3
6
Let It Be
Abbey Road
4 The Beatles
101000 01
100010 01
010001 00
1 Yellow Submarine110001 11
id titlehash code
external set
work_mem × hash_mem_multiplier
During the second phase, we read the second set of rows in
sequence.While reading, we compute the hash function from the fields'
values involved in the join condition. If a row is found in the matching bucket
of the hash table.
1) with a matching hash code
2) and with field values that satisfy the join condition
then we found a match
Simply checking the hash code isn't enough. First, not all join conditions
listed in the query are taken into account during a hash join operation (as
only equijoins are supported). Secondly, collisions may occur where different
values result in the same hash code (even though the probability is low, it
still exists).
In our example, the first row has no match.
7
Row Matching
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
5
2
1
2
A Day in the Life
Another Girl
All You Need Is Love
Act Naturally
3 Across the Universe
110001 11
01
10
00
11
1 All Together Now
3
6
Let It Be
Abbey Road
4 The Beatles
101000 01
100010 01
010001 00
1 Yellow Submarine110001 11
namealbum_id
id titlehash code
The second row of the second set yields a match that can be passed up to
the higher-level node in the query plan: ('Yellow Submarine', 'All Together
Now').
8
Row Matching
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
5
2
1
2
A Day in the Life
Another Girl
All You Need Is Love
Act Naturally
1 All Together Now
3 Across the Universe
111101 10
01
10
00
11
3
6
Let It Be
Abbey Road
4 The Beatles
101000 01
100010 01
010001 00
1 Yellow Submarine110001 11
namealbum_id
id titlehash code
No match exists for the third row (the corresponding hash table bucket is
empty)
9
Row Matching
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
5
2
1
2
A Day in the Life
Another Girl
All You Need Is Love
Act Naturally
101000 01
01
10
00
11
1 All Together Now
6 Abbey Road
4 The Beatles
100010 01
010001 00
1 Yellow Submarine110001 11
3 Across the Universe
3 Let It Be101000 01
namealbum_id
id titlehash code
The fourth match is ("Let It Be", "Across the Universe").
Note that the hash table bucket contains two rows from the first set, and
both must be examined.
10
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
Row Matching
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
5
2
2
A Day in the Life
Another Girl
Act Naturally
3 Across the Universe
110001 11
01
10
00
11
1 All Together Now
3
6
Let It Be
Abbey Road
4 The Beatles
101000 01
100010 01
010001 00
1 Yellow Submarine110001 11
1 All You Need Is Love
namealbum_id
id titlehash code
The fifth row corresponds to ("Yellow Submarine", "All You Need Is Love").
11
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
Row Matching
5
2
1
2
A Day in the Life
Another Girl
All You Need Is Love
Act Naturally
1 All Together Now
3 Across the Universe
111101 10
01
10
00
11
3
6
Let It Be
Abbey Road
4 The Beatles
101000 01
100010 01
010001 00
1 Yellow Submarine110001 11
namealbum_id
id titlehash code
No match exists for the sixth row. The connection process is complete.
The algorithm's source code is located in the file
src/backend/executor/nodeHashjoin.c
13
Two-pass Join
Used when the hash table cannot fit into main memory: data
sets are divided into batches and joined sequentially.
14
id title year
Building the Hash Table
work_mem × hash_mem_multiplier
temporary
Files used for
internal
set
Batch
01
Batch
10
Batch
11
hash table
(Batch 00)
If the hash table exceeds the memory limit set by work_mem ×
hash_mem_multiplier, the initial (internal) set of rows is divided into separate
packages. A certain number of hash code bits are used to distribute the data
into packages, so the number of packages is always a power of two. Ideally,
each package would contain roughly the same number of rows, but if row
values are repeated, skew can occur.
During query planning, the minimum number of packages required is
calculated in advance to ensure that the hash table for each package fits
into memory. This number remains unchanged even if the optimizer made
errors in its estimates, but can be dynamically adjusted when necessary.
The hash table for the first package stays in memory, while rows belonging
to other packages are written to disk as temporary files — each package in
its own file.
The figure shows four packages.
15
namealbum_id
Join - Package 1
hash table
(Batch 00)
Batch
01
Batch
10
Batch
11
temporary
Files used for
external
set
temp_file_limit
Batch
01
Batch
10
Batch
11
Next, the second (external) set of strings is processed. If the string belongs
to the first Package, it is matched against the hash table that contains the
first Package. There's no need to match the string against other packages —
they can't contain a match, as the hash codes are guaranteed to differ.
If the string belongs to another package, it is written to disk — again, each
package is stored in its own temporary file.
Therefore, with N packages, a total of 2(N–1) files are used.
Note that the use of temporary files on disk is limited by the temp_file_limit
parameter, which sets the total disk space limit for the session. (Temporary
table buffersare excluded from this limit.) (Буферы временных таблиц
в это ограничение не входят.)
16
Matching: Package 2
namealbum_id
hash table
(Batch 01)
Batch
01
Batch
10
Batch
11
Batch
01
Batch
10
Batch
11
Next, all packages are processed in turn, starting with the second.The
internal set's rows are loaded into a hash table from a temporary file,
followed by reading the external set's rows from another temporary file and
matching them against the hash table.
The procedure is repeated for all remaining packages, which amount to N–1.
The figure illustrates the connection for the second package (01).
17
Join - Package 3
namealbum_id
hash table
(Batch 10)
Batch
10
Batch
11
Batch
10
Batch
11
The figure illustrates the connection for the third package (10).
18
Mapping: Package 4
namealbum_id
hash table
(Batch 11)
Batch
11
Batch
11
After processing the final package, the connection is closed and temporary
files are freed.
Therefore, when insufficient RAM is available, the join algorithm operates in
two passes: each packet (except the first) must be written to disk and then
read back. Of course, this impacts the connection's efficiency. To avoid the
inefficiency, make sure that:
Only the actually needed fields were included in the hash table (the query
author's responsibility).
The hash table is built for the smaller set (responsibility of the planner).
20
Computational complexity
~ N + M, где
N and M represent the number of rows in the first and second data sets.
Initial overhead for building a hash table
Efficient for a large number of rows
The overall complexity of a hash join is proportional to the combined number
of rows in both data sets. Thus, hash joins are significantly more efficient
than nested loops when handling large datasets.
However, since a hash table must be built to start the join, the nested loop is
more efficient for small row counts.
Hash joins (combined with full table scans) are typically used in OLAP
queries that require processing a large number of rows, where throughput is
prioritized over response time.
22
Efficient algorithm: parallel hash table construction and parallel
matching
Parallel algorithm
Gather
Parallel Hash Join
Parallel Hash
Parallel Seq Scan
Parallel Seq Scan
Parallel Hash Join
Parallel Hash
Parallel Seq Scan
Parallel Seq Scan
Parallel Hash Join
Parallel Hash
Parallel Seq Scan
Parallel Seq Scan
Unlike other join methods, hash join not only supports parallel execution
plans but also has a distinct efficient algorithm. This algorithm enables
parallel execution of both join stages: building the hash table from the first
(inner) row set and matching the second (outer) row set against it.
The parallel hash join capability is governed by the enable_parallel_hash
parameter; by default, this parameter is enabled.
Similar to the sequential algorithm, the parallel version has two approaches:
single-pass when sufficient random access memory is available and two-
pass.
Let's start with the single-pass approach.
23
Parallel, single pass
Processes use a shared hash table
24
id title year
Building the Hash Table
work_mem × hash_mem_multiplier ×
× the number of processes
hash table
Parallel
Hash
The single-pass algorithm is employed when the hash table can fit into the
total memory allocated to all processes involved in the join, which means the
hash table's size is constrained by work_mem × hash_mem_multiplier × the
number of processes.
Processes read the first set of rows in parallel (e.g., using the Parallel Seq
Scan node) and build a shared hash table in shared memory, which all
processes can access.
25
namealbum_id
Row Matching
hash table
Parallel
Hash Join
Once the hash table is fully built, worker processes begin parallel scanning
of the second dataset, comparing the rows they read against the shared
hash table. Each process only examines a subset of the data using the hash
table.
27
Two parallel passes
Row sets are split into batches that are then processed in
parallel by worker processes.
28
id title year
Splitting into Batches
namealbum_id
Batch
01
Batch
10
Batch
11
Batch
00
Batch
01
Batch
10
Batch
00
Batch
11
A hash table may not fit within the memory limit determined by work_mem ×
hash_mem_multiplier × number of processes, and this may become
apparent during the join execution. In this case, a two-pass algorithm is
employed, which differs significantly from both the two-pass sequential and
the single-pass parallel approaches.
First, worker processes read the first dataset in parallel, split it into batches,
and write the batches to temporary files. The first batch is also written to the
file; the hash table is not constructed in memory.
Note that each process writes rows to all temporary files, with the writes
synchronized.
Next, worker processes read the second dataset in parallel, splitting it into
batches and writing them to temporary files.
As a result, 2N files are written to disk when there are N batches.
29
Row Matching
namealbum_id
hash table
(Batch 00)
Batch
11
work_mem × hash_mem_multiplier
hash table
(Batch 01)
hash table
(Batch 10)
Batch
11
Each worker process then selects one batch.
The process loads the first dataset from the selected batch into the hash
table in memory. In this algorithm, each process has its own hash table
sized at work_mem × hash_mem_multiplie × r, but these tables are stored in
shared memory, allowing all worker processes to access each table.
Once the hash table is populated, the process reads the second dataset
from the selected batch and matches the rows.
Once a process finishes processing a batch, it selects the next unprocessed
one.
30
Row Matching
work_mem × hash_mem_multiplier
namealbum_id
hash table
(Batch 11)
joint efforts
through joint efforts
If there are no unprocessed batches left, the process teams up with another
process to help it finish up its batch. They can do that because all the hash
tables reside in shared memory.
Using multiple hash tables is more efficient than a single large one: it
simplifies coordination and reduces synchronization overhead.
32
Takeaways
Hash joins require preparation
A hash table must be built
Efficient for large samples
It also supports parallel joins
It depends on the join order
The inner relation should be smaller than the outer relation, to reduce the
hash table's size.
Only equijoins are supported.
The 'greater than' and 'less than' operators are not applicable to hash codes.
Unlike nested loop joins, hash joins require setup, such as building a hash
table. Until the hash table is built, no result rows can be produced.
Hash joins, however, are efficient with large data volumes. Both row sets are
read sequentially and only once (twice if there's insufficient RAM).
A limitation of hash joins is that they only support equijoins. The issue is that
hash values can only be compared for equality, and the 'greater than' and
'less than' operations are not applicable.
33
Practice
1. Write a query to show occupied seats in the cabin for all
flights.Which join strategy did the planner select? Check whether
there was sufficient RAM to allocate the hash tables.
2. Modify the query to display only the total count of occupied
seats.How did the query plan change? Why didn't the planner use
the same plan for the previous query?
3. Examine the query plan to determine the order of operations.
1. 1. To do this, simply join the flights table with the boarding_passes table.
3. 3. Such a query needs to join three tables: tickets (tickets), ticket_flights
(ticket_flights), and flights (flights).