Join Methods
16
Copyright
© Postgres Professional, 2019–2024
Authors Authors: Egor Rogov, Pavel Luzanov, Ilya Bashtanov
Photo by: Oleg Bartunov (Phu monastery, Bhrikuti summit, Nepal)
Use of course materials
Non-commercial use of course materials (presentations, demonstrations) is
allowed without restrictions. Commercial use is possible only with the written
permission of Postgres Professional. It is prohibited to make changes to the
course materials.
Feedback
Please send your feedback, comments and suggestions to:
edu@postgrespro.ru
Disclaimer
In no event shall Postgres Professional company be liable for any damages
or loss, including loss of profits, that arise from direct or indirect, special or
incidental use of course materials. Postgres Professional company
specifically disclaims any warranties on course materials. Course materials
are provided “as is,” and Postgres Professional company has no obligations
to provide maintenance, support, updates, enhancements, or modifications.
2
Topics
Merge Join Algorithm
Computational complexity
Merge Join in Parallel Execution Plans
3
Merging Two Sorted Row Sets
The result of the join is automatically sorted
Merge join
Merge Join
Sort
Seq Scan
Index Scan
or
Sorting
or
sorted
data
The third and final join method is the merge join.
The idea of this method is that two pre-sorted data sets can be easily
merged into a single combined set that is sorted in the same way. The
Gather Merge node operates similarly.
Before performing a merge join, both sets of rows need to be sorted.
Сортировка — дорогая операция,
она имеет сложность O(N log N).
But sometimes this phase can be skipped if the rows are already sorted by
the required columns, for example, via indexed access to the table.
4
Merge
namealbum_id
id title year
6
4
5
2
1
2Abbey Road
The Beatles
A Day in the Life
Another Girl
All You Need Is Love
Act Naturally
1969
1968
3 Across the Universe
3 Let It Be
1970
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
1 1Yellow Submarine All Together Now
1969
The merge is straightforward. First, we take the first strings from both sets
and compare them. In this case, we immediately found a match and can
output the first tuple of the result: («Yellow Submarine», «All Together
Now»).
The algorithm goes as follows. the general algorithm works by reading the
next row from the set with the smaller join field value (one set "catches up"
to the other). If the values are equal, as in our example, we proceed to the
next row in the second (inner) set.
5
Merge
namealbum_id
id title year
6
4
5
2
2Abbey Road
The Beatles
A Day in the Life
Another Girl
Act Naturally
1969
1968
3 Across the Universe
3 Let It Be
1970
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
1 1Yellow Submarine All Together Now
1969
1 All You Need Is Love
Once again, the match: ('Yellow Submarine', 'All You Need Is Love')
Once more, read the next row from the second dataset.
6
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
Merge
namealbum_id
id title year
6
4
5
2
2Abbey Road
The Beatles
A Day in the Life
Another Girl
Act Naturally
1969
1968
3 Across the Universe
3 Let It Be
1970
1 1Yellow Submarine All Together Now
1969
1 All You Need Is Love
No match found in this case.
Since 1 < 2, we proceed to the next string in the first set.
7
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
Merge
namealbum_id
id title year
6
4
5
2
2Abbey Road
The Beatles
A Day in the Life
Another Girl
Act Naturally
1969
1968
3 Across the Universe
3 Let It Be
1970
1 1Yellow Submarine All Together Now
1969
1 All You Need Is Love
No match found.
Since 3 > 2, we proceed to the next row in the second set.
8
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
Merge
namealbum_id
id title year
6
4
5
2
2Abbey Road
The Beatles
A Day in the Life
Another Girl
Act Naturally
1969
1968
3 Across the Universe
3 Let It Be
1970
1 1Yellow Submarine All Together Now
1969
1 All You Need Is Love
Once more, there's no match, once more 3 > 2, we read the next row from
the second set.
9
Merge
namealbum_id
id title year
6
4
5
2
2Abbey Road
The Beatles
A Day in the Life
Another Girl
Act Naturally
1969
1968
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
1 1Yellow Submarine All Together Now
1969
1 All You Need Is Love3 Let It Be
1970
3 Across the Universe
There is a match: 'Let It Be' and 'Across the Universe'
3 = 3, proceed to the next line in the second set.
10
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
Merge
namealbum_id
id title year
6
4
5
2
2Abbey Road
The Beatles
A Day in the Life
Another Girl
Act Naturally
1969
1968
3 Across the Universe
3 Let It Be
1970
1 1Yellow Submarine All Together Now
1969
1 All You Need Is Love
No match found.
3 < 5, read the row from the first set.
11
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
Merge
namealbum_id
id title year
6
4
5
2
2Abbey Road
The Beatles
A Day in the Life
Another Girl
Act Naturally
1969
1968
3 Across the Universe
3 Let It Be
1970
1 1Yellow Submarine All Together Now
1969
1 All You Need Is Love
No match found.
4 < 5, so read the row from the first set.
12
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
Merge
namealbum_id
id title year
6
4
5
2
2Abbey Road
The Beatles
A Day in the Life
Another Girl
Act Naturally
1969
1968
3 Across the Universe
3 Let It Be
1970
1 1Yellow Submarine All Together Now
1969
1 All You Need Is Love
And the final step: no match again.
The merge join has concluded.
In reality, the algorithm is more complex — if the first (outer) row set has
multiple identical values, it must be able to reread the rows of the second
(inner) setusing the same join key.
The algorithm's pseudocode can be found in the file
src/backend/executor/nodeMergejoin.c.
Notably, the merge algorithm returns the join result in a sorted format. In
particular, the resulting row set can be used for the next merge join without
further sorting.
14
Computational complexity
~ N + M, где
N and M represent the number of rows in the first and second data sets.no
joins required
~ N logN + M logM,
if sorting is required
Potential Initial Sorting Costs
Efficient for a large number of rows
In cases where data sorting isn't required, the overall complexity of a merge
join is proportional to the total number of rows in both data sets. However,
unlike hash joins, there is no overhead involved in building a hash table
here.
Therefore, merge joins can be effectively used in both OLTP- and OLAP-
queries.
However, if sorting is required, the cost becomes proportional to the number
of rows multiplied by the logarithm of that number. On large datasets, this
approach is likely to be less efficient than a hash join.
16
The external dataset is scanned in parallel, while the internal one
is processed sequentially by each process.
In parallel execution plans
Merge Join
Parallel
Index Scan
Gather
Merge Join
Parallel
Index Scan
Merge Join
Parallel
Index Scan
Sort
Seq Scan
Sort
Seq Scan
Sort
Seq Scan
The merge join algorithm can be used in a parallelized plan.
Just like with a nested loop join, scanning one set of rows is executed in
parallel by worker processes, but the other set of rows is read entirely by
each worker process on its own. Therefore, hash joins are far more
commonly used in parallel execution plans when dealing with large row sets,
due to their efficient parallel algorithm.
17
namealbum_id
id title year
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
In parallel execution plans
1 Yellow Submarine
5
2
2
A Day in the Life
Another Girl
Act Naturally
1 All Together Now
3
1
Across the Universe
All You Need Is Love
1969
6 Abbey Road
1969
4 The Beatles
1968
3 Let It Be
1970
Parallel Index Scan
Each worker process will scan the internal dataset from the start until no
more matches are found.
This slide shows the rows processed by the first process.
18
id title year
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
In parallel execution plans
namealbum_id
1 Yellow Submarine
5
2
2
A Day in the Life
Another Girl
Act Naturally
1 All Together Now
3
1
Across the Universe
All You Need Is Love
1969
6 Abbey Road
1969
4 The Beatles
1968
3 Let It Be
1970
The second process will scan through the full set and find a match.
19
id title year
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
In parallel execution plans
namealbum_id
1 Yellow Submarine
5
2
2
A Day in the Life
Another Girl
Act Naturally
1 All Together Now
3
1
Across the Universe
All You Need Is Love
1969
6 Abbey Road
1969
4 The Beatles
1968
3 Let It Be
1970
However, the third process will find no matches.
Of course, all three processes examine the internal set at the same time,
rather than sequentially.
21
Takeaways
Merge join may require some preparation
The row sets need to be sorted
or have them pre-sorted
Efficient for large samples
It's beneficial if the row sets are already sorted.
It's beneficial if a sorted result is required
This method is independent of the join order
Only equijoins are supported.
Other join types are not implemented, but there are no fundamental
restrictions
To perform a merge join, both row sets must be sorted. It's beneficial if the
data is already in the correct order; otherwise, sorting is required.
Merge operations are highly efficient, even for large data sets. As a nice
bonus, the output is also sorted, making this join method advantageous
when higher-level plan nodes need sorting (e.g., a query with an ORDER BY
clause or another merge sort).
Thus, the planner has three join methods: nested loop, hashing, and merge
(excluding various modifications). Each method has scenarios where it
outperforms the others. This allows the planner to select the method that is
expected to be the most suitable for each specific scenario.
22
Practice
1. Check the query's execution plan for the list of all seats in the
cabins, ordered by aircraft code:
SELECT * FROM aircrafts a JOIN seats s ON
a.aircraft_code = s.aircraft_codeORDER BY
a.aircraft_code;
But present it as a cursor.
Reduce the cursor_tuple_fraction parameter value by a factor of
ten. How did the execution plan change?
2. An aircraft can be replaced with another if the capacity
difference is no more than 20%. Generate a replacement table
between Boeing and Airbus models using a full join. How does
the query execute?
2. 2. The query to execute:
WITH cap AS ( SELECT a.model, COUNT(*)::NUMERIC AS capacity FROM
aircrafts a JOIN seats s ON a.aircraft_code = s.aircraft_code
GROUP BY a.model), a AS ( SELECT * FROM cap WHERE model LIKE
'Airbus%'), b AS ( SELECT * FROM cap WHERE model LIKE 'Boeing
%')SELECT a.model AS airbus, b.model AS boeingFROM a FULL JOIN b
ON b.capacity::NUMERIC / a.capacity BETWEEN 0.8 AND 1.2ORDER BY 1,
2;