Join Methods
16
Copyright
© Postgres Professional, 2019–2024
Authors Authors: Egor Rogov, Pavel Luzanov, Ilya Bashtanov
Photo by: Oleg Bartunov (Phu monastery, Bhrikuti summit, Nepal)
Use of course materials
Non-commercial use of course materials (presentations, demonstrations) is
allowed without restrictions. Commercial use is possible only with the written
permission of Postgres Professional. It is prohibited to make changes to the
course materials.
Feedback
Please send your feedback, comments and suggestions to:
edu@postgrespro.com
Disclaimer
In no event shall Postgres Professional company be liable for any damages
or loss, including loss of profits, that arise from direct or indirect, special or
incidental use of course materials. Postgres Professional company
specifically disclaims any warranties on course materials. Course materials
are provided “as is,” and Postgres Professional company has no obligations
to provide maintenance, support, updates, enhancements, or modifications.
2
Topics
General Considerations on Joins
Nested loop join
Variations: left, semi-, and anti-joins
Computational complexity
Parallel Execution Plans with Nested Loops
3
Joins
Join types are not SQL joins
inner, left, right, full, and cross joins, along with 'in' and 'exists' — logical
operations
Join types are the implementation mechanism
It's not tables that are joined, but sets of rows.
may originate from any node in the execution plan tree
Row sets are joined in pairs
The order of joins is crucial for performance.
The order within the pair is typically important.
Simply retrieving data using the discussed access methods isn't enough;
you also need to know how to join them. PostgreSQL offers several methods
for this.
Join methods are algorithms designed to combine two row sets. These
methods also implement other SQL constructs, such as EXISTS. Avoid
confusing the two: SQL joins are logical operations on two sets, whereas
PostgreSQL's join methods are the actual implementations that consider
performance.
It's common to hear that tables are joined. This is a convenient
simplification, but in reality, row sets are what get joined. These sets can be
directly retrieved from the table (via one of the access methods), but they
can also, for instance, result from joining other row sets.
Finally, row sets are always joined pairwise. The order in which tables are
joined typically doesn't affect the query result (such as a join with b or b join
with c), but it can greatly impact performance. As we'll see later, the order in
which two row sets are joined matters (e.g., a join b versus b join a).
4
For each row in one set, we look for matching rows in the other
set
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
Nested Loop
Nested Loop
Index Scan
on songs
Seq Scan
on albums
foreign
dataset
inner
dataset
Let's begin with the nested loop join, the simplest approach. The algorithm
works by iterating through each row in one set and returning the
corresponding rows from the second set. In essence, this is two nested
loops, which is why it's called the nested loop method.
Note that the second (inner) set is accessed as many times as there are
rows in the first (outer) set. If there's no efficient way to find the
corresponding rows in the second set (i.e., an index on the table), you'll
have to scan many non-matching rows repeatedly. It's clear that this isn't the
optimal choice, even for small datasets, where the algorithm can still be
quite effective.
In the query plan, you'll see a Nested Loop node with two child nodes (which
can represent not just access methods, but also other operations like joins
or aggregations).
5
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
name
id title year
Nested Loop
1
6
4
5
1
2
1
2
Yellow Submarine
Abbey Road
The Beatles
A Day in the Life
All Together Now
Another Girl
All You Need Is Love
Act Naturally
1969
1969
1968
3 Across the Universe
3 Let It Be
1970
album_id
The figures illustrate this connection method. In the figures:
Rows that had already been accessed are shown in gray.
Rows currently being accessed are highlighted in color;
Rows forming a pair that match the join condition are highlighted with an
orange border (in this case, by equality of numeric identifiers).
First, we read the first row of the first set and find its corresponding row in
the second set. A match was found, and the first result row is ready to be
returned to the parent plan node: ('Let It Be', 'Across the Universe').
6
namealbum_id
id title year
Nested Loop
6
4
5
2
2
Abbey Road
The Beatles
A Day in the Life
Another Girl
Act Naturally
3 Let It Be
3 Across the Universe
1969
1968
1970
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
1 Yellow Submarine 1 All Together Now
1969
1 All You Need Is Love
Read the second string from the first set.
We also go through the pairs from the second set for her. First, return
("Yellow Submarine", "All Together Now")...
7
namealbum_id
id title year
Nested Loop
6
4
5
2
2
Abbey Road
The Beatles
A Day in the Life
Another Girl
Act Naturally
3 Let It Be
3 Across the Universe
1969
1968
1970
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
1 Yellow Submarine 1 All Together Now
1969
1 All You Need Is Love
...then the second set ("Yellow Submarine", "All You Need Is Love")
8
namealbum_id
id title year
Nested Loop
3
1
4
Let It Be
Yellow Submarine
The Beatles
5
2
2
A Day in the Life
Another Girl
Act Naturally
1 All Together Now
3
1
Across the Universe
All You Need Is Love
6 Abbey Road
1969
1969
1968
1970
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
Proceed to the third row of the first set. No matches found for her.
9
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
namealbum_id
id title year
Nested Loop
3
1
6
4
Let It Be
Yellow Submarine
Abbey Road
The Beatles
5
2
2
A Day in the Life
Another Girl
Act Naturally
1 All Together Now
3
1
Across the Universe
All You Need Is Love
Not all of them
internal set
were read
1969
1969
1968
1970
The fourth row also has no matches. The connection ends here.
Some rows from the second set were not considered at all — as shown in
the figure, they remained white.
(The algorithm's source code is available in the file
src/backend/executor/nodeNestloop.c.)
11
Caching repeated data in the inner relation
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
Memoization
Nested Loop
Memoize
Seq Scan
on songs
Caching
Index Scan
on albums
If the internal set is scanned multiple times with repeated parameter values,
caching the result can help avoid repeatedly reading the same data. This
operation is known as memoization . It is performed in the Memoize node,
which is placed between the Nested Loop and the data-providing node.
(If the planner's calculations are incorrect, you can disable memoization by
setting the enable_memoize parameter to off.)
Caching is implemented using a hash table. The hash key is a parameter or
parameters used to access the internal set.
12
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
name
id title year
Memoization
1
1
3
It’s All Too Much
All Together Now
Across the Universe
1 All You Need Is Love
album_id
1
6
4
Yellow Submarine
Abbey Road
The Beatles
3 Let It Be
1969
1969
1968
1970
2
8
5
Help!
Revolver
A Hard Day’s Night
7 Please Please Me
1965
1966
1964
1963
1 Yellow Submarine
2 Another Girl
values
are repeated
work_mem × hash_mem_multiplier
In general –
a few rows
Consider the example in the demo. In contrast to the previous case, there
are fewer songs than albums here; additionally, nearly all of them are from
the same album.
If the required row isn't in the hash table, the Memoize node fetches it from
the inner dataset, caches it, and passes it to the parent node Nested Loop.
In general, a parameter value may match multiple rows from the inner
dataset — all of these rows are cached. If all of them don't fit into memory
(limited to work_mem × hash_mem_multiplier), the parameter value is
ignored because caching only part of the rows is pointless. In the query
execution plan, the number of such situations is reported as overflow.
13
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
name
id title year
Memoization
1
1
3
It’s All Too Much
All Together Now
Across the Universe
1 All You Need Is Love
album_id
1
6
4
Yellow Submarine
Abbey Road
The Beatles
3 Let It Be
1969
1969
1968
1970
2
8
5
Help!
Revolver
A Hard Day’s Night
7 Please Please Me
1965
1966
1964
1963
1 Yellow Submarine
2 Another Girl
work_mem × hash_mem_multiplier
If the required row is already in the cache, the Memoize node immediately
returns it to the Nested Loop node. The inner relation is not accessed.
14
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
name
id title year
Memoization
1
1
3
It’s All Too Much
All Together Now
Across the Universe
1 All You Need Is Love
album_id
1
6
4
Yellow Submarine
Abbey Road
The Beatles
3 Let It Be
1969
1969
1968
1970
2
8
5
Help!
Revolver
A Hard Day’s Night
7 Please Please Me
1965
1966
1964
1963
3 Let It Be
2 Another Girl
1 All You Need Is Love
1 Yellow Submarine
Adds to the start
work_mem × hash_mem_multiplier
As long as there's space in the cache, new values keep getting cached.
Meanwhile, the new value is added to the front of the cache.
15
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
name
id title year
Memoization
1
1
3
It’s All Too Much
All Together Now
Across the Universe
1 All You Need Is Love
album_id
1
6
4
Yellow Submarine
Abbey Road
The Beatles
3 Let It Be
1969
1969
1968
1970
2
8
5
Help!
Revolver
A Hard Day’s Night
7 Please Please Me
1965
1966
1964
1963
1 Yellow Submarine
2 Another Girl
1 All You Need Is Love
3 Let it Be
work_mem × hash_mem_multiplier
Fresh
bubbles up
The already cached value rises to the top when accessed, while the others
sink down accordingly.
16
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
name
id title year
Memoization
1
1
3
It’s All Too Much
All Together Now
Across the Universe
1 All You Need Is Love
album_id
1
6
4
Yellow Submarine
Abbey Road
The Beatles
3 Let It Be
1969
1969
1968
1970
2
8
5
Help!
Revolver
A Hard Day’s Night
7 Please Please Me
1965
1966
1964
1963
2 Help!
2 Another Girl
1 All You Need Is Love
1 Yellow Submarine
Old entries are discarded
are evicted
When memory runs out, the least recently used entries are evicted from the
cache. Thus, the LRU replacement algorithm is implemented.
18
Computational complexity
~ N ×M, где
If N represents the number of rows in the external dataset and M represents
the average number of rows in the internal dataset per iteration, the overall
join complexity is proportional to the product of N and M.
A join is effective only when dealing with a small number of
rows.
If N represents the number of rows in the external dataset and M represents
the average number of rows in the internal dataset per iteration, the overall
join complexity is proportional to the product of N and M.
In a non-parameterized join, M is exactly equal to the number of rows in the
internal dataset; in a parameterized join, M can be much smaller.
The nested loop join is only effective when dealing with a small number of
rows. In particular, this method (combined with index access) is typical for
OLTP queries that require quickly returning a small number of rows.
20
In parallel execution plans
The external dataset is scanned in parallel, while the internal one
is processed sequentially by each process.
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
Nested Loop
Index Scan
on songs
Parallel Seq Scan
on albums
Nested Loop
Index Scan
on songs
Parallel Seq Scan
on albums
Nested Loop
Index Scan
on songs
Parallel Seq Scan
on albums
Gather
Nested loop join merge joins can be used in parallel execution plans.
The external row set is scanned in parallel by multiple worker processes.
After obtaining a row from the external set, the process then sequentially
iterates through the corresponding rows in the internal set.
21
SELECT a.title, s.nameFROM albums a JOIN songs s ON
a.id = s.album_id;
In parallel execution plans
namealbum_id
id title year
3 Let It Be
5
2
2
A Day in the Life
Another Girl
Act Naturally
1 All Together Now
3
1
Across the Universe
All You Need Is Love
1970
4 The Beatles
1968
6
2
Abbey Road
Help!
1969
1965
8 Revolver
1966
1 Yellow Submarine
1969
Parallel Seq Scan
To retrieve the next page from the outer set, processes synchronize. Each
process accesses the inner data set independently.
23
Takeaways
A nested loop requires no preparatory actions
Can deliver the join result without delay
Effective for small samples
The outer row set isn't very large.
The inner table can be accessed efficiently (typically through an index).
It depends on the join order
It's usually better if the outer row set is smaller than the inner one.
Supports joins on any condition
Supports equijoins as well as any other
The main advantage of the nested loop join is its simplicity: it requires no
preparation, allowing results to be returned almost immediately.
The downside is that this approach is highly inefficient with large datasets.
The same applies to indexes: the larger the data volume, the higher the
overhead.
Therefore, a nested loop join is worth using when:
one of the row sets is small
Efficient access to the other dataset is available via the join condition.
The result set contains a small number of rows.
This is a typical case for OLTP queries (e.g., user interface queries where a
web page or form must load quickly without handling large data volumes).
Another important point to note is that nested loop joins can handle any join
condition. It works for equijoins (such as the example given) as well as any
other join condition.
24
Practice
1. Create an index on the departure_airport column of the flights
table.Find all flights departing from Ulyanovsk and examine the
query's execution plan.
2. Create a distance table between all airports (with each pair
appearing only once).
2. Use the <@> operator from the earthdistance extension.