Limiting overshoot in nbtree's parallel SAOP index scans

Matthias van de Meent Fri, 11 Oct 2024 07:27:40 -0700

Hi,

With PG17's new SAOP handling the performance of certain index scans
significantly improved performance in the serial case. However, for
parallel index scans the performance picture is not as
straightforward, and this has caused some issues earlier.


Background
----------
Before PG17, we had one btree descent + scan for every unique
combination of SAOP operators. With PG17, that changed to one btree
descent + scan, plus however many btree descents we decide are needed
to skip past keys we know won't match. The difference here is that we
went from always skipping, to only skipping when we think it's
beneficial.

In parallel index scans, before PG17 there was a counter which
maintained the number of SAOP entries handled. This meant the IO
patterns between normal scans and parallel scans was about the same,
as the critical path for IO were almost exactly the same with no
specific race conditions - with a certain scan all backends will agree
on the state of the scan.

With the introduction of the new SAOP handling in PG17, however, the
shared state has become a bit more muddied. Because the default has
changed from "stop scanning at the end of a SAOP value's range" to
"continue scanning, unless ...", and because it is relatively
expensive to determine whether we need to stop or continue we release
the parallel scan to continue before we know if a new primitive scan
would be preferred.

Problem
-------
The new parallel index scan code can only get into a new primitive
scan IFF no parallel worker has taken the next block in the time
between the call to _bt_parallel_release at the start of _bt_readpage,
and the call to _bt_parallel_primscan_schedule through _bt_checkkeys
slightly later in _bt_readpage. This makes the code susceptible to
race conditions, and in the worst case parallel SAOP index scans can
behave like a range index scan for the the full value range of the
SAOP parameters, rather than being limited to only scanning leaf pages
that contain value ranges that match the SAOP parameters, skidding
past what would be considered bounds in the serial scan variant.

Given the above, a parallel index scan with ScanKey `id = ANY
(minvalue, maxvalue)` and bad enough timing/luck could thus scan the
whole index, while just 2 primitive index scans (the behaviour before
PG17) are sufficient.

In practice it is quite unlikely that parallel index scans have such
bad luck, but I don't think we should have luck or timing as a big
factor for the performance picture in parallel index scans.

Solution
--------

I think I've found a reasonably elegant solution, which limits the
number of buffers we can fail to start a new primitive scan to the
number of parallel workers + 1:  If we detect that a parallel worker
in the same primscan range thinks this is the right moment to start a
new primscan (but couldn't due to concurrent advancement), we don't
release the parallel scan immediately, but instead only release it
after reading the pages contents, to find out if we really should
start a new primitive scan.  If so, we do that, and if not, we instead
mark that the primitive scan has reached a new primscan range, do some
cleanup, and then continue business as usual.

There is some space to tune the number of buffers we can fail to start
a new primitive scan by locally keeping track of the number of pages
we've processed without starting an expected primitive scan, but I
feel that it'd result in overall worse performance if we tried to wait
for more time than just doing the skip check ASAP.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

v1-0001-Avoid-full-btree-index-scans-when-skipping-is-pos.patch
Description: Binary data

Limiting overshoot in nbtree's parallel SAOP index scans

Reply via email to