On Tue, Oct 8, 2024 at 12:52 PM Peter Geoghegan <p...@bowt.ie> wrote: > The code in v3-0001-Avoid-unneeded-nbtree-backwards-scan-buffer-locks.patch > looks reasonable to me. I don't think that there are any impediments > to committing it sometime this week.
I see significant benefits for parallel index-only backwards scans with this patch. It's not that hard to see about a 10% decrease in execution time. I imagine that there are disproportionate benefits for the parallel case because the work we're avoiding is work that takes place while a backend has seized the scan. Not bad! I've been looking at this patch in more detail. I think that you (Matthias) have gone too far in the direction of making the backwards scan handling within _bt_readnextpage match the forwards scan handling. You fail to update blkno when we have to deal with a concurrent split/deletion that necessitates reading from another blkno (when the correct left sibling page to read from changes) -- v3 still used the potentially-stale blkno passed by _bt_readnextpage's caller. Attached v4 deals with that issue. It also goes further in the direction of simplifying the code in and around _bt_readnextpage. The code in this area has always seemed to me to be quite a lot more complicated than it really needed to be. Now seems like a good time to refactor and simplify. It's likely that we'll need to do more work nearby to enable prefetching during plain index scans, without regressing the kill_prior_tuple/_bt_killitems mechanism -- Tomas Vondra is currently working on that. The mechanism in question is closely tied to the stuff that I'm simplifying now, particularly the _bt_drop_lock_and_maybe_pin/TID-recycle-safety business. The main simplification new to v4 is that v4 isolates the need to call _bt_drop_lock_and_maybe_pin to only 2 functions: _bt_readnextpage, and its new _bt_readfirstpage "sibling" function. The functions have similar preconditions, and identical postconditions -- all of which are now clearly documented. It's not just _bt_drop_lock_and_maybe_pin that's only called from these 2 functions in v4. All calls to _bt_readpage (plus their associated PredicateLockPage calls) are now also isolated to those same 2 functions. This shift enables removing the _bt_parallel_readpage function entirely -- the point at the top of _bt_first where _bt_parallel_readpage used to be called can now call _bt_readnextpage directly instead. ISTM that having a separate _bt_parallel_readpage function never made much sense -- it was papering over the fact that the interfaces it uses are so confusing. v4 also demotes _bt_steppage to a helper function that sets up a call to _bt_readnextpage. Note that the 2017 parallel index scan commit split _bt_steppage in two: it became the current _bt_steppage code, plus the current _bt_readnextpage code. Making that relationship explicit seems much clearer. I'm essentially finishing off the process of splitting _bt_steppage up. Side note: all of these simplifications were enabled by making the preconditions for calling _bt_readnextpage work independently of the scan direction (which was present in earlier versions of the patch from Matthias). We now always do a "BTScanPosUnpinIfPinned(so->currPos)" within _bt_steppage, so it's easy to talk clearly about the preconditions/postconditions for _bt_readnextpage without directly considering its _bt_steppage caller (still the main caller, including during parallel index scans). Performance improvement bonus: We do that "BTScanPosUnpinIfPinned(so->currPos)" at a slightly different point in _bt_steppage in v4, as compared to master, even during forwards scans: we consistently unpin *before* the required next call to _bt_parallel_seize takes place (unless, of course, it's a serial scan, which'll never need to call _bt_parallel_seize). Presumably this is why I see a ~5% increase in throughput for parallel index-only forward (standard) scans with this v4 -- all parallel scans (backwards and forwards alike) now receive at least some benefit from this patch. It makes a certain amount of sense; we should do as little as possible during any window where our backend has seized the scan. I do need to do more testing, but the direction of v4 seems promising. -- Peter Geoghegan
v4-0001-Optimize-and-simplify-nbtree-backwards-scans.patch
Description: Binary data