On Thu, Sep 12, 2024 at 10:49 AM Matthias van de Meent <boekewurm+postg...@gmail.com> wrote: > Thanks to Peter for the description, that helped me debug the issue. I > think I found a fix for the issue: regression tests for 811af978 > consistently got stuck on my macbook before the attached patch 0001, > after applying that this patch they completed just fine.
Thanks for taking a look at it. > The fix in 0001 is relatively simple: we stop backends from waiting > for a concurrent backend to resolve the NEED_PRIMSCAN condition, and > instead move our local state machine so that we'll hit _bt_first > ourselves, so that we may be able to start the next primitive scan. I agree with your approach, but I'm concerned about it causing confusion inside _bt_parallel_done. And so I attach a v2 revision of your bug fix. v2 adds a check that nails that down, too. I'm not 100% sure if the change to _bt_parallel_done becomes strictly necessary, to make the basic fix robust, but it's a good idea either way. In fact, it seemed like a good idea even before this bug came to light: it was already clear that this was strictly necessary for the skip scan patch. And for reasons that really have nothing to do with the requirements for skip scan (it's related to how we call _bt_parallel_done without much care in code paths from the original parallel index scan commit). More details on changes in v2 that didn't appear in Matthias' v1: v2 makes _bt_parallel_done do nothing at all when the backend-local so->needPrimScan flag is set (regardless of whether it has been set by _bt_parallel_seize or by _bt_advance_array_keys). This is a bit like the approach taken before the Postgres 17 work went in: _bt_parallel_done used to only permit the shared btps_pageStatus state to become BTPARALLEL_DONE when it found that "so->arrayKeyCount >= btscan->btps_arrayKeyCount" (else the call was a no-op). With this extra hardening, _bt_parallel_done will only permit setting BTPARALLEL_DONE when "!so->needPrimScan". Same idea, more or less. v2 also changes comments in _bt_parallel_seize. The comment tweaks suggest that the new "if (!first && status == BTPARALLEL_NEED_PRIMSCAN) return false" path is similar to the existing master branch "if (!first && so->needPrimScan) return false" precheck logic on master (the precheck that takes place before examining any state in shared memory). The new path can be thought of as dealing with cases where the backend-local so->needPrimScan flag must have been stale back when it was prechecked -- it's essentially the same logic, though unlike the precheck it works against the authoritative shared memory state. My current plan is to commit something like this in the next day or two. -- Peter Geoghegan
From 11282515bae8090b30663814c5f91db00488508d Mon Sep 17 00:00:00 2001 From: Peter Geoghegan <pg@bowt.ie> Date: Mon, 16 Sep 2024 14:28:57 -0400 Subject: [PATCH v2] Fix stuck parallel btree scans Before, a backend that called _bt_parallel_seize was not always guaranteed to be able to move forward on a state where more work was expected from parallel backends, and handled NEED_PRIMSCAN as a semi-ADVANCING state. This caused issues when the leader process was waiting for the state to advance and concurrent backends were waiting for the leader to consume the buffered tuples they still had after updating the state to NEED_PRIMSCAN. This is fixed by treating _bt_parallel_seize()'s status output as the status of a currently active primitive scan. If _seize is called from outside _bt_first, and the scan state is NEED_PRIMSCAN, then we'll end our current primitive scan and set the scan up for a new primitive scan, eventually hitting _bt_first's call to _seize. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reported-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzmMGaPa32u9x_FvEbPTUkP5e95i=QxR8054nvCRydP-sw@mail.gmail.com Backpatch: 17-, where nbtree SAOP execution was enhanced. --- src/backend/access/nbtree/nbtree.c | 53 +++++++++++++++++++----------- 1 file changed, 33 insertions(+), 20 deletions(-) diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c index 686a3206f..456a04995 100644 --- a/src/backend/access/nbtree/nbtree.c +++ b/src/backend/access/nbtree/nbtree.c @@ -585,7 +585,10 @@ btparallelrescan(IndexScanDesc scan) * or _bt_parallel_done(). * * The return value is true if we successfully seized the scan and false - * if we did not. The latter case occurs if no pages remain. + * if we did not. The latter case occurs when no pages remain, or when + * another primitive index scan is scheduled that caller's backend cannot + * start just yet (only backends that call from _bt_first are capable of + * starting primitive index scans, which they indicate by passing first=true). * * If the return value is true, *pageno returns the next or current page * of the scan (depending on the scan direction). An invalid block number @@ -596,10 +599,6 @@ btparallelrescan(IndexScanDesc scan) * scan will return false. * * Callers should ignore the value of pageno if the return value is false. - * - * Callers that are in a position to start a new primitive index scan must - * pass first=true (all other callers pass first=false). We just return false - * for first=false callers that require another primitive index scan. */ bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first) @@ -616,13 +615,7 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first) { /* * Initialize array related state when called from _bt_first, assuming - * that this will either be the first primitive index scan for the - * scan, or a previous explicitly scheduled primitive scan. - * - * Note: so->needPrimScan is only set when a scheduled primitive index - * scan is set to be performed in caller's worker process. It should - * not be set here by us for the first primitive scan, nor should we - * ever set it for a parallel scan that has no array keys. + * that this will be the first primitive index scan for the scan */ so->needPrimScan = false; so->scanBehind = false; @@ -630,8 +623,8 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first) else { /* - * Don't attempt to seize the scan when backend requires another - * primitive index scan unless we're in a position to start it now + * Don't attempt to seize the scan when it requires another primitive + * index scan, since caller's backend cannot start it right now */ if (so->needPrimScan) return false; @@ -653,12 +646,9 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first) { Assert(so->numArrayKeys); - /* - * If we can start another primitive scan right away, do so. - * Otherwise just wait. - */ if (first) { + /* Can start another primitive scan right away, so do so */ btscan->btps_pageStatus = BTPARALLEL_ADVANCING; for (int i = 0; i < so->numArrayKeys; i++) { @@ -668,11 +658,25 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first) array->cur_elem = btscan->btps_arrElems[i]; skey->sk_argument = array->elem_values[array->cur_elem]; } - so->needPrimScan = true; - so->scanBehind = false; *pageno = InvalidBlockNumber; exit_loop = true; } + else + { + /* + * Don't attempt to seize the scan when it requires another + * primitive index scan, since caller's backend cannot start + * it right now + */ + status = false; + } + + /* + * Either way, update backend local state to indicate that a + * pending primitive scan is required + */ + so->needPrimScan = true; + so->scanBehind = false; } else if (btscan->btps_pageStatus != BTPARALLEL_ADVANCING) { @@ -731,6 +735,7 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page) void _bt_parallel_done(IndexScanDesc scan) { + BTScanOpaque so = (BTScanOpaque) scan->opaque; ParallelIndexScanDesc parallel_scan = scan->parallel_scan; BTParallelScanDesc btscan; bool status_changed = false; @@ -739,6 +744,13 @@ _bt_parallel_done(IndexScanDesc scan) if (parallel_scan == NULL) return; + /* + * Should not mark parallel scan done when there's still a pending + * primitive index scan + */ + if (so->needPrimScan) + return; + btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan, parallel_scan->ps_offset); @@ -747,6 +759,7 @@ _bt_parallel_done(IndexScanDesc scan) * already */ SpinLockAcquire(&btscan->btps_mutex); + Assert(btscan->btps_pageStatus != BTPARALLEL_NEED_PRIMSCAN); if (btscan->btps_pageStatus != BTPARALLEL_DONE) { btscan->btps_pageStatus = BTPARALLEL_DONE; -- 2.45.2