On 9/12/24 16:49, Matthias van de Meent wrote: > On Mon, 9 Sept 2024 at 21:55, Peter Geoghegan <p...@bowt.ie> wrote: >> > ... > > The fix in 0001 is relatively simple: we stop backends from waiting > for a concurrent backend to resolve the NEED_PRIMSCAN condition, and > instead move our local state machine so that we'll hit _bt_first > ourselves, so that we may be able to start the next primitive scan. > Also attached is 0002, which adds tracking of responsible backends to > parallel btree scans, thus allowing us to assert we're never waiting > for our own process to move the state forward. I found this patch > helpful while working on solving this issue, even if it wouldn't have > found the bug as reported. >
No opinion on the analysis / coding, but per my testing the fix indeed addresses the issue. The script reliably got stuck within a minute, now it's running for ~1h just fine. It also checks results and that seems fine too, so that seems fine too. regards -- Tomas Vondra