Hi, With PG17's new SAOP handling the performance of certain index scans significantly improved performance in the serial case. However, for parallel index scans the performance picture is not as straightforward, and this has caused some issues earlier.
Background ---------- Before PG17, we had one btree descent + scan for every unique combination of SAOP operators. With PG17, that changed to one btree descent + scan, plus however many btree descents we decide are needed to skip past keys we know won't match. The difference here is that we went from always skipping, to only skipping when we think it's beneficial. In parallel index scans, before PG17 there was a counter which maintained the number of SAOP entries handled. This meant the IO patterns between normal scans and parallel scans was about the same, as the critical path for IO were almost exactly the same with no specific race conditions - with a certain scan all backends will agree on the state of the scan. With the introduction of the new SAOP handling in PG17, however, the shared state has become a bit more muddied. Because the default has changed from "stop scanning at the end of a SAOP value's range" to "continue scanning, unless ...", and because it is relatively expensive to determine whether we need to stop or continue we release the parallel scan to continue before we know if a new primitive scan would be preferred. Problem ------- The new parallel index scan code can only get into a new primitive scan IFF no parallel worker has taken the next block in the time between the call to _bt_parallel_release at the start of _bt_readpage, and the call to _bt_parallel_primscan_schedule through _bt_checkkeys slightly later in _bt_readpage. This makes the code susceptible to race conditions, and in the worst case parallel SAOP index scans can behave like a range index scan for the the full value range of the SAOP parameters, rather than being limited to only scanning leaf pages that contain value ranges that match the SAOP parameters, skidding past what would be considered bounds in the serial scan variant. Given the above, a parallel index scan with ScanKey `id = ANY (minvalue, maxvalue)` and bad enough timing/luck could thus scan the whole index, while just 2 primitive index scans (the behaviour before PG17) are sufficient. In practice it is quite unlikely that parallel index scans have such bad luck, but I don't think we should have luck or timing as a big factor for the performance picture in parallel index scans. Solution -------- I think I've found a reasonably elegant solution, which limits the number of buffers we can fail to start a new primitive scan to the number of parallel workers + 1: If we detect that a parallel worker in the same primscan range thinks this is the right moment to start a new primscan (but couldn't due to concurrent advancement), we don't release the parallel scan immediately, but instead only release it after reading the pages contents, to find out if we really should start a new primitive scan. If so, we do that, and if not, we instead mark that the primitive scan has reached a new primscan range, do some cleanup, and then continue business as usual. There is some space to tune the number of buffers we can fail to start a new primitive scan by locally keeping track of the number of pages we've processed without starting an expected primitive scan, but I feel that it'd result in overall worse performance if we tried to wait for more time than just doing the skip check ASAP. Kind regards, Matthias van de Meent Neon (https://neon.tech)
v1-0001-Avoid-full-btree-index-scans-when-skipping-is-pos.patch
Description: Binary data