There's one aspect of my recent bugfix commit 5f4d98d4 that still doesn't sit well with me: the way that it taught _bt_set_startikey to completely back out of applying any of its optimization in the presence of a RowCompare. Without this "row compare back out" behavior, there are obscure cases where resetting the scan's array keys within _bt_readpage leads to an endless cycle of ending the current primitive index scan, then starting another, only to end the new primitive index scan on exactly the same leaf page/page offset as last time. This could only happen in the presence of a NULL row compare member, mostly due to the weird rules about those within _bt_check_rowcompare (and how they interact with _bt_advance_array_keys when we reset the scan's array keys in _bt_readpage, another thing introduced by commit 5f4d98d4).
For example, a query like the following could get into an endless cycle, if I remove the "don't set forcenonrequired" code from _bt_set_startikey: select * from fuzz_skip_scan where (b, c) >= (6, null) and (b, c) <= (7, null) order by a desc, b desc, c desc, d desc; Notice that we'll have a skip array on "a" here, since it is the leading column, but has been omitted. I'm assuming the use of skip scan/multiple primitive scans, and some use of forcenonrequired. Obviously, this kind of case is ultra contrived/adversarial. My confusion about how best to deal with the problem was on clear display in commit 54c6ea8c, which removed forcenonrequired=true support from _bt_check_rowcompare. I had to revert that commit shortly thereafter (in commit dd2ce379) after it became clear that _bt_check_rowcompare does in fact need to be called during forcenonrequired=true scans. You see, the logic I added to _bt_set_startikey in commit 5f4d98d4 only prevents application of forcenonrequired mode (as well as setting pstate.ikey to a key past the first key) with row compares that happen to have a first row compare member whose index column happens to be the first index column whose tuple values change within the page in question (the page examined by _bt_set_startikey). I have no reason to believe that there's any bug here, but it is certainly needlessly complicated. Thinking about the problem some more today, I realized that the issue wasn't really with _bt_set_startikey, or with _bt_advance_array_keys. The problem was with the weird rules in _bt_check_rowcompare around NULL row compare members. And the lack of joined-up thinking between _bt_check_rowcompare and corresponding _bt_first code that deals with building an insertion scan key given a row compare. These two places should really have symmetrical rules, including in their handling of NULL row compare members, but right now they don't. Disabling forcenonrequired mode in _bt_set_startikey now seems to me to be a case of the tail wagging the dog. It would make much more sense to teach row comparisons to behave more like standard scalar inequalities in their handling of NULLs/NULL semantics/_bt_first NULL positioning constraints. That way, _bt_set_startikey can safely treat row compares (with or without NULL members) just like any other kind of inequality -- it'd *always* be safe to use forcenonrequired=true mode when that made sense, no matter what. Attached patch shows how this could work. It refines the rules around NULL row comparison members in _bt_check_rowcompare, and in _bt_first. The fundamental idea here is to make _bt_check_rowcompare *consistent* with _bt_first. That way _bt_set_startikey doesn't have to know anything about row compares. I would like to commit this to Postgres 18, treating it as a bugfix. It seems like a much more solid fix than what I came up with in bugfix commit 5f4d98d4. Does anybody else have an opinion on that question? Note that this patch effectively *adds a new optimization* to _bt_first around row comparison keys with NULL row members. I don't actually care about the performance of such row comparisons. Again, what I actually care about is making _bt_first *consistent* with _bt_check_rowcompare in its handling of NULLs. This approach happens to be the best way to make them consistent (see the patch's draft commit message if you want to know about an alternative approach involving *removing* an old optimization from _bt_check_rowcompare). In summary: In general, when we end a primitive index scan, the code that sets continuescan=false (any such code, not just _bt_check_rowcompare NULL row member code) has to make sure that starting the next primitive index scan will actually allow the top-level scan to make forward progress -- the new primscan needs to land on some later leaf page. Right now, _bt_check_rowcompare doesn't guarantee it in a way that I'd consider truly robust. We need that guarantee to avoid these cycles/infinite looping. AFAIK there are no bugs of that general nature on Postgres 18, but all of that depends on _bt_set_startikey knowing about _bt_advance_array_keys and _bt_check_rowcompare behaviors from a great distance, which is pretty grotty. Seems worth fixing sooner rather than later. -- Peter Geoghegan
v1-0001-Improve-_bt_first-row-compare-NULL-member-handlin.patch
Description: Binary data