There's one aspect of my recent bugfix commit 5f4d98d4 that still
doesn't sit well with me: the way that it taught _bt_set_startikey to
completely back out of applying any of its optimization in the
presence of a RowCompare. Without this "row compare back out"
behavior, there are obscure cases where resetting the scan's array
keys within _bt_readpage leads to an endless cycle of ending the
current primitive index scan, then starting another, only to end the
new primitive index scan on exactly the same leaf page/page offset as
last time. This could only happen in the presence of a NULL row
compare member, mostly due to the weird rules about those within
_bt_check_rowcompare (and how they interact with
_bt_advance_array_keys when we reset the scan's array keys in
_bt_readpage, another thing introduced by commit 5f4d98d4).

For example, a query like the following could get into an endless
cycle, if I remove the "don't set forcenonrequired" code from
_bt_set_startikey:

select *
from fuzz_skip_scan
where (b, c) >= (6, null) and (b, c) <= (7, null)
order by a desc, b desc, c desc, d desc;

Notice that we'll have a skip array on "a" here, since it is the
leading column, but has been omitted. I'm assuming the use of skip
scan/multiple primitive scans, and some use of forcenonrequired.
Obviously, this kind of case is ultra contrived/adversarial.

My confusion about how best to deal with the problem was on clear
display in commit 54c6ea8c, which removed forcenonrequired=true
support from _bt_check_rowcompare. I had to revert that commit shortly
thereafter (in commit dd2ce379) after it became clear that
_bt_check_rowcompare does in fact need to be called during
forcenonrequired=true scans. You see, the logic I added to
_bt_set_startikey in commit 5f4d98d4 only prevents application of
forcenonrequired mode (as well as setting pstate.ikey to a key past
the first key) with row compares that happen to have a first row
compare member whose index column happens to be the first index column
whose tuple values change within the page in question (the page
examined by _bt_set_startikey). I have no reason to believe that
there's any bug here, but it is certainly needlessly complicated.

Thinking about the problem some more today, I realized that the issue
wasn't really with _bt_set_startikey, or with _bt_advance_array_keys.
The problem was with the weird rules in _bt_check_rowcompare around
NULL row compare members. And the lack of joined-up thinking between
_bt_check_rowcompare and corresponding _bt_first code that deals with
building an insertion scan key given a row compare. These two places
should really have symmetrical rules, including in their handling of
NULL row compare members, but right now they don't. Disabling
forcenonrequired mode in _bt_set_startikey now seems to me to be a
case of the tail wagging the dog.

It would make much more sense to teach row comparisons to behave more
like standard scalar inequalities in their handling of NULLs/NULL
semantics/_bt_first NULL positioning constraints. That way,
_bt_set_startikey can safely treat row compares (with or without NULL
members) just like any other kind of inequality -- it'd *always* be
safe to use forcenonrequired=true mode when that made sense, no matter
what.

Attached patch shows how this could work. It refines the rules around
NULL row comparison members in _bt_check_rowcompare, and in _bt_first.
The fundamental idea here is to make _bt_check_rowcompare *consistent*
with _bt_first. That way _bt_set_startikey doesn't have to know
anything about row compares.

I would like to commit this to Postgres 18, treating it as a bugfix.
It seems like a much more solid fix than what I came up with in bugfix
commit 5f4d98d4. Does anybody else have an opinion on that question?

Note that this patch effectively *adds a new optimization* to
_bt_first around row comparison keys with NULL row members. I don't
actually care about the performance of such row comparisons. Again,
what I actually care about is making _bt_first *consistent* with
_bt_check_rowcompare in its handling of NULLs. This approach happens
to be the best way to make them consistent (see the patch's draft
commit message if you want to know about an alternative approach
involving *removing* an old optimization from _bt_check_rowcompare).

In summary:

In general, when we end a primitive index scan, the code that sets
continuescan=false (any such code, not just _bt_check_rowcompare NULL
row member code) has to make sure that starting the next primitive
index scan will actually allow the top-level scan to make forward
progress -- the new primscan needs to land on some later leaf page.
Right now, _bt_check_rowcompare doesn't guarantee it in a way that I'd
consider truly robust. We need that guarantee to avoid these
cycles/infinite looping. AFAIK there are no bugs of that general
nature on Postgres 18, but all of that depends on _bt_set_startikey
knowing about _bt_advance_array_keys and _bt_check_rowcompare
behaviors from a great distance, which is pretty grotty. Seems worth
fixing sooner rather than later.

-- 
Peter Geoghegan

Attachment: v1-0001-Improve-_bt_first-row-compare-NULL-member-handlin.patch
Description: Binary data

Reply via email to