On Mon, Jun 16, 2025 at 12:46 PM Peter Geoghegan <p...@bowt.ie> wrote: > Making this change in _bt_readpage creates a problem in _bt_killitems: > it currently expects posting list TIDs in ascending order (the loop > invariant relies on that), which is why the deduplication patch didn't > just do things like this in _bt_readpage to begin with. The patch > deals with that new problem by qsort'ing the so->killedItems[] array > (at the top of _bt_killitems) such that the items always appear > exactly once, in the expected ASC order.
Actually, we can just completely get rid of so->killedItems[]. We can replace it with a new 1 bit itemDead field in the BTScanPosItem struct (the struct used for so->currPos.items[] entries). That way, the patch avoids a qsort. The patch doesn't need to change the order of anything (except of course for the order that _bt_readpage initially saves posting list tuple TIDs in, when scanning backwards). The problem with so->killedItems[] is that it stores things in whatever order successive btgettuple calls find most convenient in the kill_prior_tuple path. It turns out that it's even more convenient for btgettuple if its this kill_prior_tuple path simply sets the relevant (prior/just-returned) item's so->currPos.items[].itemDead field. There is no need to allocate memory for so->killedItems[], and no need to worry about running out of space in so->killedItems[] in cases involving scrollable cursors that happen to scroll back and forth, triggering kill_prior_tuple for the same TID. We'll simply set the TID/item's so->currPos.items[].itemDead field a second or a third time, which can't complicate things, since my new approach makes that idempotent. Attached is v2 of the patch, which works that way. This seems like a better direction to take things in within _bt_killitems. In general we're quite sensitive to the size of the BTScanPosItem struct, since (at least for now) we routinely allocate MaxTIDsPerBTreePage-many of them (i.e. 1,358 of them). 1,358 * 10 bytes is already too much. But there's no need to make the relevant allocation even larger. I can steal a bit from BTScanPosItem.tupleOffset for these new so->currPos.items[].itemDead/BTScanPosItem.itemDead fields. That's safe, since we only really need 15 bits for each BTScanPosItem.tupleOffset offset. -- Peter Geoghegan
v2-0001-Return-TIDs-in-desc-order-during-backwards-scans.patch
Description: Binary data