> On Tue, Jul 21, 2020 at 04:35:55PM -0700, Peter Geoghegan wrote: > > > Well, it's obviously wrong, thanks for noticing. What is necessary is to > > compare two index tuples, the start and the next one, to test if they're > > the same (in which case if I'm not mistaken probably we can compare item > > pointers). I've got this question when I was about to post a new version > > with changes to address feedback from Andy, now I'll combine them and > > send a cumulative patch. > > This sounds like approximately the same problem as the one that > _bt_killitems() has to deal with as of Postgres 13. This is handled in > a way that is admittedly pretty tricky, even though the code does not > need to be 100% certain that it's "the same" tuple. Deduplication kind > of makes that a fuzzy concept. In principle there could be one big > index tuple instead of 5 tuples, even though the logical contents of > the page have not been changed between the time we recording heap TIDs > in local and the time _bt_killitems() tried to match on those heap > TIDs to kill_prior_tuple-kill some index tuples -- a concurrent > deduplication pass could do that. Your code needs to be prepared for > stuff like that. > > Ultimately posting list tuples are just a matter of understanding the > on-disk representation -- a "Small Matter of Programming". Even > without deduplication there are potential hazards from the physical > deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is > not code that runs in VACUUM, despite the name). Make sure that you > hold a buffer pin on the leaf page throughout, because you need to do > that to make sure that VACUUM cannot concurrently recycle heap TIDs. > If VACUUM *is* able to concurrently recycle heap TIDs then it'll be > subtly broken. _bt_killitems() is safe because it either holds on to a > pin or gives up when the LSN changes at all. (ISTM that your only > choice is to hold on to a leaf page pin, since you cannot just decide > to give up in the way that _bt_killitems() sometimes can.)
I see, thanks for clarification. You're right, in this part of implementation there is no way to give up if LSN changes like _bt_killitems does. As far as I can see the leaf page is already pinned all the time between reading relevant tuples and comparing them, I only need to handle posting list tuples.