Hi all, During investigating the issue our customer had, I realized that _bt_killitems() can record the same FPI pages multiple times simultaneously. This can happen when several concurrent index scans are processing pages that contain killable tuples. Because killing index items could be performed while holding a buffer lock in shared mode concurrent processes record multiple FPI_FOR_HINT for the same block.
Here is the reproducer: cat <<EOF | psql -d postgres drop table if exists tbl; create table tbl (c int primary key) with (autovacuum_enabled = off); insert into tbl select generate_series(1,300); update tbl set c = c * -1 where c = 100; checkpoint; EOF for n in `seq 1 4` do psql -d postgres -c "select from tbl where c = 100" & done The server needs to enable wal_log_hints and this might need to run several times. After running the script we can see this issue by pg_waldump: rmgr: XLOG len (rec/tot): 49/ 8209, tx: 0, top: 0, lsn: 1/8FD1C3D8, prev 1/8FD1C368, desc: FPI_FOR_HINT , blkref #0: rel 1663/12643/16767 blk 0 FPW rmgr: XLOG len (rec/tot): 49/ 8209, tx: 0, top: 0, lsn: 1/8FD1E408, prev 1/8FD1C3D8, desc: FPI_FOR_HINT , blkref #0: rel 1663/12643/16767 blk 0 FPW This is an excerpt from _bt_killitems() of version 12.2. By recent changes the code of HEAD looks different much but the part in question is essentially not changed much. That is, it's reproducible even with HEAD. for (i = 0; i < numKilled; i++) { int itemIndex = so->killedItems[i]; BTScanPosItem *kitem = &so->currPos.items[itemIndex]; OffsetNumber offnum = kitem->indexOffset; Assert(itemIndex >= so->currPos.firstItem && itemIndex <= so->currPos.lastItem); if (offnum < minoff) continue; /* pure paranoia */ while (offnum <= maxoff) { ItemId iid = PageGetItemId(page, offnum); IndexTuple ituple = (IndexTuple) PageGetItem(page, iid); if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid)) { /* found the item */ ItemIdMarkDead(iid); killedsomething = true; break; /* out of inner search loop */ } offnum = OffsetNumberNext(offnum); } } /* * Since this can be redone later if needed, mark as dirty hint. * * Whenever we mark anything LP_DEAD, we also set the page's * BTP_HAS_GARBAGE flag, which is likewise just a hint. */ if (killedsomething) { opaque->btpo_flags |= BTP_HAS_GARBAGE; MarkBufferDirtyHint(so->currPos.buf, true); } The inner test in the comment "found the item" never tests the item for being dead. So maybe we can add !ItemIdIsDead(iid) to that condition. But there still is a race condition of recording multiple FPIs can happen. Maybe a better solution is to change the lock to exclusive, at least when wal_log_hints = on, so that only one process can run this code -- the reduction in concurrency might be won back by the fact that we don't wal-log the page multiple times. I understand that we can call MarkBufferDirtyHint while holding a buffer lock in share mode as the comment of MarkBufferDirtyHint() says, but I'd like to improve this behavior so that we can avoid multiple FPI_FOR_HINT for the same block. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services