Hi all,

During investigating the issue our customer had, I realized that
_bt_killitems() can record the same FPI pages multiple times
simultaneously. This can happen when several concurrent index scans
are processing pages that contain killable tuples. Because killing
index items could be performed while holding a buffer lock in shared
mode concurrent processes record multiple FPI_FOR_HINT for the same
block.

Here is the reproducer:

cat <<EOF | psql -d postgres
drop table if exists tbl;
create table tbl (c int primary key) with (autovacuum_enabled = off);
insert into tbl select generate_series(1,300);
update tbl set c = c * -1 where c = 100;
checkpoint;
EOF

for n in `seq 1 4`
do
    psql -d postgres -c "select from tbl where c = 100" &
done

The server needs to enable wal_log_hints and this might need to run
several times. After running the script we can see this issue by
pg_waldump:

rmgr: XLOG len (rec/tot): 49/ 8209, tx: 0, top: 0, lsn: 1/8FD1C3D8,
prev 1/8FD1C368, desc: FPI_FOR_HINT , blkref #0: rel 1663/12643/16767
blk 0 FPW
rmgr: XLOG len (rec/tot): 49/ 8209, tx: 0, top: 0, lsn: 1/8FD1E408,
prev 1/8FD1C3D8, desc: FPI_FOR_HINT , blkref #0: rel 1663/12643/16767
blk 0 FPW

This is an excerpt from _bt_killitems() of version 12.2. By recent
changes the code of HEAD looks different much but the part in question
is essentially not changed much. That is, it's reproducible even with
HEAD.

    for (i = 0; i < numKilled; i++)
    {
        int         itemIndex = so->killedItems[i];
        BTScanPosItem *kitem = &so->currPos.items[itemIndex];
        OffsetNumber offnum = kitem->indexOffset;

        Assert(itemIndex >= so->currPos.firstItem &&
               itemIndex <= so->currPos.lastItem);
        if (offnum < minoff)
            continue;           /* pure paranoia */
        while (offnum <= maxoff)
        {
            ItemId      iid = PageGetItemId(page, offnum);
            IndexTuple  ituple = (IndexTuple) PageGetItem(page, iid);

            if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
            {
                /* found the item */
                ItemIdMarkDead(iid);
                killedsomething = true;
                break;          /* out of inner search loop */
            }
            offnum = OffsetNumberNext(offnum);
        }
    }

    /*
     * Since this can be redone later if needed, mark as dirty hint.
     *
     * Whenever we mark anything LP_DEAD, we also set the page's
     * BTP_HAS_GARBAGE flag, which is likewise just a hint.
     */
    if (killedsomething)
    {
        opaque->btpo_flags |= BTP_HAS_GARBAGE;
        MarkBufferDirtyHint(so->currPos.buf, true);
    }

The inner test in the comment "found the item" never tests the item
for being dead. So maybe we can add !ItemIdIsDead(iid) to that
condition. But there still is a race condition of recording multiple
FPIs can happen. Maybe a better solution is to change the lock to
exclusive, at least when wal_log_hints = on, so that only one process
can run this code -- the reduction in concurrency might be won back by
the fact that we don't wal-log the page multiple times.

I understand that we can call MarkBufferDirtyHint while holding a
buffer lock in share mode as the comment of MarkBufferDirtyHint()
says, but I'd like to improve this behavior so that we can avoid
multiple FPI_FOR_HINT for the same block.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply via email to