On 24/09/2024 18:55, Andres Freund wrote:
What I'd instead like to propose is to implement the right to set hint bits as
a bit in each buffer's state, similar to BM_IO_IN_PROGRESS. Tentatively I
named this BM_SETTING_HINTS. It's only allowed to set BM_SETTING_HINTS when
BM_IO_IN_PROGRESS isn't already set and StartBufferIO has to wait for
BM_SETTING_HINTS to be unset to start IO.

Naively implementing this, by acquiring and releasing the permission to set
hint bits in SetHintBits() unfortunately leads to a significant performance
regression. While the performance is unaffected for OLTPish workloads like
pgbench (both read and write), sequential scans of unhinted tables regress
significantly, due to the per-tuple lock acquisition this would imply.

It'd be nice to implement this without having to set and clear BM_SETTING_HINTS every time. Could we put the overhead on the FlushBuffer() instead?

How about something like this:

To set hint bits:

1. Lock page in SHARED mode.
2. Read BM_IO_IN_PROGRESS
3. If !BM_IO_IN_PROGRESS, set hint bits, otherwise don't
4. Unlock page

To flush a buffer:

1. Lock page in SHARED mode
2. Set BM_IO_IN_PROGRESS
3. Read the lock count on the buffer lock, to see if we're the only locker.
4. If anyone else is holding the lock, upgrade it to exclusive mode, and immediately downgrade back to share mode.
5. calculate CRC, flush the buffer
6. Clear BM_IO_IN_PROGRESS and unlock page.

This goes back to the idea of adding LWLock support for this, but the amount of changes could be pretty small. The missing operation we don't have today is reading the share-lock count on the lock in step 3. That seems simple to add.

Acquiring the exclusive lock in step 4 is just a way to wait for all the existing share-lockers to release the lock. You wouldn't need to block new share-lockers. We have LW_WAIT_UNTIL_FREE, which is almost what we need, but it currently ignores share-lockers. So doing this "properly" would require more changes to LWLocks. Briefly acquiring an exclusive lock seems acceptable though.


But: We can address this and improve performance over the status quo! Today we
determine tuple visiblity determination one-by-one, even when checking the
visibility of an entire page worth of tuples.

Yes, batching the visibility checks makes sense in any case.

--
Heikki Linnakangas
Neon (https://neon.tech)



Reply via email to