Re: [RFC] Lock-free XLog Reservation from WAL

Yura Sokolov Fri, 10 Jan 2025 04:43:20 -0800

09.01.2025 19:03, Zhou, Zhiguo пишет:

On 1/7/2025 10:49 AM, Юрий Соколов wrote:
On 6 Jan 2025, at 09:46, Zhou, Zhiguo <[email protected]> wrote:

Hi Yura and Wenhui,

Thanks for kindly reviewing this work!

On 1/3/2025 9:01 PM, wenhui qiu wrote:
Hi
Thank you for your path，NUM_XLOGINSERT_LOCKS increase to 128，Ithink it will be challenged，do we make it guc ？
I noticed there have been some discussions (for example, [1] and itsresponses) about making NUM_XLOGINSERT_LOCKS a GUC, which seems to bea controversial proposal. Given that, we may first focus on the lock-free XLog reservation implementation, and leave the increase ofNUM_XLOGINSERT_LOCKS for a future patch, where we would provide morequantitative evidence for the various implementations. WDYT?
On Fri, 3 Jan 2025 at 20:36, Yura Sokolov <[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>> wrote:
   Good day, Zhiguo.
   Idea looks great.
   Minor issue:
   - you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.
   I initially thought it became un-synchronized against
   `ReserveXLogInsertLocation`, but looking closer I found it is
   synchronized with `WALInsertLockAcquireExclusive`.
   Since there are no other `insertpos_lck` usages after your patch, I
   don't see why it should exists and be used in `ReserveXLogSwitch`.
   Still I'd prefer to see CAS loop in this place to be consistent with
   other non-locking access. And it will allow to get rid of
   `WALInsertLockAcquireExclusive`, (though probably it is not a big
   issue).
Exactly, it should be safe to remove `insertpos_lck`. And I agreewith you on getting rid of `WALInsertLockAcquireExclusive` with CASloop which should significantly reduce the synchronization cost hereespecially when we intend to increase NUM_XLOGINSERT_LOCKS. I willtry it in the next version of patch.
   Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/readwith on platforms where MAXALIGN != 8 or without native 64 load/store.Branch
   with 'memcpy` is rather obvious, but even pointer de-referencing on
   "lucky case" is not safe either.
   I have no idea how to fix it at the moment.
Indeed, non-atomic write/read operations can lead to safety issues insome situations. My initial thought is to define a bit near the prev-link to flag the completion of the update. In this way, we couldallow non-atomic or even discontinuous write/read operations on theprev- link, while simultaneously guaranteeing its atomicity throughatomic operations (as well as memory barriers) on the flag bit. Whatdo you think of this as a viable solution?
   Readability issue:
- It would be good to add `Assert(ptr >= upto)` into`GetXLogBuffer`.
   I had hard time to recognize `upto` is strictly not in the future.
   - Certainly, final version have to have fixed and improved comments.
   Many patch's ideas are strictly non-obvious. I had hard time to
   recognize patch is not a piece of ... (excuse me for the swear
   sentence).
Thanks for the suggestion and patience. It's really more readableafter inserting the assertion, I will fix it and improve othercomments in the following patches.
   Indeed, patch is much better than it looks on first sight.
   I came with alternative idea yesterday, but looking closer to your
   patch
   today I see it is superior to mine (if atomic access will be fixed).
[1]https://www.postgresql.org/message-id/2266698.1704854297%40sss.pgh.pa.us <https://www.postgresql.org/message-id/2266698.1704854297%40sss.pgh.pa.us>
Good day, Zhiguo.
Here’s my attempt to organise link to previous record without messingwith xlog buffers:
- link is stored in lock-free hash table instead.

I don’t claim it is any better than using xlog buffers.
It is just alternative vision.

Some tricks in implementation:
- Relying on byte-position nature, it could be converted to 32 bit unique
value with `(uint32)(pos ^ (pos>>32))`. Certainly it is not totallyunique,
   but it is certainly unique among 32GB consecutive log.
- PrevBytePos could be calculated as a difference between positions, and
this difference is certainly less than 4GB, so it also could bestored as 32
   bit value (PrevSize).
- Since xlog records are aligned we could use lowest bit of PrevSizeas a lock.- While Cuckoo Hashing could suffer from un-solvable cycle conflicts,this implementation relies on concurrent deleters which willeventually break such cycles if any.
I have a version without 32bit conversion trick, and it is a bitlighter on atomic instructions count, but it performs badly in absenceof native 64bit atomics.
——
regards
Yura Sokolov aka funny-falcon
Good day, Yura!
Your implementation based on the lock-free hash table is trulyimpressive! One of the aspects I particularly admire is how yoursolution doesn't require breaking the current convention of XLoginsertion, whose revision is quite error-prone and ungraceful.


That is main benefit of my approach. Though it is not strictly better
than yours.

My minorconcern is that the limited number of entries (256) in the hash tablewould be a bottleneck for parallel memory reservation, but I believethis is not a critical issue.

If you consider hash-table fillrate, than 256 is quite enough for 128concurrent inserters.


But I agree 8 items on cache line could lead to false-sharing.
Items could be stretched to 16 bytes (and then CurrPosId could be fully
unique), so there's just 4 entry per cache line.

I will soon try to evaluate the performance impact of your patch on mydevice with the TPCC benchmark and also profile it to see if there areany changes that could be made to further improve it.

It would be great. On my notebook (Mac Air M1) I don't see any benefitsneither from mine, nor from yours patch ))My colleague will also test it on 20 core virtual machine (butbackported to v15).

BTW, do you have a plan to merge this patch to the master branch? Thanks!

I'm not committer )) We are both will struggle to make somethingcommitted for many months ;-)


BTW, your version could make alike trick for guaranteed atomicity:

- change XLogRecord's `XLogRecPtr xl_prev` to `uint32 xl_prev_offset`and store offset to prev record's start.


Since there are two limits:

    #define XLogRecordMaxSize   (1020 * 1024 * 1024)
    #define WalSegMaxSize 1024 * 1024 * 1024

offset to previous record could not be larger than 2GB.

Yes, it is format change, that some backup utilities will have to adopt.

But it saves 4 bytes in XLogRecord (that could be spent to storeFullTransactionId instead of TransactionId) and it is better compressible.

And your version than will not need the case when this value is splitamong two buffers (since MAXALIGN is not less than 4), and PostgreSQLalready relies on 4 byte read/write atomicity (in some places evenwithout use of pg_atomic_uint32).


----

regards
Sokolov Yura aka funny-falcon

Re: [RFC] Lock-free XLog Reservation from WAL

Reply via email to