Hello,

On 11/12/24 10:34, Andres Freund wrote:
I have working code - pretty ugly at this state, but mostly needs a fair bit
of elbow grease not divine inspiration...  It's not a trivial change, but
entirely doable.

The short summary of how it works is that it uses a single 64bit atomic that
is internally subdivided into a ringbuffer position in N high bits and an
offset from a base LSN in the remaining bits.  The insertion sequence is

...

This leaves you with a single xadd to contended cacheline as the contention
point (scales far better than cmpxchg and far far better than
cmpxchg16b). There's a bit of contention for the ringbuffer[].oldpos being set
and read, but it's only by two backends, not all of them.

That sounds rather promising.

Would it be reasonable to have both implementations available at least at compile time, if not at runtime? Is it possible that we need to do that anyway for some time or are those atomic operations available on all supported CPU architectures?



The nice part is this scheme leaves you with a ringbuffer that's ordered by
the insertion-lsn. Which allows to make WaitXLogInsertionsToFinish() far more
efficient and to get rid of NUM_XLOGINSERT_LOCKS (by removing WAL insertion
locks). Right now NUM_XLOGINSERT_LOCKS is a major scalability limit - but at
the same time increasing it makes the contention on the spinlock *much* worse,
leading to slowdowns in other workloads.

Yeah, that is a complex wart that I believe was the answer to the NUMA overload that Kevin Grittner and myself discovered many years ago, where on a 4-socket machine the cacheline stealing would get so bad that whoever was holding the lock could not release it.

In any case, thanks for the input. Looks like in the long run we need to come up with a different way to solve the inversion problem.


Best Regards, Jan



Reply via email to