AdvanceXLInsertBuffers() vs wal_sync_method=open_datasync

Andres Freund Thu, 09 Nov 2023 19:54:41 -0800

Hi,

I just created a primary with wal_segment_size=512. Then tried to create a
standby via pg_basebackup. The pg_basebackup appeared to just hang, for quite
a while, but did eventually complete. Over a minute for an empty cluster, when
using -c fast.


In this case I had used wal_sync_method=open_datasync - it's often faster and
if we want to scale WAL writes more we'll have to use it more widely (you
can't have multiple fdatasyncs in progress and reason about which one affects
what, but you can have multiple DSYNC writes in progress at the same time).

After a bit of confused staring and debugging I figured out that the problem
is that the RequestXLogSwitch() within the code for starting a basebackup was
triggering writing back the WAL in individual 8kB writes via
GetXLogBuffer()->AdvanceXLInsertBuffer(). With open_datasync each of these
writes is durable - on this drive each take about 1ms.


Normally we write out WAL in bigger chunks - but as it turns out, we don't
have any logic for doing larger writes when AdvanceXLInsertBuffers() is called
from within GetXLogBuffer(). We just try to make enough space so that one
buffer can be replaced.


The time for a single SELECT pg_switch_wal() on this system, when using
open_datasync and a 512MB segment, are:

wal_buffers   time for pg_switch_xlog()
16            64s
100           53s
400           13s
600           1.3s

That's pretty bad.  We don't really benefit from more buffering here, it just
avoids flushing in tiny increments. With a smaller wal_buffers, the large
record by pg_switch_xlog() needs to replace buffers it inself inserted, and
does so one-by-one. If we never re-encounter an buffer we inserted ourself
earlier due to a larger wal_buffers, the problem isn't present.

This can bit with smaller segments too, it doesn't require large ones ones.


The reason this doesn't constantly become an issue is that walwriter normally
tries to write out WAL, and if it does, the AdvanceXLInsertBuffers() called in
backends doesn't need to (walsender also calls AdvanceXLInsertBuffers(), but
it won't ever write out data).

In my case, walsender is actually trying to do something - but it never gets
WALWriteLock. The semaphore does get set after AdvanceXLInsertBuffers()
releases WALWriteLock, but on this system walwriter never succeeds taking the
lwlock before AdvanceXLInsertBuffers() succeeds re-acquiring it.


I think it might be a lucky accident that the problem was visible this
blatantly in this one case - I suspect that this behaviour is encountered
during normal operation in the wild, but much harder to pinpoint, because it
doesn't happen "exclusively".

E.g. I see a lot higher throughput bulk-loading data with larger wal_buffers
when using open_datasync, but basically no difference when using
fdatasync. And there are a lot of wal_buffers_full writes.


To fix this, I suspect we need to make
GetXLogBuffer()->AdvanceXLInsertBuffer() flush more aggressively. In this
specific case, we even know for sure that we are going to fill a lot more
buffers, so no heuristic would be needed. In other cases however we need some
heuristic to know how much to write out.

Given how *extremely* aggressive we are about flushing out nearly all pending
WAL in XLogFlush(), I'm not sure there's much point in not also being somewhat
aggressive in GetXLogBuffer()->AdvanceXLInsertBuffer().


Greetings,

Andres Freund

AdvanceXLInsertBuffers() vs wal_sync_method=open_datasync

Reply via email to