Hi, I just created a primary with wal_segment_size=512. Then tried to create a standby via pg_basebackup. The pg_basebackup appeared to just hang, for quite a while, but did eventually complete. Over a minute for an empty cluster, when using -c fast.
In this case I had used wal_sync_method=open_datasync - it's often faster and if we want to scale WAL writes more we'll have to use it more widely (you can't have multiple fdatasyncs in progress and reason about which one affects what, but you can have multiple DSYNC writes in progress at the same time). After a bit of confused staring and debugging I figured out that the problem is that the RequestXLogSwitch() within the code for starting a basebackup was triggering writing back the WAL in individual 8kB writes via GetXLogBuffer()->AdvanceXLInsertBuffer(). With open_datasync each of these writes is durable - on this drive each take about 1ms. Normally we write out WAL in bigger chunks - but as it turns out, we don't have any logic for doing larger writes when AdvanceXLInsertBuffers() is called from within GetXLogBuffer(). We just try to make enough space so that one buffer can be replaced. The time for a single SELECT pg_switch_wal() on this system, when using open_datasync and a 512MB segment, are: wal_buffers time for pg_switch_xlog() 16 64s 100 53s 400 13s 600 1.3s That's pretty bad. We don't really benefit from more buffering here, it just avoids flushing in tiny increments. With a smaller wal_buffers, the large record by pg_switch_xlog() needs to replace buffers it inself inserted, and does so one-by-one. If we never re-encounter an buffer we inserted ourself earlier due to a larger wal_buffers, the problem isn't present. This can bit with smaller segments too, it doesn't require large ones ones. The reason this doesn't constantly become an issue is that walwriter normally tries to write out WAL, and if it does, the AdvanceXLInsertBuffers() called in backends doesn't need to (walsender also calls AdvanceXLInsertBuffers(), but it won't ever write out data). In my case, walsender is actually trying to do something - but it never gets WALWriteLock. The semaphore does get set after AdvanceXLInsertBuffers() releases WALWriteLock, but on this system walwriter never succeeds taking the lwlock before AdvanceXLInsertBuffers() succeeds re-acquiring it. I think it might be a lucky accident that the problem was visible this blatantly in this one case - I suspect that this behaviour is encountered during normal operation in the wild, but much harder to pinpoint, because it doesn't happen "exclusively". E.g. I see a lot higher throughput bulk-loading data with larger wal_buffers when using open_datasync, but basically no difference when using fdatasync. And there are a lot of wal_buffers_full writes. To fix this, I suspect we need to make GetXLogBuffer()->AdvanceXLInsertBuffer() flush more aggressively. In this specific case, we even know for sure that we are going to fill a lot more buffers, so no heuristic would be needed. In other cases however we need some heuristic to know how much to write out. Given how *extremely* aggressive we are about flushing out nearly all pending WAL in XLogFlush(), I'm not sure there's much point in not also being somewhat aggressive in GetXLogBuffer()->AdvanceXLInsertBuffer(). Greetings, Andres Freund