On 10/11/2023 05:54, Andres Freund wrote:
In this case I had used wal_sync_method=open_datasync - it's often faster and
if we want to scale WAL writes more we'll have to use it more widely (you
can't have multiple fdatasyncs in progress and reason about which one affects
what, but you can have multiple DSYNC writes in progress at the same time).

Not sure I understand that. If you issue an fdatasync, it will sync all writes that were complete before the fdatasync started. Right? If you have multiple fdatasyncs in progress, that's true for each fdatasync. Or is there a bottleneck in the kernel with multiple in-progress fdatasyncs or something?

After a bit of confused staring and debugging I figured out that the problem
is that the RequestXLogSwitch() within the code for starting a basebackup was
triggering writing back the WAL in individual 8kB writes via
GetXLogBuffer()->AdvanceXLInsertBuffer(). With open_datasync each of these
writes is durable - on this drive each take about 1ms.

I see. So the assumption in AdvanceXLInsertBuffer() is that XLogWrite() is relatively fast. But with open_datasync, it's not.

To fix this, I suspect we need to make
GetXLogBuffer()->AdvanceXLInsertBuffer() flush more aggressively. In this
specific case, we even know for sure that we are going to fill a lot more
buffers, so no heuristic would be needed. In other cases however we need some
heuristic to know how much to write out.

+1. Maybe use the same logic as in XLogFlush().

I wonder if the 'flexible' argument to XLogWrite() is too inflexible. It would be nice to pass a hard minimum XLogRecPtr that it must write up to, but still allow it to write more than that if it's convenient.

--
Heikki Linnakangas
Neon (https://neon.tech)



Reply via email to