Hi, On 2023-10-03 16:05:32 -0700, Jeff Davis wrote: > On Sat, 2023-01-14 at 12:34 -0800, Andres Freund wrote: > > One benefit would be that it'd make it more realistic to use direct > > IO for WAL > > - for which I have seen significant performance benefits. But when we > > afterwards have to re-read it from disk to replicate, it's less > > clearly a win. > > Does this patch still look like a good fit for your (or someone else's) > plans for direct IO here? If so, would committing this soon make it > easier to make progress on that, or should we wait until it's actually > needed?
I think it'd be quite useful to have. Even with the code as of 16, I see better performance in some workloads with debug_io_direct=wal, wal_sync_method=open_datasync compared to any other configuration. Except of course that it makes walsenders more problematic, as they suddenly require read IO. Thus having support for walsenders to send directly from wal buffers would be beneficial, even without further AIO infrastructure. I also think there are other quite desirable features that are made easier by this patch. One of the primary problems with using synchronous replication is the latency increase, obviously. We can't send out WAL before it has locally been wirten out and flushed to disk. For some workloads, we could substantially lower synchronous commit latency if we were able to send WAL to remote nodes *before* WAL has been made durable locally, even if the receiving systems wouldn't be allowed to write that data to disk yet: It takes less time to send just "write LSN: %X/%X, flush LSNL: %X/%X" than also having to send all the not-yet-durable WAL. In many OLTP workloads there won't be WAL flushes between generating WAL for DML and commit, which means that the amount of WAL that needs to be sent out at commit can be of nontrivial size. E.g. for pgbench, normally a transaction is about ~550 bytes (fitting in a single tcp/ip packet), but a pgbench transaction that needs to emit FPIs for everything is a lot larger: ~45kB (not fitting in a single packet). Obviously many real world workloads OLTP workloads actually do more writes than pgbench. Making the commit latency of the latter be closer to the commit latency of the former when using syncrep would obviously be great. Of course this patch is just a relatively small step towards that: We'd also need in-memory buffering on the receiving side, the replication protocol would need to be improved, we'd likely need an option to explicitly opt into receiving unflushed data. But it's still a pretty much required step. Greetings, Andres Freund