Hi, On 2020-12-08 13:01:38 +0800, Craig Ringer wrote: > Have you done much bpf / systemtap / perf based work on measurement and > tracing of latencies etc? If not that's something I'd be keen to help with. > I've mostly been using systemtap so far but I'm trying to pivot over to > bpf.
Not much - there's still so many low hanging fruits and architectural things to finish that it didn't yet seem pressing. > I've got asynchronous writing of WAL mostly working, but need to > > redesign the locking a bit further. Right now it's a win in some cases, > > but not others. The latter to a significant degree due to unnecessary > > blocking.... > That's where io_uring's I/O ordering operations looked interesting. But I > haven't looked closely enough to see if they're going to help us with I/O > ordering in a multiprocessing architecture like postgres. The ordering ops aren't quite powerful enough to be a huge boon performance-wise (yet). They can cut down on syscall and intra-process context switch overhead to some degree, but otherwise it's not different than userspace submitting another request upon receving of a completion. > In an ideal world we could tell the kernel about WAL-to-heap I/O > dependencies and even let it apply WAL then heap changes out-of-order so > long as they didn't violate any ordering constraints we specify between > particular WAL records or between WAL writes and their corresponding heap > blocks. But I don't know if the io_uring interface is that capable. It's not. And that kind of dependency inferrence wouldn't be cheap on the PG side either. I don't think it'd help that much for WAL apply anyway. You need read-ahead of the WAL to avoid unnecessary waits for a lot of records anyway. And the writes during WAL are mostly pretty asynchronous (mainly writeback during buffer replacement). An imo considerably more interesting case is avoiding blocking on a WAL flush when needing to write a page out in an OLTPish workload. But I can think of more efficient ways there too. > How feasible do you think it'd be to take it a step further and structure > redo as a pipelined queue, where redo calls enqueue I/O operations and > completion handlers then return immediately? Everything still goes to disk > in the order it's enqueued, and the callbacks will be invoked in order, so > they can update appropriate shmem state etc. Since there's no concurrency > during redo, it should be *much* simpler than normal user backend > operations where we have all the tight coordination of buffer management, > WAL write ordering, PGXACT and PGPROC, the clog, etc. I think it'd be a fairly massive increase in complexity. And I don't see a really large payoff: Once you have real readahead in the WAL there's really not much synchronous IO left. What am I missing? Greetings, Andres Freund