O and io_uring

Andres Freund Mon, 07 Dec 2020 22:23:48 -0800

Hi,

On 2020-12-08 13:01:38 +0800, Craig Ringer wrote:
> Have you done much bpf / systemtap / perf based work on measurement and
> tracing of latencies etc? If not that's something I'd be keen to help with.
> I've mostly been using systemtap so far but I'm trying to pivot over to
> bpf.


Not much - there's still so many low hanging fruits and architectural
things to finish that it didn't yet seem pressing.




> I've got asynchronous writing of WAL mostly working, but need to
> > redesign the locking a bit further. Right now it's a win in some cases,
> > but not others. The latter to a significant degree due to unnecessary
> > blocking....

> That's where io_uring's I/O ordering operations looked interesting. But I
> haven't looked closely enough to see if they're going to help us with I/O
> ordering in a multiprocessing architecture like postgres.

The ordering ops aren't quite powerful enough to be a huge boon
performance-wise (yet). They can cut down on syscall and intra-process
context switch overhead to some degree, but otherwise it's not different
than userspace submitting another request upon receving of a completion.


> In an ideal world we could tell the kernel about WAL-to-heap I/O
> dependencies and even let it apply WAL then heap changes out-of-order so
> long as they didn't violate any ordering constraints we specify between
> particular WAL records or between WAL writes and their corresponding heap
> blocks. But I don't know if the io_uring interface is that capable.

It's not. And that kind of dependency inferrence wouldn't be cheap on
the PG side either.

I don't think it'd help that much for WAL apply anyway. You need
read-ahead of the WAL to avoid unnecessary waits for a lot of records
anyway. And the writes during WAL are mostly pretty asynchronous (mainly
writeback during buffer replacement).

An imo considerably more interesting case is avoiding blocking on a WAL
flush when needing to write a page out in an OLTPish workload. But I can
think of more efficient ways there too.


> How feasible do you think it'd be to take it a step further and structure
> redo as a pipelined queue, where redo calls enqueue I/O operations and
> completion handlers then return immediately? Everything still goes to disk
> in the order it's enqueued, and the callbacks will be invoked in order, so
> they can update appropriate shmem state etc. Since there's no concurrency
> during redo, it should be *much* simpler than normal user backend
> operations where we have all the tight coordination of buffer management,
> WAL write ordering, PGXACT and PGPROC, the clog, etc.

I think it'd be a fairly massive increase in complexity. And I don't see
a really large payoff: Once you have real readahead in the WAL there's
really not much synchronous IO left. What am I missing?

Greetings,

Andres Freund

Re: Blocking I/O, async I/O and io_uring

Reply via email to