io_uring: implements interfaces for io_uring

Maxim Levitsky Wed, 19 Jun 2019 03:53:35 -0700

On Wed, 2019-06-19 at 11:14 +0100, Stefan Hajnoczi wrote:
> On Mon, Jun 17, 2019 at 03:26:50PM +0300, Maxim Levitsky wrote:
> > On Mon, 2019-06-10 at 19:18 +0530, Aarushi Mehta wrote:
> > > +        if (!cqes) {
> > > +            break;
> > > +        }
> > > +        LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
> > > +        ret = cqes->res;
> > > +
> > > +        if (ret == luringcb->qiov->size) {
> > > +            ret = 0;
> > > +        } else if (ret >= 0) {
> > 
> > 
> > You should very carefully check the allowed return values here.
> > 
> > It looks like you can get '-EINTR' here, which would ask you to rerun the 
> > read operation, and otherwise
> > you will get the number of bytes read, which might be less that what was 
> > asked for, which implies that you
> > need to retry the read operation with the remainder of the buffer rather 
> > that zero the end of the buffer IMHO 
> > 
> > (0 is returned on EOF according to 'read' semantics, which I think are used 
> > here, thus a short read might not be an EOF)
> > 
> > 
> > Looking at linux-aio.c though I do see that it just passes through the 
> > returned value with no special treatments. 
> > including lack of check for -EINTR.
> > 
> > I assume that since aio is linux specific, and it only supports direct IO, 
> > it happens
> > to have assumption of no short reads/-EINTR (but since libaio has very 
> > sparse documentation I can't verify this)
> > 
> > On the other hand the aio=threads implementation actually does everything 
> > as specified on the 'write' manpage,
> > retrying the reads on -EINTR, and doing additional reads if less that 
> > required number of bytes were read.
> > 
> > Looking at io_uring implementation in the kernel I see that it does support 
> > synchronous (non O_DIRECT mode), 
> > and in this case, it goes through the same ->read_iter which is pretty much 
> > the same path that 
> > regular read() takes and so it might return short reads and or -EINTR.
> 
> Interesting point.  Investigating EINTR should at least be a TODO
> comment and needs to be resolved before io_uring lands in a QEMU
> release.
> 
> > > +static int ioq_submit(LuringState *s)
> > > +{
> > > +    int ret = 0;
> > > +    LuringAIOCB *luringcb, *luringcb_next;
> > > +
> > > +    while (s->io_q.in_queue > 0) {
> > > +        QSIMPLEQ_FOREACH_SAFE(luringcb, &s->io_q.sq_overflow, next,
> > > +                              luringcb_next) {
> > 
> > I am torn about the 'sq_overflow' name. it seems to me that its not 
> > immediately clear that these
> > are the requests that are waiting because the io uring got full, but I 
> > can't now think of a better name.
> > 
> > Maybe add a comment here to explain what is going on here?
> 
> Hmm...I suggested this name because I thought it was clear.  But the
> fact that it puzzled you proves it wasn't clear :-).
> 
> Can anyone think of a better name?  It's the queue we keep in QEMU to
> hold requests while the io_uring sq ring is full.
> 
> > Also maybe we could somehow utilize the plug/unplug facility to avoid 
> > reaching that state in first place?
> > Maybe the block layer has some kind of 'max outstanding requests' limit 
> > that could be used?
> > 
> > In my nvme-mdev I opted to not process the input queues when such a 
> > condition is detected, but here you can't as the block layer
> > pretty much calls you to process the requests.
> 
> Block layer callers are allowed to submit as many I/O requests as they
> like and there is no feedback mechanism.  It's up to linux-aio.c and
> io_uring.c to handle the case where host kernel I/O submission resources
> are exhausted.
> 
> Plug/unplug is a batching performance optimization to reduce the number
> of io_uring_enter() calls but it does not stop the callers from
> submitting more I/O requests.  So plug/unplug isn't directly applicable
> here.


Thanks for the explanation! I guess we can leave that name as is, but add some 
comment or so
in the place where the queue is accessed.



> 
> > > +static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState 
> > > *s,
> > > +                            uint64_t offset, int type)
> > > +{
> > > +    struct io_uring_sqe *sqes = io_uring_get_sqe(&s->ring);
> > > +    if (!sqes) {
> > > +        sqes = &luringcb->sqeq;
> > > +        QSIMPLEQ_INSERT_TAIL(&s->io_q.sq_overflow, luringcb, next);
> > > +    }
> > > +
> > > +    switch (type) {
> > > +    case QEMU_AIO_WRITE:
> > > +        io_uring_prep_writev(sqes, fd, luringcb->qiov->iov,
> > > +                             luringcb->qiov->niov, offset);
> > > +        break;
> > > +    case QEMU_AIO_READ:
> > > +        io_uring_prep_readv(sqes, fd, luringcb->qiov->iov,
> > > +                            luringcb->qiov->niov, offset);
> > > +        break;
> > > +    case QEMU_AIO_FLUSH:
> > > +        io_uring_prep_fsync(sqes, fd, 0);
> > > +        break;
> > > +    default:
> > > +        fprintf(stderr, "%s: invalid AIO request type, aborting 0x%x.\n",
> > > +                        __func__, type);
> > 
> > Nitpick: Don't we use some king of error printing functions like 
> > 'error_setg' rather that fprintf?
> 
> Here we're not in a context where an Error object can be returned (e.g.
> printed by the QMP monitor).  We only have an errno return value that
> the emulated storage controller may squash down further to a single
> EIO-type error code.
> 
> 'type' is a QEMU-internal value so the default case is basically
> assert(false); /* we should never get here */
> 
> For these reasons the fprintf() seems okay here.
All right then.

> 
> > I plan on this or next week to do some benchmarks of the code and I will 
> > share the results as soon
> > as I do them.
> 
> Excellent, Aarushi has been benchmarking too.  Perhaps you can share the
> QEMU command-line and fio configuration so that the results can be
> compared.

I'll do that soon. I already did some initial benchmarks (didn't save the 
results yet), and in the current state the io_uring is a bit, tiny bit slower
that libaio, which is nothing wrong because io_uring has lot of benefits in 
addition to performance, plus the current code can be optimized futher.

I'll post all the benchmarks I am doing once again now, as soon as I have them.

Best regards,
        Maxim Levitsky

Re: [Qemu-devel] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring

Reply via email to