Re: [PATCH] block/rbd: Do not use BDS's AioContext

Hanna Czenczek Wed, 12 Feb 2025 06:28:19 -0800

On 12.02.25 14:26, Kevin Wolf wrote:

Am 12.02.2025 um 10:32 hat Hanna Czenczek geschrieben:

RBD schedules the request completion code (qemu_rbd_finish_bh()) to run
in the BDS's AioContext.  The intent seems to be to run it in the same
context that the original request coroutine ran in, i.e. the thread on
whose stack the RBDTask object exists (see qemu_rbd_start_co()).


However, with multiqueue, that thread is not necessarily the same as the
BDS's AioContext.  Instead, we need to remember the actual AioContext
and schedule the completion BH there.

Buglink:https://issues.redhat.com/browse/RHEL-67115

Please add a short summary of what actually happens to the commit
message. I had to check the link to remember what the symptoms are.

Sure. The problem is, I don’t know exactly what’s going wrong (lookedlike a coroutine being rescheduled after it was already done), and Idon’t know how this fixes it.

Reported-by: Junyao Zhao<junz...@redhat.com>
Signed-off-by: Hanna Czenczek<hre...@redhat.com>
---
I think I could also drop RBDTask.ctx and just use
`qemu_coroutine_get_aio_context(RBDTask.co)` instead, but this is the
version of the patch that was tested and confirmed to fix the issue (I
don't have a local reproducer), so I thought I'll post this first.

Did  you figure out why it even makes a difference in which thread
qemu_rbd_finish_bh() runs? For context:

     static void qemu_rbd_finish_bh(void *opaque)
     {
         RBDTask *task = opaque;
         task->complete = true;
         aio_co_wake(task->co);
     }

This looks as if it should be working in any thread, except maybe for a
missing barrier after updating task->complete - but I think the failure
mode for that would be a hang in qemu_rbd_start_co().

Yes, I thought the same thing. All I could imagine was that maybereading task->co returns the wrong result, but given how long ago thatmust have been set, it seems quite unlikely (to say the least). Inaddition, qemu_rbd_completion_cb() already reads the object from adifferent thread, and that seems to work fine.

Really, all I know is that the notion of a BDS’s AioContext no longermakes sense in a multiqueue I/O path, so this should be scheduled in theI/O’s AioContext (just conceptually speaking), and that this seems tofix the bug.

  block/rbd.c | 10 ++++++----
  1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index af984fb7db..9d4e0817e0 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -102,7 +102,7 @@ typedef struct BDRVRBDState {
  } BDRVRBDState;

typedef struct RBDTask {

-    BlockDriverState *bs;
+    AioContext *ctx;
      Coroutine *co;
      bool complete;
      int64_t ret;
@@ -1269,8 +1269,7 @@ static void qemu_rbd_completion_cb(rbd_completion_t c, 
RBDTask *task)
  {
      task->ret = rbd_aio_get_return_value(c);
      rbd_aio_release(c);
-    aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs),
-                            qemu_rbd_finish_bh, task);
+    aio_bh_schedule_oneshot(task->ctx, qemu_rbd_finish_bh, task);
  }

static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,

@@ -1281,7 +1280,10 @@ static int coroutine_fn 
qemu_rbd_start_co(BlockDriverState *bs,
                                            RBDAIOCmd cmd)
  {
      BDRVRBDState *s = bs->opaque;
-    RBDTask task = { .bs = bs, .co = qemu_coroutine_self() };
+    RBDTask task = {
+        .ctx = qemu_get_current_aio_context(),
+        .co = qemu_coroutine_self(),
+    };
      rbd_completion_t c;
      int r;

Nothing wrong I can see about the change, but I don't understand why it
fixes the problem.

Me neither. But if this patch had been part of one of the originalmultiqueue series (without pointing out the linked bug), would therehave been any argument against it?

Indeed it is a problem that I don’t understand what’s happening. Buteven more honestly, I’ll have to admit I can’t ever claim to understandwhat’s happening in a multi-threaded asynchronous C environment; evenmore so when the reproducer is installing Windows on RBD.


Hanna

Re: [PATCH] block/rbd: Do not use BDS's AioContext

Reply via email to