Re: [Qemu-devel] [PATCH v5 08/10] migration: create a dedicated thread to release rdma resource

Gal Shachaf Mon, 23 Jul 2018 08:43:32 -0700

On Thu, Jul 5, 2018 at 10:26 PM, 858585 jemmy <jemmy858...@gmail.com> wrote:
> On Thu, Jun 28, 2018 at 2:59 AM, Dr. David Alan Gilbert 
> <dgilb...@redhat.com> wrote:
>> * Lidong Chen (jemmy858...@gmail.com) wrote:
>>> ibv_dereg_mr wait for a long time for big memory size virtual server.
>>>
>>> The test result is:
>>>   10GB  326ms
>>>   20GB  699ms
>>>   30GB  1021ms
>>>   40GB  1387ms
>>>   50GB  1712ms
>>>   60GB  2034ms
>>>   70GB  2457ms
>>>   80GB  2807ms
>>>   90GB  3107ms
>>>   100GB 3474ms
>>>   110GB 3735ms
>>>   120GB 4064ms
>>>   130GB 4567ms
>>>   140GB 4886ms
>>>
>>> this will cause the guest os hang for a while when migration finished.
>>> So create a dedicated thread to release rdma resource.
>>>
>>> Signed-off-by: Lidong Chen <lidongc...@tencent.com>
>>> ---
>>>  migration/rdma.c | 43 +++++++++++++++++++++++++++----------------
>>>  1 file changed, 27 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/migration/rdma.c b/migration/rdma.c index 
>>> dfa4f77..f12e8d5 100644
>>> --- a/migration/rdma.c
>>> +++ b/migration/rdma.c
>>> @@ -2979,35 +2979,46 @@ static void 
>>> qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc,
>>>      }
>>>  }
>>>
>>> -static int qio_channel_rdma_close(QIOChannel *ioc,
>>> -                                  Error **errp)
>>> +static void *qio_channel_rdma_close_thread(void *arg)
>>>  {
>>> -    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>>> -    RDMAContext *rdmain, *rdmaout;
>>> -    trace_qemu_rdma_close();
>>> +    RDMAContext **rdma = arg;
>>> +    RDMAContext *rdmain = rdma[0];
>>> +    RDMAContext *rdmaout = rdma[1];
>>>
>>> -    rdmain = rioc->rdmain;
>>> -    if (rdmain) {
>>> -        atomic_rcu_set(&rioc->rdmain, NULL);
>>> -    }
>>> -
>>> -    rdmaout = rioc->rdmaout;
>>> -    if (rdmaout) {
>>> -        atomic_rcu_set(&rioc->rdmaout, NULL);
>>> -    }
>>> +    rcu_register_thread();
>>>
>>>      synchronize_rcu();
>>
>> * see below
>>
>>> -
>>>      if (rdmain) {
>>>          qemu_rdma_cleanup(rdmain);
>>>      }
>>> -
>>>      if (rdmaout) {
>>>          qemu_rdma_cleanup(rdmaout);
>>>      }
>>>
>>>      g_free(rdmain);
>>>      g_free(rdmaout);
>>> +    g_free(rdma);
>>> +
>>> +    rcu_unregister_thread();
>>> +    return NULL;
>>> +}
>>> +
>>> +static int qio_channel_rdma_close(QIOChannel *ioc,
>>> +                                  Error **errp) {
>>> +    QemuThread t;
>>> +    QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc);
>>> +    RDMAContext **rdma = g_new0(RDMAContext*, 2);
>>> +
>>> +    trace_qemu_rdma_close();
>>> +    if (rioc->rdmain || rioc->rdmaout) {
>>> +        rdma[0] =  rioc->rdmain;
>>> +        rdma[1] =  rioc->rdmaout;
>>> +        qemu_thread_create(&t, "rdma cleanup", 
>>> qio_channel_rdma_close_thread,
>>> +                           rdma, QEMU_THREAD_DETACHED);
>>> +        atomic_rcu_set(&rioc->rdmain, NULL);
>>> +        atomic_rcu_set(&rioc->rdmaout, NULL);
>>
>> I'm not sure this pair is ordered with the synchronise_rcu above; 
>> Doesn't that mean, on a bad day, that you could get:
>>
>>
>>     main-thread          rdma_cleanup     another-thread
>>     qmu_thread_create
>>                       synchronise_rcu
>>                                         reads rioc->rdmain
>>                                         starts doing something with rdmain
>>     atomic_rcu_set
>>                       rdma_cleanup
>>
>>
>> so the another-thread is using it during the cleanup?
>> Would just moving the atomic_rcu_sets before the qemu_thread_create 
>> fix that?
> yes, I will fix it.
>
>>
>> However, I've got other worries as well:
>>    a) qemu_rdma_cleanup does:
>>        migrate_get_current()->state == MIGRATION_STATUS_CANCELLING
>>
>>       which worries me a little if someone immediately tries to restart
>>       the migration.


Because the current qemu version don't wait for RDMA_CM_EVENT_DISCONNECTED 
event after rdma_disconnect, so I think it's not necessary to send 
RDMA_CONTROL_ERROR.

compare to send RDMA_CONTROL_ERROR, I think use cm event to notify peer qemu is 
better.
maybe the rdma is already in error_state, and RDMA_CONTROL_ERROR cannot send 
successfully.

For peer qemu, in qemu_rdma_wait_comp_channel function, it's should not only 
poll rdma->comp_channel->fd, it's should also poll  rdma->channel->fd.
[GS] I agree that we should poll on CM events in addition to CQ events.

for the source qemu, it's fixed by this patch.
migration: poll the cm event while wait RDMA work request completion

and for destination qemu, it's need a new patch to fix it.

so I prefer to remove MIGRATION_STATUS_CANCELLING in qemu_rdma_cleanup function.

[GS] The intention of this patch is to move the costly ibv_dereg_mr to a 
separate thread, so we do not actually need to also move the RDMA_CONTROL_ERROR 
 and rdma_disconnect, that come beforehand in qemu_rdma_cleanup,  to a thread. 
I suggest that we move them out of qemu_rdma_cleanup to a new function 
"qemu_rdma_disconnect" and call it on each RDMAcontext before the 
qemu_thread_create.
>>
>>    b) I don't understand what happens if someone does try and restart
>>       the migration after that, but in the ~5s it takes the ibv cleanup
>>       to happen.

I prefer to add a new variable in current_migration.  if the rdma cleanup 
thread has not exited. it's should not start a new migration.
[GS] I support this suggestion. 

>
> yes, I will try to fix it.
>
>>
>> Dave
>>
>>
>>> +    }
>>>
>>>      return 0;
>>>  }
>>> --
>>> 1.8.3.1
>>>
>> --
>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH v5 08/10] migration: create a dedicated thread to release rdma resource

Reply via email to