On Thu, Jul 5, 2018 at 10:26 PM, 858585 jemmy <jemmy858...@gmail.com> wrote: > On Thu, Jun 28, 2018 at 2:59 AM, Dr. David Alan Gilbert > <dgilb...@redhat.com> wrote: >> * Lidong Chen (jemmy858...@gmail.com) wrote: >>> ibv_dereg_mr wait for a long time for big memory size virtual server. >>> >>> The test result is: >>> 10GB 326ms >>> 20GB 699ms >>> 30GB 1021ms >>> 40GB 1387ms >>> 50GB 1712ms >>> 60GB 2034ms >>> 70GB 2457ms >>> 80GB 2807ms >>> 90GB 3107ms >>> 100GB 3474ms >>> 110GB 3735ms >>> 120GB 4064ms >>> 130GB 4567ms >>> 140GB 4886ms >>> >>> this will cause the guest os hang for a while when migration finished. >>> So create a dedicated thread to release rdma resource. >>> >>> Signed-off-by: Lidong Chen <lidongc...@tencent.com> >>> --- >>> migration/rdma.c | 43 +++++++++++++++++++++++++++---------------- >>> 1 file changed, 27 insertions(+), 16 deletions(-) >>> >>> diff --git a/migration/rdma.c b/migration/rdma.c index >>> dfa4f77..f12e8d5 100644 >>> --- a/migration/rdma.c >>> +++ b/migration/rdma.c >>> @@ -2979,35 +2979,46 @@ static void >>> qio_channel_rdma_set_aio_fd_handler(QIOChannel *ioc, >>> } >>> } >>> >>> -static int qio_channel_rdma_close(QIOChannel *ioc, >>> - Error **errp) >>> +static void *qio_channel_rdma_close_thread(void *arg) >>> { >>> - QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc); >>> - RDMAContext *rdmain, *rdmaout; >>> - trace_qemu_rdma_close(); >>> + RDMAContext **rdma = arg; >>> + RDMAContext *rdmain = rdma[0]; >>> + RDMAContext *rdmaout = rdma[1]; >>> >>> - rdmain = rioc->rdmain; >>> - if (rdmain) { >>> - atomic_rcu_set(&rioc->rdmain, NULL); >>> - } >>> - >>> - rdmaout = rioc->rdmaout; >>> - if (rdmaout) { >>> - atomic_rcu_set(&rioc->rdmaout, NULL); >>> - } >>> + rcu_register_thread(); >>> >>> synchronize_rcu(); >> >> * see below >> >>> - >>> if (rdmain) { >>> qemu_rdma_cleanup(rdmain); >>> } >>> - >>> if (rdmaout) { >>> qemu_rdma_cleanup(rdmaout); >>> } >>> >>> g_free(rdmain); >>> g_free(rdmaout); >>> + g_free(rdma); >>> + >>> + rcu_unregister_thread(); >>> + return NULL; >>> +} >>> + >>> +static int qio_channel_rdma_close(QIOChannel *ioc, >>> + Error **errp) { >>> + QemuThread t; >>> + QIOChannelRDMA *rioc = QIO_CHANNEL_RDMA(ioc); >>> + RDMAContext **rdma = g_new0(RDMAContext*, 2); >>> + >>> + trace_qemu_rdma_close(); >>> + if (rioc->rdmain || rioc->rdmaout) { >>> + rdma[0] = rioc->rdmain; >>> + rdma[1] = rioc->rdmaout; >>> + qemu_thread_create(&t, "rdma cleanup", >>> qio_channel_rdma_close_thread, >>> + rdma, QEMU_THREAD_DETACHED); >>> + atomic_rcu_set(&rioc->rdmain, NULL); >>> + atomic_rcu_set(&rioc->rdmaout, NULL); >> >> I'm not sure this pair is ordered with the synchronise_rcu above; >> Doesn't that mean, on a bad day, that you could get: >> >> >> main-thread rdma_cleanup another-thread >> qmu_thread_create >> synchronise_rcu >> reads rioc->rdmain >> starts doing something with rdmain >> atomic_rcu_set >> rdma_cleanup >> >> >> so the another-thread is using it during the cleanup? >> Would just moving the atomic_rcu_sets before the qemu_thread_create >> fix that? > yes, I will fix it. > >> >> However, I've got other worries as well: >> a) qemu_rdma_cleanup does: >> migrate_get_current()->state == MIGRATION_STATUS_CANCELLING >> >> which worries me a little if someone immediately tries to restart >> the migration.
Because the current qemu version don't wait for RDMA_CM_EVENT_DISCONNECTED event after rdma_disconnect, so I think it's not necessary to send RDMA_CONTROL_ERROR. compare to send RDMA_CONTROL_ERROR, I think use cm event to notify peer qemu is better. maybe the rdma is already in error_state, and RDMA_CONTROL_ERROR cannot send successfully. For peer qemu, in qemu_rdma_wait_comp_channel function, it's should not only poll rdma->comp_channel->fd, it's should also poll rdma->channel->fd. [GS] I agree that we should poll on CM events in addition to CQ events. for the source qemu, it's fixed by this patch. migration: poll the cm event while wait RDMA work request completion and for destination qemu, it's need a new patch to fix it. so I prefer to remove MIGRATION_STATUS_CANCELLING in qemu_rdma_cleanup function. [GS] The intention of this patch is to move the costly ibv_dereg_mr to a separate thread, so we do not actually need to also move the RDMA_CONTROL_ERROR and rdma_disconnect, that come beforehand in qemu_rdma_cleanup, to a thread. I suggest that we move them out of qemu_rdma_cleanup to a new function "qemu_rdma_disconnect" and call it on each RDMAcontext before the qemu_thread_create. >> >> b) I don't understand what happens if someone does try and restart >> the migration after that, but in the ~5s it takes the ibv cleanup >> to happen. I prefer to add a new variable in current_migration. if the rdma cleanup thread has not exited. it's should not start a new migration. [GS] I support this suggestion. > > yes, I will try to fix it. > >> >> Dave >> >> >>> + } >>> >>> return 0; >>> } >>> -- >>> 1.8.3.1 >>> >> -- >> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK