On Thu, May 17, 2018 at 3:31 PM, Aviad Yehezkel <avia...@dev.mellanox.co.il> wrote: > > > On 5/17/2018 5:42 AM, 858585 jemmy wrote: >> >> On Wed, May 16, 2018 at 11:11 PM, Aviad Yehezkel >> <avia...@dev.mellanox.co.il> wrote: >>> >>> Hi Lidong and David, >>> Sorry for the late response, I had to ramp up on migration code and build >>> a >>> setup on my side. >>> >>> PSB my comments for this patch below. >>> For the RDMA post-copy patches I will comment next week after testing on >>> Mellanox side too. >>> >>> Thanks! >>> >>> On 5/16/2018 5:21 PM, Aviad Yehezkel wrote: >>>> >>>> >>>> -----Original Message----- >>>> From: Dr. David Alan Gilbert [mailto:dgilb...@redhat.com] >>>> Sent: Wednesday, May 16, 2018 4:13 PM >>>> To: 858585 jemmy <jemmy858...@gmail.com> >>>> Cc: Aviad Yehezkel <avia...@mellanox.com>; Juan Quintela >>>> <quint...@redhat.com>; qemu-devel <qemu-devel@nongnu.org>; Gal Shachaf >>>> <gal...@mellanox.com>; Adi Dotan <ad...@mellanox.com>; Lidong Chen >>>> <lidongc...@tencent.com> >>>> Subject: Re: [PATCH 2/2] migration: not wait RDMA_CM_EVENT_DISCONNECTED >>>> event after rdma_disconnect >>>> >>>> * 858585 jemmy (jemmy858...@gmail.com) wrote: >>>> >>>> <snip> >>>> >>>>>>>>>> I wonder why dereg_mr takes so long - I could understand if >>>>>>>>>> reg_mr took a long time, but why for dereg, that sounds like the >>>>>>>>>> easy side. >>>>>>>>> >>>>>>>>> I use perf collect the information when ibv_dereg_mr is invoked. >>>>>>>>> >>>>>>>>> - 9.95% client2 [kernel.kallsyms] [k] put_compound_page >>>>>>>>> ` >>>>>>>>> - put_compound_page >>>>>>>>> - 98.45% put_page >>>>>>>>> __ib_umem_release >>>>>>>>> ib_umem_release >>>>>>>>> dereg_mr >>>>>>>>> mlx5_ib_dereg_mr >>>>>>>>> ib_dereg_mr >>>>>>>>> uverbs_free_mr >>>>>>>>> remove_commit_idr_uobject >>>>>>>>> _rdma_remove_commit_uobject >>>>>>>>> rdma_remove_commit_uobject >>>>>>>>> ib_uverbs_dereg_mr >>>>>>>>> ib_uverbs_write >>>>>>>>> vfs_write >>>>>>>>> sys_write >>>>>>>>> system_call_fastpath >>>>>>>>> __GI___libc_write >>>>>>>>> 0 >>>>>>>>> + 1.55% __ib_umem_release >>>>>>>>> + 8.31% client2 [kernel.kallsyms] [k] >>>>>>>>> compound_unlock_irqrestore >>>>>>>>> + 7.01% client2 [kernel.kallsyms] [k] page_waitqueue >>>>>>>>> + 7.00% client2 [kernel.kallsyms] [k] set_page_dirty >>>>>>>>> + 6.61% client2 [kernel.kallsyms] [k] unlock_page >>>>>>>>> + 6.33% client2 [kernel.kallsyms] [k] put_page_testzero >>>>>>>>> + 5.68% client2 [kernel.kallsyms] [k] set_page_dirty_lock >>>>>>>>> + 4.30% client2 [kernel.kallsyms] [k] __wake_up_bit >>>>>>>>> + 4.04% client2 [kernel.kallsyms] [k] free_pages_prepare >>>>>>>>> + 3.65% client2 [kernel.kallsyms] [k] release_pages >>>>>>>>> + 3.62% client2 [kernel.kallsyms] [k] arch_local_irq_save >>>>>>>>> + 3.35% client2 [kernel.kallsyms] [k] page_mapping >>>>>>>>> + 3.13% client2 [kernel.kallsyms] [k] >>>>>>>>> get_pageblock_flags_group >>>>>>>>> + 3.09% client2 [kernel.kallsyms] [k] put_page >>>>>>>>> >>>>>>>>> the reason is __ib_umem_release will loop many times for each page. >>>>>>>>> >>>>>>>>> static void __ib_umem_release(struct ib_device *dev, struct >>>>>>>>> ib_umem *umem, int dirty) { >>>>>>>>> struct scatterlist *sg; >>>>>>>>> struct page *page; >>>>>>>>> int i; >>>>>>>>> >>>>>>>>> if (umem->nmap > 0) >>>>>>>>> ib_dma_unmap_sg(dev, umem->sg_head.sgl, >>>>>>>>> umem->npages, >>>>>>>>> DMA_BIDIRECTIONAL); >>>>>>>>> >>>>>>>>> for_each_sg(umem->sg_head.sgl, sg, umem->npages, i) { >>>>>>>>> << >>>>>>>>> loop a lot of times for each page.here >>>>>>>> >>>>>>>> Why 'lot of times for each page'? I don't know this code at all, >>>>>>>> but I'd expected once per page? >>>>>>> >>>>>>> sorry, once per page, but a lot of page for a big size virtual >>>>>>> machine. >>>>>> >>>>>> Ah OK; so yes it seems best if you can find a way to do the release >>>>>> in the migration thread then; still maybe this is something some of >>>>>> the kernel people could look at speeding up? >>>>> >>>>> The kernel code seem is not complex, and I have no idea how to speed >>>>> up. >>>> >>>> Me neither; but I'll ask around. >>> >>> I will ask internally and get back on this one. >>>> >>>> >>>>>>>> With your other kernel fix, does the problem of the missing >>>>>>>> RDMA_CM_EVENT_DISCONNECTED events go away? >>>>>>> >>>>>>> Yes, after kernel and qemu fixed, this issue never happens again. >>>>>> >>>>>> I'm confused; which qemu fix; my question was whether the kernel fix >>>>>> by itself fixed the problem of the missing event. >>>>> >>>>> this qemu fix: >>>>> migration: update index field when delete or qsort RDMALocalBlock >>>> >>>> OK good; so then we shouldn't need this 2/2 patch. >>> >>> It is good that the qemu fix solved this issue but me and my colleagues >>> think we need 2/2 patch anyway. >>> According to IB Spec once active side send DREQ message he should wait >>> for >>> DREP message and only once it arrived it should trigger a DISCONNECT >>> event. >>> DREP message can be dropped due to network issues. >>> For that case the spec defines a DREP_timeout state in the CM state >>> machine, >>> if the DREP is dropped we should get a timeout and a TIMEWAIT_EXIT event >>> will be trigger. >>> Unfortunately the current kernel CM implementation doesn't include the >>> DREP_timeout state and in above scenario we will not get DISCONNECT or >>> TIMEWAIT_EXIT events. >>> (Similar scenario exists also for passive side). >>> We think it is best to apply this patch until we will enhance the kernel >>> code. >>> >> Hi Aviad: >> How long about the DREP_timeout state? >> Do RDMA have some tools like tcpdump? then I can use to confirm >> whether >> DREP message has received. >> >> Thanks. > > Are you running IB or RoCE?
RoCE v2. >>>>> >>>>> this issue also cause by some ram block is not released. but I do not >>>>> find the root cause. >>>> >>>> Hmm, we should try and track that down. >>>> >>>>>>> Do you think we should remove rdma_get_cm_event after >>>>>>> rdma_disconnect? >>>>>> >>>>>> I don't think so; if 'rdma_disconnect' is supposed to generate the >>>>>> event I think we're supposed to wait for it to know that the >>>>>> disconnect is really complete. >>>>> >>>>> After move qemu_fclose to migration thread, it will not block the main >>>>> thread when wait the disconnection event. >>>> >>>> I'm not sure about moving the fclose to the migration thread; it worries >>>> me with the interaction with cancel and other failures. >>>> >>>> Dave >>>> >>>>>> Dave >>>>>> >>>>>>>> Dave >>>>>>>> >>>>>>>>> page = sg_page(sg); >>>>>>>>> if (umem->writable && dirty) >>>>>>>>> set_page_dirty_lock(page); >>>>>>>>> put_page(page); >>>>>>>>> } >>>>>>>>> >>>>>>>>> sg_free_table(&umem->sg_head); >>>>>>>>> return; >>>>>>>>> } >>>>>>>>> >>>>>>>>>> Dave >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> Dave >>>>>>>>>>>> >>>>>>>>>>>>>>> Anyway, it should not invoke rdma_get_cm_event in main >>>>>>>>>>>>>>> thread, and the event channel is also destroyed in >>>>>>>>>>>>>>> qemu_rdma_cleanup. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Signed-off-by: Lidong Chen <lidongc...@tencent.com> >>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>> migration/rdma.c | 12 ++---------- >>>>>>>>>>>>>>> migration/trace-events | 1 - >>>>>>>>>>>>>>> 2 files changed, 2 insertions(+), 11 deletions(-) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> diff --git a/migration/rdma.c b/migration/rdma.c index >>>>>>>>>>>>>>> 0dd4033..92e4d30 100644 >>>>>>>>>>>>>>> --- a/migration/rdma.c >>>>>>>>>>>>>>> +++ b/migration/rdma.c >>>>>>>>>>>>>>> @@ -2275,8 +2275,7 @@ static int >>>>>>>>>>>>>>> qemu_rdma_write(QEMUFile *f, RDMAContext *rdma, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> static void qemu_rdma_cleanup(RDMAContext *rdma) { >>>>>>>>>>>>>>> - struct rdma_cm_event *cm_event; >>>>>>>>>>>>>>> - int ret, idx; >>>>>>>>>>>>>>> + int idx; >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> if (rdma->cm_id && rdma->connected) { >>>>>>>>>>>>>>> if ((rdma->error_state || >>>>>>>>>>>>>>> @@ -2290,14 +2289,7 @@ static void >>>>>>>>>>>>>>> qemu_rdma_cleanup(RDMAContext *rdma) >>>>>>>>>>>>>>> qemu_rdma_post_send_control(rdma, NULL, >>>>>>>>>>>>>>> &head); >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> - ret = rdma_disconnect(rdma->cm_id); >>>>>>>>>>>>>>> - if (!ret) { >>>>>>>>>>>>>>> - >>>>>>>>>>>>>>> trace_qemu_rdma_cleanup_waiting_for_disconnect(); >>>>>>>>>>>>>>> - ret = rdma_get_cm_event(rdma->channel, >>>>>>>>>>>>>>> &cm_event); >>>>>>>>>>>>>>> - if (!ret) { >>>>>>>>>>>>>>> - rdma_ack_cm_event(cm_event); >>>>>>>>>>>>>>> - } >>>>>>>>>>>>>>> - } >>>>>>>>>>>>>>> + rdma_disconnect(rdma->cm_id); >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm worried whether this change could break stuff: >>>>>>>>>>>>>> The docs say for rdma_disconnect that it flushes any posted >>>>>>>>>>>>>> work >>>>>>>>>>>>>> requests to the completion queue; so unless we wait for the >>>>>>>>>>>>>> event >>>>>>>>>>>>>> do we know the stuff has been flushed? In the normal >>>>>>>>>>>>>> non-cancel case >>>>>>>>>>>>>> I'm worried that means we could lose something. >>>>>>>>>>>>>> (But I don't know the rdma/infiniband specs well enough to >>>>>>>>>>>>>> know >>>>>>>>>>>>>> if it's >>>>>>>>>>>>>> really a problem). >>>>>>>>>>>>> >>>>>>>>>>>>> In qemu_fclose function, it invoke qemu_fflush(f) before invoke >>>>>>>>>>>>> f->ops->close. >>>>>>>>>>>>> so I think it's safe for normal migration case. >>>>>>>>>>>>> >>>>>>>>>>>>> For the root cause why not receive RDMA_CM_EVENT_DISCONNECTED >>>>>>>>>>>>> event >>>>>>>>>>>>> after rdma_disconnect, >>>>>>>>>>>>> I loop in Aviad Yehezkel<avia...@mellanox.com>, Aviad will help >>>>>>>>>>>>> us >>>>>>>>>>>>> find the root cause. >>> >>> >>> Regarding previous discussion on that thread: >>> rdma_disconnect can be invoke even without invoking ib_dreg_mr first. >>> >>> common teardown flow: >>> >>> rdma_disconnect >>> <-- here resources like QPs are still alive and can DMA to RAM >>> rdma_destroy_cm_id >>> <-- All resources are destroyed by MR can be still active >>> rdma_dreg_mr >>> >>> >>> >>>>>>>>>>>>>> Dave >>>>>>>>>>>>>> >>>>>>>>>>>>>>> trace_qemu_rdma_cleanup_disconnect(); >>>>>>>>>>>>>>> rdma->connected = false; >>>>>>>>>>>>>>> } >>>>>>>>>>>>>>> diff --git a/migration/trace-events b/migration/trace-events >>>>>>>>>>>>>>> index d6be74b..64573ff 100644 >>>>>>>>>>>>>>> --- a/migration/trace-events >>>>>>>>>>>>>>> +++ b/migration/trace-events >>>>>>>>>>>>>>> @@ -125,7 +125,6 @@ qemu_rdma_accept_pin_state(bool pin) "%d" >>>>>>>>>>>>>>> qemu_rdma_accept_pin_verbsc(void *verbs) "Verbs context >>>>>>>>>>>>>>> after >>>>>>>>>>>>>>> listen: %p" >>>>>>>>>>>>>>> qemu_rdma_block_for_wrid_miss(const char *wcompstr, int >>>>>>>>>>>>>>> wcomp, const char *gcompstr, uint64_t req) "A Wanted wrid %s >>>>>>>>>>>>>>> (%d) but got %s >>>>>>>>>>>>>>> (%" PRIu64 ")" >>>>>>>>>>>>>>> qemu_rdma_cleanup_disconnect(void) "" >>>>>>>>>>>>>>> -qemu_rdma_cleanup_waiting_for_disconnect(void) "" >>>>>>>>>>>>>>> qemu_rdma_close(void) "" >>>>>>>>>>>>>>> qemu_rdma_connect_pin_all_requested(void) "" >>>>>>>>>>>>>>> qemu_rdma_connect_pin_all_outcome(bool pin) "%d" >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> 1.8.3.1 >>>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK >>>>>>>> >>>>>>>> -- >>>>>>>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK >>>>>> >>>>>> -- >>>>>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK >>>> >>>> -- >>>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK >>> >>> >