On Thu, May 31, 2018 at 1:33 AM, Dr. David Alan Gilbert <dgilb...@redhat.com> wrote: > * Lidong Chen (jemmy858...@gmail.com) wrote: >> If the peer qemu is crashed, the qemu_rdma_wait_comp_channel function >> maybe loop forever. so we should also poll the cm event fd, and when >> receive any cm event, we consider some error happened. >> >> Signed-off-by: Lidong Chen <lidongc...@tencent.com> > > I don't understand enough about the way the infiniband fd's work to > fully review this; so I'd appreciate if some one who does could > comment/add their review.
Hi Avaid: we need your help. I also not find any document about the cq channel event fd and cm channel event f. Should we set the events to G_IO_IN | G_IO_HUP | G_IO_ERR? or G_IO_IN is enough? pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR; pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR; Thanks. > >> --- >> migration/rdma.c | 35 ++++++++++++++++++++++++----------- >> 1 file changed, 24 insertions(+), 11 deletions(-) >> >> diff --git a/migration/rdma.c b/migration/rdma.c >> index 1b9e261..d611a06 100644 >> --- a/migration/rdma.c >> +++ b/migration/rdma.c >> @@ -1489,6 +1489,9 @@ static uint64_t qemu_rdma_poll(RDMAContext *rdma, >> uint64_t *wr_id_out, >> */ >> static int qemu_rdma_wait_comp_channel(RDMAContext *rdma) >> { >> + struct rdma_cm_event *cm_event; >> + int ret = -1; >> + >> /* >> * Coroutine doesn't start until migration_fd_process_incoming() >> * so don't yield unless we know we're running inside of a coroutine. >> @@ -1504,25 +1507,35 @@ static int qemu_rdma_wait_comp_channel(RDMAContext >> *rdma) >> * But we need to be able to handle 'cancel' or an error >> * without hanging forever. >> */ >> - while (!rdma->error_state && !rdma->received_error) { >> - GPollFD pfds[1]; >> + while (!rdma->error_state && !rdma->received_error) { >> + GPollFD pfds[2]; >> pfds[0].fd = rdma->comp_channel->fd; >> pfds[0].events = G_IO_IN | G_IO_HUP | G_IO_ERR; >> + pfds[0].revents = 0; >> + >> + pfds[1].fd = rdma->channel->fd; >> + pfds[1].events = G_IO_IN | G_IO_HUP | G_IO_ERR; >> + pfds[1].revents = 0; >> + >> /* 0.1s timeout, should be fine for a 'cancel' */ >> - switch (qemu_poll_ns(pfds, 1, 100 * 1000 * 1000)) { >> - case 1: /* fd active */ >> - return 0; >> + qemu_poll_ns(pfds, 2, 100 * 1000 * 1000); > > Shouldn't we still check the return value of this; if it's negative > something has gone wrong. I will fix this. Thanks. > > Dave > >> - case 0: /* Timeout, go around again */ >> - break; >> + if (pfds[1].revents) { >> + ret = rdma_get_cm_event(rdma->channel, &cm_event); >> + if (!ret) { >> + rdma_ack_cm_event(cm_event); >> + } >> + error_report("receive cm event while wait comp channel," >> + "cm event is %d", cm_event->event); >> >> - default: /* Error of some type - >> - * I don't trust errno from qemu_poll_ns >> - */ >> - error_report("%s: poll failed", __func__); >> + /* consider any rdma communication event as an error */ >> return -EPIPE; >> } >> >> + if (pfds[0].revents) { >> + return 0; >> + } >> + >> if (migrate_get_current()->state == >> MIGRATION_STATUS_CANCELLING) { >> /* Bail out and let the cancellation happen */ >> return -EPIPE; >> -- >> 1.8.3.1 >> > -- > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK