On Tue, Aug 11, 2020 at 03:41:30PM +0100, Stefan Hajnoczi wrote: > On Fri, Jul 31, 2020 at 02:20:24PM -0400, Jagannathan Raman wrote: > > @@ -343,3 +349,49 @@ static void probe_pci_info(PCIDevice *dev, Error > > **errp) > > } > > } > > } > > + > > +static void hb_msg(PCIProxyDev *dev) > > +{ > > + DeviceState *ds = DEVICE(dev); > > + Error *local_err = NULL; > > + MPQemuMsg msg = { 0 }; > > + > > + msg.cmd = PROXY_PING; > > + msg.bytestream = 0; > > + msg.size = 0; > > + > > + (void)mpqemu_msg_send_and_await_reply(&msg, dev->ioc, &local_err); > > + if (local_err) { > > + error_report_err(local_err); > > + qio_channel_close(dev->ioc, &local_err); > > + error_setg(&error_fatal, "Lost contact with device %s", ds->id); > > + } > > +} > > Here is my feedback from the last revision. Was this addressed? >
Hi Stefan, Thank you for reviewing the patchset. In this version we decided to shutdown the guest when the heartbeat did not get a reply from the remote by setting the error_fatal. Should we approach it differently or you prefer us to get rid of the heartbeat in this form? Thank you, Elena > This patch seems incomplete since no action is taken when the device > fails to respond. vCPU threads that access the device will still get > stuck. > > The simplest way to make this useful is to close the connection when a > timeout occurs. Then the G_IO_HUP handler for the UNIX domain socket > should perform connection cleanup. At that point there are a few > choices: > > 1. Stop guest execution and wait for the host admin to restore the > mplink so execution can resume. This is similar to how -drive > rerror=stop pauses the guest when a disk I/O error is encountered. > > 2. Stop guest execution but defer it until this stale device is actually > accessed. This maximizes guest uptime. Guests that rarely access the > device may not notice at all. > > 3. Return 0 from MemoryRegion read operations and ignore writes. The > guest continues executing but the device is broken. This is risky > because device drivers inside the guest may not be ready to deal with > this. The result could be data loss or corruption. > > 4. Raise a bus-level event. Maybe PCI error reporting can be used to > offline the device. > > 5. Terminate the guest with an error message. > > 6. ? > > Until the heartbeat is fully implemented and tested I suggest dropping > it from this patch series. Remember the G_IO_HUP will happen anyway if > the remote device process terminates.