Hi Alexey, Looking at the skiboot side I think we only fence the NVLink bricks as part of a PCIe function level reset (FLR) rather than a PCI Hot or Fundamental reset which I believe is what the code here does. So to fence the bricks you would need to do either a FLR on the given link or alter Skiboot to fence a given link as part of a hot reset.
- Alistair On Monday, 15 October 2018 6:17:51 PM AEDT Alexey Kardashevskiy wrote: > Ping? > > > On 02/10/2018 13:20, Alexey Kardashevskiy wrote: > > The skiboot firmware has a hot reset handler which fences the NVIDIA V100 > > GPU RAM on Witherspoons and makes accesses no-op instead of throwing HMIs: > > https://github.com/open-power/skiboot/commit/fca2b2b839a67 > > > > Now we are going to pass V100 via VFIO which most certainly involves > > KVM guests which are often terminated without getting a chance to offline > > GPU RAM so we end up with a running machine with misconfigured memory. > > Accessing this memory produces hardware management interrupts (HMI) > > which bring the host down. > > > > To suppress HMIs, this wires up this hot reset hook to vfio_pci_disable() > > via pci_disable_device() which switches NPU2 to a safe mode and prevents > > HMIs. > > > > Signed-off-by: Alexey Kardashevskiy <a...@ozlabs.ru> > > --- > > Changes: > > v2: > > * updated the commit log > > --- > > arch/powerpc/platforms/powernv/pci-ioda.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c > > b/arch/powerpc/platforms/powernv/pci-ioda.c > > index cde7102..e37b9cc 100644 > > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > > @@ -3688,6 +3688,15 @@ static void pnv_pci_release_device(struct pci_dev > > *pdev) > > pnv_ioda_release_pe(pe); > > } > > > > +static void pnv_npu_disable_device(struct pci_dev *pdev) > > +{ > > + struct eeh_dev *edev = pci_dev_to_eeh_dev(pdev); > > + struct eeh_pe *eehpe = edev ? edev->pe : NULL; > > + > > + if (eehpe && eeh_ops && eeh_ops->reset) > > + eeh_ops->reset(eehpe, EEH_RESET_HOT); > > +} > > + > > static void pnv_pci_ioda_shutdown(struct pci_controller *hose) > > { > > struct pnv_phb *phb = hose->private_data; > > @@ -3732,6 +3741,7 @@ static const struct pci_controller_ops > > pnv_npu_ioda_controller_ops = { > > .reset_secondary_bus = pnv_pci_reset_secondary_bus, > > .dma_set_mask = pnv_npu_dma_set_mask, > > .shutdown = pnv_pci_ioda_shutdown, > > + .disable_device = pnv_npu_disable_device, > > }; > > > > static const struct pci_controller_ops pnv_npu_ocapi_ioda_controller_ops = > > { > > > >