Re: [Qemu-devel] [PATCH 0/3] vfio-pci: support recovery of AER non fatal error

Cao jin Tue, 07 Mar 2017 03:39:59 -0800

ping

On 02/27/2017 03:30 PM, Cao jin wrote:
> This is nearly new design of the feature, so re-number the verion from 0.
> 
> About The test:
> Hardware problem(unsteady) still occurs like before. The test server is in
> another country spot A, and my contact of the country located spot B, so
> it is not quite convenient to find help(plug cable, or check the hardware).
> So, my NIC(has 2 functions) still just has func1 connected to gateway.
> If there is other people who has the hardware could test the patches, that
> would be great help.
> 
> 
> Basically, there are two phenomenon of unsteady hardware:
> 1. Start vm, the hardware emit fatal error itself before I did anything,
>    cause vm stop.
> 2. Start vm, assign IP to func1, then ping the gateway, it will show
>    "Destination Host Unreachable" after dozens of or hundreds of successful
>    ping, and guest dmesg shows nothing abnormal.  I think this phenomenon is
>    the *strong evidence* of saying unsteady hardware, I speculate that
>    the cable has problem.
> 
>    on the opposite, I also saw perfect result 2 times in my numerous tests,
>    which just assign func1 while func0 has no user. It can ping several 
> housrs(
>    more than 15000 times ping) withtout any problem, during the period, inject
>    non fatal error to func0 & func1, error recovery is very good.
> 
>    So, most of time, I must do the test quickly before the hardware goes 
> crazy,
>    until get what I expected.
> 
> 
> Test:
> scenario 1: assign func1 to vm while func0 has no user.
> scenario 2: assign both functions to 1 vm, with the same topology as host.
> scenario 3: assign both functions to 1 vm, under different bus.
> scenario 4: assign each function to a separate vm.
> 
> the steps is: assign IP to func1, ping the gateway, inject non fatal error to
> both functions, see if func1 still can ping after recovery.
> 
> Although we don't have cable for func0, but in the test like scenario 4,
> inject to func0, it doesn't affect func1's recovery, so I think it can prove
> that one function's recovery doesn't affect another.
> 
> 
> Extra info FYI:
> 1. During the test, some debug lines are added in vfio_err_notifier_handler,
>    read the uncor status register in this function when fatal error occured,
>    it shows all F's every time.
> 2. Based on the v10 patch & the corresponding kernel part, modified as
>    comments: revert the eventfd handling(don't signal uncor status), and
>    guest link reset will induce the host link reset. The test result shows:
>    non fatal error recovery is good; fatal error recovery has same result
>    with what Alex find before(guest kernel crash), because guest device
>    driver's error_detected() access the MMIO registers, get all F's.
> 
> 
> Cao jin (3):
>   pcie aer: verify if AER functionality is available
>   vfio pci: new function to init AER capability
>   vfio-pci: process non fatal error of AER
> 
>  hw/pci/pcie_aer.c          |  28 +++++++
>  hw/vfio/pci.c              | 180 
> +++++++++++++++++++++++++++++++++++++++++++--
>  hw/vfio/pci.h              |   3 +
>  linux-headers/linux/vfio.h |   1 +
>  4 files changed, 207 insertions(+), 5 deletions(-)
>


-- 
Sincerely,
Cao jin

Re: [Qemu-devel] [PATCH 0/3] vfio-pci: support recovery of AER non fatal error

Reply via email to