ping On 02/27/2017 03:30 PM, Cao jin wrote: > This is nearly new design of the feature, so re-number the verion from 0. > > About The test: > Hardware problem(unsteady) still occurs like before. The test server is in > another country spot A, and my contact of the country located spot B, so > it is not quite convenient to find help(plug cable, or check the hardware). > So, my NIC(has 2 functions) still just has func1 connected to gateway. > If there is other people who has the hardware could test the patches, that > would be great help. > > > Basically, there are two phenomenon of unsteady hardware: > 1. Start vm, the hardware emit fatal error itself before I did anything, > cause vm stop. > 2. Start vm, assign IP to func1, then ping the gateway, it will show > "Destination Host Unreachable" after dozens of or hundreds of successful > ping, and guest dmesg shows nothing abnormal. I think this phenomenon is > the *strong evidence* of saying unsteady hardware, I speculate that > the cable has problem. > > on the opposite, I also saw perfect result 2 times in my numerous tests, > which just assign func1 while func0 has no user. It can ping several > housrs( > more than 15000 times ping) withtout any problem, during the period, inject > non fatal error to func0 & func1, error recovery is very good. > > So, most of time, I must do the test quickly before the hardware goes > crazy, > until get what I expected. > > > Test: > scenario 1: assign func1 to vm while func0 has no user. > scenario 2: assign both functions to 1 vm, with the same topology as host. > scenario 3: assign both functions to 1 vm, under different bus. > scenario 4: assign each function to a separate vm. > > the steps is: assign IP to func1, ping the gateway, inject non fatal error to > both functions, see if func1 still can ping after recovery. > > Although we don't have cable for func0, but in the test like scenario 4, > inject to func0, it doesn't affect func1's recovery, so I think it can prove > that one function's recovery doesn't affect another. > > > Extra info FYI: > 1. During the test, some debug lines are added in vfio_err_notifier_handler, > read the uncor status register in this function when fatal error occured, > it shows all F's every time. > 2. Based on the v10 patch & the corresponding kernel part, modified as > comments: revert the eventfd handling(don't signal uncor status), and > guest link reset will induce the host link reset. The test result shows: > non fatal error recovery is good; fatal error recovery has same result > with what Alex find before(guest kernel crash), because guest device > driver's error_detected() access the MMIO registers, get all F's. > > > Cao jin (3): > pcie aer: verify if AER functionality is available > vfio pci: new function to init AER capability > vfio-pci: process non fatal error of AER > > hw/pci/pcie_aer.c | 28 +++++++ > hw/vfio/pci.c | 180 > +++++++++++++++++++++++++++++++++++++++++++-- > hw/vfio/pci.h | 3 + > linux-headers/linux/vfio.h | 1 + > 4 files changed, 207 insertions(+), 5 deletions(-) >
-- Sincerely, Cao jin