On Tue, Aug 16, 2022 at 1:29 PM Ganesh Goudar <ganes...@linux.ibm.com> wrote: > > Hi, > > EEH reocvery is currently serialized and these patches shorten > the time taken for EEH recovery by making the recovery to run > in parallel. The original author of these patches is Sam Bobroff, > I have rebased and tested these patches. > > On powervm with 64 VFs and I see approximately 48% reduction > in time taken in EEH recovery,
Can you elaborate on how you're testing this? A VM with 64 separate VFs on 64 separate PHBs I guess? How are you injecting errors? > Yet to be tested on powernv. If you're not testing on powernv you might as well not be testing. EEH support on pseries is relatively straightforward since managing PCI bridges, etc is all hidden away in the hypervisor. Most of the things about EEH which give me headaches simply don't happen on pseries since the PCI topology seen by the guest is flat. > These patches were originally posted as separate RFCs, I think > posting them as single series would be more helpful, I know the > patches are too big, I will try to logically divide in next > iterations. > > Thanks > > Ganesh Goudar (3): > powerpc/eeh: Synchronization for safety I didn't pick that patch up after Sam left for a reason. I don't recall the exact details, but I think I decided that the lock being added had the same (or mostly the same) scope as the pci rescan lock. All modifications to the EEH PE tree have to occur with the rescan lock held since the per-device EEH setup occurs while configuring the pci_dev (similarly the teardown happens when we remove the pci_dev from its bus). I don't think that was true when Same wrote the patch originally though since it pre-dates my big EEH init re-work. It's possible (probable even!) I got that wrong, or that it was something I was trying to make true through re-works rather than an accurate assessment of the current state of things. I'll try page some of this back into my brain later. I think I left some comments on Sam's original RFC patch which might still be valid. > powerpc/eeh: Asynchronous recovery I remember not hating the idea, but I think the implementation leaves a lot to be desired. If you only really care about pseries then just moving to a model where each PHB has its own recovery thread would get you most of the benefit without needing to turn the already complicated recovery path into an async trainwreck. IIRC Sam's motivation for that patch was to make recovery faster for powernv systems when an error was injected on a PF with a lot of child VFs. Certain drivers took forever to recover each VF (might have been mlx5) so doing it in parallel would have sped things up considerably. Oliver