On 9/24/20 7:11 AM, Alexey Kardashevskiy wrote: > > > On 23/09/2020 17:06, Cédric Le Goater wrote: >> On 9/23/20 2:33 AM, Qian Cai wrote: >>> On Fri, 2020-08-07 at 12:18 +0200, Cédric Le Goater wrote: >>>> When a passthrough IO adapter is removed from a pseries machine using >>>> hash MMU and the XIVE interrupt mode, the POWER hypervisor expects the >>>> guest OS to clear all page table entries related to the adapter. If >>>> some are still present, the RTAS call which isolates the PCI slot >>>> returns error 9001 "valid outstanding translations" and the removal of >>>> the IO adapter fails. This is because when the PHBs are scanned, Linux >>>> maps automatically the INTx interrupts in the Linux interrupt number >>>> space but these are never removed. >>>> >>>> To solve this problem, we introduce a PPC platform specific >>>> pcibios_remove_bus() routine which clears all interrupt mappings when >>>> the bus is removed. This also clears the associated page table entries >>>> of the ESB pages when using XIVE. >>>> >>>> For this purpose, we record the logical interrupt numbers of the >>>> mapped interrupt under the PHB structure and let pcibios_remove_bus() >>>> do the clean up. >>>> >>>> Since some PCI adapters, like GPUs, use the "interrupt-map" property >>>> to describe interrupt mappings other than the legacy INTx interrupts, >>>> we can not restrict the size of the mapping array to PCI_NUM_INTX. The >>>> number of interrupt mappings is computed from the "interrupt-map" >>>> property and the mapping array is allocated accordingly. >>>> >>>> Cc: "Oliver O'Halloran" <ooh...@gmail.com> >>>> Cc: Alexey Kardashevskiy <a...@ozlabs.ru> >>>> Signed-off-by: Cédric Le Goater <c...@kaod.org> >>> >>> Some syscall fuzzing will trigger this on POWER9 NV where the traces >>> pointed to >>> this patch. >>> >>> .config: https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config >> >> OK. The patch is missing a NULL assignement after kfree() and that >> might be the issue. >> >> I did try PHB removal under PowerNV, so I would like to understand >> how we managed to remove twice the PCI bus and possibly reproduce. >> Any chance we could grab what the syscall fuzzer (syzkaller) did ? > > > > My guess would be it is doing this in parallel to provoke races.
Concurrency removal and rescan should be controlled by : pci_stop_and_remove_bus_device_locked() pci_lock_rescan_remove() And, in the report, the stack traces are on the same CPU and PID. What I think is happening is that we did a couple of remove/rescan of the same PHB. The problem is that ->irq_map is not reallocated the second time because the PHB is re-scanned differently on the PowerNV platform. At the second remove, the ->irq_map being not NULL, we try to kfree it again and the allocator warns of a double free :/ This works fine on pseries/KVM because the PHB is never removed and on pseries/pHyp, pcibios_scan_phb() is called at PHB hotplug. But on PowerNV, pcibios_scan_phb() is only called at probe/boot time and not at hotplug time :/ I was a good thing to spot that before merge. But I need to revise that patch again. Thanks, C.