On Wed, Nov 20, 2019 at 10:48 AM Matan Azrad <ma...@mellanox.com> wrote: > > When a rte_device is unplugged, the driver should be detached from the > device. > > The PCI detach driver operation wrongly didn't clear the driver from the > device structure what remain the device in probe state from the EAL > point of view. > > For example, when a device is removed twice using rte_dev_remove, it > cause a crash in EAL.
I can see a crash when using port detach in testpmd with a virtio pci device. testpmd> port attach 0000:07:00.0 Attaching a new port... EAL: PCI device 0000:07:00.0 on NUMA socket -1 EAL: Invalid NUMA socket, default to 0 EAL: probe driver: 1af4:1041 net_virtio Port 1 is attached. Now total ports is 2 Done testpmd> port close 1 Closing ports... EAL: Releasing pci mapped resource for 0000:07:00.0 EAL: Calling pci_unmap_resource for 0000:07:00.0 at 0x2200006000 Done testpmd> port detach 1 Removing a device... Breakpoint 1, local_dev_remove (dev=0x1de64b0) at /root/dpdk/lib/librte_eal/common/eal_common_dev.c:315 315 if (dev->bus->unplug == NULL) { Missing separate debuginfos, use: debuginfo-install glibc-2.17-292.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libpcap-1.5.3-11.el7.x86_64 numactl-libs-2.0.12-3.el7.x86_64 (gdb) p *dev $1 = {next = {tqe_next = 0x0, tqe_prev = 0x0}, name = 0x1cf8078 "0000:07:00.0", driver = 0x16c68f0 <rte_virtio_pmd+16>, bus = 0x16b2640 <rte_pci_bus>, numa_node = 0, devargs = 0x1cf8060} (gdb) c Continuing. Device of port 1 is detached Now total ports is 1 Done On the first detach, the pci bus frees the rte_pci_device which embeds the rte_device object. static int pci_unplug(struct rte_device *dev) { struct rte_pci_device *pdev; int ret; pdev = RTE_DEV_TO_PCI(dev); ret = rte_pci_detach_dev(pdev); if (ret == 0) { rte_pci_remove_device(pdev); rte_devargs_remove(dev->devargs); free(pdev); } return ret; } testpmd> port detach 1 Removing a device... Breakpoint 1, local_dev_remove (dev=0x1de64b0) at /root/dpdk/lib/librte_eal/common/eal_common_dev.c:315 315 if (dev->bus->unplug == NULL) { (gdb) p *dev $2 = {next = {tqe_next = 0x0, tqe_prev = 0x0}, name = 0xa <Address 0xa out of bounds>, driver = 0x0, bus = 0x4637, numa_node = 1, devargs = 0x40000002e040018} (gdb) c Continuing. Program received signal SIGSEGV, Segmentation fault. 0x00000000007c1ddd in local_dev_remove (dev=0x1de64b0) at /root/dpdk/lib/librte_eal/common/eal_common_dev.c:315 315 if (dev->bus->unplug == NULL) { On the second detach, testpmd passes the same rte_device pointer it extracts from rte_eth_devices, but the malloc'd location has been reused (with watchpoint on the location, I found somewhere around rte_mp_request_sync/opendir()), and then *crunch* on dev->bus. >From my pov: - testpmd is wrongly reusing a pointer coming from rte_eth_devices[], without caring about the port state (this is what your second patch fixes), - testpmd is directly kicking pointers in rte_eth_devices[] (setting ->device = NULL for its own logic), which is bad too, - this patch just hides the reuse of a freed pointer, -- David Marchand