Hi Jerin, On Mon, May 8, 2017 at 4:20 PM, Jerin Jacob <jerin.ja...@caviumnetworks.com> wrote:
> -----Original Message----- > > Date: Sun, 30 Apr 2017 19:29:49 +0200 > > From: Thomas Monjalon <tho...@monjalon.net> > > To: Alejandro Lucero <alejandro.luc...@netronome.com> > > Cc: dev@dpdk.org, "Burakov, Anatoly" <anatoly.bura...@intel.com> > > Subject: Re: [dpdk-dev] [PATCH] vfio: fix device unplug when several > > devices per vfio group > > > > 28/04/2017 15:25, Burakov, Anatoly: > > > From: Alejandro Lucero [mailto:alejandro.luc...@netronome.com] > > > > VFIO allows a secure way of assigning devices to user space and those > > > > devices which can not be isolated from other ones are set in same > VFIO > > > > group. Releasing or unplugging a device should be aware of remaining > > > > devices is the same group for avoiding to close such a group. > > > > > > > > Fixes: 94c0776b1bad ("vfio: support hotplug") > > > > > > > > Signed-off-by: Alejandro Lucero <alejandro.luc...@netronome.com> > > > > > > I have tested this on my setup on an old kernel with multiple > attach/detaches, and it works (whereas it fails without this patch). > > > > > > Acked-by: Anatoly Burakov <anatoly.bura...@intel.com> > > > > Applied, thanks > > This patch creates issue when large number of PCIe devices connected to > system. > Found it through git bisect. > > This issue is, vfio_group_fd goes beyond 64(VFIO_MAX_GROUPS) and writes > to wrong memory on following code execution and sub sequentially creates > issues in vfio mapping or such. > vfio_cfg.vfio_groups[vfio_group_fd].devices++; > > I can increase VFIO_MAX_GROUPS, but I think, it is not correct fix as > vfio_group_fd generated from open system call. > > I add some prints the code for debug. Please find below the output. > Any thoughts from VFIO experts? > > That is a silly but serious bug. We are using the file descriptor as the index for updating devices counter of a vfio group structure internal to DPDK VFIO code. We should be using the vfio_group that file descriptor is registered with. I will send a fix where vfio_group_device_get/put/count functions are implemented which take the file descriptor as a parameter and then go through the vfio_group array for working with the right one. Thomas, is this fix in time yet for 17.05? I will send the patch today but I can just test it against a system with the "normal" case for VFIO device groups. Maybe Jerin or/and Anatoly can test it against the other case. > ➜ [master]83xx [dpdk-master] $ git diff > diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c > b/lib/librte_eal/linuxapp/eal/eal_vfio.c > index d3eae20..2d8ee4c 100644 > --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c > +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c > @@ -100,6 +100,7 @@ vfio_get_group_fd(int iommu_group_no) > snprintf(filename, sizeof(filename), > VFIO_GROUP_FMT, iommu_group_no); > vfio_group_fd = open(filename, O_RDWR); > + printf("###### name %s vfio_group_fd %d\n", filename, > vfio_group_fd); > if (vfio_group_fd < 0) { > /* if file not found, it's not an error */ > if (errno != ENOENT) { > @@ -259,6 +260,8 @@ vfio_setup_device(const char *sysfs_base, const char > *dev_addr, > if (vfio_group_fd < 0) > return -1; > > + printf("#### iommu_group_fd %d vfio_group_fd=%d\n", > iommu_group_no, vfio_group_fd); > + > /* if group_fd == 0, that means the device isn't managed by VFIO > * */ > if (vfio_group_fd == 0) { > RTE_LOG(WARNING, EAL, " %s not managed by VFIO driver, > skipping\n", > @@ -266,6 +269,7 @@ vfio_setup_device(const char *sysfs_base, const char > *dev_addr, > return 1; > } > /* > * at this point, we know that this group is viable (meaning, > * all devices > * are either bound to VFIO or not bound to anything) > @@ -359,6 +363,7 @@ vfio_setup_device(const char *sysfs_base, const char > *dev_addr, > return -1; > } > vfio_cfg.vfio_groups[vfio_group_fd].devices++; > + printf("vfio_group_fd %d device %d\n", vfio_group_fd, > vfio_cfg.vfio_groups[vfio_group_fd].devices++); > > return 0; > } > > > output log > ---------- > > EAL: PCI device 0000:07:00.1 on NUMA socket 0 > EAL: probe driver: 177d:a04b octeontx_ssovf > ###### name /dev/vfio/114 vfio_group_fd 44 > #### iommu_group_fd 114 vfio_group_fd=44 > EAL: using IOMMU type 1 (Type 1) > vfio_group_fd 44 device 1 > EAL: PCI device 0000:07:00.2 on NUMA socket 0 > EAL: probe driver: 177d:a04b octeontx_ssovf > ###### name /dev/vfio/115 vfio_group_fd 47 > #### iommu_group_fd 115 vfio_group_fd=47 > vfio_group_fd 47 device 1 > EAL: PCI device 0000:07:00.3 on NUMA socket 0 > EAL: probe driver: 177d:a04b octeontx_ssovf > ###### name /dev/vfio/116 vfio_group_fd 50 > #### iommu_group_fd 116 vfio_group_fd=50 > vfio_group_fd 50 device 1 > EAL: PCI device 0000:07:00.4 on NUMA socket 0 > EAL: probe driver: 177d:a04b octeontx_ssovf > ###### name /dev/vfio/117 vfio_group_fd 53 > #### iommu_group_fd 117 vfio_group_fd=53 > vfio_group_fd 53 device 1 > EAL: PCI device 0000:07:00.5 on NUMA socket 0 > EAL: probe driver: 177d:a04b octeontx_ssovf > ###### name /dev/vfio/118 vfio_group_fd 56 > #### iommu_group_fd 118 vfio_group_fd=56 > vfio_group_fd 56 device 1 > EAL: PCI device 0000:07:00.6 on NUMA socket 0 > EAL: probe driver: 177d:a04b octeontx_ssovf > ###### name /dev/vfio/119 vfio_group_fd 59 > #### iommu_group_fd 119 vfio_group_fd=59 > vfio_group_fd 59 device 1 > EAL: PCI device 0000:07:00.7 on NUMA socket 0 > EAL: probe driver: 177d:a04b octeontx_ssovf > ###### name /dev/vfio/120 vfio_group_fd 62 > #### iommu_group_fd 120 vfio_group_fd=62 > vfio_group_fd 62 device 1 > EAL: PCI device 0000:07:01.0 on NUMA socket 0 > EAL: probe driver: 177d:a04b octeontx_ssovf > ###### name /dev/vfio/121 vfio_group_fd 65 > #### iommu_group_fd 121 vfio_group_fd=65 > vfio_group_fd 65 device 1632632833 > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^(memory corruption here) > > EAL: PCI device 0000:08:00.1 on NUMA socket 0 > EAL: probe driver: 177d:a04d octeontx_ssowvf > ###### name /dev/vfio/122 vfio_group_fd 68 > #### iommu_group_fd 122 vfio_group_fd=68 > vfio_group_fd 68 device 1 > EAL: PCI device 0000:08:00.2 on NUMA socket 0 > EAL: probe driver: 177d:a04d octeontx_ssowvf > ###### name /dev/vfio/123 vfio_group_fd 71 > #### iommu_group_fd 123 vfio_group_fd=71 > vfio_group_fd 71 device 99999941 > EAL: PCI device 0000:08:00.3 on NUMA socket 0 > EAL: probe driver: 177d:a04d octeontx_ssowvf > ###### name /dev/vfio/124 vfio_group_fd 74 > #### iommu_group_fd 124 vfio_group_fd=74 > vfio_group_fd 74 device 1 > EAL: PCI device 0000:08:00.4 on NUMA socket 0 > EAL: probe driver: 177d:a04d octeontx_ssowvf > ###### name /dev/vfio/125 vfio_group_fd 77 > #### iommu_group_fd 125 vfio_group_fd=77 > vfio_group_fd 77 device 1 > EAL: PCI device 0000:08:00.5 on NUMA socket 0 > EAL: probe driver: 177d:a04d octeontx_ssowvf > ###### name /dev/vfio/126 vfio_group_fd 80 > #### iommu_group_fd 126 vfio_group_fd=80 > vfio_group_fd 80 device 1 > EAL: PCI device 0000:08:00.6 on NUMA socket 0 > EAL: probe driver: 177d:a04d octeontx_ssowvf > ###### name /dev/vfio/127 vfio_group_fd 83 > #### iommu_group_fd 127 vfio_group_fd=83 > vfio_group_fd 83 device 1 > EAL: PCI device 0000:08:00.7 on NUMA socket 0 > EAL: probe driver: 177d:a04d octeontx_ssowvf > EAL: PCI device 0000:08:01.0 on NUMA socket 0 > EAL: probe driver: 177d:a04d octeontx_ssowvf > EAL: PCI device 0001:01:00.1 on NUMA socket 0 > EAL: probe driver: 177d:a034 net_thunderx > ###### name /dev/vfio/64 vfio_group_fd 86 > #### iommu_group_fd 64 vfio_group_fd=86 > vfio_group_fd 86 device 1 > Segmentation fault > > >