[cc Bandan] On Tue, Feb 21, 2017 at 5:33 AM, Jintack Lim <jint...@cs.columbia.edu> wrote:
> > > On Wed, Feb 15, 2017 at 9:47 PM, Alex Williamson < > alex.william...@redhat.com> wrote: > >> On Thu, 16 Feb 2017 10:28:39 +0800 >> Peter Xu <pet...@redhat.com> wrote: >> >> > On Wed, Feb 15, 2017 at 11:15:52AM -0700, Alex Williamson wrote: >> > >> > [...] >> > >> > > > Alex, do you like something like below to fix above issue that >> Jintack >> > > > has encountered? >> > > > >> > > > (note: this code is not for compile, only trying show what I >> mean...) >> > > > >> > > > ------8<------- >> > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >> > > > index 332f41d..4dca631 100644 >> > > > --- a/hw/vfio/pci.c >> > > > +++ b/hw/vfio/pci.c >> > > > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice >> *vdev) >> > > > */ >> > > > config = g_memdup(pdev->config, vdev->config_size); >> > > > >> > > > - /* >> > > > - * Extended capabilities are chained with each pointing to the >> next, so we >> > > > - * can drop anything other than the head of the chain simply >> by modifying >> > > > - * the previous next pointer. For the head of the chain, we >> can modify the >> > > > - * capability ID to something that cannot match a valid >> capability. ID >> > > > - * 0 is reserved for this since absence of capabilities is >> indicated by >> > > > - * 0 for the ID, version, AND next pointer. However, >> pcie_add_capability() >> > > > - * uses ID 0 as reserved for list management and will >> incorrectly match and >> > > > - * assert if we attempt to pre-load the head of the chain with >> this ID. >> > > > - * Use ID 0xFFFF temporarily since it is also seems to be >> reserved in >> > > > - * part for identifying absence of capabilities in a root >> complex register >> > > > - * block. If the ID still exists after adding capabilities, >> switch back to >> > > > - * zero. We'll mark this entire first dword as emulated for >> this purpose. >> > > > - */ >> > > > - pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE, >> > > > - PCI_EXT_CAP(0xFFFF, 0, 0)); >> > > > - pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0); >> > > > - pci_set_long(vdev->emulated_config_bits + >> PCI_CONFIG_SPACE_SIZE, ~0); >> > > > - >> > > > for (next = PCI_CONFIG_SPACE_SIZE; next; >> > > > next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) { >> > > > header = pci_get_long(config + next); >> > > > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice >> *vdev) >> > > > switch (cap_id) { >> > > > case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse >> OVMF */ >> > > > case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function >> virtualization */ >> > > > + /* keep this ecap header (4 bytes), but mask cap_id to >> 0xffff */ >> > > > + ... >> > > > trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, >> cap_id, next); >> > > > break; >> > > > default: >> > > > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice >> *vdev) >> > > > >> > > > } >> > > > >> > > > - /* Cleanup chain head ID if necessary */ >> > > > - if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == >> 0xFFFF) { >> > > > - pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0); >> > > > - } >> > > > - >> > > > g_free(config); >> > > > return; >> > > > } >> > > > ----->8----- >> > > > >> > > > Since after all we need the assumption that 0xffff is reserved for >> > > > cap_id. Then, we can just remove the "first 0xffff then 0x0" hack, >> > > > which is imho error-prone and hacky. >> > > >> > > This doesn't fix the bug, which is that pcie_add_capability() uses a >> > > valid capability ID for it's own internal tracking. It's only doing >> > > this to find the end of the capability chain, which we could do in a >> > > spec complaint way by looking for a zero next pointer. Fix that and >> > > then vfio doesn't need to do this set to 0xffff then back to zero >> > > nonsense at all. Capability ID zero is valid. Thanks, >> > >> > Yeah I see Michael's fix on the capability list stuff. However, imho >> > these are two different issues? Or say, even if with that patch, we >> > should still need this hack (first 0x0, then 0xffff) right? Since >> > looks like that patch didn't solve the problem if the first pcie ecap >> > is masked at 0x100. >> >> I thought the problem was that QEMU in the host exposes a device with a >> capability ID of 0 to the L1 guest. QEMU in the L1 guest balks at a >> capability ID of 0 because that's how it finds the end of the chain. >> Therefore if we make QEMU not use capability ID 0 for internal >> purposes, things work. vfio using 0xffff and swapping back to 0x0 >> becomes unnecessary, but doesn't hurt anything. Thanks, >> > > I've applied Peter's hack and Michael's patch below, but still can't use > the assigned device in L2. > commit 4bb571d857d973d9308d9fdb1f48d983d6639bd4 > Author: Michael S. Tsirkin <m...@redhat.com> > Date: Wed Feb 15 22:37:45 2017 +0200 > > pci/pcie: don't assume cap id 0 is reserved > > I was able to boot L2 with following qemu warnings, > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available > reset mechanism. > qemu-system-x86_64: vfio: Cannot reset device 0000:00:03.0, no available > reset mechanism. > > but then I don't see the network device, which I was trying to assign to > L2, in L2. > This is from L2 dmesg, and it looks like the device is not initialized. > > [ 5.884115] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014) > [ 5.891563] mlx4_core: Initializing 0000:00:03.0 > [ 5.896947] ACPI: PCI Interrupt Link [GSIH] enabled at IRQ 23 > [ 6.913559] mlx4_core 0000:00:03.0: Installed FW has unsupported > command interface revision 0 > [ 6.920925] mlx4_core 0000:00:03.0: (Installed FW version is 0.0.000) > [ 6.926490] mlx4_core 0000:00:03.0: This driver version supports only > revisions 2 to 3 > [ 6.933300] mlx4_core 0000:00:03.0: QUERY_FW command failed, aborting > [ 6.940279] mlx4_core 0000:00:03.0: Failed to init fw, aborting. > > This is the full kernel log from L2. > https://paste.ubuntu.com/24039462/ > > L0, L1 and L2 are using the same kernel, so I think they are using the > same device driver. > This is the L0/L1 kernel log about the network device. > > --- From L0 --- > [ 8.175533] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014) > [ 8.175543] mlx4_core: Initializing 0000:08:00.0 > [ 14.524093] mlx4_core 0000:08:00.0: PCIe link speed is 8.0GT/s, device > supports 8.0GT/s > [ 14.533030] mlx4_core 0000:08:00.0: PCIe link width is x8, device > supports x8 > [ 14.714296] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb > 2014) > [ 14.722295] mlx4_en 0000:08:00.0: Activating port:2 > [ 14.735186] mlx4_en: 0000:08:00.0: Port 2: Using 128 TX rings > [ 14.741608] mlx4_en: 0000:08:00.0: Port 2: Using 8 RX rings > [ 14.747826] mlx4_en: 0000:08:00.0: Port 2: frag:0 - size:1522 > prefix:0 stride:1536 > [ 14.756698] mlx4_en: 0000:08:00.0: Port 2: Initializing port > [ 14.764036] mlx4_en 0000:08:00.0: registered PHC clock > > --- From L1 --- > [ 3.790302] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014) > [ 3.791089] mlx4_core: Initializing 0000:00:03.0 > [ 9.053077] mlx4_core 0000:00:03.0: Unable to determine PCIe device BW > capabilities > [ 9.203290] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb > 2014) > [ 9.204503] mlx4_en 0000:00:03.0: Activating port:2 > [ 9.212853] mlx4_en: 0000:00:03.0: Port 2: Using 32 TX rings > [ 9.213514] mlx4_en: 0000:00:03.0: Port 2: Using 4 RX rings > [ 9.214131] mlx4_en: 0000:00:03.0: Port 2: frag:0 - size:1522 > prefix:0 stride:1536 > [ 9.215260] mlx4_en: 0000:00:03.0: Port 2: Initializing port > [ 9.216377] mlx4_en 0000:00:03.0: registered PHC clock > [ 9.261518] mlx4_en: eth1: Link Up > [ 9.690730] mlx4_core 0000:00:03.0 eth2: renamed from eth1 > > Any thoughts? > I've tried another network device on a different machine. It has "Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection" ethernet controller. I got the same problem of getting the network device initialization failure in L2. I think I'm missing something since I heard from Bandan that he had no problem to assign a device to L2 with ixgbe. This is the error message from dmesg in L2. [ 3.692871] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 4.2.1-k [ 3.697716] ixgbe: Copyright (c) 1999-2015 Intel Corporation. [ 3.964875] ixgbe 0000:00:02.0: HW Init failed: -12 [ 3.972362] ixgbe: probe of 0000:00:02.0 failed with error -12 I checked that L2 indeed had that device. root@guest0:~# lspci 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02) 00:02.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) I'm describing steps I took, so if you notice something wrong, PLEASE let me know. 1. [L0] Check the device with lspci. Result is [1] 2. [L0] Unbind from the original driver and bind to vfio-pci driver following [2][3] 3. [L0] Start L1 with this script. [4] 4. [L1] L1 is able to use the network device. 5. [L1] Unbind from the original driver and bind to vfio-pci driver same as the step 2. 6. [L1] Start L2 with this script. [5] 7. [L2] Got the init failure error message above. [1] https://paste.ubuntu.com/24055745/ [2] http://www.linux-kvm.org/page/10G_NIC_performance:_VFIO_vs_virtio [3] http://www.linux-kvm.org/images/b/b4/2012-forum-VFIO.pdf [4] https://paste.ubuntu.com/24055715/ [5] https://paste.ubuntu.com/24055720/ Thanks, Jintack > > >> Alex >> >> >