On Wed, 15 Feb 2017 11:34:52 +0800 Peter Xu <pet...@redhat.com> wrote:
> On Wed, Feb 15, 2017 at 10:52:43AM +0800, Peter Xu wrote: > > [...] > > > > > > > > > Then, I *think* above assertion you encountered would fail only if > > > > prev == 0 here, but I still don't quite sure why was that happening. > > > > Btw, could you paste me your "lspci -vvv -s 00:03.0" result in your L1 > > > > guest? > > > > > > > > > > Sure. This is from my L1 guest. > > > > Hmm... I think I found the problem... > > > > > > > > root@guest0:~# lspci -vvv -s 00:03.0 > > > 00:03.0 Network controller: Mellanox Technologies MT27500 Family > > > [ConnectX-3] > > > Subsystem: Mellanox Technologies Device 0050 > > > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > > > Stepping- SERR+ FastB2B- DisINTx+ > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- > > > <MAbort- >SERR- <PERR- INTx- > > > Latency: 0, Cache Line Size: 64 bytes > > > Interrupt: pin A routed to IRQ 23 > > > Region 0: Memory at fe900000 (64-bit, non-prefetchable) [size=1M] > > > Region 2: Memory at fe000000 (64-bit, prefetchable) [size=8M] > > > Expansion ROM at fea00000 [disabled] [size=1M] > > > Capabilities: [40] Power Management version 3 > > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) > > > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- > > > Capabilities: [48] Vital Product Data > > > Product Name: CX354A - ConnectX-3 QSFP > > > Read-only fields: > > > [PN] Part number: MCX354A-FCBT > > > [EC] Engineering changes: A4 > > > [SN] Serial number: MT1346X00791 > > > [V0] Vendor specific: PCIe Gen3 x8 > > > [RV] Reserved: checksum good, 0 byte(s) reserved > > > Read/write fields: > > > [V1] Vendor specific: N/A > > > [YA] Asset tag: N/A > > > [RW] Read-write area: 105 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 253 byte(s) free > > > [RW] Read-write area: 252 byte(s) free > > > End > > > Capabilities: [9c] MSI-X: Enable+ Count=128 Masked- > > > Vector table: BAR=0 offset=0007c000 > > > PBA: BAR=0 offset=0007d000 > > > Capabilities: [60] Express (v2) Root Complex Integrated Endpoint, MSI 00 > > > DevCap: MaxPayload 256 bytes, PhantFunc 0 > > > ExtTag- RBE+ > > > DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+ > > > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- > > > MaxPayload 256 bytes, MaxReadReq 4096 bytes > > > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- > > > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not > > > Supported > > > DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF > > > Disabled > > > Capabilities: [100 v0] #00 > > > > Here we have the head of ecap capability as cap_id==0, then when we > > boot the l2 guest with the same device, we'll first copy this > > cap_id==0 cap, then when adding the 2nd ecap, we'll probably encounter > > problem since pcie_find_capability_list() will thought there is no cap > > at all (cap_id==0 is skipped). > > > > Do you want to try this "hacky patch" to see whether it works for you? > > > > ------8<------- > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > > index 332f41d..bacd302 100644 > > --- a/hw/vfio/pci.c > > +++ b/hw/vfio/pci.c > > @@ -1925,11 +1925,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev) > > > > } > > > > - /* Cleanup chain head ID if necessary */ > > - if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) { > > - pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0); > > - } > > - > > g_free(config); > > return; > > } > > ------>8------- > > > > I don't think it's a good solution (it just used 0xffff instead of 0x0 > > for the masked cap_id, then l2 guest would like to co-op with it), but > > it should workaround this temporarily. I'll try to think of a better > > one later and post when proper. > > > > (Alex, please leave comment if you have any better suggestion before > > mine :) > > Alex, do you like something like below to fix above issue that Jintack > has encountered? > > (note: this code is not for compile, only trying show what I mean...) > > ------8<------- > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > index 332f41d..4dca631 100644 > --- a/hw/vfio/pci.c > +++ b/hw/vfio/pci.c > @@ -1877,25 +1877,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev) > */ > config = g_memdup(pdev->config, vdev->config_size); > > - /* > - * Extended capabilities are chained with each pointing to the next, so > we > - * can drop anything other than the head of the chain simply by modifying > - * the previous next pointer. For the head of the chain, we can modify > the > - * capability ID to something that cannot match a valid capability. ID > - * 0 is reserved for this since absence of capabilities is indicated by > - * 0 for the ID, version, AND next pointer. However, > pcie_add_capability() > - * uses ID 0 as reserved for list management and will incorrectly match > and > - * assert if we attempt to pre-load the head of the chain with this ID. > - * Use ID 0xFFFF temporarily since it is also seems to be reserved in > - * part for identifying absence of capabilities in a root complex > register > - * block. If the ID still exists after adding capabilities, switch back > to > - * zero. We'll mark this entire first dword as emulated for this > purpose. > - */ > - pci_set_long(pdev->config + PCI_CONFIG_SPACE_SIZE, > - PCI_EXT_CAP(0xFFFF, 0, 0)); > - pci_set_long(pdev->wmask + PCI_CONFIG_SPACE_SIZE, 0); > - pci_set_long(vdev->emulated_config_bits + PCI_CONFIG_SPACE_SIZE, ~0); > - > for (next = PCI_CONFIG_SPACE_SIZE; next; > next = PCI_EXT_CAP_NEXT(pci_get_long(config + next))) { > header = pci_get_long(config + next); > @@ -1917,6 +1898,8 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev) > switch (cap_id) { > case PCI_EXT_CAP_ID_SRIOV: /* Read-only VF BARs confuse OVMF */ > case PCI_EXT_CAP_ID_ARI: /* XXX Needs next function virtualization */ > + /* keep this ecap header (4 bytes), but mask cap_id to 0xffff */ > + ... > trace_vfio_add_ext_cap_dropped(vdev->vbasedev.name, cap_id, > next); > break; > default: > @@ -1925,11 +1908,6 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev) > > } > > - /* Cleanup chain head ID if necessary */ > - if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) { > - pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0); > - } > - > g_free(config); > return; > } > ----->8----- > > Since after all we need the assumption that 0xffff is reserved for > cap_id. Then, we can just remove the "first 0xffff then 0x0" hack, > which is imho error-prone and hacky. This doesn't fix the bug, which is that pcie_add_capability() uses a valid capability ID for it's own internal tracking. It's only doing this to find the end of the capability chain, which we could do in a spec complaint way by looking for a zero next pointer. Fix that and then vfio doesn't need to do this set to 0xffff then back to zero nonsense at all. Capability ID zero is valid. Thanks, Alex