Hi Damjan, Benoit, In the CSIT call I've learned that you were looking into why VPP Device jobs sometimes fail. I'm working on adding VPP Device jobs for arm and we're seeing the same issue that's behind these random failures, possibly even more frequently.
There are two high level observations I'll start with: * The surface level issue is that VPP doesn't return interface dump info from all interfaces (we're using 2 VFs in the tests and sometimes we're getting info from only 1 and sometimes not even that) * The issue happens only when there are multiple jobs running on the same server Looking at logs, I was able to figure out that the failure is tied to VPP startup, particularly this in DPDK plugin: 2020/11/09 16:05:39:251 notice dpdk EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:91:02.5 (socket 1) 2020/11/09 16:05:39:251 notice dpdk i40evf_check_api_version(): PF/VF API version mismatch:(0.0)-(1.1) 2020/11/09 16:05:39:251 notice dpdk i40evf_init_vf(): check_api version failed 2020/11/09 16:05:39:251 notice dpdk i40evf_dev_init(): Init vf failed 2020/11/09 16:05:39:251 notice dpdk EAL: Releasing pci mapped resource for 0000:91:02.5 2020/11/09 16:05:39:251 notice dpdk EAL: Calling pci_unmap_resource for 0000:91:02.5 at 0x2101014000 2020/11/09 16:05:39:251 notice dpdk EAL: Calling pci_unmap_resource for 0000:91:02.5 at 0x2101024000 2020/11/09 16:05:39:251 notice dpdk EAL: Requested device 0000:91:02.5 cannot be used There are multiple variations of the same failure (DPDK failing to talk to the PF). I've documented them here: https://jira.fd.io/browse/VPP-1943 This leads me to think we're dealing with a race condition when multiple VPPs are trying to access the same PF (they're using different VFs that belong to the same PF) during VPP startup. In case this is a DPDK bug, I've created a bug in their bugzilla: https://bugs.dpdk.org/show_bug.cgi?id=578 How do we debug this further? Putting together a script that loops over multiple VPPs starting at the same time should reproduce this issue, but I don't know what to look for. We could also try updating firmware/kernel (for a newer vfio-pci version). I've documented the versions we use on aarch64/x86_64 in the Jira ticket. What do you think? Juraj
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#18137): https://lists.fd.io/g/vpp-dev/message/18137 Mute This Topic: https://lists.fd.io/mt/78502299/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-