Hi Damjan, Benoit,

In the CSIT call I've learned that you were looking into why VPP Device jobs 
sometimes fail. I'm working on adding VPP Device jobs for arm and we're seeing 
the same issue that's behind these random failures, possibly even more 
frequently.

There are two high level observations I'll start with:

*        The surface level issue is that VPP doesn't return interface dump info 
from all interfaces (we're using 2 VFs in the tests and sometimes we're getting 
info from only 1 and sometimes not even that)

*        The issue happens only when there are multiple jobs running on the 
same server

Looking at logs, I was able to figure out that the failure is tied to VPP 
startup, particularly this in DPDK plugin:
2020/11/09 16:05:39:251 notice     dpdk           EAL: Probe PCI driver: 
net_i40e_vf (8086:154c) device: 0000:91:02.5 (socket 1)
2020/11/09 16:05:39:251 notice     dpdk           i40evf_check_api_version(): 
PF/VF API version mismatch:(0.0)-(1.1)
2020/11/09 16:05:39:251 notice     dpdk           i40evf_init_vf(): check_api 
version failed
2020/11/09 16:05:39:251 notice     dpdk           i40evf_dev_init(): Init vf 
failed
2020/11/09 16:05:39:251 notice     dpdk           EAL: Releasing pci mapped 
resource for 0000:91:02.5
2020/11/09 16:05:39:251 notice     dpdk           EAL: Calling 
pci_unmap_resource for 0000:91:02.5 at 0x2101014000
2020/11/09 16:05:39:251 notice     dpdk           EAL: Calling 
pci_unmap_resource for 0000:91:02.5 at 0x2101024000
2020/11/09 16:05:39:251 notice     dpdk           EAL: Requested device 
0000:91:02.5 cannot be used

There are multiple variations of the same failure (DPDK failing to talk to the 
PF). I've documented them here: https://jira.fd.io/browse/VPP-1943

This leads me to think we're dealing with a race condition when multiple VPPs 
are trying to access the same PF (they're using different VFs that belong to 
the same PF) during VPP startup.

In case this is a DPDK bug, I've created a bug in their bugzilla: 
https://bugs.dpdk.org/show_bug.cgi?id=578

How do we debug this further? Putting together a script that loops over 
multiple VPPs starting at the same time should reproduce this issue, but I 
don't know what to look for. We could also try updating firmware/kernel (for a 
newer vfio-pci version). I've documented the versions we use on aarch64/x86_64 
in the Jira ticket.

What do you think?
Juraj
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18137): https://lists.fd.io/g/vpp-dev/message/18137
Mute This Topic: https://lists.fd.io/mt/78502299/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to