It happens in CI on x86 as well (with the older PF driver), but I didn't reproduce it manually, since I did't want to touch x86 hardware in production (x86 is running voting jobs, aarch64 is running non-voting, so aarch64 is a bit safer to tinker with).
Juraj From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Damjan Marion via lists.fd.io Sent: Wednesday, December 2, 2020 4:48 PM To: Juraj Linkeš <juraj.lin...@pantheon.tech> Cc: Benoit Ganne (bganne) <bga...@cisco.com>; vpp-dev <vpp-dev@lists.fd.io>; csit-...@lists.fd.io; Andrew Yourtchenko (ayourtch) <ayour...@cisco.com> Subject: Re: [vpp-dev] VPP Device jobs randomly failing I doubt changing anything around vfio-pci will help. That module doesn’t participate in communication between PF and VF. Indeed, this looks like a PF driver bug. This is on AArch64, right? Are you able to repro on x86? — Damjan On 02.12.2020., at 14:44, Juraj Linkeš <juraj.lin...@pantheon.tech<mailto:juraj.lin...@pantheon.tech>> wrote: Updating to the latest PF version (2.13.10) did not help. I'm seeing the same failures. We'll tank about other options in the CSIT calls, things like whether it makes sense to try newer firmware or vfio-pci versions. Juraj From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> <vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> On Behalf Of Juraj Linkeš Sent: Wednesday, December 2, 2020 11:05 AM To: Damjan Marion (damarion) <damar...@cisco.com<mailto:damar...@cisco.com>> Cc: Benoit Ganne (bganne) <bga...@cisco.com<mailto:bga...@cisco.com>>; vpp-dev <vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>>; csit-...@lists.fd.io<mailto:csit-...@lists.fd.io>; Andrew Yourtchenko (ayourtch) <ayour...@cisco.com<mailto:ayour...@cisco.com>> Subject: Re: [vpp-dev] VPP Device jobs randomly failing I've looked into this a bit more and I'm seeing an error with avf in logs, but that actually doesn't impact VPP negatively: 2020/12/02 09:36:48:219 error avf 0000:05:10.0: send_to_pf failed (timeout 1.269s) And a different log: 2020/12/02 09:36:47:176 error avf 0000:05:10.2: aq_desc_enq failed (timeout .266s) When this error appear in logs, the interfaces take a bit longer to show up in show int, whereas with dpdk they never show up. This hints at an issue with the PF driver, since avf seems to be able to handle the error. The PF driver is not the latest and I'll try to test with the latest. We'll then need to document that users will need to update their Ubuntu18.04 drivers or document a know issue with old drivers (if the newer version fixes it). I'll update this thread when I test the latest version. Juraj From: Damjan Marion (damarion) <damar...@cisco.com<mailto:damar...@cisco.com>> Sent: Wednesday, November 25, 2020 5:06 PM To: Juraj Linkeš <juraj.lin...@pantheon.tech<mailto:juraj.lin...@pantheon.tech>> Cc: Benoit Ganne (bganne) <bga...@cisco.com<mailto:bga...@cisco.com>>; vpp-dev <vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>>; csit-...@lists.fd.io<mailto:csit-...@lists.fd.io>; Andrew Yourtchenko (ayourtch) <ayour...@cisco.com<mailto:ayour...@cisco.com>> Subject: Re: VPP Device jobs randomly failing On 25.11.2020., at 16:55, Juraj Linkeš <juraj.lin...@pantheon.tech<mailto:juraj.lin...@pantheon.tech>> wrote: Hi Damjan, Benoit, In the CSIT call I've learned that you were looking into why VPP Device jobs sometimes fail. I'm working on adding VPP Device jobs for arm and we're seeing the same issue that's behind these random failures, possibly even more frequently. There are two high level observations I'll start with: • The surface level issue is that VPP doesn't return interface dump info from all interfaces (we're using 2 VFs in the tests and sometimes we're getting info from only 1 and sometimes not even that) • The issue happens only when there are multiple jobs running on the same server Looking at logs, I was able to figure out that the failure is tied to VPP startup, particularly this in DPDK plugin: 2020/11/09 16:05:39:251 notice dpdk EAL: Probe PCI driver: net_i40e_vf (8086:154c) device: 0000:91:02.5 (socket 1) 2020/11/09 16:05:39:251 notice dpdk i40evf_check_api_version(): PF/VF API version mismatch:(0.0)-(1.1) 2020/11/09 16:05:39:251 notice dpdk i40evf_init_vf(): check_api version failed 2020/11/09 16:05:39:251 notice dpdk i40evf_dev_init(): Init vf failed 2020/11/09 16:05:39:251 notice dpdk EAL: Releasing pci mapped resource for 0000:91:02.5 2020/11/09 16:05:39:251 notice dpdk EAL: Calling pci_unmap_resource for 0000:91:02.5 at 0x2101014000 2020/11/09 16:05:39:251 notice dpdk EAL: Calling pci_unmap_resource for 0000:91:02.5 at 0x2101024000 2020/11/09 16:05:39:251 notice dpdk EAL: Requested device 0000:91:02.5 cannot be used There are multiple variations of the same failure (DPDK failing to talk to the PF). I've documented them here: https://jira.fd.io/browse/VPP-1943 This leads me to think we're dealing with a race condition when multiple VPPs are trying to access the same PF (they're using different VFs that belong to the same PF) during VPP startup. In case this is a DPDK bug, I've created a bug in their bugzilla: https://bugs.dpdk.org/show_bug.cgi?id=578 How do we debug this further? Putting together a script that loops over multiple VPPs starting at the same time should reproduce this issue, but I don't know what to look for. We could also try updating firmware/kernel (for a newer vfio-pci version). I've documented the versions we use on aarch64/x86_64 in the Jira ticket. What do you think? This looks to me like a linux PF driver issue. Are you using latest intel provided PF driver[1]? (in case you find it useful I’m maintaining my own DKMS debian packaging [2] for intel driver) Do you see the same issue with avf plugin? [1] https://sourceforge.net/projects/e1000/files/i40e%20stable/ [2] https://github.com/dmarion/deb-i40e — Damjan
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#18243): https://lists.fd.io/g/vpp-dev/message/18243 Mute This Topic: https://lists.fd.io/mt/78502299/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-