Re: [vpp-dev] VPP Device jobs randomly failing

Juraj Linkeš Wed, 02 Dec 2020 23:11:59 -0800

It happens in CI on x86 as well (with the older PF driver), but I didn't 
reproduce it manually, since I did't want to touch x86 hardware in production 
(x86 is running voting jobs, aarch64 is running non-voting, so aarch64 is a bit 
safer to tinker with).

Juraj

From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Damjan Marion via 
lists.fd.io
Sent: Wednesday, December 2, 2020 4:48 PM
To: Juraj Linkeš <juraj.lin...@pantheon.tech>
Cc: Benoit Ganne (bganne) <bga...@cisco.com>; vpp-dev <vpp-dev@lists.fd.io>; 
csit-...@lists.fd.io; Andrew Yourtchenko (ayourtch) <ayour...@cisco.com>
Subject: Re: [vpp-dev] VPP Device jobs randomly failing

I doubt changing anything around vfio-pci will help. That module doesn’t 
participate in communication between PF and VF.

Indeed, this looks like a PF driver bug. This is on AArch64, right? Are you 
able to repro on x86?

—
Damjan

On 02.12.2020., at 14:44, Juraj Linkeš 
<juraj.lin...@pantheon.tech<mailto:juraj.lin...@pantheon.tech>> wrote:

Updating to the latest PF version (2.13.10) did not help. I'm seeing the same 
failures. We'll tank about other options in the CSIT calls, things like whether 
it makes sense to try newer firmware or vfio-pci versions.

Juraj

From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> 
<vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>> On Behalf Of Juraj Linkeš
Sent: Wednesday, December 2, 2020 11:05 AM
To: Damjan Marion (damarion) <damar...@cisco.com<mailto:damar...@cisco.com>>
Cc: Benoit Ganne (bganne) <bga...@cisco.com<mailto:bga...@cisco.com>>; vpp-dev 
<vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>>; 
csit-...@lists.fd.io<mailto:csit-...@lists.fd.io>; Andrew Yourtchenko 
(ayourtch) <ayour...@cisco.com<mailto:ayour...@cisco.com>>
Subject: Re: [vpp-dev] VPP Device jobs randomly failing

I've looked into this a bit more and I'm seeing an error with avf in logs, but 
that actually doesn't impact VPP negatively:
2020/12/02 09:36:48:219 error      avf            0000:05:10.0: send_to_pf 
failed (timeout 1.269s)
And a different log:
2020/12/02 09:36:47:176 error      avf            0000:05:10.2: aq_desc_enq 
failed (timeout .266s)

When this error appear in logs, the interfaces take a bit longer to show up in 
show int, whereas with dpdk they never show up. This hints at an issue with the 
PF driver, since avf seems to be able to handle the error.

The PF driver is not the latest and I'll try to test with the latest. We'll 
then need to document that users will need to update their Ubuntu18.04 drivers 
or document a know issue with old drivers (if the newer version fixes it). I'll 
update this thread when I test the latest version.

Juraj

From: Damjan Marion (damarion) <damar...@cisco.com<mailto:damar...@cisco.com>>
Sent: Wednesday, November 25, 2020 5:06 PM
To: Juraj Linkeš <juraj.lin...@pantheon.tech<mailto:juraj.lin...@pantheon.tech>>
Cc: Benoit Ganne (bganne) <bga...@cisco.com<mailto:bga...@cisco.com>>; vpp-dev 
<vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io>>; 
csit-...@lists.fd.io<mailto:csit-...@lists.fd.io>; Andrew Yourtchenko 
(ayourtch) <ayour...@cisco.com<mailto:ayour...@cisco.com>>
Subject: Re: VPP Device jobs randomly failing

On 25.11.2020., at 16:55, Juraj Linkeš 
<juraj.lin...@pantheon.tech<mailto:juraj.lin...@pantheon.tech>> wrote:

Hi Damjan, Benoit,

In the CSIT call I've learned that you were looking into why VPP Device jobs 
sometimes fail. I'm working on adding VPP Device jobs for arm and we're seeing 
the same issue that's behind these random failures, possibly even more 
frequently.

There are two high level observations I'll start with:
•        The surface level issue is that VPP doesn't return interface dump info 
from all interfaces (we're using 2 VFs in the tests and sometimes we're getting 
info from only 1 and sometimes not even that)
•        The issue happens only when there are multiple jobs running on the 
same server

Looking at logs, I was able to figure out that the failure is tied to VPP 
startup, particularly this in DPDK plugin:
2020/11/09 16:05:39:251 notice     dpdk           EAL: Probe PCI driver: 
net_i40e_vf (8086:154c) device: 0000:91:02.5 (socket 1)
2020/11/09 16:05:39:251 notice     dpdk           i40evf_check_api_version(): 
PF/VF API version mismatch:(0.0)-(1.1)
2020/11/09 16:05:39:251 notice     dpdk           i40evf_init_vf(): check_api 
version failed
2020/11/09 16:05:39:251 notice     dpdk           i40evf_dev_init(): Init vf 
failed
2020/11/09 16:05:39:251 notice     dpdk           EAL: Releasing pci mapped 
resource for 0000:91:02.5
2020/11/09 16:05:39:251 notice     dpdk           EAL: Calling 
pci_unmap_resource for 0000:91:02.5 at 0x2101014000
2020/11/09 16:05:39:251 notice     dpdk           EAL: Calling 
pci_unmap_resource for 0000:91:02.5 at 0x2101024000
2020/11/09 16:05:39:251 notice     dpdk           EAL: Requested device 
0000:91:02.5 cannot be used

There are multiple variations of the same failure (DPDK failing to talk to the 
PF). I've documented them here: https://jira.fd.io/browse/VPP-1943

This leads me to think we're dealing with a race condition when multiple VPPs 
are trying to access the same PF (they're using different VFs that belong to 
the same PF) during VPP startup.

In case this is a DPDK bug, I've created a bug in their bugzilla: 
https://bugs.dpdk.org/show_bug.cgi?id=578

How do we debug this further? Putting together a script that loops over 
multiple VPPs starting at the same time should reproduce this issue, but I 
don't know what to look for. We could also try updating firmware/kernel (for a 
newer vfio-pci version). I've documented the versions we use on aarch64/x86_64 
in the Jira ticket.

What do you think?

This looks to me like a linux PF driver issue.
Are you using latest intel provided PF driver[1]?

(in case you find it useful I’m maintaining my own DKMS debian packaging [2] 
for intel driver)

Do you see the same issue with avf plugin?

[1] https://sourceforge.net/projects/e1000/files/i40e%20stable/
[2] https://github.com/dmarion/deb-i40e

—
Damjan

-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#18243): https://lists.fd.io/g/vpp-dev/message/18243
Mute This Topic: https://lists.fd.io/mt/78502299/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [vpp-dev] VPP Device jobs randomly failing

Reply via email to