Hi DPDK devs,

A while back I've submitted this bug: https://bugs.dpdk.org/show_bug.cgi?id=578 
and now we have a pretty good idea where the issue stems from. TL;DL: it seems 
to be in either XL710 firmware or i40e driver, with downstream effects which we 
may need to address in DPDK.

What is the issue?
We're using an XL710 NIC with SR-IOV setup with multiple virtual functions 
(VFs) that belong to the same physical function (PF). We're observing 
intermittent failures when multiple DPDK EAL instances are trying to initialize 
different VFs from the PF. One of the failures looks like this:
i40evf_check_api_version(): PF/VF API version mismatch:(0.0)-(1.1)

This results in VPP (which uses DPDK to initialize these VFs) not being able to 
use the VFs. There an associated syslog:

[Thu Dec  3 02:30:56 2020] i40e 0000:05:00.1: Unable to send the message to VF 
49 aq_err 12

Digging in the sources we've found that this is the error message:
https://elixir.bootlin.com/linux/v4.15/source/drivers/net/ethernet/intel/i40evf/i40e_adminq_cmd.h#L115

This suggests it's an issue with either the driver or firmware and that leads 
us to two questions:
1) Is this an expected condition to happen? What is the reason for this 
contention and is it normal to have it, and what is the expected correct 
behavior of the calling code?
2) If "yes" to the previous question - then, since the caller in this case 
initialization code of DPDK, should we address it there (e.g. some retries or a 
lock)?

Are there any Intel (or SR-IOV) experts who could help with answering the first 
question? Or is it possible that no matter what the expected behavior is should 
we address it in DPDK?

This is just a short description, there's more information in Bugzilla.

Thanks,
Juraj

Reply via email to