Public bug reported:

SRU Justification:

[Impact]

This is reproducible on systems which already have heavy background
traffic. On top of that, the user issues one of the 2 docker pulls below:
docker pull nvcr.io/ea-doca-hbn/hbn/hbn:latest
OR
docker pull gitlab-master.nvidia.com:5005/dl/dgx/tritonserver:22.02-py3-qa

The second one is a very large container (17GB)

When they run docker pull, the OOB interface stops being pingable,
the docker pull is interrupted for a very long time (>3mn) or
times out.

[Fix]

* Update the RX_CQE_CI before updating the RX_PI to avoid a race condition 
where we wrongly inform HW that there is space for the WQE.
* disable the RX DMA while we are handling incoming packets to avoid overflow.

[Test Case]

* Created a script which loops 200 times and does a docker pull in each loop:
docker pull nvcr.io/ea-doca-hbn/hbn/hbn:latest
OR
docker pull gitlab-master.nvidia.com:5005/dl/dgx/tritonserver:22.02-py3-qa

[Regression Potential]

* This could result in slower handling since we are disabling/enabling the DMA 
periodically.
* Although this fix has been tested by the people who opened the bug, QA needs 
to thoroughly test it to make sure it is not reproducible.

** Affects: linux-bluefield (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1964984

Title:
  Fix OOB handling RX packets in heavy traffic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/1964984/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to