Public bug reported: SRU Justification:
[Impact] This is reproducible on systems which already have heavy background traffic. On top of that, the user issues one of the 2 docker pulls below: docker pull nvcr.io/ea-doca-hbn/hbn/hbn:latest OR docker pull gitlab-master.nvidia.com:5005/dl/dgx/tritonserver:22.02-py3-qa The second one is a very large container (17GB) When they run docker pull, the OOB interface stops being pingable, the docker pull is interrupted for a very long time (>3mn) or times out. [Fix] * Update the RX_CQE_CI before updating the RX_PI to avoid a race condition where we wrongly inform HW that there is space for the WQE. * disable the RX DMA while we are handling incoming packets to avoid overflow. [Test Case] * Created a script which loops 200 times and does a docker pull in each loop: docker pull nvcr.io/ea-doca-hbn/hbn/hbn:latest OR docker pull gitlab-master.nvidia.com:5005/dl/dgx/tritonserver:22.02-py3-qa [Regression Potential] * This could result in slower handling since we are disabling/enabling the DMA periodically. * Although this fix has been tested by the people who opened the bug, QA needs to thoroughly test it to make sure it is not reproducible. ** Affects: linux-bluefield (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-bluefield in Ubuntu. https://bugs.launchpad.net/bugs/1964984 Title: Fix OOB handling RX packets in heavy traffic Status in linux-bluefield package in Ubuntu: New Bug description: SRU Justification: [Impact] This is reproducible on systems which already have heavy background traffic. On top of that, the user issues one of the 2 docker pulls below: docker pull nvcr.io/ea-doca-hbn/hbn/hbn:latest OR docker pull gitlab-master.nvidia.com:5005/dl/dgx/tritonserver:22.02-py3-qa The second one is a very large container (17GB) When they run docker pull, the OOB interface stops being pingable, the docker pull is interrupted for a very long time (>3mn) or times out. [Fix] * Update the RX_CQE_CI before updating the RX_PI to avoid a race condition where we wrongly inform HW that there is space for the WQE. * disable the RX DMA while we are handling incoming packets to avoid overflow. [Test Case] * Created a script which loops 200 times and does a docker pull in each loop: docker pull nvcr.io/ea-doca-hbn/hbn/hbn:latest OR docker pull gitlab-master.nvidia.com:5005/dl/dgx/tritonserver:22.02-py3-qa [Regression Potential] * This could result in slower handling since we are disabling/enabling the DMA periodically. * Although this fix has been tested by the people who opened the bug, QA needs to thoroughly test it to make sure it is not reproducible. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/1964984/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp