> The first step is to check if the linked patch fixes the issue at hand, > could you please give it a try?
Sorry for the long delay. It took some time to complete the verification, as the environment I used for testing is not something I can access freely. I apologize for the inconvenience. I have confirmed that applying the patch 1a0f25a52e08 to the Ubuntu 22.04 kernel (5.15) resolves the issue. I’ve attached graphs showing the transmit/receive statistics before and after applying the patch. The data before May 30th is from before the patch was applied, and the data after May 30th is from after the patch. However, since kernel 5.15 still uses struct ice_ring instead of struct ice_tx_ring, I was not able to apply the patch as-is. I had to make two small modifications to replace struct ice_tx_ring with struct ice_ring. As shown above, the patch 1a0f25a52e08 appears to be effective on 5.15 as well, so I would greatly appreciate it if you could consider backporting it. 2025年2月24日(月) 20:21 Przemek Kitszel <[email protected]>: > On 2/21/25 04:12, Masakazu Asama wrote: > > We have observed a very rare issue in Intel E810 environments where > > SNMP-retrieved TX/RX counter values are sometimes nearly twice the > > actual values. > > > > Upon investigation, we identified a problem in the process that updates > > the transmit and receive ring statistics in the ice driver. This issue > > occurs when the counter update process is executed simultaneously on > > different CPU cores. > > > > I have attached a patch to fix this issue. > > > > This patch is intended for Linux kernel 5.15 on Ubuntu 22.04, as my > > environment is Ubuntu 22.04. > > > > In my test environment, applying this patch prevents the issue from > > occurring. > > > > The function ice_update_vsi_ring_stats takes a pointer to a struct > > ice_vsi as an argument. This structure is allocated on the heap and > > shared across all CPU cores. The function resets the counter values to > > zero and then accumulates the values from each ring of the NIC. > > > > However, since struct ice_vsi is shared across all CPU cores, the > > following race condition can occur when ice_update_vsi_ring_stats is > > executed simultaneously on different CPUs: > > > > 1.Multiple CPU cores reset the counter values in struct ice_vsi to zero > > at the same time. > > > > 2.Each CPU core independently increments the counter values. > > > > As a result, the counter values may be updated to a higher-than-actual > > value. > > We had observed other problems caused by the very same shared data, it > already was fixed as part of kernel 5.16 via > commit 1a0f25a52e08 ("ice: safer stats processing"). > Sadly it was not backported to 5.15. > > From your proposed patch I could tell that the fix is not present on > your Ubuntu kernel. > > The first step is to check if the linked patch fixes the issue at hand, > could you please give it a try? > > > > > The attached patch modifies the implementation to store the counter > > values on the stack, initialize them to zero, increment them with the > > values from each ring, and finally update struct ice_vsi. By avoiding > > the use of shared data for intermediate calculations, this fix prevents > > the issue. > > > > In my environment, multiple Intel E810 NICs are bonded together. > > > > I use Zabbix to graph the RX/TX counters of the bonding interface. > > However, due to the way bonding ignores decreases in the counters of > > slave interfaces, this issue makes the statistics completely unreliable. > > > > Graphs generated from the slave interfaces may appear normal because, > > even if the counter temporarily increases, it is corrected in the next > > observation. > > > > When I reported this issue to the Ubuntu bug tracking system, I was told > > to get it merged upstream first. > > > > I would like this issue to be fixed, but what should I do to get it > > accepted? > > > > Any advice would be greatly appreciated. > > You hit the correct mailing list for the upstream process. > > Process is a bit different depending on weather we will need to just > backport Jesse's patch or parts of yours. For backports you will reach > to [email protected] > > One more question prior to adding more patches: does the issue reproduce > with the current kernel (6.13, or even better if net-next: > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git ) > >
