On 4/23/2024 9:56 PM, Mattias Rönnblom wrote: > On 2024-04-23 13:15, Ferruh Yigit wrote: >> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote: >>> Cache align Rx and Tx queue struct to avoid false sharing. >>> >>> RX struct happens to be 64 bytes on x86_64 already, so cache alignment >>> makes no change there, but it does on 32-bit ISAs. >>> >>> TX struct is 56 bytes on x86_64. >>> >> >> Hi Mattias, >> >> No objection to the patch. Is the improvement theoretical or do you >> measure any improvement practically, if so how much is the improvement? >> > > I didn't run any benchmarks. > > Two cores storing to a (falsely) shared cache line on a per-packet basis > is going to be very expensive, at least for "light touch" applications. >
ack I expect for af_packet bottleneck is the kernel side, so I don't expect any visible improvement practically, but OK to fix this theoretical issue. >>> Both structs keep counters, and in the RX case they are updated even >>> for empty polls. >>> >> >> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within >> the loop? >> > > No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes > are declared volatile, so you are forcing a load-modify-store cycle for > every increment. > My intention was to prevent updating stats in empty polls, I thought stats will be hot in the cache but won't work with volatile. > I would drop "volatile", or replace it with an atomic (although *not* > use an atomic add for incrementing, but rather atomic load + <n> > non-atomic adds + atomic store). > As only single core polls from an Rx queue, I assume atomics is required only for stats reset case. And for that case won't we need atomic add? And do you know if there is a performance difference between atomic add and keeping volatile qualifier? >>> Signed-off-by: Mattias Rönnblom <mattias.ronnb...@ericsson.com> >>> --- >>> drivers/net/af_packet/rte_eth_af_packet.c | 5 +++-- >>> 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c >>> b/drivers/net/af_packet/rte_eth_af_packet.c >>> index 397a32db58..28aeb7d08e 100644 >>> --- a/drivers/net/af_packet/rte_eth_af_packet.c >>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >>> @@ -6,6 +6,7 @@ >>> * All rights reserved. >>> */ >>> +#include <rte_common.h> >>> #include <rte_string_fns.h> >>> #include <rte_mbuf.h> >>> #include <ethdev_driver.h> >>> @@ -53,7 +54,7 @@ struct pkt_rx_queue { >>> volatile unsigned long rx_pkts; >>> volatile unsigned long rx_bytes; >>> -}; >>> +} __rte_cache_aligned; >>> >> >> Latest location for '__rte_cache_aligned' tag is at the beginning of the >> struct [1], so something like: >> `struct __rte_cache_aligned pkt_rx_queue {` >> >> [1] >> https://patchwork.dpdk.org/project/dpdk/list/?series=31746&state=%2A&archive=both