On 4/24/2024 11:28 AM, Bruce Richardson wrote: > On Wed, Apr 24, 2024 at 11:21:52AM +0100, Ferruh Yigit wrote: >> On 4/23/2024 9:56 PM, Mattias Rönnblom wrote: >>> On 2024-04-23 13:15, Ferruh Yigit wrote: >>>> On 4/23/2024 10:08 AM, Mattias Rönnblom wrote: >>>>> Cache align Rx and Tx queue struct to avoid false sharing. >>>>> >>>>> RX struct happens to be 64 bytes on x86_64 already, so cache >>>>> alignment makes no change there, but it does on 32-bit ISAs. >>>>> >>>>> TX struct is 56 bytes on x86_64. >>>>> >>>> >>>> Hi Mattias, >>>> >>>> No objection to the patch. Is the improvement theoretical or do you >>>> measure any improvement practically, if so how much is the >>>> improvement? >>>> >>> >>> I didn't run any benchmarks. >>> >>> Two cores storing to a (falsely) shared cache line on a per-packet >>> basis is going to be very expensive, at least for "light touch" >>> applications. >>> >> >> ack I expect for af_packet bottleneck is the kernel side, so I don't >> expect any visible improvement practically, but OK to fix this >> theoretical issue. >> >>>>> Both structs keep counters, and in the RX case they are updated even >>>>> for empty polls. >>>>> >>>> >>>> Do you think does it help if move 'rx_pkts' & 'rx_bytes' update within >>>> the loop? >>>> >>> >>> No, why? Wouldn't that be worse? Especially since rx_pkts and rx_bytes >>> are declared volatile, so you are forcing a load-modify-store cycle for >>> every increment. >>> >> >> My intention was to prevent updating stats in empty polls, I thought >> stats will be hot in the cache but won't work with volatile. >> > Yes, it will. Volatile only prevents caching in registers, it does not > affect the storing of data within the cache hierarchy. Reads/writes of > stats on empty polls should indeed hit the L1 as expected. However, that is > still less efficient than just doing a register increment which could > theoretically be the result without the volatile. > >
Thanks Bruce for clarification.