On 2024-03-26 03:19, Michael Ellerman wrote:
Mathieu Desnoyers <mathieu.desnoy...@efficios.com> writes:
Hi,
Hi Mathieu,
In the powerpc architecture support within the liburcu project [1]
we have a cache line size defined as 256 bytes with the following
comment:
/* Include size of POWER5+ L3 cache lines: 256 bytes */
#define CAA_CACHE_LINE_SIZE 256
I recently received a pull request on github [2] asking to
change this to 128 bytes. All the material provided supports
that the cache line sizes on powerpc are 128 bytes or less (even
L3 on POWER7, POWER8, and POWER9) [3].
I wonder where the 256 bytes L3 cache line size for POWER5+
we have in liburcu comes from, and I wonder if it's the right choice
for a cache line size on all powerpc, considering that the Linux
kernel cache line size appear to use 128 bytes on recent Power
architectures. I recall some benchmark experiments Paul and I did
on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
cache line size, and I suppose this is why we came up with this
value, but I don't have the detailed specs of that machine.
Any feedback on this matter would be appreciated.
The ISA doesn't specify the cache line size, other than it is smaller
than a page.
In practice all the 64-bit IBM server CPUs I'm aware of have used 128
bytes. There are some 64-bit CPUs that use 64 bytes, eg. pasemi PA6T and
Freescale e6500.
It is possible to discover at runtime via AUXV headers. But that's no
use if you want a compile-time constant.
Indeed, and this CAA_CACHE_LINE_SIZE is part of the liburcu powerpc ABI,
so changing this would require a soname bump, which I don't want to do
without really good reasons.
I'm happy to run some benchmarks if you can point me at what to run. I
had a poke around the repository and found short_bench, but it seemed to
run for a very long time.
I've created a dedicated test program for this, see:
https://github.com/compudj/userspace-rcu-dev/tree/false-sharing
example use:
On a AMD Ryzen 7 PRO 6850U with Radeon Graphics:
for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 21320
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 22657
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 47599
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 531364
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 523634
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 519402
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 520651
1..1
Would point to false-sharing starting with cache lines smaller than
64 bytes. I get similar results (false-sharing under 64 bytes) with a
AMD EPYC 9654 96-Core Processor.
The test programs runs 4 threads by default, which can be overridden
with "-t N". This may be needed if you want this to use all cores from
a larger machine. See "-h" for options.
On a POWER9 (architected), altivec supported:
for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 12264
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 12276
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 25638
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 39934
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 53971
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 53599
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 53962
1..1
This points at false-sharing below 128 bytes stride.
On a e6500, altivec supported, Model 2.0 (pvr 8040 0120)
for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 9049
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 9054
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 18643
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 37417
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 37906
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 37870
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 37899
1..1
Which points at false-sharing below 64 bytes.
I prefer to be cautious about this cache line size value and aim for
a value which takes into account the largest known cache line size
for an architecture rather than use a too small due to the large
overhead caused by false-sharing.
Feedback is welcome.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com