On 28/11/2024 19:59, Sam Russell wrote:
I've ported the PCLMUL to for ARMv8 support, looks to be an 80% time
reduction over CPU on an EC2 T4g instance:
$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Vendor ID: ARM
Model name: Neoverse-N1
Model: 1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
Stepping: r3p1
BogoMIPS: 243.75
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32
atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
# ubuntu 24.04 package
$ time cksum ubuntu.iso
914429447 2773874688 ubuntu.iso
real 0m20.136s
user 0m2.044s
sys 0m1.691s
# built from head
$ time ./cksum_old ubuntu.iso
914429447 2773874688 ubuntu.iso
real 0m20.217s
user 0m2.022s
sys 0m1.770s
# this patch using only pmull opcodes
$ time ./cksum_neon ubuntu.iso
914429447 2773874688 ubuntu.iso
real 0m20.135s
user 0m0.353s
sys 0m1.819s
# this patch using pmull and pmull2 opcodes
$ time ./cksum_neon2 ubuntu.iso
914429447 2773874688 ubuntu.iso
real 0m20.136s
user 0m0.346s
sys 0m1.819s
Benchmark scripts (I used the crc_sum_stream() function so the hash output
is different, but have verified against the pclmul script functions locally)
$ time ./cksum_bench_old 65536 400000
Hash: 8984ED89, length: 65536
real 0m19.300s
user 0m19.299s
sys 0m0.001s
$ time ./cksum_bench_neon2 65536 400000
Hash: 828F9BAC, length: 65536
real 0m5.001s
user 0m4.997s
sys 0m0.003s
For hash validation
$ time ./cksum_bench_neon2 1048576 40000
Hash: EFA0B24F, length: 1048576
real 0m7.540s
user 0m7.538s
sys 0m0.001s
$ time ./cksum_bench_pclmul 1048576 10000
Hash: EFA0B24F, length: 1048576
real 0m3.018s
user 0m3.018s
sys 0m0.000s
-O3 does most of the optimisation work for us, there may be more savings
but this is still a good improvement.
Some questions
- There's no direct equivalent of "__builtin_cpu_supports" for ARM, but the
hwcaps interface seems to be the way to test this [1] [2]
- ARM is a much more diverse system than x86_64, it's possible that some
platforms (e.g. phones) would see a slowdown, is this something we want to
give maintainers a flag to disable?
- ARMv8 also has a CRC32() opcode, a quick test showed it wasn't super
efficient but it's possible that interleaving this against the folding
approach might add extra speedups. This is an exercise for the reader.
Cool. I'll try this out on some of the arm64 machines at:
https://portal.cfarm.net/machines/list/
Note builders can disable this already with:
./configure utils_cv_vmull_intrinsic_exists=no
thanks!
Pádraig