I've ported the PCLMUL to for ARMv8 support, looks to be an 80% time reduction over CPU on an EC2 T4g instance:
$ lscpu Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Vendor ID: ARM Model name: Neoverse-N1 Model: 1 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 1 Stepping: r3p1 BogoMIPS: 243.75 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs # ubuntu 24.04 package $ time cksum ubuntu.iso 914429447 2773874688 ubuntu.iso real 0m20.136s user 0m2.044s sys 0m1.691s # built from head $ time ./cksum_old ubuntu.iso 914429447 2773874688 ubuntu.iso real 0m20.217s user 0m2.022s sys 0m1.770s # this patch using only pmull opcodes $ time ./cksum_neon ubuntu.iso 914429447 2773874688 ubuntu.iso real 0m20.135s user 0m0.353s sys 0m1.819s # this patch using pmull and pmull2 opcodes $ time ./cksum_neon2 ubuntu.iso 914429447 2773874688 ubuntu.iso real 0m20.136s user 0m0.346s sys 0m1.819s Benchmark scripts (I used the crc_sum_stream() function so the hash output is different, but have verified against the pclmul script functions locally) $ time ./cksum_bench_old 65536 400000 Hash: 8984ED89, length: 65536 real 0m19.300s user 0m19.299s sys 0m0.001s $ time ./cksum_bench_neon2 65536 400000 Hash: 828F9BAC, length: 65536 real 0m5.001s user 0m4.997s sys 0m0.003s For hash validation $ time ./cksum_bench_neon2 1048576 40000 Hash: EFA0B24F, length: 1048576 real 0m7.540s user 0m7.538s sys 0m0.001s $ time ./cksum_bench_pclmul 1048576 10000 Hash: EFA0B24F, length: 1048576 real 0m3.018s user 0m3.018s sys 0m0.000s -O3 does most of the optimisation work for us, there may be more savings but this is still a good improvement. Some questions - There's no direct equivalent of "__builtin_cpu_supports" for ARM, but the hwcaps interface seems to be the way to test this [1] [2] - ARM is a much more diverse system than x86_64, it's possible that some platforms (e.g. phones) would see a slowdown, is this something we want to give maintainers a flag to disable? - ARMv8 also has a CRC32() opcode, a quick test showed it wasn't super efficient but it's possible that interleaving this against the folding approach might add extra speedups. This is an exercise for the reader. Cheers Sam [1] <https://docs.kernel.org/arch/arm64/elf_hwcaps.html> [2] < https://community.arm.com/arm-community-blogs/b/operating-systems-blog/posts/runtime-detection-of-cpu-features-on-an-armv8-a-cpu >
0001-cksum-Use-ARMv8-SIMD-extensions.patch
Description: Binary data