[PATCH] cksum: Use ARMv8 SIMD extensions

Sam Russell Thu, 28 Nov 2024 12:00:24 -0800

I've ported the PCLMUL to for ARMv8 support, looks to be an 80% time
reduction over CPU on an EC2 T4g instance:


$ lscpu
Architecture:             aarch64
  CPU op-mode(s):         32-bit, 64-bit
  Byte Order:             Little Endian
CPU(s):                   2
  On-line CPU(s) list:    0,1
Vendor ID:                ARM
  Model name:             Neoverse-N1
    Model:                1
    Thread(s) per core:   1
    Core(s) per socket:   2
    Socket(s):            1
    Stepping:             r3p1
    BogoMIPS:             243.75
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32
atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs

# ubuntu 24.04 package
$ time cksum ubuntu.iso
914429447 2773874688 ubuntu.iso

real    0m20.136s
user    0m2.044s
sys     0m1.691s

# built from head
$ time ./cksum_old ubuntu.iso
914429447 2773874688 ubuntu.iso

real    0m20.217s
user    0m2.022s
sys     0m1.770s

# this patch using only pmull opcodes
$ time ./cksum_neon ubuntu.iso
914429447 2773874688 ubuntu.iso

real    0m20.135s
user    0m0.353s
sys     0m1.819s

# this patch using pmull and pmull2 opcodes
$ time ./cksum_neon2 ubuntu.iso
914429447 2773874688 ubuntu.iso

real    0m20.136s
user    0m0.346s
sys     0m1.819s

Benchmark scripts (I used the crc_sum_stream() function so the hash output
is different, but have verified against the pclmul script functions locally)

$ time ./cksum_bench_old 65536 400000
Hash: 8984ED89, length: 65536

real    0m19.300s
user    0m19.299s
sys     0m0.001s

$ time ./cksum_bench_neon2 65536 400000
Hash: 828F9BAC, length: 65536

real    0m5.001s
user    0m4.997s
sys     0m0.003s

For hash validation

$ time ./cksum_bench_neon2 1048576 40000
Hash: EFA0B24F, length: 1048576

real    0m7.540s
user    0m7.538s
sys     0m0.001s

$ time ./cksum_bench_pclmul 1048576 10000
Hash: EFA0B24F, length: 1048576

real    0m3.018s
user    0m3.018s
sys     0m0.000s

-O3 does most of the optimisation work for us, there may be more savings
but this is still a good improvement.

Some questions
- There's no direct equivalent of "__builtin_cpu_supports" for ARM, but the
hwcaps interface seems to be the way to test this [1] [2]
- ARM is a much more diverse system than x86_64, it's possible that some
platforms (e.g. phones) would see a slowdown, is this something we want to
give maintainers a flag to disable?
- ARMv8 also has a CRC32() opcode, a quick test showed it wasn't super
efficient but it's possible that interleaving this against the folding
approach might add extra speedups. This is an exercise for the reader.

Cheers
Sam

[1] <https://docs.kernel.org/arch/arm64/elf_hwcaps.html>
[2] <
https://community.arm.com/arm-community-blogs/b/operating-systems-blog/posts/runtime-detection-of-cpu-features-on-an-armv8-a-cpu
>

0001-cksum-Use-ARMv8-SIMD-extensions.patch
Description: Binary data

[PATCH] cksum: Use ARMv8 SIMD extensions

Reply via email to