I've added a sample benchmarking program to measure the difference without hitting disk, looking like a 40% speedup
$ time ./cksum_bench_pclmul 1048576 10000 Hash: EFA0B24F, length: 1048576 real 0m3.018s user 0m3.018s sys 0m0.000s $ time ./cksum_bench_avx2 1048576 10000 Hash: EFA0B24F, length: 1048576 real 0m1.824s user 0m1.804s sys 0m0.020s The code effectively replicates the existing pclmul code and has new constants generated for the larger folds. The main gotcha was that the previous CRC gets inserted at a weird offset due to endianness and byte swapping. I don't have a skylake processor so I spun up an AWS instance to test out the AVX512 version, it turns out there's a bug where virtualisation environments don't handle the AVX512 pclmul correctly despite the CPU supporting it. It might be worth us disabling this for now as it does get past the __builtin_cpu_supports() gate but then throws an illegal instruction halfway through the function. It would be nice if we could at least validate it for now though. AVX2 has been around over 10 years though so this seems to be a safer addition.
#include "config.h" #include <stdio.h> #include <stdlib.h> #include "cksum.h" void xorshift_populate(char* buffer, size_t len) { size_t i; unsigned int state = 0x123; for (i = 0; i < len; i++) { state ^= state << 13; state ^= state >> 17; state ^= state << 5; buffer[i] = (char) state; } } int main(int argc, char* argv[]) { uint_fast32_t hash; uintmax_t length; size_t iterations, i; FILE* fp; size_t buffer_len; char* buffer; if (argc != 3) { printf("Usage: %s length iterations\n", argv[0]); return -1; } buffer_len = atoi(argv[1]); iterations = atoi(argv[2]); buffer = calloc(1, buffer_len); xorshift_populate(buffer, buffer_len); for (i = 0; i < iterations; i++) { fp = fmemopen(buffer, buffer_len, "r"); cksum_pclmul(fp, &hash, &length); } free(buffer); printf("Hash: %08X, length: %d\n", (unsigned int) hash, (int) length); return 0; }
0001-cksum-Use-AVX2-and-AVX512-for-speedup.patch
Description: Binary data