On Tue, Feb 11, 2025 at 7:25 AM Devulapalli, Raghuveer <raghuveer.devulapa...@intel.com> wrote: > I ran the same benchmark drive_crc32c with the postgres infrastructure and > found that your v2 sse42 version from corsix is slower than > pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes.
That matches my findings as well. > I think the reason is that postgres is not using -O3 flag build the crc32c > source files and the compiler generates less than optimal code. Adding that > flag fixes the regression for buffers with 64 bytes – 128 bytes. Could you > confirm that behavior on your end too? On my machine that still regresses compared to master in that range (although by not as much) so I still think 128 bytes is the right threshold. The effect of -O3 with gcc14.2 is that the single-block loop (after the 4-block loop) is unrolled. Unrolling adds branches and binary space, so it'd be nice to avoid that even for systems that build with -O3. I tried leaving out the single block loop, so that the tail is called for the remaining <63 bytes, and it's actually better: v2: 128 latency average = 10.256 ms 144 latency average = 11.393 ms 160 latency average = 12.977 ms 176 latency average = 14.364 ms 192 latency average = 12.627 ms remove single-block loop: 128 latency average = 10.211 ms 144 latency average = 10.987 ms 160 latency average = 12.094 ms 176 latency average = 12.927 ms 192 latency average = 12.466 ms Keeping the extra loop is probably better at this benchmark on newer machines, but I don't think it's worth it. Microbenchmarks with fixed sized input don't take branch mispredicts into account. > > I did the benchmarks on my older machine, which I believe has a latency of > > 7 cycles for this instruction. > > May I ask which processor does you older machine have? I am benchmarking on a > Tigerlake processor. It's an i7-7500U / Kaby Lake. > > It's probably okay to fold these together in the same compile-time > > check, since both are fairly old by now, but for those following > > along, pclmul is not in SSE 4.2 and is a bit newer. So this would > > cause machines building on Nehalem (2008) to fail the check and go > > back to slicing-by-8 with it written this way. > > Technically, the current version of the patch does not have a runtime cpuid > check for pclmul and so would cause it to crash with segill on Nehalam > (currently we only check for sse4.2). This needs to be fixed by adding an > additional cpuid check for pcmul but it would fall back to slicing by 8 on > Nehalem and use the latest version on Westmere and above. If you care about > keeping the performance on Nehalem, then I am happy to update the choose > function to pick the right pointer accordingly. Let me know which one you > would prefer. Okay, Nehalem is 17 years old, and the additional cpuid check would still work on hardware 14-15 years old, so I think it's fine to bump the requirement for runtime hardware support. -- John Naylor Amazon Web Services