Re: Improve CRC32C performance on SSE4.2

John Naylor Mon, 10 Feb 2025 23:02:52 -0800

On Tue, Feb 11, 2025 at 7:25 AM Devulapalli, Raghuveer
<raghuveer.devulapa...@intel.com> wrote:
> I ran the same benchmark drive_crc32c with the postgres infrastructure and 
> found that your v2 sse42 version from corsix is slower than 
> pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes.


That matches my findings as well.

> I think the reason is that postgres is not using -O3 flag build the crc32c 
> source files and the compiler generates less than optimal code. Adding that 
> flag fixes the regression for buffers with 64 bytes – 128 bytes. Could you 
> confirm that behavior on your end too?

On my machine that still regresses compared to master in that range
(although by not as much) so I still think 128 bytes is the right
threshold.

The effect of -O3 with gcc14.2 is that the single-block loop (after
the 4-block loop) is unrolled. Unrolling adds branches and binary
space, so it'd be nice to avoid that even for systems that build with
-O3. I tried leaving out the single block loop, so that the tail is
called for the remaining <63 bytes, and it's actually better:

v2:

128
latency average = 10.256 ms
144
latency average = 11.393 ms
160
latency average = 12.977 ms
176
latency average = 14.364 ms
192
latency average = 12.627 ms

remove single-block loop:

128
latency average = 10.211 ms
144
latency average = 10.987 ms
160
latency average = 12.094 ms
176
latency average = 12.927 ms
192
latency average = 12.466 ms

Keeping the extra loop is probably better at this benchmark on newer
machines, but I don't think it's worth it. Microbenchmarks with fixed
sized input don't take branch mispredicts into account.

> > I did the benchmarks on my older machine, which I believe has a latency of 
> > 7 cycles for this instruction.
>
> May I ask which processor does you older machine have? I am benchmarking on a 
> Tigerlake processor.

It's an i7-7500U / Kaby Lake.

> > It's probably okay to fold these together in the same compile-time
> > check, since both are fairly old by now, but for those following
> > along, pclmul is not in SSE 4.2 and is a bit newer. So this would
> > cause machines building on Nehalem (2008) to fail the check and go
> > back to slicing-by-8 with it written this way.
>
> Technically, the current version of the patch does not have a runtime cpuid 
> check for pclmul and so would cause it to crash with segill on Nehalam 
> (currently we only check for sse4.2). This needs to be fixed by adding an 
> additional cpuid check for pcmul but it would fall back to slicing by 8 on 
> Nehalem and use the latest version on Westmere and above. If you care about 
> keeping the performance on Nehalem, then I am happy to update the choose 
> function to pick the right pointer accordingly.  Let me know which one you 
> would prefer.

Okay, Nehalem is 17 years old, and the additional cpuid check would
still work on hardware 14-15 years old, so I think it's fine to bump
the requirement for runtime hardware support.

--
John Naylor
Amazon Web Services

Re: Improve CRC32C performance on SSE4.2

Reply via email to