On Tue, Jun 17, 2025 at 1:55 AM John Naylor wrote:
I took the minimal repro from [1] and took a look at the code generated
between clang 17 -O0 [2] and clang 17 -O3 [3]. I saw that -O3 (and
actually -O1 and -O2) generated the following code for:
castval = _mm512_castsi128_si512(_mm_cvtsi32_si128
On Mon, Jun 23, 2025 at 10:05 PM Nathan Bossart
wrote:
>
> On Mon, Jun 23, 2025 at 10:51:21AM -0400, Tom Lane wrote:
> > This has broken the build completely on my RHEL8 x86_64 box,
> > with gcc 8.5.0:
> >
> > $ ./configure ...
> > $ make -s
> > pg_crc32c_sse42.c: In function 'pg_comp_crc32c_avx51
On Mon, Jun 23, 2025 at 11:29:32AM -0400, Tom Lane wrote:
> The code still uses _mm512_castsi512_si128, so I think removing it
> from the configure snippet might not be bright.
Ah, right. I'm not firing on all cylinders this morning, apparently.
> I adapted what's
> there now to get the attached
Nathan Bossart writes:
> On Mon, Jun 23, 2025 at 11:10:45AM -0400, Tom Lane wrote:
>> I was imagining just including _mm512_zextsi128_si512() in the
>> code being tested during configure, so that we fall back to
>> the non-AVX-512 code if the compiler is too old to have it.
>> I don't really feel
On Mon, Jun 23, 2025 at 11:10:45AM -0400, Tom Lane wrote:
> Nathan Bossart writes:
>> On Mon, Jun 23, 2025 at 10:51:21AM -0400, Tom Lane wrote:
>>> This has broken the build completely on my RHEL8 x86_64 box,
>>> with gcc 8.5.0:
>
>> Unfortunately, this will probably require more than replacing
>
Nathan Bossart writes:
> On Mon, Jun 23, 2025 at 10:51:21AM -0400, Tom Lane wrote:
>> This has broken the build completely on my RHEL8 x86_64 box,
>> with gcc 8.5.0:
> Unfortunately, this will probably require more than replacing
> _mm512_castsi512_si128 with _mm512_zextsi512_si128 because the la
On Mon, Jun 23, 2025 at 10:51:21AM -0400, Tom Lane wrote:
> This has broken the build completely on my RHEL8 x86_64 box,
> with gcc 8.5.0:
>
> $ ./configure ...
> $ make -s
> pg_crc32c_sse42.c: In function 'pg_comp_crc32c_avx512':
> pg_crc32c_sse42.c:126:25: warning: implicit declaration of functi
John Naylor writes:
> Pushed, thanks everyone!
This has broken the build completely on my RHEL8 x86_64 box,
with gcc 8.5.0:
$ ./configure ...
$ make -s
pg_crc32c_sse42.c: In function 'pg_comp_crc32c_avx512':
pg_crc32c_sse42.c:126:25: warning: implicit declaration of function
'_mm512_zextsi128_s
On Tue, Jun 17, 2025 at 3:55 PM John Naylor wrote:
> > > Replacing that with _mm512_zextsi128_si512 fixes the problem.
>
> Here's a patch for testing, which also reverts the previous
> workaround.
Pushed, thanks everyone!
--
John Naylor
Amazon Web Services
Verified that the patch works w/ clang-19 -O0 and that avx-512 was selected
for
the CRC at runtime. Thanks for fixing this.
Regards,
Deep (VMware)
John Naylor writes:
Hi,
>> Just be curious, what kind of optimization (like what -O2 does) could
>> mask this issue?
>
> In case Andy is asking about "how" rather than "under what
> circumstances", my guess is: -O1+ may have just chosen instructions
> that also happen to zero-extend, which are
> In case Andy is asking about "how" rather than "under what circumstances", my
> guess is: -O1+ may have just chosen instructions that also happen to
> zero-extend,
> which are common. -O0 doesn't represent the naive straightforward structure of
> what the programmer wrote, it's more like an "ex
On Tue, Jun 17, 2025 at 03:55:06PM +0700, John Naylor wrote:
>> "Devulapalli, Raghuveer" writes:
>> > Replacing that with _mm512_zextsi128_si512 fixes the problem.
>
> Here's a patch for testing, which also reverts the previous
> workaround. Help welcome, but I still promise to test it in the nea
On Tue, Jun 17, 2025 at 6:40 AM Andy Fan wrote:
>
> "Devulapalli, Raghuveer" writes:
>
> > Great catch! From the intrinsic manual:
> >
> > Cast vector of type __m128i to type __m512i; the upper 384 bits of the
> > result are undefined.
Thanks Raghuveer and Nathan, for the diagnosis!
> Just be c
ondra ; pgsql-hackers@lists.postgresql.org;
> Shankaran, Akash
> Subject: RE: Improve CRC32C performance on SSE4.2
>
> > Just be curious, what kind of optimization (like what -O2 does) could
> > mask this issue?
>
> >= -O1. Only -O0 shows this problem.
>
> Raghuveer
>
4 PM
> To: Devulapalli, Raghuveer
> Cc: John Naylor ; Andy Fan
> ; Jesper Pedersen ;
> Tomas Vondra ; pgsql-hackers@lists.postgresql.org;
> Shankaran, Akash
> Subject: Re: Improve CRC32C performance on SSE4.2
>
> On Mon, Jun 16, 2025 at 06:31:11PM +, Devulapalli, R
> Just be curious, what kind of optimization (like what -O2 does) could mask
> this
> issue?
>= -O1. Only -O0 shows this problem.
Raghuveer
"Devulapalli, Raghuveer" writes:
> Great catch! From the intrinsic manual:
>
> Cast vector of type __m128i to type __m512i; the upper 384 bits of the
> result are undefined.
Just be curious, what kind of optimization (like what -O2 does) could
mask this issue?
> Replacing that with _mm512_zext
On Mon, Jun 16, 2025 at 06:31:11PM +, Devulapalli, Raghuveer wrote:
> Attached is a simple reproducer. It passes with clang v16 -O0, but fails
> with 17 and 18 only when built with -O0..
I've just started looking into this, but the difference in code generated
for _mm512_castsi128_si512() betw
> Cc: Jesper Pedersen ; Nathan Bossart
> ; Tomas Vondra ; Devulapalli,
> Raghuveer ; pgsql-
> hack...@lists.postgresql.org; Shankaran, Akash
> Subject: Re: Improve CRC32C performance on SSE4.2
>
> On Sun, Jun 15, 2025 at 8:32 AM Andy Fan wrote:
> >
> > Jesper Pe
On Sun, Jun 15, 2025 at 8:32 AM Andy Fan wrote:
>
> Jesper Pedersen writes:
>
> Hi,
>
> Thank you Nathan, Tomas and Jesper for the answers. The patch at [0]
> works for me and I could work with master smoothly now.
Pushed, thanks for testing! I'll do some more testing to see what
versions/levels
Jesper Pedersen writes:
Hi,
Thank you Nathan, Tomas and Jesper for the answers. The patch at [0]
works for me and I could work with master smoothly now.
> On 6/14/25 9:56 AM, Nathan Bossart wrote:
>> I'm also genuinely curious why folks are using -O0...
>>
>
> Personally, I use
>
> -O0 -fno-o
Hi Nathan,
On 6/14/25 9:56 AM, Nathan Bossart wrote:
I'm also genuinely curious why folks are using -O0...
Personally, I use
-O0 -fno-omit-frame-pointer
for FlameGraph [1] investigations.
[1] https://github.com/brendangregg/FlameGraph
Best regards,
Jesper
On 6/14/25 15:56, Nathan Bossart wrote:
> On Sat, Jun 14, 2025 at 03:47:33PM +0200, Tomas Vondra wrote:
>> I suggest you try with a newer gcc, perhaps 13.4. There's been a bunch
>> of fixes related to AVX512 since 13.0, chances are this was already
>> fixed. I don't see this failure on 14.3.1.
>
On Sat, Jun 14, 2025 at 03:47:33PM +0200, Tomas Vondra wrote:
> I suggest you try with a newer gcc, perhaps 13.4. There's been a bunch
> of fixes related to AVX512 since 13.0, chances are this was already
> fixed. I don't see this failure on 14.3.1.
>From the config.log, it looks like Andy is usin
Hi,
I suggest you try with a newer gcc, perhaps 13.4. There's been a bunch
of fixes related to AVX512 since 13.0, chances are this was already
fixed. I don't see this failure on 14.3.1.
T.
On 6/14/25 12:24, Andy Fan wrote:
> Hi,
>
> Recently I always get below error during initdb.
>
> """
> UT
On Sat, Apr 5, 2025 at 5:15 AM Nathan Bossart wrote:
>
> I noticed that autoconf is defining USE_AVX512_CRC_WITH_RUNTIME_CHECK, but
> everywhere else expects USE_AVX512_CRC32C_WITH_RUNTIME_CHECK (with the
> "32C" included). I tested the v16 patches (with the macro fixed and
> assertions enabled)
On Wed, Apr 02, 2025 at 02:10:40PM +0700, John Naylor wrote:
> Thanks for looking, I plan to commit this over the weekend unless
> there are objections.
I noticed that autoconf is defining USE_AVX512_CRC_WITH_RUNTIME_CHECK, but
everywhere else expects USE_AVX512_CRC32C_WITH_RUNTIME_CHECK (with the
> Thanks for looking, I plan to commit this over the weekend unless there are
> objections.
LGTM.
Raghuveer
On Tue, Apr 1, 2025 at 11:25 PM Nathan Bossart wrote:
>
> On Tue, Apr 01, 2025 at 05:33:02PM +0700, John Naylor wrote:
> > On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer
> > wrote:
> >> (2) Might be apt to rename pg_crc32c_sse42*.c to pg_crc32c_x86*.c since
> >> they contain both sse42
On Tue, Apr 01, 2025 at 05:33:02PM +0700, John Naylor wrote:
> On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer
> wrote:
>> (1) zmm_regs_available() in pg_crc32c_sse42_choose.c is a duplicate of
>> pg_popcount_avx512.c but perhaps that’s fine for now. I will propose a
>> consolidated SIMD r
On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer
wrote:
>
> Hello John,
>
> v15 LGTM. Couple of minor comments:
>
> > I'm leaning towards a length limit for v15-0001 so that inlined
> > instructions are
> > likely to be unrolled. Aside from lack of commit message, I think that one
> > is r
Hello John,
v15 LGTM. Couple of minor comments:
> I'm leaning towards a length limit for v15-0001 so that inlined instructions
> are
> likely to be unrolled. Aside from lack of commit message, I think that one is
> ready
> for commit soon-ish.
+1
> I'm feeling pretty good about 0002, but
On Mon, Mar 24, 2025 at 6:37 PM John Naylor wrote:
>
> I'll take a look at the configure
> checks soon, since I had some questions there.
One other thing I forgot to mention: The previous test function had
local constants that the compiler was able to fold, resulting in no
actual vector instructi
On Mon, Mar 24, 2025 at 6:37 PM John Naylor wrote:
> I'll take a look at the configure
> checks soon, since I had some questions there.
I'm leaning towards a length limit for v15-0001 so that inlined
instructions are likely to be unrolled. Aside from lack of commit
message, I think that one is re
On Thu, Mar 13, 2025 at 11:38 PM Devulapalli, Raghuveer
wrote:
>
> > > I've gone ahead and added the generated AVX-512 algorithm in
> > > v14-0005
>
> Here is my benchmark numbers (master v/s v14) on TGL (time in ms):
>
> Buf-size-bytes Master v14
> 64 9.547 6.095
...
> 256 31.399
On Wed, Mar 12, 2025 at 3:57 AM Devulapalli, Raghuveer
wrote:
>
> > I've gone ahead and added the generated AVX-512 algorithm in v14-0005
> I'm a little confused. Why is v14 missing the earlier versions of pclmul
> implementation that works without avx-512?
As I mentioned upthread, the 128-bit
> > I've gone ahead and added the generated AVX-512 algorithm in
> > v14-0005
Here is my benchmark numbers (master v/s v14) on TGL (time in ms):
Buf-size-bytes Master v14
64 9.547 6.095
80 10.999 6.248
96 12.443 7.756
112 14.129 9.62
128 15.295 6.534
144 16.7
On Mon, Mar 10, 2025 at 03:48:31PM +0700, John Naylor wrote:
> On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart
> wrote:
>> Overall, I wish we could avoid splitting things into separate files and
>> adding more header file gymnastics, but maybe there isn't much better we
>> can do without overhaulin
> I've gone ahead and added the generated AVX-512 algorithm in v14-0005
+ pg_attribute_target("avx512vl,vpclmulqdq")
+ pg_crc32c
+ pg_comp_crc32c_pclmul(pg_crc32c crc, const void *data, size_t length)
I'm a little confused. Why is v14 missing the earlier versions of pclmul
implementation that wo
On Wed, Mar 5, 2025 at 10:52 PM Nathan Bossart wrote:
>
> On Wed, Mar 05, 2025 at 08:51:21AM +0700, John Naylor wrote:
> > That was my hunch too, but I wanted to be more sure, so I modified the
> > benchmark so it doesn't know the address of the next calculation until
> > it finishes the last calc
On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart wrote:
>
> Overall, I wish we could avoid splitting things into separate files and
> adding more header file gymnastics, but maybe there isn't much better we
> can do without overhauling the CPU feature detection code.
I wanted to make an attempt to m
On Tue, Mar 11, 2025 at 4:47 AM Nathan Bossart wrote:
>
> On Mon, Mar 10, 2025 at 03:48:31PM +0700, John Naylor wrote:
> > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart
> > wrote:
> >> Overall, I wish we could avoid splitting things into separate files and
> >> adding more header file gymnastics
On Wed, Mar 05, 2025 at 08:51:21AM +0700, John Naylor wrote:
> That was my hunch too, but I wanted to be more sure, so I modified the
> benchmark so it doesn't know the address of the next calculation until
> it finishes the last calculation so we can hopefully see the latency
> caused by indirecti
On Wed, Mar 5, 2025 at 12:36 AM Nathan Bossart wrote:
>
> On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote:
> > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart
> > wrote:
> >> This could potentially lead to a small regression for machines with SSE
> >> 4.2 but not PCLMUL, but that may b
On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote:
> On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart
> wrote:
>> This could potentially lead to a small regression for machines with SSE
>> 4.2 but not PCLMUL, but that may be uncommon enough at this point to not
>> worry aobut.
>
> Note al
On Tue, Mar 4, 2025 at 5:41 AM Devulapalli, Raghuveer
wrote:
> Some feedback on v11:
>
> if ((exx[2] & (1 << 20)) != 0) /* SSE 4.2 */
> {
> pg_comp_crc32c = pg_comp_crc32c_sse42;
> #ifdef USE_PCLMUL_WITH_RUNTIME_CHECK
> if ((exx[2] & (1 << 1)) != 0) /* PCLMUL */
>
On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart wrote:
> I spent some time staring at pg_crc32.h with all these patches applied, and
> IIUC it leads to the following behavior:
>
> * For compiled-in SSE 4.2 builds, we branch based on the length. For
> smaller inputs, we are using an inlined versi
Hi John,
> You raised some interesting points, which deserve a thoughtful response. After
> sleeping on it, however I came to the conclusion that a sweeping change in
> runtime checks, with either of our approaches, has downsides and unresolved
> questions. Perhaps we can come back to it at a lat
On Fri, Feb 28, 2025 at 07:11:29PM +0700, John Naylor wrote:
> 0002: For SSE4.2 builds, arrange so that constant input uses an
> inlined path so that the compiler can emit unrolled loops anywhere.
> This is particularly important for the WAL insertion lock, so this is
> possibly committable on its
Hi Raghuveer,
You raised some interesting points, which deserve a thoughtful
response. After sleeping on it, however I came to the conclusion that
a sweeping change in runtime checks, with either of our approaches,
has downsides and unresolved questions. Perhaps we can come back to it
at a later t
Hi John,
Going back to your earlier comment:
> > > I'm not a fan of exposing these architecture-specific details to
> > > places that consult the capabilities. That requires things like
+#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42)
> > Isn't that one thing currently pg_cpu_have(FE
On Thu, Feb 27, 2025 at 3:53 AM Devulapalli, Raghuveer
wrote:
> > I'm not a fan of exposing these architecture-specific details to places
> > that consult
> > the capabilities. That requires things like
> >
> > +#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42)
> > [...]
> > +#define PGC
Attached v10 with minor changes based on the feedback.
> Thanks, I think this is a good direction. Some comments:
>
> #if defined(HAVE__GET_CPUID)
> __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); #elif defined(HAVE__CPUID)
> __cpuid(exx, 1); #endif
Oh yeah good catch. Fixed in v10.
> +
On Wed, Feb 26, 2025 at 7:21 AM Devulapalli, Raghuveer
wrote:
>
> > I agree it would be preferable to make a centralized check work.
>
> Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid
> checks into one single place including the AVX512 popcount code. Any new
> feature
> I agree it would be preferable to make a centralized check work.
Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid
checks into one single place including the AVX512 popcount code. Any new
feature that requires CPUID information can access that information with
pg_cpu_
On Tue, Feb 18, 2025 at 1:40 PM John Naylor wrote:
>
> On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart
> wrote:
> > While this needn't block this patch set, I do find the dispatch code to be
> > pretty complicated. Maybe we can improve that in the future by using
> > macros to auto-generate muc
On Tue, Feb 25, 2025 at 3:17 AM Devulapalli, Raghuveer
wrote:
>
> > Here's another idea to make it more automatic: Give up on initializing every
> > capability at once.
>
> I'm not sure I like giving up this. Initializing and running CPUID check with
> the attribute constructor is very valuable f
> Here's another idea to make it more automatic: Give up on initializing every
> capability at once.
I'm not sure I like giving up this. Initializing and running CPUID check with
the attribute constructor is very valuable for two reasons: (1) you get
everything done at load time before main and
On Fri, Feb 21, 2025 at 1:24 AM Devulapalli, Raghuveer
wrote:
>
>
> > Now, there is no equivalent on MSVC and workarounds are fragile [1].
> > Maybe we could only assert initialization happened for the backend and for
> > frontend either
> > - add a couple manual initializations to to the frontend
> Now, there is no equivalent on MSVC and workarounds are fragile [1].
> Maybe we could only assert initialization happened for the backend and for
> frontend either
> - add a couple manual initializations to to the frontend programs where we
> don't
> want to lose performance for non-gcc/clang.
On Wed, Feb 19, 2025 at 1:28 AM Devulapalli, Raghuveer
wrote:
> The runtime detection code could also be appended with function
> __attribute__((constructor)) so that it gets executed before main.
Hmm! I didn't know about that, thanks! It works on old gcc/clang, so
that's good. I can't verify on
Hi John,
Just a few comments on v7:
> pg_cpucap_crc32c
I like the idea of moving all CPU runtime detection to a single module outside
of implementations. This makes it easy to extend in the future and keeps the
functionalities separate.
> - Rename the CRC choose*.c files to pg_cpucap_{x86,
alli, Raghuveer ; pgsql-
> hack...@lists.postgresql.org; Shankaran, Akash
> Subject: Re: Improve CRC32C performance on SSE4.2
>
> On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart
> wrote:
> >
> > On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
> > > I tr
On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart
wrote:
>
> On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
> > I tried using branching for the runtime check, and this looks like the
> > way to go:
> > - Existing -msse4.2 builders will still call directly, but inside the
> > function t
On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
> I tried using branching for the runtime check, and this looks like the
> way to go:
> - Existing -msse4.2 builders will still call directly, but inside the
> function there is a length check and only for long input will it do a
> runtim
On Thu, Feb 13, 2025 at 5:19 AM Nathan Bossart wrote:
>
> On Wed, Feb 12, 2025 at 10:12:20PM +, Devulapalli, Raghuveer wrote:
> >> Well, I suspect the AVX-512 version will pretty much always need the
> >> runtime
> >> check given that its not available on a lot of newer hardware and requires
On Thu, Feb 13, 2025 at 4:18 AM Nathan Bossart wrote:
>
> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
> overhead if possible. I looked at switching to always using runtime checks
> for this stuff, and we concluded that we'd better not [0].
>
> [0] https://postgr.es/m
> > Sounds good to me. Although, users building with just -msse4.2 will
> > now encounter an an additional pclmul runtime check. That would be a
> > regression unless they update to building with both -msse4.2 and -mpclmul.
>
> My thinking was that building with just -msse4.2 would cause the exi
On Wed, Feb 12, 2025 at 09:48:57PM +, Devulapalli, Raghuveer wrote:
>> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
>> overhead if possible. I looked at switching to always using runtime checks
>> for this
>> stuff, and we concluded that we'd better not [0].
>
>
> Well, I suspect the AVX-512 version will pretty much always need the runtime
> check given that its not available on a lot of newer hardware and requires a
> bunch of extra runtime checks (see pg_popcount_avx512.c). But it might be
> worth doing for PCLMUL. Otherwise, I think we'd have to leave
On Wed, Feb 12, 2025 at 10:12:20PM +, Devulapalli, Raghuveer wrote:
>> Well, I suspect the AVX-512 version will pretty much always need the runtime
>> check given that its not available on a lot of newer hardware and requires a
>> bunch of extra runtime checks (see pg_popcount_avx512.c). But i
> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
> overhead if possible. I looked at switching to always using runtime checks
> for this
> stuff, and we concluded that we'd better not [0].
>
> [0] https://postgr.es/m/flat/20231030161706.GA3011%40nathanxps13
Does that
On Wed, Feb 12, 2025 at 09:02:27PM +, Devulapalli, Raghuveer wrote:
> Also, do we really need to have both USE_SSE42_CRC32C and
> USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
> features support? The former macro is used to enable running the SSE42
> version without a runtime check
> when someone buil
Hi,
> 2. Unfortunately, there is another wrinkle that I failed to consider: If you
> search
> the web for "VirtualBox pclmulqdq" you can see a few reports from not very
> long
> ago that some hypervisors don't enable the CPUID for pclmul. I don't know how
> big a problem that is in practice tod
On Wed, Feb 12, 2025 at 4:34 AM Devulapalli, Raghuveer
wrote:
> > On my machine that still regresses compared to master in that range
> > (although by
> > not as much) so I still think 128 bytes is the right threshold.
>
> On my TGL, buffer sizes as small as 64 bytes see performance benefits.
Y
Hello,
Attached v3 which is same as v2 with the added PCLMULQDQ runtime CPUID check.
> > I ran the same benchmark drive_crc32c with the postgres infrastructure and
> found that your v2 sse42 version from corsix is slower than
> pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes.
>
On Tue, Feb 11, 2025 at 7:25 AM Devulapalli, Raghuveer
wrote:
> I ran the same benchmark drive_crc32c with the postgres infrastructure and
> found that your v2 sse42 version from corsix is slower than
> pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes.
That matches my findings a
Hi John,
> I'm highly suspicious of these numbers because they show this version
> is about 20x faster than "scalar", so relatively speaking 3x faster
> than the AVX-512 proposal?
Apologies for this. I was suspicious of this too and looks like I had
unintentionally set the scalar version I wrote
On Thu, Feb 6, 2025 at 3:49 AM Devulapalli, Raghuveer
wrote:
>
> This patch improves the performance of SSE42 CRC32C algorithm. The current
> SSE4.2 implementation of CRC32C relies on the native crc32 instruction and
> processes 8 bytes at a time in a loop. The technique in this paper uses the
80 matches
Mail list logo