subject:"RE\: Improve CRC32C performance on SSE4.2"

Re: Improve CRC32C performance on SSE4.2

2025-07-13 Thread Soumyadeep Chakraborty

On Tue, Jun 17, 2025 at 1:55 AM John Naylor wrote: I took the minimal repro from [1] and took a look at the code generated between clang 17 -O0 [2] and clang 17 -O3 [3]. I saw that -O3 (and actually -O1 and -O2) generated the following code for: castval = _mm512_castsi128_si512(_mm_cvtsi32_si128

Re: Improve CRC32C performance on SSE4.2

2025-06-23 Thread John Naylor

On Mon, Jun 23, 2025 at 10:05 PM Nathan Bossart wrote: > > On Mon, Jun 23, 2025 at 10:51:21AM -0400, Tom Lane wrote: > > This has broken the build completely on my RHEL8 x86_64 box, > > with gcc 8.5.0: > > > > $ ./configure ... > > $ make -s > > pg_crc32c_sse42.c: In function 'pg_comp_crc32c_avx51

Re: Improve CRC32C performance on SSE4.2

2025-06-23 Thread Nathan Bossart

On Mon, Jun 23, 2025 at 11:29:32AM -0400, Tom Lane wrote: > The code still uses _mm512_castsi512_si128, so I think removing it > from the configure snippet might not be bright. Ah, right. I'm not firing on all cylinders this morning, apparently. > I adapted what's > there now to get the attached

Re: Improve CRC32C performance on SSE4.2

2025-06-23 Thread Tom Lane

Nathan Bossart writes: > On Mon, Jun 23, 2025 at 11:10:45AM -0400, Tom Lane wrote: >> I was imagining just including _mm512_zextsi128_si512() in the >> code being tested during configure, so that we fall back to >> the non-AVX-512 code if the compiler is too old to have it. >> I don't really feel

Re: Improve CRC32C performance on SSE4.2

2025-06-23 Thread Nathan Bossart

On Mon, Jun 23, 2025 at 11:10:45AM -0400, Tom Lane wrote: > Nathan Bossart writes: >> On Mon, Jun 23, 2025 at 10:51:21AM -0400, Tom Lane wrote: >>> This has broken the build completely on my RHEL8 x86_64 box, >>> with gcc 8.5.0: > >> Unfortunately, this will probably require more than replacing >

Re: Improve CRC32C performance on SSE4.2

2025-06-23 Thread Tom Lane

Nathan Bossart writes: > On Mon, Jun 23, 2025 at 10:51:21AM -0400, Tom Lane wrote: >> This has broken the build completely on my RHEL8 x86_64 box, >> with gcc 8.5.0: > Unfortunately, this will probably require more than replacing > _mm512_castsi512_si128 with _mm512_zextsi512_si128 because the la

Re: Improve CRC32C performance on SSE4.2

2025-06-23 Thread Nathan Bossart

On Mon, Jun 23, 2025 at 10:51:21AM -0400, Tom Lane wrote: > This has broken the build completely on my RHEL8 x86_64 box, > with gcc 8.5.0: > > $ ./configure ... > $ make -s > pg_crc32c_sse42.c: In function 'pg_comp_crc32c_avx512': > pg_crc32c_sse42.c:126:25: warning: implicit declaration of functi

Re: Improve CRC32C performance on SSE4.2

2025-06-23 Thread Tom Lane

John Naylor writes: > Pushed, thanks everyone! This has broken the build completely on my RHEL8 x86_64 box, with gcc 8.5.0: $ ./configure ... $ make -s pg_crc32c_sse42.c: In function 'pg_comp_crc32c_avx512': pg_crc32c_sse42.c:126:25: warning: implicit declaration of function '_mm512_zextsi128_s

Re: Improve CRC32C performance on SSE4.2

2025-06-23 Thread John Naylor

On Tue, Jun 17, 2025 at 3:55 PM John Naylor wrote: > > > Replacing that with _mm512_zextsi128_si512 fixes the problem. > > Here's a patch for testing, which also reverts the previous > workaround. Pushed, thanks everyone! -- John Naylor Amazon Web Services

Re: Improve CRC32C performance on SSE4.2

2025-06-20 Thread Soumyadeep Chakraborty

Verified that the patch works w/ clang-19 -O0 and that avx-512 was selected for the CRC at runtime. Thanks for fixing this. Regards, Deep (VMware)

Re: Improve CRC32C performance on SSE4.2

2025-06-18 Thread Andy Fan

John Naylor writes: Hi, >> Just be curious, what kind of optimization (like what -O2 does) could >> mask this issue? > > In case Andy is asking about "how" rather than "under what > circumstances", my guess is: -O1+ may have just chosen instructions > that also happen to zero-extend, which are

RE: Improve CRC32C performance on SSE4.2

2025-06-17 Thread Devulapalli, Raghuveer

> In case Andy is asking about "how" rather than "under what circumstances", my > guess is: -O1+ may have just chosen instructions that also happen to > zero-extend, > which are common. -O0 doesn't represent the naive straightforward structure of > what the programmer wrote, it's more like an "ex

Re: Improve CRC32C performance on SSE4.2

2025-06-17 Thread Nathan Bossart

On Tue, Jun 17, 2025 at 03:55:06PM +0700, John Naylor wrote: >> "Devulapalli, Raghuveer" writes: >> > Replacing that with _mm512_zextsi128_si512 fixes the problem. > > Here's a patch for testing, which also reverts the previous > workaround. Help welcome, but I still promise to test it in the nea

Re: Improve CRC32C performance on SSE4.2

2025-06-17 Thread John Naylor

On Tue, Jun 17, 2025 at 6:40 AM Andy Fan wrote: > > "Devulapalli, Raghuveer" writes: > > > Great catch! From the intrinsic manual: > > > > Cast vector of type __m128i to type __m512i; the upper 384 bits of the > > result are undefined. Thanks Raghuveer and Nathan, for the diagnosis! > Just be c

RE: Improve CRC32C performance on SSE4.2

2025-06-16 Thread Devulapalli, Raghuveer

ondra ; pgsql-hackers@lists.postgresql.org; > Shankaran, Akash > Subject: RE: Improve CRC32C performance on SSE4.2 > > > Just be curious, what kind of optimization (like what -O2 does) could > > mask this issue? > > >= -O1. Only -O0 shows this problem. > > Raghuveer >

RE: Improve CRC32C performance on SSE4.2

2025-06-16 Thread Devulapalli, Raghuveer

4 PM > To: Devulapalli, Raghuveer > Cc: John Naylor ; Andy Fan > ; Jesper Pedersen ; > Tomas Vondra ; pgsql-hackers@lists.postgresql.org; > Shankaran, Akash > Subject: Re: Improve CRC32C performance on SSE4.2 > > On Mon, Jun 16, 2025 at 06:31:11PM +, Devulapalli, R

RE: Improve CRC32C performance on SSE4.2

2025-06-16 Thread Devulapalli, Raghuveer

> Just be curious, what kind of optimization (like what -O2 does) could mask > this > issue? >= -O1. Only -O0 shows this problem. Raghuveer

Re: Improve CRC32C performance on SSE4.2

2025-06-16 Thread Andy Fan

"Devulapalli, Raghuveer" writes: > Great catch! From the intrinsic manual: > > Cast vector of type __m128i to type __m512i; the upper 384 bits of the > result are undefined. Just be curious, what kind of optimization (like what -O2 does) could mask this issue? > Replacing that with _mm512_zext

Re: Improve CRC32C performance on SSE4.2

2025-06-16 Thread Nathan Bossart

On Mon, Jun 16, 2025 at 06:31:11PM +, Devulapalli, Raghuveer wrote: > Attached is a simple reproducer. It passes with clang v16 -O0, but fails > with 17 and 18 only when built with -O0.. I've just started looking into this, but the difference in code generated for _mm512_castsi128_si512() betw

RE: Improve CRC32C performance on SSE4.2

2025-06-16 Thread Devulapalli, Raghuveer

> Cc: Jesper Pedersen ; Nathan Bossart > ; Tomas Vondra ; Devulapalli, > Raghuveer ; pgsql- > hack...@lists.postgresql.org; Shankaran, Akash > Subject: Re: Improve CRC32C performance on SSE4.2 > > On Sun, Jun 15, 2025 at 8:32 AM Andy Fan wrote: > > > > Jesper Pe

Re: Improve CRC32C performance on SSE4.2

2025-06-15 Thread John Naylor

On Sun, Jun 15, 2025 at 8:32 AM Andy Fan wrote: > > Jesper Pedersen writes: > > Hi, > > Thank you Nathan, Tomas and Jesper for the answers. The patch at [0] > works for me and I could work with master smoothly now. Pushed, thanks for testing! I'll do some more testing to see what versions/levels

Re: Improve CRC32C performance on SSE4.2

2025-06-14 Thread Andy Fan

Jesper Pedersen writes: Hi, Thank you Nathan, Tomas and Jesper for the answers. The patch at [0] works for me and I could work with master smoothly now. > On 6/14/25 9:56 AM, Nathan Bossart wrote: >> I'm also genuinely curious why folks are using -O0... >> > > Personally, I use > > -O0 -fno-o

Re: Improve CRC32C performance on SSE4.2

2025-06-14 Thread Jesper Pedersen

Hi Nathan, On 6/14/25 9:56 AM, Nathan Bossart wrote: I'm also genuinely curious why folks are using -O0... Personally, I use -O0 -fno-omit-frame-pointer for FlameGraph [1] investigations. [1] https://github.com/brendangregg/FlameGraph Best regards, Jesper

Re: Improve CRC32C performance on SSE4.2

2025-06-14 Thread Tomas Vondra

On 6/14/25 15:56, Nathan Bossart wrote: > On Sat, Jun 14, 2025 at 03:47:33PM +0200, Tomas Vondra wrote: >> I suggest you try with a newer gcc, perhaps 13.4. There's been a bunch >> of fixes related to AVX512 since 13.0, chances are this was already >> fixed. I don't see this failure on 14.3.1. >

Re: Improve CRC32C performance on SSE4.2

2025-06-14 Thread Nathan Bossart

On Sat, Jun 14, 2025 at 03:47:33PM +0200, Tomas Vondra wrote: > I suggest you try with a newer gcc, perhaps 13.4. There's been a bunch > of fixes related to AVX512 since 13.0, chances are this was already > fixed. I don't see this failure on 14.3.1. >From the config.log, it looks like Andy is usin

Re: Improve CRC32C performance on SSE4.2

2025-06-14 Thread Tomas Vondra

Hi, I suggest you try with a newer gcc, perhaps 13.4. There's been a bunch of fixes related to AVX512 since 13.0, chances are this was already fixed. I don't see this failure on 14.3.1. T. On 6/14/25 12:24, Andy Fan wrote: > Hi, > > Recently I always get below error during initdb. > > """ > UT

Re: Improve CRC32C performance on SSE4.2

2025-04-06 Thread John Naylor

On Sat, Apr 5, 2025 at 5:15 AM Nathan Bossart wrote: > > I noticed that autoconf is defining USE_AVX512_CRC_WITH_RUNTIME_CHECK, but > everywhere else expects USE_AVX512_CRC32C_WITH_RUNTIME_CHECK (with the > "32C" included). I tested the v16 patches (with the macro fixed and > assertions enabled)

Re: Improve CRC32C performance on SSE4.2

2025-04-04 Thread Nathan Bossart

On Wed, Apr 02, 2025 at 02:10:40PM +0700, John Naylor wrote: > Thanks for looking, I plan to commit this over the weekend unless > there are objections. I noticed that autoconf is defining USE_AVX512_CRC_WITH_RUNTIME_CHECK, but everywhere else expects USE_AVX512_CRC32C_WITH_RUNTIME_CHECK (with the

RE: Improve CRC32C performance on SSE4.2

2025-04-03 Thread Devulapalli, Raghuveer

> Thanks for looking, I plan to commit this over the weekend unless there are > objections. LGTM. Raghuveer

Re: Improve CRC32C performance on SSE4.2

2025-04-02 Thread John Naylor

On Tue, Apr 1, 2025 at 11:25 PM Nathan Bossart wrote: > > On Tue, Apr 01, 2025 at 05:33:02PM +0700, John Naylor wrote: > > On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer > > wrote: > >> (2) Might be apt to rename pg_crc32c_sse42*.c to pg_crc32c_x86*.c since > >> they contain both sse42

Re: Improve CRC32C performance on SSE4.2

2025-04-01 Thread Nathan Bossart

On Tue, Apr 01, 2025 at 05:33:02PM +0700, John Naylor wrote: > On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer > wrote: >> (1) zmm_regs_available() in pg_crc32c_sse42_choose.c is a duplicate of >> pg_popcount_avx512.c but perhaps that’s fine for now. I will propose a >> consolidated SIMD r

Re: Improve CRC32C performance on SSE4.2

2025-04-01 Thread John Naylor

On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer wrote: > > Hello John, > > v15 LGTM. Couple of minor comments: > > > I'm leaning towards a length limit for v15-0001 so that inlined > > instructions are > > likely to be unrolled. Aside from lack of commit message, I think that one > > is r

RE: Improve CRC32C performance on SSE4.2

2025-03-26 Thread Devulapalli, Raghuveer

Hello John, v15 LGTM. Couple of minor comments: > I'm leaning towards a length limit for v15-0001 so that inlined instructions > are > likely to be unrolled. Aside from lack of commit message, I think that one is > ready > for commit soon-ish. +1 > I'm feeling pretty good about 0002, but

Re: Improve CRC32C performance on SSE4.2

2025-03-25 Thread John Naylor

On Mon, Mar 24, 2025 at 6:37 PM John Naylor wrote: > > I'll take a look at the configure > checks soon, since I had some questions there. One other thing I forgot to mention: The previous test function had local constants that the compiler was able to fold, resulting in no actual vector instructi

Re: Improve CRC32C performance on SSE4.2

2025-03-25 Thread John Naylor

On Mon, Mar 24, 2025 at 6:37 PM John Naylor wrote: > I'll take a look at the configure > checks soon, since I had some questions there. I'm leaning towards a length limit for v15-0001 so that inlined instructions are likely to be unrolled. Aside from lack of commit message, I think that one is re

Re: Improve CRC32C performance on SSE4.2

2025-03-24 Thread John Naylor

On Thu, Mar 13, 2025 at 11:38 PM Devulapalli, Raghuveer wrote: > > > > I've gone ahead and added the generated AVX-512 algorithm in > > > v14-0005 > > Here is my benchmark numbers (master v/s v14) on TGL (time in ms): > > Buf-size-bytes Master v14 > 64 9.547 6.095 ... > 256 31.399

Re: Improve CRC32C performance on SSE4.2

2025-03-15 Thread John Naylor

On Wed, Mar 12, 2025 at 3:57 AM Devulapalli, Raghuveer wrote: > > > I've gone ahead and added the generated AVX-512 algorithm in v14-0005 > I'm a little confused. Why is v14 missing the earlier versions of pclmul > implementation that works without avx-512? As I mentioned upthread, the 128-bit

RE: Improve CRC32C performance on SSE4.2

2025-03-13 Thread Devulapalli, Raghuveer

> > I've gone ahead and added the generated AVX-512 algorithm in > > v14-0005 Here is my benchmark numbers (master v/s v14) on TGL (time in ms): Buf-size-bytes Master v14 64 9.547 6.095 80 10.999 6.248 96 12.443 7.756 112 14.129 9.62 128 15.295 6.534 144 16.7

Re: Improve CRC32C performance on SSE4.2

2025-03-11 Thread Nathan Bossart

On Mon, Mar 10, 2025 at 03:48:31PM +0700, John Naylor wrote: > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart > wrote: >> Overall, I wish we could avoid splitting things into separate files and >> adding more header file gymnastics, but maybe there isn't much better we >> can do without overhaulin

RE: Improve CRC32C performance on SSE4.2

2025-03-11 Thread Devulapalli, Raghuveer

> I've gone ahead and added the generated AVX-512 algorithm in v14-0005 + pg_attribute_target("avx512vl,vpclmulqdq") + pg_crc32c + pg_comp_crc32c_pclmul(pg_crc32c crc, const void *data, size_t length) I'm a little confused. Why is v14 missing the earlier versions of pclmul implementation that wo

Re: Improve CRC32C performance on SSE4.2

2025-03-11 Thread John Naylor

On Wed, Mar 5, 2025 at 10:52 PM Nathan Bossart wrote: > > On Wed, Mar 05, 2025 at 08:51:21AM +0700, John Naylor wrote: > > That was my hunch too, but I wanted to be more sure, so I modified the > > benchmark so it doesn't know the address of the next calculation until > > it finishes the last calc

Re: Improve CRC32C performance on SSE4.2

2025-03-11 Thread John Naylor

On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart wrote: > > Overall, I wish we could avoid splitting things into separate files and > adding more header file gymnastics, but maybe there isn't much better we > can do without overhauling the CPU feature detection code. I wanted to make an attempt to m

Re: Improve CRC32C performance on SSE4.2

2025-03-11 Thread John Naylor

On Tue, Mar 11, 2025 at 4:47 AM Nathan Bossart wrote: > > On Mon, Mar 10, 2025 at 03:48:31PM +0700, John Naylor wrote: > > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart > > wrote: > >> Overall, I wish we could avoid splitting things into separate files and > >> adding more header file gymnastics

Re: Improve CRC32C performance on SSE4.2

2025-03-05 Thread Nathan Bossart

On Wed, Mar 05, 2025 at 08:51:21AM +0700, John Naylor wrote: > That was my hunch too, but I wanted to be more sure, so I modified the > benchmark so it doesn't know the address of the next calculation until > it finishes the last calculation so we can hopefully see the latency > caused by indirecti

Re: Improve CRC32C performance on SSE4.2

2025-03-04 Thread John Naylor

On Wed, Mar 5, 2025 at 12:36 AM Nathan Bossart wrote: > > On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote: > > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart > > wrote: > >> This could potentially lead to a small regression for machines with SSE > >> 4.2 but not PCLMUL, but that may b

Re: Improve CRC32C performance on SSE4.2

2025-03-04 Thread Nathan Bossart

On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote: > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart > wrote: >> This could potentially lead to a small regression for machines with SSE >> 4.2 but not PCLMUL, but that may be uncommon enough at this point to not >> worry aobut. > > Note al

Re: Improve CRC32C performance on SSE4.2

2025-03-03 Thread John Naylor

On Tue, Mar 4, 2025 at 5:41 AM Devulapalli, Raghuveer wrote: > Some feedback on v11: > > if ((exx[2] & (1 << 20)) != 0) /* SSE 4.2 */ > { > pg_comp_crc32c = pg_comp_crc32c_sse42; > #ifdef USE_PCLMUL_WITH_RUNTIME_CHECK > if ((exx[2] & (1 << 1)) != 0) /* PCLMUL */ >

Re: Improve CRC32C performance on SSE4.2

2025-03-03 Thread John Naylor

On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart wrote: > I spent some time staring at pg_crc32.h with all these patches applied, and > IIUC it leads to the following behavior: > > * For compiled-in SSE 4.2 builds, we branch based on the length. For > smaller inputs, we are using an inlined versi

RE: Improve CRC32C performance on SSE4.2

2025-03-03 Thread Devulapalli, Raghuveer

Hi John, > You raised some interesting points, which deserve a thoughtful response. After > sleeping on it, however I came to the conclusion that a sweeping change in > runtime checks, with either of our approaches, has downsides and unresolved > questions. Perhaps we can come back to it at a lat

Re: Improve CRC32C performance on SSE4.2

2025-03-03 Thread Nathan Bossart

On Fri, Feb 28, 2025 at 07:11:29PM +0700, John Naylor wrote: > 0002: For SSE4.2 builds, arrange so that constant input uses an > inlined path so that the compiler can emit unrolled loops anywhere. > This is particularly important for the WAL insertion lock, so this is > possibly committable on its

Re: Improve CRC32C performance on SSE4.2

2025-02-28 Thread John Naylor

Hi Raghuveer, You raised some interesting points, which deserve a thoughtful response. After sleeping on it, however I came to the conclusion that a sweeping change in runtime checks, with either of our approaches, has downsides and unresolved questions. Perhaps we can come back to it at a later t

RE: Improve CRC32C performance on SSE4.2

2025-02-27 Thread Devulapalli, Raghuveer

Hi John, Going back to your earlier comment: > > > I'm not a fan of exposing these architecture-specific details to > > > places that consult the capabilities. That requires things like +#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42) > > Isn't that one thing currently pg_cpu_have(FE

Re: Improve CRC32C performance on SSE4.2

2025-02-26 Thread John Naylor

On Thu, Feb 27, 2025 at 3:53 AM Devulapalli, Raghuveer wrote: > > I'm not a fan of exposing these architecture-specific details to places > > that consult > > the capabilities. That requires things like > > > > +#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42) > > [...] > > +#define PGC

RE: Improve CRC32C performance on SSE4.2

2025-02-26 Thread Devulapalli, Raghuveer

Attached v10 with minor changes based on the feedback. > Thanks, I think this is a good direction. Some comments: > > #if defined(HAVE__GET_CPUID) > __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); #elif defined(HAVE__CPUID) > __cpuid(exx, 1); #endif Oh yeah good catch. Fixed in v10. > +

Re: Improve CRC32C performance on SSE4.2

2025-02-25 Thread John Naylor

On Wed, Feb 26, 2025 at 7:21 AM Devulapalli, Raghuveer wrote: > > > I agree it would be preferable to make a centralized check work. > > Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid > checks into one single place including the AVX512 popcount code. Any new > feature

RE: Improve CRC32C performance on SSE4.2

2025-02-25 Thread Devulapalli, Raghuveer

> I agree it would be preferable to make a centralized check work. Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid checks into one single place including the AVX512 popcount code. Any new feature that requires CPUID information can access that information with pg_cpu_

Re: Improve CRC32C performance on SSE4.2

2025-02-25 Thread John Naylor

On Tue, Feb 18, 2025 at 1:40 PM John Naylor wrote: > > On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart > wrote: > > While this needn't block this patch set, I do find the dispatch code to be > > pretty complicated. Maybe we can improve that in the future by using > > macros to auto-generate muc

Re: Improve CRC32C performance on SSE4.2

2025-02-25 Thread John Naylor

On Tue, Feb 25, 2025 at 3:17 AM Devulapalli, Raghuveer wrote: > > > Here's another idea to make it more automatic: Give up on initializing every > > capability at once. > > I'm not sure I like giving up this. Initializing and running CPUID check with > the attribute constructor is very valuable f

RE: Improve CRC32C performance on SSE4.2

2025-02-24 Thread Devulapalli, Raghuveer

> Here's another idea to make it more automatic: Give up on initializing every > capability at once. I'm not sure I like giving up this. Initializing and running CPUID check with the attribute constructor is very valuable for two reasons: (1) you get everything done at load time before main and

Re: Improve CRC32C performance on SSE4.2

2025-02-20 Thread John Naylor

On Fri, Feb 21, 2025 at 1:24 AM Devulapalli, Raghuveer wrote: > > > > Now, there is no equivalent on MSVC and workarounds are fragile [1]. > > Maybe we could only assert initialization happened for the backend and for > > frontend either > > - add a couple manual initializations to to the frontend

RE: Improve CRC32C performance on SSE4.2

2025-02-20 Thread Devulapalli, Raghuveer

> Now, there is no equivalent on MSVC and workarounds are fragile [1]. > Maybe we could only assert initialization happened for the backend and for > frontend either > - add a couple manual initializations to to the frontend programs where we > don't > want to lose performance for non-gcc/clang.

Re: Improve CRC32C performance on SSE4.2

2025-02-18 Thread John Naylor

On Wed, Feb 19, 2025 at 1:28 AM Devulapalli, Raghuveer wrote: > The runtime detection code could also be appended with function > __attribute__((constructor)) so that it gets executed before main. Hmm! I didn't know about that, thanks! It works on old gcc/clang, so that's good. I can't verify on

RE: Improve CRC32C performance on SSE4.2

2025-02-18 Thread Devulapalli, Raghuveer

Hi John, Just a few comments on v7: > pg_cpucap_crc32c I like the idea of moving all CPU runtime detection to a single module outside of implementations. This makes it easy to extend in the future and keeps the functionalities separate. > - Rename the CRC choose*.c files to pg_cpucap_{x86,

RE: Improve CRC32C performance on SSE4.2

2025-02-18 Thread Devulapalli, Raghuveer

alli, Raghuveer ; pgsql- > hack...@lists.postgresql.org; Shankaran, Akash > Subject: Re: Improve CRC32C performance on SSE4.2 > > On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart > wrote: > > > > On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote: > > > I tr

Re: Improve CRC32C performance on SSE4.2

2025-02-17 Thread John Naylor

On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart wrote: > > On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote: > > I tried using branching for the runtime check, and this looks like the > > way to go: > > - Existing -msse4.2 builders will still call directly, but inside the > > function t

Re: Improve CRC32C performance on SSE4.2

2025-02-17 Thread Nathan Bossart

On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote: > I tried using branching for the runtime check, and this looks like the > way to go: > - Existing -msse4.2 builders will still call directly, but inside the > function there is a length check and only for long input will it do a > runtim

Re: Improve CRC32C performance on SSE4.2

2025-02-17 Thread John Naylor

On Thu, Feb 13, 2025 at 5:19 AM Nathan Bossart wrote: > > On Wed, Feb 12, 2025 at 10:12:20PM +, Devulapalli, Raghuveer wrote: > >> Well, I suspect the AVX-512 version will pretty much always need the > >> runtime > >> check given that its not available on a lot of newer hardware and requires

Re: Improve CRC32C performance on SSE4.2

2025-02-13 Thread John Naylor

On Thu, Feb 13, 2025 at 4:18 AM Nathan Bossart wrote: > > I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer > overhead if possible. I looked at switching to always using runtime checks > for this stuff, and we concluded that we'd better not [0]. > > [0] https://postgr.es/m

RE: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Devulapalli, Raghuveer

> > Sounds good to me. Although, users building with just -msse4.2 will > > now encounter an an additional pclmul runtime check. That would be a > > regression unless they update to building with both -msse4.2 and -mpclmul. > > My thinking was that building with just -msse4.2 would cause the exi

Re: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Nathan Bossart

On Wed, Feb 12, 2025 at 09:48:57PM +, Devulapalli, Raghuveer wrote: >> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer >> overhead if possible. I looked at switching to always using runtime checks >> for this >> stuff, and we concluded that we'd better not [0]. > >

RE: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Devulapalli, Raghuveer

> Well, I suspect the AVX-512 version will pretty much always need the runtime > check given that its not available on a lot of newer hardware and requires a > bunch of extra runtime checks (see pg_popcount_avx512.c). But it might be > worth doing for PCLMUL. Otherwise, I think we'd have to leave

Re: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Nathan Bossart

On Wed, Feb 12, 2025 at 10:12:20PM +, Devulapalli, Raghuveer wrote: >> Well, I suspect the AVX-512 version will pretty much always need the runtime >> check given that its not available on a lot of newer hardware and requires a >> bunch of extra runtime checks (see pg_popcount_avx512.c). But i

RE: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Devulapalli, Raghuveer

> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer > overhead if possible. I looked at switching to always using runtime checks > for this > stuff, and we concluded that we'd better not [0]. > > [0] https://postgr.es/m/flat/20231030161706.GA3011%40nathanxps13 Does that

Re: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Nathan Bossart

On Wed, Feb 12, 2025 at 09:02:27PM +, Devulapalli, Raghuveer wrote: > Also, do we really need to have both USE_SSE42_CRC32C and > USE_SSE42_CRC32C_WITH_RUNTIME_CHECK > features support? The former macro is used to enable running the SSE42 > version without a runtime check > when someone buil

RE: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Devulapalli, Raghuveer

Hi, > 2. Unfortunately, there is another wrinkle that I failed to consider: If you > search > the web for "VirtualBox pclmulqdq" you can see a few reports from not very > long > ago that some hypervisors don't enable the CPUID for pclmul. I don't know how > big a problem that is in practice tod

Re: Improve CRC32C performance on SSE4.2

2025-02-12 Thread John Naylor

On Wed, Feb 12, 2025 at 4:34 AM Devulapalli, Raghuveer wrote: > > On my machine that still regresses compared to master in that range > > (although by > > not as much) so I still think 128 bytes is the right threshold. > > On my TGL, buffer sizes as small as 64 bytes see performance benefits. Y

RE: Improve CRC32C performance on SSE4.2

2025-02-11 Thread Devulapalli, Raghuveer

Hello, Attached v3 which is same as v2 with the added PCLMULQDQ runtime CPUID check. > > I ran the same benchmark drive_crc32c with the postgres infrastructure and > found that your v2 sse42 version from corsix is slower than > pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. >

Re: Improve CRC32C performance on SSE4.2

2025-02-10 Thread John Naylor

On Tue, Feb 11, 2025 at 7:25 AM Devulapalli, Raghuveer wrote: > I ran the same benchmark drive_crc32c with the postgres infrastructure and > found that your v2 sse42 version from corsix is slower than > pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. That matches my findings a

RE: Improve CRC32C performance on SSE4.2

2025-02-10 Thread Devulapalli, Raghuveer

Hi John, > I'm highly suspicious of these numbers because they show this version > is about 20x faster than "scalar", so relatively speaking 3x faster > than the AVX-512 proposal? Apologies for this. I was suspicious of this too and looks like I had unintentionally set the scalar version I wrote

Re: Improve CRC32C performance on SSE4.2

2025-02-09 Thread John Naylor

On Thu, Feb 6, 2025 at 3:49 AM Devulapalli, Raghuveer wrote: > > This patch improves the performance of SSE42 CRC32C algorithm. The current > SSE4.2 implementation of CRC32C relies on the native crc32 instruction and > processes 8 bytes at a time in a loop. The technique in this paper uses the

80 matches

Mail list logo