Re: Improve CRC32C performance on SSE4.2

2025-04-06 Thread John Naylor
On Sat, Apr 5, 2025 at 5:15 AM Nathan Bossart wrote: > > I noticed that autoconf is defining USE_AVX512_CRC_WITH_RUNTIME_CHECK, but > everywhere else expects USE_AVX512_CRC32C_WITH_RUNTIME_CHECK (with the > "32C" included). I tested the v16 patches (with the macro fixed and > assertions enabled)

Re: Improve CRC32C performance on SSE4.2

2025-04-04 Thread Nathan Bossart
On Wed, Apr 02, 2025 at 02:10:40PM +0700, John Naylor wrote: > Thanks for looking, I plan to commit this over the weekend unless > there are objections. I noticed that autoconf is defining USE_AVX512_CRC_WITH_RUNTIME_CHECK, but everywhere else expects USE_AVX512_CRC32C_WITH_RUNTIME_CHECK (with the

RE: Improve CRC32C performance on SSE4.2

2025-04-03 Thread Devulapalli, Raghuveer
> Thanks for looking, I plan to commit this over the weekend unless there are > objections. LGTM. Raghuveer

Re: Improve CRC32C performance on SSE4.2

2025-04-02 Thread John Naylor
On Tue, Apr 1, 2025 at 11:25 PM Nathan Bossart wrote: > > On Tue, Apr 01, 2025 at 05:33:02PM +0700, John Naylor wrote: > > On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer > > wrote: > >> (2) Might be apt to rename pg_crc32c_sse42*.c to pg_crc32c_x86*.c since > >> they contain both sse42

Re: Improve CRC32C performance on SSE4.2

2025-04-01 Thread Nathan Bossart
On Tue, Apr 01, 2025 at 05:33:02PM +0700, John Naylor wrote: > On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer > wrote: >> (1) zmm_regs_available() in pg_crc32c_sse42_choose.c is a duplicate of >> pg_popcount_avx512.c but perhaps that’s fine for now. I will propose a >> consolidated SIMD r

Re: Improve CRC32C performance on SSE4.2

2025-04-01 Thread John Naylor
On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer wrote: > > Hello John, > > v15 LGTM. Couple of minor comments: > > > I'm leaning towards a length limit for v15-0001 so that inlined > > instructions are > > likely to be unrolled. Aside from lack of commit message, I think that one > > is r

RE: Improve CRC32C performance on SSE4.2

2025-03-26 Thread Devulapalli, Raghuveer
Hello John, v15 LGTM. Couple of minor comments: > I'm leaning towards a length limit for v15-0001 so that inlined instructions > are > likely to be unrolled. Aside from lack of commit message, I think that one is > ready > for commit soon-ish. +1 > I'm feeling pretty good about 0002, but

Re: Improve CRC32C performance on SSE4.2

2025-03-25 Thread John Naylor
On Mon, Mar 24, 2025 at 6:37 PM John Naylor wrote: > > I'll take a look at the configure > checks soon, since I had some questions there. One other thing I forgot to mention: The previous test function had local constants that the compiler was able to fold, resulting in no actual vector instructi

Re: Improve CRC32C performance on SSE4.2

2025-03-25 Thread John Naylor
On Mon, Mar 24, 2025 at 6:37 PM John Naylor wrote: > I'll take a look at the configure > checks soon, since I had some questions there. I'm leaning towards a length limit for v15-0001 so that inlined instructions are likely to be unrolled. Aside from lack of commit message, I think that one is re

Re: Improve CRC32C performance on SSE4.2

2025-03-24 Thread John Naylor
On Thu, Mar 13, 2025 at 11:38 PM Devulapalli, Raghuveer wrote: > > > > I've gone ahead and added the generated AVX-512 algorithm in > > > v14-0005 > > Here is my benchmark numbers (master v/s v14) on TGL (time in ms): > > Buf-size-bytes Master v14 > 64 9.547 6.095 ... > 256 31.399

Re: Improve CRC32C performance on SSE4.2

2025-03-15 Thread John Naylor
On Wed, Mar 12, 2025 at 3:57 AM Devulapalli, Raghuveer wrote: > > > I've gone ahead and added the generated AVX-512 algorithm in v14-0005 > I'm a little confused. Why is v14 missing the earlier versions of pclmul > implementation that works without avx-512? As I mentioned upthread, the 128-bit

RE: Improve CRC32C performance on SSE4.2

2025-03-13 Thread Devulapalli, Raghuveer
> > I've gone ahead and added the generated AVX-512 algorithm in > > v14-0005 Here is my benchmark numbers (master v/s v14) on TGL (time in ms): Buf-size-bytes Master v14 64 9.547 6.095 80 10.999 6.248 96 12.443 7.756 112 14.129 9.62 128 15.295 6.534 144 16.7

Re: Improve CRC32C performance on SSE4.2

2025-03-11 Thread Nathan Bossart
On Mon, Mar 10, 2025 at 03:48:31PM +0700, John Naylor wrote: > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart > wrote: >> Overall, I wish we could avoid splitting things into separate files and >> adding more header file gymnastics, but maybe there isn't much better we >> can do without overhaulin

RE: Improve CRC32C performance on SSE4.2

2025-03-11 Thread Devulapalli, Raghuveer
> I've gone ahead and added the generated AVX-512 algorithm in v14-0005 + pg_attribute_target("avx512vl,vpclmulqdq") + pg_crc32c + pg_comp_crc32c_pclmul(pg_crc32c crc, const void *data, size_t length) I'm a little confused. Why is v14 missing the earlier versions of pclmul implementation that wo

Re: Improve CRC32C performance on SSE4.2

2025-03-11 Thread John Naylor
On Wed, Mar 5, 2025 at 10:52 PM Nathan Bossart wrote: > > On Wed, Mar 05, 2025 at 08:51:21AM +0700, John Naylor wrote: > > That was my hunch too, but I wanted to be more sure, so I modified the > > benchmark so it doesn't know the address of the next calculation until > > it finishes the last calc

Re: Improve CRC32C performance on SSE4.2

2025-03-11 Thread John Naylor
On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart wrote: > > Overall, I wish we could avoid splitting things into separate files and > adding more header file gymnastics, but maybe there isn't much better we > can do without overhauling the CPU feature detection code. I wanted to make an attempt to m

Re: Improve CRC32C performance on SSE4.2

2025-03-11 Thread John Naylor
On Tue, Mar 11, 2025 at 4:47 AM Nathan Bossart wrote: > > On Mon, Mar 10, 2025 at 03:48:31PM +0700, John Naylor wrote: > > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart > > wrote: > >> Overall, I wish we could avoid splitting things into separate files and > >> adding more header file gymnastics

Re: Improve CRC32C performance on SSE4.2

2025-03-05 Thread Nathan Bossart
On Wed, Mar 05, 2025 at 08:51:21AM +0700, John Naylor wrote: > That was my hunch too, but I wanted to be more sure, so I modified the > benchmark so it doesn't know the address of the next calculation until > it finishes the last calculation so we can hopefully see the latency > caused by indirecti

Re: Improve CRC32C performance on SSE4.2

2025-03-04 Thread John Naylor
On Wed, Mar 5, 2025 at 12:36 AM Nathan Bossart wrote: > > On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote: > > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart > > wrote: > >> This could potentially lead to a small regression for machines with SSE > >> 4.2 but not PCLMUL, but that may b

Re: Improve CRC32C performance on SSE4.2

2025-03-04 Thread Nathan Bossart
On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote: > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart > wrote: >> This could potentially lead to a small regression for machines with SSE >> 4.2 but not PCLMUL, but that may be uncommon enough at this point to not >> worry aobut. > > Note al

Re: Improve CRC32C performance on SSE4.2

2025-03-03 Thread John Naylor
On Tue, Mar 4, 2025 at 5:41 AM Devulapalli, Raghuveer wrote: > Some feedback on v11: > > if ((exx[2] & (1 << 20)) != 0) /* SSE 4.2 */ > { > pg_comp_crc32c = pg_comp_crc32c_sse42; > #ifdef USE_PCLMUL_WITH_RUNTIME_CHECK > if ((exx[2] & (1 << 1)) != 0) /* PCLMUL */ >

Re: Improve CRC32C performance on SSE4.2

2025-03-03 Thread John Naylor
On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart wrote: > I spent some time staring at pg_crc32.h with all these patches applied, and > IIUC it leads to the following behavior: > > * For compiled-in SSE 4.2 builds, we branch based on the length. For > smaller inputs, we are using an inlined versi

RE: Improve CRC32C performance on SSE4.2

2025-03-03 Thread Devulapalli, Raghuveer
Hi John, > You raised some interesting points, which deserve a thoughtful response. After > sleeping on it, however I came to the conclusion that a sweeping change in > runtime checks, with either of our approaches, has downsides and unresolved > questions. Perhaps we can come back to it at a lat

Re: Improve CRC32C performance on SSE4.2

2025-03-03 Thread Nathan Bossart
On Fri, Feb 28, 2025 at 07:11:29PM +0700, John Naylor wrote: > 0002: For SSE4.2 builds, arrange so that constant input uses an > inlined path so that the compiler can emit unrolled loops anywhere. > This is particularly important for the WAL insertion lock, so this is > possibly committable on its

Re: Improve CRC32C performance on SSE4.2

2025-02-28 Thread John Naylor
Hi Raghuveer, You raised some interesting points, which deserve a thoughtful response. After sleeping on it, however I came to the conclusion that a sweeping change in runtime checks, with either of our approaches, has downsides and unresolved questions. Perhaps we can come back to it at a later t

RE: Improve CRC32C performance on SSE4.2

2025-02-27 Thread Devulapalli, Raghuveer
Hi John, Going back to your earlier comment: > > > I'm not a fan of exposing these architecture-specific details to > > > places that consult the capabilities. That requires things like +#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42) > > Isn't that one thing currently pg_cpu_have(FE

Re: Improve CRC32C performance on SSE4.2

2025-02-26 Thread John Naylor
On Thu, Feb 27, 2025 at 3:53 AM Devulapalli, Raghuveer wrote: > > I'm not a fan of exposing these architecture-specific details to places > > that consult > > the capabilities. That requires things like > > > > +#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42) > > [...] > > +#define PGC

RE: Improve CRC32C performance on SSE4.2

2025-02-26 Thread Devulapalli, Raghuveer
Attached v10 with minor changes based on the feedback. > Thanks, I think this is a good direction. Some comments: > > #if defined(HAVE__GET_CPUID) > __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); #elif defined(HAVE__CPUID) > __cpuid(exx, 1); #endif Oh yeah good catch. Fixed in v10. > +

Re: Improve CRC32C performance on SSE4.2

2025-02-25 Thread John Naylor
On Wed, Feb 26, 2025 at 7:21 AM Devulapalli, Raghuveer wrote: > > > I agree it would be preferable to make a centralized check work. > > Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid > checks into one single place including the AVX512 popcount code. Any new > feature

RE: Improve CRC32C performance on SSE4.2

2025-02-25 Thread Devulapalli, Raghuveer
> I agree it would be preferable to make a centralized check work. Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid checks into one single place including the AVX512 popcount code. Any new feature that requires CPUID information can access that information with pg_cpu_

Re: Improve CRC32C performance on SSE4.2

2025-02-25 Thread John Naylor
On Tue, Feb 18, 2025 at 1:40 PM John Naylor wrote: > > On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart > wrote: > > While this needn't block this patch set, I do find the dispatch code to be > > pretty complicated. Maybe we can improve that in the future by using > > macros to auto-generate muc

Re: Improve CRC32C performance on SSE4.2

2025-02-25 Thread John Naylor
On Tue, Feb 25, 2025 at 3:17 AM Devulapalli, Raghuveer wrote: > > > Here's another idea to make it more automatic: Give up on initializing every > > capability at once. > > I'm not sure I like giving up this. Initializing and running CPUID check with > the attribute constructor is very valuable f

RE: Improve CRC32C performance on SSE4.2

2025-02-24 Thread Devulapalli, Raghuveer
> Here's another idea to make it more automatic: Give up on initializing every > capability at once. I'm not sure I like giving up this. Initializing and running CPUID check with the attribute constructor is very valuable for two reasons: (1) you get everything done at load time before main and

Re: Improve CRC32C performance on SSE4.2

2025-02-20 Thread John Naylor
On Fri, Feb 21, 2025 at 1:24 AM Devulapalli, Raghuveer wrote: > > > > Now, there is no equivalent on MSVC and workarounds are fragile [1]. > > Maybe we could only assert initialization happened for the backend and for > > frontend either > > - add a couple manual initializations to to the frontend

RE: Improve CRC32C performance on SSE4.2

2025-02-20 Thread Devulapalli, Raghuveer
> Now, there is no equivalent on MSVC and workarounds are fragile [1]. > Maybe we could only assert initialization happened for the backend and for > frontend either > - add a couple manual initializations to to the frontend programs where we > don't > want to lose performance for non-gcc/clang.

Re: Improve CRC32C performance on SSE4.2

2025-02-18 Thread John Naylor
On Wed, Feb 19, 2025 at 1:28 AM Devulapalli, Raghuveer wrote: > The runtime detection code could also be appended with function > __attribute__((constructor)) so that it gets executed before main. Hmm! I didn't know about that, thanks! It works on old gcc/clang, so that's good. I can't verify on

RE: Improve CRC32C performance on SSE4.2

2025-02-18 Thread Devulapalli, Raghuveer
Hi John, Just a few comments on v7: > pg_cpucap_crc32c I like the idea of moving all CPU runtime detection to a single module outside of implementations. This makes it easy to extend in the future and keeps the functionalities separate. > - Rename the CRC choose*.c files to pg_cpucap_{x86,

RE: Improve CRC32C performance on SSE4.2

2025-02-18 Thread Devulapalli, Raghuveer
alli, Raghuveer ; pgsql- > hack...@lists.postgresql.org; Shankaran, Akash > Subject: Re: Improve CRC32C performance on SSE4.2 > > On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart > wrote: > > > > On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote: > > > I tr

Re: Improve CRC32C performance on SSE4.2

2025-02-17 Thread John Naylor
On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart wrote: > > On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote: > > I tried using branching for the runtime check, and this looks like the > > way to go: > > - Existing -msse4.2 builders will still call directly, but inside the > > function t

Re: Improve CRC32C performance on SSE4.2

2025-02-17 Thread Nathan Bossart
On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote: > I tried using branching for the runtime check, and this looks like the > way to go: > - Existing -msse4.2 builders will still call directly, but inside the > function there is a length check and only for long input will it do a > runtim

Re: Improve CRC32C performance on SSE4.2

2025-02-17 Thread John Naylor
On Thu, Feb 13, 2025 at 5:19 AM Nathan Bossart wrote: > > On Wed, Feb 12, 2025 at 10:12:20PM +, Devulapalli, Raghuveer wrote: > >> Well, I suspect the AVX-512 version will pretty much always need the > >> runtime > >> check given that its not available on a lot of newer hardware and requires

Re: Improve CRC32C performance on SSE4.2

2025-02-13 Thread John Naylor
On Thu, Feb 13, 2025 at 4:18 AM Nathan Bossart wrote: > > I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer > overhead if possible. I looked at switching to always using runtime checks > for this stuff, and we concluded that we'd better not [0]. > > [0] https://postgr.es/m

RE: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Devulapalli, Raghuveer
> > Sounds good to me. Although, users building with just -msse4.2 will > > now encounter an an additional pclmul runtime check. That would be a > > regression unless they update to building with both -msse4.2 and -mpclmul. > > My thinking was that building with just -msse4.2 would cause the exi

Re: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Nathan Bossart
On Wed, Feb 12, 2025 at 09:48:57PM +, Devulapalli, Raghuveer wrote: >> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer >> overhead if possible. I looked at switching to always using runtime checks >> for this >> stuff, and we concluded that we'd better not [0]. > >

RE: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Devulapalli, Raghuveer
> Well, I suspect the AVX-512 version will pretty much always need the runtime > check given that its not available on a lot of newer hardware and requires a > bunch of extra runtime checks (see pg_popcount_avx512.c). But it might be > worth doing for PCLMUL. Otherwise, I think we'd have to leave

Re: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Nathan Bossart
On Wed, Feb 12, 2025 at 10:12:20PM +, Devulapalli, Raghuveer wrote: >> Well, I suspect the AVX-512 version will pretty much always need the runtime >> check given that its not available on a lot of newer hardware and requires a >> bunch of extra runtime checks (see pg_popcount_avx512.c). But i

RE: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Devulapalli, Raghuveer
> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer > overhead if possible. I looked at switching to always using runtime checks > for this > stuff, and we concluded that we'd better not [0]. > > [0] https://postgr.es/m/flat/20231030161706.GA3011%40nathanxps13 Does that

Re: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Nathan Bossart
On Wed, Feb 12, 2025 at 09:02:27PM +, Devulapalli, Raghuveer wrote: > Also, do we really need to have both USE_SSE42_CRC32C and > USE_SSE42_CRC32C_WITH_RUNTIME_CHECK > features support? The former macro is used to enable running the SSE42 > version without a runtime check > when someone buil

RE: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Devulapalli, Raghuveer
Hi, > 2. Unfortunately, there is another wrinkle that I failed to consider: If you > search > the web for "VirtualBox pclmulqdq" you can see a few reports from not very > long > ago that some hypervisors don't enable the CPUID for pclmul. I don't know how > big a problem that is in practice tod

Re: Improve CRC32C performance on SSE4.2

2025-02-12 Thread John Naylor
On Wed, Feb 12, 2025 at 4:34 AM Devulapalli, Raghuveer wrote: > > On my machine that still regresses compared to master in that range > > (although by > > not as much) so I still think 128 bytes is the right threshold. > > On my TGL, buffer sizes as small as 64 bytes see performance benefits. Y

RE: Improve CRC32C performance on SSE4.2

2025-02-11 Thread Devulapalli, Raghuveer
: v3-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patch v3-0003-Improve-CRC32C-performance-on-SSE4.2.patch Description: v3-0003-Improve-CRC32C-performance-on-SSE4.2.patch v3-0004-Shorter-version-from-corsix.patch Description: v3-0004-Shorter-version-from-corsix.patch

Re: Improve CRC32C performance on SSE4.2

2025-02-10 Thread John Naylor
On Tue, Feb 11, 2025 at 7:25 AM Devulapalli, Raghuveer wrote: > I ran the same benchmark drive_crc32c with the postgres infrastructure and > found that your v2 sse42 version from corsix is slower than > pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. That matches my findings a

RE: Improve CRC32C performance on SSE4.2

2025-02-10 Thread Devulapalli, Raghuveer
t pointer accordingly. Let me know which one you would prefer. Raghuveer From: Devulapalli, Raghuveer Sent: Wednesday, February 5, 2025 12:49 PM To: pgsql-hackers@lists.postgresql.org Cc: Shankaran, Akash ; Devulapalli, Raghuveer Subject: Improve CRC32C performance on SSE4.2 This patch i

Re: Improve CRC32C performance on SSE4.2

2025-02-09 Thread John Naylor
while(count--) + { + INIT_CRC32C(crc); + COMP_CRC32C(crc, data, num); + FIN_CRC32C(crc); + } + + free((void *)data); + + PG_RETURN_INT64((int64_t)crc); +} diff --git a/contrib/test_crc32c/test_crc32c.control b/contrib/test_crc32c/test_crc32c.control new file mode 100644 index 00..878a077

Improve CRC32C performance on SSE4.2

2025-02-05 Thread Devulapalli, Raghuveer
116 Raghuveer v1-0001-Add-more-test-coverage-for-crc32c.patch Description: v1-0001-Add-more-test-coverage-for-crc32c.patch v1-0002-Improve-CRC32C-performance-on-SSE4.2.patch Description: v1-0002-Improve-CRC32C-performance-on-SSE4.2.patch