RE: Improve CRC32C performance on SSE4.2

2025-04-03 Thread Devulapalli, Raghuveer
> Thanks for looking, I plan to commit this over the weekend unless there are > objections. LGTM. Raghuveer

RE: Improve CRC32C performance on SSE4.2

2025-03-26 Thread Devulapalli, Raghuveer
Hello John, v15 LGTM. Couple of minor comments: > I'm leaning towards a length limit for v15-0001 so that inlined instructions > are > likely to be unrolled. Aside from lack of commit message, I think that one is > ready > for commit soon-ish. +1 > I'm feeling pretty good about 0002, but

RE: Improve CRC32C performance on SSE4.2

2025-03-13 Thread Devulapalli, Raghuveer
> > I've gone ahead and added the generated AVX-512 algorithm in > > v14-0005 Here is my benchmark numbers (master v/s v14) on TGL (time in ms): Buf-size-bytes Master v14 64 9.547 6.095 80 10.999 6.248 96 12.443 7.756 112 14.129 9.62 128 15.295 6.534 144 16.7

RE: CRC32C Parallel Computation Optimization on ARM

2025-03-12 Thread Devulapalli, Raghuveer
> > Intel has contributed SSE4.2 CRC32C [1] and AVX-512 CRC32C [2] based on > similar techniques to postgres. > > ...this is a restatement of facts we already know. I'm guessing the intended > takeaway is "since Intel submitted an implementation to us based on paper A, > then we are free to separa

RE: Improve CRC32C performance on SSE4.2

2025-03-11 Thread Devulapalli, Raghuveer
> I've gone ahead and added the generated AVX-512 algorithm in v14-0005 + pg_attribute_target("avx512vl,vpclmulqdq") + pg_crc32c + pg_comp_crc32c_pclmul(pg_crc32c crc, const void *data, size_t length) I'm a little confused. Why is v14 missing the earlier versions of pclmul implementation that wo

RE: CRC32C Parallel Computation Optimization on ARM

2025-03-11 Thread Devulapalli, Raghuveer
lingering concerns anyone may have w.r.t using this technique for the benefit of Postgres. Raghuveer > -Original Message- > From: John Naylor > Sent: Monday, March 10, 2025 6:31 PM > To: Devulapalli, Raghuveer > Cc: Dmitry Dolgov <9erthali...@gmail.com>; Nathan Bossart &g

RE: CRC32C Parallel Computation Optimization on ARM

2025-03-10 Thread Devulapalli, Raghuveer
Hi John, > On the other hand, looking at Linux kernel sources, it seems a patch using > this > technique was contributed by Intel over a decade ago: > > https://github.com/torvalds/linux/blob/master/arch/x86/crypto/crc32c-pcl-intel- > asm_64.S > > So one more thing to ask our friends at Intel.

RE: Improve CRC32C performance on SSE4.2

2025-03-03 Thread Devulapalli, Raghuveer
Hi John, > You raised some interesting points, which deserve a thoughtful response. After > sleeping on it, however I came to the conclusion that a sweeping change in > runtime checks, with either of our approaches, has downsides and unresolved > questions. Perhaps we can come back to it at a lat

RE: Improve CRC32C performance on SSE4.2

2025-02-27 Thread Devulapalli, Raghuveer
Hi John, Going back to your earlier comment: > > > I'm not a fan of exposing these architecture-specific details to > > > places that consult the capabilities. That requires things like +#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42) > > Isn't that one thing currently pg_cpu_have(FE

RE: Improve CRC32C performance on SSE4.2

2025-02-26 Thread Devulapalli, Raghuveer
Attached v10 with minor changes based on the feedback. > Thanks, I think this is a good direction. Some comments: > > #if defined(HAVE__GET_CPUID) > __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); #elif defined(HAVE__CPUID) > __cpuid(exx, 1); #endif Oh yeah good catch. Fixed in v10. > +

RE: Improve CRC32C performance on SSE4.2

2025-02-25 Thread Devulapalli, Raghuveer
> I agree it would be preferable to make a centralized check work. Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid checks into one single place including the AVX512 popcount code. Any new feature that requires CPUID information can access that information with pg_cpu_

RE: Improve CRC32C performance on SSE4.2

2025-02-24 Thread Devulapalli, Raghuveer
> Here's another idea to make it more automatic: Give up on initializing every > capability at once. I'm not sure I like giving up this. Initializing and running CPUID check with the attribute constructor is very valuable for two reasons: (1) you get everything done at load time before main and

RE: Improve CRC32C performance on SSE4.2

2025-02-20 Thread Devulapalli, Raghuveer
> Now, there is no equivalent on MSVC and workarounds are fragile [1]. > Maybe we could only assert initialization happened for the backend and for > frontend either > - add a couple manual initializations to to the frontend programs where we > don't > want to lose performance for non-gcc/clang.

RE: SIMD optimization for list_sort

2025-02-20 Thread Devulapalli, Raghuveer
> Note that our current implemention is highly optimized for low-cardinality > inputs. > This is needed for aggregate queries. I found this write-up of a couple > scalar and > vectorized sorts, and they show this library doing very poorly on very-low > cardinality inputs. I would look into that b

RE: Improve CRC32C performance on SSE4.2

2025-02-18 Thread Devulapalli, Raghuveer
Hi John, Just a few comments on v7: > pg_cpucap_crc32c I like the idea of moving all CPU runtime detection to a single module outside of implementations. This makes it easy to extend in the future and keeps the functionalities separate. > - Rename the CRC choose*.c files to pg_cpucap_{x86,

RE: Improve CRC32C performance on SSE4.2

2025-02-18 Thread Devulapalli, Raghuveer
and seems a lot cleaner to me. See https://github.com/r-devulap/postgressql/pull/5/commits/cf80f7375f24d2fb5cd29e95dc73d4988fc6d066/ Raghuveer > -Original Message- > From: John Naylor > Sent: Monday, February 17, 2025 10:40 PM > To: Nathan Bossart > Cc: Devulap

RE: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Devulapalli, Raghuveer
> > Sounds good to me. Although, users building with just -msse4.2 will > > now encounter an an additional pclmul runtime check. That would be a > > regression unless they update to building with both -msse4.2 and -mpclmul. > > My thinking was that building with just -msse4.2 would cause the exi

RE: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Devulapalli, Raghuveer
> Well, I suspect the AVX-512 version will pretty much always need the runtime > check given that its not available on a lot of newer hardware and requires a > bunch of extra runtime checks (see pg_popcount_avx512.c). But it might be > worth doing for PCLMUL. Otherwise, I think we'd have to leave

RE: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Devulapalli, Raghuveer
> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer > overhead if possible. I looked at switching to always using runtime checks > for this > stuff, and we concluded that we'd better not [0]. > > [0] https://postgr.es/m/flat/20231030161706.GA3011%40nathanxps13 Does that

RE: Improve CRC32C performance on SSE4.2

2025-02-12 Thread Devulapalli, Raghuveer
Hi, > 2. Unfortunately, there is another wrinkle that I failed to consider: If you > search > the web for "VirtualBox pclmulqdq" you can see a few reports from not very > long > ago that some hypervisors don't enable the CPUID for pclmul. I don't know how > big a problem that is in practice tod

RE: Improve CRC32C performance on SSE4.2

2025-02-11 Thread Devulapalli, Raghuveer
Hello, Attached v3 which is same as v2 with the added PCLMULQDQ runtime CPUID check. > > I ran the same benchmark drive_crc32c with the postgres infrastructure and > found that your v2 sse42 version from corsix is slower than > pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes. >

RE: Improve CRC32C performance on SSE4.2

2025-02-10 Thread Devulapalli, Raghuveer
r sse4.2). This needs to be fixed by adding an additional cpuid check for pcmul but it would fall back to slicing by 8 on Nehalem and use the latest version on Westmere and above. If you care about keeping the performance on Nehalem, then I am happy to update the choose function to pick the righ

RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

2025-02-05 Thread Devulapalli, Raghuveer
Hi John, > Further, we verified upthread that Intel's current and near-future product > line > includes server chips (some with over 100 cores, so not exactly low-end) that > don't support AVX-512 at all. I have no idea how common they will be, but they > will certainly be found in cloud datacen

Improve CRC32C performance on SSE4.2

2025-02-05 Thread Devulapalli, Raghuveer
This patch improves the performance of SSE42 CRC32C algorithm. The current SSE4.2 implementation of CRC32C relies on the native crc32 instruction and processes 8 bytes at a time in a loop. The technique in this paper uses the pclmulqdq instruction and processing 64 bytes at time. The algorithm

RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

2025-01-24 Thread Devulapalli, Raghuveer
Hi John, Thanks for your summary and here are responses: > #1 - The choice of AVX-512. There is no such thing as a "CRC instruction > operating > on 8 bytes", and the proposed algorithm is a multistep process using carryless > multiplication and requiring at least 256 bytes of input. The Chrom

RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

2025-01-21 Thread Devulapalli, Raghuveer
> Hello! I'm Matthew Sterrett and I'm a coworker of Raghuveer; he asked me to > look into the Windows build failures related to pg_comp_crc32c. > > It seems that the only thing that was required to fix that is to mark > pg_comp_crc32c as PGDLLIMPORT, so I added a patch that does just that. > I'm n

RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

2024-12-07 Thread Devulapalli, Raghuveer
> Sorry for going back so far, but this thread was pointed out to me, and this > aspect > of the design could use some more discussion: > > + * pg_crc32c_avx512(): compute the crc32c of the buffer, where the > + * buffer length must be at least 256, and a multiple of 64. Based > > There is anoth

RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

2024-12-07 Thread Devulapalli, Raghuveer
> [0] https://cirrus-ci.com/task/6023394207989760 > [1] https://cirrus-ci.com/task/5460444254568448 > [2] https://cirrus-ci.com/task/6586344161411072 I was able to fix [0] and [1], but I can't think of why [2] fails. When I tried to reproduce this locally, I get a different unrelated error. Any i

RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

2024-12-03 Thread Devulapalli, Raghuveer
> Thanks! cfbot is showing a couple of errors [0] [1] [2]. Oh yikes, the CI had passed with an earlier version. Wonder if I made a mess of the rebase. I will take a look and fix them. Raghuveer

RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

2024-12-03 Thread Devulapalli, Raghuveer
> Raghuveer, would you mind rebasing this patch set now that the SSE4.2 patch is > committed? Rebased to master branch. Raghuveer v8-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patch Description: v8-0001-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patch v8-0002-Refactor-cons

RE: Use __attribute__((target(sse4.2))) for SSE42 CRC32C

2024-11-27 Thread Devulapalli, Raghuveer
> Here's what I have staged for commit (except for the commit message, which > needs some work). LGTM. Raghuveer

RE: Use __attribute__((target(sse4.2))) for SSE42 CRC32C

2024-11-25 Thread Devulapalli, Raghuveer
> > Thanks. I think my only remaining feedback is that we should probably add a > comment to explain why we aren't doing this for ARM yet [0]. Sounds good. Where would you like me to add this comment? Meson.build and configure?

RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

2024-11-25 Thread Devulapalli, Raghuveer
> > create mode 100644 src/test/modules/test_crc32c/test_crc32c.c > > create mode 100644 src/test/modules/test_crc32c/test_crc32c.control > > Needs to be integrated with the meson based build as well. Done. > > +drive_crc32c(PG_FUNCTION_ARGS) > > +{ > > + int64 count = P

RE: Use __attribute__((target(sse4.2))) for SSE42 CRC32C

2024-11-22 Thread Devulapalli, Raghuveer
> I think we should still use the test program even when __SSE4_2__ is defined, > but > we can use that macro to determine whether to use a runtime check. I think > that > would keep autoconf and meson consistent. Sure. Updated patch. v5-0001-Use-__attribute__-target-sse4.2-for-SSE42-CRC32C.p

RE: Use __attribute__((target(sse4.2))) for SSE42 CRC32C

2024-11-21 Thread Devulapalli, Raghuveer
> I think it is in pretty good shape. After a read-through, the only thing > that stands > out to me is the difference in configure tests between autoconf and meson. I > think we should consider picking one approach, although I'm not sure I have a > strong preference. Thanks for the review! We

RE: Use __attribute__((target(sse4.2))) for SSE42 CRC32C

2024-11-20 Thread Devulapalli, Raghuveer
Anymore feedback on this patch? Hoping this is a straightforward one. Raghuveer > -Original Message- > From: Devulapalli, Raghuveer > Sent: Friday, November 8, 2024 11:05 AM > To: Devulapalli, Raghuveer ; Nathan Bossart > > Cc: pgsql-hackers@lists.postgresql.org

RE: Use __attribute__((target(sse4.2))) for SSE42 CRC32C

2024-11-08 Thread Devulapalli, Raghuveer
V3: With the suggested changes. Raghuveer > -Original Message- > From: Devulapalli, Raghuveer > Sent: Friday, November 8, 2024 10:43 AM > To: Nathan Bossart > Cc: pgsql-hackers@lists.postgresql.org; Shankaran, Akash > > Subject: RE: Use __attribute__((targe

RE: Use __attribute__((target(sse4.2))) for SSE42 CRC32C

2024-11-08 Thread Devulapalli, Raghuveer
> I believe we expect MSVC builds to use meson at this point, which is probably > why there's this extra check: > > if cc.get_id() == 'msvc' > cdata.set('USE_SSE42_CRC32C', false) > cdata.set('USE_SSE42_CRC32C_WITH_RUNTIME_CHECK', 1) > have_optimized_crc = true > Ah, I missed this.

RE: Use __attribute__((target(sse4.2))) for SSE42 CRC32C

2024-11-08 Thread Devulapalli, Raghuveer
> -prog = ''' > +sse42_crc_prog = ''' > > nitpick: Why does this need to be renamed? Doesn't need to be renamed, but just a better name to describe that the code is meant for. It will be followed up with avx512_crc_prog in the next patch. > > +#ifdef TEST_SSE42_WITH_ATTRIBUTE > +#if d

RE: Use __attribute__((target(sse4.2))) for SSE42 CRC32C

2024-11-07 Thread Devulapalli, Raghuveer
Here is the updated patch with the #if defined(__has_attribute) && __has_attribute (target) guard. > Oh, you are right, sorry. No worries! 😊 > Is the idea that we will put both "choose" functions in one file and the > actual > CRC-32C code in another? I'm okay with that. Yup, there can only

RE: Use __attribute__((target(sse4.2))) for SSE42 CRC32C

2024-11-07 Thread Devulapalli, Raghuveer
> +__attribute__((target("sse4.2"))) > > These need to be surrounded with > > #if defined(__has_attribute) && __has_attribute (target) > > so that we still attempt the check on compilers that don't support it (e.g., > MSVC). I assumed configure was only used on linux and ignored this

RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

2024-11-07 Thread Devulapalli, Raghuveer
> Would you mind moving the function attribute change for the existing SSE > 4.2 code to its own patch? I think that is pretty straightforward, and IMHO > it'd be > nice to take care of it first so that we can focus on the new stuff. Just submitted a separate patch for this. Will update the CRC3

Use __attribute__((target(sse4.2))) for SSE42 CRC32C

2024-11-07 Thread Devulapalli, Raghuveer
Based on the discussion in PostgreSQL: Re: Proposal for Updating CRC32C with AVX-512 Algorithm., this patch leverages pg_attribute_target to build the SSE42 CRC32C code using function attributes. Raghuveer v1-0001-Use-__attribu

RE: Popcount optimization using AVX512

2024-11-07 Thread Devulapalli, Raghuveer
> Of course, as soon as I committed this, I noticed that it's broken. It seems > that > compilers are rather picky about how multiple target options are specified. Just curious, which compiler complained? Raghuveer

RE: Popcount optimization using AVX512

2024-10-31 Thread Devulapalli, Raghuveer
> Here is an updated patch with this change. LGTM. Raghuveer

RE: Popcount optimization using AVX512

2024-10-30 Thread Devulapalli, Raghuveer
> Oh, good catch. IIUC we only need to check for #ifndef _MSC_VER in the > configure programs for meson. pg_attribute_target will be empty on MSVC, and > I > believe we only support meson builds there. Right. __has_attribute (target) produces a compiler warning on MSVC: https://gcc.godbolt.o

RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

2024-10-18 Thread Devulapalli, Raghuveer
> I've proposed a patch to move the existing AVX-512 code in Postgres to use > __attribute__((target("..."))) instead of per-translation-unit compiler flags > [0]. We > should likely do something similar for this one. > > [0] https://postgr.es/m/ZxAqRG1-8fJLMRUY%40nathan I assume this will be

RE: Proposal for Updating CRC32C with AVX-512 Algorithm.

2024-10-08 Thread Devulapalli, Raghuveer
nted to introduce > the > engineer who will be taking over the CRC32c proposal and commit fest entry. > > Devulapalli, Raghuveer > > I have brought him up to speed and he will be the go-to for technical review > comments and questions. Please welcome him into the community. > > Thanks, > Paul

RE: Popcount optimization using AVX512

2024-04-18 Thread Devulapalli, Raghuveer
> On that note, is it necessary to also check for avx512f? At the moment, we > are assuming that's supported if the other AVX-512 instructions are available. No, it's not needed. There are no CPU's with avx512bw/avx512popcnt without avx512f. Unfortunately though, avx512popcnt does not mean avx

RE: Popcount optimization using AVX512

2024-04-18 Thread Devulapalli, Raghuveer
> Thanks for the feedback. I've attached an updated patch. (1) Shouldn't it be: return (_xgetbv(0) & 0xe6) == 0xe6; ? Otherwise zmm_regs_available() will return false. (2) Nitpick: avx512_popcnt_available and avx512_bw_available() run the same cpuid leaf. You could combine them into one to avo