On Sat, Apr 5, 2025 at 5:15 AM Nathan Bossart wrote:
>
> I noticed that autoconf is defining USE_AVX512_CRC_WITH_RUNTIME_CHECK, but
> everywhere else expects USE_AVX512_CRC32C_WITH_RUNTIME_CHECK (with the
> "32C" included). I tested the v16 patches (with the macro fixed and
> assertions enabled)
On Wed, Apr 02, 2025 at 02:10:40PM +0700, John Naylor wrote:
> Thanks for looking, I plan to commit this over the weekend unless
> there are objections.
I noticed that autoconf is defining USE_AVX512_CRC_WITH_RUNTIME_CHECK, but
everywhere else expects USE_AVX512_CRC32C_WITH_RUNTIME_CHECK (with the
> Thanks for looking, I plan to commit this over the weekend unless there are
> objections.
LGTM.
Raghuveer
On Tue, Apr 1, 2025 at 11:25 PM Nathan Bossart wrote:
>
> On Tue, Apr 01, 2025 at 05:33:02PM +0700, John Naylor wrote:
> > On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer
> > wrote:
> >> (2) Might be apt to rename pg_crc32c_sse42*.c to pg_crc32c_x86*.c since
> >> they contain both sse42
On Tue, Apr 01, 2025 at 05:33:02PM +0700, John Naylor wrote:
> On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer
> wrote:
>> (1) zmm_regs_available() in pg_crc32c_sse42_choose.c is a duplicate of
>> pg_popcount_avx512.c but perhaps that’s fine for now. I will propose a
>> consolidated SIMD r
On Thu, Mar 27, 2025 at 2:55 AM Devulapalli, Raghuveer
wrote:
>
> Hello John,
>
> v15 LGTM. Couple of minor comments:
>
> > I'm leaning towards a length limit for v15-0001 so that inlined
> > instructions are
> > likely to be unrolled. Aside from lack of commit message, I think that one
> > is r
Hello John,
v15 LGTM. Couple of minor comments:
> I'm leaning towards a length limit for v15-0001 so that inlined instructions
> are
> likely to be unrolled. Aside from lack of commit message, I think that one is
> ready
> for commit soon-ish.
+1
> I'm feeling pretty good about 0002, but
On Mon, Mar 24, 2025 at 6:37 PM John Naylor wrote:
>
> I'll take a look at the configure
> checks soon, since I had some questions there.
One other thing I forgot to mention: The previous test function had
local constants that the compiler was able to fold, resulting in no
actual vector instructi
On Mon, Mar 24, 2025 at 6:37 PM John Naylor wrote:
> I'll take a look at the configure
> checks soon, since I had some questions there.
I'm leaning towards a length limit for v15-0001 so that inlined
instructions are likely to be unrolled. Aside from lack of commit
message, I think that one is re
On Thu, Mar 13, 2025 at 11:38 PM Devulapalli, Raghuveer
wrote:
>
> > > I've gone ahead and added the generated AVX-512 algorithm in
> > > v14-0005
>
> Here is my benchmark numbers (master v/s v14) on TGL (time in ms):
>
> Buf-size-bytes Master v14
> 64 9.547 6.095
...
> 256 31.399
On Wed, Mar 12, 2025 at 3:57 AM Devulapalli, Raghuveer
wrote:
>
> > I've gone ahead and added the generated AVX-512 algorithm in v14-0005
> I'm a little confused. Why is v14 missing the earlier versions of pclmul
> implementation that works without avx-512?
As I mentioned upthread, the 128-bit
> > I've gone ahead and added the generated AVX-512 algorithm in
> > v14-0005
Here is my benchmark numbers (master v/s v14) on TGL (time in ms):
Buf-size-bytes Master v14
64 9.547 6.095
80 10.999 6.248
96 12.443 7.756
112 14.129 9.62
128 15.295 6.534
144 16.7
On Mon, Mar 10, 2025 at 03:48:31PM +0700, John Naylor wrote:
> On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart
> wrote:
>> Overall, I wish we could avoid splitting things into separate files and
>> adding more header file gymnastics, but maybe there isn't much better we
>> can do without overhaulin
> I've gone ahead and added the generated AVX-512 algorithm in v14-0005
+ pg_attribute_target("avx512vl,vpclmulqdq")
+ pg_crc32c
+ pg_comp_crc32c_pclmul(pg_crc32c crc, const void *data, size_t length)
I'm a little confused. Why is v14 missing the earlier versions of pclmul
implementation that wo
On Wed, Mar 5, 2025 at 10:52 PM Nathan Bossart wrote:
>
> On Wed, Mar 05, 2025 at 08:51:21AM +0700, John Naylor wrote:
> > That was my hunch too, but I wanted to be more sure, so I modified the
> > benchmark so it doesn't know the address of the next calculation until
> > it finishes the last calc
On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart wrote:
>
> Overall, I wish we could avoid splitting things into separate files and
> adding more header file gymnastics, but maybe there isn't much better we
> can do without overhauling the CPU feature detection code.
I wanted to make an attempt to m
On Tue, Mar 11, 2025 at 4:47 AM Nathan Bossart wrote:
>
> On Mon, Mar 10, 2025 at 03:48:31PM +0700, John Naylor wrote:
> > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart
> > wrote:
> >> Overall, I wish we could avoid splitting things into separate files and
> >> adding more header file gymnastics
On Wed, Mar 05, 2025 at 08:51:21AM +0700, John Naylor wrote:
> That was my hunch too, but I wanted to be more sure, so I modified the
> benchmark so it doesn't know the address of the next calculation until
> it finishes the last calculation so we can hopefully see the latency
> caused by indirecti
On Wed, Mar 5, 2025 at 12:36 AM Nathan Bossart wrote:
>
> On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote:
> > On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart
> > wrote:
> >> This could potentially lead to a small regression for machines with SSE
> >> 4.2 but not PCLMUL, but that may b
On Tue, Mar 04, 2025 at 12:09:09PM +0700, John Naylor wrote:
> On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart
> wrote:
>> This could potentially lead to a small regression for machines with SSE
>> 4.2 but not PCLMUL, but that may be uncommon enough at this point to not
>> worry aobut.
>
> Note al
On Tue, Mar 4, 2025 at 5:41 AM Devulapalli, Raghuveer
wrote:
> Some feedback on v11:
>
> if ((exx[2] & (1 << 20)) != 0) /* SSE 4.2 */
> {
> pg_comp_crc32c = pg_comp_crc32c_sse42;
> #ifdef USE_PCLMUL_WITH_RUNTIME_CHECK
> if ((exx[2] & (1 << 1)) != 0) /* PCLMUL */
>
On Tue, Mar 4, 2025 at 2:11 AM Nathan Bossart wrote:
> I spent some time staring at pg_crc32.h with all these patches applied, and
> IIUC it leads to the following behavior:
>
> * For compiled-in SSE 4.2 builds, we branch based on the length. For
> smaller inputs, we are using an inlined versi
Hi John,
> You raised some interesting points, which deserve a thoughtful response. After
> sleeping on it, however I came to the conclusion that a sweeping change in
> runtime checks, with either of our approaches, has downsides and unresolved
> questions. Perhaps we can come back to it at a lat
On Fri, Feb 28, 2025 at 07:11:29PM +0700, John Naylor wrote:
> 0002: For SSE4.2 builds, arrange so that constant input uses an
> inlined path so that the compiler can emit unrolled loops anywhere.
> This is particularly important for the WAL insertion lock, so this is
> possibly committable on its
Hi Raghuveer,
You raised some interesting points, which deserve a thoughtful
response. After sleeping on it, however I came to the conclusion that
a sweeping change in runtime checks, with either of our approaches,
has downsides and unresolved questions. Perhaps we can come back to it
at a later t
Hi John,
Going back to your earlier comment:
> > > I'm not a fan of exposing these architecture-specific details to
> > > places that consult the capabilities. That requires things like
+#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42)
> > Isn't that one thing currently pg_cpu_have(FE
On Thu, Feb 27, 2025 at 3:53 AM Devulapalli, Raghuveer
wrote:
> > I'm not a fan of exposing these architecture-specific details to places
> > that consult
> > the capabilities. That requires things like
> >
> > +#define PGCPUCAP_CRC32C pg_cpu_have(PG_CPU_FEATURE_SSE42)
> > [...]
> > +#define PGC
Attached v10 with minor changes based on the feedback.
> Thanks, I think this is a good direction. Some comments:
>
> #if defined(HAVE__GET_CPUID)
> __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); #elif defined(HAVE__CPUID)
> __cpuid(exx, 1); #endif
Oh yeah good catch. Fixed in v10.
> +
On Wed, Feb 26, 2025 at 7:21 AM Devulapalli, Raghuveer
wrote:
>
> > I agree it would be preferable to make a centralized check work.
>
> Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid
> checks into one single place including the AVX512 popcount code. Any new
> feature
> I agree it would be preferable to make a centralized check work.
Here is my first stab at it. v9 is same as v8 + a commit to move all cpuid
checks into one single place including the AVX512 popcount code. Any new
feature that requires CPUID information can access that information with
pg_cpu_
On Tue, Feb 18, 2025 at 1:40 PM John Naylor wrote:
>
> On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart
> wrote:
> > While this needn't block this patch set, I do find the dispatch code to be
> > pretty complicated. Maybe we can improve that in the future by using
> > macros to auto-generate muc
On Tue, Feb 25, 2025 at 3:17 AM Devulapalli, Raghuveer
wrote:
>
> > Here's another idea to make it more automatic: Give up on initializing every
> > capability at once.
>
> I'm not sure I like giving up this. Initializing and running CPUID check with
> the attribute constructor is very valuable f
> Here's another idea to make it more automatic: Give up on initializing every
> capability at once.
I'm not sure I like giving up this. Initializing and running CPUID check with
the attribute constructor is very valuable for two reasons: (1) you get
everything done at load time before main and
On Fri, Feb 21, 2025 at 1:24 AM Devulapalli, Raghuveer
wrote:
>
>
> > Now, there is no equivalent on MSVC and workarounds are fragile [1].
> > Maybe we could only assert initialization happened for the backend and for
> > frontend either
> > - add a couple manual initializations to to the frontend
> Now, there is no equivalent on MSVC and workarounds are fragile [1].
> Maybe we could only assert initialization happened for the backend and for
> frontend either
> - add a couple manual initializations to to the frontend programs where we
> don't
> want to lose performance for non-gcc/clang.
On Wed, Feb 19, 2025 at 1:28 AM Devulapalli, Raghuveer
wrote:
> The runtime detection code could also be appended with function
> __attribute__((constructor)) so that it gets executed before main.
Hmm! I didn't know about that, thanks! It works on old gcc/clang, so
that's good. I can't verify on
Hi John,
Just a few comments on v7:
> pg_cpucap_crc32c
I like the idea of moving all CPU runtime detection to a single module outside
of implementations. This makes it easy to extend in the future and keeps the
functionalities separate.
> - Rename the CRC choose*.c files to pg_cpucap_{x86,
alli, Raghuveer ; pgsql-
> hack...@lists.postgresql.org; Shankaran, Akash
> Subject: Re: Improve CRC32C performance on SSE4.2
>
> On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart
> wrote:
> >
> > On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
> > > I tr
On Tue, Feb 18, 2025 at 12:41 AM Nathan Bossart
wrote:
>
> On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
> > I tried using branching for the runtime check, and this looks like the
> > way to go:
> > - Existing -msse4.2 builders will still call directly, but inside the
> > function t
On Mon, Feb 17, 2025 at 05:58:01PM +0700, John Naylor wrote:
> I tried using branching for the runtime check, and this looks like the
> way to go:
> - Existing -msse4.2 builders will still call directly, but inside the
> function there is a length check and only for long input will it do a
> runtim
On Thu, Feb 13, 2025 at 5:19 AM Nathan Bossart wrote:
>
> On Wed, Feb 12, 2025 at 10:12:20PM +, Devulapalli, Raghuveer wrote:
> >> Well, I suspect the AVX-512 version will pretty much always need the
> >> runtime
> >> check given that its not available on a lot of newer hardware and requires
On Thu, Feb 13, 2025 at 4:18 AM Nathan Bossart wrote:
>
> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
> overhead if possible. I looked at switching to always using runtime checks
> for this stuff, and we concluded that we'd better not [0].
>
> [0] https://postgr.es/m
> > Sounds good to me. Although, users building with just -msse4.2 will
> > now encounter an an additional pclmul runtime check. That would be a
> > regression unless they update to building with both -msse4.2 and -mpclmul.
>
> My thinking was that building with just -msse4.2 would cause the exi
On Wed, Feb 12, 2025 at 09:48:57PM +, Devulapalli, Raghuveer wrote:
>> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
>> overhead if possible. I looked at switching to always using runtime checks
>> for this
>> stuff, and we concluded that we'd better not [0].
>
>
> Well, I suspect the AVX-512 version will pretty much always need the runtime
> check given that its not available on a lot of newer hardware and requires a
> bunch of extra runtime checks (see pg_popcount_avx512.c). But it might be
> worth doing for PCLMUL. Otherwise, I think we'd have to leave
On Wed, Feb 12, 2025 at 10:12:20PM +, Devulapalli, Raghuveer wrote:
>> Well, I suspect the AVX-512 version will pretty much always need the runtime
>> check given that its not available on a lot of newer hardware and requires a
>> bunch of extra runtime checks (see pg_popcount_avx512.c). But i
> I think the idea behind USE_SSE42_CRC32C is to avoid the function pointer
> overhead if possible. I looked at switching to always using runtime checks
> for this
> stuff, and we concluded that we'd better not [0].
>
> [0] https://postgr.es/m/flat/20231030161706.GA3011%40nathanxps13
Does that
On Wed, Feb 12, 2025 at 09:02:27PM +, Devulapalli, Raghuveer wrote:
> Also, do we really need to have both USE_SSE42_CRC32C and
> USE_SSE42_CRC32C_WITH_RUNTIME_CHECK
> features support? The former macro is used to enable running the SSE42
> version without a runtime check
> when someone buil
Hi,
> 2. Unfortunately, there is another wrinkle that I failed to consider: If you
> search
> the web for "VirtualBox pclmulqdq" you can see a few reports from not very
> long
> ago that some hypervisors don't enable the CPUID for pclmul. I don't know how
> big a problem that is in practice tod
On Wed, Feb 12, 2025 at 4:34 AM Devulapalli, Raghuveer
wrote:
> > On my machine that still regresses compared to master in that range
> > (although by
> > not as much) so I still think 128 bytes is the right threshold.
>
> On my TGL, buffer sizes as small as 64 bytes see performance benefits.
Y
: v3-0002-Add-a-Postgres-SQL-function-for-crc32c-benchmarki.patch
v3-0003-Improve-CRC32C-performance-on-SSE4.2.patch
Description: v3-0003-Improve-CRC32C-performance-on-SSE4.2.patch
v3-0004-Shorter-version-from-corsix.patch
Description: v3-0004-Shorter-version-from-corsix.patch
On Tue, Feb 11, 2025 at 7:25 AM Devulapalli, Raghuveer
wrote:
> I ran the same benchmark drive_crc32c with the postgres infrastructure and
> found that your v2 sse42 version from corsix is slower than
> pg_comp_crc32c_sse42 in master branch when buffer is < 128 bytes.
That matches my findings a
t pointer accordingly. Let me know which one you
would prefer.
Raghuveer
From: Devulapalli, Raghuveer
Sent: Wednesday, February 5, 2025 12:49 PM
To: pgsql-hackers@lists.postgresql.org
Cc: Shankaran, Akash ; Devulapalli, Raghuveer
Subject: Improve CRC32C performance on SSE4.2
This patch i
while(count--)
+ {
+ INIT_CRC32C(crc);
+ COMP_CRC32C(crc, data, num);
+ FIN_CRC32C(crc);
+ }
+
+ free((void *)data);
+
+ PG_RETURN_INT64((int64_t)crc);
+}
diff --git a/contrib/test_crc32c/test_crc32c.control b/contrib/test_crc32c/test_crc32c.control
new file mode 100644
index 00..878a077
116
Raghuveer
v1-0001-Add-more-test-coverage-for-crc32c.patch
Description: v1-0001-Add-more-test-coverage-for-crc32c.patch
v1-0002-Improve-CRC32C-performance-on-SSE4.2.patch
Description: v1-0002-Improve-CRC32C-performance-on-SSE4.2.patch
55 matches
Mail list logo