Re: Popcount optimization using AVX512

2024-11-07 Thread Nathan Bossart
On Thu, Nov 07, 2024 at 08:38:21PM +, Devulapalli, Raghuveer wrote: > >> Of course, as soon as I committed this, I noticed that it's broken. It >> seems that >> compilers are rather picky about how multiple target options are specified. > > Just curious, which compiler complained? Clang.

RE: Popcount optimization using AVX512

2024-11-07 Thread Devulapalli, Raghuveer
> Of course, as soon as I committed this, I noticed that it's broken. It seems > that > compilers are rather picky about how multiple target options are specified. Just curious, which compiler complained? Raghuveer

Re: Popcount optimization using AVX512

2024-11-07 Thread Nathan Bossart
On Thu, Nov 07, 2024 at 02:03:04PM -0600, Nathan Bossart wrote: > Committed. Of course, as soon as I committed this, I noticed that it's broken. It seems that compilers are rather picky about how multiple target options are specified. AFAICT the commonly supported syntax is to put the entire lis

Re: Popcount optimization using AVX512

2024-11-07 Thread Nathan Bossart
Committed. -- nathan

Re: Popcount optimization using AVX512

2024-11-07 Thread Nathan Bossart
On Thu, Nov 07, 2024 at 11:12:37AM -0500, Andres Freund wrote: > One thing that'd I'd like to see this being used is to elide the indirection > when the current target platform *already* supports the necessary > intrinsics. Adding a bunch of indirection for short & common operations is > decidedly

Re: Popcount optimization using AVX512

2024-11-07 Thread Andres Freund
Hi, On 2024-11-06 20:26:47 -0600, Nathan Bossart wrote: > From d0fb7e0e375f7b76d4df90910c21e9448dd3b380 Mon Sep 17 00:00:00 2001 > From: Nathan Bossart > Date: Wed, 16 Oct 2024 15:57:55 -0500 > Subject: [PATCH v3 1/1] use __attribute__((target(...))) for AVX-512 stuff One thing that'd I'd like t

Re: Popcount optimization using AVX512

2024-11-06 Thread Nathan Bossart
rebased -- nathan >From d0fb7e0e375f7b76d4df90910c21e9448dd3b380 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 16 Oct 2024 15:57:55 -0500 Subject: [PATCH v3 1/1] use __attribute__((target(...))) for AVX-512 stuff --- config/c-compiler.m4 | 64 +- configure

Re: Popcount optimization using AVX512

2024-11-04 Thread Nathan Bossart
On Thu, Oct 31, 2024 at 07:58:06PM +, Devulapalli, Raghuveer wrote: > LGTM. Thanks. Barring additional feedback, I plan to commit this soon. -- nathan

Re: Popcount optimization using AVX512

2024-10-31 Thread Nathan Bossart
On Wed, Oct 30, 2024 at 04:10:10PM -0500, Nathan Bossart wrote: > On Wed, Oct 30, 2024 at 08:53:10PM +, Raghuveer Devulapalli wrote: >> BTW, I just realized function attributes for xsave and avx512 don't work >> on MSVC (see >> https://developercommunity.visualstudio.com/t/support-function-targ

RE: Popcount optimization using AVX512

2024-10-31 Thread Devulapalli, Raghuveer
> Here is an updated patch with this change. LGTM. Raghuveer

RE: Popcount optimization using AVX512

2024-10-30 Thread Devulapalli, Raghuveer
> Oh, good catch. IIUC we only need to check for #ifndef _MSC_VER in the > configure programs for meson. pg_attribute_target will be empty on MSVC, and > I > believe we only support meson builds there. Right. __has_attribute (target) produces a compiler warning on MSVC: https://gcc.godbolt.o

Re: Popcount optimization using AVX512

2024-10-30 Thread Nathan Bossart
On Wed, Oct 30, 2024 at 08:53:10PM +, Raghuveer Devulapalli wrote: > BTW, I just realized function attributes for xsave and avx512 don't work > on MSVC (see > https://developercommunity.visualstudio.com/t/support-function-target-attribute-and-mutiversioning/10130630). > Not sure if you care abo

Re: Popcount optimization using AVX512

2024-10-30 Thread Raghuveer Devulapalli
BTW, I just realized function attributes for xsave and avx512 don't work on MSVC (see https://developercommunity.visualstudio.com/t/support-function-target-attribute-and-mutiversioning/10130630). Not sure if you care about it. Its an easy fix (see https://gcc.godbolt.org/z/Pebdj3vMx).

Re: Popcount optimization using AVX512

2024-10-29 Thread Raghuveer Devulapalli
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: tested, failed Spec compliant: tested, failed Documentation:tested, failed Changes LGTM. Makes the Makefile look clean. Built and ran te

Re: Popcount optimization using AVX512

2024-10-16 Thread Nathan Bossart
On Tue, Oct 08, 2024 at 09:36:03PM -0500, Nathan Bossart wrote: > On Wed, Jul 31, 2024 at 04:43:02PM -0500, Nathan Bossart wrote: >> On Wed, Jul 31, 2024 at 01:52:54PM -0700, Andres Freund wrote: >>> I think we'd be better off enabling architectural features on a per-function >>> basis, roughly lik

Re: Popcount optimization using AVX512

2024-10-08 Thread Nathan Bossart
On Wed, Jul 31, 2024 at 04:43:02PM -0500, Nathan Bossart wrote: > On Wed, Jul 31, 2024 at 01:52:54PM -0700, Andres Freund wrote: >> I think we'd be better off enabling architectural features on a per-function >> basis, roughly like this: >> >> [...] >> >> /* FIXME: Should be gated by configure che

Re: Popcount optimization using AVX512

2024-07-31 Thread Nathan Bossart
On Wed, Jul 31, 2024 at 01:52:54PM -0700, Andres Freund wrote: > On 2024-07-30 22:12:18 -0500, Nathan Bossart wrote: >> As I started on this, I remembered why I needed it. The file >> pg_popcount_avx512_choose.c is compiled without the AVX-512 flags in order >> to avoid inadvertently issuing any A

Re: Popcount optimization using AVX512

2024-07-31 Thread Andres Freund
Hi, On 2024-07-30 22:12:18 -0500, Nathan Bossart wrote: > On Tue, Jul 30, 2024 at 10:01:50PM -0500, Nathan Bossart wrote: > > On Tue, Jul 30, 2024 at 07:43:08PM -0700, Andres Freund wrote: > >> My point is that _xgetbv() is made available by -mavx512vpopcntdq > >> -mavx512bw > >> alone, without n

Re: Popcount optimization using AVX512

2024-07-30 Thread Nathan Bossart
On Tue, Jul 30, 2024 at 10:01:50PM -0500, Nathan Bossart wrote: > On Tue, Jul 30, 2024 at 07:43:08PM -0700, Andres Freund wrote: >> My point is that _xgetbv() is made available by -mavx512vpopcntdq -mavx512bw >> alone, without needing -mxsave: > > Oh, I see. I'll work on a patch to remove that co

Re: Popcount optimization using AVX512

2024-07-30 Thread Nathan Bossart
On Tue, Jul 30, 2024 at 07:43:08PM -0700, Andres Freund wrote: > On 2024-07-30 21:01:31 -0500, Nathan Bossart wrote: >> The main purpose of the XSAVE compiler check is to determine whether we >> need to add -mxsave in order to use _xgetbv() [0]. If that wasn't a >> factor, we could probably skip i

Re: Popcount optimization using AVX512

2024-07-30 Thread Andres Freund
Hi, On 2024-07-30 21:01:31 -0500, Nathan Bossart wrote: > On Tue, Jul 30, 2024 at 06:46:51PM -0700, Andres Freund wrote: > > On 2024-07-30 20:20:34 -0500, Nathan Bossart wrote: > >> On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote: > >> > Why are we actually checking for xsave? We're

Re: Popcount optimization using AVX512

2024-07-30 Thread Nathan Bossart
On Tue, Jul 30, 2024 at 06:46:51PM -0700, Andres Freund wrote: > On 2024-07-30 20:20:34 -0500, Nathan Bossart wrote: >> On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote: >> > Why are we actually checking for xsave? We're not using xsave itself and I >> > couldn't find a comment in 7927

Re: Popcount optimization using AVX512

2024-07-30 Thread Andres Freund
Hi, On 2024-07-30 20:20:34 -0500, Nathan Bossart wrote: > On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote: > > Ah, I somehow thought we'd avoid the runtime check in case we determine at > > compile time we don't need any extra flags to enable the AVX512 stuff > > (similar > > to how

Re: Popcount optimization using AVX512

2024-07-30 Thread Nathan Bossart
On Tue, Jul 30, 2024 at 05:49:59PM -0700, Andres Freund wrote: > Ah, I somehow thought we'd avoid the runtime check in case we determine at > compile time we don't need any extra flags to enable the AVX512 stuff (similar > to how we deal with crc32). But it looks like that's not the case - which >

Re: Popcount optimization using AVX512

2024-07-30 Thread Thomas Munro
On Wed, Jul 31, 2024 at 12:50 PM Andres Freund wrote: > It's one thing for the avx512 path to have that overhead, but it's > particularly absurd for pg_popcount32/pg_popcount64, where > > a) The function call overhead is a larger proportion of the cost. > b) the instruction is almost universally a

Re: Popcount optimization using AVX512

2024-07-30 Thread Andres Freund
Hi, On 2024-07-30 16:32:07 -0500, Nathan Bossart wrote: > On Tue, Jul 30, 2024 at 02:07:01PM -0700, Andres Freund wrote: > > Now, a reasonable counter-argument would be that only some of these macros > > are > > defined for msvc ([1]). However, as it turns out, the test is broken > > today, as m

Re: Popcount optimization using AVX512

2024-07-30 Thread Nathan Bossart
On Tue, Jul 30, 2024 at 04:32:07PM -0500, Nathan Bossart wrote: > On Tue, Jul 30, 2024 at 02:07:01PM -0700, Andres Freund wrote: >> Afaict we could just check for predefined preprocessor macros: >> >> echo|time gcc -c -mxsave -mavx512vpopcntdq -mavx512bw -xc -dM -E - -o >> -|grep -E '__XSAVE__|_

Re: Popcount optimization using AVX512

2024-07-30 Thread Nathan Bossart
On Tue, Jul 30, 2024 at 02:07:01PM -0700, Andres Freund wrote: > I've noticed that the configure probes for this are quite slow - pretty much > the slowest step in a meson setup (and autoconf is similar). While looking > into this, I also noticed that afaict the tests don't do the right thing for

Re: Popcount optimization using AVX512

2024-07-30 Thread Andres Freund
Hi, On 2024-04-23 11:02:07 -0500, Nathan Bossart wrote: > On Thu, Apr 18, 2024 at 05:13:58PM -0500, Nathan Bossart wrote: > > Makes sense, thanks. I'm planning to commit this fix sometime early next > > week. > > Committed. I've noticed that the configure probes for this are quite slow - pretty

Re: Popcount optimization using AVX512

2024-04-23 Thread Nathan Bossart
On Thu, Apr 18, 2024 at 05:13:58PM -0500, Nathan Bossart wrote: > Makes sense, thanks. I'm planning to commit this fix sometime early next > week. Committed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart
On Thu, Apr 18, 2024 at 10:11:08PM +, Devulapalli, Raghuveer wrote: >> On that note, is it necessary to also check for avx512f? At the moment, >> we are assuming that's supported if the other AVX-512 instructions are >> available. > > No, it's not needed. There are no CPU's with avx512bw/avx5

RE: Popcount optimization using AVX512

2024-04-18 Thread Devulapalli, Raghuveer
> On that note, is it necessary to also check for avx512f? At the moment, we > are assuming that's supported if the other AVX-512 instructions are available. No, it's not needed. There are no CPU's with avx512bw/avx512popcnt without avx512f. Unfortunately though, avx512popcnt does not mean avx

Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart
On Thu, Apr 18, 2024 at 09:29:55PM +, Devulapalli, Raghuveer wrote: > (1) Shouldn't it be: return (_xgetbv(0) & 0xe6) == 0xe6; ? Otherwise > zmm_regs_available() will return false.. Yes, that's a mistake. I fixed that in v3. > (2) Nitpick: avx512_popcnt_available and avx512_bw_available() ru

RE: Popcount optimization using AVX512

2024-04-18 Thread Devulapalli, Raghuveer
> Thanks for the feedback. I've attached an updated patch. (1) Shouldn't it be: return (_xgetbv(0) & 0xe6) == 0xe6; ? Otherwise zmm_regs_available() will return false. (2) Nitpick: avx512_popcnt_available and avx512_bw_available() run the same cpuid leaf. You could combine them into one to avo

Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart
On Thu, Apr 18, 2024 at 08:24:03PM +, Devulapalli, Raghuveer wrote: >> This seems to contradict the note about doing step 3 at any point, and >> given step 1 is the OSXSAVE check, I'm not following what this means, >> anyway. > > It is recommended that you run the xgetbv code before you check

Re: Popcount optimization using AVX512

2024-04-18 Thread Nathan Bossart
On Thu, Apr 18, 2024 at 06:12:22PM +, Shankaran, Akash wrote: > Good find. I confirmed after speaking with an intel expert, and from the > intel AVX-512 manual [0] section 14.3, which recommends to check bit27. From > the manual: > > "Prior to using Intel AVX, the application must identify t

RE: Popcount optimization using AVX512

2024-04-18 Thread Shankaran, Akash
> It was brought to my attention [0] that we probably should be checking for > the OSXSAVE bit instead of the XSAVE bit when determining whether there's > support for the XGETBV instruction. IIUC that should indicate that both the > OS and the processor have XGETBV support (not just the process

Re: Popcount optimization using AVX512

2024-04-17 Thread Nathan Bossart
It was brought to my attention [0] that we probably should be checking for the OSXSAVE bit instead of the XSAVE bit when determining whether there's support for the XGETBV instruction. IIUC that should indicate that both the OS and the processor have XGETBV support (not just the processor). I've a

Re: Popcount optimization using AVX512

2024-04-07 Thread Tom Lane
Nathan Bossart writes: > On Sun, Apr 07, 2024 at 08:23:32PM -0500, Nathan Bossart wrote: >> The Intel documentation for _mm256_undefined_si256() [0] >> indicates that it is intended to return "undefined elements," so it seems >> like the use of an uninitialized variable might be intentional. > Se

Re: Popcount optimization using AVX512

2024-04-07 Thread Nathan Bossart
On Sun, Apr 07, 2024 at 08:23:32PM -0500, Nathan Bossart wrote: > The Intel documentation for _mm256_undefined_si256() [0] > indicates that it is intended to return "undefined elements," so it seems > like the use of an uninitialized variable might be intentional. See also https://gcc.gnu.org/git/

Re: Popcount optimization using AVX512

2024-04-07 Thread Nathan Bossart
On Sun, Apr 07, 2024 at 08:42:12PM -0400, Tom Lane wrote: > Today's Coverity run produced this warning, which seemingly was > triggered by one of these commits, but I can't make much sense > of it: > > *** CID 1596255: Uninitialized variables (UNINIT) > /usr/lib/gcc/x86_64-linux-gnu/10/include/a

Re: Popcount optimization using AVX512

2024-04-07 Thread Tom Lane
Nathan Bossart writes: > Here is what I have staged for commit, which I intend to do shortly. Today's Coverity run produced this warning, which seemingly was triggered by one of these commits, but I can't make much sense of it: *** CID 1596255: Uninitialized variables (UNINIT) /usr/lib/gcc/x86

Re: Popcount optimization using AVX512

2024-04-06 Thread Nathan Bossart
On Sat, Apr 06, 2024 at 02:41:01PM -0500, Nathan Bossart wrote: > Here is what I have staged for commit, which I intend to do shortly. Committed. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com

Re: Popcount optimization using AVX512

2024-04-06 Thread Nathan Bossart
On Sat, Apr 06, 2024 at 02:51:39PM +1300, David Rowley wrote: > On Sat, 6 Apr 2024 at 14:17, Nathan Bossart wrote: >> On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote: >> > Won't Valgrind complain about this? >> > >> > +pg_popcount_avx512(const char *buf, int bytes) >> > >> > + buf = (

Re: Popcount optimization using AVX512

2024-04-05 Thread David Rowley
On Sat, 6 Apr 2024 at 14:17, Nathan Bossart wrote: > > On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote: > > Won't Valgrind complain about this? > > > > +pg_popcount_avx512(const char *buf, int bytes) > > > > + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); > > > > + val =

Re: Popcount optimization using AVX512

2024-04-05 Thread Nathan Bossart
On Sat, Apr 06, 2024 at 12:08:14PM +1300, David Rowley wrote: > Won't Valgrind complain about this? > > +pg_popcount_avx512(const char *buf, int bytes) > > + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); > > + val = _mm512_maskz_loadu_epi8(mask, (const __m512i *) buf); I haven't be

Re: Popcount optimization using AVX512

2024-04-05 Thread David Rowley
On Sat, 6 Apr 2024 at 04:38, Nathan Bossart wrote: > This seems to provide a small performance boost, so I've incorporated it > into v27. Won't Valgrind complain about this? +pg_popcount_avx512(const char *buf, int bytes) + buf = (const char *) TYPEALIGN_DOWN(sizeof(__m512i), buf); + val = _mm

Re: Popcount optimization using AVX512

2024-04-05 Thread Nathan Bossart
On Fri, Apr 05, 2024 at 07:58:44AM -0500, Nathan Bossart wrote: > On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote: >> The main issue I saw was that clang was able to peel off the first >> iteration of the loop and then eliminate the mask assignment and >> replace masked load with a memor

Re: Popcount optimization using AVX512

2024-04-05 Thread Nathan Bossart
On Fri, Apr 05, 2024 at 10:33:27AM +0300, Ants Aasma wrote: > The main issue I saw was that clang was able to peel off the first > iteration of the loop and then eliminate the mask assignment and > replace masked load with a memory operand for vpopcnt. I was not able > to convince gcc to do that re

Re: Popcount optimization using AVX512

2024-04-05 Thread Ants Aasma
On Fri, 5 Apr 2024 at 07:15, Nathan Bossart wrote: > Here is an updated patch set. IMHO this is in decent shape and is > approaching committable. I checked the code generation on various gcc and clang versions. It looks mostly fine starting from versions where avx512 is supported, gcc-7.1 and cl

Re: Popcount optimization using AVX512

2024-04-04 Thread Nathan Bossart
Here is an updated patch set. IMHO this is in decent shape and is approaching committable. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From df59d3e78604e4530f5096bafc08ac94e13d82d2 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Mar 2024 16:39:24 -0500 Subject: [P

Re: Popcount optimization using AVX512

2024-04-04 Thread Nathan Bossart
On Thu, Apr 04, 2024 at 04:02:53PM +0300, Ants Aasma wrote: > Speaking of which, what does bumping up the inlined version threshold > to 16 do with and without AVX-512 available? Linearly extrapolating > the 2 and 4 byte numbers it might just come ahead in both cases, > making the choice easy. IIR

Re: Popcount optimization using AVX512

2024-04-04 Thread Nathan Bossart
On Thu, Apr 04, 2024 at 04:28:58PM +1300, David Rowley wrote: > On Thu, 4 Apr 2024 at 11:50, Nathan Bossart wrote: >> If we can verify this approach won't cause segfaults and can stomach the >> regression between 8 and 16 bytes, I'd happily pivot to this approach so >> that we can avoid the functi

Re: Popcount optimization using AVX512

2024-04-04 Thread Ants Aasma
On Thu, 4 Apr 2024 at 01:50, Nathan Bossart wrote: > If we can verify this approach won't cause segfaults and can stomach the > regression between 8 and 16 bytes, I'd happily pivot to this approach so > that we can avoid the function call dance that I have in v25. The approach I posted does not r

Re: Popcount optimization using AVX512

2024-04-03 Thread David Rowley
On Thu, 4 Apr 2024 at 11:50, Nathan Bossart wrote: > If we can verify this approach won't cause segfaults and can stomach the > regression between 8 and 16 bytes, I'd happily pivot to this approach so > that we can avoid the function call dance that I have in v25. > > Thoughts? If we're worried a

Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 11:30:39PM +0300, Ants Aasma wrote: > On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: >> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: >> > What about using the masking capabilities of AVX-512 to handle the >> > tail in the same code path? Masked out portio

Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart
On Wed, Apr 03, 2024 at 12:41:27PM -0500, Nathan Bossart wrote: > I committed v23-0001. Here is a rebased version of the remaining patches. > I intend to test the masking idea from Ants next. 0002 was missing a cast that is needed for the 32-bit builds. I've fixed that in v25. -- Nathan Bossar

Re: Popcount optimization using AVX512

2024-04-03 Thread Nathan Bossart
I committed v23-0001. Here is a rebased version of the remaining patches. I intend to test the masking idea from Ants next. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com >From 295b03530de5f42fe876b4489191da2f8dc83194 Mon Sep 17 00:00:00 2001 From: Nathan Bossart Date: Wed, 27 Ma

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 05:20:20PM -0500, Nathan Bossart wrote: > Sorry for the noise. I noticed a couple of silly mistakes immediately > after sending v21. Sigh... I missed a line while rebasing these patches, which seems to have grossly offended cfbot. Apologies again for the noise. -- Nath

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 05:01:32PM -0500, Nathan Bossart wrote: > In v21, 0001 is just the above inlining idea, which seems worth doing > independent of $SUBJECT. 0002 and 0003 are the AVX-512 patches, which I've > modified similarly to 0001, i.e., I've inlined the "fast" version in the > function

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 01:40:21PM -0500, Nathan Bossart wrote: > On Tue, Apr 02, 2024 at 01:43:48PM -0400, Tom Lane wrote: >> I don't like the double evaluation of the macro argument. Seems like >> you could get the same results more safely with >> >> static inline uint64 >> pg_popcoun

Re: Popcount optimization using AVX512

2024-04-02 Thread Ants Aasma
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: > On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: > > What about using the masking capabilities of AVX-512 to handle the > > tail in the same code path? Masked out portions of a load instruction > > will not generate an exception. To a

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 01:43:48PM -0400, Tom Lane wrote: > Alvaro Herrera writes: >> On 2024-Apr-02, Nathan Bossart wrote: >>> Another idea I had is to turn pg_popcount() into a macro that just uses the >>> pg_number_of_ones array when called for few bytes: >>> >>> static inline uint64 >>>

Re: Popcount optimization using AVX512

2024-04-02 Thread Tom Lane
Alvaro Herrera writes: > On 2024-Apr-02, Nathan Bossart wrote: >> Another idea I had is to turn pg_popcount() into a macro that just uses the >> pg_number_of_ones array when called for few bytes: >> >> static inline uint64 >> pg_popcount_inline(const char *buf, int bytes) >> { >>

Re: Popcount optimization using AVX512

2024-04-02 Thread Alvaro Herrera
On 2024-Apr-02, Nathan Bossart wrote: > Another idea I had is to turn pg_popcount() into a macro that just uses the > pg_number_of_ones array when called for few bytes: > > static inline uint64 > pg_popcount_inline(const char *buf, int bytes) > { > uint64

Re: Popcount optimization using AVX512

2024-04-02 Thread Nathan Bossart
On Mon, Apr 01, 2024 at 05:11:17PM -0500, Nathan Bossart wrote: > Here is a v19 of the patch set. I moved out the refactoring of the > function pointer selection code to 0001. I think this is a good change > independent of $SUBJECT, and I plan to commit this soon. In 0002, I > changed the syslog

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 01:09:57AM +0300, Ants Aasma wrote: > On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: >> On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: >> > What about using the masking capabilities of AVX-512 to handle the >> > tail in the same code path? Masked out portio

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart
Here is a v19 of the patch set. I moved out the refactoring of the function pointer selection code to 0001. I think this is a good change independent of $SUBJECT, and I plan to commit this soon. In 0002, I changed the syslogger.c usage of pg_popcount() to use pg_number_of_ones instead. This is

Re: Popcount optimization using AVX512

2024-04-01 Thread Ants Aasma
On Tue, 2 Apr 2024 at 00:31, Nathan Bossart wrote: > > On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: > > What about using the masking capabilities of AVX-512 to handle the > > tail in the same code path? Masked out portions of a load instruction > > will not generate an exception. To

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart
On Tue, Apr 02, 2024 at 12:11:59AM +0300, Ants Aasma wrote: > What about using the masking capabilities of AVX-512 to handle the > tail in the same code path? Masked out portions of a load instruction > will not generate an exception. To allow byte level granularity > masking, -mavx512bw is needed.

Re: Popcount optimization using AVX512

2024-04-01 Thread Ants Aasma
On Mon, 1 Apr 2024 at 18:53, Nathan Bossart wrote: > > On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote: > > On 2024-Mar-31, Nathan Bossart wrote: > >> +popcnt = _mm512_reduce_add_epi64(accum); > >> +return popcnt + pg_popcount_fast(buf, bytes); > > > > Hmm, doesn't this arra

Re: Popcount optimization using AVX512

2024-04-01 Thread Nathan Bossart
On Mon, Apr 01, 2024 at 01:06:12PM +0200, Alvaro Herrera wrote: > On 2024-Mar-31, Nathan Bossart wrote: >> +popcnt = _mm512_reduce_add_epi64(accum); >> +return popcnt + pg_popcount_fast(buf, bytes); > > Hmm, doesn't this arrangement cause an extra function call to > pg_popcount_fast to be

Re: Popcount optimization using AVX512

2024-04-01 Thread Alvaro Herrera
On 2024-Mar-31, Nathan Bossart wrote: > +uint64 > +pg_popcount_avx512(const char *buf, int bytes) > +{ > + uint64 popcnt; > + __m512i accum = _mm512_setzero_si512(); > + > + for (; bytes >= sizeof(__m512i); bytes -= sizeof(__m512i)) > + { > + const

Re: Popcount optimization using AVX512

2024-03-31 Thread Nathan Bossart
On Sat, Mar 30, 2024 at 03:03:29PM -0500, Nathan Bossart wrote: > My current plan is to add some new tests for > pg_popcount() with many bytes, and then I'll give it a few more days for > any additional feedback before committing. Here is a v18 with a couple of new tests. Otherwise, it is the sam

Re: Popcount optimization using AVX512

2024-03-30 Thread Nathan Bossart
I used John Naylor's test_popcount module [0] to put together the attached graphs (note that the "small arrays" one is semi-logarithmic). For both graphs, the X-axis is the number of 64-bit words in the array, and Y-axis is the amount of time in milliseconds to run pg_popcount() on it 100,000 time

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
Here's a v17 of the patch. This one has configure checks for everything (i.e., CPUID, XGETBV, and the AVX512 intrinsics) as well as the relevant runtime checks (i.e., we call CPUID to check for XGETBV and AVX512 POPCNT availability, and we call XGETBV to ensure the ZMM registers are enabled). I re

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Fri, Mar 29, 2024 at 03:08:28PM -0500, Nathan Bossart wrote: >> +#if defined(HAVE__GET_CPUID) >> +__get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); >> +#elif defined(HAVE__CPUID) >> +__cpuidex(exx, 7, 0); > > Is there any reason we can't use __get_cpuid() and __cpuid() here,

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Fri, Mar 29, 2024 at 02:13:12PM -0500, Nathan Bossart wrote: > * If the compiler understands AVX512 intrinsics, we assume that it also > knows about the required CPUID and XGETBV intrinsics, and we assume that > the conditions for TRY_POPCNT_FAST are true. Bleh, cfbot's 32-bit build is unha

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
Okay, here is a slightly different approach that I've dubbed the "maximum assumption" approach. In short, I wanted to see how much we could simplify the patch by making all possibly-reasonable assumptions about the compiler and CPU. These include: * If the compiler understands AVX512 intrinsics,

RE: Popcount optimization using AVX512

2024-03-29 Thread Amonson, Paul D
> A counterexample is the CRC32C code. AFAICT we assume the presence of > CPUID in that code (and #error otherwise). I imagine its probably safe to > assume the compiler understands CPUID if it understands AVX512 intrinsics, > but that is still mostly a guess. If AVX-512 intrinsics are available

RE: Popcount optimization using AVX512

2024-03-29 Thread Amonson, Paul D
> On Thu, Mar 28, 2024 at 11:10:33PM +0100, Alvaro Herrera wrote: > > We don't do MSVC via autoconf/Make. We used to have a special build > > framework for MSVC which parsed Makefiles to produce "solution" files, > > but it was removed as soon as Meson was mature enough to build. See > > commit 1

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Fri, Mar 29, 2024 at 12:30:14PM -0400, Tom Lane wrote: > Nathan Bossart writes: >>> I see google web references to the xgetbv instruction as far back as 2009 >>> for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for >>> _xgetbv() MSVC built-in. How far back do you need to go?

RE: Popcount optimization using AVX512

2024-03-29 Thread Shankaran, Akash
> From: Nathan Bossart > Sent: Friday, March 29, 2024 9:17 AM > To: Amonson, Paul D > On Fri, Mar 29, 2024 at 04:06:17PM +, Amonson, Paul D wrote: >> Yeah, I understand that much, but I want to know how portable the >> XGETBV instruction is. Unless I can assume that all x86_64 systems >>

Re: Popcount optimization using AVX512

2024-03-29 Thread Tom Lane
Nathan Bossart writes: >> I see google web references to the xgetbv instruction as far back as 2009 >> for Intel 64 bit HW and 2010 for AMD 64bit HW, maybe you could test for >> _xgetbv() MSVC built-in. How far back do you need to go? > Hm. It seems unlikely that a compiler would understand AVX5

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Fri, Mar 29, 2024 at 10:59:40AM -0500, Nathan Bossart wrote: > It might be nice if we conditionally built pg_popcount_avx512.o in autoconf > builds, too, but AFAICT we still need to wrap most of that code with > macros, so I'm not sure it's worth the trouble. I'll take another look at > this...

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Fri, Mar 29, 2024 at 04:06:17PM +, Amonson, Paul D wrote: >> Yeah, I understand that much, but I want to know how portable the XGETBV >> instruction is. Unless I can assume that all x86_64 systems and compilers >> support that instruction, we might need an additional configure check and/or

RE: Popcount optimization using AVX512

2024-03-29 Thread Amonson, Paul D
> -Original Message- > > Cool. I think we should run the benchmarks again to be safe, though. Ok, sure go ahead. :) > >> I forgot to mention that I also want to understand whether we can > >> actually assume availability of XGETBV when CPUID says we support > >> AVX512: > > > > You canno

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Thu, Mar 28, 2024 at 10:29:47PM +, Amonson, Paul D wrote: > I see in the meson.build you added the new file twice? > > @@ -7,6 +7,7 @@ pgport_sources = [ >'noblock.c', >'path.c', >'pg_bitutils.c', > + 'pg_popcount_avx512.c', >'pg_strong_random.c', >'pgcheckdir.c', >

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Thu, Mar 28, 2024 at 11:10:33PM +0100, Alvaro Herrera wrote: > We don't do MSVC via autoconf/Make. We used to have a special build > framework for MSVC which parsed Makefiles to produce "solution" files, > but it was removed as soon as Meson was mature enough to build. See > commit 1301c80b216

Re: Popcount optimization using AVX512

2024-03-29 Thread Nathan Bossart
On Thu, Mar 28, 2024 at 10:03:04PM +, Amonson, Paul D wrote: >> * I think we need to verify there isn't a huge performance regression for >> smaller arrays. IIUC those will still require an AVX512 instruction or >> two as well as a function call, which might add some noticeable overhead. >

RE: Popcount optimization using AVX512

2024-03-28 Thread Amonson, Paul D
> -Original Message- > From: Amonson, Paul D > Sent: Thursday, March 28, 2024 3:03 PM > To: Nathan Bossart > ... > I will review the new patch to see if there are anything that jumps out at me. I see in the meson.build you added the new file twice? @@ -7,6 +7,7 @@ pgport_sources = [

Re: Popcount optimization using AVX512

2024-03-28 Thread Alvaro Herrera
On 2024-Mar-28, Amonson, Paul D wrote: > > -Original Message- > > From: Nathan Bossart > > Sent: Thursday, March 28, 2024 2:39 PM > > To: Amonson, Paul D > > > > * The latest patch set from Paul Amonson appeared to support MSVC in the > > meson build, but not the autoconf one. I don'

RE: Popcount optimization using AVX512

2024-03-28 Thread Amonson, Paul D
> -Original Message- > From: Nathan Bossart > Sent: Thursday, March 28, 2024 2:39 PM > To: Amonson, Paul D > > * The latest patch set from Paul Amonson appeared to support MSVC in the > meson build, but not the autoconf one. I don't have much expertise here, > so the v14 patch doesn

Re: Popcount optimization using AVX512

2024-03-28 Thread Nathan Bossart
On Thu, Mar 28, 2024 at 04:38:54PM -0500, Nathan Bossart wrote: > Here is a v14 of the patch that I think is beginning to approach something > committable. Besides general review and testing, there are two things that > I'd like to bring up: > > * The latest patch set from Paul Amonson appeared t

Re: Popcount optimization using AVX512

2024-03-28 Thread Nathan Bossart
Here is a v14 of the patch that I think is beginning to approach something committable. Besides general review and testing, there are two things that I'd like to bring up: * The latest patch set from Paul Amonson appeared to support MSVC in the meson build, but not the autoconf one. I don't ha

RE: Popcount optimization using AVX512

2024-03-27 Thread Amonson, Paul D
> -Original Message- > From: Nathan Bossart > Sent: Wednesday, March 27, 2024 3:00 PM > To: Amonson, Paul D > > ... (I realize that I'm essentially > recanting much of my previous feedback, which I apologize for.) It happens. LOL As long as the algorithm for AVX-512 is not altered I am

Re: Popcount optimization using AVX512

2024-03-27 Thread Nathan Bossart
On Mon, Mar 25, 2024 at 03:05:51PM -0500, Nathan Bossart wrote: > On Mon, Mar 25, 2024 at 06:42:36PM +, Amonson, Paul D wrote: >> Ok, CI turned green after my re-post of the patches. Can this please get >> merged? > > Thanks for the new patches. I intend to take another look soon. Thanks fo

Re: Popcount optimization using AVX512

2024-03-25 Thread Nathan Bossart
On Mon, Mar 25, 2024 at 06:42:36PM +, Amonson, Paul D wrote: > Ok, CI turned green after my re-post of the patches. Can this please get > merged? Thanks for the new patches. I intend to take another look soon. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com

RE: Popcount optimization using AVX512

2024-03-25 Thread Amonson, Paul D
s.postgresql.org > Subject: RE: Popcount optimization using AVX512 > Ok, CI turned green after my re-post of the patches. Can this please get merged? Thanks, Paul

Re: Popcount optimization using AVX512

2024-03-25 Thread Joe Conway
On 3/25/24 11:12, Tom Lane wrote: "Amonson, Paul D" writes: I am re-posting the patches as CI for Mac failed (CI error not code/test error). The patches are the same as last time. Just for a note --- the cfbot will re-test existing patches every so often without needing a bump. The current

  1   2   >