Inlined the hex encode/decode functions in "src/include/utils/builtins.h"
similar to pg_popcount() in pg_bitutils.h.
---
Chiranmoy
v3-0001-SVE-support-for-hex-encode-and-hex-decode.patch
Description: v3-0001-SVE-support-for-hex-encode-and-hex-decode.patch
> The meson configure check seems to fail on my machine
> This test looks quite different than the autoconf one. Why is that? I
would expect them to be the same. And I think ideally the test would check
that all the intrinsics functions we need are available.
Fixed, both meson and autoconf have
> Hm. These results are so similar that I'm tempted to suggest we just
> remove the section of code dedicated to alignment. Is there any reason not
> to do that?
It seems that the double load overhead from unaligned memory access isn’t
too taxing, even on larger inputs. We can remove it to simpl
Thank you for the suggestion; we have removed the `xsave` flag.
We have used the following command for benchmarking:
time ./build_fj/bin/psql pop_db -c "select drive_popcount(1000, 16);"
We ran it 20 times and took the average to flatten any CPU fluctuations. The
results observed on `m7g.4xl
Hello Nathan,
We tried auto-vectorization and observed no performance improvement.
The instructions in src/include/port/simd.h are based on older SIMD
architectures like NEON, whereas the patch uses the newer SVE, so some of the
instructions used in the patch may not have direct equivalents in N
Hi all,
Here is the updated patch using pg_attribute_target("arch=armv8-a+sve") to
compile the arch-specific function instead of using compiler flags.
---
Chiranmoy
v3-0001-SVE-support-for-popcount-and-popcount-masked.patch
Description: v3-0001-SVE-support-for-popcount-and-popcount-masked.p
> The approach looks generally reasonable to me, but IMHO the code needs
much more commentary to explain how it works.
Added comments to explain the SVE implementation.
> I would be interested to see how your bytea test compares with the
improvements added in commit e24d770 and with sending the
> This looks good. Thanks Chiranmoy and team. Can you address any other
> feedback from Nathan or others here? Then we can pursue further reviews and
> merging of the patch.
Thank you for the review.
If there is no further feedback from the community, may we submit the patch for
the next commit
I realized I didn't attach the patch.
v2-0001-SVE-support-for-hex-encode-and-hex-decode.patch
Description: v2-0001-SVE-support-for-hex-encode-and-hex-decode.patch
On Fri, Jan 10, 2025 at 09:38:14AM -0600, Nathan Bossart wrote:
> Do you mean that the auto-vectorization worked and you observed no
> performance improvement, or the auto-vectorization had no effect on the
> code generated?
Auto-vectorization is working now with the following addition on Graviton
On Wed, Mar 13, 2025 at 12:02:07AM +, nathandboss...@gmail.com wrote:
> Those are nice results. I'm a little worried about the Neon implementation
> for smaller inputs since it uses a per-byte loop for the remaining bytes,
> though. If we can ensure there's no regression there, I think this p
It seems that the patch doesn't compile on macOS, it is unable to map 'i'
and 'len' which are of type 'size_t' to 'uint64'. This appears to be a mac
specific
issue. The latest patch should resolve this by casting 'size_t' to 'uint64'
before
passing it to 'svwhilelt_b8'.
[11:04:07.478] ../src/back
> Hm. Any idea why that is? I wonder if the compiler isn't using as many
> SVE registers as it could for this.
Not sure, we tried forcing loop unrolling using the below line in the MakeFile
but the results are the same.
pg_popcount_sve.o: CFLAGS += ${CFLAGS_UNROLL_LOOPS} -march=native
> I've
> Interesting. I do see different assembly with the 2 and 4 register
> versions, but I didn't get to testing it on a machine with SVE support
> today.
> Besides some additional benchmarking, I might make some small adjustments
> to the patch. But overall, it seems to be in decent shape.
Sounds
On Wed, Mar 12, 2025 at 02:41:18AM +, nathandboss...@gmail.com wrote:
> v5-no-sve is the result of using a function pointer, but pointing to the
> "slow" versions instead of the SVE version. v5-sve is the result of the
> latest patch in this thread on a machine with SVE support, and v5-4reg i
Looks good, the code is more readable now.
> For both Neon and SVE, I do see improvements with looping over 4
> registers at a time, so IMHO it's worth doing so even if it performs the
> same as 2-register blocks on some hardware.
There was no regression on Graviton 3 when using the 4-register
Here's the rebased patch with a few modifications.
The hand-unrolled hex encode performs better than the non-unrolled version on
r8g.4xlarge. No improvement on m7g.4xlarge.
Added line-by-line comments explaining the changes with an example.
Below are the results. Input size is in bytes, and exec
17 matches
Mail list logo