Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-02-03 Thread chiranmoy.bhattacha...@fujitsu.com
Inlined the hex encode/decode functions in "src/include/utils/builtins.h" similar to pg_popcount() in pg_bitutils.h. --- Chiranmoy v3-0001-SVE-support-for-hex-encode-and-hex-decode.patch Description: v3-0001-SVE-support-for-hex-encode-and-hex-decode.patch

Re: [PATCH] SVE popcount support

2025-02-04 Thread chiranmoy.bhattacha...@fujitsu.com
> The meson configure check seems to fail on my machine > This test looks quite different than the autoconf one. Why is that? I would expect them to be the same. And I think ideally the test would check that all the intrinsics functions we need are available. Fixed, both meson and autoconf have

Re: [PATCH] SVE popcount support

2025-02-06 Thread chiranmoy.bhattacha...@fujitsu.com
> Hm. These results are so similar that I'm tempted to suggest we just > remove the section of code dedicated to alignment. Is there any reason not > to do that? It seems that the double load overhead from unaligned memory access isn’t too taxing, even on larger inputs. We can remove it to simpl

Re: [PATCH] SVE popcount support

2024-12-11 Thread chiranmoy.bhattacha...@fujitsu.com
Thank you for the suggestion; we have removed the `xsave` flag. We have used the following command for benchmarking: time ./build_fj/bin/psql pop_db -c "select drive_popcount(1000, 16);" We ran it 20 times and took the average to flatten any CPU fluctuations. The results observed on `m7g.4xl

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-10 Thread chiranmoy.bhattacha...@fujitsu.com
Hello Nathan, We tried auto-vectorization and observed no performance improvement. The instructions in src/include/port/simd.h are based on older SIMD architectures like NEON, whereas the patch uses the newer SVE, so some of the instructions used in the patch may not have direct equivalents in N

Re: [PATCH] SVE popcount support

2025-01-10 Thread chiranmoy.bhattacha...@fujitsu.com
Hi all, Here is the updated patch using pg_attribute_target("arch=armv8-a+sve") to compile the arch-specific function instead of using compiler flags. --- Chiranmoy v3-0001-SVE-support-for-popcount-and-popcount-masked.patch Description: v3-0001-SVE-support-for-popcount-and-popcount-masked.p

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-22 Thread chiranmoy.bhattacha...@fujitsu.com
> The approach looks generally reasonable to me, but IMHO the code needs much more commentary to explain how it works. Added comments to explain the SVE implementation. > I would be interested to see how your bytea test compares with the improvements added in commit e24d770 and with sending the

Re: [PATCH] SVE popcount support

2025-01-22 Thread chiranmoy.bhattacha...@fujitsu.com
> This looks good. Thanks Chiranmoy and team. Can you address any other > feedback from Nathan or others here? Then we can pursue further reviews and > merging of the patch. Thank you for the review. If there is no further feedback from the community, may we submit the patch for the next commit

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-22 Thread chiranmoy.bhattacha...@fujitsu.com
I realized I didn't attach the patch. v2-0001-SVE-support-for-hex-encode-and-hex-decode.patch Description: v2-0001-SVE-support-for-hex-encode-and-hex-decode.patch

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-13 Thread chiranmoy.bhattacha...@fujitsu.com
On Fri, Jan 10, 2025 at 09:38:14AM -0600, Nathan Bossart wrote: > Do you mean that the auto-vectorization worked and you observed no > performance improvement, or the auto-vectorization had no effect on the > code generated? Auto-vectorization is working now with the following addition on Graviton

Re: [PATCH] SVE popcount support

2025-03-19 Thread chiranmoy.bhattacha...@fujitsu.com
On Wed, Mar 13, 2025 at 12:02:07AM +, nathandboss...@gmail.com wrote: > Those are nice results. I'm a little worried about the Neon implementation > for smaller inputs since it uses a per-byte loop for the remaining bytes, > though. If we can ensure there's no regression there, I think this p

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-02-19 Thread chiranmoy.bhattacha...@fujitsu.com
It seems that the patch doesn't compile on macOS, it is unable to map 'i' and 'len' which are of type 'size_t' to 'uint64'. This appears to be a mac specific issue. The latest patch should resolve this by casting 'size_t' to 'uint64' before passing it to 'svwhilelt_b8'. [11:04:07.478] ../src/back

Re: [PATCH] SVE popcount support

2025-02-19 Thread chiranmoy.bhattacha...@fujitsu.com
> Hm. Any idea why that is? I wonder if the compiler isn't using as many > SVE registers as it could for this. Not sure, we tried forcing loop unrolling using the below line in the MakeFile but the results are the same. pg_popcount_sve.o: CFLAGS += ${CFLAGS_UNROLL_LOOPS} -march=native > I've

Re: [PATCH] SVE popcount support

2025-03-06 Thread chiranmoy.bhattacha...@fujitsu.com
> Interesting. I do see different assembly with the 2 and 4 register > versions, but I didn't get to testing it on a machine with SVE support > today. > Besides some additional benchmarking, I might make some small adjustments > to the patch. But overall, it seems to be in decent shape. Sounds

Re: [PATCH] SVE popcount support

2025-03-12 Thread chiranmoy.bhattacha...@fujitsu.com
On Wed, Mar 12, 2025 at 02:41:18AM +, nathandboss...@gmail.com wrote: > v5-no-sve is the result of using a function pointer, but pointing to the > "slow" versions instead of the SVE version. v5-sve is the result of the > latest patch in this thread on a machine with SVE support, and v5-4reg i

Re: [PATCH] SVE popcount support

2025-03-23 Thread chiranmoy.bhattacha...@fujitsu.com
Looks good, the code is more readable now. > For both Neon and SVE, I do see improvements with looping over 4 > registers at a time, so IMHO it's worth doing so even if it performs the > same as 2-register blocks on some hardware. There was no regression on Graviton 3 when using the 4-register

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-06-09 Thread chiranmoy.bhattacha...@fujitsu.com
Here's the rebased patch with a few modifications. The hand-unrolled hex encode performs better than the non-unrolled version on r8g.4xlarge. No improvement on m7g.4xlarge. Added line-by-line comments explaining the changes with an example. Below are the results. Input size is in bytes, and exec