Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-02-03 Thread chiranmoy.bhattacha...@fujitsu.com
Inlined the hex encode/decode functions in "src/include/utils/builtins.h"
similar to pg_popcount() in pg_bitutils.h.

---
Chiranmoy


v3-0001-SVE-support-for-hex-encode-and-hex-decode.patch
Description: v3-0001-SVE-support-for-hex-encode-and-hex-decode.patch


Re: [PATCH] SVE popcount support

2025-02-04 Thread chiranmoy.bhattacha...@fujitsu.com
> The meson configure check seems to fail on my machine
> This test looks quite different than the autoconf one.  Why is that?  I
would expect them to be the same.  And I think ideally the test would check
that all the intrinsics functions we need are available.

Fixed, both meson and autoconf have the same test program with all the 
intrinsics.
Meson should work now.

> +pgac_save_CFLAGS=$CFLAGS
> +CFLAGS="$pgac_save_CFLAGS $1"

> I don't see any extra compiler flag tests used, so we no longer need this,
right?

True, removed it.

> +  if test x"$pgac_cv_arm_sve_popcnt_intrinsics" = x"yes"; then
> +pgac_arm_sve_popcnt_intrinsics=yes
> +  fi

> I'm curious why this doesn't use Ac_cachevar like the examples above it
(e.g., PGAC_XSAVE_INTRINSICS).

Implemented using Ac_cachevar similar to  PGAC_XSAVE_INTRINSICS.

> +#include "c.h"
> +#include "port/pg_bitutils.h"
> +
> +#ifdef USE_SVE_POPCNT_WITH_RUNTIME_CHECK

> nitpick: The USE_SVE_POPCNT_WITH_RUNTIME_CHECK check can probably go above
the #include for pg_bitutils.h (but below the one for c.h).

Done.

> pg_crc32c_armv8_available() (in pg_crc32c_armv8_choose.c) looks quite a bit
more complicated than this.  Are we missing something here?

SVE is only available in aarch64, so we don't need to worry about aarch32. The 
latest patch
includes runtime checks for Linux and FreeBSD. For all other operating systems, 
false is
returned, because we are unable to verify the check.

> +/*
> + * For smaller inputs, aligning the buffer degrades the performance.
> + * Therefore, the buffers only when the input size is sufficiently large.
> + */

> Is the inverse true, i.e., does aligning the buffer improve performance for
> larger inputs?  I'm also curious what level of performance degradation you
> were seeing.

Here is a comparison of all three cases. Alignment is marginally better for 
inputs
above 1024B, but the difference is small. Unaligned performs better for smaller 
inputs.
Aligned After 128B => the current implementation "if (aligned != buf && bytes > 
4 * vec_len)"
Always Aligned => condition "bytes > 4 * vec_len" is removed.
Unaligned => the whole if block was removed

 buf| Always Aligned | Aligned After 128B | Unaligned
+---++
   16   |   37.851  |   38.203   | 34.971
   32   |   37.859  |   38.187   | 34.972
   64   |   37.611  |   37.405   | 34.121
  128   |   45.357  |   45.897   | 41.890
  256   |   62.440  |   63.454   | 58.666
  512   |  100.120  |  102.767   | 99.861
 1024   |  159.574  |  158.594   |164.975
 2048   |  282.354  |  281.198   |283.937
 4096   |  532.038  |  531.068   |533.699
 8192   | 1038.973  | 1038.083   |   1039.206
16384   | 2028.604  | 2025.843   |   2033.940

---
Chiranmoy


v4-0001-SVE-support-for-popcount-and-popcount-masked.patch
Description:  v4-0001-SVE-support-for-popcount-and-popcount-masked.patch


Re: [PATCH] SVE popcount support

2025-02-06 Thread chiranmoy.bhattacha...@fujitsu.com
> Hm.  These results are so similar that I'm tempted to suggest we just
> remove the section of code dedicated to alignment.  Is there any reason not
> to do that?

It seems that the double load overhead from unaligned memory access isn’t
too taxing, even on larger inputs. We can remove it to simplify the code.

> Does this hand-rolled loop unrolling offer any particular advantage?  What
> do the numbers look like if we don't do this or if we process, say, 4
> vectors at a time?

The unrolled version performs better than the non-unrolled one, but
processing four vectors provides no additional benefit. The numbers
and code used are given below.

 buf  | Not Unrolled | Unrolled x2 | Unrolled x4
--+-+-+-
   16  | 4.774  | 4.759   | 5.634
   32  | 6.872  | 6.486   | 7.348
   64  |11.070  |10.249   |10.617
  128  |20.003  |16.205   |16.764
  256  |40.234  |28.377   |29.108
  512  |83.825  |53.420   |53.658
 1024  |   191.181  |   101.677   |   102.727
 2048  |   389.160  |   200.291   |   201.544
 4096  |   785.742  |   404.593   |   399.134
 8192  |  1587.226  |   811.314   |   810.961

/* Process 4 vectors */
for (; i < loop_bytes; i += vec_len * 4)
{
  vec64_1 = svld1(pred, (const uint64 *) (buf + i));
  accum1 = svadd_x(pred, accum1, svcnt_x(pred, vec64_1));
  vec64_2 = svld1(pred, (const uint64 *) (buf + i + vec_len));
  accum2 = svadd_x(pred, accum2, svcnt_x(pred, vec64_2));

  vec64_3 = svld1(pred, (const uint64 *) (buf + i + 2 * vec_len));
  accum3 = svadd_x(pred, accum3, svcnt_x(pred, vec64_3));
  vec64_4 = svld1(pred, (const uint64 *) (buf + i + 3 * vec_len));
  accum4 = svadd_x(pred, accum4, svcnt_x(pred, vec64_4));
}

-Chiranmoy


Re: [PATCH] SVE popcount support

2024-12-11 Thread chiranmoy.bhattacha...@fujitsu.com
Thank you for the suggestion; we have removed the `xsave` flag.

We have used the following command for benchmarking:
time ./build_fj/bin/psql pop_db -c "select drive_popcount(1000, 16);"

We ran it 20 times and took the average to flatten any CPU fluctuations. The 
results observed on `m7g.4xlarge`are in the attached Excel file.

We have also updated the condition for buffer alignment, skipping the alignment 
process if the buffer is already aligned. This seems to have improved the 
performance by a few milliseconds because the input buffer provided by 
`drive_popcount` is already aligned. PFA for the updated patch file.

Thanks,
Chiranmoy





From: Malladi, Rama 
Sent: Monday, December 9, 2024 10:06 PM
To: Susmitha, Devanga ; Nathan Bossart 

Cc: Kirill Reshke ; pgsql-hackers 
; Bhattacharya, Chiranmoy 
; M A, Rajat ; 
Hajela, Ragesh 
Subject: Re: [PATCH] SVE popcount support



On 12/9/24 12:21 AM, 
devanga.susmi...@fujitsu.com wrote:
Hello,

We are sharing our patch for pg_popcount with SVE support as a contribution 
from our side in this thread. We hope this contribution will help in exploring 
and refining the popcount implementation further.
Our patch uses the existing infrastructure, i.e. the 
"choose_popcount_functions" method, to determine the correct popcount 
implementation based on the architecture, thereby requiring fewer code changes. 
The patch also includes implementations for popcount and popcount masked.
We can reference both solutions and work together toward achieving the most 
efficient and effective implementation for PostgreSQL.

Thanks for the patch and it looks good. I will review the full patch in the 
next couple of days. One observation was that the patch has `xsave` flags 
added. This isn't needed.


`pgport_cflags = {'crc': cflags_crc, 'popcnt': cflags_popcnt + 
cflags_popcnt_arm, 'xsave': cflags_xsave}`

Algorithm Overview:
1. For larger inputs, align the buffers to avoid double loads. For smaller 
inputs alignment is not necessary and might even degrade the performance.
2. Process the aligned buffer chunk by chunk till the last incomplete chunk.
3. Process the last incomplete chunk.
Our setup:
Machine: AWS EC2 c7g.8xlarge - 32vcpu, 64gb RAM
OS : Ubuntu 22.04.5 LTS
GCC: 11.4

Benchmark and Result:
We have used PostgreSQL community recommended popcount-test-module[0] for 
benchmarking and observed a speed-up of more than 4x for larger buffers. Even 
for smaller inputs of size 8 and 16 bytes there aren't any performance 
degradations observed.
Looking forward to your thoughts!


I tested the patch and here attached is the performance I see on a 
`c7g.xlarge`. The perf data doesn't quite match to what you observe (especially 
for 256B). In the chart, I have comparison of baseline, AWS SVE (what I had 
implemented) and Fujitsu SVE popcount implementations. Can you confirm the 
command-line you had used for the benchmark run?


I had used the below command-line:

`sudo su postgres -c "/usr/local/pgsql/bin/psql -c 'EXPLAIN ANALYZE SELECT 
drive_popcount(10, 16);'"`



From: Nathan Bossart 
Sent: Wednesday, December 4, 2024 21:37
To: Malladi, Rama 
Cc: Kirill Reshke ; 
pgsql-hackers 

Subject: Re: [PATCH] SVE popcount support

On Wed, Dec 04, 2024 at 08:51:39AM -0600, Malladi, Rama wrote:
> Thank you, Kirill, for the review and the feedback. Please find inline my
> reply and an updated patch.

Thanks for the updated patch.  I have a couple of high-level comments.
Would you mind adding this to the commitfest system
(https://commitfest.postgresql.org/) so that it is picked up by our
automated patch testing tools?

> +# Check for ARMv8 SVE popcount intrinsics
> +#
> +CFLAGS_POPCNT=""
> +PG_POPCNT_OBJS=""
> +PGAC_SVE_POPCNT_INTRINSICS([])
> +if test x"$pgac_sve_popcnt_intrinsics" != x"yes"; then
> +  PGAC_SVE_POPCNT_INTRINSICS([-march=armv8-a+sve])
> +fi
> +if test x"$pgac_sve_popcnt_intrinsics" = x"yes"; then
> +  PG_POPCNT_OBJS="pg_popcount_sve.o"
> +  AC_DEFINE(USE_SVE_POPCNT_WITH_RUNTIME_CHECK, 1, [Define to 1 to use SVE 
> popcount instructions with a runtime check.])
> +fi
> +AC_SUBST(CFLAGS_POPCNT)
> +AC_SUBST(PG_POPCNT_OBJS)

We recently switched some intrinsics support in PostgreSQL to use
__attribute__((target("..."))) instead of applying special compiler flags
to specific files (e.g., commits f78667b and 4b03a27).  The hope is that
this approach will be a little more sustainable as we add more
architecture-specific code.  IMHO we should do something similar here.
While this means that older versions of clang might not pick up this
optimization (see the commit message for 4b03a27 for details), I think
that's okay because 1) this patch is intended for the next major version of
Postgres, which will take some time for significant adoption, 

Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-10 Thread chiranmoy.bhattacha...@fujitsu.com
Hello Nathan,

We tried auto-vectorization and observed no performance improvement.
The instructions in src/include/port/simd.h are based on older SIMD 
architectures like NEON, whereas the patch uses the newer SVE, so some of the 
instructions used in the patch may not have direct equivalents in NEON. We will 
check the feasibility of integrating SVE in "src/include/port/simd.h" and get 
back to you.
The actual encoding/decoding implementation takes less than 100 lines. The rest 
of the code is related to config and the "choose" logic. One option is to move 
the implementation to a new file, making src/backend/utils/adt/encode.c less 
bloated.

Thanks,
Chiranmoy


Re: [PATCH] SVE popcount support

2025-01-10 Thread chiranmoy.bhattacha...@fujitsu.com
Hi all,

Here is the updated patch using pg_attribute_target("arch=armv8-a+sve") to 
compile the arch-specific function instead of using compiler flags.

---
Chiranmoy




v3-0001-SVE-support-for-popcount-and-popcount-masked.patch
Description:  v3-0001-SVE-support-for-popcount-and-popcount-masked.patch


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-22 Thread chiranmoy.bhattacha...@fujitsu.com
> The approach looks generally reasonable to me, but IMHO the code needs
  much more commentary to explain how it works.

Added comments to explain the SVE implementation.

> I would be interested to see how your bytea test compares with the
improvements added in commit e24d770 and with sending the data in binary.

The following are the bytea test results with commit e24d770.
The same query and tables were used.

With commit e24d770:
Query exec time: 2.324 sec
hex_encode function time: 0.72 sec

Pre-commit e24d770:
Query exec time: 2.858 sec
hex_encode function time: 1.228 sec

SVE patch:
Query exec time: 1.654 sec
hex_encode_sve function time: 0.085 sec

> The functions that test the length before potentially calling a function
> pointer should probably be inlined (see pg_popcount() in pg_bitutils.h).
> I wouldn't be surprised if some compilers are inlining this stuff
> already, but it's probably worth being explicit about it.

Should we implement an inline function in "utils/builtins.h", similar to
pg_popcount()? Currently, we have not modified the header file, everything
is statically implemented in encode.c.

---
Chiranmoy


Re: [PATCH] SVE popcount support

2025-01-22 Thread chiranmoy.bhattacha...@fujitsu.com
> This looks good. Thanks Chiranmoy and team. Can you address any other 
> feedback from Nathan or others here? Then we can pursue further reviews and 
> merging of the patch.

Thank you for the review.
If there is no further feedback from the community, may we submit the patch for 
the next commit fest?


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-22 Thread chiranmoy.bhattacha...@fujitsu.com
I realized I didn't attach the patch.


v2-0001-SVE-support-for-hex-encode-and-hex-decode.patch
Description: v2-0001-SVE-support-for-hex-encode-and-hex-decode.patch


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-01-13 Thread chiranmoy.bhattacha...@fujitsu.com
On Fri, Jan 10, 2025 at 09:38:14AM -0600, Nathan Bossart wrote:
> Do you mean that the auto-vectorization worked and you observed no
> performance improvement, or the auto-vectorization had no effect on the
> code generated?

Auto-vectorization is working now with the following addition on Graviton 3 
(m7g.4xlarge) with GCC 11.4, and the results match yours. Previously, 
auto-vectorization had no effect because we missed the -march=native option.

  encode.o: CFLAGS += ${CFLAGS_VECTORIZE} -march=native

There is a 30% improvement using auto-vectorization.

 buf   | default | auto_vec | SVE
+---++---
 16 | 16  |  12  |8
 64 | 58  |  40  |9
256 |223  | 152  |   18
   1024 |934  | 613  |   54
   4096 |   3533  |2430  |  202
  16384 |  14081  |9831  |  800
  65536 |  56374  |   38702  | 3202

Auto-vectorization had no effect on hex_decode due to the presence of control 
flow.

-
Here is a comment snippet from src/include/port/simd.h

"While Neon support is technically optional for aarch64, it appears that all 
available 64-bit hardware does have it."

Currently, it is assumed that all aarch64 machine support NEON, but for newer 
advanced SIMD like SVE (and AVX512 for x86) this assumption may not hold. We 
need a runtime check to be sure.. Using src/include/port/simd.h to abstract 
away these advanced SIMD implementations may be difficult.

We will update the thread once a solution is found.

-
Chiranmoy


Re: [PATCH] SVE popcount support

2025-03-19 Thread chiranmoy.bhattacha...@fujitsu.com
On Wed, Mar 13, 2025 at 12:02:07AM +, nathandboss...@gmail.com wrote:
> Those are nice results.  I'm a little worried about the Neon implementation
> for smaller inputs since it uses a per-byte loop for the remaining bytes,
> though.  If we can ensure there's no regression there, I think this patch
> will be in decent shape.

True, the neon implementation in patch v6 did perform worse for smaller inputs.
This is solved in v7, we have added pg_popcount64 to speed up the processing of
smaller inputs/remaining bytes. Also, similar to sve, the neon-2reg version
performed better than neon-1reg but no improvement in neon-4reg.

The below table compares patches v6 and v7 on m7g.4xlarge
Query: SELECT drive_popcount(100, 8-byte words);
 8-byte words |  master  | v6-neon-2reg| v7-neon-2reg|  v7-sve
--+--+-+-+
1 |   4.051  | 6.239   | 3.431   |   3.343
2 |   4.429  |10.773   | 3.899   |   3.335
3 |   4.844  |14.066   | 4.398   |   3.348
4 |   5.324  | 3.342   | 3.663   |   3.365
5 |   5.900  | 7.108   | 4.349   |   4.441
6 |   6.478  |11.720   | 4.851   |   4.441
7 |   7.192  |15.686   | 5.551   |   4.447
8 |   8.016  | 4.288   | 4.367   |   4.013


We modified [0] to get the numbers for pg_popcount_masked
 8-byte words |  master  | v7-neon-2reg|  v7-sve
--+--+-+
1 |   4.289  | 4.202   |   3.827
2 |   4.993  | 4.662   |   3.823
3 |   5.981  | 5.459   |   3.834
4 |   6.438  | 4.230   |   3.846
5 |   7.169  | 5.236   |   5.072
6 |   7.949  | 5.922   |   5.106
7 |   9.130  | 6.535   |   5.060
8 |   9.796  | 5.328   |   4.718
  512 | 387.543  |   182.801   |  77.077
 1024 | 760.644  |   360.660   | 150.519

[0] 
https://postgr.es/m/CAFBsxsE7otwnfA36Ly44zZO+b7AEWHRFANxR1h1kxveEV=g...@mail.gmail.com

-Chiranmoy


v7-0001-SVE-and-NEON-support-for-pg_popcount.patch
Description: v7-0001-SVE-and-NEON-support-for-pg_popcount.patch


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-02-19 Thread chiranmoy.bhattacha...@fujitsu.com
It seems that the patch doesn't compile on macOS, it is unable to map 'i'
and 'len' which are of type 'size_t' to 'uint64'. This appears to be a mac 
specific
issue. The latest patch should resolve this by casting 'size_t' to 'uint64' 
before
passing it to 'svwhilelt_b8'.
[11:04:07.478] ../src/backend/utils/adt/encode.c:356:10: error: call to 
'svwhilelt_b8' is ambiguous
[11:04:07.478] 356 | pred = svwhilelt_b8(i, len);
[11:04:07.478] | ^~~~
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28288:10:
 note: candidate function
[11:04:07.478] 28288 | svbool_t svwhilelt_b8(uint32_t, uint32_t);
[11:04:07.478] | ^
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28296:10:
 note: candidate function
[11:04:07.478] 28296 | svbool_t svwhilelt_b8(uint64_t, uint64_t);
[11:04:07.478] | ^
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28304:10:
 note: candidate function
[11:04:07.478] 28304 | svbool_t svwhilelt_b8(int32_t, int32_t);
[11:04:07.478] | ^
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28312:10:
 note: candidate function
[11:04:07.478] 28312 | svbool_t svwhilelt_b8(int64_t, int64_t);
[11:04:07.478] | ^
[11:04:07.478] ../src/backend/utils/adt/encode.c:433:10: error: call to 
'svwhilelt_b8' is ambiguous
[11:04:07.478] 433 | pred = svwhilelt_b8(i / 2, len / 2);
[11:04:07.478] | ^~~~
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28288:10:
 note: candidate function
[11:04:07.478] 28288 | svbool_t svwhilelt_b8(uint32_t, uint32_t);
[11:04:07.478] | ^
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28296:10:
 note: candidate function
[11:04:07.478] 28296 | svbool_t svwhilelt_b8(uint64_t, uint64_t);
[11:04:07.478] | ^
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28304:10:
 note: candidate function
[11:04:07.478] 28304 | svbool_t svwhilelt_b8(int32_t, int32_t);
[11:04:07.478] | ^
[11:04:07.478] 
/Applications/Xcode_16.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_sve.h:28312:10:
 note: candidate function
[11:04:07.478] 28312 | svbool_t svwhilelt_b8(int64_t, int64_t);
[11:04:07.478] | ^
[11:04:07.478] 2 errors generated.

---
Chiranmoy


v4-0001-SVE-support-for-hex-encode-and-hex-decode.patch
Description: v4-0001-SVE-support-for-hex-encode-and-hex-decode.patch


Re: [PATCH] SVE popcount support

2025-02-19 Thread chiranmoy.bhattacha...@fujitsu.com
> Hm.  Any idea why that is?  I wonder if the compiler isn't using as many
> SVE registers as it could for this.

Not sure, we tried forcing loop unrolling using the below line in the MakeFile
but the results are the same.

pg_popcount_sve.o: CFLAGS += ${CFLAGS_UNROLL_LOOPS} -march=native


> I've also noticed that the latest patch doesn't compile on my M3 macOS
> machine.  After a quick glance, I think the problem is that the
> TRY_POPCNT_FAST macro is set, so it's trying to compile the assembly
> versions.

Fixed, we tried using the existing "choose" logic guarded by TRY_POPCNT_FAST.
The latest patch bypasses TRY_POPCNT_FAST by having a separate choose logic
for aarch64.


-Chiranmoy


v5-0001-SVE-support-for-popcount-and-popcount-masked.patch
Description:  v5-0001-SVE-support-for-popcount-and-popcount-masked.patch


Re: [PATCH] SVE popcount support

2025-03-06 Thread chiranmoy.bhattacha...@fujitsu.com
> Interesting.  I do see different assembly with the 2 and 4 register
> versions, but I didn't get to testing it on a machine with SVE support
> today.

> Besides some additional benchmarking, I might make some small adjustments
> to the patch.  But overall, it seems to be in decent shape.

Sounds good. Let us know your findings.

-Chiranmoy


Re: [PATCH] SVE popcount support

2025-03-12 Thread chiranmoy.bhattacha...@fujitsu.com
On Wed, Mar 12, 2025 at 02:41:18AM +, nathandboss...@gmail.com wrote:

> v5-no-sve is the result of using a function pointer, but pointing to the
> "slow" versions instead of the SVE version.  v5-sve is the result of the
> latest patch in this thread on a machine with SVE support, and v5-4reg is
> the result of the latest patch in this thread modified to process 4
> register's worth of data at a time.

Nice, I wonder why I did not observe any performance gain in the 4reg
version. Did you modify the 4reg version code?

One possible explanation is that you used Graviton4 based instances
whereas I used Graviton3 instances.

> For the latter point, I think we should consider trying to add a separate
> Neon implementation that we use as a fallback for machines that don't have
> SVE.  My understanding is that Neon is virtually universally supported on
> 64-bit Arm gear, so that will not only help offset the function pointer
> overhead but may even improve performance for a much wider set of machines.

I have added the NEON implementation in the latest patch.

Here are the numbers for drive_popcount(100, 1024) on m7g.8xlarge:
Scalar - 692ms
Neon - 298ms
SVE - 112ms

-Chiranmoy


v6-0001-SVE-and-NEON-support-for-popcount.patch
Description: v6-0001-SVE-and-NEON-support-for-popcount.patch


Re: [PATCH] SVE popcount support

2025-03-23 Thread chiranmoy.bhattacha...@fujitsu.com
Looks good, the code is more readable now.

> For both Neon and SVE, I do see improvements with looping over 4
 > registers at a time, so IMHO it's worth doing so even if it performs the
 > same as 2-register blocks on some hardware.

There was no regression on Graviton 3 when using the 4-register version so can 
keep it.

-Chiranmoy


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-07-07 Thread chiranmoy.bhattacha...@fujitsu.com
Attaching the rebased patch, some regression tests for SIMD hex-coding,
and a script to test bytea performance (usage info in the script).

The results obtained using the script on an m7g.4xlarge are shown below.

Read Operation
table (MB) | HEAD (ms) | SVE (ms) | improvement (%)
---
52 | 136   | 111  | 18.38
105| 215   | 164  | 23.72
209| 452   | 331  | 26.76
419| 830   | 602  | 27.46

Write Operation - table size after write
table (MB) | HEAD (ms) | SVE (ms) | improvement (%)
---
52 | 1430  | 1361 | 4.82
105| 2956  | 2816 | 4.73
The bytea write numbers are averaged over 7 runs, with the table
truncated and vacuumed after each run.

Chiranmoy
from time import time
from sys import argv
import psycopg2
# import psycopg2.extras

"""
Usage
Set the input_path, output_path and conn_params in the main guard below.
 
To create the test table
python3 bytea_test.py c

To insert a file in the input_path N times
python3 bytea_test.py w 

To read the data from the table and write the first file to output_path
python3 bytea_test.py r
"""


def read_binary(input_path):
with open(input_path, 'rb') as f:
return f.read()


def create_table(conn_params):
conn = psycopg2.connect(**conn_params)
cursor = conn.cursor()

cursor.execute("""CREATE TABLE IF NOT EXISTS BYTEA_TABLE (data BYTEA);""",)

conn.commit()
cursor.close()
conn.close()


def save_file(binary_data, duplicate, conn_params):
conn = psycopg2.connect(**conn_params)
cursor = conn.cursor()

data = psycopg2.Binary(binary_data)

# rows = [(data,) for _ in range(duplicate)]
# start = time()
# psycopg2.extras.execute_values(cursor, """INSERT INTO bytea_table (data) VALUES %s;""", rows)
# print(time()-start)

start = time()
for _ in range(duplicate):
cursor.execute("""INSERT INTO bytea_table (data) VALUES (%s);""", (data,))
print(time()-start)

conn.commit()
cursor.close()
conn.close()
print(f"file saved {duplicate} times")


def retrieve_file(output_path, conn_params):
conn = psycopg2.connect(**conn_params)
cursor = conn.cursor()

start = time()
cursor.execute("""SELECT data FROM bytea_table;""")
print(time()-start)

result = cursor.fetchone()
if result:
binary_data = result[0]
with open(output_path, 'wb') as f:
f.write(binary_data)
print(f"File retrieved, saved to '{output_path}'")
else:
print(f"ERROR")

cursor.close()
conn.close()


if __name__ == '__main__':
input_path = 'INPUT/PATH/file.jpg'
output_path = 'OUTPUT/PATCH/retrieved.jpg'
conn_params = {
'dbname': '',
'user': '',
'password': '',
'host': '',
'port': 0
}

if len(argv) == 1:
print("Specify Mode")
elif argv[1].lower() == "c":
create_table(conn_params)
elif argv[1].lower() == "r":
retrieve_file(output_path, conn_params)
elif argv[1].lower() == 'w' and int(argv[2]) > 0:
save_file(read_binary(input_path), int(argv[2]), conn_params)
else:
print("Invalid Mode")


v1-0001-hex-coding-regress-test.patch
Description: v1-0001-hex-coding-regress-test.patch


v5-0001-SVE-support-for-hex-coding.patch
Description: v5-0001-SVE-support-for-hex-coding.patch


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-06-09 Thread chiranmoy.bhattacha...@fujitsu.com
Here's the rebased patch with a few modifications.

The hand-unrolled hex encode performs better than the non-unrolled version on
r8g.4xlarge. No improvement on m7g.4xlarge.
Added line-by-line comments explaining the changes with an example.

Below are the results. Input size is in bytes, and exec time is in ms.

encode - r8g.4xlarge

 Input | master |   SVE  | SVE-unrolled
---+++--
 8 |  4.971 |  6.434 |6.623
16 |  8.532 |  4.399 |4.710
24 | 12.296 |  5.007 |5.780
32 | 16.003 |  5.027 |5.234
40 | 19.628 |  5.807 |6.201
48 | 23.277 |  5.815 |6.222
56 | 26.927 |  6.744 |7.030
64 | 30.419 |  6.774 |6.347
   128 | 83.250 | 10.214 |9.104
   256 |112.158 | 17.892 |   16.313
   512 |216.544 | 31.060 |   29.876
  1024 |429.351 | 59.310 |   53.374
  2048 |854.677 |116.769 |  101.004
  4096 |1706.528|237.322 |  195.297
  8192 |3723.884|499.520 |  385.424
---

encode - m7g.4xlarge

 Input | master |   SVE  | SVE-unrolled
---+++--
 8 |  5.503 |  7.986 |8.053
16 |  9.881 |  9.583 |9.888
24 | 13.854 |  9.212 |   10.138
32 | 18.056 |  9.208 |9.364
40 | 22.127 | 10.134 |   10.540
48 | 26.214 | 10.186 |   10.550
56 | 29.718 | 10.197 |   10.428
64 | 33.613 | 10.982 |   10.497
   128 | 66.060 | 12.460 |   12.624
   256 |130.225 | 18.491 |   18.872
   512 |267.105 | 30.343 |   31.661
  1024 |515.603 | 54.371 |   55.341
  2048 |1013.766|103.898 |  105.192
  4096 |2018.705|202.653 |  203.142
  8192 |4000.496|400.918 |  401.842
---
decode - r8g.4xlarge

 Input | master |   SVE
---++
 8 |  7.641 |  8.787
16 | 14.301 | 14.477
32 | 28.663 |  6.091
48 | 42.940 | 17.604
64 | 57.483 | 10.549
80 | 71.637 | 19.194
96 | 85.918 | 15.586
   112 |100.272 | 25.956
   128 |114.740 | 19.829
   256 |229.176 | 36.032
   512 |458.295 | 68.222
  1024 |916.741 |132.927
  2048 |1833.422|262.741
  4096 |3667.096|522.009
  8192 |7333.886|1042.447
---

decode - m7g.4xlarge

 Input | master |   SVE
---++
 8 |  8.194 |  9.433
16 | 14.397 | 15.606
32 | 26.669 | 29.006
48 | 45.971 | 48.984
64 | 58.468 | 12.388
80 | 70.820 | 22.295
96 | 84.792 | 43.470
   112 | 98.992 | 54.282
   128 |113.250 | 25.508
   256 |218.743 | 45.165
   512 |414.133 | 86.800
  1024 |828.493 |174.670
  2048 |1617.921|346.375
  4096 |3259.159|689.391
  8192 |6551.879|1376.195


Chiranmoy


v5-0001-SVE-support-for-hex-coding.patch
Description: v5-0001-SVE-support-for-hex-coding.patch


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-09-11 Thread chiranmoy.bhattacha...@fujitsu.com
> Thanks.  I noticed that this stuff is simple enough that we can use
> port/simd.h (with a few added functions).  This is especially nice because
> it takes care of x86, too.  The performance gains look similar to what you
> reported for v6:

This looks good, much cleaner.
One possible improvement would be to use a vectorized table lookup instead of 
compare and add. I compared v6 and v7 Neon versions, and v6 is always faster.
I’m not sure if SSE2 has a table lookup similar to Neon.

arm - m7g.4xlarge
  buf  | v6-Neon| v7-Neon| % diff
---+++
64 |   6.16 |   8.57 |  28.07
   128 |  11.37 |  15.77 |  27.87
   256 |  18.54 |  30.28 |  38.77
   512 |  33.98 |  62.15 |  45.33
  1024 |  64.46 | 117.55 |  45.16
  2048 | 124.28 | 254.86 |  51.24
  4096 | 243.47 | 509.23 |  52.19
  8192 | 487.34 | 953.81 |  48.91

-
Chiranmoy


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-09-14 Thread chiranmoy.bhattacha...@fujitsu.com
> My current philosophy with this stuff is to favor simplicity,
> maintainability, portability, etc. over extracting the absolute maximum
> amount of performance gain, so I think we should proceed with the simd.h
> approach. But I'm curious how others feel about this.

>  +1. The maintainability aspect is critical over the long run.
> Also, there's a very real danger of optimizing for the specific
> hardware and test case you are working with, leading to actually
> worse performance with future hardware.

Using simd.h does make it easier to maintain.
Is there a plan to upgrade simd.h to use SSE4 or SSSE3 in the future?
Since SSE2 is much older, it lacks some of the more specialized intrinsics.
For example, vectorized table lookup can be implemented via [0], and
it’s available in SSSE3 and later x86 instruction sets.

[0] https://www.felixcloutier.com/x86/pshufb

-
Chiranmoy


Re: [PATCH] Hex-coding optimizations using SVE on ARM.

2025-09-03 Thread chiranmoy.bhattacha...@fujitsu.com
Hi all,

Since the CommitFest is underway, could we get some feedback to improve the 
patch?

___
Chiranmoy