Re: CRC32C Parallel Computation Optimization on ARM

2024-12-17 Thread John Naylor
On Mon, Dec 2, 2024 at 2:01 AM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > One side note, I think it would be great to properly cite the white > paper the patch is referring to. Besides paying some respect to the > authors, it will also make it easier to actually find it. After a quick > searc

Re: CRC32C Parallel Computation Optimization on ARM

2024-12-11 Thread John Naylor
On Wed, Dec 11, 2024 at 11:54 PM Nathan Bossart wrote: > > On Wed, Dec 11, 2024 at 02:08:58PM +0700, John Naylor wrote: > > and how light it was. With more hardware support, we can go much lower > > than 1024 bytes, but that can be left for future work. > > Nice. I'm curious how this compares to

Re: CRC32C Parallel Computation Optimization on ARM

2024-12-11 Thread Nathan Bossart
On Wed, Dec 11, 2024 at 02:08:58PM +0700, John Naylor wrote: > I added a port to x86 and poked at it, with the intent to have an easy > on-ramp to that at least accelerates computation of CRCs on FPIs. > > The 0008 patch only worked on chunks of 1024 at a time. At that size, > the presence of hard

Re: CRC32C Parallel Computation Optimization on ARM

2024-12-10 Thread John Naylor
I wrote: > > 1. I looked at a couple implementations of this idea, and found that > the constants used in the carryless multiply are tied to the length of > the blocks. With a lookup table we can do the 3-way algorithm on any > portion of a full block length, rather than immediately fall to doing >

Re: CRC32C Parallel Computation Optimization on ARM

2024-12-03 Thread John Naylor
On Mon, Dec 4, 2023 at 2:27 PM Xiang Gao wrote: > > [v8 patch] I have a couple quick thoughts on this: 1. I looked at a couple implementations of this idea, and found that the constants used in the carryless multiply are tied to the length of the blocks. With a lookup table we can do the 3-way a

Re: CRC32C Parallel Computation Optimization on ARM

2024-12-01 Thread Dmitry Dolgov
> On Mon, Dec 04, 2023 at 10:18:09PM -0600, Nathan Bossart wrote: > > Thanks for the new patch. I am hoping to spend much more time on this in > the near future... Hi, The patch looks interesting, having around 8% improvement on that sounds attractive. Nathan, do you plan to come back to it and

Re: CRC32C Parallel Computation Optimization on ARM

2023-12-04 Thread Nathan Bossart
On Mon, Dec 04, 2023 at 07:27:01AM +, Xiang Gao wrote: > This is the latest patch. Looking forward to your feedback, thanks! Thanks for the new patch. I am hoping to spend much more time on this in the near future... -- Nathan Bossart Amazon Web Services: https://aws.amazon.com

RE: CRC32C Parallel Computation Optimization on ARM

2023-12-03 Thread Xiang Gao
On Date: Thu, 30 Nov 2023 14:54:26PM -0600, Nathan Bossart wrote: >>pg_crc32c_armv8.o: CFLAGS += ${CFLAGS_CRC} ${CFLAGS_CRYPTO} >> >> It does not work correctly. CFLAGS ='-march=armv8-a+crc, >> -march=armv8-a+crypto', what actually works is '-march=armv8-a+crypto'. >> >> We set a new variable CLAGS

Re: CRC32C Parallel Computation Optimization on ARM

2023-11-30 Thread Nathan Bossart
On Thu, Nov 23, 2023 at 08:05:26AM +, Xiang Gao wrote: > On Date: Wed, 22 Nov 2023 15:06:18PM -0600, Nathan Bossart wrote: >>pg_crc32c_armv8.o: CFLAGS += ${CFLAGS_CRC} ${CFLAGS_CRYPTO} > > It does not work correctly. CFLAGS ='-march=armv8-a+crc, > -march=armv8-a+crypto', what actually works is

RE: CRC32C Parallel Computation Optimization on ARM

2023-11-23 Thread Xiang Gao
On Date: Wed, 22 Nov 2023 15:06:18PM -0600, Nathan Bossart wrote: >> On Date: Fri, 10 Nov 2023 10:36:08AM -0600, Nathan Bossart wrote: >>>+__attribute__((target("+crc+crypto"))) >>> >>>I'm not sure we can assume that all compilers will understand this, and I'm >>>not sure we need it. >> >> CFLAGS_

Re: CRC32C Parallel Computation Optimization on ARM

2023-11-22 Thread Nathan Bossart
On Wed, Nov 22, 2023 at 10:16:44AM +, Xiang Gao wrote: > On Date: Fri, 10 Nov 2023 10:36:08AM -0600, Nathan Bossart wrote: >>+__attribute__((target("+crc+crypto"))) >> >>I'm not sure we can assume that all compilers will understand this, and I'm >>not sure we need it. > > CFLAGS_CRC is "-march

RE: CRC32C Parallel Computation Optimization on ARM

2023-11-22 Thread Xiang Gao
On Date: Fri, 10 Nov 2023 10:36:08AM -0600, Nathan Bossart wrote: >-# all versions of pg_crc32c_armv8.o need CFLAGS_CRC >-pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC) >-pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC) >-pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC) > >Why are these lines deleted? > >- [

Re: CRC32C Parallel Computation Optimization on ARM

2023-11-10 Thread Nathan Bossart
On Tue, Nov 07, 2023 at 08:05:45AM +, Xiang Gao wrote: > I think I understand what you mean, this is the latest patch. Thank you! Thanks for the new patch. +# PGAC_ARMV8_VMULL_INTRINSICS +# +# Check if the compiler supports the vmull_p64 +# intrinsic functions. Th

RE: CRC32C Parallel Computation Optimization on ARM

2023-11-07 Thread Xiang Gao
On Mon, 6 Nov 2023 13:16:13PM -0600, Nathan Bossart wrote: >>> The idea is that we don't want to start forcing runtime checks on builds >>>where we aren't already doing runtime checks. IOW if the compiler can use >>>the ARMv8 CRC instructions with the default compiler flags, we should only >>>use

Re: CRC32C Parallel Computation Optimization on ARM

2023-11-06 Thread Nathan Bossart
On Fri, Nov 03, 2023 at 10:46:57AM +, Xiang Gao wrote: > On Date: Thu, 2 Nov 2023 09:35:50AM -0500, Nathan Bossart wrote: >> The idea is that we don't want to start forcing runtime checks on builds >> where we aren't already doing runtime checks. IOW if the compiler can use >> the ARMv8 CRC in

RE: CRC32C Parallel Computation Optimization on ARM

2023-11-03 Thread Xiang Gao
On Date: Thu, 2 Nov 2023 09:35:50AM -0500, Nathan Bossart wrote: >On Thu, Nov 02, 2023 at 06:17:20AM +, Xiang Gao wrote: >> After reading the discussion, I understand that in order to avoid performance >> regression in some instances, we need to try our best to avoid runtime >> checks. > >I d

Re: CRC32C Parallel Computation Optimization on ARM

2023-11-02 Thread Nathan Bossart
On Thu, Nov 02, 2023 at 06:17:20AM +, Xiang Gao wrote: > After reading the discussion, I understand that in order to avoid performance > regression in some instances, we need to try our best to avoid runtime checks. > I don't know if I understand it correctly. The idea is that we don't want to

RE: CRC32C Parallel Computation Optimization on ARM

2023-11-01 Thread Xiang Gao
On Tue, 31 Oct 2023 15:48:21PM -0500, Nathan Bossart wrote: >> Thanks. I went ahead and split this prerequisite part out to a separate >> thread [0] since it's sort-of unrelated to your proposal here. It's not >> really a prerequisite, but I do think it will simplify things a bit. >Per the other

Re: CRC32C Parallel Computation Optimization on ARM

2023-10-31 Thread Nathan Bossart
On Mon, Oct 30, 2023 at 11:21:43AM -0500, Nathan Bossart wrote: > On Fri, Oct 27, 2023 at 07:01:10AM +, Xiang Gao wrote: >> On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote: We consider that a runtime check needs to be done in any scenario. Here we only confirm that the com

Re: CRC32C Parallel Computation Optimization on ARM

2023-10-30 Thread Nathan Bossart
On Fri, Oct 27, 2023 at 07:01:10AM +, Xiang Gao wrote: > On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote: >>> We consider that a runtime check needs to be done in any scenario. >>> Here we only confirm that the compilation can be successful. >> >A runtime check will be done when cho

RE: CRC32C Parallel Computation Optimization on ARM

2023-10-27 Thread Xiang Gao
On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote: >> We consider that a runtime check needs to be done in any scenario. >> Here we only confirm that the compilation can be successful. > >A runtime check will be done when choosing which algorithm. > >You can think of us as merging USE_ARM

Re: CRC32C Parallel Computation Optimization on ARM

2023-10-26 Thread Nathan Bossart
On Thu, Oct 26, 2023 at 08:53:31AM +, Xiang Gao wrote: > On Tue, 24 Oct, 2023 20:45:39PM -0500, Nathan Bossart wrote: >>I tried this. pg_waldump on 2 million ~8kB records took around 8.1 seconds >>without the patch and around 7.4 seconds with it (an 8% improvement). >>pg_waldump on 1 millio

Re: CRC32C Parallel Computation Optimization on ARM

2023-10-26 Thread Nathan Bossart
On Thu, Oct 26, 2023 at 07:28:35AM +, Xiang Gao wrote: > On Wed, 25 Oct, 2023 at 10:43:25 -0500, Nathan Bossart wrote: >>+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too. >>+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" || >>test x"$USE_ARMV8_

Re: CRC32C Parallel Computation Optimization on ARM

2023-10-26 Thread Bharath Rupireddy
On Thu, Oct 26, 2023 at 2:23 PM Xiang Gao wrote: > > On Tue, 24 Oct, 2023 20:45:39PM -0500, Nathan Bossart wrote: > >I tried this. pg_waldump on 2 million ~8kB records took around 8.1 seconds > >without the patch and around 7.4 seconds with it (an 8% improvement). > >pg_waldump on 1 million ~1

RE: CRC32C Parallel Computation Optimization on ARM

2023-10-26 Thread Xiang Gao
On Tue, 24 Oct, 2023 20:45:39PM -0500, Nathan Bossart wrote: >I tried this. pg_waldump on 2 million ~8kB records took around 8.1 seconds >without the patch and around 7.4 seconds with it (an 8% improvement). >pg_waldump on 1 million ~16kB records took around 3.2 seconds without the >patch and a

RE: CRC32C Parallel Computation Optimization on ARM

2023-10-26 Thread Xiang Gao
On Wed, 25 Oct, 2023 at 10:43:25 -0500, Nathan Bossart wrote: >+pg_crc32c >+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len) >It looks like most of this function is duplicated from >pg_comp_crc32c_armv8(). I understand that we probably need a separate >function becau

Re: CRC32C Parallel Computation Optimization on ARM

2023-10-25 Thread Nathan Bossart
+pg_crc32c +pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len) It looks like most of this function is duplicated from pg_comp_crc32c_armv8(). I understand that we probably need a separate function because of the runtime check, but perhaps we could create a common static

RE: CRC32C Parallel Computation Optimization on ARM

2023-10-24 Thread Xiang Gao
Thanks for your suggestion, this is the modified patch and two test files. -Original Message- From: Michael Paquier Sent: Friday, October 20, 2023 4:19 PM To: Xiang Gao Cc: pgsql-hackers@lists.postgresql.org Subject: Re: CRC32C Parallel Computation Optimization on ARM On Fri, Oct 20

Re: CRC32C Parallel Computation Optimization on ARM

2023-10-24 Thread Nathan Bossart
On Wed, Oct 25, 2023 at 07:17:55AM +0900, Michael Paquier wrote: > If you are looking at computing the CRC of records with arbitrary > sizes, why not just generating a series with > pg_logical_emit_message() before doing a comparison with pg_waldump or > a custom replay loop to go through the recor

Re: CRC32C Parallel Computation Optimization on ARM

2023-10-24 Thread Michael Paquier
On Wed, Oct 25, 2023 at 12:37:45AM +0300, Heikki Linnakangas wrote: > On 25/10/2023 00:18, Nathan Bossart wrote: >> Actually, since the pg_waldump benchmark likely only involves very small >> WAL records, it would make sense that there isn't much difference. >> *facepalm* > > No need to guess, pg_

Re: CRC32C Parallel Computation Optimization on ARM

2023-10-24 Thread Heikki Linnakangas
On 25/10/2023 00:18, Nathan Bossart wrote: On Tue, Oct 24, 2023 at 04:09:54PM -0500, Nathan Bossart wrote: I'm able to reproduce the speedup with the provided benchmark on an Apple M1 Pro (which appears to have the required instructions). There was almost no change for the 512-byte case, but th

Re: CRC32C Parallel Computation Optimization on ARM

2023-10-24 Thread Nathan Bossart
On Tue, Oct 24, 2023 at 04:09:54PM -0500, Nathan Bossart wrote: > I'm able to reproduce the speedup with the provided benchmark on an Apple > M1 Pro (which appears to have the required instructions). There was almost > no change for the 512-byte case, but there was a ~60% speedup for the > 4096-by

Re: CRC32C Parallel Computation Optimization on ARM

2023-10-24 Thread Nathan Bossart
On Fri, Oct 20, 2023 at 05:18:56PM +0900, Michael Paquier wrote: > On Fri, Oct 20, 2023 at 07:08:58AM +, Xiang Gao wrote: >> This patch uses a parallel computing optimization algorithm to >> improve crc32c computing performance on ARM. The algorithm comes >> from Intel whitepaper: >> crc-iscsi-

Re: CRC32C Parallel Computation Optimization on ARM

2023-10-20 Thread Michael Paquier
On Fri, Oct 20, 2023 at 07:08:58AM +, Xiang Gao wrote: > This patch uses a parallel computing optimization algorithm to > improve crc32c computing performance on ARM. The algorithm comes > from Intel whitepaper: > crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided > into three

CRC32C Parallel Computation Optimization on ARM

2023-10-20 Thread Xiang Gao
Hi all This patch uses a parallel computing optimization algorithm to improve crc32c computing performance on ARM. The algorithm comes from Intel whitepaper: crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided into three equal-sized blocks.Three parallel blocks (crc0, crc1, crc2