On Mon, Dec 2, 2024 at 2:01 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> One side note, I think it would be great to properly cite the white
> paper the patch is referring to. Besides paying some respect to the
> authors, it will also make it easier to actually find it. After a quick
> searc
On Wed, Dec 11, 2024 at 11:54 PM Nathan Bossart
wrote:
>
> On Wed, Dec 11, 2024 at 02:08:58PM +0700, John Naylor wrote:
> > and how light it was. With more hardware support, we can go much lower
> > than 1024 bytes, but that can be left for future work.
>
> Nice. I'm curious how this compares to
On Wed, Dec 11, 2024 at 02:08:58PM +0700, John Naylor wrote:
> I added a port to x86 and poked at it, with the intent to have an easy
> on-ramp to that at least accelerates computation of CRCs on FPIs.
>
> The 0008 patch only worked on chunks of 1024 at a time. At that size,
> the presence of hard
I wrote:
>
> 1. I looked at a couple implementations of this idea, and found that
> the constants used in the carryless multiply are tied to the length of
> the blocks. With a lookup table we can do the 3-way algorithm on any
> portion of a full block length, rather than immediately fall to doing
>
On Mon, Dec 4, 2023 at 2:27 PM Xiang Gao wrote:
>
> [v8 patch]
I have a couple quick thoughts on this:
1. I looked at a couple implementations of this idea, and found that
the constants used in the carryless multiply are tied to the length of
the blocks. With a lookup table we can do the 3-way a
> On Mon, Dec 04, 2023 at 10:18:09PM -0600, Nathan Bossart wrote:
>
> Thanks for the new patch. I am hoping to spend much more time on this in
> the near future...
Hi,
The patch looks interesting, having around 8% improvement on that sounds
attractive. Nathan, do you plan to come back to it and
On Mon, Dec 04, 2023 at 07:27:01AM +, Xiang Gao wrote:
> This is the latest patch. Looking forward to your feedback, thanks!
Thanks for the new patch. I am hoping to spend much more time on this in
the near future...
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
On Date: Thu, 30 Nov 2023 14:54:26PM -0600, Nathan Bossart wrote:
>>pg_crc32c_armv8.o: CFLAGS += ${CFLAGS_CRC} ${CFLAGS_CRYPTO}
>>
>> It does not work correctly. CFLAGS ='-march=armv8-a+crc,
>> -march=armv8-a+crypto', what actually works is '-march=armv8-a+crypto'.
>>
>> We set a new variable CLAGS
On Thu, Nov 23, 2023 at 08:05:26AM +, Xiang Gao wrote:
> On Date: Wed, 22 Nov 2023 15:06:18PM -0600, Nathan Bossart wrote:
>>pg_crc32c_armv8.o: CFLAGS += ${CFLAGS_CRC} ${CFLAGS_CRYPTO}
>
> It does not work correctly. CFLAGS ='-march=armv8-a+crc,
> -march=armv8-a+crypto', what actually works is
On Date: Wed, 22 Nov 2023 15:06:18PM -0600, Nathan Bossart wrote:
>> On Date: Fri, 10 Nov 2023 10:36:08AM -0600, Nathan Bossart wrote:
>>>+__attribute__((target("+crc+crypto")))
>>>
>>>I'm not sure we can assume that all compilers will understand this, and I'm
>>>not sure we need it.
>>
>> CFLAGS_
On Wed, Nov 22, 2023 at 10:16:44AM +, Xiang Gao wrote:
> On Date: Fri, 10 Nov 2023 10:36:08AM -0600, Nathan Bossart wrote:
>>+__attribute__((target("+crc+crypto")))
>>
>>I'm not sure we can assume that all compilers will understand this, and I'm
>>not sure we need it.
>
> CFLAGS_CRC is "-march
On Date: Fri, 10 Nov 2023 10:36:08AM -0600, Nathan Bossart wrote:
>-# all versions of pg_crc32c_armv8.o need CFLAGS_CRC
>-pg_crc32c_armv8.o: CFLAGS+=$(CFLAGS_CRC)
>-pg_crc32c_armv8_shlib.o: CFLAGS+=$(CFLAGS_CRC)
>-pg_crc32c_armv8_srv.o: CFLAGS+=$(CFLAGS_CRC)
>
>Why are these lines deleted?
>
>- [
On Tue, Nov 07, 2023 at 08:05:45AM +, Xiang Gao wrote:
> I think I understand what you mean, this is the latest patch. Thank you!
Thanks for the new patch.
+# PGAC_ARMV8_VMULL_INTRINSICS
+#
+# Check if the compiler supports the vmull_p64
+# intrinsic functions. Th
On Mon, 6 Nov 2023 13:16:13PM -0600, Nathan Bossart wrote:
>>> The idea is that we don't want to start forcing runtime checks on builds
>>>where we aren't already doing runtime checks. IOW if the compiler can use
>>>the ARMv8 CRC instructions with the default compiler flags, we should only
>>>use
On Fri, Nov 03, 2023 at 10:46:57AM +, Xiang Gao wrote:
> On Date: Thu, 2 Nov 2023 09:35:50AM -0500, Nathan Bossart wrote:
>> The idea is that we don't want to start forcing runtime checks on builds
>> where we aren't already doing runtime checks. IOW if the compiler can use
>> the ARMv8 CRC in
On Date: Thu, 2 Nov 2023 09:35:50AM -0500, Nathan Bossart wrote:
>On Thu, Nov 02, 2023 at 06:17:20AM +, Xiang Gao wrote:
>> After reading the discussion, I understand that in order to avoid performance
>> regression in some instances, we need to try our best to avoid runtime
>> checks.
> >I d
On Thu, Nov 02, 2023 at 06:17:20AM +, Xiang Gao wrote:
> After reading the discussion, I understand that in order to avoid performance
> regression in some instances, we need to try our best to avoid runtime checks.
> I don't know if I understand it correctly.
The idea is that we don't want to
On Tue, 31 Oct 2023 15:48:21PM -0500, Nathan Bossart wrote:
>> Thanks. I went ahead and split this prerequisite part out to a separate
>> thread [0] since it's sort-of unrelated to your proposal here. It's not
>> really a prerequisite, but I do think it will simplify things a bit.
>Per the other
On Mon, Oct 30, 2023 at 11:21:43AM -0500, Nathan Bossart wrote:
> On Fri, Oct 27, 2023 at 07:01:10AM +, Xiang Gao wrote:
>> On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote:
We consider that a runtime check needs to be done in any scenario.
Here we only confirm that the com
On Fri, Oct 27, 2023 at 07:01:10AM +, Xiang Gao wrote:
> On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote:
>>> We consider that a runtime check needs to be done in any scenario.
>>> Here we only confirm that the compilation can be successful.
>> >A runtime check will be done when cho
On Thu, 26 Oct, 2023 11:37:52AM -0500, Nathan Bossart wrote:
>> We consider that a runtime check needs to be done in any scenario.
>> Here we only confirm that the compilation can be successful.
> >A runtime check will be done when choosing which algorithm.
> >You can think of us as merging USE_ARM
On Thu, Oct 26, 2023 at 08:53:31AM +, Xiang Gao wrote:
> On Tue, 24 Oct, 2023 20:45:39PM -0500, Nathan Bossart wrote:
>>I tried this. pg_waldump on 2 million ~8kB records took around 8.1 seconds
>>without the patch and around 7.4 seconds with it (an 8% improvement).
>>pg_waldump on 1 millio
On Thu, Oct 26, 2023 at 07:28:35AM +, Xiang Gao wrote:
> On Wed, 25 Oct, 2023 at 10:43:25 -0500, Nathan Bossart wrote:
>>+# Use ARM VMULL if available and ARM CRC32C intrinsic is avaliable too.
>>+if test x"$USE_ARMV8_VMULL" = x"" && (test x"$USE_ARMV8_CRC32C" = x"1" ||
>>test x"$USE_ARMV8_
On Thu, Oct 26, 2023 at 2:23 PM Xiang Gao wrote:
>
> On Tue, 24 Oct, 2023 20:45:39PM -0500, Nathan Bossart wrote:
> >I tried this. pg_waldump on 2 million ~8kB records took around 8.1 seconds
> >without the patch and around 7.4 seconds with it (an 8% improvement).
> >pg_waldump on 1 million ~1
On Tue, 24 Oct, 2023 20:45:39PM -0500, Nathan Bossart wrote:
>I tried this. pg_waldump on 2 million ~8kB records took around 8.1 seconds
>without the patch and around 7.4 seconds with it (an 8% improvement).
>pg_waldump on 1 million ~16kB records took around 3.2 seconds without the
>patch and a
On Wed, 25 Oct, 2023 at 10:43:25 -0500, Nathan Bossart wrote:
>+pg_crc32c
>+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)
>It looks like most of this function is duplicated from
>pg_comp_crc32c_armv8(). I understand that we probably need a separate
>function becau
+pg_crc32c
+pg_comp_crc32c_with_vmull_armv8(pg_crc32c crc, const void *data, size_t len)
It looks like most of this function is duplicated from
pg_comp_crc32c_armv8(). I understand that we probably need a separate
function because of the runtime check, but perhaps we could create a common
static
Thanks for your suggestion, this is the modified patch and two test files.
-Original Message-
From: Michael Paquier
Sent: Friday, October 20, 2023 4:19 PM
To: Xiang Gao
Cc: pgsql-hackers@lists.postgresql.org
Subject: Re: CRC32C Parallel Computation Optimization on ARM
On Fri, Oct 20
On Wed, Oct 25, 2023 at 07:17:55AM +0900, Michael Paquier wrote:
> If you are looking at computing the CRC of records with arbitrary
> sizes, why not just generating a series with
> pg_logical_emit_message() before doing a comparison with pg_waldump or
> a custom replay loop to go through the recor
On Wed, Oct 25, 2023 at 12:37:45AM +0300, Heikki Linnakangas wrote:
> On 25/10/2023 00:18, Nathan Bossart wrote:
>> Actually, since the pg_waldump benchmark likely only involves very small
>> WAL records, it would make sense that there isn't much difference.
>> *facepalm*
>
> No need to guess, pg_
On 25/10/2023 00:18, Nathan Bossart wrote:
On Tue, Oct 24, 2023 at 04:09:54PM -0500, Nathan Bossart wrote:
I'm able to reproduce the speedup with the provided benchmark on an Apple
M1 Pro (which appears to have the required instructions). There was almost
no change for the 512-byte case, but th
On Tue, Oct 24, 2023 at 04:09:54PM -0500, Nathan Bossart wrote:
> I'm able to reproduce the speedup with the provided benchmark on an Apple
> M1 Pro (which appears to have the required instructions). There was almost
> no change for the 512-byte case, but there was a ~60% speedup for the
> 4096-by
On Fri, Oct 20, 2023 at 05:18:56PM +0900, Michael Paquier wrote:
> On Fri, Oct 20, 2023 at 07:08:58AM +, Xiang Gao wrote:
>> This patch uses a parallel computing optimization algorithm to
>> improve crc32c computing performance on ARM. The algorithm comes
>> from Intel whitepaper:
>> crc-iscsi-
On Fri, Oct 20, 2023 at 07:08:58AM +, Xiang Gao wrote:
> This patch uses a parallel computing optimization algorithm to
> improve crc32c computing performance on ARM. The algorithm comes
> from Intel whitepaper:
> crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided
> into three
Hi all
This patch uses a parallel computing optimization algorithm to improve crc32c
computing performance on ARM. The algorithm comes from Intel whitepaper:
crc-iscsi-polynomial-crc32-instruction-paper. Input data is divided into three
equal-sized blocks.Three parallel blocks (crc0, crc1, crc2
35 matches
Mail list logo