Re: Popcount optimization using SVE for ARM

devanga.susmi...@fujitsu.com Fri, 06 Dec 2024 01:30:11 -0800

Hi Kirill,
This work has been conducted independently and is not connected to 
https://www.postgresql.org/message-id/010101936e4aaa70-b474ab9e-b9ce-474d-a3ba-a3dc223d295c-000000%40us-west-2.amazonses.com.


Our patch uses the existing infrastructure, i.e. the 
"choose_popcount_functions" method, to determine the correct popcount 
implementation based on the architecture, thereby requiring fewer code changes. 
The patch also includes implementations for popcount32, popcount64 and popcount 
masked. We'd be happy to discuss any potential overlaps and collaborate further 
to ensure the best solution is integrated.
Looking forward to your feedback!


Thanks & regards,
Susmitha Devanga.


________________________________
From: Kirill Reshke <reshkekir...@gmail.com>
Sent: Friday, December 6, 2024 12:52
To: Susmitha, Devanga <devanga.susmi...@fujitsu.com>
Cc: pgsql-hack...@postgresql.org <pgsql-hack...@postgresql.org>; Hajela, Ragesh 
<ragesh.haj...@fujitsu.com>; Bhattacharya, Chiranmoy 
<chiranmoy.bhattacha...@fujitsu.com>; M A, Rajat <rajat...@fujitsu.com>
Subject: Re: Popcount optimization using SVE for ARM



On Fri, 6 Dec 2024 at 10:54, 
devanga.susmi...@fujitsu.com<mailto:devanga.susmi...@fujitsu.com> 
<devanga.susmi...@fujitsu.com<mailto:devanga.susmi...@fujitsu.com>> wrote:
Hello,   This email is to discuss the contribution of the speed-up popcount and 
popcount mask feature we have developed for the ARM architecture using SVE 
intrinsics.
The current method for popcount on ARM relies on compiler intrinsics or C code, 
which processes data in a scalar fashion, handling one integer at a time. By 
leveraging SVE intrinsics for popcount, the execution can process multiple 
integers simultaneously, depending on the vector length, thereby significantly 
enhancing the performance of the functionality.
We have designed this feature to ensure compatibility and robustness. It 
includes compile-time and runtime checks for SVE compatibility with both the 
compiler and hardware. If either check fails, the code falls back to the 
existing scalar implementation, ensuring fail-safe operation. Additionally, we 
leveraged the existing infrastructure to select between different popcount 
implementations, avoiding additional complexity.

Algorithm Overview:
1. For larger inputs, align the buffers to avoid double loads. For smaller 
inputs alignment is not necessary and might even degrade the performance.
2. Process the aligned buffer chunk by chunk till the last incomplete chunk.
3. Process the last incomplete chunk.
Our setup:
Machine: AWS EC2 c7g.8xlarge - 32vcpu, 64gb RAM
OS : Ubuntu 22.04.5 LTS
GCC: 11.4

Benchmark and Result:
We have used John Naylor's popcount-test-module [0] for benchmarking and 
observed a speed-up of more than 3x for larger buffers. Even for smaller inputs 
of size 8 and 32 bytes there aren't any performance degradations observed.

                                           [cid:ii_1939ad8bcdacb971f161]        
                                                             
[cid:ii_1939ad8bcdacb971f162]
We would like to contribute our above work so that it can be available for the 
community to utilize. To do so, we are following the procedure mentioned in 
Submitting a Patch - PostgreSQL 
wiki<https://wiki.postgresql.org/wiki/Submitting_a_Patch>. Please find the 
attachments for the patch and performance results.
Please let us know if you have any queries or suggestions.


Thanks & Regards,
Susmitha Devanga.
Hi! Is this patch somehow related to [0] ?


[0] 
https://www.postgresql.org/message-id/010101936e4aaa70-b474ab9e-b9ce-474d-a3ba-a3dc223d295c-000000%40us-west-2.amazonses.com

--
Best regards,
Kirill Reshke

Re: Popcount optimization using SVE for ARM

Reply via email to