Hi,

Comparing the current SSE4.2 implementation of the CRC32C algorithm in 
Postgres, to an optimized AVX-512 algorithm [0] we observed significant gains. 
The result was a ~6.6X average multiplier of increased performance measured on 
3 different Intel products. Details below. The AVX-512 algorithm in C is a port 
of the ISA-L library [1] assembler code.

Workload call size distribution details (write heavy):
   * Average was approximately around 1,010 bytes per call
   * ~80% of the calls were under 256 bytes
   * ~20% of the calls were greater than or equal to 256 bytes up to the max 
buffer size of 8192

The 256 bytes is important because if the buffer is smaller, it makes sense 
fallback to the existing implementation. This is because the AVX-512 algorithm 
needs a minimum of 256 bytes to operate.

Using the above workload data distribution, 
at 0%    calls < 256 bytes, a 841% improvement on average for crc32c 
functionality was observed.
at 50%   calls < 256 bytes, a 758% improvement on average for crc32c 
functionality was observed.
at 90%   calls < 256 bytes, a 44% improvement on average for crc32c 
functionality was observed. 
at 97.6% calls < 256 bytes, the workload's crc32c performance breaks-even.
at 100%  calls < 256 bytes, a 14% regression is seen when using AVX-512 
implementation. 

The results above are averages over 3 machines, and were measured on: Intel 
Saphire Rapids bare metal, and using EC2 on AWS cloud: Intel Saphire Rapids 
(m7i.2xlarge) and Intel Ice Lake (m6i.2xlarge).

Summary Data (Saphire Rapids bare metal, AWS m7i-2xl, and AWS m6i-2xl):
+---------------------+-------------------+-------------------+-------------------+--------------------+
| Rates in Bytes/us   |     Bare Metal    |    AWS m6i-2xl    |   AWS m7i-2xl   
  |                    |
| (Larger is Better)  
+---------+---------+---------+---------+---------+---------+ Overall 
Multiplier |
|                     | SSE 4.2 | AVX-512 | SSE 4.2 | AVX-512 | SSE 4.2 | 
AVX-512 |                    |
+---------------------+---------+---------+---------+---------+---------+---------+--------------------+
| Numbers 256-8192    |  12,046 |  83,196 |   7,471 |  39,965 |  11,867 |  
84,589 |        6.62        |
+---------------------+---------+---------+---------+---------+---------+---------+--------------------+
| Numbers 64 - 255    |  16,865 |  15,909 |   9,209 |   7,363 |  12,496 |  
10,046 |        0.86        |
+---------------------+---------+---------+---------+---------+---------+---------+--------------------+
                                                    |  Weighted Multiplier [*]  
  |        1.44        |
                                                    
+-----------------------------+--------------------+
There was no evidence of AVX-512 frequency throttling from perf data, which 
stayed steady during the test.

Feedback on this proposed improvement is appreciated. Some questions: 
1) This AVX-512 ISA-L derived code uses BSD-3 license [2]. Is this compatible 
with the PostgreSQL License [3]? They both appear to be very permissive 
licenses, but I am not an expert on licenses. 
2) Is there a preferred benchmark I should run to test this change? 

If licensing is a non-issue, I can post the initial patch along with my 
Postgres benchmark function patch for further review.

Thanks,
Paul

[0] 
https://www.researchgate.net/publication/263424619_Fast_CRC_computation#full-text
[1] https://github.com/intel/isa-l
[2] https://opensource.org/license/bsd-3-clause
[3] https://opensource.org/license/postgresql
                      
[*] Weights used were 90% of requests less than 256 bytes, 10% greater than or 
equal to 256 bytes.


Reply via email to