Integrated: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction

Jatin Bhateja Thu, 21 Nov 2024 10:26:21 -0800

On Sun, 29 Sep 2024 04:21:19 GMT, Jatin Bhateja <[email protected]> wrote:


> This patch optimizes LongVector multiplication by inferring VPMUL[U]DQ 
> instruction for following IR pallets.
>   
> 
>        MulVL   ( AndV  SRC1,  0xFFFFFFFF)   ( AndV  SRC2,  0xFFFFFFFF) 
>        MulVL   (URShiftVL SRC1 , 32) (URShiftVL SRC2, 32)
>        MulVL   (URShiftVL SRC1 , 32)  ( AndV  SRC2,  0xFFFFFFFF)
>        MulVL   ( AndV  SRC1,  0xFFFFFFFF) (URShiftVL SRC2 , 32)
>        MulVL   (VectorCastI2X SRC1) (VectorCastI2X SRC2)
>        MulVL   (RShiftVL SRC1 , 32) (RShiftVL SRC2, 32)
> 
> 
> 
>  A  64x64 bit multiplication produces 128 bit result, and can be performed by 
> individually multiplying upper and lower double word of multiplier with 
> multiplicand and assembling the partial products to compute full width 
> result. Targets supporting vector quadword multiplication have separate 
> instructions to compute upper and lower quadwords for 128 bit result. 
> Therefore existing VectorAPI multiplication operator expects shape 
> conformance between source and result vectors.
> 
> If upper 32 bits of quadword multiplier and multiplicand is always set to 
> zero then result of multiplication is only dependent on the partial product 
> of their lower double words and can be performed using unsigned 32 bit 
> multiplication instruction with quadword saturation. Patch matches this 
> pattern in a target dependent manner without introducing new IR node.
>  
> VPMUL[U]DQ instruction performs [unsigned] multiplication between even 
> numbered doubleword lanes of two long vectors and produces 64 bit result.  It 
> has much lower latency compared to full 64 bit multiplication instruction 
> "VPMULLQ", in addition non-AVX512DQ targets does not support direct quadword 
> multiplication, thus we can save redundant partial product for zeroed out 
> upper 32 bits. This results into throughput improvements on both P and E core 
> Xeons.
> 
> Please find below the performance of [XXH3 hashing benchmark 
> ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included
>  with the patch:-
>  
> 
> Sierra Forest :-
> ============
> Baseline:-
> Benchmark                                 (SIZE)   Mode  Cnt    Score   Error 
>   Units
> VectorXXH3HashingBenchmark.hashingKernel    1024  thrpt    2  806.228         
>  ops/ms
> VectorXXH3HashingBenchmark.hashingKernel    2048  thrpt    2  403.044         
>  ops/ms
> VectorXXH3HashingBenchmark.hashingKernel    4096  thrpt    2  200.641         
>  ops/ms
> VectorXXH3HashingBenchmark.hashingKernel    8192  thrpt    2  100.664         
>  ops/ms
> 
> With Optimization:-
> Benchmark                                 (SIZE)   Mode  ...

This pull request has now been integrated.

Changeset: dc9a6ef6
Author:    Jatin Bhateja <[email protected]>
URL:       
https://git.openjdk.org/jdk/commit/dc9a6ef6100d73a431cd0cfa2c252acf7743f8a3
Stats:     544 lines in 7 files changed: 543 ins; 0 del; 1 mod

8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction

Co-authored-by: Vladimir Ivanov <[email protected]>
Reviewed-by: vlivanov, sviswanathan

-------------

PR: https://git.openjdk.org/jdk/pull/21244

Integrated: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction

Reply via email to