> This patch optimizes LongVector multiplication by inferring VPMULUDQ 
> instruction for following IR pallets.
>   
> 
>        MulL   ( And  SRC1,  0xFFFFFFFF)   ( And  SRC2,  0xFFFFFFFF) 
>        MulL   (URShift SRC1 , 32) (URShift SRC2, 32)
>        MulL   (URShift SRC1 , 32)  ( And  SRC2,  0xFFFFFFFF)
>        MulL   ( And  SRC1,  0xFFFFFFFF) (URShift SRC2 , 32)
> 
> 
> 
>  A  64x64 bit multiplication produces 128 bit result, and can be performed by 
> individually multiplying upper and lower double word of multiplier with 
> multiplicand and assembling the partial products to compute full width 
> result. Targets supporting vector quadword multiplication have separate 
> instructions to compute upper and lower quadwords for 128 bit result. 
> Therefore existing VectorAPI multiplication operator expects shape 
> conformance between source and result vectors.
> 
> If upper 32 bits of quadword multiplier and multiplicand is always set to 
> zero then result of multiplication is only dependent on the partial product 
> of their lower double words and can be performed using unsigned 32 bit 
> multiplication instruction with quadword saturation. Patch matches this 
> pattern in a target dependent manner without introducing new IR node.
>  
> VPMULUDQ instruction performs unsigned multiplication between even numbered 
> doubleword lanes of two long vectors and produces 64 bit result.  It has much 
> lower latency compared to full 64 bit multiplication instruction "VPMULLQ", 
> in addition non-AVX512DQ targets does not support direct quadword 
> multiplication, thus we can save redundant partial product for zeroed out 
> upper 32 bits. This results into throughput improvements on both P and E core 
> Xeons.
> 
> Please find below the performance of [XXH3 hashing benchmark 
> ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included
>  with the patch:-
>  
> 
> Sierra Forest :-
> ============
> Baseline:-
> Benchmark                                 (SIZE)   Mode  Cnt    Score   Error 
>   Units
> VectorXXH3HashingBenchmark.hashingKernel    1024  thrpt    2  806.228         
>  ops/ms
> VectorXXH3HashingBenchmark.hashingKernel    2048  thrpt    2  403.044         
>  ops/ms
> VectorXXH3HashingBenchmark.hashingKernel    4096  thrpt    2  200.641         
>  ops/ms
> VectorXXH3HashingBenchmark.hashingKernel    8192  thrpt    2  100.664         
>  ops/ms
> 
> With Optimization:-
> Benchmark                                 (SIZE)   Mode  Cnt     Score   
> Error   Units
> VectorXXH3HashingBenchmark.hashingKernel    1024  thrpt    2  1299.407        
>   ops/ms
> VectorXXH3HashingB...

Jatin Bhateja has updated the pull request with a new target base due to a 
merge or a rebase. The pull request now contains two commits:

 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8341137
 - 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction

-------------

Changes: https://git.openjdk.org/jdk/pull/21244/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=21244&range=01
  Stats: 354 lines in 12 files changed: 343 ins; 0 del; 11 mod
  Patch: https://git.openjdk.org/jdk/pull/21244.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/21244/head:pull/21244

PR: https://git.openjdk.org/jdk/pull/21244

Reply via email to