Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction

Vladimir Ivanov Wed, 06 Nov 2024 09:42:26 -0800

On Sun, 29 Sep 2024 04:21:19 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:


> This patch optimizes LongVector multiplication by inferring VPMUL[U]DQ 
> instruction for following IR pallets.
>   
> 
>        MulVL   ( AndV  SRC1,  0xFFFFFFFF)   ( AndV  SRC2,  0xFFFFFFFF) 
>        MulVL   (URShiftVL SRC1 , 32) (URShiftVL SRC2, 32)
>        MulVL   (URShiftVL SRC1 , 32)  ( AndV  SRC2,  0xFFFFFFFF)
>        MulVL   ( AndV  SRC1,  0xFFFFFFFF) (URShiftVL SRC2 , 32)
>        MulVL   (VectorCastI2X SRC1) (VectorCastI2X SRC2)
>        MulVL   (RShiftVL SRC1 , 32) (RShiftVL SRC2, 32)
> 
> 
> 
>  A  64x64 bit multiplication produces 128 bit result, and can be performed by 
> individually multiplying upper and lower double word of multiplier with 
> multiplicand and assembling the partial products to compute full width 
> result. Targets supporting vector quadword multiplication have separate 
> instructions to compute upper and lower quadwords for 128 bit result. 
> Therefore existing VectorAPI multiplication operator expects shape 
> conformance between source and result vectors.
> 
> If upper 32 bits of quadword multiplier and multiplicand is always set to 
> zero then result of multiplication is only dependent on the partial product 
> of their lower double words and can be performed using unsigned 32 bit 
> multiplication instruction with quadword saturation. Patch matches this 
> pattern in a target dependent manner without introducing new IR node.
>  
> VPMUL[U]DQ instruction performs [unsigned] multiplication between even 
> numbered doubleword lanes of two long vectors and produces 64 bit result.  It 
> has much lower latency compared to full 64 bit multiplication instruction 
> "VPMULLQ", in addition non-AVX512DQ targets does not support direct quadword 
> multiplication, thus we can save redundant partial product for zeroed out 
> upper 32 bits. This results into throughput improvements on both P and E core 
> Xeons.
> 
> Please find below the performance of [XXH3 hashing benchmark 
> ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included
>  with the patch:-
>  
> 
> Sierra Forest :-
> ============
> Baseline:-
> Benchmark                                 (SIZE)   Mode  Cnt    Score   Error 
>   Units
> VectorXXH3HashingBenchmark.hashingKernel    1024  thrpt    2  806.228         
>  ops/ms
> VectorXXH3HashingBenchmark.hashingKernel    2048  thrpt    2  403.044         
>  ops/ms
> VectorXXH3HashingBenchmark.hashingKernel    4096  thrpt    2  200.641         
>  ops/ms
> VectorXXH3HashingBenchmark.hashingKernel    8192  thrpt    2  100.664         
>  ops/ms
> 
> With Optimization:-
> Benchmark                                 (SIZE)   Mode  ...

Some time ago, there was a relevant experiment to optimize vectorized Poly1305 
implementation by utilizing VPMULDQ instruction on x86 (see 
[JDK-8219881](https://bugs.openjdk.org/browse/JDK-8219881) for details). The 
implementation used int-to-long vector casts and produced the following IR 
shape: `MulVL (VectorCastI2X src1) (VectorCastI2X src2)`. Does it make sense to 
cover it as part of this particular enhancement?

IMO until C2 type system starts to track bitwise constant information 
([JDK-8001436](https://bugs.openjdk.org/browse/JDK-8001436) et al), there are 
not enough benefits to rely on IGVN here. So far, all the discussed patterns 
are simple enough for matcher to handle them without too much tweaking.

Also, I briefly looked at #21599 in the context of this particular enhancement, 
but still don't see how it can improve the situation (except input rewiring 
part) and not simply duplicate what matcher already does well.

src/hotspot/share/opto/vectornode.cpp line 2122:

> 2120:     // MulL (URShift SRC1 , 32) (URShift SRC2, 32)
> 2121:     // MulL (URShift SRC1 , 32)  ( And  SRC2,  0xFFFFFFFF)
> 2122:     // MulL ( And  SRC1,  0xFFFFFFFF) (URShift SRC2 , 32)

I don't understand how it works... According to the documentation, 
`VPMULDQ`/`VPMULUDQ` consume vectors of double words and produce a vector of 
quadwords. But it looks like `SRC1`/`SRC2` are always vectors of longs 
(quadwords). And `vmuludq_reg` in `x86.ad` just takes the immedate operands and 
pass them into `vpmuludq` which doesn't look right...

-------------

PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2412582542
PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2421529658
PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2436531693
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1805886268

Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction

Reply via email to