Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction [v4]

Vladimir Ivanov Wed, 13 Nov 2024 19:06:05 -0800

On Wed, 13 Nov 2024 02:43:12 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:


>> This patch optimizes LongVector multiplication by inferring VPMUL[U]DQ 
>> instruction for following IR pallets.
>>   
>> 
>>        MulVL   ( AndV  SRC1,  0xFFFFFFFF)   ( AndV  SRC2,  0xFFFFFFFF) 
>>        MulVL   (URShiftVL SRC1 , 32) (URShiftVL SRC2, 32)
>>        MulVL   (URShiftVL SRC1 , 32)  ( AndV  SRC2,  0xFFFFFFFF)
>>        MulVL   ( AndV  SRC1,  0xFFFFFFFF) (URShiftVL SRC2 , 32)
>>        MulVL   (VectorCastI2X SRC1) (VectorCastI2X SRC2)
>>        MulVL   (RShiftVL SRC1 , 32) (RShiftVL SRC2, 32)
>> 
>> 
>> 
>>  A  64x64 bit multiplication produces 128 bit result, and can be performed 
>> by individually multiplying upper and lower double word of multiplier with 
>> multiplicand and assembling the partial products to compute full width 
>> result. Targets supporting vector quadword multiplication have separate 
>> instructions to compute upper and lower quadwords for 128 bit result. 
>> Therefore existing VectorAPI multiplication operator expects shape 
>> conformance between source and result vectors.
>> 
>> If upper 32 bits of quadword multiplier and multiplicand is always set to 
>> zero then result of multiplication is only dependent on the partial product 
>> of their lower double words and can be performed using unsigned 32 bit 
>> multiplication instruction with quadword saturation. Patch matches this 
>> pattern in a target dependent manner without introducing new IR node.
>>  
>> VPMUL[U]DQ instruction performs [unsigned] multiplication between even 
>> numbered doubleword lanes of two long vectors and produces 64 bit result.  
>> It has much lower latency compared to full 64 bit multiplication instruction 
>> "VPMULLQ", in addition non-AVX512DQ targets does not support direct quadword 
>> multiplication, thus we can save redundant partial product for zeroed out 
>> upper 32 bits. This results into throughput improvements on both P and E 
>> core Xeons.
>> 
>> Please find below the performance of [XXH3 hashing benchmark 
>> ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included
>>  with the patch:-
>>  
>> 
>> Sierra Forest :-
>> ============
>> Baseline:-
>> Benchmark                                 (SIZE)   Mode  Cnt    Score   
>> Error   Units
>> VectorXXH3HashingBenchmark.hashingKernel    1024  thrpt    2  806.228        
>>   ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel    2048  thrpt    2  403.044        
>>   ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel    4096  thrpt    2  200.641        
>>   ops/ms
>> VectorXXH3HashingBenchmark.hashingKernel    8192  thrpt    2  100.664        
>>   ops/ms
>> 
>> With Optimizati...
>
> Jatin Bhateja has refreshed the contents of this pull request, and previous 
> commits have been removed. Incremental views are not available. The pull 
> request now contains seven commits:
> 
>  - Removing target specific hooks
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8341137
>  - Review resoultions
>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8341137
>  - Handle new I2L pattern, IR tests, Rewiring pattern inputs to MulVL further 
> optimizes JIT code
>  - Review resolutions
>  - 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction

Overall, looks good.

Some minor refactoring suggestions follow.

src/hotspot/share/opto/vectornode.cpp line 2092:

> 2090:          n->in(1)->is_Con() &&
> 2091:          n->in(1)->bottom_type()->isa_long() &&
> 2092:          n->in(1)->bottom_type()->is_long()->get_con() <= 4294967295L;

Is it clearer to use `0xFFFFFFFFL` representation here?

src/hotspot/share/opto/vectornode.cpp line 2114:

> 2112: 
> 2113: static bool has_vector_elements_fit_int(Node* n) {
> 2114:   auto is_cast_integer_to_long_pattern = [](const Node* n) {

I like how you use lambda expressions for node predicates. 
Please, shape `has_vector_elements_fit_uint()` in a similar fashion.

test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 51:

> 49: 
> 50:     public static final int SIZE = 1024;
> 51:     public static final Random r = new Random(1024);

For reproducibility purposes, it's better to use 
`jdk.test.lib.Utils.getRandomInstance()`. It reports the seed and supports 
overriding it.

test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 105:

> 103:             LongVector vsrc2 = LongVector.fromArray(LSP, lsrc2, i);
> 104:             vsrc1.lanewise(VectorOperators.AND, 0xFFFFFFFFL)
> 105:                  .lanewise(VectorOperators.MUL, 
> vsrc2.lanewise(VectorOperators.AND, 0xFFFFFFFFL))

It would be nice to randomize the constants (masks and shifts) to improve test 
coverage.

-------------

PR Review: https://git.openjdk.org/jdk/pull/21244#pullrequestreview-2434942773
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1841495626
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1841496794
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1841491826
PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1841492208

Re: RFR: 8341137: Optimize long vector multiplication using x86 VPMUL[U]DQ instruction [v4]

Reply via email to