On Wed, 13 Nov 2024 02:43:12 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
>> This patch optimizes LongVector multiplication by inferring VPMUL[U]DQ >> instruction for following IR pallets. >> >> >> MulVL ( AndV SRC1, 0xFFFFFFFF) ( AndV SRC2, 0xFFFFFFFF) >> MulVL (URShiftVL SRC1 , 32) (URShiftVL SRC2, 32) >> MulVL (URShiftVL SRC1 , 32) ( AndV SRC2, 0xFFFFFFFF) >> MulVL ( AndV SRC1, 0xFFFFFFFF) (URShiftVL SRC2 , 32) >> MulVL (VectorCastI2X SRC1) (VectorCastI2X SRC2) >> MulVL (RShiftVL SRC1 , 32) (RShiftVL SRC2, 32) >> >> >> >> A 64x64 bit multiplication produces 128 bit result, and can be performed >> by individually multiplying upper and lower double word of multiplier with >> multiplicand and assembling the partial products to compute full width >> result. Targets supporting vector quadword multiplication have separate >> instructions to compute upper and lower quadwords for 128 bit result. >> Therefore existing VectorAPI multiplication operator expects shape >> conformance between source and result vectors. >> >> If upper 32 bits of quadword multiplier and multiplicand is always set to >> zero then result of multiplication is only dependent on the partial product >> of their lower double words and can be performed using unsigned 32 bit >> multiplication instruction with quadword saturation. Patch matches this >> pattern in a target dependent manner without introducing new IR node. >> >> VPMUL[U]DQ instruction performs [unsigned] multiplication between even >> numbered doubleword lanes of two long vectors and produces 64 bit result. >> It has much lower latency compared to full 64 bit multiplication instruction >> "VPMULLQ", in addition non-AVX512DQ targets does not support direct quadword >> multiplication, thus we can save redundant partial product for zeroed out >> upper 32 bits. This results into throughput improvements on both P and E >> core Xeons. >> >> Please find below the performance of [XXH3 hashing benchmark >> ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included >> with the patch:- >> >> >> Sierra Forest :- >> ============ >> Baseline:- >> Benchmark (SIZE) Mode Cnt Score >> Error Units >> VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 2 806.228 >> ops/ms >> VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 2 403.044 >> ops/ms >> VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 2 200.641 >> ops/ms >> VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 2 100.664 >> ops/ms >> >> With Optimizati... > > Jatin Bhateja has refreshed the contents of this pull request, and previous > commits have been removed. Incremental views are not available. The pull > request now contains seven commits: > > - Removing target specific hooks > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8341137 > - Review resoultions > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8341137 > - Handle new I2L pattern, IR tests, Rewiring pattern inputs to MulVL further > optimizes JIT code > - Review resolutions > - 8341137: Optimize long vector multiplication using x86 VPMULUDQ instruction Overall, looks good. Some minor refactoring suggestions follow. src/hotspot/share/opto/vectornode.cpp line 2092: > 2090: n->in(1)->is_Con() && > 2091: n->in(1)->bottom_type()->isa_long() && > 2092: n->in(1)->bottom_type()->is_long()->get_con() <= 4294967295L; Is it clearer to use `0xFFFFFFFFL` representation here? src/hotspot/share/opto/vectornode.cpp line 2114: > 2112: > 2113: static bool has_vector_elements_fit_int(Node* n) { > 2114: auto is_cast_integer_to_long_pattern = [](const Node* n) { I like how you use lambda expressions for node predicates. Please, shape `has_vector_elements_fit_uint()` in a similar fashion. test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 51: > 49: > 50: public static final int SIZE = 1024; > 51: public static final Random r = new Random(1024); For reproducibility purposes, it's better to use `jdk.test.lib.Utils.getRandomInstance()`. It reports the seed and supports overriding it. test/hotspot/jtreg/compiler/vectorapi/VectorMultiplyOpt.java line 105: > 103: LongVector vsrc2 = LongVector.fromArray(LSP, lsrc2, i); > 104: vsrc1.lanewise(VectorOperators.AND, 0xFFFFFFFFL) > 105: .lanewise(VectorOperators.MUL, > vsrc2.lanewise(VectorOperators.AND, 0xFFFFFFFFL)) It would be nice to randomize the constants (masks and shifts) to improve test coverage. ------------- PR Review: https://git.openjdk.org/jdk/pull/21244#pullrequestreview-2434942773 PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1841495626 PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1841496794 PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1841491826 PR Review Comment: https://git.openjdk.org/jdk/pull/21244#discussion_r1841492208