> This patch optimizes LongVector multiplication by inferring VPMUL[U]DQ > instruction for following IR pallets. > > > MulVL ( AndV SRC1, 0xFFFFFFFF) ( AndV SRC2, 0xFFFFFFFF) > MulVL (URShiftVL SRC1 , 32) (URShiftVL SRC2, 32) > MulVL (URShiftVL SRC1 , 32) ( AndV SRC2, 0xFFFFFFFF) > MulVL ( AndV SRC1, 0xFFFFFFFF) (URShiftVL SRC2 , 32) > MulVL (VectorCastI2X SRC1) (VectorCastI2X SRC2) > MulVL (RShiftVL SRC1 , 32) (RShiftVL SRC2, 32) > > > > A 64x64 bit multiplication produces 128 bit result, and can be performed by > individually multiplying upper and lower double word of multiplier with > multiplicand and assembling the partial products to compute full width > result. Targets supporting vector quadword multiplication have separate > instructions to compute upper and lower quadwords for 128 bit result. > Therefore existing VectorAPI multiplication operator expects shape > conformance between source and result vectors. > > If upper 32 bits of quadword multiplier and multiplicand is always set to > zero then result of multiplication is only dependent on the partial product > of their lower double words and can be performed using unsigned 32 bit > multiplication instruction with quadword saturation. Patch matches this > pattern in a target dependent manner without introducing new IR node. > > VPMUL[U]DQ instruction performs [unsigned] multiplication between even > numbered doubleword lanes of two long vectors and produces 64 bit result. It > has much lower latency compared to full 64 bit multiplication instruction > "VPMULLQ", in addition non-AVX512DQ targets does not support direct quadword > multiplication, thus we can save redundant partial product for zeroed out > upper 32 bits. This results into throughput improvements on both P and E core > Xeons. > > Please find below the performance of [XXH3 hashing benchmark > ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included > with the patch:- > > > Sierra Forest :- > ============ > Baseline:- > Benchmark (SIZE) Mode Cnt Score Error > Units > VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 2 806.228 > ops/ms > VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 2 403.044 > ops/ms > VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 2 200.641 > ops/ms > VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 2 100.664 > ops/ms > > With Optimization:- > Benchmark (SIZE) Mode ...
Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: Review suggestions incorporated. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/21244/files - new: https://git.openjdk.org/jdk/pull/21244/files/43320063..84f2e04f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=21244&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=21244&range=03-04 Stats: 44 lines in 2 files changed: 12 ins; 14 del; 18 mod Patch: https://git.openjdk.org/jdk/pull/21244.diff Fetch: git fetch https://git.openjdk.org/jdk.git pull/21244/head:pull/21244 PR: https://git.openjdk.org/jdk/pull/21244