On Sun, 29 Sep 2024 04:21:19 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:
> This patch optimizes LongVector multiplication by inferring VPMUL[U]DQ > instruction for following IR pallets. > > > MulVL ( AndV SRC1, 0xFFFFFFFF) ( AndV SRC2, 0xFFFFFFFF) > MulVL (URShiftVL SRC1 , 32) (URShiftVL SRC2, 32) > MulVL (URShiftVL SRC1 , 32) ( AndV SRC2, 0xFFFFFFFF) > MulVL ( AndV SRC1, 0xFFFFFFFF) (URShiftVL SRC2 , 32) > MulVL (VectorCastI2X SRC1) (VectorCastI2X SRC2) > MulVL (RShiftVL SRC1 , 32) (RShiftVL SRC2, 32) > > > > A 64x64 bit multiplication produces 128 bit result, and can be performed by > individually multiplying upper and lower double word of multiplier with > multiplicand and assembling the partial products to compute full width > result. Targets supporting vector quadword multiplication have separate > instructions to compute upper and lower quadwords for 128 bit result. > Therefore existing VectorAPI multiplication operator expects shape > conformance between source and result vectors. > > If upper 32 bits of quadword multiplier and multiplicand is always set to > zero then result of multiplication is only dependent on the partial product > of their lower double words and can be performed using unsigned 32 bit > multiplication instruction with quadword saturation. Patch matches this > pattern in a target dependent manner without introducing new IR node. > > VPMUL[U]DQ instruction performs [unsigned] multiplication between even > numbered doubleword lanes of two long vectors and produces 64 bit result. It > has much lower latency compared to full 64 bit multiplication instruction > "VPMULLQ", in addition non-AVX512DQ targets does not support direct quadword > multiplication, thus we can save redundant partial product for zeroed out > upper 32 bits. This results into throughput improvements on both P and E core > Xeons. > > Please find below the performance of [XXH3 hashing benchmark > ](https://mail.openjdk.org/pipermail/panama-dev/2024-July/020557.html)included > with the patch:- > > > Sierra Forest :- > ============ > Baseline:- > Benchmark (SIZE) Mode Cnt Score Error > Units > VectorXXH3HashingBenchmark.hashingKernel 1024 thrpt 2 806.228 > ops/ms > VectorXXH3HashingBenchmark.hashingKernel 2048 thrpt 2 403.044 > ops/ms > VectorXXH3HashingBenchmark.hashingKernel 4096 thrpt 2 200.641 > ops/ms > VectorXXH3HashingBenchmark.hashingKernel 8192 thrpt 2 100.664 > ops/ms > > With Optimization:- > Benchmark (SIZE) Mode ... Another approach is to do similarly to `MacroLogicVNode`. You can make another node and transform `MulVL` to it before matching, this is more flexible than using match rules. I am having a similar idea that is to group those transformations together into a `Phase` called `PhaseLowering`. It can be used to do e.g split `ExtractI` into the 128-bit lane extraction and the element extraction from that lane. This allows us to do `GVN` on those and `v.lane(5) + v.lane(7)` can be compiled nicely as: vextracti128 xmm0, ymm1, 1 pextrd eax, xmm0, 1 // vextracti128 xmm0, ymm1, 1 here will be gvn-ed pextrd ecx, xmm0, 3 add eax, ecx Personally, I think this optimization is not essential, so we should proceed with introducing lowering first, then add this transformation to that phase, instead of trying to integrate this transformation then refactor it into phase lowering, which seems like a net extra step. The issues I have with this patch are that: - It convolutes the graph with machine-dependent nodes early in the compiling process. - It overloads `MulVL` with alternative behaviours, it is fine now as we do not perform much analysis on this node but it would be problematic later. I think it is more preferable to have a separate IR node for this like `MulVLowIToLNode`, or have this transformation be done only just before matching, or both. ------------- PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2407793168 PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2414491182 PR Comment: https://git.openjdk.org/jdk/pull/21244#issuecomment-2421157206