On Mon, 9 Dec 2024 14:12:19 GMT, Quan Anh Mai <qa...@openjdk.org> wrote:
>> Hi, >> >> This is just a redo of https://github.com/openjdk/jdk/pull/13093. mostly >> just the revert of the backout. >> >> Regarding the related issues: >> >> - [JDK-8306008](https://bugs.openjdk.org/browse/JDK-8306008) and >> [JDK-8309531](https://bugs.openjdk.org/browse/JDK-8309531) have been fixed >> before the backout. >> - [JDK-8309373](https://bugs.openjdk.org/browse/JDK-8309373) was due to >> missing `ForceInline` on `AbstractVector::toBitsVectorTemplate` >> - [JDK-8306592](https://bugs.openjdk.org/browse/JDK-8306592), I have not >> been able to find the root causes. I'm not sure if this is a blocker, now I >> cannot even build x86-32 tests. >> >> Finally, I moved some implementation of public methods and methods that call >> into intrinsics to the concrete class as that may help the compiler know the >> correct types of the variables. >> >> Please take a look and leave reviews. Thanks a lot. >> >> The description of the original PR: >> >> This patch reimplements `VectorShuffle` implementations to be a vector of >> the bit type. Currently, `VectorShuffle` is stored as a byte array, and >> would be expanded upon usage. This poses several drawbacks: >> >> Inefficient conversions between a shuffle and its corresponding vector. This >> hinders the performance when the shuffle indices are not constant and are >> loaded or computed dynamically. >> Redundant expansions in `rearrange` operations. On all platforms, it seems >> that a shuffle index vector is always expanded to the correct type before >> executing the `rearrange` operations. >> Some redundant intrinsics are needed to support this handling as well as >> special considerations in the C2 compiler. >> Range checks are performed using `VectorShuffle::toVector`, which is >> inefficient for FP types since both FP conversions and FP comparisons are >> more expensive than the integral ones. >> Upon these changes, a `rearrange` can emit more efficient code: >> >> var species = IntVector.SPECIES_128; >> var v1 = IntVector.fromArray(species, SRC1, 0); >> var v2 = IntVector.fromArray(species, SRC2, 0); >> v1.rearrange(v2.toShuffle()).intoArray(DST, 0); >> >> Before: >> movabs $0x751589fa8,%r10 ; {oop([I{0x0000000751589fa8})} >> vmovdqu 0x10(%r10),%xmm2 >> movabs $0x7515a0d08,%r10 ; {oop([I{0x00000007515a0d08})} >> vmovdqu 0x10(%r10),%xmm1 >> movabs $0x75158afb8,%r10 ; {oop([I{0x000000075158afb8})} >> vmovdqu 0x10(%r10),%xmm0 >> vpand -0xddc12(%rip),%xmm0,%xmm0 # Stub::vector_int_to_byt... > > Quan Anh Mai has updated the pull request incrementally with one additional > commit since the last revision: > > adverb order Hi @merykitty , Nice work!, over all looks good to me with some mincro comments. Kindly address. src/hotspot/share/opto/library_call.hpp line 358: > 356: bool inline_vector_shuffle_to_vector(); > 357: bool inline_vector_wrap_shuffle_indexes(); > 358: bool inline_vector_shuffle_iota(); FTR, x86 ISA does not support a direct byte multiplier instruction, so we first unpack to a short vector, multiply at a short granularity, and then pack it back to byte vector. This was somewhat costly since now shuffle backing storage matches the lane size of the corresponding vector. Hence, computation with a non-unit scalar should improve. src/hotspot/share/opto/vectornode.hpp line 1691: > 1689: }; > 1690: > 1691: // The machine may not directly support the rearrange operation of an > element type. In those cases, `` Suggestion: // The target may not directly support the rearrange operation for an element type. In those cases, src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte128Vector.java line 822: > 820: static final Class<Byte> ETYPE = byte.class; // used by the JVM > 821: > 822: Byte128Shuffle(byte[] indices) { We still cannot accommodate all the indexes for the 2048 bit scalable vector for ARM SVE. Max index accommodable is 127 since byte is a signed type with value range b/w [-128 , 127]. src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Byte512Vector.java line 1046: > 1044: String msg = ("index "+si+"out of range > ["+length+"] in "+ > 1045: java.util.Arrays.toString(indices)); > 1046: throw new AssertionError(msg); Why not directly throw IndexOutOfBoundsException here? src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Double128Vector.java line 859: > 857: .reinterpretAsInts() > 858: .intoArray(a, offset); > 859: default -> { These cases for length() = 4, 8, and 16 looks redundant for 128-bit DeoubleVector. ------------- PR Review: https://git.openjdk.org/jdk/pull/21042#pullrequestreview-2491206953 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1877513236 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1877491672 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1877459622 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1877452008 PR Review Comment: https://git.openjdk.org/jdk/pull/21042#discussion_r1877466991