On Tue, 11 Apr 2023 09:36:06 GMT, Quan Anh Mai <qa...@openjdk.org> wrote:
>> Hi @merykitty , Agree with you that SPECIES_PREFERRED is preferred for >> vector algorithms intercepting both integral and floating point vectors. >> >> FTR, we see a perf regression with Float256 based micro now on AVX=1 targets, >> >> >> public static short micro() { >> VectorShuffle<Float> iota = FloatVector.SPECIES_256.iotaShuffle(0, 1, >> true); >> return >> iota.cast(ShortVector.SPECIES_128).toVector().reinterpretAsShorts().lane(1); >> } >> >> CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 >> -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . >> shufflef >> CompileCommand: compileonly shufflef.micro bool compileonly = true >> ** not supported: arity=1 op=reinterpret/1 vlen1=8 etype1=int ismask=0 >> ** not supported: arity=1 op=cast/1 vlen1=8 etype1=int ismask=0 >> @ 17 java.lang.Object::getClass (0 >> bytes) (intrinsic) >> @ 24 java.lang.Object::getClass (0 >> bytes) (intrinsic) >> @ 45 >> jdk.internal.vm.vector.VectorSupport::convert (36 bytes) failed to inline >> (intrinsic) >> @ 34 java.lang.Object::getClass (0 >> bytes) (intrinsic) >> @ 54 >> jdk.internal.vm.vector.VectorSupport::convert (36 bytes) failed to inline >> (intrinsic) >> @ 17 java.lang.Object::getClass (0 >> bytes) (intrinsic) >> @ 24 java.lang.Object::getClass (0 >> bytes) (intrinsic) >> @ 45 >> jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) >> @ 292 java.lang.Object::getClass (0 >> bytes) (intrinsic) >> @ 298 java.lang.Object::getClass (0 >> bytes) (intrinsic) >> @ 322 >> jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) >> @ 292 java.lang.Object::getClass (0 >> bytes) (intrinsic) >> @ 298 java.lang.Object::getClass (0 >> bytes) (intrinsic) >> @ 322 >> jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) >> @ 16 >> jdk.internal.vm.vector.VectorSupport::extract (35 bytes) (intrinsic) >> [time] 386ms [res]3392 >> CPROMPT>export JAVA_HOME=/home/jatinbha/softwares/jdk-20/ >> CPROMPT>export PATH=$JAVA_HOME/bin:$PATH >> CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 >> -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . >> shufflef >> CompileCommand: compileonly shufflef.micro bool compileonly = true >> WARNING: Using incubator modules: jdk.incubator.vector >> @ 3 >> jdk.internal.misc.Unsafe::loadFence (5 bytes) (intrinsic) >> @ 3 >> jdk.internal.misc.Unsafe::loadFence (5 bytes) (intrinsic) >> @ 17 >> jdk.internal.vm.vector.VectorSupport::shuffleToVector (33 bytes) >> (intrinsic) >> @ 292 java.lang.Object::getClass (0 >> bytes) (intrinsic) >> @ 298 java.lang.Object::getClass (0 >> bytes) (intrinsic) >> @ 322 >> jdk.internal.vm.vector.VectorSupport::convert (36 bytes) (intrinsic) >> @ 16 >> jdk.internal.vm.vector.VectorSupport::extract (35 bytes) (intrinsic) >> [time] 7ms [res]3392 > > @jatin-bhateja Since `Float256Shuffle` is represented as a 256-bit int > vector, which is not supported by AVX1, the compiled code falls back to Java > implementation, which explains the regression. However, having a > `VectorShuffle` but not for `Vector::rearrange` is not really useful, and the > code snippet is similar to `ShortVector.SPECIES_128.iotaShuffle(0, 1, > true).toVector().reinterpretAsShorts().lane(1)`. As a result, I think having > some regressions in edge cases of AVX1 is acceptable in contrast with the > improvement in all other operations on all platforms. Agree, this is also fixing less than 32 bit shuffle vectors case, i.e. shuffles involving Long128, Int64 and Float64 will get benefitted on x86. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1163147535