On Tue, 11 Apr 2023 09:36:06 GMT, Quan Anh Mai <qa...@openjdk.org> wrote:

>> Hi @merykitty , Agree with you that SPECIES_PREFERRED is preferred for 
>> vector algorithms intercepting both integral and floating point vectors.
>> 
>> FTR, we see a perf regression with Float256 based micro now on AVX=1 targets,
>> 
>> 
>>   public static short micro() {
>>      VectorShuffle<Float> iota = FloatVector.SPECIES_256.iotaShuffle(0, 1, 
>> true);
>>      return 
>> iota.cast(ShortVector.SPECIES_128).toVector().reinterpretAsShorts().lane(1);
>>   }
>> 
>> CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 
>> -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . 
>> shufflef
>> CompileCommand: compileonly shufflef.micro bool compileonly = true
>>   ** not supported: arity=1 op=reinterpret/1 vlen1=8 etype1=int ismask=0
>>   ** not supported: arity=1 op=cast/1 vlen1=8 etype1=int ismask=0
>>                                     @ 17   java.lang.Object::getClass (0 
>> bytes)   (intrinsic)
>>                                     @ 24   java.lang.Object::getClass (0 
>> bytes)   (intrinsic)
>>                                     @ 45   
>> jdk.internal.vm.vector.VectorSupport::convert (36 bytes)   failed to inline 
>> (intrinsic)
>>                                   @ 34   java.lang.Object::getClass (0 
>> bytes)   (intrinsic)
>>                                   @ 54   
>> jdk.internal.vm.vector.VectorSupport::convert (36 bytes)   failed to inline 
>> (intrinsic)
>>                                     @ 17   java.lang.Object::getClass (0 
>> bytes)   (intrinsic)
>>                                     @ 24   java.lang.Object::getClass (0 
>> bytes)   (intrinsic)
>>                                     @ 45   
>> jdk.internal.vm.vector.VectorSupport::convert (36 bytes)   (intrinsic)
>>                                       @ 292   java.lang.Object::getClass (0 
>> bytes)   (intrinsic)
>>                                       @ 298   java.lang.Object::getClass (0 
>> bytes)   (intrinsic)
>>                                       @ 322   
>> jdk.internal.vm.vector.VectorSupport::convert (36 bytes)   (intrinsic)
>>                                       @ 292   java.lang.Object::getClass (0 
>> bytes)   (intrinsic)
>>                                       @ 298   java.lang.Object::getClass (0 
>> bytes)   (intrinsic)
>>                                       @ 322   
>> jdk.internal.vm.vector.VectorSupport::convert (36 bytes)   (intrinsic)
>>                                 @ 16   
>> jdk.internal.vm.vector.VectorSupport::extract (35 bytes)   (intrinsic)
>> [time] 386ms  [res]3392
>> CPROMPT>export JAVA_HOME=/home/jatinbha/softwares/jdk-20/
>> CPROMPT>export PATH=$JAVA_HOME/bin:$PATH
>> CPROMPT>javad --add-modules=jdk.incubator.vector -XX:UseAVX=1 
>> -XX:+PrintIntrinsics -XX:CompileCommand=compileonly,shufflef::micro -cp . 
>> shufflef
>> CompileCommand: compileonly shufflef.micro bool compileonly = true
>> WARNING: Using incubator modules: jdk.incubator.vector
>>                                       @ 3   
>> jdk.internal.misc.Unsafe::loadFence (5 bytes)   (intrinsic)
>>                                         @ 3   
>> jdk.internal.misc.Unsafe::loadFence (5 bytes)   (intrinsic)
>>                                 @ 17   
>> jdk.internal.vm.vector.VectorSupport::shuffleToVector (33 bytes)   
>> (intrinsic)
>>                                       @ 292   java.lang.Object::getClass (0 
>> bytes)   (intrinsic)
>>                                       @ 298   java.lang.Object::getClass (0 
>> bytes)   (intrinsic)
>>                                       @ 322   
>> jdk.internal.vm.vector.VectorSupport::convert (36 bytes)   (intrinsic)
>>                                 @ 16   
>> jdk.internal.vm.vector.VectorSupport::extract (35 bytes)   (intrinsic)
>> [time] 7ms  [res]3392
>
> @jatin-bhateja Since `Float256Shuffle` is represented as a 256-bit int 
> vector, which is not supported by AVX1, the compiled code falls back to Java 
> implementation, which explains the regression. However, having a 
> `VectorShuffle` but not for `Vector::rearrange` is not really useful, and the 
> code snippet is similar to `ShortVector.SPECIES_128.iotaShuffle(0, 1, 
> true).toVector().reinterpretAsShorts().lane(1)`. As a result, I think having 
> some regressions in edge cases of AVX1 is acceptable in contrast with the 
> improvement in all other operations on all platforms.

Agree, this is also fixing less than 32 bit shuffle vectors case, i.e. shuffles 
involving Long128, Int64 and Float64 will get benefitted on x86.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/13093#discussion_r1163147535

Reply via email to