On Thu, 22 Aug 2024 18:21:50 GMT, Paul Sandoz <psan...@openjdk.org> wrote:
> API shapes are good! > > I see you intrinsified `selectFrom` which, IIUC, optimally generates C2 nodes > that are functionally equivalent to the Java expression > `v.rearrange(this.toShuffle())`. That way we can better generate an optimal > set of instructions? > > Do you know what deficiencies there that blocks us from compiling the > expression down to the same set of instructions as the intrinsic? Not > suggesting we do that here, just for future reference. Yes, I intrinsified to generate optimial set of instructions. In the expression `v.rearrange(this.toShuffle())` we will do first partial wrap as part of this.toShuffle() and then full wrap as part of rearrange. In the intrinsic I am only doing full wrap. Without intrinsic, if for whatever reason the this.toShuffle() is not moved out of the loop by the JIT, we incur additional overhead of the partial wrap in the hot code path. I saw this happening when the following is run as part of the jmh instead of being called from standalone java with a loop: var index = ByteVector.fromArray(bspecies128, shuffles[1], 0); for (int j = 0; j < bspecies128.loopBound(size); j += bspecies128.length()) { var inpvect = ByteVector.fromArray(bspecies128, byteinp, j); index.selectFrom(inpvect).intoArray(byteres, j); } The perf difference between the intrinsic and no intrinsic observed in this case then is about 20%. ------------- PR Comment: https://git.openjdk.org/jdk/pull/20634#issuecomment-2305521441