On Tue, 7 Mar 2023 18:23:42 GMT, Quan Anh Mai <qa...@openjdk.org> wrote:
> `Vector::slice` is a method at the top-level class of the Vector API that > concatenates the 2 inputs into an intermediate composite and extracts a > window equal to the size of the inputs into the result. It is used in vector > conversion methods where the part number is not 0 to slice the parts to the > correct positions. Slicing is also used in text processing such as utf8 and > utf16 validation. x86 starting from SSSE3 has `palignr` which does vector > slicing very efficiently. As a result, I think it is beneficial to add a C2 > node for this operation as well as intrinsify `Vector::slice` method. > > A slice is currently implemented as > `v2.rearrange(iota).blend(v1.rearrange(iota), blendMask)` which requires > preparation of the index vector and the blending mask. Even with the > preparations being hoisted out of the loops, microbenchmarks show improvement > using the slice instrinsics. Some have tremendous increases in throughput due > to the limitation that a mask of length 2 cannot currently be intrinsified, > leading to falling back to the Java implementations. > > Please take a look and have some reviews. Thank you very much. Benchmark results: Before After Benchmark (size) Mode Cnt Score Error Score Error Units Change Byte128Vector.sliceBinaryConstant 1024 thrpt 5 5058.760 ± 2214.115 8315.263 ± 102.169 ops/ms +64.37% Byte256Vector.sliceBinaryConstant 1024 thrpt 5 6986.299 ± 1028.257 8440.387 ± 30.163 ops/ms +20.81% Byte64Vector.sliceBinaryConstant 1024 thrpt 5 2944.869 ± 849.548 5926.054 ± 493.146 ops/ms +101.23% ByteMaxVector.sliceBinaryConstant 1024 thrpt 5 7269.226 ± 366.246 8201.184 ± 309.539 ops/ms +12.82% Double128Vector.sliceBinaryConstant 1024 thrpt 5 10.204 ± 0.508 979.287 ± 19.991 ops/ms x95.97 Double256Vector.sliceBinaryConstant 1024 thrpt 5 868.085 ± 26.378 967.799 ± 10.224 ops/ms +11.49% DoubleMaxVector.sliceBinaryConstant 1024 thrpt 5 813.646 ± 74.468 978.150 ± 14.316 ops/ms +20.22% Float128Vector.sliceBinaryConstant 1024 thrpt 5 1297.281 ± 23.650 1850.995 ± 29.741 ops/ms +42.68% Float256Vector.sliceBinaryConstant 1024 thrpt 5 1796.121 ± 26.662 2011.362 ± 38.418 ops/ms +11.98% Float64Vector.sliceBinaryConstant 1024 thrpt 5 10.381 ± 0.194 1628.510 ± 8.752 ops/ms x156.87 FloatMaxVector.sliceBinaryConstant 1024 thrpt 5 1820.161 ± 26.802 1988.085 ± 41.835 ops/ms +9.23% Int128Vector.sliceBinaryConstant 1024 thrpt 5 1394.911 ± 40.815 1864.818 ± 33.792 ops/ms +33.69% Int256Vector.sliceBinaryConstant 1024 thrpt 5 1874.496 ± 60.541 1864.818 ± 33.792 ops/ms -0.52% Int64Vector.sliceBinaryConstant 1024 thrpt 5 10.942 ± 0.377 1621.849 ± 56.538 ops/ms x148.22 IntMaxVector.sliceBinaryConstant 1024 thrpt 5 1870.746 ± 40.665 2027.041 ± 25.880 ops/ms +8.35% Long128Vector.sliceBinaryConstant 1024 thrpt 5 10.595 ± 0.306 991.969 ± 15.033 ops/ms x93.63 Long256Vector.sliceBinaryConstant 1024 thrpt 5 815.689 ± 12.243 989.365 ± 25.969 ops/ms +21.29% LongMaxVector.sliceBinaryConstant 1024 thrpt 5 822.060 ± 12.337 977.061 ± 31.968 ops/ms +18.86% Short128Vector.sliceBinaryConstant 1024 thrpt 5 3062.676 ± 124.796 3890.796 ± 326.767 ops/ms +27.04% Short256Vector.sliceBinaryConstant 1024 thrpt 5 3747.778 ± 119.356 4125.463 ± 33.602 ops/ms +10.08% Short64Vector.sliceBinaryConstant 1024 thrpt 5 1879.203 ± 69.160 2899.515 ± 57.870 ops/ms +54.29% ShortMaxVector.sliceBinaryConstant 1024 thrpt 5 3717.217 ± 48.876 4035.455 ± 102.725 ops/ms +8.56% ------------- PR: https://git.openjdk.org/jdk/pull/12909