On Thu, 19 Jan 2023 03:37:33 GMT, Quan Anh Mai <qa...@openjdk.org> wrote:
>> The Vector API `"indexInRange(int offset, int limit)"` is used >> to compute a vector mask whose lanes are set to true if the >> index of the lane is inside the range specified by the `"offset"` >> and `"limit"` arguments, otherwise the lanes are set to false. >> >> There are two special cases for this API: >> 1) If `"offset >= 0 && offset >= limit"`, all the lanes of the >> generated mask are false. >> 2) If` "offset >= 0 && limit - offset >= vlength"`, all the >> lanes of the generated mask are true. Note that `"vlength"` is >> the number of vector lanes. >> >> For such special cases, we can simply use `"maskAll(false|true)"` >> to implement the API. Otherwise, the original comparison with >> `"iota" `vector is needed. And for further optimization, we have >> optimal instruction supported by SVE (i.e. whilelo [1]), which >> can implement the API directly if the `"offset >= 0"`. >> >> As a summary, to optimize the API, we can use the if-else branches >> to handle the specific cases in java level and intrinsify the >> remaining case by C2 compiler: >> >> >> public VectorMask<E> indexInRange(int offset, int limit) { >> if (offset < 0) { >> return this.and(indexInRange0Helper(offset, limit)); >> } else if (offset >= limit) { >> return this.and(vectorSpecies().maskAll(false)); >> } else if (limit - offset >= length()) { >> return this.and(vectorSpecies().maskAll(true)); >> } >> return this.and(indexInRange0(offset, limit)); >> } >> >> >> The last part (i.e. `"indexInRange0"`) in the above implementation >> is expected to be intrinsified by C2 compiler if the necessary IRs >> are supported. Otherwise, it will fall back to the original API >> implementation (i.e. `"indexInRange0Helper"`). Regarding to the >> intrinsifaction, the compiler will generate `"VectorMaskGen"` IR >> with "limit - offset" as the input if the current platform supports >> it. Otherwise, it generates `"VectorLoadConst + VectorMaskCmp"` based >> on `"iota < limit - offset"`. >> >> For the following java code which uses `"indexInRange"`: >> >> >> static final VectorSpecies<Double> SPECIES = >> DoubleVector.SPECIES_PREFERRED; >> static final int LENGTH = 1027; >> >> public static double[] da; >> public static double[] db; >> public static double[] dc; >> >> private static void func() { >> for (int i = 0; i < LENGTH; i += SPECIES.length()) { >> var m = SPECIES.indexInRange(i, LENGTH); >> var av = DoubleVector.fromArray(SPECIES, da, i, m); >> av.lanewise(VectorOperators.NEG).intoArray(dc, i, m); >> } >> } >> >> >> The core code generated with SVE 256-bit vector size is: >> >> >> ptrue p2.d ; maskAll(true) >> ... >> LOOP: >> ... >> sub w11, w13, w14 ; limit - offset >> cmp w14, w13 >> b.cs LABEL-1 ; if (offset >= limit) => uncommon-trap >> cmp w11, #0x4 >> b.lt LABEL-2 ; if (limit - offset < vlength) >> mov p1.b, p2.b >> LABEL-3: >> ld1d {z16.d}, p1/z, [x10] ; load vector masked >> ... >> cmp w14, w29 >> b.cc LOOP >> ... >> LABEL-2: >> whilelo p1.d, x16, x10 ; VectorMaskGen >> ... >> b LABEL-3 >> ... >> LABEL-1: >> uncommon-trap >> >> >> Please note that if the array size `LENGTH` is aligned with >> the vector size 256 (i.e. `LENGTH = 1024`), the branch "LABEL-2" >> will be optimized out by compiler and it becomes another >> uncommon-trap. >> >> For NEON, the main CFG is the same with above. But the compiler >> intrinsification is different. Here is the code: >> >> >> sub x10, x10, x12 ; limit - offset >> scvtf d16, x10 >> dup v16.2d, v16.d[0] ; replicateD >> >> mov x8, #0xd8d0 >> movk x8, #0x84cb, lsl #16 >> movk x8, #0xffff, lsl #32 >> ldr q17, [x8], #0 ; load the "iota" const vector >> fcmgt v18.2d, v16.2d, v17.2d ; mask = iota < limit - offset >> >> >> Here is the performance data of the new added benchmark on an ARM >> SVE 256-bit platform: >> >> >> Benchmark (size) Before After Units >> IndexInRangeBenchmark.byteIndexInRange 1024 11203.697 41404.431 ops/ms >> IndexInRangeBenchmark.byteIndexInRange 1027 2365.920 8747.004 ops/ms >> IndexInRangeBenchmark.doubleIndexInRange 1024 1227.505 6092.194 ops/ms >> IndexInRangeBenchmark.doubleIndexInRange 1027 351.215 1156.683 ops/ms >> IndexInRangeBenchmark.floatIndexInRange 1024 1468.876 11032.580 ops/ms >> IndexInRangeBenchmark.floatIndexInRange 1027 699.645 2439.671 ops/ms >> IndexInRangeBenchmark.intIndexInRange 1024 2842.187 11903.544 ops/ms >> IndexInRangeBenchmark.intIndexInRange 1027 689.866 2547.424 ops/ms >> IndexInRangeBenchmark.longIndexInRange 1024 1394.135 5902.973 ops/ms >> IndexInRangeBenchmark.longIndexInRange 1027 355.621 1189.458 ops/ms >> IndexInRangeBenchmark.shortIndexInRange 1024 5521.468 21578.340 ops/ms >> IndexInRangeBenchmark.shortIndexInRange 1027 1264.816 4640.504 ops/ms >> >> >> And the performance data with ARM NEON: >> >> >> Benchmark (size) Before After Units >> IndexInRangeBenchmark.byteIndexInRange 1024 4026.548 15562.880 ops/ms >> IndexInRangeBenchmark.byteIndexInRange 1027 305.314 576.559 ops/ms >> IndexInRangeBenchmark.doubleIndexInRange 1024 289.224 2244.080 ops/ms >> IndexInRangeBenchmark.doubleIndexInRange 1027 39.740 76.499 ops/ms >> IndexInRangeBenchmark.floatIndexInRange 1024 675.264 4457.470 ops/ms >> IndexInRangeBenchmark.floatIndexInRange 1027 79.918 144.952 ops/ms >> IndexInRangeBenchmark.intIndexInRange 1024 740.139 4014.583 ops/ms >> IndexInRangeBenchmark.intIndexInRange 1027 78.608 147.903 ops/ms >> IndexInRangeBenchmark.longIndexInRange 1024 400.683 2209.551 ops/ms >> IndexInRangeBenchmark.longIndexInRange 1027 41.146 69.599 ops/ms >> IndexInRangeBenchmark.shortIndexInRange 1024 1821.736 8153.546 ops/ms >> IndexInRangeBenchmark.shortIndexInRange 1027 158.810 243.205 ops/ms >> >> >> The performance improves about `3.5x ~ 7.5x` on the vector size aligned >> (1024 size) benchmarks both with NEON and SVE. And it improves about >> `3.5x/1.8x` on the vector size not aligned (1027 size) benchmarks with >> SVE/NEON respectively. We can also observe the similar improvement on >> the x86 platforms. >> >> [1] >> https://developer.arm.com/documentation/ddi0596/2020-12/SVE-Instructions/WHILELO--While-incrementing-unsigned-scalar-lower-than-scalar- > > src/jdk.incubator.vector/share/classes/jdk/incubator/vector/AbstractMask.java > line 219: > >> 217: @ForceInline >> 218: public VectorMask<E> indexInRange(int offset, int limit) { >> 219: if (offset < 0) { > > These fast-paths penalise every usage of `VectorMask::indexInRange`, > especially the common use case of tail processing an array. So I don't think > it is needed, the user can implement it themselves if their use cases find it > beneficial. Thanks. This "offset < 0" path can be optimized out by compiler if the "offset >= 0" in the common cases (i.e. normal loop with no tail loop). ------------- PR: https://git.openjdk.org/jdk/pull/12064