On Thu, 2 Feb 2023 08:15:14 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
>> While you talked about Java side changes, I found another opportunity for >> optimization in checkIndex0 implementation, in the following code snippet >> from checkIndex0 method, indexLimit is guaranteed to be a +ve value. >> >> >> int indexLimit = Math.max(0, Math.min(length - offset, vlength)); >> VectorMask<E> badMask = >> iota.compare(GE, iota.broadcast(indexLimit)); >> >> For float/double iota vectors, subsequent broadcast operation accepting long >> argument checks for precision loss before broadcasting floating point value. >> >> >> >> long longToElementBits(long value) { >> // Do the conversion, and then test it for failure. >> float e = (float) value; >> if ((long) e != value) { >> throw badElementBits(value, e); >> } >> return toBits(e); >> }` >> >> >> This can be saved by doing a prior casting of indexLimit to floating point >> value before calling broadcast, effectively we may need to move checkIndex0 >> to template generated abstract vector classes. > > Hi, I modified a version by using the old implementation for the tail loop > instead of adding the new intrinsics. The code looks like: > > public VectorMask<E> indexInRange(int offset, int limit) { > int vlength = length(); > if (offset >= 0 && limit - offset >= length()) { > return this; > } else if (offset >= limit) { > return vectorSpecies().maskAll(false); > } > > Vector<E> iota = vectorSpecies().zero().addIndex(1); > VectorMask<E> badMask = checkIndex0(offset, limit, iota, vlength); > return this.andNot(badMask); > } > > And I tested the performance of the new added benchmarks with different > vector size on NEON/SVE and x86 avx2/avx512 architectures. The results show > that the performance of changed version is not better than the current > version, if the array size is not aligned with the vector size, especially > for the double/long type with larger size. > > Here are some raw data with NEON: > > Benchmark (size) Mode Cnt current > modified Units > IndexInRangeBenchmark.byteIndexInRange 131 thrpt 5 2654.919 > 2584.423 ops/ms > IndexInRangeBenchmark.byteIndexInRange 259 thrpt 5 1830.364 > 1802.876 ops/ms > IndexInRangeBenchmark.byteIndexInRange 515 thrpt 5 1058.548 > 1073.742 ops/ms > IndexInRangeBenchmark.doubleIndexInRange 131 thrpt 5 594.920 > 593.832 ops/ms > IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 5 308.678 > 149.279 ops/ms > IndexInRangeBenchmark.doubleIndexInRange 515 thrpt 5 160.639 > 74.579 ops/ms > IndexInRangeBenchmark.floatIndexInRange 131 thrpt 5 1097.567 > 1104.008 ops/ms > IndexInRangeBenchmark.floatIndexInRange 259 thrpt 5 617.845 > 606.886 ops/ms > IndexInRangeBenchmark.floatIndexInRange 515 thrpt 5 315.978 > 152.046 ops/ms > IndexInRangeBenchmark.intIndexInRange 131 thrpt 5 1165.279 > 1205.486 ops/ms > IndexInRangeBenchmark.intIndexInRange 259 thrpt 5 633.648 > 631.672 ops/ms > IndexInRangeBenchmark.intIndexInRange 515 thrpt 5 315.370 > 154.144 ops/ms > IndexInRangeBenchmark.longIndexInRange 131 thrpt 5 639.840 > 633.623 ops/ms > IndexInRangeBenchmark.longIndexInRange 259 thrpt 5 312.267 > 152.788 ops/ms > IndexInRangeBenchmark.longIndexInRange 515 thrpt 5 163.028 > 78.150 ops/ms > IndexInRangeBenchmark.shortIndexInRange 131 thrpt 5 1834.318 > 1800.318 ops/ms > IndexInRangeBenchmark.shortIndexInRange 259 thrpt 5 1105.695 > 1094.347 ops/ms > IndexInRangeBenchmark.shortIndexInRange 515 thrpt 5 602.442 > 599.827 ops/ms > > > And the data with SVE 256-bit vector size: > > Benchmark (size) Mode Cnt current > modified Units > IndexInRangeBenchmark.byteIndexInRange 131 thrpt 5 23772.370 > 22921.113 ops/ms > IndexInRangeBenchmark.byteIndexInRange 259 thrpt 5 18930.388 > 17920.910 ops/ms > IndexInRangeBenchmark.byteIndexInRange 515 thrpt 5 13528.610 > 13282.504 ops/ms > IndexInRangeBenchmark.doubleIndexInRange 131 thrpt 5 7850.522 > 7975.720 ops/ms > IndexInRangeBenchmark.doubleIndexInRange 259 thrpt 5 4281.749 > 4373.926 ops/ms > IndexInRangeBenchmark.doubleIndexInRange 515 thrpt 5 2160.001 > 604.458 ops/ms > IndexInRangeBenchmark.floatIndexInRange 131 thrpt 5 13594.943 > 13306.904 ops/ms > IndexInRangeBenchmark.floatIndexInRange 259 thrpt 5 8163.134 > 7912.343 ops/ms > IndexInRangeBenchmark.floatIndexInRange 515 thrpt 5 4335.529 > 4198.555 ops/ms > IndexInRangeBenchmark.intIndexInRange 131 thrpt 5 22106.880 > 20348.266 ops/ms > IndexInRangeBenchmark.intIndexInRange 259 thrpt 5 11711.588 > 10958.299 ops/ms > IndexInRangeBenchmark.intIndexInRange 515 thrpt 5 5501.034 > 5358.441 ops/ms > IndexInRangeBenchmark.longIndexInRange 131 thrpt 5 9832.578 > 9829.398 ops/ms > IndexInRangeBenchmark.longIndexInRange 259 thrpt 5 4979.326 > 4947.166 ops/ms > IndexInRangeBenchmark.longIndexInRange 515 thrpt 5 2269.131 > 614.204 ops/ms > IndexInRangeBenchmark.shortIndexInRange 131 thrpt 5 19865.866 > 19297.628 ops/ms > IndexInRangeBenchmark.shortIndexInRange 259 thrpt 5 14005.214 > 13592.407 ops/ms > IndexInRangeBenchmark.shortIndexInRange 515 thrpt 5 8766.450 > 8531.675 ops/ms > > As a conclusion, I prefer to keep the current version. WDYT? Yes, I agree, I did collect performance data with your benchmarks at various AVX levels and see significant gains. ------------- PR: https://git.openjdk.org/jdk/pull/12064