On Thu, 2 Feb 2023 03:42:34 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:
>> Your patch re-organized the code to take care of following cases :- >> 1. offset < 0 : Still falling over to old code. >> 2. offset >= limit : Special case optimized >> 3. (limit - offset) >= length : Most general case. >> 4. Partial in range check : New intrinsic, existing java side >> implementation in checkIndex0 is also composed of intrinsified calls, >> infrequently taken paths will anyways lead to uncommon traps. >> >> Have you tried introducing just case 3 (first) and then case 2 in existing >> indexInRange implementation. > >> Thanks for the review @jatin-bhateja ! >> >> > Have you tried introducing just case 3 (first) and then case 2 in existing >> > indexInRange implementation. >> >> Yes, I tried with this way as well, together with the performance testing >> compared with current version. The result is: >> >> 1. For cases that the array size is aligned with the vector size, the >> performance is almost the same with each other. >> 2. For cases that the array size is not aligned (which needs the masked >> operation), the performance with current version can improve about 5%~6% >> compared with the way as you suggested. But the improvement is limited to >> the cases with smaller vector size, whose tail loop size is bigger. >> >> So, adding the new intrinsic has no side effect for me, except for the more >> complex java side code and one additional intrinsic. > > Another benefit is the code change will be simpler, that almost all the files > except the `AbstractMask.java` do not need to be changed. I will make the > modification and rerun the benchmarks to decide whether it's necessary to add > the new intrinsics. Thanks! While you talked about Java side changes, I found another opportunity for optimization in checkIndex0 implementation, in the following code snippet from checkIndex0 method, indexLimit is guaranteed to be a +ve value. int indexLimit = Math.max(0, Math.min(length - offset, vlength)); VectorMask<E> badMask = iota.compare(GE, iota.broadcast(indexLimit)); For float/double iota vectors, subsequent broadcast operation accepting long argument checks for precision loss before broadcasting floating point value. long longToElementBits(long value) { // Do the conversion, and then test it for failure. float e = (float) value; if ((long) e != value) { throw badElementBits(value, e); } return toBits(e); }` This can be saved by doing a prior casting of indexLimit to floating point value before calling broadcast, effectively we may need to move checkIndex0 to template generated abstract vector classes. ------------- PR: https://git.openjdk.org/jdk/pull/12064