On Thu, 2 Feb 2023 06:38:58 GMT, Jatin Bhateja <jbhat...@openjdk.org> wrote:

>>> Thanks for the review @jatin-bhateja !
>>> 
>>> > Have you tried introducing just case 3 (first) and then case 2 in 
>>> > existing indexInRange implementation.
>>> 
>>> Yes, I tried with this way as well, together with the performance testing 
>>> compared with current version. The result is:
>>> 
>>> 1. For cases that the array size is aligned with the vector size, the 
>>> performance is almost the same with each other.
>>> 2. For cases that the array size is not aligned (which needs the masked 
>>> operation), the performance with current version can improve about 5%~6% 
>>> compared with the way as you suggested. But the improvement is limited to 
>>> the cases with smaller vector size, whose tail loop size is bigger.
>>> 
>>> So, adding the new intrinsic has no side effect for me, except for the more 
>>> complex java side code and one additional intrinsic.
>> 
>> Another benefit is the code change will be simpler, that almost all the 
>> files except the `AbstractMask.java` do not need to be changed. I will make 
>> the modification and rerun the benchmarks to decide whether it's necessary 
>> to add the new intrinsics. Thanks!
>
> While you talked about Java side changes, I found another opportunity for 
> optimization in checkIndex0 implementation, in the following code snippet 
> from checkIndex0 method, indexLimit is guaranteed to be a +ve value.
> 
> 
>          int indexLimit = Math.max(0, Math.min(length - offset, vlength));
>          VectorMask<E> badMask =
>             iota.compare(GE, iota.broadcast(indexLimit)); 
> 
> For float/double iota vectors, subsequent broadcast operation accepting long 
> argument checks for precision loss before broadcasting floating point value.  
> 
> 
>        long longToElementBits(long value) {
>             // Do the conversion, and then test it for failure.
>             float e = (float) value;
>             if ((long) e != value) {
>                 throw badElementBits(value, e);
>             }
>             return toBits(e);
>         }`
> 
>  
> This can be saved by doing a prior casting of indexLimit to floating point 
> value before calling broadcast, effectively we may need to move checkIndex0 
> to template generated abstract vector classes.

Hi, I modified a version by using the old implementation for the tail loop 
instead of adding the new intrinsics. The code looks like:

public VectorMask<E> indexInRange(int offset, int limit) {
        int vlength = length();
        if (offset >= 0 && limit - offset >= length()) {
            return this;
        } else if (offset >= limit) {
            return vectorSpecies().maskAll(false);
        }

        Vector<E> iota = vectorSpecies().zero().addIndex(1);
        VectorMask<E> badMask = checkIndex0(offset, limit, iota, vlength);
        return this.andNot(badMask);
    }

And I tested the performance of the new added benchmarks with different vector 
size on NEON/SVE and x86 avx2/avx512 architectures. The results show that the 
performance of changed version is not better than the current version, if the 
array size is not aligned with the vector size, especially for the double/long 
type with larger size.

Here are some raw data with NEON:

Benchmark                                 (size)   Mode  Cnt  current   
modified   Units
IndexInRangeBenchmark.byteIndexInRange       131  thrpt    5  2654.919  
2584.423   ops/ms
IndexInRangeBenchmark.byteIndexInRange       259  thrpt    5  1830.364  
1802.876   ops/ms
IndexInRangeBenchmark.byteIndexInRange       515  thrpt    5  1058.548  
1073.742   ops/ms
IndexInRangeBenchmark.doubleIndexInRange     131  thrpt    5   594.920   
593.832   ops/ms
IndexInRangeBenchmark.doubleIndexInRange     259  thrpt    5   308.678   
149.279   ops/ms
IndexInRangeBenchmark.doubleIndexInRange     515  thrpt    5   160.639    
74.579   ops/ms
IndexInRangeBenchmark.floatIndexInRange      131  thrpt    5  1097.567  
1104.008   ops/ms
IndexInRangeBenchmark.floatIndexInRange      259  thrpt    5   617.845   
606.886   ops/ms
IndexInRangeBenchmark.floatIndexInRange      515  thrpt    5   315.978   
152.046   ops/ms
IndexInRangeBenchmark.intIndexInRange        131  thrpt    5  1165.279  
1205.486   ops/ms
IndexInRangeBenchmark.intIndexInRange        259  thrpt    5   633.648   
631.672   ops/ms
IndexInRangeBenchmark.intIndexInRange        515  thrpt    5   315.370   
154.144   ops/ms
IndexInRangeBenchmark.longIndexInRange       131  thrpt    5   639.840   
633.623   ops/ms
IndexInRangeBenchmark.longIndexInRange       259  thrpt    5   312.267   
152.788   ops/ms
IndexInRangeBenchmark.longIndexInRange       515  thrpt    5   163.028    
78.150   ops/ms
IndexInRangeBenchmark.shortIndexInRange      131  thrpt    5  1834.318  
1800.318   ops/ms
IndexInRangeBenchmark.shortIndexInRange      259  thrpt    5  1105.695  
1094.347   ops/ms
IndexInRangeBenchmark.shortIndexInRange      515  thrpt    5   602.442   
599.827   ops/ms


And the data with SVE 256-bit vector size:

Benchmark                                 (size)   Mode  Cnt   current   
modified Units
IndexInRangeBenchmark.byteIndexInRange       131  thrpt    5  23772.370 
22921.113 ops/ms
IndexInRangeBenchmark.byteIndexInRange       259  thrpt    5  18930.388 
17920.910 ops/ms
IndexInRangeBenchmark.byteIndexInRange       515  thrpt    5  13528.610 
13282.504 ops/ms
IndexInRangeBenchmark.doubleIndexInRange     131  thrpt    5   7850.522  
7975.720 ops/ms
IndexInRangeBenchmark.doubleIndexInRange     259  thrpt    5   4281.749  
4373.926 ops/ms
IndexInRangeBenchmark.doubleIndexInRange     515  thrpt    5   2160.001   
604.458 ops/ms
IndexInRangeBenchmark.floatIndexInRange      131  thrpt    5  13594.943 
13306.904 ops/ms
IndexInRangeBenchmark.floatIndexInRange      259  thrpt    5   8163.134  
7912.343 ops/ms
IndexInRangeBenchmark.floatIndexInRange      515  thrpt    5   4335.529  
4198.555 ops/ms
IndexInRangeBenchmark.intIndexInRange        131  thrpt    5  22106.880 
20348.266 ops/ms
IndexInRangeBenchmark.intIndexInRange        259  thrpt    5  11711.588 
10958.299 ops/ms
IndexInRangeBenchmark.intIndexInRange        515  thrpt    5   5501.034  
5358.441 ops/ms
IndexInRangeBenchmark.longIndexInRange       131  thrpt    5   9832.578  
9829.398 ops/ms
IndexInRangeBenchmark.longIndexInRange       259  thrpt    5   4979.326  
4947.166 ops/ms
IndexInRangeBenchmark.longIndexInRange       515  thrpt    5   2269.131   
614.204 ops/ms
IndexInRangeBenchmark.shortIndexInRange      131  thrpt    5  19865.866 
19297.628 ops/ms
IndexInRangeBenchmark.shortIndexInRange      259  thrpt    5  14005.214 
13592.407 ops/ms
IndexInRangeBenchmark.shortIndexInRange      515  thrpt    5   8766.450  
8531.675 ops/ms

As a conclusion, I prefer to keep the current version. WDYT?

-------------

PR: https://git.openjdk.org/jdk/pull/12064

Reply via email to