On Thu, 2 Feb 2023 08:15:14 GMT, Xiaohong Gong <xg...@openjdk.org> wrote:

>> While you talked about Java side changes, I found another opportunity for 
>> optimization in checkIndex0 implementation, in the following code snippet 
>> from checkIndex0 method, indexLimit is guaranteed to be a +ve value.
>> 
>> 
>>          int indexLimit = Math.max(0, Math.min(length - offset, vlength));
>>          VectorMask<E> badMask =
>>             iota.compare(GE, iota.broadcast(indexLimit)); 
>> 
>> For float/double iota vectors, subsequent broadcast operation accepting long 
>> argument checks for precision loss before broadcasting floating point value. 
>>  
>> 
>> 
>>        long longToElementBits(long value) {
>>             // Do the conversion, and then test it for failure.
>>             float e = (float) value;
>>             if ((long) e != value) {
>>                 throw badElementBits(value, e);
>>             }
>>             return toBits(e);
>>         }`
>> 
>>  
>> This can be saved by doing a prior casting of indexLimit to floating point 
>> value before calling broadcast, effectively we may need to move checkIndex0 
>> to template generated abstract vector classes.
>
> Hi, I modified a version by using the old implementation for the tail loop 
> instead of adding the new intrinsics. The code looks like:
> 
> public VectorMask<E> indexInRange(int offset, int limit) {
>         int vlength = length();
>         if (offset >= 0 && limit - offset >= length()) {
>             return this;
>         } else if (offset >= limit) {
>             return vectorSpecies().maskAll(false);
>         }
> 
>         Vector<E> iota = vectorSpecies().zero().addIndex(1);
>         VectorMask<E> badMask = checkIndex0(offset, limit, iota, vlength);
>         return this.andNot(badMask);
>     }
> 
> And I tested the performance of the new added benchmarks with different 
> vector size on NEON/SVE and x86 avx2/avx512 architectures. The results show 
> that the performance of changed version is not better than the current 
> version, if the array size is not aligned with the vector size, especially 
> for the double/long type with larger size.
> 
> Here are some raw data with NEON:
> 
> Benchmark                                 (size)   Mode  Cnt  current   
> modified   Units
> IndexInRangeBenchmark.byteIndexInRange       131  thrpt    5  2654.919  
> 2584.423   ops/ms
> IndexInRangeBenchmark.byteIndexInRange       259  thrpt    5  1830.364  
> 1802.876   ops/ms
> IndexInRangeBenchmark.byteIndexInRange       515  thrpt    5  1058.548  
> 1073.742   ops/ms
> IndexInRangeBenchmark.doubleIndexInRange     131  thrpt    5   594.920   
> 593.832   ops/ms
> IndexInRangeBenchmark.doubleIndexInRange     259  thrpt    5   308.678   
> 149.279   ops/ms
> IndexInRangeBenchmark.doubleIndexInRange     515  thrpt    5   160.639    
> 74.579   ops/ms
> IndexInRangeBenchmark.floatIndexInRange      131  thrpt    5  1097.567  
> 1104.008   ops/ms
> IndexInRangeBenchmark.floatIndexInRange      259  thrpt    5   617.845   
> 606.886   ops/ms
> IndexInRangeBenchmark.floatIndexInRange      515  thrpt    5   315.978   
> 152.046   ops/ms
> IndexInRangeBenchmark.intIndexInRange        131  thrpt    5  1165.279  
> 1205.486   ops/ms
> IndexInRangeBenchmark.intIndexInRange        259  thrpt    5   633.648   
> 631.672   ops/ms
> IndexInRangeBenchmark.intIndexInRange        515  thrpt    5   315.370   
> 154.144   ops/ms
> IndexInRangeBenchmark.longIndexInRange       131  thrpt    5   639.840   
> 633.623   ops/ms
> IndexInRangeBenchmark.longIndexInRange       259  thrpt    5   312.267   
> 152.788   ops/ms
> IndexInRangeBenchmark.longIndexInRange       515  thrpt    5   163.028    
> 78.150   ops/ms
> IndexInRangeBenchmark.shortIndexInRange      131  thrpt    5  1834.318  
> 1800.318   ops/ms
> IndexInRangeBenchmark.shortIndexInRange      259  thrpt    5  1105.695  
> 1094.347   ops/ms
> IndexInRangeBenchmark.shortIndexInRange      515  thrpt    5   602.442   
> 599.827   ops/ms
> 
> 
> And the data with SVE 256-bit vector size:
> 
> Benchmark                                 (size)   Mode  Cnt   current   
> modified Units
> IndexInRangeBenchmark.byteIndexInRange       131  thrpt    5  23772.370 
> 22921.113 ops/ms
> IndexInRangeBenchmark.byteIndexInRange       259  thrpt    5  18930.388 
> 17920.910 ops/ms
> IndexInRangeBenchmark.byteIndexInRange       515  thrpt    5  13528.610 
> 13282.504 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange     131  thrpt    5   7850.522  
> 7975.720 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange     259  thrpt    5   4281.749  
> 4373.926 ops/ms
> IndexInRangeBenchmark.doubleIndexInRange     515  thrpt    5   2160.001   
> 604.458 ops/ms
> IndexInRangeBenchmark.floatIndexInRange      131  thrpt    5  13594.943 
> 13306.904 ops/ms
> IndexInRangeBenchmark.floatIndexInRange      259  thrpt    5   8163.134  
> 7912.343 ops/ms
> IndexInRangeBenchmark.floatIndexInRange      515  thrpt    5   4335.529  
> 4198.555 ops/ms
> IndexInRangeBenchmark.intIndexInRange        131  thrpt    5  22106.880 
> 20348.266 ops/ms
> IndexInRangeBenchmark.intIndexInRange        259  thrpt    5  11711.588 
> 10958.299 ops/ms
> IndexInRangeBenchmark.intIndexInRange        515  thrpt    5   5501.034  
> 5358.441 ops/ms
> IndexInRangeBenchmark.longIndexInRange       131  thrpt    5   9832.578  
> 9829.398 ops/ms
> IndexInRangeBenchmark.longIndexInRange       259  thrpt    5   4979.326  
> 4947.166 ops/ms
> IndexInRangeBenchmark.longIndexInRange       515  thrpt    5   2269.131   
> 614.204 ops/ms
> IndexInRangeBenchmark.shortIndexInRange      131  thrpt    5  19865.866 
> 19297.628 ops/ms
> IndexInRangeBenchmark.shortIndexInRange      259  thrpt    5  14005.214 
> 13592.407 ops/ms
> IndexInRangeBenchmark.shortIndexInRange      515  thrpt    5   8766.450  
> 8531.675 ops/ms
> 
> As a conclusion, I prefer to keep the current version. WDYT?

Yes, I agree, I did collect performance data with your benchmarks at various 
AVX levels and see significant gains.

-------------

PR: https://git.openjdk.org/jdk/pull/12064

Reply via email to