On Mon, 6 Feb 2023 17:39:42 GMT, Paul Sandoz <psan...@openjdk.org> wrote:

>> Xiaohong Gong has updated the pull request incrementally with one additional 
>> commit since the last revision:
>> 
>>   Add smaller array size for benchmark tests
>
> I think it would be useful to adjust the naming and comments of some methods 
> to make it clearer the method parameter constraints.
> 
> `indexInRange0Helper` is now called if the index is partially or totally out 
> of range at the lower or upper ends and `indexInRange0` is called if 
> partially or totally out of range at the upper end.
> e.g. a more literal naming could be:
> `AbstractMask::indexInRange0Helper` -> 
> `AbstractMask::indexPartiallyInRangeHelper`
> `VectorSupport::indexInRange` -> VectorSupport::indexPartiallyInUpperRange`
> ?
> 
> IIUC the performance numbers show that when the array is not a multiple of 
> the vector size there is still quite an impact overall to calling 
> `VectorSupport::indexInRange` for the last loop iteration. I am guessing the 
> overall loop shape is different which impacts other optimizations?
> 
> To do this more optimally likely requires a loop transformation where the 
> last loop iteration is peeled off, but that's a harder transformation in one 
> of the more complicated areas of C2 (although it already supports pre/post 
> loop, so maybe its possible to leverage that?).

Thanks for looking at this PR @PaulSandoz !

> I think it would be useful to adjust the naming and comments of some methods 
> to make it clearer the method parameter constraints.
> 
> `indexInRange0Helper` is now called if the index is partially or totally out 
> of range at the lower or upper ends and `indexInRange0` is called if 
> partially or totally out of range at the upper end. e.g. a more literal 
> naming could be: `AbstractMask::indexInRange0Helper` -> 
> `AbstractMask::indexPartiallyInRangeHelper` `VectorSupport::indexInRange` -> 
> VectorSupport::indexPartiallyInUpperRange` ?

The renaming looks good to me. Thanks!

> IIUC the performance numbers show that when the array is not a multiple of 
> the vector size there is still quite an impact overall to calling 
> `VectorSupport::indexInRange` for the last loop iteration. I am guessing the 
> overall loop shape is different which impacts other optimizations?

I think the main influence of the benchmark result comes from the masked ` 
fromArray()/intoArray()` APIs, especially the masked intoArray() API. For the 
tail loop part, there is the vector boxing needed on all architectures, with 
the following reasons:

- If the architecture doesn't support predicate feature, it cannot be 
intrinsified.
- The `checkMaskFromIndexSize` called inside the `else->if` branch may not be 
inlined, and the `indexInRange()` generated mask `m` needs the boxing before it.


public final
    void intoArray(double[] a, int offset,
                   VectorMask<Double> m) {
        if (m.allTrue()) {
            intoArray(a, offset);
        } else {
            DoubleSpecies vsp = vspecies();
            if (!VectorIntrinsics.indexInRange(offset, vsp.length(), a.length)) 
{
                checkMaskFromIndexSize(offset, vsp, m, 1, a.length);
            }
            intoArray0(a, offset, m);
        }
    }


If the array size is aligned with the vector size, the generated `m` is all 
true. Hence, the non-masked `intoArray()` is called instead, which improves the 
performance a lot.

Regarding to the `indexInRange()` API implementation, if the array size is the 
multiple num of vector size, the branch for the tail loop part is optimized out 
to an uncommon-trap by C2 compiler, which may improves the performance as well. 

Regarding to the added benchmark, since it is a testing for `indexInRange`, 
maybe we can remove the calling to the masked `fromArray()/intoArray()` APIs 
and directly save the mask into a boolean array instead. I guess this may 
reduce the overall performance gap. 

> 
> To do this more optimally likely requires a loop transformation where the 
> last loop iteration is peeled off, but that's a harder transformation in one 
> of the more complicated areas of C2 (although it already supports pre/post 
> loop, so maybe its possible to leverage that?).

Yes, it is!

-------------

PR: https://git.openjdk.org/jdk/pull/12064

Reply via email to