On Mon, 6 Feb 2023 17:39:42 GMT, Paul Sandoz <psan...@openjdk.org> wrote:
>> Xiaohong Gong has updated the pull request incrementally with one additional >> commit since the last revision: >> >> Add smaller array size for benchmark tests > > I think it would be useful to adjust the naming and comments of some methods > to make it clearer the method parameter constraints. > > `indexInRange0Helper` is now called if the index is partially or totally out > of range at the lower or upper ends and `indexInRange0` is called if > partially or totally out of range at the upper end. > e.g. a more literal naming could be: > `AbstractMask::indexInRange0Helper` -> > `AbstractMask::indexPartiallyInRangeHelper` > `VectorSupport::indexInRange` -> VectorSupport::indexPartiallyInUpperRange` > ? > > IIUC the performance numbers show that when the array is not a multiple of > the vector size there is still quite an impact overall to calling > `VectorSupport::indexInRange` for the last loop iteration. I am guessing the > overall loop shape is different which impacts other optimizations? > > To do this more optimally likely requires a loop transformation where the > last loop iteration is peeled off, but that's a harder transformation in one > of the more complicated areas of C2 (although it already supports pre/post > loop, so maybe its possible to leverage that?). Thanks for looking at this PR @PaulSandoz ! > I think it would be useful to adjust the naming and comments of some methods > to make it clearer the method parameter constraints. > > `indexInRange0Helper` is now called if the index is partially or totally out > of range at the lower or upper ends and `indexInRange0` is called if > partially or totally out of range at the upper end. e.g. a more literal > naming could be: `AbstractMask::indexInRange0Helper` -> > `AbstractMask::indexPartiallyInRangeHelper` `VectorSupport::indexInRange` -> > VectorSupport::indexPartiallyInUpperRange` ? The renaming looks good to me. Thanks! > IIUC the performance numbers show that when the array is not a multiple of > the vector size there is still quite an impact overall to calling > `VectorSupport::indexInRange` for the last loop iteration. I am guessing the > overall loop shape is different which impacts other optimizations? I think the main influence of the benchmark result comes from the masked ` fromArray()/intoArray()` APIs, especially the masked intoArray() API. For the tail loop part, there is the vector boxing needed on all architectures, with the following reasons: - If the architecture doesn't support predicate feature, it cannot be intrinsified. - The `checkMaskFromIndexSize` called inside the `else->if` branch may not be inlined, and the `indexInRange()` generated mask `m` needs the boxing before it. public final void intoArray(double[] a, int offset, VectorMask<Double> m) { if (m.allTrue()) { intoArray(a, offset); } else { DoubleSpecies vsp = vspecies(); if (!VectorIntrinsics.indexInRange(offset, vsp.length(), a.length)) { checkMaskFromIndexSize(offset, vsp, m, 1, a.length); } intoArray0(a, offset, m); } } If the array size is aligned with the vector size, the generated `m` is all true. Hence, the non-masked `intoArray()` is called instead, which improves the performance a lot. Regarding to the `indexInRange()` API implementation, if the array size is the multiple num of vector size, the branch for the tail loop part is optimized out to an uncommon-trap by C2 compiler, which may improves the performance as well. Regarding to the added benchmark, since it is a testing for `indexInRange`, maybe we can remove the calling to the masked `fromArray()/intoArray()` APIs and directly save the mask into a boolean array instead. I guess this may reduce the overall performance gap. > > To do this more optimally likely requires a loop transformation where the > last loop iteration is peeled off, but that's a harder transformation in one > of the more complicated areas of C2 (although it already supports pre/post > loop, so maybe its possible to leverage that?). Yes, it is! ------------- PR: https://git.openjdk.org/jdk/pull/12064