On Fri, 26 Sep 2025 12:00:31 GMT, Bhavana Kilambi <[email protected]> wrote:
> This patch adds mid-end support for vectorized add/mul reduction operations
> for half floats. It also includes backend aarch64 support for these
> operations. Only vectorization support through autovectorization is added as
> VectorAPI currently does not support Float16 vector species.
>
> Both add and mul reduction vectorized through autovectorization mandate the
> implementation to be strictly ordered. The following is how each of these
> reductions is implemented for different aarch64 targets -
>
> **For AddReduction :**
> On Neon only targets (UseSVE = 0): Generates scalarized additions using the
> scalar `fadd` instruction for both 8B and 16B vector lengths. This is because
> Neon does not provide a direct instruction for computing strictly ordered
> floating point add reduction.
>
> On SVE targets (UseSVE > 0): Generates the `fadda` instruction which computes
> add reduction for floating point in strict order.
>
> **For MulReduction :**
> Both Neon and SVE do not provide a direct instruction for computing strictly
> ordered floating point multiply reduction. For vector lengths of 8B and 16B,
> a scalarized sequence of scalar `fmul` instructions is generated and multiply
> reduction for vector lengths > 16B is not supported.
>
> Below is the performance of the two newly added microbenchmarks in
> `Float16OperationsBenchmark.java` tested on three different aarch64 machines
> and with varying `MaxVectorSize` -
>
> Note: On all machines, the score (ops/ms) is compared with the master branch
> without this patch which generates a sequence of loads (`ldrsh`) to load the
> FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded
> value to the running sum/product. The ratios given below are the ratios
> between the throughput with this patch and the throughput without this patch.
> Ratio > 1 indicates the performance with this patch is better than the master
> branch.
>
> **N1 (UseSVE = 0, max vector length = 16B):**
>
> Benchmark vectorDim Mode Cnt 8B 16B
> ReductionAddFP16 256 thrpt 9 1.41 1.40
> ReductionAddFP16 512 thrpt 9 1.41 1.41
> ReductionAddFP16 1024 thrpt 9 1.43 1.40
> ReductionAddFP16 2048 thrpt 9 1.43 1.40
> ReductionMulFP16 256 thrpt 9 1.22 1.22
> ReductionMulFP16 512 thrpt 9 1.21 1.23
> ReductionMulFP16 1024 thrpt 9 1.21 1.22
> ReductionMulFP16 2048 thrpt 9 1.20 1.22
>
>
> On N1, the scalarized sequence of `fadd/fmul` are generated for both
> `MaxVectorSize` of 8B and 16B for add reduction ...
src/hotspot/cpu/aarch64/aarch64_vector.ad line 272:
> 270: if (length_in_bytes > 16 || !is_feat_fp16_supported()) {
> 271: return false;
> 272: }
Reductions with `length_in_bytes < 8` should also be skipped. Because such
operations are not supported now, while the IRs with 32-bit vector size might
exist, right?
src/hotspot/cpu/aarch64/aarch64_vector.ad line 3427:
> 3425:
> (!VM_Version::use_neon_for_vector(Matcher::vector_length_in_bytes(n->in(2)))
> ||
> 3426: n->as_Reduction()->requires_strict_order())) ||
> 3427: (Matcher::vector_element_basic_type(n->in(2)) == T_SHORT &&
> UseSVE > 0));
`UseSVE > 0` is a requirement for all cases. I suggest separate rule for
`AddReductionVHF`. The predicate is much more complex now.
src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1900:
> 1898: fmulh(dst, dst, vtmp);
> 1899: ins(vtmp, H, vsrc, 0, 7);
> 1900: fmulh(dst, dst, vtmp);
Do you know why the performance is not improved significantly for multiply
reduction? Seems instructions between different `ins` instructions will have a
data-dependence, which is not expected? Could you please use other instructions
instead of clear the register `vtmp` before `ins` and check the performance
changes.
Note that a clear of `mov` such as `MOVI Vd.2D, #0` has zero cost from V2's
guide.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2386942508
PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2386995748
PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2387053858