On Mon, 29 Sep 2025 08:04:06 GMT, Xiaohong Gong <[email protected]> wrote:
>> This patch adds mid-end support for vectorized add/mul reduction operations >> for half floats. It also includes backend aarch64 support for these >> operations. Only vectorization support through autovectorization is added as >> VectorAPI currently does not support Float16 vector species. >> >> Both add and mul reduction vectorized through autovectorization mandate the >> implementation to be strictly ordered. The following is how each of these >> reductions is implemented for different aarch64 targets - >> >> **For AddReduction :** >> On Neon only targets (UseSVE = 0): Generates scalarized additions using the >> scalar `fadd` instruction for both 8B and 16B vector lengths. This is >> because Neon does not provide a direct instruction for computing strictly >> ordered floating point add reduction. >> >> On SVE targets (UseSVE > 0): Generates the `fadda` instruction which >> computes add reduction for floating point in strict order. >> >> **For MulReduction :** >> Both Neon and SVE do not provide a direct instruction for computing strictly >> ordered floating point multiply reduction. For vector lengths of 8B and 16B, >> a scalarized sequence of scalar `fmul` instructions is generated and >> multiply reduction for vector lengths > 16B is not supported. >> >> Below is the performance of the two newly added microbenchmarks in >> `Float16OperationsBenchmark.java` tested on three different aarch64 machines >> and with varying `MaxVectorSize` - >> >> Note: On all machines, the score (ops/ms) is compared with the master branch >> without this patch which generates a sequence of loads (`ldrsh`) to load the >> FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded >> value to the running sum/product. The ratios given below are the ratios >> between the throughput with this patch and the throughput without this patch. >> Ratio > 1 indicates the performance with this patch is better than the >> master branch. >> >> **N1 (UseSVE = 0, max vector length = 16B):** >> >> Benchmark vectorDim Mode Cnt 8B 16B >> ReductionAddFP16 256 thrpt 9 1.41 1.40 >> ReductionAddFP16 512 thrpt 9 1.41 1.41 >> ReductionAddFP16 1024 thrpt 9 1.43 1.40 >> ReductionAddFP16 2048 thrpt 9 1.43 1.40 >> ReductionMulFP16 256 thrpt 9 1.22 1.22 >> ReductionMulFP16 512 thrpt 9 1.21 1.23 >> ReductionMulFP16 1024 thrpt 9 1.21 1.22 >> ReductionMulFP16 2048 thrpt 9 1.20 1.22 >> >> >> On N1, the scalarized sequence of `fadd/fmul` are gener... > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1900: > >> 1898: fmulh(dst, dst, vtmp); >> 1899: ins(vtmp, H, vsrc, 0, 7); >> 1900: fmulh(dst, dst, vtmp); > > Do you know why the performance is not improved significantly for multiply > reduction? Seems instructions between different `ins` instructions will have > a data-dependence, which is not expected? Could you please use other > instructions instead or clear the register `vtmp` before `ins` and check the > performance changes? > > Note that a clear of `mov` such as `MOVI Vd.2D, #0` has zero cost from V2's > guide. Are you referring to the N1 numbers? The add reduction operation has gains around ~40% while the mul reduction is around ~20% on N1. On V1 and V2 they look comparable (not considering the cases where we generate `fadda` instructions). ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2398197879
