On Fri, 26 Sep 2025 12:00:31 GMT, Bhavana Kilambi <[email protected]> wrote:
> This patch adds mid-end support for vectorized add/mul reduction operations > for half floats. It also includes backend aarch64 support for these > operations. Only vectorization support through autovectorization is added as > VectorAPI currently does not support Float16 vector species. > > Both add and mul reduction vectorized through autovectorization mandate the > implementation to be strictly ordered. The following is how each of these > reductions is implemented for different aarch64 targets - > > **For AddReduction :** > On Neon only targets (UseSVE = 0): Generates scalarized additions using the > scalar `fadd` instruction for both 8B and 16B vector lengths. This is because > Neon does not provide a direct instruction for computing strictly ordered > floating point add reduction. > > On SVE targets (UseSVE > 0): Generates the `fadda` instruction which computes > add reduction for floating point in strict order. > > **For MulReduction :** > Both Neon and SVE do not provide a direct instruction for computing strictly > ordered floating point multiply reduction. For vector lengths of 8B and 16B, > a scalarized sequence of scalar `fmul` instructions is generated and multiply > reduction for vector lengths > 16B is not supported. > > Below is the performance of the two newly added microbenchmarks in > `Float16OperationsBenchmark.java` tested on three different aarch64 machines > and with varying `MaxVectorSize` - > > Note: On all machines, the score (ops/ms) is compared with the master branch > without this patch which generates a sequence of loads (`ldrsh`) to load the > FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded > value to the running sum/product. The ratios given below are the ratios > between the throughput with this patch and the throughput without this patch. > Ratio > 1 indicates the performance with this patch is better than the master > branch. > > **N1 (UseSVE = 0, max vector length = 16B):** > > Benchmark vectorDim Mode Cnt 8B 16B > ReductionAddFP16 256 thrpt 9 1.41 1.40 > ReductionAddFP16 512 thrpt 9 1.41 1.41 > ReductionAddFP16 1024 thrpt 9 1.43 1.40 > ReductionAddFP16 2048 thrpt 9 1.43 1.40 > ReductionMulFP16 256 thrpt 9 1.22 1.22 > ReductionMulFP16 512 thrpt 9 1.21 1.23 > ReductionMulFP16 1024 thrpt 9 1.21 1.22 > ReductionMulFP16 2048 thrpt 9 1.20 1.22 > > > On N1, the scalarized sequence of `fadd/fmul` are generated for both > `MaxVectorSize` of 8B and 16B for add reduction ... I seem to have a failure on `compiler/vectorization/TestFloat16VectorOperations.java` on aarch64 in `C2_MacroAssembler::neon_reduce_add_fp16(FloatRegister, FloatRegister, FloatRegister, unsigned int, FloatRegister)` at `src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp:1930`: assert(vector_length_in_bytes == 8 || vector_length_in_bytes == 16) failed: unsupported ------------- PR Comment: https://git.openjdk.org/jdk/pull/27526#issuecomment-3339456935
