On Fri, 26 Sep 2025 12:00:31 GMT, Bhavana Kilambi <[email protected]> wrote:
> This patch adds mid-end support for vectorized add/mul reduction operations > for half floats. It also includes backend aarch64 support for these > operations. Only vectorization support through autovectorization is added as > VectorAPI currently does not support Float16 vector species. > > Both add and mul reduction vectorized through autovectorization mandate the > implementation to be strictly ordered. The following is how each of these > reductions is implemented for different aarch64 targets - > > **For AddReduction :** > On Neon only targets (UseSVE = 0): Generates scalarized additions using the > scalar `fadd` instruction for both 8B and 16B vector lengths. This is because > Neon does not provide a direct instruction for computing strictly ordered > floating point add reduction. > > On SVE targets (UseSVE > 0): Generates the `fadda` instruction which computes > add reduction for floating point in strict order. > > **For MulReduction :** > Both Neon and SVE do not provide a direct instruction for computing strictly > ordered floating point multiply reduction. For vector lengths of 8B and 16B, > a scalarized sequence of scalar `fmul` instructions is generated and multiply > reduction for vector lengths > 16B is not supported. > > Below is the performance of the two newly added microbenchmarks in > `Float16OperationsBenchmark.java` tested on three different aarch64 machines > and with varying `MaxVectorSize` - > > Note: On all machines, the score (ops/ms) is compared with the master branch > without this patch which generates a sequence of loads (`ldrsh`) to load the > FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded > value to the running sum/product. The ratios given below are the ratios > between the throughput with this patch and the throughput without this patch. > Ratio > 1 indicates the performance with this patch is better than the master > branch. > > **N1 (UseSVE = 0, max vector length = 16B):** > > Benchmark vectorDim Mode Cnt 8B 16B > ReductionAddFP16 256 thrpt 9 1.41 1.40 > ReductionAddFP16 512 thrpt 9 1.41 1.41 > ReductionAddFP16 1024 thrpt 9 1.43 1.40 > ReductionAddFP16 2048 thrpt 9 1.43 1.40 > ReductionMulFP16 256 thrpt 9 1.22 1.22 > ReductionMulFP16 512 thrpt 9 1.21 1.23 > ReductionMulFP16 1024 thrpt 9 1.21 1.22 > ReductionMulFP16 2048 thrpt 9 1.20 1.22 > > > On N1, the scalarized sequence of `fadd/fmul` are generated for both > `MaxVectorSize` of 8B and 16B for add reduction ... I see now the flags are not triviall: -XX:+UnlockDiagnosticVMOptions -XX:-TieredCompilation -XX:+StressArrayCopyMacroNode -XX:+StressLCM -XX:+StressGCM -XX:+StressIGVN -XX:+StressCCP -XX:+StressMacroExpansion -XX:+StressMethodHandleLinkerInlining -XX:+StressCompiledExceptionHandlers -XX:VerifyConstraintCasts=1 -XX:+StressLoopPeeling a lot of stress file. It's likely that many runs might be needed to reproduce. The machine is a VM.Standard.A1.Flex shape, as described in https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm Backtrace at the failure: Current CompileTask: C2:1523 346 % b compiler.vectorization.TestFloat16VectorOperations::vectorAddReductionFloat16 @ 4 (39 bytes) Stack: [0x0000ffff84799000,0x0000ffff84997000], sp=0x0000ffff849920d0, free space=2020k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x7da724] C2_MacroAssembler::neon_reduce_add_fp16(FloatRegister, FloatRegister, FloatRegister, unsigned int, FloatRegister)+0x2b4 (c2_MacroAssembler_aarch64.cpp:1930) V [libjvm.so+0x154492c] PhaseOutput::scratch_emit_size(Node const*)+0x2ec (output.cpp:3171) V [libjvm.so+0x153d4a4] PhaseOutput::shorten_branches(unsigned int*)+0x2e4 (output.cpp:528) V [libjvm.so+0x154dcdc] PhaseOutput::Output()+0x95c (output.cpp:328) V [libjvm.so+0x9be070] Compile::Code_Gen()+0x7f0 (compile.cpp:3127) V [libjvm.so+0x9c21c0] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1774 (compile.cpp:894) V [libjvm.so+0x7eec64] C2Compiler::compile_method(ciEnv*, ciMethod*, int, bool, DirectiveSet*)+0x2e0 (c2compiler.cpp:147) V [libjvm.so+0x9d0f8c] CompileBroker::invoke_compiler_on_method(CompileTask*)+0xb08 (compileBroker.cpp:2345) V [libjvm.so+0x9d1eb8] CompileBroker::compiler_thread_loop()+0x638 (compileBroker.cpp:1989) V [libjvm.so+0xed25a8] JavaThread::thread_main_inner()+0x108 (javaThread.cpp:775) V [libjvm.so+0x18466dc] Thread::call_run()+0xac (thread.cpp:243) V [libjvm.so+0x152349c] thread_native_entry(Thread*)+0x12c (os_linux.cpp:895) C [libc.so.6+0x80b50] start_thread+0x300 I've attached the replay file in the JBS issue, if it can help. ------------- PR Comment: https://git.openjdk.org/jdk/pull/27526#issuecomment-3361203842
