Re: RFR: 8366444: Add support for add/mul reduction operations for Float16

Bhavana Kilambi Thu, 02 Oct 2025 06:40:41 -0700

On Thu, 2 Oct 2025 13:21:32 GMT, Marc Chevalier <[email protected]> wrote:


>> This patch adds mid-end support for vectorized add/mul reduction operations 
>> for half floats. It also includes backend aarch64 support for these 
>> operations. Only vectorization support through autovectorization is added as 
>> VectorAPI currently does not support Float16 vector species.
>> 
>> Both add and mul reduction vectorized through autovectorization mandate the 
>> implementation to be strictly ordered. The following is how each of these 
>> reductions is implemented for different aarch64 targets -
>> 
>> **For AddReduction :**
>> On Neon only targets (UseSVE = 0): Generates scalarized additions using the 
>> scalar `fadd` instruction for both 8B and 16B vector lengths. This is 
>> because Neon does not provide a direct instruction for computing strictly 
>> ordered floating point add reduction.
>> 
>> On SVE targets (UseSVE > 0): Generates the `fadda` instruction which 
>> computes add reduction for floating point in strict order.
>> 
>> **For MulReduction :**
>> Both Neon and SVE do not provide a direct instruction for computing strictly 
>> ordered floating point multiply reduction. For vector lengths of 8B and 16B, 
>> a scalarized sequence of scalar `fmul` instructions is generated and 
>> multiply reduction for vector lengths > 16B is not supported.
>> 
>> Below is the performance of the two newly added microbenchmarks in 
>> `Float16OperationsBenchmark.java` tested on three different aarch64 machines 
>> and with varying `MaxVectorSize` -
>> 
>> Note: On all machines, the score (ops/ms) is compared with the master branch 
>> without this patch which generates a sequence of loads (`ldrsh`) to load the 
>> FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded 
>> value to the running sum/product. The ratios given below are the ratios 
>> between the throughput with this patch and the throughput without this patch.
>> Ratio > 1 indicates the performance with this patch is better than the 
>> master branch.
>> 
>> **N1 (UseSVE = 0, max vector length = 16B):**
>> 
>> Benchmark         vectorDim  Mode   Cnt  8B     16B
>> ReductionAddFP16  256        thrpt  9    1.41   1.40
>> ReductionAddFP16  512        thrpt  9    1.41   1.41
>> ReductionAddFP16  1024       thrpt  9    1.43   1.40
>> ReductionAddFP16  2048       thrpt  9    1.43   1.40
>> ReductionMulFP16  256        thrpt  9    1.22   1.22
>> ReductionMulFP16  512        thrpt  9    1.21   1.23
>> ReductionMulFP16  1024       thrpt  9    1.21   1.22
>> ReductionMulFP16  2048       thrpt  9    1.20   1.22
>> 
>> 
>> On N1, the scalarized sequence of `fadd/fmul` are gener...
>
> I see now the flags are not triviall:
> 
> -XX:+UnlockDiagnosticVMOptions -XX:-TieredCompilation 
> -XX:+StressArrayCopyMacroNode -XX:+StressLCM -XX:+StressGCM -XX:+StressIGVN 
> -XX:+StressCCP -XX:+StressMacroExpansion 
> -XX:+StressMethodHandleLinkerInlining -XX:+StressCompiledExceptionHandlers 
> -XX:VerifyConstraintCasts=1 -XX:+StressLoopPeeling
> 
> a lot of stress file. It's likely that many runs might be needed to reproduce.
> 
> The machine is a VM.Standard.A1.Flex shape, as described in
> https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm
> 
> Backtrace at the failure:
> 
> Current CompileTask:
> C2:1523  346 %  b        
> compiler.vectorization.TestFloat16VectorOperations::vectorAddReductionFloat16 
> @ 4 (39 bytes)
> 
> Stack: [0x0000ffff84799000,0x0000ffff84997000],  sp=0x0000ffff849920d0,  free 
> space=2020k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> V  [libjvm.so+0x7da724]  
> C2_MacroAssembler::neon_reduce_add_fp16(FloatRegister, FloatRegister, 
> FloatRegister, unsigned int, FloatRegister)+0x2b4  
> (c2_MacroAssembler_aarch64.cpp:1930)
> V  [libjvm.so+0x154492c]  PhaseOutput::scratch_emit_size(Node const*)+0x2ec  
> (output.cpp:3171)
> V  [libjvm.so+0x153d4a4]  PhaseOutput::shorten_branches(unsigned int*)+0x2e4  
> (output.cpp:528)
> V  [libjvm.so+0x154dcdc]  PhaseOutput::Output()+0x95c  (output.cpp:328)
> V  [libjvm.so+0x9be070]  Compile::Code_Gen()+0x7f0  (compile.cpp:3127)
> V  [libjvm.so+0x9c21c0]  Compile::Compile(ciEnv*, ciMethod*, int, Options, 
> DirectiveSet*)+0x1774  (compile.cpp:894)
> V  [libjvm.so+0x7eec64]  C2Compiler::compile_method(ciEnv*, ciMethod*, int, 
> bool, DirectiveSet*)+0x2e0  (c2compiler.cpp:147)
> V  [libjvm.so+0x9d0f8c]  
> CompileBroker::invoke_compiler_on_method(CompileTask*)+0xb08  
> (compileBroker.cpp:2345)
> V  [libjvm.so+0x9d1eb8]  CompileBroker::compiler_thread_loop()+0x638  
> (compileBroker.cpp:1989)
> V  [libjvm.so+0xed25a8]  JavaThread::thread_main_inner()+0x108  
> (javaThread.cpp:775)
> V  [libjvm.so+0x18466dc]  Thread::call_run()+0xac  (thread.cpp:243)
> V  [libjvm.so+0x152349c]  thread_native_entry(Thread*)+0x12c  
> (os_linux.cpp:895)
> C  [libc.so.6+0x80b50]  start_thread+0x300
> 
> 
> I've attached the replay file in the JBS issue, if it can help.

@marc-chevalier Thanks! I have now been able to reproduce it using the flags 
you shared. Will update my patch soon with a fix for this along with addressing 
other review comments.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/27526#issuecomment-3361263768

Re: RFR: 8366444: Add support for add/mul reduction operations for Float16

Reply via email to