Re: RFR: 8366444: Add support for add/mul reduction operations for Float16

Bhavana Kilambi Thu, 02 Oct 2025 02:26:31 -0700

On Mon, 29 Sep 2025 07:18:42 GMT, Xiaohong Gong <[email protected]> wrote:


>> This patch adds mid-end support for vectorized add/mul reduction operations 
>> for half floats. It also includes backend aarch64 support for these 
>> operations. Only vectorization support through autovectorization is added as 
>> VectorAPI currently does not support Float16 vector species.
>> 
>> Both add and mul reduction vectorized through autovectorization mandate the 
>> implementation to be strictly ordered. The following is how each of these 
>> reductions is implemented for different aarch64 targets -
>> 
>> **For AddReduction :**
>> On Neon only targets (UseSVE = 0): Generates scalarized additions using the 
>> scalar `fadd` instruction for both 8B and 16B vector lengths. This is 
>> because Neon does not provide a direct instruction for computing strictly 
>> ordered floating point add reduction.
>> 
>> On SVE targets (UseSVE > 0): Generates the `fadda` instruction which 
>> computes add reduction for floating point in strict order.
>> 
>> **For MulReduction :**
>> Both Neon and SVE do not provide a direct instruction for computing strictly 
>> ordered floating point multiply reduction. For vector lengths of 8B and 16B, 
>> a scalarized sequence of scalar `fmul` instructions is generated and 
>> multiply reduction for vector lengths > 16B is not supported.
>> 
>> Below is the performance of the two newly added microbenchmarks in 
>> `Float16OperationsBenchmark.java` tested on three different aarch64 machines 
>> and with varying `MaxVectorSize` -
>> 
>> Note: On all machines, the score (ops/ms) is compared with the master branch 
>> without this patch which generates a sequence of loads (`ldrsh`) to load the 
>> FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded 
>> value to the running sum/product. The ratios given below are the ratios 
>> between the throughput with this patch and the throughput without this patch.
>> Ratio > 1 indicates the performance with this patch is better than the 
>> master branch.
>> 
>> **N1 (UseSVE = 0, max vector length = 16B):**
>> 
>> Benchmark         vectorDim  Mode   Cnt  8B     16B
>> ReductionAddFP16  256        thrpt  9    1.41   1.40
>> ReductionAddFP16  512        thrpt  9    1.41   1.41
>> ReductionAddFP16  1024       thrpt  9    1.43   1.40
>> ReductionAddFP16  2048       thrpt  9    1.43   1.40
>> ReductionMulFP16  256        thrpt  9    1.22   1.22
>> ReductionMulFP16  512        thrpt  9    1.21   1.23
>> ReductionMulFP16  1024       thrpt  9    1.21   1.22
>> ReductionMulFP16  2048       thrpt  9    1.20   1.22
>> 
>> 
>> On N1, the scalarized sequence of `fadd/fmul` are gener...
>
> src/hotspot/cpu/aarch64/aarch64_vector.ad line 272:
> 
>> 270:         if (length_in_bytes > 16 || !is_feat_fp16_supported()) {
>> 271:           return false;
>> 272:         }
> 
> Reductions with `length_in_bytes < 8` should also be skipped. Because such 
> operations are not supported now, while the IRs with 32-bit vector size might 
> exist, right?

Hi @XiaohongGong, yes `length_in_bytes < 8` is also not supported and currently 
we support only for vector lengths of 8B and 16B.
IRs with 32-bit vector size might exist but we do not have an optimized 
implementation for 32B vector lengths and thus I have disabled it. Instead of 
that, it generates the 16B scalarized Neon instruction sequence for a 32B 
vector length. Is this what you were asking?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/27526#discussion_r2397961057

Re: RFR: 8366444: Add support for add/mul reduction operations for Float16

Reply via email to