This patch adds mid-end support for vectorized add/mul reduction operations for 
half floats. It also includes backend aarch64 support for these operations. 
Only vectorization support through autovectorization is added as VectorAPI 
currently does not support Float16 vector species.

Both add and mul reduction vectorized through autovectorization mandate the 
implementation to be strictly ordered. The following is how each of these 
reductions is implemented for different aarch64 targets -

**For AddReduction :**
On Neon only targets (UseSVE = 0): Generates scalarized additions using the 
scalar `fadd` instruction for both 8B and 16B vector lengths. This is because 
Neon does not provide a direct instruction for computing strictly ordered 
floating point add reduction.

On SVE targets (UseSVE > 0): Generates the `fadda` instruction which computes 
add reduction for floating point in strict order.

**For MulReduction :**
Both Neon and SVE do not provide a direct instruction for computing strictly 
ordered floating point multiply reduction. For vector lengths of 8B and 16B, a 
scalarized sequence of scalar `fmul` instructions is generated and multiply 
reduction for vector lengths > 16B is not supported.

Below is the performance of the two newly added microbenchmarks in 
`Float16OperationsBenchmark.java` tested on three different aarch64 machines 
and with varying `MaxVectorSize` -

Note: On all machines, the score (ops/ms) is compared with the master branch 
without this patch which generates a sequence of loads (`ldrsh`) to load the 
FP16 value into an FPR and a scalar `fadd/fmul` to add/multiply the loaded 
value to the running sum/product. The ratios given below are the ratios between 
the throughput with this patch and the throughput without this patch.
Ratio > 1 indicates the performance with this patch is better than the master 
branch.

**N1 (UseSVE = 0, max vector length = 16B):**

Benchmark         vectorDim  Mode   Cnt  8B     16B
ReductionAddFP16  256        thrpt  9    1.41   1.40
ReductionAddFP16  512        thrpt  9    1.41   1.41
ReductionAddFP16  1024       thrpt  9    1.43   1.40
ReductionAddFP16  2048       thrpt  9    1.43   1.40
ReductionMulFP16  256        thrpt  9    1.22   1.22
ReductionMulFP16  512        thrpt  9    1.21   1.23
ReductionMulFP16  1024       thrpt  9    1.21   1.22
ReductionMulFP16  2048       thrpt  9    1.20   1.22


On N1, the scalarized sequence of `fadd/fmul` are generated for both 
`MaxVectorSize` of 8B and 16B for add reduction and mul reduction respectively.

**V1 (UseSVE = 1, max vector length = 32B):**

Benchmark         vectorDim  Mode   Cnt  8B     16B     32B
ReductionAddFP16  256        thrpt  9    1.11   1.75    2.02
ReductionAddFP16  512        thrpt  9    1.02   1.64    1.93
ReductionAddFP16  1024       thrpt  9    1.02   1.59    1.85
ReductionAddFP16  2048       thrpt  9    1.02   1.56    1.80
ReductionMulFP16  256        thrpt  9    1.12   0.99    1.09
ReductionMulFP16  512        thrpt  9    1.04   1.01    1.04
ReductionMulFP16  1024       thrpt  9    1.02   1.02    1.00
ReductionMulFP16  2048       thrpt  9    1.01   1.01    1.00


On V1, for MaxVectorSize = 8: scalarized `fadd/fmul` sequence will be generated 
for `AddReductionVHF/MulReductionVHF` as UseSVE defaults to 0 [2].
For MaxVectorSize = 16: scalarized `fmul` sequence is generated for 
`MulReductionVHF` and `fadda` is generated for `AddReductionVHF` which fetches 
signficant gains.
For MaxVectorSize = 32: Autovectorization of `MulReductionVHF` is disabled for 
MaxVectorSize > 16B so the autovectorizer checks for maximal implemented 
size[1] which is 16B and generates scalarized `fmul` sequence for 16B in this 
case. For `AddReductionVHF`, it generates the `fadda` instruction.

**V2 (UseSVE = 2, max vector length = 16B)**

Benchmark         vectorDim  Mode   Cnt  8B     16B
ReductionAddFP16  256        thrpt  9    1.16   1.70
ReductionAddFP16  512        thrpt  9    1.02   1.61
ReductionAddFP16  1024       thrpt  9    1.01   1.53
ReductionAddFP16  2048       thrpt  9    1.00   1.49
ReductionMulFP16  256        thrpt  9    1.18   0.99
ReductionMulFP16  512        thrpt  9    1.04   1.01
ReductionMulFP16  1024       thrpt  9    1.02   1.02
ReductionMulFP16  2048       thrpt  9    1.01   1.01


On V2, for MaxVectorSize = 8: scalarized `fadd/fmul` sequence will be generated 
as UseSVE defaults to 0 [2].
For MaxVectorSize = 16: `fadda` instruction is generated for `AddReductionVHF` 
which results in significant gains in performance. For `MulReductionVHF`, the 
scalarized `fmul` sequence will be generated.

**Testing:**
hotspot_all, jdk(tiers1-3) and langtools(tier1) all pass on N1/V1/V2.

[1] 
https://github.com/openjdk/jdk/blob/a272696813f2e5e896ac9de9985246aaeb9d476c/src/hotspot/share/opto/superword.cpp#L1677
[2] 
https://github.com/openjdk/jdk/blob/a272696813f2e5e896ac9de9985246aaeb9d476c/src/hotspot/cpu/aarch64/vm_version_aarch64.cpp#L479

-------------

Commit messages:
 - 8366444: Add support for add/mul reduction operations for Float16

Changes: https://git.openjdk.org/jdk/pull/27526/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=27526&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8366444
  Stats: 500 lines in 12 files changed: 421 ins; 2 del; 77 mod
  Patch: https://git.openjdk.org/jdk/pull/27526.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/27526/head:pull/27526

PR: https://git.openjdk.org/jdk/pull/27526

Reply via email to