Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a
generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64 provides
a more efficient mapping for this operation through the NEON `BSL` and SVE
`BSL` (bitwise select) instructions.
This change teaches C2 to recognize the `BITWISE_BLEND` patterns and lower them
to the dedicated AArch64 instructions for better performance.
The change includes the AArch64 match rules and assembler support, updates the
AArch64 asm tests, adds IR framework nodes for the new mach instructions,
introduces a new jtreg IR test and extends the MaskedLogicOpts JMH benchmark
for 128-bit long type.
JMH results show **11% - 54%** performance improvements for the optimized
cases, and all jtreg tests (tier1, tier2 and tier3) passe on SVE2, SVE1, and
NEON configurations.
On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
Benchmark Unit ARRAYLEN Before Error
After Error Uplift
bitwiseBlendOperationInt128 ops/s 256.00 3787.49 5.29
4277.64 8.89 1.13
bitwiseBlendOperationInt128 ops/s 512.00 1888.24 11.02
2143.21 6.32 1.14
bitwiseBlendOperationInt128 ops/s 1024.00 938.22 6.24
1053.45 14.68 1.12
bitwiseBlendOperationLong128 ops/s 256.00 1895.45 13.68 2140.31
3.68 1.13
bitwiseBlendOperationLong128 ops/s 512.00 938.71 5.32 1052.16
14.07 1.12
bitwiseBlendOperationLong128 ops/s 1024.00 474.15 2.33 526.49
2.62 1.11
On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
Benchmark Unit ARRAYLEN Before Error
After Error Uplift
bitwiseBlendOperationInt128 ops/s 256.00 2051.52 13.85
2481.44 0.27 1.21
bitwiseBlendOperationInt128 ops/s 512.00 995.47 20.77
1235.10 5.70 1.24
bitwiseBlendOperationInt128 ops/s 1024.00 507.73 9.83
617.59 2.43 1.22
bitwiseBlendOperationLong128 ops/s 256.00 1000.99 21.50 1235.39
5.48 1.23
bitwiseBlendOperationLong128 ops/s 512.00 507.73 9.74 617.67
2.32 1.22
bitwiseBlendOperationLong128 ops/s 1024.00 258.86 0.01 310.70
0.04 1.20
On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON:
Benchmark Unit ARRAYLEN Before Error
After Error Uplift
bitwiseBlendOperationInt128 ops/s 256.00 2336.17 13.18
3505.19 19.61 1.50
bitwiseBlendOperationInt128 ops/s 512.00 1145.50 12.40
1735.24 10.43 1.51
bitwiseBlendOperationInt128 ops/s 1024.00 571.41 6.51
866.01 3.34 1.52
bitwiseBlendOperationLong128 ops/s 256.00 1140.38 13.77 1740.28
11.16 1.53
bitwiseBlendOperationLong128 ops/s 512.00 570.20 7.58 865.67
3.33 1.52
bitwiseBlendOperationLong128 ops/s 1024.00 280.94 2.58 432.78
0.19 1.54
---------
- [x] I confirm that I make this contribution in accordance with the [OpenJDK
Interim AI Policy](https://openjdk.org/legal/ai).
-------------
Commit messages:
- 8382052: VectorAPI: AArch64: Optimize the lanewise BITWISE_BLEND operation
with BSL
Changes: https://git.openjdk.org/jdk/pull/31269/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=31269&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8382052
Stats: 479 lines in 8 files changed: 411 ins; 2 del; 66 mod
Patch: https://git.openjdk.org/jdk/pull/31269.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/31269/head:pull/31269
PR: https://git.openjdk.org/jdk/pull/31269