On Thu, 4 Jun 2026 07:46:10 GMT, Eric Fang <[email protected]> wrote:
>> Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a >> generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64 >> provides a more efficient mapping for this operation through the NEON `BSL` >> and SVE `BSL` (bitwise select) instructions. >> >> This change teaches C2 to recognize the `BITWISE_BLEND` patterns and lower >> them to the dedicated AArch64 instructions for better performance. >> >> The change includes the AArch64 match rules and assembler support, updates >> the AArch64 asm tests, adds IR framework nodes for the new mach >> instructions, introduces a new jtreg IR test and extends the MaskedLogicOpts >> JMH benchmark for 128-bit long type. >> >> JMH results show **11% - 54%** performance improvements for the optimized >> cases, and all jtreg tests (tier1, tier2 and tier3) passe on SVE2, SVE1, and >> NEON configurations. >> >> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2: >> >> Benchmark Unit ARRAYLEN Before Error >> After Error Uplift >> bitwiseBlendOperationInt128 ops/s 256.00 3787.49 5.29 >> 4277.64 8.89 1.13 >> bitwiseBlendOperationInt128 ops/s 512.00 1888.24 11.02 >> 2143.21 6.32 1.14 >> bitwiseBlendOperationInt128 ops/s 1024.00 938.22 6.24 >> 1053.45 14.68 1.12 >> bitwiseBlendOperationLong128 ops/s 256.00 1895.45 13.68 2140.31 >> 3.68 1.13 >> bitwiseBlendOperationLong128 ops/s 512.00 938.71 5.32 1052.16 >> 14.07 1.12 >> bitwiseBlendOperationLong128 ops/s 1024.00 474.15 2.33 526.49 >> 2.62 1.11 >> >> >> On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1: >> >> Benchmark Unit ARRAYLEN Before Error >> After Error Uplift >> bitwiseBlendOperationInt128 ops/s 256.00 2051.52 13.85 >> 2481.44 0.27 1.21 >> bitwiseBlendOperationInt128 ops/s 512.00 995.47 20.77 >> 1235.10 5.70 1.24 >> bitwiseBlendOperationInt128 ops/s 1024.00 507.73 9.83 >> 617.59 2.43 1.22 >> bitwiseBlendOperationLong128 ops/s 256.00 1000.99 21.50 1235.39 >> 5.48 1.23 >> bitwiseBlendOperationLong128 ops/s 512.00 507.73 9.74 617.67 >> 2.32 1.22 >> bitwiseBlendOperationLong128 ops/s 1024.00 258.86 0.01 310.70 >> 0.04 1.20 >> >> >> On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON: >> >> Benchmark Unit ARRAYLEN Before Error >> After Error Uplift >> bitwiseBlendOperationInt128 ops/s 256.00 2336.17 13.18 >> 3505.19 19.61 1.50 >> bitwiseBlendOperationInt128 ops/s 512.00 1145.50 ... > > Eric Fang has updated the pull request with a new target base due to a merge > or a rebase. The incremental webrev excludes the unrelated changes brought in > by the merge/rebase. The pull request contains three additional commits since > the last revision: > > - Implement bitwise_blend in IGVN > > The latest changes: > > 1. Defined a new IR `VectorBitwiseBlendNode` > 2. Do the optimization in IGVN: > // XorV(a, AndV(sel, XorV(a, b))) => VectorBitwiseBlend(a, b, sel) > // XorV(a, AndV(sel, XorV(a, b)), mask) => > // VectorBlend(a, VectorBitwiseBlend(a, b, sel), mask) > > 3. Adjust the ad file match rules to match `VectorBitwiseBlendNode`. > 4. Adjust the JTReg tests to check `VectorBitwiseBlendNode`. > - Merge branch 'master' into JDK-8382052-bitwise-blend > - 8382052: VectorAPI: AArch64: Optimize the lanewise BITWISE_BLEND operation > with BSL > > Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a > generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64 > provides a more efficient mapping for this operation through the NEON > `BSL` and SVE `BSL` (bitwise select) instructions. > > This change teaches C2 to recognize the `BITWISE_BLEND` patterns and > lower them to the dedicated AArch64 instructions for better performance. > > The change includes the AArch64 match rules and assembler support, > updates the AArch64 asm tests, adds IR framework nodes for the new mach > instructions, introduces a new jtreg IR test and extends the > MaskedLogicOpts JMH benchmark for 128-bit long type. > > JMH results show **11% - 54%** performance improvements for the > optimized cases, and all jtreg tests (tier1, tier2 and tier3) passe on > SVE2, SVE1, and NEON configurations. > > On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2: > ``` > Benchmark Unit ARRAYLEN Before Error After > Error Uplift > bitwiseBlendOperationInt128 ops/s 256.00 3787.49 5.29 > 4277.64 8.89 1.13 > bitwiseBlendOperationInt128 ops/s 512.00 1888.24 11.02 > 2143.21 6.32 1.14 > bitwiseBlendOperationInt128 ops/s 1024.00 938.22 6.24 > 1053.45 14.68 1.12 > bitwiseBlendOperationLong128 ops/s 256.00 1895.45 13.68 > 2140.31 3.68 1.13 > bitwiseBlendOperationLong128 ops/s 512.00 938.71 5.32 > 1052.16 14.07 1.12 > bitwiseBlendOperationLong128 ops/s 1024.00 474.15 2.33 > 526.49 2.62 1.11 > ``` > > On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1: > ``` > Benchmar... Thanks for your updating! Overall looks good to me. src/hotspot/cpu/aarch64/aarch64_vector.ad line 320: > 318: } > 319: break; > 320: case Op_VectorBitwiseBlend: Since SVE doesn't support the predicated vector instruction for this operation. I suggest to add this op after here: https://github.com/openjdk/jdk/blob/1c1a13085620a604b5668df4a26f4b09a704ceb5/src/hotspot/cpu/aarch64/aarch64_vector.ad#L342 Although method `match_rule_supported_vector_masked` is not called when creating this IR in mid-end. It's better to list it as well in case it is used in future. test/hotspot/jtreg/compiler/vectorapi/VectorBitwiseBlendTest.java line 29: > 27: * @key randomness > 28: * @library /test/lib / > 29: * @summary IR tests for AArch64 BITWISE_BLEND optimization match rules Do we need to update the summary since this is a mid-end level IR test. ------------- PR Review: https://git.openjdk.org/jdk/pull/31269#pullrequestreview-4454825428 PR Review Comment: https://git.openjdk.org/jdk/pull/31269#discussion_r3377539279 PR Review Comment: https://git.openjdk.org/jdk/pull/31269#discussion_r3377608027
