On Thu, 11 Jun 2026 07:37:30 GMT, Eric Fang <[email protected]> wrote:

>> Vector API `lanewise BITWISE_BLEND` on AArch64 is currently lowered to a 
>> generic vector sequence built from `(XorV(AndV(XorV)))` nodes. AArch64 
>> provides a more efficient mapping for this operation through the NEON `BSL` 
>> and SVE `BSL` (bitwise select) instructions.
>> 
>> This change teaches C2 to recognize the `BITWISE_BLEND` patterns and lower 
>> them to the dedicated AArch64 instructions for better performance.
>> 
>> The change includes the AArch64 match rules and assembler support, updates 
>> the AArch64 asm tests, adds IR framework nodes for the new mach 
>> instructions, introduces a new jtreg IR test and extends the MaskedLogicOpts 
>> JMH benchmark for 128-bit long type.
>> 
>> JMH results show **11% - 54%** performance improvements for the optimized 
>> cases, and all jtreg tests (tier1, tier2 and tier3) passe on SVE2, SVE1, and 
>> NEON configurations.
>> 
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
>> 
>> Benchmark                        Unit        ARRAYLEN Before     Error    
>> After          Error       Uplift
>> bitwiseBlendOperationInt128      ops/s       256.00   3787.49    5.29     
>> 4277.64    8.89    1.13
>> bitwiseBlendOperationInt128      ops/s       512.00   1888.24    11.02    
>> 2143.21    6.32    1.14
>> bitwiseBlendOperationInt128      ops/s       1024.00  938.22     6.24     
>> 1053.45    14.68   1.12
>> bitwiseBlendOperationLong128 ops/s   256.00   1895.45    13.68    2140.31    
>> 3.68    1.13
>> bitwiseBlendOperationLong128 ops/s   512.00   938.71     5.32     1052.16    
>> 14.07   1.12
>> bitwiseBlendOperationLong128 ops/s   1024.00  474.15     2.33     526.49     
>>     2.62        1.11
>> 
>> 
>> On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
>> 
>> Benchmark                        Unit        ARRAYLEN Before     Error    
>> After          Error       Uplift
>> bitwiseBlendOperationInt128      ops/s       256.00   2051.52    13.85    
>> 2481.44    0.27    1.21
>> bitwiseBlendOperationInt128      ops/s       512.00   995.47     20.77    
>> 1235.10    5.70    1.24
>> bitwiseBlendOperationInt128      ops/s       1024.00  507.73     9.83     
>> 617.59         2.43        1.22
>> bitwiseBlendOperationLong128 ops/s   256.00   1000.99    21.50    1235.39    
>> 5.48    1.23
>> bitwiseBlendOperationLong128 ops/s   512.00   507.73     9.74     617.67     
>>     2.32        1.22
>> bitwiseBlendOperationLong128 ops/s   1024.00  258.86     0.01     310.70     
>>     0.04        1.20
>> 
>> 
>> On a Nvidia Grace (Neoverse-V2) machine with 128-bit NEON:
>> 
>> Benchmark                        Unit        ARRAYLEN Before     Error    
>> After          Error       Uplift
>> bitwiseBlendOperationInt128      ops/s       256.00   2336.17    13.18    
>> 3505.19    19.61   1.50
>> bitwiseBlendOperationInt128      ops/s       512.00   1145.50 ...
>
> Eric Fang has updated the pull request incrementally with one additional 
> commit since the last revision:
> 
>   Add a movprfx optimization entry point for the newly added sve_bsl 
> instruction

I ran a full round of JTreg and JMH tests on this patch locally. All JTReg 
tests passed. After adding `movprfx` fusion support to the `BSL` instruction, 
the performance improvement of this PR is even greater. For example, on a 
128-bit SVE2 grace machine, the current performance improvement is as follows:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40";>

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:////Users/erfang/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip.htm">
<link rel=File-List
href="file:////Users/erfang/Library/Group%20Containers/UBF8T346G9.Office/TemporaryItems/msohtmlclip/clip_filelist.xml">

</head>

<body link="#467886" vlink="#96607D">


Benchmark | ARRAYLEN | Unit | Before | Error | After | Error | Uplift
-- | -- | -- | -- | -- | -- | -- | --
bitwiseBlendOperationInt128 | 256 | ops/s | 3709.10 | 3.47 | 5801.07 | 3.99 | 
1.56
bitwiseBlendOperationInt128 | 512 | ops/s | 1850.16 | 15.67 | 2911.17 | 6.21 | 
1.57
bitwiseBlendOperationInt128 | 1024 | ops/s | 913.95 | 8.45 | 1423.66 | 34.04 | 
1.56
bitwiseBlendOperationLong128 | 256 | ops/s | 1850.36 | 12.54 | 2913.70 | 2.19 | 
1.57
bitwiseBlendOperationLong128 | 512 | ops/s | 920.63 | 5.76 | 1433.99 | 26.71 | 
1.56
bitwiseBlendOperationLong128 | 1024 | ops/s | 464.38 | 1.38 | 709.36 | 9.56 | 
1.53



</body>

</html>

-------------

PR Comment: https://git.openjdk.org/jdk/pull/31269#issuecomment-4686625641

Reply via email to