Re: [PING][PATCH v2] AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit

Soumya AR Tue, 17 Feb 2026 00:51:46 -0800

Ping.

Thanks,
Soumya


> On 29 Jan 2026, at 4:15 PM, Alex Coplan <[email protected]> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi both,
> 
> I'm looking at this and will aim to get back to you soon.  Sorry for not 
> getting
> to this sooner.
> 
> Alex
> 
> On 28/01/2026 05:26, Soumya AR wrote:
>> Ping.
>> 
>> Thanks,
>> Soumya
>> 
>>> On 20 Jan 2026, at 3:59 PM, Kyrylo Tkachov <[email protected]> wrote:
>>> 
>>> 
>>> 
>>>> On 20 Jan 2026, at 10:06, Kyrylo Tkachov <[email protected]> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On 20 Jan 2026, at 05:23, Andrew Pinski <[email protected]> 
>>>>> wrote:
>>>>> 
>>>>> On Mon, Jan 19, 2026 at 8:12 PM Soumya AR <[email protected]> wrote:
>>>>>> 
>>>>>> Ping.
>>>>>> 
>>>>>> I split the files from the previous mail so it's hopefully easier to 
>>>>>> review.
>>>>> 
>>>>> I can review this but the approval won't be for until stage1. This
>>>>> pass at this point is too risky for this point of the release cycle.
>>>> 
>>>> Thanks for any feedback you can give. FWIW we’ve been testing this 
>>>> internally for a few months without any issues.
>>> 
>>> One option to reduce the risk that Soumya’s initial patch implemented was 
>>> to enable this only for -mcpu=olympus. We initially developed and tested it 
>>> on that target.
>>> So that way it wouldn’t affect most aarch64 targets and we’d still have the 
>>> -mno-* option to disable it as a workaround for users if it causes trouble.
>>> Would that be okay with you?
>>> Thanks,
>>> Kyrill
>>> 
>>>> 
>>>>> 
>>>>> Though I also wonder how much of this can/should be done on the gimple
>>>>> level in a generic way.
>>>> 
>>>> GIMPLE does have powerful ranger infrastructure for this, but I was 
>>>> concerned about doing this earlier because it’s very likely that some 
>>>> later pass could introduce extra extend operations, which would likely 
>>>> undo the benefit of the narrowing.
>>>> 
>>>> Thanks,
>>>> Kyrill
>>>> 
>>>>> And if there is a way to get the zero-bits from the gimple level down
>>>>> to the RTL level still so we don't need to keep on recomputing them
>>>>> (this is useful for other passes too).
>>>>> 
>>>>> Thanks,
>>>>> Andrew Pinski
>>>>> 
>>>>>> 
>>>>>> Also CC'ing Alex Coplan to this thread.
>>>>>> 
>>>>>> Thanks,
>>>>>> Soumya
>>>>>> 
>>>>>>> On 12 Jan 2026, at 12:42 PM, Soumya AR <[email protected]> wrote:
>>>>>>> 
>>>>>>> Hi Tamar,
>>>>>>> 
>>>>>>> Attaching an updated version of this patch that enables the pass at O2 
>>>>>>> and above
>>>>>>> on aarch64, and can be optionally disabled with -mno-narrow-gp-writes.
>>>>>>> 
>>>>>>> Enabling it by default at O2 touched quite a large number of tests, 
>>>>>>> which I
>>>>>>> have updated in this patch.
>>>>>>> 
>>>>>>> Most of the updates are straightforward, which involve changing x 
>>>>>>> registers to
>>>>>>> (w|x) registers (e.g., x[0-9]+ -> [wx][0-9]+).
>>>>>>> 
>>>>>>> There are some tests (eg. aarch64/int_mov_immediate_1.c) where the
>>>>>>> representation of the immediate changes:
>>>>>>> 
>>>>>>>     mov w0, 4294927974 -> mov w0, -39322
>>>>>>> 
>>>>>>> This is because when the following RTL is narrowed to SI:
>>>>>>>     (set (reg/i:DI 0 x0)
>>>>>>>             (const_int 4294927974 [0xffff6666]))
>>>>>>> 
>>>>>>> Due to the MSB changing to Bit 31, which is set, the output is printed 
>>>>>>> as
>>>>>>> signed.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Soumya
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 1 Dec 2025, at 2:03 PM, Soumya AR <[email protected]> wrote:
>>>>>>>> 
>>>>>>>> External email: Use caution opening links or attachments
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Ping.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Soumya
>>>>>>>> 
>>>>>>>>> On 13 Nov 2025, at 11:43 AM, Soumya AR <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit
>>>>>>>>> 
>>>>>>>>> This patch adds a new AArch64 RTL pass that optimizes 64-bit
>>>>>>>>> general purpose register operations to use 32-bit W-registers when the
>>>>>>>>> upper 32 bits of the register are known to be zero.
>>>>>>>>> 
>>>>>>>>> This is beneficial for the Olympus core, which benefits from using 
>>>>>>>>> 32-bit
>>>>>>>>> W-registers over 64-bit X-registers if possible. This is recommended 
>>>>>>>>> by the
>>>>>>>>> updated Olympus Software Optimization Guide, which will be published 
>>>>>>>>> soon.
>>>>>>>>> 
>>>>>>>>> This pass can be controlled with -mnarrow-gp-writes and is active at 
>>>>>>>>> -O2 and
>>>>>>>>> above, but not enabled by default, except for -mcpu=olympus.
>>>>>>>>> 
>>>>>>>>> ---
>>>>>>>>> 
>>>>>>>>> In AArch64, each 64-bit X register has a corresponding 32-bit W 
>>>>>>>>> register
>>>>>>>>> that maps to its lower half.  When we can guarantee that the upper 32 
>>>>>>>>> bits
>>>>>>>>> are never used, we can safely narrow operations to use W registers 
>>>>>>>>> instead.
>>>>>>>>> 
>>>>>>>>> For example, this code:
>>>>>>>>> uint64_t foo(uint64_t a) {
>>>>>>>>> return (a & 255) + 3;
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> Currently compiles to:
>>>>>>>>> and x8, x0, #0xff
>>>>>>>>> add x0, x8, #3
>>>>>>>>> 
>>>>>>>>> But with this pass enabled, it optimizes to:
>>>>>>>>> and x8, x0, #0xff
>>>>>>>>> add w0, w8, #3      // Using W register instead of X
>>>>>>>>> 
>>>>>>>>> ---
>>>>>>>>> 
>>>>>>>>> The pass operates in two phases:
>>>>>>>>> 
>>>>>>>>> 1) Analysis Phase:
>>>>>>>>> - Using RTL-SSA, iterates through extended basic blocks (EBBs)
>>>>>>>>> - Computes nonzero bit masks for each register definition
>>>>>>>>> - Recursively processes PHI nodes
>>>>>>>>> - Identifies candidates for narrowing
>>>>>>>>> 2) Transformation Phase:
>>>>>>>>> - Applies narrowing to validated candidates
>>>>>>>>> - Converts DImode operations to SImode where safe
>>>>>>>>> 
>>>>>>>>> The pass runs late in the RTL pipeline, after register allocation, to 
>>>>>>>>> ensure
>>>>>>>>> stable def-use chains and avoid interfering with earlier 
>>>>>>>>> optimizations.
>>>>>>>>> 
>>>>>>>>> ---
>>>>>>>>> 
>>>>>>>>> nonzero_bits(src, DImode) is a function defined in rtlanal.cc that 
>>>>>>>>> recursively
>>>>>>>>> analyzes RTL expressions to compute a bitmask. However, nonzero_bits 
>>>>>>>>> has a
>>>>>>>>> limitation: when it encounters a register, it conservatively returns 
>>>>>>>>> the mode
>>>>>>>>> mask (all bits potentially set). Since this pass analyzes all defs in 
>>>>>>>>> an
>>>>>>>>> instruction, this information can be used to refine the mask. The 
>>>>>>>>> pass maintains
>>>>>>>>> a hash map of computed bit masks and installs a custom RTL hooks 
>>>>>>>>> callback
>>>>>>>>> to consult this mask when encountering a register.
>>>>>>>>> 
>>>>>>>>> ---
>>>>>>>>> 
>>>>>>>>> PHI nodes require special handling to merge masks from all inputs. 
>>>>>>>>> This is done
>>>>>>>>> by combine_mask_from_phi. 3 cases are tackled here:
>>>>>>>>> 1. Input Edge has a Definition: This is the simplest case. For each 
>>>>>>>>> input
>>>>>>>>> edge to the PHI, the def information is retreived and its mask is 
>>>>>>>>> looked up.
>>>>>>>>> 2. Input Edge has no Definition: A conservative mask is assumed for 
>>>>>>>>> that
>>>>>>>>> input.
>>>>>>>>> 3. Input Edge is a PHI: Recursively call combine_mask_from_phi to
>>>>>>>>> merge the masks of all incoming values.
>>>>>>>>> 
>>>>>>>>> ---
>>>>>>>>> 
>>>>>>>>> When processing regular instructions, the pass first tackles SET and 
>>>>>>>>> PARALLEL
>>>>>>>>> patterns with compare instructions.
>>>>>>>>> 
>>>>>>>>> Single SET instructions:
>>>>>>>>> 
>>>>>>>>> If the upper 32 bits of the source are known to be zero, then the 
>>>>>>>>> instruction
>>>>>>>>> qualifies for narrowing. Instead of just using lowpart_subreg for the 
>>>>>>>>> source,
>>>>>>>>> we define narrow_dimode_src to attempt further optimizations:
>>>>>>>>> 
>>>>>>>>> - Bitwise operations (AND/OR/XOR/ASHIFT): simplified via 
>>>>>>>>> simplify_gen_binary
>>>>>>>>> - IF_THEN_ELSE: simplified via simplify_gen_ternary
>>>>>>>>> 
>>>>>>>>> PARALLEL Instructions (Compare + SET):
>>>>>>>>> 
>>>>>>>>> The pass tackles flag-setting operations (ADDS, SUBS, ANDS, etc.) 
>>>>>>>>> where the SET
>>>>>>>>> source equals the first operand of the COMPARE. Depending on the 
>>>>>>>>> condition code
>>>>>>>>> for the compare, the pass checks for the required bits to be zero:
>>>>>>>>> 
>>>>>>>>> - CC_Zmode/CC_NZmode: Upper 32 bits
>>>>>>>>> - CC_NZVmode: Upper 32 bits and bit 31 (for overflow)
>>>>>>>>> 
>>>>>>>>> If the instruction does not match the above patterns (or matches but 
>>>>>>>>> cannot be
>>>>>>>>> optimized), the pass still analyzes all its definitions to ensure 
>>>>>>>>> nzero_map is
>>>>>>>>> complete. This ensures every definition has an entry in nzero_map.
>>>>>>>>> 
>>>>>>>>> ---
>>>>>>>>> 
>>>>>>>>> When transforming the qualified instructions, the pass uses 
>>>>>>>>> rtl_ssa::recog and
>>>>>>>>> rtl_ssa::change_is_worthwhile to verify the new pattern and determine 
>>>>>>>>> if the
>>>>>>>>> transformation is worthwhile.
>>>>>>>>> 
>>>>>>>>> ---
>>>>>>>>> 
>>>>>>>>> As an additional benefit, testing on Neoverse-V2 shows that instances 
>>>>>>>>> of
>>>>>>>>> 'and x1, x2, #0xffffffff' are converted to zero-latency 'mov w1, w2'
>>>>>>>>> instructions after this pass narrows them.
>>>>>>>>> 
>>>>>>>>> ---
>>>>>>>>> 
>>>>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>>>>>>>> regression.
>>>>>>>>> OK for mainline?
>>>>>>>>> 
>>>>>>>>> Co-authored-by: Kyrylo Tkachov <[email protected]>
>>>>>>>>> Signed-off-by: Soumya AR <[email protected]>
>>>>>>>>> 
>>>>>>>>> gcc/ChangeLog:
>>>>>>>>> 
>>>>>>>>> * config.gcc: Add aarch64-narrow-gp-writes.o.
>>>>>>>>> * config/aarch64/aarch64-passes.def (INSERT_PASS_BEFORE): Insert
>>>>>>>>> pass_narrow_gp_writes before pass_cleanup_barriers.
>>>>>>>>> * config/aarch64/aarch64-tuning-flags.def 
>>>>>>>>> (AARCH64_EXTRA_TUNING_OPTION):
>>>>>>>>> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES.
>>>>>>>>> * config/aarch64/tuning_models/olympus.h:
>>>>>>>>> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES to tune_flags.
>>>>>>>>> * config/aarch64/aarch64-protos.h (make_pass_narrow_gp_writes): 
>>>>>>>>> Declare.
>>>>>>>>> * config/aarch64/aarch64.opt (mnarrow-gp-writes): New option.
>>>>>>>>> * config/aarch64/t-aarch64: Add aarch64-narrow-gp-writes.o rule.
>>>>>>>>> * doc/invoke.texi: Document -mnarrow-gp-writes.
>>>>>>>>> * config/aarch64/aarch64-narrow-gp-writes.cc: New file.
>>>>>>>>> 
>>>>>>>>> gcc/testsuite/ChangeLog:
>>>>>>>>> 
>>>>>>>>> * gcc.target/aarch64/narrow-gp-writes-1.c: New test.
>>>>>>>>> * gcc.target/aarch64/narrow-gp-writes-2.c: New test.
>>>>>>>>> * gcc.target/aarch64/narrow-gp-writes-3.c: New test.
>>>>>>>>> * gcc.target/aarch64/narrow-gp-writes-4.c: New test.
>>>>>>>>> * gcc.target/aarch64/narrow-gp-writes-5.c: New test.
>>>>>>>>> * gcc.target/aarch64/narrow-gp-writes-6.c: New test.
>>>>>>>>> * gcc.target/aarch64/narrow-gp-writes-7.c: New test.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> <0001-AArch64-Add-RTL-pass-to-narrow-64-bit-GP-reg-writes-.patch>
>> 
>>

Re: [PING][PATCH v2] AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit

Reply via email to