Ping. Thanks, Soumya
> On 29 Jan 2026, at 4:15 PM, Alex Coplan <[email protected]> wrote: > > External email: Use caution opening links or attachments > > > Hi both, > > I'm looking at this and will aim to get back to you soon. Sorry for not > getting > to this sooner. > > Alex > > On 28/01/2026 05:26, Soumya AR wrote: >> Ping. >> >> Thanks, >> Soumya >> >>> On 20 Jan 2026, at 3:59 PM, Kyrylo Tkachov <[email protected]> wrote: >>> >>> >>> >>>> On 20 Jan 2026, at 10:06, Kyrylo Tkachov <[email protected]> wrote: >>>> >>>> >>>> >>>>> On 20 Jan 2026, at 05:23, Andrew Pinski <[email protected]> >>>>> wrote: >>>>> >>>>> On Mon, Jan 19, 2026 at 8:12 PM Soumya AR <[email protected]> wrote: >>>>>> >>>>>> Ping. >>>>>> >>>>>> I split the files from the previous mail so it's hopefully easier to >>>>>> review. >>>>> >>>>> I can review this but the approval won't be for until stage1. This >>>>> pass at this point is too risky for this point of the release cycle. >>>> >>>> Thanks for any feedback you can give. FWIW we’ve been testing this >>>> internally for a few months without any issues. >>> >>> One option to reduce the risk that Soumya’s initial patch implemented was >>> to enable this only for -mcpu=olympus. We initially developed and tested it >>> on that target. >>> So that way it wouldn’t affect most aarch64 targets and we’d still have the >>> -mno-* option to disable it as a workaround for users if it causes trouble. >>> Would that be okay with you? >>> Thanks, >>> Kyrill >>> >>>> >>>>> >>>>> Though I also wonder how much of this can/should be done on the gimple >>>>> level in a generic way. >>>> >>>> GIMPLE does have powerful ranger infrastructure for this, but I was >>>> concerned about doing this earlier because it’s very likely that some >>>> later pass could introduce extra extend operations, which would likely >>>> undo the benefit of the narrowing. >>>> >>>> Thanks, >>>> Kyrill >>>> >>>>> And if there is a way to get the zero-bits from the gimple level down >>>>> to the RTL level still so we don't need to keep on recomputing them >>>>> (this is useful for other passes too). >>>>> >>>>> Thanks, >>>>> Andrew Pinski >>>>> >>>>>> >>>>>> Also CC'ing Alex Coplan to this thread. >>>>>> >>>>>> Thanks, >>>>>> Soumya >>>>>> >>>>>>> On 12 Jan 2026, at 12:42 PM, Soumya AR <[email protected]> wrote: >>>>>>> >>>>>>> Hi Tamar, >>>>>>> >>>>>>> Attaching an updated version of this patch that enables the pass at O2 >>>>>>> and above >>>>>>> on aarch64, and can be optionally disabled with -mno-narrow-gp-writes. >>>>>>> >>>>>>> Enabling it by default at O2 touched quite a large number of tests, >>>>>>> which I >>>>>>> have updated in this patch. >>>>>>> >>>>>>> Most of the updates are straightforward, which involve changing x >>>>>>> registers to >>>>>>> (w|x) registers (e.g., x[0-9]+ -> [wx][0-9]+). >>>>>>> >>>>>>> There are some tests (eg. aarch64/int_mov_immediate_1.c) where the >>>>>>> representation of the immediate changes: >>>>>>> >>>>>>> mov w0, 4294927974 -> mov w0, -39322 >>>>>>> >>>>>>> This is because when the following RTL is narrowed to SI: >>>>>>> (set (reg/i:DI 0 x0) >>>>>>> (const_int 4294927974 [0xffff6666])) >>>>>>> >>>>>>> Due to the MSB changing to Bit 31, which is set, the output is printed >>>>>>> as >>>>>>> signed. >>>>>>> >>>>>>> Thanks, >>>>>>> Soumya >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On 1 Dec 2025, at 2:03 PM, Soumya AR <[email protected]> wrote: >>>>>>>> >>>>>>>> External email: Use caution opening links or attachments >>>>>>>> >>>>>>>> >>>>>>>> Ping. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Soumya >>>>>>>> >>>>>>>>> On 13 Nov 2025, at 11:43 AM, Soumya AR <[email protected]> wrote: >>>>>>>>> >>>>>>>>> AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit >>>>>>>>> >>>>>>>>> This patch adds a new AArch64 RTL pass that optimizes 64-bit >>>>>>>>> general purpose register operations to use 32-bit W-registers when the >>>>>>>>> upper 32 bits of the register are known to be zero. >>>>>>>>> >>>>>>>>> This is beneficial for the Olympus core, which benefits from using >>>>>>>>> 32-bit >>>>>>>>> W-registers over 64-bit X-registers if possible. This is recommended >>>>>>>>> by the >>>>>>>>> updated Olympus Software Optimization Guide, which will be published >>>>>>>>> soon. >>>>>>>>> >>>>>>>>> This pass can be controlled with -mnarrow-gp-writes and is active at >>>>>>>>> -O2 and >>>>>>>>> above, but not enabled by default, except for -mcpu=olympus. >>>>>>>>> >>>>>>>>> --- >>>>>>>>> >>>>>>>>> In AArch64, each 64-bit X register has a corresponding 32-bit W >>>>>>>>> register >>>>>>>>> that maps to its lower half. When we can guarantee that the upper 32 >>>>>>>>> bits >>>>>>>>> are never used, we can safely narrow operations to use W registers >>>>>>>>> instead. >>>>>>>>> >>>>>>>>> For example, this code: >>>>>>>>> uint64_t foo(uint64_t a) { >>>>>>>>> return (a & 255) + 3; >>>>>>>>> } >>>>>>>>> >>>>>>>>> Currently compiles to: >>>>>>>>> and x8, x0, #0xff >>>>>>>>> add x0, x8, #3 >>>>>>>>> >>>>>>>>> But with this pass enabled, it optimizes to: >>>>>>>>> and x8, x0, #0xff >>>>>>>>> add w0, w8, #3 // Using W register instead of X >>>>>>>>> >>>>>>>>> --- >>>>>>>>> >>>>>>>>> The pass operates in two phases: >>>>>>>>> >>>>>>>>> 1) Analysis Phase: >>>>>>>>> - Using RTL-SSA, iterates through extended basic blocks (EBBs) >>>>>>>>> - Computes nonzero bit masks for each register definition >>>>>>>>> - Recursively processes PHI nodes >>>>>>>>> - Identifies candidates for narrowing >>>>>>>>> 2) Transformation Phase: >>>>>>>>> - Applies narrowing to validated candidates >>>>>>>>> - Converts DImode operations to SImode where safe >>>>>>>>> >>>>>>>>> The pass runs late in the RTL pipeline, after register allocation, to >>>>>>>>> ensure >>>>>>>>> stable def-use chains and avoid interfering with earlier >>>>>>>>> optimizations. >>>>>>>>> >>>>>>>>> --- >>>>>>>>> >>>>>>>>> nonzero_bits(src, DImode) is a function defined in rtlanal.cc that >>>>>>>>> recursively >>>>>>>>> analyzes RTL expressions to compute a bitmask. However, nonzero_bits >>>>>>>>> has a >>>>>>>>> limitation: when it encounters a register, it conservatively returns >>>>>>>>> the mode >>>>>>>>> mask (all bits potentially set). Since this pass analyzes all defs in >>>>>>>>> an >>>>>>>>> instruction, this information can be used to refine the mask. The >>>>>>>>> pass maintains >>>>>>>>> a hash map of computed bit masks and installs a custom RTL hooks >>>>>>>>> callback >>>>>>>>> to consult this mask when encountering a register. >>>>>>>>> >>>>>>>>> --- >>>>>>>>> >>>>>>>>> PHI nodes require special handling to merge masks from all inputs. >>>>>>>>> This is done >>>>>>>>> by combine_mask_from_phi. 3 cases are tackled here: >>>>>>>>> 1. Input Edge has a Definition: This is the simplest case. For each >>>>>>>>> input >>>>>>>>> edge to the PHI, the def information is retreived and its mask is >>>>>>>>> looked up. >>>>>>>>> 2. Input Edge has no Definition: A conservative mask is assumed for >>>>>>>>> that >>>>>>>>> input. >>>>>>>>> 3. Input Edge is a PHI: Recursively call combine_mask_from_phi to >>>>>>>>> merge the masks of all incoming values. >>>>>>>>> >>>>>>>>> --- >>>>>>>>> >>>>>>>>> When processing regular instructions, the pass first tackles SET and >>>>>>>>> PARALLEL >>>>>>>>> patterns with compare instructions. >>>>>>>>> >>>>>>>>> Single SET instructions: >>>>>>>>> >>>>>>>>> If the upper 32 bits of the source are known to be zero, then the >>>>>>>>> instruction >>>>>>>>> qualifies for narrowing. Instead of just using lowpart_subreg for the >>>>>>>>> source, >>>>>>>>> we define narrow_dimode_src to attempt further optimizations: >>>>>>>>> >>>>>>>>> - Bitwise operations (AND/OR/XOR/ASHIFT): simplified via >>>>>>>>> simplify_gen_binary >>>>>>>>> - IF_THEN_ELSE: simplified via simplify_gen_ternary >>>>>>>>> >>>>>>>>> PARALLEL Instructions (Compare + SET): >>>>>>>>> >>>>>>>>> The pass tackles flag-setting operations (ADDS, SUBS, ANDS, etc.) >>>>>>>>> where the SET >>>>>>>>> source equals the first operand of the COMPARE. Depending on the >>>>>>>>> condition code >>>>>>>>> for the compare, the pass checks for the required bits to be zero: >>>>>>>>> >>>>>>>>> - CC_Zmode/CC_NZmode: Upper 32 bits >>>>>>>>> - CC_NZVmode: Upper 32 bits and bit 31 (for overflow) >>>>>>>>> >>>>>>>>> If the instruction does not match the above patterns (or matches but >>>>>>>>> cannot be >>>>>>>>> optimized), the pass still analyzes all its definitions to ensure >>>>>>>>> nzero_map is >>>>>>>>> complete. This ensures every definition has an entry in nzero_map. >>>>>>>>> >>>>>>>>> --- >>>>>>>>> >>>>>>>>> When transforming the qualified instructions, the pass uses >>>>>>>>> rtl_ssa::recog and >>>>>>>>> rtl_ssa::change_is_worthwhile to verify the new pattern and determine >>>>>>>>> if the >>>>>>>>> transformation is worthwhile. >>>>>>>>> >>>>>>>>> --- >>>>>>>>> >>>>>>>>> As an additional benefit, testing on Neoverse-V2 shows that instances >>>>>>>>> of >>>>>>>>> 'and x1, x2, #0xffffffff' are converted to zero-latency 'mov w1, w2' >>>>>>>>> instructions after this pass narrows them. >>>>>>>>> >>>>>>>>> --- >>>>>>>>> >>>>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>>>>>>>> regression. >>>>>>>>> OK for mainline? >>>>>>>>> >>>>>>>>> Co-authored-by: Kyrylo Tkachov <[email protected]> >>>>>>>>> Signed-off-by: Soumya AR <[email protected]> >>>>>>>>> >>>>>>>>> gcc/ChangeLog: >>>>>>>>> >>>>>>>>> * config.gcc: Add aarch64-narrow-gp-writes.o. >>>>>>>>> * config/aarch64/aarch64-passes.def (INSERT_PASS_BEFORE): Insert >>>>>>>>> pass_narrow_gp_writes before pass_cleanup_barriers. >>>>>>>>> * config/aarch64/aarch64-tuning-flags.def >>>>>>>>> (AARCH64_EXTRA_TUNING_OPTION): >>>>>>>>> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES. >>>>>>>>> * config/aarch64/tuning_models/olympus.h: >>>>>>>>> Add AARCH64_EXTRA_TUNE_NARROW_GP_WRITES to tune_flags. >>>>>>>>> * config/aarch64/aarch64-protos.h (make_pass_narrow_gp_writes): >>>>>>>>> Declare. >>>>>>>>> * config/aarch64/aarch64.opt (mnarrow-gp-writes): New option. >>>>>>>>> * config/aarch64/t-aarch64: Add aarch64-narrow-gp-writes.o rule. >>>>>>>>> * doc/invoke.texi: Document -mnarrow-gp-writes. >>>>>>>>> * config/aarch64/aarch64-narrow-gp-writes.cc: New file. >>>>>>>>> >>>>>>>>> gcc/testsuite/ChangeLog: >>>>>>>>> >>>>>>>>> * gcc.target/aarch64/narrow-gp-writes-1.c: New test. >>>>>>>>> * gcc.target/aarch64/narrow-gp-writes-2.c: New test. >>>>>>>>> * gcc.target/aarch64/narrow-gp-writes-3.c: New test. >>>>>>>>> * gcc.target/aarch64/narrow-gp-writes-4.c: New test. >>>>>>>>> * gcc.target/aarch64/narrow-gp-writes-5.c: New test. >>>>>>>>> * gcc.target/aarch64/narrow-gp-writes-6.c: New test. >>>>>>>>> * gcc.target/aarch64/narrow-gp-writes-7.c: New test. >>>>>>>>> >>>>>>>>> >>>>>>>>> <0001-AArch64-Add-RTL-pass-to-narrow-64-bit-GP-reg-writes-.patch> >> >>
