Re: [PATCH v2] AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit

Alex Coplan Tue, 03 Mar 2026 10:02:31 -0800

Hi Soumya,

Thanks for working on this, and thanks for your patience.  Sorry it's
taken me so long to get back to you.  Please see some comments inline
below.


Things I haven't reviewed yet:
 - optimize_compare_arith_insn: I need to look at this in detail but I don't
   want to hold up the rest of the review at this point.  I hope to get to this
   soon.
 - The testsuite changes.  I suppose most of these would go away if we decided
   to just enable this for Olympus initially, anyway.

On 12/01/2026 07:12, Soumya AR wrote:
> Hi Tamar,
> 
> Attaching an updated version of this patch that enables the pass at O2 and 
> above
> on aarch64, and can be optionally disabled with -mno-narrow-gp-writes.
> 
> Enabling it by default at O2 touched quite a large number of tests, which I
> have updated in this patch.
> 
> Most of the updates are straightforward, which involve changing x registers to
> (w|x) registers (e.g., x[0-9]+ -> [wx][0-9]+).
> 
> There are some tests (eg. aarch64/int_mov_immediate_1.c) where the
> representation of the immediate changes:
> 
>         mov w0, 4294927974 -> mov w0, -39322
> 
> This is because when the following RTL is narrowed to SI:
>         (set (reg/i:DI 0 x0)
>                 (const_int 4294927974 [0xffff6666]))
> 
> Due to the MSB changing to Bit 31, which is set, the output is printed as
> signed.
> 
> Thanks,
> Soumya
> 
> <snip>

> From 3a4de5a91b2ee3975dc633b4af4c8e25c4278545 Mon Sep 17 00:00:00 2001
> From: Soumya AR <[email protected]>
> Date: Tue, 6 Jan 2026 04:40:37 +0000
> Subject: [PATCH] AArch64: Add RTL pass to narrow 64-bit GP reg writes to
>  32-bit
> 
> This patch adds a new AArch64 RTL pass that optimizes 64-bit
> general purpose register operations to use 32-bit W-registers when the
> upper 32 bits of the register are known to be zero.
> 
> This is beneficial for the Olympus core, which benefits from using 32-bit
> W-registers over 64-bit X-registers if possible. This is recommended by the
> updated Olympus Software Optimization Guide, which will be published soon.
> 
> This pass is enabled by default at -O2 and above, and can be controlled with
> -mnarrow-gp-writes.
> 
> ----
> 
> In AArch64, each 64-bit X register has a corresponding 32-bit W register
> that maps to its lower half.  When we can guarantee that the upper 32 bits
> are never used, we can safely narrow operations to use W registers instead.
> 
> For example, this code:
>     uint64_t foo (uint64_t a) {
>         return (a & 255) + 3;
>     }
> 
> Currently compiles to:
> 
>       and     x0, x0, 255
>       add     x0, x0, 3
>       ret
> 
> But with this pass enabled, it optimizes to:
> 
>       and     w0, w0, 255
>       add     w0, w0, 3
>       ret
> 
> ----
> 
> The pass operates in two phases:
> 
>  1) Analysis Phase:
>    - Using RTL-SSA, iterates through extended basic blocks (EBBs)
>    - Computes nonzero bit masks for each register definition
>    - Recursively processes PHI nodes
>    - Identifies candidates for narrowing
>  2) Transformation Phase:
>    - Applies narrowing to validated candidates
>    - Converts DImode operations to SImode where safe
> 
> The pass runs late in the RTL pipeline, after register allocation, to ensure
> stable def-use chains and avoid interfering with earlier optimizations.
> 
> ----
> 
> nonzero_bits(src, DImode) is a function defined in rtlanal.cc that recursively
> analyzes RTL expressions to compute a bitmask. However, nonzero_bits has a
> limitation: when it encounters a register, it conservatively returns the mode
> mask (all bits potentially set). Since this pass analyzes all defs in an
> instruction, this information can be used to refine the mask. The pass 
> maintains
> a hash map of computed bit masks and installs a custom RTL hooks callback
> to consult this mask when encountering a register.
> 
> ----
> 
> PHI nodes require special handling to merge masks from all inputs. This is 
> done
> by combine_mask_from_phi. 3 cases are tackled here:
>    1. Input Edge has a Definition: This is the simplest case. For each input
>    edge to the PHI, the def information is retreived and its mask is looked 
> up.
>    2. Input Edge has no Definition: A conservative mask is assumed for that
>    input.
>    3. Input Edge is a PHI: Recursively call combine_mask_from_phi to
>    merge the masks of all incoming values.
> 
> ---
> 
> When processing regular instructions, the pass first tackles SET and PARALLEL
> patterns with compare instructions.
> 
> Single SET instructions:
> 
> If the upper 32 bits of the source are known to be zero, then the instruction
> qualifies for narrowing. Instead of just using lowpart_subreg for the source,
> we define narrow_dimode_src to attempt further optimizations via
> simplify_gen_binary and simplify_gen_ternary.
> 
> PARALLEL Instructions (Compare + SET):
> 
> The pass tackles flag-setting operations (ADDS, SUBS, ANDS, etc.) where the 
> SET
> source equals the first operand of the COMPARE. Depending on the condition 
> code
> for the compare, the pass checks for the required bits to be zero:
> 
> - CC_Zmode/CC_NZmode: Upper 32 bits
> - CC_NZVmode: Upper 32 bits and bit 31 (for overflow)
> 
> If the instruction does not match the above patterns (or matches but cannot be
> optimized), the pass still analyzes all its definitions to ensure nzero_map is
> complete. This ensures every definition has an entry in nzero_map.
> 
> ----
> 
> When transforming the qualified instructions, the pass uses rtl_ssa::recog and
> rtl_ssa::change_is_worthwhile to verify the new pattern and determine if the
> transformation is worthwhile.
> 
> ----
> 
> As an additional benefit, testing on Neoverse-V2 shows that instances of
> 'and x1, x2, #0xffffffff' are converted to zero-latency 'mov w1, w2'
> instructions after this pass narrows them.
> 
> ----
> 
> Enabling the pass by default at O2 touches a large number of tests that have
> been updated accordingly.
> 
> Most of the update are straightforward, which involve changing x registers to
> (w|x) registers (e.g., x[0-9]+ -> [wx][0-9]+).
> 
> There are some tests (eg. aarch64/int_mov_immediate_1.c) where the
> representation of the immediate changes:
> 
>       mov     w0, 4294927974 -> mov   w0, -39322
> 
> This is because when the following RTL is narrowed:
>     (set (reg/i:DI 0 x0)
>          (const_int 4294927974 [0xffff6666]))
> 
> Due to the MSB changing to Bit 31, which is set, the output is printed as
> signed.
> 
> ----
> 
> The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
> OK for mainline?
> 
> Co-authored-by: Kyrylo Tkachov <[email protected]>
> Signed-off-by: Soumya AR <[email protected]>
> 
> gcc/ChangeLog:
> 
>       * config.gcc: Add aarch64-narrow-gp-writes.o.
>       * config/aarch64/aarch64-passes.def (INSERT_PASS_BEFORE): Insert
>       pass_narrow_gp_writes before pass_cleanup_barriers.
>       * config/aarch64/aarch64-protos.h (make_pass_narrow_gp_writes): Declare.
>       * config/aarch64/aarch64.opt (mnarrow-gp-writes): New option.
>       * config/aarch64/t-aarch64: Add aarch64-narrow-gp-writes.o rule.
>       * doc/invoke.texi: Document -mnarrow-gp-writes.
>       * config/aarch64/aarch64-narrow-gp-writes.cc: New file.
> 
> gcc/testsuite/ChangeLog:
> 
>       * gcc.target/aarch64/acle/pr110100.c: Change x register to (x|w).
>       * gcc.target/aarch64/ands_1.c: Likewise.
>       * gcc.target/aarch64/bitfield-bitint-abi-align16.c: Likewise.
>       * gcc.target/aarch64/bitfield-bitint-abi-align8.c: Likewise.
>       * gcc.target/aarch64/chkfeat-1.c: Likewise.
>       * gcc.target/aarch64/chkfeat-2.c: Likewise.
>       * gcc.target/aarch64/cmpbr.c: Likewise.
>       * gcc.target/aarch64/csinc-2.c: Likewise.
>       * gcc.target/aarch64/eh_return-3.c: Likewise.
>       * gcc.target/aarch64/ffs.c: Likewise.
>       * gcc.target/aarch64/gcspopm-1.c: Likewise.
>       * gcc.target/aarch64/gcsss-1.c: Likewise.
>       * gcc.target/aarch64/imm_choice_comparison.c: Likewise.
>       * gcc.target/aarch64/int_mov_immediate_1.c: Likewise.
>       * gcc.target/aarch64/memset-corner-cases-2.c: Likewise.
>       * gcc.target/aarch64/memset-corner-cases.c: Likewise.
>       * gcc.target/aarch64/mops_1.c: Likewise.
>       * gcc.target/aarch64/mops_2.c: Likewise.
>       * gcc.target/aarch64/mops_3.c: Likewise.
>       * gcc.target/aarch64/movk_3.c: Likewise.
>       * gcc.target/aarch64/pr71727.c: Likewise.
>       * gcc.target/aarch64/pr84882.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilege_b16.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilege_b32.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilege_b64.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilege_b8.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilege_c16.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilege_c32.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilege_c64.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilege_c8.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilegt_b16.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilegt_b32.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilegt_b64.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilegt_b8.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilegt_c16.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilegt_c32.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilegt_c64.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilegt_c8.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilele_b16.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilele_b32.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilele_b64.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilele_b8.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilele_c16.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilele_c32.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilele_c64.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilele_c8.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilelt_b16.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilelt_b32.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilelt_b64.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilelt_b8.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilelt_c16.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilelt_c32.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilelt_c64.c: Likewise.
>       * gcc.target/aarch64/sme2/acle-asm/whilelt_c8.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/cntb_pat.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/cntd_pat.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/cnth_pat.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/cntw_pat.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/dup_f64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/dup_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/dup_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/dupq_f32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/dupq_s32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/dupq_u32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/index_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/index_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1_gather_f32.c: Likewise
>       * gcc.target/aarch64/sve/acle/asm/ld1_gather_f64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1sb_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1sb_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1sb_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1sb_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1sh_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1sh_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1sh_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1sh_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1sw_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1sw_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1ub_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1ub_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1ub_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1ub_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1uh_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1uh_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1uh_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1uh_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1uw_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ld1uw_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1_gather_f32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1_gather_f64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1sb_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1sb_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1sb_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1sb_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1sh_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1sh_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1sh_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1sh_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1sw_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1sw_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1ub_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1ub_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1ub_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1ub_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1uh_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1uh_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1uh_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1uh_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1uw_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/ldff1uw_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/prfb_gather.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/prfd_gather.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/prfh_gather.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/prfw_gather.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1_scatter_f32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1_scatter_f64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1_scatter_s32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1_scatter_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1_scatter_u32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1_scatter_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1b_scatter_s32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1b_scatter_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1b_scatter_u32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1b_scatter_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1h_scatter_s32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1h_scatter_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1h_scatter_u32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1h_scatter_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1w_scatter_s64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/st1w_scatter_u64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/whilele_b16.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/whilele_b32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/whilele_b64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/whilele_b8.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/whilelt_b16.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/whilelt_b32.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/whilelt_b64.c: Likewise.
>       * gcc.target/aarch64/sve/acle/asm/whilelt_b8.c: Likewise.
>       * gcc.target/aarch64/sve/const_2.c: Likewise.
>       * gcc.target/aarch64/sve/const_3.c: Likewise.
>       * gcc.target/aarch64/sve/pfalse-count_pred.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ld1q_gather_bf16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ld1q_gather_f16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ld1q_gather_f32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ld1q_gather_f64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ld1q_gather_s16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ld1q_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ld1q_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ld1q_gather_s8.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ld1q_gather_u16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ld1q_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ld1q_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ld1q_gather_u8.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1_gather_f32.c: Likewise
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1_gather_f64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1sb_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1sb_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1sb_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1sb_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1sw_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1sw_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1ub_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1ub_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1ub_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1ub_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_s32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_u32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1uw_gather_s64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/ldnt1uw_gather_u64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/st1q_scatter_bf16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/st1q_scatter_f16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/st1q_scatter_f32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/st1q_scatter_f64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/st1q_scatter_s16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/st1q_scatter_s32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/st1q_scatter_s64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/st1q_scatter_s8.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/st1q_scatter_u16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/st1q_scatter_u32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/st1q_scatter_u64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/st1q_scatter_u8.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1_scatter_f32.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1_scatter_f64.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1_scatter_s32.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1_scatter_s64.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1_scatter_u32.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1_scatter_u64.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1b_scatter_s32.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1b_scatter_s64.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1b_scatter_u32.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1b_scatter_u64.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_s32.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_s64.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_u32.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_u64.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1w_scatter_s64.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/stnt1w_scatter_u64.c Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilege_b16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilege_b16_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilege_b32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilege_b32_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilege_b64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilege_b64_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilege_b8.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilege_b8_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilege_c16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilege_c32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilege_c64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilege_c8.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilegt_b16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilegt_b16_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilegt_b32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilegt_b32_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilegt_b64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilegt_b64_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilegt_b8.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilegt_b8_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilegt_c16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilegt_c32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilegt_c64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilegt_c8.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilele_b16_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilele_b32_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilele_b64_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilele_b8_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilele_c16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilele_c32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilele_c64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilele_c8.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilelt_b16_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilelt_b32_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilelt_b64_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilelt_b8_x2.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilelt_c16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilelt_c32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilelt_c64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilelt_c8.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilerw_bf16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilerw_f16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilerw_f32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilerw_f64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilerw_mf8.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilerw_s16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilerw_s32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilerw_s64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilerw_s8.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilerw_u16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilerw_u32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilerw_u64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilerw_u8.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilewr_bf16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilewr_f16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilewr_f32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilewr_f64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilewr_mf8.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilewr_s16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilewr_s32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilewr_s64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilewr_s8.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilewr_u16.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilewr_u32.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilewr_u64.c: Likewise.
>       * gcc.target/aarch64/sve2/acle/asm/whilewr_u8.c: Likewise.
>       * gcc.target/aarch64/test_frame_17.c: Likewise.
>       * gcc.target/aarch64/vect-cse-codegen.c: Likewise.
>       * gcc.target/aarch64/narrow-gp-writes-1.c: New test.
>       * gcc.target/aarch64/narrow-gp-writes-2.c: New test.
>       * gcc.target/aarch64/narrow-gp-writes-3.c: New test.
>       * gcc.target/aarch64/narrow-gp-writes-4.c: New test.
>       * gcc.target/aarch64/narrow-gp-writes-5.c: New test.
>       * gcc.target/aarch64/narrow-gp-writes-6.c: New test.
>       * gcc.target/aarch64/narrow-gp-writes-7.c: New test.
> ---
>  gcc/config.gcc                                |  10 +-
>  .../aarch64/aarch64-narrow-gp-writes.cc       | 595 ++++++++++++++++++
>  gcc/config/aarch64/aarch64-passes.def         |   1 +
>  gcc/config/aarch64/aarch64-protos.h           |   1 +
>  gcc/config/aarch64/aarch64.opt                |   5 +
>  gcc/config/aarch64/t-aarch64                  |   5 +
>  gcc/doc/invoke.texi                           |   8 +-
>  .../gcc.target/aarch64/acle/pr110100.c        |   2 +-
>  gcc/testsuite/gcc.target/aarch64/ands_1.c     |   2 +-
>  .../aarch64/bitfield-bitint-abi-align16.c     | 162 ++---
>  .../aarch64/bitfield-bitint-abi-align8.c      | 146 ++---
>  gcc/testsuite/gcc.target/aarch64/chkfeat-1.c  |   6 +-
>  gcc/testsuite/gcc.target/aarch64/chkfeat-2.c  |   4 +-
>  gcc/testsuite/gcc.target/aarch64/cmpbr.c      |  38 +-
>  gcc/testsuite/gcc.target/aarch64/csinc-2.c    |   2 +-
>  .../gcc.target/aarch64/eh_return-3.c          |   4 +-
>  gcc/testsuite/gcc.target/aarch64/ffs.c        |   4 +-
>  gcc/testsuite/gcc.target/aarch64/gcspopm-1.c  |   6 +-
>  gcc/testsuite/gcc.target/aarch64/gcsss-1.c    |  10 +-
>  .../aarch64/imm_choice_comparison.c           |   4 +-
>  .../gcc.target/aarch64/int_mov_immediate_1.c  |   5 +-
>  .../aarch64/memset-corner-cases-2.c           |   4 +-
>  .../gcc.target/aarch64/memset-corner-cases.c  |   2 +-
>  gcc/testsuite/gcc.target/aarch64/mops_1.c     |   6 +-
>  gcc/testsuite/gcc.target/aarch64/mops_2.c     |   6 +-
>  gcc/testsuite/gcc.target/aarch64/mops_3.c     |   8 +-
>  gcc/testsuite/gcc.target/aarch64/movk_3.c     |   2 +-
>  .../gcc.target/aarch64/narrow-gp-writes-1.c   |  11 +
>  .../gcc.target/aarch64/narrow-gp-writes-2.c   |  20 +
>  .../gcc.target/aarch64/narrow-gp-writes-3.c   |  15 +
>  .../gcc.target/aarch64/narrow-gp-writes-4.c   |  15 +
>  .../gcc.target/aarch64/narrow-gp-writes-5.c   |  11 +
>  .../gcc.target/aarch64/narrow-gp-writes-6.c   |  21 +
>  .../gcc.target/aarch64/narrow-gp-writes-7.c   |  12 +
>  gcc/testsuite/gcc.target/aarch64/pr71727.c    |   2 +-
>  gcc/testsuite/gcc.target/aarch64/pr84882.c    |   2 +-
>  .../aarch64/sme2/acle-asm/whilege_b16.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilege_b32.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilege_b64.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilege_b8.c        |  16 +-
>  .../aarch64/sme2/acle-asm/whilege_c16.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilege_c32.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilege_c64.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilege_c8.c        |  16 +-
>  .../aarch64/sme2/acle-asm/whilegt_b16.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilegt_b32.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilegt_b64.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilegt_b8.c        |  16 +-
>  .../aarch64/sme2/acle-asm/whilegt_c16.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilegt_c32.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilegt_c64.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilegt_c8.c        |  16 +-
>  .../aarch64/sme2/acle-asm/whilele_b16.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilele_b32.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilele_b64.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilele_b8.c        |  16 +-
>  .../aarch64/sme2/acle-asm/whilele_c16.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilele_c32.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilele_c64.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilele_c8.c        |  16 +-
>  .../aarch64/sme2/acle-asm/whilelt_b16.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilelt_b32.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilelt_b64.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilelt_b8.c        |  16 +-
>  .../aarch64/sme2/acle-asm/whilelt_c16.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilelt_c32.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilelt_c64.c       |  16 +-
>  .../aarch64/sme2/acle-asm/whilelt_c8.c        |  16 +-
>  .../aarch64/sve/acle/asm/cntb_pat.c           |  18 +-
>  .../aarch64/sve/acle/asm/cntd_pat.c           |   4 +-
>  .../aarch64/sve/acle/asm/cnth_pat.c           |  16 +-
>  .../aarch64/sve/acle/asm/cntw_pat.c           |   8 +-
>  .../gcc.target/aarch64/sve/acle/asm/dup_f64.c |   8 +-
>  .../gcc.target/aarch64/sve/acle/asm/dup_s64.c |  88 +--
>  .../gcc.target/aarch64/sve/acle/asm/dup_u64.c |  88 +--
>  .../aarch64/sve/acle/asm/dupq_f32.c           |  12 +-
>  .../aarch64/sve/acle/asm/dupq_s32.c           |   8 +-
>  .../aarch64/sve/acle/asm/dupq_u32.c           |   8 +-
>  .../aarch64/sve/acle/asm/index_s64.c          |  36 +-
>  .../aarch64/sve/acle/asm/index_u64.c          |  36 +-
>  .../aarch64/sve/acle/asm/ld1_gather_f32.c     |  28 +-
>  .../aarch64/sve/acle/asm/ld1_gather_f64.c     |  44 +-
>  .../aarch64/sve/acle/asm/ld1_gather_s32.c     |  28 +-
>  .../aarch64/sve/acle/asm/ld1_gather_s64.c     |  44 +-
>  .../aarch64/sve/acle/asm/ld1_gather_u32.c     |  28 +-
>  .../aarch64/sve/acle/asm/ld1_gather_u64.c     |  44 +-
>  .../aarch64/sve/acle/asm/ld1sb_gather_s32.c   |   8 +-
>  .../aarch64/sve/acle/asm/ld1sb_gather_s64.c   |   8 +-
>  .../aarch64/sve/acle/asm/ld1sb_gather_u32.c   |   8 +-
>  .../aarch64/sve/acle/asm/ld1sb_gather_u64.c   |   8 +-
>  .../aarch64/sve/acle/asm/ld1sh_gather_s32.c   |  20 +-
>  .../aarch64/sve/acle/asm/ld1sh_gather_s64.c   |  20 +-
>  .../aarch64/sve/acle/asm/ld1sh_gather_u32.c   |  20 +-
>  .../aarch64/sve/acle/asm/ld1sh_gather_u64.c   |  20 +-
>  .../aarch64/sve/acle/asm/ld1sw_gather_s64.c   |  28 +-
>  .../aarch64/sve/acle/asm/ld1sw_gather_u64.c   |  28 +-
>  .../aarch64/sve/acle/asm/ld1ub_gather_s32.c   |   8 +-
>  .../aarch64/sve/acle/asm/ld1ub_gather_s64.c   |   8 +-
>  .../aarch64/sve/acle/asm/ld1ub_gather_u32.c   |   8 +-
>  .../aarch64/sve/acle/asm/ld1ub_gather_u64.c   |   8 +-
>  .../aarch64/sve/acle/asm/ld1uh_gather_s32.c   |  20 +-
>  .../aarch64/sve/acle/asm/ld1uh_gather_s64.c   |  20 +-
>  .../aarch64/sve/acle/asm/ld1uh_gather_u32.c   |  20 +-
>  .../aarch64/sve/acle/asm/ld1uh_gather_u64.c   |  20 +-
>  .../aarch64/sve/acle/asm/ld1uw_gather_s64.c   |  28 +-
>  .../aarch64/sve/acle/asm/ld1uw_gather_u64.c   |  28 +-
>  .../aarch64/sve/acle/asm/ldff1_gather_f32.c   |  28 +-
>  .../aarch64/sve/acle/asm/ldff1_gather_f64.c   |  44 +-
>  .../aarch64/sve/acle/asm/ldff1_gather_s32.c   |  28 +-
>  .../aarch64/sve/acle/asm/ldff1_gather_s64.c   |  44 +-
>  .../aarch64/sve/acle/asm/ldff1_gather_u32.c   |  28 +-
>  .../aarch64/sve/acle/asm/ldff1_gather_u64.c   |  44 +-
>  .../aarch64/sve/acle/asm/ldff1sb_gather_s32.c |   8 +-
>  .../aarch64/sve/acle/asm/ldff1sb_gather_s64.c |   8 +-
>  .../aarch64/sve/acle/asm/ldff1sb_gather_u32.c |   8 +-
>  .../aarch64/sve/acle/asm/ldff1sb_gather_u64.c |   8 +-
>  .../aarch64/sve/acle/asm/ldff1sh_gather_s32.c |  20 +-
>  .../aarch64/sve/acle/asm/ldff1sh_gather_s64.c |  20 +-
>  .../aarch64/sve/acle/asm/ldff1sh_gather_u32.c |  20 +-
>  .../aarch64/sve/acle/asm/ldff1sh_gather_u64.c |  20 +-
>  .../aarch64/sve/acle/asm/ldff1sw_gather_s64.c |  28 +-
>  .../aarch64/sve/acle/asm/ldff1sw_gather_u64.c |  28 +-
>  .../aarch64/sve/acle/asm/ldff1ub_gather_s32.c |   8 +-
>  .../aarch64/sve/acle/asm/ldff1ub_gather_s64.c |   8 +-
>  .../aarch64/sve/acle/asm/ldff1ub_gather_u32.c |   8 +-
>  .../aarch64/sve/acle/asm/ldff1ub_gather_u64.c |   8 +-
>  .../aarch64/sve/acle/asm/ldff1uh_gather_s32.c |  20 +-
>  .../aarch64/sve/acle/asm/ldff1uh_gather_s64.c |  20 +-
>  .../aarch64/sve/acle/asm/ldff1uh_gather_u32.c |  20 +-
>  .../aarch64/sve/acle/asm/ldff1uh_gather_u64.c |  20 +-
>  .../aarch64/sve/acle/asm/ldff1uw_gather_s64.c |  28 +-
>  .../aarch64/sve/acle/asm/ldff1uw_gather_u64.c |  28 +-
>  .../aarch64/sve/acle/asm/prfb_gather.c        |  16 +-
>  .../aarch64/sve/acle/asm/prfd_gather.c        |  16 +-
>  .../aarch64/sve/acle/asm/prfh_gather.c        |  16 +-
>  .../aarch64/sve/acle/asm/prfw_gather.c        |  16 +-
>  .../aarch64/sve/acle/asm/st1_scatter_f32.c    |  28 +-
>  .../aarch64/sve/acle/asm/st1_scatter_f64.c    |  44 +-
>  .../aarch64/sve/acle/asm/st1_scatter_s32.c    |  28 +-
>  .../aarch64/sve/acle/asm/st1_scatter_s64.c    |  44 +-
>  .../aarch64/sve/acle/asm/st1_scatter_u32.c    |  28 +-
>  .../aarch64/sve/acle/asm/st1_scatter_u64.c    |  44 +-
>  .../aarch64/sve/acle/asm/st1b_scatter_s32.c   |   8 +-
>  .../aarch64/sve/acle/asm/st1b_scatter_s64.c   |   8 +-
>  .../aarch64/sve/acle/asm/st1b_scatter_u32.c   |   8 +-
>  .../aarch64/sve/acle/asm/st1b_scatter_u64.c   |   8 +-
>  .../aarch64/sve/acle/asm/st1h_scatter_s32.c   |  20 +-
>  .../aarch64/sve/acle/asm/st1h_scatter_s64.c   |  20 +-
>  .../aarch64/sve/acle/asm/st1h_scatter_u32.c   |  20 +-
>  .../aarch64/sve/acle/asm/st1h_scatter_u64.c   |  20 +-
>  .../aarch64/sve/acle/asm/st1w_scatter_s64.c   |  28 +-
>  .../aarch64/sve/acle/asm/st1w_scatter_u64.c   |  28 +-
>  .../aarch64/sve/acle/asm/whilele_b16.c        |  16 +-
>  .../aarch64/sve/acle/asm/whilele_b32.c        |  16 +-
>  .../aarch64/sve/acle/asm/whilele_b64.c        |  16 +-
>  .../aarch64/sve/acle/asm/whilele_b8.c         |  16 +-
>  .../aarch64/sve/acle/asm/whilelt_b16.c        |  16 +-
>  .../aarch64/sve/acle/asm/whilelt_b32.c        |  16 +-
>  .../aarch64/sve/acle/asm/whilelt_b64.c        |  16 +-
>  .../aarch64/sve/acle/asm/whilelt_b8.c         |  16 +-
>  .../gcc.target/aarch64/sve/const_2.c          |   2 +-
>  .../gcc.target/aarch64/sve/const_3.c          |   2 +-
>  .../aarch64/sve/pfalse-count_pred.c           |   2 +-
>  .../aarch64/sve2/acle/asm/ld1q_gather_bf16.c  |  16 +-
>  .../aarch64/sve2/acle/asm/ld1q_gather_f16.c   |  16 +-
>  .../aarch64/sve2/acle/asm/ld1q_gather_f32.c   |  16 +-
>  .../aarch64/sve2/acle/asm/ld1q_gather_f64.c   |  16 +-
>  .../aarch64/sve2/acle/asm/ld1q_gather_s16.c   |  16 +-
>  .../aarch64/sve2/acle/asm/ld1q_gather_s32.c   |  16 +-
>  .../aarch64/sve2/acle/asm/ld1q_gather_s64.c   |  16 +-
>  .../aarch64/sve2/acle/asm/ld1q_gather_s8.c    |   8 +-
>  .../aarch64/sve2/acle/asm/ld1q_gather_u16.c   |  16 +-
>  .../aarch64/sve2/acle/asm/ld1q_gather_u32.c   |  16 +-
>  .../aarch64/sve2/acle/asm/ld1q_gather_u64.c   |  16 +-
>  .../aarch64/sve2/acle/asm/ld1q_gather_u8.c    |   8 +-
>  .../aarch64/sve2/acle/asm/ldnt1_gather_f32.c  |  44 +-
>  .../aarch64/sve2/acle/asm/ldnt1_gather_f64.c  |  60 +-
>  .../aarch64/sve2/acle/asm/ldnt1_gather_s32.c  |  44 +-
>  .../aarch64/sve2/acle/asm/ldnt1_gather_s64.c  |  60 +-
>  .../aarch64/sve2/acle/asm/ldnt1_gather_u32.c  |  44 +-
>  .../aarch64/sve2/acle/asm/ldnt1_gather_u64.c  |  60 +-
>  .../sve2/acle/asm/ldnt1sb_gather_s32.c        |  16 +-
>  .../sve2/acle/asm/ldnt1sb_gather_s64.c        |  16 +-
>  .../sve2/acle/asm/ldnt1sb_gather_u32.c        |  16 +-
>  .../sve2/acle/asm/ldnt1sb_gather_u64.c        |  16 +-
>  .../sve2/acle/asm/ldnt1sh_gather_s32.c        |  36 +-
>  .../sve2/acle/asm/ldnt1sh_gather_s64.c        |  36 +-
>  .../sve2/acle/asm/ldnt1sh_gather_u32.c        |  36 +-
>  .../sve2/acle/asm/ldnt1sh_gather_u64.c        |  36 +-
>  .../sve2/acle/asm/ldnt1sw_gather_s64.c        |  44 +-
>  .../sve2/acle/asm/ldnt1sw_gather_u64.c        |  44 +-
>  .../sve2/acle/asm/ldnt1ub_gather_s32.c        |  16 +-
>  .../sve2/acle/asm/ldnt1ub_gather_s64.c        |  16 +-
>  .../sve2/acle/asm/ldnt1ub_gather_u32.c        |  16 +-
>  .../sve2/acle/asm/ldnt1ub_gather_u64.c        |  16 +-
>  .../sve2/acle/asm/ldnt1uh_gather_s32.c        |  36 +-
>  .../sve2/acle/asm/ldnt1uh_gather_s64.c        |  36 +-
>  .../sve2/acle/asm/ldnt1uh_gather_u32.c        |  36 +-
>  .../sve2/acle/asm/ldnt1uh_gather_u64.c        |  36 +-
>  .../sve2/acle/asm/ldnt1uw_gather_s64.c        |  44 +-
>  .../sve2/acle/asm/ldnt1uw_gather_u64.c        |  44 +-
>  .../aarch64/sve2/acle/asm/st1q_scatter_bf16.c |   8 +-
>  .../aarch64/sve2/acle/asm/st1q_scatter_f16.c  |   8 +-
>  .../aarch64/sve2/acle/asm/st1q_scatter_f32.c  |   8 +-
>  .../aarch64/sve2/acle/asm/st1q_scatter_f64.c  |  16 +-
>  .../aarch64/sve2/acle/asm/st1q_scatter_s16.c  |   8 +-
>  .../aarch64/sve2/acle/asm/st1q_scatter_s32.c  |   8 +-
>  .../aarch64/sve2/acle/asm/st1q_scatter_s64.c  |  16 +-
>  .../aarch64/sve2/acle/asm/st1q_scatter_s8.c   |   8 +-
>  .../aarch64/sve2/acle/asm/st1q_scatter_u16.c  |   8 +-
>  .../aarch64/sve2/acle/asm/st1q_scatter_u32.c  |   8 +-
>  .../aarch64/sve2/acle/asm/st1q_scatter_u64.c  |  16 +-
>  .../aarch64/sve2/acle/asm/st1q_scatter_u8.c   |   8 +-
>  .../aarch64/sve2/acle/asm/stnt1_scatter_f32.c |  44 +-
>  .../aarch64/sve2/acle/asm/stnt1_scatter_f64.c |  60 +-
>  .../aarch64/sve2/acle/asm/stnt1_scatter_s32.c |  44 +-
>  .../aarch64/sve2/acle/asm/stnt1_scatter_s64.c |  60 +-
>  .../aarch64/sve2/acle/asm/stnt1_scatter_u32.c |  44 +-
>  .../aarch64/sve2/acle/asm/stnt1_scatter_u64.c |  60 +-
>  .../sve2/acle/asm/stnt1b_scatter_s32.c        |  16 +-
>  .../sve2/acle/asm/stnt1b_scatter_s64.c        |  16 +-
>  .../sve2/acle/asm/stnt1b_scatter_u32.c        |  16 +-
>  .../sve2/acle/asm/stnt1b_scatter_u64.c        |  16 +-
>  .../sve2/acle/asm/stnt1h_scatter_s32.c        |  36 +-
>  .../sve2/acle/asm/stnt1h_scatter_s64.c        |  36 +-
>  .../sve2/acle/asm/stnt1h_scatter_u32.c        |  36 +-
>  .../sve2/acle/asm/stnt1h_scatter_u64.c        |  36 +-
>  .../sve2/acle/asm/stnt1w_scatter_s64.c        |  44 +-
>  .../sve2/acle/asm/stnt1w_scatter_u64.c        |  44 +-
>  .../aarch64/sve2/acle/asm/whilege_b16.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilege_b16_x2.c    |  16 +-
>  .../aarch64/sve2/acle/asm/whilege_b32.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilege_b32_x2.c    |  16 +-
>  .../aarch64/sve2/acle/asm/whilege_b64.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilege_b64_x2.c    |  16 +-
>  .../aarch64/sve2/acle/asm/whilege_b8.c        |  16 +-
>  .../aarch64/sve2/acle/asm/whilege_b8_x2.c     |  16 +-
>  .../aarch64/sve2/acle/asm/whilege_c16.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilege_c32.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilege_c64.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilege_c8.c        |  16 +-
>  .../aarch64/sve2/acle/asm/whilegt_b16.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilegt_b16_x2.c    |  16 +-
>  .../aarch64/sve2/acle/asm/whilegt_b32.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilegt_b32_x2.c    |  16 +-
>  .../aarch64/sve2/acle/asm/whilegt_b64.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilegt_b64_x2.c    |  16 +-
>  .../aarch64/sve2/acle/asm/whilegt_b8.c        |  16 +-
>  .../aarch64/sve2/acle/asm/whilegt_b8_x2.c     |  16 +-
>  .../aarch64/sve2/acle/asm/whilegt_c16.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilegt_c32.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilegt_c64.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilegt_c8.c        |  16 +-
>  .../aarch64/sve2/acle/asm/whilele_b16_x2.c    |  16 +-
>  .../aarch64/sve2/acle/asm/whilele_b32_x2.c    |  16 +-
>  .../aarch64/sve2/acle/asm/whilele_b64_x2.c    |  16 +-
>  .../aarch64/sve2/acle/asm/whilele_b8_x2.c     |  16 +-
>  .../aarch64/sve2/acle/asm/whilele_c16.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilele_c32.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilele_c64.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilele_c8.c        |  16 +-
>  .../aarch64/sve2/acle/asm/whilelt_b16_x2.c    |  16 +-
>  .../aarch64/sve2/acle/asm/whilelt_b32_x2.c    |  16 +-
>  .../aarch64/sve2/acle/asm/whilelt_b64_x2.c    |  16 +-
>  .../aarch64/sve2/acle/asm/whilelt_b8_x2.c     |  16 +-
>  .../aarch64/sve2/acle/asm/whilelt_c16.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilelt_c32.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilelt_c64.c       |  16 +-
>  .../aarch64/sve2/acle/asm/whilelt_c8.c        |  16 +-
>  .../aarch64/sve2/acle/asm/whilerw_bf16.c      |   8 +-
>  .../aarch64/sve2/acle/asm/whilerw_f16.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilerw_f32.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilerw_f64.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilerw_mf8.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilerw_s16.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilerw_s32.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilerw_s64.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilerw_s8.c        |   8 +-
>  .../aarch64/sve2/acle/asm/whilerw_u16.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilerw_u32.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilerw_u64.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilerw_u8.c        |   8 +-
>  .../aarch64/sve2/acle/asm/whilewr_bf16.c      |   8 +-
>  .../aarch64/sve2/acle/asm/whilewr_f16.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilewr_f32.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilewr_f64.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilewr_mf8.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilewr_s16.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilewr_s32.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilewr_s64.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilewr_s8.c        |   8 +-
>  .../aarch64/sve2/acle/asm/whilewr_u16.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilewr_u32.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilewr_u64.c       |   8 +-
>  .../aarch64/sve2/acle/asm/whilewr_u8.c        |   8 +-
>  .../gcc.target/aarch64/test_frame_17.c        |   2 +-
>  .../gcc.target/aarch64/vect-cse-codegen.c     |   6 +-
>  297 files changed, 3537 insertions(+), 2812 deletions(-)
>  create mode 100644 gcc/config/aarch64/aarch64-narrow-gp-writes.cc
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/narrow-gp-writes-1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/narrow-gp-writes-2.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/narrow-gp-writes-3.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/narrow-gp-writes-4.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/narrow-gp-writes-5.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/narrow-gp-writes-6.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/narrow-gp-writes-7.c
> 
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index b2a48c02d3b..d421ab0cd4a 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -359,7 +359,15 @@ aarch64*-*-*)
>       c_target_objs="aarch64-c.o"
>       cxx_target_objs="aarch64-c.o"
>       d_target_objs="aarch64-d.o"
> -     extra_objs="aarch64-builtins.o aarch-common.o aarch64-elf-metadata.o 
> aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o 
> aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o 
> aarch64-sve-builtins-sme.o cortex-a57-fma-steering.o aarch64-speculation.o 
> aarch-bti-insert.o aarch64-early-ra.o aarch64-ldp-fusion.o 
> aarch64-sched-dispatch.o aarch64-json-tunings-printer.o 
> aarch64-json-tunings-parser.o"
> +     extra_objs="aarch64-builtins.o aarch-common.o aarch64-elf-metadata.o \
> +     aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o \
> +     aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o \
> +     aarch64-sve-builtins-sme.o cortex-a57-fma-steering.o \
> +     aarch64-speculation.o aarch-bti-insert.o aarch64-early-ra.o \
> +     aarch64-ldp-fusion.o aarch64-sched-dispatch.o \
> +     aarch64-json-tunings-printer.o \
> +     aarch64-json-tunings-parser.o \
> +     aarch64-narrow-gp-writes.o"
>       target_gtfiles="\$(srcdir)/config/aarch64/aarch64-protos.h 
> \$(srcdir)/config/aarch64/aarch64-builtins.h 
> \$(srcdir)/config/aarch64/aarch64-builtins.cc 
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.h 
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
>       target_has_targetm_common=yes
>       ;;
> diff --git a/gcc/config/aarch64/aarch64-narrow-gp-writes.cc 
> b/gcc/config/aarch64/aarch64-narrow-gp-writes.cc
> new file mode 100644
> index 00000000000..3ad67166b13
> --- /dev/null
> +++ b/gcc/config/aarch64/aarch64-narrow-gp-writes.cc
> @@ -0,0 +1,595 @@
> +/* GP register writes narrowing pass.
> +   Copyright The GNU Toolchain Authors.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but
> +   WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#define IN_TARGET_CODE 1
> +
> +#define INCLUDE_ALGORITHM
> +#define INCLUDE_FUNCTIONAL
> +#define INCLUDE_ARRAY
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "df.h"
> +#include "hash-map.h"
> +#include "rtl-ssa.h"
> +#include "rtlhooks-def.h"
> +#include "rtl-iter.h"
> +#include "tree-pass.h"
> +#include "insn-attr.h"
> +
> +using namespace rtl_ssa;
> +
> +namespace {
> +
> +/* This pass converts 64-bit (X-register) operations to 32-bit (W-register)
> +   operations when the upper 32 bits of the result are known to be zero.
> +
> +   In AArch64, each 64-bit X register has a corresponding 32-bit W register
> +   that maps to its lower half.  When we can guarantee that the upper 32 bits
> +   are never used, we can safely narrow operations to use W registers 
> instead.
> +
> +   For example, this code:
> +     uint64_t foo(uint64_t a) {
> +      return (a & 255) + 3;
> +     }
> +
> +   Currently compiles to:
> +     and     x0, x0, 255
> +     add     x0, x0, 3
> +     ret
> +
> +   But can be optimized to:
> +     and     w0, w0, 255
> +     add     w0, w0, 3
> +     ret
> +
> +   The pass operates in two phases:
> +
> +   1) Analysis Phase:
> +      - Iterates through extended basic blocks (EBBs)
> +      - Computes nonzero bit masks for each register definition
> +      - Processes PHI nodes to handle control flow joins
> +      - Identifies candidates for narrowing
> +
> +   2) Transformation Phase:
> +      - Applies narrowing to validated candidates
> +      - Converts DImode operations to SImode where safe
> +
> +   - We use RTL-SSA to walk def-use chains for each register.
> +   - A custom nonzero_bits hook makes the analysis context-aware by
> +     maintaining a hash map (nzero_map) of computed bit masks
> +   - PHI nodes require special handling to merge masks from all inputs. This 
> is
> +     done by combine_mask_from_phi. We tackle 3 cases here:
> +      1. Input Edge has a Definition: This is the simplest case. For each 
> input
> +      edge to the PHI, we retrieve the def information and look up its mask.
> +      2. Input Edge has no Definition: We assume a conservative mask for that
> +      input.
> +      3. Input Edge is a PHI: We recursively call combine_mask_from_phi to
> +      merge the masks of all incoming values.
> +   - The pass tackles SET instructions and PARALLEL (compare+set)
> +     patterns.
> +   - The pass runs late in the RTL pipeline, after register allocation, to
> +     ensure stable def-use chains and avoid interfering with earlier
> +     optimizations.  */
> +
> +const pass_data pass_data_narrow_gp_writes = {
> +  RTL_PASS,        // type
> +  "narrow_gp_writes", // name
> +  OPTGROUP_NONE,      // optinfo_flags
> +  TV_MACH_DEP,             // tv_id
> +  0,               // properties_required
> +  0,               // properties_provided
> +  0,               // properties_destroyed
> +  0,               // todo_flags_start
> +  TODO_df_finish,     // todo_flags_finish
> +};
> +
> +using mask_t = unsigned HOST_WIDE_INT;
> +
> +/* Map from register definitions to their nonzero bit masks.  */
> +static hash_map<def_info *, mask_t> *nzero_map = nullptr;
> +
> +/* Current instruction being analyzed for nonzero_bits calculation.  */
> +static insn_info *curr_insn = nullptr;
> +
> +class narrow_gp_writes
> +{
> +public:
> +  narrow_gp_writes ();
> +  ~narrow_gp_writes ();
> +  void execute ();
> +
> +private:
> +  auto_vec<std::pair<insn_info *, rtx>> m_update_list;
> +  rtx optimize_single_set_insn (insn_info *);
> +  rtx optimize_compare_arith_insn (insn_info *);
> +  bool optimize_insn (insn_info *);
> +};
> +
> +static bool
> +relevant_access_p (access_info *acc)
> +{
> +  return GP_REGNUM_P (acc->regno ());
> +}
> +
> +/* Return the nonzero bit mask for definition DEF.  If we haven't computed
> +   a mask for this definition yet, return a conservative mask based on the
> +   mode.  */
> +
> +static mask_t
> +lookup_mask_from_def (def_info *def)
> +{
> +  mask_t *res;
> +  def = look_through_degenerate_phi (def);

I don't think this is necessary.  Degenerate phis should get processed just like
any other phi (via combine_mask_from_phi), and themselves end up in nzero_map.
So we can just query nzero_map directly.

> +  res = nzero_map->get (def);
> +  if (!res)
> +    return GET_MODE_MASK (def->mode ());
> +  return *res;

Easier as:

  return res ? *res : GET_MODE_MASK (def->mode ());

> +}
> +
> +/* Compute the nonzero bit mask for PHI node PHI by combining the masks
> +   of all incoming values.  VISITED tracks PHI nodes we've already processed
> +   to avoid infinite recursion.  */
> +
> +static mask_t
> +combine_mask_from_phi (phi_info *phi, auto_sbitmap &visited)
> +{
> +  if (mask_t *mask = nzero_map->get (phi))
> +    return *mask;
> +
> +  if (bitmap_bit_p (visited, phi->uid ()))
> +    return GET_MODE_MASK (phi->mode ());
> +  mask_t phi_mode_mask = GET_MODE_MASK (phi->mode ());
> +  mask_t combined_mask = 0;
> +  for (use_info *phi_use : phi->inputs ())
> +    {
> +      set_info *phi_set = phi_use->def ();
> +      if (!phi_set)
> +     {
> +       combined_mask |= phi_mode_mask;
> +       break;
> +     }
> +      else if (is_a<phi_info *> (phi_set))

I don't think we should special-case phi inputs here (and try to
recursively follow them), we should just treat them as any other def.
The pass does a single forward sweep over the RPO.  If we follow a phi
input along a backedge to an RPO-later def, then one of two things are
true:

(1) It's a regular def.  Since it's RPO-later, we won't have already
processed the def and it won't be in nzero_map, so we won't be able to
optimize/narrow based on that def.

(2) It's a phi.  Since it is RPO-after the parent phi, it's likely
(guaranteed?) that at least one of the phi's inputs is also defined
RPO-after the parent phi.  Thus we won't be able to narrow based on that
input, and we will have to conservatively use a mode mask for the entire
phi.

So in summary I don't think this recursive phi input chasing actually
gains us anything optimization-wise, but it is quite costly in terms of
compile time (especially because of the bitmap handling in the main
loop).  So let's drop it.

FWIW, I did an experiment disabling this recursive phi chasing, and I
observed no codegen differences on SPEC 2017 with/without it.

I think the pass should still handle phis at join points just fine
without the recursion.  If we want to later extend the analysis to
handle cycles and propagate information across backedges, we should use
the approach that Richard S outlined involving worklists.

> +     {
> +       /* Mark as visited before recursing to prevent infinite loops.  */
> +       bitmap_set_bit (visited, phi->uid ());
> +       combined_mask
> +         |= combine_mask_from_phi (as_a<phi_info *> (phi_set), visited);
> +     }
> +      else
> +     {
> +       mask_t def_mask = lookup_mask_from_def (phi_set);
> +       combined_mask |= def_mask;
> +     }
> +      if ((combined_mask & phi_mode_mask) == phi_mode_mask)
> +     break;
> +    }
> +
> +  return phi_mode_mask & combined_mask;
> +}
> +
> +/* RTL hooks callback for computing nonzero bits of registers.
> +   Updates NONZERO with the tracked nonzero bits for register X.  */
> +
> +static rtx
> +reg_nonzero_bits_for_narrow_gp_writes (const_rtx x, scalar_int_mode,
> +                                    scalar_int_mode,
> +                                    unsigned HOST_WIDE_INT *nonzero)
> +{
> +  for (use_info *use : curr_insn->uses ())
> +    {

No need for the curly braces here.

> +      if (use->regno () == REGNO (x) && GET_MODE (x) == use->mode ())

It looks like you could not ignore the second param (`xmode`) and use
`xmode == use->mode ()` here instead.

> +     {
> +       def_info *set = use->def ();
> +       if (!set)
> +         {
> +           *nonzero &= GET_MODE_MASK (use->mode ());
> +           return NULL_RTX;
> +         }
> +       mask_t mask = lookup_mask_from_def (set);

Easier to do:

mask = set ? lookup_mask_from_def (set) : GET_MODE_MASK (use->mode ());

and delete the if block above.

> +       *nonzero &= mask;
> +       return NULL_RTX;
> +     }
> +    }
> +  return NULL_RTX;
> +}
> +
> +/* Convert a DImode RTL expression X to its SImode equivalent.
> +   Recursively narrows operands of supported operations.  */
> +
> +static rtx
> +narrow_dimode_src (rtx x)
> +{
> +  rtx op = lowpart_subreg (SImode, x, DImode);
> +  rtx_code code = GET_CODE (op);
> +  /* If the generic lowpart logic simplifies it then use that.
> +     If it just results in wrapping X in a subreg or truncate then try harder
> +     below.  */
> +  if (GET_CODE (op) != SUBREG && GET_CODE (op) != TRUNCATE)
> +    return op;
> +  if (subreg_lowpart_p (op))
> +    op = SUBREG_REG (op);
> +  else if (code == TRUNCATE)
> +    op = XEXP (op, 0);
> +  else
> +    gcc_unreachable ();
> +
> +  code = GET_CODE (op);
> +  if (code == AND || code == IOR || code == XOR || code == ASHIFT)
> +    {
> +      rtx op0 = lowpart_subreg (SImode, XEXP (op, 0), DImode);
> +      rtx op1 = lowpart_subreg (SImode, XEXP (op, 1), DImode);

As written this fails to handle ubfiz-type instructions.  Consider:

unsigned long f(unsigned long x)
{
  return (x && 0xff) << 3;
}

as it stands, this fails to narrow with:

  Trying to narrow insn:
  (insn:TI 12 7 13 (set (reg/i:DI 0 x0)
          (and:DI (ashift:DI (reg:DI 0 x0 [orig:106 x ] [106])
                  (const_int 3 [0x3]))
              (const_int 2040 [0x7f8]))) "shift.c":5:1 927 
{*andim_ashiftdi_bfiz}
       (nil))
  with new pattern:
  (set (reg:SI 0 x0)
      (and:SI (subreg:SI (ashift:DI (reg:DI 0 x0 [orig:106 x ] [106])
                  (const_int 3 [0x3])) 0)
          (const_int 2040 [0x7f8])))
  Narrowed insn not valid, not modifying it

and we generate:

f:
        ubfiz   x0, x0, 3, 8
        ret

but we should be able to use the w-register variant.  I think you should
be able to handle such things by making the above calls recursively use
narrow_dimode_src instead of calling lowpart_subreg directly.

> +      return simplify_gen_binary (code, SImode, op0, op1);
> +    }
> +
> +  if (code == IF_THEN_ELSE)
> +    {
> +      rtx trueop = lowpart_subreg (SImode, XEXP (op, 1), DImode);
> +      rtx falseop = lowpart_subreg (SImode, XEXP (op, 2), DImode);

If it works, we should probably do the same and recursively call
narrow_dimode_src on the operands here.

> +      return simplify_gen_ternary (code, SImode, GET_MODE (XEXP (op, 0)),
> +                                XEXP (op, 0), trueop, falseop);
> +    }
> +
> +  return op;
> +}
> +
> +#undef RTL_HOOKS_REG_NONZERO_REG_BITS
> +#define RTL_HOOKS_REG_NONZERO_REG_BITS reg_nonzero_bits_for_narrow_gp_writes
> +static const struct rtl_hooks narrow_gp_writes_rtl_hooks
> +  = RTL_HOOKS_INITIALIZER;
> +
> +narrow_gp_writes::narrow_gp_writes ()
> +{
> +  if (!nzero_map)
> +    nzero_map = new hash_map<def_info *, mask_t> ();
> +  nzero_map->empty ();

Is this really required?  I would have thought that a newly-constructed hash_map
should be empty by default.

> +  rtl_hooks = narrow_gp_writes_rtl_hooks;
> +  curr_insn = nullptr;
> +  m_update_list.safe_grow_cleared (0);
> +}
> +
> +narrow_gp_writes::~narrow_gp_writes ()
> +{
> +  rtl_hooks = general_rtl_hooks;
> +  delete nzero_map;
> +  nzero_map = nullptr;
> +}
> +
> +/* Return true if INSN is a candidate for narrowing.  */
> +
> +static bool
> +optimizable_insn_p (insn_info *insn)
> +{
> +  if (!insn->is_real () || insn->is_asm () || insn->is_jump ()
> +      || !insn->can_be_optimized () || insn->has_volatile_refs ()

Note that insn->can_be_optimized () implies insn->is_real (), so the is_real ()
check is redundant here.

> +      || insn->has_pre_post_modify ())
> +    return false;
> +
> +  return true;

I think it would be nicer to have this directly return a boolean
expression which computes the positive condition for whether an insn is
optimizable, but please see the comments on optimize_insn below.

> +}
> +
> +/* Attempt to replace INSN's pattern with NEW_PAT.  Returns true if the
> +   replacement was successful.  */
> +
> +static bool
> +narrow_dimode_ops (insn_info *insn, rtx new_pat)
> +{
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "Trying to narrow insn:\n");
> +      print_rtl_single (dump_file, insn->rtl ());
> +      fprintf (dump_file, "with new pattern:\n");
> +      print_rtl_single (dump_file, new_pat);
> +    }
> +
> +  auto attempt = crtl->ssa->new_change_attempt ();
> +  rtl_ssa::insn_change change (insn);
> +  rtx_insn *rtl = insn->rtl ();
> +  insn_change_watermark watermark;
> +  validate_change (rtl, &PATTERN (rtl), new_pat, 1);
> +  if (!rtl_ssa::recog (attempt, change)
> +      || !rtl_ssa::change_is_worthwhile (change))
> +    {
> +      if (dump_file)
> +     {
> +       fprintf (dump_file, "Narrowed insn not valid, not modifying it\n");

Can you separate the recog failure from the change_is_worthwhile failure
and use different dump messages for the two cases?

It's an important distinction for anyone looking at the dump file to
know if the change was rejected because of unrecognisable RTL or
just costing.

> +       print_rtl_single (dump_file, new_pat);
> +     }
> +      return false;
> +    }
> +  confirm_change_group ();
> +  crtl->ssa->change_insn (change);
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "vvvvvvvvvvvNarrowed insn!vvvvvvvvvv\n");

Are all the vs really necessary here?  Either way, at least a space
between them and the text would make the output a bit more readable imo.

> +      print_rtl_single (dump_file, new_pat);
> +    }
> +  return true;
> +}
> +
> +/* Try to narrow flag-setting arithmetic operations (e.g., ADDS, SUBS, ANDS).
> +   These are represented as PARALLEL patterns with a compare and a set.
> +   Returns the narrowed pattern or NULL_RTX if narrowing is not possible.  */
> +
> +rtx
> +narrow_gp_writes::optimize_compare_arith_insn (insn_info *insn)
> +{
> +  rtx pat = PATTERN (insn->rtl ());
> +  if (GET_CODE (pat) != PARALLEL || XVECLEN (pat, 0) != 2)
> +    return NULL_RTX;
> +
> +  rtx cmp_set = XVECEXP (pat, 0, 0);
> +  rtx set = XVECEXP (pat, 0, 1);
> +  if (GET_CODE (cmp_set) != SET || GET_CODE (set) != SET)
> +    return NULL_RTX;
> +
> +  if (!REG_P (SET_DEST (cmp_set)) || REGNO (SET_DEST (cmp_set)) != CC_REGNUM
> +      || GET_CODE (SET_SRC (cmp_set)) != COMPARE)
> +    return NULL_RTX;
> +
> +  rtx set_src = SET_SRC (set);
> +  rtx set_dst = SET_DEST (set);
> +  if (!REG_P (set_dst) || GET_MODE (set_dst) != DImode
> +      || !GP_REGNUM_P (REGNO (set_dst)))
> +    return NULL_RTX;
> +  if (GET_CODE (set_src) == ZERO_EXTEND
> +      && GET_MODE (XEXP (set_src, 0)) == SImode)
> +    return NULL_RTX;
> +  rtx cmp_op0 = XEXP (SET_SRC (cmp_set), 0);
> +  rtx cmp_op1 = XEXP (SET_SRC (cmp_set), 1);
> +  if (cmp_op1 != CONST0_RTX (DImode) || !rtx_equal_p (set_src, cmp_op0))
> +    return NULL_RTX;
> +
> +  /* Different condition code modes have different requirements for
> +     which bits must be zero to allow narrowing.  CC_Zmode only tests the Z 
> flag,
> +     while CC_NZmode tests both N and Z flags.  Both require all upper 32 
> bits
> +     to be zero.  CC_NZVmode tests N, Z, and V flags, and additionally 
> requires
> +     bit 31 to be zero to ensure correct overflow flag behavior.  */
> +  machine_mode cc_mode = GET_MODE (SET_SRC (cmp_set));
> +  mask_t valid_mask = 0;
> +  if (cc_mode == CC_Zmode || cc_mode == CC_NZmode)
> +    valid_mask = ~GET_MODE_MASK (SImode);
> +  else if (cc_mode == CC_NZVmode)
> +    valid_mask = ~(GET_MODE_MASK (SImode) >> 1);
> +  else
> +    return NULL_RTX;
> +  mask_t mask = nonzero_bits (set_src, DImode);
> +
> +  for (auto def : insn->defs ())
> +    {
> +      unsigned def_regno = def->regno ();
> +      if (REGNO (set_dst) == def_regno)
> +     nzero_map->put (def, mask);
> +      else if (def_regno != CC_REGNUM)
> +     gcc_unreachable ();
> +    }
> +
> +  if (valid_mask & mask)
> +    return NULL_RTX;
> +
> +  rtx new_set_src = narrow_dimode_src (set_src);
> +  rtx new_set_dst = lowpart_subreg (SImode, set_dst, DImode);
> +  rtx new_set = gen_rtx_SET (new_set_dst, new_set_src);
> +  rtx new_cmp_set
> +    = gen_rtx_SET (SET_DEST (cmp_set),
> +                gen_rtx_COMPARE (cc_mode, copy_rtx (new_set_src),
> +                                 CONST0_RTX (SImode)));
> +  rtx new_pat
> +    = gen_rtx_PARALLEL (VOIDmode, gen_rtvec (2, new_cmp_set, new_set));
> +  return new_pat;
> +}
> +
> +/* Try to narrow single SET instructions.
> +   Returns the narrowed pattern or NULL_RTX if narrowing is not possible.  */
> +
> +rtx
> +narrow_gp_writes::optimize_single_set_insn (insn_info *insn)
> +{
> +  set_info *sinfo = single_set_info (insn);
> +  if (!sinfo)
> +    return NULL_RTX;
> +  rtx set = single_set (insn->rtl ());
> +  if (!set)
> +    return NULL_RTX;
> +  rtx dst = SET_DEST (set);
> +  rtx src = SET_SRC (set);
> +  if (!REG_P (dst) || GET_MODE (dst) != DImode || !GP_REGNUM_P (REGNO (dst)))
> +    return NULL_RTX;
> +  gcc_assert (sinfo->regno () == REGNO (dst));
> +
> +  /* The pass can optimize many aarch64.md patterns from (x0:DI :=
> +     zero_extend:DI (op:SI...)) to (x0:SI := op:SI) but as those patterns
> +     represent a write of a W-register, the resulting assembly is the same.
> +     Avoid the compile-time cost of manipulating these patterns and also
> +     increasing the replacement statistics for something that has no effect 
> at
> +     the assembly level.  */
> +  if (GET_CODE (src) == ZERO_EXTEND
> +      && SCALAR_INT_MODE_P (GET_MODE (XEXP (src, 0)))
> +      && (known_eq (GET_MODE_BITSIZE (GET_MODE (XEXP (src, 0))),
> +                 GET_MODE_BITSIZE (SImode))))

Can you not just check exactly for the inner mode being SImode here?
What other mode would satisfy this condition in this context?

Also, see my comments on optimize_insn below, but I think any early
return for "we can't / don't want to narrow this insn" should come
after we've done the nonzero bit tracking.

> +    return NULL_RTX;
> +  mask_t mask = nonzero_bits (src, DImode);
> +  nzero_map->put (sinfo, mask);
> +
> +  /* If mask overlaps with top 32 bits we can't narrow.  */
> +  if ((~GET_MODE_MASK (SImode)) & mask)
> +    {
> +      if (dump_file)
> +     fprintf (dump_file,
> +              "optimize_insn::Cannot narrow destination: mask %lx\n", mask);
> +      return NULL_RTX;
> +    }
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "optimize_insn::Can narrow insn!\n");
> +      print_rtl_single (dump_file, insn->rtl ());
> +    }
> +  rtx new_src = narrow_dimode_src (src);
> +  rtx new_dst = lowpart_subreg (SImode, dst, DImode);
> +  rtx new_pat;
> +  if (rtx_equal_p (new_src, new_dst))
> +    new_pat = gen_rtx_SET (dst, gen_rtx_ZERO_EXTEND (DImode, new_src));

Please can you explain why this is correct?  It looks a little odd at first
glance.  I suppose:

 (set (reg:SI x) (reg:SI x))

is just a nop, but you need the upper-bit-clearing effect of the
original insn to be preserved, hence the need for the ZERO_EXTEND to
DImode here.  Is that right?  I assume assembly-wise this just maps to

  mov wN, WN

?

> +  else
> +    new_pat = gen_rtx_SET (new_dst, new_src);
> +  return new_pat;
> +}
> +
> +/* Analyze INSN for narrowing opportunities.  Updates nzero_map with
> +   nonzero bit information for all definitions.  Returns true if INSN
> +   was marked for narrowing.  */
> +
> +bool
> +narrow_gp_writes::optimize_insn (insn_info *insn)
> +{
> +  bool optimizable = optimizable_insn_p (insn);

I don't really like the way this whole function is structured.  It seems
the loop at the bottom just duplicates work that should be done in
optimize_single_set_insn.

I think there are essentially three sets of instructions that we need to tease
apart here:

(1) Those that we can't do anything with at all (e.g. artificial insns, USEs,
    CLOBBERs).
(2) Those that we can track nonzero bits for, but can't transform.
(3) Those that we can both track nonzero bits for and potentially transform.

For insns in set (1) we should clearly just return early from this
function.  Clearly (3) is a subset of (2), and (1) is the inverse of (2).
I think you need to tease out from optimizable_insn_p which conditions
characterise these sets.

ISTM the condition for being in (2) is likely:

  insn->can_be_optimized () && !insn->is_asm () && !insn->is_jump ()
  && !insn->has_pre_post_modify ()

for a member of (2), the additional condition for being in (3) is then:

  insn->has_volatile_refs ()

?  It seems safe to (attempt to) track the nonzero bits of an insn with
volatile refs, but probably not to try and transform such an insn.  In practice
I think the optimize_*_insn routines themselves will have their own checks for
when narrowing is possible, and it likely wouldn't be possible for insns that
satisfy that condition anyway.

Structurally, I think all the nonzero bit tracking should happen in the
optimize_{single_set,compare_arith}_insn routines.  If appropraite, they
can then go ahead and try to narrow the insn too.

I think the high-level structure should look like this:

  if (!can_track_or_narrow_p (insn))
    return; // Early return for insns in set (1) above.

  rtx new_pat;
  if (auto set = single_set (insn->rtl ()))
    new_pat = optimize_single_set_insn (insn, set);
  else
    new_pat = optimize_compare_arith_insn (insn);

  if (new_pat)
    // either queue or immediately transform to new_pat

without the trailing for loop, as that work would be done by the above calls.
Their logic would be adjusted to do the tracking by default (even if we can't
narrow), and return a new pattern for narrowing if it is safe to do so.

As a general point, I think we should probably allow tracking of nonzero
bits for insns using modes other than DImode, as we may be able to use
such information via the nonzero_bits hook to allow narrowing of other
instructions.

I'm also wondering now whether it's really necessary to defer the transformation
until after analysis, given that we just make a single forward pass over the RPO
and propagate information forwards (without trying to follow backedges).  What
would go wrong if we tried to narrow as we go during analysis?

> +
> +  if (optimizable)
> +    {
> +      rtx new_pat = optimize_compare_arith_insn (insn);
> +      if (new_pat)
> +     {
> +       m_update_list.safe_push (std::make_pair (insn, new_pat));


> +       return true;
> +     }
> +      new_pat = optimize_single_set_insn (insn);
> +      if (new_pat)
> +     {
> +       m_update_list.safe_push (std::make_pair (insn, new_pat));
> +       return true;
> +     }
> +    }
> +  /* For instructions that can't be narrowed, still track their nonzero bits
> +     for use by later instructions.  */
> +  for (auto def : insn->defs ())
> +    {
> +      if (!relevant_access_p (def) || nzero_map->get (def))
> +     continue;
> +      mask_t mask = GET_MODE_MASK (def->mode ());
> +      rtx dst = NULL_RTX;
> +      rtx set = NULL_RTX;
> +      if (insn->is_real ())
> +     set = single_set (insn->rtl ());
> +      if (set && (dst = SET_DEST (set)) && REG_P (dst)
> +       && REGNO (dst) == def->regno ())
> +     {
> +       rtx src = SET_SRC (set);
> +       mask = nonzero_bits (src, DImode);

This is a slightly moot point, since I think we should drop this for
loop, but it seems wrong to unconditionally pass DImode here; how do we
know the set is in DImode?

> +     }
> +      nzero_map->put (def, mask);
> +    }
> +
> +  return false;
> +}
> +
> +void
> +narrow_gp_writes::execute ()
> +{
> +  calculate_dominance_info (CDI_DOMINATORS);
> +  df_analyze ();
> +  crtl->ssa = new rtl_ssa::function_info (cfun);
> +  timevar_push (TV_MACH_DEP);

Stylistically, I think it would be nicer to put this
initialization/destruction boilerplate in the ctor/dtor of
the narrow_gp_writes class.

> +  auto_sbitmap visited (get_max_uid ());

get_max_uid () is not the correct domain for this bitmap.  It looks to
be keyed off rtl-ssa phi uids, which are allocated by rtl-ssa's
function_info::create_phi (via m_next_phi_uid).  get_max_uid () instead
returns the maximum uid over all RTL insns.

This is a somewhat moot point, though, since (as discussed previously),
I think we should drop the code in combine_mask_from_phi which
recursively follows phi inputs, so we can just drop the bitmap
altogether.

> +  bitmap_clear (visited);
> +  unsigned HOST_WIDE_INT narrowing_candidates = 0;
> +  for (auto ebb : crtl->ssa->ebbs ())
> +    {
> +      for (auto *phi : ebb->phis ())
> +     {
> +       if (dump_file)
> +         {
> +           fprintf (dump_file, "Processing phi:\n");
> +           dump (dump_file, phi);
> +         }
> +       if (!relevant_access_p (phi))
> +         continue;
> +       bitmap_clear (visited);

As discussed previously, this looks like it could be quadratic.

As written, since visited is get_max_uid () bits large, a single call to
bitmap_clear (visited) does O(num_insns) work.

If we assume that the number of phis in any given EBB is bounded above
by a small constant k, then the complexity of this call is:

O(num_insns * num_ebbs * k).

In branch-dense code, it is possible for the number of EBBs to approach
the number of insns, such that we have:

num_insns = num_ebbs * n
=> num_ebbs = num_insns / n

for some small constant n.  Thus, we see the worst-case time complexity
is:

O(num_insns * (num_insns / n) * k)
= O(num_insns^2).

Suppose instead we fix the visited bitmap to use the number of RTL-SSA
phis in the function (call that P). Then the complexity would be:

O(P * num_ebbs * k)

but it's likely that the number of phis is at least as big as the number
of ebbs, i.e. we have:

P = num_ebbs * k',

so this is just:

O(num_ebbs * k' * num_ebbs * k)
= O(num_ebbs^2)

and it is still quadratic in the number of EBBs, which is a problem.

As mentioned above, I think we can just drop the recursive phi-chasing
in combine_mask_from_phi, and thus eliminate the need for the bitmap
altogether.

> +       mask_t phi_mask = combine_mask_from_phi (phi, visited);
> +       nzero_map->put (phi, phi_mask);
> +     }
> +      for (auto insn : ebb->nondebug_insns ())
> +     {
> +       if (dump_file)
> +         {
> +           fprintf (dump_file, "\nexecute::=========================\n");
> +           fprintf (dump_file, "execute::Now processing insn: \n");
> +           dump (dump_file, insn);

It seems quite noisy to dump every nondebug insn that the pass comes
across, perhaps this should be hidden behind the TDF_DETAILS flag?

Thanks,
Alex

> +         }
> +
> +       curr_insn = insn;
> +       if (optimize_insn (insn))
> +         narrowing_candidates++;
> +     }
> +    }
> +  if (dump_file)
> +    {
> +      fprintf (dump_file,
> +            "execute::Finished analysing, performing mode narrowing\n");
> +    }
> +  curr_insn = nullptr;
> +  rtl_hooks = general_rtl_hooks;
> +  unsigned HOST_WIDE_INT successful_narrowings = 0;
> +  for (auto pair : m_update_list)
> +    if (narrow_dimode_ops (pair.first, pair.second))
> +      successful_narrowings++;
> +  crtl->ssa->perform_pending_updates ();
> +  free_dominance_info (CDI_DOMINATORS);
> +  delete crtl->ssa;
> +  crtl->ssa = nullptr;
> +  if (successful_narrowings > 0 && dump_file)
> +    fprintf (dump_file,
> +          "Finished narrowing pass: Narrowed " HOST_WIDE_INT_PRINT_UNSIGNED
> +          " / " HOST_WIDE_INT_PRINT_UNSIGNED " candidates\n",
> +          successful_narrowings, narrowing_candidates);
> +  timevar_pop (TV_MACH_DEP);
> +}
> +
> +class pass_narrow_gp_writes : public rtl_opt_pass
> +{
> +public:
> +  pass_narrow_gp_writes (gcc::context *ctxt)
> +    : rtl_opt_pass (pass_data_narrow_gp_writes, ctxt)
> +  {}
> +
> +  /* opt_pass methods:  */
> +  virtual bool gate (function *)
> +  {
> +    return optimize >= 2 && aarch64_narrow_gp_writes != 0;
> +  }
> +  virtual unsigned int execute (function *);
> +};
> +
> +unsigned int
> +pass_narrow_gp_writes::execute (function *)
> +{
> +  narrow_gp_writes ().execute ();
> +  return 0;
> +}
> +
> +} // end namespace
> +
> +/* Create a new narrow gp writes pass instance.  */
> +rtl_opt_pass *
> +make_pass_narrow_gp_writes (gcc::context *ctxt)
> +{
> +  return new pass_narrow_gp_writes (ctxt);
> +}
> \ No newline at end of file
> diff --git a/gcc/config/aarch64/aarch64-passes.def 
> b/gcc/config/aarch64/aarch64-passes.def
> index 62a231cf79d..f4307aab551 100644
> --- a/gcc/config/aarch64/aarch64-passes.def
> +++ b/gcc/config/aarch64/aarch64-passes.def
> @@ -26,3 +26,4 @@ INSERT_PASS_BEFORE (pass_late_thread_prologue_and_epilogue, 
> 1, pass_late_track_s
>  INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
>  INSERT_PASS_BEFORE (pass_early_remat, 1, pass_ldp_fusion);
>  INSERT_PASS_BEFORE (pass_peephole2, 1, pass_ldp_fusion);
> +INSERT_PASS_BEFORE (pass_cleanup_barriers, 1, pass_narrow_gp_writes);
> diff --git a/gcc/config/aarch64/aarch64-protos.h 
> b/gcc/config/aarch64/aarch64-protos.h
> index 48d3a3de235..9a6dd1216c1 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -1254,6 +1254,7 @@ rtl_opt_pass *make_pass_late_track_speculation 
> (gcc::context *);
>  rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
>  rtl_opt_pass *make_pass_switch_pstate_sm (gcc::context *ctxt);
>  rtl_opt_pass *make_pass_ldp_fusion (gcc::context *);
> +rtl_opt_pass *make_pass_narrow_gp_writes (gcc::context *);
>  
>  poly_uint64 aarch64_regmode_natural_size (machine_mode);
>  
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index 9b59c15737f..ccb04cf7510 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -103,6 +103,11 @@ mfix-cortex-a53-843419
>  Target Var(aarch64_fix_a53_err843419) Init(2) Save
>  Workaround for ARM Cortex-A53 Erratum number 843419.
>  
> +mnarrow-gp-writes
> +Target Var(aarch64_narrow_gp_writes) Optimization Init(1) Save
> +Enable narrowing of 64-bit general purpose register writes to 32-bit when
> +upper 32 bits of the register are unused.
> +
>  mlittle-endian
>  Target RejectNegative InverseMask(BIG_END)
>  Assume target CPU is configured as little endian.
> diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> index 7584d3e4d59..7c04397c846 100644
> --- a/gcc/config/aarch64/t-aarch64
> +++ b/gcc/config/aarch64/t-aarch64
> @@ -238,6 +238,11 @@ aarch64-json-tunings-parser.o: 
> $(srcdir)/config/aarch64/aarch64-json-tunings-par
>       $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
>               $(srcdir)/config/aarch64/aarch64-json-tunings-parser.cc
>  
> +aarch64-narrow-gp-writes.o: 
> $(srcdir)/config/aarch64/aarch64-narrow-gp-writes.cc \
> +    $(CONFIG_H) $(SYSTEM_H) $(CORETYPES_H) $(BACKEND_H) $(RTL_H) $(DF_H) \
> +    $(RTL_SSA_H) tree-pass.h
> +     $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> +             $(srcdir)/config/aarch64/aarch64-narrow-gp-writes.cc
>  comma=,
>  MULTILIB_OPTIONS    = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst 
> $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
>  MULTILIB_DIRNAMES   = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index bae66ba6c45..eb0b6800158 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -909,7 +909,7 @@ Objective-C and Objective-C++ Dialects}.
>  -moverride=@var{string}
>  -mstack-protector-guard=@var{guard}  -mstack-protector-guard-reg=@var{sysreg}
>  -mstack-protector-guard-offset=@var{offset}  -mtrack-speculation
> --moutline-atomics  -mearly-ra  -mearly-ldp-fusion  -mlate-ldp-fusion
> +-moutline-atomics  -mearly-ra  -mearly-ldp-fusion  -mlate-ldp-fusion 
> -mnarrow-gp-writes
>  -msve-vector-bits=@var{bits}}
>  
>  @emph{Adapteva Epiphany Options} (@ref{Adapteva Epiphany Options})
> @@ -22985,6 +22985,12 @@ register allocation.  Enabled by default at 
> @samp{-O} and above.
>  Enable the copy of the AArch64 load/store pair fusion pass that runs after
>  register allocation.  Enabled by default at @samp{-O} and above.
>  
> +@opindex mnarrow-gp-writes
> +@item -mnarrow-gp-writes
> +Enable conversion of 64-bit general purpose register writes to equivalent 
> 32-bit
> +operations when the upper 32 bits are demonstrably unused. Enabled by 
> default at
> +@option{-O2} and above.
> +
>  @opindex msve-vector-bits
>  @item -msve-vector-bits=@var{bits}
>  Specify the number of bits in an SVE vector register.  This option only has

Re: [PATCH v2] AArch64: Add RTL pass to narrow 64-bit GP reg writes to 32-bit

Reply via email to