Soumya AR <[email protected]> writes:
> [...]
> In AArch64, each 64-bit X register has a corresponding 32-bit W register
> that maps to its lower half.  When we can guarantee that the upper 32 bits
> are never used, we can safely narrow operations to use W registers instead.
>
> For example, this code:
>     uint64_t foo (uint64_t a) {
>         return (a & 255) + 3;
>     }
>
> Currently compiles to:
>
>       and     x0, x0, 255
>       add     x0, x0, 3
>       ret
>
> But with this pass enabled, it optimizes to:
>
>       and     w0, w0, 255
>       add     w0, w0, 3
>       ret
>
> ----
>
> The pass operates in two phases:
>
>  1) Analysis Phase:
>    - Using RTL-SSA, iterates through extended basic blocks (EBBs)
>    - Computes nonzero bit masks for each register definition
>    - Recursively processes PHI nodes
>    - Identifies candidates for narrowing
>  2) Transformation Phase:
>    - Applies narrowing to validated candidates
>    - Converts DImode operations to SImode where safe
>
> The pass runs late in the RTL pipeline, after register allocation, to ensure
> stable def-use chains and avoid interfering with earlier optimizations.

I haven't looked at the implementation in detail yet, but on the design:

As you say above, the pass makes a single pass through the instructions,
making pessimistic assumptions about backedges.  Did you consider instead
using a worklist algorithm that makes optimistic assumptions and then
corrects them?  That would cope better with loops.

With that arrangement, the worklist/analysis phase would not make
optimisations on the fly.  It would simply record a mask for each
definition (e.g. in a map), making optimistic assumptions about
definitions that have not been processed yet.  If the mask calculated
for a definition invalidates an earlier assumption, the definition
would be pushed onto the worklist so that all its uses could be
reevaluated.  (See gimple-ssa-backprop for another pass that works
like this, although I'm sure there are simpler examples.)

That analysis framework seems generic rather than target-specific.
Perhaps it should be separated from the pass and provided as an
independent routine, so that multiple passes can use it.  (I'd
wondered at one point whether late-combine should do the same kind
of analysis, but never got around to trying it.)

Thanks,
Richard

Reply via email to