On 12/30/24 3:02 PM, Richard Sandiford wrote:
Jeff Law <jeffreya...@gmail.com> writes:
The BZ in question is a failure to recognize a pair of shifts as a sign
extension.

I originally thought simplify-rtx would be the right framework to
address this problem, but fwprop is actually better.  We can write the
recognizer much simpler in that framework.

fwprop already simplifies nested shifts/extensions to the desired RTL,
but it's not considered profitable and we throw away the good work done
by fwprop & simplifiers.

It's hard to see a scenario where nested shifts or nested extensions
that simplify down to a single sign/zero extension isn't a profitable
transformation.  So when fwprop has nested shifts/extensions that
simplifies to an extension, we consider it profitable.

This allow us to simplify the testcase on rv64 with ZBB enabled from a
pair of shifts to a single byte or half-word sign extension.

Hmm.  So just to summarise something that was discussed in the PR
comments, this is a case where combine's expand_compound_operation/
make_compound_operation wrangler hurts us, because the process isn't
idempotent, and combine produces two complex instructions:

(insn 6 3 7 2 (set (reg:DI 137 [ _3 ])
         (ashift:DI (reg:DI 139 [ x ])
             (const_int 24 [0x18]))) "foo.c":2:20 305 {ashldi3}
      (expr_list:REG_DEAD (reg:DI 139 [ x ])
         (nil)))
(insn 12 7 13 2 (set (reg/i:DI 10 a0)
         (sign_extend:DI (ashiftrt:SI (subreg:SI (reg:DI 137 [ _3 ]) 0)
                 (const_int 24 [0x18])))) "foo.c":2:27 321 {ashrsi3_extend}
      (expr_list:REG_DEAD (reg:DI 137 [ _3 ])
         (nil)))

given two simple instructions:

(insn 6 3 7 2 (set (reg:SI 137 [ _3 ])
         (sign_extend:SI (subreg:QI (reg/v:DI 136 [ x ]) 0))) "foo.c":2:20 533 
{*extendqisi2_bitmanip}
      (expr_list:REG_DEAD (reg/v:DI 136 [ x ])
         (nil)))
(insn 7 6 12 2 (set (reg:DI 138 [ _3 ])
         (sign_extend:DI (reg:SI 137 [ _3 ]))) "foo.c":2:20 discrim 1 133 
{*extendsidi2_internal}
      (expr_list:REG_DEAD (reg:SI 137 [ _3 ])
         (nil)))

If I run with -fdisable-rtl-combine then late_combine1 already does the
expected transformation.
That's an interesting observation and opens another path forward, namely to get something more sensible out of combine, perhaps without going directly to optimizing to an extension.


Although it would be nice to fix combine, that might be difficult.
If we treat combine as immutable then the options are:

(1) Teach simplify-rtx to simplify combine's output into a single sign_extend.

(2) Allow fwprop1 to get in first, before combine has a chance to mess
     things up.

The patch goes for (2).

Is that a fair summary?
Yes, very fair.


Playing devil's advocate, I suppose one advantage of (1) is that it
would allow the optimisation even if the original rtl looked like
combine's output.  And fwprop1 doesn't distinguish between cases in
which the source instruction disappears from cases in which the source
instruction is kept.  Thus we could transform:
Right. Initially I was drawn to simplify-rtx as it would work for combine and any other clients of the simplify-rtx framework, though it's probably a bit of a stretch to expect to see this kind of case pop up in the other clients.



   (set (reg:SI R2) (sign_extend:SI (reg:QI R1)))
   (set (reg:DI R3) (sign_extend:DI (reg:SI R2)))

into:

   (set (reg:SI R2) (sign_extend:SI (reg:QI R1)))
   (set (reg:DI R3) (sign_extend:DI (reg:QI R1)))

which increases the register pressure between the two instructions
(since R2 and R1 are both now live).  In general, there could be
quite a gap between the two instructions.

On the other hand, even in that case, fwprop1 would be parallelising
the extensions.  And since we're talking about unary operations,
even two-address targets would allow R1 to be extended without
tying the source and destination.

Also, it seems relatively unlikely that expand would produce code
that looks like combine's, since the gimple optimisers should have
simplified it into conversions.

So initially I was going to agree that it's worth trying in fwprop.  But...
The appeal of fwprop was to avoid having to write a full recognizer. But you're POC takes a different approach than mine and just assumes the subregs are going to be in there from the get-go rather than writing two instances of the recognizer or trying to generalize with another set of painful conditionals.

[ fwprop implementation elided ]



...I think we should make sure that XEXP (new_rtx, 0)) is a REG or a subreg
of a REG (not just any SUBREG), so that we don't duplicate complex expressions.
I'd pondered that and can certainly support tightening. We might consider allowing MEMs given that many targets can handle the extension as part of a memory reference.


However, even that would lead to the odd situation in which simplifying
(foo (bar (reg X))) to (reg X) would not be treated as profitable,
but simplifying (foo (bar (reg X))) to (sign_extend (reg X)) would be
treated as profitable.  I think the two cases would need to be handled
in the same way, and maybe also (foo (reg X)) to (reg X).  That's beginning
to feel like a big change in policy for this stage of development.
Just to be clear, I won't lose any sleep if we defer to gcc-16. It's just a minor missed optimization in the end.


And it looks like option (1) isn't simple either.  As well as mangling
the original instructions, combine also fuses the final operation into:

(insn 12 7 13 2 (set (reg/i:DI 10 a0)
         (reg:DI 138 [ _3 ])) "foo.c":2:27 283 {*movdi_64bit}
      (expr_list:REG_DEAD (reg:DI 138 [ _3 ])
         (nil)))

so that it ends up with a hard register destination (as for the final
insn 12 quoted earlier).  late-combine currently punts on such instructions:
I'm less worried about the hard register cases, I tend to look at those as symptomatic of testcase reduction.

Unfortunately after 18+ months I don't remember where I saw this show up to be 100% sure about the context where this showed up.



If we do try to fix combine, I think something like the attached
would fit within the current scheme.  It is a pure shift-for-shift
transformation, avoiding any extensions.
Which actually simplifies things a bit. I was going straight for the sign extension generation. But if we just simplify by getting the subregs out of the way, that may be enough to make everything else happy. I'll play with this a bit...


Will think more about it, but wanted to get the above stream of
consciousness out before I finish for the day :)
It's appreciated.  I know you're probably as busy as I am.

Jeff

Reply via email to