[…]
> >> > We generate for e.g.:
> >> >
> >> > #include <stdint.h>
> >> >
> >> > uint16_t f8 (uint8_t xr, uint8_t xc){
> >> >     return (uint8_t)(xr * xc);
> >> > }
> >> >
> >> > (insn 9 6 10 2 (set (reg:HI 101)
> >> (zero_extend:HI (reg/v:QI 96 [ xr ]))) "prom.c":4:16 -1
> >> (nil))
> >> (insn 10 9 11 2 (set (reg:HI 102)
> >> (zero_extend:HI (reg/v:QI 98 [ xc ]))) "prom.c":4:16 -1
> >> (nil))
> >> (insn 11 10 12 2 (set (reg:SI 103)
> >> (mult:SI (subreg:SI (reg:HI 101) 0)
> >> (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1
> >> (nil))
> >> >
> >> > Out of expand. The paradoxical subreg isn't generated at all out of
> >> > expand unless it's needed. It does keep the original params around
> >> > as
> >> unused:
> >> >
> >> > (insn 2 7 4 2 (set (reg:QI 97)
> >> (reg:QI 0 x0 [ xr ])) "prom.c":3:37 -1
> >> (nil))
> >> (insn 4 2 3 2 (set (reg:QI 99)
> >> (reg:QI 1 x1 [ xc ])) "prom.c":3:37 -1
> >> (nil))
> >> >
> >> > And the paradoxical subreg is moved into the first operation requiring 
> >> > it:
> >> >
> >> > (insn 11 10 12 2 (set (reg:SI 103)
> >> (mult:SI (subreg:SI (reg:HI 101) 0)
> >> (subreg:SI (reg:HI 102) 0))) "prom.c":4:16 -1
> >> (nil))
> >>
> >> Ah, OK, this isn't what I'd imaagined.  I thought the xr and xc
> >> registers would be SIs and the DECL_RTLs would be QI subregs of those SI
> regs.
> >> I think that might work better, for the reasons above.  (That is,
> >> whenever we need the register in extended form, we can simply extend
> >> the existing reg rather than create a new one.)
> >
> > Ah, I see, no, I explicitly avoid this. When doing the type promotions
> > I tell it that size of the copies of xr and xc is still the original size, 
> > e.g. QI (i.e. I
> don't change 97 and 99).
> > This is different from what we do with extends where 97 and 99 *would*
> be changed.
> >
> > The reason is that if I make this SI the compiler thinks it knows the
> > value of all the bits in the register which led to various miscompares as it
> thinks it can use the SI value directly.
> >
> > This happens because again the xr and xc are hard regs. So having 97
> > be
> >
> > (set (reg:SI 97) (subreg:SI (reg:QI 0 x0 [ xr ]) 0))
> >
> > gets folded to an incorrect
> >
> > (set (reg:SI 97) (reg:SI 0 x0 [ xr ]))
> 
> This part I would expect (and hope for :-)).
> 
> > And now 97 is free to be used without any zero extension, as 97 on it's own
> is an invalid RTX.
> 
> But the way I'd imagined it working, expand would need to insert an
> extension before any operation that needs the upper 24 bits to be defined
> (e.g. comparisons, right shifts).  If the DECL_RTL is (subreg:QI (reg:SI x) 0)
> then the upper bits are not defined, since SUBREG_PROMOTED_VAR_P
> would/should be false for the subreg.

Ah I see, my fear here was that if we have a pattern which splits out the 
zero-extend for whatever reason
that if it gets folded it would be invalid.  But I think I understand what you 
meant.  In your case
we'd never again use the hardreg, but that everything goes through 97. Got it.

> 
> E.g. for:
> 
>   int8_t foo(int8_t x) { return x >> 1; }
> 
> x would have a DECL_RTL of (subreg:QI (reg:SI x) 0), the parameter
> assignment would be expanded as:
> 
>   (set (reg:SI x) (reg:SI x0))
> 
> the shift would be expanded as:
> 
>   (set (reg:SI x) (zero_extend:SI (subreg:QI (reg:SI x) 0)))
>   (set (reg:SI x) (ashiftrt:SI (reg:SI x) (const_int 1)))
> 
> and the return assignment would be expanded as:
> 
>   (set (reg:SI x0) (reg:SI x))
> 
> x + 1 would instead be expanded to just:
> 
>   (set (reg:SI x) (plus:SI (reg:SI x) (const_int 1)))
> 
> (without an extension).
> 
> I realised later though that, although reusing the DECL_RTL reg for the
> extension has the nice RA property of avoiding multiple live values, it would
> make it harder to combine the extension into the operation if the variable is
> still live afterwards.  So I guess we lose something both ways.
> 
> Maybe we need a different approach, not based on changing
> PROMOTE_MODE.
> 
> I wonder how easy it would be to do the promotion in gimple, then reuse
> backprop to determine when a sign/zero-extension (i.e. a normal gimple cast)
> can be converted into an “any extend”
> (probably represented as a new ifn).

Do you mean without changing the hook implementation but keeping the current 
promotion?

I guess the problem here is that it's the inverse cases that's the problem 
isn't it? It's not that in
gimple there are unneeded extends, it's that some operations require an 
any-extend no?

like in gimple ~a where a is an 8-bit quantity requires an any-extend, but no 
cast would be there
in gimple.

So for instance

#include <stdint.h>

uint8_t f (uint8_t a)
{
    return ~a;
}

Is just simply:

f (uint8_t a)
{
  uint8_t _2;

  <bb 2> [local count: 1073741824]:
  _2 = ~a_1(D);
  return _2;

}

In gimple. I'm also slightly worried about interfering with phi opts. Backprop 
runs
before ifcombine and pihops for instance and there are various phi opts like 
ifcombine_ifandif
that rely on the BB containing only the phi node.  Adding an any-extend in 
between would break this.

I also wonder if the IFN won't interfere with range analysis of expressions. 
Unless we manage to strategically
Insert the IFNs on entire expressions and not in the intermediate SSA 
components.

> 
> > So I have to keep the intermediate copy QI mode, after which the RTX
> > optimizations being done during expand generates the forms above.
> >
> >>
> >> I think that's where confusion was coming from.
> >>
> >> > In any case, I'm still not totally sure what the objection here is.
> >> > Afaik, compares need to be treated specially because in GIMPLE they
> >> > already are.  Afaik, C integer promotion rules state that in the
> >> > comparison 0 should have been promoted to an integer constant of
> >> > rank int and so the comparison itself should have been done as integer.
> >> > i.e. extended.  And most of our patterns are based around this.
> >> >
> >> > Gimple however doesn't do this, the comparison is done in the rank
> >> > of the variable and there is no explicit conversion.  This happened
> >> > to be fixed up before during the forced promotion.  So to me the
> >> > heuristic doesn't seem to be that crazy..
> >>
> >> I guess I still don't see the distinction.  C says the same thing
> >> about
> >> +, -, >>, etc.  And gimple is free to do those operations in narrow
> >> +types
> >> if it wants to, and if that doesn't change the semantics.  (Not that
> >> gimple always does them in narrow types.  But it is valid gimple.)
> >>
> >> The optimisation problem doesn't come from C or gimple semantics, but
> >> from the fact that AArch64 (unlike x86 say) doesn't have byte add,
> >> byte compare, byte right shift, etc.  We therefore need to promote
> >> 8-bit and 16- bit operations to 32 bits first.
> >>
> >> For add, subtract, multiply, left shift, and logic ops, we can avoid
> >> defining the upper bits of the inputs when we do these extensions,
> >> because the upper bits of the inputs don't affect the useful bits of
> >> the result.  But for comparisons, right shifts, and divides, etc., we do 
> >> need
> to extend.
> >>
> >> AIUI, the comparison case is special because (for AArch64-specific
> >> reasons), we prefer extend + cbz to tst + branch, especially when the
> >> extend can be shared.
> >
> > Agreed, so I think we agree on this 😊 I guess the disagreement is
> > where this should be done. I'll admit that the testcase above works by
> > coincidence. But if we don't do it during expand time, the only place
> > I can think of to introduce the zero extends is to add various
> > patterns to do an early split of any-extend
> > + cmp.  But wouldn't that be more fragile? At least at expand time all
> > + comparisons
> > are tcc_comparisons.
> 
> I guess one question is: is the patch without the comparison handling just
> exacerbating an existing problem?  Do we already make similar bad choices
> between extend+cbz and tst+branch in cases where the variables aren't
> short, but where intermediate calculations involve &s?  If so, it might be
> something worth tackling in its own right, regardless of where the &s or
> extensions come from.

I've tried various cases but they all look correct.  This is mostly because
we already have actual instructions for these. We generate a chain of
and + tst in most != 0 cases, in the rest we do the least amount of &s
and then a normal cmp. 

For the case in spec, we correctly only get an additional mov to copy
the value.  So it looks like it's only an issue with short cases.

Cheers,
Tamar

> 
> But yeah, I'm not sure how easily that would fit into existing passes.
> 
> Thanks,
> Richard

Reply via email to