[Bug target/114252] Introducing bswapsi reduces code performance

gjl at gcc dot gnu.org via Gcc-bugs Thu, 07 Mar 2024 00:45:16 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114252


--- Comment #8 from Georg-Johann Lay <gjl at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #7)
> Note I do understand what you are saying, just the middle-end in detecting
> and using __builtin_bswap32 does what it does everywhere else - it checks
> whether the target implements the operation.
> 
> The middle-end doesn't try to actually compare costs (it has no idea of the
> bswapsi costs),

But even when the bswapsi insn costs nothing, the v14 code has these additional
6 movqi insns 32...37 compared to v13 code.  In order to have the same
performance like v13 code, a bswapsi would have to cost negative 6 insns.  And
an optimizer that assumes negative costs is not reasonable, in particular
because the recognition of bswap opportunities serves optimization -- or is
supposed to serve it as far as I understand.

> and it most definitely doesn't see how AVR is special in
> having only QImode registers and thus the created SImode load (which the
> target supports!) will end up as four registers.

Even when the bswap insn would cost nothing the code is worse.

> The only thing that maybe would make sense with AVR exposing bswapsi is
> users calling __builtin_bswap but since it always expands as a libcall
> even that makes no sense.

It makes perfect sense when C/C++ code uses __builtin_bswap32:

* With current bswapsi insn, the code does a call that performs SI:22 =
bswap(SI:22) with NO additionall register pressure.

* Without bswap insn, the code does a real ABI call that performs SI:22 =
bswap(SI:22) PLUS IT CLOBBERS r18, r19, r20, r21, r26, r27, r30 and r31; which
are the most powerful GPRs.

> So my preferred fix would be to remove bswapsi from avr.md?

Is there a way that the backend can fold a call to an insn that performs better
that a call? Like in TARGET_FOLD_BUILTIN?  As far as I know, the backend can
only fold target builtins, but not common builtins?  Tree fold cannot fold to
an insn obviously, but it could fold to inline asm, no?

Or can the target change an optabs entry so it expands to an insn that's more
profitable that a respective call? (like avr.md's bswap insn with transparent
call is more profitable than a real call).

The avr backend does this for many other stuff, too:

divmod, SI and PSI multiplications, parity, popcount, clz, ffs, 

> Does it benefit from recognizing bswap done with shifts on an int?

I don't fully understand that question. You mean to write code that shifts
bytes around like in
    uint32_t res = 0;
    res |= ((uint32_t) buf[0]) << 24;
    res |= ((uint32_t) buf[1]) << 16;
    res |= (uint32_t) buf[2] << 8;
    res |= buf[3];
    return res;
is better than a bswapsi call?

[Bug target/114252] Introducing bswapsi reduces code performance

Reply via email to