https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot 
gnu.org

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Uroš Bizjak from comment #5)
> (In reply to Eric Botcazou from comment #4)
> > I guess the transformations should accept MEMs instead of just REGs but, no,
> > I'm not particularly interested in quirks of CISC architectures, I have
> > enough to do with those of RISC architectures.
> 
> The problem is that with both function arguments in memory, combine
> simplifies sequence of bswaps with memory argument ( == movbe) in foo7 to:
> 
> Failed to match this instruction:
> (set (reg:SI 84 [ D.2318 ])
>     (xor:SI (mem/c:SI (plus:SI (reg/f:SI 16 argp)
>                 (const_int 4 [0x4])) [2 b+0 S4 A32])
>         (mem/c:SI (reg/f:SI 16 argp) [2 a+0 S4 A32])))
> 
> This is invalid RTX, where both input arguments are in memory.
> 
> The optimized tree dump for foo7 is:
> 
>   <bb 2>:
>   _2 = __builtin_bswap32 (a_1(D));
>   _4 = __builtin_bswap32 (b_3(D));
>   _5 = _4 ^ _2;
>   _6 = __builtin_bswap32 (_5); [tail call]
>   return _6;

Seems to me we want

  (bit_xor (bswap32 @0) (bswap32 @1)) -> (bswap32 (bit_xor @0 @1))

in match-and-simplify speak.

On trunk this transform would go to tree-ssa-forwprop.c as pattern.
It would apply to all bitwise binary ops and all bswap builtins
(all bit/byte-shuffling operations applying the same shuffle to
both operands).

(for bitop in bit_xor bit_ior bit_and
  (for bswap in BUILT_IN_BSWAP16 BUILT_IN_BSWAP32 BUILT_IN_BSWAP64
    (simplify
      (bitop (bswap @0) (bswap @1))
      (bswap (bitop @0 @1))))
  (simplify
    (bitop (vec_perm @1 @2 @0) (vec_perm @3 @4 @0))
    (vec_perm (bitop @1 @3) (bitop @2 @4) @0)))

not sure if the vector permute one is profitable (but I guess a
permute is always more expensive than a bit operation).

The requested transform of course relies on somebody transforming
bswap (bswap (x)) to x and for vec_perm detecting a cancelling
operation (tree-ssa-forwprop.c can do that already I think).

Mine.  Fixed by the above on match-and-simplify.

> It looks to me that the optimization has to be re-implemented as tree
> optimization (probably by extending fold_builtin_bswap in builtins.c). This
> generic optimization will also benefit targets without bswap RTX pattern,
> e.g. plain i386, as observed in Comment #2.
> 
> I'm recategorizing the PR as a tree-optimization.

Reply via email to