https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61559
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org --- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Uroš Bizjak from comment #5) > (In reply to Eric Botcazou from comment #4) > > I guess the transformations should accept MEMs instead of just REGs but, no, > > I'm not particularly interested in quirks of CISC architectures, I have > > enough to do with those of RISC architectures. > > The problem is that with both function arguments in memory, combine > simplifies sequence of bswaps with memory argument ( == movbe) in foo7 to: > > Failed to match this instruction: > (set (reg:SI 84 [ D.2318 ]) > (xor:SI (mem/c:SI (plus:SI (reg/f:SI 16 argp) > (const_int 4 [0x4])) [2 b+0 S4 A32]) > (mem/c:SI (reg/f:SI 16 argp) [2 a+0 S4 A32]))) > > This is invalid RTX, where both input arguments are in memory. > > The optimized tree dump for foo7 is: > > <bb 2>: > _2 = __builtin_bswap32 (a_1(D)); > _4 = __builtin_bswap32 (b_3(D)); > _5 = _4 ^ _2; > _6 = __builtin_bswap32 (_5); [tail call] > return _6; Seems to me we want (bit_xor (bswap32 @0) (bswap32 @1)) -> (bswap32 (bit_xor @0 @1)) in match-and-simplify speak. On trunk this transform would go to tree-ssa-forwprop.c as pattern. It would apply to all bitwise binary ops and all bswap builtins (all bit/byte-shuffling operations applying the same shuffle to both operands). (for bitop in bit_xor bit_ior bit_and (for bswap in BUILT_IN_BSWAP16 BUILT_IN_BSWAP32 BUILT_IN_BSWAP64 (simplify (bitop (bswap @0) (bswap @1)) (bswap (bitop @0 @1)))) (simplify (bitop (vec_perm @1 @2 @0) (vec_perm @3 @4 @0)) (vec_perm (bitop @1 @3) (bitop @2 @4) @0))) not sure if the vector permute one is profitable (but I guess a permute is always more expensive than a bit operation). The requested transform of course relies on somebody transforming bswap (bswap (x)) to x and for vec_perm detecting a cancelling operation (tree-ssa-forwprop.c can do that already I think). Mine. Fixed by the above on match-and-simplify. > It looks to me that the optimization has to be re-implemented as tree > optimization (probably by extending fold_builtin_bswap in builtins.c). This > generic optimization will also benefit targets without bswap RTX pattern, > e.g. plain i386, as observed in Comment #2. > > I'm recategorizing the PR as a tree-optimization.