https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109766

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|                            |2023-05-08
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW

--- Comment #2 from Roger Sayle <roger at nextmovesoftware dot com> ---
I believe the problem is in the cprop_hardreg pass, which undoes reload's
register assignments (to use DImode GPR registers with -Os), by propagating DF
mode values into *pushdi2_rex64, which then get split during the split3 pass
into lea/movq pairs, that are each larger than a DImode push.  The work around,
for this test case, is to use -Os -fno-cprop-registers which produces code
that's shorter than -O2.

0000000000000000 <callfunc>:
   0:   66 48 0f 7e ca          movq   %xmm1,%rdx
   5:   66 48 0f 7e d1          movq   %xmm2,%rcx
   a:   66 48 0f 7e de          movq   %xmm3,%rsi
   f:   50                      push   %rax
  10:   66 49 0f 7e e0          movq   %xmm4,%r8
  15:   66 48 0f 7e c0          movq   %xmm0,%rax
  1a:   66 49 0f 7e e9          movq   %xmm5,%r9
  1f:   66 49 0f 7e f2          movq   %xmm6,%r10
  24:   66 49 0f 7e fb          movq   %xmm7,%r11
  29:   41 53                   push   %r11
  2b:   41 52                   push   %r10
  2d:   41 51                   push   %r9
  2f:   41 50                   push   %r8
  31:   56                      push   %rsi
  32:   51                      push   %rcx
  33:   52                      push   %rdx
  34:   50                      push   %rax
  35:   b0 08                   mov    $0x8,%al
  37:   e8 00 00 00 00          callq  3c <callfunc+0x3c>
  3c:   48 83 c4 48             add    $0x48,%rsp
  40:   c3                      retq

Now to figure out if there's a way, using target rtx_costs or pushdi2_rex64's
constraints/predicates, to prevent hardreg cprop performing this substitution.
Plan B might be to investigate reload's choice of DFmode SSE vs DImode GPR, but
this is within one or two bytes of optimal (for four arguments I believe GCC
would produce shorter code than clang).

Reply via email to