https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109766
Roger Sayle <roger at nextmovesoftware dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed| |2023-05-08 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW --- Comment #2 from Roger Sayle <roger at nextmovesoftware dot com> --- I believe the problem is in the cprop_hardreg pass, which undoes reload's register assignments (to use DImode GPR registers with -Os), by propagating DF mode values into *pushdi2_rex64, which then get split during the split3 pass into lea/movq pairs, that are each larger than a DImode push. The work around, for this test case, is to use -Os -fno-cprop-registers which produces code that's shorter than -O2. 0000000000000000 <callfunc>: 0: 66 48 0f 7e ca movq %xmm1,%rdx 5: 66 48 0f 7e d1 movq %xmm2,%rcx a: 66 48 0f 7e de movq %xmm3,%rsi f: 50 push %rax 10: 66 49 0f 7e e0 movq %xmm4,%r8 15: 66 48 0f 7e c0 movq %xmm0,%rax 1a: 66 49 0f 7e e9 movq %xmm5,%r9 1f: 66 49 0f 7e f2 movq %xmm6,%r10 24: 66 49 0f 7e fb movq %xmm7,%r11 29: 41 53 push %r11 2b: 41 52 push %r10 2d: 41 51 push %r9 2f: 41 50 push %r8 31: 56 push %rsi 32: 51 push %rcx 33: 52 push %rdx 34: 50 push %rax 35: b0 08 mov $0x8,%al 37: e8 00 00 00 00 callq 3c <callfunc+0x3c> 3c: 48 83 c4 48 add $0x48,%rsp 40: c3 retq Now to figure out if there's a way, using target rtx_costs or pushdi2_rex64's constraints/predicates, to prevent hardreg cprop performing this substitution. Plan B might be to investigate reload's choice of DFmode SSE vs DImode GPR, but this is within one or two bytes of optimal (for four arguments I believe GCC would produce shorter code than clang).