https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
--- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Hongtao.liu from comment #9) > 1703 : 401cb1: vmovq %xmm1,%r9 (*) > 834 : 401cb6: vmovq %r8,%xmm1 > 1719 : 401cbb: vmovq %r9,%xmm0 (*) > > Look like %r9 is dead after the second (*), and it can be optimized to > > 1703 : 401cb1: vmovq %xmm1,%xmm0 > 834 : 401cb6: vmovq %r8,%xmm1 Yep, we also have code like - movabsq $0x3ff03db8fde2ef4e, %r8 ... - vmovq %r8, %xmm11 or movq .LC11(%rip), %rax vmovq %rax, %xmm14 which is extremely odd to see ... (I didn't check how we arrive at that) When I do diff --git a/gcc/config/i386/x86-tune-costs.h b/gcc/config/i386/x86-tune-costs.h index 017ffa69958..4c51358d7b6 100644 --- a/gcc/config/i386/x86-tune-costs.h +++ b/gcc/config/i386/x86-tune-costs.h @@ -1585,7 +1585,7 @@ struct processor_costs znver2_cost = { in 32,64,128,256 and 512-bit. */ {8, 8, 8, 8, 16}, /* cost of storing SSE registers in 32,64,128,256 and 512-bit. */ - 6, 6, /* SSE->integer and integer->SSE + 8, 8, /* SSE->integer and integer->SSE moves. */ 8, 8, /* mask->integer and integer->mask moves */ {6, 6, 6}, /* cost of loading mask register performance improves from 128 seconds to 115 seconds. The result is a lot more stack spilling in the code but there are still cases like movq .LC8(%rip), %rax vmovsd .LC13(%rip), %xmm6 vmovsd .LC16(%rip), %xmm11 vmovsd .LC14(%rip), %xmm3 vmovsd .LC12(%rip), %xmm14 vmovq %rax, %xmm2 vmovq %rax, %xmm0 movq .LC9(%rip), %rax see how we load .LC8 to %rax just to move it to xmm2 and xmm0 instead of at least moving xmm2 to xmm0 (maybe that's now cprop_hardreg) or also loading directly to xmm0. In the end register pressure is the main issue but how we deal with it is bad. It's likely caused by a combination of PRE & hoisting & sinking which together exploit if( ((*((unsigned int*) ((void*) (&((((srcGrid)[((FLAGS)+N_CELL_ENTRIES*((0)+ (0)*(1*(100))+(0)*(1*(100))*(1*(100))))+(i)]))))))) & (ACCEL))) { ux = 0.005; uy = 0.002; uz = 0.000; } which makes the following computes partly compile-time resolvable. I still think the above code generation issues need to be analyzed and we should figure why we emit this weird code under register pressure. I'll attach a testcase that has the function in question split out for easier analysis.