https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127
--- Comment #4 from Alexander Monakov <amonakov at gcc dot gnu.org> --- > More so, gcc variant occupies 2 reservation station entries (2 fused uOps) vs > 4 entries by de-transformed sequence. I don't think this is true for the test at hand? With base+offset memory operand the renaming stage already sees two separate uops for each fma, so reservation etc. should also see two for each fma, 4 uops in total. And they will not be fused. It would be true if memory operands required just one register (and then pressure on renaming stage would be the same for both variants). > For me it's enough to know that it *is* slower. Understood, but I hope GCC developers want to understand the nature of the slowdown before attempting to fix it.