https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127
Bug ID: 97127
Summary: FMA3 code transformation leads to slowdown on Skylake
Product: gcc
Version: 10.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: already5chosen at yahoo dot com
Target Milestone: ---
The following clever gcc transformation leads to generation of slower code than
non-transformed original:
a = *mem;
a = a + b * c;
where both b and c are reused further down is transformed to:
a = b
a = *mem + a * c;
Or, expressing the same in asm terms
vmovuxx (mem), %ymmA
vfnmadd231xx %ymmB, %ymmC, %ymmA
transformed to
vmovaxx %ymmB, %ymmA
vfnmadd213xx (mem), %ymmC, %ymmA
You may ask "Why transformed variant is slower?" and I can try my best to
answer (my guess is that performance bottleneck is in rename stage rather than
in the execution stage and transformed code occupies 3 rename slots vs 2 rename
slots by original) but it would be mostly pointless. What's matters that on
Skylake the transformed variant is slower and I can prove it with benchmark.
BTW, on Haswell too.
You can see comparison of two variants at
https://github.com/already5chosen/others/tree/master/cholesky_solver/gcc-badopt-fma3
The interesting spot is starting at line 367 in file chol.cpp.
Or starting two lines below .L21: in asm generated by gcc 10.2.0 (chol_a.s).
Run 's_chol_a 100' vs 's_chol_b 100' and see the difference in favor of the
second (de-transformed) variant.
The difference, in this particular case, is small, order of 2-4 percents, but
very consistent.
In more tight loops I would expect a bigger difference.