On Tue, Dec 12, 2023 at 3:38 PM Jan Hubicka <hubi...@ucw.cz> wrote: > > Hi, > this patch disables use of FMA in matrix multiplication loop for generic (for > x86-64-v3) and zen4. I tested this on zen4 and Xenon Gold Gold 6212U. > > For Intel this is neutral both on the matrix multiplication microbenchmark > (attached) and spec2k17 where the difference was within noise for Core. > > On core the micro-benchmark runs as follows: > > With FMA: > > 578,500,241 cycles:u # 3.645 GHz > ( +- 0.12% ) > 753,318,477 instructions:u # 1.30 insn per > cycle ( +- 0.00% ) > 125,417,701 branches:u # 790.227 M/sec > ( +- 0.00% ) > 0.159146 +- 0.000363 seconds time elapsed ( +- 0.23% ) > > > No FMA: > > 577,573,960 cycles:u # 3.514 GHz > ( +- 0.15% ) > 878,318,479 instructions:u # 1.52 insn per > cycle ( +- 0.00% ) > 125,417,702 branches:u # 763.035 M/sec > ( +- 0.00% ) > 0.164734 +- 0.000321 seconds time elapsed ( +- 0.19% ) > > So the cycle count is unchanged and discrete multiply+add takes same time as > FMA. > > While on zen: > > > With FMA: > 484875179 cycles:u # 3.599 GHz > ( +- 0.05% ) (82.11%) > 752031517 instructions:u # 1.55 insn per > cycle > 125106525 branches:u # 928.712 M/sec > ( +- 0.03% ) (85.09%) > 128356 branch-misses:u # 0.10% of all > branches ( +- 0.06% ) (83.58%) > > No FMA: > 375875209 cycles:u # 3.592 GHz > ( +- 0.08% ) (80.74%) > 875725341 instructions:u # 2.33 insn per > cycle > 124903825 branches:u # 1.194 G/sec > ( +- 0.04% ) (84.59%) > 0.105203 +- 0.000188 seconds time elapsed ( +- 0.18% ) > > The diffrerence is that Cores understand the fact that fmadd does not need > all three parameters to start computation, while Zen cores doesn't.
This came up in a separate thread as well, but when doing reassoc of a chain with multiple dependent FMAs. I can't understand how this uarch detail can affect performance when as in the testcase the longest input latency is on the multiplication from a memory load. Do we actually understand _why_ the FMAs are slower here? Do we know that Cores can start the multiplication part when the add operand isn't ready yet? I'm curious how you set up a micro benchmark to measure this. There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per cycle. So in theory we can at most do 2 FMA per cycle but with latency (FMA) == 4 for Zen3/4 and latency (FADD/FMUL) == 3 we might be able to squeeze out a little bit more throughput when there are many FADD/FMUL ops to execute? That works independent on whether FMAs have a head-start on multiplication as you'd still be bottle-necked on the 2-wide issue for FMA? On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a latency of four. So you should get worse results there (looking at the numbers above you do get worse results, slightly so), probably the higher number of uops is hidden by the latency. > Since this seems noticeable win on zen and not loss on Core it seems like good > default for generic. > > I plan to commit the patch next week if there are no compplains. complaint! Richard. > Honza > > #include <stdio.h> > #include <time.h> > > #define SIZE 1000 > > float a[SIZE][SIZE]; > float b[SIZE][SIZE]; > float c[SIZE][SIZE]; > > void init(void) > { > int i, j, k; > for(i=0; i<SIZE; ++i) > { > for(j=0; j<SIZE; ++j) > { > a[i][j] = (float)i + j; > b[i][j] = (float)i - j; > c[i][j] = 0.0f; > } > } > } > > void mult(void) > { > int i, j, k; > > for(i=0; i<SIZE; ++i) > { > for(j=0; j<SIZE; ++j) > { > for(k=0; k<SIZE; ++k) > { > c[i][j] += a[i][k] * b[k][j]; > } > } > } > } > > int main(void) > { > clock_t s, e; > > init(); > s=clock(); > mult(); > e=clock(); > printf(" mult took %10d clocks\n", (int)(e-s)); > > return 0; > > } > > * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, > X86_TUNE_AVOID_256FMA_CHAINS) > Enable for znver4 and Core. > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def > index 43fa9e8fd6d..74b03cbcc60 100644 > --- a/gcc/config/i386/x86-tune.def > +++ b/gcc/config/i386/x86-tune.def > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, > "use_scatter_8parts", > > /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or > smaller FMA chain. */ > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | > m_ZNVER2 | m_ZNVER3 > - | m_YONGFENG) > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | > m_ZNVER2 | m_ZNVER3 | m_ZNVER4 > + | m_YONGFENG | m_GENERIC) > > /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or > smaller FMA chain. */ > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | > m_ZNVER3 > - | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM) > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | > m_ZNVER3 | m_ZNVER4 > + | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC) > > /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or > smaller FMA chain. */