> > The diffrerence is that Cores understand the fact that fmadd does not need > > all three parameters to start computation, while Zen cores doesn't. > > > > Since this seems noticeable win on zen and not loss on Core it seems like > > good > > default for generic. > > > > I plan to commit the patch next week if there are no compplains. > The generic part LGTM.(It's exactly what we proposed in [1]) > > [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637721.html
Thanks. I wonder if can think of other generic changes that would make sense to do? Concerning zen4 and FMA, it is not really win with AVX512 enabled (which is what I was benchmarking for znver4 tuning), but indeed it is win with AVX256 where the extra latency is not hidden by the parallelism exposed by doing evertyhing twice. I re-benmchmarked zen4 and it behaves similarly to zen3 with avx256, so for x86-64-v3 this makes sense. Honza > > > > Honza > > > > #include <stdio.h> > > #include <time.h> > > > > #define SIZE 1000 > > > > float a[SIZE][SIZE]; > > float b[SIZE][SIZE]; > > float c[SIZE][SIZE]; > > > > void init(void) > > { > > int i, j, k; > > for(i=0; i<SIZE; ++i) > > { > > for(j=0; j<SIZE; ++j) > > { > > a[i][j] = (float)i + j; > > b[i][j] = (float)i - j; > > c[i][j] = 0.0f; > > } > > } > > } > > > > void mult(void) > > { > > int i, j, k; > > > > for(i=0; i<SIZE; ++i) > > { > > for(j=0; j<SIZE; ++j) > > { > > for(k=0; k<SIZE; ++k) > > { > > c[i][j] += a[i][k] * b[k][j]; > > } > > } > > } > > } > > > > int main(void) > > { > > clock_t s, e; > > > > init(); > > s=clock(); > > mult(); > > e=clock(); > > printf(" mult took %10d clocks\n", (int)(e-s)); > > > > return 0; > > > > } > > > > * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, > > X86_TUNE_AVOID_256FMA_CHAINS) > > Enable for znver4 and Core. > > > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def > > index 43fa9e8fd6d..74b03cbcc60 100644 > > --- a/gcc/config/i386/x86-tune.def > > +++ b/gcc/config/i386/x86-tune.def > > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, > > "use_scatter_8parts", > > > > /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or > > smaller FMA chain. */ > > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | > > m_ZNVER2 | m_ZNVER3 > > - | m_YONGFENG) > > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | > > m_ZNVER2 | m_ZNVER3 | m_ZNVER4 > > + | m_YONGFENG | m_GENERIC) > > > > /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or > > smaller FMA chain. */ > > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | > > m_ZNVER3 > > - | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM) > > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | > > m_ZNVER3 | m_ZNVER4 > > + | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC) > > > > /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or > > smaller FMA chain. */ > > > > -- > BR, > Hongtao