On Thu, Dec 14, 2023 at 12:03 AM Jan Hubicka <hubi...@ucw.cz> wrote:
>
> > > The diffrerence is that Cores understand the fact that fmadd does not need
> > > all three parameters to start computation, while Zen cores doesn't.
> > >
> > > Since this seems noticeable win on zen and not loss on Core it seems like 
> > > good
> > > default for generic.
> > >
> > > I plan to commit the patch next week if there are no compplains.
> > The generic part LGTM.(It's exactly what we proposed in [1])
> >
> > [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637721.html
>
> Thanks.  I wonder if can think of other generic changes that would make
> sense to do?
> Concerning zen4 and FMA, it is not really win with AVX512 enabled
> (which is what I was benchmarking for znver4 tuning), but indeed it is
> win with AVX256 where the extra latency is not hidden by the parallelism
> exposed by doing evertyhing twice.
>
> I re-benmchmarked zen4 and it behaves similarly to zen3 with avx256, so
> for x86-64-v3 this makes sense.
>
> Honza
> > >
> > > Honza
> > >
> > > #include <stdio.h>
> > > #include <time.h>
> > >
> > > #define SIZE 1000
> > >
> > > float a[SIZE][SIZE];
> > > float b[SIZE][SIZE];
> > > float c[SIZE][SIZE];
> > >
> > > void init(void)
> > > {
> > >    int i, j, k;
> > >    for(i=0; i<SIZE; ++i)
> > >    {
> > >       for(j=0; j<SIZE; ++j)
> > >       {
> > >          a[i][j] = (float)i + j;
> > >          b[i][j] = (float)i - j;
> > >          c[i][j] = 0.0f;
> > >       }
> > >    }
> > > }
> > >
> > > void mult(void)
> > > {
> > >    int i, j, k;
> > >
> > >    for(i=0; i<SIZE; ++i)
> > >    {
> > >       for(j=0; j<SIZE; ++j)
> > >       {
> > >          for(k=0; k<SIZE; ++k)
> > >          {
> > >             c[i][j] += a[i][k] * b[k][j];
> > >          }
> > >       }
> > >    }
> > > }
> > >
> > > int main(void)
> > > {
> > >    clock_t s, e;
> > >
> > >    init();
> > >    s=clock();
> > >    mult();
> > >    e=clock();
> > >    printf("        mult took %10d clocks\n", (int)(e-s));
> > >
> > >    return 0;
> > >
> > > }
> > >
> > >         * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, 
> > > X86_TUNE_AVOID_256FMA_CHAINS)
> > >         Enable for znver4 and Core.
> > >
> > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > > index 43fa9e8fd6d..74b03cbcc60 100644
> > > --- a/gcc/config/i386/x86-tune.def
> > > +++ b/gcc/config/i386/x86-tune.def
> > > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, 
> > > "use_scatter_8parts",
> > >
> > >  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit 
> > > or
> > >     smaller FMA chain.  */
> > > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | 
> > > m_ZNVER2 | m_ZNVER3
> > > -          | m_YONGFENG)
> > > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | 
> > > m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > > +          | m_YONGFENG | m_GENERIC)
> > >
> > >  /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit 
> > > or
> > >     smaller FMA chain.  */
> > > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 
> > > | m_ZNVER3
> > > -         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> > > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 
> > > | m_ZNVER3 | m_ZNVER4
> > > +         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
Can we backport the patch(at least the generic part) to
GCC11/GCC12/GCC13 release branch?
> > >
> > >  /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit 
> > > or
> > >     smaller FMA chain.  */
> >
> >
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao

Reply via email to