Re: Disable FMADD in chains for Zen4 and generic

Hongtao Liu Tue, 12 Dec 2023 15:48:41 -0800

On Tue, Dec 12, 2023 at 10:38 PM Jan Hubicka <hubi...@ucw.cz> wrote:
>
> Hi,
> this patch disables use of FMA in matrix multiplication loop for generic (for
> x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.
>
> For Intel this is neutral both on the matrix multiplication microbenchmark
> (attached) and spec2k17 where the difference was within noise for Core.
>
> On core the micro-benchmark runs as follows:
>
> With FMA:
>
>        578,500,241      cycles:u                         #    3.645 GHz       
>                   ( +-  0.12% )
>        753,318,477      instructions:u                   #    1.30  insn per 
> cycle              ( +-  0.00% )
>        125,417,701      branches:u                       #  790.227 M/sec     
>                   ( +-  0.00% )
>           0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )
>
>
> No FMA:
>
>        577,573,960      cycles:u                         #    3.514 GHz       
>                   ( +-  0.15% )
>        878,318,479      instructions:u                   #    1.52  insn per 
> cycle              ( +-  0.00% )
>        125,417,702      branches:u                       #  763.035 M/sec     
>                   ( +-  0.00% )
>           0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )
>
> So the cycle count is unchanged and discrete multiply+add takes same time as 
> FMA.
>
> While on zen:
>
>
> With FMA:
>          484875179      cycles:u                         #    3.599 GHz       
>                ( +-  0.05% )  (82.11%)
>          752031517      instructions:u                   #    1.55  insn per 
> cycle
>          125106525      branches:u                       #  928.712 M/sec     
>                ( +-  0.03% )  (85.09%)
>             128356      branch-misses:u                  #    0.10% of all 
> branches          ( +-  0.06% )  (83.58%)
>
> No FMA:
>          375875209      cycles:u                         #    3.592 GHz       
>                ( +-  0.08% )  (80.74%)
>          875725341      instructions:u                   #    2.33  insn per 
> cycle
>          124903825      branches:u                       #    1.194 G/sec     
>                ( +-  0.04% )  (84.59%)
>           0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )
>
> The diffrerence is that Cores understand the fact that fmadd does not need
> all three parameters to start computation, while Zen cores doesn't.
>
> Since this seems noticeable win on zen and not loss on Core it seems like good
> default for generic.
>
> I plan to commit the patch next week if there are no compplains.
The generic part LGTM.(It's exactly what we proposed in [1])


[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637721.html
>
> Honza
>
> #include <stdio.h>
> #include <time.h>
>
> #define SIZE 1000
>
> float a[SIZE][SIZE];
> float b[SIZE][SIZE];
> float c[SIZE][SIZE];
>
> void init(void)
> {
>    int i, j, k;
>    for(i=0; i<SIZE; ++i)
>    {
>       for(j=0; j<SIZE; ++j)
>       {
>          a[i][j] = (float)i + j;
>          b[i][j] = (float)i - j;
>          c[i][j] = 0.0f;
>       }
>    }
> }
>
> void mult(void)
> {
>    int i, j, k;
>
>    for(i=0; i<SIZE; ++i)
>    {
>       for(j=0; j<SIZE; ++j)
>       {
>          for(k=0; k<SIZE; ++k)
>          {
>             c[i][j] += a[i][k] * b[k][j];
>          }
>       }
>    }
> }
>
> int main(void)
> {
>    clock_t s, e;
>
>    init();
>    s=clock();
>    mult();
>    e=clock();
>    printf("        mult took %10d clocks\n", (int)(e-s));
>
>    return 0;
>
> }
>
>         * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, 
> X86_TUNE_AVOID_256FMA_CHAINS)
>         Enable for znver4 and Core.
>
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index 43fa9e8fd6d..74b03cbcc60 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, 
> "use_scatter_8parts",
>
>  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
>     smaller FMA chain.  */
> -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | 
> m_ZNVER2 | m_ZNVER3
> -          | m_YONGFENG)
> +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | 
> m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> +          | m_YONGFENG | m_GENERIC)
>
>  /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
>     smaller FMA chain.  */
> -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | 
> m_ZNVER3
> -         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | 
> m_ZNVER3 | m_ZNVER4
> +         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
>
>  /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
>     smaller FMA chain.  */



-- 
BR,
Hongtao

Re: Disable FMADD in chains for Zen4 and generic

Reply via email to