On Tue, Sep 3, 2024 at 3:07 PM Jan Hubicka <hubi...@ucw.cz> wrote:
>
> Hi,
> We disable gathers for zen4.  It seems that gather has improved a bit compared
> to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions when
> the indices are known ahead of time. Vector loads followed by shuffles result
> in a higher load bandwidth." however the situation seems to be more
> complicated.
>
> gather is 5-10% loss on parest benchmark as well as 30% loss on sparse dot
> products in TSVC. Curiously enough breaking these out into microbenchmark
> reversed the situation and it turns out that the performance depends on
> how indices are distributed.  gather is loss if indices are sequential,
> neutral if they are random and win for some strides (4, 8).
>
> This seems to be similar to earlier zens, so I think (especially for
> backporting znver5 support) that it makes sense to be conistent and disable
> gather unless we work out a good heuristics on when to use it. Since we
> typically do not know the indices in advance, I don't see how that can be 
> done.
>
> I opened PR116582 with some examples of wins and loses

Note there's no way to emulate masked gathers (well - emit control flow), so
they remain the choice when AVX512 is enabled and you have conditional
loads.  Similar for stores and scatter though there performance may be well
absymal - something for the cost model to resolve.  Note I think x86 doesn't
yet expose AVX512 masked gather/scatter - the builtin target hook doesn't
support it and the backend doesn't have any mask_gather_load or
mask_scatter_store
optabs to go the now prefered internal-fn way.

Open-coding 8-way gather is also heavy in code size and thus might effect
ucode re-use for large loops (OTOH gathers may take up much space in the
ucode cache or be not there at all).

Richard.

> Bootstrapped/regtested x86_64-linux, committed.
>
>
> gcc/ChangeLog:
>
>         * config/i386/x86-tune.def (X86_TUNE_USE_GATHER_2PARTS): Disable for
>         ZNVER5.
>         (X86_TUNE_USE_SCATTER_2PARTS): Disable for ZNVER5.
>         (X86_TUNE_USE_GATHER_4PARTS): Disable for ZNVER5.
>         (X86_TUNE_USE_SCATTER_4PARTS): Disable for ZNVER5.
>         (X86_TUNE_USE_GATHER_8PARTS): Disable for ZNVER5.
>         (X86_TUNE_USE_SCATTER_8PARTS): Disable for ZNVER5.
>
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index da1a3d6a3c6..ed26136faee 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -476,35 +476,35 @@ DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, 
> "avoid_4byte_prefixes",
>  /* X86_TUNE_USE_GATHER_2PARTS: Use gather instructions for vectors with 2
>     elements.  */
>  DEF_TUNE (X86_TUNE_USE_GATHER_2PARTS, "use_gather_2parts",
> -         ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_CORE_HYBRID
> +         ~(m_ZNVER | m_CORE_HYBRID
>             | m_YONGFENG | m_SHIJIDADAO | m_CORE_ATOM | m_GENERIC | m_GDS))
>
>  /* X86_TUNE_USE_SCATTER_2PARTS: Use scater instructions for vectors with 2
>     elements.  */
>  DEF_TUNE (X86_TUNE_USE_SCATTER_2PARTS, "use_scatter_2parts",
> -         ~(m_ZNVER4))
> +         ~(m_ZNVER4 | m_ZNVER5))
>
>  /* X86_TUNE_USE_GATHER_4PARTS: Use gather instructions for vectors with 4
>     elements.  */
>  DEF_TUNE (X86_TUNE_USE_GATHER_4PARTS, "use_gather_4parts",
> -         ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_CORE_HYBRID
> +         ~(m_ZNVER | m_CORE_HYBRID
>             | m_YONGFENG | m_SHIJIDADAO | m_CORE_ATOM | m_GENERIC | m_GDS))
>
>  /* X86_TUNE_USE_SCATTER_4PARTS: Use scater instructions for vectors with 4
>     elements.  */
>  DEF_TUNE (X86_TUNE_USE_SCATTER_4PARTS, "use_scatter_4parts",
> -         ~(m_ZNVER4))
> +         ~(m_ZNVER4 | m_ZNVER5))
>
>  /* X86_TUNE_USE_GATHER: Use gather instructions for vectors with 8 or more
>     elements.  */
>  DEF_TUNE (X86_TUNE_USE_GATHER_8PARTS, "use_gather_8parts",
> -         ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER4 | m_CORE_HYBRID | m_CORE_ATOM
> +         ~(m_ZNVER | m_CORE_HYBRID | m_CORE_ATOM
>             | m_YONGFENG | m_SHIJIDADAO | m_GENERIC | m_GDS))
>
>  /* X86_TUNE_USE_SCATTER: Use scater instructions for vectors with 8 or more
>     elements.  */
>  DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
> -         ~(m_ZNVER4))
> +         ~(m_ZNVER4 | m_ZNVER5))
>
>  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
>     smaller FMA chain.  */

Reply via email to