On Tue, Jan 9, 2018 at 11:26 AM, Jan Hubicka <hubi...@ucw.cz> wrote:
> Hi,
> gather instructions are rather hard to implement in hardware and except for
> skylake+ chips (i.e. haswell and Zen) they seems to be rather slow; to the
> degree I did not find real world loop where gather would help on Zen.
> This patch simply adds a knob to disable its autogeneration (builtin still
> works). I have considered two alternatives
>  1) tune this with x86-tune-costs because gather is still profitable than
>     scalar code if we do very expensive operations (such as sequence of 
> divides)
>     on the values gathered/scattered
>  2) implement expansion of gathers into primitive instructions.  This is 
> faster
>     as semantics of gather is bit weird and not fully needed for our 
> vectorizer
>
> I did not have luck to get any good results out of 1 alone as cost model is 
> not
> very realistic.  1+2 probably makes sense but we can do this incrementally as
> that would make most sense to be implemented generically in vectorizer on the
> top of this change.

Note that the vectorizer gives up on loops with gathers with no target
support for
gathers.  It could simply open-code the gather though (and properly cost that
open-coded variant), that's probably the way to go here.

Richard.

> Given that gather is problematic even on skylake+ as shown in the PR (which 
> has
> most optimized implementation of it) it is good to have a knob to control its
> codegen at first place.
>
> I have also disabled gathers for generic.  This is because its use causes
> some two-digit regression on zen for spec2k17 while there are no measurable
> benefits on Intel.  Note that this affects only
> -march=<somethig supporting gather> -mtune=generic
> as by default we do not use AVX2.
>
> Bootstrapped/regtested x86_64-linux, plan to commit it later today if there
> are no complains.
>
> Honza
>
>         PR target/81616
>         * i386.c (ix86_vectorize_builtin_gather): Check TARGET_USE_GATHER.
>         * i386.h (TARGET_USE_GATHER): Define.
>         * x86-tune.def (X86_TUNE_USE_GATHER): New.
> Index: config/i386/i386.c
> ===================================================================
> --- config/i386/i386.c  (revision 256369)
> +++ config/i386/i386.c  (working copy)
> @@ -38233,7 +38233,7 @@ ix86_vectorize_builtin_gather (const_tre
>    bool si;
>    enum ix86_builtins code;
>
> -  if (! TARGET_AVX2)
> +  if (! TARGET_AVX2 || !TARGET_USE_GATHER)
>      return NULL_TREE;
>
>    if ((TREE_CODE (index_type) != INTEGER_TYPE
> Index: config/i386/i386.h
> ===================================================================
> --- config/i386/i386.h  (revision 256369)
> +++ config/i386/i386.h  (working copy)
> @@ -498,6 +498,8 @@ extern unsigned char ix86_tune_features[
>         ix86_tune_features[X86_TUNE_SLOW_PSHUFB]
>  #define TARGET_AVOID_4BYTE_PREFIXES \
>         ix86_tune_features[X86_TUNE_AVOID_4BYTE_PREFIXES]
> +#define TARGET_USE_GATHER \
> +       ix86_tune_features[X86_TUNE_USE_GATHER]
>  #define TARGET_FUSE_CMP_AND_BRANCH_32 \
>         ix86_tune_features[X86_TUNE_FUSE_CMP_AND_BRANCH_32]
>  #define TARGET_FUSE_CMP_AND_BRANCH_64 \
> Index: config/i386/x86-tune.def
> ===================================================================
> --- config/i386/x86-tune.def    (revision 256369)
> +++ config/i386/x86-tune.def    (working copy)
> @@ -399,6 +399,10 @@ DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_ps
>  DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
>            m_SILVERMONT | m_INTEL)
>
> +/* X86_TUNE_USE_GATHER: Use gather instructions.  */
> +DEF_TUNE (X86_TUNE_USE_GATHER, "use_gather",
> +          ~(m_ZNVER1 | m_GENERIC))
> +
>  
> /*****************************************************************************/
>  /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     
> */
>  
> /*****************************************************************************/

Reply via email to