On Fri, Sep 5, 2025 at 4:54 AM liuhongt <hongtao....@intel.com> wrote:
>
> SLP may take a broadcast as kind of vec_perm, the patch checks the
> permutation index to exclude those false positive.
>
> > Btw, you can now (in some cases) do better, namely you should
> > always have 'node' available and when SLP_TREE_PERMUTE_P (node)
> > then SLP_TREE_LANE_PERMUTATION could be inspected to
> > detect the harmful cross-lane permutes.  Note BB vectorization
> > still (always IIRC) uses SLP_TREE_LOAD_PERMUTATION,
> > so for permuted loads you have a load 'node' and the permutation
> > applied is visible in SLP_TREE_LOAD_PERMUTATION (which is
> > a simpler data structure).  That said, BB vectorization loads
> > could have harmful AVX2 permutes attached, so the patch is
> > maybe a bit overzealous.
> Changed.
>
> If we know it's a broadcast or in-lane permuation, then don't count
> it into m_num_avx256_vec_perm and also exclude 0 times vec_perm which
> is BIT_FIELD_REF <vect_**, 64, 0> and will be optimized to broadcast
> or optimized off?
>
> There're adjustments for the backend to generate vpermilpd/lps instead
> of vpermd/vpermq when the permutation is in-lane, that part is in the
> second patch.
>
> I notice there're several benchmarks benifited from this patch.
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> Any comments?
>
> gcc/ChangeLog:
>
>         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
>         Check permutation index for vec_perm, don't count it if we
>         know it's not a cross-lane permutation.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/i386/avx256_avoid_vec_perm.c: Adjust testcase.
>         * gcc.target/i386/avx256_avoid_vec_perm-2.c: New test.
> ---
>  gcc/config/i386/i386.cc                       | 60 ++++++++++++++++++-
>  .../gcc.target/i386/avx256_avoid_vec_perm-2.c | 21 +++++++
>  .../gcc.target/i386/avx256_avoid_vec_perm-5.c | 24 ++++++++
>  .../gcc.target/i386/avx256_avoid_vec_perm.c   |  2 +-
>  4 files changed, 104 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-2.c
>  create mode 100644 gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-5.c
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 55c9b16dd38..62779331774 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -26237,8 +26237,64 @@ ix86_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>      stmt_cost = ix86_default_vector_cost (kind, mode);
>
>    if (kind == vec_perm && vectype
> -      && GET_MODE_SIZE (TYPE_MODE (vectype)) == 32)
> -    m_num_avx256_vec_perm[where]++;
> +      && GET_MODE_SIZE (TYPE_MODE (vectype)) == 32
> +      /* BIT_FIELD_REF <vect_**, 64, 0> 0 times vec_perm costs 0 in body.  */
> +      && count != 0)

so the vectorizer costs sth withy count == 0?  I'll see to fix that,
but this also
means the code should have used m_num_avx256_vec_perm[where] += count.

> +    {
> +      bool real_perm = true;
> +      unsigned nunits = TYPE_VECTOR_SUBPARTS (vectype);
> +
> +      if (node
> +         && SLP_TREE_LOAD_PERMUTATION (node).exists ()
> +         /* Loop vectorization will have 4 times vec_perm
> +            with index as {0, 0, 0, 0}.
> +            But it actually generates
> +            vec_perm_expr <vect, vect, 0, 0, 0, 0>
> +            vec_perm_expr <vect, vect, 1, 1, 1, 1>
> +            vec_perm_expr <vect, vect, 2, 2, 2, 2>
> +            vec_perm_expr <vect, vect, 3, 3, 3, 3>  */
> +         && (!dyn_cast<loop_vec_info> (m_vinfo)
> +             || count * nunits == SLP_TREE_LANES (node)))

Indeed the actual permutes are not what is recorded in the load-permutation.
Instead it's the sequence of lanes to be "compressed" after loading
contiguously from the DR group (or strided from the DR group).  I'll note
that you should almost never see load-permutation when doing loop
vectorization (remaining ones are "legacy"), and for loop vectorization
a { 0 } load permutation would not be a broadcast, instead when the
group size is 2 and the mode is, say, V8SI it would be {0, 2, 4, 6, 8,
10, 12, 14 }.
But as said, you shouldn't see those for loop_vect.  For BB vect this
complication does not exist since VF == 1 there (no unrolling).

So - may I suggest to split this into loop and BB vect parts, separating the
two?  For loop vect I'd instead handle SLP_TREE_PERMUTE_P nodes
which have SLP_TREE_LANE_PERMUTATION.  All lane permutations
with SLP_TREE_LANES < nunits are necessarily cross-lane for loops,
a splat there has % nunits == 0 lanes and all source indexes the same.
There are also "good" permutes that have more than one source,
those degenerating to blends.  That said, SLP_TREE_PERMUTE_P
can have up to N input vectors, the SLP_TREE_LOAD_PERMUTATION
permutes one or two inputs from the same vector memory load.

> +       {
> +         unsigned half = nunits / 2;
> +         unsigned i = 0;
> +         bool allsame = true;
> +         unsigned first = SLP_TREE_LOAD_PERMUTATION (node)[0];
> +         bool cross_lane_p = false;
> +         for (i = 0 ; i != SLP_TREE_LANES (node); i++)
> +           {
> +             unsigned tmp = SLP_TREE_LOAD_PERMUTATION (node)[i];
> +             /* allsame is just a broadcast.  */
> +             if (tmp != first)
> +               allsame = false;
> +
> +             /* 4 times vec_perm with number of lanes multiple of nunits.  */
> +             tmp = tmp & (nunits - 1);
> +             unsigned index = i & (nunits - 1);
> +             if ((index < half && tmp >= half)
> +                 || (index >= half && tmp < half))
> +               cross_lane_p = true;
> +
> +             if (!allsame && cross_lane_p)
> +               break;
> +           }
> +
> +         if (i == SLP_TREE_LANES (node))
> +           real_perm = false;
> +       }
> +
> +      if (real_perm)
> +       {
> +         m_num_avx256_vec_perm[where]++;
> +         if (dump_file && (dump_flags & TDF_DETAILS))
> +           {
> +             fprintf (dump_file, "Detected avx256 cross-lane permutation: ");
> +             if (stmt_info)
> +               print_gimple_expr (dump_file, stmt_info->stmt, 0, TDF_SLIM);
> +             fprintf (dump_file, " \n");
> +           }
> +       }
> +    }
>
>    /* Penalize DFmode vector operations for Bonnell.  */
>    if (TARGET_CPU_P (BONNELL) && kind == vector_stmt
> diff --git a/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-2.c 
> b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-2.c
> new file mode 100644
> index 00000000000..8d4e641444d
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-2.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=sierraforest -O2 -fdump-tree-slp-details" } */
> +/* { dg-final { scan-tree-dump-times {(?n)Detected avx256 cross-lane 
> permutation} 1 "slp2" } } */
> +
> +void
> +foo (double* a, double* __restrict b, int c, int n)
> +{
> +  a[0] = b[100] * b[2];
> +  a[1] = b[100] * b[3];
> +  a[2] = b[100] * b[0];
> +  a[3] = b[100] * b[1];
> +}
> +
> +void
> +foo1 (double* a, double* __restrict b, int c, int n)
> +{
> +  a[0] = b[100] * b[0];
> +  a[1] = b[100] * b[1];
> +  a[2] = b[100] * b[3];
> +  a[3] = b[100] * b[2];
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-5.c 
> b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-5.c
> new file mode 100644
> index 00000000000..c11bea8c7b3
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm-5.c
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=sierraforest -Ofast" } */
> +/* { dg-final { scan-assembler-not {(?n)vpermpd.*%ymm} } } */
> +
> +typedef struct {
> +  unsigned short m1, m2, m3, m4;
> +} the_struct_t;
> +typedef struct {
> +  double m1, m2, m3, m4, m5;
> +} the_struct2_t;
> +
> +double bar1 (the_struct2_t*);
> +
> +double foo (double* k, unsigned int n, the_struct_t* the_struct) {
> +  unsigned int u;
> +  the_struct2_t result;
> +  for (u=0; u < n; u++, k--) {
> +    result.m1 += (*k)*the_struct[u].m1;
> +    result.m2 += (*k)*the_struct[u].m2;
> +    result.m3 += (*k)*the_struct[u].m3;
> +    result.m4 += (*k)*the_struct[u].m4;
> +  }
> +  return bar1 (&result);
> +}
> diff --git a/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm.c 
> b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm.c
> index d4f00b3fb52..e0399041ad9 100644
> --- a/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm.c
> +++ b/gcc/testsuite/gcc.target/i386/avx256_avoid_vec_perm.c
> @@ -13,7 +13,7 @@ foo (void)
>        b[i*8+0] = a[i*8+0];
>        b[i*8+1] = a[i*8+0];
>        b[i*8+2] = a[i*8+3];
> -      b[i*8+3] = a[i*8+3];
> +      b[i*8+3] = a[i*8+5];
>        b[i*8+4] = a[i*8+4];
>        b[i*8+5] = a[i*8+6];
>        b[i*8+6] = a[i*8+4];
> --
> 2.34.1
>

Reply via email to