Hi,

I updated the patch after the dot product went in. This is the new covet letter:

This patch adds support to vectorize sum of abslolute differences (SAD_EXPR)
using SVE.

Given this input code:

int
sum_abs (uint8_t *restrict x, uint8_t *restrict y, int n)
{
  int sum = 0;

  for (int i = 0; i < n; i++)
    {
      sum += __builtin_abs (x[i] - y[i]);
    }

  return sum;
}

The resulting SVE code is:

0000000000000000 <sum_abs>:
   0:   7100005f        cmp     w2, #0x0
   4:   5400026d        b.le    50 <sum_abs+0x50>
   8:   d2800003        mov     x3, #0x0                        // #0
   c:   93407c42        sxtw    x2, w2
  10:   2538c002        mov     z2.b, #0
  14:   25221fe0        whilelo p0.b, xzr, x2
  18:   2538c023        mov     z3.b, #1
  1c:   2518e3e1        ptrue   p1.b
  20:   a4034000        ld1b    {z0.b}, p0/z, [x0, x3]
  24:   a4034021        ld1b    {z1.b}, p0/z, [x1, x3]
  28:   0430e3e3        incb    x3
  2c:   0520c021        sel     z1.b, p0, z1.b, z0.b
  30:   25221c60        whilelo p0.b, x3, x2
  34:   040d0420        uabd    z0.b, p1/m, z0.b, z1.b
  38:   44830402        udot    z2.s, z0.b, z3.b
  3c:   54ffff21        b.ne    20 <sum_abs+0x20>  // b.any
  40:   2598e3e0        ptrue   p0.s
  44:   04812042        uaddv   d2, p0, z2.s
  48:   1e260040        fmov    w0, s2
  4c:   d65f03c0        ret
  50:   1e2703e2        fmov    s2, wzr
  54:   1e260040        fmov    w0, s2
  58:   d65f03c0        ret

Notice how udot is used inside a fully masked loop.

I tested this patch in an aarch64 machine bootstrapping the compiler and
running the checks.

Alejandro

gcc/Changelog:

2019-05-07  Alejandro Martinez  <alejandro.martinezvice...@arm.com>

        * config/aarch64/aarch64-sve.md (<su>abd<mode>_3): New define_expand.
        (aarch64_<su>abd<mode>_3): Likewise.
        (*aarch64_<su>abd<mode>_3): New define_insn.
        (<sur>sad<vsi2qi>): New define_expand.
        * config/aarch64/iterators.md: Added MAX_OPP attribute.
        * tree-vect-loop.c (use_mask_by_cond_expr_p): Add SAD_EXPR.
        (build_vect_cond_expr): Likewise.

gcc/testsuite/Changelog:
 
2019-05-07  Alejandro Martinez  <alejandro.martinezvice...@arm.com>

        * gcc.target/aarch64/sve/sad_1.c: New test for sum of absolute
        differences.

> -----Original Message-----
> From: gcc-patches-ow...@gcc.gnu.org <gcc-patches-ow...@gcc.gnu.org>
> On Behalf Of Alejandro Martinez Vicente
> Sent: 11 February 2019 15:38
> To: James Greenhalgh <james.greenha...@arm.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>; nd <n...@arm.com>; Richard
> Sandiford <richard.sandif...@arm.com>; Richard Biener
> <richard.guent...@gmail.com>
> Subject: RE: [Aarch64][SVE] Vectorise sum-of-absolute-differences
> 
> > -----Original Message-----
> > From: James Greenhalgh <james.greenha...@arm.com>
> > Sent: 06 February 2019 17:42
> > To: Alejandro Martinez Vicente <alejandro.martinezvice...@arm.com>
> > Cc: GCC Patches <gcc-patches@gcc.gnu.org>; nd <n...@arm.com>; Richard
> > Sandiford <richard.sandif...@arm.com>; Richard Biener
> > <richard.guent...@gmail.com>
> > Subject: Re: [Aarch64][SVE] Vectorise sum-of-absolute-differences
> >
> > On Mon, Feb 04, 2019 at 07:34:05AM -0600, Alejandro Martinez Vicente
> > wrote:
> > > Hi,
> > >
> > > This patch adds support to vectorize sum of absolute differences
> > > (SAD_EXPR) using SVE. It also uses the new functionality to ensure
> > > that the resulting loop is masked. Therefore, it depends on
> > >
> > > https://gcc.gnu.org/ml/gcc-patches/2019-02/msg00016.html
> > >
> > > Given this input code:
> > >
> > > int
> > > sum_abs (uint8_t *restrict x, uint8_t *restrict y, int n) {
> > >   int sum = 0;
> > >
> > >   for (int i = 0; i < n; i++)
> > >     {
> > >       sum += __builtin_abs (x[i] - y[i]);
> > >     }
> > >
> > >   return sum;
> > > }
> > >
> > > The resulting SVE code is:
> > >
> > > 0000000000000000 <sum_abs>:
> > >    0:     7100005f        cmp     w2, #0x0
> > >    4:     5400026d        b.le    50 <sum_abs+0x50>
> > >    8:     d2800003        mov     x3, #0x0                        // #0
> > >    c:     93407c42        sxtw    x2, w2
> > >   10:     2538c002        mov     z2.b, #0
> > >   14:     25221fe0        whilelo p0.b, xzr, x2
> > >   18:     2538c023        mov     z3.b, #1
> > >   1c:     2518e3e1        ptrue   p1.b
> > >   20:     a4034000        ld1b    {z0.b}, p0/z, [x0, x3]
> > >   24:     a4034021        ld1b    {z1.b}, p0/z, [x1, x3]
> > >   28:     0430e3e3        incb    x3
> > >   2c:     0520c021        sel     z1.b, p0, z1.b, z0.b
> > >   30:     25221c60        whilelo p0.b, x3, x2
> > >   34:     040d0420        uabd    z0.b, p1/m, z0.b, z1.b
> > >   38:     44830402        udot    z2.s, z0.b, z3.b
> > >   3c:     54ffff21        b.ne    20 <sum_abs+0x20>  // b.any
> > >   40:     2598e3e0        ptrue   p0.s
> > >   44:     04812042        uaddv   d2, p0, z2.s
> > >   48:     1e260040        fmov    w0, s2
> > >   4c:     d65f03c0        ret
> > >   50:     1e2703e2        fmov    s2, wzr
> > >   54:     1e260040        fmov    w0, s2
> > >   58:     d65f03c0        ret
> > >
> > > Notice how udot is used inside a fully masked loop.
> > >
> > > I tested this patch in an aarch64 machine bootstrapping the compiler
> > > and running the checks.
> >
> > This doesn't give us much confidence in SVE coverage; unless you have
> > been running in an environment using SVE by default? Do you have some
> > set of workloads you could test the compiler against to ensure correct
> > operation of the SVE vectorization?
> >
> I tested it using an SVE model and a big set of workloads, including SPEC 
> 2000,
> 2006 and 2017. On the plus side, nothing got broken. But impact on
> performance was very minimal (on average, a tiny gain over the whole set of
> workloads).
> 
> I still want this patch (and the companion dot product patch) to make into
> the compiler because they are the first steps towards vectorising workloads
> using fully masked loops when the target ISA (like SVE) doesn't support
> masking in all the operations.
> 
> Alejandro
> 
> > >
> > > I admit it is too late to merge this into gcc 9, but I'm posting it
> > > anyway so it can be considered for gcc 10.
> >
> > Richard Sandiford has the call on whether this patch is OK for trunk
> > now or GCC 10. With the minimal testing it has had, I'd be
> > uncomfortable with it as a GCC 9 patch. That said, it is a fairly
> > self-contained pattern for the compiler and it would be good to see this
> optimization in GCC 9.
> >
> > >
> > > Alejandro
> > >
> > >
> > > gcc/Changelog:
> > >
> > > 2019-02-04  Alejandro Martinez  <alejandro.martinezvice...@arm.com>
> > >
> > >   * config/aarch64/aarch64-sve.md (<su>abd<mode>_3): New
> > define_expand.
> > >   (aarch64_<su>abd<mode>_3): Likewise.
> > >   (*aarch64_<su>abd<mode>_3): New define_insn.
> > >   (<sur>sad<vsi2qi>): New define_expand.
> > >   * config/aarch64/iterators.md: Added MAX_OPP and max_opp
> > attributes.
> > >   Added USMAX iterator.
> > >   * config/aarch64/predicates.md: Added aarch64_smin and
> > aarch64_umin
> > >   predicates.
> > >   * tree-vect-loop.c (use_mask_by_cond_expr_p): Add SAD_EXPR.
> > >   (build_vect_cond_expr): Likewise.
> > >
> > > gcc/testsuite/Changelog:
> > >
> > > 2019-02-04  Alejandro Martinez  <alejandro.martinezvice...@arm.com>
> > >
> > >   * gcc.target/aarch64/sve/sad_1.c: New test for sum of absolute
> > >   differences.
> >

Attachment: sad_v3.patch
Description: sad_v3.patch

Reply via email to