On Fri, Oct 24, 2014 at 01:06:41PM +0100, Alan Lawrence wrote:
> This migrates the reduction patterns in altivec.md and vector.md to
> the new names. I've not touched paired.md as I wasn't really sure
> how to fix that (how do I vec_extractv2sf ?), moreover the testing I
> did didn't seem to exercise any of those patterns (iow: I'm not sure
> what would be an appropriate target machine?).
> 
> I note the reduc_uplus_v16qi (which I've removed, as unsigned and
> signed addition should be equivalent) differed from
> reduc_splus_v16qi in using gen_altivec_vsum4ubs rather than
> gen_altivec_vsum4sbs.  Testcases 
> gcc.dg/vect/{slp-24-big-array.c,slp-24.c,vect-reduc-1char-big-array.c,vert-reduc-1char.c}
> thus produce assembly which differs from previously (only) in that
> "vsum4ubs" becomes "vsum4sbs". These tests are still passing so I
> assume this is OK.
> 
> The combining of signed and unsigned addition also improves 
> gcc.dg/vect/{vect-outer-4i.c,vect-reduc-1short.c,vect-reduc-dot-u8b.c,vect-reduc-pattern-1c-big-array.c,vect-reduc-pattern-1c.c}
> : these are now reduced using direct vector reduction, rather than
> with shifts as previously (because there was only a reduc_splus
> rather than the reduc_uplus these tests looked for).

I checked the integer vector add reductions, and it seems to generate the same
value with old/new code, and I like eliminating the vector shift.

> ((Side note: the RTL changes to vector.md are to match the combine
> patterns in vsx.md; now that we now longer depend upon combine to
> generate those patterns (as the optab outputs them directly), one
> might wish to remove the smaller pattern from vsx.md, and/or
> simplify the RTL. I theorize that a reduction of a two-element
> vector is just adding the first element to the second, so maybe to
> something like
> 
>   [(parallel [(set (match_operand:DF 0 "vfloat_operand" "")
>                  (VEC_reduc:V2DF
>                   (vec_select:DF
>                    (match_operand:V2DF 1 "vfloat_operand" "")
>                    (parallel [(const_int 1)]))
>                   (vec_select:DF
>                    (match_dup 1)
>                    (parallel [(const_int 0)]))))
>             (clobber (match_scratch:V2DF 2 ""))])]
> 
> but I think it's best for me to leave that to the port maintainers.))
> 
> Bootstrapped and check-gcc on powerpc64-none-linux-gnu
> (gcc110.fsffrance.org, with thanks to the GCC Compile Farm).

However, the double pattern is completely broken.  This cannot go in.

Consider this source:

#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include <string.h>

#ifndef TYPE
#define TYPE double
#endif

#ifndef OTYPE
#define OTYPE TYPE
#endif

#ifndef SIZE
#define SIZE 1024
#endif

#ifndef ALIGN
#define ALIGN 32
#endif

TYPE a[SIZE] __attribute__((__aligned__(ALIGN)));

OTYPE sum (void) __attribute__((__noinline__));

OTYPE
sum (void)
{
  size_t i;
  OTYPE s = (OTYPE) 0;

  for (i = 0; i < SIZE; i++)
    s += a[i];

  return s;
}

If I compile with today's trunk, and -mcpu=power8 -ffast-math -O3, I get code
that I expect (though it could xxpermdi instead of xxsldwi):

sum:
        .quad   .L.sum,.TOC.@tocbase,0
        .previous
        .type   sum, @function
.L.sum:
        li 10,512
        addis 9,2,.LC1@toc@ha           # gpr load fusion, type long
        ld 9,.LC1@toc@l(9)
        xxlxor 0,0,0
        mtctr 10
        .p2align 4,,15
.L2:
        lxvd2x 12,0,9
        addi 9,9,16
        xvadddp 0,0,12
        bdnz .L2
        xxsldwi 12,0,0,2
        xvadddp 1,12,0
        xxpermdi 1,1,1,2
        blr
        .long 0

However, the code produced by the patches gives:

sum:
        .quad   .L.sum,.TOC.@tocbase,0
        .previous
        .type   sum, @function
.L.sum:
        xxlxor 0,0,0
        addi 10,1,-16
        li 8,512
        addis 9,2,.LC1@toc@ha           # gpr load fusion, type long
        ld 9,.LC1@toc@l(9)
        mtctr 8
        stxvd2x 0,0,10
        .p2align 5,,31
.L2:
        addi 10,1,-16
        lxvd2x 0,0,9
        addi 9,9,16
        lxvd2x 12,0,10
        xvadddp 12,12,0
        stxvd2x 12,0,10
        bdnz .L2
        lfd 0,-16(1)
        xxpermdi 1,12,12,2
        fadd 1,0,1
        blr
        .long 0

It is unacceptable to have to do the inner loop doing a load, vector add, and
store in the loop.

-- 
Michael Meissner, IBM
IBM, M/S 2506R, 550 King Street, Littleton, MA 01460-6245, USA
email: meiss...@linux.vnet.ibm.com, phone: +1 (978) 899-4797

Reply via email to