https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91573

Tamar Christina <tnfchris at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |tnfchris at gcc dot gnu.org

--- Comment #6 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
In this case though, wouldn't the loop vectorizer also be able to handle it if
the permute was simpler? re-rolling the loop or creating a minimum SLP tree
should be equivalent to

char src[512];
char dst[512];

#define WIDTH 8

void foo(int height, int a, int b, int c, int d, int dst_stride) {
    char * ptr_src = src;
    char * ptr_dst = dst;

    for( int y = 0; y < height; y++ )
    {
        for( int x = 0; x < WIDTH; x++ )
           {
             int p1 = a + c;
             int p2 = b + d;
             char x1 = (p1 * ptr_src[x]  ) >> 6;
             char x2 = (p2 * ptr_src[x+1]) >> 6;
             ptr_dst[x] = x1 + x2;
           }

        ptr_dst += dst_stride;
        ptr_src += 32;
    }
}

Which does vectorize (using Andre's patch for the  SUM reductions with
sign-change casts).

We've seen multiple other cases where doing so would (significantly) improve
vectorization and code generation. So perhaps we should try re-rolling the loop
or create the smallest (in terms of height) possible SLP tree?

Reply via email to