> I think we should try to fix the PR instead.  The widening operations
> can only be used for SLP if the group_size is at least double the
> number of elements in the vectype, so one idea (not worked through)
> is to make the vect_build_slp_tree family of routines undo pattern
> recognition for widening operations if the group size is less than that. >

This seems reasonable, how do I go about 'undoing' the pattern recognition.

Ideally the patterns wouldn't be substituted in the first place, but group size 
is chosen after pattern substitution.
________________________________
From: Richard Sandiford <richard.sandif...@arm.com>
Sent: 21 January 2021 13:40
To: Richard Biener <rguent...@suse.de>
Cc: Joel Hutton via Gcc-patches <gcc-patches@gcc.gnu.org>; Joel Hutton 
<joel.hut...@arm.com>
Subject: Re: [AArch64] Remove backend support for widen-sub

Richard Biener <rguent...@suse.de> writes:
> On Thu, 21 Jan 2021, Richard Sandiford wrote:
>
>> Joel Hutton via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
>> > Hi all,
>> >
>> > This patch removes support for the widening subtract operation in the 
>> > aarch64 backend as it is causing a performance regression.
>> >
>> > In the following example:
>> >
>> > #include <stdint.h>
>> > extern void wdiff( int16_t d[16], uint8_t *restrict pix1, uint8_t 
>> > *restrict pix2)
>> > {
>> >    for( int y = 0; y < 4; y++ )
>> >   {
>> >     for( int x = 0; x < 4; x++ )
>> >       d[x + y*4] = pix1[x] - pix2[x];
>> >     pix1 += 16;
>> >     pix2 += 16;
>> >  }
>> >
>> > The widening minus pattern is recognized and substituted, but cannot be 
>> > used due to the input vector type chosen in slp vectorization. This 
>> > results in an attempt to do an 8 byte->8 short widening subtract 
>> > operation, which is not supported.
>> >
>> > The issue is documented in PR 98772.
>>
>> IMO removing just the sub patterns is too arbitrary.  Like you say,
>> the PR affects all widening instructions. but it happens that the
>> first encountered regression used subtraction.
>>
>> I think we should try to fix the PR instead.  The widening operations
>> can only be used for SLP if the group_size is at least double the
>> number of elements in the vectype, so one idea (not worked through)
>> is to make the vect_build_slp_tree family of routines undo pattern
>> recognition for widening operations if the group size is less than that.
>>
>> Richi would know better than me though.
>
> Why should the widen ops be only usable with such constraints?
> As I read md.texi they for example do v8qi -> v8hi operations
> (where the v8qi is either lo or hi part of a v16qi vector).

They're v16qi->v8hi operations rather than v8qi->v8hi.  The lo
operations operate on one half of the v16qi and the hi operations
operate on the other half.  They're supposed to be used together
to produce v16qi->v8hi+v8hi, so for BB SLP we need a group size
of 16 to feed them.  (Loop-aware SLP is fine as-is because we can
just increase the unroll factor.)

In the testcase, the store end of the SLP graph is operating on
8 shorts, which further up the graph are converted from 8 chars.
To use WIDEN_MINUS_EXPR at v8hi we'd need 16 shorts and 16 chars.

> The dumps show we use a VF of 4 which makes us have two v8hi
> vectors and one v16qi which vectorizable_conversion should
> be able to handle by emitting hi/lo widened subtracts.

AIUI the problem is with slp1.  I think the loop side is fine.

Thanks,
Richard

Reply via email to