> I think we should try to fix the PR instead. The widening operations > can only be used for SLP if the group_size is at least double the > number of elements in the vectype, so one idea (not worked through) > is to make the vect_build_slp_tree family of routines undo pattern > recognition for widening operations if the group size is less than that. >
This seems reasonable, how do I go about 'undoing' the pattern recognition. Ideally the patterns wouldn't be substituted in the first place, but group size is chosen after pattern substitution. ________________________________ From: Richard Sandiford <richard.sandif...@arm.com> Sent: 21 January 2021 13:40 To: Richard Biener <rguent...@suse.de> Cc: Joel Hutton via Gcc-patches <gcc-patches@gcc.gnu.org>; Joel Hutton <joel.hut...@arm.com> Subject: Re: [AArch64] Remove backend support for widen-sub Richard Biener <rguent...@suse.de> writes: > On Thu, 21 Jan 2021, Richard Sandiford wrote: > >> Joel Hutton via Gcc-patches <gcc-patches@gcc.gnu.org> writes: >> > Hi all, >> > >> > This patch removes support for the widening subtract operation in the >> > aarch64 backend as it is causing a performance regression. >> > >> > In the following example: >> > >> > #include <stdint.h> >> > extern void wdiff( int16_t d[16], uint8_t *restrict pix1, uint8_t >> > *restrict pix2) >> > { >> > for( int y = 0; y < 4; y++ ) >> > { >> > for( int x = 0; x < 4; x++ ) >> > d[x + y*4] = pix1[x] - pix2[x]; >> > pix1 += 16; >> > pix2 += 16; >> > } >> > >> > The widening minus pattern is recognized and substituted, but cannot be >> > used due to the input vector type chosen in slp vectorization. This >> > results in an attempt to do an 8 byte->8 short widening subtract >> > operation, which is not supported. >> > >> > The issue is documented in PR 98772. >> >> IMO removing just the sub patterns is too arbitrary. Like you say, >> the PR affects all widening instructions. but it happens that the >> first encountered regression used subtraction. >> >> I think we should try to fix the PR instead. The widening operations >> can only be used for SLP if the group_size is at least double the >> number of elements in the vectype, so one idea (not worked through) >> is to make the vect_build_slp_tree family of routines undo pattern >> recognition for widening operations if the group size is less than that. >> >> Richi would know better than me though. > > Why should the widen ops be only usable with such constraints? > As I read md.texi they for example do v8qi -> v8hi operations > (where the v8qi is either lo or hi part of a v16qi vector). They're v16qi->v8hi operations rather than v8qi->v8hi. The lo operations operate on one half of the v16qi and the hi operations operate on the other half. They're supposed to be used together to produce v16qi->v8hi+v8hi, so for BB SLP we need a group size of 16 to feed them. (Loop-aware SLP is fine as-is because we can just increase the unroll factor.) In the testcase, the store end of the SLP graph is operating on 8 shorts, which further up the graph are converted from 8 chars. To use WIDEN_MINUS_EXPR at v8hi we'd need 16 shorts and 16 chars. > The dumps show we use a VF of 4 which makes us have two v8hi > vectors and one v16qi which vectorizable_conversion should > be able to handle by emitting hi/lo widened subtracts. AIUI the problem is with slp1. I think the loop side is fine. Thanks, Richard