Tamar Christina <tamar.christ...@arm.com> writes:
>> -----Original Message-----
>> From: Richard Biener <rguent...@suse.de>
>> Sent: Wednesday, April 23, 2025 9:31 AM
>> To: Tamar Christina <tamar.christ...@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Sandiford
>> <richard.sandif...@arm.com>
>> Subject: Re: [PATCH]middle-end: Add new "max" vector cost model
>> 
>> On Wed, 23 Apr 2025, Tamar Christina wrote:
>> 
>> > Hi All,
>> >
>> > This patch proposes a new vector cost model called "max".  The cost model 
>> > is an
>> > intersection between two of our existing cost models.  Like `unlimited` it
>> > disables the costing vs scalar and assumes all vectorization to be 
>> > profitable.
>> >
>> > But unlike unlimited it does not fully disable the vector cost model.  That
>> > means that we still perform comparisons between vector modes.
>> >
>> > As an example, the following:
>> >
>> > void
>> > foo (char *restrict a, int *restrict b, int *restrict c,
>> >      int *restrict d, int stride)
>> > {
>> >     if (stride <= 1)
>> >         return;
>> >
>> >     for (int i = 0; i < 3; i++)
>> >         {
>> >             int res = c[i];
>> >             int t = b[i * stride];
>> >             if (a[i] != 0)
>> >                 res = t * d[i];
>> >             c[i] = res;
>> >         }
>> > }
>> >
>> > compiled with -O3 -march=armv8-a+sve -fvect-cost-model=dynamic fails to
>> > vectorize as it assumes scalar would be faster, and with
>> > -fvect-cost-model=unlimited it picks a vector type that's so big that the 
>> > large
>> > sequence generated is working on mostly inactive lanes:
>> >
>> >         ...
>> >         and     p3.b, p3/z, p4.b, p4.b
>> >         whilelo p0.s, wzr, w7
>> >         ld1w    z23.s, p3/z, [x3, #3, mul vl]
>> >         ld1w    z28.s, p0/z, [x5, z31.s, sxtw 2]
>> >         add     x0, x5, x0
>> >         punpklo p6.h, p6.b
>> >         ld1w    z27.s, p4/z, [x0, z31.s, sxtw 2]
>> >         and     p6.b, p6/z, p0.b, p0.b
>> >         punpklo p4.h, p7.b
>> >         ld1w    z24.s, p6/z, [x3, #2, mul vl]
>> >         and     p4.b, p4/z, p2.b, p2.b
>> >         uqdecw  w6
>> >         ld1w    z26.s, p4/z, [x3]
>> >         whilelo p1.s, wzr, w6
>> >         mul     z27.s, p5/m, z27.s, z23.s
>> >         ld1w    z29.s, p1/z, [x4, z31.s, sxtw 2]
>> >         punpkhi p7.h, p7.b
>> >         mul     z24.s, p5/m, z24.s, z28.s
>> >         and     p7.b, p7/z, p1.b, p1.b
>> >         mul     z26.s, p5/m, z26.s, z30.s
>> >         ld1w    z25.s, p7/z, [x3, #1, mul vl]
>> >         st1w    z27.s, p3, [x2, #3, mul vl]
>> >         mul     z25.s, p5/m, z25.s, z29.s
>> >         st1w    z24.s, p6, [x2, #2, mul vl]
>> >         st1w    z25.s, p7, [x2, #1, mul vl]
>> >         st1w    z26.s, p4, [x2]
>> >         ...
>> >
>> > With -fvect-cost-model=max you get more reasonable code:
>> >
>> > foo:
>> >         cmp     w4, 1
>> >         ble     .L1
>> >         ptrue   p7.s, vl3
>> >         index   z0.s, #0, w4
>> >         ld1b    z29.s, p7/z, [x0]
>> >         ld1w    z30.s, p7/z, [x1, z0.s, sxtw 2]
>> >    ptrue   p6.b, all
>> >         cmpne   p7.b, p7/z, z29.b, #0
>> >         ld1w    z31.s, p7/z, [x3]
>> >    mul     z31.s, p6/m, z31.s, z30.s
>> >         st1w    z31.s, p7, [x2]
>> > .L1:
>> >         ret
>> >
>> > This model has been useful internally for performance exploration and cost-
>> model
>> > validation.  It allows us to force realistic vectorization overriding the 
>> > cost
>> > model to be able to tell whether it's correct wrt to profitability.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu,
>> > arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
>> > -m32, -m64 and no issues.
>> >
>> > Ok for master?
>> 
>> Hmm.  I don't like another cost model.  Instead how about changing
>> 'unlimited' to still iterate through vector sizes?  Cost modeling
>> is really about vector vs. scalar, not vector vs. vector which is
>> completely under target control.  Targets should provide a way
>> to limit iteration, like aarch64 has with the aarch64-autovec-preference
>> --param or x86 has with -mprefer-vector-width.
>> 
>
> I'm ok with changing 'unlimited' if that's preferred, but I do want to point
> out that we don't have enough control with current --param or -m options
> to simulate all cases.
>
> For instance for SVE there's way for us to force a smaller type to be used
> and thus force an unpacking to happen.  Or there's no way to force an
> unrolling with Adv. SIMD.
>
> Basically there's not enough control over the VF to exercise some tests
> reliably.  Some tests explicitly relied on unlimited just picking the first
> mode.

FWIW, adding extra AArch64 --params sounds ok to me.  The ones we have
were just added on an as-needed/as-wanted basis, rather than as an attempt
to be complete.

After the aarch64-autovec-preference backward-compatibility controversy,
we should consider whether what we add is something that is intended for
developers and can be taken away at any time (--param), or whether it's
something that we promise to support going forward (-m).

Thanks,
Richard

Reply via email to