Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Jennifer Schmitz Tue, 15 Oct 2024 02:00:26 -0700


> On 1 Oct 2024, at 21:30, Tamar Christina <tamar.christ...@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi Jennifer,
> 
>> -----Original Message-----
>> From: Jennifer Schmitz <jschm...@nvidia.com>
>> Sent: Tuesday, September 24, 2024 9:23 AM
>> To: gcc-patches@gcc.gnu.org
>> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford
>> <richard.sandif...@arm.com>; Kyrylo Tkachov <ktkac...@exchange.nvidia.com>
>> Subject: Re: [RFC][PATCH] AArch64: Remove
>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>> 
>> 
>> 
>>> On 28 Aug 2024, at 14:56, Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
>>> 
>>> 
>>> 
>>>> On 28 Aug 2024, at 10:27, Tamar Christina <tamar.christ...@arm.com> wrote:
>>>> 
>>>> External email: Use caution opening links or attachments
>>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>>>>> Sent: Wednesday, August 28, 2024 8:55 AM
>>>>> To: Tamar Christina <tamar.christ...@arm.com>
>>>>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Jennifer Schmitz
>>>>> <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Kyrylo Tkachov
>>>>> <ktkac...@exchange.nvidia.com>
>>>>> Subject: Re: [RFC][PATCH] AArch64: Remove
>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>> 
>>>>> Hi all,
>>>>> 
>>>>> Thanks to Jennifer for proposing a patch and Tamar and Richard for digging
>> into it.
>>>>> 
>>>>>> On 27 Aug 2024, at 13:16, Tamar Christina <tamar.christ...@arm.com>
>> wrote:
>>>>>> 
>>>>>> External email: Use caution opening links or attachments
>>>>>> 
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Richard Sandiford <richard.sandif...@arm.com>
>>>>>>> Sent: Tuesday, August 27, 2024 11:46 AM
>>>>>>> To: Tamar Christina <tamar.christ...@arm.com>
>>>>>>> Cc: Jennifer Schmitz <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org;
>> Kyrylo
>>>>>>> Tkachov <ktkac...@exchange.nvidia.com>
>>>>>>> Subject: Re: [RFC][PATCH] AArch64: Remove
>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>> 
>>>>>>> Tamar Christina <tamar.christ...@arm.com> writes:
>>>>>>>> Hi Jennifer,
>>>>>>>> 
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Jennifer Schmitz <jschm...@nvidia.com>
>>>>>>>>> Sent: Friday, August 23, 2024 1:07 PM
>>>>>>>>> To: gcc-patches@gcc.gnu.org
>>>>>>>>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Kyrylo Tkachov
>>>>>>>>> <ktkac...@exchange.nvidia.com>
>>>>>>>>> Subject: [RFC][PATCH] AArch64: Remove
>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>> 
>>>>>>>>> This patch removes the
>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>>>> tunable and
>>>>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend
>>>>> the
>>>>>>>>> default.
>>>>>>> 
>>>>>>> Thanks for doing this.  This has been on my TODO list ever since the
>>>>>>> tunable was added.
>>>>>>> 
>>>>>>> The history is that these "new" costs were originally added in stage 4
>>>>>>> of GCC 11 for Neoverse V1.  Since the costs were added so late, it 
>>>>>>> wasn't
>>>>>>> appropriate to change the behaviour for any other core.  All the new 
>>>>>>> code
>>>>>>> was therefore gated on this option.
>>>>>>> 
>>>>>>> The new costs do two main things:
>>>>>>> 
>>>>>>> (1) use throughput-based calculations where available, including to 
>>>>>>> choose
>>>>>>> between Advanced SIMD and SVE
>>>>>>> 
>>>>>>> (2) try to make the latency-based costs more precise, by looking more
>> closely
>>>>>>> at the provided stmt_info
>>>>>>> 
>>>>>>> Old cost models won't be affected by (1) either way, since they don't
>>>>>>> provide any throughput information.  But they should in principle 
>>>>>>> benefit
>>>>>>> from (2).  So...
>>>>>>> 
>>>>>>>>> To that end, the function aarch64_use_new_vector_costs_p and its uses
>>>>> were
>>>>>>>>> removed. Additionally, guards were added prevent nullpointer
>> dereferences
>>>>> of
>>>>>>>>> fields in cpu_vector_cost.
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> I'm not against this change, but it does mean that we now switch old 
>>>>>>>> Adv.
>>>>> SIMD
>>>>>>>> cost models as well to the new throughput based cost models.  That 
>>>>>>>> means
>>>>> that
>>>>>>>> -mcpu=generic now behaves differently, and -mcpu=neoverse-n1 and I
>> think
>>>>>>>> some distros explicitly use this (I believe yocto for instance does).
>>>>>>> 
>>>>>>> ...it shouldn't mean that we start using throughput-based models for
>>>>>>> cortexa53 etc., since there's no associated issue info.
>>>>>> 
>>>>>> Yes, I was using throughput based model as a name.  But as you indicated 
>>>>>> in
>> (2)
>>>>>> it does change the latency calculation.
>>>>>> 
>>>>>> My question was because of things in e.g. aarch64_adjust_stmt_cost and
>>>>> friends,
>>>>>> e.g. aarch64_multiply_add_p changes the cost between FMA SIMD vs scalar.
>>>>>> 
>>>>>> So my question..
>>>>>> 
>>>>>>> 
>>>>>>>> Have we validated that the old generic cost model still behaves 
>>>>>>>> sensibly
>> with
>>>>> this
>>>>>>> change?
>>>>>> 
>>>>>> is still valid I think, we *are* changing the cost for all models,
>>>>>> and while they should indeed be more accurate, there could be knock on
>> effects.
>>>>>> 
>>>>> 
>>>>> We can run SPEC on a Grace system with -mcpu=generic to see what the 
>>>>> effect
>> is,
>>>>> but wider benchmarking would be more appropriate. Can you help with that
>>>>> Tamar once we agree on the other implementation details in this patch?
>>>>> 
>>>> 
>>>> Sure that's not a problem.  Just ping me when you have a patch you want me 
>>>> to
>> test :)
>>>> 
>>>>> 
>>>>>> Thanks,
>>>>>> Tamar
>>>>>> 
>>>>>>>> 
>>>>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu:
>>>>>>>>> No problems bootstrapping, but several test files (in aarch64-sve.exp:
>>>>>>>>> gather_load_extend_X.c
>>>>>>>>> where X is 1 to 4, strided_load_2.c, strided_store_2.c) fail because 
>>>>>>>>> of
>> small
>>>>>>>>> differences
>>>>>>>>> in codegen that make some of the scan-assembler-times tests fail.
>>>>>>>>> 
>>>>>>>>> Kyrill suggested to add a -fvect-cost-model=unlimited flag to these 
>>>>>>>>> tests
>> and
>>>>>>> add
>>>>>>>>> some
>>>>>>>> 
>>>>>>>> I don't personally like unlimited here as unlimited means just 
>>>>>>>> vectorize at
>> any
>>>>>>>> cost.  This means that costing between modes are also disabled. A lot 
>>>>>>>> of
>> these
>>>>>>>> testcases are intended to test not just that we vectorize but that we
>> vectorize
>>>>>>>> with efficient code.
>>>>>>>> 
>>>>>>>> I'd prefer to use -fvect-cost-model=dynamic if that fixes the 
>>>>>>>> testcases.
>>>>>>> 
>>>>>>> Yeah, I don't think -fvect-cost-model=unlimited would work for the
>>>>>>> gather_load_extend_X.c tests, since we require the cost model to decide
>>>>>>> when to use extending loads vs loading a full vector and unpacking.
>>>>> 
>>>>> I had suggested using -fvect-cost-model=unlimited here because I thought
>> these
>>>>> tests wanted to test the capability of GCC to detect and generate the 
>>>>> particular
>> SVE
>>>>> feature (gather load extends) for all supported data types regardless of
>>>>> profitability.
>>>> 
>>>> I think the problem only specifically for the load_extend tests, because
>> unlimited also
>>>> disables SVE vs SVE comparisons wrt to changes to VF, so not just Adv. 
>>>> SIMD vs
>> SVE.
>>>> That is while it would generate a gather, it may choose to instead of 
>>>> gather from
>>>> B -> D, do B -> S and then extend the results.  Because this has a higher 
>>>> VF.
>>>> 
>>>> Without the cross SVE mode comparisons it wouldn't know that the extensions
>> would
>>>> actually slow it down.
>>>> 
>>>>> If the tests are intended to also make the right profitability decisions 
>>>>> for the
>> generic
>>>>> tuning model then I agree using -fvect-cost-model=unlimited is not
>> appropriate
>>>>> here, though I do think that it’d be useful to fix the backend vector 
>>>>> cost model
>>>>> hooks to honor -fvect-cost-model=unlimited and not limit generation of
>>>>> gather/scatter in that case. What it should do for the SVE vs Neon 
>>>>> decisions is
>> an
>>>>> open question.
>>>>> 
>>>>> 
>>>>>>> 
>>>>>>> [...tries patch...]
>>>>>>> 
>>>>>>> It seems like three main things are contributing to the difference:
>>>>>>> 
>>>>>>> 1. we now cost 0 for a scalar prologue extension from a loaded value
>>>>>>> 2. we now cost gathers & scatters based on gather_load_x32/x64_cost
>>>>>>> 3. we now apply a large one-off cost for gathers (Tamar's recent change)
>>>>>>> 
>>>>>>> (1) is expected.
>>>>>>> 
>>>>>>> (2) partly looks like a latent bug.  We're using the x32 cost for
>>>>>>> VNx2QI->VNx2SI, even though those are really .B->.D loads.
>>>>>>> 
>>>>>>> @@ -16819,7 +16811,7 @@ aarch64_detect_vector_stmt_subtype
>> (vec_info
>>>>>>> *vinfo, vect_cost_for_stmt kind,
>>>>>>>    && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
>>>>>>> VMAT_GATHER_SCATTER)
>>>>>>>  {
>>>>>>>    unsigned int nunits = vect_nunits_for_cost (vectype);
>>>>>>> -      if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
>>>>>>> +      if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)),
>>>>>>> aarch64_sve_vg))
>>>>>>>   return { sve_costs->gather_load_x64_cost, nunits };
>>>>>>>    return { sve_costs->gather_load_x32_cost, nunits };
>>>>>>>  }
>>>>>>> 
>>>>>>> fixes that.
>>>>> 
>>>>> Would you be willing to test and push that to trunk to get it out of the 
>>>>> way?
>>>>> 
>>>>> 
>>>>>>> 
>>>>>>> (3) is interesting.  generic_vector_cost isn't used by default for any
>>>>>>> SVE CPU, or any -march=...+sve.  So the question is: should we treat it
>>>>>>> as "architectural intent"/somewhat idealised?  Or should we try to make
>>>>>>> it produce good code for existing SVE cores, in which case it would
>>>>>>> overlap quite a lot with generic_armv8_a and generic_armv9_a.
>>>>> 
>>>>>>> 
>>>>>>> If the former, we could set gather_load_x32_init_cost and
>>>>>>> gather_load_x64_init_cost to 0 for generic_sve_vector_cost
>>>>>>> (and nothing else, so that generic_armv*_a are unaffected).
>>>>> 
>>>>> I don’t have strong opinions on this point but if Tamar is concerned about
>>>>> deviating too much from the known-good Advanced SIMD generic tuning we
>> have
>>>>> now then we should aim for minimal codegen changes in that?
>>>> 
>>>> No I'm fine with Richard's suggestion.  The only way you'd be able to get 
>>>> this
>> model
>>>> to fire is with -march=xxx=sve -mtune=generic, at which point I'm fine with
>> assuming
>>>> gathers have no initial overhead.
>>> 
>>> As an aside, Jennifer pointed out to me that running the aarch64-sve.exp 
>>> tests
>> ends up using “-march=armv8.2-a+sve -mtune=generic -moverride=tune=none”
>> for the tests but aarch64-sve.exp does’t add that -mtune and -moverride. It 
>> seems
>> that aarch64-sve-acle-asm.exp does add these flags (to avoid the
>> CSE_SVE_VL_CONSTANTS logic IIRC), but it seems odd to use them in the 
>> top-level
>> aarch64-sve.exp and I’m not sure by that DejaGNU mechanism they end up
>> propagating there.
>> Without making any more changes to the patch, I reran the testsuite:
>> With Richard's patch, the gather_load_extend tests now pass.
>> However, 3 tests still fail: strided_load_2.c, strided_store_2.c, and
>> strided_store_5.c.
>> Below you see an extract of the diff of the produced assembly for these three
>> tests. Interestingly, there seem to be cases where the codegen is better 
>> with the
>> patch (strided_load_2.c, f_int64_t_32), but also cases that are worse
>> (strided_store_5.c, f_int32_t_5).
>> Can you help diagnosing what the problem of the bad codegen in some cases
>> could be and advise on how to proceed?
>> Thank you, Jennifer
>> 
> 
> A lot of them look like costing issues,  I'm not exactly sure what 
> -mtune=generic -moverride=tune=none
> 
> Resolves to. I would assume it goes back to the default tuning?
> 
>> ===== strided_load_2.c =====
>> Compiled with:
>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -
>> moverride=tune=none
>> 
>> Failed test:
>> gcc.target/aarch64/sve/strided_load_2.c scan-assembler-times \\tld1d\\tz[0-
>> 9]+\\.d, p[0-7]/z, \\[x[0-9]+, z[0-9]+.d, lsl 3\\]\\n 15
>> Instead found 9 times
>> 
>> Example of difference in produced assembly:
>> f_int64_t_32:
>> cbz     w3, .L92
>>        mov     x4, 0
>>        uxtw    x3, w3
>> +       cntd    x5
>> +       whilelo p7.d, xzr, x3
>> +       mov     z29.s, w5
>>        mov     z31.s, w2
>> -       whilelo p6.d, xzr, x3
>> -       mov     x2, x3
>> -       index   z30.s, #0, #1
>> -       uqdecd  x2
>> -       ptrue   p5.b, all
>> -       whilelo p7.d, xzr, x2
>> +       index   z30.d, #0, #1
>> +       ptrue   p6.b, all
>>        .p2align 3,,7
>> .L94:
>> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
>> -       ld1d    z28.d, p6/z, [x0]
>> -       movprfx z29, z30
>> -       mul     z29.s, p5/m, z29.s, z31.s
>> -       incw    x4
>> -       uunpklo z0.d, z29.s
>> -       uunpkhi z29.d, z29.s
>> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
>> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
>> -       add     z25.d, z28.d, z25.d
>> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
>> +       movprfx z28, z30
>> +       mul     z28.s, p6/m, z28.s, z31.s
>> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
>>        add     z26.d, z27.d, z26.d
>> -       st1d    z26.d, p7, [x0, #1, mul vl]
>> -       whilelo p7.d, x4, x2
>> -       st1d    z25.d, p6, [x0]
>> -       incw    z30.s
>> -       incb    x0, all, mul #2
>> -       whilelo p6.d, x4, x3
>> +       st1d    z26.d, p7, [x0, x4, lsl 3]
>> +       add     z30.s, z30.s, z29.s
>> +       incd    x4
>> +       whilelo p7.d, x4, x3
>>        b.any   .L94
>> .L92:
>>        ret
>> 
> 
> This to me looks like unrolling. It looks like the old code decided to unroll 
> once and the new one doesn't.
> I agree it does look better overall.
> 
>> ===== strided_store_2.c =====
>> Compiled with:
>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -
>> moverride=tune=none
>> 
>> Failed test:
>> gcc.target/aarch64/sve/strided_store_2.c scan-assembler-times \\tst1d\\tz[0-
>> 9]+\\.d, p[0-7], \\[x[0-9]+, z[0-9]+.d, lsl 3\\]\\n 15
>> Instead found 9 times
>> 
>> Example of difference in produced assembly:
>> f_float_64:
>> .LFB11:
>>        .cfi_startproc
>>        cbz     x3, .L67
>> -       lsl     x2, x2, 2
>> -       add     x3, x1, x3, lsl 2
>> -       fmov    s31, 1.0e+0
>> -       .p2align 3,,7
>> +       sub     x4, x3, #1
>> +       cmp     x4, 2
>> +       bls     .L73
>> +       lsr     x9, x3, 2
>> +       lsl     x8, x2, 2
>> +       fmov    v31.4s, 1.0e+0
>> +       lsl     x10, x2, 4
>> +       add     x4, x0, x8
>> +       add     x9, x1, x9, lsl 4
>> +       neg     x11, x8
>> +       lsl     x2, x2, 3
>> +       mov     x5, x1
>> +       .p2align 3,,7
>> +.L70:
>> +       ldr     q30, [x5], 16
>> +       add     x7, x4, x8
>> +       add     x6, x4, x2
>> +       fadd    v30.4s, v30.4s, v31.4s
>> +       str     s30, [x4, x11]
>> +       st1     {v30.s}[1], [x4]
>> +       add     x4, x4, x10
>> +       st1     {v30.s}[2], [x7]
>> +       st1     {v30.s}[3], [x6]
>> +       cmp     x5, x9
>> +       bne     .L70
>> +       and     x4, x3, -4
>> +       tst     x3, 3
>> +       beq     .L67
>> .L69:
>> -       ldr     s30, [x1], 4
>> -       fadd    s30, s30, s31
>> -       str     s30, [x0]
>> -       add     x0, x0, x2
>> -       cmp     x1, x3
>> -       bne     .L69
>> +       madd    x0, x8, x4, x0
>> +       fmov    s29, 1.0e+0
>> +.L72:
>> +       ldr     s28, [x1, x4, lsl 2]
>> +       add     x4, x4, 1
>> +       fadd    s28, s28, s29
>> +       str     s28, [x0]
>> +       add     x0, x0, x8
>> +       cmp     x3, x4
>> +       bhi     .L72
>> .L67:
>>        ret
>> +.L73:
>> +       lsl     x8, x2, 2
>> +       mov     x4, 0
>> +       b       .L69
>> 
> 
> This one feels like the vectorizer emulated a scatter store.
> Did you rebase by chance? I think our cost model most likely
> does not model the store cots of a single lane.  I'd look at costing of
> the loop from the vectorizer.
> 
>> 
>> f_int64_t_32:
>> .LFB14:
>>        .cfi_startproc
>> -       cbz     w3, .L84
>> -       addvl   x5, x1, #1
>> +       cbz     w3, .L92
>>        mov     x4, 0
>>        uxtw    x3, w3
>> +       cntd    x5
>> +       whilelo p7.d, xzr, x3
>> +       mov     z29.s, w5
>>        mov     z31.s, w2
>> -       whilelo p6.d, xzr, x3
>> -       mov     x2, x3
>> -       index   z30.s, #0, #1
>> -       uqdecd  x2
>> -       ptrue   p5.b, all
>> -       whilelo p7.d, xzr, x2
>> -       .p2align 3,,7
>> -.L86:
>> -       ld1d    z28.d, p6/z, [x1, x4, lsl 3]
>> -       ld1d    z27.d, p7/z, [x5, x4, lsl 3]
>> -       movprfx z29, z30
>> -       mul     z29.s, p5/m, z29.s, z31.s
>> -       add     z28.d, z28.d, #1
>> -       uunpklo z26.d, z29.s
>> -       st1d    z28.d, p6, [x0, z26.d, lsl 3]
>> -       incw    x4
>> -       uunpkhi z29.d, z29.s
>> +       index   z30.d, #0, #1
>> +       ptrue   p6.b, all
>> +       .p2align 3,,7
>> +.L94:
>> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
>> +       movprfx z28, z30
>> +       mul     z28.s, p6/m, z28.s, z31.s
>>        add     z27.d, z27.d, #1
>> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
>> -       whilelo p7.d, x4, x2
>> -       incw    z30.s
>> -       whilelo p6.d, x4, x3
>> -       b.any   .L86
>> -.L84:
>> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
>> +       incd    x4
>> +       add     z30.s, z30.s, z29.s
>> +       whilelo p7.d, x4, x3
>> +       b.any   .L94
>> +.L92:
>>        ret
>> 
> 
> Same as before it looks like we stopped unrolling on a long dependency chain.
> So I think the code is better,
> 
>> ==== strided_store_5.c =====
>> Compiled with:
>> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -
>> moverride=tune=none -msve-vector-bits=256
>> 
>> Failed tests:
>> gcc.target/aarch64/sve/strided_store_5. c scan-assembler-times \\tst1d\\tz[0-
>> 9]+\\.d, p[0-7], \\[x[0-9]+, z[0-9]+.d\\]\\n 15
>> Instead found 0 times
>> gcc.target/aarch64/sve/strided_store_5. c scan-assembler-times \\tst1w\\tz[0-
>> 9]+\\.s, p[0-7], \\[x[0-9]+, z[0-9]+.s, sxtw\\]\\n 3
>> Instead found 0 times
>> gcc.target/aarch64/sve/strided_store_5. c scan-assembler-times \\tst1w\\tz[0-
>> 9]+\\.s, p[0-7], \\[x[0-9]+, z[0-9]+.s, uxtw\\]\\n 12
>> Instead found 0 times
>> 
>> Example of difference in produced assembly:
>> f_int32_t_5:
>>        cmp     x2, 0
>>        ble     .L1
>> -       mov     x3, 0
>> -       mov     w4, 20
>> -       whilelo p7.s, xzr, x2
>> -       index   z31.s, #0, w4
>> -       .p2align 3,,7
>> +       sub     x3, x2, #1
>> +       cmp     x3, 6
>> +       bls     .L7
>> +       lsr     x8, x2, 3
>> +       mov     x4, x1
>> +       mov     x3, x0
>> +       ptrue   p6.b, vl32
>> +       add     x8, x1, x8, lsl 5
>> +       pfalse  p7.b
>> +       .p2align 3,,7
>> +.L4:
>> +       ld1w    z31.s, p6/z, [x4]
>> +       add     z31.s, z31.s, #1
>> +       umov    w7, v31.s[1]
>> +       umov    w6, v31.s[2]
>> +       umov    w5, v31.s[3]
>> +       dup     z1.s, z31.s[4]
>> +       dup     z0.s, z31.s[5]
>> +       dup     z30.s, z31.s[6]
>> +       lastb   s29, p7, z31.s
>> +       add     x4, x4, 32
>> +       str     s31, [x3]
>> +       add     x3, x3, 160
>> +       str     w7, [x3, -140]
>> +       str     w6, [x3, -120]
>> +       str     w5, [x3, -100]
>> +       str     s1, [x3, -80]
>> +       str     s0, [x3, -60]
>> +       str     s30, [x3, -40]
>> +       str     s29, [x3, -20]
>> +       cmp     x8, x4
>> +       bne     .L4
>> +       and     x3, x2, -8
>> +       cmp     x2, x3
>> +       beq     .L1
>> .L3:
>> -       ld1w    z30.s, p7/z, [x1, x3, lsl 2]
>> -       add     z30.s, z30.s, #1
>> -       st1w    z30.s, p7, [x0, z31.s, uxtw]
>> -       add     x3, x3, 8
>> -       add     x0, x0, 160
>> -       whilelo p7.s, x3, x2
>> -       b.any   .L3
>> +       lsl     x4, x3, 2
>> +       sub     x2, x2, x3
>> +       add     x3, x4, x3
>> +       whilelo p7.d, xzr, x2
>> +       add     x1, x1, x4
>> +       mov     x4, 20
>> +       ld1w    z2.d, p7/z, [x1]
>> +       add     x3, x0, x3, lsl 2
>> +       index   z28.d, #0, x4
>> +       add     z2.s, z2.s, #1
>> +       st1w    z2.d, p7, [x3, z28.d]
>> +       mov     x0, 4
>> +       whilelo p7.d, x0, x2
>> +       b.any   .L10
>> .L1:
>>        ret
>> +       .p2align 2,,3
>> +.L10:
>> +       ld1w    z29.d, p7/z, [x1, #1, mul vl]
>> +       add     x3, x3, 80
>> +       add     z29.s, z29.s, #1
>> +       st1w    z29.d, p7, [x3, z28.d]
>> +       ret
>> +.L7:
>> +       mov     x3, 0
>> +       b       .L3
>> 
> 
> This one kinda feels like we vectorized a constructor we don't really support 
> codegen for.
> You can check if the code at dce6 makes sense and if the scalarization is 
> done by veclower21.
I investigated the cause of the test failures a bit further:


Looking at the diff of the vect dumps (below is a section of the diff for 
strided_store_2.c), it seemed odd that vec_to_scalar operations cost 0 now, 
instead of the previous cost of 2:

+strided_store_1.c:38:151: note:    === vectorizable_operation ===
+strided_store_1.c:38:151: note:    vect_model_simple_cost: inside_cost = 1, 
prologue_cost  = 0 .
+strided_store_1.c:38:151: note:   ==> examining statement: *_6 = _7;
+strided_store_1.c:38:151: note:   vect_is_simple_use: operand _3 + 1.0e+0, 
type of def:    internal
+strided_store_1.c:38:151: note:   Vectorizing an unaligned access.
+Applying pattern match.pd:236, generic-match-9.cc:4128
+Applying pattern match.pd:5285, generic-match-10.cc:4234
+strided_store_1.c:38:151: note:   vect_model_store_cost: inside_cost = 12, 
prologue_cost = 0 .
 *_2 1 times unaligned_load (misalign -1) costs 1 in body
-_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue
 _3 + 1.0e+0 1 times vector_stmt costs 1 in body
-_7 1 times vec_to_scalar costs 2 in body
+<unknown> 1 times vector_load costs 1 in prologue
+_7 1 times vec_to_scalar costs 0 in body
 _7 1 times scalar_store costs 1 in body
-_7 1 times vec_to_scalar costs 2 in body
+_7 1 times vec_to_scalar costs 0 in body
 _7 1 times scalar_store costs 1 in body
-_7 1 times vec_to_scalar costs 2 in body
+_7 1 times vec_to_scalar costs 0 in body
 _7 1 times scalar_store costs 1 in body
-_7 1 times vec_to_scalar costs 2 in body
+_7 1 times vec_to_scalar costs 0 in body
 _7 1 times scalar_store costs 1 in body

Although the aarch64_use_new_vector_costs_p flag was used in multiple places in 
aarch64.cc, the location that causes this behavior is this one:
unsigned
aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
                                     stmt_vec_info stmt_info, slp_tree,
                                     tree vectype, int misalign,
                                     vect_cost_model_location where)
{
  [...]
  /* Try to get a more accurate cost by looking at STMT_INFO instead
     of just looking at KIND.  */
-  if (stmt_info && aarch64_use_new_vector_costs_p ())
+  if (stmt_info)
    {
      /* If we scalarize a strided store, the vectorizer costs one
         vec_to_scalar for each element.  However, we can store the first
         element using an FP store without a separate extract step.  */
      if (vect_is_store_elt_extraction (kind, stmt_info))
        count -= 1;

      stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind,
                                                      stmt_info, stmt_cost);

      if (vectype && m_vec_flags)
        stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind,
                                                        stmt_info, vectype,
                                                        where, stmt_cost);
    }
  [...]
  return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ());
}

Previously, for mtune=generic, this function returned a cost of 2 for a 
vec_to_scalar operation in the vect body. Now "if (stmt_info)" is entered and 
"if (vect_is_store_elt_extraction (kind, stmt_info))" evaluates to true, which 
sets the count to 0 and leads to a return value of 0.

My assumption is that these "free" vec_to_scalar operations result in 
vectorization of the ADD instruction in the tests, followed by individual 
stores using ST1. But the ST1 stores aren't free so they likely shouldn't have 
a cost of 0.

What are your thoughts on this? Do you have a suggestion which changes to make 
in order to resolve this?
> 
> Thanks,
> Tamar
> 
>>> 
>>> Thanks,
>>> Kyrill
>>> 
>>>> 
>>>> Cheers,
>>>> Tamar
>>>> 
>>>>> Thanks again,
>>>>> Kyrill
>>>>> 
>>>>> 
>>>>>>> 
>>>>>>> On the patch:
>>>>>>> 
>>>>>>>> @@ -16733,7 +16723,8 @@ aarch64_in_loop_reduction_latency
>> (vec_info
>>>>>>> *vinfo, stmt_vec_info stmt_info,
>>>>>>>> {
>>>>>>>> const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
>>>>>>>> const sve_vec_cost *sve_costs = nullptr;
>>>>>>>> -  if (vec_flags & VEC_ANY_SVE)
>>>>>>>> +  if (vec_costs->sve
>>>>>>>> +      && vec_flags & VEC_ANY_SVE)
>>>>>>>>  sve_costs = aarch64_tune_params.vec_costs->sve;
>>>>>>> 
>>>>>>> This doesn't seem necessary.  I agree we might as well reuse the
>>>>>>> vec_costs local variable in the "if" body though.
>>>>>>> 
>>>>>>>> [...]
>>>>>>>> @@ -16794,9 +16785,10 @@ aarch64_detect_vector_stmt_subtype
>>>>> (vec_info
>>>>>>> *vinfo, vect_cost_for_stmt kind,
>>>>>>>>                             fractional_cost stmt_cost)
>>>>>>>> {
>>>>>>>> const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype);
>>>>>>>> +  const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
>>>>>>>> const sve_vec_cost *sve_costs = nullptr;
>>>>>>>> -  if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
>>>>>>>> -    sve_costs = aarch64_tune_params.vec_costs->sve;
>>>>>>>> +  if (aarch64_sve_mode_p (TYPE_MODE (vectype)) && vec_costs->sve)
>>>>>>>> +    sve_costs = vec_costs->sve;
>>>>>>> 
>>>>>>> Similarly here, this doesn't seem necessary.
>>>>>>> 
>>>>>>>> [...]
>>>>>>>> @@ -17301,11 +17293,13 @@ aarch64_vector_costs::add_stmt_cost (int
>>>>>>> count, vect_cost_for_stmt kind,
>>>>>>>>                                                 where, stmt_cost);
>>>>>>> 
>>>>>>>>    /* Check if we've seen an SVE gather/scatter operation and which 
>>>>>>>> size.  */
>>>>>>>> +      const cpu_vector_cost *vec_costs = 
>>>>>>>> aarch64_tune_params.vec_costs;
>>>>>>>>    if (kind == scalar_load
>>>>>>>>   && aarch64_sve_mode_p (TYPE_MODE (vectype))
>>>>>>>> -     && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
>>>>>>> VMAT_GATHER_SCATTER)
>>>>>>>> +     && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
>>>>>>> VMAT_GATHER_SCATTER
>>>>>>>> +     && vec_costs->sve)
>>>>>>>> {
>>>>>>>> -     const sve_vec_cost *sve_costs = 
>>>>>>>> aarch64_tune_params.vec_costs->sve;
>>>>>>>> +     const sve_vec_cost *sve_costs = vec_costs->sve;
>>>>>>>>   if (sve_costs)
>>>>>>>>     {
>>>>>>>>       if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
>>>>>>> 
>>>>>>> Here too.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Richard

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Reply via email to