Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Jennifer Schmitz Tue, 24 Sep 2024 01:23:43 -0700


> On 28 Aug 2024, at 14:56, Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
> 
> 
> 
>> On 28 Aug 2024, at 10:27, Tamar Christina <tamar.christ...@arm.com> wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>>> -----Original Message-----
>>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>>> Sent: Wednesday, August 28, 2024 8:55 AM
>>> To: Tamar Christina <tamar.christ...@arm.com>
>>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Jennifer Schmitz
>>> <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Kyrylo Tkachov
>>> <ktkac...@exchange.nvidia.com>
>>> Subject: Re: [RFC][PATCH] AArch64: Remove
>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>> 
>>> Hi all,
>>> 
>>> Thanks to Jennifer for proposing a patch and Tamar and Richard for digging 
>>> into it.
>>> 
>>>> On 27 Aug 2024, at 13:16, Tamar Christina <tamar.christ...@arm.com> wrote:
>>>> 
>>>> External email: Use caution opening links or attachments
>>>> 
>>>> 
>>>>> -----Original Message-----
>>>>> From: Richard Sandiford <richard.sandif...@arm.com>
>>>>> Sent: Tuesday, August 27, 2024 11:46 AM
>>>>> To: Tamar Christina <tamar.christ...@arm.com>
>>>>> Cc: Jennifer Schmitz <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; 
>>>>> Kyrylo
>>>>> Tkachov <ktkac...@exchange.nvidia.com>
>>>>> Subject: Re: [RFC][PATCH] AArch64: Remove
>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>> 
>>>>> Tamar Christina <tamar.christ...@arm.com> writes:
>>>>>> Hi Jennifer,
>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Jennifer Schmitz <jschm...@nvidia.com>
>>>>>>> Sent: Friday, August 23, 2024 1:07 PM
>>>>>>> To: gcc-patches@gcc.gnu.org
>>>>>>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Kyrylo Tkachov
>>>>>>> <ktkac...@exchange.nvidia.com>
>>>>>>> Subject: [RFC][PATCH] AArch64: Remove
>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>> 
>>>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>>> tunable and
>>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend
>>> the
>>>>>>> default.
>>>>> 
>>>>> Thanks for doing this.  This has been on my TODO list ever since the
>>>>> tunable was added.
>>>>> 
>>>>> The history is that these "new" costs were originally added in stage 4
>>>>> of GCC 11 for Neoverse V1.  Since the costs were added so late, it wasn't
>>>>> appropriate to change the behaviour for any other core.  All the new code
>>>>> was therefore gated on this option.
>>>>> 
>>>>> The new costs do two main things:
>>>>> 
>>>>> (1) use throughput-based calculations where available, including to choose
>>>>>  between Advanced SIMD and SVE
>>>>> 
>>>>> (2) try to make the latency-based costs more precise, by looking more 
>>>>> closely
>>>>>  at the provided stmt_info
>>>>> 
>>>>> Old cost models won't be affected by (1) either way, since they don't
>>>>> provide any throughput information.  But they should in principle benefit
>>>>> from (2).  So...
>>>>> 
>>>>>>> To that end, the function aarch64_use_new_vector_costs_p and its uses
>>> were
>>>>>>> removed. Additionally, guards were added prevent nullpointer 
>>>>>>> dereferences
>>> of
>>>>>>> fields in cpu_vector_cost.
>>>>>>> 
>>>>>> 
>>>>>> I'm not against this change, but it does mean that we now switch old Adv.
>>> SIMD
>>>>>> cost models as well to the new throughput based cost models.  That means
>>> that
>>>>>> -mcpu=generic now behaves differently, and -mcpu=neoverse-n1 and I think
>>>>>> some distros explicitly use this (I believe yocto for instance does).
>>>>> 
>>>>> ...it shouldn't mean that we start using throughput-based models for
>>>>> cortexa53 etc., since there's no associated issue info.
>>>> 
>>>> Yes, I was using throughput based model as a name.  But as you indicated 
>>>> in (2)
>>>> it does change the latency calculation.
>>>> 
>>>> My question was because of things in e.g. aarch64_adjust_stmt_cost and
>>> friends,
>>>> e.g. aarch64_multiply_add_p changes the cost between FMA SIMD vs scalar.
>>>> 
>>>> So my question..
>>>> 
>>>>> 
>>>>>> Have we validated that the old generic cost model still behaves sensibly 
>>>>>> with
>>> this
>>>>> change?
>>>> 
>>>> is still valid I think, we *are* changing the cost for all models,
>>>> and while they should indeed be more accurate, there could be knock on 
>>>> effects.
>>>> 
>>> 
>>> We can run SPEC on a Grace system with -mcpu=generic to see what the effect 
>>> is,
>>> but wider benchmarking would be more appropriate. Can you help with that
>>> Tamar once we agree on the other implementation details in this patch?
>>> 
>> 
>> Sure that's not a problem.  Just ping me when you have a patch you want me 
>> to test :)
>> 
>>> 
>>>> Thanks,
>>>> Tamar
>>>> 
>>>>>> 
>>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu:
>>>>>>> No problems bootstrapping, but several test files (in aarch64-sve.exp:
>>>>>>> gather_load_extend_X.c
>>>>>>> where X is 1 to 4, strided_load_2.c, strided_store_2.c) fail because of 
>>>>>>> small
>>>>>>> differences
>>>>>>> in codegen that make some of the scan-assembler-times tests fail.
>>>>>>> 
>>>>>>> Kyrill suggested to add a -fvect-cost-model=unlimited flag to these 
>>>>>>> tests and
>>>>> add
>>>>>>> some
>>>>>> 
>>>>>> I don't personally like unlimited here as unlimited means just vectorize 
>>>>>> at any
>>>>>> cost.  This means that costing between modes are also disabled. A lot of 
>>>>>> these
>>>>>> testcases are intended to test not just that we vectorize but that we 
>>>>>> vectorize
>>>>>> with efficient code.
>>>>>> 
>>>>>> I'd prefer to use -fvect-cost-model=dynamic if that fixes the testcases.
>>>>> 
>>>>> Yeah, I don't think -fvect-cost-model=unlimited would work for the
>>>>> gather_load_extend_X.c tests, since we require the cost model to decide
>>>>> when to use extending loads vs loading a full vector and unpacking.
>>> 
>>> I had suggested using -fvect-cost-model=unlimited here because I thought 
>>> these
>>> tests wanted to test the capability of GCC to detect and generate the 
>>> particular SVE
>>> feature (gather load extends) for all supported data types regardless of
>>> profitability.
>> 
>> I think the problem only specifically for the load_extend tests, because 
>> unlimited also
>> disables SVE vs SVE comparisons wrt to changes to VF, so not just Adv. SIMD 
>> vs SVE.
>> That is while it would generate a gather, it may choose to instead of gather 
>> from
>> B -> D, do B -> S and then extend the results.  Because this has a higher VF.
>> 
>> Without the cross SVE mode comparisons it wouldn't know that the extensions 
>> would
>> actually slow it down.
>> 
>>> If the tests are intended to also make the right profitability decisions 
>>> for the generic
>>> tuning model then I agree using -fvect-cost-model=unlimited is not 
>>> appropriate
>>> here, though I do think that it’d be useful to fix the backend vector cost 
>>> model
>>> hooks to honor -fvect-cost-model=unlimited and not limit generation of
>>> gather/scatter in that case. What it should do for the SVE vs Neon 
>>> decisions is an
>>> open question.
>>> 
>>> 
>>>>> 
>>>>> [...tries patch...]
>>>>> 
>>>>> It seems like three main things are contributing to the difference:
>>>>> 
>>>>> 1. we now cost 0 for a scalar prologue extension from a loaded value
>>>>> 2. we now cost gathers & scatters based on gather_load_x32/x64_cost
>>>>> 3. we now apply a large one-off cost for gathers (Tamar's recent change)
>>>>> 
>>>>> (1) is expected.
>>>>> 
>>>>> (2) partly looks like a latent bug.  We're using the x32 cost for
>>>>> VNx2QI->VNx2SI, even though those are really .B->.D loads.
>>>>> 
>>>>> @@ -16819,7 +16811,7 @@ aarch64_detect_vector_stmt_subtype (vec_info
>>>>> *vinfo, vect_cost_for_stmt kind,
>>>>>     && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
>>>>> VMAT_GATHER_SCATTER)
>>>>>   {
>>>>>     unsigned int nunits = vect_nunits_for_cost (vectype);
>>>>> -      if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
>>>>> +      if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)),
>>>>> aarch64_sve_vg))
>>>>>    return { sve_costs->gather_load_x64_cost, nunits };
>>>>>     return { sve_costs->gather_load_x32_cost, nunits };
>>>>>   }
>>>>> 
>>>>> fixes that.
>>> 
>>> Would you be willing to test and push that to trunk to get it out of the 
>>> way?
>>> 
>>> 
>>>>> 
>>>>> (3) is interesting.  generic_vector_cost isn't used by default for any
>>>>> SVE CPU, or any -march=...+sve.  So the question is: should we treat it
>>>>> as "architectural intent"/somewhat idealised?  Or should we try to make
>>>>> it produce good code for existing SVE cores, in which case it would
>>>>> overlap quite a lot with generic_armv8_a and generic_armv9_a.
>>> 
>>>>> 
>>>>> If the former, we could set gather_load_x32_init_cost and
>>>>> gather_load_x64_init_cost to 0 for generic_sve_vector_cost
>>>>> (and nothing else, so that generic_armv*_a are unaffected).
>>> 
>>> I don’t have strong opinions on this point but if Tamar is concerned about
>>> deviating too much from the known-good Advanced SIMD generic tuning we have
>>> now then we should aim for minimal codegen changes in that?
>> 
>> No I'm fine with Richard's suggestion.  The only way you'd be able to get 
>> this model
>> to fire is with -march=xxx=sve -mtune=generic, at which point I'm fine with 
>> assuming
>> gathers have no initial overhead.
> 
> As an aside, Jennifer pointed out to me that running the aarch64-sve.exp 
> tests ends up using “-march=armv8.2-a+sve -mtune=generic 
> -moverride=tune=none” for the tests but aarch64-sve.exp does’t add that 
> -mtune and -moverride. It seems that aarch64-sve-acle-asm.exp does add these 
> flags (to avoid the CSE_SVE_VL_CONSTANTS logic IIRC), but it seems odd to use 
> them in the top-level aarch64-sve.exp and I’m not sure by that DejaGNU 
> mechanism they end up propagating there.
Without making any more changes to the patch, I reran the testsuite:
With Richard's patch, the gather_load_extend tests now pass. 
However, 3 tests still fail: strided_load_2.c, strided_store_2.c, and 
strided_store_5.c. 
Below you see an extract of the diff of the produced assembly for these three 
tests. Interestingly, there seem to be cases where the codegen is better with 
the patch (strided_load_2.c, f_int64_t_32), but also cases that are worse 
(strided_store_5.c, f_int32_t_5). 
Can you help diagnosing what the problem of the bad codegen in some cases could 
be and advise on how to proceed?
Thank you, Jennifer


===== strided_load_2.c =====
Compiled with: 
-O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none

Failed test: 
gcc.target/aarch64/sve/strided_load_2.c scan-assembler-times 
\\tld1d\\tz[0-9]+\\.d, p[0-7]/z, \\[x[0-9]+, z[0-9]+.d, lsl 3\\]\\n 15
Instead found 9 times

Example of difference in produced assembly:
f_int64_t_32:
cbz     w3, .L92
        mov     x4, 0
        uxtw    x3, w3
+       cntd    x5
+       whilelo p7.d, xzr, x3
+       mov     z29.s, w5
        mov     z31.s, w2
-       whilelo p6.d, xzr, x3
-       mov     x2, x3
-       index   z30.s, #0, #1
-       uqdecd  x2
-       ptrue   p5.b, all
-       whilelo p7.d, xzr, x2
+       index   z30.d, #0, #1
+       ptrue   p6.b, all
        .p2align 3,,7
 .L94:
-       ld1d    z27.d, p7/z, [x0, #1, mul vl]
-       ld1d    z28.d, p6/z, [x0]
-       movprfx z29, z30
-       mul     z29.s, p5/m, z29.s, z31.s
-       incw    x4
-       uunpklo z0.d, z29.s
-       uunpkhi z29.d, z29.s
-       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
-       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
-       add     z25.d, z28.d, z25.d
+       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
+       movprfx z28, z30
+       mul     z28.s, p6/m, z28.s, z31.s
+       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
        add     z26.d, z27.d, z26.d
-       st1d    z26.d, p7, [x0, #1, mul vl]
-       whilelo p7.d, x4, x2
-       st1d    z25.d, p6, [x0]
-       incw    z30.s
-       incb    x0, all, mul #2
-       whilelo p6.d, x4, x3
+       st1d    z26.d, p7, [x0, x4, lsl 3]
+       add     z30.s, z30.s, z29.s
+       incd    x4
+       whilelo p7.d, x4, x3
        b.any   .L94
 .L92:
        ret

===== strided_store_2.c =====
Compiled with: 
-O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none

Failed test: 
gcc.target/aarch64/sve/strided_store_2.c scan-assembler-times 
\\tst1d\\tz[0-9]+\\.d, p[0-7], \\[x[0-9]+, z[0-9]+.d, lsl 3\\]\\n 15
Instead found 9 times

Example of difference in produced assembly:
f_float_64:
 .LFB11:
        .cfi_startproc
        cbz     x3, .L67
-       lsl     x2, x2, 2
-       add     x3, x1, x3, lsl 2
-       fmov    s31, 1.0e+0
-       .p2align 3,,7
+       sub     x4, x3, #1
+       cmp     x4, 2
+       bls     .L73
+       lsr     x9, x3, 2
+       lsl     x8, x2, 2
+       fmov    v31.4s, 1.0e+0
+       lsl     x10, x2, 4
+       add     x4, x0, x8
+       add     x9, x1, x9, lsl 4
+       neg     x11, x8
+       lsl     x2, x2, 3
+       mov     x5, x1
+       .p2align 3,,7
+.L70:
+       ldr     q30, [x5], 16
+       add     x7, x4, x8
+       add     x6, x4, x2
+       fadd    v30.4s, v30.4s, v31.4s
+       str     s30, [x4, x11]
+       st1     {v30.s}[1], [x4]
+       add     x4, x4, x10
+       st1     {v30.s}[2], [x7]
+       st1     {v30.s}[3], [x6]
+       cmp     x5, x9
+       bne     .L70
+       and     x4, x3, -4
+       tst     x3, 3
+       beq     .L67
 .L69:
-       ldr     s30, [x1], 4
-       fadd    s30, s30, s31
-       str     s30, [x0]
-       add     x0, x0, x2
-       cmp     x1, x3
-       bne     .L69
+       madd    x0, x8, x4, x0
+       fmov    s29, 1.0e+0
+.L72:
+       ldr     s28, [x1, x4, lsl 2]
+       add     x4, x4, 1
+       fadd    s28, s28, s29
+       str     s28, [x0]
+       add     x0, x0, x8
+       cmp     x3, x4
+       bhi     .L72
 .L67:
        ret
+.L73:
+       lsl     x8, x2, 2
+       mov     x4, 0
+       b       .L69


f_int64_t_32:
 .LFB14:
        .cfi_startproc
-       cbz     w3, .L84
-       addvl   x5, x1, #1
+       cbz     w3, .L92
        mov     x4, 0
        uxtw    x3, w3
+       cntd    x5
+       whilelo p7.d, xzr, x3
+       mov     z29.s, w5
        mov     z31.s, w2
-       whilelo p6.d, xzr, x3
-       mov     x2, x3
-       index   z30.s, #0, #1
-       uqdecd  x2
-       ptrue   p5.b, all
-       whilelo p7.d, xzr, x2
-       .p2align 3,,7
-.L86:
-       ld1d    z28.d, p6/z, [x1, x4, lsl 3]
-       ld1d    z27.d, p7/z, [x5, x4, lsl 3]
-       movprfx z29, z30
-       mul     z29.s, p5/m, z29.s, z31.s
-       add     z28.d, z28.d, #1
-       uunpklo z26.d, z29.s
-       st1d    z28.d, p6, [x0, z26.d, lsl 3]
-       incw    x4
-       uunpkhi z29.d, z29.s
+       index   z30.d, #0, #1
+       ptrue   p6.b, all
+       .p2align 3,,7
+.L94:
+       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
+       movprfx z28, z30
+       mul     z28.s, p6/m, z28.s, z31.s
        add     z27.d, z27.d, #1
-       st1d    z27.d, p7, [x0, z29.d, lsl 3]
-       whilelo p7.d, x4, x2
-       incw    z30.s
-       whilelo p6.d, x4, x3
-       b.any   .L86
-.L84:
+       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
+       incd    x4
+       add     z30.s, z30.s, z29.s
+       whilelo p7.d, x4, x3
+       b.any   .L94
+.L92:
        ret

==== strided_store_5.c =====
Compiled with:
-O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -moverride=tune=none 
-msve-vector-bits=256

Failed tests:
gcc.target/aarch64/sve/strided_store_5. c scan-assembler-times 
\\tst1d\\tz[0-9]+\\.d, p[0-7], \\[x[0-9]+, z[0-9]+.d\\]\\n 15
Instead found 0 times
gcc.target/aarch64/sve/strided_store_5. c scan-assembler-times 
\\tst1w\\tz[0-9]+\\.s, p[0-7], \\[x[0-9]+, z[0-9]+.s, sxtw\\]\\n 3
Instead found 0 times
gcc.target/aarch64/sve/strided_store_5. c scan-assembler-times 
\\tst1w\\tz[0-9]+\\.s, p[0-7], \\[x[0-9]+, z[0-9]+.s, uxtw\\]\\n 12
Instead found 0 times

Example of difference in produced assembly:
f_int32_t_5:
        cmp     x2, 0
        ble     .L1
-       mov     x3, 0
-       mov     w4, 20
-       whilelo p7.s, xzr, x2
-       index   z31.s, #0, w4
-       .p2align 3,,7
+       sub     x3, x2, #1
+       cmp     x3, 6
+       bls     .L7
+       lsr     x8, x2, 3
+       mov     x4, x1
+       mov     x3, x0
+       ptrue   p6.b, vl32
+       add     x8, x1, x8, lsl 5
+       pfalse  p7.b
+       .p2align 3,,7
+.L4:
+       ld1w    z31.s, p6/z, [x4]
+       add     z31.s, z31.s, #1
+       umov    w7, v31.s[1]
+       umov    w6, v31.s[2]
+       umov    w5, v31.s[3]
+       dup     z1.s, z31.s[4]
+       dup     z0.s, z31.s[5]
+       dup     z30.s, z31.s[6]
+       lastb   s29, p7, z31.s
+       add     x4, x4, 32
+       str     s31, [x3]
+       add     x3, x3, 160
+       str     w7, [x3, -140]
+       str     w6, [x3, -120]
+       str     w5, [x3, -100]
+       str     s1, [x3, -80]
+       str     s0, [x3, -60]
+       str     s30, [x3, -40]
+       str     s29, [x3, -20]
+       cmp     x8, x4
+       bne     .L4
+       and     x3, x2, -8
+       cmp     x2, x3
+       beq     .L1
 .L3:
-       ld1w    z30.s, p7/z, [x1, x3, lsl 2]
-       add     z30.s, z30.s, #1
-       st1w    z30.s, p7, [x0, z31.s, uxtw]
-       add     x3, x3, 8
-       add     x0, x0, 160
-       whilelo p7.s, x3, x2
-       b.any   .L3
+       lsl     x4, x3, 2
+       sub     x2, x2, x3
+       add     x3, x4, x3
+       whilelo p7.d, xzr, x2
+       add     x1, x1, x4
+       mov     x4, 20
+       ld1w    z2.d, p7/z, [x1]
+       add     x3, x0, x3, lsl 2
+       index   z28.d, #0, x4
+       add     z2.s, z2.s, #1
+       st1w    z2.d, p7, [x3, z28.d]
+       mov     x0, 4
+       whilelo p7.d, x0, x2
+       b.any   .L10
 .L1:
        ret
+       .p2align 2,,3
+.L10:
+       ld1w    z29.d, p7/z, [x1, #1, mul vl]
+       add     x3, x3, 80
+       add     z29.s, z29.s, #1
+       st1w    z29.d, p7, [x3, z28.d]
+       ret
+.L7:
+       mov     x3, 0
+       b       .L3

> 
> Thanks,
> Kyrill
> 
>> 
>> Cheers,
>> Tamar
>> 
>>> Thanks again,
>>> Kyrill
>>> 
>>> 
>>>>> 
>>>>> On the patch:
>>>>> 
>>>>>> @@ -16733,7 +16723,8 @@ aarch64_in_loop_reduction_latency (vec_info
>>>>> *vinfo, stmt_vec_info stmt_info,
>>>>>> {
>>>>>> const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
>>>>>> const sve_vec_cost *sve_costs = nullptr;
>>>>>> -  if (vec_flags & VEC_ANY_SVE)
>>>>>> +  if (vec_costs->sve
>>>>>> +      && vec_flags & VEC_ANY_SVE)
>>>>>>   sve_costs = aarch64_tune_params.vec_costs->sve;
>>>>> 
>>>>> This doesn't seem necessary.  I agree we might as well reuse the
>>>>> vec_costs local variable in the "if" body though.
>>>>> 
>>>>>> [...]
>>>>>> @@ -16794,9 +16785,10 @@ aarch64_detect_vector_stmt_subtype
>>> (vec_info
>>>>> *vinfo, vect_cost_for_stmt kind,
>>>>>>                              fractional_cost stmt_cost)
>>>>>> {
>>>>>> const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype);
>>>>>> +  const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
>>>>>> const sve_vec_cost *sve_costs = nullptr;
>>>>>> -  if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
>>>>>> -    sve_costs = aarch64_tune_params.vec_costs->sve;
>>>>>> +  if (aarch64_sve_mode_p (TYPE_MODE (vectype)) && vec_costs->sve)
>>>>>> +    sve_costs = vec_costs->sve;
>>>>> 
>>>>> Similarly here, this doesn't seem necessary.
>>>>> 
>>>>>> [...]
>>>>>> @@ -17301,11 +17293,13 @@ aarch64_vector_costs::add_stmt_cost (int
>>>>> count, vect_cost_for_stmt kind,
>>>>>>                                                  where, stmt_cost);
>>>>> 
>>>>>>     /* Check if we've seen an SVE gather/scatter operation and which 
>>>>>> size.  */
>>>>>> +      const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
>>>>>>     if (kind == scalar_load
>>>>>>    && aarch64_sve_mode_p (TYPE_MODE (vectype))
>>>>>> -     && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
>>>>> VMAT_GATHER_SCATTER)
>>>>>> +     && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
>>>>> VMAT_GATHER_SCATTER
>>>>>> +     && vec_costs->sve)
>>>>>>  {
>>>>>> -     const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve;
>>>>>> +     const sve_vec_cost *sve_costs = vec_costs->sve;
>>>>>>    if (sve_costs)
>>>>>>      {
>>>>>>        if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
>>>>> 
>>>>> Here too.
>>>>> 
>>>>> Thanks,
>>>>> Richard

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Reply via email to