RE: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Tamar Christina Tue, 01 Oct 2024 12:31:15 -0700

Hi Jennifer,

> -----Original Message-----
> From: Jennifer Schmitz <jschm...@nvidia.com>
> Sent: Tuesday, September 24, 2024 9:23 AM
> To: gcc-patches@gcc.gnu.org
> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford
> <richard.sandif...@arm.com>; Kyrylo Tkachov <ktkac...@exchange.nvidia.com>
> Subject: Re: [RFC][PATCH] AArch64: Remove
> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> 
> 
> 
> > On 28 Aug 2024, at 14:56, Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
> >
> >
> >
> >> On 28 Aug 2024, at 10:27, Tamar Christina <tamar.christ...@arm.com> wrote:
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >>> -----Original Message-----
> >>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
> >>> Sent: Wednesday, August 28, 2024 8:55 AM
> >>> To: Tamar Christina <tamar.christ...@arm.com>
> >>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Jennifer Schmitz
> >>> <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Kyrylo Tkachov
> >>> <ktkac...@exchange.nvidia.com>
> >>> Subject: Re: [RFC][PATCH] AArch64: Remove
> >>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>
> >>> Hi all,
> >>>
> >>> Thanks to Jennifer for proposing a patch and Tamar and Richard for digging
> into it.
> >>>
> >>>> On 27 Aug 2024, at 13:16, Tamar Christina <tamar.christ...@arm.com>
> wrote:
> >>>>
> >>>> External email: Use caution opening links or attachments
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Richard Sandiford <richard.sandif...@arm.com>
> >>>>> Sent: Tuesday, August 27, 2024 11:46 AM
> >>>>> To: Tamar Christina <tamar.christ...@arm.com>
> >>>>> Cc: Jennifer Schmitz <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org;
> Kyrylo
> >>>>> Tkachov <ktkac...@exchange.nvidia.com>
> >>>>> Subject: Re: [RFC][PATCH] AArch64: Remove
> >>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>
> >>>>> Tamar Christina <tamar.christ...@arm.com> writes:
> >>>>>> Hi Jennifer,
> >>>>>>
> >>>>>>> -----Original Message-----
> >>>>>>> From: Jennifer Schmitz <jschm...@nvidia.com>
> >>>>>>> Sent: Friday, August 23, 2024 1:07 PM
> >>>>>>> To: gcc-patches@gcc.gnu.org
> >>>>>>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Kyrylo Tkachov
> >>>>>>> <ktkac...@exchange.nvidia.com>
> >>>>>>> Subject: [RFC][PATCH] AArch64: Remove
> >>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>>
> >>>>>>> This patch removes the
> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> >>>>>>> tunable and
> >>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
> >>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend
> >>> the
> >>>>>>> default.
> >>>>>
> >>>>> Thanks for doing this.  This has been on my TODO list ever since the
> >>>>> tunable was added.
> >>>>>
> >>>>> The history is that these "new" costs were originally added in stage 4
> >>>>> of GCC 11 for Neoverse V1.  Since the costs were added so late, it 
> >>>>> wasn't
> >>>>> appropriate to change the behaviour for any other core.  All the new 
> >>>>> code
> >>>>> was therefore gated on this option.
> >>>>>
> >>>>> The new costs do two main things:
> >>>>>
> >>>>> (1) use throughput-based calculations where available, including to 
> >>>>> choose
> >>>>>  between Advanced SIMD and SVE
> >>>>>
> >>>>> (2) try to make the latency-based costs more precise, by looking more
> closely
> >>>>>  at the provided stmt_info
> >>>>>
> >>>>> Old cost models won't be affected by (1) either way, since they don't
> >>>>> provide any throughput information.  But they should in principle 
> >>>>> benefit
> >>>>> from (2).  So...
> >>>>>
> >>>>>>> To that end, the function aarch64_use_new_vector_costs_p and its uses
> >>> were
> >>>>>>> removed. Additionally, guards were added prevent nullpointer
> dereferences
> >>> of
> >>>>>>> fields in cpu_vector_cost.
> >>>>>>>
> >>>>>>
> >>>>>> I'm not against this change, but it does mean that we now switch old 
> >>>>>> Adv.
> >>> SIMD
> >>>>>> cost models as well to the new throughput based cost models.  That 
> >>>>>> means
> >>> that
> >>>>>> -mcpu=generic now behaves differently, and -mcpu=neoverse-n1 and I
> think
> >>>>>> some distros explicitly use this (I believe yocto for instance does).
> >>>>>
> >>>>> ...it shouldn't mean that we start using throughput-based models for
> >>>>> cortexa53 etc., since there's no associated issue info.
> >>>>
> >>>> Yes, I was using throughput based model as a name.  But as you indicated 
> >>>> in
> (2)
> >>>> it does change the latency calculation.
> >>>>
> >>>> My question was because of things in e.g. aarch64_adjust_stmt_cost and
> >>> friends,
> >>>> e.g. aarch64_multiply_add_p changes the cost between FMA SIMD vs scalar.
> >>>>
> >>>> So my question..
> >>>>
> >>>>>
> >>>>>> Have we validated that the old generic cost model still behaves 
> >>>>>> sensibly
> with
> >>> this
> >>>>> change?
> >>>>
> >>>> is still valid I think, we *are* changing the cost for all models,
> >>>> and while they should indeed be more accurate, there could be knock on
> effects.
> >>>>
> >>>
> >>> We can run SPEC on a Grace system with -mcpu=generic to see what the 
> >>> effect
> is,
> >>> but wider benchmarking would be more appropriate. Can you help with that
> >>> Tamar once we agree on the other implementation details in this patch?
> >>>
> >>
> >> Sure that's not a problem.  Just ping me when you have a patch you want me 
> >> to
> test :)
> >>
> >>>
> >>>> Thanks,
> >>>> Tamar
> >>>>
> >>>>>>
> >>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu:
> >>>>>>> No problems bootstrapping, but several test files (in aarch64-sve.exp:
> >>>>>>> gather_load_extend_X.c
> >>>>>>> where X is 1 to 4, strided_load_2.c, strided_store_2.c) fail because 
> >>>>>>> of
> small
> >>>>>>> differences
> >>>>>>> in codegen that make some of the scan-assembler-times tests fail.
> >>>>>>>
> >>>>>>> Kyrill suggested to add a -fvect-cost-model=unlimited flag to these 
> >>>>>>> tests
> and
> >>>>> add
> >>>>>>> some
> >>>>>>
> >>>>>> I don't personally like unlimited here as unlimited means just 
> >>>>>> vectorize at
> any
> >>>>>> cost.  This means that costing between modes are also disabled. A lot 
> >>>>>> of
> these
> >>>>>> testcases are intended to test not just that we vectorize but that we
> vectorize
> >>>>>> with efficient code.
> >>>>>>
> >>>>>> I'd prefer to use -fvect-cost-model=dynamic if that fixes the 
> >>>>>> testcases.
> >>>>>
> >>>>> Yeah, I don't think -fvect-cost-model=unlimited would work for the
> >>>>> gather_load_extend_X.c tests, since we require the cost model to decide
> >>>>> when to use extending loads vs loading a full vector and unpacking.
> >>>
> >>> I had suggested using -fvect-cost-model=unlimited here because I thought
> these
> >>> tests wanted to test the capability of GCC to detect and generate the 
> >>> particular
> SVE
> >>> feature (gather load extends) for all supported data types regardless of
> >>> profitability.
> >>
> >> I think the problem only specifically for the load_extend tests, because
> unlimited also
> >> disables SVE vs SVE comparisons wrt to changes to VF, so not just Adv. 
> >> SIMD vs
> SVE.
> >> That is while it would generate a gather, it may choose to instead of 
> >> gather from
> >> B -> D, do B -> S and then extend the results.  Because this has a higher 
> >> VF.
> >>
> >> Without the cross SVE mode comparisons it wouldn't know that the extensions
> would
> >> actually slow it down.
> >>
> >>> If the tests are intended to also make the right profitability decisions 
> >>> for the
> generic
> >>> tuning model then I agree using -fvect-cost-model=unlimited is not
> appropriate
> >>> here, though I do think that it’d be useful to fix the backend vector 
> >>> cost model
> >>> hooks to honor -fvect-cost-model=unlimited and not limit generation of
> >>> gather/scatter in that case. What it should do for the SVE vs Neon 
> >>> decisions is
> an
> >>> open question.
> >>>
> >>>
> >>>>>
> >>>>> [...tries patch...]
> >>>>>
> >>>>> It seems like three main things are contributing to the difference:
> >>>>>
> >>>>> 1. we now cost 0 for a scalar prologue extension from a loaded value
> >>>>> 2. we now cost gathers & scatters based on gather_load_x32/x64_cost
> >>>>> 3. we now apply a large one-off cost for gathers (Tamar's recent change)
> >>>>>
> >>>>> (1) is expected.
> >>>>>
> >>>>> (2) partly looks like a latent bug.  We're using the x32 cost for
> >>>>> VNx2QI->VNx2SI, even though those are really .B->.D loads.
> >>>>>
> >>>>> @@ -16819,7 +16811,7 @@ aarch64_detect_vector_stmt_subtype
> (vec_info
> >>>>> *vinfo, vect_cost_for_stmt kind,
> >>>>>     && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
> >>>>> VMAT_GATHER_SCATTER)
> >>>>>   {
> >>>>>     unsigned int nunits = vect_nunits_for_cost (vectype);
> >>>>> -      if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
> >>>>> +      if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)),
> >>>>> aarch64_sve_vg))
> >>>>>    return { sve_costs->gather_load_x64_cost, nunits };
> >>>>>     return { sve_costs->gather_load_x32_cost, nunits };
> >>>>>   }
> >>>>>
> >>>>> fixes that.
> >>>
> >>> Would you be willing to test and push that to trunk to get it out of the 
> >>> way?
> >>>
> >>>
> >>>>>
> >>>>> (3) is interesting.  generic_vector_cost isn't used by default for any
> >>>>> SVE CPU, or any -march=...+sve.  So the question is: should we treat it
> >>>>> as "architectural intent"/somewhat idealised?  Or should we try to make
> >>>>> it produce good code for existing SVE cores, in which case it would
> >>>>> overlap quite a lot with generic_armv8_a and generic_armv9_a.
> >>>
> >>>>>
> >>>>> If the former, we could set gather_load_x32_init_cost and
> >>>>> gather_load_x64_init_cost to 0 for generic_sve_vector_cost
> >>>>> (and nothing else, so that generic_armv*_a are unaffected).
> >>>
> >>> I don’t have strong opinions on this point but if Tamar is concerned about
> >>> deviating too much from the known-good Advanced SIMD generic tuning we
> have
> >>> now then we should aim for minimal codegen changes in that?
> >>
> >> No I'm fine with Richard's suggestion.  The only way you'd be able to get 
> >> this
> model
> >> to fire is with -march=xxx=sve -mtune=generic, at which point I'm fine with
> assuming
> >> gathers have no initial overhead.
> >
> > As an aside, Jennifer pointed out to me that running the aarch64-sve.exp 
> > tests
> ends up using “-march=armv8.2-a+sve -mtune=generic -moverride=tune=none”
> for the tests but aarch64-sve.exp does’t add that -mtune and -moverride. It 
> seems
> that aarch64-sve-acle-asm.exp does add these flags (to avoid the
> CSE_SVE_VL_CONSTANTS logic IIRC), but it seems odd to use them in the 
> top-level
> aarch64-sve.exp and I’m not sure by that DejaGNU mechanism they end up
> propagating there.
> Without making any more changes to the patch, I reran the testsuite:
> With Richard's patch, the gather_load_extend tests now pass.
> However, 3 tests still fail: strided_load_2.c, strided_store_2.c, and
> strided_store_5.c.
> Below you see an extract of the diff of the produced assembly for these three
> tests. Interestingly, there seem to be cases where the codegen is better with 
> the
> patch (strided_load_2.c, f_int64_t_32), but also cases that are worse
> (strided_store_5.c, f_int32_t_5).
> Can you help diagnosing what the problem of the bad codegen in some cases
> could be and advise on how to proceed?
> Thank you, Jennifer
>


A lot of them look like costing issues,  I'm not exactly sure what 
-mtune=generic -moverride=tune=none

Resolves to. I would assume it goes back to the default tuning?

> ===== strided_load_2.c =====
> Compiled with:
> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -
> moverride=tune=none
> 
> Failed test:
> gcc.target/aarch64/sve/strided_load_2.c scan-assembler-times \\tld1d\\tz[0-
> 9]+\\.d, p[0-7]/z, \\[x[0-9]+, z[0-9]+.d, lsl 3\\]\\n 15
> Instead found 9 times
> 
> Example of difference in produced assembly:
> f_int64_t_32:
> cbz     w3, .L92
>         mov     x4, 0
>         uxtw    x3, w3
> +       cntd    x5
> +       whilelo p7.d, xzr, x3
> +       mov     z29.s, w5
>         mov     z31.s, w2
> -       whilelo p6.d, xzr, x3
> -       mov     x2, x3
> -       index   z30.s, #0, #1
> -       uqdecd  x2
> -       ptrue   p5.b, all
> -       whilelo p7.d, xzr, x2
> +       index   z30.d, #0, #1
> +       ptrue   p6.b, all
>         .p2align 3,,7
>  .L94:
> -       ld1d    z27.d, p7/z, [x0, #1, mul vl]
> -       ld1d    z28.d, p6/z, [x0]
> -       movprfx z29, z30
> -       mul     z29.s, p5/m, z29.s, z31.s
> -       incw    x4
> -       uunpklo z0.d, z29.s
> -       uunpkhi z29.d, z29.s
> -       ld1d    z25.d, p6/z, [x1, z0.d, lsl 3]
> -       ld1d    z26.d, p7/z, [x1, z29.d, lsl 3]
> -       add     z25.d, z28.d, z25.d
> +       ld1d    z27.d, p7/z, [x0, x4, lsl 3]
> +       movprfx z28, z30
> +       mul     z28.s, p6/m, z28.s, z31.s
> +       ld1d    z26.d, p7/z, [x1, z28.d, uxtw 3]
>         add     z26.d, z27.d, z26.d
> -       st1d    z26.d, p7, [x0, #1, mul vl]
> -       whilelo p7.d, x4, x2
> -       st1d    z25.d, p6, [x0]
> -       incw    z30.s
> -       incb    x0, all, mul #2
> -       whilelo p6.d, x4, x3
> +       st1d    z26.d, p7, [x0, x4, lsl 3]
> +       add     z30.s, z30.s, z29.s
> +       incd    x4
> +       whilelo p7.d, x4, x3
>         b.any   .L94
>  .L92:
>         ret
> 

This to me looks like unrolling. It looks like the old code decided to unroll 
once and the new one doesn't.
I agree it does look better overall.

> ===== strided_store_2.c =====
> Compiled with:
> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -
> moverride=tune=none
> 
> Failed test:
> gcc.target/aarch64/sve/strided_store_2.c scan-assembler-times \\tst1d\\tz[0-
> 9]+\\.d, p[0-7], \\[x[0-9]+, z[0-9]+.d, lsl 3\\]\\n 15
> Instead found 9 times
> 
> Example of difference in produced assembly:
> f_float_64:
>  .LFB11:
>         .cfi_startproc
>         cbz     x3, .L67
> -       lsl     x2, x2, 2
> -       add     x3, x1, x3, lsl 2
> -       fmov    s31, 1.0e+0
> -       .p2align 3,,7
> +       sub     x4, x3, #1
> +       cmp     x4, 2
> +       bls     .L73
> +       lsr     x9, x3, 2
> +       lsl     x8, x2, 2
> +       fmov    v31.4s, 1.0e+0
> +       lsl     x10, x2, 4
> +       add     x4, x0, x8
> +       add     x9, x1, x9, lsl 4
> +       neg     x11, x8
> +       lsl     x2, x2, 3
> +       mov     x5, x1
> +       .p2align 3,,7
> +.L70:
> +       ldr     q30, [x5], 16
> +       add     x7, x4, x8
> +       add     x6, x4, x2
> +       fadd    v30.4s, v30.4s, v31.4s
> +       str     s30, [x4, x11]
> +       st1     {v30.s}[1], [x4]
> +       add     x4, x4, x10
> +       st1     {v30.s}[2], [x7]
> +       st1     {v30.s}[3], [x6]
> +       cmp     x5, x9
> +       bne     .L70
> +       and     x4, x3, -4
> +       tst     x3, 3
> +       beq     .L67
>  .L69:
> -       ldr     s30, [x1], 4
> -       fadd    s30, s30, s31
> -       str     s30, [x0]
> -       add     x0, x0, x2
> -       cmp     x1, x3
> -       bne     .L69
> +       madd    x0, x8, x4, x0
> +       fmov    s29, 1.0e+0
> +.L72:
> +       ldr     s28, [x1, x4, lsl 2]
> +       add     x4, x4, 1
> +       fadd    s28, s28, s29
> +       str     s28, [x0]
> +       add     x0, x0, x8
> +       cmp     x3, x4
> +       bhi     .L72
>  .L67:
>         ret
> +.L73:
> +       lsl     x8, x2, 2
> +       mov     x4, 0
> +       b       .L69
> 

This one feels like the vectorizer emulated a scatter store.
Did you rebase by chance? I think our cost model most likely
does not model the store cots of a single lane.  I'd look at costing of
the loop from the vectorizer.

> 
> f_int64_t_32:
>  .LFB14:
>         .cfi_startproc
> -       cbz     w3, .L84
> -       addvl   x5, x1, #1
> +       cbz     w3, .L92
>         mov     x4, 0
>         uxtw    x3, w3
> +       cntd    x5
> +       whilelo p7.d, xzr, x3
> +       mov     z29.s, w5
>         mov     z31.s, w2
> -       whilelo p6.d, xzr, x3
> -       mov     x2, x3
> -       index   z30.s, #0, #1
> -       uqdecd  x2
> -       ptrue   p5.b, all
> -       whilelo p7.d, xzr, x2
> -       .p2align 3,,7
> -.L86:
> -       ld1d    z28.d, p6/z, [x1, x4, lsl 3]
> -       ld1d    z27.d, p7/z, [x5, x4, lsl 3]
> -       movprfx z29, z30
> -       mul     z29.s, p5/m, z29.s, z31.s
> -       add     z28.d, z28.d, #1
> -       uunpklo z26.d, z29.s
> -       st1d    z28.d, p6, [x0, z26.d, lsl 3]
> -       incw    x4
> -       uunpkhi z29.d, z29.s
> +       index   z30.d, #0, #1
> +       ptrue   p6.b, all
> +       .p2align 3,,7
> +.L94:
> +       ld1d    z27.d, p7/z, [x1, x4, lsl 3]
> +       movprfx z28, z30
> +       mul     z28.s, p6/m, z28.s, z31.s
>         add     z27.d, z27.d, #1
> -       st1d    z27.d, p7, [x0, z29.d, lsl 3]
> -       whilelo p7.d, x4, x2
> -       incw    z30.s
> -       whilelo p6.d, x4, x3
> -       b.any   .L86
> -.L84:
> +       st1d    z27.d, p7, [x0, z28.d, uxtw 3]
> +       incd    x4
> +       add     z30.s, z30.s, z29.s
> +       whilelo p7.d, x4, x3
> +       b.any   .L94
> +.L92:
>         ret
> 

Same as before it looks like we stopped unrolling on a long dependency chain.
So I think the code is better,

> ==== strided_store_5.c =====
> Compiled with:
> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic -
> moverride=tune=none -msve-vector-bits=256
> 
> Failed tests:
> gcc.target/aarch64/sve/strided_store_5. c scan-assembler-times \\tst1d\\tz[0-
> 9]+\\.d, p[0-7], \\[x[0-9]+, z[0-9]+.d\\]\\n 15
> Instead found 0 times
> gcc.target/aarch64/sve/strided_store_5. c scan-assembler-times \\tst1w\\tz[0-
> 9]+\\.s, p[0-7], \\[x[0-9]+, z[0-9]+.s, sxtw\\]\\n 3
> Instead found 0 times
> gcc.target/aarch64/sve/strided_store_5. c scan-assembler-times \\tst1w\\tz[0-
> 9]+\\.s, p[0-7], \\[x[0-9]+, z[0-9]+.s, uxtw\\]\\n 12
> Instead found 0 times
> 
> Example of difference in produced assembly:
> f_int32_t_5:
>         cmp     x2, 0
>         ble     .L1
> -       mov     x3, 0
> -       mov     w4, 20
> -       whilelo p7.s, xzr, x2
> -       index   z31.s, #0, w4
> -       .p2align 3,,7
> +       sub     x3, x2, #1
> +       cmp     x3, 6
> +       bls     .L7
> +       lsr     x8, x2, 3
> +       mov     x4, x1
> +       mov     x3, x0
> +       ptrue   p6.b, vl32
> +       add     x8, x1, x8, lsl 5
> +       pfalse  p7.b
> +       .p2align 3,,7
> +.L4:
> +       ld1w    z31.s, p6/z, [x4]
> +       add     z31.s, z31.s, #1
> +       umov    w7, v31.s[1]
> +       umov    w6, v31.s[2]
> +       umov    w5, v31.s[3]
> +       dup     z1.s, z31.s[4]
> +       dup     z0.s, z31.s[5]
> +       dup     z30.s, z31.s[6]
> +       lastb   s29, p7, z31.s
> +       add     x4, x4, 32
> +       str     s31, [x3]
> +       add     x3, x3, 160
> +       str     w7, [x3, -140]
> +       str     w6, [x3, -120]
> +       str     w5, [x3, -100]
> +       str     s1, [x3, -80]
> +       str     s0, [x3, -60]
> +       str     s30, [x3, -40]
> +       str     s29, [x3, -20]
> +       cmp     x8, x4
> +       bne     .L4
> +       and     x3, x2, -8
> +       cmp     x2, x3
> +       beq     .L1
>  .L3:
> -       ld1w    z30.s, p7/z, [x1, x3, lsl 2]
> -       add     z30.s, z30.s, #1
> -       st1w    z30.s, p7, [x0, z31.s, uxtw]
> -       add     x3, x3, 8
> -       add     x0, x0, 160
> -       whilelo p7.s, x3, x2
> -       b.any   .L3
> +       lsl     x4, x3, 2
> +       sub     x2, x2, x3
> +       add     x3, x4, x3
> +       whilelo p7.d, xzr, x2
> +       add     x1, x1, x4
> +       mov     x4, 20
> +       ld1w    z2.d, p7/z, [x1]
> +       add     x3, x0, x3, lsl 2
> +       index   z28.d, #0, x4
> +       add     z2.s, z2.s, #1
> +       st1w    z2.d, p7, [x3, z28.d]
> +       mov     x0, 4
> +       whilelo p7.d, x0, x2
> +       b.any   .L10
>  .L1:
>         ret
> +       .p2align 2,,3
> +.L10:
> +       ld1w    z29.d, p7/z, [x1, #1, mul vl]
> +       add     x3, x3, 80
> +       add     z29.s, z29.s, #1
> +       st1w    z29.d, p7, [x3, z28.d]
> +       ret
> +.L7:
> +       mov     x3, 0
> +       b       .L3
> 

This one kinda feels like we vectorized a constructor we don't really support 
codegen for.
You can check if the code at dce6 makes sense and if the scalarization is done 
by veclower21.

Thanks,
Tamar

> >
> > Thanks,
> > Kyrill
> >
> >>
> >> Cheers,
> >> Tamar
> >>
> >>> Thanks again,
> >>> Kyrill
> >>>
> >>>
> >>>>>
> >>>>> On the patch:
> >>>>>
> >>>>>> @@ -16733,7 +16723,8 @@ aarch64_in_loop_reduction_latency
> (vec_info
> >>>>> *vinfo, stmt_vec_info stmt_info,
> >>>>>> {
> >>>>>> const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
> >>>>>> const sve_vec_cost *sve_costs = nullptr;
> >>>>>> -  if (vec_flags & VEC_ANY_SVE)
> >>>>>> +  if (vec_costs->sve
> >>>>>> +      && vec_flags & VEC_ANY_SVE)
> >>>>>>   sve_costs = aarch64_tune_params.vec_costs->sve;
> >>>>>
> >>>>> This doesn't seem necessary.  I agree we might as well reuse the
> >>>>> vec_costs local variable in the "if" body though.
> >>>>>
> >>>>>> [...]
> >>>>>> @@ -16794,9 +16785,10 @@ aarch64_detect_vector_stmt_subtype
> >>> (vec_info
> >>>>> *vinfo, vect_cost_for_stmt kind,
> >>>>>>                              fractional_cost stmt_cost)
> >>>>>> {
> >>>>>> const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype);
> >>>>>> +  const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
> >>>>>> const sve_vec_cost *sve_costs = nullptr;
> >>>>>> -  if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
> >>>>>> -    sve_costs = aarch64_tune_params.vec_costs->sve;
> >>>>>> +  if (aarch64_sve_mode_p (TYPE_MODE (vectype)) && vec_costs->sve)
> >>>>>> +    sve_costs = vec_costs->sve;
> >>>>>
> >>>>> Similarly here, this doesn't seem necessary.
> >>>>>
> >>>>>> [...]
> >>>>>> @@ -17301,11 +17293,13 @@ aarch64_vector_costs::add_stmt_cost (int
> >>>>> count, vect_cost_for_stmt kind,
> >>>>>>                                                  where, stmt_cost);
> >>>>>
> >>>>>>     /* Check if we've seen an SVE gather/scatter operation and which 
> >>>>>> size.  */
> >>>>>> +      const cpu_vector_cost *vec_costs = 
> >>>>>> aarch64_tune_params.vec_costs;
> >>>>>>     if (kind == scalar_load
> >>>>>>    && aarch64_sve_mode_p (TYPE_MODE (vectype))
> >>>>>> -     && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
> >>>>> VMAT_GATHER_SCATTER)
> >>>>>> +     && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
> >>>>> VMAT_GATHER_SCATTER
> >>>>>> +     && vec_costs->sve)
> >>>>>>  {
> >>>>>> -     const sve_vec_cost *sve_costs = 
> >>>>>> aarch64_tune_params.vec_costs->sve;
> >>>>>> +     const sve_vec_cost *sve_costs = vec_costs->sve;
> >>>>>>    if (sve_costs)
> >>>>>>      {
> >>>>>>        if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
> >>>>>
> >>>>> Here too.
> >>>>>
> >>>>> Thanks,
> >>>>> Richard
>

RE: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Reply via email to