Hi Jennifer, > -----Original Message----- > From: Jennifer Schmitz <jschm...@nvidia.com> > Sent: Tuesday, September 24, 2024 9:23 AM > To: gcc-patches@gcc.gnu.org > Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford > <richard.sandif...@arm.com>; Kyrylo Tkachov <ktkac...@exchange.nvidia.com> > Subject: Re: [RFC][PATCH] AArch64: Remove > AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > > > > > On 28 Aug 2024, at 14:56, Kyrylo Tkachov <ktkac...@nvidia.com> wrote: > > > > > > > >> On 28 Aug 2024, at 10:27, Tamar Christina <tamar.christ...@arm.com> wrote: > >> > >> External email: Use caution opening links or attachments > >> > >> > >>> -----Original Message----- > >>> From: Kyrylo Tkachov <ktkac...@nvidia.com> > >>> Sent: Wednesday, August 28, 2024 8:55 AM > >>> To: Tamar Christina <tamar.christ...@arm.com> > >>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Jennifer Schmitz > >>> <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Kyrylo Tkachov > >>> <ktkac...@exchange.nvidia.com> > >>> Subject: Re: [RFC][PATCH] AArch64: Remove > >>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>> > >>> Hi all, > >>> > >>> Thanks to Jennifer for proposing a patch and Tamar and Richard for digging > into it. > >>> > >>>> On 27 Aug 2024, at 13:16, Tamar Christina <tamar.christ...@arm.com> > wrote: > >>>> > >>>> External email: Use caution opening links or attachments > >>>> > >>>> > >>>>> -----Original Message----- > >>>>> From: Richard Sandiford <richard.sandif...@arm.com> > >>>>> Sent: Tuesday, August 27, 2024 11:46 AM > >>>>> To: Tamar Christina <tamar.christ...@arm.com> > >>>>> Cc: Jennifer Schmitz <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; > Kyrylo > >>>>> Tkachov <ktkac...@exchange.nvidia.com> > >>>>> Subject: Re: [RFC][PATCH] AArch64: Remove > >>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>> > >>>>> Tamar Christina <tamar.christ...@arm.com> writes: > >>>>>> Hi Jennifer, > >>>>>> > >>>>>>> -----Original Message----- > >>>>>>> From: Jennifer Schmitz <jschm...@nvidia.com> > >>>>>>> Sent: Friday, August 23, 2024 1:07 PM > >>>>>>> To: gcc-patches@gcc.gnu.org > >>>>>>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Kyrylo Tkachov > >>>>>>> <ktkac...@exchange.nvidia.com> > >>>>>>> Subject: [RFC][PATCH] AArch64: Remove > >>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>>> > >>>>>>> This patch removes the > AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS > >>>>>>> tunable and > >>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the > >>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend > >>> the > >>>>>>> default. > >>>>> > >>>>> Thanks for doing this. This has been on my TODO list ever since the > >>>>> tunable was added. > >>>>> > >>>>> The history is that these "new" costs were originally added in stage 4 > >>>>> of GCC 11 for Neoverse V1. Since the costs were added so late, it > >>>>> wasn't > >>>>> appropriate to change the behaviour for any other core. All the new > >>>>> code > >>>>> was therefore gated on this option. > >>>>> > >>>>> The new costs do two main things: > >>>>> > >>>>> (1) use throughput-based calculations where available, including to > >>>>> choose > >>>>> between Advanced SIMD and SVE > >>>>> > >>>>> (2) try to make the latency-based costs more precise, by looking more > closely > >>>>> at the provided stmt_info > >>>>> > >>>>> Old cost models won't be affected by (1) either way, since they don't > >>>>> provide any throughput information. But they should in principle > >>>>> benefit > >>>>> from (2). So... > >>>>> > >>>>>>> To that end, the function aarch64_use_new_vector_costs_p and its uses > >>> were > >>>>>>> removed. Additionally, guards were added prevent nullpointer > dereferences > >>> of > >>>>>>> fields in cpu_vector_cost. > >>>>>>> > >>>>>> > >>>>>> I'm not against this change, but it does mean that we now switch old > >>>>>> Adv. > >>> SIMD > >>>>>> cost models as well to the new throughput based cost models. That > >>>>>> means > >>> that > >>>>>> -mcpu=generic now behaves differently, and -mcpu=neoverse-n1 and I > think > >>>>>> some distros explicitly use this (I believe yocto for instance does). > >>>>> > >>>>> ...it shouldn't mean that we start using throughput-based models for > >>>>> cortexa53 etc., since there's no associated issue info. > >>>> > >>>> Yes, I was using throughput based model as a name. But as you indicated > >>>> in > (2) > >>>> it does change the latency calculation. > >>>> > >>>> My question was because of things in e.g. aarch64_adjust_stmt_cost and > >>> friends, > >>>> e.g. aarch64_multiply_add_p changes the cost between FMA SIMD vs scalar. > >>>> > >>>> So my question.. > >>>> > >>>>> > >>>>>> Have we validated that the old generic cost model still behaves > >>>>>> sensibly > with > >>> this > >>>>> change? > >>>> > >>>> is still valid I think, we *are* changing the cost for all models, > >>>> and while they should indeed be more accurate, there could be knock on > effects. > >>>> > >>> > >>> We can run SPEC on a Grace system with -mcpu=generic to see what the > >>> effect > is, > >>> but wider benchmarking would be more appropriate. Can you help with that > >>> Tamar once we agree on the other implementation details in this patch? > >>> > >> > >> Sure that's not a problem. Just ping me when you have a patch you want me > >> to > test :) > >> > >>> > >>>> Thanks, > >>>> Tamar > >>>> > >>>>>> > >>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu: > >>>>>>> No problems bootstrapping, but several test files (in aarch64-sve.exp: > >>>>>>> gather_load_extend_X.c > >>>>>>> where X is 1 to 4, strided_load_2.c, strided_store_2.c) fail because > >>>>>>> of > small > >>>>>>> differences > >>>>>>> in codegen that make some of the scan-assembler-times tests fail. > >>>>>>> > >>>>>>> Kyrill suggested to add a -fvect-cost-model=unlimited flag to these > >>>>>>> tests > and > >>>>> add > >>>>>>> some > >>>>>> > >>>>>> I don't personally like unlimited here as unlimited means just > >>>>>> vectorize at > any > >>>>>> cost. This means that costing between modes are also disabled. A lot > >>>>>> of > these > >>>>>> testcases are intended to test not just that we vectorize but that we > vectorize > >>>>>> with efficient code. > >>>>>> > >>>>>> I'd prefer to use -fvect-cost-model=dynamic if that fixes the > >>>>>> testcases. > >>>>> > >>>>> Yeah, I don't think -fvect-cost-model=unlimited would work for the > >>>>> gather_load_extend_X.c tests, since we require the cost model to decide > >>>>> when to use extending loads vs loading a full vector and unpacking. > >>> > >>> I had suggested using -fvect-cost-model=unlimited here because I thought > these > >>> tests wanted to test the capability of GCC to detect and generate the > >>> particular > SVE > >>> feature (gather load extends) for all supported data types regardless of > >>> profitability. > >> > >> I think the problem only specifically for the load_extend tests, because > unlimited also > >> disables SVE vs SVE comparisons wrt to changes to VF, so not just Adv. > >> SIMD vs > SVE. > >> That is while it would generate a gather, it may choose to instead of > >> gather from > >> B -> D, do B -> S and then extend the results. Because this has a higher > >> VF. > >> > >> Without the cross SVE mode comparisons it wouldn't know that the extensions > would > >> actually slow it down. > >> > >>> If the tests are intended to also make the right profitability decisions > >>> for the > generic > >>> tuning model then I agree using -fvect-cost-model=unlimited is not > appropriate > >>> here, though I do think that it’d be useful to fix the backend vector > >>> cost model > >>> hooks to honor -fvect-cost-model=unlimited and not limit generation of > >>> gather/scatter in that case. What it should do for the SVE vs Neon > >>> decisions is > an > >>> open question. > >>> > >>> > >>>>> > >>>>> [...tries patch...] > >>>>> > >>>>> It seems like three main things are contributing to the difference: > >>>>> > >>>>> 1. we now cost 0 for a scalar prologue extension from a loaded value > >>>>> 2. we now cost gathers & scatters based on gather_load_x32/x64_cost > >>>>> 3. we now apply a large one-off cost for gathers (Tamar's recent change) > >>>>> > >>>>> (1) is expected. > >>>>> > >>>>> (2) partly looks like a latent bug. We're using the x32 cost for > >>>>> VNx2QI->VNx2SI, even though those are really .B->.D loads. > >>>>> > >>>>> @@ -16819,7 +16811,7 @@ aarch64_detect_vector_stmt_subtype > (vec_info > >>>>> *vinfo, vect_cost_for_stmt kind, > >>>>> && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == > >>>>> VMAT_GATHER_SCATTER) > >>>>> { > >>>>> unsigned int nunits = vect_nunits_for_cost (vectype); > >>>>> - if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) > >>>>> + if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)), > >>>>> aarch64_sve_vg)) > >>>>> return { sve_costs->gather_load_x64_cost, nunits }; > >>>>> return { sve_costs->gather_load_x32_cost, nunits }; > >>>>> } > >>>>> > >>>>> fixes that. > >>> > >>> Would you be willing to test and push that to trunk to get it out of the > >>> way? > >>> > >>> > >>>>> > >>>>> (3) is interesting. generic_vector_cost isn't used by default for any > >>>>> SVE CPU, or any -march=...+sve. So the question is: should we treat it > >>>>> as "architectural intent"/somewhat idealised? Or should we try to make > >>>>> it produce good code for existing SVE cores, in which case it would > >>>>> overlap quite a lot with generic_armv8_a and generic_armv9_a. > >>> > >>>>> > >>>>> If the former, we could set gather_load_x32_init_cost and > >>>>> gather_load_x64_init_cost to 0 for generic_sve_vector_cost > >>>>> (and nothing else, so that generic_armv*_a are unaffected). > >>> > >>> I don’t have strong opinions on this point but if Tamar is concerned about > >>> deviating too much from the known-good Advanced SIMD generic tuning we > have > >>> now then we should aim for minimal codegen changes in that? > >> > >> No I'm fine with Richard's suggestion. The only way you'd be able to get > >> this > model > >> to fire is with -march=xxx=sve -mtune=generic, at which point I'm fine with > assuming > >> gathers have no initial overhead. > > > > As an aside, Jennifer pointed out to me that running the aarch64-sve.exp > > tests > ends up using “-march=armv8.2-a+sve -mtune=generic -moverride=tune=none” > for the tests but aarch64-sve.exp does’t add that -mtune and -moverride. It > seems > that aarch64-sve-acle-asm.exp does add these flags (to avoid the > CSE_SVE_VL_CONSTANTS logic IIRC), but it seems odd to use them in the > top-level > aarch64-sve.exp and I’m not sure by that DejaGNU mechanism they end up > propagating there. > Without making any more changes to the patch, I reran the testsuite: > With Richard's patch, the gather_load_extend tests now pass. > However, 3 tests still fail: strided_load_2.c, strided_store_2.c, and > strided_store_5.c. > Below you see an extract of the diff of the produced assembly for these three > tests. Interestingly, there seem to be cases where the codegen is better with > the > patch (strided_load_2.c, f_int64_t_32), but also cases that are worse > (strided_store_5.c, f_int32_t_5). > Can you help diagnosing what the problem of the bad codegen in some cases > could be and advise on how to proceed? > Thank you, Jennifer >
A lot of them look like costing issues, I'm not exactly sure what -mtune=generic -moverride=tune=none Resolves to. I would assume it goes back to the default tuning? > ===== strided_load_2.c ===== > Compiled with: > -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic - > moverride=tune=none > > Failed test: > gcc.target/aarch64/sve/strided_load_2.c scan-assembler-times \\tld1d\\tz[0- > 9]+\\.d, p[0-7]/z, \\[x[0-9]+, z[0-9]+.d, lsl 3\\]\\n 15 > Instead found 9 times > > Example of difference in produced assembly: > f_int64_t_32: > cbz w3, .L92 > mov x4, 0 > uxtw x3, w3 > + cntd x5 > + whilelo p7.d, xzr, x3 > + mov z29.s, w5 > mov z31.s, w2 > - whilelo p6.d, xzr, x3 > - mov x2, x3 > - index z30.s, #0, #1 > - uqdecd x2 > - ptrue p5.b, all > - whilelo p7.d, xzr, x2 > + index z30.d, #0, #1 > + ptrue p6.b, all > .p2align 3,,7 > .L94: > - ld1d z27.d, p7/z, [x0, #1, mul vl] > - ld1d z28.d, p6/z, [x0] > - movprfx z29, z30 > - mul z29.s, p5/m, z29.s, z31.s > - incw x4 > - uunpklo z0.d, z29.s > - uunpkhi z29.d, z29.s > - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] > - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] > - add z25.d, z28.d, z25.d > + ld1d z27.d, p7/z, [x0, x4, lsl 3] > + movprfx z28, z30 > + mul z28.s, p6/m, z28.s, z31.s > + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] > add z26.d, z27.d, z26.d > - st1d z26.d, p7, [x0, #1, mul vl] > - whilelo p7.d, x4, x2 > - st1d z25.d, p6, [x0] > - incw z30.s > - incb x0, all, mul #2 > - whilelo p6.d, x4, x3 > + st1d z26.d, p7, [x0, x4, lsl 3] > + add z30.s, z30.s, z29.s > + incd x4 > + whilelo p7.d, x4, x3 > b.any .L94 > .L92: > ret > This to me looks like unrolling. It looks like the old code decided to unroll once and the new one doesn't. I agree it does look better overall. > ===== strided_store_2.c ===== > Compiled with: > -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic - > moverride=tune=none > > Failed test: > gcc.target/aarch64/sve/strided_store_2.c scan-assembler-times \\tst1d\\tz[0- > 9]+\\.d, p[0-7], \\[x[0-9]+, z[0-9]+.d, lsl 3\\]\\n 15 > Instead found 9 times > > Example of difference in produced assembly: > f_float_64: > .LFB11: > .cfi_startproc > cbz x3, .L67 > - lsl x2, x2, 2 > - add x3, x1, x3, lsl 2 > - fmov s31, 1.0e+0 > - .p2align 3,,7 > + sub x4, x3, #1 > + cmp x4, 2 > + bls .L73 > + lsr x9, x3, 2 > + lsl x8, x2, 2 > + fmov v31.4s, 1.0e+0 > + lsl x10, x2, 4 > + add x4, x0, x8 > + add x9, x1, x9, lsl 4 > + neg x11, x8 > + lsl x2, x2, 3 > + mov x5, x1 > + .p2align 3,,7 > +.L70: > + ldr q30, [x5], 16 > + add x7, x4, x8 > + add x6, x4, x2 > + fadd v30.4s, v30.4s, v31.4s > + str s30, [x4, x11] > + st1 {v30.s}[1], [x4] > + add x4, x4, x10 > + st1 {v30.s}[2], [x7] > + st1 {v30.s}[3], [x6] > + cmp x5, x9 > + bne .L70 > + and x4, x3, -4 > + tst x3, 3 > + beq .L67 > .L69: > - ldr s30, [x1], 4 > - fadd s30, s30, s31 > - str s30, [x0] > - add x0, x0, x2 > - cmp x1, x3 > - bne .L69 > + madd x0, x8, x4, x0 > + fmov s29, 1.0e+0 > +.L72: > + ldr s28, [x1, x4, lsl 2] > + add x4, x4, 1 > + fadd s28, s28, s29 > + str s28, [x0] > + add x0, x0, x8 > + cmp x3, x4 > + bhi .L72 > .L67: > ret > +.L73: > + lsl x8, x2, 2 > + mov x4, 0 > + b .L69 > This one feels like the vectorizer emulated a scatter store. Did you rebase by chance? I think our cost model most likely does not model the store cots of a single lane. I'd look at costing of the loop from the vectorizer. > > f_int64_t_32: > .LFB14: > .cfi_startproc > - cbz w3, .L84 > - addvl x5, x1, #1 > + cbz w3, .L92 > mov x4, 0 > uxtw x3, w3 > + cntd x5 > + whilelo p7.d, xzr, x3 > + mov z29.s, w5 > mov z31.s, w2 > - whilelo p6.d, xzr, x3 > - mov x2, x3 > - index z30.s, #0, #1 > - uqdecd x2 > - ptrue p5.b, all > - whilelo p7.d, xzr, x2 > - .p2align 3,,7 > -.L86: > - ld1d z28.d, p6/z, [x1, x4, lsl 3] > - ld1d z27.d, p7/z, [x5, x4, lsl 3] > - movprfx z29, z30 > - mul z29.s, p5/m, z29.s, z31.s > - add z28.d, z28.d, #1 > - uunpklo z26.d, z29.s > - st1d z28.d, p6, [x0, z26.d, lsl 3] > - incw x4 > - uunpkhi z29.d, z29.s > + index z30.d, #0, #1 > + ptrue p6.b, all > + .p2align 3,,7 > +.L94: > + ld1d z27.d, p7/z, [x1, x4, lsl 3] > + movprfx z28, z30 > + mul z28.s, p6/m, z28.s, z31.s > add z27.d, z27.d, #1 > - st1d z27.d, p7, [x0, z29.d, lsl 3] > - whilelo p7.d, x4, x2 > - incw z30.s > - whilelo p6.d, x4, x3 > - b.any .L86 > -.L84: > + st1d z27.d, p7, [x0, z28.d, uxtw 3] > + incd x4 > + add z30.s, z30.s, z29.s > + whilelo p7.d, x4, x3 > + b.any .L94 > +.L92: > ret > Same as before it looks like we stopped unrolling on a long dependency chain. So I think the code is better, > ==== strided_store_5.c ===== > Compiled with: > -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic - > moverride=tune=none -msve-vector-bits=256 > > Failed tests: > gcc.target/aarch64/sve/strided_store_5. c scan-assembler-times \\tst1d\\tz[0- > 9]+\\.d, p[0-7], \\[x[0-9]+, z[0-9]+.d\\]\\n 15 > Instead found 0 times > gcc.target/aarch64/sve/strided_store_5. c scan-assembler-times \\tst1w\\tz[0- > 9]+\\.s, p[0-7], \\[x[0-9]+, z[0-9]+.s, sxtw\\]\\n 3 > Instead found 0 times > gcc.target/aarch64/sve/strided_store_5. c scan-assembler-times \\tst1w\\tz[0- > 9]+\\.s, p[0-7], \\[x[0-9]+, z[0-9]+.s, uxtw\\]\\n 12 > Instead found 0 times > > Example of difference in produced assembly: > f_int32_t_5: > cmp x2, 0 > ble .L1 > - mov x3, 0 > - mov w4, 20 > - whilelo p7.s, xzr, x2 > - index z31.s, #0, w4 > - .p2align 3,,7 > + sub x3, x2, #1 > + cmp x3, 6 > + bls .L7 > + lsr x8, x2, 3 > + mov x4, x1 > + mov x3, x0 > + ptrue p6.b, vl32 > + add x8, x1, x8, lsl 5 > + pfalse p7.b > + .p2align 3,,7 > +.L4: > + ld1w z31.s, p6/z, [x4] > + add z31.s, z31.s, #1 > + umov w7, v31.s[1] > + umov w6, v31.s[2] > + umov w5, v31.s[3] > + dup z1.s, z31.s[4] > + dup z0.s, z31.s[5] > + dup z30.s, z31.s[6] > + lastb s29, p7, z31.s > + add x4, x4, 32 > + str s31, [x3] > + add x3, x3, 160 > + str w7, [x3, -140] > + str w6, [x3, -120] > + str w5, [x3, -100] > + str s1, [x3, -80] > + str s0, [x3, -60] > + str s30, [x3, -40] > + str s29, [x3, -20] > + cmp x8, x4 > + bne .L4 > + and x3, x2, -8 > + cmp x2, x3 > + beq .L1 > .L3: > - ld1w z30.s, p7/z, [x1, x3, lsl 2] > - add z30.s, z30.s, #1 > - st1w z30.s, p7, [x0, z31.s, uxtw] > - add x3, x3, 8 > - add x0, x0, 160 > - whilelo p7.s, x3, x2 > - b.any .L3 > + lsl x4, x3, 2 > + sub x2, x2, x3 > + add x3, x4, x3 > + whilelo p7.d, xzr, x2 > + add x1, x1, x4 > + mov x4, 20 > + ld1w z2.d, p7/z, [x1] > + add x3, x0, x3, lsl 2 > + index z28.d, #0, x4 > + add z2.s, z2.s, #1 > + st1w z2.d, p7, [x3, z28.d] > + mov x0, 4 > + whilelo p7.d, x0, x2 > + b.any .L10 > .L1: > ret > + .p2align 2,,3 > +.L10: > + ld1w z29.d, p7/z, [x1, #1, mul vl] > + add x3, x3, 80 > + add z29.s, z29.s, #1 > + st1w z29.d, p7, [x3, z28.d] > + ret > +.L7: > + mov x3, 0 > + b .L3 > This one kinda feels like we vectorized a constructor we don't really support codegen for. You can check if the code at dce6 makes sense and if the scalarization is done by veclower21. Thanks, Tamar > > > > Thanks, > > Kyrill > > > >> > >> Cheers, > >> Tamar > >> > >>> Thanks again, > >>> Kyrill > >>> > >>> > >>>>> > >>>>> On the patch: > >>>>> > >>>>>> @@ -16733,7 +16723,8 @@ aarch64_in_loop_reduction_latency > (vec_info > >>>>> *vinfo, stmt_vec_info stmt_info, > >>>>>> { > >>>>>> const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs; > >>>>>> const sve_vec_cost *sve_costs = nullptr; > >>>>>> - if (vec_flags & VEC_ANY_SVE) > >>>>>> + if (vec_costs->sve > >>>>>> + && vec_flags & VEC_ANY_SVE) > >>>>>> sve_costs = aarch64_tune_params.vec_costs->sve; > >>>>> > >>>>> This doesn't seem necessary. I agree we might as well reuse the > >>>>> vec_costs local variable in the "if" body though. > >>>>> > >>>>>> [...] > >>>>>> @@ -16794,9 +16785,10 @@ aarch64_detect_vector_stmt_subtype > >>> (vec_info > >>>>> *vinfo, vect_cost_for_stmt kind, > >>>>>> fractional_cost stmt_cost) > >>>>>> { > >>>>>> const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype); > >>>>>> + const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs; > >>>>>> const sve_vec_cost *sve_costs = nullptr; > >>>>>> - if (aarch64_sve_mode_p (TYPE_MODE (vectype))) > >>>>>> - sve_costs = aarch64_tune_params.vec_costs->sve; > >>>>>> + if (aarch64_sve_mode_p (TYPE_MODE (vectype)) && vec_costs->sve) > >>>>>> + sve_costs = vec_costs->sve; > >>>>> > >>>>> Similarly here, this doesn't seem necessary. > >>>>> > >>>>>> [...] > >>>>>> @@ -17301,11 +17293,13 @@ aarch64_vector_costs::add_stmt_cost (int > >>>>> count, vect_cost_for_stmt kind, > >>>>>> where, stmt_cost); > >>>>> > >>>>>> /* Check if we've seen an SVE gather/scatter operation and which > >>>>>> size. */ > >>>>>> + const cpu_vector_cost *vec_costs = > >>>>>> aarch64_tune_params.vec_costs; > >>>>>> if (kind == scalar_load > >>>>>> && aarch64_sve_mode_p (TYPE_MODE (vectype)) > >>>>>> - && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == > >>>>> VMAT_GATHER_SCATTER) > >>>>>> + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == > >>>>> VMAT_GATHER_SCATTER > >>>>>> + && vec_costs->sve) > >>>>>> { > >>>>>> - const sve_vec_cost *sve_costs = > >>>>>> aarch64_tune_params.vec_costs->sve; > >>>>>> + const sve_vec_cost *sve_costs = vec_costs->sve; > >>>>>> if (sve_costs) > >>>>>> { > >>>>>> if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) > >>>>> > >>>>> Here too. > >>>>> > >>>>> Thanks, > >>>>> Richard >