> On 1 Oct 2024, at 21:30, Tamar Christina <tamar.christ...@arm.com> wrote: > > External email: Use caution opening links or attachments > > > Hi Jennifer, > >> -----Original Message----- >> From: Jennifer Schmitz <jschm...@nvidia.com> >> Sent: Tuesday, September 24, 2024 9:23 AM >> To: gcc-patches@gcc.gnu.org >> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford >> <richard.sandif...@arm.com>; Kyrylo Tkachov <ktkac...@exchange.nvidia.com> >> Subject: Re: [RFC][PATCH] AArch64: Remove >> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >> >> >> >>> On 28 Aug 2024, at 14:56, Kyrylo Tkachov <ktkac...@nvidia.com> wrote: >>> >>> >>> >>>> On 28 Aug 2024, at 10:27, Tamar Christina <tamar.christ...@arm.com> wrote: >>>> >>>> External email: Use caution opening links or attachments >>>> >>>> >>>>> -----Original Message----- >>>>> From: Kyrylo Tkachov <ktkac...@nvidia.com> >>>>> Sent: Wednesday, August 28, 2024 8:55 AM >>>>> To: Tamar Christina <tamar.christ...@arm.com> >>>>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Jennifer Schmitz >>>>> <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Kyrylo Tkachov >>>>> <ktkac...@exchange.nvidia.com> >>>>> Subject: Re: [RFC][PATCH] AArch64: Remove >>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>> >>>>> Hi all, >>>>> >>>>> Thanks to Jennifer for proposing a patch and Tamar and Richard for digging >> into it. >>>>> >>>>>> On 27 Aug 2024, at 13:16, Tamar Christina <tamar.christ...@arm.com> >> wrote: >>>>>> >>>>>> External email: Use caution opening links or attachments >>>>>> >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Richard Sandiford <richard.sandif...@arm.com> >>>>>>> Sent: Tuesday, August 27, 2024 11:46 AM >>>>>>> To: Tamar Christina <tamar.christ...@arm.com> >>>>>>> Cc: Jennifer Schmitz <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; >> Kyrylo >>>>>>> Tkachov <ktkac...@exchange.nvidia.com> >>>>>>> Subject: Re: [RFC][PATCH] AArch64: Remove >>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>> >>>>>>> Tamar Christina <tamar.christ...@arm.com> writes: >>>>>>>> Hi Jennifer, >>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Jennifer Schmitz <jschm...@nvidia.com> >>>>>>>>> Sent: Friday, August 23, 2024 1:07 PM >>>>>>>>> To: gcc-patches@gcc.gnu.org >>>>>>>>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Kyrylo Tkachov >>>>>>>>> <ktkac...@exchange.nvidia.com> >>>>>>>>> Subject: [RFC][PATCH] AArch64: Remove >>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>> >>>>>>>>> This patch removes the >> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>>>>> tunable and >>>>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the >>>>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend >>>>> the >>>>>>>>> default. >>>>>>> >>>>>>> Thanks for doing this. This has been on my TODO list ever since the >>>>>>> tunable was added. >>>>>>> >>>>>>> The history is that these "new" costs were originally added in stage 4 >>>>>>> of GCC 11 for Neoverse V1. Since the costs were added so late, it >>>>>>> wasn't >>>>>>> appropriate to change the behaviour for any other core. All the new >>>>>>> code >>>>>>> was therefore gated on this option. >>>>>>> >>>>>>> The new costs do two main things: >>>>>>> >>>>>>> (1) use throughput-based calculations where available, including to >>>>>>> choose >>>>>>> between Advanced SIMD and SVE >>>>>>> >>>>>>> (2) try to make the latency-based costs more precise, by looking more >> closely >>>>>>> at the provided stmt_info >>>>>>> >>>>>>> Old cost models won't be affected by (1) either way, since they don't >>>>>>> provide any throughput information. But they should in principle >>>>>>> benefit >>>>>>> from (2). So... >>>>>>> >>>>>>>>> To that end, the function aarch64_use_new_vector_costs_p and its uses >>>>> were >>>>>>>>> removed. Additionally, guards were added prevent nullpointer >> dereferences >>>>> of >>>>>>>>> fields in cpu_vector_cost. >>>>>>>>> >>>>>>>> >>>>>>>> I'm not against this change, but it does mean that we now switch old >>>>>>>> Adv. >>>>> SIMD >>>>>>>> cost models as well to the new throughput based cost models. That >>>>>>>> means >>>>> that >>>>>>>> -mcpu=generic now behaves differently, and -mcpu=neoverse-n1 and I >> think >>>>>>>> some distros explicitly use this (I believe yocto for instance does). >>>>>>> >>>>>>> ...it shouldn't mean that we start using throughput-based models for >>>>>>> cortexa53 etc., since there's no associated issue info. >>>>>> >>>>>> Yes, I was using throughput based model as a name. But as you indicated >>>>>> in >> (2) >>>>>> it does change the latency calculation. >>>>>> >>>>>> My question was because of things in e.g. aarch64_adjust_stmt_cost and >>>>> friends, >>>>>> e.g. aarch64_multiply_add_p changes the cost between FMA SIMD vs scalar. >>>>>> >>>>>> So my question.. >>>>>> >>>>>>> >>>>>>>> Have we validated that the old generic cost model still behaves >>>>>>>> sensibly >> with >>>>> this >>>>>>> change? >>>>>> >>>>>> is still valid I think, we *are* changing the cost for all models, >>>>>> and while they should indeed be more accurate, there could be knock on >> effects. >>>>>> >>>>> >>>>> We can run SPEC on a Grace system with -mcpu=generic to see what the >>>>> effect >> is, >>>>> but wider benchmarking would be more appropriate. Can you help with that >>>>> Tamar once we agree on the other implementation details in this patch? >>>>> >>>> >>>> Sure that's not a problem. Just ping me when you have a patch you want me >>>> to >> test :) >>>> >>>>> >>>>>> Thanks, >>>>>> Tamar >>>>>> >>>>>>>> >>>>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu: >>>>>>>>> No problems bootstrapping, but several test files (in aarch64-sve.exp: >>>>>>>>> gather_load_extend_X.c >>>>>>>>> where X is 1 to 4, strided_load_2.c, strided_store_2.c) fail because >>>>>>>>> of >> small >>>>>>>>> differences >>>>>>>>> in codegen that make some of the scan-assembler-times tests fail. >>>>>>>>> >>>>>>>>> Kyrill suggested to add a -fvect-cost-model=unlimited flag to these >>>>>>>>> tests >> and >>>>>>> add >>>>>>>>> some >>>>>>>> >>>>>>>> I don't personally like unlimited here as unlimited means just >>>>>>>> vectorize at >> any >>>>>>>> cost. This means that costing between modes are also disabled. A lot >>>>>>>> of >> these >>>>>>>> testcases are intended to test not just that we vectorize but that we >> vectorize >>>>>>>> with efficient code. >>>>>>>> >>>>>>>> I'd prefer to use -fvect-cost-model=dynamic if that fixes the >>>>>>>> testcases. >>>>>>> >>>>>>> Yeah, I don't think -fvect-cost-model=unlimited would work for the >>>>>>> gather_load_extend_X.c tests, since we require the cost model to decide >>>>>>> when to use extending loads vs loading a full vector and unpacking. >>>>> >>>>> I had suggested using -fvect-cost-model=unlimited here because I thought >> these >>>>> tests wanted to test the capability of GCC to detect and generate the >>>>> particular >> SVE >>>>> feature (gather load extends) for all supported data types regardless of >>>>> profitability. >>>> >>>> I think the problem only specifically for the load_extend tests, because >> unlimited also >>>> disables SVE vs SVE comparisons wrt to changes to VF, so not just Adv. >>>> SIMD vs >> SVE. >>>> That is while it would generate a gather, it may choose to instead of >>>> gather from >>>> B -> D, do B -> S and then extend the results. Because this has a higher >>>> VF. >>>> >>>> Without the cross SVE mode comparisons it wouldn't know that the extensions >> would >>>> actually slow it down. >>>> >>>>> If the tests are intended to also make the right profitability decisions >>>>> for the >> generic >>>>> tuning model then I agree using -fvect-cost-model=unlimited is not >> appropriate >>>>> here, though I do think that it’d be useful to fix the backend vector >>>>> cost model >>>>> hooks to honor -fvect-cost-model=unlimited and not limit generation of >>>>> gather/scatter in that case. What it should do for the SVE vs Neon >>>>> decisions is >> an >>>>> open question. >>>>> >>>>> >>>>>>> >>>>>>> [...tries patch...] >>>>>>> >>>>>>> It seems like three main things are contributing to the difference: >>>>>>> >>>>>>> 1. we now cost 0 for a scalar prologue extension from a loaded value >>>>>>> 2. we now cost gathers & scatters based on gather_load_x32/x64_cost >>>>>>> 3. we now apply a large one-off cost for gathers (Tamar's recent change) >>>>>>> >>>>>>> (1) is expected. >>>>>>> >>>>>>> (2) partly looks like a latent bug. We're using the x32 cost for >>>>>>> VNx2QI->VNx2SI, even though those are really .B->.D loads. >>>>>>> >>>>>>> @@ -16819,7 +16811,7 @@ aarch64_detect_vector_stmt_subtype >> (vec_info >>>>>>> *vinfo, vect_cost_for_stmt kind, >>>>>>> && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == >>>>>>> VMAT_GATHER_SCATTER) >>>>>>> { >>>>>>> unsigned int nunits = vect_nunits_for_cost (vectype); >>>>>>> - if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) >>>>>>> + if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)), >>>>>>> aarch64_sve_vg)) >>>>>>> return { sve_costs->gather_load_x64_cost, nunits }; >>>>>>> return { sve_costs->gather_load_x32_cost, nunits }; >>>>>>> } >>>>>>> >>>>>>> fixes that. >>>>> >>>>> Would you be willing to test and push that to trunk to get it out of the >>>>> way? >>>>> >>>>> >>>>>>> >>>>>>> (3) is interesting. generic_vector_cost isn't used by default for any >>>>>>> SVE CPU, or any -march=...+sve. So the question is: should we treat it >>>>>>> as "architectural intent"/somewhat idealised? Or should we try to make >>>>>>> it produce good code for existing SVE cores, in which case it would >>>>>>> overlap quite a lot with generic_armv8_a and generic_armv9_a. >>>>> >>>>>>> >>>>>>> If the former, we could set gather_load_x32_init_cost and >>>>>>> gather_load_x64_init_cost to 0 for generic_sve_vector_cost >>>>>>> (and nothing else, so that generic_armv*_a are unaffected). >>>>> >>>>> I don’t have strong opinions on this point but if Tamar is concerned about >>>>> deviating too much from the known-good Advanced SIMD generic tuning we >> have >>>>> now then we should aim for minimal codegen changes in that? >>>> >>>> No I'm fine with Richard's suggestion. The only way you'd be able to get >>>> this >> model >>>> to fire is with -march=xxx=sve -mtune=generic, at which point I'm fine with >> assuming >>>> gathers have no initial overhead. >>> >>> As an aside, Jennifer pointed out to me that running the aarch64-sve.exp >>> tests >> ends up using “-march=armv8.2-a+sve -mtune=generic -moverride=tune=none” >> for the tests but aarch64-sve.exp does’t add that -mtune and -moverride. It >> seems >> that aarch64-sve-acle-asm.exp does add these flags (to avoid the >> CSE_SVE_VL_CONSTANTS logic IIRC), but it seems odd to use them in the >> top-level >> aarch64-sve.exp and I’m not sure by that DejaGNU mechanism they end up >> propagating there. >> Without making any more changes to the patch, I reran the testsuite: >> With Richard's patch, the gather_load_extend tests now pass. >> However, 3 tests still fail: strided_load_2.c, strided_store_2.c, and >> strided_store_5.c. >> Below you see an extract of the diff of the produced assembly for these three >> tests. Interestingly, there seem to be cases where the codegen is better >> with the >> patch (strided_load_2.c, f_int64_t_32), but also cases that are worse >> (strided_store_5.c, f_int32_t_5). >> Can you help diagnosing what the problem of the bad codegen in some cases >> could be and advise on how to proceed? >> Thank you, Jennifer >> > > A lot of them look like costing issues, I'm not exactly sure what > -mtune=generic -moverride=tune=none > > Resolves to. I would assume it goes back to the default tuning? > >> ===== strided_load_2.c ===== >> Compiled with: >> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic - >> moverride=tune=none >> >> Failed test: >> gcc.target/aarch64/sve/strided_load_2.c scan-assembler-times \\tld1d\\tz[0- >> 9]+\\.d, p[0-7]/z, \\[x[0-9]+, z[0-9]+.d, lsl 3\\]\\n 15 >> Instead found 9 times >> >> Example of difference in produced assembly: >> f_int64_t_32: >> cbz w3, .L92 >> mov x4, 0 >> uxtw x3, w3 >> + cntd x5 >> + whilelo p7.d, xzr, x3 >> + mov z29.s, w5 >> mov z31.s, w2 >> - whilelo p6.d, xzr, x3 >> - mov x2, x3 >> - index z30.s, #0, #1 >> - uqdecd x2 >> - ptrue p5.b, all >> - whilelo p7.d, xzr, x2 >> + index z30.d, #0, #1 >> + ptrue p6.b, all >> .p2align 3,,7 >> .L94: >> - ld1d z27.d, p7/z, [x0, #1, mul vl] >> - ld1d z28.d, p6/z, [x0] >> - movprfx z29, z30 >> - mul z29.s, p5/m, z29.s, z31.s >> - incw x4 >> - uunpklo z0.d, z29.s >> - uunpkhi z29.d, z29.s >> - ld1d z25.d, p6/z, [x1, z0.d, lsl 3] >> - ld1d z26.d, p7/z, [x1, z29.d, lsl 3] >> - add z25.d, z28.d, z25.d >> + ld1d z27.d, p7/z, [x0, x4, lsl 3] >> + movprfx z28, z30 >> + mul z28.s, p6/m, z28.s, z31.s >> + ld1d z26.d, p7/z, [x1, z28.d, uxtw 3] >> add z26.d, z27.d, z26.d >> - st1d z26.d, p7, [x0, #1, mul vl] >> - whilelo p7.d, x4, x2 >> - st1d z25.d, p6, [x0] >> - incw z30.s >> - incb x0, all, mul #2 >> - whilelo p6.d, x4, x3 >> + st1d z26.d, p7, [x0, x4, lsl 3] >> + add z30.s, z30.s, z29.s >> + incd x4 >> + whilelo p7.d, x4, x3 >> b.any .L94 >> .L92: >> ret >> > > This to me looks like unrolling. It looks like the old code decided to unroll > once and the new one doesn't. > I agree it does look better overall. > >> ===== strided_store_2.c ===== >> Compiled with: >> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic - >> moverride=tune=none >> >> Failed test: >> gcc.target/aarch64/sve/strided_store_2.c scan-assembler-times \\tst1d\\tz[0- >> 9]+\\.d, p[0-7], \\[x[0-9]+, z[0-9]+.d, lsl 3\\]\\n 15 >> Instead found 9 times >> >> Example of difference in produced assembly: >> f_float_64: >> .LFB11: >> .cfi_startproc >> cbz x3, .L67 >> - lsl x2, x2, 2 >> - add x3, x1, x3, lsl 2 >> - fmov s31, 1.0e+0 >> - .p2align 3,,7 >> + sub x4, x3, #1 >> + cmp x4, 2 >> + bls .L73 >> + lsr x9, x3, 2 >> + lsl x8, x2, 2 >> + fmov v31.4s, 1.0e+0 >> + lsl x10, x2, 4 >> + add x4, x0, x8 >> + add x9, x1, x9, lsl 4 >> + neg x11, x8 >> + lsl x2, x2, 3 >> + mov x5, x1 >> + .p2align 3,,7 >> +.L70: >> + ldr q30, [x5], 16 >> + add x7, x4, x8 >> + add x6, x4, x2 >> + fadd v30.4s, v30.4s, v31.4s >> + str s30, [x4, x11] >> + st1 {v30.s}[1], [x4] >> + add x4, x4, x10 >> + st1 {v30.s}[2], [x7] >> + st1 {v30.s}[3], [x6] >> + cmp x5, x9 >> + bne .L70 >> + and x4, x3, -4 >> + tst x3, 3 >> + beq .L67 >> .L69: >> - ldr s30, [x1], 4 >> - fadd s30, s30, s31 >> - str s30, [x0] >> - add x0, x0, x2 >> - cmp x1, x3 >> - bne .L69 >> + madd x0, x8, x4, x0 >> + fmov s29, 1.0e+0 >> +.L72: >> + ldr s28, [x1, x4, lsl 2] >> + add x4, x4, 1 >> + fadd s28, s28, s29 >> + str s28, [x0] >> + add x0, x0, x8 >> + cmp x3, x4 >> + bhi .L72 >> .L67: >> ret >> +.L73: >> + lsl x8, x2, 2 >> + mov x4, 0 >> + b .L69 >> > > This one feels like the vectorizer emulated a scatter store. > Did you rebase by chance? I think our cost model most likely > does not model the store cots of a single lane. I'd look at costing of > the loop from the vectorizer. > >> >> f_int64_t_32: >> .LFB14: >> .cfi_startproc >> - cbz w3, .L84 >> - addvl x5, x1, #1 >> + cbz w3, .L92 >> mov x4, 0 >> uxtw x3, w3 >> + cntd x5 >> + whilelo p7.d, xzr, x3 >> + mov z29.s, w5 >> mov z31.s, w2 >> - whilelo p6.d, xzr, x3 >> - mov x2, x3 >> - index z30.s, #0, #1 >> - uqdecd x2 >> - ptrue p5.b, all >> - whilelo p7.d, xzr, x2 >> - .p2align 3,,7 >> -.L86: >> - ld1d z28.d, p6/z, [x1, x4, lsl 3] >> - ld1d z27.d, p7/z, [x5, x4, lsl 3] >> - movprfx z29, z30 >> - mul z29.s, p5/m, z29.s, z31.s >> - add z28.d, z28.d, #1 >> - uunpklo z26.d, z29.s >> - st1d z28.d, p6, [x0, z26.d, lsl 3] >> - incw x4 >> - uunpkhi z29.d, z29.s >> + index z30.d, #0, #1 >> + ptrue p6.b, all >> + .p2align 3,,7 >> +.L94: >> + ld1d z27.d, p7/z, [x1, x4, lsl 3] >> + movprfx z28, z30 >> + mul z28.s, p6/m, z28.s, z31.s >> add z27.d, z27.d, #1 >> - st1d z27.d, p7, [x0, z29.d, lsl 3] >> - whilelo p7.d, x4, x2 >> - incw z30.s >> - whilelo p6.d, x4, x3 >> - b.any .L86 >> -.L84: >> + st1d z27.d, p7, [x0, z28.d, uxtw 3] >> + incd x4 >> + add z30.s, z30.s, z29.s >> + whilelo p7.d, x4, x3 >> + b.any .L94 >> +.L92: >> ret >> > > Same as before it looks like we stopped unrolling on a long dependency chain. > So I think the code is better, > >> ==== strided_store_5.c ===== >> Compiled with: >> -O2 -ftree-vectorize -march=armv8.2-a+sve -mtune=generic - >> moverride=tune=none -msve-vector-bits=256 >> >> Failed tests: >> gcc.target/aarch64/sve/strided_store_5. c scan-assembler-times \\tst1d\\tz[0- >> 9]+\\.d, p[0-7], \\[x[0-9]+, z[0-9]+.d\\]\\n 15 >> Instead found 0 times >> gcc.target/aarch64/sve/strided_store_5. c scan-assembler-times \\tst1w\\tz[0- >> 9]+\\.s, p[0-7], \\[x[0-9]+, z[0-9]+.s, sxtw\\]\\n 3 >> Instead found 0 times >> gcc.target/aarch64/sve/strided_store_5. c scan-assembler-times \\tst1w\\tz[0- >> 9]+\\.s, p[0-7], \\[x[0-9]+, z[0-9]+.s, uxtw\\]\\n 12 >> Instead found 0 times >> >> Example of difference in produced assembly: >> f_int32_t_5: >> cmp x2, 0 >> ble .L1 >> - mov x3, 0 >> - mov w4, 20 >> - whilelo p7.s, xzr, x2 >> - index z31.s, #0, w4 >> - .p2align 3,,7 >> + sub x3, x2, #1 >> + cmp x3, 6 >> + bls .L7 >> + lsr x8, x2, 3 >> + mov x4, x1 >> + mov x3, x0 >> + ptrue p6.b, vl32 >> + add x8, x1, x8, lsl 5 >> + pfalse p7.b >> + .p2align 3,,7 >> +.L4: >> + ld1w z31.s, p6/z, [x4] >> + add z31.s, z31.s, #1 >> + umov w7, v31.s[1] >> + umov w6, v31.s[2] >> + umov w5, v31.s[3] >> + dup z1.s, z31.s[4] >> + dup z0.s, z31.s[5] >> + dup z30.s, z31.s[6] >> + lastb s29, p7, z31.s >> + add x4, x4, 32 >> + str s31, [x3] >> + add x3, x3, 160 >> + str w7, [x3, -140] >> + str w6, [x3, -120] >> + str w5, [x3, -100] >> + str s1, [x3, -80] >> + str s0, [x3, -60] >> + str s30, [x3, -40] >> + str s29, [x3, -20] >> + cmp x8, x4 >> + bne .L4 >> + and x3, x2, -8 >> + cmp x2, x3 >> + beq .L1 >> .L3: >> - ld1w z30.s, p7/z, [x1, x3, lsl 2] >> - add z30.s, z30.s, #1 >> - st1w z30.s, p7, [x0, z31.s, uxtw] >> - add x3, x3, 8 >> - add x0, x0, 160 >> - whilelo p7.s, x3, x2 >> - b.any .L3 >> + lsl x4, x3, 2 >> + sub x2, x2, x3 >> + add x3, x4, x3 >> + whilelo p7.d, xzr, x2 >> + add x1, x1, x4 >> + mov x4, 20 >> + ld1w z2.d, p7/z, [x1] >> + add x3, x0, x3, lsl 2 >> + index z28.d, #0, x4 >> + add z2.s, z2.s, #1 >> + st1w z2.d, p7, [x3, z28.d] >> + mov x0, 4 >> + whilelo p7.d, x0, x2 >> + b.any .L10 >> .L1: >> ret >> + .p2align 2,,3 >> +.L10: >> + ld1w z29.d, p7/z, [x1, #1, mul vl] >> + add x3, x3, 80 >> + add z29.s, z29.s, #1 >> + st1w z29.d, p7, [x3, z28.d] >> + ret >> +.L7: >> + mov x3, 0 >> + b .L3 >> > > This one kinda feels like we vectorized a constructor we don't really support > codegen for. > You can check if the code at dce6 makes sense and if the scalarization is > done by veclower21. I investigated the cause of the test failures a bit further:
Looking at the diff of the vect dumps (below is a section of the diff for strided_store_2.c), it seemed odd that vec_to_scalar operations cost 0 now, instead of the previous cost of 2: +strided_store_1.c:38:151: note: === vectorizable_operation === +strided_store_1.c:38:151: note: vect_model_simple_cost: inside_cost = 1, prologue_cost = 0 . +strided_store_1.c:38:151: note: ==> examining statement: *_6 = _7; +strided_store_1.c:38:151: note: vect_is_simple_use: operand _3 + 1.0e+0, type of def: internal +strided_store_1.c:38:151: note: Vectorizing an unaligned access. +Applying pattern match.pd:236, generic-match-9.cc:4128 +Applying pattern match.pd:5285, generic-match-10.cc:4234 +strided_store_1.c:38:151: note: vect_model_store_cost: inside_cost = 12, prologue_cost = 0 . *_2 1 times unaligned_load (misalign -1) costs 1 in body -_3 + 1.0e+0 1 times scalar_to_vec costs 1 in prologue _3 + 1.0e+0 1 times vector_stmt costs 1 in body -_7 1 times vec_to_scalar costs 2 in body +<unknown> 1 times vector_load costs 1 in prologue +_7 1 times vec_to_scalar costs 0 in body _7 1 times scalar_store costs 1 in body -_7 1 times vec_to_scalar costs 2 in body +_7 1 times vec_to_scalar costs 0 in body _7 1 times scalar_store costs 1 in body -_7 1 times vec_to_scalar costs 2 in body +_7 1 times vec_to_scalar costs 0 in body _7 1 times scalar_store costs 1 in body -_7 1 times vec_to_scalar costs 2 in body +_7 1 times vec_to_scalar costs 0 in body _7 1 times scalar_store costs 1 in body Although the aarch64_use_new_vector_costs_p flag was used in multiple places in aarch64.cc, the location that causes this behavior is this one: unsigned aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, stmt_vec_info stmt_info, slp_tree, tree vectype, int misalign, vect_cost_model_location where) { [...] /* Try to get a more accurate cost by looking at STMT_INFO instead of just looking at KIND. */ - if (stmt_info && aarch64_use_new_vector_costs_p ()) + if (stmt_info) { /* If we scalarize a strided store, the vectorizer costs one vec_to_scalar for each element. However, we can store the first element using an FP store without a separate extract step. */ if (vect_is_store_elt_extraction (kind, stmt_info)) count -= 1; stmt_cost = aarch64_detect_scalar_stmt_subtype (m_vinfo, kind, stmt_info, stmt_cost); if (vectype && m_vec_flags) stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, stmt_info, vectype, where, stmt_cost); } [...] return record_stmt_cost (stmt_info, where, (count * stmt_cost).ceil ()); } Previously, for mtune=generic, this function returned a cost of 2 for a vec_to_scalar operation in the vect body. Now "if (stmt_info)" is entered and "if (vect_is_store_elt_extraction (kind, stmt_info))" evaluates to true, which sets the count to 0 and leads to a return value of 0. My assumption is that these "free" vec_to_scalar operations result in vectorization of the ADD instruction in the tests, followed by individual stores using ST1. But the ST1 stores aren't free so they likely shouldn't have a cost of 0. What are your thoughts on this? Do you have a suggestion which changes to make in order to resolve this? > > Thanks, > Tamar > >>> >>> Thanks, >>> Kyrill >>> >>>> >>>> Cheers, >>>> Tamar >>>> >>>>> Thanks again, >>>>> Kyrill >>>>> >>>>> >>>>>>> >>>>>>> On the patch: >>>>>>> >>>>>>>> @@ -16733,7 +16723,8 @@ aarch64_in_loop_reduction_latency >> (vec_info >>>>>>> *vinfo, stmt_vec_info stmt_info, >>>>>>>> { >>>>>>>> const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs; >>>>>>>> const sve_vec_cost *sve_costs = nullptr; >>>>>>>> - if (vec_flags & VEC_ANY_SVE) >>>>>>>> + if (vec_costs->sve >>>>>>>> + && vec_flags & VEC_ANY_SVE) >>>>>>>> sve_costs = aarch64_tune_params.vec_costs->sve; >>>>>>> >>>>>>> This doesn't seem necessary. I agree we might as well reuse the >>>>>>> vec_costs local variable in the "if" body though. >>>>>>> >>>>>>>> [...] >>>>>>>> @@ -16794,9 +16785,10 @@ aarch64_detect_vector_stmt_subtype >>>>> (vec_info >>>>>>> *vinfo, vect_cost_for_stmt kind, >>>>>>>> fractional_cost stmt_cost) >>>>>>>> { >>>>>>>> const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype); >>>>>>>> + const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs; >>>>>>>> const sve_vec_cost *sve_costs = nullptr; >>>>>>>> - if (aarch64_sve_mode_p (TYPE_MODE (vectype))) >>>>>>>> - sve_costs = aarch64_tune_params.vec_costs->sve; >>>>>>>> + if (aarch64_sve_mode_p (TYPE_MODE (vectype)) && vec_costs->sve) >>>>>>>> + sve_costs = vec_costs->sve; >>>>>>> >>>>>>> Similarly here, this doesn't seem necessary. >>>>>>> >>>>>>>> [...] >>>>>>>> @@ -17301,11 +17293,13 @@ aarch64_vector_costs::add_stmt_cost (int >>>>>>> count, vect_cost_for_stmt kind, >>>>>>>> where, stmt_cost); >>>>>>> >>>>>>>> /* Check if we've seen an SVE gather/scatter operation and which >>>>>>>> size. */ >>>>>>>> + const cpu_vector_cost *vec_costs = >>>>>>>> aarch64_tune_params.vec_costs; >>>>>>>> if (kind == scalar_load >>>>>>>> && aarch64_sve_mode_p (TYPE_MODE (vectype)) >>>>>>>> - && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == >>>>>>> VMAT_GATHER_SCATTER) >>>>>>>> + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == >>>>>>> VMAT_GATHER_SCATTER >>>>>>>> + && vec_costs->sve) >>>>>>>> { >>>>>>>> - const sve_vec_cost *sve_costs = >>>>>>>> aarch64_tune_params.vec_costs->sve; >>>>>>>> + const sve_vec_cost *sve_costs = vec_costs->sve; >>>>>>>> if (sve_costs) >>>>>>>> { >>>>>>>> if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) >>>>>>> >>>>>>> Here too. >>>>>>> >>>>>>> Thanks, >>>>>>> Richard
smime.p7s
Description: S/MIME cryptographic signature