> -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Friday, September 20, 2024 3:48 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw > <richard.earns...@arm.com>; Marcus Shawcroft > <marcus.shawcr...@arm.com>; ktkac...@gcc.gnu.org > Subject: Re: [PATCH]AArch64: Take into account when VF is higher than known > scalar iters > > Tamar Christina <tamar.christ...@arm.com> writes: > >> > >> So my gut instinct is that we should instead tweak the condition for > >> using latency costs, but I'll need to think about it more when I get > >> back from holiday. > >> > > > > I think that's a separate problem.. From first principals it should already > > be very wrong to compare the scalar loop to an iteration count it will > > *NEVER* reach. So I don't understand why that would ever be valid. > > But I don't think we're doing that, or at least, not as the final result. > Instead, we first calculate the minimum number of vector iterations for > which the vector loop is sometimes profitable. If this is N, then we're > saying that the vector code is better than the scalar code for N*VF > iterations. Like you say, this part ignores whether N*VF is actually > achievable. But then: > > /* Now that we know the minimum number of vector iterations, > find the minimum niters for which the scalar cost is larger: > > SIC * niters > VIC * vniters + VOC - SOC > > We know that the minimum niters is no more than > vniters * VF + NPEEL, but it might be (and often is) less > than that if a partial vector iteration is cheaper than the > equivalent scalar code. */ > int threshold = (vec_inside_cost * min_vec_niters > + vec_outside_cost > - scalar_outside_cost); > if (threshold <= 0) > min_profitable_iters = 1; > else > min_profitable_iters = threshold / scalar_single_iter_cost + 1; > > calculates which number of iterations in the range [(N-1)*VF + 1, N*VF] > is the first to be profitable. This is specifically taking partial > iterations into account and includes the N==1 case. The lower niters is, > the easier it is for the scalar code to win. > > This is what is printed as: > > Calculated minimum iters for profitability: 7 > > So we think that vectorisation should be rejected if the loop count > is <= 6, but accepted if it's >= 7.
This 7 is the vector iteration count. epilogue iterations: 0 Minimum number of vector iterations: 1 Calculated minimum iters for profitability: 7 /app/example.c:4:21: note: Runtime profitability threshold = 7 /app/example.c:4:21: note: Static estimate profitability threshold = 7 Which says the vector code has to iterate at least 7 iteration for it to be profitable. This on its own is already impossible. Even at the smallest VL possible to SVE you can never reach 7 iterations. At 7 iteration you're processed more elements than you know are in the loop. Since every iteration processes 1 array element and niters is 9, 7 vector iteration is an impossible situation. So to me, you are essentially quoting back what I said but differently. You cannot Ever reach 7 vector iterations. This vector code will *NEVER* be profitable. Further more, the SIMD unrolling code also gets in the way, Minimum number of vector iterations: 1 Calculated minimum iters for profitability: 2 /app/example.c:4:21: note: Runtime profitability threshold = 2 /app/example.c:4:21: note: Static estimate profitability threshold = 2 /app/example.c:4:21: note: ***** Analysis succeeded with vector mode VNx4QI /app/example.c:4:21: note: Comparing two main loops (VNx4QI at VF 4 vs VNx4SI at VF 16) /app/example.c:4:21: note: Number of insns in unrolled Advanced SIMD loop = 8 /app/example.c:4:21: note: Preferring Advanced SIMD loop because it can be unrolled And because of this it doesn't actually do any comparisons. That's why we end up with such horrible SVE codegen. It assumes Adv. SIMD will be taken but doesn't actually know if Adv. SIMD is even possible... And sure enough, the vectorizer itself does the sensible thing and Blocks most Adv. SIMD modes in generic code: /app/example.c:4:21: missed: not vectorized: iteration count smaller than vectorization factor. /app/example.c:4:21: missed: Loop costings not worthwhile. /app/example.c:4:21: note: ***** Analysis failed with vector mode V16QI And the smaller modes lose based on costing. So then the random first tested SVE mode is picked. This is a purely random selection. The patch works by disqualifying those modes where you have this impossibility. You can view it Instead as I phrased it above. The result of the computation is an impossibility, at any SVE VL. Even at the lowest VL your last 4 iterations are with a pfalse predicate even if you somehow get there. The vectorizer knows this: /app/example.c:4:21: note: Original vector body cost = 46 /app/example.c:4:21: note: Vector loop iterates at most 1 times ... Minimum number of vector iterations: 1 Calculated minimum iters for profitability: 7 /app/example.c:4:21: note: Runtime profitability threshold = 7 /app/example.c:4:21: note: Static estimate profitability threshold = 7 So clearly this mode should never be used. But instead of rejecting the mode we continue with this and instead just disable throughput based calculations: if (m_num_vector_iterations >= 1 && m_num_vector_iterations < threshold) { if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "Low iteration count, so using pure latency" " costs\n"); } But clearly, looking at latency here is wrong. I assumed the reason we do this is because the throughput based model may have been too conservative. But the issue with ignoring throughput here is that you then also incorrectly model predicate latency. As I mentioned already, the pure latency based calculation is inaccurate in that it just looks at the max of pred or vector ops. But the pred ops are required for the vector ops. So the relationship cannot be a MAX. And the code admits this is simplistic: /* Estimate the minimum number of cycles needed to issue the operations. This is a very simplistic model! */ fractional_cost aarch64_vec_op_count::min_cycles_per_iter () const { return std::max (min_nonpred_cycles_per_iter (), min_pred_cycles_per_iter ()); } So this vastly underestimates the vector loop. The reason it doesn't help generic code is because as we already discussed internally the advanced SIMD unrolling code needs to compare fractional VFs before punting to Adv. SIMD so that there's at least some ordering when Adv. SIMD isn't usable and it falls back to SVE. It works for specific CPU model because those aren't truly scalable. E.g. Neoverse-V1 generates with the patch .L2: ld1b z30.s, p7/z, [x0, x3] ld1w z31.s, p7/z, [x1, x3, lsl 2] cmpne p6.b, p7/z, z30.b, #0 st1w z31.s, p6, [x2, x3, lsl 2] add x3, x3, x5 whilelo p7.s, w3, w4 b.any .L2 because it disqualified the impossible SVE modes. And given how the entire model was approximate, and that based purely on latency it's hard to model the interaction between predicate ops and vector ops that this made sense. I thought we had agreed before that this was the correct thing to do, I guess I misunderstood the outcome of that conversation. Tamar > > So I think the costing framework is set up to handle niters<VF correctly > on first principles. It's "just" that the numbers being fed in give the > wrong answer in this case. > > Thanks, > Richard