RE: [PATCH]AArch64: Take into account when VF is higher than known scalar iters

Tamar Christina Fri, 20 Sep 2024 08:35:49 -0700

> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Friday, September 20, 2024 3:48 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> <richard.earns...@arm.com>; Marcus Shawcroft
> <marcus.shawcr...@arm.com>; ktkac...@gcc.gnu.org
> Subject: Re: [PATCH]AArch64: Take into account when VF is higher than known
> scalar iters
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> >>
> >> So my gut instinct is that we should instead tweak the condition for
> >> using latency costs, but I'll need to think about it more when I get
> >> back from holiday.
> >>
> >
> > I think that's a separate problem.. From first principals it should already
> > be very wrong to compare the scalar loop to an iteration count it will
> > *NEVER* reach.  So I don't understand why that would ever be valid.
> 
> But I don't think we're doing that, or at least, not as the final result.
> Instead, we first calculate the minimum number of vector iterations for
> which the vector loop is sometimes profitable.  If this is N, then we're
> saying that the vector code is better than the scalar code for N*VF
> iterations.  Like you say, this part ignores whether N*VF is actually
> achievable.  But then:
> 
>         /* Now that we know the minimum number of vector iterations,
>            find the minimum niters for which the scalar cost is larger:
> 
>            SIC * niters > VIC * vniters + VOC - SOC
> 
>            We know that the minimum niters is no more than
>            vniters * VF + NPEEL, but it might be (and often is) less
>            than that if a partial vector iteration is cheaper than the
>            equivalent scalar code.  */
>         int threshold = (vec_inside_cost * min_vec_niters
>                          + vec_outside_cost
>                          - scalar_outside_cost);
>         if (threshold <= 0)
>           min_profitable_iters = 1;
>         else
>           min_profitable_iters = threshold / scalar_single_iter_cost + 1;
> 
> calculates which number of iterations in the range [(N-1)*VF + 1, N*VF]
> is the first to be profitable.  This is specifically taking partial
> iterations into account and includes the N==1 case.  The lower niters is,
> the easier it is for the scalar code to win.
> 
> This is what is printed as:
> 
>   Calculated minimum iters for profitability: 7
> 
> So we think that vectorisation should be rejected if the loop count
> is <= 6, but accepted if it's >= 7.


This 7 is the vector iteration count.

  epilogue iterations: 0
  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 7
/app/example.c:4:21: note:    Runtime profitability threshold = 7
/app/example.c:4:21: note:    Static estimate profitability threshold = 7

Which says the vector code has to iterate at least 7 iteration for it to be 
profitable.
This on its own is already impossible. Even at the smallest VL possible to SVE 
you
can never reach 7 iterations. At 7 iteration you're processed more elements than
you know are in the loop.  Since every iteration processes 1 array element and
niters is 9, 7 vector iteration is an impossible situation.

So to me, you are essentially quoting back what I said but differently.  You 
cannot
Ever reach 7 vector iterations.  This vector code will *NEVER* be profitable.

Further more, the SIMD unrolling code also gets in the way,

  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 2
/app/example.c:4:21: note:    Runtime profitability threshold = 2
/app/example.c:4:21: note:    Static estimate profitability threshold = 2
/app/example.c:4:21: note:  ***** Analysis succeeded with vector mode VNx4QI
/app/example.c:4:21: note:  Comparing two main loops (VNx4QI at VF 4 vs VNx4SI 
at VF 16)
/app/example.c:4:21: note:  Number of insns in unrolled Advanced SIMD loop = 8
/app/example.c:4:21: note:  Preferring Advanced SIMD loop because it can be 
unrolled

And because of this it doesn't actually do any comparisons. That's why we end 
up with such
horrible SVE codegen.  It assumes Adv. SIMD will be taken but doesn't actually 
know if
Adv. SIMD is even possible... And sure enough, the vectorizer itself does the 
sensible thing and
Blocks most Adv. SIMD modes in generic code:

/app/example.c:4:21: missed:  not vectorized: iteration count smaller than 
vectorization factor.
/app/example.c:4:21: missed:  Loop costings not worthwhile.
/app/example.c:4:21: note:  ***** Analysis failed with vector mode V16QI

And the smaller modes lose based on costing.  So then the random first tested 
SVE mode is picked.
This is a purely random selection.

The patch works by disqualifying those modes where you have this impossibility. 
 You can view it
Instead as I phrased it above.  The result of the computation is an 
impossibility, at any SVE VL.
Even at the lowest VL your last 4 iterations are with a pfalse predicate even 
if you somehow get there.

The vectorizer knows this:

/app/example.c:4:21: note:  Original vector body cost = 46
/app/example.c:4:21: note:  Vector loop iterates at most 1 times
...
  Minimum number of vector iterations: 1
  Calculated minimum iters for profitability: 7
/app/example.c:4:21: note:    Runtime profitability threshold = 7
/app/example.c:4:21: note:    Static estimate profitability threshold = 7

So clearly this mode should never be used.  But instead of rejecting the mode 
we continue with this
and instead just disable throughput based calculations:

  if (m_num_vector_iterations >= 1
      && m_num_vector_iterations < threshold)
    {
      if (dump_enabled_p ())
        dump_printf_loc (MSG_NOTE, vect_location,
                         "Low iteration count, so using pure latency"
                         " costs\n");
    }

But clearly, looking at latency here is wrong.  I assumed the reason we do this 
is because the throughput
based model may have been too conservative.   But the issue with ignoring 
throughput here is that you
then also incorrectly model predicate latency.  As I mentioned already, the 
pure latency based calculation
is inaccurate in that it just looks at the max of pred or vector ops.  But the 
pred ops are required for the
vector ops.  So the relationship cannot be a MAX.

And the code admits this is simplistic:

/* Estimate the minimum number of cycles needed to issue the operations.
   This is a very simplistic model!  */
fractional_cost
aarch64_vec_op_count::min_cycles_per_iter () const
{
  return std::max (min_nonpred_cycles_per_iter (),
                   min_pred_cycles_per_iter ());
}

So this vastly underestimates the vector loop.

The reason it doesn't help generic code is because as we already discussed 
internally the advanced SIMD
unrolling code needs to compare fractional VFs before punting to Adv. SIMD so 
that there's at least some
ordering when Adv. SIMD isn't usable and it falls back to SVE.

It works for specific CPU model because those aren't truly scalable. E.g. 
Neoverse-V1 generates with the patch

.L2:
        ld1b    z30.s, p7/z, [x0, x3]
        ld1w    z31.s, p7/z, [x1, x3, lsl 2]
        cmpne   p6.b, p7/z, z30.b, #0
        st1w    z31.s, p6, [x2, x3, lsl 2]
        add     x3, x3, x5
        whilelo p7.s, w3, w4
        b.any   .L2

because it disqualified the impossible SVE modes.  And given how the entire 
model was approximate, and that
based purely on latency it's hard to model the interaction between predicate 
ops and vector ops that this made sense.

I thought we had agreed before that this was the correct thing to do,
I guess I misunderstood the outcome of that conversation.

Tamar

> 
> So I think the costing framework is set up to handle niters<VF correctly
> on first principles.  It's "just" that the numbers being fed in give the
> wrong answer in this case.
> 
> Thanks,
> Richard

RE: [PATCH]AArch64: Take into account when VF is higher than known scalar iters

Reply via email to