Re: [RFC] Generalize formation of lane-reducing ops in loop reduction

Richard Biener Thu, 08 Aug 2024 05:04:35 -0700

On Sat, Aug 3, 2024 at 2:42 PM Feng Xue OS <f...@os.amperecomputing.com> wrote:
>
> >> 1. Background
> >>
> >> For loop reduction of accumulating result of a widening operation, the
> >> preferred pattern is lane-reducing operation, if supported by target. 
> >> Because
> >> this kind of operation need not preserve intermediate results of widening
> >> operation, and only produces reduced amount of final results for 
> >> accumulation,
> >> choosing the pattern could lead to pretty compact codegen.
> >>
> >> Three lane-reducing opcodes are defined in gcc, belonging to two kinds of
> >> operations: dot-product (DOT_PROD_EXPR) and sum-of-absolute-difference
> >> (SAD_EXPR). WIDEN_SUM_EXPR could be seen as a degenerated dot-product with 
> >> a
> >> constant operand as "1". Currently, gcc only supports recognition of simple
> >> lane-reducing case, in which each accumulation statement of loop reduction
> >> forms one pattern:
> >>
> >>  char  *d0, *d1;
> >>  short *s0, *s1;
> >>
> >>  for (i) {
> >>    sum += d0[i] * d1[i];      // <vector(4) int> = DOT_PROD <vector(16) 
> >> char>
> >>    sum += abs(s0[i] - s1[i]); // <vector(4) int> = SAD <vector(8) short>
> >>  }
> >>
> >> We could rewrite the example as the below using only one statement, whose 
> >> non-
> >> reduction addend is the sum of the above right-side parts. As a whole, the
> >> addend would match nothing, while its two sub-expressions could be 
> >> recognized
> >> as corresponding lane-reducing patterns.
> >>
> >>  for (i) {
> >>    sum += d0[i] * d1[i] + abs(s0[i] - s1[i]);
> >>  }
> >
> > Note we try to recognize the original form as SLP reduction (which of
> > course fails).
> >
> >> This case might be too elaborately crafted to be very common in reality.
> >> Though, we do find seemingly variant but essentially similar code pattern 
> >> in
> >> some AI applications, which use matrix-vector operations extensively, some
> >> usages are just single loop reduction composed of multiple dot-products. A
> >> code snippet from ggml:
> >>
> >>  for (int j = 0; j < qk/2; ++j) {
> >>    const uint8_t xh_0 = ((qh >> (j +  0)) << 4) & 0x10;
> >>    const uint8_t xh_1 = ((qh >> (j + 12))     ) & 0x10;
> >>
> >>    const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0;
> >>    const int32_t x1 = (x[i].qs[j] >>  4) | xh_1;
> >>
> >>    sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
> >>  }
> >>
> >> In the source level, it appears to be a nature and minor scaling-up of 
> >> simple
> >> one lane-reducing pattern, but it is beyond capability of current 
> >> vectorization
> >> pattern recognition, and needs some kind of generic extension to the 
> >> framework.
>
> Sorry for late response.
>
> > So this is about re-associating lane-reducing ops to alternative 
> > lane-reducing
> > ops to save repeated accumulation steps?
>
> You mean re-associating slp-based lane-reducing ops to loop-based?


Yes.

> > The thing is that IMO pattern recognition as we do now is limiting and 
> > should
> > eventually move to the SLP side where we should be able to more freely
> > "undo" and associate.
>
> No matter pattern recognition is done prior to or within SLP, the must thing 
> is we
> need to figure out which op is qualified for lane-reducing by some means.
>
> For example, when seeing a mult in a loop with vectorization-favored shape,
>     ...
>     t = a * b;     // char a, b
>     ...
>
> we could not say it is decidedly applicable for reduced computation via 
> dot-product
> even the corresponding target ISA is available.

True.  Note there's a PR which shows SLP lane-reducing written out like

  a[i] = b[4*i] * 3 + b[4*i+1] * 3 + b[4*i+2] * 3 + b[4*i+3] * 3;

which we cannot change to a DOT_PROD because we do not know which
lanes are reduced.  My point was there are non-reduction cases where knowing
which actual lanes get reduced would help.  For reductions it's not important
and associating in a way to expose more possible (reduction) lane reductions
is almost always going to be a win.

> Recognition of normal patterns merely involves local statement-based match, 
> while
> for lane-reducing, validity check requires global loop-wise analysis on 
> structure of
> reduction, probably not same as, but close to what is proposed in the RFC. The
> basic logic, IMHO, is independent of where pattern recognition is implemented.
> As the matter of fact, this is not about of "associating", but "tagging" 
> (mark all lane-
> reducing quantifiable statements). After the process, "re-associator" could 
> play its
> role to guide selection of either loop-based or slp-based lane-reducing op.
>
> > I've searched twice now, a few days ago I read that the optabs not 
> > specifying
> > which lanes are combined/reduced is a limitation.  Yes, it is - I hope we 
> > can
> > rectify this, so if this is motivation enough we should split the optabs up
> > into even/odd/hi/lo (or whatever else interesting targets actually do).
>
> Actually, how lanes are combined/reduced does not matter too much regarding
> to recognition of lane-reducing patterns.
>
> > I did read through the rest of the e-mail before, I do in general like the 
> > idea
> > to do better.  Costing is another place where we need to re-do things
>
> Yes, current pattern recognition framework is not costing-model-driven, and 
> has
> no way to "undo" decision previously made even it negatively impacts pattern
> match later. But this is a weakness of the framework, not any specific 
> pattern.
> To overcome it, we may count on another separate task instead of mingling with
> this RFC, and better to have the task contained into your plan of moving 
> pattern
> recognition to SLP.
>
> > completely;  my current idea is to feed targets the SLP graph so they'll
> > have some dependence info.  They should already have access to the
> > actual operation done, though in awkward ways.  I guess the first target
> > to implement sth better than we have now will get the lead to design
> > a better API with the vectorizer.
>
> For the targeted case found in ggml, if we can optimize it as the proposal, 
> some
> AI application (such as Llamar) relying on ggml would be accelerated by 10%,
> or even more than 20% with some input data.  Now LLVM has already supported
> the two-dot-product pattern, though it seems to adopt a case-by-case solution.
>
> I know that you and Tamar are working on refactoring vectorizer to be all-SLP.
> Those aforementioned SLP-related stuffs would depend on or just be part of the
> refactoring. And I feel like it would be a long-term work to get the new 
> vectorizer
> full-fledged. Do we have a rough schedule on the work?
>
> Thus, it might be more practical to build this extended lane-reducing pattern
> recognition within the old framework, so that in a relatively short time, we 
> could
> get GCC enhanced to support possible more complicated usage of dot-product
> for AI applications. When the all-SLP vectorizer gradually takes shape, we 
> could
> adjust pattern recognition framework in the way as you suggested. 
> Additionally,
> I still think we need not to re-implement idea of the proposal from scratch 
> at that
> time, most of completed code could be trivially migrated.

Note I didn't yet look at the actual patches (but I saw they are big,
adding new code).

The plan is to get through the transition to SLP this stage1 (after
not getting anywhere
with "incremental" for the last 5 years).  So I'm wary of adding new
features that
do _not_ work within SLP and thus require converting - like I was wary adding
not SLP-ready multi-exit vectorization last year which now is another
obstacle to
overcome.

I didn't look into whether the code you add works with SLP reduction
vectorization
or at what level it applies.  But any kind of "re-association" after
SLP discovery
finished would have to happen on the SLP representation.

That said, I'm not the only one able to review or approve vectorizer
patches - I just
wanted to make the point again that please avoid adding new stuff that's not SLP
ready.

Richard.

> Thanks,
> Feng

Re: [RFC] Generalize formation of lane-reducing ops in loop reduction

Reply via email to