Re: [RFC] Generalize formation of lane-reducing ops in loop reduction

Feng Xue OS Wed, 21 Aug 2024 00:57:28 -0700

>>
>> >> 1. Background
>> >>
>> >> For loop reduction of accumulating result of a widening operation, the
>> >> preferred pattern is lane-reducing operation, if supported by target. 
>> >> Because
>> >> this kind of operation need not preserve intermediate results of widening
>> >> operation, and only produces reduced amount of final results for 
>> >> accumulation,
>> >> choosing the pattern could lead to pretty compact codegen.
>> >>
>> >> Three lane-reducing opcodes are defined in gcc, belonging to two kinds of
>> >> operations: dot-product (DOT_PROD_EXPR) and sum-of-absolute-difference
>> >> (SAD_EXPR). WIDEN_SUM_EXPR could be seen as a degenerated dot-product 
>> >> with a
>> >> constant operand as "1". Currently, gcc only supports recognition of 
>> >> simple
>> >> lane-reducing case, in which each accumulation statement of loop reduction
>> >> forms one pattern:
>> >>
>> >>  char  *d0, *d1;
>> >>  short *s0, *s1;
>> >>
>> >>  for (i) {
>> >>    sum += d0[i] * d1[i];      // <vector(4) int> = DOT_PROD <vector(16) 
>> >> char>
>> >>    sum += abs(s0[i] - s1[i]); // <vector(4) int> = SAD <vector(8) short>
>> >>  }
>> >>
>> >> We could rewrite the example as the below using only one statement, whose 
>> >> non-
>> >> reduction addend is the sum of the above right-side parts. As a whole, the
>> >> addend would match nothing, while its two sub-expressions could be 
>> >> recognized
>> >> as corresponding lane-reducing patterns.
>> >>
>> >>  for (i) {
>> >>    sum += d0[i] * d1[i] + abs(s0[i] - s1[i]);
>> >>  }
>> >
>> > Note we try to recognize the original form as SLP reduction (which of
>> > course fails).
>> >
>> >> This case might be too elaborately crafted to be very common in reality.
>> >> Though, we do find seemingly variant but essentially similar code pattern 
>> >> in
>> >> some AI applications, which use matrix-vector operations extensively, some
>> >> usages are just single loop reduction composed of multiple dot-products. A
>> >> code snippet from ggml:
>> >>
>> >>  for (int j = 0; j < qk/2; ++j) {
>> >>    const uint8_t xh_0 = ((qh >> (j +  0)) << 4) & 0x10;
>> >>    const uint8_t xh_1 = ((qh >> (j + 12))     ) & 0x10;
>> >>
>> >>    const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0;
>> >>    const int32_t x1 = (x[i].qs[j] >>  4) | xh_1;
>> >>
>> >>    sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
>> >>  }
>> >>
>> >> In the source level, it appears to be a nature and minor scaling-up of 
>> >> simple
>> >> one lane-reducing pattern, but it is beyond capability of current 
>> >> vectorization
>> >> pattern recognition, and needs some kind of generic extension to the 
>> >> framework.
>>
>> Sorry for late response.
>>
>> > So this is about re-associating lane-reducing ops to alternative 
>> > lane-reducing
>> > ops to save repeated accumulation steps?
>>
>> You mean re-associating slp-based lane-reducing ops to loop-based?
> 
> Yes.
> 
>> > The thing is that IMO pattern recognition as we do now is limiting and 
>> > should
>> > eventually move to the SLP side where we should be able to more freely
>> > "undo" and associate.
>>
>> No matter pattern recognition is done prior to or within SLP, the must thing 
>> is we
>> need to figure out which op is qualified for lane-reducing by some means.
>>
>> For example, when seeing a mult in a loop with vectorization-favored shape,
>>     ...
>>     t = a * b;     // char a, b
>>     ...
>>
>> we could not say it is decidedly applicable for reduced computation via 
>> dot-product
>> even the corresponding target ISA is available.
> 
> True.  Note there's a PR which shows SLP lane-reducing written out like
> 
>   a[i] = b[4*i] * 3 + b[4*i+1] * 3 + b[4*i+2] * 3 + b[4*i+3] * 3;
> 
> which we cannot change to a DOT_PROD because we do not know which
> lanes are reduced.  My point was there are non-reduction cases where knowing
> which actual lanes get reduced would help.  For reductions it's not important
> and associating in a way to expose more possible (reduction) lane reductions
> is almost always going to be a win.
> 
>> Recognition of normal patterns merely involves local statement-based match, 
>> while
>> for lane-reducing, validity check requires global loop-wise analysis on 
>> structure of
>> reduction, probably not same as, but close to what is proposed in the RFC. 
>> The
>> basic logic, IMHO, is independent of where pattern recognition is 
>> implemented.
>> As the matter of fact, this is not about of "associating", but "tagging" 
>> (mark all lane-
>> reducing quantifiable statements). After the process, "re-associator" could 
>> play its
>> role to guide selection of either loop-based or slp-based lane-reducing op.
>>
>> > I've searched twice now, a few days ago I read that the optabs not 
>> > specifying
>> > which lanes are combined/reduced is a limitation.  Yes, it is - I hope we 
>> > can
>> > rectify this, so if this is motivation enough we should split the optabs up
>> > into even/odd/hi/lo (or whatever else interesting targets actually do).
>>
>> Actually, how lanes are combined/reduced does not matter too much regarding
>> to recognition of lane-reducing patterns.
>>
>> > I did read through the rest of the e-mail before, I do in general like the 
>> > idea
>> > to do better.  Costing is another place where we need to re-do things
>>
>> Yes, current pattern recognition framework is not costing-model-driven, and 
>> has
>> no way to "undo" decision previously made even it negatively impacts pattern
>> match later. But this is a weakness of the framework, not any specific 
>> pattern.
>> To overcome it, we may count on another separate task instead of mingling 
>> with
>> this RFC, and better to have the task contained into your plan of moving 
>> pattern
>> recognition to SLP.
>>
>> > completely;  my current idea is to feed targets the SLP graph so they'll
>> > have some dependence info.  They should already have access to the
>> > actual operation done, though in awkward ways.  I guess the first target
>> > to implement sth better than we have now will get the lead to design
>> > a better API with the vectorizer.
>>
>> For the targeted case found in ggml, if we can optimize it as the proposal, 
>> some
>> AI application (such as Llamar) relying on ggml would be accelerated by 10%,
>> or even more than 20% with some input data.  Now LLVM has already supported
>> the two-dot-product pattern, though it seems to adopt a case-by-case 
>> solution.
>>
>> I know that you and Tamar are working on refactoring vectorizer to be 
>> all-SLP.
>> Those aforementioned SLP-related stuffs would depend on or just be part of 
>> the
>> refactoring. And I feel like it would be a long-term work to get the new 
>> vectorizer
>> full-fledged. Do we have a rough schedule on the work?
>>
>> Thus, it might be more practical to build this extended lane-reducing pattern
>> recognition within the old framework, so that in a relatively short time, we 
>> could
>> get GCC enhanced to support possible more complicated usage of dot-product
>> for AI applications. When the all-SLP vectorizer gradually takes shape, we 
>> could
>> adjust pattern recognition framework in the way as you suggested. 
>> Additionally,
>> I still think we need not to re-implement idea of the proposal from scratch 
>> at that
>> time, most of completed code could be trivially migrated.
> 
> Note I didn't yet look at the actual patches (but I saw they are big,
> adding new code).
> 
> The plan is to get through the transition to SLP this stage1 (after
> not getting anywhere
> with "incremental" for the last 5 years).  So I'm wary of adding new
> features that
> do _not_ work within SLP and thus require converting - like I was wary adding
> not SLP-ready multi-exit vectorization last year which now is another
> obstacle to
> overcome.
> 
> I didn't look into whether the code you add works with SLP reduction
> vectorization
> or at what level it applies.  But any kind of "re-association" after
> SLP discovery
> finished would have to happen on the SLP representation.


What my work focuses is lane-reducing for loop reduction, and it is meant to be 
handled
in vect pattern recog (not SLP-post vect slp-pattern matcher), which happens as 
a preparation
stage for vectorization, no need to know its result is feed into SLP or non-SLP 
vectorizer,
so theoretically it would not disturb SLP.

You mentioned vect pattern recog is supposed to be merged into SLP in the 
future, I did
think of how it would become, but I feel like it may not completely disappear. 
Will those
basic pattern recog, like type cast splitting, widening operation generation, 
been processed
along with SLP discovery, or after it?

For loop reduction, lane-reducing op is originated from a statement with 1:1 
correspondence,
so would the transformation before SLP be an alternative?  Otherwise, there is 
another
complication if it will be done after SLP discovery, other pattern matching 
before SLP may
break up statement that could have been simply mapped to lane-reducing op, such 
as 
widening operation pattern. For example:

      sum += a * b + c * d + e * f;  // char a, b, c, d, e, f
      =>
      short patt_t1 = widen_mult(a, b);
      short patt_t2 = widen_mult(c, d);
      sum += widen_add (patt_t1, patt_t2) + e * f;

To recognize three dot-products (a * b / c * d / e *f), we may need a much more 
complex
vect slp-pattern matcher that is able to manipulate multiple slp instances as a 
whole. Then
touching slp-pattern matcher probably incurs a big dependency to SLP transition.


> That said, I'm not the only one able to review or approve vectorizer
> patches - I just
> wanted to make the point again that please avoid adding new stuff that's not 
> SLP
> ready.

Re: [RFC] Generalize formation of lane-reducing ops in loop reduction

Reply via email to