Re: [RFC] Generalize formation of lane-reducing ops in loop reduction

Feng Xue OS Sat, 03 Aug 2024 05:42:32 -0700

>> 1. Background
>>
>> For loop reduction of accumulating result of a widening operation, the
>> preferred pattern is lane-reducing operation, if supported by target. Because
>> this kind of operation need not preserve intermediate results of widening
>> operation, and only produces reduced amount of final results for 
>> accumulation,
>> choosing the pattern could lead to pretty compact codegen.
>>
>> Three lane-reducing opcodes are defined in gcc, belonging to two kinds of
>> operations: dot-product (DOT_PROD_EXPR) and sum-of-absolute-difference
>> (SAD_EXPR). WIDEN_SUM_EXPR could be seen as a degenerated dot-product with a
>> constant operand as "1". Currently, gcc only supports recognition of simple
>> lane-reducing case, in which each accumulation statement of loop reduction
>> forms one pattern:
>>
>>  char  *d0, *d1;
>>  short *s0, *s1;
>>
>>  for (i) {
>>    sum += d0[i] * d1[i];      // <vector(4) int> = DOT_PROD <vector(16) char>
>>    sum += abs(s0[i] - s1[i]); // <vector(4) int> = SAD <vector(8) short>
>>  }
>>
>> We could rewrite the example as the below using only one statement, whose 
>> non-
>> reduction addend is the sum of the above right-side parts. As a whole, the
>> addend would match nothing, while its two sub-expressions could be recognized
>> as corresponding lane-reducing patterns.
>>
>>  for (i) {
>>    sum += d0[i] * d1[i] + abs(s0[i] - s1[i]);
>>  }
> 
> Note we try to recognize the original form as SLP reduction (which of
> course fails).
> 
>> This case might be too elaborately crafted to be very common in reality.
>> Though, we do find seemingly variant but essentially similar code pattern in
>> some AI applications, which use matrix-vector operations extensively, some
>> usages are just single loop reduction composed of multiple dot-products. A
>> code snippet from ggml:
>>
>>  for (int j = 0; j < qk/2; ++j) {
>>    const uint8_t xh_0 = ((qh >> (j +  0)) << 4) & 0x10;
>>    const uint8_t xh_1 = ((qh >> (j + 12))     ) & 0x10;
>>
>>    const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0;
>>    const int32_t x1 = (x[i].qs[j] >>  4) | xh_1;
>>
>>    sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
>>  }
>>
>> In the source level, it appears to be a nature and minor scaling-up of simple
>> one lane-reducing pattern, but it is beyond capability of current 
>> vectorization
>> pattern recognition, and needs some kind of generic extension to the 
>> framework.


Sorry for late response. 

> So this is about re-associating lane-reducing ops to alternative lane-reducing
> ops to save repeated accumulation steps?

You mean re-associating slp-based lane-reducing ops to loop-based?

> The thing is that IMO pattern recognition as we do now is limiting and should
> eventually move to the SLP side where we should be able to more freely
> "undo" and associate.

No matter pattern recognition is done prior to or within SLP, the must thing is 
we
need to figure out which op is qualified for lane-reducing by some means.

For example, when seeing a mult in a loop with vectorization-favored shape,
    ...
    t = a * b;     // char a, b
    ...

we could not say it is decidedly applicable for reduced computation via 
dot-product
even the corresponding target ISA is available.

Recognition of normal patterns merely involves local statement-based match, 
while
for lane-reducing, validity check requires global loop-wise analysis on 
structure of
reduction, probably not same as, but close to what is proposed in the RFC. The
basic logic, IMHO, is independent of where pattern recognition is implemented.
As the matter of fact, this is not about of "associating", but "tagging" (mark 
all lane-
reducing quantifiable statements). After the process, "re-associator" could 
play its
role to guide selection of either loop-based or slp-based lane-reducing op.

> I've searched twice now, a few days ago I read that the optabs not specifying
> which lanes are combined/reduced is a limitation.  Yes, it is - I hope we can
> rectify this, so if this is motivation enough we should split the optabs up
> into even/odd/hi/lo (or whatever else interesting targets actually do).

Actually, how lanes are combined/reduced does not matter too much regarding
to recognition of lane-reducing patterns.

> I did read through the rest of the e-mail before, I do in general like the 
> idea
> to do better.  Costing is another place where we need to re-do things

Yes, current pattern recognition framework is not costing-model-driven, and has
no way to "undo" decision previously made even it negatively impacts pattern
match later. But this is a weakness of the framework, not any specific pattern. 
To overcome it, we may count on another separate task instead of mingling with
this RFC, and better to have the task contained into your plan of moving pattern
recognition to SLP.

> completely;  my current idea is to feed targets the SLP graph so they'll
> have some dependence info.  They should already have access to the
> actual operation done, though in awkward ways.  I guess the first target
> to implement sth better than we have now will get the lead to design
> a better API with the vectorizer.

For the targeted case found in ggml, if we can optimize it as the proposal, some
AI application (such as Llamar) relying on ggml would be accelerated by 10%,
or even more than 20% with some input data.  Now LLVM has already supported
the two-dot-product pattern, though it seems to adopt a case-by-case solution.

I know that you and Tamar are working on refactoring vectorizer to be all-SLP.
Those aforementioned SLP-related stuffs would depend on or just be part of the
refactoring. And I feel like it would be a long-term work to get the new 
vectorizer
full-fledged. Do we have a rough schedule on the work?

Thus, it might be more practical to build this extended lane-reducing pattern
recognition within the old framework, so that in a relatively short time, we 
could
get GCC enhanced to support possible more complicated usage of dot-product
for AI applications. When the all-SLP vectorizer gradually takes shape, we could
adjust pattern recognition framework in the way as you suggested. Additionally,
I still think we need not to re-implement idea of the proposal from scratch at 
that
time, most of completed code could be trivially migrated. 

Thanks,
Feng

Re: [RFC] Generalize formation of lane-reducing ops in loop reduction

Reply via email to