On Sat, Aug 3, 2024 at 2:42 PM Feng Xue OS <f...@os.amperecomputing.com> wrote: > > >> 1. Background > >> > >> For loop reduction of accumulating result of a widening operation, the > >> preferred pattern is lane-reducing operation, if supported by target. > >> Because > >> this kind of operation need not preserve intermediate results of widening > >> operation, and only produces reduced amount of final results for > >> accumulation, > >> choosing the pattern could lead to pretty compact codegen. > >> > >> Three lane-reducing opcodes are defined in gcc, belonging to two kinds of > >> operations: dot-product (DOT_PROD_EXPR) and sum-of-absolute-difference > >> (SAD_EXPR). WIDEN_SUM_EXPR could be seen as a degenerated dot-product with > >> a > >> constant operand as "1". Currently, gcc only supports recognition of simple > >> lane-reducing case, in which each accumulation statement of loop reduction > >> forms one pattern: > >> > >> char *d0, *d1; > >> short *s0, *s1; > >> > >> for (i) { > >> sum += d0[i] * d1[i]; // <vector(4) int> = DOT_PROD <vector(16) > >> char> > >> sum += abs(s0[i] - s1[i]); // <vector(4) int> = SAD <vector(8) short> > >> } > >> > >> We could rewrite the example as the below using only one statement, whose > >> non- > >> reduction addend is the sum of the above right-side parts. As a whole, the > >> addend would match nothing, while its two sub-expressions could be > >> recognized > >> as corresponding lane-reducing patterns. > >> > >> for (i) { > >> sum += d0[i] * d1[i] + abs(s0[i] - s1[i]); > >> } > > > > Note we try to recognize the original form as SLP reduction (which of > > course fails). > > > >> This case might be too elaborately crafted to be very common in reality. > >> Though, we do find seemingly variant but essentially similar code pattern > >> in > >> some AI applications, which use matrix-vector operations extensively, some > >> usages are just single loop reduction composed of multiple dot-products. A > >> code snippet from ggml: > >> > >> for (int j = 0; j < qk/2; ++j) { > >> const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; > >> const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; > >> > >> const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0; > >> const int32_t x1 = (x[i].qs[j] >> 4) | xh_1; > >> > >> sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]); > >> } > >> > >> In the source level, it appears to be a nature and minor scaling-up of > >> simple > >> one lane-reducing pattern, but it is beyond capability of current > >> vectorization > >> pattern recognition, and needs some kind of generic extension to the > >> framework. > > Sorry for late response. > > > So this is about re-associating lane-reducing ops to alternative > > lane-reducing > > ops to save repeated accumulation steps? > > You mean re-associating slp-based lane-reducing ops to loop-based?
Yes. > > The thing is that IMO pattern recognition as we do now is limiting and > > should > > eventually move to the SLP side where we should be able to more freely > > "undo" and associate. > > No matter pattern recognition is done prior to or within SLP, the must thing > is we > need to figure out which op is qualified for lane-reducing by some means. > > For example, when seeing a mult in a loop with vectorization-favored shape, > ... > t = a * b; // char a, b > ... > > we could not say it is decidedly applicable for reduced computation via > dot-product > even the corresponding target ISA is available. True. Note there's a PR which shows SLP lane-reducing written out like a[i] = b[4*i] * 3 + b[4*i+1] * 3 + b[4*i+2] * 3 + b[4*i+3] * 3; which we cannot change to a DOT_PROD because we do not know which lanes are reduced. My point was there are non-reduction cases where knowing which actual lanes get reduced would help. For reductions it's not important and associating in a way to expose more possible (reduction) lane reductions is almost always going to be a win. > Recognition of normal patterns merely involves local statement-based match, > while > for lane-reducing, validity check requires global loop-wise analysis on > structure of > reduction, probably not same as, but close to what is proposed in the RFC. The > basic logic, IMHO, is independent of where pattern recognition is implemented. > As the matter of fact, this is not about of "associating", but "tagging" > (mark all lane- > reducing quantifiable statements). After the process, "re-associator" could > play its > role to guide selection of either loop-based or slp-based lane-reducing op. > > > I've searched twice now, a few days ago I read that the optabs not > > specifying > > which lanes are combined/reduced is a limitation. Yes, it is - I hope we > > can > > rectify this, so if this is motivation enough we should split the optabs up > > into even/odd/hi/lo (or whatever else interesting targets actually do). > > Actually, how lanes are combined/reduced does not matter too much regarding > to recognition of lane-reducing patterns. > > > I did read through the rest of the e-mail before, I do in general like the > > idea > > to do better. Costing is another place where we need to re-do things > > Yes, current pattern recognition framework is not costing-model-driven, and > has > no way to "undo" decision previously made even it negatively impacts pattern > match later. But this is a weakness of the framework, not any specific > pattern. > To overcome it, we may count on another separate task instead of mingling with > this RFC, and better to have the task contained into your plan of moving > pattern > recognition to SLP. > > > completely; my current idea is to feed targets the SLP graph so they'll > > have some dependence info. They should already have access to the > > actual operation done, though in awkward ways. I guess the first target > > to implement sth better than we have now will get the lead to design > > a better API with the vectorizer. > > For the targeted case found in ggml, if we can optimize it as the proposal, > some > AI application (such as Llamar) relying on ggml would be accelerated by 10%, > or even more than 20% with some input data. Now LLVM has already supported > the two-dot-product pattern, though it seems to adopt a case-by-case solution. > > I know that you and Tamar are working on refactoring vectorizer to be all-SLP. > Those aforementioned SLP-related stuffs would depend on or just be part of the > refactoring. And I feel like it would be a long-term work to get the new > vectorizer > full-fledged. Do we have a rough schedule on the work? > > Thus, it might be more practical to build this extended lane-reducing pattern > recognition within the old framework, so that in a relatively short time, we > could > get GCC enhanced to support possible more complicated usage of dot-product > for AI applications. When the all-SLP vectorizer gradually takes shape, we > could > adjust pattern recognition framework in the way as you suggested. > Additionally, > I still think we need not to re-implement idea of the proposal from scratch > at that > time, most of completed code could be trivially migrated. Note I didn't yet look at the actual patches (but I saw they are big, adding new code). The plan is to get through the transition to SLP this stage1 (after not getting anywhere with "incremental" for the last 5 years). So I'm wary of adding new features that do _not_ work within SLP and thus require converting - like I was wary adding not SLP-ready multi-exit vectorization last year which now is another obstacle to overcome. I didn't look into whether the code you add works with SLP reduction vectorization or at what level it applies. But any kind of "re-association" after SLP discovery finished would have to happen on the SLP representation. That said, I'm not the only one able to review or approve vectorizer patches - I just wanted to make the point again that please avoid adding new stuff that's not SLP ready. Richard. > Thanks, > Feng