Hi, I composed some patches to generalize lane-reducing (dot-product is a typical representative) pattern recognition, and prepared a RFC document so as to help review. The original intention was to make a complete solution for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114440. For sure, the work might be limited, so hope your comments. Thanks.
----- 1. Background For loop reduction of accumulating result of a widening operation, the preferred pattern is lane-reducing operation, if supported by target. Because this kind of operation need not preserve intermediate results of widening operation, and only produces reduced amount of final results for accumulation, choosing the pattern could lead to pretty compact codegen. Three lane-reducing opcodes are defined in gcc, belonging to two kinds of operations: dot-product (DOT_PROD_EXPR) and sum-of-absolute-difference (SAD_EXPR). WIDEN_SUM_EXPR could be seen as a degenerated dot-product with a constant operand as "1". Currently, gcc only supports recognition of simple lane-reducing case, in which each accumulation statement of loop reduction forms one pattern: char *d0, *d1; short *s0, *s1; for (i) { sum += d0[i] * d1[i]; // <vector(4) int> = DOT_PROD <vector(16) char> sum += abs(s0[i] - s1[i]); // <vector(4) int> = SAD <vector(8) short> } We could rewrite the example as the below using only one statement, whose non- reduction addend is the sum of the above right-side parts. As a whole, the addend would match nothing, while its two sub-expressions could be recognized as corresponding lane-reducing patterns. for (i) { sum += d0[i] * d1[i] + abs(s0[i] - s1[i]); } This case might be too elaborately crafted to be very common in reality. Though, we do find seemingly variant but essentially similar code pattern in some AI applications, which use matrix-vector operations extensively, some usages are just single loop reduction composed of multiple dot-products. A code snippet from ggml: for (int j = 0; j < qk/2; ++j) { const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0; const int32_t x1 = (x[i].qs[j] >> 4) | xh_1; sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]); } In the source level, it appears to be a nature and minor scaling-up of simple one lane-reducing pattern, but it is beyond capability of current vectorization pattern recognition, and needs some kind of generic extension to the framework. 2. Reasoning on validity of transform First of all, we should tell what kind of expression is appropriate for lane- reducing transform. Given a loop, we use the language of mathematics to define an abstract function f(x, i), whose first independent variable "x" denotes a value that will participate sum-based loop reduction either directly or indirectly, and the 2nd one "i" specifies index of a loop iteration, which implies other intra-iteration factor irrelevant to "x". The function itself represents the transformed value by applying a series of operations on "x" in the context of "i"th loop iteration, and this value is directly accumulated to the loop reduction result. For the purpose of vectorization, it is implicitly supposed that f(x, i) is a pure function, and free of loop dependency. Additionally, for a value "x" defined in the loop, let "X" be a vector as <x0, x1, ..., xM>, consisting of the "x" values in all iterations, to be specific, "X[i]" corresponds to "x" at iteration "i", or "xi". With sequential execution order, a loop reduction regarding to f(x, i) would be expanded to: sum += f(x0, 0); sum += f(x1, 1); ... sum += f(xM, M); 2.1 Lane-reducing vs. Lane-combining Following lane-reducing semantics, we introduce a new similar lane-combining operation that also manipulates a subset of lanes/elements in vector, by accumulating all into one of them, at the same time, clearing the rest lanes to be zero. Two operations are equivalent in essence, while a major difference is that lane-combining operation does not reduce the lanes of vector. One advantage about this is codegen of lane-combining operation could seamlessly inter-operate with that of normal (non-lane-reducing) vector operation. Any lane-combining operation could be synthesized by a sequence of the most basic two-lane operations, which become the focuses of our analysis. Given two lanes "i" and "j", and let X' = lane-combine(X, i, j), then we have: X = <..., xi , ..., xj, ...> X' = <..., xi + xj, ..., 0, ...> 2.2 Equations for loop reduction invariance Since combining strategy of lane-reducing operations is target-specific, for examples, accumulating quad lanes to one (#0 + #1 + #2 + #3 => #0), or low to high (#0 + #4 => #4), we just make a conservative assumption that combining could happen on arbitrary two lanes in either order. Under the precondition, it is legitimate to optimize evaluation of a value "x" with a lane-reducing pattern, only if loop reduction always produces invariant result no matter which two lanes of "X" (the vector value set for "x") are combined. If the constraint is satisfied, one equation could be easily deduced as: f(xi + xj, i) + f(0, j) = f(0, i) + f(xi + xj, j) (∀i, ∀j) Replace "xi + xj" with a general variable "x", and let j = 0, we get an initial solution to f(x, i): f(x, i) = g(x) + h(i) (g(0) = 0) Two functions on the right are represented as the below. Evidently, they are orthogonal to each other. g(x) = f(x, 0) - f(0, 0) h(i) = f(0, i) Furthermore, we resort to another equation derived from the above constraint: f(xi, i) + f(xj, j) = f(xi + xj, i) + f(0, j) (∀i, ∀j, i!= j) Plugin the initial solution into the equation, get it simplified, then we have: g(xi) + g(xj) = g(xi + xj) (∀xi, ∀xj) Only standard linear or affine function crossing origin (0, 0) possesses such characteristics, thus the ultimate solution to f(x, i) should be like: f(x, i) = cst * x + h(i) In this formula, "cst" stands for a loop invariant value. And h(i) could be recursively decomposed to more children values just as f(x, i). In other word, evaluation of a value is qualified to be candidate of lane-reducing transform, only if it belongs to affine closure that contributes to non-reduction addend of loop reduction statement. Let "opX" denote a lane-reducing operation, if any recognized, normalized representation for loop reduction pattern statements must be of the form: sum += cst0 * op0 + cst1 * op1 + ... + cstM * opM + h(i) An invented example might be complicated as: ~DOT_PROD(a1, a2, 0) + invar * (WIDEN_SUM(b, 0) - (SAD(c1, c2, 0) << 3)) + NORMAL_OP 3. Implementation 3.1 Affine closure for loop reduction As has been mentioned, the prerequisite to match a value against lane-reducing patterns is that its definition statement must locate in affine closure of loop reduction addends. To be specific, the value and all its derived computation only end up in loop reductions, and are not used in any non-linear transform operation. Since rightly figuring out the property need to traverse def-use graph leading to reductions, it would be quite costly to repeat this loop-wise analysis every time a statement is to be checked. So we add a preprocessing step to mark all statements in affine closure with merely one-time effect, which is invoked at initialization of pattern recognition module. Inside affine closure, underlying match framework would universally handle both lane-reducing and normal patterns, and also, resulted lane-combining operation need not to be distinctly treated, which could be smoothly coupled with normal operation. It does not matter whether a given value is single used or not, so even sharing a lane-combining operation among multiple loop reductions could be supported. for (i) { int a = d0[i] * d1[i]; // a = DOT_PROD(d0, d1, 0) sum0 += a + 18; sum1 -= a << 2; } Because what a lane-combining operation generates does not represent real result of original statement, we should not initiate such kind of pattern match from a statement outside affine closure. for (i) { int a = d0[i] * d1[i]; // Not in affine closure data[i] = a; // Prevent "a" being in affine closure sum += a; } But if profitable, such statement may be forcedly pulled into closure by duplication. Assume vectorization factor is 16, for the above case, the vector value accumulated to "sum" is obtained by combining "<vector(16) int>" value of "a" to "<vector(4) int>" via 4 add instructions, while it only takes one instruction if dot-product pattern would be enabled after duplication as: for (i) { int a0 = d0[i] * d1[i]; // Not in affine closure int a1 = d0[i] * d1[i]; // Duplicate into affine closure data[i] = a0; sum += a1; // sum = DOT_PROD(d0, d1, sum) } As we know, a success recognition of pattern would replace matched statement with new pattern statements, which might cause affine closure being changed. Some statements that are originally in closure have to be kicked out if linearity of a relay statement linking into the closure is broken. for (i) { char a = ...; char b = ...; int i_a = (int)a; // In affine closure int i_b = (int)b; // In affine closure int c = i_a + i_b; // In affine closure sum += c; } After over_widening pattern: for (i) { char a = ...; char b = ...; int i_a = (int)a; // Become unused int i_b = (int)b; // Become unused short patt_a = (short)a; // Not in affine closure short patt_b = (short)b; // Not in affine closure short patt_c = patt_a + patt_b; // Not in affine closure int c = (int)patt_c; sum += c; } On the contrary, if a statement in closure is substituted to a new linear- transform like operation, those operands could also be pulled into the closure. This expansion would happen recursively, but it is not necessary to add those statements that have finished pattern recognition, no matter matched or not. for (i) { int a = ...; // In affine closure int b = a * 41; // In affine closure sum += b; } Suppose mult pattern is triggered for the case, the mult-by-constant statement is rewritten to several adds and shifts as: for (i) { int a = ...; int patt_a0 = a << 5; // New in affine closure int patt_a1 = a << 3; // New in affine closure int patt_a2 = patt_a0 + patt_a1; // New in affine closure int b = patt_a2 + a; sum += b; } Therefore, we resort to a postprocessing step after a pattern is recognized, in which semantics of new pattern statements are parsed, accordingly, affine closure of loop reductions would be incrementally updated. However, there is one complication that the adjustment to do might cause lane-combining operations already generated retreating from affine closure, which must be avoided. Since a lane-reducing pattern could not be reverted once it has been formed, alternative is to invalidate the other pattern in process. Actually, as far as existing patterns are concerned, this kind of conflict may not occur, but we can not guarantee introduction of a new pattern in the future would not also. For an imaginary example: for (i) { int patt_a = DOT_PROD(d0, d1, 0); // a = d0[i] * d1[i], Range[30 - 132] short b = ...; int c = patt_a + (int)b; sum += c; } Suppose a new pattern that utilizes value range information to de-promote integer values to fit into widening operations as many as possible: for (i) { int patt_a = DOT_PROD(d0, d1, 0); short patt_s_a = (short)patt_a; // Conflict with DOT_PROD short b = ...; int c = patt_s_a w+ b; sum += c; } Based on the above rationale, three kind of statuses (rpatt_none/rpatt_allow/ rpatt_formed) are defined. A statement with rpatt_allow indicates it is covered by loop reduction affine closure, thus it is candidate for which lane-reducing patterns are applicable. Otherwise, it would be quickly ignored by the patterns . At the time we are to remove statement from affine closure, this status should be cleared. Once a lane-combining statement is generated upon a pattern match, it is set to rpatt_formed, besides, the status would also be propagated to all dependent statements along paths that lead to loop reduction. Any adjustment on affine closure imposed by pending pattern recognition must ensure statements marked with rpatt_formed stay in the closure, if could not, rollback change due to the pattern. As to the implementation detailed so far, rpatt_formed implies rpatt_allow. Though, we consider making two statuses be independently represented, for the sake of enabling degraded lane-reducing operation in later work. If lane- reducing to a given result type is not supported by target, we could settle for second best by choosing a degraded lane-reducing operation with a smaller intermediate result type. The operation serves to reduction with rpatt_formed tagging, but does not belong to the affine closure, so with no rpatt_allow. For example: long sum = ...; for (i) { sum += d0[i] * d1[i]; } Suppose target has only dot-product instruction of accumulating 8-bit muls to 32-bit integer, we could still get the reduction optimized as: for (i) { int patt_a = DOT_PROD(d0, d1, 0); // <vector(4) int> = <vector(16) char> sum += (long)patt_a; } Additionally, if widen-sum from 32-bit to 64-bit is available, we would end up with cascading lane-reducing operations: for (i) { int patt_a = DOT_PROD(d0, d1, 0); // <vector(4) int> = <vector(16) char> long patt_l_a = WIDEN_SUM(patt_a, 0); // <vector(2) long> = <vector(4) int> sum += patt_l_a; } 3.2 Pattern priority Current pattern priority strategy for a statement is that the first match wins, order of a pattern determines its priority. Basically, lane-reducing patterns are not that sort of local statement-based patterns, in that they require knowledge of loop structure. Naturally, it is anticipated that these patterns would benefit loop vectorization much more than peephole-like patterns. So we give lane-reducing patterns overriding priorities, and move them to be the topmost in pattern table, which could reduce possible interference from the others. 3.3 Pattern decoupling A lane-reducing operation contains two aspects: main primitive operation and appendant result-accumulation. Original design handles match of the compound semantics in single pattern, but the means is not suitable for operation that does not directly participate in loop reduction. To resolve the problem, lane- reducing patterns are tailored to only cover the basic aspect, and a new pattern is dedicated to folding a summation into lane-reducing operation when appropriate. for (i) { int a = DOT_PROD(d0, d1, 0); // a = d0[i] * d1[i] int b = SAD(s0, s1, 0); // b = abs(s0[i] - s1[i]) int c = WIDEN_SUM(w, 0); // c = w[i] int f = (b + c) << 5; sum += a + f; } After lane-reducing accum pattern: for (i) { int a = DOT_PROD(d0, d1, 0); // Become unused int b = SAD(s0, s1, 0); // Become unused int c = WIDEN_SUM(w, 0); int f = SAD(s0, s1, c) << 5; // Fold "c" sum += DOT_PROD(d0, d1, f); // Fold "f" } 3.4 Dot-product pattern For a dot-product operation that would be generated from a mult-by-invariant statement like "t = invar * x", the conversion process becomes not that straightforward as before. Since the statement itself is a linear transform, definition statement of its variant operand "x" must also be in affine closure, and have been perfectly matched by widen-sum pattern before getting to the mult statement, for which, then we could try to combine the widen-sum operation and invariant multiplier to a dot-product. Though, combining is not always beneficial if the invariant is wider than operand of the widen-sum, and this results in reduced vectorization parallelism, so is not allowed. char invar0 = ...; short invar1 = ...; for (i) { char a0 = ...; char a1 = ...; int b0 = WIDEN_SUM(a0, 0); // b0 = (int)a0; int b1 = WIDEN_SUM(a1, 0); // b1 = (int)a1; int c0 = invar0 * b0; int c1 = invar1 * b1; .... } After combining widen-sum and invariant: for (i) { char a0 = ...; char a1 = ...; int b0 = WIDEN_SUM(a0, 0); // Become unused int b1 = WIDEN_SUM(a1, 0); int c0 = DOT_PROD(a0, invar, 0); // Combine int c1 = invar1 * b1; // Do not combine ... } The most common case is mult statement with two variant operands. Firstly, we check whether any operand is result of other mult-by-invariant statement, if so, rely on reassociation to temporarily group invariants and variants together respectively. This is repeated until neither operand is resoluble. for (i) { char a0 = ...; char a1 = ...; int b0 = invar0 * a0; int b1 = invar1 * a1; int c = b0 * b1; ... } When the pair of variant operands are both promoted from values of narrow types, we try to make those origin values fit into a dot-product operation. And if the set of invariants is not empty, and whose mult-product is unable to be folded to constant or an existing value, as a minor optimization, we would generate evaluation statements outside of the loop. int patt_invar = invar0 * invar1; for (i) { char a0 = ...; char a1 = ...; int b0 = invar0 * a0; // Become unused int b1 = invar1 * a1; // Become unused int c = patt_invar * DOT_PROD(a0, a1, 0); ... } In fact, what the enhancement aims at is just a simplified instance of a more generalized code pattern for mult statement, of which each operand is affine combination of values with narrower types. Let "xcsti", "ycsti" stand for affine invariant coefficients, and "xi", "yi" denote affine variable elements. The math representation is: result = (xcst0 * x0 + xcst1 * x1 + ... + xcstM * xM) * (ycst0 * y0 + ycst1 * y1 + ... + ycstN * yN) // x0 = y0 = 1 Based on distributive law, the mult statement could be expanded to summations of total (M + 1) * (N + 1) items as "(xcsti * ycstj) * (xi * yj)". Suppose "M" and "N" are relatively small numbers, and if all "xi * yj" could be transformed to dot-products, codegen after distributive expansion may even produce less instructions. In a simple example on aarch64, original loop body contains 8 instructions: char *a, *b, *c, *d; for (i) { sum += (a[i] + b[i]) * (c[i] + d[i]); // uaddl v29.8h, v1.8b, v30.8b // uaddl2 v30.8h, v1.16b, v30.16b // uaddl v31.8h, v0.8b, v27.8b // uaddl2 v27.8h, v0.16b, v27.16b // umull2 v28.4s, v29.8h, v31.8h // umlal v28.4s, v29.4h, v31.4h // umlal v28.4s, v30.4h, v27.4h // umlal2 v28.4s, v30.8h, v27.8h } After distributive expansion, only 4 instructions. for (i) { int patt_t0 = DOT_PROD(a, c, sum); int patt_t1 = DOT_PROD(a, d, patt_t0); int patt_t2 = DOT_PROD(b, c, patt_t1); int patt_t3 = DOT_PROD(b, d, patt_t2); sum = patt_t3; // udot v27.4s, v28.16b, v29.16b // udot v27.4s, v28.16b, v30.16b // udot v27.4s, v31.16b, v29.16b // udot v27.4s, v31.16b, v30.16b } However, on some architectures, dot-product operation is not natively supported, and is emulated by a sequence of other instructions. For instance, some dot-product operations for int8 type on x86-64 are emulated using "vpdpwssd". Meanwhile, gcc lacks a mechanism to precisely model a gimple statement's execution cost that may be measured as instruction cycle or count, consequently, we could not easily judge if a mult statement really deserves distributive expansion in a target-independent way. Thereby, though the scheme is specially talked out here, we did not implement it in completed patches, and would consider it as a TODO.