On Sun, Jul 21, 2024 at 11:12 AM Feng Xue OS
<f...@os.amperecomputing.com> wrote:
>
> Hi,
>
>   I composed some patches to generalize lane-reducing (dot-product is a 
> typical representative) pattern recognition, and prepared a RFC document so 
> as to help
> review. The original intention was to make a complete solution for 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114440.  For sure, the work might
> be limited, so hope your comments. Thanks.
>
> -----
>
> 1. Background
>
> For loop reduction of accumulating result of a widening operation, the
> preferred pattern is lane-reducing operation, if supported by target. Because
> this kind of operation need not preserve intermediate results of widening
> operation, and only produces reduced amount of final results for accumulation,
> choosing the pattern could lead to pretty compact codegen.
>
> Three lane-reducing opcodes are defined in gcc, belonging to two kinds of
> operations: dot-product (DOT_PROD_EXPR) and sum-of-absolute-difference
> (SAD_EXPR). WIDEN_SUM_EXPR could be seen as a degenerated dot-product with a
> constant operand as "1". Currently, gcc only supports recognition of simple
> lane-reducing case, in which each accumulation statement of loop reduction
> forms one pattern:
>
>  char  *d0, *d1;
>  short *s0, *s1;
>
>  for (i) {
>    sum += d0[i] * d1[i];      // <vector(4) int> = DOT_PROD <vector(16) char>
>    sum += abs(s0[i] - s1[i]); // <vector(4) int> = SAD <vector(8) short>
>  }
>
> We could rewrite the example as the below using only one statement, whose non-
> reduction addend is the sum of the above right-side parts. As a whole, the
> addend would match nothing, while its two sub-expressions could be recognized
> as corresponding lane-reducing patterns.
>
>  for (i) {
>    sum += d0[i] * d1[i] + abs(s0[i] - s1[i]);
>  }

Note we try to recognize the original form as SLP reduction (which of
course fails).

> This case might be too elaborately crafted to be very common in reality.
> Though, we do find seemingly variant but essentially similar code pattern in
> some AI applications, which use matrix-vector operations extensively, some
> usages are just single loop reduction composed of multiple dot-products. A
> code snippet from ggml:
>
>  for (int j = 0; j < qk/2; ++j) {
>    const uint8_t xh_0 = ((qh >> (j +  0)) << 4) & 0x10;
>    const uint8_t xh_1 = ((qh >> (j + 12))     ) & 0x10;
>
>    const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0;
>    const int32_t x1 = (x[i].qs[j] >>  4) | xh_1;
>
>    sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
>  }
>
> In the source level, it appears to be a nature and minor scaling-up of simple
> one lane-reducing pattern, but it is beyond capability of current 
> vectorization
> pattern recognition, and needs some kind of generic extension to the 
> framework.

So this is about re-associating lane-reducing ops to alternative lane-reducing
ops to save repeated accumulation steps?

The thing is that IMO pattern recognition as we do now is limiting and should
eventually move to the SLP side where we should be able to more freely
"undo" and associate.

I've searched twice now, a few days ago I read that the optabs not specifying
which lanes are combined/reduced is a limitation.  Yes, it is - I hope we can
rectify this, so if this is motivation enough we should split the optabs up
into even/odd/hi/lo (or whatever else interesting targets actually do).

I did read through the rest of the e-mail before, I do in general like the idea
to do better.  Costing is another place where we need to re-do things
completely;  my current idea is to feed targets the SLP graph so they'll
have some dependence info.  They should already have access to the
actual operation done, though in awkward ways.  I guess the first target
to implement sth better than we have now will get the lead to design
a better API with the vectorizer.

Richard.

> 2. Reasoning on validity of transform
>
> First of all, we should tell what kind of expression is appropriate for lane-
> reducing transform. Given a loop, we use the language of mathematics to define
> an abstract function f(x, i), whose first independent variable "x" denotes a
> value that will participate sum-based loop reduction either directly or
> indirectly, and the 2nd one "i" specifies index of a loop iteration, which
> implies other intra-iteration factor irrelevant to "x". The function itself
> represents the transformed value by applying a series of operations on "x" in
> the context of "i"th loop iteration, and this value is directly accumulated to
> the loop reduction result. For the purpose of vectorization, it is implicitly
> supposed that f(x, i) is a pure function, and free of loop dependency.
>
> Additionally, for a value "x" defined in the loop, let "X" be a vector as
> <x0, x1, ..., xM>, consisting of the "x" values in all iterations, to be
> specific, "X[i]" corresponds to "x" at iteration "i", or "xi". With sequential
> execution order, a loop reduction regarding to f(x, i) would be expanded to:
>
>  sum += f(x0, 0);
>  sum += f(x1, 1);
>  ...
>  sum += f(xM, M);
>
> 2.1 Lane-reducing vs. Lane-combining
>
> Following lane-reducing semantics, we introduce a new similar lane-combining
> operation that also manipulates a subset of lanes/elements in vector, by
> accumulating all into one of them, at the same time, clearing the rest lanes
> to be zero. Two operations are equivalent in essence, while a major difference
> is that lane-combining operation does not reduce the lanes of vector. One
> advantage about this is codegen of lane-combining operation could seamlessly
> inter-operate with that of normal (non-lane-reducing) vector operation.
>
> Any lane-combining operation could be synthesized by a sequence of the most
> basic two-lane operations, which become the focuses of our analysis. Given two
> lanes "i" and "j", and let X' = lane-combine(X, i, j), then we have:
>
>   X  = <..., xi     , ...,  xj, ...>
>   X' = <..., xi + xj, ...,   0, ...>
>
> 2.2 Equations for loop reduction invariance
>
> Since combining strategy of lane-reducing operations is target-specific, for
> examples, accumulating quad lanes to one (#0 + #1 + #2 + #3 => #0), or low to
> high (#0 + #4 => #4), we just make a conservative assumption that combining
> could happen on arbitrary two lanes in either order. Under the precondition,
> it is legitimate to optimize evaluation of a value "x" with a lane-reducing
> pattern, only if loop reduction always produces invariant result no matter
> which two lanes of "X" (the vector value set for "x") are combined. If the
> constraint is satisfied, one equation could be easily deduced as:
>
>  f(xi + xj, i) + f(0, j) = f(0, i) + f(xi + xj, j)  (∀i, ∀j)
>
> Replace "xi + xj" with a general variable "x", and let j = 0, we get an 
> initial
> solution to f(x, i):
>
>  f(x, i) = g(x) + h(i)  (g(0) = 0)
>
> Two functions on the right are represented as the below. Evidently, they are
> orthogonal to each other.
>
>  g(x) = f(x, 0) - f(0, 0)
>  h(i) = f(0, i)
>
> Furthermore, we resort to another equation derived from the above constraint:
>
>  f(xi, i) + f(xj, j) = f(xi + xj, i) + f(0, j)  (∀i, ∀j, i!= j)
>
> Plugin the initial solution into the equation, get it simplified, then we 
> have:
>
>  g(xi) + g(xj) = g(xi + xj)  (∀xi, ∀xj)
>
> Only standard linear or affine function crossing origin (0, 0) possesses such
> characteristics, thus the ultimate solution to f(x, i) should be like:
>
>  f(x, i) = cst * x + h(i)
>
> In this formula, "cst" stands for a loop invariant value. And h(i) could be
> recursively decomposed to more children values just as f(x, i). In other word,
> evaluation of a value is qualified to be candidate of lane-reducing transform,
> only if it belongs to affine closure that contributes to non-reduction addend
> of loop reduction statement. Let "opX" denote a lane-reducing operation, if
> any recognized, normalized representation for loop reduction pattern 
> statements
> must be of the form:
>
>  sum += cst0 * op0 + cst1 * op1 + ... + cstM * opM + h(i)
>
> An invented example might be complicated as:
>
>  ~DOT_PROD(a1, a2, 0) + invar * (WIDEN_SUM(b, 0) - (SAD(c1, c2, 0) << 3)) + 
> NORMAL_OP
>
> 3. Implementation
>
> 3.1 Affine closure for loop reduction
>
> As has been mentioned, the prerequisite to match a value against lane-reducing
> patterns is that its definition statement must locate in affine closure of 
> loop
> reduction addends. To be specific, the value and all its derived computation
> only end up in loop reductions, and are not used in any non-linear transform
> operation. Since rightly figuring out the property need to traverse def-use
> graph leading to reductions, it would be quite costly to repeat this loop-wise
> analysis every time a statement is to be checked. So we add a preprocessing
> step to mark all statements in affine closure with merely one-time effect,
> which is invoked at initialization of pattern recognition module.
>
> Inside affine closure, underlying match framework would universally handle 
> both
> lane-reducing and normal patterns, and also, resulted lane-combining operation
> need not to be distinctly treated, which could be smoothly coupled with normal
> operation. It does not matter whether a given value is single used or not, so
> even sharing a lane-combining operation among multiple loop reductions could 
> be
> supported.
>
>  for (i) {
>    int a = d0[i] * d1[i];   // a = DOT_PROD(d0, d1, 0)
>
>    sum0 += a + 18;
>    sum1 -= a << 2;
>  }
>
> Because what a lane-combining operation generates does not represent real
> result of original statement, we should not initiate such kind of pattern 
> match
> from a statement outside affine closure.
>
>  for (i) {
>    int a = d0[i] * d1[i];   // Not in affine closure
>
>    data[i] = a;             // Prevent "a" being in affine closure
>    sum += a;
>  }
>
> But if profitable, such statement may be forcedly pulled into closure by
> duplication. Assume vectorization factor is 16, for the above case, the vector
> value accumulated to "sum" is obtained by combining "<vector(16) int>" value
> of "a" to "<vector(4) int>" via 4 add instructions, while it only takes one
> instruction if dot-product pattern would be enabled after duplication as:
>
>  for (i) {
>    int a0 = d0[i] * d1[i];  // Not in affine closure
>    int a1 = d0[i] * d1[i];  // Duplicate into affine closure
>
>    data[i] = a0;
>    sum += a1;               // sum = DOT_PROD(d0, d1, sum)
>  }
>
> As we know, a success recognition of pattern would replace matched statement
> with new pattern statements, which might cause affine closure being changed.
> Some statements that are originally in closure have to be kicked out if
> linearity of a relay statement linking into the closure is broken.
>
>  for (i) {
>    char a = ...;
>    char b = ...;
>    int i_a = (int)a;                  // In affine closure
>    int i_b = (int)b;                  // In affine closure
>    int c = i_a + i_b;                 // In affine closure
>
>    sum += c;
>  }
>
> After over_widening pattern:
>
>  for (i) {
>    char a = ...;
>    char b = ...;
>    int i_a = (int)a;                  // Become unused
>    int i_b = (int)b;                  // Become unused
>    short patt_a = (short)a;           // Not in affine closure
>    short patt_b = (short)b;           // Not in affine closure
>    short patt_c = patt_a + patt_b;    // Not in affine closure
>    int c = (int)patt_c;
>
>    sum += c;
>  }
>
> On the contrary, if a statement in closure is substituted to a new linear-
> transform like operation, those operands could also be pulled into the 
> closure.
> This expansion would happen recursively, but it is not necessary to add those
> statements that have finished pattern recognition, no matter matched or not.
>
>  for (i) {
>    int a = ...;                       // In affine closure
>    int b = a * 41;                    // In affine closure
>
>    sum += b;
>  }
>
> Suppose mult pattern is triggered for the case, the mult-by-constant statement
> is rewritten to several adds and shifts as:
>
>  for (i) {
>    int a = ...;
>    int patt_a0 = a << 5;              // New in affine closure
>    int patt_a1 = a << 3;              // New in affine closure
>    int patt_a2 = patt_a0 + patt_a1;   // New in affine closure
>    int b = patt_a2 + a;
>
>    sum += b;
>  }
>
> Therefore, we resort to a postprocessing step after a pattern is recognized, 
> in
> which semantics of new pattern statements are parsed, accordingly, affine
> closure of loop reductions would be incrementally updated. However, there is
> one complication that the adjustment to do might cause lane-combining
> operations already generated retreating from affine closure, which must be
> avoided. Since a lane-reducing pattern could not be reverted once it has been
> formed, alternative is to invalidate the other pattern in process. Actually, 
> as
> far as existing patterns are concerned, this kind of conflict may not occur,
> but we can not guarantee introduction of a new pattern in the future would not
> also. For an imaginary example:
>
>  for (i) {
>    int patt_a = DOT_PROD(d0, d1, 0);  // a = d0[i] * d1[i], Range[30 - 132]
>    short b = ...;
>    int c = patt_a + (int)b;
>
>    sum += c;
>  }
>
> Suppose a new pattern that utilizes value range information to de-promote
> integer values to fit into widening operations as many as possible:
>
>  for (i) {
>    int patt_a = DOT_PROD(d0, d1, 0);
>    short patt_s_a = (short)patt_a;    // Conflict with DOT_PROD
>    short b = ...;
>    int c = patt_s_a w+ b;
>
>    sum += c;
>  }
>
> Based on the above rationale, three kind of statuses (rpatt_none/rpatt_allow/
> rpatt_formed) are defined. A statement with rpatt_allow indicates it is 
> covered
> by loop reduction affine closure, thus it is candidate for which lane-reducing
> patterns are applicable. Otherwise, it would be quickly ignored by the 
> patterns
> . At the time we are to remove statement from affine closure, this status
> should be cleared. Once a lane-combining statement is generated upon a pattern
> match, it is set to rpatt_formed, besides, the status would also be propagated
> to all dependent statements along paths that lead to loop reduction. Any
> adjustment on affine closure imposed by pending pattern recognition must 
> ensure
> statements marked with rpatt_formed stay in the closure, if could not, 
> rollback
> change due to the pattern.
>
> As to the implementation detailed so far, rpatt_formed implies rpatt_allow.
> Though, we consider making two statuses be independently represented, for the
> sake of enabling degraded lane-reducing operation in later work.  If lane-
> reducing to a given result type is not supported by target, we could settle 
> for
> second best by choosing a degraded lane-reducing operation with a smaller
> intermediate result type. The operation serves to reduction with rpatt_formed
> tagging, but does not belong to the affine closure, so with no rpatt_allow. 
> For
> example:
>
>  long sum = ...;
>
>  for (i) {
>    sum += d0[i] * d1[i];
>  }
>
> Suppose target has only dot-product instruction of accumulating 8-bit muls to
> 32-bit integer, we could still get the reduction optimized as:
>
>  for (i) {
>    int patt_a = DOT_PROD(d0, d1, 0);     // <vector(4) int> = <vector(16) 
> char>
>
>    sum += (long)patt_a;
>  }
>
> Additionally, if widen-sum from 32-bit to 64-bit is available, we would end up
> with cascading lane-reducing operations:
>
>  for (i) {
>    int patt_a = DOT_PROD(d0, d1, 0);     // <vector(4) int> = <vector(16) 
> char>
>    long patt_l_a = WIDEN_SUM(patt_a, 0); // <vector(2) long> = <vector(4) int>
>
>    sum += patt_l_a;
>  }
>
> 3.2 Pattern priority
>
> Current pattern priority strategy for a statement is that the first match 
> wins,
> order of a pattern determines its priority. Basically, lane-reducing patterns
> are not that sort of local statement-based patterns, in that they require
> knowledge of loop structure. Naturally, it is anticipated that these patterns
> would benefit loop vectorization much more than peephole-like patterns. So we
> give lane-reducing patterns overriding priorities, and move them to be the
> topmost in pattern table, which could reduce possible interference from the
> others.
>
> 3.3 Pattern decoupling
>
> A lane-reducing operation contains two aspects: main primitive operation and
> appendant result-accumulation. Original design handles match of the compound
> semantics in single pattern, but the means is not suitable for operation that
> does not directly participate in loop reduction. To resolve the problem, lane-
> reducing patterns are tailored to only cover the basic aspect, and a new
> pattern is dedicated to folding a summation into lane-reducing operation when
> appropriate.
>
>  for (i) {
>    int a = DOT_PROD(d0, d1, 0);       // a = d0[i] * d1[i]
>    int b = SAD(s0, s1, 0);            // b = abs(s0[i] - s1[i])
>    int c = WIDEN_SUM(w, 0);           // c = w[i]
>    int f = (b + c) << 5;
>
>    sum += a + f;
>  }
>
> After lane-reducing accum pattern:
>
>  for (i) {
>    int a = DOT_PROD(d0, d1, 0);       // Become unused
>    int b = SAD(s0, s1, 0);            // Become unused
>    int c = WIDEN_SUM(w, 0);
>    int f = SAD(s0, s1, c) << 5;       // Fold "c"
>
>    sum += DOT_PROD(d0, d1, f);        // Fold "f"
>  }
>
> 3.4 Dot-product pattern
>
> For a dot-product operation that would be generated from a mult-by-invariant
> statement like "t = invar * x", the conversion process becomes not that
> straightforward as before. Since the statement itself is a linear transform,
> definition statement of its variant operand "x" must also be in affine 
> closure,
> and have been perfectly matched by widen-sum pattern before getting to the 
> mult
> statement, for which, then we could try to combine the widen-sum operation and
> invariant multiplier to a dot-product. Though, combining is not always
> beneficial if the invariant is wider than operand of the widen-sum, and this
> results in reduced vectorization parallelism, so is not allowed.
>
>  char  invar0 = ...;
>  short invar1 = ...;
>
>  for (i) {
>    char a0 = ...;
>    char a1 = ...;
>    int b0 = WIDEN_SUM(a0, 0);         // b0 = (int)a0;
>    int b1 = WIDEN_SUM(a1, 0);         // b1 = (int)a1;
>    int c0 = invar0 * b0;
>    int c1 = invar1 * b1;
>    ....
>  }
>
> After combining widen-sum and invariant:
>
>  for (i) {
>    char a0 = ...;
>    char a1 = ...;
>    int b0 = WIDEN_SUM(a0, 0);         // Become unused
>    int b1 = WIDEN_SUM(a1, 0);
>    int c0 = DOT_PROD(a0, invar, 0);   // Combine
>    int c1 = invar1 * b1;              // Do not combine
>    ...
>  }
>
> The most common case is mult statement with two variant operands. Firstly, we
> check whether any operand is result of other mult-by-invariant statement, if
> so, rely on reassociation to temporarily group invariants and variants 
> together
> respectively. This is repeated until neither operand is resoluble.
>
>  for (i) {
>    char a0 = ...;
>    char a1 = ...;
>    int b0 = invar0 * a0;
>    int b1 = invar1 * a1;
>    int c = b0 * b1;
>    ...
>  }
>
> When the pair of variant operands are both promoted from values of narrow 
> types,
> we try to make those origin values fit into a dot-product operation. And if 
> the
> set of invariants is not empty, and whose mult-product is unable to be folded
> to constant or an existing value, as a minor optimization, we would generate
> evaluation statements outside of the loop.
>
>  int patt_invar = invar0 * invar1;
>
>  for (i) {
>    char a0 = ...;
>    char a1 = ...;
>    int b0 = invar0 * a0;              // Become unused
>    int b1 = invar1 * a1;              // Become unused
>    int c = patt_invar * DOT_PROD(a0, a1, 0);
>    ...
>  }
>
> In fact, what the enhancement aims at is just a simplified instance of a more
> generalized code pattern for mult statement, of which each operand is affine
> combination of values with narrower types. Let "xcsti", "ycsti" stand for
> affine invariant coefficients, and "xi", "yi" denote affine variable elements.
> The math representation is:
>
>  result = (xcst0 * x0 + xcst1 * x1 + ... + xcstM * xM) *
>           (ycst0 * y0 + ycst1 * y1 + ... + ycstN * yN)   // x0 = y0 = 1
>
> Based on distributive law, the mult statement could be expanded to summations
> of total (M + 1) * (N + 1) items as "(xcsti * ycstj) * (xi * yj)". Suppose "M"
> and "N" are relatively small numbers, and if all "xi * yj" could be 
> transformed
> to dot-products, codegen after distributive expansion may even produce less
> instructions. In a simple example on aarch64, original loop body contains 8
> instructions:
>
>  char *a, *b, *c, *d;
>
>  for (i) {
>    sum += (a[i] + b[i]) * (c[i] + d[i]);
>
>    // uaddl   v29.8h, v1.8b, v30.8b
>    // uaddl2  v30.8h, v1.16b, v30.16b
>    // uaddl   v31.8h, v0.8b, v27.8b
>    // uaddl2  v27.8h, v0.16b, v27.16b
>    // umull2  v28.4s, v29.8h, v31.8h
>    // umlal   v28.4s, v29.4h, v31.4h
>    // umlal   v28.4s, v30.4h, v27.4h
>    // umlal2  v28.4s, v30.8h, v27.8h
>  }
>
> After distributive expansion, only 4 instructions.
>
>  for (i) {
>    int patt_t0 = DOT_PROD(a, c, sum);
>    int patt_t1 = DOT_PROD(a, d, patt_t0);
>    int patt_t2 = DOT_PROD(b, c, patt_t1);
>    int patt_t3 = DOT_PROD(b, d, patt_t2);
>
>    sum = patt_t3;
>
>    // udot    v27.4s, v28.16b, v29.16b
>    // udot    v27.4s, v28.16b, v30.16b
>    // udot    v27.4s, v31.16b, v29.16b
>    // udot    v27.4s, v31.16b, v30.16b
>  }
>
> However, on some architectures, dot-product operation is not natively
> supported, and is emulated by a sequence of other instructions. For instance,
> some dot-product operations for int8 type on x86-64 are emulated using
> "vpdpwssd". Meanwhile, gcc lacks a mechanism to precisely model a gimple
> statement's execution cost that may be measured as instruction cycle or count,
> consequently, we could not easily judge if a mult statement really deserves
> distributive expansion in a target-independent way. Thereby, though the scheme
> is specially talked out here, we did not implement it in completed patches, 
> and
> would consider it as a TODO.

Reply via email to