Hi,

  I composed some patches to generalize lane-reducing (dot-product is a typical 
representative) pattern recognition, and prepared a RFC document so as to help
review. The original intention was to make a complete solution for 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114440.  For sure, the work might
be limited, so hope your comments. Thanks.

-----

1. Background

For loop reduction of accumulating result of a widening operation, the
preferred pattern is lane-reducing operation, if supported by target. Because
this kind of operation need not preserve intermediate results of widening
operation, and only produces reduced amount of final results for accumulation,
choosing the pattern could lead to pretty compact codegen.

Three lane-reducing opcodes are defined in gcc, belonging to two kinds of
operations: dot-product (DOT_PROD_EXPR) and sum-of-absolute-difference
(SAD_EXPR). WIDEN_SUM_EXPR could be seen as a degenerated dot-product with a
constant operand as "1". Currently, gcc only supports recognition of simple
lane-reducing case, in which each accumulation statement of loop reduction
forms one pattern:

 char  *d0, *d1;
 short *s0, *s1;

 for (i) {
   sum += d0[i] * d1[i];      // <vector(4) int> = DOT_PROD <vector(16) char>
   sum += abs(s0[i] - s1[i]); // <vector(4) int> = SAD <vector(8) short>
 }

We could rewrite the example as the below using only one statement, whose non-
reduction addend is the sum of the above right-side parts. As a whole, the
addend would match nothing, while its two sub-expressions could be recognized
as corresponding lane-reducing patterns.

 for (i) {
   sum += d0[i] * d1[i] + abs(s0[i] - s1[i]);
 }

This case might be too elaborately crafted to be very common in reality.
Though, we do find seemingly variant but essentially similar code pattern in
some AI applications, which use matrix-vector operations extensively, some
usages are just single loop reduction composed of multiple dot-products. A
code snippet from ggml:

 for (int j = 0; j < qk/2; ++j) {
   const uint8_t xh_0 = ((qh >> (j +  0)) << 4) & 0x10;
   const uint8_t xh_1 = ((qh >> (j + 12))     ) & 0x10;

   const int32_t x0 = (x[i].qs[j] & 0xF) | xh_0;
   const int32_t x1 = (x[i].qs[j] >>  4) | xh_1;

   sumi += (x0 * y[i].qs[j]) + (x1 * y[i].qs[j + qk/2]);
 }

In the source level, it appears to be a nature and minor scaling-up of simple
one lane-reducing pattern, but it is beyond capability of current vectorization
pattern recognition, and needs some kind of generic extension to the framework.

2. Reasoning on validity of transform

First of all, we should tell what kind of expression is appropriate for lane-
reducing transform. Given a loop, we use the language of mathematics to define
an abstract function f(x, i), whose first independent variable "x" denotes a
value that will participate sum-based loop reduction either directly or
indirectly, and the 2nd one "i" specifies index of a loop iteration, which
implies other intra-iteration factor irrelevant to "x". The function itself
represents the transformed value by applying a series of operations on "x" in
the context of "i"th loop iteration, and this value is directly accumulated to
the loop reduction result. For the purpose of vectorization, it is implicitly
supposed that f(x, i) is a pure function, and free of loop dependency.

Additionally, for a value "x" defined in the loop, let "X" be a vector as
<x0, x1, ..., xM>, consisting of the "x" values in all iterations, to be
specific, "X[i]" corresponds to "x" at iteration "i", or "xi". With sequential
execution order, a loop reduction regarding to f(x, i) would be expanded to:

 sum += f(x0, 0);
 sum += f(x1, 1);
 ...
 sum += f(xM, M);

2.1 Lane-reducing vs. Lane-combining

Following lane-reducing semantics, we introduce a new similar lane-combining
operation that also manipulates a subset of lanes/elements in vector, by
accumulating all into one of them, at the same time, clearing the rest lanes
to be zero. Two operations are equivalent in essence, while a major difference
is that lane-combining operation does not reduce the lanes of vector. One
advantage about this is codegen of lane-combining operation could seamlessly
inter-operate with that of normal (non-lane-reducing) vector operation.

Any lane-combining operation could be synthesized by a sequence of the most
basic two-lane operations, which become the focuses of our analysis. Given two
lanes "i" and "j", and let X' = lane-combine(X, i, j), then we have:

  X  = <..., xi     , ...,  xj, ...>
  X' = <..., xi + xj, ...,   0, ...>

2.2 Equations for loop reduction invariance

Since combining strategy of lane-reducing operations is target-specific, for
examples, accumulating quad lanes to one (#0 + #1 + #2 + #3 => #0), or low to
high (#0 + #4 => #4), we just make a conservative assumption that combining
could happen on arbitrary two lanes in either order. Under the precondition,
it is legitimate to optimize evaluation of a value "x" with a lane-reducing
pattern, only if loop reduction always produces invariant result no matter
which two lanes of "X" (the vector value set for "x") are combined. If the
constraint is satisfied, one equation could be easily deduced as:

 f(xi + xj, i) + f(0, j) = f(0, i) + f(xi + xj, j)  (∀i, ∀j)

Replace "xi + xj" with a general variable "x", and let j = 0, we get an initial
solution to f(x, i):

 f(x, i) = g(x) + h(i)  (g(0) = 0)

Two functions on the right are represented as the below. Evidently, they are
orthogonal to each other.

 g(x) = f(x, 0) - f(0, 0)
 h(i) = f(0, i)

Furthermore, we resort to another equation derived from the above constraint:

 f(xi, i) + f(xj, j) = f(xi + xj, i) + f(0, j)  (∀i, ∀j, i!= j)

Plugin the initial solution into the equation, get it simplified, then we have:

 g(xi) + g(xj) = g(xi + xj)  (∀xi, ∀xj)

Only standard linear or affine function crossing origin (0, 0) possesses such
characteristics, thus the ultimate solution to f(x, i) should be like:

 f(x, i) = cst * x + h(i)

In this formula, "cst" stands for a loop invariant value. And h(i) could be
recursively decomposed to more children values just as f(x, i). In other word,
evaluation of a value is qualified to be candidate of lane-reducing transform,
only if it belongs to affine closure that contributes to non-reduction addend
of loop reduction statement. Let "opX" denote a lane-reducing operation, if
any recognized, normalized representation for loop reduction pattern statements
must be of the form:

 sum += cst0 * op0 + cst1 * op1 + ... + cstM * opM + h(i)

An invented example might be complicated as:

 ~DOT_PROD(a1, a2, 0) + invar * (WIDEN_SUM(b, 0) - (SAD(c1, c2, 0) << 3)) + 
NORMAL_OP

3. Implementation

3.1 Affine closure for loop reduction

As has been mentioned, the prerequisite to match a value against lane-reducing
patterns is that its definition statement must locate in affine closure of loop
reduction addends. To be specific, the value and all its derived computation
only end up in loop reductions, and are not used in any non-linear transform
operation. Since rightly figuring out the property need to traverse def-use
graph leading to reductions, it would be quite costly to repeat this loop-wise
analysis every time a statement is to be checked. So we add a preprocessing
step to mark all statements in affine closure with merely one-time effect,
which is invoked at initialization of pattern recognition module.

Inside affine closure, underlying match framework would universally handle both
lane-reducing and normal patterns, and also, resulted lane-combining operation
need not to be distinctly treated, which could be smoothly coupled with normal
operation. It does not matter whether a given value is single used or not, so
even sharing a lane-combining operation among multiple loop reductions could be
supported.

 for (i) {
   int a = d0[i] * d1[i];   // a = DOT_PROD(d0, d1, 0)

   sum0 += a + 18;
   sum1 -= a << 2;
 }

Because what a lane-combining operation generates does not represent real
result of original statement, we should not initiate such kind of pattern match
from a statement outside affine closure.

 for (i) {
   int a = d0[i] * d1[i];   // Not in affine closure

   data[i] = a;             // Prevent "a" being in affine closure
   sum += a;
 }

But if profitable, such statement may be forcedly pulled into closure by
duplication. Assume vectorization factor is 16, for the above case, the vector
value accumulated to "sum" is obtained by combining "<vector(16) int>" value
of "a" to "<vector(4) int>" via 4 add instructions, while it only takes one
instruction if dot-product pattern would be enabled after duplication as:

 for (i) {
   int a0 = d0[i] * d1[i];  // Not in affine closure
   int a1 = d0[i] * d1[i];  // Duplicate into affine closure

   data[i] = a0;
   sum += a1;               // sum = DOT_PROD(d0, d1, sum)
 }

As we know, a success recognition of pattern would replace matched statement
with new pattern statements, which might cause affine closure being changed.
Some statements that are originally in closure have to be kicked out if
linearity of a relay statement linking into the closure is broken.

 for (i) {
   char a = ...;
   char b = ...;
   int i_a = (int)a;                  // In affine closure
   int i_b = (int)b;                  // In affine closure
   int c = i_a + i_b;                 // In affine closure

   sum += c;
 }

After over_widening pattern:

 for (i) {
   char a = ...;
   char b = ...;
   int i_a = (int)a;                  // Become unused
   int i_b = (int)b;                  // Become unused
   short patt_a = (short)a;           // Not in affine closure
   short patt_b = (short)b;           // Not in affine closure
   short patt_c = patt_a + patt_b;    // Not in affine closure
   int c = (int)patt_c;

   sum += c;
 }

On the contrary, if a statement in closure is substituted to a new linear-
transform like operation, those operands could also be pulled into the closure.
This expansion would happen recursively, but it is not necessary to add those
statements that have finished pattern recognition, no matter matched or not.

 for (i) {
   int a = ...;                       // In affine closure
   int b = a * 41;                    // In affine closure

   sum += b;
 }

Suppose mult pattern is triggered for the case, the mult-by-constant statement
is rewritten to several adds and shifts as:

 for (i) {
   int a = ...;
   int patt_a0 = a << 5;              // New in affine closure
   int patt_a1 = a << 3;              // New in affine closure
   int patt_a2 = patt_a0 + patt_a1;   // New in affine closure
   int b = patt_a2 + a;

   sum += b;
 }

Therefore, we resort to a postprocessing step after a pattern is recognized, in
which semantics of new pattern statements are parsed, accordingly, affine
closure of loop reductions would be incrementally updated. However, there is
one complication that the adjustment to do might cause lane-combining
operations already generated retreating from affine closure, which must be
avoided. Since a lane-reducing pattern could not be reverted once it has been
formed, alternative is to invalidate the other pattern in process. Actually, as
far as existing patterns are concerned, this kind of conflict may not occur,
but we can not guarantee introduction of a new pattern in the future would not
also. For an imaginary example:

 for (i) {
   int patt_a = DOT_PROD(d0, d1, 0);  // a = d0[i] * d1[i], Range[30 - 132]
   short b = ...;
   int c = patt_a + (int)b;

   sum += c;
 }

Suppose a new pattern that utilizes value range information to de-promote
integer values to fit into widening operations as many as possible:

 for (i) {
   int patt_a = DOT_PROD(d0, d1, 0);
   short patt_s_a = (short)patt_a;    // Conflict with DOT_PROD
   short b = ...;
   int c = patt_s_a w+ b;

   sum += c;
 }

Based on the above rationale, three kind of statuses (rpatt_none/rpatt_allow/
rpatt_formed) are defined. A statement with rpatt_allow indicates it is covered
by loop reduction affine closure, thus it is candidate for which lane-reducing
patterns are applicable. Otherwise, it would be quickly ignored by the patterns
. At the time we are to remove statement from affine closure, this status
should be cleared. Once a lane-combining statement is generated upon a pattern
match, it is set to rpatt_formed, besides, the status would also be propagated
to all dependent statements along paths that lead to loop reduction. Any
adjustment on affine closure imposed by pending pattern recognition must ensure
statements marked with rpatt_formed stay in the closure, if could not, rollback
change due to the pattern.

As to the implementation detailed so far, rpatt_formed implies rpatt_allow.
Though, we consider making two statuses be independently represented, for the
sake of enabling degraded lane-reducing operation in later work.  If lane-
reducing to a given result type is not supported by target, we could settle for
second best by choosing a degraded lane-reducing operation with a smaller
intermediate result type. The operation serves to reduction with rpatt_formed
tagging, but does not belong to the affine closure, so with no rpatt_allow. For
example:

 long sum = ...;

 for (i) {
   sum += d0[i] * d1[i];
 }

Suppose target has only dot-product instruction of accumulating 8-bit muls to
32-bit integer, we could still get the reduction optimized as:

 for (i) {
   int patt_a = DOT_PROD(d0, d1, 0);     // <vector(4) int> = <vector(16) char>

   sum += (long)patt_a;
 }

Additionally, if widen-sum from 32-bit to 64-bit is available, we would end up
with cascading lane-reducing operations:

 for (i) {
   int patt_a = DOT_PROD(d0, d1, 0);     // <vector(4) int> = <vector(16) char>
   long patt_l_a = WIDEN_SUM(patt_a, 0); // <vector(2) long> = <vector(4) int>

   sum += patt_l_a;
 }

3.2 Pattern priority

Current pattern priority strategy for a statement is that the first match wins,
order of a pattern determines its priority. Basically, lane-reducing patterns
are not that sort of local statement-based patterns, in that they require
knowledge of loop structure. Naturally, it is anticipated that these patterns
would benefit loop vectorization much more than peephole-like patterns. So we
give lane-reducing patterns overriding priorities, and move them to be the
topmost in pattern table, which could reduce possible interference from the
others.

3.3 Pattern decoupling

A lane-reducing operation contains two aspects: main primitive operation and
appendant result-accumulation. Original design handles match of the compound
semantics in single pattern, but the means is not suitable for operation that
does not directly participate in loop reduction. To resolve the problem, lane-
reducing patterns are tailored to only cover the basic aspect, and a new
pattern is dedicated to folding a summation into lane-reducing operation when
appropriate.

 for (i) {
   int a = DOT_PROD(d0, d1, 0);       // a = d0[i] * d1[i]
   int b = SAD(s0, s1, 0);            // b = abs(s0[i] - s1[i])
   int c = WIDEN_SUM(w, 0);           // c = w[i]
   int f = (b + c) << 5;

   sum += a + f;
 }

After lane-reducing accum pattern:

 for (i) {
   int a = DOT_PROD(d0, d1, 0);       // Become unused
   int b = SAD(s0, s1, 0);            // Become unused
   int c = WIDEN_SUM(w, 0);
   int f = SAD(s0, s1, c) << 5;       // Fold "c"

   sum += DOT_PROD(d0, d1, f);        // Fold "f"
 }

3.4 Dot-product pattern

For a dot-product operation that would be generated from a mult-by-invariant
statement like "t = invar * x", the conversion process becomes not that
straightforward as before. Since the statement itself is a linear transform,
definition statement of its variant operand "x" must also be in affine closure,
and have been perfectly matched by widen-sum pattern before getting to the mult
statement, for which, then we could try to combine the widen-sum operation and
invariant multiplier to a dot-product. Though, combining is not always
beneficial if the invariant is wider than operand of the widen-sum, and this
results in reduced vectorization parallelism, so is not allowed.

 char  invar0 = ...;
 short invar1 = ...;

 for (i) {
   char a0 = ...;
   char a1 = ...;
   int b0 = WIDEN_SUM(a0, 0);         // b0 = (int)a0;
   int b1 = WIDEN_SUM(a1, 0);         // b1 = (int)a1;
   int c0 = invar0 * b0;
   int c1 = invar1 * b1;
   ....
 }

After combining widen-sum and invariant:

 for (i) {
   char a0 = ...;
   char a1 = ...;
   int b0 = WIDEN_SUM(a0, 0);         // Become unused
   int b1 = WIDEN_SUM(a1, 0);
   int c0 = DOT_PROD(a0, invar, 0);   // Combine
   int c1 = invar1 * b1;              // Do not combine
   ...
 }

The most common case is mult statement with two variant operands. Firstly, we
check whether any operand is result of other mult-by-invariant statement, if
so, rely on reassociation to temporarily group invariants and variants together
respectively. This is repeated until neither operand is resoluble.

 for (i) {
   char a0 = ...;
   char a1 = ...;
   int b0 = invar0 * a0;
   int b1 = invar1 * a1;
   int c = b0 * b1;
   ...
 }

When the pair of variant operands are both promoted from values of narrow types,
we try to make those origin values fit into a dot-product operation. And if the
set of invariants is not empty, and whose mult-product is unable to be folded
to constant or an existing value, as a minor optimization, we would generate
evaluation statements outside of the loop.

 int patt_invar = invar0 * invar1;

 for (i) {
   char a0 = ...;
   char a1 = ...;
   int b0 = invar0 * a0;              // Become unused
   int b1 = invar1 * a1;              // Become unused
   int c = patt_invar * DOT_PROD(a0, a1, 0);
   ...
 }

In fact, what the enhancement aims at is just a simplified instance of a more
generalized code pattern for mult statement, of which each operand is affine
combination of values with narrower types. Let "xcsti", "ycsti" stand for
affine invariant coefficients, and "xi", "yi" denote affine variable elements.
The math representation is:

 result = (xcst0 * x0 + xcst1 * x1 + ... + xcstM * xM) *
          (ycst0 * y0 + ycst1 * y1 + ... + ycstN * yN)   // x0 = y0 = 1

Based on distributive law, the mult statement could be expanded to summations
of total (M + 1) * (N + 1) items as "(xcsti * ycstj) * (xi * yj)". Suppose "M"
and "N" are relatively small numbers, and if all "xi * yj" could be transformed
to dot-products, codegen after distributive expansion may even produce less
instructions. In a simple example on aarch64, original loop body contains 8
instructions:

 char *a, *b, *c, *d;

 for (i) {
   sum += (a[i] + b[i]) * (c[i] + d[i]);

   // uaddl   v29.8h, v1.8b, v30.8b
   // uaddl2  v30.8h, v1.16b, v30.16b
   // uaddl   v31.8h, v0.8b, v27.8b
   // uaddl2  v27.8h, v0.16b, v27.16b
   // umull2  v28.4s, v29.8h, v31.8h
   // umlal   v28.4s, v29.4h, v31.4h
   // umlal   v28.4s, v30.4h, v27.4h
   // umlal2  v28.4s, v30.8h, v27.8h
 }

After distributive expansion, only 4 instructions.

 for (i) {
   int patt_t0 = DOT_PROD(a, c, sum);
   int patt_t1 = DOT_PROD(a, d, patt_t0);
   int patt_t2 = DOT_PROD(b, c, patt_t1);
   int patt_t3 = DOT_PROD(b, d, patt_t2);
  
   sum = patt_t3;

   // udot    v27.4s, v28.16b, v29.16b
   // udot    v27.4s, v28.16b, v30.16b
   // udot    v27.4s, v31.16b, v29.16b
   // udot    v27.4s, v31.16b, v30.16b
 }

However, on some architectures, dot-product operation is not natively
supported, and is emulated by a sequence of other instructions. For instance,
some dot-product operations for int8 type on x86-64 are emulated using
"vpdpwssd". Meanwhile, gcc lacks a mechanism to precisely model a gimple
statement's execution cost that may be measured as instruction cycle or count,
consequently, we could not easily judge if a mult statement really deserves
distributive expansion in a target-independent way. Thereby, though the scheme
is specially talked out here, we did not implement it in completed patches, and
would consider it as a TODO.

Reply via email to