On 14/11/18 09:53, Richard Biener wrote:
WIDEN_MULT_PLUS is special on our target in that it creates double-sized
vectors.
Are there really double-size vectors or does the target simply produce
the output in two vectors?  Usually targets have WIDEN_MULT_PLUS_LOW/HIGH
or _EVEN/ODD split operations.  Or, like - what I now remember - for the
DOT_PROD_EXPR optab, the target already reduces element pairs of the
result vector (unspecified which ones) so the result vector is of the same
size as the inputs.
The output of widening multiply and widening multiply-add is stored
in two consecutive registers.  So, they can be used as separate
vectors, but you can't choose the register numbers indepenndently.
OTOH, you can treat them together as a double-sized vector, but
without any extra alignment requirements over a single-sized vector.

That is, if your target produces two vectors you may instead want to hide
that fact by claiming you support DOT_PROD_EXPR and expanding
it to the widen-mult-plus plus reducing (adding) the two result vectors to
get a single one.

Doing a part of the reduction in the loop is a bit pointless.

I have tried another approach, to advertize the WIDEN_MULT_PLUS
and WIDEN_MULT operations as LO/HI part operations of the double
vector size, and also add fake double-vector patterns for move, widening
and add (they get expanded or splitted to single-vector patterns).
That seems to work for the dot product, it's like the code is unrolled by
a factor of two.  There are a few drawbacks though:
- the tree optimizer creates separate WIDEN_MULT and PLUS expressions,
and it is left to the combiner to clean that up. That combination and register allocation
might be a bit fragile.
- if the input isn't known to be aligned to the doubled vector size, a run-time check is inserted to use an unvectorized loop if there is no excess alignment. - auto-increment for the loads is lost. I can probably fix this by keeping double-sized loads around for longer or with some special-purpose pass, but both might have some other drawbacks. But there's actually a configuration option for an instruction to load multiple vector registers with register-indirect or auto-increment, so there is
some merit to have a pattern for it.
- the code size is larger.
- vectorization will fail if any other code is mixed in for which no double-vector patterns are provided. - this approach uses SUBREGs in ways that are not safe according to the documentation. But then, other ports like i386 and aarch64-little endian do that too. I think it is now (since we have SUBREG_BYTE) safe to have subregs of registers with hard reg sizes larger than UNITS_PER_WORD, as long as you refer to entire hard registers. Maybe we could change the documentation? AFAICT, there are also only four places that need to be patched to make a lowpart access with a SUBREG of such a hard register safe. I'm trying this at the moment, it was justa few hours late for the
phase 1->3 deadline.

I suppose for WIDEN_SUM_EXPR, I'd have to have one double-vector-sized pattern that adds the products of the two input vectors into the double output vector, and leave the rtl loop optimizer to get the constant pool load of the all-ones vector out of the loop. But again, there'll be issues with excess alignment requirements and code size.

The vectorizer cannot really deal with multiple sizes, thus for example
a V4SI * V4SI + V4DI operation and that all those tree codes are exposed
as "scalar" is sth that continues to confuse me but is mainly done because
at pattern recognition time there's only the scalars.
Well, the vectorizer makes an exception for reductions as it'll allow to maintain either a vector or a scalar during the loop, so why not allow other sizes for that value as well? It's all hidden in the final reduction emitted by the epilogue.
For vectorization I would advise to provide expansion patterns for codes that are already supported, in your case DOT_PROD_EXPR.
With vector size doubling, it seems to work better with LO/HI multiply and PLUS (and let
the combiner take the strain).
without... for a straight expansion, there is little point. The previous sum is in one register, the multiply results are spread over two registers, and DOT_PROD_EXPR is supposed to yield a scalar. Even with a reduction instruction to sum up two registers, you need another instruction to add up all three, so a minimum of three instructions. LO/HI mulltiply can be fudged by doing a full multiply and picking half the result, and cse should reduce that to one multiply. Again, two adds needed, because the reduction variable is too narrow
to use widening multiply-add.
There maybe some merit to DOT_PROD_EXPR if I make it do something strange.
But there's no easy way to use a special purpose mode, since there's no matching reduction pattern for a DOT_PROD_EXPR, and the reduction for a WIDEN_SUM_EXPR is not readily distinguishable from the one for a non-widening summation with the same output vector mode. I could use a special kind of hard register that's really another view of a group of vector registers and which are reserved for this purpose unless eliminated, and the elimination is blocked when there is a statement that uses these registers because the expander for the DOT_PROD_EXPR / WIDEN_SUM_EXPR sticks the actually used hard registers somewhere, and if they special 'hard reg' can't be obtained another, more expensive pattern (suitably indicated in the constraints) is
used... but that's a lot of hair.
It's probably easier to write a special-purpose ssa pass to patch up the type of the reduction variable, and insert that pass to run after the vectorizer. widen the variable when entering the loop, reduce it when exiting. if the loop is not understood, a more expensive pattern with standard
reduction variable width is used.
In which case, the value of DOT_PROD_EXPR / WIDEN_SUM_EXPR is that they are somewhat special and thus stick out (or in other words, you can take a bit of time to verify you got something interesting
when you find them).

Reply via email to