Re: RFA: vectorizer patches 1/2 : WIDEN_MULT_PLUS support

Joern Wolfgang Rennecke Wed, 14 Nov 2018 07:41:53 -0800


On 14/11/18 09:53, Richard Biener wrote:

WIDEN_MULT_PLUS is special on our target in that it creates double-sized
vectors.

Are there really double-size vectors or does the target simply produce
the output in two vectors?  Usually targets have WIDEN_MULT_PLUS_LOW/HIGH
or _EVEN/ODD split operations.  Or, like - what I now remember - for the
DOT_PROD_EXPR optab, the target already reduces element pairs of the
result vector (unspecified which ones) so the result vector is of the same
size as the inputs.

The output of widening multiply and widening multiply-add is stored
in two consecutive registers.  So, they can be used as separate
vectors, but you can't choose the register numbers indepenndently.
OTOH, you can treat them together as a double-sized vector, but
without any extra alignment requirements over a single-sized vector.


That is, if your target produces two vectors you may instead want to hide
that fact by claiming you support DOT_PROD_EXPR and expanding
it to the widen-mult-plus plus reducing (adding) the two result vectors to
get a single one.


Doing a part of the reduction in the loop is a bit pointless.

I have tried another approach, to advertize the WIDEN_MULT_PLUS
and WIDEN_MULT operations as LO/HI part operations of the double
vector size, and also add fake double-vector patterns for move, widening
and add (they get expanded or splitted to single-vector patterns).
That seems to work for the dot product, it's like the code is unrolled by
a factor of two.  There are a few drawbacks though:
- the tree optimizer creates separate WIDEN_MULT and PLUS expressions,

and it is left to the combiner to clean that up. That combination andregister allocation

might be a bit fragile.

- if the input isn't known to be aligned to the doubled vector size, arun-timecheck is inserted to use an unvectorized loop if there is no excessalignment.- auto-increment for the loads is lost. I can probably fix this bykeeping double-sizedloads around for longer or with some special-purpose pass, but bothmight havesome other drawbacks. But there's actually a configuration option foran instructionto load multiple vector registers with register-indirect orauto-increment, so there is

some merit to have a pattern for it.
- the code size is larger.

- vectorization will fail if any other code is mixed in for which nodouble-vector patterns are provided.- this approach uses SUBREGs in ways that are not safe according to thedocumentation.But then, other ports like i386 and aarch64-little endian do that too.I think it is now (since we haveSUBREG_BYTE) safe to have subregs of registers with hard reg sizeslarger than UNITS_PER_WORD,as long as you refer to entire hard registers. Maybe we could changethe documentation?AFAICT, there are also only four places that need to be patched to makea lowpart access with a SUBREG of such a hard register safe. I'm tryingthis at the moment, it was justa few hours late for the

phase 1->3 deadline.

I suppose for WIDEN_SUM_EXPR, I'd have to have one double-vector-sizedpattern thatadds the products of the two input vectors into the double outputvector, and leavethe rtl loop optimizer to get the constant pool load of the all-onesvector out ofthe loop. But again, there'll be issues with excess alignmentrequirements and code size.


The vectorizer cannot really deal with multiple sizes, thus for example
a V4SI * V4SI + V4DI operation and that all those tree codes are exposed
as "scalar" is sth that continues to confuse me but is mainly done because
at pattern recognition time there's only the scalars.

Well, the vectorizer makes an exception for reductions as it'll allow tomaintaineither a vector or a scalar during the loop, so why not allow othersizes for thatvalue as well? It's all hidden in the final reduction emitted by theepilogue.

For vectorization I would advise to provide expansion patterns forcodes that are already supported, in your case DOT_PROD_EXPR.

With vector size doubling, it seems to work better with LO/HI multiplyand PLUS (and let

the combiner take the strain).

without... for a straight expansion, there is little point. Theprevious sum is in oneregister, the multiply results are spread over two registers, andDOT_PROD_EXPR is supposedto yield a scalar. Even with a reduction instruction to sum up tworegisters, you need anotherinstruction to add up all three, so a minimum of three instructions.LO/HI mulltiply canbe fudged by doing a full multiply and picking half the result, and cseshould reduce thatto one multiply. Again, two adds needed, because the reduction variableis too narrow

to use widening multiply-add.
There maybe some merit to DOT_PROD_EXPR if I make it do something strange.

But there's no easy way to use a special purpose mode, since there's nomatching reductionpattern for a DOT_PROD_EXPR, and the reduction for a WIDEN_SUM_EXPR isnot readilydistinguishable from the one for a non-widening summation with the sameoutput vector mode.I could use a special kind of hard register that's really another viewof a group of vector registersand which are reserved for this purpose unless eliminated, and theelimination is blocked whenthere is a statement that uses these registers because the expander forthe DOT_PROD_EXPR /WIDEN_SUM_EXPR sticks the actually used hard registers somewhere, and ifthey special 'hardreg' can't be obtained another, more expensive pattern (suitablyindicated in the constraints) is

used... but that's a lot of hair.

It's probably easier to write a special-purpose ssa pass to patch up thetype of the reduction variable,and insert that pass to run after the vectorizer. widen the variablewhen entering the loop,reduce it when exiting. if the loop is not understood, a more expensivepattern with standard

reduction variable width is used.

In which case, the value of DOT_PROD_EXPR / WIDEN_SUM_EXPR is that theyare somewhat specialand thus stick out (or in other words, you can take a bit of time toverify you got something interesting

when you find them).

Re: RFA: vectorizer patches 1/2 : WIDEN_MULT_PLUS support

Reply via email to