On 14/11/18 09:53, Richard Biener wrote:
WIDEN_MULT_PLUS is special on our target in that it creates double-sized
vectors.
Are there really double-size vectors or does the target simply produce
the output in two vectors? Usually targets have WIDEN_MULT_PLUS_LOW/HIGH
or _EVEN/ODD split operations. Or, like - what I now remember - for the
DOT_PROD_EXPR optab, the target already reduces element pairs of the
result vector (unspecified which ones) so the result vector is of the same
size as the inputs.
The output of widening multiply and widening multiply-add is stored
in two consecutive registers. So, they can be used as separate
vectors, but you can't choose the register numbers indepenndently.
OTOH, you can treat them together as a double-sized vector, but
without any extra alignment requirements over a single-sized vector.
That is, if your target produces two vectors you may instead want to hide
that fact by claiming you support DOT_PROD_EXPR and expanding
it to the widen-mult-plus plus reducing (adding) the two result vectors to
get a single one.
Doing a part of the reduction in the loop is a bit pointless.
I have tried another approach, to advertize the WIDEN_MULT_PLUS
and WIDEN_MULT operations as LO/HI part operations of the double
vector size, and also add fake double-vector patterns for move, widening
and add (they get expanded or splitted to single-vector patterns).
That seems to work for the dot product, it's like the code is unrolled by
a factor of two. There are a few drawbacks though:
- the tree optimizer creates separate WIDEN_MULT and PLUS expressions,
and it is left to the combiner to clean that up. That combination and
register allocation
might be a bit fragile.
- if the input isn't known to be aligned to the doubled vector size, a
run-time
check is inserted to use an unvectorized loop if there is no excess
alignment.
- auto-increment for the loads is lost. I can probably fix this by
keeping double-sized
loads around for longer or with some special-purpose pass, but both
might have
some other drawbacks. But there's actually a configuration option for
an instruction
to load multiple vector registers with register-indirect or
auto-increment, so there is
some merit to have a pattern for it.
- the code size is larger.
- vectorization will fail if any other code is mixed in for which no
double-vector patterns are provided.
- this approach uses SUBREGs in ways that are not safe according to the
documentation.
But then, other ports like i386 and aarch64-little endian do that too.
I think it is now (since we have
SUBREG_BYTE) safe to have subregs of registers with hard reg sizes
larger than UNITS_PER_WORD,
as long as you refer to entire hard registers. Maybe we could change
the documentation?
AFAICT, there are also only four places that need to be patched to make
a lowpart access with a SUBREG of such a hard register safe. I'm trying
this at the moment, it was justa few hours late for the
phase 1->3 deadline.
I suppose for WIDEN_SUM_EXPR, I'd have to have one double-vector-sized
pattern that
adds the products of the two input vectors into the double output
vector, and leave
the rtl loop optimizer to get the constant pool load of the all-ones
vector out of
the loop. But again, there'll be issues with excess alignment
requirements and code size.
The vectorizer cannot really deal with multiple sizes, thus for example
a V4SI * V4SI + V4DI operation and that all those tree codes are exposed
as "scalar" is sth that continues to confuse me but is mainly done because
at pattern recognition time there's only the scalars.
Well, the vectorizer makes an exception for reductions as it'll allow to
maintain
either a vector or a scalar during the loop, so why not allow other
sizes for that
value as well? It's all hidden in the final reduction emitted by the
epilogue.
For vectorization I would advise to provide expansion patterns for
codes that are already supported, in your case DOT_PROD_EXPR.
With vector size doubling, it seems to work better with LO/HI multiply
and PLUS (and let
the combiner take the strain).
without... for a straight expansion, there is little point. The
previous sum is in one
register, the multiply results are spread over two registers, and
DOT_PROD_EXPR is supposed
to yield a scalar. Even with a reduction instruction to sum up two
registers, you need another
instruction to add up all three, so a minimum of three instructions.
LO/HI mulltiply can
be fudged by doing a full multiply and picking half the result, and cse
should reduce that
to one multiply. Again, two adds needed, because the reduction variable
is too narrow
to use widening multiply-add.
There maybe some merit to DOT_PROD_EXPR if I make it do something strange.
But there's no easy way to use a special purpose mode, since there's no
matching reduction
pattern for a DOT_PROD_EXPR, and the reduction for a WIDEN_SUM_EXPR is
not readily
distinguishable from the one for a non-widening summation with the same
output vector mode.
I could use a special kind of hard register that's really another view
of a group of vector registers
and which are reserved for this purpose unless eliminated, and the
elimination is blocked when
there is a statement that uses these registers because the expander for
the DOT_PROD_EXPR /
WIDEN_SUM_EXPR sticks the actually used hard registers somewhere, and if
they special 'hard
reg' can't be obtained another, more expensive pattern (suitably
indicated in the constraints) is
used... but that's a lot of hair.
It's probably easier to write a special-purpose ssa pass to patch up the
type of the reduction variable,
and insert that pass to run after the vectorizer. widen the variable
when entering the loop,
reduce it when exiting. if the loop is not understood, a more expensive
pattern with standard
reduction variable width is used.
In which case, the value of DOT_PROD_EXPR / WIDEN_SUM_EXPR is that they
are somewhat special
and thus stick out (or in other words, you can take a bit of time to
verify you got something interesting
when you find them).