Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization

Richard Biener Mon, 11 Jan 2016 00:49:10 -0800

On Fri, Jan 8, 2016 at 5:11 PM, Alan Lawrence
<alan.lawre...@foss.arm.com> wrote:
> On Tues, Oct 27, 2015 at 2:39 PM, Richard Biener
> <richard.guent...@gmail.com> wrote:
>>
>> On Mon, Oct 26, 2015 at 6:59 AM, sameera <sameera.deshpa...@imgtec.com>
>> wrote:
>>>
>>>
>>> Richard, we have defined the input language for convenience in prototype
>>> implementation. However, we will be using GIMPLE as our IR. As per
>>> grammar
>>> of our tree, p-tree denote the permute order associated with the
>>> statement,
>>> whereas c-tree is actually the GIMPLE instruction, which performs compute
>>> operation. I tried looking at structures used in SLP, however they can
>>> not
>>> be used as they are, as main difference between current SLP
>>> implementation
>>> in GCC versus our representation is that, permute order in SLP is part of
>>> the tree node in current GCC, whereas in our representation permute order
>>> is
>>> represented as independent tree-node. Hence, I have created new tree
>>> structure for our pass, which will create p-tree nodes for permute order,
>>> and c-tree node which points to appropriate gimple statement.
>>
>>
>> Yes, that's the whole purpose - get the vectorizer (and SLP) a better
>> data structure
>> which is explicit about permutes.
>
>
>>>> I somewhat miss an idea on how to deal with irregular SLP cases - that
>>>> is,
>>>> does the AST model each (current) SLP node as a single statement or
>>>> does it model individual vector elements (so you can insert no-op
>>>> compensation
>>>> code to make the operations match).  Consider
>>>>
>>>>   a[i] = b[i] * 3;
>>>>   a[i+1] = b[i+1];
>>>>
>>>> which can be vectorized with SLP if you realize you can multiply b[i+1]
>>>> by
>>>> 1.
>>>
>>>
>>>
>>> The pass structure we are having for our optimization is as follows:
>>> - New pass target-aware-loop-vect with following subpasses:
>>>   - Loop structure identification
>>>   - Restricted loop construction
>>>   - Loop analysis (6 phases of target-aware reordering-instruction
>>> selection
>>> algorithm)
>>>   - Loop transformation (Cost aware gimple code generation)
>>> We might need some optimizations to be performed on loop body before we
>>> start our optimization for reordering instruction selection, so that
>>> input
>>> program can be transformed to fit into our restricted loop structure.
>>> These
>>> transformations can be performed by restricted loop construction subpass.
>>> Eg.: multi-dimentional array, nested loops, reduction chains etc. can be
>>> supported by transforming input loop. The case that you have mentioned,
>>> can
>>> be transformed at this level.
>
>
>>>> As of implementing the whole thing in GCC I recently spent some more
>>>> time
>>>> thinking and came to the conclusion that the path to the fastest
>>>> improvements
>>>> in GCC code-gen would be to build the new AST after the current analysis
>>>> phase finished, do the optimization and drive code-generation off the
>>>> new
>>>> AST.
>>>> Thus keep things like data-ref and dependence analysis as is, as well as
>>>> group analysis and SLP tree construction.  Build the AST from the
>>>> vectorizer
>>>> "IL" at the point vect_transform_loop / vect_slp_transform_bb is called.
>>>
>>>
>>>
>>> Current pass structure that we have designed, does exactly that. The only
>>> difference is that, we are planning to use the transform phase as a
>>> co-process of our transform pass, than reconstruct an AST from our
>>> representation to pass to vect_transform_loop.
>>>
>>>>
>>>> More and more of the rest of the vectorizer code can then be "slowly"
>>>> moved
>>>> to work on the AST and AST construction be moved earlier.
>
>
> So I think an incremental approach is much more likely to get us some
> benefit, and I'm wondering how much of the benefit here stems from which
> bit, and how many of these changes can be broken off.
>
> For example, making permutes into explicit operations on the SLP tree, and
> adding a pass that tries to push them up/down the tree into each other,
> might be done on the existing codebase, and would be a significant win in
> itself (bringing some/much(?) of improvements in interleaving code of your
> proposal).
>
> I'm not really sure how much of the benefit stems from what, though - e.g.
> does reducing vector length down to fit the target machine only after the
> other steps, bring any benefits besides raising abstraction? (Which
> certainly can be a benefit!). AFAICT this could be done independently of
> other changes?
>
> Another limitation of the present SLP structure is it's not understanding
> sharing (until later CSE, hence costing wrongly); it would be good to fix
> this (making the SLP tree into a DAG, and potentially reusing subtrees by
> adding permutes). This is a step we'd like to take anyway, but might tie in
> with your "Bottom-up commoning of p-node subtrees". And would help with (BB
> as well as loop) cost calculations.
>
> Are there any really fundamental differences that we cannot overcome in
> steps? ( / are there any steps we shouldn't take in isolation, because we'd
> have to 'throw them away' later?)


I think we can incrementally change the point where we go "into" the new
representation.  My plan would be to keep the analysis phase as-is until
we hit the point where we'd want to do cost compute / code generation at which
point we can create the new representation off the existing one and drive both
cost computation and code-generation off it (the immediate benefit with this
"copying" would be that we'd get to have multiple choices "for free" in case
we'd want to compare costs of them).

The main work would be disentangling all the vectorizable_* routines in
tree-vect-stmts.c (and the reduction variant in tree-vect-loop.c) into an
analysis and a code-gen part.

And of course desigining the new IL which would have permutes et al
represented explicitely.

That's the only "incremental" step I can see us performing.  Doing it this
way would also make it possible to largely keep the old code-path while
developing the new one and thus more easily debug issues coming up.
It's IMHO also natural to design the IL from the code-generation side.

Oh, a even more incremental step would be to create the new IL only
for cost computation and do code-gen from the old IL (of course that
doesn't allow for permute optimizations).

Richard.

> Alan

Re: Prototype implementation: Improving effectiveness and generality of auto-vectorization

Reply via email to