Re: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.

Richard Biener Thu, 09 Jun 2016 01:54:44 -0700

On Tue, Jun 7, 2016 at 3:59 PM, Sameera Deshpande
<sameera.deshpa...@imgtec.com> wrote:
> Hi Richard,
>
> This is with reference to our discussion at GNU Tools Cauldron 2015 regarding 
> my talk titled "Improving the effectiveness and generality of GCC 
> auto-vectorization." Further to our prototype implementation of the concept, 
> we have started implementing this concept in GCC.
>
> We are following incremental model to add language support in our front-end, 
> and corresponding back-end (for auto-vectorizer) will be added for feature 
> completion.
>
> Looking at the complexity and scale of the project, we have divided this 
> project into subtasks listed below, for ease of implementation, testing and 
> review.
>
> 0. Add new pass to perform autovectorization using unified representation - 
> Current GCC framework does not give complete overview of the loop to be 
> vectorized : it either breaks the loop across body, or across iterations. 
> Because of which these data structures can not be reused for our approach 
> which gathers all the information of loop body at one place using primitive 
> permute operations. Hence, define new data structures and populate them.
>
> 1. Add support for vectorization of LOAD/STORE instructions
>     a. Create permute order tree for the loop with LOAD and STORE 
> instructions for single or multi-dimensional arrays, aggregates within nested 
> loops.
>     b. Basic transformation phase to generate vectorized code for the 
> primitive reorder tree generated at stage 1a using tree tiling algorithm. 
> This phase handles code generation for SCATTER, GATHER, stridded memory 
> accesses etc. along with permute instruction generation.
>
> 2. Implementation of k-arity promotion/reduction : The permute nodes within 
> primitive reorder tree generated from input program can have any arity. 
> However, the target can support maximum of arity = 2 in most of the cases. 
> Hence, we need to promote or reduce the arity of permute order tree to enable 
> successful tree tiling.
>
> 3. Vector size reduction : Depending upon the vector size for target, reduce 
> vector size per statement and adjust the loop count for vectorized loop 
> accordingly.
>
> 4. Support simple arithmetic operations :
>     a. Add support for analyzing statements with simple arithmetic operations 
> like +, -, *, / for vectorization, and create primitive reorder tree with 
> compute_op.
>     b. Generate vector code for primitive reorder tree generated at stage 4a 
> using tree tiling algorithm - here support for complex patterns like 
> multiply-add should be checked and appropriate instruction to be generated.
>
> 5. Support reduction operation :
>     a. Add support for reduction operation analysis and primitive reorder 
> tree generation. The reduction operation needs special handling, as the 
> finish statement should COLLAPSE the temporary reduction vector TEMP_VAR into 
> original reduction variable.
>     b. The code generation for primitive reorder tree does not need any 
> handling - as reduction tree is same as tree generated in 4a, with only 
> difference that in 4a, the destination is MEMREF (because of STORE operation) 
> and for reduction it is TEMP_VAR. At this stage, generate code for COLLAPSE 
> node in finish statements.
>
> 6. Support other vectorizable statements like complex arithmetic operations, 
> bitwise operations, type conversions etc.
>     a. Add support for analysis and primitive reorder tree generation.
>     b. Vector code generation.
>
> 7. Cost effective tree tiling algorithm : Till now, the tree tiling is 
> happening without considering cost of computation. However, there can be 
> multiple target instructions covering the tree - hence, instead of picking 
> first matched largest instruction cover, select the instruction cover based 
> on cost of instruction given in .md for the target.
>
> 8. Optimizations on created primitive reorder tree : This stage is open 
> ended, and depending upon perf analysis, the scope of it can be defined.
>
> The current patch I have attached herewith handles stage 0 and 1a : Adds new 
> pass to perform autovectorization using unified representation, defines new 
> data structures to cater to this requirement and creates primitive reorder 
> tree for LOAD/STORE instructions within the loop.
>
> The whole loop is represented using the ITER_NODE, which have information 
> about
> - The preparatory statements for vectorization to be executed before entering 
> the loop (like initialization of vectors, prepping for reduction operations, 
> peeling etc.)
> - Vectorizable loop body represented as PRIMOP_TREE (primitive reordering 
> tree)
> - Final statements (For peeling, variable loop bound, COLLAPSE operation for 
> reduction etc.)
> - Other loop attributes (loop bound, peeling needed, dependences, etc.)
>
> Memory accesses within a loop have definite repetitive pattern which can be 
> captured using primitive permute operators which can be used to  determine 
> desired permute order for the vector computations.  The PRIMOP_TREE is AST 
> which records all computations and permutations required to store  
> destination vector into continuous memory at the end of all iterations of the 
>  loop. It can have INTERLEAVE, CONCAT, EXTRACT, SPLIT, ITER or any compute 
> operation as intermediate node. Leaf nodes can either be memory reference, 
> constant or vector of loop invariants. Depending upon the operation, 
> PRIMOP_TREE holds appropriate information about the statement within the loop 
> which is necessary for vectorization.
>
> At this stage, these data structures are populated by gathering all the 
> information of the loop, statements within the loop and correlation of the 
> statements within the loop. Moreover the loop body is analyzed to check if 
> vectorization of each statement is possible. One has to note however that 
> this analysis phase will give worst-case estimate of instruction selection, 
> as it checks if specific named pattern is defined in .md for the target. It 
> not necessarily  give optimal cover which is aim of the transformation phase 
> using tree tiling algorithm - and can be invoked only once the loop body is 
> represented using primitive reoder tree.
>
> At this stage, the focus is to create permute order tree for the loop with 
> LOAD and STORE instructions only. The code we intend to compile is of the form
> FOR(i = 0; i < N; i + +)
> {
>   stmt 1 : D[k ∗ i + d 1 ] =S 1 [k ∗ i + c 11 ]
>   stmt 2 : D[k ∗ i + d 2 ] =S 1 [k ∗ i + c 21 ]
>   ...
>   stmt k : D[k ∗ i + d k ] =S 1 [k ∗ i + c k 1 ]
> }
> Here we are assuming that any data reference can be represented using base + 
> k * index + offset (The data structure struct data_reference from GCC is used 
> currently for this purpose). If not, the address is normalized to convert to 
> such representation.
>
> We are looking forward to your suggestions and insight in this regard for 
> better execution of this project.


I will look at the patch in detail this afternoon and will write up
some comments.

Generally staging the project as described above looks good to me.

Implementing the cost effective tiling will be interesting and probably require
additions to the target interfaces.  Hopefully most vector architectures
are not as weird as x86_64s with its gazillion variants.

> Also, as this is long term project, can we create a branch in GCC to put all 
> our patches at one place so that anyone interested can download and tweak 
> with the code?

Yes, working on a branch is appreciated - anyone with write access can create
a branch in SVN.  See https://gcc.gnu.org/svn.html for where development
branches should be documented.

Thanks,
Richard.

>
> - Thanks and regards,
>   Sameera D.

Re: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.

Reply via email to