Prototype implementation: Improving effectiveness and generality of auto-vectorization

sameera Thu, 15 Oct 2015 03:21:05 -0700

Hi Richard,

This is with reference to our discussion at GNU Tools Cauldron 2015 regarding my talk titled "Improving the effectiveness and generality of GCC auto-vectorization." We, at Imaginations Technologies, have further worked on finalizing the algorithms for transformations to achieve efficient target-aware reordering instruction selection. We have implemented the prototype in python to demonstrate the capabilities of the algorithm and verify the claims made in the presentation.

I am attaching the prototype along with the documented algorithm for review purposes. We would be starting with design and implementation of the same in GCC and would be glad to receive comments, feedback and suggestions.


- Thanks and regards,
  Sameera Deshpande

Introduction
============
Current methods of auto-vectorization take the vectorization decision through a 
generic analysis which is almost target independent
(except for target features such as vector sizes). However, the instruction 
selection mechanism is largely adhoc in the sense that the
ideas used for instruction selection for a target may not carry over seamlessly 
to some other target. This has led to proliferation of
target specific code in the generic part of compiler frameworks such as GCC.

We propose a generic approach for reordering-instruction selection which is 
based on the insight that effective data reordering for
vectorization can be defined in terms of a set of well defined primitive 
operations. Interestingly, the same set of operations can be 
used to define target instructions almost for all targets that we know of. Thus 
these operations form the basic building blocks of 
reorderings for effective vectorization.

Primitive operations
====================
Let V_N denote a vector of N scalar elements. A collection of k V_N vectors is 
denoted by V_N , ...k , V_N
We propose the following operations:
- Concatenation_k^N : V_N x ... x k V_N --> V_kN : CONCAT_k^N (V_N , ...k , 
V_N) concatenates k vectors of size N to produce a
single vector of size kN.

- Split_k,i^kN : V_kN --> V_N . Split operation is the inverse of 
concatenation. SPLT_k,i^kN ( V_kN ) splits a vector of size kN into
k vectors of size N each and returns the i th vector.

- Inerleave_k^N : V_N x ...k x V_N --> V_kN . ILV_k^N ( V_N, ...k, V_N ) 
creates a vector of size kN such that ith element of output
  vector comes from i/kth element from i%kth vector.

- Extract_k,i^kN : V_kN --> V_N. Extract operation is inverse of interleave 
operation. EXTR_k,i^kN (V_kN) divides the input vector
 into N parts and creates k vectors of size N by selecting jth element from 
each part, and returning ith vector of size N .

- permute^N : V_N x I_M --> V_N. Permute operation takes vector of size N and 
integer order of size M. It divides vector N into N/M
  vectors of size M, and permutes each subvector with permute order I_M.

Phase structure of Algorithm
============================
The algorithm for target-aware reordering-instruction selection for efficient 
autovectorization is divided into following phases:
Phase 1 : Create tree for each loop per destination - Implemented by Sameera 
Deshpande
Phase 2 : k-arity reduction/promotion  - Implemented by Sameera Deshpande
Phase 3 : Top-down distribution of p-node subtree  - Implemented by Sameera 
Deshpande
Phase 4 : Optimizations  - Implemented by Sameera Deshpande
Phase 5 : Bottom-up commoning of p-node subtree  - Implemented by Sameera 
Deshpande
          - Cost computation of p-tree  - Implemented by Sameera Deshpande
Phase 6 : Vector-size reduction  - Implemented by Sameera Deshpande
Phase 4(again): Optimization  - Implemented by Sameera Deshpande
Phase 7: Cost computation - Store min_cost_tree.  - Implemented by Prachi 
Godbole
Phase 8: Target specific code generation for min_cost_tree  - Implemented by 
Prachi Godbole

Input loop
==========
To enable auto-vectorization using this algorithm, the loop structure should be 
as follows:
Loop with k statements reading from S_1 , S_2 , ... S_m and writing to 
destination D, where stmt j is of form
D[k * i + d_j ] = op_j (S_1 [k_1 * i + c_j1 ], S_2 [k_2 * i + c_j2 ], ... S_m 
[k_m * i + c_jm ])
such that

– All statements stmt_1 , ... stmt_k in the loop body are independent and 
vectorizable for target with vector size VS.
– Iteration count N > VS.
– Induction variable i for the loop is not altered by stmt_1 , stmt_2 ... 
stmt_k .
– All S_1 , ... S_m and D are of same type.
– Indices for all S and D have same multiplicative index. If different, 
normalize by taking LCM, and duplicate the statements 
  appropriately by adjusting the offset.
– c_jp < k_p


FOR(i = 0 ... N − 1)
{
  stmt 1 : D [k * i + d_1 ] = op_1 (S_1 [k_1 * i + c_11 ], S_2 [k_2 * i + c_12 
], . . . S_m [k_m * i + c_1m ])
  stmt 2 : D [k * i + d_2 ] = op_2 (S_1 [k_1 * i + c_21 ], S_2 [k_2 * i + c_22 
], . . . S_m [k_m * i + c_2m ])
  ...
  stmt k : D [k * i + d_k ] = op_k (S_1 [k_1 * i + c_k1 ], S_2 [k_2 * i + c_k2 
], . . . S_m [k_m * i + c_km ])
}

For ease of implementation, the loop information is taken as input in language 
defined as follows:

Input language
--------------
loop_begin       ->    LOOP '(' NUMBER ')' dest_stmt_list
dest_stmt_list   ->    dest_stmt_list dest_stmt                         | 
dest_stmt
dest_stmt        ->    VARIABLE '[' NUMBER ',' NUMBER ']' '=' cop_tree
cop_tree         ->    VARIABLE '[' NUMBER ',' NUMBER ']'               | c-op 
'(' cop_tree_list ')'
cop_tree_list    ->    cop_tree_list ',' cop_tree                       | 
cop_tree
c-op             ->    Any order-preserving compute operation.

So, the above given loop can be represented as follows in the input language:

LOOP (N)
D[k,d_1] = op1 (S_1[k_1, c_11], S_2[k_2, c_12], ... S_m[k_m, c_1m])
D[k,d_2] = op1 (S_1[k_1, c_21], S_2[k_2, c_22], ... S_m[k_m, c_2m])
...
D[k,d_k] = op1 (S_1[k_1, c_k1], S_2[k_2, c_k2], ... S_m[k_m, c_km])

Algorithm
=========

Phase 1: Create AST for each loop per destination D.
---------------------------------------------------
Input:
- - - 
Program defined using input language.

Procedure:
- - - - -
- Identify loop count N.
- Identify respective k for destination array and source arrays.
- Create ILV_k^N node which has arity k, which writes to destination array D. 
The vectorsize = loopcount.
- Create vectorized AST for each statement within loop. Each operation op_j in 
statement j is vectorized to vop_j depending upon
   target architecture. If statement j accesses memory S_p[ k_p * i + c_jp ], 
create EXTR_k_p,c_jp on source S_p. If statement j 
   writes at location D[k * i + d_j ] in the iteration, attach the AST as d_jth 
operand to ILV_k^N
- For each destination array D, goto step 3

Output:
- - - - 
Output of phase 1 is an AST which can be defined using following grammar:

tree   ->       ptree                                   | ctree                 
| ITER (tree)
ptree  ->       p-op_k (tree_1, tree_2, ... , tree_k)
ctree  ->       c-op_k (tree_1, tree_2, ... , tree_k)   | MEM (src-name)
p-op_k ->       Permute operation of arity k.
c-op_k ->       Compute operation of arity k.

This phase is implemented in
- scanner.lex - Lexical analyzer to scan the input
- parser.y - Syntax analyzer and AST generator
- header.h - Header file

Commands to Execute:
To build phase 1:
flex scanner.lex
bison parser.y -d -v
gcc parser.tab.c lex.yy.c -g -o parser.out

To run phase 1:
./parser.out < <inputFile> > dmp.txt

Phase 2: k-arity reduction/promotion.
------------------------------------
Input:
- - -
AST constructed from dmp.txt rooted at root_node

Procedure:
- - - - -
IF root_node is p-op:
- Let m be machine arity
- Let k be arity of root_node

- IF m < k: Arity reduction
  - k % m should be 0, otherwise autovectorization not possible.
  - IF p-op of root_node is EXTR:
       EXTR_k,i^N               EXTR_m,(i/(k/m))%m^Nm/k
            |          =>                |
            |                            |
            T                   EXTR_k/m,i%(k/m)^N      <=== Recursion on newly 
created EXTR node
                                         |
                                         |
                                         T

  - ELSE IF p-op of root_node is ILV:
          ILV_k^N                              ILV_m^Nk/m
        /    |    \        =>                  /   |    \
       /     |     \                        /      |       \
      T1    T2 ... Tk                   /          |   ...     \
                                    /              |              \
                                ILV_k/m^N      ILV_k/m^N      ILV_k/m^N    <=== 
Recursion on all new ILV nodes
                                 /  |  \        /  |  \        /  |  \
                                /   |...\      /   |...\      /   |...\
                              T1 Tm+1 Tk-m+1  T2 Tm+2 Tk-m+2 Tm  T2m  Tk
- ELSE IF m == k:
  - Already in desired format. Hence, invoke phase 2 recursively on 
child(root_node)

- ELSE IF m > k: Arity promotion
  - m %k should be 0, otherwise autovectorization not possible.
  - Algorithm for ILV-promotion introduces new EXTR nodes whereas algorithm for 
EXTR-promotion introduce ILV node. To avoid the 
    infinite loop, only ILV-promotion or EXTR-promotion can be implemented. As 
phase 2 is designed as top-down algorithm, ILV-promotion
    is implemented, and expect that the propagation of newly created EXTR nodes 
creates opportunity for arity reduction. If this cannot
    be done, auto-vectorization cannot be possible. So,
  - IF p-op of root-node is EXTR:
    - Do not perform auto-vectorization
  - Else if p-op of root_node is ILV:
            ILV_k^N                                ILV_m^Nk/m
          /    |    \        =>                  / / / \ \   \
         /     |     \                     
        T1    T2 ... Tk                       /  /  /   \  \    \       
                                                      :
                                                      :
                                          /    /    /   \      \       \
                                       /    /     /       \      \       \
                                   /      /      /         \        \       \   
     
                                  N      N      N          N        N        N
                               EXTR    EXTR    EXTR      EXTR     EXTR       
EXTR
                                m/k,0   m/k,0   m/k,1    m/k,1     m/k,m/k-1   
m/k,m/k-1 
                                  |      |       |         |          |         
   |
                                  T1 ... Tk      T1   ...  Tk         T1    ... 
   Tk

                      
ELSE #root_node is c-op or ITER or MEM
- Invoke phase 2 recursively on child(root_node)

Output:
- - - -
Output of phase 2 is AST with all intermediate nodes having arity of p-nodes 
adjusted to arity of target instruction.

This phase is implemented in Transformations.py and can be invoked by invoking 
driver function k_arity_reduction_promotion

Phase 3: Top-down distribution of p-node subtree
------------------------------------------------
Input:
- - -
Input of phase 3 is AST rooted at root_node with p-node arity adjusted to arity 
of reordering-instructions in target.

Procedure:
- - - - -
This phase helps accumulate all reordering operations on source operands, 
because of which effective instruction selection is possible.
- Select a subtree T_N of input expression tree, such that all nodes within the 
subtree are p-ops and all n children of the subtree
  have same c-op node. Each c-op node has arity of k.
                         ^
                        /  \
                       /    \
                      / T_N  \
                     +========+
                   ==============
                   /     |      \
                  /      |       \
                COP_k   ...     COP_k
                 /|\           /  |  \
                / | \         /   |   \
               C0...Ck-1    C_kN-k...C_kN-1  
   
               
- If such subtree T can be found, then construct a tree rooted at COP_k, and 
propagate T on each child of cop_k.
                         
                       COP_k 
                     /   |  \
                    /    |   \
                   /     |    \
                  /      |     \
                 T_N    ...    T_N
                 /|\         /  |  \
                / | \       /   |   \
               C0...CNk-k  C_k-1...C_kN-1

Output:
- - - -
AST rooted at newly created cop-node or original tree.

This phase is implemented in Transformations.py and can be invoked using 
top_down_distribution.

Phase 4: Optimization
---------------------
Input:
- - -
AST rooted at root_node with accumulated reordering operations whenever 
possible. 

Procedure:
- - - - -
The distribution of reordering operations (p-ops) can create opportunities to 
perform redundancy elimination. Few identities that can
be eliminated are:
    1. ILV_m (EXTR_0(S), EXTR_1(S),...EXTR_m-1(S)) => S
    2. EXTR_m,x (ILV_M(S1, S2, ... Sm)) => Sx
    3. CONCAT_m (SPLT_0(S), SPLT_1(S),...SPLT_m-1(S)) => S
    4. SPLT_m,x (CONCAT_M(S1, S2, ... Sm)) => Sx
This phase process tree nodes in post order, so that maximum redundancy 
elimination can be acheived in single pass over tree.

Output:
- - - -
AST rooted at root_node with identity elimination performed.

This phase is implemented in Transformations.py and can be invoked using 
optimizations.

Phase 5: Bottom-up commoning of p-node subtrees
-----------------------------------------------
Input:
- - -
Optimized tree rooted at root_node.

Procedure:
- - - - -
Each p-tree represents set of permute instructions to be generated on source of 
the p-tree. If single reordering instruction can be 
generated at any tree rooted at c-node, then commoning subpart of child p-tree 
is always expensive. Hence, if single instruction cannot
be generated for each child of c-node, then only commoning out the sub p-tree 
is efficient.

This commoning can give better performance, if for k-ary c-tree, 
- Each child p-tree has cost > 1.
- Each child p-tree has common sub p-tree with cost c - so, k * c instructions 
can be reduced if sub p-tree can be commoned across all 
  children.
- Commoned sub p-tree now operate on result, hence k * c instructions are 
replaced by c instructions.

- Select common subtree T_N from all children of root COP_k node, such that all 
nodes within the subtree are p-ops
  and T_N is not subtree of single instruction matching.

                       COP_k 
                   ==============
                   /     |      \
                  /      |       \
                 T_N   ...       T_N
                 /|\           /  |  \
                / | \         /   |   \
               C0...CNk-k   Ck-1 ... CkN-1
               
               
- If such subtree T can be found, common out subtree T_N from all children of 
COP_K, and construct a tree rooted
  at T_N, with each child as tree rooted at COP_k, performing c-op on 
appropriate children on original tree. 
                         ^
                        /  \
                       /    \
                      / T_N  \
                     +========+
                   ==============
                   /     |      \
                  /      |       \
                COP_k   ...     COP_k
                 /|\           /  |  \
                / | \         /   |   \
               C0...Ck-1    CkN-k ...CkN-1  
   
Output:
- - - -
p-tree commoned tree rooted at new-p-node, or original tree.

This phase is implemented in Transformations.py and can be invoked using 
bottom_up_commoning.

Phase 6: Vector-size reduction
------------------------------
Input:
- - -
AST rooted at root_node

Procedure:
- - - - -
- Take GCD of vec_size fields of all nodes in tree. 
- IF GCD is divisible by target's vec_size, 
  - create ITER node of GCD/target_vec_size. 
- ELSE IF GCD is not divisible by target's vec_size,
  - do not vectorize.   

Output:
- - - -
AST rooted at ITER node with adjusted vector size in each node.

This phase is implemented in Transformations.py and can be invoked using 
vs_reduction.

Phase 7: Cost Computation
-------------------------
This phase covers entire expression tree with tiles of target instructions. It 
records selected instructions along with their costs for 
each node of the tree in tuple <instruction, cost for input VS, desired input 
VS>. For each target architecture, reordering
instructions in target are represented using primitive operations. Then finite 
automata is constructed by creating transitions per p-op
node in instruction. In final state for each reordering instruction, the tuple 
is populated. Along with state transitions for each
target instruction, we also generate states in the finite automata to match 
single ILV and EXTR node, where final state gives tuple for 
combination of instructions or combination of input respectively.

Input:
- - - 
AST for cost computation rooted at root_node.

Procedure for instruction selection:
- - - - - - - - - - - - - - - - - -
- Traverse tree top-down and for each subtree, 
  - Transition over the generated automaton.
  - If final state is reached, add populated cost tuple in root of subtree 
matched.

Output:
- - - -
Tree with cost anotated nodes.

This phase is implemented in CostComputation.py and can be invoked using 
compute_cost

Phase 8: Code generation
------------------------
Input:
- - -
AST rooted at root_node with cost-tuple annotations at each node.

Procedure:
- - - - -
- Create DAG from AST.
- Scan the DAG in Bottom-up fashion so that code for all children will be 
generated before the parent node.
- Depending upon vector size of tree-node and input vector size expected by 
instruction, tree_node(vec_size)/<desired input VS>
  instructions are generated.

Output:
- - - -
Assembly code or intermediate code.

This phase is implemented in CostComputation.py and can be invoked using 
generate_code.


TODOs
=====
- Implementation of ILV <-> CONCAT and EXTR <-> SPLT conversions.
- Implementation of PERM for target instructions of type vec_perm.


Currently, this algorithm can optimize the loops which adhere to the input loop 
structure as described earlier. However some
optimizations and loop transformations can help to transform unacceptable loops 
into acceptable structure:
- Loop interchange - for conteguous element access in multi-dimentional arrays
- Loop peeling - For getting loop count % target VS = 0
- Loop fusion - To avoid voids on destinations as well as reuse sources better.
- Loop fission - To seperate different destinations.

final_prototype.tgz
Description: application/compressed-tar

Prototype implementation: Improving effectiveness and generality of auto-vectorization

Reply via email to