RE: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.

Sameera Deshpande Thu, 07 Jul 2016 00:11:43 -0700

________________________________________
From: Richard Biener [richard.guent...@gmail.com]
Sent: 06 July 2016 16:46:17
To: Sameera Deshpande
Cc: Matthew Fortune; Rich Fuhler; Prachi Godbole; gcc@gcc.gnu.org; Jaydeep Patil
Subject: Re: [Patch 0,1a] Improving effectiveness and generality of 
autovectorization using unified representation.


On Wed, Jul 6, 2016 at 12:49 PM, Sameera Deshpande
<sameera.deshpa...@imgtec.com> wrote:
> ________________________________________
> From: Sameera Deshpande [sameera.deshpa...@imgtec.com]
> Sent: 20 June 2016 11:37:58
> To: Richard Biener
> Cc: Matthew Fortune; Rich Fuhler; Prachi Godbole; gcc@gcc.gnu.org; Jaydeep 
> Patil
> Subject: Re: [Patch 0,1a] Improving effectiveness and generality of 
> autovectorization using unified representation.
>
> On Wednesday 15 June 2016 05:52 PM, Richard Biener wrote:
>> On Mon, Jun 13, 2016 at 12:56 PM, Sameera Deshpande
>> <sameera.deshpa...@imgtec.com> wrote:
>>> On Thursday 09 June 2016 05:45 PM, Richard Biener wrote:
>>>>
>>>> On Thu, Jun 9, 2016 at 10:54 AM, Richard Biener
>>>> <richard.guent...@gmail.com> wrote:
>>>>>
>>>>> On Tue, Jun 7, 2016 at 3:59 PM, Sameera Deshpande
>>>>> <sameera.deshpa...@imgtec.com> wrote:
>>>>>>
>>>>>> Hi Richard,
>>>>>>
>>>>>> This is with reference to our discussion at GNU Tools Cauldron 2015
>>>>>> regarding my talk titled "Improving the effectiveness and generality of 
>>>>>> GCC
>>>>>> auto-vectorization." Further to our prototype implementation of the 
>>>>>> concept,
>>>>>> we have started implementing this concept in GCC.
>>>>>>
>>>>>> We are following incremental model to add language support in our
>>>>>> front-end, and corresponding back-end (for auto-vectorizer) will be added
>>>>>> for feature completion.
>>>>>>
>>>>>> Looking at the complexity and scale of the project, we have divided this
>>>>>> project into subtasks listed below, for ease of implementation, testing 
>>>>>> and
>>>>>> review.
>>>>>>
>>>>>> 0. Add new pass to perform autovectorization using unified
>>>>>> representation - Current GCC framework does not give complete overview of
>>>>>> the loop to be vectorized : it either breaks the loop across body, or 
>>>>>> across
>>>>>> iterations. Because of which these data structures can not be reused for 
>>>>>> our
>>>>>> approach which gathers all the information of loop body at one place 
>>>>>> using
>>>>>> primitive permute operations. Hence, define new data structures and 
>>>>>> populate
>>>>>> them.
>>>>>>
>>>>>> 1. Add support for vectorization of LOAD/STORE instructions
>>>>>>       a. Create permute order tree for the loop with LOAD and STORE
>>>>>> instructions for single or multi-dimensional arrays, aggregates within
>>>>>> nested loops.
>>>>>>       b. Basic transformation phase to generate vectorized code for the
>>>>>> primitive reorder tree generated at stage 1a using tree tiling algorithm.
>>>>>> This phase handles code generation for SCATTER, GATHER, stridded memory
>>>>>> accesses etc. along with permute instruction generation.
>>>>>>
>>>>>> 2. Implementation of k-arity promotion/reduction : The permute nodes
>>>>>> within primitive reorder tree generated from input program can have any
>>>>>> arity. However, the target can support maximum of arity = 2 in most of 
>>>>>> the
>>>>>> cases. Hence, we need to promote or reduce the arity of permute order 
>>>>>> tree
>>>>>> to enable successful tree tiling.
>>>>>>
>>>>>> 3. Vector size reduction : Depending upon the vector size for target,
>>>>>> reduce vector size per statement and adjust the loop count for vectorized
>>>>>> loop accordingly.
>>>>>>
>>>>>> 4. Support simple arithmetic operations :
>>>>>>       a. Add support for analyzing statements with simple arithmetic
>>>>>> operations like +, -, *, / for vectorization, and create primitive 
>>>>>> reorder
>>>>>> tree with compute_op.
>>>>>>       b. Generate vector code for primitive reorder tree generated at
>>>>>> stage 4a using tree tiling algorithm - here support for complex patterns
>>>>>> like multiply-add should be checked and appropriate instruction to be
>>>>>> generated.
>>>>>>
>>>>>> 5. Support reduction operation :
>>>>>>       a. Add support for reduction operation analysis and primitive
>>>>>> reorder tree generation. The reduction operation needs special handling, 
>>>>>> as
>>>>>> the finish statement should COLLAPSE the temporary reduction vector 
>>>>>> TEMP_VAR
>>>>>> into original reduction variable.
>>>>>>       b. The code generation for primitive reorder tree does not need any
>>>>>> handling - as reduction tree is same as tree generated in 4a, with only
>>>>>> difference that in 4a, the destination is MEMREF (because of STORE
>>>>>> operation) and for reduction it is TEMP_VAR. At this stage, generate code
>>>>>> for COLLAPSE node in finish statements.
>>>>>>
>>>>>> 6. Support other vectorizable statements like complex arithmetic
>>>>>> operations, bitwise operations, type conversions etc.
>>>>>>       a. Add support for analysis and primitive reorder tree generation.
>>>>>>       b. Vector code generation.
>>>>>>
>>>>>> 7. Cost effective tree tiling algorithm : Till now, the tree tiling is
>>>>>> happening without considering cost of computation. However, there can be
>>>>>> multiple target instructions covering the tree - hence, instead of 
>>>>>> picking
>>>>>> first matched largest instruction cover, select the instruction cover 
>>>>>> based
>>>>>> on cost of instruction given in .md for the target.
>>>>>>
>>>>>> 8. Optimizations on created primitive reorder tree : This stage is open
>>>>>> ended, and depending upon perf analysis, the scope of it can be defined.
>>>>>>
>>>>>> The current patch I have attached herewith handles stage 0 and 1a : Adds
>>>>>> new pass to perform autovectorization using unified representation, 
>>>>>> defines
>>>>>> new data structures to cater to this requirement and creates primitive
>>>>>> reorder tree for LOAD/STORE instructions within the loop.
>>>>>>
>>>>>> The whole loop is represented using the ITER_NODE, which have
>>>>>> information about
>>>>>> - The preparatory statements for vectorization to be executed before
>>>>>> entering the loop (like initialization of vectors, prepping for reduction
>>>>>> operations, peeling etc.)
>>>>>> - Vectorizable loop body represented as PRIMOP_TREE (primitive
>>>>>> reordering tree)
>>>>>> - Final statements (For peeling, variable loop bound, COLLAPSE operation
>>>>>> for reduction etc.)
>>>>>> - Other loop attributes (loop bound, peeling needed, dependences, etc.)
>>>>>>
>>>>>> Memory accesses within a loop have definite repetitive pattern which can
>>>>>> be captured using primitive permute operators which can be used to
>>>>>> determine desired permute order for the vector computations.  The
>>>>>> PRIMOP_TREE is AST which records all computations and permutations 
>>>>>> required
>>>>>> to store  destination vector into continuous memory at the end of all
>>>>>> iterations of the  loop. It can have INTERLEAVE, CONCAT, EXTRACT, SPLIT,
>>>>>> ITER or any compute operation as intermediate node. Leaf nodes can 
>>>>>> either be
>>>>>> memory reference, constant or vector of loop invariants. Depending upon 
>>>>>> the
>>>>>> operation, PRIMOP_TREE holds appropriate information about the statement
>>>>>> within the loop which is necessary for vectorization.
>>>>>>
>>>>>> At this stage, these data structures are populated by gathering all the
>>>>>> information of the loop, statements within the loop and correlation of 
>>>>>> the
>>>>>> statements within the loop. Moreover the loop body is analyzed to check 
>>>>>> if
>>>>>> vectorization of each statement is possible. One has to note however that
>>>>>> this analysis phase will give worst-case estimate of instruction 
>>>>>> selection,
>>>>>> as it checks if specific named pattern is defined in .md for the target. 
>>>>>> It
>>>>>> not necessarily  give optimal cover which is aim of the transformation 
>>>>>> phase
>>>>>> using tree tiling algorithm - and can be invoked only once the loop body 
>>>>>> is
>>>>>> represented using primitive reoder tree.
>>>>>>
>>>>>> At this stage, the focus is to create permute order tree for the loop
>>>>>> with LOAD and STORE instructions only. The code we intend to compile is 
>>>>>> of
>>>>>> the form
>>>>>> FOR(i = 0; i < N; i + +)
>>>>>> {
>>>>>>     stmt 1 : D[k ∗ i + d 1 ] =S 1 [k ∗ i + c 11 ]
>>>>>>     stmt 2 : D[k ∗ i + d 2 ] =S 1 [k ∗ i + c 21 ]
>>>>>>     ...
>>>>>>     stmt k : D[k ∗ i + d k ] =S 1 [k ∗ i + c k 1 ]
>>>>>> }
>>>>>> Here we are assuming that any data reference can be represented using
>>>>>> base + k * index + offset (The data structure struct data_reference from 
>>>>>> GCC
>>>>>> is used currently for this purpose). If not, the address is normalized to
>>>>>> convert to such representation.
>>>>>>
>>>>>> We are looking forward to your suggestions and insight in this regard
>>>>>> for better execution of this project.
>>>>>
>>>>>
>>>>> I will look at the patch in detail this afternoon and will write up
>>>>> some comments.
>>>>
>>>>
>>>> Ok, so here we go.
>>>>
>>>
>>> Hi Richard,
>>>
>>> Thanks for your detailed review. Please find my comments inlined.
>>>
>>>> I see you copy quite some code from the existing vectorizer - rather than
>>>> doing this and adding a "new" pass I'd make the
>>>> flag_tree_loop_vectorize_unified
>>>> flag guard code inside the existing vectorizer - thus share the pass.
>>>
>>> I agree with you that I have copied lot of code from current implementation
>>> of GCC, and some data structures seem redundant as they create same
>>> information as that is already available. However, as per our discussion at
>>> cauldron, I tried to generate ptrees after all the analysis phases were
>>> done, using the information generated there. However, because of
>>> overwhelming and scattered information in various statements, loops and
>>> their respective info nodes, it did not yield much. Moreover, it was seen
>>> that many functions or parts of the functions were not very useful, and some
>>> stmt or loop related information needed different way of computation for our
>>> scheme than GCC's framework. So, instead of tweaking GCC's codebase, and
>>> corrupting the information, thereby making it unusable if our optimization
>>> fails to vectorize by default vectorizer, we created new pass for the same.
>>>
>>> I agree, that currently it looks like we are reusing most of the checks and
>>> conditions for vectorization as in GCC - as for first phase, our aim is to
>>> meet performance of GCC before adding different optimizations. However, we
>>> plan to increase the scope further, thereby need to change the checks and
>>> data generated accordingly.
>>>
>>> e.g.: Peeling information won't be generated till transformation phase is
>>> started, where it will be added in the ITER_node depending upon alignment
>>> requirements, ITER_count and VEC_SIZE of target.
>>>
>>> Or, scatter/gather information is not needed as tree tiling for vector
>>> load/store has different tiles for those instructions if target supports
>>> them.
>>>
>>> Also, the way reduction operations are to be implemented in this scheme
>>> makes all the categorization of reduction operation in current GCC
>>> redundant.
>>>
>>> However, if you still think it is good idea to reuse same data structures
>>> and functions, I have older patch available, which I can clean up and update
>>> to add updated ptree generation part in it, and share it.
>>
>> My comment was mostly about making patches smaller and thus easier to review.
>> I do expect that little of the copied code will remain as-is.
>>
> Richard, the way we have phased out the project, the first patch is bigger 
> one as we had to do all the initializations and checks relevant to
> auto-vectorization in this patch. However, we expect following patches to be 
> smaller, and dedicated to single functionality. It is difficult to factor
> this patch, as the changes that we have added are needed for the 
> functionality - to vectorize load/store operations.
>
>>>>
>>>> Similarly I'd re-use loop_vinfo and simply add fields to it for the
>>>> new vectorizer
>>>> so you can dispatch to existing vectorizer functions "easily".  Otherwise
>>>> it's going to be hard to keep things in-sync.  Likewise for your stmt_attr
>>>> and the existing stmt_vec_info.
>>>>
>>> The main reason behind creating new ITER_node instead of reusing loop_vinfo
>>> is to capture whole loop as single entity, thereby allowing optimizations on
>>> nested loops as well. The ITER_node, when implemented completely, can handle
>>> vectorization of multiple nested loops, as similar to all permute nodes,
>>> even ITER_node can be distributed over each element in ITER_node.stmts - and
>>> can be propagated across compute_tree in permute order tree. So, for future
>>> expansion, we are creating ITER_node with copy of some of loop_vinfo fields,
>>> and do not compute the fields which are not needed for this scheme.
>>>
>>> Similar is the case with stmt_vinfo - the information gathered in stmt_vinfo
>>> is mostly usable for SLP optimizations, which we cannot use as is. So,
>>> instead of generating redundant information, or alter generated information,
>>> we chose to create new data structure.
>>>
>>>> Why isn't the existing data_reference structure good enough and you need
>>>> to invent a struct mem_ref?  Even if that might be leaner adding new
>>>> concepts
>>>> makes the code harder to understand.  You can always use dr->aux to
>>>> add auxiliar data you need (like the existing vectorizer does).
>>>
>>>
>>> We can use data_reference structure as is, however, each of it's component
>>> has very specific meaning associated with it - whereas the memref has only 3
>>> components which are important - stride, offset - which is less than stride,
>>> and remaining base - for address of first element. So, again we thought
>>> instead of overriding current semantics of components of data_ref, we
>>> created new data structure.
>>
>> I was worried at the point you made an existing GCC function take data-ref
>> pieces to make your mem-ref look like a data-ref.  If the data-ref doesn't
>> give you the information you want and that is important it seems to me it
>> is better to fix that.  It sounds to me you want to look at innermost
>> behavior, DR_STEP and DR_BASE_ADDRESS + DR_OFFSET + DR_INIT
>> where a part of DR_INIT is your 'offset' (I think it is what the
>> current vectorizer
>> computes as GROUPs and offset would be the offset inside the group,
>> relative to the first element?)
>>
> I agree that the data structure mem-ref is redundant. Will eliminate it to 
> use altered dataref. We can say that the computations to find mem-ref are
> similar to those to form GROUPS - however, GROUP is collection of all the 
> statements which access elements of aggregate with definite interval,
> whereas mem-ref looks at small window of the aggregate with reference to 
> stride, and the offset addressing is always relative to stride. Also, mem-ref
> is not collection of the statements, but the place-holder to identify the 
> continuous piece of aggregate within the memory of size = stride.
> However, I can use dataref instead of mem-ref as the place holder.
>
>>>>
>>>> It helps if you follow the GCC coding-conventions from the start - a lot
>>>> of the new functions miss comments.
>>>
>>> Sorry about that. I will tidy up the patch.
>>>>
>>>>
>>>> I realize the patch is probably still a "hack" and nowhere a final step
>>>> 0/1a,
>>>> but for example you copied the iteration scheme over vector sizes.  I hope
>>>> the new vectorizer can do without that and instead decide on the vector
>>>> size as part of the cost effective tiling algorithm.
>>>
>>> I didn't get this point clearly. However, for my understanding, is the
>>> objection with use of ITER_COUNT instead of VEC_SIZE for each statement?
>>> Because, if that is the case, there is basic difference between current
>>> GCC's implementation and this scheme - in case of GCC, we always look for
>>> the standard pattern name to be matched for scalar instruction - for which
>>> VEC_TYPE becomes crucial. Whereas, this scheme represents each statement
>>> within the loop as a vector with element of type SCALAR_TYPE and
>>> initial_vec_size = ITER_COUNT (as those many instances of the statement will
>>> be executed). Then, depending upon the VEC_SIZE for target, VEC_SIZE
>>> reduction will be applied to it, on top of which tiling algorithm functions.
>>> So, for tiling, we will have to use VEC_SIZE. However, for all other
>>> optimizations and transformations, we use each node of permute order tree to
>>> have vec_size = ITER_COUNT to allow free movement of permute-nodes across
>>> compute-nodes, and various optimizations on them.
>>
>> The comment was about you copying the loop iterating over target vector sizes
>> and repeating vectorization attempts with that single vector size.  I hope
>> the cost effective tiling will simply choose a single vector size based on
>> cost and iteration count.  Your description above suggests so thus the
>> copying of the vector size iteration seems spurious and if then should
>> be done _after_ building the ptree (and sharing the ptree).
> Yes, as I have mentioned in previous mail
>
>  >> (I have kept the loop checking for
>  >> different vector types in vect_analyze_loop_with_prim_tree() for now as I 
> am
>  >> still not very clear at what point in transformation phase will it play 
> any
>  >> role... Once, the end-to-end solution for it is developed for load/store
>  >> chains, there will be some clarity - and the loop can be moved there or
>  >> eliminated completely.)
>
> So, this loop will be eliminated once the backend for tree tiling of 
> load/store is implemented.
>
>>
>>>>
>>>>
>>>> As the main feature of the patch is creation of the ptree (no code
>>>> generation
>>>> seems to be done?) the biggest missing part is dumping the ptree in
>>>> some nice form (like in DOT format and/or to the dump file).
>>>
>>> Yes, the ptree dump was put on back burner because I was focusing on the
>>> functionality. However, I will share the patch for dumping ptree in DOT
>>> format shortly.
>>
>> Thanks!
>>
>>>>
>>>>
>>>> You do seem to check for target capabilities at ptree build time (looking
>>>> at
>>>> vectorizable_store).  I think that is a mistake as we'd ideally have the
>>>> same
>>>> ptree when considering different vector sizes (or even generic vectors).
>>>> Target capabilities will differ between vector sizes.
>>>
>>> That's right. However as long as ITER_COUNT > VEC_SIZE of target, there is
>>> at least a cover available for the ptree that we are creating. Hence, this
>>> check is just to check if certain instruction can be represented anyhow on
>>> target architecture. This will actually be handled at transform phase after
>>> vec_size reduction is performed. (I have kept the loop checking for
>>> different vector types in vect_analyze_loop_with_prim_tree() for now as I am
>>> still not very clear at what point in transformation phase will it play any
>>> role... Once, the end-to-end solution for it is developed for load/store
>>> chains, there will be some clarity - and the loop can be moved there or
>>> eliminated completely.)
>>
>> Ok.
>>
>> One limit of the current vectorizer is that it wants to use a single
>> vector size.
>>
>> In principle vectorizable_* could compute a vector size mask for each
>> scalar statement (ptree node) noting the sizes the target has support for
>> encoding the stmt.  Similar to permute nodes the ptree can have
>> size-change nodes in case supported input/output size have no match.
>
> We can use the macro TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES to get the 
> mask, and perform cost-effective instruction selection for optimal vector
> size, than considering availability of single vector size in 
> get_vectype_for_scalar_type_and_size(). However, I will check if it is needed 
> to be done
> as early as vectorizable_*, or can be postponed till tree tiling algorithm 
> hits.
>>
>> Thanks,
>> Richard.
>>
>>>>
>>>>
>>>> That's all for now.
>>>>
>>>> Last but not least - at the Cauldron I suggested to incrementally rewrite
>>>> the existing vectorizer by building the ptree at the point it is "ready"
>>>> for code generation and perform the permute optimizations on it and then
>>>> re-write the code generation routines to work off the ptree.
>>>
>>>
>>>>
>>>> Thanks,
>>>> Richard.
>>>>
>>>
>>> - Thanks and regards,
>>>    Sameera D.
>
>> I will create new branch in GCC as you directed in other thread, and create 
>> incremental patches for the changes suggested here. Will that be fine?
>>
>> - Thanks and regards,
>>   Sameera D.
>
> Hi Richard,
>
> Sorry for delay with this updated patch. Please find attached the patch with 
> changes addressing the review comments.
>
> Is it ok to put in the branch?

Sure.

Thanks Richard,
Will create the branch.

- Thanks and regards,
  Sameera D.

Thanks,
Richard.

> - Thanks and regards,
>   Sameera D.

RE: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.

Reply via email to