________________________________________ From: Richard Biener [richard.guent...@gmail.com] Sent: 06 July 2016 16:46:17 To: Sameera Deshpande Cc: Matthew Fortune; Rich Fuhler; Prachi Godbole; gcc@gcc.gnu.org; Jaydeep Patil Subject: Re: [Patch 0,1a] Improving effectiveness and generality of autovectorization using unified representation.
On Wed, Jul 6, 2016 at 12:49 PM, Sameera Deshpande <sameera.deshpa...@imgtec.com> wrote: > ________________________________________ > From: Sameera Deshpande [sameera.deshpa...@imgtec.com] > Sent: 20 June 2016 11:37:58 > To: Richard Biener > Cc: Matthew Fortune; Rich Fuhler; Prachi Godbole; gcc@gcc.gnu.org; Jaydeep > Patil > Subject: Re: [Patch 0,1a] Improving effectiveness and generality of > autovectorization using unified representation. > > On Wednesday 15 June 2016 05:52 PM, Richard Biener wrote: >> On Mon, Jun 13, 2016 at 12:56 PM, Sameera Deshpande >> <sameera.deshpa...@imgtec.com> wrote: >>> On Thursday 09 June 2016 05:45 PM, Richard Biener wrote: >>>> >>>> On Thu, Jun 9, 2016 at 10:54 AM, Richard Biener >>>> <richard.guent...@gmail.com> wrote: >>>>> >>>>> On Tue, Jun 7, 2016 at 3:59 PM, Sameera Deshpande >>>>> <sameera.deshpa...@imgtec.com> wrote: >>>>>> >>>>>> Hi Richard, >>>>>> >>>>>> This is with reference to our discussion at GNU Tools Cauldron 2015 >>>>>> regarding my talk titled "Improving the effectiveness and generality of >>>>>> GCC >>>>>> auto-vectorization." Further to our prototype implementation of the >>>>>> concept, >>>>>> we have started implementing this concept in GCC. >>>>>> >>>>>> We are following incremental model to add language support in our >>>>>> front-end, and corresponding back-end (for auto-vectorizer) will be added >>>>>> for feature completion. >>>>>> >>>>>> Looking at the complexity and scale of the project, we have divided this >>>>>> project into subtasks listed below, for ease of implementation, testing >>>>>> and >>>>>> review. >>>>>> >>>>>> 0. Add new pass to perform autovectorization using unified >>>>>> representation - Current GCC framework does not give complete overview of >>>>>> the loop to be vectorized : it either breaks the loop across body, or >>>>>> across >>>>>> iterations. Because of which these data structures can not be reused for >>>>>> our >>>>>> approach which gathers all the information of loop body at one place >>>>>> using >>>>>> primitive permute operations. Hence, define new data structures and >>>>>> populate >>>>>> them. >>>>>> >>>>>> 1. Add support for vectorization of LOAD/STORE instructions >>>>>> a. Create permute order tree for the loop with LOAD and STORE >>>>>> instructions for single or multi-dimensional arrays, aggregates within >>>>>> nested loops. >>>>>> b. Basic transformation phase to generate vectorized code for the >>>>>> primitive reorder tree generated at stage 1a using tree tiling algorithm. >>>>>> This phase handles code generation for SCATTER, GATHER, stridded memory >>>>>> accesses etc. along with permute instruction generation. >>>>>> >>>>>> 2. Implementation of k-arity promotion/reduction : The permute nodes >>>>>> within primitive reorder tree generated from input program can have any >>>>>> arity. However, the target can support maximum of arity = 2 in most of >>>>>> the >>>>>> cases. Hence, we need to promote or reduce the arity of permute order >>>>>> tree >>>>>> to enable successful tree tiling. >>>>>> >>>>>> 3. Vector size reduction : Depending upon the vector size for target, >>>>>> reduce vector size per statement and adjust the loop count for vectorized >>>>>> loop accordingly. >>>>>> >>>>>> 4. Support simple arithmetic operations : >>>>>> a. Add support for analyzing statements with simple arithmetic >>>>>> operations like +, -, *, / for vectorization, and create primitive >>>>>> reorder >>>>>> tree with compute_op. >>>>>> b. Generate vector code for primitive reorder tree generated at >>>>>> stage 4a using tree tiling algorithm - here support for complex patterns >>>>>> like multiply-add should be checked and appropriate instruction to be >>>>>> generated. >>>>>> >>>>>> 5. Support reduction operation : >>>>>> a. Add support for reduction operation analysis and primitive >>>>>> reorder tree generation. The reduction operation needs special handling, >>>>>> as >>>>>> the finish statement should COLLAPSE the temporary reduction vector >>>>>> TEMP_VAR >>>>>> into original reduction variable. >>>>>> b. The code generation for primitive reorder tree does not need any >>>>>> handling - as reduction tree is same as tree generated in 4a, with only >>>>>> difference that in 4a, the destination is MEMREF (because of STORE >>>>>> operation) and for reduction it is TEMP_VAR. At this stage, generate code >>>>>> for COLLAPSE node in finish statements. >>>>>> >>>>>> 6. Support other vectorizable statements like complex arithmetic >>>>>> operations, bitwise operations, type conversions etc. >>>>>> a. Add support for analysis and primitive reorder tree generation. >>>>>> b. Vector code generation. >>>>>> >>>>>> 7. Cost effective tree tiling algorithm : Till now, the tree tiling is >>>>>> happening without considering cost of computation. However, there can be >>>>>> multiple target instructions covering the tree - hence, instead of >>>>>> picking >>>>>> first matched largest instruction cover, select the instruction cover >>>>>> based >>>>>> on cost of instruction given in .md for the target. >>>>>> >>>>>> 8. Optimizations on created primitive reorder tree : This stage is open >>>>>> ended, and depending upon perf analysis, the scope of it can be defined. >>>>>> >>>>>> The current patch I have attached herewith handles stage 0 and 1a : Adds >>>>>> new pass to perform autovectorization using unified representation, >>>>>> defines >>>>>> new data structures to cater to this requirement and creates primitive >>>>>> reorder tree for LOAD/STORE instructions within the loop. >>>>>> >>>>>> The whole loop is represented using the ITER_NODE, which have >>>>>> information about >>>>>> - The preparatory statements for vectorization to be executed before >>>>>> entering the loop (like initialization of vectors, prepping for reduction >>>>>> operations, peeling etc.) >>>>>> - Vectorizable loop body represented as PRIMOP_TREE (primitive >>>>>> reordering tree) >>>>>> - Final statements (For peeling, variable loop bound, COLLAPSE operation >>>>>> for reduction etc.) >>>>>> - Other loop attributes (loop bound, peeling needed, dependences, etc.) >>>>>> >>>>>> Memory accesses within a loop have definite repetitive pattern which can >>>>>> be captured using primitive permute operators which can be used to >>>>>> determine desired permute order for the vector computations. The >>>>>> PRIMOP_TREE is AST which records all computations and permutations >>>>>> required >>>>>> to store destination vector into continuous memory at the end of all >>>>>> iterations of the loop. It can have INTERLEAVE, CONCAT, EXTRACT, SPLIT, >>>>>> ITER or any compute operation as intermediate node. Leaf nodes can >>>>>> either be >>>>>> memory reference, constant or vector of loop invariants. Depending upon >>>>>> the >>>>>> operation, PRIMOP_TREE holds appropriate information about the statement >>>>>> within the loop which is necessary for vectorization. >>>>>> >>>>>> At this stage, these data structures are populated by gathering all the >>>>>> information of the loop, statements within the loop and correlation of >>>>>> the >>>>>> statements within the loop. Moreover the loop body is analyzed to check >>>>>> if >>>>>> vectorization of each statement is possible. One has to note however that >>>>>> this analysis phase will give worst-case estimate of instruction >>>>>> selection, >>>>>> as it checks if specific named pattern is defined in .md for the target. >>>>>> It >>>>>> not necessarily give optimal cover which is aim of the transformation >>>>>> phase >>>>>> using tree tiling algorithm - and can be invoked only once the loop body >>>>>> is >>>>>> represented using primitive reoder tree. >>>>>> >>>>>> At this stage, the focus is to create permute order tree for the loop >>>>>> with LOAD and STORE instructions only. The code we intend to compile is >>>>>> of >>>>>> the form >>>>>> FOR(i = 0; i < N; i + +) >>>>>> { >>>>>> stmt 1 : D[k ∗ i + d 1 ] =S 1 [k ∗ i + c 11 ] >>>>>> stmt 2 : D[k ∗ i + d 2 ] =S 1 [k ∗ i + c 21 ] >>>>>> ... >>>>>> stmt k : D[k ∗ i + d k ] =S 1 [k ∗ i + c k 1 ] >>>>>> } >>>>>> Here we are assuming that any data reference can be represented using >>>>>> base + k * index + offset (The data structure struct data_reference from >>>>>> GCC >>>>>> is used currently for this purpose). If not, the address is normalized to >>>>>> convert to such representation. >>>>>> >>>>>> We are looking forward to your suggestions and insight in this regard >>>>>> for better execution of this project. >>>>> >>>>> >>>>> I will look at the patch in detail this afternoon and will write up >>>>> some comments. >>>> >>>> >>>> Ok, so here we go. >>>> >>> >>> Hi Richard, >>> >>> Thanks for your detailed review. Please find my comments inlined. >>> >>>> I see you copy quite some code from the existing vectorizer - rather than >>>> doing this and adding a "new" pass I'd make the >>>> flag_tree_loop_vectorize_unified >>>> flag guard code inside the existing vectorizer - thus share the pass. >>> >>> I agree with you that I have copied lot of code from current implementation >>> of GCC, and some data structures seem redundant as they create same >>> information as that is already available. However, as per our discussion at >>> cauldron, I tried to generate ptrees after all the analysis phases were >>> done, using the information generated there. However, because of >>> overwhelming and scattered information in various statements, loops and >>> their respective info nodes, it did not yield much. Moreover, it was seen >>> that many functions or parts of the functions were not very useful, and some >>> stmt or loop related information needed different way of computation for our >>> scheme than GCC's framework. So, instead of tweaking GCC's codebase, and >>> corrupting the information, thereby making it unusable if our optimization >>> fails to vectorize by default vectorizer, we created new pass for the same. >>> >>> I agree, that currently it looks like we are reusing most of the checks and >>> conditions for vectorization as in GCC - as for first phase, our aim is to >>> meet performance of GCC before adding different optimizations. However, we >>> plan to increase the scope further, thereby need to change the checks and >>> data generated accordingly. >>> >>> e.g.: Peeling information won't be generated till transformation phase is >>> started, where it will be added in the ITER_node depending upon alignment >>> requirements, ITER_count and VEC_SIZE of target. >>> >>> Or, scatter/gather information is not needed as tree tiling for vector >>> load/store has different tiles for those instructions if target supports >>> them. >>> >>> Also, the way reduction operations are to be implemented in this scheme >>> makes all the categorization of reduction operation in current GCC >>> redundant. >>> >>> However, if you still think it is good idea to reuse same data structures >>> and functions, I have older patch available, which I can clean up and update >>> to add updated ptree generation part in it, and share it. >> >> My comment was mostly about making patches smaller and thus easier to review. >> I do expect that little of the copied code will remain as-is. >> > Richard, the way we have phased out the project, the first patch is bigger > one as we had to do all the initializations and checks relevant to > auto-vectorization in this patch. However, we expect following patches to be > smaller, and dedicated to single functionality. It is difficult to factor > this patch, as the changes that we have added are needed for the > functionality - to vectorize load/store operations. > >>>> >>>> Similarly I'd re-use loop_vinfo and simply add fields to it for the >>>> new vectorizer >>>> so you can dispatch to existing vectorizer functions "easily". Otherwise >>>> it's going to be hard to keep things in-sync. Likewise for your stmt_attr >>>> and the existing stmt_vec_info. >>>> >>> The main reason behind creating new ITER_node instead of reusing loop_vinfo >>> is to capture whole loop as single entity, thereby allowing optimizations on >>> nested loops as well. The ITER_node, when implemented completely, can handle >>> vectorization of multiple nested loops, as similar to all permute nodes, >>> even ITER_node can be distributed over each element in ITER_node.stmts - and >>> can be propagated across compute_tree in permute order tree. So, for future >>> expansion, we are creating ITER_node with copy of some of loop_vinfo fields, >>> and do not compute the fields which are not needed for this scheme. >>> >>> Similar is the case with stmt_vinfo - the information gathered in stmt_vinfo >>> is mostly usable for SLP optimizations, which we cannot use as is. So, >>> instead of generating redundant information, or alter generated information, >>> we chose to create new data structure. >>> >>>> Why isn't the existing data_reference structure good enough and you need >>>> to invent a struct mem_ref? Even if that might be leaner adding new >>>> concepts >>>> makes the code harder to understand. You can always use dr->aux to >>>> add auxiliar data you need (like the existing vectorizer does). >>> >>> >>> We can use data_reference structure as is, however, each of it's component >>> has very specific meaning associated with it - whereas the memref has only 3 >>> components which are important - stride, offset - which is less than stride, >>> and remaining base - for address of first element. So, again we thought >>> instead of overriding current semantics of components of data_ref, we >>> created new data structure. >> >> I was worried at the point you made an existing GCC function take data-ref >> pieces to make your mem-ref look like a data-ref. If the data-ref doesn't >> give you the information you want and that is important it seems to me it >> is better to fix that. It sounds to me you want to look at innermost >> behavior, DR_STEP and DR_BASE_ADDRESS + DR_OFFSET + DR_INIT >> where a part of DR_INIT is your 'offset' (I think it is what the >> current vectorizer >> computes as GROUPs and offset would be the offset inside the group, >> relative to the first element?) >> > I agree that the data structure mem-ref is redundant. Will eliminate it to > use altered dataref. We can say that the computations to find mem-ref are > similar to those to form GROUPS - however, GROUP is collection of all the > statements which access elements of aggregate with definite interval, > whereas mem-ref looks at small window of the aggregate with reference to > stride, and the offset addressing is always relative to stride. Also, mem-ref > is not collection of the statements, but the place-holder to identify the > continuous piece of aggregate within the memory of size = stride. > However, I can use dataref instead of mem-ref as the place holder. > >>>> >>>> It helps if you follow the GCC coding-conventions from the start - a lot >>>> of the new functions miss comments. >>> >>> Sorry about that. I will tidy up the patch. >>>> >>>> >>>> I realize the patch is probably still a "hack" and nowhere a final step >>>> 0/1a, >>>> but for example you copied the iteration scheme over vector sizes. I hope >>>> the new vectorizer can do without that and instead decide on the vector >>>> size as part of the cost effective tiling algorithm. >>> >>> I didn't get this point clearly. However, for my understanding, is the >>> objection with use of ITER_COUNT instead of VEC_SIZE for each statement? >>> Because, if that is the case, there is basic difference between current >>> GCC's implementation and this scheme - in case of GCC, we always look for >>> the standard pattern name to be matched for scalar instruction - for which >>> VEC_TYPE becomes crucial. Whereas, this scheme represents each statement >>> within the loop as a vector with element of type SCALAR_TYPE and >>> initial_vec_size = ITER_COUNT (as those many instances of the statement will >>> be executed). Then, depending upon the VEC_SIZE for target, VEC_SIZE >>> reduction will be applied to it, on top of which tiling algorithm functions. >>> So, for tiling, we will have to use VEC_SIZE. However, for all other >>> optimizations and transformations, we use each node of permute order tree to >>> have vec_size = ITER_COUNT to allow free movement of permute-nodes across >>> compute-nodes, and various optimizations on them. >> >> The comment was about you copying the loop iterating over target vector sizes >> and repeating vectorization attempts with that single vector size. I hope >> the cost effective tiling will simply choose a single vector size based on >> cost and iteration count. Your description above suggests so thus the >> copying of the vector size iteration seems spurious and if then should >> be done _after_ building the ptree (and sharing the ptree). > Yes, as I have mentioned in previous mail > > >> (I have kept the loop checking for > >> different vector types in vect_analyze_loop_with_prim_tree() for now as I > am > >> still not very clear at what point in transformation phase will it play > any > >> role... Once, the end-to-end solution for it is developed for load/store > >> chains, there will be some clarity - and the loop can be moved there or > >> eliminated completely.) > > So, this loop will be eliminated once the backend for tree tiling of > load/store is implemented. > >> >>>> >>>> >>>> As the main feature of the patch is creation of the ptree (no code >>>> generation >>>> seems to be done?) the biggest missing part is dumping the ptree in >>>> some nice form (like in DOT format and/or to the dump file). >>> >>> Yes, the ptree dump was put on back burner because I was focusing on the >>> functionality. However, I will share the patch for dumping ptree in DOT >>> format shortly. >> >> Thanks! >> >>>> >>>> >>>> You do seem to check for target capabilities at ptree build time (looking >>>> at >>>> vectorizable_store). I think that is a mistake as we'd ideally have the >>>> same >>>> ptree when considering different vector sizes (or even generic vectors). >>>> Target capabilities will differ between vector sizes. >>> >>> That's right. However as long as ITER_COUNT > VEC_SIZE of target, there is >>> at least a cover available for the ptree that we are creating. Hence, this >>> check is just to check if certain instruction can be represented anyhow on >>> target architecture. This will actually be handled at transform phase after >>> vec_size reduction is performed. (I have kept the loop checking for >>> different vector types in vect_analyze_loop_with_prim_tree() for now as I am >>> still not very clear at what point in transformation phase will it play any >>> role... Once, the end-to-end solution for it is developed for load/store >>> chains, there will be some clarity - and the loop can be moved there or >>> eliminated completely.) >> >> Ok. >> >> One limit of the current vectorizer is that it wants to use a single >> vector size. >> >> In principle vectorizable_* could compute a vector size mask for each >> scalar statement (ptree node) noting the sizes the target has support for >> encoding the stmt. Similar to permute nodes the ptree can have >> size-change nodes in case supported input/output size have no match. > > We can use the macro TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_SIZES to get the > mask, and perform cost-effective instruction selection for optimal vector > size, than considering availability of single vector size in > get_vectype_for_scalar_type_and_size(). However, I will check if it is needed > to be done > as early as vectorizable_*, or can be postponed till tree tiling algorithm > hits. >> >> Thanks, >> Richard. >> >>>> >>>> >>>> That's all for now. >>>> >>>> Last but not least - at the Cauldron I suggested to incrementally rewrite >>>> the existing vectorizer by building the ptree at the point it is "ready" >>>> for code generation and perform the permute optimizations on it and then >>>> re-write the code generation routines to work off the ptree. >>> >>> >>>> >>>> Thanks, >>>> Richard. >>>> >>> >>> - Thanks and regards, >>> Sameera D. > >> I will create new branch in GCC as you directed in other thread, and create >> incremental patches for the changes suggested here. Will that be fine? >> >> - Thanks and regards, >> Sameera D. > > Hi Richard, > > Sorry for delay with this updated patch. Please find attached the patch with > changes addressing the review comments. > > Is it ok to put in the branch? Sure. Thanks Richard, Will create the branch. - Thanks and regards, Sameera D. Thanks, Richard. > - Thanks and regards, > Sameera D.