Proposal of new Unrolling degree before/after the allocated Register Allocation is done in GCC.
All: Given a Data Dependency Graph(DDG) the unrolling degree proposed by Monica Lam et.al calculates the unrolling degree as follows. Unrolling degree = Length of Longest Live range/ Number of cycles in the kernel ( Initiation Interval). The unrolling degree based on the Above leads to more register Spills and Fetch inside the Loops. Given the Data Dependency Graph(DDG) and the unrolling degree calculated as above we build the reuse graph. Lot has been presented On the Reuse graph. Each node nodes labelled the DDG nodes that writes into registers and the arc(u,v) represents that the destination Share same registers if there is no dependency between the between the iterations of DDG nodes. As there is no dependency as represented by the Reuse graph the life time will not overlap and assign the same registers. Each arc will Have the weight that describes the allocated registers and the Unrolling as proposed by Monica Lam given above The reuse graph will be Partitioned based on the heuristics and the Sum of weights of each arc in each partitioned is calculated. Then based on sum of weights of Each partitioned reuse graphs the least common factor of all the weights will be the new Unrolling degree. This new unrolling degree will lead to better register allocation and reduction of register pressure and because the destination registers is Shared between different iterations because of no dependency as described by the Data dependency distance given in DDG, the allocated registers will lead to less spill and Fetch. The calculation of getting the Unrolling degree(new) as above will lead to register allocation for loops ( that has the candidate of lots of Parallelism ) on assigning the Same destination registers between the DDG nodes having no dependency. Some of the Logic for calculation of new unrolling degree is taken from the proposed Unrolling degree based on Periodic register allocation Albert Cohen et.al. This will efficiently use before and after the register allocation based in gcc so that weights of each reuse graph that describes the allocated Registers Will be known after the register allocation and the weights that describes unrolling degree is done before register allocation and the new Unrolling degree efficiently calculated that reduces the register pressure. Suggestions? Thanks & Regards Ajit
RE: set_src_cost lying comment
-Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Jeff Law Sent: Wednesday, June 24, 2015 10:36 AM To: gcc@gcc.gnu.org Subject: Re: set_src_cost lying comment On 06/21/2015 11:57 PM, Alan Modra wrote: > set_src_cost says it is supposed to > /* Return the cost of moving X into a register, relative to the cost > of a register move. SPEED_P is true if optimizing for speed rather > than size. */ > > Now, set_src_cost of a register move (set (reg1) (reg2)), is zero. > Why? Well, set_src_cost is used just on the right hand side of a SET, > so the cost is that of (reg2), which is zero according to rtlanal.c > rtx_cost. targetm.rtx_costs doesn't get a chance to modify this. > > Now consider (set (reg1) (ior (reg2) (reg3))), for which set_src_cost > on rs6000 currently returns COSTS_N_INSNS(1). It seems to me that > this also ought to return zero, if the set_src_cost comment is to be > believed. I'd claim the right hand side of this expression costs the > same as a register move. A register move machine insn "mr reg1,reg2" > is encoded as "or reg1,reg2,reg2" on rs6000! >>Certainly seems inconsistent -- all the costing stuff should be revisited. >>The basic design for costing dates back to the m68k/vax era. >>I certainly agree that the cost of a move, logicals and arithmetic is >>essentially the same at the chip level for many processors. But a copy has >>other properties >>that make it "cheaper" -- namely we can often propagate it >>away or arrange for the source & dest of the copy to have the same hard >>register which achieves >>the same effect. I have seen for target like Microblaze, making the changes in the cost of move instructions not same at the chip level cost/latency improves the performance for Mibench/EEMBC benchmarks. Also changing the cost of branch instruction and sometimes making it zero also improves the performance for Mibench/EEMBC benchmarks. >>So one could argue that a copy should have cost 0 as it has a reasonable >>chance of just going away, while logicals, alu operations on the appropriate >>chips >>should have a cost of 1. Thanks & Regards Ajit jeff
RE: set_src_cost lying comment
-Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Richard Kenner Sent: Wednesday, June 24, 2015 9:28 PM To: l...@redhat.com Cc: gcc@gcc.gnu.org Subject: Re: set_src_cost lying comment > These are good examples of things the costing model simply wasn't ever > designed to consider -- because they weren't significant issues on > the m68k, vax and other ports in the gcc-1 era. > > So I don't really know how to tell you to proceed -- I've considered > the costing models fundamentally flawed for many years, but haven't > ever tried to come up with something that works better. >>I'm not sure I'd call it "fundamentally flawed" or that it wasn't "designed >>to consider" these things. I see the costing model as meant as a *heuristic* >>for >>making choices, not as a precise metric for anything. Certainly, a lot >>of people worked on that model, including both of us, during the gcc-2 RISC >>porting days. >>I see the "flaw" as trying to use it for too much. There's a limit to the >>amount of decisions you can make with purely local data. When trying to >>decide >>whether a source should be a constant or a register, you have to >>look at: >>- the chances the reg-reg copy can be eliminated >>- register pressure >>- whether there's room to insn schedule around any conflicts Increasing or decreasing the cost of moves affects the Optimal Coalescing. Also the Live range splitting that generates the shuffle code at the border if the registers in the partner live ranges are different/ Splitted live ranges where a partner has register and another partner in memory, then a store would be needed otherwise a load is generated with respect to the live range has memory and the partner live ranges in register then a load will be generated. All these conditions and heuristics are associated with move cost for x86 and other Architecture. Thanks & Regards Ajit >>Unless you look at the whole picture, you're just guessing, which is, of >>course, what a heuristic is all about!
Multi-version IF-THEN-ELSE conditional
All: The presence of aliases disables many optimizations like CCP(conditional constant propagation) , PRE(Partial Redundancy Elimination), Scalar Replacements for conditional IF-THEN-ELSE. The presence of aliasing also disables the IF-conversion. I am proposing the Multi-version IF-THEN-ELSE where the different version of the IF-THEN-ELSE is one without aliasing and other versions with the aliasing. Thus converting the different Multi-version IF-THEN-ELSE enables the CCP, PRE, Scalar replacements and IF-conversions for the version of the IF-THEN-ELSE that does not have aliasing on the pointer variables and the versions that has alias information will not be affected with such optimizations. I don't have examples on hand currently but I am working on currently to provide the examples. Thoughts? Thanks & Regards Ajit
Transformation from SEME(Single Entry Multiple Exit) to SESE(Single Entry Single Exit)
All: Single Entry and Multiple Exits disables traditional Loop optimization. The presence of short circuit also makes the CFG as Single Entry and Multiple Exits. The transformation from SEME(Single Entry and Multiple Exits) to SESE( Single Entry and Single Exits enables many Loop Optimizations. The approach like Node Splitting to make SEME regions to SESE regions is an important optimization on the CFG that Enable the transformation with respect to Loops and Conditionals. The Loops transformation in LLVM does the node splitting to convert from SEME regions to SESE regions. The presence of break and GOTO statements inside the loops makes the CFG unstructured transforming it SEME. To convert such control Flow from unstructured to Structured control flow enables many Loop transformation. I would like to implement a transformation phase on the loops before any Loop optimizations pass is enabled to transform Unstructured CFG to structured CFG like LLVM. Does the GCC already has such transformation passes on Loops? Please share your thoughts. Thanks & Regards Ajit
Consideration of Cost associated with SEME regions.
All: The Cost Calculation for a candidate to Spill in the Integrated Register Allocator(IRA) considers only the SESE regions. The Cost Calculation in the IRA should consider the SEME regions into consider for spilling decisions. The Cost associated with the path that has un-matured exists should be less, thus making the more chances of spilling decision In the path of un-matured exits. The path that has un-matured (normal )exists should be having a higher cost than the cost of un-matured exists and Spilling decisions has to made accordingly in order to spill inside the less frequency path with the un-matured exists than the high frequency Path with the normal exits. I would like to propose the above for consideration of cost associated with SEME regions in IRA. Thoughts? Thanks & Regards Ajit
RE: Consideration of Cost associated with SEME regions.
Sorry for the typo error. I meant exits instead of exists. The below is corrected. The Cost Calculation for a candidate to Spill in the Integrated Register Allocator(IRA) considers only the SESE regions. The Cost Calculation in the IRA should consider the SEME regions into consideration for spilling decisions. The Cost associated with the path that has un-matured exits should be less, thus making the more chances of spilling decision In the path of un-matured exits. The path that has normal exit should be having a higher cost than the cost of un-matured exit and Spilling decisions has to made accordingly in order to spill inside the less frequency path with the un-matured exits than the high frequency Path with the normal exits. I would like to propose the above for consideration of cost associated with SEME regions in IRA. Thoughts? Thanks & Regards Ajit -Original Message- From: Ajit Kumar Agarwal Sent: Thursday, July 02, 2015 3:33 PM To: vmaka...@redhat.com; l...@redhat.com; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Consideration of Cost associated with SEME regions. All: The Cost Calculation for a candidate to Spill in the Integrated Register Allocator(IRA) considers only the SESE regions. The Cost Calculation in the IRA should consider the SEME regions into consider for spilling decisions. The Cost associated with the path that has un-matured exists should be less, thus making the more chances of spilling decision In the path of un-matured exits. The path that has un-matured (normal )exists should be having a higher cost than the cost of un-matured exists and Spilling decisions has to made accordingly in order to spill inside the less frequency path with the un-matured exists than the high frequency Path with the normal exits. I would like to propose the above for consideration of cost associated with SEME regions in IRA. Thoughts? Thanks & Regards Ajit
Live on Exit renaming.
All: Design and Analysis of Profile-Based Optimization in Compaq's Compilation Tools for Alpha; Journal of Instruction-Level Parallelism 3 (2000) 1-25 The above paper based on this paper the existing tracer pass (This pass performs the tail duplication needed for superblock formation.) is Implemented in the GCC. There is another optimization that of interest in the above paper is the following. Live on Exit Renamer: This optimizations tries to remove a constraint that force the compiler to create long dependent chains of operations in unrolled loops. The following example While (a[i] != key) Return I; Fig(1) Unrolled Loop: 1.While (a[i] == key) { 2.I = I +1; 3. If(a[i] == key ) goto E 4. I = i+1; 5. If(a[i] == key) goto E 6.I = i+1; 7.} 8.E: return; Fig(2) Live on Exit renamer transformation. While ( a[i] == key) { I1 = I +1; If( a[i1] == key) goto E1 I2 = i+2; If(a[i2] == key) goto E2; I3 = i+3; } E: return I; E1: I = i1 goto E E2: I = i2 goto E Fig(3). The above transformation removes the Liveness of exits and make the unrolled loop non-overlapping. Thus the line 4 in Fig(2) cannot be moved Above 3 because of Live on Exit. The transformation in the Fig(3) remove the Live on Exits and the register allocator can be allocated with optimized Register sets. This can form the non-overlapping regions in the unrolled loop. I am not sure why the above optimization is not implemented in GCC. If there is no specific reasons I would like the implement the same. Thanks & Regards Ajit
RE: Live on Exit renaming.
Sorry for the typo error. The below is the corrected Fig (1). While (a[i] != key) I = i+1; Return I; Fig (1). Thanks & Regards Ajit -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Ajit Kumar Agarwal Sent: Saturday, July 04, 2015 7:15 PM To: l...@redhat.com; Richard Biener; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Live on Exit renaming. All: Design and Analysis of Profile-Based Optimization in Compaq's Compilation Tools for Alpha; Journal of Instruction-Level Parallelism 3 (2000) 1-25 The above paper based on this paper the existing tracer pass (This pass performs the tail duplication needed for superblock formation.) is Implemented in the GCC. There is another optimization that of interest in the above paper is the following. Live on Exit Renamer: This optimizations tries to remove a constraint that force the compiler to create long dependent chains of operations in unrolled loops. The following example While (a[i] != key) Return I; Fig(1) Unrolled Loop: 1.While (a[i] == key) { 2.I = I +1; 3. If(a[i] == key ) goto E 4. I = i+1; 5. If(a[i] == key) goto E 6.I = i+1; 7.} 8.E: return; Fig(2) Live on Exit renamer transformation. While ( a[i] == key) { I1 = I +1; If( a[i1] == key) goto E1 I2 = i+2; If(a[i2] == key) goto E2; I3 = i+3; } E: return I; E1: I = i1 goto E E2: I = i2 goto E Fig(3). The above transformation removes the Liveness of exits and make the unrolled loop non-overlapping. Thus the line 4 in Fig(2) cannot be moved Above 3 because of Live on Exit. The transformation in the Fig(3) remove the Live on Exits and the register allocator can be allocated with optimized Register sets. This can form the non-overlapping regions in the unrolled loop. I am not sure why the above optimization is not implemented in GCC. If there is no specific reasons I would like the implement the same. Thanks & Regards Ajit
Allocation of hotness of data structure with respect to the top of stack.
All: I am wondering allocation of hot data structure closer to the top of the stack increases the performance of the application. The data structure are identified as hot and cold data structure and all the data structures are sorted in decreasing order of The hotness and the hot data structure will be allocated closer to the top of the stack. The load and store on accessing with respect to allocation of data structure on stack will be faster with allocation of hot Data structure closer to the top of the stack. Based on the above the code is generated with respect to load and store with the correct offset of the stack allocated on the decreasing order of hotness. Thoughts? Thanks & Regards Ajit
Reduction Pattern ( Vectorization or Parallelization)
All: The scalar and array reduction patterns can be identified if the result of commutative updates Is applied to the same scalar or array variables on the LHS with +, *, Min or Max. Thus the reduction pattern identified with the commutative update help in vectorization or parallelization. For the following code For(j = 0; j <= N;j++) { y = d[j]; For( I = 0 ; I <8 ; i++) X(a[i]) = X(a[i]) + c[i] * y; } Fig(1). For the above code with the reduction pattern on X with respect to the outer loop exhibits the commutative updates on + can be identified In gcc as reduction pattern with respect to outer loops. I wondering whether this can be identified as reduction pattern which can reduce to vectorized Code because of the X is indexed by another array as thus the access of X is not affine expression. Does the above code can be identified as reduction pattern and transform to the vectorized or parallelize code. Thoughts? Thanks & Regards Ajit
RE: Live on Exit renaming.
-Original Message- From: Bin.Cheng [mailto:amker.ch...@gmail.com] Sent: Monday, July 06, 2015 7:04 AM To: Steven Bosscher Cc: Ajit Kumar Agarwal; l...@redhat.com; Richard Biener; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Live on Exit renaming. On Mon, Jul 6, 2015 at 6:02 AM, Steven Bosscher wrote: > On Sat, Jul 4, 2015 at 3:45 PM, Ajit Kumar Agarwal wrote: >> I am not sure why the above optimization is not implemented in GCC. > > -fsplit-ivs-in-unroller >>And thing might have changed. Given the condition GCC does IVO on gimple, >>unrolling on RTL, there is inconsistency between the two optimizer since IVO >>>>takes register pressure of IVs into consideration and assumes IVs will take >>single registers. At least for some cases, splitting live range of IVs >>results in bad >>code. See PR29256 for more information. As described in >>the comment, actually I am going to do some experiments disabling such >>transformation to see >>what happens. The above optimization is implemented as a part of unroller in gimple. There is an unroller pass in rtl which does not have support for this optimization. Shouldn't be the fsplit-ivs-in-unroller optimization implemented in the unroller pass of rtl. I am looking at the implementation perspective for implementing the fsplit-ivs-in-unroller optimizations in the unroller rtl pass. Thanks & Regards Ajit Thanks, bin > > Ciao! > Steven
RE: Live on Exit renaming.
-Original Message- From: Bin.Cheng [mailto:amker.ch...@gmail.com] Sent: Monday, July 06, 2015 10:26 AM To: Ajit Kumar Agarwal Cc: Steven Bosscher; l...@redhat.com; Richard Biener; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Live on Exit renaming. On Mon, Jul 6, 2015 at 12:02 PM, Ajit Kumar Agarwal wrote: > > > -Original Message- > From: Bin.Cheng [mailto:amker.ch...@gmail.com] > Sent: Monday, July 06, 2015 7:04 AM > To: Steven Bosscher > Cc: Ajit Kumar Agarwal; l...@redhat.com; Richard Biener; > gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli > Hunsigida; Nagaraju Mekala > Subject: Re: Live on Exit renaming. > > On Mon, Jul 6, 2015 at 6:02 AM, Steven Bosscher wrote: >> On Sat, Jul 4, 2015 at 3:45 PM, Ajit Kumar Agarwal wrote: >>> I am not sure why the above optimization is not implemented in GCC. >> >> -fsplit-ivs-in-unroller > >>>And thing might have changed. Given the condition GCC does IVO on gimple, >>>unrolling on RTL, there is inconsistency between the two optimizer since IVO >>>>>takes register pressure of IVs into consideration and assumes IVs will >>>take single registers. At least for some cases, splitting live range of IVs >>>results in bad >>code. See PR29256 for more information. As described in >>>the comment, actually I am going to do some experiments disabling such >>>transformation to see >>what happens. > > The above optimization is implemented as a part of unroller in gimple. > There is an unroller pass in rtl which does not have support for this >>As far as I understand, fsplit-ivs-in-unroller is a transformation in RTL >>unroller. My mistake. Yes you are right. The fsplit-ivs-in-unroller is a transformation in RTL unroller. IVO on gimple doesn't take unrolling into consideration and assume to assign single register for IV candidates. My thinking is that Splitting IVs at RTL with the unroller removes the long dependent chains and thus makes the overlapping iterations and better Register allocators and there is a chance of movement of independent code that got exposes with split-ivs-in-unroller. You have mentioned that splitting of IV candidate reults in bad code. I could see only the positive end of this optimizations. Could you please elaborate on the negative end of the fsplit-ivs-in-unroller optimizations as you have mentioned that it results In bad code in some cases. Thanks & Regards Ajit Thanks, bin > optimization. Shouldn't be the fsplit-ivs-in-unroller optimization > implemented in the unroller pass of rtl. I am looking at the implementation > perspective for implementing the fsplit-ivs-in-unroller optimizations in the > unroller rtl pass. > > Thanks & Regards > Ajit > > Thanks, > bin >> >> Ciao! >> Steven
CFG transformation of loops with continue statement inside the loops.
All: While/For ( condition1) { Some code here. If(condition2 ) continue; Some code here. } Fig(1) For the above loop in Fig(1) there will be two backedges and multiple latches. The below code can be transformed to the below in order to have a single backedge. While/For (condition1) { Some code here. If( condition2) Goto latch; Some code here. Latch: } Fig(2). With the transformation shown in Fig(2) the presence of GoTo inside loops affect and disables many optimizations. To enable the loops in Fig(1) can also be transformed to multiple loops with each backedge that will make the affected optimizations enabled and the transformed Multiple loops will enable many optimizations that are disabled with the presence of GoTo in Fig(2) and multiple latches given in Fig(1). Thoughts? Thanks & Regards Ajit
RE: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Friday, July 10, 2015 4:04 AM To: Ajit Kumar Agarwal; Richard Biener; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE On 06/02/2015 10:43 PM, Ajit Kumar Agarwal wrote: > > > -Original Message- > From: Jeff Law [mailto:l...@redhat.com] > Sent: Tuesday, June 02, 2015 9:19 PM > To: Ajit Kumar Agarwal; Richard Biener; gcc@gcc.gnu.org > Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju > Mekala > Subject: Re: [RFC] Design and Implementation for Path Splitting for > Loop with Conditional IF-THEN-ELSE > > On 06/02/2015 12:35 AM, Ajit Kumar Agarwal wrote: >>>> I don't offhand know if any of the benchmarks you cite above are >>>> free-enough to derive a testcase from. But one trick many of us >>>> use is to instrument the >>pass and compile some known free >>>> software (often gcc >>>> itself) to find triggering code and use that to generate tests for the new >>>> transformation. >> >> I will add tests in the suite. I could see many existing tests in the suite >> also get triggered with this optimization. >>> Thanks. For cases in the existing testsuite where you need to change the >>> expected output, it's useful to note why the expected output was changed. >>> >>Sometimes a test is compromised by a new optimization, sometimes the >>> expected output is changed and is papering over a problem, etc so it's >>> something >>we look at reasonably closely. > > Thanks. I will modify accordingly. > >> >>> >>> diff --git a/gcc/cfghooks.c b/gcc/cfghooks.c index 9faa339..559ca96 >>> 100644 >>> --- a/gcc/cfghooks.c >>> +++ b/gcc/cfghooks.c >>> @@ -581,7 +581,7 @@ delete_basic_block (basic_block bb) >>> >>> /* If we remove the header or the latch of a loop, mark the loop >>> for >>> removal. */ >>> - if (loop->latch == bb >>> + if (loop && loop->latch == bb >>> || loop->header == bb) >>> mark_loop_for_removal (loop); >>>> So what caused you to add this additional test? In general loop >>>> structures are supposed to always be available. The change here >>>> implies that the loop structures were gone at some point. That >>>> seems at first glance a mistake. >> >> I was using the gimple_duplicate_bb which will not add the duplicate >> basic block inside the current_loops. That's why the above Condition >> is required. I am using duplicate_block instead of gimple_duplicate_bb. With >> this change the above check with loop Is not required as it adds the >> duplicate basic block inside the loops. >>> OK. Good to hear it's not required anymore. > > >> >>> >>> diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c index aed5254..b25e409 >>> 100644 >>> --- a/gcc/tree-cfg.c >>> +++ b/gcc/tree-cfg.c >>> @@ -1838,6 +1838,64 @@ replace_uses_by (tree name, tree val) >>> } >>> } >>> >>> +void >>> +gimple_threaded_merge_blocks (basic_block a, basic_block b) >> If we keep this function it will need a block comment. >> >>>> I say "if" for a couple reasons. First we already have support >>>> routines that know how to merge blocks. If you really need to >>>> merge blocks you should try to use them. >> >>>> Second, I'm not sure that you really need to worry about block >>>> merging in this pass. Just create the duplicates, wire them into >>>> the CFG and let the existing block merging support handle this problem. >> >> The above routine is not merging but duplicates the join nodes into >> its predecessors. If I change the name of the above Function to the >> gimple_threaded_duplicating_join_node it should be fine. >>> But you don't need to duplicate into the predecessors. If you create the >>> duplicates and wire them into the CFG properly the existing code in >>> cfgcleanup >>should take care of this for you. > > Certainly I will do it. >> >>> diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c >>> index 4303a18..2c7d36d 100644 >>> --- a/gcc/tree-ssa-threadedge.c >>> +++ b/gcc/tree-ssa-threadedge.c >&g
RE: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Friday, July 10, 2015 4:04 AM To: Ajit Kumar Agarwal; Richard Biener; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE On 06/02/2015 10:43 PM, Ajit Kumar Agarwal wrote: > > > -Original Message- > From: Jeff Law [mailto:l...@redhat.com] > Sent: Tuesday, June 02, 2015 9:19 PM > To: Ajit Kumar Agarwal; Richard Biener; gcc@gcc.gnu.org > Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju > Mekala > Subject: Re: [RFC] Design and Implementation for Path Splitting for > Loop with Conditional IF-THEN-ELSE > > On 06/02/2015 12:35 AM, Ajit Kumar Agarwal wrote: >>>> I don't offhand know if any of the benchmarks you cite above are >>>> free-enough to derive a testcase from. But one trick many of us >>>> use is to instrument the >>pass and compile some known free >>>> software (often gcc >>>> itself) to find triggering code and use that to generate tests for the new >>>> transformation. >> >> I will add tests in the suite. I could see many existing tests in the suite >> also get triggered with this optimization. >>> Thanks. For cases in the existing testsuite where you need to change the >>> expected output, it's useful to note why the expected output was changed. >>> >>Sometimes a test is compromised by a new optimization, sometimes the >>> expected output is changed and is papering over a problem, etc so it's >>> something >>we look at reasonably closely. > > Thanks. I will modify accordingly. > >> >>> >>> diff --git a/gcc/cfghooks.c b/gcc/cfghooks.c index 9faa339..559ca96 >>> 100644 >>> --- a/gcc/cfghooks.c >>> +++ b/gcc/cfghooks.c >>> @@ -581,7 +581,7 @@ delete_basic_block (basic_block bb) >>> >>> /* If we remove the header or the latch of a loop, mark the loop >>> for >>> removal. */ >>> - if (loop->latch == bb >>> + if (loop && loop->latch == bb >>> || loop->header == bb) >>> mark_loop_for_removal (loop); >>>> So what caused you to add this additional test? In general loop >>>> structures are supposed to always be available. The change here >>>> implies that the loop structures were gone at some point. That >>>> seems at first glance a mistake. >> >> I was using the gimple_duplicate_bb which will not add the duplicate >> basic block inside the current_loops. That's why the above Condition >> is required. I am using duplicate_block instead of gimple_duplicate_bb. With >> this change the above check with loop Is not required as it adds the >> duplicate basic block inside the loops. >>> OK. Good to hear it's not required anymore. > > >> >>> >>> diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c index aed5254..b25e409 >>> 100644 >>> --- a/gcc/tree-cfg.c >>> +++ b/gcc/tree-cfg.c >>> @@ -1838,6 +1838,64 @@ replace_uses_by (tree name, tree val) >>> } >>> } >>> >>> +void >>> +gimple_threaded_merge_blocks (basic_block a, basic_block b) >> If we keep this function it will need a block comment. >> >>>> I say "if" for a couple reasons. First we already have support >>>> routines that know how to merge blocks. If you really need to >>>> merge blocks you should try to use them. >> >>>> Second, I'm not sure that you really need to worry about block >>>> merging in this pass. Just create the duplicates, wire them into >>>> the CFG and let the existing block merging support handle this problem. >> >> The above routine is not merging but duplicates the join nodes into >> its predecessors. If I change the name of the above Function to the >> gimple_threaded_duplicating_join_node it should be fine. >>> But you don't need to duplicate into the predecessors. If you create the >>> duplicates and wire them into the CFG properly the existing code in >>> cfgcleanup >>should take care of this for you. > > Certainly I will do it. >> >>> diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c >>> index 4303a18..2c7d36d 100644 >>> --- a/gcc/tree-ssa-threadedge.c >>> +++ b/gcc/tree-ssa-threadedge.c >&g
Partition and subpartition Analysis that helps in further vectorization and parallelization
All: I am trying the place the following Analysis in the vectorizer of GCC that helps in improving the vectorizer to a great extent For the unit stride, zero stride and non stride accesses of memory that helps in vectorizer. For the Data Dependency graph, the topological sort is performed. The topological sorted Data Dependence graph the time Stamp for each node of the DDG is assigned based on the following Algorithm. For each node in Topological sorted order in DDG { Timestamp = 0; Timestamp(node) = Max(Timestamp, Timestamp of all predecessors) + 1; } Based on the above calculation of timestamp, the partition of DDG is formed. Each partition of DDG is having the nodes with the same Stamp. So nodes in each partition can be vectorized as they are independent nodes in the DDG. To enable the vectorization, the accesses based on contiguous access and non-Contagious access the sub partition is formed. The memory address of all the operands of each node in the partition formed above is sorted in increasing/decreasing order. Based on the sorted increasing/decreasing order of the memory address of each operands of each node in the partition the sub partition is performed based on the unit stride access, zero stride access and the accesses that require shuffling of operands through the vectorized instruction. The above analysis will help in performing Data Layout on the partitioned nodes of the DDG and based on Sub partition formed above and more vectorization opportunities is enabled for performing data Layout on non contiguous accesses and the sub partition With the contiguous access helps in vectorization. Thoughts? Thanks & Regards Ajit
Traces on Data Dependency graph.
All: I am wondering how useful to form the traces on Data Dependency Graph. On top of the traces in the Control flow graph, I was thinking of forming the traces on data Dependency graph(DDG). Would this helps in further vectorization and parallelization candidates. Thoughts? Thanks & Regards Ajit
RE: Traces on Data Dependency graph.
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Tuesday, July 14, 2015 6:35 PM To: Ajit Kumar Agarwal Cc: Jeff Law; Jan Hubicka; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Traces on Data Dependency graph. On Tue, Jul 14, 2015 at 2:12 PM, Ajit Kumar Agarwal wrote: > All: > > I am wondering how useful to form the traces on Data Dependency Graph. > On top of the traces in the Control flow graph, I was thinking of forming > the traces on data Dependency graph(DDG). > > Would this helps in further vectorization and parallelization candidates. > > Thoughts? >>Not sure how you'd do that? Transform data dependences to control >>dependences? Work has done with respect to the control flow traces and the dependency traces where the control flow traces Can also explicitly identifies the def and use with respect to the registers but for the flow information of memory is difficult To extract from the control flow traces and thus dependent on data dependency traces. The data dependence traces Are long in the sense that it requires the execution instance of the statement along with the data dependence. The DDG Graph annotates the statement along with the statement which can be further enhanced with execution instance of the Statement. There has been work done by Zhang and Gupta etal to form the whole execution Traces with respect to control flow edges and the dependency edges at granularity level of statement and the blocks. I was thinking of data dependency traces with respect to the DDG having the statements and augmented with the execution Instance of the statement that will help with traces for flow information of memory. Both Dependence(control and data) profiles have been used in Data Speculative optimizations (Lin etal 2003 - 2004), Speculative Optimizations (2004) and computation of dynamic slices. I am not sure how useful would it be and would like to have your opinion on this. Thanks & Regards Ajit > Thanks & Regards > Ajit
RETURN_ADDRESS_POINTER_REGNUM Macro
All: >From the description of the definition of the macro >RETURN_ADDRESS_POINTER_REGNUM , it is derived that this macro is used to Define a register for the above macro that helps in getting the return address from the stack or frame pointer. I could see many of the architectures supported by GCC does not define this macro. Does this impact the performance or correctness of the compiler? On what cases it is applicable to define for the given architecture? Thanks & Regards Ajit
vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST
All: The definition of the following macro that determine the statement cost that adds to vectorization cost. #define TARGET_VECTORIZE_ADD_STMT_COST. In the implementation of the above macro the following is done for many vectorization supported architectures like i386, ARM. if (where == vect_body && stmt_info && stmt_in_inner_loop_p (stmt_info)) count *= 50; /* FIXME. */ I have the following questions. 1. Why the multiplication factor of 50 is choosen? 2. The comment mentions that the inner loop relative to the loop being vectorized is added more weight. If more weight is added to the inner loop for the loop being vectorized, the chances of vectorizing the inner loop decreases. Why the inner loop cost is increased with relative to the loop being vectorized? Thanks & Regards Ajit
RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Monday, August 03, 2015 2:59 PM To: Ajit Kumar Agarwal Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST On Sun, Aug 2, 2015 at 4:13 PM, Ajit Kumar Agarwal wrote: > All: > > The definition of the following macro that determine the statement cost that > adds to vectorization cost. > > #define TARGET_VECTORIZE_ADD_STMT_COST. > > In the implementation of the above macro the following is done for many > vectorization supported architectures like i386, ARM. > > if (where == vect_body && stmt_info && stmt_in_inner_loop_p (stmt_info)) > count *= 50; /* FIXME. */ > > I have the following questions. > > 1. Why the multiplication factor of 50 is choosen? >>It's a wild guess. See >>tree-vect-loop.c:vect_get_single_scalar_iteration_cost. > 2. The comment mentions that the inner loop relative to the loop being > vectorized is added more weight. If more weight is added to the inner > loop for the loop being vectorized, the chances of vectorizing the inner loop > decreases. Why the inner loop cost is increased with relative to the loop > being vectorized? >>In fact adding more weight to the inner loop increases the chance of >>vectorizing it (if vectorizing the inner loop is profitable). >>Both scalar and vector cost get biased by a factor of 50 (we assume 50 >>iterations of the inner loop for one iteration of the outer loop), so a >>non-profitable >>vectorization in the outer loop can be offsetted by >>profitable inner loop vectorization. >>Yes, '50' can be improved if we actually know the iteration count of the >>inner loop or if we have profile-feedback. Thanks for the valuable information and suggestions. Thanks & Regards Ajit Richard. > Thanks & Regards > Ajit
Loop distribution for nested Loops.
All: For the Loop given in Fig(1), there is no possibility of loop distribution because of the dependency of S1 and S2 on the outerloop index k. Due to the dependency the Loop cannot be distributed. The Loop can be distributed with the transformation given in Fig(2) where the loop given in Fig(1) is distributed due to the dependency Hoisting transformation. The Dependency hoisting transformation where the dependency is shifted to insertion of new outer Loop and the Dependency is based on the inserted outerloop. This makes the loop k(S1) and j(S2) distributed with the insertion of new outerloop and transfer The dependency of S1 and S2 to the inserted outer loop. Do k = 1, n-1 Do I = k+1, n S1: a(I,k) = a(I,k)/a(k,k) Enddo Do j = k+1,n Do I = k+1,n S2: a(I,j) = a(I,j) - a(I,k) *a(k,j); Enddo Enddo Enddo Fig(1) Do x = 1, n Do k = 1, x-1 Do I = k+1, n S2: a(I,x) = a(I,x) - a(I,k) * a(k,x) Enddo enddo Do i = x+1,n S1: a(I,x) = a(I,x)/a(x,x); Enddo Enddo Fig(2). The above transformation looks interesting making the candidate of loop distribution of nested loops with the presence of dependency by Shifting the dependency to new inserted outer loop. It is useful to have dependency hoisting transformation that makes the loop distribution possible for nested loops My question is the partitioning based Loop distributed transformation does the distribution of the nested Loops? Thanks & Regards Ajit
More of a Loop distribution.
All: Loop distribution considers DDG to decide on distributing the Loops. The Loops with control statements like IF-THEN-ELSE can also be Distributed. Instead of Data Dependency Graph, the Control Dependence Graph should be considered in order to distribute the loops In presence of control Statements. Also the presence of multiple exits in the Loop can also be considered for Loop distribution transformation. Thus the above transformation helps in the Libquantum benchmarks for SPEC 2006. There are following articles that looks interesting to me. "Loop Distribution in presence of arbitrarily control flow Ken Kennedy et.al." "Loop Distribution in presence of Multiple Exits Bor-Ming Hsieh etal." I don't think the loop distribution in presence of control flow is implemented in GCC/LLVM. I think it is feasible to consider the above for the implementation in GCC. Thanks & Regards Ajit
RE: More of a Loop distribution.
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Thursday, August 13, 2015 3:23 PM To: Ajit Kumar Agarwal Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: More of a Loop distribution. On Thu, Aug 13, 2015 at 9:37 AM, Ajit Kumar Agarwal wrote: > All: > > Loop distribution considers DDG to decide on distributing the Loops. > The Loops with control statements like IF-THEN-ELSE can also be > Distributed. Instead of Data Dependency Graph, the Control Dependence Graph > should be considered in order to distribute the loops In presence of control > Statements. > > Also the presence of multiple exits in the Loop can also be considered for > Loop distribution transformation. > > Thus the above transformation helps in the Libquantum benchmarks for SPEC > 2006. > > There are following articles that looks interesting to me. > > "Loop Distribution in presence of arbitrarily control flow Ken Kennedy et.al." > "Loop Distribution in presence of Multiple Exits Bor-Ming Hsieh etal." > > > I don't think the loop distribution in presence of control flow is > implemented in GCC/LLVM. > > I think it is feasible to consider the above for the implementation in GCC. >>It's true that loop distribution does not try to distribute based on any >>control structure heuristics but it only considers data locality. It does >>however already >>compute the control dependence graph (and uses it to add >>control edges to the DDG to properly add data dependence edges to uses of >>control statements >>necessary in partitions). >>So it should be a matter of specifying the proper set of starting statements >>it tries separating. Thanks. >>Not sure which kind of distribution you are after, can you give an example? I would like to have a distribution of the loop having control flow. For example For (I = 2 ; I < N; i++) { If (condition) { S1: A[i] = ... S2:D[i] = A[i-1]... } } The above loop can be distributed with two loops having one loop with S1 inside IF and another loop with S2 with the IF. The two scenario can be true. 1. The condition inside IF have a check on A[i] and is dependent on S1. In this case the distribution is difficult. And the above article From Ken Kennedy et.al does store the partial results of comparison in an temporary array and that array is compared inside the IF Condition. This makes the loop distributed. This is what I was looking for which I found in the above article. 2. The condition inside the IF in the above loop is not dependent on the S1 and S2 , and this case the loop can be distributed. In the above two scenario the GCC can't distribute the loop, as the control dependency graph ( control structure ) is not used. The advantage of The above loop distribution makes the loop vectorizable which otherwise not possible due to presence of multiple statements inside the IF and Also may not be IF-converted due to presence of multiple statements. If we distribute the loop for the above two scenarios the individual loops in the distributed loop can be vectorized which is otherwise not possible. Thanks & Regards Ajit Richard. > Thanks & Regards > Ajit
RE: More of a Loop distribution.
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Friday, August 14, 2015 11:30 AM To: Ajit Kumar Agarwal Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: More of a Loop distribution. On August 14, 2015 4:59:07 AM GMT+02:00, Ajit Kumar Agarwal wrote: > > >-Original Message- >From: Richard Biener [mailto:richard.guent...@gmail.com] >Sent: Thursday, August 13, 2015 3:23 PM >To: Ajit Kumar Agarwal >Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; >Vidhumouli Hunsigida; Nagaraju Mekala >Subject: Re: More of a Loop distribution. > >On Thu, Aug 13, 2015 at 9:37 AM, Ajit Kumar Agarwal > wrote: >> All: >> >> Loop distribution considers DDG to decide on distributing the Loops. >> The Loops with control statements like IF-THEN-ELSE can also be >> Distributed. Instead of Data Dependency Graph, the Control Dependence >Graph should be considered in order to distribute the loops In presence >of control Statements. >> >> Also the presence of multiple exits in the Loop can also be >considered for Loop distribution transformation. >> >> Thus the above transformation helps in the Libquantum benchmarks for >SPEC 2006. >> >> There are following articles that looks interesting to me. >> >> "Loop Distribution in presence of arbitrarily control flow Ken >Kennedy et.al." >> "Loop Distribution in presence of Multiple Exits Bor-Ming Hsieh >etal." >> >> >> I don't think the loop distribution in presence of control flow is >implemented in GCC/LLVM. >> >> I think it is feasible to consider the above for the implementation >in GCC. > >>>It's true that loop distribution does not try to distribute based on >any control structure heuristics but it only considers data locality. >It does however already >>compute the control dependence graph (and >uses it to add control edges to the DDG to properly add data dependence >edges to uses of control statements >>necessary in partitions). > >>>So it should be a matter of specifying the proper set of starting >statements it tries separating. > >Thanks. > >>>Not sure which kind of distribution you are after, can you give an >example? > >I would like to have a distribution of the loop having control flow. >For example > >For (I = 2 ; I < N; i++) >{ >If (condition) > { > S1: A[i] = ... > S2:D[i] = A[i-1]... > } >} > >The above loop can be distributed with two loops having one loop with >S1 inside IF and another loop with S2 with the IF. >The two scenario can be true. > >1. The condition inside IF have a check on A[i] and is dependent on S1. >In this case the distribution is difficult. And the above article From >Ken Kennedy et.al does store the partial results of comparison in an >temporary array and that array is compared inside the IF Condition. >This makes the loop distributed. This is what I was looking for which I >found in the above article. > >2. The condition inside the IF in the above loop is not dependent on >the S1 and S2 , and this case the loop can be distributed. > >In the above two scenario the GCC can't distribute the loop, as the >control dependency graph ( control structure ) is not used. >>The above loop can be distributed by gcc just fine. Existing loop distribution implementation in GCC distributes the above loop in both the above scenario's? Thanks & Regards Ajit Richard. The >advantage of >The above loop distribution makes the loop vectorizable which otherwise >not possible due to presence of multiple statements inside the IF and >Also may not be IF-converted due to presence of multiple statements. If >we distribute the loop for the above two scenarios the individual loops > >in the distributed loop can be vectorized which is otherwise not >possible. > >Thanks & Regards >Ajit > > >Richard. > >> Thanks & Regards >> Ajit
RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Monday, August 03, 2015 2:59 PM To: Ajit Kumar Agarwal Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST On Sun, Aug 2, 2015 at 4:13 PM, Ajit Kumar Agarwal wrote: > All: > > The definition of the following macro that determine the statement cost that > adds to vectorization cost. > > #define TARGET_VECTORIZE_ADD_STMT_COST. > > In the implementation of the above macro the following is done for many > vectorization supported architectures like i386, ARM. > > if (where == vect_body && stmt_info && stmt_in_inner_loop_p (stmt_info)) > count *= 50; /* FIXME. */ > > I have the following questions. > > 1. Why the multiplication factor of 50 is choosen? >>It's a wild guess. See >>tree-vect-loop.c:vect_get_single_scalar_iteration_cost. > 2. The comment mentions that the inner loop relative to the loop being > vectorized is added more weight. If more weight is added to the inner > loop for the loop being vectorized, the chances of vectorizing the inner loop > decreases. Why the inner loop cost is increased with relative to the loop > being vectorized? >>In fact adding more weight to the inner loop increases the chance of >>vectorizing it (if vectorizing the inner loop is profitable). >>Both scalar and vector cost get biased by a factor of 50 (we assume 50 >>iterations of the inner loop for one iteration of the outer loop), so a >>non-profitable >>vectorization in the outer loop can be offsetted by >>profitable inner loop vectorization. >>Yes, '50' can be improved if we actually know the iteration count of the >>inner loop or if we have profile-feedback. Instead of vector and scalar cost get biased by a factor of 50, Can the benefit of vectorization calculated as follows Benefit = scalar cost - vector cost/VF; Cost = 0; For ( I = 1; I < N; i++) { Cost = cost + (final_value - Initial value)/steps; } Benefit = Benefit * cost; Where N = No. of levels of the loop; Final_value = Final iteration count of the loop. Initial_value = Initial Iteration count of the loop. Steps = steps of the iteration for the loop. VF = vectorization factor. Thus increase in the Levels of the loops increases the benefit of vectorization. Also if the scalar cost is more than the vectorization cost then the Scalar cost - vector cost /VF increases with the same vectorization Factor thus increasing the benefit of vectorization. Thanks & Regards Ajit Richard. > Thanks & Regards > Ajit
More of a Loop fusion
All: Loop fusion is an important optimizations that fuses the set of Loops if the following condition is valid. 1) Loops are conformant ( i.e. they have same iteration count). 2. Loops are control equivalent. The control equivalence of the loops can be identified with the dominator and post dominator properties Of the Loop. Two Loops Lj and LK are control equivalent if the Lj dominates Lk and LK post dominates the Lj. 3. Loops are adjacent. The Loops are adjacent if there is no intervening code between them. The intervening code can be determined With the dominance properties. If there is an aggregate node ax where the Loop Lj dominates ax and the Loop Lk post dominates The ax then there is a intervening code. 4. There are forward dependence between the Loops. If the properties 1. Doesn't hold i.e. the loops are not conformant and they don't have same iteration count. Then the following Transformation can be done as given in Fig(1) and Fig(2) to make the set of Loops as non-conformant. In Fig(1) the Loops i1 and i2 are non-conformant as they don't have same iteration count. The following transformation can be made given in Fig(2). The Fig(2) fuses the Loop for the Loops i1 and i2 that are non-conformant and doesn't have same iteration count. If the properties 3 is not valid and the set of Loops are not adjacent. The intervening code can be moved up the Loop or down the Loop So that the Loop become adjacent and there is no intervening code. Also partially the intervening code is moved up and partial intervening Code is moved down based on the dependence relation and the dominance properties. Such transformation can make loops adjacent and There is no intervening code. Also the Loop fusion, should be done after the Loop normalization pass and the Loop normalization pass should be present before the Loop Fusion pass. The 4 properties to be valid for loop fusion to occur is done with respect to traversing the Loops from outermost to inner most Loop. First the outer loop is traversed and the loops that are at same level of the outer loop are made into one partitions and that partition Can be fused if the above four properties is valid and otherwise the loops are taken out from the partitions. The same logic is done for the Loop traversing outer to inner loops and partitions are formed and later the partitions are fused based on the above properties. Each Loop is traverse in the forward order of dominance tree and also the reverse order of dominance tree ( i.e. post dominator) and the Forward dependence is checked. Based on the dependence the Loops are fused. And the Loops that are non-adjacent partial code is moved Up and partial code is moved down or the whole intervening code is moved up or down. I think the above transformations and properties and enhance the Loop fusion algorithm in GCC with the above Algorithm to have more Loop Fusion with respect to SPEC benchmarks. For ( I1 = 0;I1 < n;I1 ++) { If ( I1 < n-1) S1: Else S2: S3: } For ( I2 = 0 ; I2 < n-2 ;I2 ++) S4; Fig(1). For( I1 = 0; i1< n ; i1++) { If(i1 < n-2) { If(i1< n-1) S1: Else S2: S3: S4: } Else { If ( i1 < n-1) S1: Else S2: S3: } }// For Fig(2). Thanks & Regards Ajit
RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Friday, August 14, 2015 9:59 PM To: Ajit Kumar Agarwal Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST On August 14, 2015 5:03:58 PM GMT+02:00, Ajit Kumar Agarwal wrote: > > >-Original Message- >From: Richard Biener [mailto:richard.guent...@gmail.com] >Sent: Monday, August 03, 2015 2:59 PM >To: Ajit Kumar Agarwal >Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; >Vidhumouli Hunsigida; Nagaraju Mekala >Subject: Re: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST > >On Sun, Aug 2, 2015 at 4:13 PM, Ajit Kumar Agarwal > wrote: >> All: >> >> The definition of the following macro that determine the statement >cost that adds to vectorization cost. >> >> #define TARGET_VECTORIZE_ADD_STMT_COST. >> >> In the implementation of the above macro the following is done for >many vectorization supported architectures like i386, ARM. >> >> if (where == vect_body && stmt_info && stmt_in_inner_loop_p >(stmt_info)) >> count *= 50; /* FIXME. */ >> >> I have the following questions. >> >> 1. Why the multiplication factor of 50 is choosen? > >>>It's a wild guess. See >tree-vect-loop.c:vect_get_single_scalar_iteration_cost. > >> 2. The comment mentions that the inner loop relative to the loop >being >> vectorized is added more weight. If more weight is added to the inner > >> loop for the loop being vectorized, the chances of vectorizing the >inner loop decreases. Why the inner loop cost is increased with >relative to the loop being vectorized? > >>>In fact adding more weight to the inner loop increases the chance of >vectorizing it (if vectorizing the inner loop is profitable). >>>Both scalar and vector cost get biased by a factor of 50 (we assume >50 iterations of the inner loop for one iteration of the outer loop), >so a non-profitable >>vectorization in the outer loop can be offsetted >by profitable inner loop vectorization. > >>>Yes, '50' can be improved if we actually know the iteration count of >the inner loop or if we have profile-feedback. > >Instead of vector and scalar cost get biased by a factor of 50, Can the >benefit of vectorization calculated as follows > >Benefit = scalar cost - vector cost/VF; Cost = 0; For ( I = 1; I < N; >i++) { >Cost = cost + (final_value - Initial value)/steps; } > >Benefit = Benefit * cost; > >Where >N = No. of levels of the loop; >Final_value = Final iteration count of the loop. >Initial_value = Initial Iteration count of the loop. >Steps = steps of the iteration for the loop. >VF = vectorization factor. > >Thus increase in the Levels of the loops increases the benefit of >vectorization. Also if the scalar cost is more than the vectorization >cost then the Scalar cost - vector cost /VF increases with the same >vectorization Factor thus increasing the benefit of vectorization. >>Sure. But the number of iterations may only be available symbolically, thus >>the cost be only useful for the dynamic check at runtime. A better static >>>>estimate would also be useful. Thanks. For the cases the loop bound can be known at the compile time, through Value Range Analysis. Already GCC uses the value range Information/Analysis To calculate the Loop bound. We can use the same loop bound info to get the static estimate on the number of iterations. Based on the above estimates, the above cost calculation as I have mentioned can be used for Vectorization cost Analysis. Thanks & Regards Ajit Richard. >Thanks & Regards >Ajit > >Richard. > > >> Thanks & Regards >> Ajit
RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST
-Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Ajit Kumar Agarwal Sent: Monday, August 17, 2015 4:03 PM To: Richard Biener Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST -Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Friday, August 14, 2015 9:59 PM To: Ajit Kumar Agarwal Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST On August 14, 2015 5:03:58 PM GMT+02:00, Ajit Kumar Agarwal wrote: > > >-Original Message- >From: Richard Biener [mailto:richard.guent...@gmail.com] >Sent: Monday, August 03, 2015 2:59 PM >To: Ajit Kumar Agarwal >Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; >Vidhumouli Hunsigida; Nagaraju Mekala >Subject: Re: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST > >On Sun, Aug 2, 2015 at 4:13 PM, Ajit Kumar Agarwal > wrote: >> All: >> >> The definition of the following macro that determine the statement >cost that adds to vectorization cost. >> >> #define TARGET_VECTORIZE_ADD_STMT_COST. >> >> In the implementation of the above macro the following is done for >many vectorization supported architectures like i386, ARM. >> >> if (where == vect_body && stmt_info && stmt_in_inner_loop_p >(stmt_info)) >> count *= 50; /* FIXME. */ >> >> I have the following questions. >> >> 1. Why the multiplication factor of 50 is choosen? > >>>It's a wild guess. See >tree-vect-loop.c:vect_get_single_scalar_iteration_cost. > >> 2. The comment mentions that the inner loop relative to the loop >being >> vectorized is added more weight. If more weight is added to the inner > >> loop for the loop being vectorized, the chances of vectorizing the >inner loop decreases. Why the inner loop cost is increased with >relative to the loop being vectorized? > >>>In fact adding more weight to the inner loop increases the chance of >vectorizing it (if vectorizing the inner loop is profitable). >>>Both scalar and vector cost get biased by a factor of 50 (we assume >50 iterations of the inner loop for one iteration of the outer loop), >so a non-profitable >>vectorization in the outer loop can be offsetted >by profitable inner loop vectorization. > >>>Yes, '50' can be improved if we actually know the iteration count of >the inner loop or if we have profile-feedback. > >Instead of vector and scalar cost get biased by a factor of 50, Can the >benefit of vectorization calculated as follows > >Benefit = scalar cost - vector cost/VF; Cost = 0; For ( I = 1; I < N; >i++) { >Cost = cost + (final_value - Initial value)/steps; } > >Benefit = Benefit * cost; > >Where >N = No. of levels of the loop; >Final_value = Final iteration count of the loop. >Initial_value = Initial Iteration count of the loop. >Steps = steps of the iteration for the loop. >VF = vectorization factor. > >Thus increase in the Levels of the loops increases the benefit of >vectorization. Also if the scalar cost is more than the vectorization >cost then the Scalar cost - vector cost /VF increases with the same >vectorization Factor thus increasing the benefit of vectorization. >>Sure. But the number of iterations may only be available symbolically, thus >>the cost be only useful for the dynamic check at runtime. A better static >>>>estimate would also be useful. >>Thanks. For the cases the loop bound can be known at the compile time, >>through Value Range Analysis. Already GCC uses the value range >>>>Information/Analysis To calculate the Loop bound. We can use the same loop >>bound info to get the static estimate on the number of iterations. >> Based on the above estimates, the above cost calculation as I have >> mentioned can be used for Vectorization cost Analysis. On top of the above, the vectorizer cannot vectorize the loops if the trip count or iteration count is not known. In order to have the number Of iterations for vectorizer cost calculation, it's always true the trip count or iteration count is known. The summation of iteration count of all the Loop levels where the iteration count is known, gives the static estimate and included in the above vectorization cost. For the Loops where iteration or trip count is not known, the vectorizer cannot vectorize and iteration count of such cases can be neglected for the vectorization cost calculation. Only SLP or partial vectorization is possible where its considers the isomorphic operations instead of vectorization based on trip or iteration Count for the Loops. Thanks & Regards Ajit Thanks & Regards Ajit Richard. >Thanks & Regards >Ajit > >Richard. > > >> Thanks & Regards >> Ajit
[RFC]: Vectorization cost benefit changes.
All: I have done the vectorization cost changes as given below. I have considered only the cost associated with the inner instead of outside. The consideration of inside scalar and vector cost is done as the inner cost are the most cost effective than the outside cost. min_profitable_iters = ((scalar_single_iter_cost - vec_inside_cost) *vf); The Scalar_single_iter_cost consider the hardcoded value 50 which is used for most of the targets and the scalar cost is multiplied With 50. This scalar cost is subtracted with vector cost and as the scalar cost is increased the chances of vectorization is more with same Vectorization factor and more loops will be vectorized. I have not changed the iteration count which is hardcoded with 50 and I will do the changes to replace the 50 with the static Estimates of iteration count if you agree upon the below changes. I have ran the SPEC cpu 2000 benchmarks with the below changes for i386 targets and the significant gains are achieved with respect To INT and FP benchmarks. Here is the data. Ratio of vectorization cost changes(FP benchmarks) vs Ratio of without vectorization cost changes( FP benchmarks) = 4640.102 vs 4583.379. Ratio of vectorization cost changes (INT benchmarks ) vs Ratio of without vectorization cost changes( INT benchmarks0 = 3812.883 vs 3778.558 Please give your feedback on the below changes for vectorization cost benefit. diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 422b883..35d538f 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -2987,11 +2987,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, min_profitable_iters = 1; else { - min_profitable_iters = ((vec_outside_cost - scalar_outside_cost) * vf - - vec_inside_cost * peel_iters_prologue - - vec_inside_cost * peel_iters_epilogue) - / ((scalar_single_iter_cost * vf) -- vec_inside_cost); + min_profitable_iters = ((scalar_single_iter_cost +- vec_inside_cost) *vf); if ((scalar_single_iter_cost * vf * min_profitable_iters) <= (((int) vec_inside_cost * min_profitable_iters) Thanks & Regards Ajit vect.diff Description: vect.diff
RE: [RFC]: Vectorization cost benefit changes.
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Friday, August 21, 2015 2:03 PM To: Ajit Kumar Agarwal Cc: Jeff Law; GCC Patches; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [RFC]: Vectorization cost benefit changes. On Fri, Aug 21, 2015 at 7:18 AM, Ajit Kumar Agarwal wrote: > All: > > I have done the vectorization cost changes as given below. I have considered > only the cost associated with the inner instead of outside. > The consideration of inside scalar and vector cost is done as the inner cost > are the most cost effective than the outside cost. >>I think you are confused about what the variables cost are associated to. >>You are changing a place that computes also the cost for >>non-outer-loop->>vectorization so your patch is clearly not applicable. >>vec_outside_cost is the cost of setting up invariants for example. >>All costs apply to the "outer" loop - if there is a nested loop inside that >>loop its costs are folded into the "outer" loop cost already at this stage. >>So I think your analysis is simply wrong and thus your patch. >>You need to find another place to fix inner loop cost. Thanks for your valuable suggestions and feedback. I will certainly look into it. Thanks & Regards Ajit Richard. > min_profitable_iters = ((scalar_single_iter_cost > - vec_inside_cost) *vf); > > The Scalar_single_iter_cost consider the hardcoded value 50 which is > used for most of the targets and the scalar cost is multiplied With > 50. This scalar cost is subtracted with vector cost and as the scalar cost is > increased the chances of vectorization is more with same Vectorization factor > and more loops will be vectorized. > > I have not changed the iteration count which is hardcoded with 50 and > I will do the changes to replace the 50 with the static Estimates of > iteration count if you agree upon the below changes. > > I have ran the SPEC cpu 2000 benchmarks with the below changes for > i386 targets and the significant gains are achieved with respect To INT and > FP benchmarks. > > Here is the data. > > Ratio of vectorization cost changes(FP benchmarks) vs Ratio of without > vectorization cost changes( FP benchmarks) = 4640.102 vs 4583.379. > Ratio of vectorization cost changes (INT benchmarks ) vs Ratio of > without vectorization cost changes( INT benchmarks0 = 3812.883 vs > 3778.558 > > Please give your feedback on the below changes for vectorization cost benefit. > > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index > 422b883..35d538f 100644 > --- a/gcc/tree-vect-loop.c > +++ b/gcc/tree-vect-loop.c > @@ -2987,11 +2987,8 @@ vect_estimate_min_profitable_iters (loop_vec_info > loop_vinfo, > min_profitable_iters = 1; >else > { > - min_profitable_iters = ((vec_outside_cost - scalar_outside_cost) * > vf > - - vec_inside_cost * peel_iters_prologue > - - vec_inside_cost * peel_iters_epilogue) > - / ((scalar_single_iter_cost * vf) > -- vec_inside_cost); > + min_profitable_iters = ((scalar_single_iter_cost > +- vec_inside_cost) *vf); > >if ((scalar_single_iter_cost * vf * min_profitable_iters) ><= (((int) vec_inside_cost * min_profitable_iters) > > Thanks & Regards > Ajit
Awareness of register pressure on strength reduction of induction variables.
All; The Global code motion are the important optimization that have an impact on register spills and Fetch. Thus The Global code motion takes into account the increase or decrease of register pressure. Strength Reductions is an important optimization that has an impact on register pressure. The strength reduction Optimizations doesn't take into account the register pressure and especially the strength reduction for induction Variables that are inside the Loops. Loops are the bottleneck for the register pressure and any spill inside the loops Based on strength reduction of induction variables degrades the performance of the Loops. Thus the strength reductions in general and especially the strength reduction for induction variables should take Into account the register pressure based on Live range info. I don't think we do take into account the register pressure for strength reduction optimization in general and especially the Strength reduction based on induction variables. Thanks & Regards Ajit
Commoning the control and Data Dependence
All: The Data Dependency graph augmented with control dependence can be common out based on the dominator info. The instruction I1 dominates all the uses say instruction I2 and I3. Then I2 and I3 depends on I1. Thus the Graph can be Formed from the dominator tree of all the instructions and the edges represent the dominator info. There is an edge From I1 - I2 and I1-I3 if I1 dominates I2 and I3. Such representation can be common out with Data and control dependence and the common data structure can be Formed with the tree based graph on dominator info. This will take care of both the Data Dependence and the control Dependence with the common representation of the graph on dominator tree info based out of instructions. Such representation can be augmented with DDG augmented with Control dependence used in Loop distribution pass To form the partitions to distribute the Loops? Thanks & Regards Ajit
Live range Analysis based on tree representations
All: The Live ranges info on tree SSA representation is important step towards the SSA based code motion optimizations. As the code motion optimization based on the SSA representation effects the register pressure and reasons for performance Bottleneck. I am proposing the Live range Analysis based on the SSA representation. The Live range analysis traverses the dominator Tree. The SSA and phi variables are represented based on dominance frontier info and the SSA representation reflects The dominance info. Based on such dominance info Live range Overlapping Analysis can be derived. Variable V intersects W if Vdef dominates the Wdef. The variable v intersects at point p if Vdef dominates P and Wdef Dominates the P. If Vdef dominates Wdef and Wdef dominates Udef , then the Vdef dominates Udef and thus Live range Of V intersect W and live range W intersect U, thus the live range V intersects the U. Such dominance info can be used to Represent the Overlapping Live range Analysis and the register pressure is derived from Overlapping Live ranges based On the dominator info inherited from the SSA representation. The SSA representation is derived based on dominance Frontier and the traversal of dominator tree based on SSA can derive the Overlapping Live ranges. The above Overlapping Live range info can be used to derive the register pressure and the optimization based out of tree Representation can use the above overlapping live ranges to take register pressure into account. Thanks & Regards Ajit
RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST
-Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Ajit Kumar Agarwal Sent: Wednesday, August 19, 2015 2:53 PM To: Richard Biener Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Ajit Kumar Agarwal Sent: Monday, August 17, 2015 4:03 PM To: Richard Biener Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST -Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Friday, August 14, 2015 9:59 PM To: Ajit Kumar Agarwal Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST On August 14, 2015 5:03:58 PM GMT+02:00, Ajit Kumar Agarwal wrote: > > >-Original Message- >From: Richard Biener [mailto:richard.guent...@gmail.com] >Sent: Monday, August 03, 2015 2:59 PM >To: Ajit Kumar Agarwal >Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; >Vidhumouli Hunsigida; Nagaraju Mekala >Subject: Re: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST > >On Sun, Aug 2, 2015 at 4:13 PM, Ajit Kumar Agarwal > wrote: >> All: >> >> The definition of the following macro that determine the statement >cost that adds to vectorization cost. >> >> #define TARGET_VECTORIZE_ADD_STMT_COST. >> >> In the implementation of the above macro the following is done for >many vectorization supported architectures like i386, ARM. >> >> if (where == vect_body && stmt_info && stmt_in_inner_loop_p >(stmt_info)) >> count *= 50; /* FIXME. */ >> >> I have the following questions. >> >> 1. Why the multiplication factor of 50 is choosen? > >>>It's a wild guess. See >tree-vect-loop.c:vect_get_single_scalar_iteration_cost. > >> 2. The comment mentions that the inner loop relative to the loop >being >> vectorized is added more weight. If more weight is added to the inner > >> loop for the loop being vectorized, the chances of vectorizing the >inner loop decreases. Why the inner loop cost is increased with >relative to the loop being vectorized? > >>>In fact adding more weight to the inner loop increases the chance of >vectorizing it (if vectorizing the inner loop is profitable). >>>Both scalar and vector cost get biased by a factor of 50 (we assume >50 iterations of the inner loop for one iteration of the outer loop), >so a non-profitable >>vectorization in the outer loop can be offsetted >by profitable inner loop vectorization. > >>>Yes, '50' can be improved if we actually know the iteration count of >the inner loop or if we have profile-feedback. > >Instead of vector and scalar cost get biased by a factor of 50, Can the >benefit of vectorization calculated as follows > >Benefit = scalar cost - vector cost/VF; Cost = 0; For ( I = 1; I < N; >i++) { >Cost = cost + (final_value - Initial value)/steps; } > >Benefit = Benefit * cost; > >Where >N = No. of levels of the loop; >Final_value = Final iteration count of the loop. >Initial_value = Initial Iteration count of the loop. >Steps = steps of the iteration for the loop. >VF = vectorization factor. > >Thus increase in the Levels of the loops increases the benefit of >vectorization. Also if the scalar cost is more than the vectorization >cost then the Scalar cost - vector cost /VF increases with the same >vectorization Factor thus increasing the benefit of vectorization. >>Sure. But the number of iterations may only be available symbolically, thus >>the cost be only useful for the dynamic check at runtime. A better static >>>>estimate would also be useful. >>Thanks. For the cases the loop bound can be known at the compile time, >>through Value Range Analysis. Already GCC uses the value range >>>>Information/Analysis To calculate the Loop bound. We can use the same loop >>bound info to get the static estimate on the number of iterations. >> Based on the above estimates, the above cost calculation as I have >> mentioned can be used for Vectorization cost Analysis. >>On top of the above, the vectorizer cannot vectorize the loops if the trip >>count or iteration count is not known. In order to have the number Of >>iterations for >>vectorizer cost calculation, it's always true the trip
RE: Live range Analysis based on tree representations
-Original Message- From: Aaron Sawdey [mailto:acsaw...@linux.vnet.ibm.com] Sent: Wednesday, September 02, 2015 8:23 PM To: Ajit Kumar Agarwal Cc: Jeff Law; vmaka...@redhat.com; Richard Biener; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Live range Analysis based on tree representations On Tue, 2015-09-01 at 17:56 +, Ajit Kumar Agarwal wrote: > All: > > The Live ranges info on tree SSA representation is important step towards the > SSA based code motion optimizations. > As the code motion optimization based on the SSA representation > effects the register pressure and reasons for performance Bottleneck. > > I am proposing the Live range Analysis based on the SSA > representation. The Live range analysis traverses the dominator Tree. > The SSA and phi variables are represented based on dominance frontier info > and the SSA representation reflects The dominance info. Based on such > dominance info Live range Overlapping Analysis can be derived. > > Variable V intersects W if Vdef dominates the Wdef. The variable v > intersects at point p if Vdef dominates P and Wdef Dominates the P. If > Vdef dominates Wdef and Wdef dominates Udef , then the Vdef dominates > Udef and thus Live range Of V intersect W and live range W intersect > U, thus the live range V intersects the U. Such dominance info can be > used to Represent the Overlapping Live range Analysis and the register > pressure is derived from Overlapping Live ranges based On the dominator info > inherited from the SSA representation. The SSA representation is derived > based on dominance Frontier and the traversal of dominator tree based on SSA > can derive the Overlapping Live ranges. > > The above Overlapping Live range info can be used to derive the > register pressure and the optimization based out of tree Representation can > use the above overlapping live ranges to take register pressure into account. >>Ajit, >>I did a prototype of this kind of analysis at one point last year to see if it could help improve inlining decisions in LTO. Basically I did >>exactly what you suggest and computed the number of overlapping SSA live ranges and used that as a proxy for register pressure. It >>did appear to be able to help in some circumstances but the real solution is to improve register allocation so it doesn't fall down when >>register pressure gets high. Aaron: Would you mind in explaining on what circumstances it helps and when it won't. The Live ranges on SSA representation forms chordal Graphs that might have the different colorability requirements than the real register allocator. This may not give exact register pressure compared to register allocator as register allocator is further down the optimization and the code generation pipeline but forms the basis of optimization based on SSA that effects the register pressure. >>The code is in a branch called lto-pressure. Thanks. I would like to see the code. Thanks & Regards Ajit >>Aaron > > Thanks & Regards > Ajit > -- Aaron Sawdey, Ph.D. acsaw...@linux.vnet.ibm.com 050-2/C113 (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC Toolchain
Cost and Benefit Allocation for moving the expression above the conditional.
All: The cost and benefit associated for moving a given expression above conditional are the important factors for the performance boost. Considering the above, the cost and benefit calculation can be derived based on below. For a given conditional entry point 'n', the benefit path 'p' are the sub paths that passes through 'n' and available and anticipatable at the conditional Entry n. For a given conditional entry point 'n', the cost path 'p' are the sub paths that passes through 'n' and are unavailable and un-anticipatable at the Conditional entry n. Thus the Benefit = summation of the frequency of all the benefit paths as given above. Cost = Summation of the frequency of all the cost paths as given above. Thus the movement of expression above the conditional can be enabled if the Benefit > Cost as calculated above. The derivation above can be extended to Loop Invariant code motion. The above benefit and the cost derivation can be extended to the following based on number of DEFs and use. Benefit = Summation of the frequency of all the benefit paths + summation of (number of defs * latency of store instruction) of all benefit paths +summation of ( number of uses * latency of load instruction ) of all benefit paths + - Summation of ( Number of moves * latency of move instructions) Of all benefit paths. Cost = Benefit = Summation of the frequency of all the cost paths + summation of (number of defs * latency of store instruction) of all cost paths +summation of ( number of uses * latency of load instruction ) of all cost paths- Summation of ( Number of moves * latency of move instructions) Of all cost paths The movement of the expressions above the condition and can be extended to LICM if the benefit is greater than cost. The above is feasible and quite efficient for the cost and Benefit allocation for control flow code motion. This above cost somewhat equivalent to Cost of the Live range priority and selection of coloring the live ranges based on priority. Thanks & Regards Ajit
Inlining Decision Priority Function.
All: Inlining decisions that reduces the formulation of callee's stacks frame and including the callee in the caller context increases The performance. The priority function of Inlining decisions can be calculated as follows considering the following. 1. Level nest of the callee. 2. code size of the callee. 3. code size of the largest function in the program. Prority(c) = (level(c) + 1) * largetsize - Size(c). The the Higher the priority for inlining decision for the callee with small code size and deepest in the call site of the call graph. The deeper the level of callee the greater is the priority and smaller the code size the higher is the priority. The priority function also considers the largest size of the functions in the program. If the size of the function is largest in the Program the higher is the prority. Priority(c) = Priority of the callee. Level(c) = Deepness of the callee. Size(c) = code size of the callee. Largetsize = largest size among all the functions in the program. The above heuristics for inlining should be considered for inlining. If we use similar heuristics in the inlining decisions Let me know. Thanks & Regards Ajit
Selective criteria and Heuristics for Loop Unrolling.
All: The Loop unrolling and the decisions on unrolling factor is an important criteria for loop Unrolling optimization. The decision on unrolling factor for the loops based on the below criteria improves the performance of unrolled loops. 1. Number of operations. 2. Number of operands. 3. Number of floating point operations. 4. Loop nest Level. 5. Number of branches. 6. Number of memory operations. 7. Live range size. 8. Critical Path Length. 9. Known Trip count. 10. Number of parallel computations in the loop. 11. The Max dependence height of the computation. 12. Main memory to memory loop carried dependence. The best unrolling factor heuristics based on the above criteria improves the performance of the unrolled loop. Thanks & Regards Ajit
RE: Live range Analysis based on tree representations
-Original Message- From: Aaron Sawdey [mailto:acsaw...@linux.vnet.ibm.com] Sent: Friday, September 04, 2015 11:51 PM To: Ajit Kumar Agarwal Cc: Jeff Law; vmaka...@redhat.com; Richard Biener; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Live range Analysis based on tree representations On Thu, 2015-09-03 at 15:22 +, Ajit Kumar Agarwal wrote: > > > -Original Message- > From: Aaron Sawdey [mailto:acsaw...@linux.vnet.ibm.com] > Sent: Wednesday, September 02, 2015 8:23 PM > To: Ajit Kumar Agarwal > Cc: Jeff Law; vmaka...@redhat.com; Richard Biener; gcc@gcc.gnu.org; > Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju > Mekala > Subject: Re: Live range Analysis based on tree representations > > On Tue, 2015-09-01 at 17:56 +, Ajit Kumar Agarwal wrote: > > All: > > > > The Live ranges info on tree SSA representation is important step towards > > the SSA based code motion optimizations. > > As the code motion optimization based on the SSA representation > > effects the register pressure and reasons for performance Bottleneck. > > > > I am proposing the Live range Analysis based on the SSA > > representation. The Live range analysis traverses the dominator Tree. > > The SSA and phi variables are represented based on dominance frontier info > > and the SSA representation reflects The dominance info. Based on such > > dominance info Live range Overlapping Analysis can be derived. > > > > Variable V intersects W if Vdef dominates the Wdef. The variable v > > intersects at point p if Vdef dominates P and Wdef Dominates the P. > > If Vdef dominates Wdef and Wdef dominates Udef , then the Vdef > > dominates Udef and thus Live range Of V intersect W and live range W > > intersect U, thus the live range V intersects the U. Such dominance > > info can be used to Represent the Overlapping Live range Analysis and the > > register pressure is derived from Overlapping Live ranges based On the > > dominator info inherited from the SSA representation. The SSA > > representation is derived based on dominance Frontier and the traversal of > > dominator tree based on SSA can derive the Overlapping Live ranges. > > > > The above Overlapping Live range info can be used to derive the > > register pressure and the optimization based out of tree Representation can > > use the above overlapping live ranges to take register pressure into > > account. > > >>Ajit, > >>I did a prototype of this kind of analysis at one point last year to see > if it could help improve inlining decisions in LTO. Basically I did >>exactly > what you suggest and computed the number of overlapping SSA live ranges and > used that as a proxy for register pressure. It >>did appear to be able to > help in some circumstances but the real solution is to improve register > allocation so it doesn't fall down when >>register pressure gets high. > > Aaron: > Would you mind in explaining on what circumstances it helps and when > it won't. The Live ranges on SSA representation forms chordal Graphs > that might have the different colorability requirements than the real > register allocator. This may not give exact register pressure compared > to register allocator as register allocator is further down the > optimization and the code generation pipeline but forms the basis of > optimization based on SSA that effects the register pressure. > > >>The code is in a branch called lto-pressure. > > Thanks. I would like to see the code. Ajit, >> The branch is here: svn://gcc.gnu.org/svn/gcc/branches/lto-pressure >>The analysis is in ipa-inline.c; if you search for "pressure" you'll find the >>code. >>The live ranges in SSA are certainly different than what the register >>allocator is going to see, but it's what we have to work with at the point >>where the >>inlining decisions are made, which is why I was looking at it. My >>hope was that it would be a reasonable proxy for downstream register >>pressure. >>I went and did this after seeing a particular situation in bzip2 where a >>bunch of inlining done by LTO sent register pressure up a lot and resulted in >>a >>measureable increase in loads/stores due to extra spill code. Functions >>that are called in only one place and not externally visible will be inlined >>regardless of >>their size. There is one function in bzip2 that has >>particularly complex control and inlining this into another non-trivial >>function caused all the excess spi
Replacing malloc with alloca.
All: The replacement of malloc with alloca can be done on the following analysis. If the lifetime of an object does not stretch beyond the immediate scope. In such cases the malloc can be replaced with alloca. This increases the performance to a great extent. Inlining helps to a great extent the scope of lifetime of an object doesn't stretch the immediate scope of an object. And the scope of replacing malloc with alloca can be identified. I am wondering what phases of our optimization pipeline the malloc is replaced with alloca and what analysis is done to transform The malloc with alloca. This greatly increases the performance of the benchmarks? Is the analysis done through Escape Analysis? If yes, then what data structure is used for the abstract execution interpretation? Thanks & Regards Ajit
Live Range Splitting in Integrated Register Allocator
Sorry for resending again as Plain Text as my earlier mail was sent with HTML enable. This makes enable to send it to gcc@gcc.gnu.org. Sorry once again. Thanks & Regards Ajit From: Ajit Kumar Agarwal Sent: Wednesday, May 14, 2014 10:43 PM To: 'gcc@gcc.gnu.org'; 'vmaka...@redhat.com' Cc: 'Michael Eager'; Vinod Kathail; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: Live Range Splitting in Integrated Register Allocator Adding the gcc@gcc.gnu.org mailing list. From: Ajit Kumar Agarwal Sent: Wednesday, May 14, 2014 10:33 PM To: 'gcc@gcc.gnu.org'; 'vmaka...@redhat.com' Cc: 'Michael Eager'; Vinod Kathail; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Live Range Splitting in Integrated Register Allocator Hello All: I am planning to implement the Live range splitting based on the following cases in the Integrated Register Allocator. For a given Live range that spans from  from outer region to inner region of the loop. Such Live ranges which are LiveIn at the entry of the header of the Loop and Live Out at the exit of the loop but there are no references inside the Loop. Such Live ranges lead to unoptimal spill and fetch inside the Loop conflicting with the shorter live ranges that spans inside the Loop. Lets say such Live range as L1. L1 can be splitted at the Loop Boundary splitting the Live range by making a store at the header of the Loop and the Load at the exit of the Loop. This makes the Live range less conflicting with the Live ranges that are local to the Loop regions reducing the spill and Fetch inside the Loops. >From the code and documentation of Integrated Register Allocator following is >the understanding. As Live range L1 is live in the outer region but as there are no reference inside the Loop region. Since the allocno for L1 for a given variable v is assigned two allocno v1 and v2 . V1 being assigned allocno for the outer region and v2 as allocno for the inner Loop region. This allows to accumulate the information from the inner loop region to outer region. Will the current Integrated Register Allocator will consider the Live range L1 as Live inside the Loop and outer region? If Yes then there will be conflicting with the Live ranges that are local to the Loop region leading to spill and fetch inside the Loop.  If the v1 and v2 allocno are created v1 for the outer region and v2 for the inner region then there will v2 will be conflicting the local live ranges inside the Loop region and v1 will be conflicting with the Live ranges of the outer regions. This is how its been considered as Live range splitting at the Loop Boundary for the Live range that spans inside the Loop but not not being referenced? If Such cases are not being considered in the Integrated Register Allocator, then it will be useful to implement such cases in IRA which will be benefitted the microblaze target. Please let me know what do you think. Thanks & Regards Ajit
RE: Live Range Splitting in Integrated Register Allocator
On 2014-05-14, 1:33 PM, Ajit Kumar Agarwal wrote: > > Hello All: > > I am planning to implement the Live range splitting based on the following > cases in the Integrated Register Allocator. > > For a given Live range that spans from from outer region to inner region of > the loop. Such Live ranges which are LiveIn at the entry of the header of the > Loop and Live Out at the exit of the loop but there are no references inside > the Loop. Such Live ranges lead to unoptimal spill and fetch inside the Loop > conflicting with the shorter live ranges that spans inside the Loop. > > Lets say such Live range as L1. L1 can be splitted at the Loop Boundary > splitting the Live range by making a store at the header of the Loop and the > Load at the exit of the Loop. This makes the Live range less conflicting with > the Live ranges that are local to the Loop regions reducing the spill and > Fetch inside the Loops. > > From the code and documentation of Integrated Register Allocator following > is the understanding. > > As Live range L1 is live in the outer region but as there are no reference > inside the Loop region. Since the allocno for L1 for a given variable v is > assigned two allocno v1 and v2 . V1 being assigned allocno for the outer > region and v2 as allocno for the inner Loop region. This allows to accumulate > the information from the inner loop region to outer region. > > Will the current Integrated Register Allocator will consider the Live range > L1 as Live inside the Loop and outer region? If Yes then there will be > conflicting with the Live ranges that are local to the Loop region leading to > spill and fetch inside the Loop. If the v1 and v2 allocno are created v1 for > the outer region and v2 for the inner region then there will v2 will be > conflicting the local live ranges inside the Loop region and v1 will be > conflicting with the Live ranges of the outer regions. This is how its been > considered as Live range splitting at the Loop Boundary for the Live range > that spans inside the Loop but not not being referenced? > > If Such cases are not being considered in the Integrated Register Allocator, > then it will be useful to implement such cases in IRA which will be > benefitted the microblaze target. > > Please let me know what do you think. > >>Allocno v2 corresponding to live range inside the loop has a very small cost >>for spilling therefore it will be spilled if we still need registers to >>pseudos local to the loop. If allocno v1 >>corresponding live ranges outside >>the loop *and* inside the loop gets a hard register, we will have live range >>splitting as you propose. So I do not see a necessity for the optimization >>>>you propose. If the allocno v1 does n't get the hard register and become the candidate of spill which is Live that spans the Loop but not reference inside the Loop, it will lead to spills inside the loop which become bottleneck as performance is concerned. Does this case also handled in the IRA? With my proposal the load and store will be moved outside the loop instead of having it inside the Loop for such case. >>Moreover my experience shows that making a lot of explicit transformations >>(e.g. proposed splitting) even if we have transformations to undo them (e.g. >>coalescing) results in worse >>code. >>The explicit transformations should be as minimal as possible during RA in a >>good register allocator. So I guess the optimization you propose will >>actually results in a worse code. >>Although I might be wrong because it is >>always hard to predict the result of heuristic optimizations. There is a paper on Improved Passive Splitting which does take care of Splitting at Loop boundaries for the Live range that spans the Loop but not referenced. Improved Passive Splitting (2005) by Keith D. Cooper , Jason Eckhardt. This paper uses the heuristics and claim to reduce the spill and fetch for the case I propose. Also claims to generate the better code. >>What is really missed in RA, it is a good splitting on BB boundaries. I have >>plans to try this as a part of more common pass to decrease register pressure >>on which I'll start work soon. >>In any way thanks for the proposal. You are free try it to confirm or reject >>my prediction. Unfortunately, that is the only way to be sure about the >>result. Thanks & Regards Ajit
RE: Live Range Splitting in Integrated Register Allocator
Thanks Vladimir for the clarification. Thanks & Regards Ajit -Original Message- From: Vladimir Makarov [mailto:vmaka...@redhat.com] Sent: Thursday, May 15, 2014 8:39 PM To: Ajit Kumar Agarwal; gcc@gcc.gnu.org Cc: Michael Eager; Vinod Kathail; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Live Range Splitting in Integrated Register Allocator On 05/15/2014 03:28 AM, Ajit Kumar Agarwal wrote: > > On 2014-05-14, 1:33 PM, Ajit Kumar Agarwal wrote: > >> Hello All: >> >> I am planning to implement the Live range splitting based on the following >> cases in the Integrated Register Allocator. >> >> For a given Live range that spans from from outer region to inner region of >> the loop. Such Live ranges which are LiveIn at the entry of the header of >> the Loop and Live Out at the exit of the loop but there are no references >> inside the Loop. Such Live ranges lead to unoptimal spill and fetch inside >> the Loop conflicting with the shorter live ranges that spans inside the Loop. >> >> Lets say such Live range as L1. L1 can be splitted at the Loop Boundary >> splitting the Live range by making a store at the header of the Loop and the >> Load at the exit of the Loop. This makes the Live range less conflicting >> with the Live ranges that are local to the Loop regions reducing the spill >> and Fetch inside the Loops. >> >> From the code and documentation of Integrated Register Allocator following >> is the understanding. >> >> As Live range L1 is live in the outer region but as there are no reference >> inside the Loop region. Since the allocno for L1 for a given variable v is >> assigned two allocno v1 and v2 . V1 being assigned allocno for the outer >> region and v2 as allocno for the inner Loop region. This allows to >> accumulate the information from the inner loop region to outer region. >> >> Will the current Integrated Register Allocator will consider the Live range >> L1 as Live inside the Loop and outer region? If Yes then there will be >> conflicting with the Live ranges that are local to the Loop region leading >> to spill and fetch inside the Loop. If the v1 and v2 allocno are created v1 >> for the outer region and v2 for the inner region then there will v2 will be >> conflicting the local live ranges inside the Loop region and v1 will be >> conflicting with the Live ranges of the outer regions. This is how its been >> considered as Live range splitting at the Loop Boundary for the Live range >> that spans inside the Loop but not not being referenced? >> >> If Such cases are not being considered in the Integrated Register Allocator, >> then it will be useful to implement such cases in IRA which will be >> benefitted the microblaze target. >> >> Please let me know what do you think. >> >>> Allocno v2 corresponding to live range inside the loop has a very small >>> cost for spilling therefore it will be spilled if we still need registers >>> to pseudos local to the loop. If allocno v1 >>corresponding live ranges >>> outside the loop *and* inside the loop gets a hard register, we will have >>> live range splitting as you propose. So I do not see a necessity for the >>> optimization >>you propose. > If the allocno v1 does n't get the hard register and become the candidate of > spill which is Live that spans the Loop but not reference inside the Loop, it > will lead to spills inside the loop which become bottleneck as performance is > concerned. Does this case also handled in the IRA? With my proposal the load > and store will be moved outside the loop instead of having it inside the Loop > for such case. > If v1 does not get a hard register (and v2 most probably too), than there will be no any additional insns because v1 and v2 most probably gets the same memory. If v1 gets a hard register and v2 does not, then we have store on the loop enter and loads on the loop exits (the store/loads are not inside the loop). >>> Moreover my experience shows that making a lot of explicit transformations >>> (e.g. proposed splitting) even if we have transformations to undo them >>> (e.g. coalescing) results in worse >>code. >>> The explicit transformations should be as minimal as possible during RA in >>> a good register allocator. So I guess the optimization you propose will >>> actually results in a worse code. >>Although I might be wrong because it >>> is always hard to predict the result of heuristic optimizations. > There is a paper on Improved Passive Splitting which does take care of > Splitting at Loop boundari
RE: negative latencies
Is it the case of code speculation where the negative latencies are used? Thanks & Regards Ajit -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of shmeel gutl Sent: Monday, May 19, 2014 12:23 PM To: Andrew Pinski Cc: gcc@gcc.gnu.org; Vladimir Makarov Subject: Re: negative latencies On 19-May-14 09:39 AM, Andrew Pinski wrote: > On Sun, May 18, 2014 at 11:13 PM, shmeel gutl > wrote: >> Are there hooks in gcc to deal with negative latencies? In other >> words, an architecture that permits an instruction to use a result >> from an instruction that will be issued later. > Do you mean bypasses? If so there is a bypass feature which you can use: > https://gcc.gnu.org/onlinedocs/gccint/Processor-pipeline-description.h > tml#index-data-bypass-3773 > > Thanks, > Andrew Pinski Unfortunately, bypasses in the pipeline description is not enough. They only allow you to calculate the latency of true dependencies. They are also forced to be zero or greater. The real question is how the scheduler and register allocator can deal with negative latencies. Thanks Shmeel >> At first glance it seems that it will will break a few things. >> 1) The definition of dependencies cannot come from the simple >> ordering of rtl. >> 2) The scheduling problem starts to look like "get off the train 3 >> stops before me". >> 3) The definition of live ranges needs to use actual instruction >> timing information, not just instruction sequencing. >> >> The hooks in the scheduler seem to be enough to stop damage but not >> enough to take advantage of this "feature". >> >> Thanks >> > > - > No virus found in this message. > Checked by AVG - www.avg.com > Version: 2014.0.4577 / Virus Database: 3950/7515 - Release Date: > 05/18/14
Reducing Register Pressure through Live range Shrinking through Loops!!
Hello All: Simpson does the Live range shrinking and reduction of register pressure by using the computation that are not load and store but the arithmetic computation. The computation where the operands and registers are live at the entry and exit of the basic block but not touched inside the block then the computation is moved at the end of the block the reducing the register pressure inside the block by one. Extension of the Simpson work by extending the computation not being touched inside the basic block to the spanning of the Loops. If the Live ranges spans the Loops and live at the entry and exit of the Loop but the computation is not being touched inside the Loops then the computation is moved after the exit of the Loop. REDUCTION OF REGISTER PRESSURE THROUGH LIVE RANGE SHRINKING INSIDE THE LOOPS for each Loop starting from inner to outer do the following begin RELIEFIN(i) = null if i is the entry of the cfg. Else For all predecessors j RELIEFOUT(j) RELIEFOUT(i) = RELIEFIN(i) exposed union relief INSERT(I,j) = RELIEFOUT(i) RELIEFIN(i) Intersection Live(i) end The Simpson approach does takes the nesting depth into consideration of placing the computation and the relieve of the register pressure. Simpson approach doesn't takes into consideration the computation which spans throughout the loop and the operands and results are live at the entry of the Loop and exit of the Loop but not touched inside the Loops can be useful in reduction of register pressure inside the Loops. This approach will be useful in Region Based Register Allocator for Live Range Splitting at the Region Boundaries. Extension of the Simpson approach is to consider the data flow analysis with respect to the given Loop rather than having it for entire control flow graph. This data flow analysis starts from the inner loop and extends it to the outer loop. If the reference is not through the nested depth or with some depth then the computation can be placed accordingly. For register allocator by Graph coloring the live ranges that are with respect to operands and results of the computation are taken into consideration and for the above approach put into the stack during simplification phase of Graph Coloring so that there is a chance of getting such Live ranges colorable and thus reduces the register pressure. This is extended to splitting approach based on containment of Live ranges OPTIMAL PLACEMENT OF THE COMPUTATION FOR SINGLE ENTRY AND MULTIPLE EXIT LOOPS The placement of the computation to reduce the register pressure for Single Entry and Multiple exit by Simpson approach lead to unoptimal solution. The unoptimal Solution is because of the exit node of the loop does not post dominates all the basic block inside the Loops. Due to this the placement of the computation just after the tail block of the Loop will lead to incorrect results. In order to perform the Optimal Solution of the placement of the computation, the computation needs to be placed the block just after all the exit points of the Loop reconverge and which will post dominates all the blocks of the Loops. This will take care of reducing the register pressure for the Loops that are single Entry and Multiple Exit. For irreducible Loops the optimization to convert to reducible is done before the register allocation that reduces the register pressure and will be applicable to structured control flow and thus reduces the register pressure. The Live range shrinkage reducing register pressure takes load and store into consideration but not computation as proposed by Simpson. I am proposing to extend in GCC for the computation to reduce register pressure and for the Loop as given above for both Single Entry and Single Exit and Single Entry and Multiple Exit Loops. Please let me know what do you think. Thanks & Regards Ajit
RE: Reducing Register Pressure through Live range Shrinking through Loops!!
On Friday, May 23, 2014 1:46 AM Vladimir Makarov wrote: On 05/21/2014 12:25 AM, Ajit Kumar Agarwal wrote: > Hello All: > > Simpson does the Live range shrinking and reduction of register > pressure by using the computation that are not load and store but the > arithmetic computation. The computation where the operands and registers are > live at the entry and exit of the basic block but not touched inside the > block then the computation is moved at the end of the block the reducing the > register pressure inside the block by one. Extension of the Simpson work by > extending the computation not being touched inside the basic block to the > spanning of the Loops. If the Live ranges spans the Loops and live at the > entry and exit of the Loop but the computation is not being touched inside > the Loops then the computation is moved after the exit of the Loop. > > REDUCTION OF REGISTER PRESSURE THROUGH LIVE RANGE SHRINKING INSIDE THE > LOOPS > > for each Loop starting from inner to outer do the following begin > RELIEFIN(i) = null if i is the entry of the cfg. > Else > For all predecessors j RELIEFOUT(j) > RELIEFOUT(i) = RELIEFIN(i) exposed union relief > INSERT(I,j) = RELIEFOUT(i) RELIEFIN(i) Intersection > Live(i) > end > > The Simpson approach does takes the nesting depth into consideration > of placing the computation and the relieve of the register pressure. Simpson > approach doesn't takes into consideration the computation which spans > throughout the loop and the operands and results are live at the entry of the > Loop and exit of the Loop but not touched inside the Loops can be useful in > reduction of register pressure inside the Loops. This approach will be > useful in Region Based Register Allocator for Live Range Splitting at the > Region Boundaries. > > Extension of the Simpson approach is to consider the data flow > analysis with respect to the given Loop rather than having it for > entire control flow graph. This data flow analysis starts from the > inner loop and extends it to the outer loop. If the reference is not > through the nested depth or with some depth then the computation can > be placed accordingly. For register allocator by Graph coloring the > live ranges that are with respect to operands and results of the > computation are taken into consideration and for the above approach > put into the stack during simplification phase of Graph Coloring so > that there is a chance of getting such Live ranges colorable and thus > reduces the register pressure. This is extended to splitting approach > based on containment of Live ranges > > OPTIMAL PLACEMENT OF THE COMPUTATION FOR SINGLE ENTRY AND MULTIPLE > EXIT LOOPS > > The placement of the computation to reduce the register pressure for > Single Entry and Multiple exit by Simpson approach lead to unoptimal > solution. The unoptimal Solution is because of the exit node of the loop does > not post dominates all the basic block inside the Loops. Due to this the > placement of the computation just after the tail block of the Loop will lead > to incorrect results. In order to perform the Optimal Solution of the > placement of the computation, the computation needs to be placed the block > just after all the exit points of the Loop reconverge and which will post > dominates all the blocks of the Loops. This will take care of reducing the > register pressure for the Loops that are single Entry and Multiple Exit. For > irreducible Loops the optimization to convert to reducible is done before the > register allocation that reduces the register pressure and will be applicable > to structured control flow and thus reduces the register pressure. > > The Live range shrinkage reducing register pressure takes load and store into > consideration but not computation as proposed by Simpson. I am proposing to > extend in GCC for the computation to reduce register pressure and for the > Loop as given above for both Single Entry and Single Exit and Single Entry > and Multiple Exit Loops. > > >>Thanks for sharing with this. I have plans to work on a new register >>pressure relief pass too this summer (more accurately this work is in my >>company plans -- so I should work on this >>anyway). It will use Simpson's >>approach also. I prefer to do it as a separate pass not as a part of IRA >>because IRA is already complicated. It also permits to rematerialize not >>only on loop >>borders (although it is the most important points). >>It is hard for me to say what approach will be better as there are too many >>transformations even after IRA (e.g. in IRA you can think that pseudos got >>hard registers and rematerilize >
Reducing Register Pressure based on Instruction Scheduling and Register Allocator!!
Hello All: I was looking further the aspect of reducing register pressure based on Register Allocation and Instruction Scheduling and the Following observation being made on reducing register pressure based on the existing papers on reducing register pressure Based on scheduling approach. Does the following aspect of reducing register pressure is already been implemented in GCC? Minimum register usage through better Instruction scheduling Instruction scheduling play an important role increase or decrease of register pressure. If the instruction scheduler is done before register allocation the register pressure will increase whereas if the register allocation is done before instruction scheduling the false antidependence will be created. For the out of order superscalar process the scheduling will be done in such a way that register pressure will be decreased. To achieve ILP the register pressure increases but in superscalar processor the register pressure is important. To achieve the decrease of register pressure the instruction scheduling for ILP should also consider the register pressure. Govindarajan proposed a scheme where the list scheduling is modified to have one of the constraint's as available register along with latency to perform the instruction scheduling. From the data dependency graph of the given basic block the chains are created based on depth first search to form different chains. And one register is allocated for each chain in order to have a baton process where the def is used in next instruction and the next instruction def will be used in next to next instruction. A given register is assigned to def and use and the same register is used for next def and use, which makes it to have one register for each chain, The main challenges is to assign the same register for the different chains if the chains don't overlap. The list scheduling is modified to have a parameter as list of available colors and release is the release of available colors. >From the ready queue the nodes in the DAG is removed the live ranges that >terminate at that node is released and as added as available colors. This ensures the register pressure will not increase by the schedule which was derived from Graph coloring register allocator. Register pressure sensitivity based Instruction scheduling The superscalar OOO processor does the dynamic instruction scheduling based on the out of order on instruction window and register renaming to achieve ILP. The main challenges is to achieve a ILP schedule and from the ILP schedule, it generates the instruction sequence with sensitivity to registers is called linearization. The ILP schedule groups are formed. The groups are, 1. If the instruction selected is s all the instruction that are scheduled before s are formed with one group. 2. All the instruction that are scheduled after s are formed with another group. The linearization algorithm uses can test and must can test. The can test if the selected instruction is s , then all the instruction that are formed with Group1 and are not scheduled but in the ready list are generated with W-1. If the selected instruction is 's' is generated at index i ,all the instruction are scheduled before s are generated with i+W-1 where W is the size of register window. The must can test if the selected instruction is generated at index i, then all the instruction that are scheduled after s are generated with index i+W-1. If the instruction scheduled after s are not in window the ILP won't be achieved. The linearization algorithm checks for can test if it satisfies then the instruction is generated. If not satisfied it will be checked with must can test and if it satisfies it will be generated. The register pressure should not be increased during linearization as there is a chance of register pressure getting increased during scheduling. To achieve this, the priority is assigned for each instruction that are in the ready list. If the instruction selected all the use of the variables in the instruction are scheduled the priority is assigned as 2. if an instruction is dependent on i and i is dependent on another instruction the Live range is too long and the priority is less and assigned to be 1. In the ready list it selects the instruction with high priority which ensures the register pressure doesn't increases. There are approaches where the register pressure is high in the basic block the instruction scheduler will be phased after Register allocation, otherwise the instruction scheduler is done before the register allocator. This is how in gcc there is an instruction scheduler before register allocator and after the register allocator. Integrated Code Scheduling and Register Allocation with minimum register pressure The integrated code scheduling where there is a prepass and post pass instruction scheduler. The prepass scheduler increase the overuse of the registers and th
vector load Rematerialization!!
Hello All: There has been work done for load rematerialization. Instead of Store and Load of variables they kept in registers for the Live range. Till now we are doing the rematerialization of scalar loads. Is it feasible to have rematerialization for the vector Loads? This will be helpful to reduce the vectorized Store and Load for the dependencies across the vectorized Loops. I was looking at one of the presentation where there is a mentioned about the Load rematerialization is implemented from GCC 4.8.2 Onwards. Does this implementation takes care of rematerialization of vector Loads. Can we have this approach? Please let me know what do you think. Thanks & Regards Ajit
Register Pressure guided Unroll and Jam in GCC !!
Hello All: I have worked on the Open64 compiler where the Register Pressure Guided Unroll and Jam gave a good amount of performance improvement for the C and C++ Spec Benchmark and also Fortran benchmarks. The Unroll and Jam increases the register pressure in the Unrolled Loop leading to increase in the Spill and Fetch degrading the performance of the Unrolled Loop. The Performance of Cache locality achieved through Unroll and Jam is degraded with the presence of Spilling instruction due to increases in register pressure Its better to do the decision of Unrolled Factor of the Loop based on the Performance model of the register pressure. Most of the Loop Optimization Like Unroll and Jam is implemented in the High Level IR. The register pressure based Unroll and Jam requires the calculation of register pressure in the High Level IR which will be similar to register pressure we calculate on Register Allocation. This makes the implementation complex. To overcome this, the Open64 compiler does the decision of Unrolling to both High Level IR and also at the Code Generation Level. Some of the decisions way at the end of the Code Generation . The advantage of using this approach like Open64 helps in using the register pressure information calculated by the Register Allocator. This helps the implementation much simpler and less complex. Can we have this approach in GCC of the Decisions of Unroll and Jam in the High Level IR and also to defer some of the decision at the Code Generation Level like Open64? Please let me know what do you think. Thanks & Regards Ajit
RE: Register Pressure guided Unroll and Jam in GCC !!
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Monday, June 16, 2014 7:55 PM To: Ajit Kumar Agarwal Cc: gcc@gcc.gnu.org; Vladimir Makarov; Michael Eager; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Register Pressure guided Unroll and Jam in GCC !! On Mon, Jun 16, 2014 at 4:14 PM, Ajit Kumar Agarwal wrote: > Hello All: > > I have worked on the Open64 compiler where the Register Pressure Guided > Unroll and Jam gave a good amount of performance improvement for the C and > C++ Spec Benchmark and also Fortran benchmarks. > > The Unroll and Jam increases the register pressure in the Unrolled Loop > leading to increase in the Spill and Fetch degrading the performance of the > Unrolled Loop. The Performance of Cache locality achieved through Unroll and > Jam is degraded with the presence of Spilling instruction due to increases in > register pressure Its better to do the decision of Unrolled Factor of the > Loop based on the Performance model of the register pressure. > > Most of the Loop Optimization Like Unroll and Jam is implemented in the High > Level IR. The register pressure based Unroll and Jam requires the calculation > of register pressure in the High Level IR which will be similar to register > pressure we calculate on Register Allocation. This makes the implementation > complex. > > To overcome this, the Open64 compiler does the decision of Unrolling to both > High Level IR and also at the Code Generation Level. Some of the decisions > way at the end of the Code Generation . The advantage of using this approach > like Open64 helps in using the register pressure information calculated by > the Register Allocator. This helps the implementation much simpler and less > complex. > > Can we have this approach in GCC of the Decisions of Unroll and Jam in the > High Level IR and also to defer some of the decision at the Code Generation > Level like Open64? > > Please let me know what do you think. >>Sure, you can for example compute validity of the transform during the GIMPLE >>loop opts, annotate the loop meta-information with the desired transform and >>apply it (or not) later >>during RTL unrolling. Thanks !! Has RTL unrolling been already implemented? Richard. > Thanks & Regards > Ajit
RE: Possible LRA issue?
The cause of xmalloc occurring at times given below in Register Allocator will not be caused only by the structure and changing the passed S as template argument. It depends on how the below structures is referenced or used. From the stack trace I can see the live ranges creation is based on how the below structure is referenced and Used. Thanks & Regards Ajit -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Daniel Gutson Sent: Wednesday, August 27, 2014 7:58 PM To: gcc Mailing List Subject: Possible LRA issue? Hi, I have a large codebase where at some point, there's a structure that takes an unsigned integer template argument, and uses as the size of an array, something like template struct Struct { typedef std::array Chunk; typedef std::list Content; Content c; }; Changing the values of S alters significantly the compile time and memory that the compiler takes. We use some large numbers there. At some point, the compiler runs out of memory (xmalloc fails). I wondered why, and did some analysis by debugging the 4.8.2 (same with 4.8.3), and did the following experiment turning off all the optimizations (-fno-* and -O0): I generated a report of xmalloc usage of two programs: one having S=10u, and another with S=11u, just to see the difference of 1. The report was generated as follows: I set a breakpoint at xmalloc, appending a bt to a file. Then I found common stack traces and counted how many xmallocs were called in one and another versions of the program (S=10u and S=11u as mentioned above). The difference were: a) Stack trace: xmalloc | pool_alloc | create_live_range | mark_pseudo_live | mark_regno_live | process_bb_lives | lra_create_live_ranges | lra | do_reload | rest_of_handle_reload | execute_one_pass | execute_pass_list | execute_pass_list | expand_function | output_in_order | compile | finalize_compilation_unit | cp_write_global_declarations | compile_file | do_compile | toplev_main | __libc_start_main | _start | S=10u: 15 times S=11u: 16 times b) Stack trace: xmalloc | lra_set_insn_recog_data | lra_get_insn_recog_data | lra_update_insn_regno_info | lra_update_insn_regno_info | lra_push_insn_1 | lra_push_insn | push_insns | lra_process_new_insns | curr_insn_transform | lra_constraints | lra | do_reload | rest_of_handle_reload | execute_one_pass | execute_pass_list | execute_pass_list | expand_function | output_in_order | compile | finalize_compilation_unit | cp_write_global_declarations | compile_file | do_compile | toplev_main | __libc_start_main | _start | S=10u: 186 times S=11u: 192 times c) Stack trace: xmalloc | df_install_refs | df_refs_add_to_chains | df_insn_rescan | emit_insn_after_1 | emit_pattern_after_noloc | emit_pattern_after_setloc | emit_insn_after_setloc | try_split | split_insn | split_all_insns | rest_of_handle_split_after_reload | execute_one_pass | execute_pass_list | execute_pass_list | execute_pass_list | expand_function | output_in_order | compile | finalize_compilation_unit | cp_write_global_declarations | compile_file | do_compile | toplev_main | __libc_start_main | _start | S=10u: 617 times S=11u: 619 times d) Stack trace: xmalloc | df_install_refs | df_refs_add_to_chains | df_bb_refs_record | df_scan_blocks | rest_of_handle_df_initialize | execute_one_pass | execute_pass_list | execute_pass_list | expand_function | output_in_order | compile | finalize_compilation_unit | cp_write_global_declarations | compile_file | do_compile | toplev_main | __libc_start_main | _start | S=10u: 13223 times S=11u: 13227 times e) Stack trace: xmalloc | __GI__obstack_newchunk | bitmap_element_allocate | bitmap_set_bit | update_lives | assign_hard_regno | assign_by_spills | lra_assign | lra | do_reload | rest_of_handle_reload | execute_one_pass | execute_pass_list | execute_pass_list | expand_function | output_in_order | compile | finalize_compilation_unit | cp_write_global_declarations | compile_file | do_compile | toplev_main | __libc_start_main | _start | S=10u: 0 times (never!) S=11u: 1 Unfortunately I can't disclose the source code nor have the time to isolate a piece of code reproducing the issue. Some comments about the code: I don't do template metaprogramming depending on S, but I do some for-range on the Content. I can extend the analysis to S=12 and compare with the previous values. I thought to fix this myself but lack the time and background on theses optimizations. Any hint? I'm open to do more experiments if anybody asks me, or post -fdumps. I suspect that playing with gcc-min-heapsize and similar values this issue could be worked around, but I'd like to know why just changing the size of an array has such a consequence. Thanks! Daniel. -- Daniel F. Gutson Chief Engineering Officer, SPD San Lorenzo 47, 3rd Floor, Office 5 Córdoba, Argentina Phone: +54 351 42178
RE: Possible LRA issue?
-Original Message- From: Daniel Gutson [mailto:daniel.gut...@tallertechnologies.com] Sent: Wednesday, August 27, 2014 8:53 PM To: Ajit Kumar Agarwal Cc: gcc Mailing List Subject: Re: Possible LRA issue? On Wed, Aug 27, 2014 at 12:16 PM, Ajit Kumar Agarwal wrote: > The cause of xmalloc occurring at times given below in Register Allocator > will not be caused only by the structure and changing the passed S as > template argument. > It depends on how the below structures is referenced or used. From the stack > trace I can see the live ranges creation is based on how the below structure > is referenced and Used. >>Could you please show me an example of such different usages and references? I would like you to formulate the exact code with the templatized structures. Here is the flow for def and use. Live range<1>Live range<2> 1 = Def| 2 = Def| | . | | . | | Use of <2> | | . | . | . | Use of <1> | If the size of elements of array increases the creation of Live ranges will increase thus increasing the calls to xmalloc to create the live ranges. The above DEF and USE allows to populate the Live ranges. Thus the Use and reference of array element is the cause of the calls rather than the declaration of the structures. Thanks & Regards Ajit > > Thanks & Regards > Ajit > > -Original Message- > From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf > Of Daniel Gutson > Sent: Wednesday, August 27, 2014 7:58 PM > To: gcc Mailing List > Subject: Possible LRA issue? > > Hi, > >I have a large codebase where at some point, there's a structure > that takes an unsigned integer template argument, and uses as the size > of an array, something like > > template > struct Struct > { > typedef std::array Chunk; > typedef std::list Content; > >Content c; > }; > > Changing the values of S alters significantly the compile time and memory > that the compiler takes. We use some large numbers there. > At some point, the compiler runs out of memory (xmalloc fails). I wondered > why, and did some analysis by debugging the 4.8.2 (same with 4.8.3), and did > the following experiment turning off all the optimizations (-fno-* and -O0): > I generated a report of xmalloc usage of two programs: one having S=10u, > and another with S=11u, just to see the difference of 1. > The report was generated as follows: I set a breakpoint at xmalloc, appending > a bt to a file. Then I found common stack traces and counted how many > xmallocs were called in one and another versions of the program (S=10u and > S=11u as mentioned above). > The difference were: > > a) Stack trace: > xmalloc | pool_alloc | create_live_range | mark_pseudo_live | > mark_regno_live | process_bb_lives | lra_create_live_ranges | lra | > do_reload | rest_of_handle_reload | execute_one_pass | > execute_pass_list | execute_pass_list | expand_function | > output_in_order | compile | finalize_compilation_unit | > cp_write_global_declarations | compile_file | do_compile | toplev_main > | __libc_start_main | _start | > > S=10u: 15 times > S=11u: 16 times > > > b) Stack trace: > xmalloc | lra_set_insn_recog_data | lra_get_insn_recog_data | > lra_update_insn_regno_info | lra_update_insn_regno_info | > lra_push_insn_1 | lra_push_insn | push_insns | lra_process_new_insns | > curr_insn_transform | lra_constraints | lra | do_reload | > rest_of_handle_reload | execute_one_pass | execute_pass_list | > execute_pass_list | expand_function | output_in_order | compile | > finalize_compilation_unit | cp_write_global_declarations | > compile_file | do_compile | toplev_main | __libc_start_main | _start | > > S=10u: 186 times > S=11u: 192 times > > c) Stack trace: > xmalloc | df_install_refs | df_refs_add_to_chains | > df_insn_rescan | emit_insn_after_1 | emit_pattern_after_noloc | > emit_pattern_after_setloc | emit_insn_after_setloc | try_split | > split_insn | split_all_insns | rest_of_handle_split_after_reload | > execute_one_pass | execute_pass_list | execute_pass_list | > execute_pass_list | expand_function | output_in_order | compile | > finalize_compilation_unit | cp_write_global_declar
Global Value Numbering on SSA representation based on Redundancy Class
Hello All: Please find the different Global Value numbering techniques on SSA representation and proposing in GCC Global Value Numbering on SSA representation based on Redundancy Class. Can this be proposed. SSA representation with control graph can be formulated with Global Value Numbering Algorithm. The GVN Algorithm assigns the value numbers for the expression based on hashing, value partitioning, SCC on SSA Graph and Dominator. Value partitioning is based on congruent classes and the Dominator is based on traversing the dominator tree in reverse post order and assign the values to the expression. The SCC based on SSA graph is based on traversing the SSA graph in reverse post order and assign the value numbers based on optimistic and valid table. Different optimization based on GVN are useful like redundant expression elimination, redundant load and stores , Common Sub expression elimination, reassociation and value redundancy. Global value numbering is proposed on redundancy class assign to expression based on renaming scheme on SSA representation. The variable renaming scheme is extended to expressions in the SSA representation. The redundancy class along with the SSA graph and the SCC representation of the SSA Graph is proposed in this paper. The redundancy class concept is taken from the Fred Chow paper on PRE on SSA. Based on the redundancy class new set of nodes like phi are inserted which takes the operands as redundancy class and this set of phi nodes are used along with redundancy class number are used using the optimistic table and the valid table as suggested by Simpson on value numbering on SCC based SSA Graphs. This will help to perform the optimization as mentioned above in the SSA representation. Value based partitioning: Value based partitioning assigns the values based on partitoning the congruent classes. Two expressions are congruent if they have same opcode and each operand has to be congruent. This is the recursive based definition. Based on the congruency initial partition is created which keep on growing with different iterations till the partitions became stable. For the expressions given below: A= x - y B = y - x C = A + B For the expressions above the initial partition created with {x}, {y}, { A, B, C}. Then the algorithm will work as follows. The worklist will be the set of the congruence class. For each class like {x} is selected. For x the derivation is A and the position of A is subset of the class {A,B,C} then the new partition will be created{A}. Similarly the {B} and {C} partition are created and added to the worklist and the above algorithm continues till the partitions become stable. So the final partitions will be {x},{y},{A},{B},{C}. SCC based value numbering on SSA Graph The SCC(strongly connected component) is found on the SSA graph with cycle. The Set of nodes in each SCC will be a single node in the cycle. For such strongly connected component mapped to single nodes, each node will be traversed and the value numbers are assigned. For following SSA representation of the control flow graph i0 = 1 j0 = 1 while(..) { i1 = phi(i0, i2) j1 = phi(j0,j2) i2 = i1 +1 j2 = j1 +1 } For the above example the SSA graph will be formed with Strongly connected component with the cycle and the each node of the SCC will be traversed. Traversing the SSA graph in reverse post order for variable i the initial value is 1 which will be added to optimistic table so the lattice initial value is 1. At the point of i1 = phi(i0,i2) 1 will be added to optimistic table and assign the value of i1. At the first iteration i2 will be 2 as the initial value of i1 is propagated. In the second iteration the i1 = phi(i0,i2) will be added to the optimistic table and assign the value i1. Traversing the SSA graph in reverse post order for variable j the initial value is 1 and j1 will be 1 at the first iteration. Since the value of i1 is assigned 1 and j is hashed to same value j1 will be equal to i1. At the first iteration j2 will be 2 and j2 will be i2 as same value is hashed. In the second iteration the j1 = phi(j0,j2) will become i1 = phi(i0,i2) and this entry is already mapped to optimistic table. This leads to i is congruent to j . so the same value is assigned to i and j and the optimization will remove the jth entry and related phi node for jth entry. Global Value Numbering based on redundancy class The redundancy class is formed as given in PRE on SSA by Fred Chow using SSA variable renaming. The SSA variables used to form the expressions is assigned the redundancy class similar to renaming scheme of SSA variables. For the following example: i0 = 1 j0 = 1 while(..) { i1 = phi(i0, i2) j1 = phi(j0,j2) i2 = i1 +1 j2 = j1 +1 } The above SSA program is represented with redundancy class and the new phi node based on the redundancy class. The above SSA program is modified as follows i0 = 1 j0 = 1 while(..) {
Expansion of memset and memcpy calls.
Hello All: Memset and Memcpy calls are extensively used in many benchmarks. Inlining or expansion the memcpy and memset calls improves the performance of many performance Benchmark. I have implemented the expansion of strcmp to the optimizaed sequence of instruction In open64 compiler for AMD x86 target. Can I suggest and propose to expand the memset and memcpy calls to the sequence Of instruction as many targets like ARM are moving implementation of memcpy and Memset in assembly instead of C. This makes it easier to expand the memcpy and Memset call in gcc. To implement this in GCC we need to expand similarly to the implementation as builtins. Let me know what do you think. Thanks & Regards Ajit
RE: Optimized Allocation of Argument registers
Hello All: I was looking at the optimized usage and allocation to argument registers. There are two aspects to it as follows. 1. We need to specify the argument registers as followed by ABI in the target specific code. Based on the function argument registers defined in the target dependent code the function argument registers are passed. If the number of argument registers defined in the Architecture is large say 6/8 function argument registers. Most of the time in the benchmarks we don't pass so many arguments and the number of arguments passed is quite less. Since we reserve the function arguments as specified in the target specific code for the given architecture, leads to unoptimized usage as this function argument registers will not be used in the function. Thus we need to steal some of the arguments registers and have the usage of those in the function depending on the support of the number of function argument registers. The stealing of function argument registers will lead more number of registers available that are to be used in the function and leading to less spill and fetch. 2. The other aspect of the function argument registers is not spill and fetch the argument registers as they are live across the function call. But the liveness is limited to certain point of the called function after that point the function argument registers are not live and can be used inside the called function. Other aspect is if there is a shortage of registers than can the function argument registers should be used as spill candidate? Will this lead to the optimized code. Please let me know what do you think. Thanks & Regards Ajit
RE: Optimized Allocation of Argument registers
-Original Message- From: Vladimir Makarov [mailto:vmaka...@redhat.com] Sent: Tuesday, November 18, 2014 1:57 AM To: Ajit Kumar Agarwal; gcc Mailing List Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Optimized Allocation of Argument registers On 2014-11-17 8:13 AM, Ajit Kumar Agarwal wrote: > Hello All: > > I was looking at the optimized usage and allocation to argument registers. > There are two aspects to it as follows. > > 1. We need to specify the argument registers as followed by ABI in the target > specific code. Based on the function > argument registers defined in the target dependent code the function > argument registers are passed. If the > number of argument registers defined in the Architecture is large say 6/8 > function argument registers. > Most of the time in the benchmarks we don't pass so many arguments and the > number of arguments passed > is quite less. Since we reserve the function arguments as specified > in the target specific code for the given architecture, leads to unoptimized > usage as this function argument registers will not be used in the function. > Thus we need to steal some of the arguments registers and have the > usage of those in the function depending on the support of the number of > function argument registers. The stealing of function argument registers will > lead more number of registers available that are to be used in the function > and leading to less spill and fetch. > >>The argument registers should be not reserved. They should be present in RTL >>and RA allocator will figure out itself when it can use them. >>That is how other ports work. Thanks Vladimir for Clarifications. > 2. The other aspect of the function argument registers is not spill > and fetch the argument registers as they are live across the function > call. But the liveness is limited to certain point of the called > function after that point the function argument registers are not live > and can be used inside the called function. Other aspect is if there is a > shortage of registers than can the function argument registers should be used > as spill candidate? Will this lead to the optimized code. > > >>You can remove unnecessary code to save/restore arg registers around calls if >>you can figure out that they are not used in called functions. >> There is already code for this written by Tom de Vries. So you can use it. Is the code written by Tom de Vries already a part of trunk? Could you please give the pointer to this part of code. Thanks & Regards Ajit
RE: Optimized Allocation of Argument registers
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Monday, November 17, 2014 9:27 PM To: Ajit Kumar Agarwal; Vladimir Makarov; gcc Mailing List Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Optimized Allocation of Argument registers On 11/17/14 06:13, Ajit Kumar Agarwal wrote: > Hello All: > > I was looking at the optimized usage and allocation to argument registers. > There are two aspects to it as follows. > > 1. We need to specify the argument registers as followed by ABI in the target > specific code. Based on the function > argument registers defined in the target dependent code the function > argument registers are passed. If the > number of argument registers defined in the Architecture is large say 6/8 > function argument registers. > Most of the time in the benchmarks we don't pass so many arguments and the > number of arguments passed > is quite less. Since we reserve the function arguments as specified > in the target specific code for the given architecture, leads to unoptimized > usage as this function argument registers will not be used in the function. > Thus we need to steal some of the arguments registers and have the > usage of those in the function depending on the support of the number of > function argument registers. The stealing of function argument registers will > lead more number of registers available that are to be used in the function > and leading to less spill and fetch. > > 2. The other aspect of the function argument registers is not spill > and fetch the argument registers as they are live across the function > call. But the liveness is limited to certain point of the called > function after that point the function argument registers are not live > and can be used inside the called function. Other aspect is if there is a > shortage of registers than can the function argument registers should be used > as spill candidate? Will this lead to the optimized code. > > Please let me know what do you think. >>Typically GCC ports do not reserve the function argument/return registers, >>they are allocatable just like any other call clobbered register. >>In essence arguments for function calls are set up at each call site by >>copying values out of pseudo registers, memory, constant initializations, etc >>to the >>appropriate outgoing argument register. The allocator will, when >>possible and profitable try to assign the pseudo register to the appropriate >>argument >>register to eliminate the copies. >>Similarly, GCC copies values out of the incoming argument registers and into >>pseudos at the start of a function. The allocator, again when possible and >>>>profitable, will try to allocate the pseudos to the incoming argument >>registers to avoid the copy. >>So in summary, there is no reason to reserve registers for argument passing >>in GCC and doing so would be wasteful. Treat the argument registers as any >>>>other call clobbered register and allow the register allocator to make >>appropriate decisions based on liveness, expected uses, call-crossing >>liveness, related >>copies, etc etc. Thanks Jeff for the explanation and Clarifications. Thanks & Regards Ajit jeff
RE: Optimized Allocation of Argument registers
All: The optimization of reducing save and restore of the callee and caller saved register has been the attention Of increasing the performance of the benchmark. The callee saved registers is saved at the entry and restore at the exit of the procedure if the register is reused inside the procedure whereas the caller save registers at the Caller site is saved before the call and the restore after the variable is live and spans through the call. The GCC port has done some optimization whereas the call-used registers are live inside the procedure and has been set as 1 bit then it will not be saved and restored. This is based on the data flow analysis. The callee saved registers is useful when there all multiple calls in the call graph whereas the caller save registers are useful if the call is the leaf procedure then the saving before the call and restore after the call will be useful and increases the performance. By traversing the call graph in depth-first-order and the bottom-up approach we can propagate the save and restore At the procedure entry and exit to the upper regions of the call graph which reduces the save and restore at all the lower Regions across the various lower calls. These decision can be made based on the frequency of the call in the call graph as Proposed by Fred Chow. Another approach to reducing the save and restore at the procedure entry and exit is moving the save and restore from The procedure entry to the active regions where the variable is Live inside the procedure based on the data flow analysis And thus improve the performance of many benchmarks as proposed by Fred Chow. The propagation of save and restore from the lower regions of the call graph to the upper regions is not implemented in The GCC framework and also the moving the save and restore to the active region of Liveness inside the procedure from The entry and exit is not implemented inside the GCC framework. Can this be proposed and implemented in GCC framework? For the open procedure whereas the indirect calls, recursive calls and the external linkage calls cannot be optimized and in The open case the save and restore at the entry and exit of the procedure is applied. But for the open procedure if all the Lower calls in the call-graph is closed and resolved through call-graph, the save and restore can be propagate to the upper Region in the open procedures from the lower region of the calls which are closed and resolved. This can also improve the Performance of many benchmarks and can this be proposed and implemented in GCC framework? Let me know what do you think. Thanks & Regards Ajit -Original Message- From: Ajit Kumar Agarwal Sent: Tuesday, November 18, 2014 7:01 PM To: 'Vladimir Makarov'; gcc Mailing List Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: Optimized Allocation of Argument registers -Original Message- From: Vladimir Makarov [mailto:vmaka...@redhat.com] Sent: Tuesday, November 18, 2014 1:57 AM To: Ajit Kumar Agarwal; gcc Mailing List Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Optimized Allocation of Argument registers On 2014-11-17 8:13 AM, Ajit Kumar Agarwal wrote: > Hello All: > > I was looking at the optimized usage and allocation to argument registers. > There are two aspects to it as follows. > > 1. We need to specify the argument registers as followed by ABI in the target > specific code. Based on the function > argument registers defined in the target dependent code the function > argument registers are passed. If the > number of argument registers defined in the Architecture is large say 6/8 > function argument registers. > Most of the time in the benchmarks we don't pass so many arguments and the > number of arguments passed > is quite less. Since we reserve the function arguments as specified > in the target specific code for the given architecture, leads to unoptimized > usage as this function argument registers will not be used in the function. > Thus we need to steal some of the arguments registers and have the > usage of those in the function depending on the support of the number of > function argument registers. The stealing of function argument registers will > lead more number of registers available that are to be used in the function > and leading to less spill and fetch. > >>The argument registers should be not reserved. They should be present in RTL >>and RA allocator will figure out itself when it can use them. >>That is how other ports work. Thanks Vladimir for Clarifications. > 2. The other aspect of the function argument registers is not spill > and fetch the argument registers as they are live across the function > call. But the liveness is limited to certain point of the called > f
RE: A Question About LRA/reload
-Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Jeff Law Sent: Tuesday, December 09, 2014 11:26 PM To: Vladimir Makarov; lin zuojian; gcc@gcc.gnu.org Subject: Re: A Question About LRA/reload On 12/09/14 10:10, Vladimir Makarov wrote: > generate the correct code in many cases even for x86. Jeff Law tried > IRA coloring reusage too for reload but whole RA became slower > (although he achieved performance improvements on x86). >>Right. After IRA was complete, I'd walk over the unallocated allocnos and >>split their ranges at EBB boundaries. That created new allocnos with a >>smaller ??>>conflict set and reduced the conflict set for the original >>unallocated allocnos. Jeff: In the above approach of splitting the ranges for unallocated allocnos is aggressive or based on the approach of some heuristics that the Live ranges for unallocated allocnos is not touched inside the EBBs. >>After I'd done that splitting for all the EBBs, I called back into >>ira_reassign_pseudos to try to assign the original unallocated allocnos as >>well as the new >>allocnos. >>To get good results, much of IRA's cost analysis had to be redone from >>scratch. And from a compile-time standpoint, that's a killer. >>The other approach I was looking at was a backwards walk through each block. >>When I found an insn with an unallocated pseudo that would trigger one of various range spliting techniques to try and free up a hard register. Then >>again I'd call into ira_reassign_pseudos to try the allocations again. This >>got even >>better results, but was obviously even more compile-time expensive. After the above splitting, are you building the conflict graph again to assign the new allocnos. If the Conflict graph is built again, this will affect the compile time. Thanks & Regards Ajit I don't think much, if any, of that work is relevant given the current structure and effectiveness of LRA. jeff
Instruction scheduler with respect to prefetch instructions.
Hello All: Since the prefetch instruction have no direct consumers in the code stream, they provide considerable freedom to the Instruction scheduler. They are typically assigned lower priorities than most of the instructions in the code stream. This tends to cause all the prefetch instructions to be placed together in the final schedule. This causes the performance Degradations by placing them in clumps rather than evenly spreading the prefetch instructions. The evenly spreading the prefetch instruction gives better speed up ratios as compared to be placing in clumps for dirty Misses. I am curious to know how the schedulers in the GCC handles the prefetch instruction and how the priorities is assigned to Prefetch instructions in the gcc schedulers so that the prefetch instruction is evenly spread. Please let me know what do you think. Thanks & Regards Ajit
RE: Instruction scheduler with respect to prefetch instructions.
-Original Message- From: paul_kon...@dell.com [mailto:paul_kon...@dell.com] Sent: Saturday, December 13, 2014 9:46 PM To: Ajit Kumar Agarwal Cc: vmaka...@redhat.com; l...@redhat.com; richard.guent...@gmail.com; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Instruction scheduler with respect to prefetch instructions. > On Dec 13, 2014, at 5:22 AM, Ajit Kumar Agarwal > wrote: > > Hello All: > > Since the prefetch instruction have no direct consumers in the code > stream, they provide considerable freedom to the Instruction scheduler. They > are typically assigned lower priorities than most of the instructions in the > code stream. > This tends to cause all the prefetch instructions to be placed > together in the final schedule. This causes the performance Degradations by > placing them in clumps rather than evenly spreading the prefetch instructions. > > The evenly spreading the prefetch instruction gives better speed up > ratios as compared to be placing in clumps for dirty Misses. >>I can believe that’s true for some processors; is it true for all of them? I >>have the impression that some MIPS processors don’t mind clumped prefetches, >>>>so long as you don’t exceed the limit on total number of concurrently >>pending memory accesses. I think it's okay to have clumped prefetches for architectures that are decided based on prefetch distance as long it doesn't exceed the concurrent pending memory access. The clumped prefetches that are generated by the scheduler as there are no direct consumers in the code stream sometimes exceed the concurrent pending memory access if the special mechanism is not done by the scheduler like some information to be passed from the generation of prefetch algorithm phase to the scheduler. Due to the freedom provided to instruction scheduler for not having direct consumers in the code stream, clumps the prefetch instructions at the end of the basic blocks which will invalidates the actual place where the prefetch instruction is generated based on the prefetch distance. The prefetch algorithms based on prefetch distance takes care of the cases where the clumped prefetches degraded the performance due to dirty misses. My question is there any special way of handling the prefetch instruction with respect to the instruction scheduler to overcome the above. Thanks & Regards Ajit paul
Register Allocation with Instruction Scheduling.
Hello All: I was going through the following article " Register Allocation with instruction scheduling: a new approach" by Pinter etal. The phase ordering of register allocation and Instruction scheduling is important topic. The scheduling before register allocator increases the register pressure whereas the instruction scheduling after register allocation will be seeing the false dependencies created by Register allocator. The false dependencies between the instructions will not be able to schedule the two instruction simultaneously reducing the Instruction Level Parallelism. To achieve better ILP for the instruction scheduling after register allocation the false dependencies created by the register allocator should be reduced. Let Gs(V,E) is the scheduling Graph and G'(V,E) is the Scheduling Graph after the register allocation. The Graph G'(V,E) has the false dependencies created by Register Allocator Another graph is created with Gf(V,E) where the edges represents the false dependencies. To make sure the false dependencies Graph which is the complement of Gs(V,E) and the edges of Gf are the intersection of the edge in the G'(V,E) and the edges in the complement of Gs scheduling graph. With the availability of the false dependencies graph and the interference graph, The edges in the interference graph doesn't intersects with the edges in the false dependencies graph, this edge will be added to the interference graph and since this edge interfere and the assigning phase will assign different register. Thus the false dependencies can be removed and the scheduling graph after the register allocator will not have false dependencies and the better ILP and scheduling is achieved. This overcomes the disadvantages of instruction scheduling after register allocator. This looks feasible to implement in GCC and I would like to implement the above if not done in the GCC. Please let me know what do you think. Thanks & Regards Ajit
Vectorization opportunities for conditional branches.
The following fig (1) shows an implementation of the SSQ kernel from the BLAS Library in ATLAS. Fig(2) shows the conversions of the IF-THEN-ELSE in Fig(1) to vectorized code. Normally in the automatic vectorization the IF-THEN-ELSE is vectorized only after the IF-CONVERSION that converts control flow to Data flow and then raise the opportunity of vectorization. The Fig(1) and Fig(2) is taken from the article " Vectorization Past Dependent Branches through speculation" Qing Yi etal. The following conversion given in the above article checks the IF-THEN and IF-ELSE separately and checks for the candidates of vectorization. If the IF-THEN can be vectorized and IF-ELSE cannot be vectorized and the whole conversion given in the Fig (2) shows how the conditional branches can be vectorized. In the below case the IF-THEN-ELSE need not be IF-CONVERTED and the conditional branches is vectorized by the transformation shown in FIG (2). ssq = *ssq0; scal = *scal0; For(i=0;I < N ;i++) { ax= x[i]; ax = ABS & Ax; If( ax <=scal) { t0= ax/scal; ssq+= t0 * t0; } Else { t0= scal/ax; t0 = t0 * t0; t1= ssq * t0; ssq = 1.0 + t1; scal = ax; } } *ssq0 = ssq; *scal0 = scal; FIG ( 1) Transformed to VECTOR_PROLOGUE: VABS = {ABS, ABS, ABS, ABS}; Vssq = {ssq0, 0.0, 0.0, 0.0}; Vscal = { scal, scal, scal , scal}; VECTOR_LOOP: For(i=0;I < N4 ;i++) { Vax= x[i: I+3]; Vax = VABS & Vax; If( VEC_ANY(Vax > Vscal) GOTO SCALAR_RESTART; Vt0= vax/vscal; Vssq+= Vt0 *V t0; continue; SCALAR_RESTART: // Vector to Scalar. Ssq= sum(vssq[0:3]); // Scalar Loop For(j=0;I < 4 ;j++) { ax= x[i+j]; ax = AbS & Ax; If( ax >scal) { t0= scal/ax; t0 = t0 * t0; t1= ssq * t0; ssq = 1.0 + t1; scal = ax; } Else{ t0= ax/scal; ssq+= t0 * t0; } } Vssq= { ssq, 0.0, 0.0, 0.0}; Vscal = { scal, scal, scal, scal} } VECTOR_EPILOGUE: ssq = sum(Vssq[0:3]); scal = Vscal[0]; FIG(2). This looks interesting to me considering the IF-THEN-ELSE inside the Loop to be vectorized as given above without IF-Converting. I am not sure how many such sequences of the code will be triggered for SPEC benchmark but it looks interesting to me to be implemented. Thoughts Please ? Thanks & Regards Ajit
IRA : Changes in the cost of putting allocno into memory.
Hello Vladimir: We have made the changes in the ira-color.c in ira_loop_edge_freq and move_spill_restore. The main motivation behind the change is to reduce the memory instruction with respect to the Loops. The changes are done to not to consider the back edge frequency if there are no references of the allocno inside the loop even though the allocno are live out at the exit of the Loop and Live In at the entry of the Loop. This will reduce the calculation of the cost of putting allocno into memory. If we reduce the cost of putting allocno into memory, then there are chances of getting color in the simplification phase of the gcc IRA register allocator. Enabling this, there are chances of putting the allocno in the register and the splitting phase will split the live range across the Loop boundary. This will beneficial with respect to reducing memory instruction inside the Loop. Other changes is done to enable some of the allocno to allocate in memory if the cost is less than equal to zero. This is the first phase of the change in the IRA core register module, I would like to get your feedbacks on this change based on this the actual patch is sent. The changes are accompanied with the patch created for the change which will be easier for you to review the change. We have tested the changes for Microblaze target for MIBENCH and EEMBC benchmarks and there are no degradation in the Geomean and some of the Mibench and EEMBC benchmarks is benefitted with this change. Please review the changes and let me know your feedbacks and other steps needs to be done to incorporate the changes in the Upstream". We would like to get this change for Microblaze, Please also let us know if the changes are required to be enabled with any switch flag. Changes are given below. From 758ee2227e9dde946ac35b772bee99279b1bf996 Mon Sep 17 00:00:00 2001 From: Ajit Kumar Agarwal Date: Tue, 6 Jan 2015 19:42:16 +0530 Subject: [PATCH] IRA : Changes in the cost of putting allocno into memory. Changes are made to not consider the back edge frequency for cost calculation if the allocno is live out at the exit of the Loop and Live in at the entry of the Loop and there are no references inside the Loop. Further changes are made to enable some of the allocnos into memory if the costs is less than equal to 0. ChangeLog: 2015-01-06 Ajit Agarwal * ira-color.c (move_spill_restore): Add the check cost less than equal to 0. (ira_loop_edge_freq): Add to ignore the loop edge frequence if there are no reference of allocno inside the loop. Signed-off-by:Ajit Agarwal ajit...@xilinx.com --- gcc/ira-color.c | 22 +++--- 1 files changed, 19 insertions(+), 3 deletions(-) diff --git a/gcc/ira-color.c b/gcc/ira-color.c index 39387c8..d180e86 100644 --- a/gcc/ira-color.c +++ b/gcc/ira-color.c @@ -2409,6 +2409,8 @@ int ira_loop_edge_freq (ira_loop_tree_node_t loop_node, int regno, bool exit_p) { int freq, i; + bitmap_iterator bi; + unsigned int j; edge_iterator ei; edge e; vec edges; @@ -2423,7 +2425,14 @@ ira_loop_edge_freq (ira_loop_tree_node_t loop_node, int regno, bool exit_p) && (regno < 0 || (bitmap_bit_p (df_get_live_out (e->src), regno) && bitmap_bit_p (df_get_live_in (e->dest), regno - freq += EDGE_FREQUENCY (e); + { + EXECUTE_IF_SET_IN_BITMAP (loop_node->all_allocnos, 0, j, bi) + { + if(regno == ALLOCNO_REGNO (ira_allocnos[j])) + if (ALLOCNO_NREFS (ira_allocnos[j]) != 0) + freq += EDGE_FREQUENCY (e); + } + } } else { @@ -2432,7 +2441,14 @@ ira_loop_edge_freq (ira_loop_tree_node_t loop_node, int regno, bool exit_p) if (regno < 0 || (bitmap_bit_p (df_get_live_out (e->src), regno) && bitmap_bit_p (df_get_live_in (e->dest), regno))) - freq += EDGE_FREQUENCY (e); + { +EXECUTE_IF_SET_IN_BITMAP (loop_node->all_allocnos, 0, j, bi) +{ + if(regno == ALLOCNO_REGNO (ira_allocnos[j])) +if(ALLOCNO_NREFS (ira_allocnos[j]) != 0) + freq += EDGE_FREQUENCY (e); + } + } edges.release (); } @@ -3441,7 +3457,7 @@ move_spill_restore (void) * (exit_freq + enter_freq)); } } - if (cost < 0) + if (cost <= 0) { ALLOCNO_HARD_REGNO (a) = -1; if (internal_flag_ira_verbose > 3 && ira_dump_file != NULL) -- 1.7.1 Thanks & Regards Ajit
RE: Support for architectures without hardware interlocks
-Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Joel Sherrill Sent: Thursday, January 08, 2015 8:59 PM To: Eric Botcazou; Claudiu Zissulescu Cc: gcc@gcc.gnu.org; David Kang Subject: Re: Support for architectures without hardware interlocks On 1/8/2015 9:01 AM, Eric Botcazou wrote: >> I've worked on a gcc target that was porting an architecture without >> hardware interlock support. Basically, you need to emit nop >> operations to avoid possible hw conflicts. At the moment, this was >> done by patching the gcc scheduler to do so, Another issue to keep is >> to check for hardware conflicts across basic-block boundaries. And >> not the last, is to prohibit/avoid any instruction stream >> modification after scheduler (e.g., peephole optimizations etc.). > That's an overly complex approach, this usually can be done in a > simpler way with a machine-specific pass that runs at the end of the RTL > pipeline. > >>Isn't this similar to needing to fill a delay slot after a branch >>instruction? My recollection is that some SPARC and MIPS have to deal with >>that. I think this can be done at the machine description md file level with define_delay where the delay slot description can be described. Thanks & Regards Ajit -- Joel Sherrill, Ph.D. Director of Research & Development joel.sherr...@oarcorp.comOn-Line Applications Research Ask me about RTEMS: a free RTOS Huntsville AL 35805 Support Available(256) 722-9985
Allocating some Loop allocno in memory
I was thinking of some of the opportunities with respect to reducing spills inside the Loop. If the Live range(allocno) spans through the Loop and Live out at the exit of the Loop and there are no references or not being touched upon inside the Loop, assign the allocno to the memory. This increase the chances of getting the register for other allocno which spans through the Loop and have reference inside the Loop if the interference graph is not colorable. Since allocno don't have references inside the Loop there won't be any load instructions with respect to restore inside the loops. There will be a store instruction with respect to spill which will be outside the Loop. This will reduce the conflicting edges of some of the allocno increasing the chances of making colorable of the Interference graph and reduces the spills and restore inside the Loop. This heuristics looks reasonable. This heuristics goes side by side with the Live range splitting across the Loop boundary. On top of this heuristics, after the Live range splitting across the Loop boundary there interference graph is not colorable then we can assign some of the splitted live ranges in the memory giving chances of registers for the Live ranges(allocno) that spans the Loop and have reference inside the Loop. We can change the cost of allocno in memory based on the above heuristics and take the above factor into consideration for the cost calculation. Thoughts Please? Thanks & Regards Ajit
RE: Allocating some Loop allocno in memory
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Sunday, January 11, 2015 8:05 PM To: Ajit Kumar Agarwal; vmaka...@redhat.com; l...@redhat.com; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Allocating some Loop allocno in memory On January 11, 2015 5:25:23 AM CET, Ajit Kumar Agarwal wrote: > >I was thinking of some of the opportunities with respect to reducing >spills inside the Loop. If the Live range(allocno) spans through the >Loop and Live out at the exit of the Loop and there are no references >or not being touched upon inside the Loop, assign the allocno to the >memory. This increase the chances of getting the register for other >allocno which spans through the Loop and have reference inside the Loop >if the interference graph is not colorable. > >Since allocno don't have references inside the Loop there won't be any >load instructions with respect to restore inside the loops. There will >be a store instruction with respect to spill which will be outside the >Loop. This will reduce the conflicting edges of some of the allocno >increasing the chances of making colorable of the Interference graph >and reduces the spills and restore inside the Loop. > >This heuristics looks reasonable. This heuristics goes side by side >with the Live range splitting across the Loop boundary. >On top of this heuristics, after the Live range splitting across the >Loop boundary there interference graph is not colorable then we can >assign some of the splitted live ranges in the memory giving chances >of registers for the Live ranges(allocno) that spans the Loop and have >reference inside the Loop. > >We can change the cost of allocno in memory based on the above >heuristics and take the above factor into consideration for the cost >calculation. > >Thoughts Please? >>How can this result in better allocations if you only ever spill at >>life-range split points? The life-range covering the loop should already be >>assigned to >>memory if required. If the live range covering the loop assigned to memory, and there are references inside the Loops then there are chances of spills inside the loop which degrades the performance. If live range covering the loops and there are no references inside the Loop, then assigning to memory make reasonable as there are no references. Since there are no references or not touched inside the Loop, there won't be load instruction to restore which are required if there are any references. The store which requires for the def of the Live range will be outside the Loops. This helps in not degrading the performances as there are no load and store required for spill and restore inside the Loops. The above heuristics can also be accompanied with heuristics of the number of use points in the Live range. Considering both the heuristics will lead to better allocation if ever spill at the live range split points and there are chances of getting colorable the Interference graph without degrading the performance. Also if the Live range is splitted at the Loop boundary then the spill at the split points of loop boundary and registers is assigned to this live range will make it reasonable if there are no references inside the Loops. Thanks & Regards Ajit Richard. >Thanks & Regards >Ajit
RE: Allocating some Loop allocno in memory
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Monday, January 12, 2015 2:33 PM To: Ajit Kumar Agarwal Cc: vmaka...@redhat.com; l...@redhat.com; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Allocating some Loop allocno in memory On Sun, Jan 11, 2015 at 4:45 PM, Ajit Kumar Agarwal wrote: > > > -Original Message- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: Sunday, January 11, 2015 8:05 PM > To: Ajit Kumar Agarwal; vmaka...@redhat.com; l...@redhat.com; > gcc@gcc.gnu.org > Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju > Mekala > Subject: Re: Allocating some Loop allocno in memory > > On January 11, 2015 5:25:23 AM CET, Ajit Kumar Agarwal > wrote: >> >>I was thinking of some of the opportunities with respect to reducing >>spills inside the Loop. If the Live range(allocno) spans through the >>Loop and Live out at the exit of the Loop and there are no references >>or not being touched upon inside the Loop, assign the allocno to the >>memory. This increase the chances of getting the register for other >>allocno which spans through the Loop and have reference inside the >>Loop if the interference graph is not colorable. >> >>Since allocno don't have references inside the Loop there won't be any >>load instructions with respect to restore inside the loops. There will >>be a store instruction with respect to spill which will be outside >>the Loop. This will reduce the conflicting edges of some of the >>allocno increasing the chances of making colorable of the Interference >>graph and reduces the spills and restore inside the Loop. >> >>This heuristics looks reasonable. This heuristics goes side by side >>with the Live range splitting across the Loop boundary. >>On top of this heuristics, after the Live range splitting across the >>Loop boundary there interference graph is not colorable then we can >>assign some of the splitted live ranges in the memory giving chances >>of registers for the Live ranges(allocno) that spans the Loop and >>have reference inside the Loop. >> >>We can change the cost of allocno in memory based on the above >>heuristics and take the above factor into consideration for the cost >>calculation. >> >>Thoughts Please? > >>>How can this result in better allocations if you only ever spill at >>>life-range split points? The life-range covering the loop should already be >>>assigned to >>memory if required. > > If the live range covering the loop assigned to memory, and there are > references inside the Loops then there are chances of spills inside > the loop which degrades the performance. If live range covering the > loops and there are no references inside the Loop, then assigning to > memory make reasonable as there are no references. Since there are no > references or not touched inside the Loop, there won't be load instruction to > restore which are required if there are any references. The store which > requires for the def of the Live range will be outside the Loops. This helps > in not degrading the performances as there are no load and store required for > spill and restore inside the Loops. > > The above heuristics can also be accompanied with heuristics of the > number of use points in the Live range. Considering both the > heuristics will lead to better allocation if ever spill at the live > range split points and there are chances of getting colorable the > Interference graph without degrading the performance. Also if the Live range > is splitted at the Loop boundary then the spill at the split points of loop > boundary and registers is assigned to this live range will make it reasonable > if there are no references inside the Loops. >>But if we split the life-range of not used in-loop-body pseudos at the loop >>boundary the allocator should do the assignment to memory. >>There is no need to add another heuristic here. Or do you say we don't split >>the life-range of such pseudos at loop boundary? In spill_cost calculations we are seeing the function ira_loop_edge_freq triggered with the Live ranges that are live at the exit of the Loop but are not referenced inside the Loops. We are still considering frequency in the updation of spill cost for such live ranges . We have made a change not to consider the freq of such live ranges in the updation of spill cost, thus reducing the cost of such live ranges making it a chance to be in memory giving the conflicting allocno the chances of getting the register. There are chances , the conflicting allocno with respect to loops which have references and will have more edge on getting the registers. The changes are made in the following. https://gcc.gnu.org/ml/gcc/2015-01/msg00033.html Thanks & Regards Ajit Richard. > Thanks & Regards > Ajit > > Richard. > >>Thanks & Regards >>Ajit > >
Optimal Coalescing with respect to move instruction for Live range splitting
Register allocation with two phase approach does optimal coalescing after the spilling. Sometime Live range splitting makes the coalescing non optimal. The splitted Live range are connected by move instruction. Thus the Live range splitting and more specifically aggressive Live range splitting increases the number of move instruction and making the coalescing non optimal. The optimal coalescing should take the frequency of the regions into consideration. The more the frequency of the region the cost associated with the splitted live ranges will increase. If such cost increasing then we should do aggressive coalescing in this case and remove the move instruction by associating the same color or hard registers associated with move instruction for the splitting live ranges. I am think of adding such heuristics in the GCC Top Down Region based register allocator for the optimal coalescing, thereby reducing the move instruction connected with the splitted Live ranges in the more frequently code. Thoughts please ? Thanks & Regards Ajit
Rematerialization and Live Range Splitting on Region Frequency
Hello All: Looks like Live range splitting and rematerialization are connected to each other. If the boundary of Live range Splitting is in the high frequency of the region then the move connected to splitted live ranges are inside the High frequency region which is the performance bottleneck for many benchmarks. Live range splitting based on the frequency of the region should be considered. The Live range splitting in the High frequency region is beneficial if the splitted live range is assigned the color(hard registers) which is better Spilling inside the high frequency region, although there will be move instruction or shuffle code which is still Better. If one of the splitted live range does not have any use points and all the partner live ranges gets the Hard register, then the move instruction due to splitting will be costly for the high frequency region. In such Case the split point should be move up at the boundary of the transition from low frequency region to high Frequency region, and the splitted live ranges still get hard registers. This require increment check of colorabity which increases the compile time but beneficial with respect to run time. The above heuristic should be incorporated on top of the below Live range splitting Algorithm. Live range splitting algorithm should consider the blocks in the decreasing order of frequency with the first block should be taken from the high frequency region and incrementally updates till it become colorable. Thus split points should be at the edge of the transition from high frequency to low frequency region or from low frequency region to high frequency region. The above Live range splitting should be incorporated for all the flavors of Graph Coloring. Regarding the rematerialization the Chaitin's Rematerialization try to recalculate the expression at all the Use points of the Live ranges and Simpson based approach for Rematerialized try to move the arithmetic Instruction lower to use points or the basic blocks considering the operands of Arithmetic instructions is Not touched along the blocks of the Live range. Considering all the effects of rematerialization, The remat point or the recalculation should be done at the split points instead of Chaitin's approach of remat at every use points and the Simpson approach of operands not being touched upon and the movement of arithmetic instruction later at the use points. The above approaches looks feasible to implement consider the high frequency region into consideration. Thoughts Please ? Thanks & Regards Ajit
RE: Rematerialization and Live Range Splitting on Region Frequency
Thanks Vladimir for the inputs. It is quite helpful. Thanks & Regards Ajit -Original Message- From: Vladimir Makarov [mailto:vmaka...@redhat.com] Sent: Tuesday, January 27, 2015 1:10 AM To: Ajit Kumar Agarwal; l...@redhat.com; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Rematerialization and Live Range Splitting on Region Frequency On 2015-01-25 4:55 AM, Ajit Kumar Agarwal wrote: > Hello All: > > Looks like Live range splitting and rematerialization are connected to > each other. If the boundary of Live range Splitting is in the high > frequency of the region then the move connected to splitted live ranges are > inside the High frequency region which is the performance bottleneck for many > benchmarks. > > Live range splitting based on the frequency of the region should be > considered. The Live range splitting in the High frequency region is > beneficial if the splitted live range is assigned the color(hard > registers) which is better Spilling inside the high frequency region, > although there will be move instruction or shuffle code which is still > Better. If one of the splitted live range does not have any use > points and all the partner live ranges gets the Hard register, then > the move instruction due to splitting will be costly for the high > frequency region. In such Case the split point should be move up at > the boundary of the transition from low frequency region to high > Frequency region, and the splitted live ranges still get hard > registers. This require increment check of colorabity which increases the > compile time but beneficial with respect to run time. The above heuristic > should be incorporated on top of the below Live range splitting Algorithm. > Live range splitting algorithm should consider the blocks in the decreasing > order of frequency with the first block should be taken from the high > frequency region and incrementally updates till it become colorable. Thus > split points should be at the edge of the transition from high frequency to > low frequency region or from low frequency region to high frequency region. > > The above Live range splitting should be incorporated for all the flavors of > Graph Coloring. > > Regarding the rematerialization the Chaitin's Rematerialization try to > recalculate the expression at all the Use points of the Live ranges > and Simpson based approach for Rematerialized try to move the > arithmetic Instruction lower to use points or the basic blocks considering > the operands of Arithmetic instructions is Not touched along the blocks of > the Live range. > > Considering all the effects of rematerialization, The remat point or > the recalculation should be done at the split points instead of > Chaitin's approach of remat at every use points and the Simpson approach of > operands not being touched upon and the movement of arithmetic instruction > later at the use points. > > The above approaches looks feasible to implement consider the high frequency > region into consideration. > > Thoughts Please ? > > Ajit, nobody can tell you for sure what the final results of your proposal can be. I usually try a lot of things and most of them are rejected because the results are not what I expected. I personally implemented Simpson's register pressure decrease through rematerialization twice. The first time was long ago (as I remember > 10 years ago) and that time it worked not bad (as I remember it gave 1% for x86 - you can find the exact data in my GCC summit paper "Fighting register pressure in GCC"). It worked well because the old RA was quite bad (imho the problem of most research articles in compiler optimization field was/is in usage of some sub-par compiler where a new good optimization shines in environment of existing simple ones or because of absence of many other optimizations). Second time I reimplemented CFG-sensitive rematerialization (as a separate pass) last year and the results were worse than without it. Therefore I had to implement a rematerialization subpass in LRA which really improves the code. That is because the current RA is pretty good. Even if we have register pressure evaluation (which was absent in the old RA), I believe it is very inaccurate as IR uses a sophisticated coloring criteria which are actually based on dynamic intersected register classes (more accurately approximated by dynamic register class inclusion tree). Also it is very hard to predict a lot of decisions in LRA too. All the above is also true for the live range splitting implemented as a separate pass. There is a good point in your email that rematerialization should work better when it is done not for each use but for some
Unrolling factor heuristics for Loop Unrolling
Hello All: The Loop unrolling without good unrolling factor heuristics becomes the performance bottleneck. The Unrolling factor heuristics based on minimum Initiation interval is quite useful with respect to better ILP. The minimum Initiation interval based on recurrence and resource calculation on Data Dependency Graph along with the register pressure can be used to add the unrolling factor heuristics. To achieve better ILP with the given schedule, the Loops unrolling and the scheduling are inter dependent and has been widely used in Software Pipelining Literature along with the more granular List and Trace Scheduling. The recurrence calculation based on the Loop carried dependencies and the resource allocation based on the simultaneous access of the resources Using the reservation table will give good heuristics with respect to calculation of unrolling factor. This has been taken care in the MII interval Calculation. Along with MII, the register pressure should also be considered in the calculation of heuristics for unrolling factor. This enable better heuristics with respect to unrolling factor. The main advantage of the above heuristics for unrolling factor is that it can be Implemented in the Code generation Level. Currently Loop unrolling is done much before the code generation. Let's go by the current implementation Of doing Loop unrolling optimization at the Loop optimizer level and unrolling happens. After the Current unrolling at the optimizer level the above heuristics Can be used to do the unrolling at the Code generation Level with the accurate Register pressure calculation as done in the register allocator and the Unrolling is done at the code generation level. This looks feasible solution which I am going to propose for the above unrolling heuristics. This enables the Loop unrolling done at the Optimizer Level + at the Code Generation Level. This double level of Loop unrolling is quite useful. This will overcome the shortcomings of the Loop unrolling at the optimizer level. The SPEC benchmarks are the better candidates for the above heuristics instead of Mibench and EEMBC. Thanks & Regards Ajit
Function outlining and partial Inlining
Hello All: The large functions are the important part of high performance application. They contribute to performance bottleneck with many respect. Some of the large hot functions are frequently executed but many regions inside the functions are cold regions. The large Function blocks the function inlining to happen before of the code size constraints. Such cold regions inside the hot large functions can be extracted out and form the function outlining. Thus breaking the large functions Into smaller function segments which causes the functions to be inlined at the caller site or helps in partial inlining. LLVM Compiler has the functionality and the optimizations for function outlining based on regions like basic blocks, superblocks and Hyperblocks which gets extracted out into smaller function segments and thus enabling the partial inlining and function inlining to happen At the caller site. This optimization is the good case of profile guided optimizations and based on the profile feedback data by the Compiler. Without profile information the above function outlining optimizations will not be useful. We are doing lot of optimization regarding polymorphism and also the indirect icall promotion based on the profile feedback on the Callgraph profile. Are we doing the function outlining optimization in GCC with respect to function inline and partial inline based on profile feedback Data. If not this optimization can be implemented. If already implemented in GCC Can I know any pointer for such code in GCC and the Scope of this function outlining optimization. If not implemented , Can I propose to have the optimization like function outlining in GCC. Thoughts Please? Thanks & Regards Ajit
unaligned memory access for vectorization
Hello All: The unaligned array access are the blocking factor in the vectorization. This is due to unaligned load and stores with respect to SIMD instructions are costly operations. To enable the vectorizations for unaligned array access the loop peeling is done to make the multiversioning of the loop with a loop for the iterations for unaligned array access where the code is non vectorized and also the loop where the loop can be vectorized for aligned access. This is possible with loop multiversioning to not to generate the unaligned moves. Can I know the scope of the above optimization and pointer to the code in GCC where this optimizations is implemented. If not implemented , it's good to have this optimization. Thoughts Please? Thanks & Regards Ajit
RE: Function outlining and partial Inlining
-Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Thursday, February 12, 2015 10:34 PM To: Ajit Kumar Agarwal Cc: hubi...@ucw.cz; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Function outlining and partial Inlining > Hello All: > > The large functions are the important part of high performance > application. They contribute to performance bottleneck with many > respect. Some of the large hot functions are frequently executed but many > regions inside the functions are cold regions. The large Function blocks the > function inlining to happen before of the code size constraints. > > Such cold regions inside the hot large functions can be extracted out > and form the function outlining. Thus breaking the large functions Into > smaller function segments which causes the functions to be inlined at the > caller site or helps in partial inlining. > > LLVM Compiler has the functionality and the optimizations for function > outlining based on regions like basic blocks, superblocks and > Hyperblocks which gets extracted out into smaller function segments and thus > enabling the partial inlining and function inlining to happen At the caller > site. > > This optimization is the good case of profile guided optimizations and based > on the profile feedback data by the Compiler. > Without profile information the above function outlining optimizations will > not be useful. > > We are doing lot of optimization regarding polymorphism and also the > indirect icall promotion based on the profile feedback on the Callgraph > profile. > > Are we doing the function outlining optimization in GCC with respect > to function inline and partial inline based on profile feedback Data. > If not this optimization can be implemented. If already implemented in GCC > Can I know any pointer for such code in GCC and the Scope of this function > outlining optimization. >>The outlining pass is called ipa-split. The heuristic used is however quite >>simplistic and it looks for very specific case where you have small header of >>a function containing conditional and >>splits after that. It does use >>profile. Thanks! I have gone through the code in the file ipa-split. This has the infrastructure of doing function outlining with respect to early exits routine. The early exits from the given function leads to more cold regions based on the profile data. The profile data with respect to early return statement and the regions that it corresponds for early exits gives the rest of the regions as Cold regions. The early exits leads to Single Entry and Multiple Exit regions where the chances of code regions with respect to the rest of regions with early exits can be formed as cold regions. >>Any work on improving the heuristics or providing interesting testcases to >>consider would be welcome. There are many large application that forms the Single Entry and Multiple Exits. I can see the h264ref benchmarks in Spec2006 which has many early Exits routine like SubPelBlockMotionSearch which are the candidates of Single Entry and Multiple Exits with break statements instead of returns. I don't have other examples of early exits returns but there could be many examples which has early returns. These are mainly with respect to IF-THEN-ELSE regions. >>I think LLVM pass is doing pretty much the same analysis minus the profile >>feedback considerations. After splitting, LLVm will inline the header into >>all callers while GCC leaves this on the >>decision of inliner heuristics >>that may just merge the function back into one block. >>The actual outlining logic is contained in tree-inline.c and also used by >>OpenMP. Thanks! Thanks & Regards Ajit Honza > > If not implemented , Can I propose to have the optimization like function > outlining in GCC. > > Thoughts Please? > > Thanks & Regards > Ajit
Tree SSA If-combine optimization pass in GCC
Hello All: I can see the IF-combining (If-merging) pass of optimization on tree-ssa form of intermediate representation. The IF-combine or merging takes of merging the IF-THEN-ELSE if the condition Expr found be congruent or Similar. The IF-combine happens if the two IF-THEN-ELSE are contiguous to each other. If the IF-THEN-ELSE happens to be not contiguous but are wide apart with there is code in between. Does the If-combine takes care of this. This requires to do the head-duplication and Tail-duplication for the Code in between If-THEN-ELSE to bring the IF-THEN-ELSE contiguous to each other. After the head and tail duplication of the code in between the IF-THEN-ElSE sequence becomes contiguous to each other. Apart from this, Does the tree-ssa-if-combine pass considers the control flow of the body of the IF-THEN-ELSE. Is there any limitation on control flow of the body of the IF-THEN-ELSE. Can I know the scope of tree-ssa-ifcombine optimizations pass with respect to the above points. Thoughts Please? Thanks & Regards Ajit
Cost Calculation on Loop Invariant on Arithmetic operations on RTL
Hello All: I can see the Loop invariant pass in the GCC on RTL considering the register pressure and the cost manipulation With respect to SET destination node in RTL. The Loop invariant takes care of only address arithmetic candidates of Loop invariance. In the function get_inv_cost, I can see the following check for the updation of comp_cost and the comp_cost Gets incremented with respect all the dependence on the def node of the invariant variable. If (!inv->cheap_address || inv-def->n_addr_uses < inv->def->n_uses) (*comp_cost += inv->cost + inv->eqno Is there any specific reasons of addr_uses less than the actual uses check for the address arithmetic candidate Of Loop invariant. I think we should be aggressive enough and can do the cost calculation without this check. One more point the above cost calculation should not be considering the register pressure_costs. Thoughts Please ? Thanks & Regards Ajit
RE: Tree SSA If-combine optimization pass in GCC
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Tuesday, February 17, 2015 3:42 PM To: Ajit Kumar Agarwal Cc: gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Tree SSA If-combine optimization pass in GCC On Tue, Feb 17, 2015 at 9:22 AM, Ajit Kumar Agarwal wrote: > Hello All: > > I can see the IF-combining (If-merging) pass of optimization on tree-ssa form > of intermediate representation. > The IF-combine or merging takes of merging the IF-THEN-ELSE if the > condition Expr found be congruent or Similar. > > The IF-combine happens if the two IF-THEN-ELSE are contiguous to each other. > If the IF-THEN-ELSE happens to be not contiguous but are wide apart with > there is code in between. > Does the If-combine takes care of this. This requires to do the > head-duplication and Tail-duplication for the Code in between If-THEN-ELSE to > bring the IF-THEN-ELSE contiguous to each other. > > After the head and tail duplication of the code in between the > IF-THEN-ElSE sequence becomes contiguous to each other. Apart from > this, Does the tree-ssa-if-combine pass considers the control flow of the > body of the IF-THEN-ELSE. Is there any limitation on control flow of the body > of the IF-THEN-ELSE. > > Can I know the scope of tree-ssa-ifcombine optimizations pass with respect to > the above points. > > Thoughts Please? >>if-combine is a simple CFG + condition pattern matcher. It does not perform >>head/tail duplication. Also there is no "control flow" in the bodies, >>control flow is part of the CFG that is >>matched so I'm not quite getting >>your last question. Thanks ! My last question was If there is a control flow likes loops inside the IF-THEN-ELSE, which could be possible if the Loop unswitching is performed and the Loop body is placed inside the IF-THEN-ELSE, then in that case the two IF-THEN-ELSE can be merged if the cond expr matches and the control flow of the body of If-then-else matches? There are many cases in SPEC 2006 benchmarks where the IF-combine could be enabled if the if-then-else sequence is made contiguous by performing the head/tail duplication. >>if-combine was designed to accompany IL-only patterns that get partly >>translated into control flow. Like >>tem1 = name & bit1; >>tem2 = name & bit2; >>tem3 = tem1 | tem2; >>if (tem3) ... >>vs. >>tem1 = name & bit1; >>if (tem1) >>goto x; >>else >>{ >>tem2 = name & bit2; >>if (tem2) >> goto x; >>} >>x: >>... Thanks for the examples. This explains the scope of if-combine optimization pass. Thanks & Regards Ajit Richard. > Thanks & Regards > Ajit
RE: Tree SSA If-combine optimization pass in GCC
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Tuesday, February 17, 2015 5:49 PM To: Ajit Kumar Agarwal Cc: gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Tree SSA If-combine optimization pass in GCC On Tue, Feb 17, 2015 at 11:26 AM, Ajit Kumar Agarwal wrote: > > > -Original Message- > From: Richard Biener [mailto:richard.guent...@gmail.com] > Sent: Tuesday, February 17, 2015 3:42 PM > To: Ajit Kumar Agarwal > Cc: gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli > Hunsigida; Nagaraju Mekala > Subject: Re: Tree SSA If-combine optimization pass in GCC > > On Tue, Feb 17, 2015 at 9:22 AM, Ajit Kumar Agarwal > wrote: >> Hello All: >> >> I can see the IF-combining (If-merging) pass of optimization on tree-ssa >> form of intermediate representation. >> The IF-combine or merging takes of merging the IF-THEN-ELSE if the >> condition Expr found be congruent or Similar. >> >> The IF-combine happens if the two IF-THEN-ELSE are contiguous to each other. >> If the IF-THEN-ELSE happens to be not contiguous but are wide apart with >> there is code in between. >> Does the If-combine takes care of this. This requires to do the >> head-duplication and Tail-duplication for the Code in between If-THEN-ELSE >> to bring the IF-THEN-ELSE contiguous to each other. >> >> After the head and tail duplication of the code in between the >> IF-THEN-ElSE sequence becomes contiguous to each other. Apart from >> this, Does the tree-ssa-if-combine pass considers the control flow of the >> body of the IF-THEN-ELSE. Is there any limitation on control flow of the >> body of the IF-THEN-ELSE. >> >> Can I know the scope of tree-ssa-ifcombine optimizations pass with respect >> to the above points. >> >> Thoughts Please? > >>>if-combine is a simple CFG + condition pattern matcher. It does not perform >>>head/tail duplication. Also there is no "control flow" in the bodies, >>>control flow is part of the CFG that is >>matched so I'm not quite getting >>>your last question. > > Thanks ! My last question was If there is a control flow likes loops > inside the IF-THEN-ELSE, which could be possible if the Loop > unswitching is performed and the Loop body is placed inside the IF-THEN-ELSE, > then in that case the two IF-THEN-ELSE can be merged if the cond expr matches > and the control flow of the body of If-then-else matches? > > There are many cases in SPEC 2006 benchmarks where the IF-combine > could be enabled if the if-then-else sequence is made contiguous by > performing the head/tail duplication. >>I'd be curious what those cases look like. Care to file some bugreports with >>testcases? This is not a bug and it’s the performance improvement optimizations with respect to h264ref spec2006 benchmarks. Here is the example. Var1 = funcptr(); For(...) { Code here For(...) { Code here ... For(...) ... code here.. If(*funcptr() == FastPely()) FastPely() Else (*funcptr)(); There are such 16 IF statements. code here } end for Code here }//end for Code here }//end for. The funcptr has two targets FastPely() and UMVPely(). After the indirect call promotion the targets is known to be either Fastpely() or UMVPely. The Transformed code after indirect icall promotion looks like as follows. Var1 = funcptr(); For(...) { Code here For(...) { Code here ... For(...) ... code here.. If(var1 == FastPely()) FastPely() Else UMVpely(); There are such 16 IF statements. code here } end for Code here }//end for Code here }//end for. After the icall promotion the Function FastPely or UMVPely can be inlined as the target is known to be either Fastpely() or UmvPely() and it become a candidate for heuristics for inlined. As you can see the transformed code the IF-THEN-ELSE (such 16 If statements) can be IF-combined and merged and then get inlined. Also you can see that the code above IF and below for which can be head duplicated or tail duplicated which is then become 3 -Level loop unswitching candidate. This can be loop unswitching candidate after the IF-Combine or merging. I am planning to implement the above optimizations in GCC with respect to h264ref spec 2006 benchmark. This gives a significant amount of gains. I have implemented the above optimization in Open64 compiler and it has given significant amount of gains in open64 compiler. Thanks & Regards Ajit >&g
Proposal on Unrolling factor based on Data reuse.
Hello All: I would like to propose the Unrolling factor based on Data reuse between different iterations. This combines the data reuse of different iterations into single iterations. There is a use of MaxFactor which decides on the calculation of unroll factor based on Data reuse.The MaxFactor is calculated based on (MaxInsts/LoopBodyInsts). The MaxInsts decides on the number of instruction that doesn't degrades on the instruction cache. The calculation of the MaxInsts also considers the def and use distance for the max insts that does not degrades the instruction cache which leads to increase in the Live range width and multiplied with the MaxInsts calculated based on Instruction cache. The following example from Livermore Loops benchmarks. The data reuse from the previous iteration to current iteration makes the data reuse. Unrolling the Loop in Fig(1) can Reuse the data and thus increasing the performance. The unrolled loop is unrolled 3 times given in Fig(2) based on the algorithm for Calculation of unrolling factor used as given in Fig(3). For ( I = 1; I < size; i++ ) { X[i] = z[i] * ( y[i] - x[i-1]); } Fig(1). For ( I = 1; I < size ; i+=3) { X[i] = z[i] * ( y[i] - x[i-1]); X[i+1] = z[i] * ( y[i] - x[i]); X[i] = z[i] * ( y[i] - x[i+1]]); } Fig(2). Algorithm for unroll factor based on data reuse. If ( data_reuse == true) { Unroll_factor = Reuse_distance +1; If(Unroll_factor < MaxFactor) Return unroll_factor;; Else{ Unroll_factor = MaxFactor - unroll_factor. If( unroll_factor < MaxDistance) Return unroll_factor; } } Fig ( 3). In the example given above in Fig(1) the data reuse distance and the MaxFactor calculated based on the maximum number of insts that don't degrades the instructions caches and doesn't exceed the maximum limit on def and use distance that increases the width of Live range. The above is considered and the Loop is 3 times. Thoughts Please ? Thanks & Regards Ajit
RE: Proposal on Unrolling factor based on Data reuse.
Hello All: There is a typo error. The fig(2) is corrected as follows. For ( I = 1; I < size ; i+=3) { X[i] = z[i] * ( y[i] - x[i-1]); X[i+1] = z[i] * ( y[i] - x[i]); X[i+2] = z[i] * ( y[i] - x[i+1]]); } Fig(2). Thanks & Regards Ajit -Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Ajit Kumar Agarwal Sent: Saturday, March 07, 2015 3:31 PM To: Richard Biener; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Proposal on Unrolling factor based on Data reuse. Hello All: I would like to propose the Unrolling factor based on Data reuse between different iterations. This combines the data reuse of different iterations into single iterations. There is a use of MaxFactor which decides on the calculation of unroll factor based on Data reuse.The MaxFactor is calculated based on (MaxInsts/LoopBodyInsts). The MaxInsts decides on the number of instruction that doesn't degrades on the instruction cache. The calculation of the MaxInsts also considers the def and use distance for the max insts that does not degrades the instruction cache which leads to increase in the Live range width and multiplied with the MaxInsts calculated based on Instruction cache. The following example from Livermore Loops benchmarks. The data reuse from the previous iteration to current iteration makes the data reuse. Unrolling the Loop in Fig(1) can Reuse the data and thus increasing the performance. The unrolled loop is unrolled 3 times given in Fig(2) based on the algorithm for Calculation of unrolling factor used as given in Fig(3). For ( I = 1; I < size; i++ ) { X[i] = z[i] * ( y[i] - x[i-1]); } Fig(1). For ( I = 1; I < size ; i+=3) { X[i] = z[i] * ( y[i] - x[i-1]); X[i+1] = z[i] * ( y[i] - x[i]); X[i] = z[i] * ( y[i] - x[i+1]]); } Fig(2). Algorithm for unroll factor based on data reuse. If ( data_reuse == true) { Unroll_factor = Reuse_distance +1; If(Unroll_factor < MaxFactor) Return unroll_factor;; Else{ Unroll_factor = MaxFactor - unroll_factor. If( unroll_factor < MaxDistance) Return unroll_factor; } } Fig ( 3). In the example given above in Fig(1) the data reuse distance and the MaxFactor calculated based on the maximum number of insts that don't degrades the instructions caches and doesn't exceed the maximum limit on def and use distance that increases the width of Live range. The above is considered and the Loop is 3 times. Thoughts Please ? Thanks & Regards Ajit
Proposal for inter-procedural loops fusion.
Hello All: I am proposing the inter-procedural Loop fusion. Generally the Loops adjacent to each other and the conformable Candidates of loop fusions are done with respect to intra-procedural loops. The whole program analysis needs to Be done with array sections analysis across the procedure calls to fuse the loops across the procedure calls. In the example given in Fig(1) the doubly nested loops in routine sub1 will be a candidates of fusing the doubly Nested loops in routine sub2 in Fig(2). The main calls routine sub1 and sub2 and the loops in routine sub1 and sub2 Can be fused. The array section analysis and dependence analysis is done on the regions across the procedure calls For loops in sub1 and sub2. Here is the proposal for loop fusion across the procedure calls. Normally loop peeling is done to make the candidate For Loop fusion and the sub2 function the loops can also be peeled. In the examples given below the array A and array B are global variables in order to happen the loops fusion across the procedure calls. The below examples are extracted from the articles on the following. "Inter-procedure loop fusion, array compaction and rotation" by John Ng et.al Examples: program main common /s/ A(N),B(N) call sub1() call sub2()  end Subroutine sub1() common /s/ A(N),B(N) do j=1,N  do i=1,N    A(i+1, j) = j -i   enddo enddo do i=1,N A(1,i) = i * 2 enddo  do i=1,N A(i+1,n+1) = A(i+1,1)  enddo end /* sub1 */ Fig (1)  Subroutine sub2() common /s/ A(N),B(N)  do j=1,N   do i=1,N     B(i,j) = A(i+1,j+1) +A(i,j+1) + A(i,j) +A(i+1,j)    enddo enddo  end Fig (2). Thoughts Please? Thanks & Regards Ajit
Proposal for path splitting for reduction in register pressure for Loops.
Hello All: The path splitting that replicates the code for better Data flow Analysis available. One of the properties of path splitting removes the joining nodes for the forked path like IF-THEN-ELSE and the Loops. The removal of joining nodes makes the path splitted into two independent paths. The increase in register Pressures for the loops are the performance bottleneck and sometimes lead to spill code in loops. The removal of joining nodes like loops and IF-THEN-ELSE makes the target independent optimization like CSE and Partial redundancy Elimination effective. Along with ease of these optimization the path splitting reduces the registers because Of the Liveness wont intersect because of independent splitted path. The ease of less intersection of liveness Reduces the register pressure for the Loops, thus better register allocation and less spilling code for Loops. The heuristics used in IRA code for register pressure will have better impact on the splitted path and thus optimized code. Thoughts Please? Thanks & Regards Ajit
RE: Proposal for path splitting for reduction in register pressure for Loops.
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Sunday, March 08, 2015 9:05 PM To: Ajit Kumar Agarwal; vmaka...@redhat.com; Jeff Law; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Proposal for path splitting for reduction in register pressure for Loops. On March 8, 2015 3:39:08 PM CET, Ajit Kumar Agarwal wrote: >Hello All: > >The path splitting that replicates the code for better Data flow >Analysis available. One of the properties of path splitting removes the >joining nodes for the forked path like IF-THEN-ELSE and the Loops. > >The removal of joining nodes makes the path splitted into two >independent paths. > >The increase in register Pressures for the loops are the performance >bottleneck and sometimes lead to spill code in loops. > >The removal of joining nodes like loops and IF-THEN-ELSE makes the >target independent optimization like CSE and Partial redundancy >Elimination effective. Along with ease of these optimization the path >splitting reduces the registers because Of the Liveness wont intersect >because of independent splitted path. The ease of less intersection of >liveness Reduces the register pressure for the Loops, thus better >register allocation and less spilling code for Loops. > >The heuristics used in IRA code for register pressure will have better >impact on the splitted path and thus optimized code. > >Thoughts Please? >>Are you talking about loop unswitching? Loop unswitching is a way of path splitting but I am proposing about the path splitting for the removal of joining node for the FOR Loops with IF-THEN-ELSE path. The tail of loop there is a joining node for the iF-THEN-ELSE forked at the header or the IF Region. The path splitting removes the joining node by splitting the path by replicating the code and joining nodes which leads to independent path(removing the joining node), thus the live ranges wont intersects and there is a reduction in the register pressure for the Loops. The Loop unswitching move the For Loops inside the IF-THEN and IF-ELSE path but the path splitting I am proposing would still have the IF-THEN-ELSE region Inside the loop instead of Loop unswitching but replicate the code for the joining nodes and splitting the path into two independent paths. Though there is Performance bottleneck with the replication of the code and spliiting the joining nodes and thus there wont be any joining node. This will makes the Live Ranges not intersecting thus reducing the register pressure and better register allocations. This makes it to check for the profitability for the reduction in register pressure with the replication of joining nodes thus removing the joining node. I am sure this makes the reduction in register pressure through the loops and IF-THEN-ELSE regions inside the Loops and thus better register allocation. Some profitability heuristics need to be added with this optimization. Thanks & Regards Ajit Richard. >Thanks & Regards >Ajit > >
RE: Proposal for path splitting for reduction in register pressure for Loops.
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Monday, March 09, 2015 11:01 PM To: Richard Biener Cc: Ajit Kumar Agarwal; vmaka...@redhat.com; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Proposal for path splitting for reduction in register pressure for Loops. On 03/09/15 05:40, Richard Biener wrote: > On Sun, Mar 8, 2015 at 8:49 PM, Jeff Law wrote: >> On 03/08/15 12:13, Richard Biener wrote: >>> >>> >>> I see. This basically creates two loop latches and thus will make >>> our loop detection code turn the loop into a fake loop nest. Not >>> sure if that is a good idea. >> >> I'd have to sit down and ponder this for a while -- what would be the >> register pressure implications of duplicating the contents of the >> join block into its predecessors, leaving an empty joiner so that we >> still have a single latch? > > Good question. Now another question is why we don't choose this way > to disambiguate loops with multiple latches? (create a forwarder as > new single latch block) >>Dunno. The RTL jump optimizer ought to eliminate the unnecessary jumping >>late in the optimization pipeline and creating the forwarder allows us to put >>the loop into a "more canonical" >>form for the loop optimizers. Seems like >>it'd be worth some experimentation. I agree with Jeff on this. The above approach of path splitting will keep the loop into more canonical form for the Loop optimizers. >>I'm having trouble seeing how Ajit's proposal helps reduce register pressure >>in any significant way except perhaps by exposing partially dead code. And >>if that's the primary benefit, we >>may better off actually building a proper >>framework for PDCE. Jeff: The above approach of path splitting is very similar to superblock formation as we are duplicating the join nodes into all its predecessors. By doing so, The predecessors blocks duplicated from the join nodes will achieve more granularity with the scope of the scheduling and the register allocation. Due to The bigger granularity of the predecessors blocks, the advantages will have similar to having superblock with more granularity. This gives more scheduling Opportunities of basic blocks scheduling the independent operations. Due to the above path splitting approach, the LRA will have a more significant impact with the respect to register allocation. Because of more granular predecessors blocks the scope of LRA will increase, and the LRA can reuse the registers thus impacting the Live range and the register pressure and we can use better heuristics with respect to spill code in LRA. >>I've often pondered if PDCE is really worth it. There's some good PLDI >>papers from many years ago. One is closely related to our RTL GCSE >>implementation (Knoop). But RTL seems the >>wrong place to be doing this. >>Click's GCM/GVN can achieve similar results by the nature of code motion >>algorithm IIRC, but as much as I like Click's algorithm, I wasn't ever able >>to get it to do anything significantly >>better than existing bits in GCC. Thanks Jeff. Thanks & Regards Ajit Jeff
Proposal for another approach for Loop transformation with conditional in Loops.
Hello All: I am proposing the new approach to Loop transformation as given below in the example For the loops with conditional expression inside the Loops. The Loop body should be reducible control flow graph. The iteration space is partitioned into different spaces for which either the cond_expr is true or cond_expr is false. The evaluation of cond_expr for the partitioned iteration spaces can be determined statically. For the partitioned iterations and based on the analysis of cond_expr on the partitioned iterations spaces the Loops in fig (a) will be transformed to Loops in fig (b). for the iteration spaces of the conditional inside the loop is live in at the entry of the cond_expr and Live out the Control flow graph is topological ordered with the basic blocks for the conditional CFG and the partitioned iteration spaces can be formed for which the spaces can be true for conditional expr and false and unknown. Based on the above info and analysis the Loop of Fig (a) will be transformed to Loop Fig (b). This is much more optimized code as compared to the performance. The cases will be triggered for SPEC Benchmarks. Loops is partitioned to different version with different iteration spaces. Optimized in the presence Of the transformed generation of the loops without conditional. For ( x1= lb1; x1<= ub1; x1++) .. For(xn= lbn; xn <= ubn; xn++) { V = cond_expr; If ( V) Code_FOR _THEN; Else Code_FOR_ELSE; } } Fig(a) /* Loop for cond_expr == true */ For( x1 = ) For(xn = ) CODE_FOR_THEN; End End /* Loop for cond_expr == false */ For ( x1 = ..) For( xn = ...) CODE_FOR_ELSE; End End /* Loop for cond_expr == unknown *// For ( x1 = ...) For( xn = { V = cond_expr; If( v) CODE_FOR_THEN; Else CODE_FOR_ELSE; } } Fig ( b). Thoughts Please ? Thanks & Regards Ajit
RE: Proposal for another approach for Loop transformation with conditional in Loops.
-Original Message- From: Aditya K [mailto:hiradi...@msn.com] Sent: Sunday, March 15, 2015 11:37 AM To: Ajit Kumar Agarwal; Jeff Law; Richard Biener; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: Proposal for another approach for Loop transformation with conditional in Loops. > From: ajit.kumar.agar...@xilinx.com > To: l...@redhat.com; richard.guent...@gmail.com; gcc@gcc.gnu.org > CC: vin...@xilinx.com; shail...@xilinx.com; vid...@xilinx.com; > nmek...@xilinx.com > Subject: Proposal for another approach for Loop transformation with > conditional in Loops. > Date: Sun, 15 Mar 2015 04:40:54 + > > Hello All: > > I am proposing the new approach to Loop transformation as given below > in the example For the loops with conditional expression inside the > Loops. The Loop body should be reducible control flow graph. The > iteration space is partitioned into different spaces for which either > the cond_expr is true or cond_expr is false. The evaluation of > cond_expr for the partitioned iteration spaces can be determined statically. > For the partitioned iterations and based on the analysis of cond_expr on the > partitioned iterations spaces the Loops in fig (a) will be transformed to > Loops in fig (b). > > for the iteration spaces of the conditional inside the loop is live in > at the entry of the cond_expr and Live out the Control flow graph is > topological ordered with the basic blocks for the conditional CFG and the > partitioned iteration spaces can be formed for which the spaces can be true > for conditional expr and false and unknown. > > Based on the above info and analysis the Loop of Fig (a) will be transformed > to Loop Fig (b). > > This is much more optimized code as compared to the performance. The > cases will be triggered for SPEC Benchmarks. Loops is partitioned to > different version with different iteration spaces. Optimized in the presence > Of the transformed generation of the loops without conditional. > > For ( x1= lb1; x1<= ub1; x1++) > .. > For(xn= lbn; xn <= ubn; xn++) > { > > V = cond_expr; > If ( V) > Code_FOR _THEN; > Else > Code_FOR_ELSE; > } > > } > Fig(a) > > /* Loop for cond_expr == true */ > For( x1 = ) For(xn = > ) CODE_FOR_THEN; End End > > /* Loop for cond_expr == false */ > For ( x1 = ..) > For( xn = ...) > CODE_FOR_ELSE; > End > End > > /* Loop for cond_expr == unknown *// > For ( x1 = ...) For( xn = > > { > V = cond_expr; > If( v) > CODE_FOR_THEN; > Else > CODE_FOR_ELSE; > } > } > > Fig ( b). > > Thoughts Please ? > > Thanks & Regards > Ajit >>Ajit, >>How different this is from the loop-unswitch pass already in gcc >>(tree-ssa-loop-unswitch.c)? Aditya: The loop unswitching moves the loop inside the IF-THEN and IF-ELSE with different versions of the Loop. The iteration space remains same for the Loop unswitching. In the proposal given above the iteration spaces with different version of the Loop will be different based on the cond_expr for the iteration space is true or false. Based on partitioned iteration space the transformed loops are generated with different iterations space for the transformed loops based on the condition is true or false. Another difference in the Loop unswitching the conditional expression for IF-THEN-ELSE is present with the loop unswitching, whereas In the above optimizations the conditional expressions is removed for the cond_expr condition true or false. The different versions of the Loop with different iteration spaces execute either the statements for then-block or statement for the ELSE block. For unknown cases for The iterations spaces the Loops will be with different iterations space. This results in much more optimized code without having the Conditional hot branches inside the loops and the performance bottleneck with respect to branch mispredictions with respect to conditional branches Inside the loops will be removed or reduced to negligible amount. As you mentioned it should be for the Hot Loops. Thanks & Regards Ajit -Aditya
Short Circuit compiler transformations!!
Hello All: Short circuit compiler transformation for conditional branches. The conditional branches based on the conditional Expressions one of the path is always executed thus short circuiting the path. Certains values of the conditional Expressions makes the conditional expressions always true or false. This part of the conditions in extracted out In the IF-THEN cond_express with the check of the value and in the else case the original expressions is checked. This makes for a given values of the variables of the conditional expressions makes the conditional expression as True or false always and the one path is always executed. For the example given below. For( x1 = lb1; x1 < = ub1 ; x1++) { If ( C1 && C2) S_then } Fig (1). For the input code given in Fig (1) the condition C1 && C2 is always false if C1 or C2 is zero(false) thus short circuiting The path and only s_else will be executed if C1 or C2 is false. Thus the input code is transformed as follows. For( x1 = lb1; x1 < = ub1 ; x1++) { If(!C1) Continue else S_then } This is much simpler code and there could be complex expressions inside the FOR Loop that can be extracted out with Different versions of the conditional IF-THEN-ELSE inside the loops based on short circuiting executing one of the path always Executed can be extracted in IF-THEN and other cases of condition will be inside the else conditions. Again this has to be optimized based on the profile Data of hot conditional branches and the above optimizations is performed Based on the hotness of the conditional branches. The short Circuiting optimization also makes the irreducible loops with conditional branches as reducible and there is no need for Conversion of irreducible loops to reducible loop with any of the existing approach like node splitting for the short circuiting candidate. This optimizations is implemented in LLVM and I would like to know if this is implemented in GCC. If not implemented I would like To propose this short circuit compiler transformation. Thoughts Please ? Thanks & Regards Ajit
RE: Short Circuit compiler transformations!!
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Sunday, March 15, 2015 3:05 PM To: Ajit Kumar Agarwal; Jeff Law; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Short Circuit compiler transformations!! On March 15, 2015 9:15:36 AM GMT+01:00, Ajit Kumar Agarwal wrote: >Hello All: > >Short circuit compiler transformation for conditional branches. The >conditional branches based on the conditional Expressions one of the >path is always executed thus short circuiting the path. Certains values >of the conditional Expressions makes the conditional expressions always >true or false. >This part of the conditions in extracted out In the IF-THEN >cond_express with the check of the value and in the else case the >original expressions is checked. > >This makes for a given values of the variables of the conditional >expressions makes the conditional expression as True or false always >and the one path is always executed. > >For the example given below. > >For( x1 = lb1; x1 < = ub1 ; x1++) >{ > If ( C1 && C2) > S_then > } > > >Fig (1). > >For the input code given in Fig (1) the condition C1 && C2 is always >false if C1 or C2 is zero(false) thus short circuiting The path and >only s_else will be executed if C1 or C2 is false. > >Thus the input code is transformed as follows. > >For( x1 = lb1; x1 < = ub1 ; x1++) >{ > If(!C1) >Continue > else >S_then > } >>This looks wrong as you miss to check C2. Riichard: Sorry for the typo error. I could see the short circuit transformation is implemented in gcc. I could see only Short circuiting with respect to &&/OR. I am thinking in terms of complex expressions where any of the variable condition In the expressions Set to to true or false the whole expressions and such conditions can be extracted out into different versions of IF with different parameter check( Another way of short circuiting). I would like to know the scope of this optimization in gcc. If not implemented with respect to complex expressions with respect To different iteration spaces. I would like to propose the same. Thoughts Please? Thanks & Regards Ajit Richard. >This is much simpler code and there could be complex expressions inside >the FOR Loop that can be extracted out with Different versions of the >conditional IF-THEN-ELSE inside the loops based on short circuiting >executing one of the path always Executed can be extracted in IF-THEN >and other cases of condition will be inside the else conditions. > >Again this has to be optimized based on the profile Data of hot >conditional branches and the above optimizations is performed Based on >the hotness of the conditional branches. > >The short Circuiting optimization also makes the irreducible loops with >conditional branches as reducible and there is no need for Conversion >of irreducible loops to reducible loop with any of the existing >approach like node splitting for the short circuiting candidate. > >This optimizations is implemented in LLVM and I would like to know if >this is implemented in GCC. If not implemented I would like To propose >this short circuit compiler transformation. > >Thoughts Please ? > >Thanks & Regards >Ajit
RE: Short Circuit compiler transformations!!
-Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Ajit Kumar Agarwal Sent: Sunday, March 15, 2015 3:35 PM To: Richard Biener; Jeff Law; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: Short Circuit compiler transformations!! -Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Sunday, March 15, 2015 3:05 PM To: Ajit Kumar Agarwal; Jeff Law; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Short Circuit compiler transformations!! On March 15, 2015 9:15:36 AM GMT+01:00, Ajit Kumar Agarwal wrote: >Hello All: > >Short circuit compiler transformation for conditional branches. The >conditional branches based on the conditional Expressions one of the >path is always executed thus short circuiting the path. Certains values >of the conditional Expressions makes the conditional expressions always >true or false. >This part of the conditions in extracted out In the IF-THEN >cond_express with the check of the value and in the else case the >original expressions is checked. > >This makes for a given values of the variables of the conditional >expressions makes the conditional expression as True or false always >and the one path is always executed. > >For the example given below. > >For( x1 = lb1; x1 < = ub1 ; x1++) >{ > If ( C1 && C2) > S_then > } > > >Fig (1). > >For the input code given in Fig (1) the condition C1 && C2 is always >false if C1 or C2 is zero(false) thus short circuiting The path and >only s_else will be executed if C1 or C2 is false. > >Thus the input code is transformed as follows. > >For( x1 = lb1; x1 < = ub1 ; x1++) >{ > If(!C1) >Continue > else >S_then > } >>This looks wrong as you miss to check C2. >>Riichard: Sorry for the typo error. I could see the short circuit >>transformation is implemented in gcc. I could see only Short circuiting with >>respect to &&/OR. I am thinking in terms of >>complex expressions where any >>of the variable condition In the expressions Set to to true or false the >>whole expressions and such conditions can be extracted out into different >>versions >>of IF with different parameter check( Another way of short >>circuiting). >>I would like to know the scope of this optimization in gcc. If not >>implemented with respect to complex expressions with respect To different >>iteration spaces. I would like to propose the >>same. Richard: Other aspect of the transformation for short circuiting the IF-THEN-ELSE with the loops where the loop is irreducible. In this case implementing the short circuit transformations and the CFG transformations to convert from irreducibility to reducible Loops along with the transformations of short circuiting compiler transformations. Thanks & Regards Ajit >>Thoughts Please? Thanks & Regards Ajit Richard. >This is much simpler code and there could be complex expressions inside >the FOR Loop that can be extracted out with Different versions of the >conditional IF-THEN-ELSE inside the loops based on short circuiting >executing one of the path always Executed can be extracted in IF-THEN >and other cases of condition will be inside the else conditions. > >Again this has to be optimized based on the profile Data of hot >conditional branches and the above optimizations is performed Based on >the hotness of the conditional branches. > >The short Circuiting optimization also makes the irreducible loops with >conditional branches as reducible and there is no need for Conversion >of irreducible loops to reducible loop with any of the existing >approach like node splitting for the short circuiting candidate. > >This optimizations is implemented in LLVM and I would like to know if >this is implemented in GCC. If not implemented I would like To propose >this short circuit compiler transformation. > >Thoughts Please ? > >Thanks & Regards >Ajit
RE: Function outlining and partial Inlining
-Original Message- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Thursday, February 12, 2015 10:34 PM To: Ajit Kumar Agarwal Cc: hubi...@ucw.cz; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Function outlining and partial Inlining > Hello All: > > The large functions are the important part of high performance > application. They contribute to performance bottleneck with many > respect. Some of the large hot functions are frequently executed but many > regions inside the functions are cold regions. The large Function blocks the > function inlining to happen before of the code size constraints. > > Such cold regions inside the hot large functions can be extracted out > and form the function outlining. Thus breaking the large functions Into > smaller function segments which causes the functions to be inlined at the > caller site or helps in partial inlining. > > LLVM Compiler has the functionality and the optimizations for function > outlining based on regions like basic blocks, superblocks and > Hyperblocks which gets extracted out into smaller function segments and thus > enabling the partial inlining and function inlining to happen At the caller > site. > > This optimization is the good case of profile guided optimizations and based > on the profile feedback data by the Compiler. > Without profile information the above function outlining optimizations will > not be useful. > > We are doing lot of optimization regarding polymorphism and also the > indirect icall promotion based on the profile feedback on the Callgraph > profile. > > Are we doing the function outlining optimization in GCC with respect > to function inline and partial inline based on profile feedback Data. > If not this optimization can be implemented. If already implemented in GCC > Can I know any pointer for such code in GCC and the Scope of this function > outlining optimization. >>The outlining pass is called ipa-split. The heuristic used is however quite >>simplistic and it looks for very specific case where you have small header of >>a function containing conditional and >>splits after that. It does use >>profile. >>Any work on improving the heuristics or providing interesting testcases to >>consider would be welcome. >>I think LLVM pass is doing pretty much the same analysis minus the profile >>feedback considerations. After splitting, LLVm will inline the header into >>all callers while GCC leaves this on the >>decision of inliner heuristics >>that may just merge the function back into one block. >>The actual outlining logic is contained in tree-inline.c and also used by >>OpenMP. Honza: LLVM has the logic of extraction of Loops using -loop-extract flags where the Loop code region is extracted into a function. The LLVM has the heuristics that the header of the Loop is the entry block and there is a single entry to the Loop which does by Loop Simplification pass in LLVM. Also the heuristics in LLVM has the heuristics that there should not be a branch from the entry block to the header of the Loop. You have mentioned the GCC has the heuristics of having the conditional in the header after the splits point. Does GCC has the heuristics for Conditional to split in the header or the heuristics can be augmented with the Loops as done in the LLVM. Thanks & Regards Ajit Honza > > If not implemented , Can I propose to have the optimization like function > outlining in GCC. > > Thoughts Please? > > Thanks & Regards > Ajit
Proposal for the coarse grain unrolling heuristics and renaming for the enablement of better fine grain Loop transformation.
Hello All: Below examples are the transformation for the given loop in Fig(1). Fig(2) unroll and jam and the Fig(3) does the Code motion to bring two IF adjacent to each other and two while loops adjacent to each other. The Fig(4 ) does the IF-merging and the Loop fusion on the transformed Loop in Fig (3) which bring two IF adjacent and two while Loops adjacent to each other. This shows how the Loop with conditional branches can be unrolled which enables the IF-merging and Loop fusion after doing code motion to unrolled loops to bring IF and Loops adjacent to each other. This is the powerful optimization which gets enabled with renaming , unrolling. Such heuristics needs to be added for Loop unrolling and then later the code motions can enable the IF-merging and the Loop fusion for the unrolled loop. The below example is taken from the article by Albert Cohen et.al. "Deep Jam : Conversion of Coarse grain Parallelism to Fine grain vector parallelism" For ( I = 0; I < 10; i++) { If(p) a = ...; While(q) .. = a + ...; } Fig (1) For ( I = 0; I < 10; i+=2) { a1 = phi(a,a2); If ( p1) a1 = ; While (q1) ... = a1 + ...; If (p2) a2 = ...; a2 = phi(a2, a1); While (q2) = a2 + ; Fig (2) For ( I = 0; I < 10; i+=2) { a1 = phi(a,a2); If ( p1) a1 = ; If (p2) a2 = ...; a2 = phi(a2, a1); While (q1) ... = a1 + ...; While (q2) = a2 + ; Fig (3) For ( I = 0 ; I < 10; i++) a1 = phi(a1,a2); If ( p1 && p2) a1 = ...; a2 = ...; Else If (p1) a1= ; Else If (p2) a2 = ...; While (q1 && q2) { ... = a1 + ...; ... = a2 + .; } While (q1) = a1 + ...; While ( q2) . = a2 + ..; Fig (4). I would like to propose the above heuristics for unroll and jam and renaming which enables the loop fusion and the IF-merging to achieve the optimize code. This shows how the coarse grain unrolling heuristics enable fine grain Loop transformation. Thoughts Please? Thanks & Regards Ajit
RE: Short Circuit compiler transformations!!
-Original Message- From: Richard Biener [mailto:richard.guent...@gmail.com] Sent: Sunday, March 15, 2015 9:30 PM To: Ajit Kumar Agarwal; Jeff Law; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: RE: Short Circuit compiler transformations!! On March 15, 2015 11:14:59 AM GMT+01:00, Ajit Kumar Agarwal wrote: > > >-Original Message- >From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of >Ajit Kumar Agarwal >Sent: Sunday, March 15, 2015 3:35 PM >To: Richard Biener; Jeff Law; gcc@gcc.gnu.org >Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju >Mekala >Subject: RE: Short Circuit compiler transformations!! > > > >-Original Message- >From: Richard Biener [mailto:richard.guent...@gmail.com] >Sent: Sunday, March 15, 2015 3:05 PM >To: Ajit Kumar Agarwal; Jeff Law; gcc@gcc.gnu.org >Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju >Mekala >Subject: Re: Short Circuit compiler transformations!! > >On March 15, 2015 9:15:36 AM GMT+01:00, Ajit Kumar Agarwal > wrote: >>Hello All: >> >>Short circuit compiler transformation for conditional branches. The >>conditional branches based on the conditional Expressions one of the >>path is always executed thus short circuiting the path. Certains >values >>of the conditional Expressions makes the conditional expressions >always >>true or false. >>This part of the conditions in extracted out In the IF-THEN >>cond_express with the check of the value and in the else case the >>original expressions is checked. >> >>This makes for a given values of the variables of the conditional >>expressions makes the conditional expression as True or false always >>and the one path is always executed. >> >>For the example given below. >> >>For( x1 = lb1; x1 < = ub1 ; x1++) >>{ >> If ( C1 && C2) >> S_then >> } >> >> >>Fig (1). >> >>For the input code given in Fig (1) the condition C1 && C2 is always >>false if C1 or C2 is zero(false) thus short circuiting The path and >>only s_else will be executed if C1 or C2 is false. >> >>Thus the input code is transformed as follows. >> >>For( x1 = lb1; x1 < = ub1 ; x1++) >>{ >> If(!C1) >>Continue >> else >>S_then >> } > >>>This looks wrong as you miss to check C2. > >>>Riichard: Sorry for the typo error. I could see the short circuit >transformation is implemented in gcc. I could see only Short circuiting >with respect to &&/OR. I am thinking in terms of >>complex expressions >where any of the variable condition In the expressions Set to to true >or false the whole expressions and such conditions can be extracted out >into different versions >>of IF with different parameter check( Another >way of short circuiting). > >>>I would like to know the scope of this optimization in gcc. If not >implemented with respect to complex expressions with respect To >different iteration spaces. I would like to propose the >>same. >>Just as a general note/question on all you suggestions. Are you willing to >>implement all your proposals or do you have resources to spend to that effort? >>It's not that we are not aware of loop optimizations GCC does not implement. >>It's a question of priorities and resources to implement something. Yes you are right. We are proposing different optimizations based on our Analysis and past experience and we would be interested in and like to implement based on the suggestions and inputs from the community. We may not implement all, but based on the suggestions and inputs from the community we would like to take on the priority basis. Thanks & Regards Ajit Richard. >Richard: Other aspect of the transformation for short circuiting the >IF-THEN-ELSE with the loops where the loop is irreducible. >In this case implementing the short circuit transformations and the CFG >transformations to convert from irreducibility to reducible Loops >along with the transformations of short circuiting compiler >transformations. > >Thanks & Regards >Ajit > >>>Thoughts Please? > >Thanks & Regards >Ajit > >Richard. > >>This is much simpler code and there could be complex expressions >inside >>the FOR Loop that can be extracted out with Different versions of the >>conditional IF-THEN-ELSE inside the loops based on short circuiting >>executing one of the path always Executed can be extracted in IF-THEN >>and other cases of condition will b
RE: Proposal for another approach for Loop transformation with conditional in Loops.
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Monday, March 16, 2015 11:45 PM To: Ajit Kumar Agarwal; Richard Biener; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: Proposal for another approach for Loop transformation with conditional in Loops. On 03/14/15 22:40, Ajit Kumar Agarwal wrote: > Hello All: > > I am proposing the new approach to Loop transformation as given below > in the example For the loops with conditional expression inside the > Loops. The Loop body should be reducible control flow graph. The > iteration space is partitioned into different spaces for which either > the cond_expr is true or cond_expr is false. The evaluation of > cond_expr for the partitioned iteration spaces can be determined statically. > For the partitioned iterations and based on the analysis of cond_expr on the > partitioned iterations spaces the Loops in fig (a) will be transformed to > Loops in fig (b). > > for the iteration spaces of the conditional inside the loop is live in > at the entry of the cond_expr and Live out the Control flow graph is > topological ordered with the basic blocks for the conditional CFG and the > partitioned iteration spaces can be formed for which the spaces can be true > for conditional expr and false and unknown. > > Based on the above info and analysis the Loop of Fig (a) will be transformed > to Loop Fig (b). > > This is much more optimized code as compared to the performance. The > cases will be triggered for SPEC Benchmarks. Loops is partitioned to > different version with different iteration spaces. Optimized in the presence > Of the transformed generation of the loops without conditional. I fail to see how this is dramatically different than unswitching. You pull the condition out (it has to be loop invariant), then have two instances of the loop, one for the THEN, the other for the ELSE. You obviously select one or the other based on the condition of V. >>Specific examples might be helpful in understanding how you see this as >>different than unswitching. For ( I = 1; I < 5; i++) For ( j =1; j<=5;j++) If ( I >3 && j >3) A[i][j] = a[i+1][j-1]; Else A{i][j] = 0; End End Fig (1) For ( I = 1; I <= 3;i++) For(j= 4; j <= 5; j++) A[i][j] = 0; For( I = 4; I <= 5; i++) For(j = 4; j <=5;j++) A[i][j]= a[i+1][j-1]; For( (I = 1; I <=5 ;i++) For(j = 1; j <= 3;j++) A[i][j] = 0; Fig(2) Transformed code for Fig(1). Fig(1) is the original code and Fig(2) is the transformed code for fig(1). Thanks & Regards Ajit jeff
More methods of reducing the register pressure
Hello All: To reduce the register pressure, I am proposing the following methods of reducing the registers. 1. Assigning same registers or sharing same register for the logical registers having the same value. To determine the logical registers having the same value is the real challenge. Is there is any way in GCC to determine the logical registers having the same value? 2. For the architecture with the self-modifying instructions the DEF-USE will not overlap and can Do the better register allocator and thus reclaiming the registers when it's no longer used. Such self-modifying instructions can assign the same registers. For example: OP R1 -Definition. ... . ... No used. OP R1,R1 self-modifying. . . .. Not Used OP R1, R1 Self-modifying. To reduce the registers pressure such cases with self-modifying instructions with the architecture Support will not overlap as the use is done after it's defined. And also the USE will be after the DEF Making the life time as not overlap and can assign the same registers. Such cases needs to be handled separately for the better register pressure usage. Thoughts Please? Thanks & Regards Ajit
RE: dom1 prevents vectorization via partial loop peeling?
-Original Message- From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Richard Biener Sent: Tuesday, April 28, 2015 4:12 PM To: Jeff Law Cc: Alan Lawrence; gcc@gcc.gnu.org Subject: Re: dom1 prevents vectorization via partial loop peeling? On Mon, Apr 27, 2015 at 7:06 PM, Jeff Law wrote: > On 04/27/2015 10:12 AM, Alan Lawrence wrote: >> >> >> After copyrename3, immediately prior to dom1, the loop body looks like: >> >>: >> >>: >># i_11 = PHI >>_5 = a[i_11]; >>_6 = i_11 & _5; >>if (_6 != 0) >> goto ; >>else >> goto ; >> >>: >> >>: >># m_2 = PHI <5(4), 4(3)> >>_7 = m_2 * _5; >>b[i_11] = _7; >>i_9 = i_11 + 1; >>if (i_9 != 32) >> goto ; >>else >> goto ; >> >>: >>goto ; >> >>: >>return; >> >> dom1 then peeled part of the first loop iteration, producing: > > Yup. The jump threading code realized that if we traverse the edge > 2->3, then we'll always traverse 3->5. The net effect is like peeling > the first iteration because we copy bb3. The original will be reached > via 7->3 (it, loop iterating), the copy via 2->3' and 3' will have its > conditional removed and will unconditionally transfer control to bb5. > > > > This is a known problem, but we really don't have any good heuristics > for when to do this vs when not to do it. > > >> >> In contrast, a slightly-different testcase: >> >> #define N 32 >> >> int a[N]; >> int b[N]; >> >> int foo () >> { >>for (int i = 0; i < N ; i++) >> { >>int cond = (a[i] & i) ? -1 : 0; // extra variable here >>int m = (cond) ? 5 : 4; >>b[i] = a[i] * m; >> } >> } >> >> after copyrename3, just before dom1, is only slightly different: >> >>: >> >>: >># i_15 = PHI >>_6 = a[i_15]; >>_7 = i_15 & _6; >>if (_7 != 0) >> goto ; >>else >> goto ; >> >>: >># m_3 = PHI <4(6), 5(3)> >>_8 = m_3 * _6; >>b[i_15] = _8; >>i_10 = i_15 + 1; >>if (i_10 != 32) >> goto ; >>else >> goto ; >> >>: >>goto ; >> >>: >>return; >> >>: >>goto ; >> >> with bb6 being out-of-line at the end of the function, rather than >> bb4 falling through from just above bb5. However, this prevents dom1 >> from doing the partial peeling, and dom1 only changes the "goto bb7" >> into a "goto bb3": > > I would have still expected it to thread 2->3, 3->6->4 > > >> >> (1) dom1 should really, in the second case, perform the same partial >> peeling that it does in the first testcase, if that is what it thinks >> is desirable. (Of course, we might want to fix that only later, as >> ATM that'd take us backwards). > > Please a file a BZ. It could be something simple, or we might be > hitting one of Zdenek's heuristics around keeping overall loop structure. > >> >> Alternatively, maybe we don't want dom1 doing that sort of thing (?), >> but I'm inclined to think that if it's doing such optimizations, it's >> for a good reason ;) I guess there'll be other times where we >> *cannot* do partially peeling of later iterations... > > It's an open question -- we never reached any kind of conclusion when > it was last discussed with Zdenek. I think the fundamental issue is > we can't really predict when threading the loop is going to interfere > with later optimizations or not. The heuristics we have are marginal at best. > > The one thought we've never explored was re-rolling that first > iteration back into the loop in the vectorizer. >>Well. In this case we hit >>/* If one of the loop header's edge is an exit edge then do not >> apply if-conversion. */ >>FOR_EACH_EDGE (e, ei, loop->header->succs) >> if (loop_exit_edge_p (loop, e)) >> return false; >>which is simply because even after if-conversion we'll at least end up with a >>non-empty latch block which is what the vectorizer doesn't support. >>DOM rotated the loop into this non-canonical form. Running loop header >>copying again would probably undo this. The creation of empty latches with the path-splitting approach where the back edge node can be copied to the predecessors and the Empty latch can be created with the path-splitting approach I have proposed. This will enable the above scenario of vectorization. Thanks & Regards Ajit Richard. > > Jeff
[RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE
I have Designed and implemented with the following design for the path splitting of the loops with conditional IF-THEN-ELSE. The implementation has gone through the bootstrap for Microblaze target along DEJA GNU regressions tests and running the MIBench/EEMBC benchmarks. There is no regression seen in Deja GNU tests and Deja C++ tests results are better with the path splitting changes(lesser failures). The Mibench/EEMBC benchmarks are run and no performance degradation is seen and the performance improvement in telcom_gsm( Mibench benchmarks) of 9.15% and rgbcmy01_lite(EEMBC benchmarks) the performance improvement of 9.4% is observed. Proposal on the below were made earlier on path splitting and the reference is attached. https://gcc.gnu.org/ml/gcc/2015-03/msg00057.html The above gains are achieved because of the constant propagation on conditional which become much easier with the Path splitting changes and benefitted with other optimization on tree representation along with better register allocation. Because of the better granularity. I will make the SPEC run and Bootstrap on i386 target Before that I would like to receive your feedbacks and comments on design and implementation done. I will send the actual patch with SPEC run and bootstrap on i386 target after I receive the feedback on the design and the implementation mentioned in this mail. Design changes. 1. The dominators of the block with conditional IF statements say BB1 are found and the join node of the IF-THEN-ELSE inside the loops is found on the blocks dominated by the BB1 and are not successor of BB1 are the join node. 2. The Join node is same as the source of the loops latch basic blocks. 3. If the above conditional in (1) and (2) are found the Join node same as the source of the Loop latch node is moved into predecessors and the Join node ( Source of the Loop latch node) is made empty statement block with only the phi nodes. 4. In the process of moving the Join node into its predecessors the result of the phi node in the Join node propagated with the corresponding phi arguments based on which predecessor it came from in the Join blocks and move into its predecessors. 5. The Dominator INFO is updated after performing the steps of (1) (2) (3) and (4). 6. The Join which is same as the source of the Loop latch node is made empty with only the phi node in the Join node. 7. The implementation is done in Jump threading phase of the machine independent optimization on tree based representation. The path splitting is called after the Jump threading optimization is performed. The following files are modifed. a) tree-vrp.c b) tree-ssa-threadedge.c c) tree-cfg.c d) tree-ssa-threadedge.h e) tree-cfg.h f) cfghooks.c The diff is attached along with this mail and pasted below. Please share your thoughts and feedbacks on the above optimization and the design and the coding and implementation done for the given above design. diff --git a/gcc/cfghooks.c b/gcc/cfghooks.c index 9faa339..559ca96 100644 --- a/gcc/cfghooks.c +++ b/gcc/cfghooks.c @@ -581,7 +581,7 @@ delete_basic_block (basic_block bb) /* If we remove the header or the latch of a loop, mark the loop for removal. */ - if (loop->latch == bb + if (loop && loop->latch == bb || loop->header == bb) mark_loop_for_removal (loop); diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c index aed5254..b25e409 100644 --- a/gcc/tree-cfg.c +++ b/gcc/tree-cfg.c @@ -1838,6 +1838,64 @@ replace_uses_by (tree name, tree val) } } +void +gimple_threaded_merge_blocks (basic_block a, basic_block b) +{ + gimple_stmt_iterator last, gsi, psi; + + if (dump_file) +fprintf (dump_file, "Merging blocks %d and %d\n", a->index, b->index); + + /* Remove labels from B and set gimple_bb to A for other statements. */ + for (gsi = gsi_start_bb (b); !gsi_end_p (gsi);) + { + gimple stmt = gsi_stmt (gsi); + if (glabel *label_stmt = dyn_cast (stmt)) + { + tree label = gimple_label_label (label_stmt); + int lp_nr; + + gsi_remove (&gsi, false); + + if (FORCED_LABEL (label)) + { + gimple_stmt_iterator dest_gsi = gsi_start_bb (a); + gsi_insert_before (&dest_gsi, stmt, GSI_NEW_STMT); + } + /* Other user labels keep around in a form of a debug stmt. */ + else if (!DECL_ARTIFICIAL (label) && MAY_HAVE_DEBUG_STMTS) + { + gimple dbg = gimple_build_debug_bind (label, + integer_zero_node, + stmt); + + gimple_debug_bind_reset_value (dbg); + + gsi_insert_before (&gsi, dbg, GSI_SAME_STMT); + + } + lp_nr = EH_LANDING_PAD_NR (label); + if (lp_nr) + { +
RE: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Friday, May 29, 2015 9:24 PM To: Ajit Kumar Agarwal; Richard Biener; gcc@gcc.gnu.org Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala Subject: Re: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE On 05/16/2015 06:49 AM, Ajit Kumar Agarwal wrote: > I have Designed and implemented with the following design for the path > splitting of the loops with conditional IF-THEN-ELSE. > The implementation has gone through the bootstrap for Microblaze > target along DEJA GNU regressions tests and running the MIBench/EEMBC > benchmarks. There is no regression seen in Deja GNU tests and Deja C++ > tests results are better with the path splitting changes(lesser > failures). The Mibench/EEMBC benchmarks are run and no performance > degradation is seen and the performance improvement in telcom_gsm( > Mibench benchmarks) of 9.15% and rgbcmy01_lite(EEMBC > benchmarks) the performance improvement of 9.4% is observed. > > Proposal on the below were made earlier on path splitting and the reference > is attached. >>The first thing I notice is you haven't included any tests for the new >>optimization. We're trying really hard these days to have tests in the suite >>for this kind of >>new optimization/feature work. >>I don't offhand know if any of the benchmarks you cite above are free-enough >>to derive a testcase from. But one trick many of us use is to instrument the >>>>pass and compile some known free software (often gcc >>itself) to find triggering code and use that to generate tests for the new >>transformation. I will add tests in the suite. I could see many existing tests in the suite also get triggered with this optimization. > > Design changes. > > 1. The dominators of the block with conditional IF statements say BB1 > are found and the join node of the IF-THEN-ELSE inside the loops is found on > the blocks dominated by the BB1 and are not successor of BB1 are the join > node. This isn't necessarily my area of strength, but it doesn't seem right to me. I guess it would work if there aren't any more nodes in the loop after the join block, except for the loop latch, but it doesn't see right in the general case. It might work if you also verify that the IF block is the immediate dominator of potential join node. > > 2. The Join node is same as the source of the loops latch basic blocks. >>OK. This is why your initial algorithm in #1 works. Thanks. > > 3. If the above conditional in (1) and (2) are found the Join node > same as the source of the Loop latch node is moved into predecessors > and the Join node ( Source of the Loop latch node) is made empty statement > block with only the phi nodes. >>Hopefully you do this by duplicating the join node and manipulating the CFG >>so that the original predecessors of the join each go to their copy of the >>join >>node. The copied join nodes unconditionally then pass control to the >>original join node. Yes. > > 4. In the process of moving the Join node into its predecessors the > result of the phi node in the Join node propagated with the corresponding phi > arguments based on which predecessor it came from in the Join blocks and > move into its predecessors. >>Right. Note there's a phi-only cprop pass which is designed to do this >>propagation and in theory should be very fast. Thanks. > 5. The Dominator INFO is updated after performing the steps of (1) (2) (3) > and (4). >>Right. Getting the sequencing of the various updates can be tricky. >>Even more so if you try to do it as a part of jump threading because you have >>to (for example) deal with unreachable blocks in the CFG. > > 6. The Join which is same as the source of the Loop latch node is made empty > with only the phi node in the Join node. >>Right. Thanks. > > 7. The implementation is done in Jump threading phase of the machine > independent optimization on tree based > representation. The path splitting is called after the Jump threading > optimization is performed. > The following files are modifed. > > a) tree-vrp.c > b) tree-ssa-threadedge.c > c) tree-cfg.c > d) tree-ssa-threadedge.h > e) tree-cfg.h > f) cfghooks.c > > The diff is attached along with this mail and pasted below. >>I would recommend this as a separate pass. One of the goals for GCC 6 >>is to break out the threader into its own pass, if you're piggybacking >>on the threader it just makes this task harder. >>Also by implementing this as a distinct pass you have