Proposal of new Unrolling degree before/after the allocated Register Allocation is done in GCC.

2015-06-13 Thread Ajit Kumar Agarwal
All:

Given a Data Dependency Graph(DDG) the unrolling degree proposed by Monica Lam 
et.al calculates the unrolling degree as follows.

Unrolling degree = Length of Longest Live range/ Number of cycles in the kernel 
( Initiation Interval). The unrolling degree based on the 
Above leads to more register Spills and Fetch inside the Loops.

Given the Data Dependency Graph(DDG) and the unrolling degree calculated as 
above we build the reuse graph. Lot has been presented 
On the Reuse graph. Each node nodes labelled the DDG nodes that writes into 
registers and the arc(u,v) represents that the destination 
Share same registers if there is no dependency between the between the 
iterations of DDG nodes.

As there is no dependency as represented by the Reuse graph the life time will 
not overlap and assign the same registers. Each arc will
Have the weight that describes the allocated registers and the Unrolling as 
proposed by Monica Lam given above The reuse graph will be
Partitioned based on the heuristics and the Sum of weights of each arc in each 
partitioned is calculated.  Then based on sum of weights of
Each partitioned reuse graphs the least common factor of all the weights will 
be the new Unrolling degree.

This new unrolling degree will lead to better register allocation and reduction 
of register pressure and  because the destination registers is
Shared between different iterations because of no dependency as described by 
the Data dependency distance given in DDG, the allocated 
registers will lead to less spill and Fetch.

The calculation of getting the Unrolling degree(new) as above will lead to 
register allocation  for loops ( that has the candidate of lots of 
Parallelism ) on assigning the Same destination registers between the DDG nodes 
having no dependency.

Some of the Logic for calculation of new unrolling degree is taken from the 
proposed Unrolling degree based on Periodic register allocation 
Albert Cohen et.al.

This will efficiently use before and after the register allocation based in gcc 
so that weights of each reuse graph that describes the allocated 
Registers Will be known after the register allocation and the weights that 
describes unrolling degree is done before register allocation and the
new Unrolling degree efficiently calculated that reduces the register pressure.

Suggestions?

Thanks & Regards
Ajit 


RE: set_src_cost lying comment

2015-06-24 Thread Ajit Kumar Agarwal


-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Jeff Law
Sent: Wednesday, June 24, 2015 10:36 AM
To: gcc@gcc.gnu.org
Subject: Re: set_src_cost lying comment

On 06/21/2015 11:57 PM, Alan Modra wrote:
> set_src_cost says it is supposed to
> /* Return the cost of moving X into a register, relative to the cost
> of a register move.  SPEED_P is true if optimizing for speed rather
> than size.  */
>
> Now, set_src_cost of a register move (set (reg1) (reg2)), is zero.
> Why?  Well, set_src_cost is used just on the right hand side of a SET, 
> so the cost is that of (reg2), which is zero according to rtlanal.c 
> rtx_cost.  targetm.rtx_costs doesn't get a chance to modify this.
>
> Now consider (set (reg1) (ior (reg2) (reg3))), for which set_src_cost 
> on rs6000 currently returns COSTS_N_INSNS(1).  It seems to me that 
> this also ought to return zero, if the set_src_cost comment is to be 
> believed.  I'd claim the right hand side of this expression costs the 
> same as a register move.  A register move machine insn "mr reg1,reg2"
> is encoded as "or reg1,reg2,reg2" on rs6000!
>>Certainly seems inconsistent -- all the costing stuff should be revisited.  
>>The basic design for costing dates back to the m68k/vax era.

>>I certainly agree that the cost of a move, logicals and arithmetic is 
>>essentially the same at the chip level for many processors.  But a copy has 
>>other properties >>that make it "cheaper" -- namely we can often propagate it 
>>away or arrange for the source & dest of the copy to have the same hard 
>>register which achieves >>the same effect.

I have seen for target like Microblaze, making the changes in the cost of move 
instructions not same  at the chip level cost/latency improves 
the performance for Mibench/EEMBC benchmarks.  Also changing the cost of branch 
instruction and sometimes making it zero also improves
the performance for Mibench/EEMBC benchmarks.

>>So one could argue that a copy should have cost 0 as it has a reasonable 
>>chance of just going away, while logicals, alu operations on the appropriate 
>>chips >>should have a cost of 1.

Thanks & Regards
Ajit

jeff


RE: set_src_cost lying comment

2015-06-24 Thread Ajit Kumar Agarwal


-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Richard 
Kenner
Sent: Wednesday, June 24, 2015 9:28 PM
To: l...@redhat.com
Cc: gcc@gcc.gnu.org
Subject: Re: set_src_cost lying comment

> These are good examples of things the costing model simply wasn't ever 
> designed to consider  -- because they weren't significant issues on 
> the m68k, vax and other ports in the gcc-1 era.
> 
> So I don't really know how to tell you to proceed -- I've considered 
> the costing models fundamentally flawed for many years, but haven't 
> ever tried to come up with something that works better.

>>I'm not sure I'd call it "fundamentally flawed" or that it wasn't "designed 
>>to consider" these things.  I see the costing model as meant as a *heuristic* 
>>for >>making choices, not as a precise metric for anything.  Certainly, a lot 
>>of people worked on that model, including both of us, during the gcc-2 RISC 
>>porting days.

>>I see the "flaw" as trying to use it for too much.  There's a limit to the 
>>amount of decisions you can make with purely local data.  When trying to 
>>decide >>whether a source should be a constant or a register, you have to 
>>look at:

>>- the chances the reg-reg copy can be eliminated
>>- register pressure
>>- whether there's room to insn schedule around any conflicts

Increasing or decreasing the cost of moves affects the Optimal Coalescing. Also 
the Live range splitting that generates the shuffle code
at the border if the registers in the partner live ranges are different/ 
Splitted live ranges where a partner has register and another partner
in memory, then a store would be needed otherwise a load is generated with 
respect to the live range has memory and the partner live
ranges in register then a load will be generated. All these conditions and 
heuristics are associated with move cost for x86 and other Architecture.

Thanks & Regards
Ajit

>>Unless you look at the whole picture, you're just guessing, which is, of 
>>course, what a heuristic is all about!


Multi-version IF-THEN-ELSE conditional

2015-06-27 Thread Ajit Kumar Agarwal
All:

The presence of aliases disables many optimizations like CCP(conditional 
constant propagation) , PRE(Partial Redundancy Elimination),
 Scalar Replacements for conditional IF-THEN-ELSE. The presence of aliasing 
also disables the IF-conversion.

I am proposing the Multi-version IF-THEN-ELSE where the different version of 
the IF-THEN-ELSE is one without aliasing and other 
versions with the aliasing. Thus converting the different Multi-version 
IF-THEN-ELSE enables the CCP, PRE, Scalar replacements
and IF-conversions for the version of the IF-THEN-ELSE that does not have 
aliasing on the pointer variables and the versions that 
has alias information will not be affected with such optimizations.

I don't have examples on hand currently but I am working on currently to 
provide the examples.

Thoughts?

Thanks & Regards
Ajit


Transformation from SEME(Single Entry Multiple Exit) to SESE(Single Entry Single Exit)

2015-07-02 Thread Ajit Kumar Agarwal
All:

Single Entry and Multiple Exits disables traditional Loop optimization. The 
presence of short circuit also makes the CFG as
Single Entry and Multiple Exits. The transformation from SEME(Single Entry and 
Multiple Exits) to SESE( Single Entry and 
Single Exits enables many Loop Optimizations. 

The approach like Node Splitting to make SEME regions to SESE regions is an 
important optimization on the CFG that 
Enable the transformation with respect to Loops and Conditionals.

The Loops transformation in LLVM does the node splitting to convert from SEME 
regions to SESE regions. The presence
of break and GOTO statements inside the loops makes the CFG unstructured 
transforming  it SEME.  To convert such control
Flow from unstructured to Structured control flow enables many Loop 
transformation.

I would like to implement a  transformation phase on the loops before any Loop 
optimizations pass is enabled to transform 
Unstructured CFG to structured CFG like LLVM.

Does the GCC already has such transformation passes on Loops? Please share your 
thoughts.

Thanks & Regards
Ajit


Consideration of Cost associated with SEME regions.

2015-07-02 Thread Ajit Kumar Agarwal
All:

The Cost Calculation for a candidate to Spill in the Integrated Register 
Allocator(IRA) considers only the SESE regions.
The Cost Calculation in the IRA should consider the SEME regions into consider 
for spilling decisions. 

The Cost associated with the path that has un-matured exists should be less, 
thus making the more chances of spilling decision
In the path of  un-matured exits. The path that has un-matured (normal )exists 
should be having a higher cost than the cost of un-matured exists and
Spilling decisions has to made accordingly in order to spill inside the less 
frequency path with the un-matured exists than the high frequency
Path with the normal exits.

I would like to propose the above for consideration of cost associated with 
SEME regions in IRA.

Thoughts?

Thanks & Regards
Ajit


RE: Consideration of Cost associated with SEME regions.

2015-07-02 Thread Ajit Kumar Agarwal
Sorry for the typo error. I meant exits instead of exists.

The below is corrected.

The Cost Calculation for a candidate to Spill in the Integrated Register 
Allocator(IRA) considers only the SESE regions.
The Cost Calculation in the IRA should consider the SEME regions into 
consideration for spilling decisions. 

The Cost associated with the path that has un-matured exits should be less, 
thus making the more chances of spilling decision 
In the path of  un-matured exits. The path that has normal exit should be 
having a higher cost than the cost of un-matured
exit and Spilling decisions has to made accordingly in order to spill inside 
the less frequency path with the un-matured exits 
than the high frequency Path with the normal exits.

I would like to propose the above for consideration of cost associated with 
SEME regions in IRA.

Thoughts?

Thanks & Regards
Ajit



-Original Message-
From: Ajit Kumar Agarwal 
Sent: Thursday, July 02, 2015 3:33 PM
To: vmaka...@redhat.com; l...@redhat.com; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Consideration of Cost associated with SEME regions.

All:

The Cost Calculation for a candidate to Spill in the Integrated Register 
Allocator(IRA) considers only the SESE regions.
The Cost Calculation in the IRA should consider the SEME regions into consider 
for spilling decisions. 

The Cost associated with the path that has un-matured exists should be less, 
thus making the more chances of spilling decision In the path of  un-matured 
exits. The path that has un-matured (normal )exists should be having a higher 
cost than the cost of un-matured exists and Spilling decisions has to made 
accordingly in order to spill inside the less frequency path with the 
un-matured exists than the high frequency Path with the normal exits.

I would like to propose the above for consideration of cost associated with 
SEME regions in IRA.

Thoughts?

Thanks & Regards
Ajit


Live on Exit renaming.

2015-07-04 Thread Ajit Kumar Agarwal
All:

Design and Analysis of Profile-Based Optimization in Compaq's
 Compilation Tools for Alpha; Journal of Instruction-Level
 Parallelism 3 (2000) 1-25

The above paper based on this paper the existing tracer pass (This pass 
performs the tail duplication needed for superblock formation.) is 
Implemented in the GCC.

There is another optimization  that of interest in the above paper is the 
following.

Live on Exit Renamer:

This optimizations tries to remove a constraint that force the compiler to 
create long dependent chains of operations in unrolled loops.

The following example

While (a[i] != key)
Return I;

Fig(1)

Unrolled Loop:

1.While (a[i] == key)
{
   2.I = I +1;
  3. If(a[i] == key ) goto E
  4. I = i+1;
 5. If(a[i] == key) goto E
6.I = i+1;
7.}
8.E: return;

Fig(2)

Live on Exit renamer transformation.


While ( a[i] == key)
{
   I1 = I +1;
  If( a[i1] == key) goto E1
  I2 = i+2;
 If(a[i2] == key) goto E2;
I3 = i+3;
}
E: return I;
E1: I = i1 goto E
E2: I = i2 goto E


Fig(3).

The above transformation removes the Liveness of exits and make the unrolled 
loop non-overlapping. Thus the line 4 in Fig(2) cannot be moved 
Above 3 because of Live on Exit. The transformation in the Fig(3) remove the 
Live on Exits and the register allocator can be allocated with optimized
Register sets. This can form the non-overlapping regions in the unrolled loop.

I am not sure why the above optimization is not implemented in GCC.
If there is no specific reasons I would like the implement the same.

Thanks & Regards
Ajit




RE: Live on Exit renaming.

2015-07-04 Thread Ajit Kumar Agarwal
Sorry for the typo error.

The below is the corrected Fig (1).

While (a[i] != key)
I = i+1;
Return I;

  Fig (1).

Thanks & Regards
Ajit
-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Ajit 
Kumar Agarwal
Sent: Saturday, July 04, 2015 7:15 PM
To: l...@redhat.com; Richard Biener; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Live on Exit renaming.

All:

Design and Analysis of Profile-Based Optimization in Compaq's
 Compilation Tools for Alpha; Journal of Instruction-Level
 Parallelism 3 (2000) 1-25

The above paper based on this paper the existing tracer pass (This pass 
performs the tail duplication needed for superblock formation.) is Implemented 
in the GCC.

There is another optimization  that of interest in the above paper is the 
following.

Live on Exit Renamer:

This optimizations tries to remove a constraint that force the compiler to 
create long dependent chains of operations in unrolled loops.

The following example

While (a[i] != key)
Return I;

Fig(1)

Unrolled Loop:

1.While (a[i] == key)
{
   2.I = I +1;
  3. If(a[i] == key ) goto E
  4. I = i+1;
 5. If(a[i] == key) goto E
6.I = i+1;
7.}
8.E: return;

Fig(2)

Live on Exit renamer transformation.


While ( a[i] == key)
{
   I1 = I +1;
  If( a[i1] == key) goto E1
  I2 = i+2;
 If(a[i2] == key) goto E2;
I3 = i+3;
}
E: return I;
E1: I = i1 goto E
E2: I = i2 goto E


Fig(3).

The above transformation removes the Liveness of exits and make the unrolled 
loop non-overlapping. Thus the line 4 in Fig(2) cannot be moved Above 3 because 
of Live on Exit. The transformation in the Fig(3) remove the Live on Exits and 
the register allocator can be allocated with optimized Register sets. This can 
form the non-overlapping regions in the unrolled loop.

I am not sure why the above optimization is not implemented in GCC.
If there is no specific reasons I would like the implement the same.

Thanks & Regards
Ajit




Allocation of hotness of data structure with respect to the top of stack.

2015-07-05 Thread Ajit Kumar Agarwal
All:

I am wondering allocation of hot data structure closer to the top of the stack 
increases the performance of the application.
The data structure are identified as hot and cold data structure and all the 
data structures are sorted in decreasing order of
The hotness and the hot data structure will be allocated closer to the top of 
the stack.

The load and store on accessing with respect to allocation of data structure on 
stack will be faster with allocation of hot
Data structure closer to the top of the stack.

Based on the above the code is generated with respect to load and store with 
the correct offset of the stack allocated on
the decreasing order of hotness.

Thoughts?

Thanks & Regards
Ajit


Reduction Pattern ( Vectorization or Parallelization)

2015-07-05 Thread Ajit Kumar Agarwal
All:

The scalar and array reduction patterns can be identified if the result of 
commutative updates
Is applied to the same scalar or array variables on the LHS with +, *, Min or 
Max. Thus the reduction pattern identified with 
the commutative update help  in vectorization or parallelization.

For the following code
For(j = 0; j <= N;j++)
{
   y = d[j];
For( I = 0 ; I  <8 ; i++)
X(a[i]) = X(a[i]) + c[i] * y;
}

Fig(1).

For the above code with the reduction pattern on X with respect to the outer 
loop  exhibits the commutative updates on + can be identified
In gcc as reduction pattern with respect to outer loops. I wondering whether 
this can be identified as reduction pattern which can reduce to vectorized
Code because of the X is indexed by another array as thus the access of X is 
not affine expression.

Does the above code can be identified as reduction pattern and transform to the 
vectorized or parallelize code.

Thoughts?

Thanks & Regards
Ajit




RE: Live on Exit renaming.

2015-07-05 Thread Ajit Kumar Agarwal


-Original Message-
From: Bin.Cheng [mailto:amker.ch...@gmail.com] 
Sent: Monday, July 06, 2015 7:04 AM
To: Steven Bosscher
Cc: Ajit Kumar Agarwal; l...@redhat.com; Richard Biener; gcc@gcc.gnu.org; Vinod 
Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Live on Exit renaming.

On Mon, Jul 6, 2015 at 6:02 AM, Steven Bosscher  wrote:
> On Sat, Jul 4, 2015 at 3:45 PM, Ajit Kumar Agarwal wrote:
>> I am not sure why the above optimization is not implemented in GCC.
>
> -fsplit-ivs-in-unroller

>>And thing might have changed.  Given the condition GCC does IVO on gimple, 
>>unrolling on RTL, there is inconsistency between the two optimizer since IVO 
>>>>takes register pressure of IVs into consideration and assumes IVs will take 
>>single registers.  At least for some cases, splitting live range of IVs 
>>results in bad >>code.  See PR29256 for more information.  As described in 
>>the comment, actually I am going to do some experiments disabling such 
>>transformation to see >>what happens.

The above optimization is implemented as a part of unroller in gimple. There is 
an unroller pass in rtl which does not have support for this 
optimization.  Shouldn't be the fsplit-ivs-in-unroller optimization implemented 
in the unroller pass of rtl. I am looking at the implementation
perspective for implementing the fsplit-ivs-in-unroller optimizations in the 
unroller rtl pass.

Thanks & Regards
Ajit

Thanks,
bin
>
> Ciao!
> Steven


RE: Live on Exit renaming.

2015-07-05 Thread Ajit Kumar Agarwal


-Original Message-
From: Bin.Cheng [mailto:amker.ch...@gmail.com] 
Sent: Monday, July 06, 2015 10:26 AM
To: Ajit Kumar Agarwal
Cc: Steven Bosscher; l...@redhat.com; Richard Biener; gcc@gcc.gnu.org; Vinod 
Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Live on Exit renaming.

On Mon, Jul 6, 2015 at 12:02 PM, Ajit Kumar Agarwal 
 wrote:
>
>
> -Original Message-
> From: Bin.Cheng [mailto:amker.ch...@gmail.com]
> Sent: Monday, July 06, 2015 7:04 AM
> To: Steven Bosscher
> Cc: Ajit Kumar Agarwal; l...@redhat.com; Richard Biener; 
> gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 
> Hunsigida; Nagaraju Mekala
> Subject: Re: Live on Exit renaming.
>
> On Mon, Jul 6, 2015 at 6:02 AM, Steven Bosscher  wrote:
>> On Sat, Jul 4, 2015 at 3:45 PM, Ajit Kumar Agarwal wrote:
>>> I am not sure why the above optimization is not implemented in GCC.
>>
>> -fsplit-ivs-in-unroller
>
>>>And thing might have changed.  Given the condition GCC does IVO on gimple, 
>>>unrolling on RTL, there is inconsistency between the two optimizer since IVO 
>>>>>takes register pressure of IVs into consideration and assumes IVs will 
>>>take single registers.  At least for some cases, splitting live range of IVs 
>>>results in bad >>code.  See PR29256 for more information.  As described in 
>>>the comment, actually I am going to do some experiments disabling such 
>>>transformation to see >>what happens.
>
> The above optimization is implemented as a part of unroller in gimple. 
> There is an unroller pass in rtl which does not have support for this
>>As far as I understand, fsplit-ivs-in-unroller is a transformation in RTL 
>>unroller.

My mistake. Yes you are right. The fsplit-ivs-in-unroller is a transformation 
in RTL unroller.
IVO on gimple doesn't take unrolling into consideration and assume to assign 
single register for IV candidates. My thinking is that 
Splitting IVs at RTL with the unroller removes the long dependent chains and 
thus makes the overlapping iterations and better
Register allocators and there is a chance of movement of independent code that 
got exposes with split-ivs-in-unroller.

You have mentioned that splitting of IV candidate reults in bad code.  I could 
see only the positive end of this optimizations.
Could you please elaborate on the negative end of the fsplit-ivs-in-unroller 
optimizations as you have mentioned that it results
In bad code in some cases.

Thanks & Regards 
Ajit

Thanks,
bin
> optimization.  Shouldn't be the fsplit-ivs-in-unroller optimization 
> implemented in the unroller pass of rtl. I am looking at the implementation 
> perspective for implementing the fsplit-ivs-in-unroller optimizations in the 
> unroller rtl pass.
>
> Thanks & Regards
> Ajit
>
> Thanks,
> bin
>>
>> Ciao!
>> Steven


CFG transformation of loops with continue statement inside the loops.

2015-07-08 Thread Ajit Kumar Agarwal
All:

While/For ( condition1)
{
   Some code here.
  If(condition2 )
continue;
  Some code here.
}

Fig(1)

For the above loop in Fig(1)  there will be two backedges and multiple latches. 
 The below code can be transformed to the below in order to 
have a single backedge.

While/For (condition1)
{
   Some code here.
   If( condition2)
 Goto latch;
  Some code here.

Latch:
}

Fig(2).

With the transformation shown in Fig(2) the presence of GoTo inside loops  
affect and disables many  optimizations. To enable
the loops in Fig(1) can also be transformed to multiple loops with each 
backedge that will make the affected optimizations enabled 
and the transformed Multiple loops will  enable many optimizations that are 
disabled with the presence of GoTo in Fig(2) and multiple
latches given in Fig(1).

Thoughts?

Thanks & Regards
Ajit



RE: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE

2015-07-10 Thread Ajit Kumar Agarwal


-Original Message-
From: Jeff Law [mailto:l...@redhat.com]
Sent: Friday, July 10, 2015 4:04 AM
To: Ajit Kumar Agarwal; Richard Biener; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [RFC] Design and Implementation for Path Splitting for Loop with 
Conditional IF-THEN-ELSE

On 06/02/2015 10:43 PM, Ajit Kumar Agarwal wrote:
>
>
> -Original Message-
> From: Jeff Law [mailto:l...@redhat.com]
> Sent: Tuesday, June 02, 2015 9:19 PM
> To: Ajit Kumar Agarwal; Richard Biener; gcc@gcc.gnu.org
> Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju 
> Mekala
> Subject: Re: [RFC] Design and Implementation for Path Splitting for 
> Loop with Conditional IF-THEN-ELSE
>
> On 06/02/2015 12:35 AM, Ajit Kumar Agarwal wrote:
>>>> I don't offhand know if any of the benchmarks you cite above are 
>>>> free-enough to derive a testcase from.  But one trick many of us 
>>>> use is to instrument the >>pass and compile some known free 
>>>> software (often gcc
>>>> itself) to find triggering code and use that to generate tests for the new 
>>>> transformation.
>>
>> I will add tests in the suite. I could see many existing tests in the suite 
>> also get triggered with this optimization.
>>> Thanks.  For cases in the existing testsuite where you need to change the 
>>> expected output, it's useful to note why the expected output was changed.  
>>> >>Sometimes a test is compromised by a new optimization, sometimes the 
>>> expected output is changed and is papering over a problem, etc so it's 
>>> something >>we look at reasonably closely.
>
> Thanks. I will modify accordingly.
>
>>
>>>
>>> diff --git a/gcc/cfghooks.c b/gcc/cfghooks.c index 9faa339..559ca96
>>> 100644
>>> --- a/gcc/cfghooks.c
>>> +++ b/gcc/cfghooks.c
>>> @@ -581,7 +581,7 @@ delete_basic_block (basic_block bb)
>>>
>>>   /* If we remove the header or the latch of a loop, mark the loop 
>>> for
>>>  removal.  */
>>> -  if (loop->latch == bb
>>> +  if (loop && loop->latch == bb
>>>   || loop->header == bb)
>>> mark_loop_for_removal (loop);
>>>> So what caused you to add this additional test?  In general loop 
>>>> structures are supposed to always be available.  The change here 
>>>> implies that the loop structures were gone at some point.  That 
>>>> seems at first glance a mistake.
>>
>> I was using the gimple_duplicate_bb which will not add the duplicate 
>> basic block inside the current_loops. That's why the above Condition 
>> is required. I am using duplicate_block instead of gimple_duplicate_bb. With 
>> this change the above check with loop Is not required as it adds the 
>> duplicate basic block inside the loops.
>>> OK.  Good to hear it's not required anymore.
>
>
>>
>>>
>>> diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c index aed5254..b25e409
>>> 100644
>>> --- a/gcc/tree-cfg.c
>>> +++ b/gcc/tree-cfg.c
>>> @@ -1838,6 +1838,64 @@ replace_uses_by (tree name, tree val)
>>> }
>>> }
>>>
>>> +void
>>> +gimple_threaded_merge_blocks (basic_block a, basic_block b)
>> If we keep this function it will need a block comment.
>>
>>>> I say "if" for a couple reasons.  First we already have support 
>>>> routines that know how to merge blocks.  If you really need to 
>>>> merge blocks you should try to use them.
>>
>>>> Second, I'm not sure that you really need to worry about block 
>>>> merging in this pass.  Just create the duplicates, wire them into 
>>>> the CFG and let the existing block merging support handle this problem.
>>
>> The above routine is not merging but duplicates the join nodes into 
>> its predecessors. If I change the name of the above Function to the 
>> gimple_threaded_duplicating_join_node it should be fine.
>>> But you don't need to duplicate into the predecessors.  If you create the 
>>> duplicates and wire them into the CFG properly the existing code in 
>>> cfgcleanup >>should take care of this for you.
>
> Certainly I will do it.
>>
>>> diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c 
>>> index 4303a18..2c7d36d 100644
>>> --- a/gcc/tree-ssa-threadedge.c
>>> +++ b/gcc/tree-ssa-threadedge.c
>&g

RE: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE

2015-07-10 Thread Ajit Kumar Agarwal


-Original Message-
From: Jeff Law [mailto:l...@redhat.com] 
Sent: Friday, July 10, 2015 4:04 AM
To: Ajit Kumar Agarwal; Richard Biener; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [RFC] Design and Implementation for Path Splitting for Loop with 
Conditional IF-THEN-ELSE

On 06/02/2015 10:43 PM, Ajit Kumar Agarwal wrote:
>
>
> -Original Message-
> From: Jeff Law [mailto:l...@redhat.com]
> Sent: Tuesday, June 02, 2015 9:19 PM
> To: Ajit Kumar Agarwal; Richard Biener; gcc@gcc.gnu.org
> Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju 
> Mekala
> Subject: Re: [RFC] Design and Implementation for Path Splitting for 
> Loop with Conditional IF-THEN-ELSE
>
> On 06/02/2015 12:35 AM, Ajit Kumar Agarwal wrote:
>>>> I don't offhand know if any of the benchmarks you cite above are 
>>>> free-enough to derive a testcase from.  But one trick many of us 
>>>> use is to instrument the >>pass and compile some known free 
>>>> software (often gcc
>>>> itself) to find triggering code and use that to generate tests for the new 
>>>> transformation.
>>
>> I will add tests in the suite. I could see many existing tests in the suite 
>> also get triggered with this optimization.
>>> Thanks.  For cases in the existing testsuite where you need to change the 
>>> expected output, it's useful to note why the expected output was changed.  
>>> >>Sometimes a test is compromised by a new optimization, sometimes the 
>>> expected output is changed and is papering over a problem, etc so it's 
>>> something >>we look at reasonably closely.
>
> Thanks. I will modify accordingly.
>
>>
>>>
>>> diff --git a/gcc/cfghooks.c b/gcc/cfghooks.c index 9faa339..559ca96
>>> 100644
>>> --- a/gcc/cfghooks.c
>>> +++ b/gcc/cfghooks.c
>>> @@ -581,7 +581,7 @@ delete_basic_block (basic_block bb)
>>>
>>>   /* If we remove the header or the latch of a loop, mark the loop 
>>> for
>>>  removal.  */
>>> -  if (loop->latch == bb
>>> +  if (loop && loop->latch == bb
>>>   || loop->header == bb)
>>> mark_loop_for_removal (loop);
>>>> So what caused you to add this additional test?  In general loop 
>>>> structures are supposed to always be available.  The change here 
>>>> implies that the loop structures were gone at some point.  That 
>>>> seems at first glance a mistake.
>>
>> I was using the gimple_duplicate_bb which will not add the duplicate 
>> basic block inside the current_loops. That's why the above Condition 
>> is required. I am using duplicate_block instead of gimple_duplicate_bb. With 
>> this change the above check with loop Is not required as it adds the 
>> duplicate basic block inside the loops.
>>> OK.  Good to hear it's not required anymore.
>
>
>>
>>>
>>> diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c index aed5254..b25e409
>>> 100644
>>> --- a/gcc/tree-cfg.c
>>> +++ b/gcc/tree-cfg.c
>>> @@ -1838,6 +1838,64 @@ replace_uses_by (tree name, tree val)
>>> }
>>> }
>>>
>>> +void
>>> +gimple_threaded_merge_blocks (basic_block a, basic_block b)
>> If we keep this function it will need a block comment.
>>
>>>> I say "if" for a couple reasons.  First we already have support 
>>>> routines that know how to merge blocks.  If you really need to 
>>>> merge blocks you should try to use them.
>>
>>>> Second, I'm not sure that you really need to worry about block 
>>>> merging in this pass.  Just create the duplicates, wire them into 
>>>> the CFG and let the existing block merging support handle this problem.
>>
>> The above routine is not merging but duplicates the join nodes into 
>> its predecessors. If I change the name of the above Function to the 
>> gimple_threaded_duplicating_join_node it should be fine.
>>> But you don't need to duplicate into the predecessors.  If you create the 
>>> duplicates and wire them into the CFG properly the existing code in 
>>> cfgcleanup >>should take care of this for you.
>
> Certainly I will do it.
>>
>>> diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c 
>>> index 4303a18..2c7d36d 100644
>>> --- a/gcc/tree-ssa-threadedge.c
>>> +++ b/gcc/tree-ssa-threadedge.c
>&g

Partition and subpartition Analysis that helps in further vectorization and parallelization

2015-07-14 Thread Ajit Kumar Agarwal
All:

I am trying the place the following Analysis in the vectorizer of GCC that 
helps in improving the vectorizer to a great extent
For the unit stride, zero stride and non stride accesses of memory that helps 
in vectorizer.

For the Data Dependency graph, the topological sort is performed. The 
topological sorted Data Dependence graph the time
Stamp for each node of the DDG is assigned based on the following Algorithm.

For each node in Topological sorted order in DDG
{

Timestamp = 0;
Timestamp(node) = Max(Timestamp, Timestamp of all predecessors) + 1;

}

Based on the above calculation of timestamp, the partition of DDG is formed. 
Each partition of DDG is having the nodes with the same 
Stamp. So nodes in each partition can be vectorized as they are independent 
nodes in the DDG. To enable the vectorization, the accesses
 based on contiguous access and non-Contagious access the sub partition is 
formed. The memory address of all the operands of each node 
in the partition formed above is sorted in increasing/decreasing order. Based 
on the sorted increasing/decreasing order of the memory 
address of each operands of each node in the partition the sub partition is 
performed based on the unit stride access, zero stride access 
and the accesses that require shuffling of operands through the vectorized 
instruction.

The above analysis will help in performing Data Layout on the partitioned nodes 
of the DDG and  based on Sub partition formed above and 
more vectorization opportunities is enabled for performing data Layout on non 
contiguous accesses and  the sub partition With the contiguous 
access helps in vectorization.

Thoughts?

Thanks & Regards
Ajit


Traces on Data Dependency graph.

2015-07-14 Thread Ajit Kumar Agarwal
All:

I am wondering how useful to form the traces on Data Dependency Graph. On top 
of the traces in the Control flow graph,
 I was thinking of forming  the traces on data Dependency graph(DDG).

Would this helps in  further  vectorization and parallelization candidates.

Thoughts?

Thanks & Regards
Ajit


RE: Traces on Data Dependency graph.

2015-07-14 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Tuesday, July 14, 2015 6:35 PM
To: Ajit Kumar Agarwal
Cc: Jeff Law; Jan Hubicka; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; 
Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Traces on Data Dependency graph.

On Tue, Jul 14, 2015 at 2:12 PM, Ajit Kumar Agarwal 
 wrote:
> All:
>
> I am wondering how useful to form the traces on Data Dependency Graph. 
> On top of the traces in the Control flow graph,  I was thinking of forming  
> the traces on data Dependency graph(DDG).
>
> Would this helps in  further  vectorization and parallelization candidates.
>
> Thoughts?

>>Not sure how you'd do that?  Transform data dependences to control 
>>dependences?

Work has done with respect to the control flow traces and the dependency traces 
where the control flow traces
Can also  explicitly identifies the def and use with respect to the registers 
but for the flow information of memory is difficult
To extract from the control flow traces and thus dependent on data dependency 
traces. The data dependence traces
Are long in the sense that it requires the execution instance of the statement 
along with the data dependence. The DDG 
Graph annotates the statement along with the statement which can be further 
enhanced with execution instance of the 
Statement. There has been work done by Zhang and Gupta etal to form the whole 
execution Traces with respect to control flow 
edges and the dependency edges at granularity level of statement and the blocks.

I was thinking of data dependency traces with respect to the DDG having the 
statements and augmented with the execution
Instance of the statement that will help with traces for flow information of 
memory.

Both Dependence(control and data) profiles have been used in Data Speculative 
optimizations (Lin etal 2003 - 2004), Speculative
Optimizations (2004) and computation of dynamic slices.

I am not sure how useful would it be and would like to have your opinion on 
this.

Thanks & Regards
Ajit

> Thanks & Regards
> Ajit


RETURN_ADDRESS_POINTER_REGNUM Macro

2015-07-23 Thread Ajit Kumar Agarwal
All:

>From the description of the definition of the macro 
>RETURN_ADDRESS_POINTER_REGNUM , it is derived that this macro is used to
Define a register for the above macro that helps in getting the return address 
from the stack or frame pointer.

I could see many of the architectures supported by GCC does not define this 
macro.

Does this impact the performance or correctness of the compiler?  On what cases 
it is applicable to define for the given architecture?

Thanks & Regards
Ajit


vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST

2015-08-02 Thread Ajit Kumar Agarwal
All:

The definition of the following macro that determine the statement cost that 
adds to vectorization cost.

#define TARGET_VECTORIZE_ADD_STMT_COST.

In the implementation of the above macro the following is done for many 
vectorization supported architectures like i386, ARM.

if (where == vect_body && stmt_info && stmt_in_inner_loop_p (stmt_info))
count *= 50;  /* FIXME.  */

I have the  following questions.

1. Why the multiplication factor of 50 is choosen?
2. The comment mentions that the inner loop relative to the loop being 
vectorized is added more weight. If more weight is added to
the inner loop for the loop being vectorized, the chances of vectorizing the 
inner loop decreases. Why the inner loop cost is increased
with relative to the loop being vectorized?

Thanks & Regards
Ajit


RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST

2015-08-04 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Monday, August 03, 2015 2:59 PM
To: Ajit Kumar Agarwal
Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 
Hunsigida; Nagaraju Mekala
Subject: Re: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST

On Sun, Aug 2, 2015 at 4:13 PM, Ajit Kumar Agarwal 
 wrote:
> All:
>
> The definition of the following macro that determine the statement cost that 
> adds to vectorization cost.
>
> #define TARGET_VECTORIZE_ADD_STMT_COST.
>
> In the implementation of the above macro the following is done for many 
> vectorization supported architectures like i386, ARM.
>
> if (where == vect_body && stmt_info && stmt_in_inner_loop_p (stmt_info))
> count *= 50;  /* FIXME.  */
>
> I have the  following questions.
>
> 1. Why the multiplication factor of 50 is choosen?

>>It's a wild guess.  See 
>>tree-vect-loop.c:vect_get_single_scalar_iteration_cost.

> 2. The comment mentions that the inner loop relative to the loop being 
> vectorized is added more weight. If more weight is added to the inner 
> loop for the loop being vectorized, the chances of vectorizing the inner loop 
> decreases. Why the inner loop cost is increased with relative to the loop 
> being vectorized?

>>In fact adding more weight to the inner loop increases the chance of 
>>vectorizing it (if vectorizing the inner loop is profitable).
>>Both scalar and vector cost get biased by a factor of 50 (we assume 50 
>>iterations of the inner loop for one iteration of the outer loop), so a 
>>non-profitable >>vectorization in the outer loop can be offsetted by 
>>profitable inner loop vectorization.

>>Yes, '50' can be improved if we actually know the iteration count of the 
>>inner loop or if we have profile-feedback.

Thanks for the valuable information and suggestions.

Thanks & Regards
Ajit

Richard.


> Thanks & Regards
> Ajit


Loop distribution for nested Loops.

2015-08-04 Thread Ajit Kumar Agarwal

All:

For the Loop given in Fig(1), there is no possibility of loop distribution 
because of the dependency of S1 and S2 on the outerloop index k.
Due to the dependency the Loop cannot be distributed. 

The Loop can be distributed with the transformation given in Fig(2) where the 
loop given in Fig(1) is distributed due to the dependency
Hoisting transformation. The Dependency hoisting transformation where the 
dependency is shifted to insertion of new outer Loop and the 
Dependency is based on the inserted outerloop.  This makes the loop k(S1) and 
j(S2) distributed with the insertion of new outerloop and transfer 
The dependency  of S1  and S2 to the inserted outer loop. 

Do k = 1, n-1
  Do I = k+1, n

S1:  a(I,k) = a(I,k)/a(k,k)

  Enddo
  Do j = k+1,n
 Do I = k+1,n
  S2:  a(I,j) = a(I,j) - a(I,k) *a(k,j);
 Enddo
   Enddo
Enddo

Fig(1)

Do x = 1, n
   Do k = 1, x-1
  Do I = k+1, n

S2: a(I,x) = a(I,x) - a(I,k) * a(k,x)

  Enddo
   enddo
  Do i = x+1,n
 S1:  a(I,x) = a(I,x)/a(x,x);
   Enddo
Enddo

Fig(2).

The above transformation looks interesting making the candidate of loop 
distribution of nested loops with the presence of dependency by
Shifting the dependency to new inserted outer loop.

It is useful to have dependency hoisting transformation that makes the loop 
distribution possible for nested loops

My question is  the partitioning based Loop distributed transformation does the 
distribution of the nested Loops?  

Thanks & Regards
Ajit







More of a Loop distribution.

2015-08-13 Thread Ajit Kumar Agarwal
All:

Loop distribution considers DDG to decide on distributing the Loops. The Loops 
with control statements like IF-THEN-ELSE can also be 
Distributed. Instead of Data Dependency Graph, the Control Dependence Graph 
should be considered in order to distribute the loops
In presence of control Statements.

Also the presence of multiple exits in the Loop can also be considered for Loop 
distribution transformation.

Thus the above transformation helps in the Libquantum benchmarks for SPEC 2006.

There are following articles that looks interesting to me.

"Loop Distribution in presence of arbitrarily control flow Ken Kennedy et.al."
"Loop Distribution in presence of Multiple Exits Bor-Ming Hsieh etal."


I don't think the loop distribution in presence of control flow is implemented 
in GCC/LLVM.

I think it is feasible to consider the above for the implementation in GCC.

Thanks & Regards
Ajit


RE: More of a Loop distribution.

2015-08-13 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Thursday, August 13, 2015 3:23 PM
To: Ajit Kumar Agarwal
Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 
Hunsigida; Nagaraju Mekala
Subject: Re: More of a Loop distribution.

On Thu, Aug 13, 2015 at 9:37 AM, Ajit Kumar Agarwal 
 wrote:
> All:
>
> Loop distribution considers DDG to decide on distributing the Loops. 
> The Loops with control statements like IF-THEN-ELSE can also be 
> Distributed. Instead of Data Dependency Graph, the Control Dependence Graph 
> should be considered in order to distribute the loops In presence of control 
> Statements.
>
> Also the presence of multiple exits in the Loop can also be considered for 
> Loop distribution transformation.
>
> Thus the above transformation helps in the Libquantum benchmarks for SPEC 
> 2006.
>
> There are following articles that looks interesting to me.
>
> "Loop Distribution in presence of arbitrarily control flow Ken Kennedy et.al."
> "Loop Distribution in presence of Multiple Exits Bor-Ming Hsieh etal."
>
>
> I don't think the loop distribution in presence of control flow is 
> implemented in GCC/LLVM.
>
> I think it is feasible to consider the above for the implementation in GCC.

>>It's true that loop distribution does not try to distribute based on any 
>>control structure heuristics but it only considers data locality.  It does 
>>however already >>compute the control dependence graph (and uses it to add 
>>control edges to the DDG to properly add data dependence edges to uses of 
>>control statements >>necessary in partitions).

>>So it should be a matter of specifying the proper set of starting statements 
>>it tries separating.

Thanks.

>>Not sure which kind of distribution you are after, can you give an example?

I would like to have a distribution of the loop having control flow. For example

For (I = 2 ; I < N; i++)
{
If  (condition)
   {
  S1: A[i] = ...
  S2:D[i] = A[i-1]...
   }
}

The above loop can be distributed with two loops having one loop with S1  
inside IF and another loop with S2 with the IF.
The two scenario can be true.

1. The condition inside IF have a check on A[i] and is dependent on S1. In this 
case the distribution is difficult. And the above article
From Ken Kennedy et.al does store the partial results of comparison in an 
temporary array and that array is compared inside the IF
Condition. This makes the loop distributed. This is what I was looking for 
which I found in the above article.

2. The condition inside the IF in the above loop is not dependent on the S1 and 
S2 , and this case the loop can be distributed.

In the above two scenario the GCC can't distribute the loop, as the control 
dependency graph ( control structure ) is not used. The advantage of
The above loop distribution makes the loop vectorizable which otherwise not 
possible due to presence of multiple statements inside the IF and
Also may not be IF-converted due to presence of multiple statements. If we 
distribute the loop for the above two scenarios the individual loops 
in the distributed loop can be vectorized which is otherwise not possible.

Thanks & Regards
Ajit


Richard.

> Thanks & Regards
> Ajit


RE: More of a Loop distribution.

2015-08-13 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Friday, August 14, 2015 11:30 AM
To: Ajit Kumar Agarwal
Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 
Hunsigida; Nagaraju Mekala
Subject: RE: More of a Loop distribution.

On August 14, 2015 4:59:07 AM GMT+02:00, Ajit Kumar Agarwal 
 wrote:
>
>
>-Original Message-
>From: Richard Biener [mailto:richard.guent...@gmail.com]
>Sent: Thursday, August 13, 2015 3:23 PM
>To: Ajit Kumar Agarwal
>Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; 
>Vidhumouli Hunsigida; Nagaraju Mekala
>Subject: Re: More of a Loop distribution.
>
>On Thu, Aug 13, 2015 at 9:37 AM, Ajit Kumar Agarwal 
> wrote:
>> All:
>>
>> Loop distribution considers DDG to decide on distributing the Loops. 
>> The Loops with control statements like IF-THEN-ELSE can also be 
>> Distributed. Instead of Data Dependency Graph, the Control Dependence
>Graph should be considered in order to distribute the loops In presence 
>of control Statements.
>>
>> Also the presence of multiple exits in the Loop can also be
>considered for Loop distribution transformation.
>>
>> Thus the above transformation helps in the Libquantum benchmarks for
>SPEC 2006.
>>
>> There are following articles that looks interesting to me.
>>
>> "Loop Distribution in presence of arbitrarily control flow Ken
>Kennedy et.al."
>> "Loop Distribution in presence of Multiple Exits Bor-Ming Hsieh
>etal."
>>
>>
>> I don't think the loop distribution in presence of control flow is
>implemented in GCC/LLVM.
>>
>> I think it is feasible to consider the above for the implementation
>in GCC.
>
>>>It's true that loop distribution does not try to distribute based on
>any control structure heuristics but it only considers data locality. 
>It does however already >>compute the control dependence graph (and 
>uses it to add control edges to the DDG to properly add data dependence 
>edges to uses of control statements >>necessary in partitions).
>
>>>So it should be a matter of specifying the proper set of starting
>statements it tries separating.
>
>Thanks.
>
>>>Not sure which kind of distribution you are after, can you give an
>example?
>
>I would like to have a distribution of the loop having control flow.
>For example
>
>For (I = 2 ; I < N; i++)
>{
>If  (condition)
>   {
>  S1: A[i] = ...
>  S2:D[i] = A[i-1]...
>   }
>}
>
>The above loop can be distributed with two loops having one loop with
>S1  inside IF and another loop with S2 with the IF.
>The two scenario can be true.
>
>1. The condition inside IF have a check on A[i] and is dependent on S1.
>In this case the distribution is difficult. And the above article From 
>Ken Kennedy et.al does store the partial results of comparison in an 
>temporary array and that array is compared inside the IF Condition. 
>This makes the loop distributed. This is what I was looking for which I 
>found in the above article.
>
>2. The condition inside the IF in the above loop is not dependent on 
>the S1 and S2 , and this case the loop can be distributed.
>
>In the above two scenario the GCC can't distribute the loop, as the 
>control dependency graph ( control structure ) is not used.

>>The above loop can be distributed by gcc just fine.

Existing  loop distribution implementation in GCC distributes the above loop in 
both the above scenario's? 

Thanks & Regards
Ajit

Richard.

 The
>advantage of
>The above loop distribution makes the loop vectorizable which otherwise 
>not possible due to presence of multiple statements inside the IF and 
>Also may not be IF-converted due to presence of multiple statements. If 
>we distribute the loop for the above two scenarios the individual loops
>
>in the distributed loop can be vectorized which is otherwise not 
>possible.
>
>Thanks & Regards
>Ajit
>
>
>Richard.
>
>> Thanks & Regards
>> Ajit




RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST

2015-08-14 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Monday, August 03, 2015 2:59 PM
To: Ajit Kumar Agarwal
Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 
Hunsigida; Nagaraju Mekala
Subject: Re: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST

On Sun, Aug 2, 2015 at 4:13 PM, Ajit Kumar Agarwal 
 wrote:
> All:
>
> The definition of the following macro that determine the statement cost that 
> adds to vectorization cost.
>
> #define TARGET_VECTORIZE_ADD_STMT_COST.
>
> In the implementation of the above macro the following is done for many 
> vectorization supported architectures like i386, ARM.
>
> if (where == vect_body && stmt_info && stmt_in_inner_loop_p (stmt_info))
> count *= 50;  /* FIXME.  */
>
> I have the  following questions.
>
> 1. Why the multiplication factor of 50 is choosen?

>>It's a wild guess.  See 
>>tree-vect-loop.c:vect_get_single_scalar_iteration_cost.

> 2. The comment mentions that the inner loop relative to the loop being 
> vectorized is added more weight. If more weight is added to the inner 
> loop for the loop being vectorized, the chances of vectorizing the inner loop 
> decreases. Why the inner loop cost is increased with relative to the loop 
> being vectorized?

>>In fact adding more weight to the inner loop increases the chance of 
>>vectorizing it (if vectorizing the inner loop is profitable).
>>Both scalar and vector cost get biased by a factor of 50 (we assume 50 
>>iterations of the inner loop for one iteration of the outer loop), so a 
>>non-profitable >>vectorization in the outer loop can be offsetted by 
>>profitable inner loop vectorization.

>>Yes, '50' can be improved if we actually know the iteration count of the 
>>inner loop or if we have profile-feedback.

Instead of vector and scalar cost get biased by a factor of 50, Can the benefit 
of vectorization calculated as follows

Benefit = scalar cost - vector cost/VF;
Cost = 0;
For ( I = 1; I < N; i++)
{
Cost = cost + (final_value - Initial value)/steps;
}

Benefit = Benefit * cost;

Where 
N = No. of levels of the loop;
Final_value = Final iteration count of the loop.
Initial_value = Initial Iteration count of the loop.
Steps = steps of the iteration for the loop.
VF = vectorization factor.

Thus increase in the Levels of the loops increases  the benefit of 
vectorization. Also if the scalar cost is more than the vectorization cost then 
the 
Scalar cost - vector cost /VF increases with the same vectorization Factor thus 
increasing the benefit of vectorization.

Thanks & Regards
Ajit

Richard.


> Thanks & Regards
> Ajit


More of a Loop fusion

2015-08-16 Thread Ajit Kumar Agarwal
All:

Loop fusion is an important optimizations that fuses the set of Loops if the 
following condition is valid.

1) Loops are conformant ( i.e. they have same iteration count).
2. Loops are control equivalent. The control equivalence of the loops can be 
identified with the dominator and post dominator properties 
Of the Loop. Two Loops Lj and LK are control equivalent if the Lj dominates Lk 
and LK post dominates the Lj.
3. Loops are adjacent. The Loops are adjacent if there is no intervening code 
between them. The intervening code can be determined
With the dominance properties. If there is an aggregate node ax where the Loop 
Lj dominates ax and the Loop Lk post dominates
The ax then there is a intervening code.
4. There are forward dependence between the Loops.

If the properties 1. Doesn't hold i.e. the loops are not conformant and they 
don't have same iteration count. Then the following 
Transformation can be done  as given in Fig(1) and Fig(2) to make the set of 
Loops as non-conformant.

In Fig(1) the Loops i1 and i2 are non-conformant as they don't have same 
iteration count. The following transformation can be made given in Fig(2).

The Fig(2) fuses the Loop for the Loops i1 and i2 that are non-conformant and 
doesn't have same iteration count.

If the properties 3 is not valid and the set of Loops are not adjacent. The 
intervening code can be moved up the Loop or down the Loop
So that the Loop become adjacent and there is no intervening code. Also 
partially the intervening code is moved up and partial intervening
Code is moved down based on the dependence relation and the dominance 
properties. Such transformation can make loops adjacent and
There is no intervening code.

Also the Loop fusion, should be done after the Loop normalization pass and the 
Loop normalization pass should be present before the Loop
Fusion pass. The 4 properties to be valid for loop fusion to occur is done with 
respect to traversing the Loops from outermost to inner most 
Loop. First the outer loop is traversed and the loops that are at same level of 
the outer loop are made into one partitions and that partition
Can be fused if the above four properties is valid and otherwise the loops are 
taken out from the partitions. The same logic is done for the
Loop traversing outer to inner loops and partitions are formed and later the 
partitions are fused based on the above properties.

Each Loop is traverse in the forward order of dominance tree and also the 
reverse order of dominance tree ( i.e. post dominator) and the 
Forward dependence is checked. Based on the dependence the Loops are fused. And 
the Loops that are non-adjacent partial code is moved
Up and partial code is moved down or the whole intervening code is moved up or 
down.

I think  the above transformations and properties  and enhance the Loop fusion 
algorithm in GCC with the above Algorithm to have more Loop
Fusion with respect to SPEC benchmarks.

For ( I1 = 0;I1 < n;I1 ++)
{
   If ( I1 < n-1)
  S1:
  Else
  S2:
  S3:
}
For ( I2 = 0 ; I2 < n-2 ;I2 ++)
   S4;

Fig(1).

For( I1 = 0; i1< n ; i1++)
{
   If(i1 < n-2)
   {
  If(i1< n-1)
 S1:
  Else
   S2:
  S3:
 S4:
  }
 Else
 {
 If ( i1 < n-1)
   S1:
 Else
 S2:
 S3:
 }
}// For

Fig(2).

Thanks & Regards
Ajit

   





RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST

2015-08-17 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Friday, August 14, 2015 9:59 PM
To: Ajit Kumar Agarwal
Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 
Hunsigida; Nagaraju Mekala
Subject: RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST

On August 14, 2015 5:03:58 PM GMT+02:00, Ajit Kumar Agarwal 
 wrote:
>
>
>-Original Message-
>From: Richard Biener [mailto:richard.guent...@gmail.com]
>Sent: Monday, August 03, 2015 2:59 PM
>To: Ajit Kumar Agarwal
>Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; 
>Vidhumouli Hunsigida; Nagaraju Mekala
>Subject: Re: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST
>
>On Sun, Aug 2, 2015 at 4:13 PM, Ajit Kumar Agarwal 
> wrote:
>> All:
>>
>> The definition of the following macro that determine the statement
>cost that adds to vectorization cost.
>>
>> #define TARGET_VECTORIZE_ADD_STMT_COST.
>>
>> In the implementation of the above macro the following is done for
>many vectorization supported architectures like i386, ARM.
>>
>> if (where == vect_body && stmt_info && stmt_in_inner_loop_p
>(stmt_info))
>> count *= 50;  /* FIXME.  */
>>
>> I have the  following questions.
>>
>> 1. Why the multiplication factor of 50 is choosen?
>
>>>It's a wild guess.  See
>tree-vect-loop.c:vect_get_single_scalar_iteration_cost.
>
>> 2. The comment mentions that the inner loop relative to the loop
>being
>> vectorized is added more weight. If more weight is added to the inner
>
>> loop for the loop being vectorized, the chances of vectorizing the
>inner loop decreases. Why the inner loop cost is increased with 
>relative to the loop being vectorized?
>
>>>In fact adding more weight to the inner loop increases the chance of
>vectorizing it (if vectorizing the inner loop is profitable).
>>>Both scalar and vector cost get biased by a factor of 50 (we assume
>50 iterations of the inner loop for one iteration of the outer loop), 
>so a non-profitable >>vectorization in the outer loop can be offsetted 
>by profitable inner loop vectorization.
>
>>>Yes, '50' can be improved if we actually know the iteration count of
>the inner loop or if we have profile-feedback.
>
>Instead of vector and scalar cost get biased by a factor of 50, Can the 
>benefit of vectorization calculated as follows
>
>Benefit = scalar cost - vector cost/VF; Cost = 0; For ( I = 1; I < N; 
>i++) {
>Cost = cost + (final_value - Initial value)/steps; }
>
>Benefit = Benefit * cost;
>
>Where
>N = No. of levels of the loop;
>Final_value = Final iteration count of the loop.
>Initial_value = Initial Iteration count of the loop.
>Steps = steps of the iteration for the loop.
>VF = vectorization factor.
>
>Thus increase in the Levels of the loops increases  the benefit of 
>vectorization. Also if the scalar cost is more than the vectorization 
>cost then the Scalar cost - vector cost /VF increases with the same 
>vectorization Factor thus increasing the benefit of vectorization.

>>Sure.  But the number of iterations may only be available symbolically, thus 
>>the cost be only useful for the dynamic check at runtime.  A better static 
>>>>estimate would also be useful.

Thanks. For the cases the loop bound can be known at the compile time, through 
Value Range Analysis. Already GCC uses the value range
 Information/Analysis To calculate the Loop bound. We can use the same loop 
bound info to get the static estimate on the number of iterations.

  Based on the above estimates, the above cost calculation as I have mentioned 
can be used for Vectorization cost Analysis.

Thanks & Regards
Ajit

Richard.

>Thanks & Regards
>Ajit
>
>Richard.
>
>
>> Thanks & Regards
>> Ajit




RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST

2015-08-19 Thread Ajit Kumar Agarwal


-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Ajit 
Kumar Agarwal
Sent: Monday, August 17, 2015 4:03 PM
To: Richard Biener
Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 
Hunsigida; Nagaraju Mekala
Subject: RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST



-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com]
Sent: Friday, August 14, 2015 9:59 PM
To: Ajit Kumar Agarwal
Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 
Hunsigida; Nagaraju Mekala
Subject: RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST

On August 14, 2015 5:03:58 PM GMT+02:00, Ajit Kumar Agarwal 
 wrote:
>
>
>-Original Message-
>From: Richard Biener [mailto:richard.guent...@gmail.com]
>Sent: Monday, August 03, 2015 2:59 PM
>To: Ajit Kumar Agarwal
>Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; 
>Vidhumouli Hunsigida; Nagaraju Mekala
>Subject: Re: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST
>
>On Sun, Aug 2, 2015 at 4:13 PM, Ajit Kumar Agarwal 
> wrote:
>> All:
>>
>> The definition of the following macro that determine the statement
>cost that adds to vectorization cost.
>>
>> #define TARGET_VECTORIZE_ADD_STMT_COST.
>>
>> In the implementation of the above macro the following is done for
>many vectorization supported architectures like i386, ARM.
>>
>> if (where == vect_body && stmt_info && stmt_in_inner_loop_p
>(stmt_info))
>> count *= 50;  /* FIXME.  */
>>
>> I have the  following questions.
>>
>> 1. Why the multiplication factor of 50 is choosen?
>
>>>It's a wild guess.  See
>tree-vect-loop.c:vect_get_single_scalar_iteration_cost.
>
>> 2. The comment mentions that the inner loop relative to the loop
>being
>> vectorized is added more weight. If more weight is added to the inner
>
>> loop for the loop being vectorized, the chances of vectorizing the
>inner loop decreases. Why the inner loop cost is increased with 
>relative to the loop being vectorized?
>
>>>In fact adding more weight to the inner loop increases the chance of
>vectorizing it (if vectorizing the inner loop is profitable).
>>>Both scalar and vector cost get biased by a factor of 50 (we assume
>50 iterations of the inner loop for one iteration of the outer loop), 
>so a non-profitable >>vectorization in the outer loop can be offsetted 
>by profitable inner loop vectorization.
>
>>>Yes, '50' can be improved if we actually know the iteration count of
>the inner loop or if we have profile-feedback.
>
>Instead of vector and scalar cost get biased by a factor of 50, Can the 
>benefit of vectorization calculated as follows
>
>Benefit = scalar cost - vector cost/VF; Cost = 0; For ( I = 1; I < N;
>i++) {
>Cost = cost + (final_value - Initial value)/steps; }
>
>Benefit = Benefit * cost;
>
>Where
>N = No. of levels of the loop;
>Final_value = Final iteration count of the loop.
>Initial_value = Initial Iteration count of the loop.
>Steps = steps of the iteration for the loop.
>VF = vectorization factor.
>
>Thus increase in the Levels of the loops increases  the benefit of 
>vectorization. Also if the scalar cost is more than the vectorization 
>cost then the Scalar cost - vector cost /VF increases with the same 
>vectorization Factor thus increasing the benefit of vectorization.

>>Sure.  But the number of iterations may only be available symbolically, thus 
>>the cost be only useful for the dynamic check at runtime.  A better static 
>>>>estimate would also be useful.

>>Thanks. For the cases the loop bound can be known at the compile time, 
>>through Value Range Analysis. Already GCC uses the value range  
>>>>Information/Analysis To calculate the Loop bound. We can use the same loop 
>>bound info to get the static estimate on the number of iterations.

 >> Based on the above estimates, the above cost calculation as I have 
 >> mentioned can be used for Vectorization cost Analysis.

On top of the above, the vectorizer cannot vectorize the loops if the trip 
count or iteration count is not known. In order to have the number
Of iterations for vectorizer cost calculation, it's always  true the trip count 
or iteration count is known. The summation of iteration count of all
the Loop levels where the iteration count is known, gives the static estimate 
and included in the above vectorization cost. For the Loops where 
iteration or trip count is not known, the vectorizer cannot vectorize and 
iteration count of such cases can be neglected for the vectorization cost
calculation.

Only SLP or partial vectorization is possible where its considers the 
isomorphic operations instead of vectorization based on trip or iteration
Count for the Loops.

Thanks & Regards
Ajit

Thanks & Regards
Ajit

Richard.

>Thanks & Regards
>Ajit
>
>Richard.
>
>
>> Thanks & Regards
>> Ajit




[RFC]: Vectorization cost benefit changes.

2015-08-20 Thread Ajit Kumar Agarwal
All:

I have done the vectorization cost changes as given below. I have considered 
only the cost associated with the inner instead of outside.
The consideration of inside scalar and vector cost is done as the inner cost 
are the most cost effective than the outside cost.

 min_profitable_iters = ((scalar_single_iter_cost
- vec_inside_cost) *vf);

The Scalar_single_iter_cost consider the hardcoded value 50 which is used for 
most of the targets and the scalar cost is multiplied
With 50. This scalar cost is subtracted with vector cost and as the scalar cost 
is increased the chances of vectorization is more with same
Vectorization factor  and more loops will be vectorized.

I have not changed the iteration count which is hardcoded with 50 and I will do 
the changes to replace the 50 with the static
Estimates of iteration count if you agree upon the below changes. 

I have ran the SPEC cpu 2000 benchmarks with the below changes for i386 targets 
and the significant gains are achieved with respect
To INT and FP benchmarks.

Here is the data.

Ratio of vectorization cost changes(FP benchmarks) vs Ratio of without 
vectorization cost changes( FP benchmarks)  = 4640.102 vs 4583.379.
Ratio of vectorization cost changes (INT benchmarks ) vs Ratio of without 
vectorization cost changes( INT benchmarks0 = 3812.883 vs 3778.558

Please give your feedback on the below changes for vectorization cost benefit.

diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index 422b883..35d538f 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -2987,11 +2987,8 @@ vect_estimate_min_profitable_iters (loop_vec_info 
loop_vinfo,
 min_profitable_iters = 1;
   else
 {
-  min_profitable_iters = ((vec_outside_cost - scalar_outside_cost) * vf
- - vec_inside_cost * peel_iters_prologue
-  - vec_inside_cost * peel_iters_epilogue)
- / ((scalar_single_iter_cost * vf)
-- vec_inside_cost);
+  min_profitable_iters = ((scalar_single_iter_cost
+- vec_inside_cost) *vf);

   if ((scalar_single_iter_cost * vf * min_profitable_iters)
   <= (((int) vec_inside_cost * min_profitable_iters)

Thanks & Regards
Ajit


vect.diff
Description: vect.diff


RE: [RFC]: Vectorization cost benefit changes.

2015-08-21 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Friday, August 21, 2015 2:03 PM
To: Ajit Kumar Agarwal
Cc: Jeff Law; GCC Patches; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; 
Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [RFC]: Vectorization cost benefit changes.

On Fri, Aug 21, 2015 at 7:18 AM, Ajit Kumar Agarwal 
 wrote:
> All:
>
> I have done the vectorization cost changes as given below. I have considered 
> only the cost associated with the inner instead of outside.
> The consideration of inside scalar and vector cost is done as the inner cost 
> are the most cost effective than the outside cost.

>>I think you are confused about what the variables cost are associated to.  
>>You are changing a place that computes also the cost for 
>>non-outer-loop->>vectorization so your patch is clearly not applicable.

>>vec_outside_cost is the cost of setting up invariants for example.
>>All costs apply to the "outer" loop - if there is a nested loop inside that 
>>loop its costs are folded into the "outer" loop cost already at this stage.

>>So I think your analysis is simply wrong and thus your patch.

>>You need to find another place to fix inner loop cost.

Thanks for your valuable suggestions and feedback. I will certainly look into 
it.

Thanks & Regards
Ajit
Richard.

>  min_profitable_iters = ((scalar_single_iter_cost
> - vec_inside_cost) *vf);
>
> The Scalar_single_iter_cost consider the hardcoded value 50 which is 
> used for most of the targets and the scalar cost is multiplied With 
> 50. This scalar cost is subtracted with vector cost and as the scalar cost is 
> increased the chances of vectorization is more with same Vectorization factor 
>  and more loops will be vectorized.
>
> I have not changed the iteration count which is hardcoded with 50 and 
> I will do the changes to replace the 50 with the static Estimates of 
> iteration count if you agree upon the below changes.
>
> I have ran the SPEC cpu 2000 benchmarks with the below changes for 
> i386 targets and the significant gains are achieved with respect To INT and 
> FP benchmarks.
>
> Here is the data.
>
> Ratio of vectorization cost changes(FP benchmarks) vs Ratio of without 
> vectorization cost changes( FP benchmarks)  = 4640.102 vs 4583.379.
> Ratio of vectorization cost changes (INT benchmarks ) vs Ratio of 
> without vectorization cost changes( INT benchmarks0 = 3812.883 vs 
> 3778.558
>
> Please give your feedback on the below changes for vectorization cost benefit.
>
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 
> 422b883..35d538f 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -2987,11 +2987,8 @@ vect_estimate_min_profitable_iters (loop_vec_info 
> loop_vinfo,
>  min_profitable_iters = 1;
>else
>  {
> -  min_profitable_iters = ((vec_outside_cost - scalar_outside_cost) * 
> vf
> - - vec_inside_cost * peel_iters_prologue
> -  - vec_inside_cost * peel_iters_epilogue)
> - / ((scalar_single_iter_cost * vf)
> -- vec_inside_cost);
> +  min_profitable_iters = ((scalar_single_iter_cost
> +- vec_inside_cost) *vf);
>
>if ((scalar_single_iter_cost * vf * min_profitable_iters)
><= (((int) vec_inside_cost * min_profitable_iters)
>
> Thanks & Regards
> Ajit


Awareness of register pressure on strength reduction of induction variables.

2015-09-01 Thread Ajit Kumar Agarwal
All;

The Global code motion are the important optimization that have an impact on 
register spills and Fetch. Thus
The Global code motion takes into account the increase or decrease of register 
pressure.

Strength Reductions is an important optimization that has an impact on register 
pressure. The strength reduction
Optimizations doesn't take into account the register pressure and especially 
the strength reduction for induction
Variables that are inside the Loops. Loops are the bottleneck for the register 
pressure and any spill inside the loops
Based on strength reduction of induction variables degrades the performance of 
the Loops.

Thus the strength reductions in general and especially the strength reduction 
for induction variables should take
Into account the register pressure based on Live range info.

I don't think we do take into account the register pressure for strength 
reduction optimization in general and especially the 
Strength reduction based on induction variables.

Thanks & Regards
Ajit


Commoning the control and Data Dependence

2015-09-01 Thread Ajit Kumar Agarwal
All:

The Data Dependency graph augmented with control dependence can be common out 
based on the dominator info.
The instruction I1 dominates all the uses say instruction I2 and I3. Then I2 
and I3 depends on I1. Thus the Graph can be 
Formed from the dominator tree of all the instructions and the edges represent 
the dominator info. There is an edge 
From I1 - I2 and I1-I3 if I1 dominates I2 and I3.

Such representation can be common out with Data and control dependence and the 
common data structure can be 
Formed with the tree based graph on dominator info. This will take care of both 
the Data Dependence and the control
Dependence with the common representation of the graph on dominator tree info 
based out of instructions.

Such representation can be augmented with DDG augmented with Control dependence 
used in Loop distribution pass
To form the partitions to distribute the Loops?

Thanks & Regards
Ajit





Live range Analysis based on tree representations

2015-09-01 Thread Ajit Kumar Agarwal
All:

The Live ranges info on tree SSA representation is important step towards the 
SSA based code motion optimizations.
As the code motion optimization based on the SSA representation effects the 
register pressure and reasons for performance
Bottleneck.

I am proposing the Live range Analysis based on the SSA representation. The 
Live range analysis traverses the dominator
Tree. The SSA and phi variables are represented based on dominance frontier 
info and the SSA representation reflects
The dominance info. Based on such dominance info Live range Overlapping 
Analysis can be derived.

Variable V intersects W if Vdef dominates the Wdef. The variable v intersects 
at point p if Vdef dominates P and Wdef
Dominates the P. If Vdef dominates Wdef and Wdef dominates Udef , then the Vdef 
dominates Udef and thus Live range
Of V intersect W and live range W intersect U, thus the live range V intersects 
the U. Such dominance info can be used to
Represent the Overlapping Live range Analysis and the register pressure is 
derived from Overlapping Live ranges based 
On the dominator info inherited from the SSA representation. The SSA 
representation is derived based on dominance
Frontier and the traversal of dominator tree based on SSA can derive the 
Overlapping Live ranges.

The above Overlapping Live range info can be used to derive the register 
pressure and the optimization based out of tree
Representation can use the above overlapping live ranges to take register 
pressure into account.

Thanks & Regards
Ajit


RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST

2015-09-02 Thread Ajit Kumar Agarwal


-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Ajit 
Kumar Agarwal
Sent: Wednesday, August 19, 2015 2:53 PM
To: Richard Biener
Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 
Hunsigida; Nagaraju Mekala
Subject: RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST



-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Ajit 
Kumar Agarwal
Sent: Monday, August 17, 2015 4:03 PM
To: Richard Biener
Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 
Hunsigida; Nagaraju Mekala
Subject: RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST



-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com]
Sent: Friday, August 14, 2015 9:59 PM
To: Ajit Kumar Agarwal
Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 
Hunsigida; Nagaraju Mekala
Subject: RE: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST

On August 14, 2015 5:03:58 PM GMT+02:00, Ajit Kumar Agarwal 
 wrote:
>
>
>-Original Message-
>From: Richard Biener [mailto:richard.guent...@gmail.com]
>Sent: Monday, August 03, 2015 2:59 PM
>To: Ajit Kumar Agarwal
>Cc: Jeff Law; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; 
>Vidhumouli Hunsigida; Nagaraju Mekala
>Subject: Re: vectorization cost macro TARGET_VECTORIZE_ADD_STMT_COST
>
>On Sun, Aug 2, 2015 at 4:13 PM, Ajit Kumar Agarwal 
> wrote:
>> All:
>>
>> The definition of the following macro that determine the statement
>cost that adds to vectorization cost.
>>
>> #define TARGET_VECTORIZE_ADD_STMT_COST.
>>
>> In the implementation of the above macro the following is done for
>many vectorization supported architectures like i386, ARM.
>>
>> if (where == vect_body && stmt_info && stmt_in_inner_loop_p
>(stmt_info))
>> count *= 50;  /* FIXME.  */
>>
>> I have the  following questions.
>>
>> 1. Why the multiplication factor of 50 is choosen?
>
>>>It's a wild guess.  See
>tree-vect-loop.c:vect_get_single_scalar_iteration_cost.
>
>> 2. The comment mentions that the inner loop relative to the loop
>being
>> vectorized is added more weight. If more weight is added to the inner
>
>> loop for the loop being vectorized, the chances of vectorizing the
>inner loop decreases. Why the inner loop cost is increased with 
>relative to the loop being vectorized?
>
>>>In fact adding more weight to the inner loop increases the chance of
>vectorizing it (if vectorizing the inner loop is profitable).
>>>Both scalar and vector cost get biased by a factor of 50 (we assume
>50 iterations of the inner loop for one iteration of the outer loop), 
>so a non-profitable >>vectorization in the outer loop can be offsetted 
>by profitable inner loop vectorization.
>
>>>Yes, '50' can be improved if we actually know the iteration count of
>the inner loop or if we have profile-feedback.
>
>Instead of vector and scalar cost get biased by a factor of 50, Can the 
>benefit of vectorization calculated as follows
>
>Benefit = scalar cost - vector cost/VF; Cost = 0; For ( I = 1; I < N;
>i++) {
>Cost = cost + (final_value - Initial value)/steps; }
>
>Benefit = Benefit * cost;
>
>Where
>N = No. of levels of the loop;
>Final_value = Final iteration count of the loop.
>Initial_value = Initial Iteration count of the loop.
>Steps = steps of the iteration for the loop.
>VF = vectorization factor.
>
>Thus increase in the Levels of the loops increases  the benefit of 
>vectorization. Also if the scalar cost is more than the vectorization 
>cost then the Scalar cost - vector cost /VF increases with the same 
>vectorization Factor thus increasing the benefit of vectorization.

>>Sure.  But the number of iterations may only be available symbolically, thus 
>>the cost be only useful for the dynamic check at runtime.  A better static 
>>>>estimate would also be useful.

>>Thanks. For the cases the loop bound can be known at the compile time, 
>>through Value Range Analysis. Already GCC uses the value range  
>>>>Information/Analysis To calculate the Loop bound. We can use the same loop 
>>bound info to get the static estimate on the number of iterations.

 >> Based on the above estimates, the above cost calculation as I have 
 >> mentioned can be used for Vectorization cost Analysis.

>>On top of the above, the vectorizer cannot vectorize the loops if the trip 
>>count or iteration count is not known. In order to have the number Of 
>>iterations for >>vectorizer cost calculation, it's always  true the trip 

RE: Live range Analysis based on tree representations

2015-09-03 Thread Ajit Kumar Agarwal


-Original Message-
From: Aaron Sawdey [mailto:acsaw...@linux.vnet.ibm.com] 
Sent: Wednesday, September 02, 2015 8:23 PM
To: Ajit Kumar Agarwal
Cc: Jeff Law; vmaka...@redhat.com; Richard Biener; gcc@gcc.gnu.org; Vinod 
Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Live range Analysis based on tree representations

On Tue, 2015-09-01 at 17:56 +, Ajit Kumar Agarwal wrote:
> All:
> 
> The Live ranges info on tree SSA representation is important step towards the 
> SSA based code motion optimizations.
> As the code motion optimization based on the SSA representation 
> effects the register pressure and reasons for performance Bottleneck.
> 
> I am proposing the Live range Analysis based on the SSA 
> representation. The Live range analysis traverses the dominator Tree. 
> The SSA and phi variables are represented based on dominance frontier info 
> and the SSA representation reflects The dominance info. Based on such 
> dominance info Live range Overlapping Analysis can be derived.
> 
> Variable V intersects W if Vdef dominates the Wdef. The variable v 
> intersects at point p if Vdef dominates P and Wdef Dominates the P. If 
> Vdef dominates Wdef and Wdef dominates Udef , then the Vdef dominates 
> Udef and thus Live range Of V intersect W and live range W intersect 
> U, thus the live range V intersects the U. Such dominance info can be 
> used to Represent the Overlapping Live range Analysis and the register 
> pressure is derived from Overlapping Live ranges based On the dominator info 
> inherited from the SSA representation. The SSA representation is derived 
> based on dominance Frontier and the traversal of dominator tree based on SSA 
> can derive the Overlapping Live ranges.
> 
> The above Overlapping Live range info can be used to derive the 
> register pressure and the optimization based out of tree Representation can 
> use the above overlapping live ranges to take register pressure into account.

>>Ajit,
  >>I did a prototype of this kind of analysis at one point last year to see if 
it could help improve inlining decisions in LTO. Basically I did >>exactly what 
you suggest and computed the number of overlapping SSA live ranges and used 
that as a proxy for register pressure. It >>did appear to be able to help in 
some circumstances but the real solution is to improve register allocation so 
it doesn't fall down when >>register pressure gets high.

Aaron:
Would you mind in explaining on what circumstances it helps and when it won't.  
The Live ranges on SSA
representation forms chordal Graphs that might have the different colorability 
requirements than the real
register allocator. This may not give exact register pressure compared to  
register allocator as register allocator 
is  further down the optimization and the code generation pipeline  but forms 
the basis of optimization based 
on SSA that effects the register pressure.
 
>>The code is in a branch called lto-pressure.

Thanks. I would like to see the code.

Thanks & Regards
Ajit

  >>Aaron

> 
> Thanks & Regards
> Ajit
> 

--
Aaron Sawdey, Ph.D.  acsaw...@linux.vnet.ibm.com
050-2/C113  (507) 253-7520 home: 507/263-0782 IBM Linux Technology Center - PPC 
Toolchain



Cost and Benefit Allocation for moving the expression above the conditional.

2015-09-06 Thread Ajit Kumar Agarwal
All:

The cost and benefit associated for moving a given expression above conditional 
 are the important factors for the performance boost.

Considering the above, the cost and benefit calculation can be derived based on 
below.

For a given conditional entry point 'n', the benefit path 'p'  are the sub 
paths that passes through 'n' and available and anticipatable at the conditional
Entry n.

For a given conditional entry point 'n', the cost path 'p' are the sub paths 
that passes through 'n' and are unavailable and un-anticipatable at the 
Conditional entry n.

Thus the Benefit =  summation of the frequency of all the benefit paths as 
given above.
Cost = Summation of the frequency of all the cost paths as given above.

Thus the movement of expression above the conditional can be enabled if the 
Benefit > Cost as calculated above.

The derivation above can be extended to Loop Invariant code motion.

The above benefit and the cost derivation can be extended to the following 
based on number of DEFs and use.

Benefit = Summation of the frequency of all the benefit paths + summation of 
(number  of defs * latency of store instruction)  of all benefit paths
+summation of ( number of uses * latency of load instruction ) of all benefit 
paths +  -  Summation of  ( Number of moves * latency of move instructions)
Of all benefit paths.

Cost =  Benefit = Summation of the frequency of all the cost paths + summation 
of (number  of defs * latency of store instruction)  of all cost paths
+summation of ( number of uses * latency of load instruction ) of all cost 
paths-  Summation of  ( Number of moves * latency of move instructions)
Of all cost paths

The movement of the expressions above the condition and can be extended  to 
LICM if the benefit is greater than cost.

The above is feasible and quite efficient for the cost and Benefit allocation 
for control flow code motion. This  above cost  somewhat equivalent to
Cost of the Live range priority and  selection of coloring the live ranges 
based on priority.

Thanks & Regards
Ajit


Inlining Decision Priority Function.

2015-09-12 Thread Ajit Kumar Agarwal
All:

Inlining decisions that reduces the formulation of callee's stacks frame and 
including the callee in the caller context increases
The performance.

The priority function of Inlining decisions can be calculated as follows 
considering the following.

1. Level nest of the callee.
2. code size of the callee.
3. code size of the largest function in the program.

Prority(c) = (level(c) + 1) * largetsize
 -
  Size(c).

The the Higher the priority for inlining decision for the callee with small 
code size and deepest in the call site of the call graph.

The deeper the level of callee the greater is the priority and smaller the code 
size the higher is the priority.

The priority function also considers the largest size of the functions in the 
program. If the size of the function is largest in the
Program the higher is the prority.

Priority(c) = Priority of the callee.
Level(c) = Deepness of the callee.
Size(c) = code size of the callee.
Largetsize = largest size among all the functions in the program.

The above heuristics for inlining should be considered for inlining. If we use 
similar heuristics in the inlining decisions Let me know.

Thanks & Regards
Ajit


Selective criteria and Heuristics for Loop Unrolling.

2015-09-12 Thread Ajit Kumar Agarwal
All:

The Loop unrolling and the decisions on unrolling factor is an important 
criteria for loop Unrolling optimization.

The decision on unrolling factor for the loops based on the below criteria 
improves the performance of unrolled loops.

1. Number of operations.
2. Number of operands.
3. Number of floating point operations.
4. Loop nest Level.
5. Number of branches.
6. Number of memory operations.
7. Live range size.
8. Critical Path Length.
9. Known Trip count.
10. Number of parallel computations in the loop.
11. The Max dependence height of the computation.
12. Main memory to memory loop carried dependence.

The best unrolling factor heuristics based on the above criteria improves the 
performance of the unrolled loop.

Thanks & Regards
Ajit


RE: Live range Analysis based on tree representations

2015-09-12 Thread Ajit Kumar Agarwal


-Original Message-
From: Aaron Sawdey [mailto:acsaw...@linux.vnet.ibm.com] 
Sent: Friday, September 04, 2015 11:51 PM
To: Ajit Kumar Agarwal
Cc: Jeff Law; vmaka...@redhat.com; Richard Biener; gcc@gcc.gnu.org; Vinod 
Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Live range Analysis based on tree representations

On Thu, 2015-09-03 at 15:22 +, Ajit Kumar Agarwal wrote:
> 
> 
> -Original Message-
> From: Aaron Sawdey [mailto:acsaw...@linux.vnet.ibm.com]
> Sent: Wednesday, September 02, 2015 8:23 PM
> To: Ajit Kumar Agarwal
> Cc: Jeff Law; vmaka...@redhat.com; Richard Biener; gcc@gcc.gnu.org; 
> Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju 
> Mekala
> Subject: Re: Live range Analysis based on tree representations
> 
> On Tue, 2015-09-01 at 17:56 +, Ajit Kumar Agarwal wrote:
> > All:
> > 
> > The Live ranges info on tree SSA representation is important step towards 
> > the SSA based code motion optimizations.
> > As the code motion optimization based on the SSA representation 
> > effects the register pressure and reasons for performance Bottleneck.
> > 
> > I am proposing the Live range Analysis based on the SSA 
> > representation. The Live range analysis traverses the dominator Tree.
> > The SSA and phi variables are represented based on dominance frontier info 
> > and the SSA representation reflects The dominance info. Based on such 
> > dominance info Live range Overlapping Analysis can be derived.
> > 
> > Variable V intersects W if Vdef dominates the Wdef. The variable v 
> > intersects at point p if Vdef dominates P and Wdef Dominates the P. 
> > If Vdef dominates Wdef and Wdef dominates Udef , then the Vdef 
> > dominates Udef and thus Live range Of V intersect W and live range W 
> > intersect U, thus the live range V intersects the U. Such dominance 
> > info can be used to Represent the Overlapping Live range Analysis and the 
> > register pressure is derived from Overlapping Live ranges based On the 
> > dominator info inherited from the SSA representation. The SSA 
> > representation is derived based on dominance Frontier and the traversal of 
> > dominator tree based on SSA can derive the Overlapping Live ranges.
> > 
> > The above Overlapping Live range info can be used to derive the 
> > register pressure and the optimization based out of tree Representation can 
> > use the above overlapping live ranges to take register pressure into 
> > account.
> 
> >>Ajit,
>   >>I did a prototype of this kind of analysis at one point last year to see 
> if it could help improve inlining decisions in LTO. Basically I did >>exactly 
> what you suggest and computed the number of overlapping SSA live ranges and 
> used that as a proxy for register pressure. It >>did appear to be able to 
> help in some circumstances but the real solution is to improve register 
> allocation so it doesn't fall down when >>register pressure gets high.
> 
> Aaron:
> Would you mind in explaining on what circumstances it helps and when 
> it won't.  The Live ranges on SSA representation forms chordal Graphs 
> that might have the different colorability requirements than the real 
> register allocator. This may not give exact register pressure compared 
> to  register allocator as register allocator is  further down the 
> optimization and the code generation pipeline  but forms the basis of 
> optimization based on SSA that effects the register pressure.
>  
> >>The code is in a branch called lto-pressure.
> 
> Thanks. I would like to see the code.

Ajit,
>>  The branch is here: svn://gcc.gnu.org/svn/gcc/branches/lto-pressure
>>The analysis is in ipa-inline.c; if you search for "pressure" you'll find the 
>>code.

>>The live ranges in SSA are certainly different than what the register 
>>allocator is going to see, but it's what we have to work with at the point 
>>where the >>inlining decisions are made, which is why I was looking at it. My 
>>hope was that it would be a reasonable proxy for downstream register 
>>pressure. 

>>I went and did this after seeing a particular situation in bzip2 where a 
>>bunch of inlining done by LTO sent register pressure up a lot and resulted in 
>>a >>measureable increase in loads/stores due to extra spill code. Functions 
>>that are called in only one place and not externally visible will be inlined 
>>regardless of >>their size. There is one function in bzip2 that has 
>>particularly complex control and inlining this into another non-trivial 
>>function caused all the excess spi

Replacing malloc with alloca.

2015-09-13 Thread Ajit Kumar Agarwal
All:

The replacement of malloc with alloca can be done on the following analysis.

If the lifetime of an object does not stretch beyond the immediate scope. In 
such cases the malloc can be replaced with alloca.
This increases the performance to a great extent.

Inlining helps to a great extent the scope of lifetime of an object doesn't 
stretch the immediate scope of an object.
And the scope of replacing malloc with alloca can be identified.

I am wondering what phases of our optimization pipeline the malloc is replaced 
with alloca and what analysis is done to transform
The malloc with alloca. This greatly increases the performance of the 
benchmarks? Is the analysis done through Escape Analysis?

If yes, then what data structure is used for the abstract execution 
interpretation?

Thanks & Regards
Ajit


Live Range Splitting in Integrated Register Allocator

2014-05-14 Thread Ajit Kumar Agarwal
Sorry for resending again as Plain Text as my earlier mail was sent with HTML 
enable. This makes enable to send it to gcc@gcc.gnu.org.

Sorry once again.

Thanks & Regards
Ajit

From: Ajit Kumar Agarwal 
Sent: Wednesday, May 14, 2014 10:43 PM
To: 'gcc@gcc.gnu.org'; 'vmaka...@redhat.com'
Cc: 'Michael Eager'; Vinod Kathail; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: RE: Live Range Splitting in Integrated Register Allocator

Adding the gcc@gcc.gnu.org mailing list.

From: Ajit Kumar Agarwal 
Sent: Wednesday, May 14, 2014 10:33 PM
To: 'gcc@gcc.gnu.org'; 'vmaka...@redhat.com'
Cc: 'Michael Eager'; Vinod Kathail; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Live Range Splitting in Integrated Register Allocator


Hello All:

I am planning to implement the Live range splitting based on the following 
cases in the Integrated Register Allocator.

For a given Live range that spans from  from outer region to inner region of 
the loop. Such Live ranges which are LiveIn at the entry of the header of the 
Loop and Live Out at the exit of the loop but there are no references inside 
the  Loop. Such Live ranges lead to unoptimal spill and fetch inside the Loop 
conflicting with the shorter live ranges that spans inside the Loop.

Lets say such Live range as L1. L1 can be splitted at the Loop Boundary 
splitting the Live range by making a store at the header of the Loop and the 
Load at the exit of the Loop. This makes the Live range less conflicting with 
the Live ranges that are local to the Loop regions reducing the spill and Fetch 
inside the Loops.

>From the code and documentation of Integrated Register Allocator following is 
>the understanding.

As Live range L1 is live in the outer region but as there are no reference 
inside the Loop region. Since the allocno for L1 for a given variable v is 
assigned two allocno v1 and v2 . V1 being assigned allocno for the outer region 
and v2 as allocno for the inner Loop region. This allows to accumulate the 
information from the inner loop region to outer region.

Will the current Integrated Register Allocator will consider the Live range L1 
as Live inside the Loop and outer region? If Yes then there will be conflicting 
with the Live ranges that are local to the Loop region leading to spill and 
fetch inside the Loop.  If the v1 and v2 allocno are created v1 for the outer 
region and v2 for the inner region then there will v2 will be conflicting the 
local live ranges inside the Loop region and v1 will be conflicting with the 
Live ranges of the outer regions.  This is how its been considered as Live 
range splitting at the Loop Boundary for the Live range that spans inside the 
Loop but not not being referenced?

If Such cases are not being considered in the Integrated Register Allocator, 
then it will be useful to implement such cases in IRA which will be benefitted 
the microblaze target.

Please let me know what do you think.

Thanks & Regards
Ajit


RE: Live Range Splitting in Integrated Register Allocator

2014-05-15 Thread Ajit Kumar Agarwal


On 2014-05-14, 1:33 PM, Ajit Kumar Agarwal wrote:

>
> Hello All:
>
> I am planning to implement the Live range splitting based on the following 
> cases in the Integrated Register Allocator.
>
> For a given Live range that spans from  from outer region to inner region of 
> the loop. Such Live ranges which are LiveIn at the entry of the header of the 
> Loop and Live Out at the exit of the loop but there are no references inside 
> the  Loop. Such Live ranges lead to unoptimal spill and fetch inside the Loop 
> conflicting with the shorter live ranges that spans inside the Loop.
>
> Lets say such Live range as L1. L1 can be splitted at the Loop Boundary 
> splitting the Live range by making a store at the header of the Loop and the 
> Load at the exit of the Loop. This makes the Live range less conflicting with 
> the Live ranges that are local to the Loop regions reducing the spill and 
> Fetch inside the Loops.
>
>  From the code and documentation of Integrated Register Allocator following 
> is the understanding.
>
> As Live range L1 is live in the outer region but as there are no reference 
> inside the Loop region. Since the allocno for L1 for a given variable v is 
> assigned two allocno v1 and v2 . V1 being assigned allocno for the outer 
> region and v2 as allocno for the inner Loop region. This allows to accumulate 
> the information from the inner loop region to outer region.
>
> Will the current Integrated Register Allocator will consider the Live range 
> L1 as Live inside the Loop and outer region? If Yes then there will be 
> conflicting with the Live ranges that are local to the Loop region leading to 
> spill and fetch inside the Loop.  If the v1 and v2 allocno are created v1 for 
> the outer region and v2 for the inner region then there will v2 will be 
> conflicting the local live ranges inside the Loop region and v1 will be 
> conflicting with the Live ranges of the outer regions.  This is how its been 
> considered as Live range splitting at the Loop Boundary for the Live range 
> that spans inside the Loop but not not being referenced?
>
> If Such cases are not being considered in the Integrated Register Allocator, 
> then it will be useful to implement such cases in IRA which will be 
> benefitted the microblaze target.
>
> Please let me know what do you think.
>

>>Allocno v2 corresponding to live range inside the loop has a very small cost 
>>for spilling therefore it will be spilled if we still need registers to 
>>pseudos local to the loop.  If allocno v1 >>corresponding live ranges outside 
>>the loop *and* inside the loop gets a hard register, we will have live range 
>>splitting as you propose.  So I do not see a necessity for the optimization 
>>>>you propose.

If the allocno v1 does n't get the hard register and become the candidate of 
spill which is Live that spans the Loop but not reference inside the Loop, it 
will lead to spills inside the loop which become bottleneck as performance is 
concerned. Does this case also handled in the IRA? With my proposal the load 
and store will be moved outside the loop instead of having it inside the Loop 
for such case.

>>Moreover my experience shows that making a lot of explicit transformations 
>>(e.g. proposed splitting) even if we have transformations to undo them (e.g. 
>>coalescing) results in worse >>code. 
>>The explicit transformations should be as minimal as possible during RA in a 
>>good register allocator.  So I guess the optimization you propose will 
>>actually results in a worse code.  >>Although I might be wrong because it is 
>>always hard to predict the result of heuristic optimizations.

There is a paper on Improved Passive Splitting which does take care of 
Splitting at Loop boundaries for the Live range that spans the Loop but not 
referenced.

Improved Passive Splitting (2005) by Keith D. Cooper , Jason Eckhardt. This 
paper uses the heuristics and claim to reduce the spill and fetch for the case 
I propose. Also claims to generate the better code.

>>What is really missed in RA, it is a good splitting on BB boundaries. I have 
>>plans to try this as a part of more common pass to decrease register pressure 
>>on which I'll start work soon.

>>In any way thanks for the proposal.  You are free try it to confirm or reject 
>>my prediction.  Unfortunately, that is the only way to be sure about the 
>>result.

Thanks & Regards
Ajit



RE: Live Range Splitting in Integrated Register Allocator

2014-05-15 Thread Ajit Kumar Agarwal
Thanks Vladimir for the clarification.

Thanks & Regards
Ajit

-Original Message-
From: Vladimir Makarov [mailto:vmaka...@redhat.com] 
Sent: Thursday, May 15, 2014 8:39 PM
To: Ajit Kumar Agarwal; gcc@gcc.gnu.org
Cc: Michael Eager; Vinod Kathail; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Live Range Splitting in Integrated Register Allocator

On 05/15/2014 03:28 AM, Ajit Kumar Agarwal wrote:
>
> On 2014-05-14, 1:33 PM, Ajit Kumar Agarwal wrote:
>
>> Hello All:
>>
>> I am planning to implement the Live range splitting based on the following 
>> cases in the Integrated Register Allocator.
>>
>> For a given Live range that spans from  from outer region to inner region of 
>> the loop. Such Live ranges which are LiveIn at the entry of the header of 
>> the Loop and Live Out at the exit of the loop but there are no references 
>> inside the  Loop. Such Live ranges lead to unoptimal spill and fetch inside 
>> the Loop conflicting with the shorter live ranges that spans inside the Loop.
>>
>> Lets say such Live range as L1. L1 can be splitted at the Loop Boundary 
>> splitting the Live range by making a store at the header of the Loop and the 
>> Load at the exit of the Loop. This makes the Live range less conflicting 
>> with the Live ranges that are local to the Loop regions reducing the spill 
>> and Fetch inside the Loops.
>>
>>  From the code and documentation of Integrated Register Allocator following 
>> is the understanding.
>>
>> As Live range L1 is live in the outer region but as there are no reference 
>> inside the Loop region. Since the allocno for L1 for a given variable v is 
>> assigned two allocno v1 and v2 . V1 being assigned allocno for the outer 
>> region and v2 as allocno for the inner Loop region. This allows to 
>> accumulate the information from the inner loop region to outer region.
>>
>> Will the current Integrated Register Allocator will consider the Live range 
>> L1 as Live inside the Loop and outer region? If Yes then there will be 
>> conflicting with the Live ranges that are local to the Loop region leading 
>> to spill and fetch inside the Loop.  If the v1 and v2 allocno are created v1 
>> for the outer region and v2 for the inner region then there will v2 will be 
>> conflicting the local live ranges inside the Loop region and v1 will be 
>> conflicting with the Live ranges of the outer regions.  This is how its been 
>> considered as Live range splitting at the Loop Boundary for the Live range 
>> that spans inside the Loop but not not being referenced?
>>
>> If Such cases are not being considered in the Integrated Register Allocator, 
>> then it will be useful to implement such cases in IRA which will be 
>> benefitted the microblaze target.
>>
>> Please let me know what do you think.
>>
>>> Allocno v2 corresponding to live range inside the loop has a very small 
>>> cost for spilling therefore it will be spilled if we still need registers 
>>> to pseudos local to the loop.  If allocno v1 >>corresponding live ranges 
>>> outside the loop *and* inside the loop gets a hard register, we will have 
>>> live range splitting as you propose.  So I do not see a necessity for the 
>>> optimization >>you propose.
> If the allocno v1 does n't get the hard register and become the candidate of 
> spill which is Live that spans the Loop but not reference inside the Loop, it 
> will lead to spills inside the loop which become bottleneck as performance is 
> concerned. Does this case also handled in the IRA? With my proposal the load 
> and store will be moved outside the loop instead of having it inside the Loop 
> for such case.
>
If v1 does not get a hard register (and v2 most probably too), than there will 
be no any additional insns because v1 and v2 most probably gets the same 
memory.  If v1 gets a hard register and v2 does not, then we have store on the 
loop enter and loads on the loop exits (the store/loads are not inside the 
loop). 
>>> Moreover my experience shows that making a lot of explicit transformations 
>>> (e.g. proposed splitting) even if we have transformations to undo them 
>>> (e.g. coalescing) results in worse >>code. 
>>> The explicit transformations should be as minimal as possible during RA in 
>>> a good register allocator.  So I guess the optimization you propose will 
>>> actually results in a worse code.  >>Although I might be wrong because it 
>>> is always hard to predict the result of heuristic optimizations.
> There is a paper on Improved Passive Splitting which does take care of 
> Splitting at Loop boundari

RE: negative latencies

2014-05-19 Thread Ajit Kumar Agarwal

Is it the case of code speculation where the negative latencies are used?

Thanks & Regards
Ajit
-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of shmeel 
gutl
Sent: Monday, May 19, 2014 12:23 PM
To: Andrew Pinski
Cc: gcc@gcc.gnu.org; Vladimir Makarov
Subject: Re: negative latencies

On 19-May-14 09:39 AM, Andrew Pinski wrote:
> On Sun, May 18, 2014 at 11:13 PM, shmeel gutl 
>  wrote:
>> Are there hooks in gcc to deal with negative latencies? In other 
>> words, an architecture that permits an instruction to use a result 
>> from an instruction that will be issued later.
> Do you mean bypasses?  If so there is a bypass feature which you can use:
> https://gcc.gnu.org/onlinedocs/gccint/Processor-pipeline-description.h
> tml#index-data-bypass-3773
>
> Thanks,
> Andrew Pinski
Unfortunately, bypasses in the pipeline description is not enough.
They only allow you to calculate the latency of true dependencies. They are 
also forced to be zero or greater. The real question is how the scheduler and 
register allocator can deal with negative latencies.

Thanks
Shmeel
>> At first glance it seems that it will will break a few things.
>> 1) The definition of dependencies cannot come from the simple 
>> ordering of rtl.
>> 2) The scheduling problem starts to look like "get off the train 3 
>> stops before me".
>> 3) The definition of live ranges needs to use actual instruction 
>> timing information, not just instruction sequencing.
>>
>> The hooks in the scheduler seem to be enough to stop damage but not 
>> enough to take advantage of this "feature".
>>
>> Thanks
>>
>
> -
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2014.0.4577 / Virus Database: 3950/7515 - Release Date: 
> 05/18/14




Reducing Register Pressure through Live range Shrinking through Loops!!

2014-05-20 Thread Ajit Kumar Agarwal

Hello All:

Simpson does the Live range shrinking and reduction of register pressure by 
using the computation that are not load and store but the arithmetic 
computation. The computation
where the operands and registers are live at the entry and exit of the basic 
block but not touched inside the block then the computation is moved at the end 
of the block the reducing the register pressure inside the block by one.  
Extension of the Simpson work by extending the computation not being touched 
inside the basic block to the spanning of the Loops. If the Live ranges spans 
the Loops and live at the entry and exit of the Loop but the computation is not 
being touched inside the Loops then the computation is moved after the exit of 
the Loop. 

REDUCTION OF REGISTER PRESSURE THROUGH LIVE RANGE SHRINKING INSIDE THE LOOPS

for each Loop starting from inner to outer do the following
begin
RELIEFIN(i) = null if i is the entry of the cfg.
Else
For all predecessors j RELIEFOUT(j)
RELIEFOUT(i) = RELIEFIN(i) exposed union relief
INSERT(I,j) = RELIEFOUT(i) RELIEFIN(i) Intersection
Live(i)
end

The Simpson approach does takes the nesting depth into consideration of placing 
the computation and the relieve of the register pressure. Simpson approach 
doesn't takes into
consideration the computation which spans throughout the loop and the operands 
and results are live at the entry of the Loop and exit of the Loop but not 
touched inside the Loops can be useful in reduction of register pressure inside 
the Loops. This  approach will be useful in Region Based Register Allocator for 
Live Range Splitting at the Region Boundaries.

Extension of the Simpson approach is to consider the data flow analysis with 
respect to the given Loop rather than having it for entire control flow graph. 
This data flow analysis starts from the inner loop and extends it to the outer 
loop. If the reference is not through the nested depth or with some depth then 
the computation can be placed accordingly. For register allocator by Graph 
coloring the live ranges that are with respect to operands and results of the 
computation are taken into consideration and for the above approach put into 
the stack during simplification phase of Graph Coloring so that there is a 
chance of getting such Live ranges colorable and thus reduces the register 
pressure. This is extended to splitting
approach based on containment of Live ranges

OPTIMAL PLACEMENT OF THE COMPUTATION FOR SINGLE ENTRY AND MULTIPLE EXIT LOOPS

The placement of the computation to reduce the register pressure for Single 
Entry and Multiple exit by Simpson approach lead to unoptimal solution. The 
unoptimal Solution
is because of the exit node of the loop does not post dominates all the basic 
block inside the Loops. Due to this the placement of the computation just after 
the tail block of the Loop will lead to incorrect results. In order to perform 
the Optimal Solution of the placement of the computation, the computation needs 
to be placed the block just after all the exit points of the Loop reconverge 
and which will post dominates all the blocks of the Loops. This will take care 
of reducing the register pressure for the Loops that are single Entry and 
Multiple Exit. For irreducible Loops the optimization to convert to reducible 
is done before the register allocation that reduces the register pressure and 
will be applicable to structured control flow and thus reduces the register 
pressure.

The Live range shrinkage reducing register pressure takes load and store into 
consideration but not computation as proposed by Simpson. I am proposing to 
extend in GCC for the computation to reduce register pressure and for the Loop 
as given above for both Single Entry and Single Exit and Single Entry and 
Multiple Exit Loops.

Please let me know what do you think.

Thanks & Regards
Ajit


RE: Reducing Register Pressure through Live range Shrinking through Loops!!

2014-05-25 Thread Ajit Kumar Agarwal

On  Friday, May 23, 2014 1:46 AM Vladimir Makarov wrote:
On 05/21/2014 12:25 AM, Ajit Kumar Agarwal wrote:
> Hello All:
>
> Simpson does the Live range shrinking and reduction of register 
> pressure by using the computation that are not load and store but the 
> arithmetic computation. The computation where the operands and registers are 
> live at the entry and exit of the basic block but not touched inside the 
> block then the computation is moved at the end of the block the reducing the 
> register pressure inside the block by one.  Extension of the Simpson work by 
> extending the computation not being touched inside the basic block to the 
> spanning of the Loops. If the Live ranges spans the Loops and live at the 
> entry and exit of the Loop but the computation is not being touched inside 
> the Loops then the computation is moved after the exit of the Loop.
>
> REDUCTION OF REGISTER PRESSURE THROUGH LIVE RANGE SHRINKING INSIDE THE 
> LOOPS
>
> for each Loop starting from inner to outer do the following begin
> RELIEFIN(i) = null if i is the entry of the cfg.
> Else
> For all predecessors j RELIEFOUT(j)
> RELIEFOUT(i) = RELIEFIN(i) exposed union relief
> INSERT(I,j) = RELIEFOUT(i) RELIEFIN(i) Intersection
> Live(i)
> end
>
> The Simpson approach does takes the nesting depth into consideration 
> of placing the computation and the relieve of the register pressure. Simpson 
> approach doesn't takes into consideration the computation which spans 
> throughout the loop and the operands and results are live at the entry of the 
> Loop and exit of the Loop but not touched inside the Loops can be useful in 
> reduction of register pressure inside the Loops. This  approach will be 
> useful in Region Based Register Allocator for Live Range Splitting at the 
> Region Boundaries.
>
> Extension of the Simpson approach is to consider the data flow 
> analysis with respect to the given Loop rather than having it for 
> entire control flow graph. This data flow analysis starts from the 
> inner loop and extends it to the outer loop. If the reference is not 
> through the nested depth or with some depth then the computation can 
> be placed accordingly. For register allocator by Graph coloring the 
> live ranges that are with respect to operands and results of the 
> computation are taken into consideration and for the above approach 
> put into the stack during simplification phase of Graph Coloring so 
> that there is a chance of getting such Live ranges colorable and thus 
> reduces the register pressure. This is extended to splitting approach 
> based on containment of Live ranges
>
> OPTIMAL PLACEMENT OF THE COMPUTATION FOR SINGLE ENTRY AND MULTIPLE 
> EXIT LOOPS
>
> The placement of the computation to reduce the register pressure for 
> Single Entry and Multiple exit by Simpson approach lead to unoptimal 
> solution. The unoptimal Solution is because of the exit node of the loop does 
> not post dominates all the basic block inside the Loops. Due to this the 
> placement of the computation just after the tail block of the Loop will lead 
> to incorrect results. In order to perform the Optimal Solution of the 
> placement of the computation, the computation needs to be placed the block 
> just after all the exit points of the Loop reconverge and which will post 
> dominates all the blocks of the Loops. This will take care of reducing the 
> register pressure for the Loops that are single Entry and Multiple Exit. For 
> irreducible Loops the optimization to convert to reducible is done before the 
> register allocation that reduces the register pressure and will be applicable 
> to structured control flow and thus reduces the register pressure.
>
> The Live range shrinkage reducing register pressure takes load and store into 
> consideration but not computation as proposed by Simpson. I am proposing to 
> extend in GCC for the computation to reduce register pressure and for the 
> Loop as given above for both Single Entry and Single Exit and Single Entry 
> and Multiple Exit Loops.
>
>
>>Thanks for sharing with this.  I have plans to work on a new register 
>>pressure relief pass too this summer (more accurately this work is in my 
>>company plans -- so I should work on this >>anyway).  It will use Simpson's 
>>approach also.  I prefer to do it as a separate pass not as a part of IRA 
>>because IRA is already complicated.  It also permits to rematerialize not 
>>only on loop  >>borders (although it is the most important points).

>>It is hard for me to say what approach will be better as there are too many 
>>transformations even after IRA (e.g. in IRA you can think that pseudos got 
>>hard registers and rematerilize >

Reducing Register Pressure based on Instruction Scheduling and Register Allocator!!

2014-06-06 Thread Ajit Kumar Agarwal
Hello All:

I was looking further the aspect of reducing register pressure based on 
Register Allocation and Instruction Scheduling and the
Following observation being made on reducing register pressure based on the 
existing papers on reducing register pressure 
Based on scheduling approach.

Does the following aspect of reducing register pressure is already been 
implemented in GCC?

Minimum register usage through better Instruction scheduling

Instruction scheduling play an important role increase or decrease of register 
pressure. If the instruction scheduler is done before 
register allocation the register pressure will increase whereas if the register 
allocation is done before instruction scheduling the false
antidependence will be created. For the out of order superscalar process the 
scheduling will be done in such a way that register
pressure will be decreased. To achieve ILP the register pressure increases but 
in superscalar processor the register pressure is
important. To achieve the decrease of register pressure the instruction 
scheduling for ILP should also consider the register pressure.

Govindarajan proposed a scheme where the list scheduling is modified to have 
one of the constraint's as available register
along with latency to perform the instruction scheduling. From the data 
dependency graph of the given basic block the
chains are created based on depth first search to form different chains. And 
one register is allocated for each chain in order
to have a baton process where the def is used in next instruction and the next 
instruction def  will be used in next to next 
instruction. A given register is assigned to def and use and the same register 
is used for next def and use, which makes it
to have one register for each chain, The main challenges is to assign the same 
register for the different chains if the chains
don't overlap.

The list scheduling is modified to have a parameter as list of available colors 
and release is the release of available colors.
>From the ready queue the nodes in the DAG is removed the live ranges that 
>terminate at that node is released and as
added as available colors. This ensures the register pressure will not increase 
by the schedule which was derived from 
Graph coloring register allocator.

Register pressure sensitivity based Instruction scheduling

The superscalar OOO processor does the dynamic instruction scheduling based on 
the out of order on instruction window 
and register renaming to achieve ILP. The main challenges is to achieve a ILP 
schedule and from the ILP schedule, it generates 
the instruction sequence with sensitivity to registers is called linearization. 
The ILP schedule groups are formed. The groups are,

1. If the instruction selected is s all the instruction that are scheduled 
before s are formed with one group.

2. All the instruction that are scheduled after s are formed with another group.

The linearization algorithm uses can test and must can test. The can test if 
the selected instruction is s , then all the instruction 
that are formed with Group1 and are not scheduled but in the ready list are 
generated with W-1. If the selected instruction is 
's' is generated at index i  ,all the instruction are scheduled before s are 
generated with i+W-1 where W is the size of register
 window. The must can test if the selected instruction is generated at index i, 
then all the instruction that are scheduled 
after s are generated with index i+W-1. If the instruction scheduled after s 
are not in window the ILP won't be achieved.

The linearization algorithm checks for can test if it satisfies then the 
instruction is generated. If not satisfied it will be 
checked with must can test and if it satisfies it will be generated.

The register pressure should not be increased during linearization as there is 
a chance of register pressure getting increased 
during scheduling. To achieve this, the priority is assigned for each 
instruction that are in the ready list. If the instruction 
selected all the use of the variables in the instruction are scheduled the 
priority is assigned as 2. if an instruction is dependent
 on i and i is dependent on another instruction the Live range is too long and 
the priority is less and assigned to be 1. In the
 ready list it selects the instruction with high priority which ensures the 
register pressure doesn't increases.

There are approaches where the register pressure is high in the basic block the 
instruction scheduler will be phased after
 Register allocation, otherwise the instruction scheduler is done before the 
register allocator. This is how in gcc there is
 an instruction scheduler before register allocator and after the register 
allocator.

Integrated Code Scheduling and Register Allocation with minimum register 
pressure

The integrated code scheduling where there is a prepass and post pass 
instruction scheduler. The prepass scheduler
increase the overuse of the registers and th

vector load Rematerialization!!

2014-06-16 Thread Ajit Kumar Agarwal

Hello All:

There has been work done for load rematerialization. Instead of Store and Load 
of variables they kept in registers for the Live range.  Till now we are doing 
the rematerialization of scalar loads. 
Is it feasible to have rematerialization for the vector Loads? This will be 
helpful  to reduce the vectorized Store and Load for the dependencies across 
the vectorized Loops. I was looking at one of the presentation where there is a 
mentioned about the Load rematerialization is implemented from GCC 4.8.2 
Onwards. 

Does this implementation takes care of rematerialization of vector Loads. Can 
we have this approach?

Please let me know what do you think.

Thanks & Regards
Ajit



Register Pressure guided Unroll and Jam in GCC !!

2014-06-16 Thread Ajit Kumar Agarwal
Hello All:
 
I have worked on the Open64 compiler where the Register Pressure Guided Unroll 
and Jam gave a good amount of performance improvement for the  C and C++ Spec 
Benchmark and also Fortran benchmarks.

The Unroll and Jam increases the register pressure in the Unrolled Loop leading 
to increase in the Spill and Fetch degrading the performance of the Unrolled 
Loop. The Performance of Cache locality achieved through Unroll and Jam is 
degraded with the presence of Spilling instruction due to increases in register 
pressure Its better to do the decision  of Unrolled Factor of the Loop based on 
the Performance model of the register pressure.

Most of the Loop Optimization Like Unroll and Jam is implemented in the High 
Level IR. The register pressure based Unroll and Jam requires the calculation 
of register pressure in the High Level IR  which will be similar to register 
pressure we calculate on Register Allocation. This makes the implementation 
complex.

To overcome this, the Open64 compiler does the decision of Unrolling to both 
High Level IR and also at the Code Generation Level. Some of the decisions way 
at the end of the Code Generation . The advantage of using this approach like 
Open64 helps in using the register pressure information calculated by the 
Register Allocator. This helps the implementation much simpler and less complex.

Can we have this approach in GCC of the Decisions of Unroll and Jam in the High 
Level IR  and also to defer some of the decision at the Code Generation Level 
like Open64? 

 Please let me know what do you think.

Thanks & Regards
Ajit


RE: Register Pressure guided Unroll and Jam in GCC !!

2014-06-16 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Monday, June 16, 2014 7:55 PM
To: Ajit Kumar Agarwal
Cc: gcc@gcc.gnu.org; Vladimir Makarov; Michael Eager; Vinod Kathail; Shail 
Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Register Pressure guided Unroll and Jam in GCC !!

On Mon, Jun 16, 2014 at 4:14 PM, Ajit Kumar Agarwal 
 wrote:
> Hello All:
>
> I have worked on the Open64 compiler where the Register Pressure Guided 
> Unroll and Jam gave a good amount of performance improvement for the  C and 
> C++ Spec Benchmark and also Fortran benchmarks.
>
> The Unroll and Jam increases the register pressure in the Unrolled Loop 
> leading to increase in the Spill and Fetch degrading the performance of the 
> Unrolled Loop. The Performance of Cache locality achieved through Unroll and 
> Jam is degraded with the presence of Spilling instruction due to increases in 
> register pressure Its better to do the decision  of Unrolled Factor of the 
> Loop based on the Performance model of the register pressure.
>
> Most of the Loop Optimization Like Unroll and Jam is implemented in the High 
> Level IR. The register pressure based Unroll and Jam requires the calculation 
> of register pressure in the High Level IR  which will be similar to register 
> pressure we calculate on Register Allocation. This makes the implementation 
> complex.
>
> To overcome this, the Open64 compiler does the decision of Unrolling to both 
> High Level IR and also at the Code Generation Level. Some of the decisions 
> way at the end of the Code Generation . The advantage of using this approach 
> like Open64 helps in using the register pressure information calculated by 
> the Register Allocator. This helps the implementation much simpler and less 
> complex.
>
> Can we have this approach in GCC of the Decisions of Unroll and Jam in the 
> High Level IR  and also to defer some of the decision at the Code Generation 
> Level like Open64?
>
>  Please let me know what do you think.

>>Sure, you can for example compute validity of the transform during the GIMPLE 
>>loop opts, annotate the loop meta-information with the desired transform and 
>>apply it (or not) later >>during RTL unrolling.

Thanks !! Has RTL unrolling been already implemented?

Richard.

> Thanks & Regards
> Ajit


RE: Possible LRA issue?

2014-08-27 Thread Ajit Kumar Agarwal
The cause of xmalloc occurring at times given below in Register Allocator will 
not be caused only by the structure and changing the passed S as template 
argument.
It depends on how the below structures is referenced or used. From the stack 
trace I can see the live ranges creation is based on how the below structure is 
referenced and Used.

Thanks & Regards
Ajit

-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Daniel 
Gutson
Sent: Wednesday, August 27, 2014 7:58 PM
To: gcc Mailing List
Subject: Possible LRA issue?

Hi,

   I have a large codebase where at some point, there's a structure that takes 
an unsigned integer template argument, and uses as the size of an array, 
something like

template 
struct Struct
{
typedef std::array Chunk;
typedef std::list Content;

   Content c;
};

Changing the values of S alters significantly the compile time and memory that 
the compiler takes. We use some large numbers there.
At some point, the compiler runs out of memory (xmalloc fails). I wondered why, 
and did some analysis by debugging the 4.8.2 (same with 4.8.3), and did the 
following experiment turning off all the optimizations (-fno-* and -O0):
  I generated a report of xmalloc usage of two programs: one having S=10u, and 
another with S=11u, just to see the difference of 1.
The report was generated as follows: I set a breakpoint at xmalloc, appending a 
bt to a file. Then I found common stack traces and counted how many xmallocs 
were called in one and another versions of the program (S=10u and S=11u as 
mentioned above).
The difference were:

a) Stack trace:
  xmalloc | pool_alloc | create_live_range | mark_pseudo_live | 
mark_regno_live | process_bb_lives | lra_create_live_ranges | lra | do_reload | 
rest_of_handle_reload | execute_one_pass | execute_pass_list | 
execute_pass_list | expand_function | output_in_order | compile | 
finalize_compilation_unit | cp_write_global_declarations | compile_file | 
do_compile | toplev_main
| __libc_start_main | _start |

 S=10u: 15 times
 S=11u: 16 times


b) Stack trace:
  xmalloc | lra_set_insn_recog_data | lra_get_insn_recog_data | 
lra_update_insn_regno_info | lra_update_insn_regno_info |
lra_push_insn_1 | lra_push_insn | push_insns | lra_process_new_insns | 
curr_insn_transform | lra_constraints | lra | do_reload | rest_of_handle_reload 
| execute_one_pass | execute_pass_list | execute_pass_list | expand_function | 
output_in_order | compile | finalize_compilation_unit | 
cp_write_global_declarations | compile_file | do_compile | toplev_main | 
__libc_start_main | _start |

 S=10u: 186 times
 S=11u: 192 times

c) Stack trace:
 xmalloc | df_install_refs | df_refs_add_to_chains | df_insn_rescan | 
emit_insn_after_1 | emit_pattern_after_noloc | emit_pattern_after_setloc | 
emit_insn_after_setloc | try_split | split_insn | split_all_insns | 
rest_of_handle_split_after_reload | execute_one_pass | execute_pass_list | 
execute_pass_list | execute_pass_list | expand_function | output_in_order | 
compile | finalize_compilation_unit | cp_write_global_declarations | 
compile_file | do_compile | toplev_main | __libc_start_main | _start |

 S=10u: 617 times
 S=11u: 619 times

d) Stack trace:
 xmalloc | df_install_refs | df_refs_add_to_chains | df_bb_refs_record | 
df_scan_blocks | rest_of_handle_df_initialize | execute_one_pass | 
execute_pass_list | execute_pass_list | expand_function | output_in_order | 
compile | finalize_compilation_unit | cp_write_global_declarations | 
compile_file | do_compile | toplev_main | __libc_start_main | _start |

S=10u: 13223 times
S=11u: 13227 times

e) Stack trace:
 xmalloc | __GI__obstack_newchunk | bitmap_element_allocate | 
bitmap_set_bit | update_lives | assign_hard_regno | assign_by_spills | 
lra_assign | lra | do_reload | rest_of_handle_reload | execute_one_pass | 
execute_pass_list | execute_pass_list | expand_function | output_in_order | 
compile | finalize_compilation_unit | cp_write_global_declarations | 
compile_file | do_compile | toplev_main | __libc_start_main | _start |

S=10u: 0 times (never!)
S=11u: 1

Unfortunately I can't disclose the source code nor have the time to isolate a 
piece of code reproducing the issue.
Some comments about the code: I don't do template metaprogramming depending on 
S, but I do some for-range on the Content.

I can extend the analysis to S=12 and compare with the previous values.
I thought to fix this myself but lack the time and background on theses 
optimizations. Any hint?
I'm open to do more experiments if anybody asks me, or post -fdumps.

I suspect that playing with gcc-min-heapsize and similar values this issue 
could be worked around, but I'd like to know why just changing the size of an 
array has such a consequence.

Thanks!

Daniel.

-- 

Daniel F. Gutson
Chief Engineering Officer, SPD


San Lorenzo 47, 3rd Floor, Office 5

Córdoba, Argentina


Phone: +54 351 42178

RE: Possible LRA issue?

2014-08-28 Thread Ajit Kumar Agarwal


-Original Message-
From: Daniel Gutson [mailto:daniel.gut...@tallertechnologies.com] 
Sent: Wednesday, August 27, 2014 8:53 PM
To: Ajit Kumar Agarwal
Cc: gcc Mailing List
Subject: Re: Possible LRA issue?

On Wed, Aug 27, 2014 at 12:16 PM, Ajit Kumar Agarwal 
 wrote:
> The cause of xmalloc occurring at times given below in Register Allocator 
> will not be caused only by the structure and changing the passed S as 
> template argument.
> It depends on how the below structures is referenced or used. From the stack 
> trace I can see the live ranges creation is based on how the below structure 
> is referenced and Used.

>>Could you please show me an example of such different usages and references?

I would like you to formulate the exact code with the templatized structures. 
Here is the flow for def and use.
  Live range<1>Live range<2>

1 = Def|
2 = Def|   |
.  |   |
.  |   |  
Use of <2>  |  |
.  |   
.  |   
.  |   
Use of <1> |
  
If the size of elements of array increases the  creation of Live ranges will 
increase thus increasing the calls to xmalloc to create the live ranges. The 
above DEF and USE allows to populate the Live ranges. Thus the Use and 
reference of array element is the cause of the calls rather than the 
declaration of the structures.

Thanks & Regards
Ajit

>
> Thanks & Regards
> Ajit
>
> -Original Message-
> From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf 
> Of Daniel Gutson
> Sent: Wednesday, August 27, 2014 7:58 PM
> To: gcc Mailing List
> Subject: Possible LRA issue?
>
> Hi,
>
>I have a large codebase where at some point, there's a structure 
> that takes an unsigned integer template argument, and uses as the size 
> of an array, something like
>
> template 
> struct Struct
> {
> typedef std::array Chunk;
> typedef std::list Content;
>
>Content c;
> };
>
> Changing the values of S alters significantly the compile time and memory 
> that the compiler takes. We use some large numbers there.
> At some point, the compiler runs out of memory (xmalloc fails). I wondered 
> why, and did some analysis by debugging the 4.8.2 (same with 4.8.3), and did 
> the following experiment turning off all the optimizations (-fno-* and -O0):
>   I generated a report of xmalloc usage of two programs: one having S=10u, 
> and another with S=11u, just to see the difference of 1.
> The report was generated as follows: I set a breakpoint at xmalloc, appending 
> a bt to a file. Then I found common stack traces and counted how many 
> xmallocs were called in one and another versions of the program (S=10u and 
> S=11u as mentioned above).
> The difference were:
>
> a) Stack trace:
>   xmalloc | pool_alloc | create_live_range | mark_pseudo_live | 
> mark_regno_live | process_bb_lives | lra_create_live_ranges | lra | 
> do_reload | rest_of_handle_reload | execute_one_pass | 
> execute_pass_list | execute_pass_list | expand_function | 
> output_in_order | compile | finalize_compilation_unit | 
> cp_write_global_declarations | compile_file | do_compile | toplev_main
> | __libc_start_main | _start |
>
>  S=10u: 15 times
>  S=11u: 16 times
>
>
> b) Stack trace:
>   xmalloc | lra_set_insn_recog_data | lra_get_insn_recog_data | 
> lra_update_insn_regno_info | lra_update_insn_regno_info |
> lra_push_insn_1 | lra_push_insn | push_insns | lra_process_new_insns | 
> curr_insn_transform | lra_constraints | lra | do_reload | 
> rest_of_handle_reload | execute_one_pass | execute_pass_list | 
> execute_pass_list | expand_function | output_in_order | compile | 
> finalize_compilation_unit | cp_write_global_declarations | 
> compile_file | do_compile | toplev_main | __libc_start_main | _start |
>
>  S=10u: 186 times
>  S=11u: 192 times
>
> c) Stack trace:
>  xmalloc | df_install_refs | df_refs_add_to_chains | 
> df_insn_rescan | emit_insn_after_1 | emit_pattern_after_noloc | 
> emit_pattern_after_setloc | emit_insn_after_setloc | try_split | 
> split_insn | split_all_insns | rest_of_handle_split_after_reload | 
> execute_one_pass | execute_pass_list | execute_pass_list | 
> execute_pass_list | expand_function | output_in_order | compile | 
> finalize_compilation_unit | cp_write_global_declar

Global Value Numbering on SSA representation based on Redundancy Class

2014-09-19 Thread Ajit Kumar Agarwal
Hello All:

Please find the different Global Value numbering techniques on SSA 
representation  and proposing in GCC Global Value Numbering on SSA 
representation based on Redundancy Class. Can this be proposed.

SSA representation with control graph can be formulated with Global Value 
Numbering Algorithm. The GVN Algorithm assigns the value numbers for the 
expression based on hashing, value partitioning, SCC on SSA Graph and 
Dominator. Value partitioning is based on congruent classes and the Dominator 
is based on traversing the dominator tree in reverse post order and assign the 
values to the expression. The SCC based on SSA graph is based on traversing the 
SSA graph in reverse post order and assign the value numbers based on 
optimistic and valid table.

Different optimization based on GVN are useful like redundant expression 
elimination, redundant load and stores , Common Sub expression elimination, 
reassociation and value redundancy.

Global value numbering is proposed on redundancy class assign to expression 
based on renaming scheme on SSA representation. The variable renaming scheme is 
extended to expressions in the SSA representation. The redundancy class along 
with the SSA graph and the SCC representation of the SSA Graph is proposed in 
this paper. The redundancy class concept is taken from the Fred Chow paper on 
PRE on SSA. Based on the redundancy class new set of nodes like phi are 
inserted which takes the operands as redundancy class and this set of phi nodes 
are used along with redundancy class number are used using the optimistic table 
and the valid table as suggested by Simpson on value numbering on SCC based SSA 
Graphs.

This will help to perform the optimization as mentioned above in the SSA 
representation.

 Value based partitioning:

Value based partitioning assigns the values based on partitoning the congruent 
classes. Two expressions are congruent if they have same opcode and each 
operand has to be congruent. This is the recursive based definition. Based on 
the congruency initial partition is created which keep on growing with 
different iterations till the partitions became stable.

For the expressions given below:

A= x - y

B = y - x

C = A + B

For the expressions above the initial partition created with {x}, {y}, { A, B, 
C}. Then the algorithm will work as follows. The worklist will be the set of 
the congruence class. For each class like {x} is selected. For x the derivation 
is A and the position of A is subset of the class {A,B,C} then the new 
partition will be created{A}. Similarly the {B} and {C} partition are created 
and added to the worklist and the above algorithm continues till the partitions 
become stable. So the final partitions will be {x},{y},{A},{B},{C}.

SCC based value numbering on SSA Graph

The SCC(strongly connected component) is found on the SSA graph with cycle. The 
Set of nodes in each SCC will be a single node in the cycle. For such strongly 
connected component mapped to single nodes, each node will be traversed and the 
value numbers are assigned.

For following SSA representation of the control flow graph 

i0 = 1

j0 = 1

while(..)

{

i1 = phi(i0, i2)

j1 = phi(j0,j2)

i2 = i1 +1

j2 = j1 +1

}
For the above example the SSA graph will be formed with Strongly connected 
component with the cycle and the each node of the SCC will be traversed.

Traversing the SSA graph in reverse post order for variable i the initial value 
is 1 which will be added to optimistic table so the lattice initial value is 1. 
At the point of i1 = phi(i0,i2) 1 will be added to optimistic table and assign 
the value of i1. At the first iteration i2 will be 2 as the initial value of i1 
is propagated. In the second iteration the i1 = phi(i0,i2) will be added to the 
optimistic table and assign the value i1.

Traversing the SSA graph in reverse post order for variable j the initial value 
is 1 and j1 will be 1 at the first iteration. Since the value of i1 is assigned 
1 and j is hashed to same value j1 will be equal to i1. At the first iteration 
j2 will be 2 and j2 will be i2 as same value is hashed. In the second iteration 
the j1 = phi(j0,j2) will become i1 = phi(i0,i2) and this entry is already 
mapped to optimistic table. This leads to i is congruent to j . so the same 
value is assigned to i and j and the optimization will remove the jth entry and 
related phi node for jth entry.

 Global Value Numbering based on redundancy class

The redundancy class is formed as given in PRE on SSA by Fred Chow using SSA 
variable renaming. The SSA variables used to form the expressions is assigned 
the redundancy class similar to renaming scheme of SSA variables. 

For the following example:

i0 = 1

j0 = 1

while(..)

{

i1 = phi(i0, i2)

j1 = phi(j0,j2)

i2 = i1 +1

j2 = j1 +1

}

The above SSA program is represented with redundancy class and the new phi node 
based on the redundancy class.
The above SSA program is modified as follows

i0 = 1

j0 = 1

while(..)

{

Expansion of memset and memcpy calls.

2014-10-21 Thread Ajit Kumar Agarwal
Hello All:

Memset and Memcpy calls are extensively used in many benchmarks. Inlining or 
expansion
the memcpy and memset calls improves the performance of many performance 
Benchmark. 
I have implemented the expansion of strcmp  to the optimizaed sequence of 
instruction
In open64 compiler for AMD x86 target.

Can I suggest and propose to expand the memset and memcpy calls to the sequence
Of instruction as many targets like ARM are moving implementation of memcpy and 
Memset in assembly instead of C. This makes it easier to expand the memcpy and 
Memset call in gcc.

To implement this in GCC we need to expand similarly to the implementation as  
builtins.

Let me know what do you think.

Thanks & Regards
Ajit



RE: Optimized Allocation of Argument registers

2014-11-17 Thread Ajit Kumar Agarwal
Hello All:

I was looking at the optimized usage and allocation to argument registers. 
There are two aspects to it as follows.

1. We need to specify the argument registers as followed by ABI in the target 
specific code. Based on the function
 argument registers defined in the target dependent code the function argument 
registers are passed. If the
 number of argument registers defined in the Architecture is large say 6/8 
function argument registers. 
Most of the time in the benchmarks we don't pass so many arguments and the 
number of arguments passed
 is quite less. Since we reserve the function arguments as specified in the 
target specific code for the given 
architecture, leads to unoptimized usage as this function argument registers 
will not be used in the function. 
Thus we need to steal some of the arguments registers and have the usage of 
those in the function depending 
on the support of the number of function argument registers. The stealing of 
function argument registers will
 lead more number of registers available that are to be used in the function 
and leading to less spill and fetch.

2. The other aspect of the function argument registers is not spill and fetch 
the argument registers as they are 
live across the function call. But the liveness is limited to certain point of 
the called function after that point the 
function argument registers are not live and can be used inside the called 
function. Other aspect is if there is a
shortage of registers than can the function argument registers should be used 
as spill candidate? Will this lead 
to the optimized code.

Please let me know what do you think.

Thanks & Regards
Ajit


RE: Optimized Allocation of Argument registers

2014-11-18 Thread Ajit Kumar Agarwal


-Original Message-
From: Vladimir Makarov [mailto:vmaka...@redhat.com] 
Sent: Tuesday, November 18, 2014 1:57 AM
To: Ajit Kumar Agarwal; gcc Mailing List
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Optimized Allocation of Argument registers

On 2014-11-17 8:13 AM, Ajit Kumar Agarwal wrote:
> Hello All:
>
> I was looking at the optimized usage and allocation to argument registers. 
> There are two aspects to it as follows.
>
> 1. We need to specify the argument registers as followed by ABI in the target 
> specific code. Based on the function
>   argument registers defined in the target dependent code the function 
> argument registers are passed. If the
>   number of argument registers defined in the Architecture is large say 6/8 
> function argument registers.
> Most of the time in the benchmarks we don't pass so many arguments and the 
> number of arguments passed
>   is quite less. Since we reserve the function arguments as specified 
> in the target specific code for the given architecture, leads to unoptimized 
> usage as this function argument registers will not be used in the function.
> Thus we need to steal some of the arguments registers and have the 
> usage of those in the function depending on the support of the number of 
> function argument registers. The stealing of function argument registers will
>   lead more number of registers available that are to be used in the function 
> and leading to less spill and fetch.
>

>>The argument registers should be not reserved.  They should be present in RTL 
>>and RA allocator will figure out itself when it can use them. 
>>That is how other ports work.

Thanks Vladimir for Clarifications.

> 2. The other aspect of the function argument registers is not spill 
> and fetch the argument registers as they are live across the function 
> call. But the liveness is limited to certain point of the called 
> function after that point the function argument registers are not live 
> and can be used inside the called function. Other aspect is if there is a 
> shortage of registers than can the function argument registers should be used 
> as spill candidate? Will this lead to the optimized code.
>
>

>>You can remove unnecessary code to save/restore arg registers around calls if 
>>you can figure out that they are not used in called functions. 
 >> There is already code for this written by Tom de Vries.  So you can use it.

Is the code written by Tom de Vries already a part of trunk?  Could you please 
give the pointer to this part of code.

Thanks & Regards
Ajit


RE: Optimized Allocation of Argument registers

2014-11-18 Thread Ajit Kumar Agarwal


-Original Message-
From: Jeff Law [mailto:l...@redhat.com] 
Sent: Monday, November 17, 2014 9:27 PM
To: Ajit Kumar Agarwal; Vladimir Makarov; gcc Mailing List
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Optimized Allocation of Argument registers

On 11/17/14 06:13, Ajit Kumar Agarwal wrote:
> Hello All:
>
> I was looking at the optimized usage and allocation to argument registers. 
> There are two aspects to it as follows.
>
> 1. We need to specify the argument registers as followed by ABI in the target 
> specific code. Based on the function
>   argument registers defined in the target dependent code the function 
> argument registers are passed. If the
>   number of argument registers defined in the Architecture is large say 6/8 
> function argument registers.
> Most of the time in the benchmarks we don't pass so many arguments and the 
> number of arguments passed
>   is quite less. Since we reserve the function arguments as specified 
> in the target specific code for the given architecture, leads to unoptimized 
> usage as this function argument registers will not be used in the function.
> Thus we need to steal some of the arguments registers and have the 
> usage of those in the function depending on the support of the number of 
> function argument registers. The stealing of function argument registers will
>   lead more number of registers available that are to be used in the function 
> and leading to less spill and fetch.
>
> 2. The other aspect of the function argument registers is not spill 
> and fetch the argument registers as they are live across the function 
> call. But the liveness is limited to certain point of the called 
> function after that point the function argument registers are not live 
> and can be used inside the called function. Other aspect is if there is a 
> shortage of registers than can the function argument registers should be used 
> as spill candidate? Will this lead to the optimized code.
>
> Please let me know what do you think.
>>Typically GCC ports do not reserve the function argument/return registers, 
>>they are allocatable just like any other call clobbered register.

>>In essence arguments for function calls are set up at each call site by 
>>copying values out of pseudo registers, memory, constant initializations, etc 
>>to the >>appropriate outgoing argument register.  The allocator will, when 
>>possible and profitable try to assign the pseudo register to the appropriate 
>>argument >>register to eliminate the copies.

>>Similarly, GCC copies values out of the incoming argument registers and into 
>>pseudos at the start of a function.  The allocator, again when possible and 
>>>>profitable, will try to allocate the pseudos to the incoming argument 
>>registers to avoid the copy.

>>So in summary, there is no reason to reserve registers for argument passing 
>>in GCC and doing so would be wasteful.  Treat the argument registers as any 
>>>>other call clobbered register and allow the register allocator to make 
>>appropriate decisions based on liveness, expected uses, call-crossing 
>>liveness, related >>copies, etc etc.

Thanks Jeff for the explanation and Clarifications.

Thanks & Regards
Ajit
jeff


RE: Optimized Allocation of Argument registers

2014-11-24 Thread Ajit Kumar Agarwal
All:

The optimization of reducing save and restore of the callee and caller saved 
register has been the attention Of
increasing the performance of the benchmark. The callee saved registers is 
saved at the entry and restore at the 
exit of the procedure if the register is reused inside the procedure whereas 
the caller save registers at the Caller site
is saved before the call and the restore after the variable is live and spans 
through the call.

The GCC port has done some optimization whereas the call-used registers are 
live inside the procedure and has been 
set as 1 bit then it will not be saved and restored. This is based on the data 
flow analysis.

The callee saved registers is useful when there all multiple calls in the call 
graph whereas the caller save registers are 
useful if the call is the leaf procedure then the saving before the call and 
restore after the call will be useful  and increases
 the performance.

By traversing the call graph in depth-first-order and the bottom-up approach we 
can propagate the save and restore
At the procedure entry and exit to the upper regions of the call graph which 
reduces the save and restore at all the lower
Regions across the various lower calls. These decision can be made based on the 
frequency of the call in the call graph as 
Proposed by Fred Chow.

Another approach to reducing the save and restore at the procedure entry and 
exit is moving the save and restore from
The procedure entry to the active regions where the variable is Live inside the 
procedure based on the data flow analysis
And thus improve the performance of many benchmarks as proposed by Fred Chow.

The propagation of save and restore from the lower regions of the call graph to 
the upper regions is not implemented in 
The GCC framework and also the moving the save and restore to the active region 
of Liveness inside the procedure from 
The entry and exit is not implemented inside the GCC framework. Can this be 
proposed and implemented in GCC framework?

For the open procedure whereas the indirect calls, recursive calls and the 
external linkage calls cannot be optimized and in 
The open case the save and restore at the entry and exit of the procedure is 
applied. But for the open procedure if all the 
Lower calls in the call-graph is closed and resolved through call-graph, the 
save and restore can be propagate to the upper
Region in the open procedures from the lower region of the calls which are 
closed and resolved. This can also improve the
Performance of many benchmarks and can this be proposed and implemented in GCC 
framework?

Let me know what do you think.

Thanks & Regards
Ajit 

-Original Message-
From: Ajit Kumar Agarwal 
Sent: Tuesday, November 18, 2014 7:01 PM
To: 'Vladimir Makarov'; gcc Mailing List
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: RE: Optimized Allocation of Argument registers



-Original Message-
From: Vladimir Makarov [mailto:vmaka...@redhat.com]
Sent: Tuesday, November 18, 2014 1:57 AM
To: Ajit Kumar Agarwal; gcc Mailing List
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Optimized Allocation of Argument registers

On 2014-11-17 8:13 AM, Ajit Kumar Agarwal wrote:
> Hello All:
>
> I was looking at the optimized usage and allocation to argument registers. 
> There are two aspects to it as follows.
>
> 1. We need to specify the argument registers as followed by ABI in the target 
> specific code. Based on the function
>   argument registers defined in the target dependent code the function 
> argument registers are passed. If the
>   number of argument registers defined in the Architecture is large say 6/8 
> function argument registers.
> Most of the time in the benchmarks we don't pass so many arguments and the 
> number of arguments passed
>   is quite less. Since we reserve the function arguments as specified 
> in the target specific code for the given architecture, leads to unoptimized 
> usage as this function argument registers will not be used in the function.
> Thus we need to steal some of the arguments registers and have the 
> usage of those in the function depending on the support of the number of 
> function argument registers. The stealing of function argument registers will
>   lead more number of registers available that are to be used in the function 
> and leading to less spill and fetch.
>

>>The argument registers should be not reserved.  They should be present in RTL 
>>and RA allocator will figure out itself when it can use them. 
>>That is how other ports work.

Thanks Vladimir for Clarifications.

> 2. The other aspect of the function argument registers is not spill 
> and fetch the argument registers as they are live across the function 
> call. But the liveness is limited to certain point of the called 
> f

RE: A Question About LRA/reload

2014-12-10 Thread Ajit Kumar Agarwal


-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Jeff Law
Sent: Tuesday, December 09, 2014 11:26 PM
To: Vladimir Makarov; lin zuojian; gcc@gcc.gnu.org
Subject: Re: A Question About LRA/reload

On 12/09/14 10:10, Vladimir Makarov wrote:
> generate the correct code in many cases even for x86.  Jeff Law tried 
> IRA coloring reusage too for reload but whole RA became slower 
> (although he achieved performance improvements on x86).
>>Right.  After IRA was complete, I'd walk over the unallocated allocnos and 
>>split their ranges at EBB boundaries.  That created new allocnos with a 
>>smaller ??>>conflict set and reduced the conflict set for the original 
>>unallocated allocnos.

Jeff: In the above approach of splitting the ranges for unallocated allocnos is 
aggressive or based on the approach
 of some heuristics that the Live ranges for unallocated allocnos is not 
touched inside the EBBs.

>>After I'd done that splitting for all the EBBs, I called back into 
>>ira_reassign_pseudos to try to assign the original unallocated allocnos as 
>>well as the new >>allocnos.

>>To get good results, much of IRA's cost analysis had to be redone from 
>>scratch.  And from a compile-time standpoint, that's a killer.

>>The other approach I was looking at was a backwards walk through each block.  
>>When I found an insn with an unallocated pseudo that would trigger one of 
various range spliting techniques to try and free up a hard register.  Then 
>>again I'd call into ira_reassign_pseudos to try the allocations again.  This 
>>got even >>better results, but was obviously even more compile-time expensive.

After the above splitting, are you building the conflict graph again to assign 
the new allocnos. If the Conflict graph is built again,
this will affect the compile time.

Thanks & Regards
Ajit
I don't think much, if any, of that work is relevant given the current 
structure and effectiveness of LRA.

jeff


Instruction scheduler with respect to prefetch instructions.

2014-12-13 Thread Ajit Kumar Agarwal
Hello All:

Since the prefetch instruction have no direct consumers  in the code stream, 
they provide considerable freedom to the 
Instruction scheduler. They are typically assigned lower priorities than most 
of the instructions in the code stream.
This tends to cause all the prefetch instructions to be placed together in the 
final schedule. This causes the performance
Degradations by placing them in clumps rather than evenly spreading the 
prefetch instructions.

The evenly spreading the prefetch instruction gives better speed up ratios as 
compared to be placing in clumps for dirty
Misses.

I am curious to know how the schedulers in the GCC handles the prefetch 
instruction and how the priorities is assigned to
Prefetch instructions in the gcc schedulers so that the prefetch instruction is 
evenly spread.

Please let me know what do you think.

Thanks & Regards
Ajit



RE: Instruction scheduler with respect to prefetch instructions.

2014-12-19 Thread Ajit Kumar Agarwal


-Original Message-
From: paul_kon...@dell.com [mailto:paul_kon...@dell.com] 
Sent: Saturday, December 13, 2014 9:46 PM
To: Ajit Kumar Agarwal
Cc: vmaka...@redhat.com; l...@redhat.com; richard.guent...@gmail.com; 
gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; 
Nagaraju Mekala
Subject: Re: Instruction scheduler with respect to prefetch instructions.


> On Dec 13, 2014, at 5:22 AM, Ajit Kumar Agarwal 
>  wrote:
> 
> Hello All:
> 
> Since the prefetch instruction have no direct consumers  in the code 
> stream, they provide considerable freedom to the Instruction scheduler. They 
> are typically assigned lower priorities than most of the instructions in the 
> code stream.
> This tends to cause all the prefetch instructions to be placed 
> together in the final schedule. This causes the performance Degradations by 
> placing them in clumps rather than evenly spreading the prefetch instructions.
> 
> The evenly spreading the prefetch instruction gives better speed up 
> ratios as compared to be placing in clumps for dirty Misses.

>>I can believe that’s true for some processors; is it true for all of them?  I 
>>have the impression that some MIPS processors don’t mind clumped prefetches, 
>>>>so long as you don’t exceed the limit on total number of concurrently 
>>pending memory accesses.

I think it's okay to have clumped prefetches for architectures  that are 
decided based on prefetch distance as long
it doesn't exceed the concurrent pending memory access. The clumped prefetches 
that are generated by the 
scheduler as there are no direct consumers in the code stream sometimes exceed 
the concurrent pending memory
access if the special mechanism is not done by the scheduler like some 
information to be passed from the generation
of prefetch algorithm phase to the scheduler.

Due to the freedom provided to instruction scheduler for not having direct 
consumers in the code stream, clumps
the prefetch instructions at the end of the basic blocks which will  
invalidates the actual place where the prefetch
instruction is generated based on the prefetch distance.

The prefetch algorithms based on prefetch distance takes care of the cases 
where the clumped prefetches degraded
the performance due to dirty misses.

My question is there any special way of handling the prefetch instruction with 
respect to the instruction scheduler to
overcome the above.
 
Thanks & Regards
Ajit

paul



Register Allocation with Instruction Scheduling.

2014-12-21 Thread Ajit Kumar Agarwal
Hello All:

I was going through the following article

" Register Allocation with instruction scheduling:  a new approach" by Pinter 
etal.

The phase ordering of register allocation and Instruction scheduling is 
important topic. The scheduling before register allocator 
 increases the register pressure whereas the instruction scheduling after 
register allocation will be seeing the false dependencies
created by Register allocator. The false dependencies between the instructions 
will not be able to schedule the two instruction 
simultaneously reducing the Instruction Level Parallelism. To achieve better 
ILP for the instruction scheduling after register
allocation the false dependencies created by the register allocator should be 
reduced.

Let Gs(V,E) is the scheduling Graph  and G'(V,E) is the Scheduling Graph after 
the register allocation. The Graph G'(V,E) has 
the false dependencies created by Register Allocator Another graph is created 
with Gf(V,E) where the edges represents
the false dependencies. To make sure the  false dependencies Graph which is the 
complement of Gs(V,E) and the edges
of Gf are the intersection of the edge in the G'(V,E) and the edges in the 
complement of Gs scheduling graph. 

With the availability of the  false dependencies graph and the interference 
graph, The edges in the interference graph
doesn't  intersects with the edges in the false dependencies graph, this edge 
will be added to the interference graph 
and since this edge interfere and the assigning phase will assign different 
register. Thus the false dependencies can be
removed and the scheduling graph after the register allocator will not have 
false dependencies and the better ILP and 
scheduling is achieved. This overcomes the disadvantages of instruction 
scheduling after register allocator.

This looks feasible to implement in GCC and I would like to implement the above 
if not done in the GCC.

Please let me know what do you think.

Thanks & Regards
Ajit



Vectorization opportunities for conditional branches.

2015-01-04 Thread Ajit Kumar Agarwal

The following fig (1) shows an implementation of the SSQ kernel from the BLAS 
Library in ATLAS. 
Fig(2) shows the conversions of the IF-THEN-ELSE in Fig(1) to vectorized code. 
Normally in the automatic vectorization the IF-THEN-ELSE is 
vectorized  only after the IF-CONVERSION that  converts control flow to Data 
flow and then raise the opportunity of vectorization.

The Fig(1) and Fig(2) is taken from the article

" Vectorization Past Dependent Branches through speculation" Qing Yi etal.

The following conversion given in the above article checks the IF-THEN and 
IF-ELSE separately and checks for the candidates of vectorization.
If the IF-THEN can be vectorized and IF-ELSE cannot be vectorized and the whole 
 conversion given in the Fig (2) shows how the conditional branches
can be  vectorized.

In the below case the IF-THEN-ELSE need not be IF-CONVERTED and the conditional 
branches is vectorized  by the transformation shown in FIG (2).

ssq = *ssq0;
scal = *scal0;

For(i=0;I < N ;i++)
{
ax= x[i];
ax = ABS & Ax;
If( ax <=scal)
{
 t0= ax/scal;
 ssq+= t0 * t0;
}
   Else
   {
   t0= scal/ax;
   t0 = t0 * t0;
t1= ssq * t0;
ssq = 1.0 + t1;
   scal = ax;
   }
}
 *ssq0 = ssq;
*scal0 = scal;

FIG ( 1)

  Transformed to 

VECTOR_PROLOGUE:
VABS = {ABS, ABS, ABS, ABS};
Vssq = {ssq0, 0.0, 0.0, 0.0};
Vscal =  { scal, scal, scal , scal};

VECTOR_LOOP:
For(i=0;I < N4 ;i++)
{
Vax= x[i: I+3];
Vax = VABS & Vax;
If( VEC_ANY(Vax  > Vscal)
  GOTO SCALAR_RESTART;

  Vt0= vax/vscal;
   Vssq+= Vt0 *V t0;
   continue;

   SCALAR_RESTART:
  // Vector to Scalar.

   Ssq= sum(vssq[0:3]);

// Scalar Loop
For(j=0;I < 4 ;j++)
{
ax= x[i+j];
ax = AbS & Ax;
If( ax >scal)
{
 t0= scal/ax;
   t0 = t0 * t0;
t1= ssq * t0;
ssq = 1.0 + t1;
   scal = ax;
}
Else{
 t0= ax/scal;
 ssq+= t0 * t0;
}
 }
  Vssq= { ssq, 0.0, 0.0, 0.0};
  Vscal = { scal, scal, scal, scal}

}
VECTOR_EPILOGUE:

ssq = sum(Vssq[0:3]);
scal = Vscal[0];

  FIG(2).

This looks interesting to me  considering the IF-THEN-ELSE inside the Loop to 
be vectorized as given above without IF-Converting.
I am not sure how many such sequences of the code will be triggered for SPEC 
benchmark but it looks interesting to me to be 
implemented.

Thoughts Please ?

Thanks & Regards
Ajit


IRA : Changes in the cost of putting allocno into memory.

2015-01-08 Thread Ajit Kumar Agarwal
Hello Vladimir:

 We have made the changes in the ira-color.c in ira_loop_edge_freq and 
move_spill_restore. The main motivation 
behind the change is to reduce the memory instruction with respect to the 
Loops. The changes are done to not to 
consider the back edge frequency if there are no references of the allocno 
inside the loop even though the allocno
are live out at the exit of the Loop and Live In at the entry of the Loop.  
This will reduce the calculation of the cost of
putting allocno into memory. If we reduce the cost of putting allocno into 
memory, then there are chances of getting
color in the simplification phase of the gcc IRA register allocator. Enabling 
this, there are chances of putting the allocno 
in the register and the splitting phase will split the live range across the 
Loop boundary. This will beneficial with respect 
to reducing memory instruction inside the Loop. Other changes is done to enable 
some of the allocno to allocate in
memory if the cost is less than equal to zero.

This is the first phase of the change in the IRA core register module, I would 
like to get your feedbacks on this change
based on this the actual patch is sent. The changes are accompanied with the 
patch created for the change which will
be easier for you to review the change.

We have tested the changes for Microblaze target for MIBENCH and EEMBC 
benchmarks and there are no degradation 
in the Geomean and some of the Mibench and EEMBC benchmarks is benefitted with 
this change.
Please review the changes and let me know your feedbacks and other steps needs 
to be done to incorporate the 
changes in the Upstream". We would like to get this change for Microblaze, 
Please also let us know if the changes are
required  to be enabled with any switch flag.

Changes are given below.

From 758ee2227e9dde946ac35b772bee99279b1bf996 Mon Sep 17 00:00:00 2001
From: Ajit Kumar Agarwal 
Date: Tue, 6 Jan 2015 19:42:16 +0530
Subject: [PATCH] IRA : Changes in the cost of putting allocno into memory.

Changes are made to not consider the back edge frequency for cost calculation
if the allocno is live out at the exit of the Loop and Live in at the entry of 
the
Loop and there are no references inside the Loop. Further changes are made to 
enable
some of the allocnos into memory if the costs is less than equal to 0.

ChangeLog:
2015-01-06  Ajit Agarwal  

* ira-color.c (move_spill_restore): Add the check cost less than
equal to 0.
(ira_loop_edge_freq): Add to ignore the loop edge frequence if
there are no reference of allocno inside the loop.

Signed-off-by:Ajit Agarwal ajit...@xilinx.com
---
 gcc/ira-color.c |   22 +++---
 1 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/gcc/ira-color.c b/gcc/ira-color.c
index 39387c8..d180e86 100644
--- a/gcc/ira-color.c
+++ b/gcc/ira-color.c
@@ -2409,6 +2409,8 @@ int
 ira_loop_edge_freq (ira_loop_tree_node_t loop_node, int regno, bool exit_p)
 {
   int freq, i;
+  bitmap_iterator bi;
+  unsigned int j;
   edge_iterator ei;
   edge e;
   vec edges;
@@ -2423,7 +2425,14 @@ ira_loop_edge_freq (ira_loop_tree_node_t loop_node, int 
regno, bool exit_p)
&& (regno < 0
|| (bitmap_bit_p (df_get_live_out (e->src), regno)
&& bitmap_bit_p (df_get_live_in (e->dest), regno
- freq += EDGE_FREQUENCY (e);
+   {
+ EXECUTE_IF_SET_IN_BITMAP (loop_node->all_allocnos, 0, j, bi)
+ {
+   if(regno ==  ALLOCNO_REGNO (ira_allocnos[j]))
+ if (ALLOCNO_NREFS (ira_allocnos[j]) != 0)
+   freq += EDGE_FREQUENCY (e);
+ }
+   }
 }
   else
 {
@@ -2432,7 +2441,14 @@ ira_loop_edge_freq (ira_loop_tree_node_t loop_node, int 
regno, bool exit_p)
if (regno < 0
|| (bitmap_bit_p (df_get_live_out (e->src), regno)
&& bitmap_bit_p (df_get_live_in (e->dest), regno)))
- freq += EDGE_FREQUENCY (e);
+  {
+EXECUTE_IF_SET_IN_BITMAP (loop_node->all_allocnos, 0, j, bi)
+{
+  if(regno ==  ALLOCNO_REGNO (ira_allocnos[j]))
+if(ALLOCNO_NREFS (ira_allocnos[j]) != 0)
+  freq += EDGE_FREQUENCY (e);
+ }
+  }
   edges.release ();
 }
 
@@ -3441,7 +3457,7 @@ move_spill_restore (void)
 * (exit_freq + enter_freq));
}
}
- if (cost < 0)
+ if (cost <= 0)
{
  ALLOCNO_HARD_REGNO (a) = -1;
  if (internal_flag_ira_verbose > 3 && ira_dump_file != NULL)
-- 
1.7.1

Thanks & Regards
Ajit


RE: Support for architectures without hardware interlocks

2015-01-08 Thread Ajit Kumar Agarwal


-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Joel 
Sherrill
Sent: Thursday, January 08, 2015 8:59 PM
To: Eric Botcazou; Claudiu Zissulescu
Cc: gcc@gcc.gnu.org; David Kang
Subject: Re: Support for architectures without hardware interlocks


On 1/8/2015 9:01 AM, Eric Botcazou wrote:
>> I've worked on a gcc target that was porting an architecture without 
>> hardware interlock support. Basically, you need to emit nop 
>> operations to avoid possible hw conflicts. At the moment, this was 
>> done by patching the gcc scheduler to do so, Another issue to keep is 
>> to check for hardware conflicts across basic-block boundaries. And 
>> not the last, is to prohibit/avoid any instruction stream 
>> modification after scheduler (e.g., peephole optimizations etc.).
> That's an overly complex approach, this usually can be done in a 
> simpler way with a machine-specific pass that runs at the end of the RTL 
> pipeline.
>
>>Isn't this similar to needing to fill a delay slot after a branch 
>>instruction? My recollection is that some SPARC and MIPS have to deal with 
>>that.

I think this can be done at the machine description md file level with 
define_delay where the delay slot description can be described.

Thanks & Regards
Ajit
-- 
Joel Sherrill, Ph.D. Director of Research & Development
joel.sherr...@oarcorp.comOn-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
Support Available(256) 722-9985



Allocating some Loop allocno in memory

2015-01-10 Thread Ajit Kumar Agarwal

I was thinking of some of the opportunities with respect to reducing spills 
inside the Loop. If the Live range(allocno) 
spans through the Loop and Live out at the exit of the Loop and there are no 
references or not being touched upon
inside the Loop, assign the allocno to the memory. This increase the chances of 
getting the register for other allocno 
which spans through the Loop and have reference inside the Loop if the 
interference graph is not colorable.

Since allocno don't have references inside the Loop there won't be any load 
instructions with respect to restore inside
the loops. There will be a store  instruction with respect to spill which will 
be outside the Loop. This will reduce the
conflicting edges of some of the allocno increasing the chances of making 
colorable of the Interference graph and 
reduces the spills and restore inside the Loop.

This heuristics looks reasonable. This heuristics goes side by side with the 
Live range splitting across the Loop boundary.
On top of this heuristics, after the Live range splitting across the Loop 
boundary there interference graph is not colorable
then we can assign some of the  splitted live ranges in the memory giving 
chances of registers  for the Live ranges(allocno) 
that spans the Loop and have reference inside the Loop.

We can change the cost of allocno in memory based on the above heuristics and 
take the above factor into consideration 
for the cost calculation.

Thoughts Please?

Thanks & Regards
Ajit


RE: Allocating some Loop allocno in memory

2015-01-11 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Sunday, January 11, 2015 8:05 PM
To: Ajit Kumar Agarwal; vmaka...@redhat.com; l...@redhat.com; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Allocating some Loop allocno in memory

On January 11, 2015 5:25:23 AM CET, Ajit Kumar Agarwal 
 wrote:
>
>I was thinking of some of the opportunities with respect to reducing 
>spills inside the Loop. If the Live range(allocno) spans through the 
>Loop and Live out at the exit of the Loop and there are no references 
>or not being touched upon inside the Loop, assign the allocno to the 
>memory. This increase the chances of getting the register for other 
>allocno which spans through the Loop and have reference inside the Loop 
>if the interference graph is not colorable.
>
>Since allocno don't have references inside the Loop there won't be any 
>load instructions with respect to restore inside the loops. There will 
>be a store  instruction with respect to spill which will be outside the 
>Loop. This will reduce the conflicting edges of some of the allocno 
>increasing the chances of making colorable of the Interference graph 
>and reduces the spills and restore inside the Loop.
>
>This heuristics looks reasonable. This heuristics goes side by side 
>with the Live range splitting across the Loop boundary.
>On top of this heuristics, after the Live range splitting across the 
>Loop boundary there interference graph is not colorable then we can 
>assign some of the  splitted live ranges in the memory giving chances 
>of registers  for the Live ranges(allocno) that spans the Loop and have 
>reference inside the Loop.
>
>We can change the cost of allocno in memory based on the above 
>heuristics and take the above factor into consideration for the cost 
>calculation.
>
>Thoughts Please?

>>How can this result in better allocations if you only ever spill at 
>>life-range split points? The life-range covering the loop should already be 
>>assigned to >>memory if required.

If the live range covering the loop assigned to memory, and there are 
references inside the Loops  then there are chances
of spills inside the loop which degrades the performance.  If live range 
covering the loops and there are no references inside
the Loop, then assigning to memory make reasonable as there are no references. 
Since there are no references or not
touched inside the Loop, there won't be load instruction to restore which are 
required if there are any references. The store
which requires for the def of the Live range will be outside the Loops. This 
helps in not degrading the performances as there
are no load and store required for spill and restore inside the Loops.

The above heuristics can also be accompanied with heuristics of the number of 
use points in the Live range. Considering both
 the heuristics will lead to better allocation if ever spill at the live range 
split points and there are chances of getting colorable
 the Interference graph without degrading the performance. Also if the Live 
range is splitted at the Loop boundary then the spill
 at the split points of loop boundary and registers is assigned to this live 
range will make it reasonable if there are no references 
inside the Loops.

Thanks & Regards
Ajit

Richard.

>Thanks & Regards
>Ajit




RE: Allocating some Loop allocno in memory

2015-01-12 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Monday, January 12, 2015 2:33 PM
To: Ajit Kumar Agarwal
Cc: vmaka...@redhat.com; l...@redhat.com; gcc@gcc.gnu.org; Vinod Kathail; Shail 
Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Allocating some Loop allocno in memory

On Sun, Jan 11, 2015 at 4:45 PM, Ajit Kumar Agarwal 
 wrote:
>
>
> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: Sunday, January 11, 2015 8:05 PM
> To: Ajit Kumar Agarwal; vmaka...@redhat.com; l...@redhat.com; 
> gcc@gcc.gnu.org
> Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju 
> Mekala
> Subject: Re: Allocating some Loop allocno in memory
>
> On January 11, 2015 5:25:23 AM CET, Ajit Kumar Agarwal 
>  wrote:
>>
>>I was thinking of some of the opportunities with respect to reducing 
>>spills inside the Loop. If the Live range(allocno) spans through the 
>>Loop and Live out at the exit of the Loop and there are no references 
>>or not being touched upon inside the Loop, assign the allocno to the 
>>memory. This increase the chances of getting the register for other 
>>allocno which spans through the Loop and have reference inside the 
>>Loop if the interference graph is not colorable.
>>
>>Since allocno don't have references inside the Loop there won't be any 
>>load instructions with respect to restore inside the loops. There will 
>>be a store  instruction with respect to spill which will be outside 
>>the Loop. This will reduce the conflicting edges of some of the 
>>allocno increasing the chances of making colorable of the Interference 
>>graph and reduces the spills and restore inside the Loop.
>>
>>This heuristics looks reasonable. This heuristics goes side by side 
>>with the Live range splitting across the Loop boundary.
>>On top of this heuristics, after the Live range splitting across the 
>>Loop boundary there interference graph is not colorable then we can 
>>assign some of the  splitted live ranges in the memory giving chances 
>>of registers  for the Live ranges(allocno) that spans the Loop and 
>>have reference inside the Loop.
>>
>>We can change the cost of allocno in memory based on the above 
>>heuristics and take the above factor into consideration for the cost 
>>calculation.
>>
>>Thoughts Please?
>
>>>How can this result in better allocations if you only ever spill at 
>>>life-range split points? The life-range covering the loop should already be 
>>>assigned to >>memory if required.
>
> If the live range covering the loop assigned to memory, and there are 
> references inside the Loops  then there are chances of spills inside 
> the loop which degrades the performance.  If live range covering the 
> loops and there are no references inside the Loop, then assigning to 
> memory make reasonable as there are no references. Since there are no 
> references or not touched inside the Loop, there won't be load instruction to 
> restore which are required if there are any references. The store which 
> requires for the def of the Live range will be outside the Loops. This helps 
> in not degrading the performances as there are no load and store required for 
> spill and restore inside the Loops.
>
> The above heuristics can also be accompanied with heuristics of the 
> number of use points in the Live range. Considering both  the 
> heuristics will lead to better allocation if ever spill at the live 
> range split points and there are chances of getting colorable  the 
> Interference graph without degrading the performance. Also if the Live range 
> is splitted at the Loop boundary then the spill  at the split points of loop 
> boundary and registers is assigned to this live range will make it reasonable 
> if there are no references inside the Loops.

>>But if we split the life-range of not used in-loop-body pseudos at the loop 
>>boundary the allocator should do the assignment to memory.
>>There is no need to add another heuristic here.  Or do you say we don't split 
>>the life-range of such pseudos at loop boundary?

In spill_cost calculations we are seeing the function ira_loop_edge_freq 
triggered with the Live ranges that are live at the exit of the Loop but 
are not referenced inside the Loops.  We are still considering frequency in the 
updation of spill cost for such live ranges . We have made a change
not to consider the freq of such live ranges in the updation  of spill cost, 
thus reducing the cost of such live ranges making it a chance to be in 
memory giving the conflicting allocno the chances of getting the register. 
There are chances , the conflicting allocno  with respect to loops
which have references and will have more edge on getting the registers.

The changes are made in the following.

 https://gcc.gnu.org/ml/gcc/2015-01/msg00033.html   

Thanks & Regards
Ajit

Richard.

> Thanks & Regards
> Ajit
>
> Richard.
>
>>Thanks & Regards
>>Ajit
>
>


Optimal Coalescing with respect to move instruction for Live range splitting

2015-01-17 Thread Ajit Kumar Agarwal
Register allocation with two phase approach does optimal coalescing after the 
spilling. Sometime Live range splitting makes 
the coalescing non optimal. The splitted Live range are connected by move 
instruction. Thus the Live range splitting and more
specifically aggressive Live range splitting  increases the number of move 
instruction and making the coalescing 
non optimal.

The optimal coalescing should take the frequency of the regions into 
consideration. The more the frequency of the region the cost
associated with the splitted live ranges will increase. If such cost increasing 
then we should do aggressive coalescing in this case and 
remove the move instruction by associating the same color or hard registers 
associated with move instruction for the splitting live
ranges.

I am think of adding such heuristics in the GCC Top Down Region based register 
allocator for the optimal coalescing, thereby reducing
the move instruction connected with the splitted Live ranges in the more 
frequently code.

Thoughts please ?

Thanks & Regards
Ajit


Rematerialization and Live Range Splitting on Region Frequency

2015-01-25 Thread Ajit Kumar Agarwal
Hello All:

Looks like Live range splitting and rematerialization are connected to each 
other. If  the boundary of Live range
Splitting is in the high frequency of the region then the move connected to 
splitted live ranges are inside the
High frequency region which is the performance bottleneck for many benchmarks.

Live range splitting based on the frequency of the region should be considered. 
The Live range splitting in the 
High frequency region is beneficial if the splitted live range is assigned the 
color(hard registers) which is better
Spilling inside the high frequency region, although there will be move 
instruction or shuffle code which is still 
Better.  If one of the splitted live range does not have any use points and all 
the partner live ranges gets the
Hard register, then the move instruction due to splitting will be costly for 
the high frequency region. In such 
Case the  split point should be move up at the boundary of the transition from 
low frequency region to high
Frequency region, and the splitted live ranges still get hard registers.  This 
require increment check of 
colorabity which increases the compile time but beneficial with respect to run 
time. The above heuristic should 
be incorporated on top of the below  Live range splitting Algorithm. Live range 
splitting algorithm should  consider
the blocks in the decreasing order of frequency with the first block should be 
taken from the high frequency
region and incrementally updates till it become colorable. Thus split points 
should be at the edge of the transition 
from high frequency to low frequency region or from low frequency region to 
high frequency region. 

The above Live range splitting should be incorporated  for all the flavors of 
Graph Coloring.

Regarding the rematerialization the Chaitin's Rematerialization try to 
recalculate the expression at all the 
Use points of the Live ranges and Simpson based approach for Rematerialized try 
to move the arithmetic
Instruction lower to use points or the basic blocks considering the operands of 
Arithmetic instructions is 
Not touched along the blocks of the Live range.

Considering all the effects of rematerialization, The remat point or the 
recalculation should be done at the
split points instead of Chaitin's approach of remat at every use points and the 
Simpson approach of operands
not being touched upon and the movement of arithmetic instruction later at the 
use points.

The above approaches looks feasible to implement consider the high frequency 
region into consideration.

Thoughts Please ?

Thanks & Regards
Ajit


RE: Rematerialization and Live Range Splitting on Region Frequency

2015-01-28 Thread Ajit Kumar Agarwal
Thanks Vladimir for the inputs. It is quite helpful.

Thanks & Regards
Ajit

-Original Message-
From: Vladimir Makarov [mailto:vmaka...@redhat.com] 
Sent: Tuesday, January 27, 2015 1:10 AM
To: Ajit Kumar Agarwal; l...@redhat.com; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Rematerialization and Live Range Splitting on Region Frequency


On 2015-01-25 4:55 AM, Ajit Kumar Agarwal wrote:
> Hello All:
>
> Looks like Live range splitting and rematerialization are connected to 
> each other. If  the boundary of Live range Splitting is in the high 
> frequency of the region then the move connected to splitted live ranges are 
> inside the High frequency region which is the performance bottleneck for many 
> benchmarks.
>
> Live range splitting based on the frequency of the region should be 
> considered. The Live range splitting in the High frequency region is 
> beneficial if the splitted live range is assigned the color(hard 
> registers) which is better Spilling inside the high frequency region, 
> although there will be move instruction or shuffle code which is still 
> Better.  If one of the splitted live range does not have any use 
> points and all the partner live ranges gets the Hard register, then 
> the move instruction due to splitting will be costly for the high 
> frequency region. In such Case the  split point should be move up at 
> the boundary of the transition from low frequency region to high 
> Frequency region, and the splitted live ranges still get hard 
> registers.  This require increment check of colorabity which increases the 
> compile time but beneficial with respect to run time. The above heuristic 
> should be incorporated on top of the below  Live range splitting Algorithm. 
> Live range splitting algorithm should  consider the blocks in the decreasing 
> order of frequency with the first block should be taken from the high 
> frequency region and incrementally updates till it become colorable. Thus 
> split points should be at the edge of the transition from high frequency to 
> low frequency region or from low frequency region to high frequency region.
>
> The above Live range splitting should be incorporated  for all the flavors of 
> Graph Coloring.
>
> Regarding the rematerialization the Chaitin's Rematerialization try to 
> recalculate the expression at all the Use points of the Live ranges 
> and Simpson based approach for Rematerialized try to move the 
> arithmetic Instruction lower to use points or the basic blocks considering 
> the operands of Arithmetic instructions is Not touched along the blocks of 
> the Live range.
>
> Considering all the effects of rematerialization, The remat point or 
> the recalculation should be done at the split points instead of 
> Chaitin's approach of remat at every use points and the Simpson approach of 
> operands not being touched upon and the movement of arithmetic instruction 
> later at the use points.
>
> The above approaches looks feasible to implement consider the high frequency 
> region into consideration.
>
> Thoughts Please ?
>
>
Ajit, nobody can tell you for sure what the final results of your proposal can 
be.  I usually try a lot of things and most of them are rejected because the 
results are not what I expected.

I personally implemented Simpson's register pressure decrease through 
rematerialization twice.  The first time was long ago (as I remember >
10 years ago) and that time it worked not bad (as I remember it gave 1% for x86 
- you can find the exact data in my GCC summit paper "Fighting register 
pressure in GCC").  It worked well because the old RA was quite bad (imho the 
problem of most research articles in compiler optimization field was/is in 
usage of some sub-par compiler where a new good optimization shines in 
environment of existing simple ones or because of absence of many other 
optimizations).

   Second time I reimplemented CFG-sensitive rematerialization (as a separate 
pass) last year and the results were worse than without it.  
Therefore I had to implement a rematerialization subpass in LRA which really 
improves the code.  That is because the current RA is pretty good.  Even if we 
have register pressure evaluation (which was absent in the old RA), I believe 
it is very inaccurate as IR uses a sophisticated coloring criteria which are 
actually based on dynamic intersected register classes (more accurately 
approximated by dynamic register class inclusion tree).  Also it is very hard 
to predict a lot of decisions in LRA too.

   All the above is also true for the live range splitting implemented as a 
separate pass.

   There is a good point in your email that rematerialization should work 
better when it is done not for each use but for some 

Unrolling factor heuristics for Loop Unrolling

2015-02-12 Thread Ajit Kumar Agarwal
Hello All:

The Loop unrolling without good unrolling factor heuristics becomes the 
performance bottleneck. The Unrolling factor heuristics based on minimum 
Initiation interval is quite useful with respect to better ILP.  The minimum 
Initiation interval based on recurrence and resource calculation on Data 
Dependency Graph  along with the register pressure can be used to add the 
unrolling factor heuristics. To achieve better ILP with the given schedule,
the Loops unrolling and the scheduling are inter dependent and has been widely 
used in Software Pipelining Literature along with the more granular
List and Trace Scheduling.

The recurrence calculation based on the Loop carried dependencies and the 
resource allocation based on the simultaneous access of the resources 
Using the reservation table will give good heuristics with respect to 
calculation of unrolling factor. This has been taken care in the
MII interval Calculation.

Along with MII, the register pressure should also be  considered in the 
calculation of heuristics for unrolling factor.

This enable better heuristics with respect to unrolling factor. The main 
advantage of the above heuristics for unrolling factor is that it can be 
Implemented in the Code generation Level. Currently Loop unrolling is done much 
before the code generation. Let's go by the current implementation
Of doing Loop unrolling optimization at the Loop optimizer level and unrolling 
happens. After the Current unrolling at the optimizer level the above heuristics
Can be  used to do the unrolling at the Code generation Level with the accurate 
Register pressure calculation as done in the register allocator and the
Unrolling is done at the code generation level. This looks feasible solution 
which I am going to propose for the above unrolling heuristics.

This enables the Loop unrolling done at the Optimizer Level  +  at the Code 
Generation Level. This double level of Loop unrolling is quite useful.
This will overcome the shortcomings of the Loop unrolling at the optimizer 
level.

The SPEC benchmarks are the better candidates for the above heuristics instead 
of Mibench and EEMBC.

Thanks & Regards
Ajit


Function outlining and partial Inlining

2015-02-12 Thread Ajit Kumar Agarwal
Hello All:

The large functions are the important part of high performance application. 
They contribute to performance bottleneck with many
respect. Some of the large hot functions are frequently executed but many 
regions inside the functions are cold regions. The large
Function blocks the function inlining  to happen before of the code size 
constraints.

Such cold regions inside the hot large functions can be extracted out and form 
the function outlining. Thus breaking the large functions
Into smaller function segments which causes the functions to be inlined at the 
caller site or helps in partial inlining.

LLVM Compiler has the functionality and the optimizations for function 
outlining based on regions like basic blocks, superblocks and
Hyperblocks which gets extracted out into smaller function segments and thus 
enabling the partial inlining and function inlining to happen
At the caller site.

This optimization is the good case of profile guided optimizations and based on 
the profile feedback data by the Compiler.
Without profile information the above function outlining optimizations will not 
be useful.

We are doing lot of optimization regarding polymorphism and also the indirect 
icall promotion based on the profile feedback on the 
Callgraph profile.

Are we doing the function outlining optimization in GCC with respect to 
function inline and partial inline based on profile feedback
Data. If not this optimization can be implemented. If already implemented in 
GCC  Can I  know any pointer for such code in GCC and the 
Scope of this function outlining optimization.

If not implemented , Can I propose to have the optimization like function 
outlining in GCC.

Thoughts Please?

Thanks & Regards
Ajit



unaligned memory access for vectorization

2015-02-12 Thread Ajit Kumar Agarwal
Hello All:

The unaligned array access are the blocking factor in the vectorization. This 
is due to unaligned load and stores with respect to
SIMD instructions are costly operations. 

To enable the vectorizations for unaligned array access the loop peeling is 
done to make the multiversioning of the loop with 
a loop for the iterations for unaligned array access where the code is non 
vectorized and also the loop where the loop can be 
vectorized for aligned access. This is possible with loop multiversioning to 
not to generate the unaligned moves.


Can I know the scope of the above optimization and pointer to the code in GCC 
where this optimizations is implemented.
If not implemented , it's good to have this optimization.

Thoughts Please?

Thanks & Regards
Ajit 




RE: Function outlining and partial Inlining

2015-02-16 Thread Ajit Kumar Agarwal


-Original Message-
From: Jan Hubicka [mailto:hubi...@ucw.cz] 
Sent: Thursday, February 12, 2015 10:34 PM
To: Ajit Kumar Agarwal
Cc: hubi...@ucw.cz; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; 
Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Function outlining and partial Inlining

> Hello All:
> 
> The large functions are the important part of high performance 
> application. They contribute to performance bottleneck with many 
> respect. Some of the large hot functions are frequently executed but many 
> regions inside the functions are cold regions. The large Function blocks the 
> function inlining  to happen before of the code size constraints.
> 
> Such cold regions inside the hot large functions can be extracted out 
> and form the function outlining. Thus breaking the large functions Into 
> smaller function segments which causes the functions to be inlined at the 
> caller site or helps in partial inlining.
> 
> LLVM Compiler has the functionality and the optimizations for function 
> outlining based on regions like basic blocks, superblocks and 
> Hyperblocks which gets extracted out into smaller function segments and thus 
> enabling the partial inlining and function inlining to happen At the caller 
> site.
> 
> This optimization is the good case of profile guided optimizations and based 
> on the profile feedback data by the Compiler.
> Without profile information the above function outlining optimizations will 
> not be useful.
> 
> We are doing lot of optimization regarding polymorphism and also the 
> indirect icall promotion based on the profile feedback on the Callgraph 
> profile.
> 
> Are we doing the function outlining optimization in GCC with respect 
> to function inline and partial inline based on profile feedback Data. 
> If not this optimization can be implemented. If already implemented in GCC  
> Can I  know any pointer for such code in GCC and the Scope of this function 
> outlining optimization.

>>The outlining pass is called ipa-split.  The heuristic used is however quite 
>>simplistic and it looks for very specific case where you have small header of 
>>a function containing conditional and >>splits after that.  It does use 
>>profile.

Thanks! I have gone through the code in the file ipa-split. This has the 
infrastructure of doing function outlining with respect to early exits routine. 
The early exits from the given function leads to more cold regions based on the 
profile data. The profile data with respect to early return statement and
the regions that it corresponds for early exits gives the rest of the regions 
as Cold regions. The early exits leads to Single Entry and Multiple Exit regions
where the chances of code regions with respect to the rest of regions with 
early exits can be formed as cold regions.

>>Any work on improving the heuristics or providing interesting testcases to 
>>consider would be welcome.

There are many large application that forms the Single Entry and Multiple 
Exits. I can see the h264ref benchmarks in Spec2006 which has many early
Exits routine like SubPelBlockMotionSearch which are the candidates of Single 
Entry and Multiple Exits with break statements instead of returns.
I don't have other examples of early exits returns but there could be many 
examples which has early returns. These are  mainly with respect to IF-THEN-ELSE
regions.

>>I think LLVM pass is doing pretty much the same analysis minus the profile 
>>feedback considerations.  After splitting, LLVm will inline the header into 
>>all callers while GCC leaves this on the >>decision of inliner heuristics 
>>that may just merge the function back into one block.

>>The actual outlining logic is contained in tree-inline.c and also used by 
>>OpenMP.

Thanks!

Thanks & Regards
Ajit
Honza
> 
> If not implemented , Can I propose to have the optimization like function 
> outlining in GCC.
> 
> Thoughts Please?
> 
> Thanks & Regards
> Ajit


Tree SSA If-combine optimization pass in GCC

2015-02-17 Thread Ajit Kumar Agarwal
Hello All:

I can see the IF-combining (If-merging) pass of optimization on tree-ssa form 
of intermediate representation. 
The IF-combine or merging takes of merging the IF-THEN-ELSE if the condition 
Expr found be congruent or 
Similar.

The IF-combine happens if the two IF-THEN-ELSE are contiguous to each other. 
If the IF-THEN-ELSE happens to be not contiguous but are wide apart with there 
is code in between.
Does the If-combine takes care of this. This requires to do the 
head-duplication and Tail-duplication for the
Code in between If-THEN-ELSE to bring the IF-THEN-ELSE contiguous to each other.

After the head and tail duplication of the code in between the IF-THEN-ElSE 
sequence becomes contiguous 
to each other. Apart from this, Does the tree-ssa-if-combine pass considers the 
control flow of the body
of the IF-THEN-ELSE. Is there any limitation on control flow of the body of the 
IF-THEN-ELSE.

Can I know the scope of tree-ssa-ifcombine optimizations pass with respect to 
the above points.

Thoughts Please?

Thanks & Regards
Ajit


Cost Calculation on Loop Invariant on Arithmetic operations on RTL

2015-02-17 Thread Ajit Kumar Agarwal
Hello All:

I can see the Loop invariant pass in the GCC on RTL considering the register 
pressure and the cost manipulation
With respect to SET destination node in RTL.

The Loop invariant takes care of only address arithmetic candidates of Loop 
invariance.

In the function get_inv_cost, I can see the following check for the updation of 
comp_cost and the comp_cost
Gets incremented with respect all the dependence on the def node of the 
invariant variable.

If (!inv->cheap_address || inv-def->n_addr_uses < inv->def->n_uses)
  (*comp_cost += inv->cost + inv->eqno

Is there any specific reasons of addr_uses  less than the actual uses check for 
the address arithmetic candidate
Of Loop invariant. I think we should be aggressive enough and can do the cost 
calculation without this check.

One more point the above cost calculation should not be considering the 
register pressure_costs.

Thoughts Please ?

Thanks & Regards
Ajit


RE: Tree SSA If-combine optimization pass in GCC

2015-02-17 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Tuesday, February 17, 2015 3:42 PM
To: Ajit Kumar Agarwal
Cc: gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; 
Nagaraju Mekala
Subject: Re: Tree SSA If-combine optimization pass in GCC

On Tue, Feb 17, 2015 at 9:22 AM, Ajit Kumar Agarwal 
 wrote:
> Hello All:
>
> I can see the IF-combining (If-merging) pass of optimization on tree-ssa form 
> of intermediate representation.
> The IF-combine or merging takes of merging the IF-THEN-ELSE if the 
> condition Expr found be congruent or Similar.
>
> The IF-combine happens if the two IF-THEN-ELSE are contiguous to each other.
> If the IF-THEN-ELSE happens to be not contiguous but are wide apart with 
> there is code in between.
> Does the If-combine takes care of this. This requires to do the 
> head-duplication and Tail-duplication for the Code in between If-THEN-ELSE to 
> bring the IF-THEN-ELSE contiguous to each other.
>
> After the head and tail duplication of the code in between the 
> IF-THEN-ElSE sequence becomes contiguous to each other. Apart from 
> this, Does the tree-ssa-if-combine pass considers the control flow of the 
> body of the IF-THEN-ELSE. Is there any limitation on control flow of the body 
> of the IF-THEN-ELSE.
>
> Can I know the scope of tree-ssa-ifcombine optimizations pass with respect to 
> the above points.
>
> Thoughts Please?

>>if-combine is a simple CFG + condition pattern matcher.  It does not perform 
>>head/tail duplication.  Also there is no "control flow" in the bodies, 
>>control flow is part of the CFG that is >>matched so I'm not quite getting 
>>your last question.

Thanks ! My last question was If there is a control flow likes  loops inside 
the IF-THEN-ELSE, which could be possible if the Loop unswitching is performed 
and the Loop body is placed inside the IF-THEN-ELSE, then in that case the two 
IF-THEN-ELSE can be merged if the cond expr matches and the control flow
of the body of If-then-else matches?

There are many cases in SPEC 2006 benchmarks where the IF-combine could be 
enabled if the if-then-else sequence is made contiguous by performing
the head/tail duplication. 

>>if-combine was designed to accompany IL-only patterns that get partly 
>>translated into control flow.  Like

  >>tem1 = name & bit1;
  >>tem2 = name & bit2;
  >>tem3 = tem1 | tem2;
  >>if (tem3)
...

>>vs.

  >>tem1 = name & bit1;
  >>if (tem1)
   >>goto x;
  >>else
>>{
  >>tem2 = name & bit2;
  >>if (tem2)
   >> goto x;
>>}

>>x:
   >>...
Thanks for the examples. This explains the scope of if-combine optimization 
pass.

Thanks & Regards
Ajit

Richard.

> Thanks & Regards
> Ajit


RE: Tree SSA If-combine optimization pass in GCC

2015-02-17 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Tuesday, February 17, 2015 5:49 PM
To: Ajit Kumar Agarwal
Cc: gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; 
Nagaraju Mekala
Subject: Re: Tree SSA If-combine optimization pass in GCC

On Tue, Feb 17, 2015 at 11:26 AM, Ajit Kumar Agarwal 
 wrote:
>
>
> -Original Message-
> From: Richard Biener [mailto:richard.guent...@gmail.com]
> Sent: Tuesday, February 17, 2015 3:42 PM
> To: Ajit Kumar Agarwal
> Cc: gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; Vidhumouli 
> Hunsigida; Nagaraju Mekala
> Subject: Re: Tree SSA If-combine optimization pass in GCC
>
> On Tue, Feb 17, 2015 at 9:22 AM, Ajit Kumar Agarwal 
>  wrote:
>> Hello All:
>>
>> I can see the IF-combining (If-merging) pass of optimization on tree-ssa 
>> form of intermediate representation.
>> The IF-combine or merging takes of merging the IF-THEN-ELSE if the 
>> condition Expr found be congruent or Similar.
>>
>> The IF-combine happens if the two IF-THEN-ELSE are contiguous to each other.
>> If the IF-THEN-ELSE happens to be not contiguous but are wide apart with 
>> there is code in between.
>> Does the If-combine takes care of this. This requires to do the 
>> head-duplication and Tail-duplication for the Code in between If-THEN-ELSE 
>> to bring the IF-THEN-ELSE contiguous to each other.
>>
>> After the head and tail duplication of the code in between the 
>> IF-THEN-ElSE sequence becomes contiguous to each other. Apart from 
>> this, Does the tree-ssa-if-combine pass considers the control flow of the 
>> body of the IF-THEN-ELSE. Is there any limitation on control flow of the 
>> body of the IF-THEN-ELSE.
>>
>> Can I know the scope of tree-ssa-ifcombine optimizations pass with respect 
>> to the above points.
>>
>> Thoughts Please?
>
>>>if-combine is a simple CFG + condition pattern matcher.  It does not perform 
>>>head/tail duplication.  Also there is no "control flow" in the bodies, 
>>>control flow is part of the CFG that is >>matched so I'm not quite getting 
>>>your last question.
>
> Thanks ! My last question was If there is a control flow likes  loops 
> inside the IF-THEN-ELSE, which could be possible if the Loop 
> unswitching is performed and the Loop body is placed inside the IF-THEN-ELSE, 
> then in that case the two IF-THEN-ELSE can be merged if the cond expr matches 
> and the control flow of the body of If-then-else matches?
>
> There are many cases in SPEC 2006 benchmarks where the IF-combine 
> could be enabled if the if-then-else sequence is made contiguous by 
> performing the head/tail duplication.

>>I'd be curious what those cases look like.  Care to file some bugreports with 
>>testcases?

This is not a bug and it’s the performance improvement optimizations with 
respect to h264ref spec2006 benchmarks. Here is the example.


Var1 = funcptr();
For(...)
{
   Code here 
  For(...)
  {
  Code here ...
 For(...)
   ... code here..

 If(*funcptr() == FastPely())
   FastPely()
Else
 (*funcptr)();

 There are such 16 IF statements.


   code here

 } end for
Code here
  }//end for
  Code here
}//end for.

The funcptr has two targets FastPely() and UMVPely(). After the indirect call 
promotion the targets is known to be either Fastpely() or UMVPely.

The Transformed  code  after indirect icall promotion looks like as follows.

Var1 = funcptr();
For(...)
{
   Code here 
  For(...)
  {
  Code here ...
 For(...)
   ... code here..

 If(var1 == FastPely())
   FastPely()
Else
 UMVpely();

 There are such 16 IF statements.


   code here

 } end for
Code here
  }//end for
  Code here
}//end for.

After the icall promotion the Function FastPely or UMVPely can be inlined as 
the target is known to be either Fastpely() or UmvPely() and it become a 
candidate for heuristics for inlined.
As you can see the transformed code the IF-THEN-ELSE (such 16 If statements) 
can be IF-combined and merged and then get inlined.

Also you can see that the code above IF and below for which can be head 
duplicated or tail duplicated which is then become 3 -Level loop unswitching 
candidate. This can be loop unswitching candidate after the IF-Combine or 
merging.

I am planning to implement the above optimizations in GCC with respect to 
h264ref spec 2006 benchmark. This gives a significant amount of gains.
I have implemented the above optimization in Open64 compiler and it has given  
significant amount of gains in open64 compiler.

Thanks & Regards
Ajit
   
>&g

Proposal on Unrolling factor based on Data reuse.

2015-03-07 Thread Ajit Kumar Agarwal
Hello All:

I would like to propose the Unrolling factor based on Data reuse between 
different iterations.  This combines the data
reuse of different iterations into single iterations. There is a use of 
MaxFactor which decides on the calculation of unroll 
factor based on Data reuse.The MaxFactor is calculated based on 
(MaxInsts/LoopBodyInsts). The MaxInsts decides on 
the number of instruction that doesn't degrades on the instruction cache. The 
calculation of the MaxInsts also considers
the def and use distance for the max insts that does not degrades the 
instruction cache which leads to increase in the Live 
range width  and multiplied with the MaxInsts calculated based  on Instruction 
cache.

The following example from Livermore Loops benchmarks.

The data reuse from the previous iteration to current iteration makes the data 
reuse. Unrolling the Loop in Fig(1) can
Reuse the data and thus increasing the performance. The unrolled loop is 
unrolled 3 times given in Fig(2) based on the 
algorithm for Calculation of unrolling factor used as given in Fig(3).

For ( I = 1; I < size; i++ )
{
X[i] =  z[i] * ( y[i] - x[i-1]);
}

Fig(1).

For ( I = 1; I < size ; i+=3)
{
   X[i] =  z[i] * ( y[i] - x[i-1]);
   X[i+1] =  z[i] * ( y[i] - x[i]);
   X[i] =  z[i] * ( y[i] - x[i+1]]);
}

Fig(2).

Algorithm for unroll factor based on data reuse.

If ( data_reuse == true)
{
Unroll_factor = Reuse_distance +1;
   If(Unroll_factor < MaxFactor)
 Return unroll_factor;;
  Else{
 Unroll_factor = MaxFactor - unroll_factor.
 If( unroll_factor < MaxDistance)
Return unroll_factor;

   }
}

Fig ( 3).

In the example given above in Fig(1) the data reuse distance and the MaxFactor 
calculated based on the maximum number of
insts that don't degrades the instructions caches and doesn't exceed the 
maximum limit on def and use distance that increases
the width of Live range. The above is considered and the Loop is 3 times.

Thoughts Please ?

Thanks & Regards
Ajit


RE: Proposal on Unrolling factor based on Data reuse.

2015-03-07 Thread Ajit Kumar Agarwal
Hello All:

There is a typo error.

The fig(2) is corrected as follows.

For ( I = 1; I < size ; i+=3)
{
   X[i] =  z[i] * ( y[i] - x[i-1]);
   X[i+1] =  z[i] * ( y[i] - x[i]);
   X[i+2] =  z[i] * ( y[i] - x[i+1]]); }

Fig(2).

Thanks & Regards
Ajit
-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Ajit 
Kumar Agarwal
Sent: Saturday, March 07, 2015 3:31 PM
To: Richard Biener; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Proposal on Unrolling factor based on Data reuse.

Hello All:

I would like to propose the Unrolling factor based on Data reuse between 
different iterations.  This combines the data reuse of different iterations 
into single iterations. There is a use of MaxFactor which decides on the 
calculation of unroll factor based on Data reuse.The MaxFactor is calculated 
based on (MaxInsts/LoopBodyInsts). The MaxInsts decides on the number of 
instruction that doesn't degrades on the instruction cache. The calculation of 
the MaxInsts also considers the def and use distance for the max insts that 
does not degrades the instruction cache which leads to increase in the Live 
range width  and multiplied with the MaxInsts calculated based  on Instruction 
cache.

The following example from Livermore Loops benchmarks.

The data reuse from the previous iteration to current iteration makes the data 
reuse. Unrolling the Loop in Fig(1) can Reuse the data and thus increasing the 
performance. The unrolled loop is unrolled 3 times given in Fig(2) based on the 
algorithm for Calculation of unrolling factor used as given in Fig(3).

For ( I = 1; I < size; i++ )
{
X[i] =  z[i] * ( y[i] - x[i-1]);
}

Fig(1).

For ( I = 1; I < size ; i+=3)
{
   X[i] =  z[i] * ( y[i] - x[i-1]);
   X[i+1] =  z[i] * ( y[i] - x[i]);
   X[i] =  z[i] * ( y[i] - x[i+1]]); }

Fig(2).

Algorithm for unroll factor based on data reuse.

If ( data_reuse == true)
{
Unroll_factor = Reuse_distance +1;
   If(Unroll_factor < MaxFactor)
 Return unroll_factor;;
  Else{
 Unroll_factor = MaxFactor - unroll_factor.
 If( unroll_factor < MaxDistance)
Return unroll_factor;

   }
}

Fig ( 3).

In the example given above in Fig(1) the data reuse distance and the MaxFactor 
calculated based on the maximum number of insts that don't degrades the 
instructions caches and doesn't exceed the maximum limit on def and use 
distance that increases the width of Live range. The above is considered and 
the Loop is 3 times.

Thoughts Please ?

Thanks & Regards
Ajit


Proposal for inter-procedural loops fusion.

2015-03-07 Thread Ajit Kumar Agarwal

Hello All:

I am proposing the inter-procedural Loop fusion. Generally the Loops adjacent 
to each other and the conformable
Candidates of loop fusions are done with respect to intra-procedural loops. The 
whole program analysis needs to
Be done with array sections analysis across the procedure calls to fuse the 
loops across the procedure calls. 

In the example given in Fig(1) the doubly nested loops in routine sub1 will be 
a candidates of fusing the doubly
Nested loops in routine sub2 in Fig(2). The main calls routine sub1 and sub2 
and the loops in routine sub1 and sub2
Can be fused. The array section analysis and dependence analysis is done on the 
regions across the procedure calls
For loops in sub1 and sub2. 

Here is the proposal for loop fusion across the procedure calls. Normally loop 
peeling is done to make the candidate
For Loop fusion and the sub2 function the loops can also be peeled. 

In the examples given below the array A and array B are global variables in 
order to happen the loops fusion across
the procedure calls.

The below examples are extracted from the articles on the following.

"Inter-procedure loop fusion, array compaction and rotation" by John Ng et.al

Examples:

program main
common /s/ A(N),B(N) 
call sub1() 
call sub2()
 end 
Subroutine sub1() common /s/ A(N),B(N) 
do j=1,N 
  do i=1,N 
     A(i+1, j) = j -i 
    enddo 
enddo
do i=1,N 
A(1,i) = i * 2
enddo
 do i=1,N
A(i+1,n+1) = A(i+1,1)
 enddo
end /* sub1 */

Fig (1)

 Subroutine sub2() common /s/ A(N),B(N)
 do j=1,N
    do i=1,N
       B(i,j) = A(i+1,j+1) +A(i,j+1) + A(i,j) +A(i+1,j) 
     enddo
enddo
 end 

Fig (2).

Thoughts Please?

Thanks & Regards
Ajit


Proposal for path splitting for reduction in register pressure for Loops.

2015-03-08 Thread Ajit Kumar Agarwal
Hello All:

The path splitting that replicates the code for better Data flow Analysis 
available. One of the properties
of path splitting removes the joining nodes for the forked path like 
IF-THEN-ELSE and the Loops. 

The removal of joining nodes makes the path splitted into two independent 
paths. 

The increase in register Pressures for the loops are the performance bottleneck 
and sometimes
lead to spill code in loops.

The removal of joining nodes like loops and IF-THEN-ELSE makes the target 
independent optimization like
CSE and Partial redundancy Elimination effective. Along with ease of these 
optimization the path splitting 
reduces the registers because Of the Liveness wont intersect because of 
independent splitted path. The 
ease of less intersection of liveness Reduces the register pressure for the 
Loops, thus better register
allocation and less spilling code for Loops.

The heuristics used in IRA code for register pressure will have better impact 
on the splitted path and thus
optimized code.

Thoughts Please?

Thanks & Regards
Ajit

  


RE: Proposal for path splitting for reduction in register pressure for Loops.

2015-03-08 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Sunday, March 08, 2015 9:05 PM
To: Ajit Kumar Agarwal; vmaka...@redhat.com; Jeff Law; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Proposal for path splitting for reduction in register pressure for 
Loops.

On March 8, 2015 3:39:08 PM CET, Ajit Kumar Agarwal 
 wrote:
>Hello All:
>
>The path splitting that replicates the code for better Data flow 
>Analysis available. One of the properties of path splitting removes the 
>joining nodes for the forked path like IF-THEN-ELSE and the Loops.
>
>The removal of joining nodes makes the path splitted into two 
>independent paths.
>
>The increase in register Pressures for the loops are the performance 
>bottleneck and sometimes lead to spill code in loops.
>
>The removal of joining nodes like loops and IF-THEN-ELSE makes the 
>target independent optimization like CSE and Partial redundancy 
>Elimination effective. Along with ease of these optimization the path 
>splitting reduces the registers because Of the Liveness wont intersect 
>because of independent splitted path. The ease of less intersection of 
>liveness Reduces the register pressure for the Loops, thus better 
>register allocation and less spilling code for Loops.
>
>The heuristics used in IRA code for register pressure will have better 
>impact on the splitted path and thus optimized code.
>
>Thoughts Please?

>>Are you talking about loop unswitching?

Loop unswitching is a way  of path splitting but I am proposing about the path 
splitting for the removal of joining node for the FOR Loops with IF-THEN-ELSE 
path. 
The tail of loop there is a joining node for the iF-THEN-ELSE forked at the 
header or the IF Region. The path splitting  removes the joining node by 
splitting the
path by replicating the code and joining nodes  which leads to independent 
path(removing the joining node), thus the live ranges wont intersects and there 
is a 
reduction in the register pressure for the Loops.

The Loop unswitching move the For Loops inside the IF-THEN and IF-ELSE path but 
 the path splitting I am proposing would still have the IF-THEN-ELSE region 
Inside the loop instead of Loop unswitching but replicate the code for the 
joining nodes and splitting the path into two independent paths. Though there 
is 
Performance bottleneck with the replication of the code and spliiting the 
joining nodes and thus there wont be any joining node. This will makes the Live
Ranges not intersecting thus reducing the register pressure and better register 
allocations. 

This makes it to check for the profitability for the reduction in register 
pressure with the replication of joining nodes thus removing the joining node.
I am sure this makes the reduction in register pressure through the loops and 
IF-THEN-ELSE regions inside the Loops and thus better register allocation.

Some profitability heuristics need to be added with this optimization.

Thanks & Regards
Ajit

Richard.

>Thanks & Regards
>Ajit
>
>  




RE: Proposal for path splitting for reduction in register pressure for Loops.

2015-03-09 Thread Ajit Kumar Agarwal


-Original Message-
From: Jeff Law [mailto:l...@redhat.com] 
Sent: Monday, March 09, 2015 11:01 PM
To: Richard Biener
Cc: Ajit Kumar Agarwal; vmaka...@redhat.com; gcc@gcc.gnu.org; Vinod Kathail; 
Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Proposal for path splitting for reduction in register pressure for 
Loops.

On 03/09/15 05:40, Richard Biener wrote:
> On Sun, Mar 8, 2015 at 8:49 PM, Jeff Law  wrote:
>> On 03/08/15 12:13, Richard Biener wrote:
>>>
>>>
>>> I see.  This basically creates two loop latches and thus will make 
>>> our loop detection code turn the loop into a fake loop nest.  Not 
>>> sure if that is a good idea.
>>
>> I'd have to sit down and ponder this for a while -- what would be the 
>> register pressure implications of duplicating the contents of the 
>> join block into its predecessors, leaving an empty joiner so that we 
>> still have a single latch?
>
> Good question.  Now another question is why we don't choose this way 
> to disambiguate loops with multiple latches?  (create a forwarder as 
> new single latch block)

>>Dunno.  The RTL jump optimizer ought to eliminate the unnecessary jumping 
>>late in the optimization pipeline and creating the forwarder allows us to put 
>>the loop into a "more canonical" >>form for the loop optimizers.  Seems like 
>>it'd be worth some experimentation.

I agree with Jeff on this. The above approach of path splitting will keep the 
loop into more canonical form for the Loop optimizers.

>>I'm having trouble seeing how Ajit's proposal helps reduce register pressure 
>>in any significant way except perhaps by exposing partially dead code.  And 
>>if that's the primary benefit, we >>may better off actually building a proper 
>>framework for PDCE.

Jeff: The above approach of path splitting is very similar to superblock 
formation as we are duplicating  the join nodes into all its predecessors. By 
doing so,
The predecessors blocks duplicated from the join nodes will achieve more 
granularity with the scope of the scheduling and the register allocation. Due to
The bigger granularity of the predecessors blocks, the advantages will have 
similar to having superblock with more granularity. This gives more scheduling
Opportunities of basic blocks scheduling the independent operations. Due to the 
above path splitting approach, the LRA will have a more significant impact with
 the respect to register allocation. Because of more granular predecessors 
blocks the  scope of LRA will increase, and the LRA can reuse the registers 
thus impacting the 
Live range and the register pressure and  we can use better heuristics with 
respect to spill code in LRA.

>>I've often pondered if PDCE is really worth it.  There's some good PLDI 
>>papers from many years ago.  One is closely related to our RTL GCSE 
>>implementation (Knoop). But RTL seems the >>wrong place to be doing this.

>>Click's GCM/GVN can achieve similar results by the nature of code motion 
>>algorithm IIRC, but as much as I like Click's algorithm, I wasn't ever able 
>>to get it to do anything significantly >>better than existing bits in GCC.

Thanks Jeff.

Thanks & Regards
Ajit

Jeff


Proposal for another approach for Loop transformation with conditional in Loops.

2015-03-14 Thread Ajit Kumar Agarwal
Hello All:

I am proposing the new approach to Loop transformation as given below in the 
example For the loops with
conditional expression inside the Loops. The Loop body should be reducible 
control flow graph. The iteration
space is partitioned into different spaces for which either the cond_expr is 
true or cond_expr  is false. The 
evaluation of cond_expr for the partitioned iteration spaces can be determined 
statically. For the partitioned 
iterations and based on the analysis of cond_expr on the partitioned iterations 
spaces the Loops in fig (a) will 
be transformed to Loops in fig (b).

for the iteration spaces of the conditional inside the loop is live in  at the 
entry of the cond_expr and Live out
the Control flow graph is topological ordered with the basic blocks for the 
conditional CFG and the partitioned
iteration spaces can be formed for which the spaces can be true for conditional 
expr and false and unknown.

Based on the above info and analysis the Loop of Fig (a) will be transformed to 
Loop Fig (b).   

This is much more optimized code as compared to the performance. The cases will 
be triggered for SPEC
Benchmarks. Loops is partitioned to different version with different iteration 
spaces. Optimized in the presence
Of the transformed generation of the loops without conditional.

For ( x1= lb1; x1<= ub1; x1++)
..
   For(xn= lbn; xn <= ubn; xn++)
   {

V = cond_expr;
If ( V)
Code_FOR _THEN;
   Else
 Code_FOR_ELSE;
 }

}
Fig(a)

/* Loop for cond_expr == true */
For( x1 = )
For(xn = )
CODE_FOR_THEN;
   End
End

/* Loop for cond_expr == false */
For ( x1 = ..)
   For( xn = ...)
 CODE_FOR_ELSE;
  End
End

/* Loop for cond_expr == unknown *//
For ( x1 = ...)
For( xn = 
 {
 V = cond_expr;
 If( v)
 CODE_FOR_THEN;
 Else
  CODE_FOR_ELSE;
 }
}

Fig ( b).

Thoughts Please ?

Thanks & Regards
Ajit


RE: Proposal for another approach for Loop transformation with conditional in Loops.

2015-03-14 Thread Ajit Kumar Agarwal


-Original Message-
From: Aditya K [mailto:hiradi...@msn.com] 
Sent: Sunday, March 15, 2015 11:37 AM
To: Ajit Kumar Agarwal; Jeff Law; Richard Biener; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: RE: Proposal for another approach for Loop transformation with 
conditional in Loops.




> From: ajit.kumar.agar...@xilinx.com
> To: l...@redhat.com; richard.guent...@gmail.com; gcc@gcc.gnu.org
> CC: vin...@xilinx.com; shail...@xilinx.com; vid...@xilinx.com; 
> nmek...@xilinx.com
> Subject: Proposal for another approach for Loop transformation with 
> conditional in Loops.
> Date: Sun, 15 Mar 2015 04:40:54 +
>
> Hello All:
>
> I am proposing the new approach to Loop transformation as given below 
> in the example For the loops with conditional expression inside the 
> Loops. The Loop body should be reducible control flow graph. The 
> iteration space is partitioned into different spaces for which either 
> the cond_expr is true or cond_expr is false. The evaluation of 
> cond_expr for the partitioned iteration spaces can be determined statically. 
> For the partitioned iterations and based on the analysis of cond_expr on the 
> partitioned iterations spaces the Loops in fig (a) will be transformed to 
> Loops in fig (b).
>
> for the iteration spaces of the conditional inside the loop is live in 
> at the entry of the cond_expr and Live out the Control flow graph is 
> topological ordered with the basic blocks for the conditional CFG and the 
> partitioned iteration spaces can be formed for which the spaces can be true 
> for conditional expr and false and unknown.
>
> Based on the above info and analysis the Loop of Fig (a) will be transformed 
> to Loop Fig (b).
>
> This is much more optimized code as compared to the performance. The 
> cases will be triggered for SPEC Benchmarks. Loops is partitioned to 
> different version with different iteration spaces. Optimized in the presence 
> Of the transformed generation of the loops without conditional.
>
> For ( x1= lb1; x1<= ub1; x1++)
> ..
> For(xn= lbn; xn <= ubn; xn++)
> {
>
> V = cond_expr;
> If ( V)
> Code_FOR _THEN;
> Else
> Code_FOR_ELSE;
> }
>
> }
> Fig(a)
>
> /* Loop for cond_expr == true */
> For( x1 = ) For(xn = 
> ) CODE_FOR_THEN; End End
>
> /* Loop for cond_expr == false */
> For ( x1 = ..)
> For( xn = ...)
> CODE_FOR_ELSE;
> End
> End
>
> /* Loop for cond_expr == unknown *//
> For ( x1 = ...) For( xn = 
> 
> {
> V = cond_expr;
> If( v)
> CODE_FOR_THEN;
> Else
> CODE_FOR_ELSE;
> }
> }
>
> Fig ( b).
>
> Thoughts Please ?
>
> Thanks & Regards
> Ajit

>>Ajit,

>>How different this is from the loop-unswitch pass already in gcc 
>>(tree-ssa-loop-unswitch.c)?

Aditya: The loop unswitching moves the loop inside the IF-THEN and IF-ELSE with 
different versions of the Loop. The iteration space 
remains same for the Loop unswitching. In the proposal given above the 
iteration spaces with different version of the Loop will be different
based on the cond_expr for the iteration space is true or false. Based on 
partitioned iteration space the transformed loops are generated
with different iterations space for the transformed loops based on the 
condition is true or false.

Another difference in the Loop unswitching the conditional expression for 
IF-THEN-ELSE is present with the loop unswitching, whereas
In the above optimizations the conditional expressions is removed for the 
cond_expr condition true or false. The different versions of the
Loop with different iteration spaces execute either the statements for 
then-block or statement for the ELSE block. For unknown cases for 
The iterations spaces the Loops will be with different iterations space. This 
results in much more optimized code without having the 
Conditional hot branches inside the loops and the performance bottleneck with 
respect to branch mispredictions with respect to 
conditional branches Inside the loops will be removed or reduced to negligible 
amount.

As you mentioned it should be for the Hot Loops.

Thanks & Regards
Ajit
-Aditya


  


Short Circuit compiler transformations!!

2015-03-15 Thread Ajit Kumar Agarwal
Hello All:

Short circuit compiler transformation for conditional branches. The conditional 
branches based on the conditional
Expressions one of the path is always executed thus short circuiting the path. 
Certains values of the conditional
Expressions makes the conditional expressions always true or false. This part 
of the conditions in extracted out
In the IF-THEN cond_express with the check of the value and in the else case 
the original expressions is checked.

This makes for a given values of the variables of the conditional expressions 
makes the conditional expression as
True or false always and the one path is always executed.

For the example given below.

For( x1 = lb1; x1 < = ub1 ; x1++)
{
   If ( C1 && C2)
 S_then
 }


Fig (1).

For the input code given in Fig (1) the condition C1 && C2 is always false if 
C1 or C2 is zero(false) thus short circuiting
The path and only s_else will be executed if C1 or C2 is false.

Thus the input code is transformed as follows.

For( x1 = lb1; x1 < = ub1 ; x1++)
{
   If(!C1)
Continue
  else
S_then
 }

This is much simpler code and there could be complex expressions inside the FOR 
Loop that can be extracted out with 
Different versions of the conditional IF-THEN-ELSE inside the loops based on 
short circuiting executing one of the path always
Executed can be extracted in IF-THEN and other cases of condition will be 
inside the else conditions.

Again this has to be optimized based on the profile Data of hot conditional 
branches and the above optimizations is performed
Based on the hotness of the conditional branches.

The short Circuiting optimization also makes the irreducible loops with 
conditional branches as reducible and there is no need for
Conversion of irreducible loops to reducible loop with any of the existing 
approach like node splitting for the short circuiting candidate.

This optimizations is implemented in LLVM and I would like to know if this is 
implemented in GCC. If not implemented I would like
To propose this short circuit compiler transformation.

Thoughts Please ?

Thanks & Regards
Ajit


RE: Short Circuit compiler transformations!!

2015-03-15 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Sunday, March 15, 2015 3:05 PM
To: Ajit Kumar Agarwal; Jeff Law; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Short Circuit compiler transformations!!

On March 15, 2015 9:15:36 AM GMT+01:00, Ajit Kumar Agarwal 
 wrote:
>Hello All:
>
>Short circuit compiler transformation for conditional branches. The 
>conditional branches based on the conditional Expressions one of the 
>path is always executed thus short circuiting the path. Certains values 
>of the conditional Expressions makes the conditional expressions always 
>true or false.
>This part of the conditions in extracted out In the IF-THEN 
>cond_express with the check of the value and in the else case the 
>original expressions is checked.
>
>This makes for a given values of the variables of the conditional 
>expressions makes the conditional expression as True or false always 
>and the one path is always executed.
>
>For the example given below.
>
>For( x1 = lb1; x1 < = ub1 ; x1++)
>{
>   If ( C1 && C2)
> S_then
> }
>
>
>Fig (1).
>
>For the input code given in Fig (1) the condition C1 && C2 is always 
>false if C1 or C2 is zero(false) thus short circuiting The path and 
>only s_else will be executed if C1 or C2 is false.
>
>Thus the input code is transformed as follows.
>
>For( x1 = lb1; x1 < = ub1 ; x1++)
>{
>   If(!C1)
>Continue
>  else
>S_then
> }

>>This looks wrong as you miss to check C2.

Riichard: Sorry for the typo error. I could see the short circuit 
transformation is implemented in gcc. I could see only
Short circuiting with respect to &&/OR. I am thinking in terms of complex 
expressions where any of the variable condition
In the expressions Set to to true or false the whole expressions and such 
conditions can be extracted out into different versions
of IF with different parameter check( Another way of short circuiting).

I would like to know the scope of this optimization in gcc. If not implemented 
with respect to complex expressions with respect
To different iteration spaces. I would like to propose the same.

Thoughts Please?

Thanks & Regards
Ajit

Richard.

>This is much simpler code and there could be complex expressions inside 
>the FOR Loop that can be extracted out with Different versions of the 
>conditional IF-THEN-ELSE inside the loops based on short circuiting 
>executing one of the path always Executed can be extracted in IF-THEN 
>and other cases of condition will be inside the else conditions.
>
>Again this has to be optimized based on the profile Data of hot 
>conditional branches and the above optimizations is performed Based on 
>the hotness of the conditional branches.
>
>The short Circuiting optimization also makes the irreducible loops with 
>conditional branches as reducible and there is no need for Conversion 
>of irreducible loops to reducible loop with any of the existing 
>approach like node splitting for the short circuiting candidate.
>
>This optimizations is implemented in LLVM and I would like to know if 
>this is implemented in GCC. If not implemented I would like To propose 
>this short circuit compiler transformation.
>
>Thoughts Please ?
>
>Thanks & Regards
>Ajit




RE: Short Circuit compiler transformations!!

2015-03-15 Thread Ajit Kumar Agarwal


-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Ajit 
Kumar Agarwal
Sent: Sunday, March 15, 2015 3:35 PM
To: Richard Biener; Jeff Law; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: RE: Short Circuit compiler transformations!!



-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com]
Sent: Sunday, March 15, 2015 3:05 PM
To: Ajit Kumar Agarwal; Jeff Law; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Short Circuit compiler transformations!!

On March 15, 2015 9:15:36 AM GMT+01:00, Ajit Kumar Agarwal 
 wrote:
>Hello All:
>
>Short circuit compiler transformation for conditional branches. The 
>conditional branches based on the conditional Expressions one of the 
>path is always executed thus short circuiting the path. Certains values 
>of the conditional Expressions makes the conditional expressions always 
>true or false.
>This part of the conditions in extracted out In the IF-THEN 
>cond_express with the check of the value and in the else case the 
>original expressions is checked.
>
>This makes for a given values of the variables of the conditional 
>expressions makes the conditional expression as True or false always 
>and the one path is always executed.
>
>For the example given below.
>
>For( x1 = lb1; x1 < = ub1 ; x1++)
>{
>   If ( C1 && C2)
> S_then
> }
>
>
>Fig (1).
>
>For the input code given in Fig (1) the condition C1 && C2 is always 
>false if C1 or C2 is zero(false) thus short circuiting The path and 
>only s_else will be executed if C1 or C2 is false.
>
>Thus the input code is transformed as follows.
>
>For( x1 = lb1; x1 < = ub1 ; x1++)
>{
>   If(!C1)
>Continue
>  else
>S_then
> }

>>This looks wrong as you miss to check C2.

>>Riichard: Sorry for the typo error. I could see the short circuit 
>>transformation is implemented in gcc. I could see only Short circuiting with 
>>respect to &&/OR. I am thinking in terms of >>complex expressions where any 
>>of the variable condition In the expressions Set to to true or false the 
>>whole expressions and such conditions can be extracted out into different 
>>versions >>of IF with different parameter check( Another way of short 
>>circuiting).

>>I would like to know the scope of this optimization in gcc. If not 
>>implemented with respect to complex expressions with respect To different 
>>iteration spaces. I would like to propose the >>same.

Richard: Other aspect of the transformation for short circuiting the 
IF-THEN-ELSE with the loops where the loop is irreducible. 
In this case implementing the short circuit transformations and the CFG 
transformations to convert  from irreducibility to reducible
Loops along with the transformations of short circuiting compiler 
transformations. 

Thanks & Regards
Ajit

>>Thoughts Please?

Thanks & Regards
Ajit

Richard.

>This is much simpler code and there could be complex expressions inside 
>the FOR Loop that can be extracted out with Different versions of the 
>conditional IF-THEN-ELSE inside the loops based on short circuiting 
>executing one of the path always Executed can be extracted in IF-THEN 
>and other cases of condition will be inside the else conditions.
>
>Again this has to be optimized based on the profile Data of hot 
>conditional branches and the above optimizations is performed Based on 
>the hotness of the conditional branches.
>
>The short Circuiting optimization also makes the irreducible loops with 
>conditional branches as reducible and there is no need for Conversion 
>of irreducible loops to reducible loop with any of the existing 
>approach like node splitting for the short circuiting candidate.
>
>This optimizations is implemented in LLVM and I would like to know if 
>this is implemented in GCC. If not implemented I would like To propose 
>this short circuit compiler transformation.
>
>Thoughts Please ?
>
>Thanks & Regards
>Ajit




RE: Function outlining and partial Inlining

2015-03-15 Thread Ajit Kumar Agarwal


-Original Message-
From: Jan Hubicka [mailto:hubi...@ucw.cz] 
Sent: Thursday, February 12, 2015 10:34 PM
To: Ajit Kumar Agarwal
Cc: hubi...@ucw.cz; gcc@gcc.gnu.org; Vinod Kathail; Shail Aditya Gupta; 
Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Function outlining and partial Inlining

> Hello All:
> 
> The large functions are the important part of high performance 
> application. They contribute to performance bottleneck with many 
> respect. Some of the large hot functions are frequently executed but many 
> regions inside the functions are cold regions. The large Function blocks the 
> function inlining  to happen before of the code size constraints.
> 
> Such cold regions inside the hot large functions can be extracted out 
> and form the function outlining. Thus breaking the large functions Into 
> smaller function segments which causes the functions to be inlined at the 
> caller site or helps in partial inlining.
> 
> LLVM Compiler has the functionality and the optimizations for function 
> outlining based on regions like basic blocks, superblocks and 
> Hyperblocks which gets extracted out into smaller function segments and thus 
> enabling the partial inlining and function inlining to happen At the caller 
> site.
> 
> This optimization is the good case of profile guided optimizations and based 
> on the profile feedback data by the Compiler.
> Without profile information the above function outlining optimizations will 
> not be useful.
> 
> We are doing lot of optimization regarding polymorphism and also the 
> indirect icall promotion based on the profile feedback on the Callgraph 
> profile.
> 
> Are we doing the function outlining optimization in GCC with respect 
> to function inline and partial inline based on profile feedback Data. 
> If not this optimization can be implemented. If already implemented in GCC  
> Can I  know any pointer for such code in GCC and the Scope of this function 
> outlining optimization.

>>The outlining pass is called ipa-split.  The heuristic used is however quite 
>>simplistic and it looks for very specific case where you have small header of 
>>a function containing conditional and >>splits after that.  It does use 
>>profile.

>>Any work on improving the heuristics or providing interesting testcases to 
>>consider would be welcome.

>>I think LLVM pass is doing pretty much the same analysis minus the profile 
>>feedback considerations.  After splitting, LLVm will inline the header into 
>>all callers while GCC leaves this on the >>decision of inliner heuristics 
>>that may just merge the function back into one block.

>>The actual outlining logic is contained in tree-inline.c and also used by 
>>OpenMP.

Honza: LLVM has the logic of extraction of Loops using -loop-extract flags  
where the Loop code region is extracted into a function. The LLVM  has 
the heuristics that the header of the Loop is the entry block and there is a 
single entry to the Loop which does by Loop Simplification
pass in LLVM. Also the heuristics in LLVM has the heuristics that there should 
not be a branch from the entry block to the header of the Loop.

You have mentioned the GCC has the heuristics of having the conditional in the 
header after the splits point. Does GCC has the heuristics for
Conditional to split in the header or the heuristics can be augmented with the 
Loops as done in the LLVM. 

Thanks & Regards
Ajit

Honza
> 
> If not implemented , Can I propose to have the optimization like function 
> outlining in GCC.
> 
> Thoughts Please?
> 
> Thanks & Regards
> Ajit


Proposal for the coarse grain unrolling heuristics and renaming for the enablement of better fine grain Loop transformation.

2015-03-15 Thread Ajit Kumar Agarwal

Hello All:

Below examples are the transformation for the given loop in Fig(1). Fig(2) 
unroll and jam and the Fig(3) does the
Code motion to bring two IF adjacent to each other and two while loops adjacent 
to each other.

The Fig(4 ) does the IF-merging and the Loop fusion on the transformed Loop in 
Fig (3) which bring two IF adjacent
and two while Loops adjacent to each other.

This shows how the Loop with conditional branches can be unrolled which enables 
the IF-merging and Loop fusion
after doing code motion to unrolled loops to bring IF and Loops adjacent to 
each other.

This is the powerful optimization which gets enabled with renaming , unrolling.

Such heuristics needs to be added for Loop unrolling and then later the code 
motions can enable the IF-merging
and the Loop fusion for the unrolled loop.

The below example is taken from the article by Albert Cohen et.al.

"Deep Jam : Conversion of Coarse grain Parallelism to Fine grain vector 
parallelism"

For ( I = 0; I < 10; i++)
{
   If(p)
 a = ...;
  While(q)
 .. = a + ...;
}

Fig (1)

For ( I = 0; I < 10; i+=2)
{
a1 = phi(a,a2);
If ( p1)
   a1 = ;
While (q1)
... = a1 + ...;
If (p2)
  a2  = ...;
   a2 = phi(a2, a1);
   While (q2)
  = a2 + ;

Fig (2)

For ( I = 0; I < 10; i+=2)
{
a1 = phi(a,a2);
If ( p1)
   a1 = ;
   If (p2)
  a2  = ...;
   a2 = phi(a2, a1);
  While (q1)
... = a1 + ...;
   While (q2)
  = a2 + ;

Fig (3)

For ( I = 0 ; I < 10; i++)
   a1 = phi(a1,a2);
   If ( p1 && p2)
  a1 = ...;
  a2 = ...;
  Else
If (p1)
   a1= ;
   Else
  If (p2)
a2 = ...;

While (q1 && q2)
{
... = a1 + ...;
... = a2 + .;
}
   While (q1)
  = a1 + ...;
   While ( q2)
 . = a2 + ..;

Fig (4).

I would like to propose the above heuristics for unroll and jam and renaming 
which enables the loop fusion and the IF-merging 
to achieve the optimize code.

This shows how the coarse grain unrolling heuristics enable fine grain Loop 
transformation.

Thoughts Please?

Thanks & Regards
Ajit


RE: Short Circuit compiler transformations!!

2015-03-15 Thread Ajit Kumar Agarwal


-Original Message-
From: Richard Biener [mailto:richard.guent...@gmail.com] 
Sent: Sunday, March 15, 2015 9:30 PM
To: Ajit Kumar Agarwal; Jeff Law; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: RE: Short Circuit compiler transformations!!

On March 15, 2015 11:14:59 AM GMT+01:00, Ajit Kumar Agarwal 
 wrote:
>
>
>-Original Message-
>From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of 
>Ajit Kumar Agarwal
>Sent: Sunday, March 15, 2015 3:35 PM
>To: Richard Biener; Jeff Law; gcc@gcc.gnu.org
>Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju 
>Mekala
>Subject: RE: Short Circuit compiler transformations!!
>
>
>
>-Original Message-
>From: Richard Biener [mailto:richard.guent...@gmail.com]
>Sent: Sunday, March 15, 2015 3:05 PM
>To: Ajit Kumar Agarwal; Jeff Law; gcc@gcc.gnu.org
>Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju 
>Mekala
>Subject: Re: Short Circuit compiler transformations!!
>
>On March 15, 2015 9:15:36 AM GMT+01:00, Ajit Kumar Agarwal 
> wrote:
>>Hello All:
>>
>>Short circuit compiler transformation for conditional branches. The 
>>conditional branches based on the conditional Expressions one of the 
>>path is always executed thus short circuiting the path. Certains
>values
>>of the conditional Expressions makes the conditional expressions
>always
>>true or false.
>>This part of the conditions in extracted out In the IF-THEN 
>>cond_express with the check of the value and in the else case the 
>>original expressions is checked.
>>
>>This makes for a given values of the variables of the conditional 
>>expressions makes the conditional expression as True or false always 
>>and the one path is always executed.
>>
>>For the example given below.
>>
>>For( x1 = lb1; x1 < = ub1 ; x1++)
>>{
>>   If ( C1 && C2)
>> S_then
>> }
>>
>>
>>Fig (1).
>>
>>For the input code given in Fig (1) the condition C1 && C2 is always 
>>false if C1 or C2 is zero(false) thus short circuiting The path and 
>>only s_else will be executed if C1 or C2 is false.
>>
>>Thus the input code is transformed as follows.
>>
>>For( x1 = lb1; x1 < = ub1 ; x1++)
>>{
>>   If(!C1)
>>Continue
>>  else
>>S_then
>> }
>
>>>This looks wrong as you miss to check C2.
>
>>>Riichard: Sorry for the typo error. I could see the short circuit
>transformation is implemented in gcc. I could see only Short circuiting 
>with respect to &&/OR. I am thinking in terms of >>complex expressions 
>where any of the variable condition In the expressions Set to to true 
>or false the whole expressions and such conditions can be extracted out 
>into different versions >>of IF with different parameter check( Another 
>way of short circuiting).
>
>>>I would like to know the scope of this optimization in gcc. If not
>implemented with respect to complex expressions with respect To 
>different iteration spaces. I would like to propose the >>same.

>>Just as a general note/question on all you suggestions.  Are you willing to 
>>implement all your proposals or do you have resources to spend to that effort?

>>It's not that we are not aware of loop optimizations GCC does not implement.
>>It's a question of priorities and resources to implement something.

Yes you are right. We are proposing different optimizations based on our 
Analysis and past experience and we would  be interested in and like
to implement based on the suggestions and inputs from the community. 

We may not implement all, but based on the suggestions and inputs from the 
community we would like to take on the priority basis.

Thanks & Regards
Ajit

Richard.

>Richard: Other aspect of the transformation for short circuiting the 
>IF-THEN-ELSE with the loops where the loop is irreducible.
>In this case implementing the short circuit transformations and the CFG 
>transformations to convert  from irreducibility to reducible Loops 
>along with the transformations of short circuiting compiler 
>transformations.
>
>Thanks & Regards
>Ajit
>
>>>Thoughts Please?
>
>Thanks & Regards
>Ajit
>
>Richard.
>
>>This is much simpler code and there could be complex expressions
>inside
>>the FOR Loop that can be extracted out with Different versions of the 
>>conditional IF-THEN-ELSE inside the loops based on short circuiting 
>>executing one of the path always Executed can be extracted in IF-THEN 
>>and other cases of condition will b

RE: Proposal for another approach for Loop transformation with conditional in Loops.

2015-03-16 Thread Ajit Kumar Agarwal


-Original Message-
From: Jeff Law [mailto:l...@redhat.com] 
Sent: Monday, March 16, 2015 11:45 PM
To: Ajit Kumar Agarwal; Richard Biener; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Proposal for another approach for Loop transformation with 
conditional in Loops.

On 03/14/15 22:40, Ajit Kumar Agarwal wrote:
> Hello All:
>
> I am proposing the new approach to Loop transformation as given below 
> in the example For the loops with conditional expression inside the 
> Loops. The Loop body should be reducible control flow graph. The 
> iteration space is partitioned into different spaces for which either 
> the cond_expr is true or cond_expr  is false. The evaluation of 
> cond_expr for the partitioned iteration spaces can be determined statically. 
> For the partitioned iterations and based on the analysis of cond_expr on the 
> partitioned iterations spaces the Loops in fig (a) will be transformed to 
> Loops in fig (b).
>
> for the iteration spaces of the conditional inside the loop is live in  
> at the entry of the cond_expr and Live out the Control flow graph is 
> topological ordered with the basic blocks for the conditional CFG and the 
> partitioned iteration spaces can be formed for which the spaces can be true 
> for conditional expr and false and unknown.
>
> Based on the above info and analysis the Loop of Fig (a) will be transformed 
> to Loop Fig (b).
>
> This is much more optimized code as compared to the performance. The 
> cases will be triggered for SPEC Benchmarks. Loops is partitioned to 
> different version with different iteration spaces. Optimized in the presence 
> Of the transformed generation of the loops without conditional.
I fail to see how this is dramatically different than unswitching.   You 
pull the condition out (it has to be loop invariant), then have two instances 
of the loop, one for the THEN, the other for the ELSE.  You obviously select 
one or the other based on the condition of V.

>>Specific examples might be helpful in understanding how you see this as 
>>different than unswitching.

For ( I = 1; I < 5; i++)
  For ( j =1; j<=5;j++)
If ( I >3 && j >3)
 A[i][j] = a[i+1][j-1];
Else
   A{i][j] = 0;
   End
End

Fig (1)

For ( I = 1; I <= 3;i++)
  For(j= 4; j <= 5; j++)
A[i][j] = 0;

For( I = 4; I <= 5; i++)
  For(j = 4; j <=5;j++)
 A[i][j]= a[i+1][j-1];

For( (I = 1; I <=5 ;i++)
  For(j = 1; j <= 3;j++)
A[i][j] = 0;

Fig(2) Transformed code for Fig(1).

Fig(1) is the original code and Fig(2) is the transformed code for fig(1).

Thanks & Regards
Ajit

jeff


More methods of reducing the register pressure

2015-04-19 Thread Ajit Kumar Agarwal
Hello All:

To reduce the register pressure, I am proposing the following methods of 
reducing the registers.

1. Assigning same registers or sharing same register for the logical registers 
having the same value.
To determine the logical registers having the same value is the real challenge. 
Is there is any way in
GCC to determine the logical registers having the same value?

2. For the architecture with the self-modifying instructions the DEF-USE will 
not overlap and can 
Do the better register allocator and thus reclaiming the registers when it's no 
longer used.
Such self-modifying instructions can assign the same registers.

For example:

OP R1   -Definition.
...  .
...  No used.
OP R1,R1 self-modifying.
.   .
.. Not Used
OP R1, R1   Self-modifying.

To reduce the registers pressure such cases with self-modifying instructions 
with the architecture
Support will not overlap as the use is done after it's defined. And also the 
USE will be after the DEF
Making the life time as not overlap and can assign the same registers.

Such cases needs to be handled separately for the better register pressure 
usage.
Thoughts Please?

Thanks & Regards
Ajit


RE: dom1 prevents vectorization via partial loop peeling?

2015-04-28 Thread Ajit Kumar Agarwal


-Original Message-
From: gcc-ow...@gcc.gnu.org [mailto:gcc-ow...@gcc.gnu.org] On Behalf Of Richard 
Biener
Sent: Tuesday, April 28, 2015 4:12 PM
To: Jeff Law
Cc: Alan Lawrence; gcc@gcc.gnu.org
Subject: Re: dom1 prevents vectorization via partial loop peeling?

On Mon, Apr 27, 2015 at 7:06 PM, Jeff Law  wrote:
> On 04/27/2015 10:12 AM, Alan Lawrence wrote:
>>
>>
>> After copyrename3, immediately prior to dom1, the loop body looks like:
>>
>>:
>>
>>:
>># i_11 = PHI 
>>_5 = a[i_11];
>>_6 = i_11 & _5;
>>if (_6 != 0)
>>  goto ;
>>else
>>  goto ;
>>
>>:
>>
>>:
>># m_2 = PHI <5(4), 4(3)>
>>_7 = m_2 * _5;
>>b[i_11] = _7;
>>i_9 = i_11 + 1;
>>if (i_9 != 32)
>>  goto ;
>>else
>>  goto ;
>>
>>:
>>goto ;
>>
>>:
>>return;
>>
>> dom1 then peeled part of the first loop iteration, producing:
>
> Yup.  The jump threading code realized that if we traverse the edge 
> 2->3, then we'll always traverse 3->5.  The net effect is like peeling 
> the first iteration because we copy bb3.  The original will be reached 
> via 7->3 (it, loop iterating), the copy via 2->3' and 3' will have its 
> conditional removed and will unconditionally transfer control to bb5.
>
>
>
> This is a known problem, but we really don't have any good heuristics 
> for when to do this vs when not to do it.
>
>
>>
>> In contrast, a slightly-different testcase:
>>
>> #define N 32
>>
>> int a[N];
>> int b[N];
>>
>> int foo ()
>> {
>>for (int i = 0; i < N ; i++)
>>  {
>>int cond = (a[i] & i) ? -1 : 0; // extra variable here
>>int m = (cond) ? 5 : 4;
>>b[i] = a[i] * m;
>>  }
>> }
>>
>> after copyrename3, just before dom1, is only slightly different:
>>
>>:
>>
>>:
>># i_15 = PHI 
>>_6 = a[i_15];
>>_7 = i_15 & _6;
>>if (_7 != 0)
>>  goto ;
>>else
>>  goto ;
>>
>>:
>># m_3 = PHI <4(6), 5(3)>
>>_8 = m_3 * _6;
>>b[i_15] = _8;
>>i_10 = i_15 + 1;
>>if (i_10 != 32)
>>  goto ;
>>else
>>  goto ;
>>
>>:
>>goto ;
>>
>>:
>>return;
>>
>>:
>>goto ;
>>
>> with bb6 being out-of-line at the end of the function, rather than 
>> bb4 falling through from just above bb5. However, this prevents dom1 
>> from doing the partial peeling, and dom1 only changes the "goto bb7" 
>> into a "goto bb3":
>
> I would have still expected it to thread 2->3, 3->6->4
>
>
>>
>> (1) dom1 should really, in the second case, perform the same partial 
>> peeling that it does in the first testcase, if that is what it thinks 
>> is desirable. (Of course, we might want to fix that only later, as 
>> ATM that'd take us backwards).
>
> Please a file a BZ.  It could be something simple, or we might be 
> hitting one of Zdenek's heuristics around keeping overall loop structure.
>
>>
>> Alternatively, maybe we don't want dom1 doing that sort of thing (?), 
>> but I'm inclined to think that if it's doing such optimizations, it's 
>> for a good reason ;) I guess there'll be other times where we 
>> *cannot* do partially peeling of later iterations...
>
> It's an open question -- we never reached any kind of conclusion when 
> it was last discussed with Zdenek.  I think the fundamental issue is 
> we can't really predict when threading the loop is going to interfere 
> with later optimizations or not.  The heuristics we have are marginal at best.
>
> The one thought we've never explored was re-rolling that first 
> iteration back into the loop in the vectorizer.

>>Well.  In this case we hit


  >>/* If one of the loop header's edge is an exit edge then do not
>> apply if-conversion.  */
  >>FOR_EACH_EDGE (e, ei, loop->header->succs)
   >> if (loop_exit_edge_p (loop, e))
 >> return false;

>>which is simply because even after if-conversion we'll at least end up with a 
>>non-empty latch block which is what the vectorizer doesn't support.
>>DOM rotated the loop into this non-canonical form.  Running loop header 
>>copying again would probably undo this.

The creation of empty latches with the path-splitting approach where the back 
edge node can be copied to the predecessors and the 
Empty latch can be created with the path-splitting approach I have proposed. 
This will enable the above scenario of vectorization.

Thanks & Regards
Ajit

Richard.

>
> Jeff


[RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE

2015-05-16 Thread Ajit Kumar Agarwal
I have Designed and  implemented with the following design for the path 
splitting of the loops with conditional IF-THEN-ELSE.
The implementation has gone through the bootstrap for Microblaze target along 
DEJA GNU regressions tests and
running the MIBench/EEMBC benchmarks. There is no regression seen in Deja GNU 
tests and Deja C++ tests results are
better with the path splitting changes(lesser failures). The Mibench/EEMBC 
benchmarks are run and no performance degradation is
seen and the performance improvement in telcom_gsm( Mibench benchmarks) of 
9.15% and rgbcmy01_lite(EEMBC
benchmarks) the performance improvement of 9.4% is observed.

Proposal on the below were made earlier on path splitting and the reference is 
attached.

https://gcc.gnu.org/ml/gcc/2015-03/msg00057.html

The above gains are achieved because of the constant propagation on conditional 
which become much easier with the
Path splitting changes and benefitted with other optimization on tree 
representation along with better register allocation.
Because of the better granularity.

I will make the SPEC run and Bootstrap on i386 target Before that I would like 
to receive your feedbacks and comments 
 on design and implementation done. I will send the actual patch with SPEC run 
and bootstrap on i386 target after
 I receive the feedback on the design  and the implementation mentioned in this 
mail.

Design changes.

1. The dominators of the block with conditional IF statements say BB1 are found 
and the join node of the IF-THEN-ELSE
inside the loops is found on the blocks dominated by the BB1 and are not 
successor of BB1 are the join node.

2. The Join node is same as the source of the loops latch basic blocks.

3. If the above conditional in (1) and (2) are found the Join node  same as the 
source of the Loop latch node is
moved into predecessors and the Join node ( Source of the Loop latch node) is 
made empty statement block with only
the phi nodes.

4. In the process of moving the Join node into its predecessors the result of 
the phi node in the Join node propagated
with the corresponding phi arguments  based on which predecessor it came from 
in the Join blocks and move into its predecessors.

5. The Dominator INFO is updated after performing the steps of (1) (2) (3) and 
(4).

6. The Join which is same as the source of the Loop latch node is made empty 
with only the phi node in the Join node.

7. The implementation is done in Jump threading phase of the machine 
independent optimization on tree based
representation. The path splitting is called after the Jump threading 
optimization is performed.
The following files are modifed.

a) tree-vrp.c
b) tree-ssa-threadedge.c
c) tree-cfg.c
d) tree-ssa-threadedge.h
e) tree-cfg.h
f) cfghooks.c

The diff is attached along with this mail and pasted below.

Please share your thoughts and feedbacks on the above optimization and the 
design and the coding and implementation
done for the given above design.

diff --git a/gcc/cfghooks.c b/gcc/cfghooks.c
index 9faa339..559ca96 100644
--- a/gcc/cfghooks.c
+++ b/gcc/cfghooks.c
@@ -581,7 +581,7 @@ delete_basic_block (basic_block bb)
 
   /* If we remove the header or the latch of a loop, mark the loop for
 removal.  */
-  if (loop->latch == bb
+  if (loop && loop->latch == bb
  || loop->header == bb)
mark_loop_for_removal (loop);
 
diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c
index aed5254..b25e409 100644
--- a/gcc/tree-cfg.c
+++ b/gcc/tree-cfg.c
@@ -1838,6 +1838,64 @@ replace_uses_by (tree name, tree val)
 }
 }
 
+void
+gimple_threaded_merge_blocks (basic_block a, basic_block b)
+{
+  gimple_stmt_iterator last, gsi, psi;
+
+  if (dump_file)
+fprintf (dump_file, "Merging blocks %d and %d\n", a->index, b->index);
+
+   /* Remove labels from B and set gimple_bb to A for other statements.  */
+  for (gsi = gsi_start_bb (b); !gsi_end_p (gsi);)
+ {
+   gimple stmt = gsi_stmt (gsi);
+   if (glabel *label_stmt = dyn_cast  (stmt))
+ {
+   tree label = gimple_label_label (label_stmt);
+   int lp_nr;
+
+   gsi_remove (&gsi, false);
+
+   if (FORCED_LABEL (label))
+ {
+   gimple_stmt_iterator dest_gsi = gsi_start_bb (a);
+   gsi_insert_before (&dest_gsi, stmt, GSI_NEW_STMT);
+ }
+ /* Other user labels keep around in a form of a debug stmt.  */
+ else if (!DECL_ARTIFICIAL (label) && MAY_HAVE_DEBUG_STMTS)
+ {
+   gimple dbg = gimple_build_debug_bind (label,
+ integer_zero_node,
+ stmt);
+
+   gimple_debug_bind_reset_value (dbg);
+
+   gsi_insert_before (&gsi, dbg, GSI_SAME_STMT);
+
+ }
+   lp_nr = EH_LANDING_PAD_NR (label);
+   if (lp_nr)
+ {
+   

RE: [RFC] Design and Implementation for Path Splitting for Loop with Conditional IF-THEN-ELSE

2015-06-01 Thread Ajit Kumar Agarwal


-Original Message-
From: Jeff Law [mailto:l...@redhat.com] 
Sent: Friday, May 29, 2015 9:24 PM
To: Ajit Kumar Agarwal; Richard Biener; gcc@gcc.gnu.org
Cc: Vinod Kathail; Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: [RFC] Design and Implementation for Path Splitting for Loop with 
Conditional IF-THEN-ELSE

On 05/16/2015 06:49 AM, Ajit Kumar Agarwal wrote:
> I have Designed and  implemented with the following design for the path 
> splitting of the loops with conditional IF-THEN-ELSE.
> The implementation has gone through the bootstrap for Microblaze 
> target along DEJA GNU regressions tests and running the MIBench/EEMBC 
> benchmarks. There is no regression seen in Deja GNU tests and Deja C++ 
> tests results are better with the path splitting changes(lesser 
> failures). The Mibench/EEMBC benchmarks are run and no performance 
> degradation is seen and the performance improvement in telcom_gsm( 
> Mibench benchmarks) of 9.15% and rgbcmy01_lite(EEMBC
> benchmarks) the performance improvement of 9.4% is observed.
>
> Proposal on the below were made earlier on path splitting and the reference 
> is attached.
>>The first thing I notice is you haven't included any tests for the new 
>>optimization.  We're trying really hard these days to have tests in the suite 
>>for this kind of >>new optimization/feature work.

>>I don't offhand know if any of the benchmarks you cite above are free-enough 
>>to derive a testcase from.  But one trick many of us use is to instrument the 
>>>>pass and compile some known free software (often gcc
>>itself) to find triggering code and use that to generate tests for the new 
>>transformation.

I will add tests in the suite. I could see many existing tests in the suite 
also get triggered with this optimization.

>
> Design changes.
>
> 1. The dominators of the block with conditional IF statements say BB1 
> are found and the join node of the IF-THEN-ELSE inside the loops is found on 
> the blocks dominated by the BB1 and are not successor of BB1 are the join 
> node.
This isn't necessarily my area of strength, but it doesn't seem right to 
me.   I guess it would work if there aren't any more nodes in the loop 
after the join block, except for the loop latch, but it doesn't see right in 
the general case.  It might work if you also verify that the IF block is the 
immediate dominator of potential join node.



>
> 2. The Join node is same as the source of the loops latch basic blocks.
>>OK.  This is why your initial algorithm in #1 works.
Thanks.

>
> 3. If the above conditional in (1) and (2) are found the Join node  
> same as the source of the Loop latch node is moved into predecessors 
> and the Join node ( Source of the Loop latch node) is made empty statement 
> block with only the phi nodes.
>>Hopefully you do this by duplicating the join node and manipulating the CFG 
>>so that the original predecessors of the join each go to their copy of the 
>>join >>node.  The copied join nodes unconditionally then pass control to the 
>>original join node.

Yes.

>
> 4. In the process of moving the Join node into its predecessors the 
> result of the phi node in the Join node propagated with the corresponding phi 
> arguments  based on which predecessor it came from in the Join blocks and 
> move into its predecessors.
>>Right.  Note there's a phi-only cprop pass which is designed to do this 
>>propagation and in theory should be very fast.

Thanks. 

> 5. The Dominator INFO is updated after performing the steps of (1) (2) (3) 
> and (4).
>>Right.  Getting the sequencing of the various updates can be tricky. 
>>Even more so if you try to do it as a part of jump threading because you have 
>>to (for example) deal with unreachable blocks in the CFG.


>
> 6. The Join which is same as the source of the Loop latch node is made empty 
> with only the phi node in the Join node.
>>Right.

Thanks.

>
> 7. The implementation is done in Jump threading phase of the machine 
> independent optimization on tree based
> representation. The path splitting is called after the Jump threading 
> optimization is performed.
> The following files are modifed.
>
>  a) tree-vrp.c
>  b) tree-ssa-threadedge.c
>  c) tree-cfg.c
>  d) tree-ssa-threadedge.h
>  e) tree-cfg.h
>  f) cfghooks.c
>
> The diff is attached along with this mail and pasted below.
>>I would recommend this as a separate pass.  One of the goals for GCC 6 
>>is to break out the threader into its own pass, if you're piggybacking 
>>on the threader it just makes this task harder.

>>Also by implementing this as a distinct pass you have

  1   2   >