On 2015-03-09 8:10 PM, Ajit Kumar Agarwal wrote:
-----Original Message-----
From: Jeff Law [mailto:l...@redhat.com]
Sent: Monday, March 09, 2015 11:01 PM
To: Richard Biener
Cc: Ajit Kumar Agarwal; vmaka...@redhat.com; gcc@gcc.gnu.org; Vinod Kathail;
Shail Aditya Gupta; Vidhumouli Hunsigida; Nagaraju Mekala
Subject: Re: Proposal for path splitting for reduction in register pressure for
Loops.
On 03/09/15 05:40, Richard Biener wrote:
On Sun, Mar 8, 2015 at 8:49 PM, Jeff Law <l...@redhat.com> wrote:
On 03/08/15 12:13, Richard Biener wrote:
I see. This basically creates two loop latches and thus will make
our loop detection code turn the loop into a fake loop nest. Not
sure if that is a good idea.
I'd have to sit down and ponder this for a while -- what would be the
register pressure implications of duplicating the contents of the
join block into its predecessors, leaving an empty joiner so that we
still have a single latch?
Good question. Now another question is why we don't choose this way
to disambiguate loops with multiple latches? (create a forwarder as
new single latch block)
Dunno. The RTL jump optimizer ought to eliminate the unnecessary jumping late in the
optimization pipeline and creating the forwarder allows us to put the loop into a "more
canonical" >>form for the loop optimizers. Seems like it'd be worth some
experimentation.
I agree with Jeff on this. The above approach of path splitting will keep the
loop into more canonical form for the Loop optimizers.
I'm having trouble seeing how Ajit's proposal helps reduce register pressure in any
significant way except perhaps by exposing partially dead code. And if that's the
primary benefit, we >>may better off actually building a proper framework for
PDCE.
Jeff: The above approach of path splitting is very similar to superblock
formation as we are duplicating the join nodes into all its predecessors. By
doing so,
The predecessors blocks duplicated from the join nodes will achieve more
granularity with the scope of the scheduling and the register allocation. Due to
The bigger granularity of the predecessors blocks, the advantages will have
similar to having superblock with more granularity. This gives more scheduling
Opportunities of basic blocks scheduling the independent operations. Due to the
above path splitting approach, the LRA will have a more significant impact with
the respect to register allocation. Because of more granular predecessors
blocks the scope of LRA will increase, and the LRA can reuse the registers
thus impacting the
Live range and the register pressure and we can use better heuristics with
respect to spill code in LRA.
We already have superblock formations (-ftracer and
-fsched2-use-superblocks). So it can be tried. But I guess they are
not default for a reason. The proposal can improve code for some cases,
e.g. I can say it can definitely help the current inheritance in LRA.
The problem is to get a stable improvements. So many things melt here,
e.g. I-cache behaviour (loop unrolling has the same problems),
alignments and so on. It is very hard to predict the results for such
optimizations for modern Intel CPUs which are very complicated block box
interpreter roughly speaking. It is easy for less sophisticated
architectures. It was easy for much more predictable Itanium. That is
why we have already such code.
In any case, it would be interesting to try RA on bigger blocks and
EBBs. May be some things can be found what can be improved for RA.
Spec95 fpppp (a famous benchmark with huge basic block and very high
register pressure) is too big to analyze RA behaviour by a human.