Re: Reducing Register Pressure based on Instruction Scheduling and Register Allocator!!

Vladimir Makarov Fri, 06 Jun 2014 09:08:23 -0700

On 2014-06-06, 10:48 AM, Ajit Kumar Agarwal wrote:

Hello All:


I was looking further the aspect of reducing register pressure based on 
Register Allocation and Instruction Scheduling and the
Following observation being made on reducing register pressure based on the 
existing papers on reducing register pressure
Based on scheduling approach.

Does the following aspect of reducing register pressure is already been 
implemented in GCC?

Minimum register usage through better Instruction scheduling

Instruction scheduling play an important role increase or decrease of register 
pressure. If the instruction scheduler is done before
register allocation the register pressure will increase whereas if the register 
allocation is done before instruction scheduling the false
antidependence will be created. For the out of order superscalar process the 
scheduling will be done in such a way that register
pressure will be decreased. To achieve ILP the register pressure increases but 
in superscalar processor the register pressure is
important. To achieve the decrease of register pressure the instruction 
scheduling for ILP should also consider the register pressure.

Scheduling for OOO is not so important. What is important is SoftwarePipelining. OOO processor can look through few branches but it can notlook through many of them. Quite opposite SP can look through alliterations. Therefore as I know Intel compiler had no insn scheduler atall until introducing Atom but the compiler has SP for a long time.

I recently implemented live-range shrinkage (resulting in registerpressure decrease) on the insn-scheduler base. It has a small effect onx86/x86-64 (and requires a lot of additional compiler-time) andtherefore it is not switched off by default. I guess one reason forthat is register-pressure before the 1st insn scheduling is alreadyclose to minimal. That is my observation which could be wrong aftermore thorough investigation of big code base.

Govindarajan proposed a scheme where the list scheduling is modified to have 
one of the constraint's as available register
along with latency to perform the instruction scheduling. From the data 
dependency graph of the given basic block the
chains are created based on depth first search to form different chains. And 
one register is allocated for each chain in order
to have a baton process where the def is used in next instruction and the next 
instruction def  will be used in next to next
instruction. A given register is assigned to def and use and the same register 
is used for next def and use, which makes it
to have one register for each chain, The main challenges is to assign the same 
register for the different chains if the chains
don't overlap.

The list scheduling is modified to have a parameter as list of available colors 
and release is the release of available colors.
 From the ready queue the nodes in the DAG is removed the live ranges that 
terminate at that node is released and as
added as available colors. This ensures the register pressure will not increase 
by the schedule which was derived from
Graph coloring register allocator.

I think it is too complicated and compiler-time consuming. Thecompile-time is a also big factor for GCC. For example, GCC has globaland local RA. The result probably would be better if we use classicaliterative global RA. But it would be a disaster with compiler timeperspective (besides too complicated RA).

Register pressure sensitivity based Instruction scheduling

The superscalar OOO processor does the dynamic instruction scheduling based on 
the out of order on instruction window
and register renaming to achieve ILP. The main challenges is to achieve a ILP 
schedule and from the ILP schedule, it generates
the instruction sequence with sensitivity to registers is called linearization. 
The ILP schedule groups are formed. The groups are,

1. If the instruction selected is s all the instruction that are scheduled 
before s are formed with one group.

2. All the instruction that are scheduled after s are formed with another group.

The linearization algorithm uses can test and must can test. The can test if 
the selected instruction is s , then all the instruction
that are formed with Group1 and are not scheduled but in the ready list are 
generated with W-1. If the selected instruction is
's' is generated at index i  ,all the instruction are scheduled before s are 
generated with i+W-1 where W is the size of register
  window. The must can test if the selected instruction is generated at index 
i, then all the instruction that are scheduled
after s are generated with index i+W-1. If the instruction scheduled after s 
are not in window the ILP won't be achieved.

The linearization algorithm checks for can test if it satisfies then the 
instruction is generated. If not satisfied it will be
checked with must can test and if it satisfies it will be generated.

The register pressure should not be increased during linearization as there is 
a chance of register pressure getting increased
during scheduling. To achieve this, the priority is assigned for each 
instruction that are in the ready list. If the instruction
selected all the use of the variables in the instruction are scheduled the 
priority is assigned as 2. if an instruction is dependent
  on i and i is dependent on another instruction the Live range is too long and 
the priority is less and assigned to be 1. In the
  ready list it selects the instruction with high priority which ensures the 
register pressure doesn't increases.


It is hard for me to say anything about this approach.

There are approaches where the register pressure is high in the basic block the 
instruction scheduler will be phased after
  Register allocation, otherwise the instruction scheduler is done before the 
register allocator. This is how in gcc there is
  an instruction scheduler before register allocator and after the register 
allocator.

Integrated Code Scheduling and Register Allocation with minimum register 
pressure

The integrated code scheduling where there is a prepass and post pass 
instruction scheduler. The prepass scheduler
increase the overuse of the registers and thus increase the register pressure. 
Whereas the post pass scheduling is
done after the register allocation and the overuse of register is less and thus 
perform schedule with reduced spill
and fetch. The prepass instruction scheduler achieves more parallelism and has 
high degree of parallelism and very
much useful when the register pressure is less and if the register pressure is 
high the post pass scheduling is often
used. HSU, proposed the integrated scheduling where the code scheduling with 
ILP is CSP and CSR is the scheduling
with minimum registers usage.

 From the leader set, the nodes in the DAG which doesn't have predecessors and 
the ready queue are the queues with
nodes whose predecessors has been scheduled. From the ready set or the leader 
is selected and the AVAILREG is greater
than threshold then a node from the ready set is selected and the scheduling 
done is CSP. When the registers become
high with the live registers are less than threshold then the CSR Scheduling is 
done. The nodes from the ready set either
generates a live register or frees the registers. In the CSR approach the nodes 
from the ready which frees the atmost registers
are selected. If there are more than one nodes which frees the registers, the 
nodes with highest cumulative cost is selected.

Global liveness analysis is performed for Liveness of the variables whereas the 
Reference counting is used to determine
when the variables are dead and can be freed. For some architectures the 
scheduling with greater pipeline depth and
higher degree of parallelism is more important with more interlock dependency 
than the spill and fetch. In that
cases if the register pressure is too high and the dependence node with high 
interlock dependencies, then one of the
register is spilled and the thread continues with the CSP which schedules the 
independent nodes between the nodes
with high interlock dependencies.


I guess it should not be called by integrated code scheduling and RA.

I believe something similar was already implemented in GCC(-fira-sched-pressure) with two different models. The first (simple)one is close what you wrote above.

Still even register-pressure sensitive insn scheduling is switched offfor x86/x86-64. Although it decrease code size by about 5% incomparison with regular insn scheduling (and this is a sign of registerpressure decreasing), it still generates a bit bigger code (about 1% asI remember) than without the 1st insn scheduling. As I wrote above Ibelieve that it is because of pressure is close to minimal in GCC beforethe 1st insn scheduling and any insn rearrangement will only increase it.

The currently implemented register-pressure sensitive insn schedulingfits to processors with moderate or big register file on program whereregister-pressure is not high before the 1st insn scheduling but can behigh after the regular the 1st insn scheduling.

In general I'd say that any project on scheduling or RA is hard even forexperienced GCC developer (scheduling is less harder than RA) becausethey are most target parameterized machine-independent optimizations andGCC supports too many targets and based on very complicated model oftarget description (.md description, register file description, numeroustarget hooks etc).

Therefore it is very hard to predict for anybody what the effect of theparticular optimization would be in GCC. The only right prediction isto try a prototype implementation (e.g. for RA, I tried many algorithmsand optimizations before found the current suitable approach).

So if you implement some approach really improving code, nobody willobject whatever they said before about the approach and only will beglad to include this into the GCC code base.

Also I don't completely believe articles in this area. When youimplement their approaches in GCC you will see that effect will be quitesmaller than reported one. I guess that is because the research is doneusually in a simple research compiler and researchers have a tendency toexaggerate their results as the research itself is their major workoutcome.

Re: Reducing Register Pressure based on Instruction Scheduling and Register Allocator!!

Reply via email to