Re: Improvements of the haifa scheduler

Ayal Zaks Fri, 09 Mar 2007 14:28:49 -0800

Maxim Kuvyrkov <[EMAIL PROTECTED]> wrote on 04/03/2007 11:53:47:

> Hi.
>
> I want to share some of my thoughts and doings on improving / cleaning
> up current GCC instruction scheduler (Haifa) - most of them are just
> small obvious improvements.
>
> I have semi-ready patches for about a half of them and would appreciate
> any early suggestion or comments on the following draft plan:
>
> 1. Remove compute_forward_dependencies ().  [Done]
>
> Since new representation of dependencies lists was checked in, we don't
> longer need to compute forward dependencies separately.  It would be
> natural to add forward links at the same time as we generate backward
ones.
>
> 2. Use alloc_pools instead of obstacks for dep_nodes and deps_lists.
> [In progress]
>
> As pointed out by Jan Hubicka scheduler peaks +100M on a testcase for
> PR28071 after my patch for new dependencies lists was checked in.
> Though alloc_pools should have been my first choice while writing that
> patch, I decided to mimic as close as possible the original rtx
> instruction lists with their scheme of deallocation at the very end.  So
> the next step would be to define proper lifetime of dependency lists
> and use alloc_pools to reuse nodes and lists from the previous regions.


It might be possible to shrink such large DDG's by removing edges that are
redundant wrt schduling, due to transitivity. Not sure how much we'll gain
by such a (non-trivial, dangerous) effort.

>
> Which brings us to ...
>
> 3. Define clear interface for manipulating dependencies.  [In progress]
>
> This one popped up when I began to debug <2> and understood that the
> scheduler uses and changes dependencies lists the way it shouldn't.
> Lists are being copied, edited and deleted directly without interaction
> with sched-deps.c .  What the scheduler really needs is the following
> set of primitives:
>    o FOR_EACH_DEP (insn, which_list, iterator, dep) - walk through
> insn's which_list (one of {backward, resolved_backward, forward,
> resolved_forward}) and provide the user with the dep.  Ayal Zaks
> suggested this type of macro weeks ago but at that time I didn't agree
> with him.
>    o dep_t find_dep_between (producer, consumer) - find dependency
> between two instructions.  Currently we walk through the list looking
> for what we need.  A better way would be to first check dependency
> caches and then, if we can't determine that there is no dependency, walk
> through the shorter list given the choice of two: producer's forward
> list and consumer's backward list.
>    o void add_dep (dep_t) - Add a dependency.
>    o void remove_dep (iterator) - Remove dependency pointed out by
iterator.
>    o void resolve_dep (iterator) - Resolve dependency pointed out by
> iterator.
>    o int get_list_size (insn, which_list) - Get the size of the insn's
> which_list.
>    o bool list_has_only_speculative_deps (insn, which_list) - Return
> true if all insn's dependencies can be overcome with some sort of
> speculation.

It would probably be cleaner to somehow seperate or designate the
speculative part from the core DDG part (in terms of data and interfaces).

>    o void {create, delete}_dependency_lists (insn) - Create / delete
> dependency lists for insn.
>
> As you can see, the scheduler doesn't need to know the internal
> representation of the deps_list / dep_node.
>
> 4. Support speculative loads into subregs.  [Planned]
>
> As noted by Jim Wilson current support for ia64 speculation doesn't
> handle subregs though that would be easy to fix.
>
> 5. Make sched-deps.c mark only those dependencies as speculative which
> can actually be overcame with speculation types currently in use.
[Planned]
>
> At the moment we first generate speculative dependencies and only at the
> moment of adding instruction to the ready list we check if we can (or it
> worth to) overcome every of its dependencies.
>
> 6. Make ds_t a structure.  [Planned]
>
> ds_t is type for representing status of a dependency.  It contains
> information about types of the dependency (true, output, anti) and
> probabilities of speculation success (begin_data, be_in_data,
> begin_control, be_in_control) - that makes three bits and for integers
> coded in a single int.  Historical reasons forced this inelegant
> approach but now the reasons are inexistent and the problem can be
> solved in a natural way.
>
> 7. Use cse_lib in sched-rgn.c .  [In progress]
>
> At the moment cse_lib works to improve alias analysis only during
> sched-ebb scheduling.  It is trivial that we can also enable it when
> scheduling single block regions in sched-rgn.  The patch for this is
> a one-liner which was tested on a bootstrap but not on SPECs.
>
> It is also possible to use cse_lib on sequential basic_blocks of the
> region thus handling them as an extended basic block.
>
> If it is possible to save cse_lib states, then we'll be able to process
> trees and merging capabilities are required for DAGs.  Don't know if
> this can be done.
>
> 8. Don't generate a memory barrier on simplejump.  [Done]
>
> sched-deps.c handles every jump in the scheduling region as a memory
> barrier - e.g. almost no memory operation can be moved through it.  But
> unconditional jumps don't really need such restrictions.

unconditional jumps typically target blocks that have multiple
predecessors, requiring duplicative (upwards) code motion. Right?

> A one-liner
> patch for this was tested on the bootstrap but not on SPECs.
>
> 9. Use sched-ebb on other architectures.  [Done]
>
> After patches for ia64 speculation and follow up fixes to them sched-ebb
> no longer corrupts CFG and can safely be used as not final pass on
> platforms other than ia64.  I successfully bootstrapped (and, probably,
> regtested) the compiler on x86_64 with no cfg rebuild after sched-ebb
> and a change in common.opt to run sched-ebb scheduler by default for
> sched2 pass.  I suppose powerpc might benefit from ebb scheduling,
> though don't have a SPEC box to run tests on.

we can test this on powerpc.

>
> 10. Leveled priority.  [Done]
>
> Haifa scheduler's main heuristic is the length of critical path.  It is
> computed just before the scheduling for the whole region and this
> computation has (as it looks to me) a severe drawback:
>    o It is good for single block regions only.
>
> Let's consider a diamond region if-then-else-endif:
>
> When scheduling 'if' ready instruction from 'if', 'then', 'else' and
> 'endif' look [almost] equally good to the scheduler, but
> instructions from only 'if' and 'endif' should.

hmm, wouldn't the 'then' and 'else blocks appear with probability 50 (or
something obtained from profiling), whereas 'endif' will appear with
probability 100?

>
> The problem can become worse on such architectures as arm: lets say 'if'
> is a small block with 3 cycle of critical path length and 'then'
> is an unlikely branch with several ready instructions and a jump in the
> end.  Guess what, all the ready instructions from 'then' will be moved
> to 'if'.  This happens because jump on arm has 16 cycle latency and
> therefore all the instructions in 'then' will get priorities starting
> from 16 which is surely greater then the maximal priority of 3 for
> instructions from 'if'.  There must be PR with the testcase in bugzilla
> but now I can't find it.
>
> There also are several less significant quirks in computation and use of
> the priorities.
>
> So, why leveled priority?  Because there will be 3 levels of priority:
>
>    o priority2 - the priority computed much like it is now *but* with
> respect to the probability of the execution of the consumer.
>
> priority2(producer) = max ((priority2(consumer) + dep_cost (producer,
> consumer)) * relative_probability (producer_block, consumer_block));

where max is taken over all consumers dependant upon the producer.

For region-based critical path, you may want to consider two or more
'independent' paths. For example, take the max over all consumers that
belong to the same block, and then compute a weighted average of these
max's (weighted according to normalized relative probability of reaching
each consumer block; admittedly not perfect, as they might not be
independent) instead of picking the largest and discarding all the rest.

>
>    o priority1 - priority2 after adjustment by target via
> adjust_priority() hook.  This is done at the moment of adding
> instruction to the ready list.
>
> priority1(insn) = targetm.sched.adjust_priority(insn, priority2);
>
>    o priority - priority actually used by scheduler.
>
> priority(insn) = priority1(insn) * relative_probability (target_block,
> insn_block);
>
> I have a patch that implements this scheme for sched-ebb.  Make it
> work for sched-rgn is not a problem - I just plan to make it first
> available for ia64 sched-ebb and only after public testing for commonly
> used sched-rgn as well.

Curious how much it helps.


> 11. Implement control speculation in sched-ebb.  [In progress]
>
> Few months ago I came across a spot in sched-deps.c where dependencies
> between jumps and memory operations are added.  Before I found that spot
> I thought that these dependencies are added in sched-ebb.c:
> add_deps_for_risky_insns () and considered it as a source for ia64
> control speculation during sched2.  Surprisingly
> add_deps_for_risky_insns () rarely added dependencies and none of those
> become control speculative.  Now I see what went wrong.
>
> I have a semi-ready patch for this and plan to finish it as soon as I'm
> done with <2> and <3>.
>
> 12. Initialize scheduler data structures for each region separately and
> index them by luid.  [Done]
>
> This will make a smaller memory footprint and more clear scheduler
> initialization infrastructure.
>
> A variant on this work will soon be checked in on sel-sched-branch.
>
> 13. Generate regions from extended basic blocks in sched-rgn.c  [Planned]
>
> sched-rgn.c already has all the necessary infrastructure to handle
> extended basic block.  The missing part is in formation of the regions
> which creates a ebb for each basic block.  Also we need to teach
> sched-rgn of few optimization sched-ebb capable of (e.g. moving a jump
> through partially dead instructions).  After this work is done we'll
> have a choice of three types of regions:
>    o single block regions
>    o extended basic block regions
>    o DAG of ebbs
>
> 14. Some sort of must alias analysis for scheduler.  [Planned]
>
> At the moment ia64 data speculation is used very conservatively - only
> when we can't issue anything else on the cycle.  Data speculation needs
> an alias analysis that can infer the probability of two memory locations
> to alias.  It doesn't need to be 100% correct as the recovery code will
> be generated in any case, but on the other hand it should be capable of
> finding trivial must alias relations.  Current implementation of data
> speculation suffers most when a pointer 'p1' is copied to 'p2' and then
> the program writes to 'p1' and reads from 'p2'.  The speculative loads
> from 'p2' will always fail and introduce a misspeculation stall of about
> 10 - 20 cycles.  The starting point for this work will be an aliasing
> patch by Sanjiv Kumar Gupta.

such must-alias analysis may be useful elsewhere in the compiler of-course.
E.g, when vectorizing a loop containing potentially aliasing pointers using
loop versioning (which is pointless if we know they must alias).

>
> ---
>
> Any comments, suggestions or 'please don't do that' will be greatly
> appreciated.
>
> I should also mention that I do all these works in my spare time which
> is not a lot, thus the above is my plan for about a half of the year.

at least it shows you enjoy doing it ;-)
Ayal.

>
>
> Thanks,
>
> Maxim
>
>

Re: Improvements of the haifa scheduler

Reply via email to