Re: [4/4] Make SMS schedule register moves

Ayal Zaks Mon, 10 Oct 2011 08:43:11 -0700

On Mon, Oct 10, 2011 at 1:57 PM, Richard Sandiford
<richard.sandif...@linaro.org> wrote:
> Ayal Zaks <ayal.z...@gmail.com> writes:
>>> I agree it's natural to schedule moves for intra-iteration dependencies
>>> in the normal get_sched_window way.  But suppose we have a dependency:
>>>
>>>   A --(T,N,1)--> B
>>>
>>> that requires two moves M1 and M2.  If we think in terms of cycles
>>> (in the SCHED_TIME sense), then this effectively becomes:
>>>
>>>   A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>>>
>>> because it is now M1 that is fed by both the loop and the incoming edge.
>>> But if there is a second dependency:
>>>
>>>   A --(T,M,0)--> C
>>>
>>> that also requires two moves, we end up with:
>>>
>>>   A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>>>                                        -->(T,M3,-1)--> B
>>>
>>> and dependence distances of -1 feel a bit weird. :-)  Of course,
>>> what we really have are two parallel dependencies:
>>>
>>>   A --(T,N1,1)--> M1 -->(T,N2,0)--> M2 -->(T,N3,0)--> B
>>>
>>>   A --(T,M1,0)--> M1' -->(T,M2,0)--> M2' -->(T,N3,0)--> B
>>>
>>> where M1' and M2' occupy the same position as M1 and M2 in the schedule,
>>> but are one stage further along.  But we only schedule them once,
>>> so if we take the cycle/SCHED_TIME route, we have to introduce
>>> dependencies of distance -1.
>>>
>>
>> Interesting; had to digest this distance 1 business, a result of
>> thinking in cycles instead of rows (or conversely), and mixing
>> dependences with scheduling; here's my understanding, based on your
>> explanations:
>>
>> Suppose a Use is truely dependent on a Def, where both have been
>> scheduled at some absolute cycles; think of them as timing the first
>> iteration of the loop.
>> Assume first that Use appears originally after Def in the original
>> instruction sequence of the loop (dependence distance 0). In this
>> case, Use requires register moves if its distance D from Def according
>> to the schedule is more than ii cycles long -- by the time Use is
>> executed, the value it needs is no longer available in the def'd
>> register due to intervening occurrences of Def. So in this case, the
>> first reg-move (among D/ii) should appear after Def, recording its
>> value before the next occurrence of Def overwrites it, and feeding
>> subsequent moves as needed before each is overwritten. Thus the
>> scheduling window of this first reg-move is within (Def, Def+ii).
>>
>> Now, suppose Use appears before Def, i.e., Use is upwards-exposed; if
>> it remains scheduled before the Def, no reg-move is needed. If, otoh,
>> Def is scheduled (D>0 cycles) before Use, breaking the anti-dependence
>> between them, (D/ii + 1) reg-moves are needed in order to feed Use
>> with the live-in value before Def. So in this case, the first reg-move
>> should appear before Def (again feeding subsequent moves as needed
>> before each is overwritten). Thus the scheduling window of this first
>> reg-move is within (Def-ii, Def).
>>
>> In any case, each use requires a specific number of reg-moves, which
>> begin either before or after the def; it is always safe (though
>> potentially redundant) to start reg-moving before the def -- uses that
>> do not need the live-in value will ignore it by feeding from a later
>> reg-move.
>
> Right.  The distance on the Def->Use dependency is effectively transferred
> to the dependency between the Def and first move.
>
> And we can potentially have both kinds of Use at the same time.
> We then end up scheduling two moves, one in each window, but require
> them to occupy the same row and column.  It feels more convenient to
> schedule the earlier of the two moves.
>
>> Q: if we generate reg-moves starting from before the def, would
>> redundant reg-moves be eliminated if no use needs access to live-in
>> value? If so, would that simplify their generation? (If not, it may be
>> interesting to understand how to improve DCE to catch it.)
>
> To be clear, the new version of the patch (unlike the old one)
> doesn't emit reg-moves before the def if all uses are distance 0.
> We only do it where some or all uses are distance 1.  The first move
> before the def is always needed in that case.
>


Understood. The question was whether it would simplify things if we
always emit reg-moves before the def. And rely on DCE to hopefully
eliminate redundancies.

> So redudant moves are confined to the case where (a) we have more
> than one move, (b) we have both distance 0 and distance 1 uses, and
> (c) one of the two distance sets requires more moves than the other.
> If the distance 0 dependencies require more moves than the distance
> 1 dependencies, we will have a redudant move in the prologue.
> If the distance 1 dependencies require more moves than the distance
> 0 dependencies, we will have a redudant move in the epilogue.
> But I believe that this is already handled by dce.
>
> (For avoidance of doubt, the new code is more accurate than
> current trunk in this respect.)
>
>> The issue of assigning stages to reg-moves is mostly relevant for
>> prolog and epilog generation, which requires and receives special
>> attention -- handled very nicely by ps_num_consecutive_stages! Note
>> that currently a simple boolean indicator for (the exceptional case
>> of) double stages would suffice, instead of generalizing to arbitrary
>> nums of consecutive stages (see any potential use for them?).
>
> Not in the immediate term.  But I think having a boolean indicator
> would be inconsistent.  If the distance field is an int (even though
> we only expect distance-0 and distance-1 register dependencies)
> then I think the number of stages should be too.
>
> I did wonder originally about using a boolean, but IMO, it makes
> the code less readable rather than more.  Instead of a simple
> range check like:
>
>     if (first_stage_for_insn <= last_stage_in_range
>         && last_stage_for_insn >= first_stage_in_range)
>
> we end up with the equivalent of:
>
>     if (first_stage_for_insn <= last_stage_in_range
>         && (double_stage_move_p (...)
>             ? first_stage_for_insn + 1 >= first_stage_in_range
>             : first_stage_for_insn >= first_stage_in_range))
>
> with no corresponding simplification elsewhere.
>

Sure. But setting the range can be done by consulting an simple
indicator, rather than generalizing to arbitrary stage numbers; e.g.:

+ps_num_consecutive_stages (partial_schedule_ptr ps, int id)
+{
+  if (id >= ps->g->num_nodes && ps_reg_move (ps, id)->double_stages)
+    return 2;
+  else
+    return 1;
+}

or

-  last_u = first_u + ps_num_consecutive_stages (ps, u) - 1;
+  if (...double_stages) last_u = first_u + 1;
+  else last_u = first_u;

>>> However...
>>>
>>>> Also note that if moves are assigned absolute cycles, it becomes clear
>>>> to which stage each belongs (just like any other instruction),
>>>> regulating their generation in prolog and epilog.
>>>
>>> ...I have to concede that it makes the stage calculation easier,
>>> and that that tips the balance.  (Although again, a move can belong
>>> to two stages simultanouesly.)
>>>
>>> Anyway, here's an updated patch that uses cycle times.  I've also
>>> dropped the code that tried to allow windows to start and end on
>>> the same row (and therefore schedule "either side" of that row).
>>> I thought it might help with some VLIWy DFAs, but it was always
>>> going to be of minor benefit, and we don't try that elsewhere,
>>> so let's avoid the complication.
>>>
>>
>> ok. Such windows seem rare, as they imply zero move (or def->move)
>> latencies, iinm. Leaving a note behind would be good.
>
> OK.  Since this same restriction applies to the amin scheduling
> window code, I suppose the natural place would be in the algorithm
> description itself:
>
> /* The SMS scheduling algorithm itself
>   -----------------------------------
>   Input: 'O' an ordered list of insns of a loop.
>   Output: A scheduling of the loop - kernel, prolog, and epilogue.
>   ...
>   42. compute epilogue & prologue
>   43. finish - succeeded to schedule
> */
>
> E.g. adding something like this at the end:
>
>   ??? The algorithm restricts the scheduling window to II cycles.
>   In rare cases, it may be better to allow windows of II+1 cycles.
>   The window would then start and end on the same row, but with
>   different "must precede" and "must follow" requirements.
>
> Let me know what you think and I'll add it as a follow-on patch.
>

great, thanks.

>>> Bootstrapped & regression-tested on powerpc64-linux-gnu with
>>> -fmodulo-sched and -fmodulo-sched-allow-regmoves enabled by default.
>>> Also retested on the ARM benchmarks.  OK to install?
>>>
>>
>> Yes, OK.
>> Some points to consider raised above, and some comments added below.
>
> Thanks, applied with the changes below.
>
>>> @@ -466,21 +506,167 @@ print_node_sched_params (FILE *file, int
>>>   for (i = 0; i < num_nodes; i++)
>>>     {
>>>       node_sched_params_ptr nsp = SCHED_PARAMS (i);
>>> -      int j;
>>>
>>>       fprintf (file, "Node = %d; INSN = %d\n", i,
>>>               INSN_UID (ps_rtl_insn (ps, i)));
>>>       fprintf (file, " asap = %d:\n", NODE_ASAP (&ps->g->nodes[i]));
>>>       fprintf (file, " time = %d:\n", nsp->time);
>>> -      fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
>>> -      for (j = 0; j < nsp->nreg_moves; j++)
>>> -       {
>>> -         ps_reg_move_info *move = ps_reg_move (ps, nsp->first_reg_move + 
>>> j);
>>
>> Is another way provided to print a summary of the regmoves? Or does
>> the info dumped while scheduling each one suffice for practical
>> tracing and debugging.
>
> Yeah, the new info should be more complete.  It gives the pattern
> (which is what we dump here), but also lists the producers and
> consumers of each move (which we don't print here).
>
>>> +      fprintf (file, " stage = %d:\n", nsp->stage);
>>> +    }
>>> +}
>>> +
>>> +/* Set SCHED_COLUMN for each instruction in row ROW of PS.  */
>>> +static void
>>> +set_columns_for_row (partial_schedule_ptr ps, int row)
>>> +{
>>> +  ps_insn_ptr cur_insn;
>>> +  int column;
>>> +
>>> +  column = 0;
>>> +  for (cur_insn = ps->rows[row]; cur_insn; cur_insn = 
>>> cur_insn->next_in_row)
>>> +    SCHED_COLUMN (cur_insn->id) = column++;
>>> +}
>>> +
>>> +/* Set SCHED_COLUMN for each instruction in PS.  */
>>> +static void
>>> +set_columns_for_ps (partial_schedule_ptr ps)
>>> +{
>>> +  int row;
>>> +
>>> +  for (row = 0; row < ps->ii; row++)
>>> +    set_columns_for_row (ps, row);
>>> +}
>>> +
>>> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
>>> +   Its predecessors and successors have already been scheduled.
>>
>> well, except for some (preceeding or succeeding?) moves that have not
>> yet been scheduled, right?
>
> Ah, yeah, fixed to:
>
> /* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
>   Its single predecessor has already been scheduled, as has its
>   ddg node successors.  (The move may have also another move as its
>   successor, in which case that successor will be scheduled later.)
>

perfect

>>> +
>>> +   The move is part of a chain that satisfies register dependencies
>>> +   between a producing ddg node and various consuming ddg nodes.
>>> +   If some of these dependencies cross a loop iteration (that is,
>>> +   have a distance of 1) then DISTANCE1_USES is nonnull and contains
>>> +   the set of uses with distance-1 dependencies.  DISTANCE1_USES
>>> +   is null otherwise.
>>> +
>>
>> Maybe clarify that they are upwards-exposed or live-in uses.
>
> OK, changed to:
>
>   The move is part of a chain that satisfies register dependencies
>   between a producing ddg node and various consuming ddg nodes.
>   If some of these dependencies have a distance of 1 (meaning that
>   the use is upward-exposoed) then DISTANCE1_USES is nonnull and

exposed (typo)

>   contains the set of uses with distance-1 dependencies.
>   DISTANCE1_USES is null otherwise.
>
>>> +   MUST_FOLLOW is a scratch bitmap that is big enough to hold
>>> +   all current ps_insn ids.
>>> +
>>> +   Return true on success.  */
>>> +static bool
>>> +schedule_reg_move (partial_schedule_ptr ps, int i_reg_move,
>>> +                  sbitmap distance1_uses, sbitmap must_follow)
>>> +{
>>> +  unsigned int u;
>>> +  int this_time, this_distance, this_start, this_end, this_latency;
>>> +  int start, end, c, ii;
>>> +  sbitmap_iterator sbi;
>>> +  ps_reg_move_info *move;
>>> +  rtx this_insn;
>>> +  ps_insn_ptr psi;
>>> +
>>> +  move = ps_reg_move (ps, i_reg_move);
>>> +  ii = ps->ii;
>>> +  if (dump_file)
>>> +    {
>>> +      fprintf (dump_file, "Scheduling register move INSN %d; ii = %d"
>>> +              ", min cycle = %d\n\n", INSN_UID (move->insn), ii,
>>> +              PS_MIN_CYCLE (ps));
>>> +      print_rtl_single (dump_file, move->insn);
>>> +      fprintf (dump_file, "\n%11s %11s %5s\n", "start", "end", "time");
>>> +      fprintf (dump_file, "=========== =========== =====\n");
>>> +    }
>>> +
>>> +  start = INT_MIN;
>>> +  end = INT_MAX;
>>> +
>>> +  /* For dependencies of distance 1 between a producer ddg node A
>>> +     and consumer ddg node B, we have a chain of dependencies:
>>> +
>>> +        A --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B
>>> +
>>> +     where Mi is the ith move.  For dependencies of distance 0 between
>>> +     a producer ddg node A and consumer ddg node C, we have a chain of
>>> +     dependencies:
>>> +
>>> +        A --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C
>>> +
>>> +     where Mi' occupies the same position as Mi but occurs a stage later.
>>> +     We can only schedule each move once, so if we have both types of
>>> +     chain, we model the second as:
>>> +
>>> +        A --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C
>>> +
>>> +     First handle the dependencies between the previously-scheduled
>>> +     predecessor and the move.  */
>>> +  this_insn = ps_rtl_insn (ps, move->def);
>>> +  this_latency = insn_latency (this_insn, move->insn);
>>> +  this_distance = distance1_uses && move->def < ps->g->num_nodes ? 1 : 0;
>>> +  this_time = SCHED_TIME (move->def) - this_distance * ii;
>>> +  this_start = this_time + this_latency;
>>> +  this_end = this_time + ii;
>>> +  if (dump_file)
>>> +    fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
>>> +            this_start, this_end, SCHED_TIME (move->def),
>>> +            INSN_UID (this_insn), this_latency, this_distance,
>>> +            INSN_UID (move->insn));
>>> +
>>> +  if (start < this_start)
>>
>> redundant; start=INT_MIN is surely < this_start.
>>
>>> +    start = this_start;
>>> +  if (end > this_end)
>>
>> redundant; end=INT_MAX is surely > this_end.
>
> I did this deliberately because the order in which we apply the limits
> isn't important.  Making them assymetrical by applying this kind of
> micro-optimisation is IMO a bad idea.  The compiler will do it for us.
>
> E.g. I originally had an extra limit to make sure that we didn't add
> new stages.  If we applied the optimisation by hand, and then someone
> added a new limit like that later, they'd have to be careful about
> where they put it.
>

ok, np.

>>> +    end = this_end;
>>> +
>>> +  /* Handle the dependencies between the move and previously-scheduled
>>> +     successors.  */
>>
>> (maybe assert they have indeed all been scheduled)
>
> Hmm, I don't think that'd be useful.  We've already read the schedule
> time (and row and column) by this point, and we don't assert the same
> thing elsewhere (in the code that assigns moves to uses, or in
> get_sched_window itself).
>

(ok; only a thought, hence the parentheses)

>>> +  EXECUTE_IF_SET_IN_SBITMAP (move->uses, 0, u, sbi)
>>> +    {
>>> +      this_insn = ps_rtl_insn (ps, u);
>>> +      this_latency = insn_latency (move->insn, this_insn);
>>> +      if (distance1_uses && !TEST_BIT (distance1_uses, u))
>>
>> shouldn't this be if (distance1_uses && TEST_BIT (distance1_uses, u))?
>>
>>> +       this_distance = -1;
>>> +      else
>>> +       this_distance = 0;
>
> No, this condition corresponds to:
>
>   /* For dependencies of distance 1 between a producer ddg node A
>      and consumer ddg node B, we have a chain of dependencies:
>
>         A --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B
>
>      where Mi is the ith move.  For dependencies of distance 0 between
>      a producer ddg node A and consumer ddg node C, we have a chain of
>      dependencies:
>
>         A --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C
>
>      where Mi' occupies the same position as Mi but occurs a stage later.
>      We can only schedule each move once, so if we have both types of
>      chain, we model the second as:
>
>         A --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C
>
>      First handle the dependencies between the previously-scheduled
>      predecessor and the move.  */
>
> i.e. we want a distance of -1 when (a) the original definition has uses
> with both distance 0 and distance 1 and (b) the particular use we're
> looking at has distance 0.
>

ah, right

>>> +      this_time = SCHED_TIME (u) + this_distance * ii;
>>> +      this_start = this_time - ps->ii;
>>
>> use ii instead of ps->ii
>
> Oops, fixed.
>
>>> +      this_end = this_time - this_latency;
>>> +      if (dump_file)
>>> +       fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
>>> +                this_start, this_end, SCHED_TIME (u), INSN_UID 
>>> (move->insn),
>>> +                this_latency, this_distance, INSN_UID (this_insn));
>>> +
>>> +      if (start < this_start)
>>> +       start = this_start;
>>> +      if (end > this_end)
>>> +       end = this_end;
>>> +    }
>>> +
>>> +  if (dump_file)
>>> +    {
>>> +      fprintf (dump_file, "----------- ----------- -----\n");
>>> +      fprintf (dump_file, "%11d %11d %5s %s\n", start, end, "", "(max, 
>>> min)");
>>> +    }
>>>
>>> -         fprintf (file, " reg_move = ");
>>> -         print_rtl_single (file, move->insn);
>>> +  sbitmap_zero (must_follow);
>>> +  SET_BIT (must_follow, move->def);
>>> +
>>> +  start = MAX (start, end - (ii - 1));
>>
>> ok, so this excludes considering both ends of a possible ii-cycled
>> window. Such a window would imply zero move (or def->move) latencies
>> iinm, and more care in setting must_follow/must_precede.
>
> Right.
>
>>> +  for (c = end; c >= start; c--)
>>> +    {
>>> +      psi = ps_add_node_check_conflicts (ps, i_reg_move, c,
>>> +                                        move->uses, must_follow);
>>
>> ok - def must_follow the move, if both are scheduled on same row (as
>> we potentially start from this (end) row moving backwards), but uses
>> should precede the move, assuming that if move and use are on same row
>> the sched distance between them is ii (i.e., non-zero move->use
>> latency)
>
> Right.
>
>>> @@ -1513,6 +1651,19 @@ sms_schedule (void)
>>>              continue;
>>>            }
>>>
>>> +         /* Moves that handle incoming values might have been added
>>> +            to a new first stage.  It's too late to try any kind of
>>> +            rotation, so just bump the stage count.  */
>>
>> hmm, one could still apply the rotation optimization at this time if
>> desired, no?
>
> Hmm, maybe :-)  I changed the comment to:
>
>          /* Moves that handle incoming values might have been added
>             to a new first stage.  Bump the stage count if so.
>
>             ??? Perhaps we could consider rotating the schedule here
>             instead?  */
>

Great.
Thanks,
Ayal.


> Richard
>
>
> gcc/
>        * modulo-sched.c (ps_reg_move_info): Add num_consecutive_stages.
>        (SCHED_FIRST_REG_MOVE, SCHED_NREG_MOVES): Delete.
>        (node_sched_params): Remove first_reg_move and nreg_moves.
>        (ps_num_consecutive_stages, extend_node_sched_params): New functions.
>        (update_node_sched_params): Move up file.
>        (print_node_sched_params): Print the stage.  Don't dump info related
>        to first_reg_move and nreg_moves.
>        (set_columns_for_row): New function.
>        (set_columns_for_ps): Move up file and use set_columns_for_row.
>        (schedule_reg_move): New function.
>        (schedule_reg_moves): Call extend_node_sched_params and
>        schedule_reg_move.  Extend size of uses bitmap.  Initialize
>        num_consecutive_stages.  Return false if a move could not be
>        scheduled.
>        (apply_reg_moves): Don't emit moves here.
>        (permute_partial_schedule): Handle register moves.
>        (duplicate_insns_of_cycles): Remove for_prolog.  Emit moves according
>        to the same stage-count test as ddg nodes.
>        (generate_prolog_epilog): Update calls accordingly.
>        (sms_schedule): Allow move-scheduling to add a new first stage.
>
> Index: gcc/modulo-sched.c
> ===================================================================
> --- gcc/modulo-sched.c  2011-10-10 11:03:23.278273801 +0100
> +++ gcc/modulo-sched.c  2011-10-10 12:24:44.539932692 +0100
> @@ -153,6 +153,9 @@ struct ps_reg_move_info
>   rtx old_reg;
>   rtx new_reg;
>
> +  /* The number of consecutive stages that the move occupies.  */
> +  int num_consecutive_stages;
> +
>   /* An instruction that sets NEW_REG to the correct value.  The first
>      move associated with DEF will have an rhs of OLD_REG; later moves
>      use the result of the previous move.  */
> @@ -218,8 +221,6 @@ static partial_schedule_ptr sms_schedule
>  static void permute_partial_schedule (partial_schedule_ptr, rtx);
>  static void generate_prolog_epilog (partial_schedule_ptr, struct loop *,
>                                     rtx, rtx);
> -static void duplicate_insns_of_cycles (partial_schedule_ptr,
> -                                      int, int, int, rtx);
>  static int calculate_stage_count (partial_schedule_ptr, int);
>  static void calculate_must_precede_follow (ddg_node_ptr, int, int,
>                                           int, int, sbitmap, sbitmap, 
> sbitmap);
> @@ -233,8 +234,6 @@ #define NODE_ASAP(node) ((node)->aux.cou
>
>  #define SCHED_PARAMS(x) VEC_index (node_sched_params, node_sched_param_vec, 
> x)
>  #define SCHED_TIME(x) (SCHED_PARAMS (x)->time)
> -#define SCHED_FIRST_REG_MOVE(x) (SCHED_PARAMS (x)->first_reg_move)
> -#define SCHED_NREG_MOVES(x) (SCHED_PARAMS (x)->nreg_moves)
>  #define SCHED_ROW(x) (SCHED_PARAMS (x)->row)
>  #define SCHED_STAGE(x) (SCHED_PARAMS (x)->stage)
>  #define SCHED_COLUMN(x) (SCHED_PARAMS (x)->column)
> @@ -244,15 +243,6 @@ typedef struct node_sched_params
>  {
>   int time;    /* The absolute scheduling cycle.  */
>
> -  /* The following field (first_reg_move) is the ps_insn id of the first
> -     register-move instruction added to handle the modulo-variable-expansion
> -     of the register defined by this node.  This register-move copies the
> -     original register defined by the node.  */
> -  int first_reg_move;
> -
> -  /* The number of register-move instructions added.  */
> -  int nreg_moves;
> -
>   int row;    /* Holds time % ii.  */
>   int stage;  /* Holds time / ii.  */
>
> @@ -344,6 +334,17 @@ ps_first_note (partial_schedule_ptr ps,
>   return ps->g->nodes[id].first_note;
>  }
>
> +/* Return the number of consecutive stages that are occupied by
> +   partial schedule instruction ID in PS.  */
> +static int
> +ps_num_consecutive_stages (partial_schedule_ptr ps, int id)
> +{
> +  if (id < ps->g->num_nodes)
> +    return 1;
> +  else
> +    return ps_reg_move (ps, id)->num_consecutive_stages;
> +}
> +
>  /* Given HEAD and TAIL which are the first and last insns in a loop;
>    return the register which controls the loop.  Return zero if it has
>    more than one occurrence in the loop besides the control part or the
> @@ -456,6 +457,45 @@ set_node_sched_params (ddg_ptr g)
>                         node_sched_param_vec, g->num_nodes);
>  }
>
> +/* Make sure that node_sched_param_vec has an entry for every move in PS.  */
> +static void
> +extend_node_sched_params (partial_schedule_ptr ps)
> +{
> +  VEC_safe_grow_cleared (node_sched_params, heap, node_sched_param_vec,
> +                        ps->g->num_nodes + VEC_length (ps_reg_move_info,
> +                                                       ps->reg_moves));
> +}
> +
> +/* Update the sched_params (time, row and stage) for node U using the II,
> +   the CYCLE of U and MIN_CYCLE.
> +   We're not simply taking the following
> +   SCHED_STAGE (u) = CALC_STAGE_COUNT (SCHED_TIME (u), min_cycle, ii);
> +   because the stages may not be aligned on cycle 0.  */
> +static void
> +update_node_sched_params (int u, int ii, int cycle, int min_cycle)
> +{
> +  int sc_until_cycle_zero;
> +  int stage;
> +
> +  SCHED_TIME (u) = cycle;
> +  SCHED_ROW (u) = SMODULO (cycle, ii);
> +
> +  /* The calculation of stage count is done adding the number
> +     of stages before cycle zero and after cycle zero.  */
> +  sc_until_cycle_zero = CALC_STAGE_COUNT (-1, min_cycle, ii);
> +
> +  if (SCHED_TIME (u) < 0)
> +    {
> +      stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
> +      SCHED_STAGE (u) = sc_until_cycle_zero - stage;
> +    }
> +  else
> +    {
> +      stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
> +      SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
> +    }
> +}
> +
>  static void
>  print_node_sched_params (FILE *file, int num_nodes, partial_schedule_ptr ps)
>  {
> @@ -466,21 +506,169 @@ print_node_sched_params (FILE *file, int
>   for (i = 0; i < num_nodes; i++)
>     {
>       node_sched_params_ptr nsp = SCHED_PARAMS (i);
> -      int j;
>
>       fprintf (file, "Node = %d; INSN = %d\n", i,
>               INSN_UID (ps_rtl_insn (ps, i)));
>       fprintf (file, " asap = %d:\n", NODE_ASAP (&ps->g->nodes[i]));
>       fprintf (file, " time = %d:\n", nsp->time);
> -      fprintf (file, " nreg_moves = %d:\n", nsp->nreg_moves);
> -      for (j = 0; j < nsp->nreg_moves; j++)
> -       {
> -         ps_reg_move_info *move = ps_reg_move (ps, nsp->first_reg_move + j);
> +      fprintf (file, " stage = %d:\n", nsp->stage);
> +    }
> +}
> +
> +/* Set SCHED_COLUMN for each instruction in row ROW of PS.  */
> +static void
> +set_columns_for_row (partial_schedule_ptr ps, int row)
> +{
> +  ps_insn_ptr cur_insn;
> +  int column;
> +
> +  column = 0;
> +  for (cur_insn = ps->rows[row]; cur_insn; cur_insn = cur_insn->next_in_row)
> +    SCHED_COLUMN (cur_insn->id) = column++;
> +}
> +
> +/* Set SCHED_COLUMN for each instruction in PS.  */
> +static void
> +set_columns_for_ps (partial_schedule_ptr ps)
> +{
> +  int row;
> +
> +  for (row = 0; row < ps->ii; row++)
> +    set_columns_for_row (ps, row);
> +}
> +
> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS.
> +   Its single predecessor has already been scheduled, as has its
> +   ddg node successors.  (The move may have also another move as its
> +   successor, in which case that successor will be scheduled later.)
>
> -         fprintf (file, " reg_move = ");
> -         print_rtl_single (file, move->insn);
> +   The move is part of a chain that satisfies register dependencies
> +   between a producing ddg node and various consuming ddg nodes.
> +   If some of these dependencies have a distance of 1 (meaning that
> +   the use is upward-exposoed) then DISTANCE1_USES is nonnull and
> +   contains the set of uses with distance-1 dependencies.
> +   DISTANCE1_USES is null otherwise.
> +
> +   MUST_FOLLOW is a scratch bitmap that is big enough to hold
> +   all current ps_insn ids.
> +
> +   Return true on success.  */
> +static bool
> +schedule_reg_move (partial_schedule_ptr ps, int i_reg_move,
> +                  sbitmap distance1_uses, sbitmap must_follow)
> +{
> +  unsigned int u;
> +  int this_time, this_distance, this_start, this_end, this_latency;
> +  int start, end, c, ii;
> +  sbitmap_iterator sbi;
> +  ps_reg_move_info *move;
> +  rtx this_insn;
> +  ps_insn_ptr psi;
> +
> +  move = ps_reg_move (ps, i_reg_move);
> +  ii = ps->ii;
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "Scheduling register move INSN %d; ii = %d"
> +              ", min cycle = %d\n\n", INSN_UID (move->insn), ii,
> +              PS_MIN_CYCLE (ps));
> +      print_rtl_single (dump_file, move->insn);
> +      fprintf (dump_file, "\n%11s %11s %5s\n", "start", "end", "time");
> +      fprintf (dump_file, "=========== =========== =====\n");
> +    }
> +
> +  start = INT_MIN;
> +  end = INT_MAX;
> +
> +  /* For dependencies of distance 1 between a producer ddg node A
> +     and consumer ddg node B, we have a chain of dependencies:
> +
> +        A --(T,L1,1)--> M1 --(T,L2,0)--> M2 ... --(T,Ln,0)--> B
> +
> +     where Mi is the ith move.  For dependencies of distance 0 between
> +     a producer ddg node A and consumer ddg node C, we have a chain of
> +     dependencies:
> +
> +        A --(T,L1',0)--> M1' --(T,L2',0)--> M2' ... --(T,Ln',0)--> C
> +
> +     where Mi' occupies the same position as Mi but occurs a stage later.
> +     We can only schedule each move once, so if we have both types of
> +     chain, we model the second as:
> +
> +        A --(T,L1',1)--> M1 --(T,L2',0)--> M2 ... --(T,Ln',-1)--> C
> +
> +     First handle the dependencies between the previously-scheduled
> +     predecessor and the move.  */
> +  this_insn = ps_rtl_insn (ps, move->def);
> +  this_latency = insn_latency (this_insn, move->insn);
> +  this_distance = distance1_uses && move->def < ps->g->num_nodes ? 1 : 0;
> +  this_time = SCHED_TIME (move->def) - this_distance * ii;
> +  this_start = this_time + this_latency;
> +  this_end = this_time + ii;
> +  if (dump_file)
> +    fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
> +            this_start, this_end, SCHED_TIME (move->def),
> +            INSN_UID (this_insn), this_latency, this_distance,
> +            INSN_UID (move->insn));
> +
> +  if (start < this_start)
> +    start = this_start;
> +  if (end > this_end)
> +    end = this_end;
> +
> +  /* Handle the dependencies between the move and previously-scheduled
> +     successors.  */
> +  EXECUTE_IF_SET_IN_SBITMAP (move->uses, 0, u, sbi)
> +    {
> +      this_insn = ps_rtl_insn (ps, u);
> +      this_latency = insn_latency (move->insn, this_insn);
> +      if (distance1_uses && !TEST_BIT (distance1_uses, u))
> +       this_distance = -1;
> +      else
> +       this_distance = 0;
> +      this_time = SCHED_TIME (u) + this_distance * ii;
> +      this_start = this_time - ii;
> +      this_end = this_time - this_latency;
> +      if (dump_file)
> +       fprintf (dump_file, "%11d %11d %5d %d --(T,%d,%d)--> %d\n",
> +                this_start, this_end, SCHED_TIME (u), INSN_UID (move->insn),
> +                this_latency, this_distance, INSN_UID (this_insn));
> +
> +      if (start < this_start)
> +       start = this_start;
> +      if (end > this_end)
> +       end = this_end;
> +    }
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "----------- ----------- -----\n");
> +      fprintf (dump_file, "%11d %11d %5s %s\n", start, end, "", "(max, 
> min)");
> +    }
> +
> +  sbitmap_zero (must_follow);
> +  SET_BIT (must_follow, move->def);
> +
> +  start = MAX (start, end - (ii - 1));
> +  for (c = end; c >= start; c--)
> +    {
> +      psi = ps_add_node_check_conflicts (ps, i_reg_move, c,
> +                                        move->uses, must_follow);
> +      if (psi)
> +       {
> +         update_node_sched_params (i_reg_move, ii, c, PS_MIN_CYCLE (ps));
> +         if (dump_file)
> +           fprintf (dump_file, "\nScheduled register move INSN %d at"
> +                    " time %d, row %d\n\n", INSN_UID (move->insn), c,
> +                    SCHED_ROW (i_reg_move));
> +         return true;
>        }
>     }
> +
> +  if (dump_file)
> +    fprintf (dump_file, "\nNo available slot\n\n");
> +
> +  return false;
>  }
>
>  /*
> @@ -508,6 +696,9 @@ schedule_reg_moves (partial_schedule_ptr
>       int nreg_moves = 0, i_reg_move;
>       rtx prev_reg, old_reg;
>       int first_move;
> +      int distances[2];
> +      sbitmap must_follow;
> +      sbitmap distance1_uses;
>       rtx set = single_set (u->insn);
>
>       /* Skip instructions that do not set a register.  */
> @@ -516,6 +707,7 @@ schedule_reg_moves (partial_schedule_ptr
>
>       /* Compute the number of reg_moves needed for u, by looking at life
>         ranges started at u (excluding self-loops).  */
> +      distances[0] = distances[1] = false;
>       for (e = u->out; e; e = e->next_out)
>        if (e->type == TRUE_DEP && e->dest != e->src)
>          {
> @@ -546,6 +738,11 @@ schedule_reg_moves (partial_schedule_ptr
>                gcc_assert (!autoinc_var_is_used_p (u->insn, e->dest->insn));
>              }
>
> +           if (nreg_moves4e)
> +             {
> +               gcc_assert (e->distance < 2);
> +               distances[e->distance] = true;
> +             }
>            nreg_moves = MAX (nreg_moves, nreg_moves4e);
>          }
>
> @@ -556,11 +753,10 @@ schedule_reg_moves (partial_schedule_ptr
>       first_move = VEC_length (ps_reg_move_info, ps->reg_moves);
>       VEC_safe_grow_cleared (ps_reg_move_info, heap, ps->reg_moves,
>                             first_move + nreg_moves);
> +      extend_node_sched_params (ps);
>
>       /* Record the moves associated with this node.  */
>       first_move += ps->g->num_nodes;
> -      SCHED_FIRST_REG_MOVE (i) = first_move;
> -      SCHED_NREG_MOVES (i) = nreg_moves;
>
>       /* Generate each move.  */
>       old_reg = prev_reg = SET_DEST (single_set (u->insn));
> @@ -569,15 +765,18 @@ schedule_reg_moves (partial_schedule_ptr
>          ps_reg_move_info *move = ps_reg_move (ps, first_move + i_reg_move);
>
>          move->def = i_reg_move > 0 ? first_move + i_reg_move - 1 : i;
> -         move->uses = sbitmap_alloc (g->num_nodes);
> +         move->uses = sbitmap_alloc (first_move + nreg_moves);
>          move->old_reg = old_reg;
>          move->new_reg = gen_reg_rtx (GET_MODE (prev_reg));
> +         move->num_consecutive_stages = distances[0] && distances[1] ? 2 : 1;
>          move->insn = gen_move_insn (move->new_reg, copy_rtx (prev_reg));
>          sbitmap_zero (move->uses);
>
>          prev_reg = move->new_reg;
>        }
>
> +      distance1_uses = distances[1] ? sbitmap_alloc (g->num_nodes) : NULL;
> +
>       /* Every use of the register defined by node may require a different
>         copy of this register, depending on the time the use is scheduled.
>         Record which uses require which move results.  */
> @@ -601,8 +800,21 @@ schedule_reg_moves (partial_schedule_ptr
>
>                move = ps_reg_move (ps, first_move + dest_copy - 1);
>                SET_BIT (move->uses, e->dest->cuid);
> +               if (e->distance == 1)
> +                 SET_BIT (distance1_uses, e->dest->cuid);
>              }
>          }
> +
> +      must_follow = sbitmap_alloc (first_move + nreg_moves);
> +      for (i_reg_move = 0; i_reg_move < nreg_moves; i_reg_move++)
> +       if (!schedule_reg_move (ps, first_move + i_reg_move,
> +                               distance1_uses, must_follow))
> +         break;
> +      sbitmap_free (must_follow);
> +      if (distance1_uses)
> +       sbitmap_free (distance1_uses);
> +      if (i_reg_move < nreg_moves)
> +       return false;
>     }
>   return true;
>  }
> @@ -626,39 +838,6 @@ apply_reg_moves (partial_schedule_ptr ps
>          df_insn_rescan (ps->g->nodes[i_use].insn);
>        }
>     }
> -
> -  FOR_EACH_VEC_ELT (ps_reg_move_info, ps->reg_moves, i, move)
> -    add_insn_before (move->insn, ps_first_note (ps, move->def), NULL);
> -}
> -
> -/* Update the sched_params (time, row and stage) for node U using the II,
> -   the CYCLE of U and MIN_CYCLE.
> -   We're not simply taking the following
> -   SCHED_STAGE (u) = CALC_STAGE_COUNT (SCHED_TIME (u), min_cycle, ii);
> -   because the stages may not be aligned on cycle 0.  */
> -static void
> -update_node_sched_params (int u, int ii, int cycle, int min_cycle)
> -{
> -  int sc_until_cycle_zero;
> -  int stage;
> -
> -  SCHED_TIME (u) = cycle;
> -  SCHED_ROW (u) = SMODULO (cycle, ii);
> -
> -  /* The calculation of stage count is done adding the number
> -     of stages before cycle zero and after cycle zero.  */
> -  sc_until_cycle_zero = CALC_STAGE_COUNT (-1, min_cycle, ii);
> -
> -  if (SCHED_TIME (u) < 0)
> -    {
> -      stage = CALC_STAGE_COUNT (-1, SCHED_TIME (u), ii);
> -      SCHED_STAGE (u) = sc_until_cycle_zero - stage;
> -    }
> -  else
> -    {
> -      stage = CALC_STAGE_COUNT (SCHED_TIME (u), 0, ii);
> -      SCHED_STAGE (u) = sc_until_cycle_zero + stage - 1;
> -    }
>  }
>
>  /* Bump the SCHED_TIMEs of all nodes by AMOUNT.  Set the values of
> @@ -699,22 +878,6 @@ reset_sched_times (partial_schedule_ptr
>       }
>  }
>
> -/* Set SCHED_COLUMN of each node according to its position in PS.  */
> -static void
> -set_columns_for_ps (partial_schedule_ptr ps)
> -{
> -  int row;
> -
> -  for (row = 0; row < ps->ii; row++)
> -    {
> -      ps_insn_ptr cur_insn = ps->rows[row];
> -      int column = 0;
> -
> -      for (; cur_insn; cur_insn = cur_insn->next_in_row)
> -       SCHED_COLUMN (cur_insn->id) = column++;
> -    }
> -}
> -
>  /* Permute the insns according to their order in PS, from row 0 to
>    row ii-1, and position them right before LAST.  This schedules
>    the insns of the loop kernel.  */
> @@ -731,8 +894,13 @@ permute_partial_schedule (partial_schedu
>        rtx insn = ps_rtl_insn (ps, ps_ij->id);
>
>        if (PREV_INSN (last) != insn)
> -         reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
> -                             PREV_INSN (last));
> +         {
> +           if (ps_ij->id < ps->g->num_nodes)
> +             reorder_insns_nobb (ps_first_note (ps, ps_ij->id), insn,
> +                                 PREV_INSN (last));
> +           else
> +             add_insn_before (insn, last, NULL);
> +         }
>       }
>  }
>
> @@ -935,7 +1103,7 @@ optimize_sc (partial_schedule_ptr ps, dd
>
>  static void
>  duplicate_insns_of_cycles (partial_schedule_ptr ps, int from_stage,
> -                          int to_stage, int for_prolog, rtx count_reg)
> +                          int to_stage, rtx count_reg)
>  {
>   int row;
>   ps_insn_ptr ps_ij;
> @@ -944,7 +1112,7 @@ duplicate_insns_of_cycles (partial_sched
>     for (ps_ij = ps->rows[row]; ps_ij; ps_ij = ps_ij->next_in_row)
>       {
>        int u = ps_ij->id;
> -       int j, i_reg_moves, i_reg_move;
> +       int first_u, last_u;
>        rtx u_insn;
>
>         /* Do not duplicate any insn which refers to count_reg as it
> @@ -958,42 +1126,15 @@ duplicate_insns_of_cycles (partial_sched
>             || JUMP_P (u_insn))
>           continue;
>
> -       if (for_prolog)
> +       first_u = SCHED_STAGE (u);
> +       last_u = first_u + ps_num_consecutive_stages (ps, u) - 1;
> +       if (from_stage <= last_u && to_stage >= first_u)
>          {
> -           /* SCHED_STAGE (u) >= from_stage == 0.  Generate increasing
> -              number of reg_moves starting with the second occurrence of
> -              u, which is generated if its SCHED_STAGE <= to_stage.  */
> -           i_reg_moves = to_stage - SCHED_STAGE (u) + 1;
> -           i_reg_moves = MAX (i_reg_moves, 0);
> -           i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
> -
> -           /* The reg_moves start from the *first* reg_move backwards.  */
> -           i_reg_move = SCHED_FIRST_REG_MOVE (u) + (i_reg_moves - 1);
> -         }
> -       else /* It's for the epilog.  */
> -         {
> -           /* SCHED_STAGE (u) <= to_stage.  Generate all reg_moves,
> -              starting to decrease one stage after u no longer occurs;
> -              that is, generate all reg_moves until
> -              SCHED_STAGE (u) == from_stage - 1.  */
> -           i_reg_moves = (SCHED_NREG_MOVES (u)
> -                          - (from_stage - SCHED_STAGE (u) - 1));
> -           i_reg_moves = MAX (i_reg_moves, 0);
> -           i_reg_moves = MIN (i_reg_moves, SCHED_NREG_MOVES (u));
> -
> -           /* The reg_moves start from the *last* reg_move forwards.  */
> -           i_reg_move = SCHED_FIRST_REG_MOVE (u) + (SCHED_NREG_MOVES (u) - 
> 1);
> +           if (u < ps->g->num_nodes)
> +             duplicate_insn_chain (ps_first_note (ps, u), u_insn);
> +           else
> +             emit_insn (copy_rtx (PATTERN (u_insn)));
>          }
> -
> -       for (j = 0; j < i_reg_moves; j++)
> -         {
> -           ps_reg_move_info *move = ps_reg_move (ps, i_reg_move - j);
> -
> -           emit_insn (copy_rtx (PATTERN (move->insn)));
> -         }
> -       if (SCHED_STAGE (u) >= from_stage
> -           && SCHED_STAGE (u) <= to_stage)
> -         duplicate_insn_chain (ps_first_note (ps, u), u_insn);
>       }
>  }
>
> @@ -1027,7 +1168,7 @@ generate_prolog_epilog (partial_schedule
>     }
>
>   for (i = 0; i < last_stage; i++)
> -    duplicate_insns_of_cycles (ps, 0, i, 1, count_reg);
> +    duplicate_insns_of_cycles (ps, 0, i, count_reg);
>
>   /* Put the prolog on the entry edge.  */
>   e = loop_preheader_edge (loop);
> @@ -1039,7 +1180,7 @@ generate_prolog_epilog (partial_schedule
>   start_sequence ();
>
>   for (i = 0; i < last_stage; i++)
> -    duplicate_insns_of_cycles (ps, i + 1, last_stage, 0, count_reg);
> +    duplicate_insns_of_cycles (ps, i + 1, last_stage, count_reg);
>
>   /* Put the epilogue on the exit edge.  */
>   gcc_assert (single_exit (loop));
> @@ -1375,8 +1516,7 @@ sms_schedule (void)
>     {
>       rtx head, tail;
>       rtx count_reg, count_init;
> -      int mii, rec_mii;
> -      unsigned stage_count;
> +      int mii, rec_mii, stage_count, min_cycle;
>       HOST_WIDEST_INT loop_count = 0;
>       bool opt_sc_p;
>
> @@ -1486,13 +1626,12 @@ sms_schedule (void)
>                }
>
>              gcc_assert (stage_count >= 1);
> -             PS_STAGE_COUNT (ps) = stage_count;
>            }
>
>          /* The default value of PARAM_SMS_MIN_SC is 2 as stage count of
>             1 means that there is no interleaving between iterations thus
>             we let the scheduling passes do the job in this case.  */
> -         if (stage_count < (unsigned) PARAM_VALUE (PARAM_SMS_MIN_SC)
> +         if (stage_count < PARAM_VALUE (PARAM_SMS_MIN_SC)
>              || (count_init && (loop_count <= stage_count))
>              || (flag_branch_probabilities && (trip_count <= stage_count)))
>            {
> @@ -1520,6 +1659,7 @@ sms_schedule (void)
>
>          set_columns_for_ps (ps);
>
> +         min_cycle = PS_MIN_CYCLE (ps) - SMODULO (PS_MIN_CYCLE (ps), ps->ii);
>          if (!schedule_reg_moves (ps))
>            {
>              mii = ps->ii + 1;
> @@ -1527,6 +1667,21 @@ sms_schedule (void)
>              continue;
>            }
>
> +         /* Moves that handle incoming values might have been added
> +            to a new first stage.  Bump the stage count if so.
> +
> +            ??? Perhaps we could consider rotating the schedule here
> +            instead?  */
> +         if (PS_MIN_CYCLE (ps) < min_cycle)
> +           {
> +             reset_sched_times (ps, 0);
> +             stage_count++;
> +           }
> +
> +         /* The stage count should now be correct without rotation.  */
> +         gcc_checking_assert (stage_count == calculate_stage_count (ps, 0));
> +         PS_STAGE_COUNT (ps) = stage_count;
> +
>          canon_loop (loop);
>
>           if (dump_file)
>

Re: [4/4] Make SMS schedule register moves

Reply via email to