2011/8/8 Richard Sandiford <richard.sandif...@linaro.org> > > Ayal Zaks <ayal.z...@gmail.com> writes: > >> OK. For the follow-on iv patch, it seemed easier to keep both bounds > >> inclusive for the loop, then make the "end" exclusive when setting the > >> out parameters. (Note that there shouldn't be any problem with overflow > >> when making the bound exclusive, because the size of the range has been > >> restricted to "ii" by that stage. The follow-on patch does use > >> saturating maths for all ops though.) > > > > Sure, no problem having "end" inclusive, as it simplifies things. We > > can even keep it inclusive all the way through, iterating over: for (c > > = start; c != end+step; c += step)... > > Yeah, I'd prefer that too. I might do it once Revital's and > Roman's changes are in. > > >> I don't have powerpc hardware that I can do meaningful performance > >> testing on, but I did run it through a Popular* Embedded Benchmark > >> on an ARM Cortex-A8 board with -O3 -fmodulo-sched > >> -fmodulo-sched-allow-regmoves. There were no changes. (And this is > >> a benchmark that does benefit from modulo scheduling, in some cases > >> by a significant amount.) > >> > > Ok, the patch should be neutral performance-wise. One could check btw > > if SMS produces exactly the same output in both cases, even w/o a > > powerpc (or any other) HW. > > OK. The patch does change the output a little because of the MEM_DEP > range thing.
Ahh, right. > > To compensate for that, I tried compiling libav on ARM with: > > Index: gcc/modulo-sched.c > =================================================================== > --- gcc/modulo-sched.c 2011-08-05 11:23:16.000000000 +0100 > +++ gcc/modulo-sched.c 2011-08-05 11:27:57.000000000 +0100 > @@ -1680,7 +1680,7 @@ get_sched_window (partial_schedule_ptr p > p_st, early_start, e->latency); > > if (e->data_type == MEM_DEP) > - end = MIN (end, SCHED_TIME (v_node) + ii - 1); > + end = MIN (end, SCHED_TIME (v_node) + ii); > } > else if (dump_file) > fprintf (dump_file, "the node is not scheduled\n"); > @@ -1729,7 +1729,7 @@ get_sched_window (partial_schedule_ptr p > s_st, late_start, e->latency); > > if (e->data_type == MEM_DEP) > - end = MAX (end, SCHED_TIME (v_node) - ii + 1); > + end = MAX (end, SCHED_TIME (v_node) - ii); > if (dump_file) > fprintf (dump_file, "end = %d\n", end); > > @@ -1791,7 +1791,7 @@ get_sched_window (partial_schedule_ptr p > count_preds++; > > if (e->data_type == MEM_DEP) > - end = MIN (end, SCHED_TIME (v_node) + ii - 1); > + end = MIN (end, SCHED_TIME (v_node) + ii); > } > else if (dump_file) > fprintf (dump_file, "the node is not scheduled\n"); > > applied, then tried the same thing with the revised patch below. > The output was the same. ok, that's a good sanity check. > > (FWIW, libav did show up extra differences when using the patch > that I'd originally submitted. They were due to the count_preds > and count_succs thing that you picked up in your review.) (These differences had no noticable consequences performance-wise, right?) > > >> Bootstrapped & regression-tested on powerpc-ibm-aix5.3 with the > >> additional patch: > >> > >> Index: gcc/opts.c > >> =================================================================== > >> --- gcc/opts.c 2011-08-03 10:56:50.000000000 +0100 > >> +++ gcc/opts.c 2011-08-03 10:56:50.000000000 +0100 > >> @@ -449,6 +449,8 @@ static const struct default_options defa > >> { OPT_LEVELS_1_PLUS, OPT_ftree_ch, NULL, 1 }, > >> { OPT_LEVELS_1_PLUS, OPT_fcombine_stack_adjustments, NULL, 1 }, > >> { OPT_LEVELS_1_PLUS, OPT_fcompare_elim, NULL, 1 }, > >> + { OPT_LEVELS_1_PLUS, OPT_fmodulo_sched, NULL, 1 }, > >> + { OPT_LEVELS_1_PLUS, OPT_fmodulo_sched_allow_regmoves, NULL, 1 }, > >> > >> /* -O2 optimizations. */ > >> { OPT_LEVELS_2_PLUS, OPT_finline_small_functions, NULL, 1 }, > >> > >> applied for both the "before" and "after" runs. OK to install? > > > > > > Yes, with just a couple of minor non-mandatory comments: > > 1. Setting "count_preds = psp_not_empty;" and "count_succs = > > pss_not_empty;" suggests that you want to decrement non-interesting > > neighbors instead of incrementing interesting ones. Otherwise can > > initialize to zero (doesn't really matter, just a bit confusing). > > Yeah, good point. I initialised them to zero, like you said, > and changed the reversal condition to: > > if (pss_not_empty && count_succs >= count_preds) > > Which I have admit makes a lot more sense than what I submitted, > and avoids some unwanted changes in output. > > > 2. I find it easier to read "late_start = MIN (late_start, early_start > > + (ii - 1));" instead of "late_start = MIN (early_start + (ii - 1), > > late_start);". > > Oops, that was accidental. Fixed. > > > 3. Suggest to set step=1 at the beginning, explaining from the start > > that we first compute a forward range (start<end), and may later > > possibly revert it. > > Yeah, that's better. > > > 4. "late_start" would better be renamed "late_end". > > Since you said this was optional, I decided to leave it as it is. > The name and value of "late start" come directly from the paper. Ok. We can write another paper ;-) > > > 5. While we're at it, as you've unified the treatments, we can now > > actually do with "start" and "end" only, and do away with early_start > > and late_start altogether... > > The IV patch needs the same distinction as we have now, so I'd prefer > to keep it. Ok, looking forward to the IV patch then... > > Thanks for the review. Here's what I applied after testing on > powerpc-ibm-aix5.3. > Thanks for looking into this, Ayal. > > Richard > > > gcc/ > * modulo-sched.c (get_sched_window): Use just one loop for predecessors > and one loop for successors. Fix upper bound of memory range. > > Index: gcc/modulo-sched.c > =================================================================== > *** gcc/modulo-sched.c 2011-08-05 10:39:27.000000000 +0100 > --- gcc/modulo-sched.c 2011-08-05 10:40:49.000000000 +0100 > *************** #define MAX_SPLIT_NUM 10 > *** 1630,1638 **** > > static int > get_sched_window (partial_schedule_ptr ps, ddg_node_ptr u_node, > ! sbitmap sched_nodes, int ii, int *start_p, int *step_p, int > *end_p) > { > int start, step, end; > ddg_edge_ptr e; > sbitmap psp = sbitmap_alloc (ps->g->num_nodes); > sbitmap pss = sbitmap_alloc (ps->g->num_nodes); > --- 1630,1640 ---- > > static int > get_sched_window (partial_schedule_ptr ps, ddg_node_ptr u_node, > ! sbitmap sched_nodes, int ii, int *start_p, int *step_p, > ! int *end_p) > { > int start, step, end; > + int early_start, late_start; > ddg_edge_ptr e; > sbitmap psp = sbitmap_alloc (ps->g->num_nodes); > sbitmap pss = sbitmap_alloc (ps->g->num_nodes); > *************** get_sched_window (partial_schedule_ptr p > *** 1640,1645 **** > --- 1642,1649 ---- > sbitmap u_node_succs = NODE_SUCCESSORS (u_node); > int psp_not_empty; > int pss_not_empty; > + int count_preds; > + int count_succs; > > /* 1. compute sched window for u (start, end, step). */ > sbitmap_zero (psp); > *************** get_sched_window (partial_schedule_ptr p > *** 1647,1861 **** > psp_not_empty = sbitmap_a_and_b_cg (psp, u_node_preds, sched_nodes); > pss_not_empty = sbitmap_a_and_b_cg (pss, u_node_succs, sched_nodes); > > ! if (psp_not_empty && !pss_not_empty) > ! { > ! int early_start = INT_MIN; > ! > ! end = INT_MAX; > ! for (e = u_node->in; e != 0; e = e->next_in) > ! { > ! ddg_node_ptr v_node = e->src; > ! > ! if (dump_file) > ! { > ! fprintf (dump_file, "\nProcessing edge: "); > ! print_ddg_edge (dump_file, e); > ! fprintf (dump_file, > ! "\nScheduling %d (%d) in psp_not_empty," > ! " checking p %d (%d): ", u_node->cuid, > ! INSN_UID (u_node->insn), v_node->cuid, INSN_UID > ! (v_node->insn)); > ! } > ! > ! if (TEST_BIT (sched_nodes, v_node->cuid)) > ! { > ! int p_st = SCHED_TIME (v_node); > ! > ! early_start = > ! MAX (early_start, p_st + e->latency - (e->distance * ii)); > ! > ! if (dump_file) > ! fprintf (dump_file, > ! "pred st = %d; early_start = %d; latency: %d", > ! p_st, early_start, e->latency); > ! > ! if (e->data_type == MEM_DEP) > ! end = MIN (end, SCHED_TIME (v_node) + ii - 1); > ! } > ! else if (dump_file) > ! fprintf (dump_file, "the node is not scheduled\n"); > ! } > ! start = early_start; > ! end = MIN (end, early_start + ii); > ! /* Schedule the node close to it's predecessors. */ > ! step = 1; > ! > ! if (dump_file) > ! fprintf (dump_file, > ! "\nScheduling %d (%d) in a window (%d..%d) with step %d\n", > ! u_node->cuid, INSN_UID (u_node->insn), start, end, step); > ! } > ! > ! else if (!psp_not_empty && pss_not_empty) > ! { > ! int late_start = INT_MAX; > ! > ! end = INT_MIN; > ! for (e = u_node->out; e != 0; e = e->next_out) > ! { > ! ddg_node_ptr v_node = e->dest; > ! > ! if (dump_file) > ! { > ! fprintf (dump_file, "\nProcessing edge:"); > ! print_ddg_edge (dump_file, e); > ! fprintf (dump_file, > ! "\nScheduling %d (%d) in pss_not_empty," > ! " checking s %d (%d): ", u_node->cuid, > ! INSN_UID (u_node->insn), v_node->cuid, INSN_UID > ! (v_node->insn)); > ! } > ! > ! if (TEST_BIT (sched_nodes, v_node->cuid)) > ! { > ! int s_st = SCHED_TIME (v_node); > ! > ! late_start = MIN (late_start, > ! s_st - e->latency + (e->distance * ii)); > ! > ! if (dump_file) > ! fprintf (dump_file, > ! "succ st = %d; late_start = %d; latency = %d", > ! s_st, late_start, e->latency); > ! > ! if (e->data_type == MEM_DEP) > ! end = MAX (end, SCHED_TIME (v_node) - ii + 1); > ! if (dump_file) > ! fprintf (dump_file, "end = %d\n", end); > ! > ! } > ! else if (dump_file) > ! fprintf (dump_file, "the node is not scheduled\n"); > > ! } > ! start = late_start; > ! end = MAX (end, late_start - ii); > ! /* Schedule the node close to it's successors. */ > ! step = -1; > > ! if (dump_file) > ! fprintf (dump_file, > ! "\nScheduling %d (%d) in a window (%d..%d) with step %d\n", > ! u_node->cuid, INSN_UID (u_node->insn), start, end, step); > > ! } > > ! else if (psp_not_empty && pss_not_empty) > ! { > ! int early_start = INT_MIN; > ! int late_start = INT_MAX; > ! int count_preds = 0; > ! int count_succs = 0; > > ! start = INT_MIN; > ! end = INT_MAX; > ! for (e = u_node->in; e != 0; e = e->next_in) > ! { > ! ddg_node_ptr v_node = e->src; > > ! if (dump_file) > ! { > ! fprintf (dump_file, "\nProcessing edge:"); > ! print_ddg_edge (dump_file, e); > fprintf (dump_file, > ! "\nScheduling %d (%d) in psp_pss_not_empty," > ! " checking p %d (%d): ", u_node->cuid, INSN_UID > ! (u_node->insn), v_node->cuid, INSN_UID > ! (v_node->insn)); > ! } > ! > ! if (TEST_BIT (sched_nodes, v_node->cuid)) > ! { > ! int p_st = SCHED_TIME (v_node); > ! > ! early_start = MAX (early_start, > ! p_st + e->latency > ! - (e->distance * ii)); > ! > ! if (dump_file) > ! fprintf (dump_file, > ! "pred st = %d; early_start = %d; latency = %d", > ! p_st, early_start, e->latency); > ! > ! if (e->type == TRUE_DEP && e->data_type == REG_DEP) > ! count_preds++; > > ! if (e->data_type == MEM_DEP) > ! end = MIN (end, SCHED_TIME (v_node) + ii - 1); > ! } > ! else if (dump_file) > ! fprintf (dump_file, "the node is not scheduled\n"); > > ! } > ! for (e = u_node->out; e != 0; e = e->next_out) > ! { > ! ddg_node_ptr v_node = e->dest; > > ! if (dump_file) > ! { > ! fprintf (dump_file, "\nProcessing edge:"); > ! print_ddg_edge (dump_file, e); > ! fprintf (dump_file, > ! "\nScheduling %d (%d) in psp_pss_not_empty," > ! " checking s %d (%d): ", u_node->cuid, INSN_UID > ! (u_node->insn), v_node->cuid, INSN_UID > ! (v_node->insn)); > ! } > > ! if (TEST_BIT (sched_nodes, v_node->cuid)) > ! { > ! int s_st = SCHED_TIME (v_node); > > ! late_start = MIN (late_start, > ! s_st - e->latency > ! + (e->distance * ii)); > > ! if (dump_file) > ! fprintf (dump_file, > ! "succ st = %d; late_start = %d; latency = %d", > ! s_st, late_start, e->latency); > > ! if (e->type == TRUE_DEP && e->data_type == REG_DEP) > ! count_succs++; > > ! if (e->data_type == MEM_DEP) > ! start = MAX (start, SCHED_TIME (v_node) - ii + 1); > ! } > ! else if (dump_file) > ! fprintf (dump_file, "the node is not scheduled\n"); > > ! } > ! start = MAX (start, early_start); > ! end = MIN (end, MIN (early_start + ii, late_start + 1)); > ! step = 1; > ! /* If there are more successors than predecessors schedule the > ! node close to it's successors. */ > ! if (count_succs >= count_preds) > ! { > ! int old_start = start; > > ! start = end - 1; > ! end = old_start - 1; > ! step = -1; > ! } > ! } > ! else /* psp is empty && pss is empty. */ > ! { > ! start = SCHED_ASAP (u_node); > ! end = start + ii; > ! step = 1; > } > > *start_p = start; > *step_p = step; > *end_p = end; > --- 1651,1772 ---- > psp_not_empty = sbitmap_a_and_b_cg (psp, u_node_preds, sched_nodes); > pss_not_empty = sbitmap_a_and_b_cg (pss, u_node_succs, sched_nodes); > > ! /* We first compute a forward range (start <= end), then decide whether > ! to reverse it. */ > ! early_start = INT_MIN; > ! late_start = INT_MAX; > ! start = INT_MIN; > ! end = INT_MAX; > ! step = 1; > ! > ! count_preds = 0; > ! count_succs = 0; > ! > ! /* Calculate early_start and limit end. Both bounds are inclusive. */ > ! if (psp_not_empty) > ! for (e = u_node->in; e != 0; e = e->next_in) > ! { > ! ddg_node_ptr v_node = e->src; > > ! if (dump_file) > ! { > ! fprintf (dump_file, "\nProcessing edge: "); > ! print_ddg_edge (dump_file, e); > ! fprintf (dump_file, > ! "\nScheduling %d (%d) in psp_not_empty," > ! " checking p %d (%d): ", u_node->cuid, > ! INSN_UID (u_node->insn), v_node->cuid, INSN_UID > ! (v_node->insn)); > ! } > > ! if (TEST_BIT (sched_nodes, v_node->cuid)) > ! { > ! int p_st = SCHED_TIME (v_node); > > ! early_start = MAX (early_start, > ! p_st + e->latency - (e->distance * ii)); > > ! if (e->data_type == MEM_DEP) > ! end = MIN (end, p_st + ii - 1); > > ! if (e->type == TRUE_DEP && e->data_type == REG_DEP) > ! count_preds++; > > ! if (dump_file) > fprintf (dump_file, > ! "pred st = %d; early_start = %d; latency: %d;" > ! " end: %d\n", p_st, early_start, e->latency, end); > > ! } > ! else if (dump_file) > ! fprintf (dump_file, "the node is not scheduled\n"); > ! } > > ! /* Calculate late_start and limit start. Both bounds are inclusive. */ > ! if (pss_not_empty) > ! for (e = u_node->out; e != 0; e = e->next_out) > ! { > ! ddg_node_ptr v_node = e->dest; > > ! if (dump_file) > ! { > ! fprintf (dump_file, "\nProcessing edge:"); > ! print_ddg_edge (dump_file, e); > ! fprintf (dump_file, > ! "\nScheduling %d (%d) in pss_not_empty," > ! " checking s %d (%d): ", u_node->cuid, > ! INSN_UID (u_node->insn), v_node->cuid, INSN_UID > ! (v_node->insn)); > ! } > > ! if (TEST_BIT (sched_nodes, v_node->cuid)) > ! { > ! int s_st = SCHED_TIME (v_node); > > ! late_start = MIN (late_start, > ! s_st - e->latency + (e->distance * ii)); > > ! if (e->data_type == MEM_DEP) > ! start = MAX (start, s_st - ii + 1); > > ! if (e->type == TRUE_DEP && e->data_type == REG_DEP) > ! count_succs++; > > ! if (dump_file) > ! fprintf (dump_file, > ! "succ st = %d; late_start = %d; latency = %d;" > ! " start=%d", s_st, late_start, e->latency, start); > > ! } > ! else if (dump_file) > ! fprintf (dump_file, "the node is not scheduled\n"); > ! } > > ! /* Get a target scheduling window no bigger than ii. */ > ! if (early_start == INT_MIN && late_start == INT_MAX) > ! early_start = SCHED_ASAP (u_node); > ! else if (early_start == INT_MIN) > ! early_start = late_start - (ii - 1); > ! late_start = MIN (late_start, early_start + (ii - 1)); > ! > ! /* Apply memory dependence limits. */ > ! start = MAX (start, early_start); > ! end = MIN (end, late_start); > ! > ! /* If there are at least as many successors as predecessors, schedule the > ! node close to its successors. */ > ! if (pss_not_empty && count_succs >= count_preds) > ! { > ! int tmp = end; > ! end = start; > ! start = tmp; > ! step = -1; > } > > + /* Now that we've finalized the window, make END an exclusive rather > + than an inclusive bound. */ > + end += step; > + > *start_p = start; > *step_p = step; > *end_p = end; > *************** get_sched_window (partial_schedule_ptr p > *** 1867,1876 **** > if (dump_file) > fprintf (dump_file, "\nEmpty window: start=%d, end=%d, step=%d\n", > start, end, step); > ! return -1; > } > > ! return 0; > } > > /* Calculate MUST_PRECEDE/MUST_FOLLOW bitmaps of U_NODE; which is the > --- 1778,1787 ---- > if (dump_file) > fprintf (dump_file, "\nEmpty window: start=%d, end=%d, step=%d\n", > start, end, step); > ! return -1; > } > > ! return 0; > } > > /* Calculate MUST_PRECEDE/MUST_FOLLOW bitmaps of U_NODE; which is the