2011/9/23 Richard Sandiford <rdsandif...@googlemail.com> > > Thanks as always for the review.
Sure, always a pleasure. > > Ayal Zaks <ayal.z...@gmail.com> writes: > >> Richard Sandiford <richard.sandif...@linaro.org> wrote on 30/08/2011 > >> 03:29:26 PM: > >> > >> > From: Richard Sandiford <richard.sandif...@linaro.org> > >> > To: gcc-patches@gcc.gnu.org > >> > Cc: Ayal Zaks/Haifa/IBM@IBMIL > >> > Date: 30/08/2011 03:29 PM > >> > Subject: [4/4] Make SMS schedule register moves > >> > > >> > This is the move-scheduling patch itself. It should be fairly > >> > self-explanatory. Let me know if it isn't, and I'll try to improve > >> > the commentary. > >> > >> > > >> > One potentially controversial change is to the way we handle moves > >> > in the prologue and epilogue. The current code uses a conservative > > > > > > - conservative>>accurate? (Your approach seems to be more "conservative") > > They're both conservative (and yeah, mine is more conservative :-)). > What I was trying to say is that the current code already generates > more moves than necessary. If ddg node N is "R = foo", the current code > generates enough moves to handle definitions of R from both the prologue > copies of N _and_ the incoming edge (i.e. the value defined outside the > loop, such as an accumulator being initialised to zero). But the code > doesn't check whether this initial definition actually exists. If it > doesn't -- because we're dealing with an intra-loop dependency -- > we generate one more move than necessary. > > If we wanted to be accurate, we'd need to check whether there's > an incoming value or not. ahh, right. Not only are we producing dead code, but such that is using uninitialized values.. > > The current code is also conservative in that the epilogue can end up > with sequences like: > > R1 = foo > R2 = R1 (from a move) > ... no set of R1, because we've done with that stage... > R3 = ...R2... > > where the last instruction could use R1 directly instead. > But none of this matters because later passes (including IRA) > fix it up nicely. > > So this was all just a big excuse on my part along the lines of > "we already rely on this stuff being cleaned up, and it is, > so let's keep things simple". furthermore, it should be worthwhile to strengthen dce if it were not to clean up tidily. > > >> > check to decide which stages need which moves. This check copes > >> > with values that are live before the loop, and emits some moves > >> > that aren't actually needed. The code also emits chains of moves > >> > that can be trivially optimised through propagation. We rely on > >> > later patches to clean this up for us (and they do). > >> > > >> > So, rather than keep a rather complicated check here, I've simply > >> > emitted the moves for all stages. In principle, that will generate > >> > a little more dead code than now in cases where chains of two moves > >> > are needed, but that seems to be very rare (when moves are scheduled). > > > > indeed, it's better to simplify code generation and rely on subsequent > > cleanups. Not sure about generating more (dead?) code in principle; > > could you elaborate, for the record? > > Sorry, I wasn't very clear, but what I meant was: the patch generates > every move regardless of the stage, so it will in principle generate more > moves to unused registers than we do now. (As above, we already generate > some such moves.) Thoese insns will be dead, and will be later removed > as dead. But in practice, the number of extra instructions seems to be > very low, and is usually zero. ok, that's clear. Cases which are not zero should potentially open PR's against dce. > > >> > * modulo-sched.c (extend_node_sched_params): New function. > >> > (print_node_sched_params): Print the stage. > >> > (set_columns_for_row, schedule_reg_move): New functions. > >> > (set_columns_for_ps): Move up file and use set_columns_for_now. > > > > > > typo: now>>row (unless you intend this to be temporary ;-) > > Doh! :-) > > >> > (schedule_reg_moves): Call extend_node_sched_params and > >> > schedule_reg_move. Extend size of uses bitmap. Return false > >> > if a move could not be scheduled. > >> > (apply_reg_moves): Don't emit moves here. > >> > (permute_partial_schedule): Handle register moves. > >> > (duplicate_insns_of_cycles): Remove for_prolog. Always emit moves. > > > > > > Always emit all moves > > Oops, fixed. > > >> +/* Try to schedule the move with ps_insn identifier I_REG_MOVE in PS. > >> + The source of the move is provided by I_MUST_FOLLOW, which has > >> + already been scheduled. > > > > The source y of an x=y move is defined by the previous iteration of > > the next move y=z (or by the original y=... move->def). We schedule > > moves one after the other bottom-up, starting from the original > > move->def moving backwards in cycles. This way, the instruction > > I_MUST_FOLLOW defining y is always already scheduled when we come to > > schedule the dependent I_REG_MOVE x=y move. > > Right. > > >> + /* The cyclic lifetime of move->new_reg starts and ends at move->def > >> + (the instruction that defines move->old_reg). > > > > So instruction I_REG_MOVE (new_reg=reg) must be scheduled before the > > next I_MUST_FOLLOW move/original-def (due to anti-dependence: it > > overwrites reg), but after the previous instance of I_MUST_FOLLOW (due > > to true dependence; i.e. account for latency also). Why do moves, > > except for the one closest to move->def (which is directly dependent > > upon it, i.e. for whom move->def == I_MUST_FOLLOW), have to worry > > about move->def at all? (Or have their cyclic lifetimes start and end > > there?) > > Because the uses of new_reg belong to the same move->def based cycle. the "cycle" (overloaded term; rather "iteration" in this context) to which the uses belong, is inferred from the "cycle" (absolute schedule time) in which they are scheduled, regardless of move->def. > > E.g. say we have the following partial schedule, with [A]...[D] denoting > instructions: > > row 0: empty > row 1: empty > row 2: [A] use R_2 > row 3: [B] R_0 = ... > row 4: [C] use R_1 > row 5: [D] use R_2 then these 4 instructions span 18 cycles, corresponding to the following schedule (left column; upto some multiple of ii offset): cyc 3:[B]R_0= cyc 4: cyc 5: cyc 6: cyc 7: cyc 8: cyc 9: [B]R_0= cyc 10:[C]=R_1 cyc 11: cyc 12: cyc 13: cyc 14: cyc 15: [B]R_0= cyc 16: [C]=R_1 cyc 17:[D]=R_2 --------------------------------------- cyc 18: cyc 19: cyc 20:[A]=R_2 cyc 21: [B]R_0= cyc 22: [C]=R_1 cyc 23: [D]=R_2 --------------------------------------- > > and with the following as-yet-unplaced moves: > > [M1] R_1 = R_0 > [M2] R_2 = R_1 > > The loop with scheduled moves must behave in the same way as it > would at the moment. That is, it must behave "as if" the loop were: > > LOOP 1: > [A] use R_2 > [M2] R_2 = R_1 > [M1] R_1 = R_0 > [B] R_0 = ... > [C] use R_1 > [D] use R_2 agreed. This reschedules [A] two cycles earlier (to cycle 18; or perhaps it should have been there originally?), places [M1] in cycle 8, and places [M2] in cycle 13: cyc 3:[B]R_0= cyc 4: cyc 5: cyc 6: cyc 7: cyc 8:[M1]R1=R0 cyc 9: [B]R_0= cyc 10:[C]=R_1 cyc 11: cyc 12: cyc 13:[M2]R2=R1 cyc 14: [M1]R1=R0 cyc 15: [B]R_0= cyc 16: [C]=R_1 cyc 17:[D]=R_2 --------------------------------------- cyc 18:[A]=R_2 cyc 19: [M2]R2=R1 cyc 20: [M1]R1=R0 cyc 21: [B]R_0= cyc 22: [C]=R_1 cyc 23: [D]=R_2 --------------------------------------- > > So (I think this is the uncontroversial bit): [M1] must be scheduled > "cyclically before" [B] and "cyclically after" [C], with the cycle > based at [B]: > > row 3 after [B]: empty > row 4: [C] > row 5: [D] > row 0: empty > row 1: empty > row 2: [A] > row 3 before [B]: empty > > [M1] could therefore go in row 1. This part is OK. Here's how I see it: [M1] feeds [C] which is scheduled at cycle 10, so it must be scheduled before cycle 10-M_latency and after cycle 10-ii. [M1] uses the result of [B] which is scheduled at cycle 3, so must be scheduled after cycle 3+B_latency and before cycle 3+ii. Taking all latencies to be 1 and ii=6, this yields a scheduling window of cycles [4,9]\cap[4,9]=[4,9]; if scheduled at cycle 4 it must_follow [C], if scheduled at cycle 9 it must_precede [B]. This is identical to the logic behind the sched_window of any instruction, based on its dependencies (as you've updated not too long ago..), if we do not allow reg_moves (and arguably, one should not allow reg_moves when scheduling reg_moves...). To address the potential erroneous scenario of Loop 2, suppose [A] is scheduled as in the beginning in cycle 20, and that [M1] is scheduled in cycle 7 (\in[4,9]). Then [M2] feeds [D] and [A] which are scheduled at cycles 17 and 20, so it must be scheduled before cycle 17-1 and after cycle 20-6. [M2] uses the result of [M1], so must be scheduled after cycle 7+1 and before cycle 7+6. This yields the desired [14,16]\cap[8,13]=\emptyset. Also note that if moves are assigned absolute cycles, it becomes clear to which stage each belongs (just like any other instruction), regulating their generation in prolog and epilog. > > [M2] must then come "cyclically before" [M1] and "cyclically after" [A] > and [D]. If we based that cycle on [M1], we would have: > > row 1 after [M1]: empty > row 2: [A] > row 3: [B] > row 4: [C] > row 5: [D] > row 0: empty > row 1 before [M1]: empty > > So it would seem that [M2] could go in row 0. But that's incorrect, > it would lead to: > > LOOP 2: > [M2] R_2 = R_1 > [M1] R_1 = R_0 > [A] use R_2 > [B] R_0 = ... > [C] use R_1 > [D] use R_2 > > and [A] now gets the wrong value of R_2. In the first iteration of > loop 1, [A] uses the prologue-supplied value of R_2. In loop 2 it > uses the prologue-supplied value of R_1. I suppose it's effectively > moved up a stage. > > If we keep the cycle at [B] when scheduling [M2], we have: > > row 3 after [B]: empty > row 4: [C] > row 5: [D] > row 0: empty > row 1: [M1] > row 2: [A] > row 3 before [B]: empty > > We now have an empty scheduling window (successor [M1] comes > before predecessor [A]) and we reject the current ii. > > > In general, there's a distinction between the "cycle" an instruction > > is scheduled at (for the first iteration), which is an absolute > > arbitrary integer, and the "row" it is placed in the PS, which is > > between 0 and ii-1, and is simply cycle mod ii. When scheduling, it's > > clearer to reason about cycles, especially if your window includes a > > row twice. > > > > In addition to the (true+anti) dependences involving I_REG_MOVE with > > I_MUST_FOLLOW, it has (true+anti) dependences with each use it feeds. > > > > We could alternatively set these dependencies of I_REG_MOVE on > > I_MUST_FOLLOW, and on its uses, and call get_sched_window(). But it's > > presumably simpler to handle it directly here. > > > > Right? > > Well, I think get_sched_window would need a fair bit of surgery to > handle both cases. And IMO it'd be a bit of a false abstraction. > In this case, we always want things to be scheduled close to the > successor (I_MUST_FOLLOW), even if there are several predecessors > (the uses). But if there are more predecessors than successors, > get_sched_window would normally prefer to schedule closer to the > predecessors. So I think we'd still be treating them as two separate > cases to some extent. well, it's a chain of dependences, so we should Swing through it in one direction or the other; not sure it matters which. > > As you say, doing it here is much simpler :-) yes, it makes sense to replicate the process here, given that the moves are not in the ddg and their deps are simple; but the logic should be the same. Ayal. > > Richard