Alpha EV6 and newer can execute four instructions per cycle if correctly scheduled. The architecture has two clusters {0, 1}, each with its own register file. In each cluster, there are two slots {upper, lower}. Some instructions only execute from either upper or lower slots.
Register values produced in one cluster take 1 cycle to appear in the other cluster, so improperly scheduled instructions may incur a cross- cluster delay. I've duplicated (define_insn_reservation ...) for instructions which can execute from either cluster, increased latencies by 1, and added bypasses. In my limited testing it seems to provide a minor improvement (I wouldn't expect much, since it should only remove single-cycle delays here and there) So, please review and provide feedback. I also have some questions: - In the Compiler Writer's Guide [1] [2], it doesn't seem to mention anything about cross-cluster delays from integer load/store instructions as producers. It seems plausible that load/stores could be a special case and update both clusters' register files at the same time, but maybe this is an oversight in (two versions of) the manual? - CMOV instructions are internally split as two distinct instructions on >=EV6 that may execute on any cluster/slot. Evidently, this means that the first part may execute on cluster 0 while the second executes on cluster 1, thereby incurring a 1-cycle cross-cluster delay. WTF. So, how can I represent this two-part instruction--by duplicating its define_insn_reservation 4 times? I can't find any rules for scheduling CMOVs in the CWG, so knowing this would be helpful too. - The CWG lists the latency of unconditional branches and jsr/call instructions as 3, whereas we have 1. I guess this latency value is only meaningful if the instruction produces a value? I'm a bit confused by this value in the CWG since it lists the latency of conditional branches as N/A, while these other types of branches as 3, although none produce a register value. - When increasing the default instruction latencies, I've added ',nothing' to the functional unit regexp. Is this the correct way to describe that the functional unit is free? - There's a ??? comment at the top that says "In addition, instruction order affects cluster issue." Does gcc understand how to do this already, or is this a TODO reminder? If it's a reminder, where should I look in gcc to add this? - I also see that fadd/fcmov/fmul instructions take an extra two cycles when the consumer is fst/ftoi, so something similar should be added for them. Can a (define_bypass ...) function specify a latency value greater than the default latency, or should I raise the default latency and special-case fst/ftoi consumers like I've done for cross-cluster delay? Thanks a lot! Matt Turner [1] http://www.compaq.com/cpq-alphaserver/technology/literature/cmpwrgd.pdf [2] http://download.majix.org/dec/comp_guide_v2.pdf --- ev6.md.orig 2007-08-02 06:49:31.000000000 -0400 +++ ev6.md 2011-05-24 23:15:39.414919424 -0400 @@ -24,19 +24,19 @@ ; EV6 has two symmetric pairs ("clusters") of two asymmetric integer ; units ("upper" and "lower"), yielding pipe names U0, U1, L0, L1. ; -; ??? The clusters have independent register files that are re-synced +; The clusters have independent register files that are re-synced ; every cycle. Thus there is one additional cycle of latency between -; insns issued on different clusters. Possibly model that by duplicating -; all EBOX insn_reservations that can issue to either cluster, increasing -; all latencies by one, and adding bypasses within the cluster. +; insns issued on different clusters. ; -; ??? In addition, instruction order affects cluster issue. +; ??? In addition, instruction order affects cluster issue. XXX: what to do? (define_automaton "ev6_0,ev6_1") (define_cpu_unit "ev6_u0,ev6_u1,ev6_l0,ev6_l1" "ev6_0") (define_reservation "ev6_u" "ev6_u0|ev6_u1") (define_reservation "ev6_l" "ev6_l0|ev6_l1") -(define_reservation "ev6_ebox" "ev6_u|ev6_l") +(define_reservation "ev6_ebox" "ev6_u|ev6_l") ; XXX: remove +(define_reservation "ev6_e0" "ev6_l0|ev6_u0") +(define_reservation "ev6_e1" "ev6_l1|ev6_u1") (define_cpu_unit "ev6_fa" "ev6_1") (define_cpu_unit "ev6_fm,ev6_fst0,ev6_fst1" "ev6_0") @@ -50,15 +50,26 @@ ; Integer loads take at least 3 clocks, and only issue to lower units. ; adjust_cost still factors in user-specified memory latency, so return 1 here. -(define_insn_reservation "ev6_ild" 1 +; XXX: CWG doesn't mention cross-cluster delay for ild/ist producers ??? +(define_insn_reservation "ev6_ild_0" 1 (and (eq_attr "tune" "ev6") (eq_attr "type" "ild,ldsym,ld_l")) - "ev6_l") + "ev6_l0") + +(define_insn_reservation "ev6_ild_1" 1 + (and (eq_attr "tune" "ev6") + (eq_attr "type" "ild,ldsym,ld_l")) + "ev6_l1") -(define_insn_reservation "ev6_ist" 1 +(define_insn_reservation "ev6_ist_0" 1 (and (eq_attr "tune" "ev6") (eq_attr "type" "ist,st_c")) - "ev6_l") + "ev6_l0") + +(define_insn_reservation "ev6_ist_1" 1 + (and (eq_attr "tune" "ev6") + (eq_attr "type" "ist,st_c")) + "ev6_l1") (define_insn_reservation "ev6_mb" 1 (and (eq_attr "tune" "ev6") @@ -84,48 +95,88 @@ "ev6_fst,nothing,ev6_l") ; Arithmetic goes anywhere. -(define_insn_reservation "ev6_arith" 1 +(define_insn_reservation "ev6_arith_0" 2 + (and (eq_attr "tune" "ev6") + (eq_attr "type" "iadd,ilog,icmp")) + "ev6_e0,nothing") + +(define_insn_reservation "ev6_arith_1" 2 (and (eq_attr "tune" "ev6") (eq_attr "type" "iadd,ilog,icmp")) - "ev6_ebox") + "ev6_e1,nothing") ; Motion video insns also issue only to U0, and take three ticks. -(define_insn_reservation "ev6_mvi" 3 +(define_insn_reservation "ev6_mvi" 4 (and (eq_attr "tune" "ev6") (eq_attr "type" "mvi")) - "ev6_u0") + "ev6_u0*3,nothing") ; Shifts issue to upper units. -(define_insn_reservation "ev6_shift" 1 +(define_insn_reservation "ev6_shift_0" 2 (and (eq_attr "tune" "ev6") (eq_attr "type" "shift")) - "ev6_u") + "ev6_u0,nothing") + +(define_insn_reservation "ev6_shift_1" 2 + (and (eq_attr "tune" "ev6") + (eq_attr "type" "shift")) + "ev6_u1,nothing") ; Multiplies issue only to U1, and all take 7 ticks. -(define_insn_reservation "ev6_imul" 7 +(define_insn_reservation "ev6_imul" 8 (and (eq_attr "tune" "ev6") (eq_attr "type" "imul")) - "ev6_u1") + "ev6_u1*7,nothing") ; Conditional moves decompose into two independent primitives, each taking ; one cycle. Since ev6 is out-of-order, we can't see anything but two cycles. +; XXX: icmov can be UU, UL, LU, or LL. wtf. (define_insn_reservation "ev6_icmov" 2 (and (eq_attr "tune" "ev6") (eq_attr "type" "icmov")) "ev6_ebox,ev6_ebox") ; Integer branches issue to upper units +; XXX: CWG says latency for icbr is 3 +; XXX: CWG says callpall is part of the jsr group, and therefore is slotted L0 (define_insn_reservation "ev6_ibr" 1 (and (eq_attr "tune" "ev6") (eq_attr "type" "ibr,callpal")) "ev6_u") ; Calls only issue to L0. +; XXX: CWG says latency for jsr is 3 (define_insn_reservation "ev6_jsr" 1 (and (eq_attr "tune" "ev6") (eq_attr "type" "jsr")) "ev6_l0") +; 1-cycle 0 to 0 +(define_bypass 1 + "ev6_arith_0,ev6_shift_0" + "ev6_ild_0,ev6_ist_0,ev6_arith_0,ev6_mvi,ev6_shift_0") + +; 3-cycle 0 to 0 +(define_bypass 3 + "ev6_mvi" + "ev6_ild_0,ev6_ist_0,ev6_arith_0,ev6_mvi,ev6_shift_0") + +; 1-cycle 1 to 1 +(define_bypass 1 + "ev6_arith_1,ev6_shift_1" + "ev6_ild_1,ev6_ist_1,ev6_arith_1,ev6_shift_1,ev6_imul,ev6_jsr") + +; 7-cycle 1 to 1 +(define_bypass 7 + "ev6_imul" + "ev6_ild_1,ev6_ist_1,ev6_arith_1,ev6_shift_1,ev6_imul,ev6_jsr") + +; XXX: bypass for ild/ist/itof? + +; XXX: for instructions specified in the bypass, since I'm increasing their +; default latencies, should I specify how long they'll be using the functional +; unit, like as is done for ev6_f{div,sqrt}? + ; Ftoi/itof only issue to lower pipes. (define_insn_reservation "ev6_itof" 3 (and (eq_attr "tune" "ev6")