Alpha EV6 and newer can execute four instructions per cycle if correctly
scheduled. The architecture has two clusters {0, 1}, each with its own
register file. In each cluster, there are two slots {upper, lower}. Some
instructions only execute from either upper or lower slots.

Register values produced in one cluster take 1 cycle to appear in the
other cluster, so improperly scheduled instructions may incur a cross-
cluster delay.

I've duplicated (define_insn_reservation ...) for instructions which can
execute from either cluster, increased latencies by 1, and added
bypasses.

In my limited testing it seems to provide a minor improvement (I
wouldn't expect much, since it should only remove single-cycle delays
here and there)

So, please review and provide feedback.

I also have some questions:

 - In the Compiler Writer's Guide [1] [2], it doesn't seem to mention
   anything about cross-cluster delays from integer load/store
   instructions as producers. It seems plausible that load/stores could
   be a special case and update both clusters' register files at the
   same time, but maybe this is an oversight in (two versions of) the
   manual?

 - CMOV instructions are internally split as two distinct instructions
   on >=EV6 that may execute on any cluster/slot. Evidently, this means
   that the first part may execute on cluster 0 while the second
   executes on cluster 1, thereby incurring a 1-cycle cross-cluster
   delay. WTF. So, how can I represent this two-part instruction--by
   duplicating its define_insn_reservation 4 times? I can't find any
   rules for scheduling CMOVs in the CWG, so knowing this would be
   helpful too.

 - The CWG lists the latency of unconditional branches and jsr/call
   instructions as 3, whereas we have 1. I guess this latency value is
   only meaningful if the instruction produces a value? I'm a bit
   confused by this value in the CWG since it lists the latency of
   conditional branches as N/A, while these other types of branches as
   3, although none produce a register value.

 - When increasing the default instruction latencies, I've added
   ',nothing' to the functional unit regexp. Is this the correct way to
   describe that the functional unit is free?

 - There's a ??? comment at the top that says "In addition, instruction
   order affects cluster issue." Does gcc understand how to do this
   already, or is this a TODO reminder? If it's a reminder, where should
   I look in gcc to add this?

 - I also see that fadd/fcmov/fmul instructions take an extra two cycles
   when the consumer is fst/ftoi, so something similar should be added
   for them. Can a (define_bypass ...) function specify a latency value
   greater than the default latency, or should I raise the default
   latency and special-case fst/ftoi consumers like I've done for
   cross-cluster delay?

Thanks a lot!

Matt Turner

[1] http://www.compaq.com/cpq-alphaserver/technology/literature/cmpwrgd.pdf
[2] http://download.majix.org/dec/comp_guide_v2.pdf


--- ev6.md.orig 2007-08-02 06:49:31.000000000 -0400
+++ ev6.md      2011-05-24 23:15:39.414919424 -0400
@@ -24,19 +24,19 @@
 ; EV6 has two symmetric pairs ("clusters") of two asymmetric integer
 ; units ("upper" and "lower"), yielding pipe names U0, U1, L0, L1.
 ;
-; ??? The clusters have independent register files that are re-synced
+; The clusters have independent register files that are re-synced
 ; every cycle.  Thus there is one additional cycle of latency between
-; insns issued on different clusters.  Possibly model that by duplicating
-; all EBOX insn_reservations that can issue to either cluster, increasing
-; all latencies by one, and adding bypasses within the cluster.
+; insns issued on different clusters.
 ;
-; ??? In addition, instruction order affects cluster issue.
+; ??? In addition, instruction order affects cluster issue. XXX: what to do?
 
 (define_automaton "ev6_0,ev6_1")
 (define_cpu_unit "ev6_u0,ev6_u1,ev6_l0,ev6_l1" "ev6_0")
 (define_reservation "ev6_u" "ev6_u0|ev6_u1")
 (define_reservation "ev6_l" "ev6_l0|ev6_l1")
-(define_reservation "ev6_ebox" "ev6_u|ev6_l")
+(define_reservation "ev6_ebox" "ev6_u|ev6_l") ; XXX: remove
+(define_reservation "ev6_e0" "ev6_l0|ev6_u0")
+(define_reservation "ev6_e1" "ev6_l1|ev6_u1")
 
 (define_cpu_unit "ev6_fa" "ev6_1")
 (define_cpu_unit "ev6_fm,ev6_fst0,ev6_fst1" "ev6_0")
@@ -50,15 +50,26 @@
 
 ; Integer loads take at least 3 clocks, and only issue to lower units.
 ; adjust_cost still factors in user-specified memory latency, so return 1 here.
-(define_insn_reservation "ev6_ild" 1
+; XXX: CWG doesn't mention cross-cluster delay for ild/ist producers ???
+(define_insn_reservation "ev6_ild_0" 1
   (and (eq_attr "tune" "ev6")
        (eq_attr "type" "ild,ldsym,ld_l"))
-  "ev6_l")
+  "ev6_l0")
+
+(define_insn_reservation "ev6_ild_1" 1
+  (and (eq_attr "tune" "ev6")
+       (eq_attr "type" "ild,ldsym,ld_l"))
+  "ev6_l1")
 
-(define_insn_reservation "ev6_ist" 1
+(define_insn_reservation "ev6_ist_0" 1
   (and (eq_attr "tune" "ev6")
        (eq_attr "type" "ist,st_c"))
-  "ev6_l")
+  "ev6_l0")
+
+(define_insn_reservation "ev6_ist_1" 1
+  (and (eq_attr "tune" "ev6")
+       (eq_attr "type" "ist,st_c"))
+  "ev6_l1")
 
 (define_insn_reservation "ev6_mb" 1
   (and (eq_attr "tune" "ev6")
@@ -84,48 +95,88 @@
   "ev6_fst,nothing,ev6_l")
 
 ; Arithmetic goes anywhere.
-(define_insn_reservation "ev6_arith" 1
+(define_insn_reservation "ev6_arith_0" 2
+  (and (eq_attr "tune" "ev6")
+       (eq_attr "type" "iadd,ilog,icmp"))
+  "ev6_e0,nothing")
+
+(define_insn_reservation "ev6_arith_1" 2
   (and (eq_attr "tune" "ev6")
        (eq_attr "type" "iadd,ilog,icmp"))
-  "ev6_ebox")
+  "ev6_e1,nothing")
 
 ; Motion video insns also issue only to U0, and take three ticks.
-(define_insn_reservation "ev6_mvi" 3
+(define_insn_reservation "ev6_mvi" 4
   (and (eq_attr "tune" "ev6")
        (eq_attr "type" "mvi"))
-  "ev6_u0")
+  "ev6_u0*3,nothing")
 
 ; Shifts issue to upper units.
-(define_insn_reservation "ev6_shift" 1
+(define_insn_reservation "ev6_shift_0" 2
   (and (eq_attr "tune" "ev6")
        (eq_attr "type" "shift"))
-  "ev6_u")
+  "ev6_u0,nothing")
+
+(define_insn_reservation "ev6_shift_1" 2
+  (and (eq_attr "tune" "ev6")
+       (eq_attr "type" "shift"))
+  "ev6_u1,nothing")
 
 ; Multiplies issue only to U1, and all take 7 ticks.
-(define_insn_reservation "ev6_imul" 7
+(define_insn_reservation "ev6_imul" 8
   (and (eq_attr "tune" "ev6")
        (eq_attr "type" "imul"))
-  "ev6_u1")
+  "ev6_u1*7,nothing")
 
 ; Conditional moves decompose into two independent primitives, each taking
 ; one cycle.  Since ev6 is out-of-order, we can't see anything but two cycles.
+; XXX: icmov can be UU, UL, LU, or LL. wtf.
 (define_insn_reservation "ev6_icmov" 2
   (and (eq_attr "tune" "ev6")
        (eq_attr "type" "icmov"))
   "ev6_ebox,ev6_ebox")
 
 ; Integer branches issue to upper units
+; XXX: CWG says latency for icbr is 3
+; XXX: CWG says callpall is part of the jsr group, and therefore is slotted L0
 (define_insn_reservation "ev6_ibr" 1
   (and (eq_attr "tune" "ev6")
        (eq_attr "type" "ibr,callpal"))
   "ev6_u")
 
 ; Calls only issue to L0.
+; XXX: CWG says latency for jsr is 3
 (define_insn_reservation "ev6_jsr" 1
   (and (eq_attr "tune" "ev6")
        (eq_attr "type" "jsr"))
   "ev6_l0")
 
+; 1-cycle 0 to 0
+(define_bypass 1
+  "ev6_arith_0,ev6_shift_0"
+  "ev6_ild_0,ev6_ist_0,ev6_arith_0,ev6_mvi,ev6_shift_0")
+
+; 3-cycle 0 to 0
+(define_bypass 3
+  "ev6_mvi"
+  "ev6_ild_0,ev6_ist_0,ev6_arith_0,ev6_mvi,ev6_shift_0")
+
+; 1-cycle 1 to 1
+(define_bypass 1
+  "ev6_arith_1,ev6_shift_1"
+  "ev6_ild_1,ev6_ist_1,ev6_arith_1,ev6_shift_1,ev6_imul,ev6_jsr")
+
+; 7-cycle 1 to 1
+(define_bypass 7
+  "ev6_imul"
+  "ev6_ild_1,ev6_ist_1,ev6_arith_1,ev6_shift_1,ev6_imul,ev6_jsr")
+
+; XXX: bypass for ild/ist/itof?
+
+; XXX: for instructions specified in the bypass, since I'm increasing their
+; default latencies, should I specify how long they'll be using the functional
+; unit, like as is done for ev6_f{div,sqrt}?
+
 ; Ftoi/itof only issue to lower pipes.
 (define_insn_reservation "ev6_itof" 3
   (and (eq_attr "tune" "ev6")

Reply via email to