DFA Scheduler - unable to pipeline loads

Matt Lee Fri, 31 Aug 2007 14:57:49 -0700

Hi,

I am working with GCC-4.1.1 on a simple 5-pipe stage simple scalar
RISC processors with the following description for loads and stores,


(define_insn_reservation "integer" 1
  (eq_attr "type" "branch,jump,call,arith,darith,icmp,nop")
  "issue,iu,wb")

(define_insn_reservation "load" 3
  (eq_attr "type" "load")
  "issue,iu,wb")

(define_insn_reservation "store" 1
  (eq_attr "type" "store")
  "issue,iu,wb")

I am seeing poor scheduling in Dhrystone where a memcpy call is
expanded inline.

memcpy (&dst, &src, 16) ==>

load  1, rA + 4
store 1, rB + 4
load  2, rA + 8
store 2, rB + 8
...

Basically, instead of pipelining the loads, the current schedule
stalls the processor for two cycles on every dependent store. Here is
a dump from the .35.sched1 file.

;;   ======================================================
;;   -- basic block 0 from 6 to 36 -- before reload
;;   ======================================================

;;        0--> 6    r84=r5                             :issue,iu,wb
;;        1--> 13   r86=[`Ptr_Glob']                   :issue,iu,wb
;;        2--> 25   r92=0x5                            :issue,iu,wb
;;        3--> 12   r85=[r84]                          :issue,iu,wb
;;        4--> 14   r87=[r86]                          :issue,iu,wb
;;        7--> 15   [r85]=r87                          :issue,iu,wb
;;        8--> 16   r88=[r86+0x4]                      :issue,iu,wb
;;       11--> 17   [r85+0x4]=r88                      :issue,iu,wb
;;       12--> 18   r89=[r86+0x8]                      :issue,iu,wb
;;       15--> 19   [r85+0x8]=r89                      :issue,iu,wb
;;       16--> 20   r90=[r86+0xc]                      :issue,iu,wb
;;       19--> 21   [r85+0xc]=r90                      :issue,iu,wb
;;       20--> 22   r91=[r86+0x10]                     :issue,iu,wb
;;       23--> 23   [r85+0x10]=r91                     :issue,iu,wb
;;       24--> 26   [r84+0xc]=r92                      :issue,iu,wb
;;       25--> 31   clobber r3                         :nothing
;;       25--> 36   use r3                             :nothing
;;      Ready list (final):
;;   total time = 25
;;   new head = 7
;;   new tail = 36

There is an obvious better schedule to be obtained. Here is one such
(hand-modified) schedule which just pipelines two of the loads to
obtain a shorter critical path length to the whole function (function
has only bb 0)

;;        0--> 6    r84=r5                             :issue,iu,wb
;;        1--> 13   r86=[`Ptr_Glob']                   :issue,iu,wb
;;        2--> 25   r92=0x5                            :issue,iu,wb
;;        3--> 12   r85=[r84]                          :issue,iu,wb
;;        4--> 14   r87=[r86]                          :issue,iu,wb
;;        7--> 15   [r85]=r87                          :issue,iu,wb
;;        8--> 16   r88=[r86+0x4]                      :issue,iu,wb
;;        9--> 18   r89=[r86+0x8]                      :issue,iu,wb
;;       10--> 20   r90=[r86+0xc]                      :issue,iu,wb
;;       11--> 17   [r85+0x4]=r88                      :issue,iu,wb
;;       12--> 19   [r85+0x8]=r89                      :issue,iu,wb
;;       13--> 21   [r85+0xc]=r90                      :issue,iu,wb
;;       14--> 22   r91=[r86+0x10]                     :issue,iu,wb
;;       17--> 23   [r85+0x10]=r91                     :issue,iu,wb
;;       18--> 26   [r84+0xc]=r92                      :issue,iu,mb_wb
;;       19--> 31   clobber r3                         :nothing
;;       20--> 36   use r3                             :nothing
;;      Ready list (final):
;;   total time = 20
;;   new head = 7
;;   new tail = 36

This schedule is 5 cycles faster.

I have read and re-read the material surrounding the DFA scheduler. I
understand that the heuristics optimize critical path length and not
stalls or other metrics. But in this case it is precisely the critical
path length that is shortened by the better schedule. I have been
examining various hooks available and for a while it seemed like
TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD must be set to a
larger window to look for better candidates to schedule into the ready
queue. For instance, this discussion seems to say so.
http://gcc.gnu.org/ml/gcc/2002-05/msg01132.html

But a post that follows soon after seems to imply otherwise.
http://gcc.gnu.org/ml/gcc/2002-05/msg01388.html

Both posts are from Vladimir. In any case the final conclusion seems
to be that the lookahead is useful only for multi-issue processors.

So what is wrong in my description? How do I get GCC to produce the
optimal schedule?

I appreciate any help as I have run into a similiar scheduling
situation with other architectures in the past (GCC seems to choose
not to pipeline things as much as it can) and this time I would like
to understand it a bit more.

Compile flags used are basically, -O3.
-- 
thanks,
Matt

DFA Scheduler - unable to pipeline loads

Reply via email to