Hi, I am working with GCC-4.1.1 on a simple 5-pipe stage simple scalar RISC processors with the following description for loads and stores,
(define_insn_reservation "integer" 1 (eq_attr "type" "branch,jump,call,arith,darith,icmp,nop") "issue,iu,wb") (define_insn_reservation "load" 3 (eq_attr "type" "load") "issue,iu,wb") (define_insn_reservation "store" 1 (eq_attr "type" "store") "issue,iu,wb") I am seeing poor scheduling in Dhrystone where a memcpy call is expanded inline. memcpy (&dst, &src, 16) ==> load 1, rA + 4 store 1, rB + 4 load 2, rA + 8 store 2, rB + 8 ... Basically, instead of pipelining the loads, the current schedule stalls the processor for two cycles on every dependent store. Here is a dump from the .35.sched1 file. ;; ====================================================== ;; -- basic block 0 from 6 to 36 -- before reload ;; ====================================================== ;; 0--> 6 r84=r5 :issue,iu,wb ;; 1--> 13 r86=[`Ptr_Glob'] :issue,iu,wb ;; 2--> 25 r92=0x5 :issue,iu,wb ;; 3--> 12 r85=[r84] :issue,iu,wb ;; 4--> 14 r87=[r86] :issue,iu,wb ;; 7--> 15 [r85]=r87 :issue,iu,wb ;; 8--> 16 r88=[r86+0x4] :issue,iu,wb ;; 11--> 17 [r85+0x4]=r88 :issue,iu,wb ;; 12--> 18 r89=[r86+0x8] :issue,iu,wb ;; 15--> 19 [r85+0x8]=r89 :issue,iu,wb ;; 16--> 20 r90=[r86+0xc] :issue,iu,wb ;; 19--> 21 [r85+0xc]=r90 :issue,iu,wb ;; 20--> 22 r91=[r86+0x10] :issue,iu,wb ;; 23--> 23 [r85+0x10]=r91 :issue,iu,wb ;; 24--> 26 [r84+0xc]=r92 :issue,iu,wb ;; 25--> 31 clobber r3 :nothing ;; 25--> 36 use r3 :nothing ;; Ready list (final): ;; total time = 25 ;; new head = 7 ;; new tail = 36 There is an obvious better schedule to be obtained. Here is one such (hand-modified) schedule which just pipelines two of the loads to obtain a shorter critical path length to the whole function (function has only bb 0) ;; 0--> 6 r84=r5 :issue,iu,wb ;; 1--> 13 r86=[`Ptr_Glob'] :issue,iu,wb ;; 2--> 25 r92=0x5 :issue,iu,wb ;; 3--> 12 r85=[r84] :issue,iu,wb ;; 4--> 14 r87=[r86] :issue,iu,wb ;; 7--> 15 [r85]=r87 :issue,iu,wb ;; 8--> 16 r88=[r86+0x4] :issue,iu,wb ;; 9--> 18 r89=[r86+0x8] :issue,iu,wb ;; 10--> 20 r90=[r86+0xc] :issue,iu,wb ;; 11--> 17 [r85+0x4]=r88 :issue,iu,wb ;; 12--> 19 [r85+0x8]=r89 :issue,iu,wb ;; 13--> 21 [r85+0xc]=r90 :issue,iu,wb ;; 14--> 22 r91=[r86+0x10] :issue,iu,wb ;; 17--> 23 [r85+0x10]=r91 :issue,iu,wb ;; 18--> 26 [r84+0xc]=r92 :issue,iu,mb_wb ;; 19--> 31 clobber r3 :nothing ;; 20--> 36 use r3 :nothing ;; Ready list (final): ;; total time = 20 ;; new head = 7 ;; new tail = 36 This schedule is 5 cycles faster. I have read and re-read the material surrounding the DFA scheduler. I understand that the heuristics optimize critical path length and not stalls or other metrics. But in this case it is precisely the critical path length that is shortened by the better schedule. I have been examining various hooks available and for a while it seemed like TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD must be set to a larger window to look for better candidates to schedule into the ready queue. For instance, this discussion seems to say so. http://gcc.gnu.org/ml/gcc/2002-05/msg01132.html But a post that follows soon after seems to imply otherwise. http://gcc.gnu.org/ml/gcc/2002-05/msg01388.html Both posts are from Vladimir. In any case the final conclusion seems to be that the lookahead is useful only for multi-issue processors. So what is wrong in my description? How do I get GCC to produce the optimal schedule? I appreciate any help as I have run into a similiar scheduling situation with other architectures in the past (GCC seems to choose not to pipeline things as much as it can) and this time I would like to understand it a bit more. Compile flags used are basically, -O3. -- thanks, Matt