> Before merging the insn reservations, I need to compare the latency values > for bdver1 and bdver3. I know that they are different for some of the > instructions. > In that case, the merging should prop up another subset of latency > differences. I would like to keep these insn reservations in two .md files > (one for bdver1 and one for bdver3) even after the merger.
I am not really insisting on merging (define_insn_reservation "bdver3*") with (define_insn_reservation "bdver1*). What I have in mind is merging actual atuomatons in cases it makes sense. Latencies are not really encoded in those. Bdver 12 has: (define_automaton "bdver1,bdver1_ieu,bdver1_load,bdver1_fp,bdver1_agu") while bdver 3: (define_automaton "bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu") automatons bdver1 and bdver3 are very different, because one handles up to 3 instructions, while other handles only 2. I am still bit confused with this every second cycle logic, so lets discuss it incrementally. I would propose to have (define_automaton "bdver3") or perhaps (define_automaton "bdver3,bdver3_fp") now if you look at use of bdver3_ieu we have: jan@linux-9ure:~/trunk/gcc/config/i386> grep bdver1-ieu *.md bdver1.md:(define_cpu_unit "bdver1-ieu0" "bdver1_ieu") bdver1.md:(define_cpu_unit "bdver1-ieu1" "bdver1_ieu") bdver1.md:(define_reservation "bdver1-ieu" "(bdver1-ieu0 | bdver1-ieu1)") bdver1.md:(define_reservation "bdver1-ivector" "bdver1-ieu0+bdver1-ieu1+ bdver1.md: "bdver1-direct1,bdver1-ieu1") bdver1.md: "bdver1-direct1,bdver1-ieu1") bdver1.md: "bdver1-direct1,bdver1-load,bdver1-ieu1") bdver1.md: "bdver1-direct1,bdver1-load,bdver1-ieu1") bdver1.md: "bdver1-vector,(bdver1-ieu0*6+(bdver1-fpsched,bdver1-fvector))") bdver1.md: "bdver1-vector,((bdver1-load,bdver1-ieu0*6)+(bdver1-fpsched,bdver1-fvector))") bdver1.md: "bdver1-vector,bdver1-load,bdver1-ieu0*6") bdver1.md: "bdver1-direct,bdver1-ieu") bdver1.md: "bdver1-vector,bdver1-ieu,bdver1-ieu") bdver1.md: "bdver1-direct,bdver1-load,bdver1-ieu") bdver1.md: "bdver1-vector,bdver1-load,bdver1-ieu,bdver1-ieu") bdver1.md: bdver1-ieu,bdver1-store, bdver1.md: bdver1-ieu, bdver1.md: bdver1-ieu, bdver1.md: "bdver1-direct,(bdver1-ieu+bdver1-agu), bdver1.md: "bdver1-vector,(bdver1-ieu+bdver1-agu),bdver1-ieu, jan@linux-9ure:~/trunk/gcc/config/i386> grep bdver3-ieu *.md bdver3.md:(define_cpu_unit "bdver3-ieu0" "bdver3_ieu") bdver3.md:(define_cpu_unit "bdver3-ieu1" "bdver3_ieu") bdver3.md:(define_reservation "bdver3-ieu" "(bdver3-ieu0|bdver3-ieu1)") bdver3.md:(define_reservation "bdver3-ivector" "bdver3-ieu0+bdver3-ieu1+ bdver3.md: "bdver3-double,(bdver3-agu | bdver3-ieu),nothing") bdver3.md: "bdver3-direct,bdver3-ieu,bdver3-store") bdver3.md: "bdver3-direct,bdver3-ieu") bdver3.md: "bdver3-direct,bdver3-ieu1") bdver3.md: "bdver3-direct,bdver3-ieu1") bdver3.md: "bdver3-direct,bdver3-load,bdver3-ieu1") bdver3.md: "bdver3-direct,bdver3-load,bdver3-ieu1") bdver3.md: "bdver3-direct,(bdver3-ieu|bdver3-agu)") bdver3.md: "bdver3-direct,bdver3-load,bdver3-ieu") bdver3.md: "bdver3-direct,bdver3-ieu,bdver3-store") bdver3.md: bdver3-ieu,bdver3-store, bdver3.md: "bdver3-direct,(bdver3-ieu+bdver3-agu), While they are not used always the same way, it seems that bdver3 instructions have similar characteristic to bdver1 instructions, so unifying the automatons would save binary size. This only means replacing bdver3.md:(define_cpu_unit "bdver3-ieu0" "bdver1_ieu") bdver3.md:(define_cpu_unit "bdver3-ieu1" "bdver1_ieu") and removing bdver3_ieu from list of automatons. The automaton minimization should take care of the rest and resulting schedules should not change. I will give it a try. > > > Your version has problem that it does not model the thing that the two > > decoders works sequentially. > > The two stage modeling is required so that the decode unit reservations are > screened from other unit reservations. > But this sort of goes away in bdver3 because of the decode cycle. > In bdver3, the decode units scan two of these windows every "two" cycles > decoding a maximum of eight instructions. > The hardware scan is done every two cycles in bdver3 whereas it is done every > single cycle in bdver1/bdver2. (But we have two separate hardware decoders > which guarantees higher throughput) > This means that the two stage modeling is not required in the scheduler > descriptions since the hardware sort of guarantees that with its scanning > mechanism. OK, my undertanding always was, that the decoders works sort of sequentially. One decoder of bdver1 can do vector, double, single, single, single, signle where decoder 1 is somehow special but hardware is able to swap first and second single. Now if two decoders run, i would expect vector, vector single, single, signle double, single, double, single to be decoded in once cycle, but not single, vector, single or double,double,single,single Similar situation seems to exist for bdver3 too. As I understand it does single, single double So for example single,double,single,double,single,double,single,double pattern will take longer than single,single,double,double,single,single,double,double If decoders are allocated sequencially and the instruction order matters, modeling this precisely and enabling DFA_LOOKAHEAD should make GCC to place things correctly into groups (plus minus a lot of luck because of dispatch windows and size limitations. There is already some code for Atom to take care of this, perhaps it can be generalized) Or is pre-scan smart enough to allocate single, signle into first decoder and vector into second? > Our job is to make sure that 8 direct instructions get scheduled in two > cycles or 4 double instructions get scheduled in two cycles. > So, I have modeled the bdver3 decoders such that with in a cycle they > guarantee to issue 4 direct instructions or 2 double instructions. > This eliminates the sequencing problem in modeling decoders and also ensures > that the issue rate can be numbered for a single cycle rather than two cycles. > This is one of the reasons why I remodeled only bdver3. Let me know your > comments on this. > > > We can also experiment with defining TARGET_SCHED_VARIABLE_ISSUE to get > > more realistic estimates on what still can be issued - the value of 6 is > > unrealistically high. > This would get more complicated if we go by decoder capacity in bdver3. As we > have two hardware decoders in steamroller (bdver3), they have a capacity to > decode eight instructions per clock cycle, providing up to twice the decode > and dispatch bandwidth compared to bdver1. > If we model this in GCC we need to change the issue rate to 8. If 6 is high, > then 8 would add more joy and excitement. > > TARGET_SCHED_VARIABLE_ISSUE is a nice suggestion to schedule instructions in > different way. > > > We also should enable ia32_multipass_dfa_lookahead - with that scheduler > > should be able to put double decoded and vector decoded insns on the proper > > places. > Yes. Whenever we have this scheduler analysis in place we discuss about this > but unfortunately is left as it is. > I will look into this after I do the enablement for bdver4. I benchmarked this with my bdver1/2 changes (don't have bdver3 for benchmarking) and it does not seem to do much of differnce at this stage. It may be however simply because the model is broken. Also SPEC is not the best benchmark for scheduling, since it has many other bounds. > > > I will work on replacing most of the CPU cases into tuning flags + costs. > I am planning to get bdver4 enablement in place once scheduler descriptions > for bdver3 is done with. > I will have cycles to look into the cost models. Please delegate some tasks > if you can and I am willing to take them up. Thanks. What I am working now is to decide what scheduling of generic should be. For that I need to be sure that both buldozer and core scheduling works well, so I can compare their benefits on both CPUs. So getting through these issues is very useful. Honza > > Regards > Ganesh > > -----Original Message----- > From: Jan Hubicka [mailto:hubi...@ucw.cz] > Sent: Tuesday, October 08, 2013 3:20 PM > To: Gopalasubramanian, Ganesh > Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; hjl.to...@gmail.com > Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern > x86 chips > > > Hi Honza, > > > > I am planning to update the scheduler descriptions for bdver3 first. > > Attached is the patch. Please let me know your comments if any. > > > > Though I agree on merging bdver1/2 and bdver3 on most parts, the FP lines > > and decoding schemes are different. So, let me know how can I approach > > merging these. > > Yep, I think we need to merge only those autmatas tha are same for both: > (define_automaton "bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu") > probably can become > (define_automaton "bdver3,bdver3_fp") > with the corresponding reservations using bdver3_ieu,bdver3_load,bdver3_agu > changed to bdver1 > automaton. I think it should result in smaller binary - the fact that all > conditionals are > physically duplicated in bdver1/bdev3.md should be optimized away by > genautomata. > > I also played a bit with the decoders and I am attaching my version - that > seems SPEC neutral though. > Your version has problem that it does not model the thing that the two > decoders works sequentially. > > I removed the bdver1-decodev unit and instead i simply reserve all thre > decoders + I added > presence set requring second decoder to be taken only after first one changed > presence set requring > decoder 2 to be taken only after decoder 1+2 to final presence set, so > decoderv resetvation has > chance to pass. > Finally I added use-decodera that consumes all of the first decoder as soon > as we start to allocate > second decoder - we can not really allocate them in interchanging order. > > I also noticed that push/pop instructions are modeled as being vector, while > manual says it > generates one micro op unless memory operand is used. > > I did not have much time to play further with this except for manual > inspection of schedules that > seems better now and in rare cases handle 4-5 instructions per cycle. > > We also should enable ia32_multipass_dfa_lookahead - with that scheduler > should be able to put double decoded > and vector decoded insns on the proper places. > > We can also experiment with defining TARGET_SCHED_VARIABLE_ISSUE to get more > realistic > estimates on what still can be issued - the value of 6 is unrealistically > high. > > Seems like with addition of Atom the scheduler macros became very twisty maze > of passages. > I will work on replacing most of the CPU cases into tuning flags + costs. > > What do you think? > Honza > > > Index: bdver1.md > =================================================================== > --- bdver1.md (revision 203204) > +++ bdver1.md (working copy) > @@ -41,7 +41,9 @@ > (define_cpu_unit "bdver1-decode0" "bdver1") > (define_cpu_unit "bdver1-decode1" "bdver1") > (define_cpu_unit "bdver1-decode2" "bdver1") > -(define_cpu_unit "bdver1-decodev" "bdver1") > +(define_cpu_unit "bdver1-decode0b" "bdver1") > +(define_cpu_unit "bdver1-decode1b" "bdver1") > +(define_cpu_unit "bdver1-decode2b" "bdver1") > > ;; Model the fact that double decoded instruction may take 2 cycles > ;; to decode when decoder2 and decoder0 in next cycle > @@ -57,18 +59,26 @@ > ;; too. Vector decoded instructions then can't be issued when modeled > ;; as consuming decoder0+decoder1+decoder2. > ;; We solve that by specialized vector decoder unit and exclusion set. > -(presence_set "bdver1-decode2" "bdver1-decode0") > -(exclusion_set "bdver1-decodev" > "bdver1-decode0,bdver1-decode1,bdver1-decode2") > - > -(define_reservation "bdver1-vector" "nothing,bdver1-decodev") > -(define_reservation "bdver1-direct1" "nothing,bdver1-decode1") > +(final_presence_set "bdver1-decode2" "bdver1-decode0,bdver1-decode1") > +(presence_set "bdver1-decode0b,bdver1-decode1b,bdver1-decode2b" > "bdver1-decode0,bdver1-decode1") > +(final_presence_set "bdver1-decode2b" "bdver1-decode0b,bdver1-decode1b") > + > +(define_reservation "use-decodera" "((bdver1-decode0 | nothing) > + + (bdver1-decode1 | nothing) > + + (bdver1-decode2 | nothing))") > +(define_reservation "bdver1-vector" > "nothing,((bdver1-decode0+bdver1-decode1+bdver1-decode2) > + > |(use-decodera+bdver1-decode0b+bdver1-decode1b+bdver1-decode2b))") > +(define_reservation "bdver1-direct1" > "nothing,(bdver1-decode1|(use-decodera+bdver1-decode1b))") > (define_reservation "bdver1-direct" "nothing, > (bdver1-decode0 | bdver1-decode1 > - | bdver1-decode2)") > + | bdver1-decode2 | > (use-decodera+bdver1-decode0b) > + | (use-decodera+bdver1-decode1b) | > (use-decodera+bdver1-decode2b))") > ;; Double instructions behaves like two direct instructions. > (define_reservation "bdver1-double" "((bdver1-decode2,bdver1-decode0) > | (nothing,(bdver1-decode0 + > bdver1-decode1)) > - | (nothing,(bdver1-decode1 + > bdver1-decode2)))") > + | (nothing,(bdver1-decode1 + > bdver1-decode2)) > + | (nothing,(use-decodera + bdver1-decode0b > + bdver1-decode1b)) > + | (nothing,(use-decodera + bdver1-decode1b > + bdver1-decode2b)))") > > > (define_cpu_unit "bdver1-ieu0" "bdver1_ieu") > @@ -131,17 +141,28 @@ > (eq_attr "type" "call,callv")) > "bdver1-double,bdver1-agu") > ;; PUSH mem is double path. > +(define_insn_reservation "bdver1_pushmem" 1 > + (and (eq_attr "cpu" "bdver1,bdver2") > + (and (eq_attr "type" "push") > + (eq_attr "memory" "both"))) > + "bdver1-direct,bdver1-load,bdver1-store") > (define_insn_reservation "bdver1_push" 1 > (and (eq_attr "cpu" "bdver1,bdver2") > (eq_attr "type" "push")) > - "bdver1-direct,bdver1-agu,bdver1-store") > + "bdver1-direct,bdver1-store") > ;; POP r16/mem are double path. > +;; 16bit pops are not really used by GCC. > +(define_insn_reservation "bdver1_popmem" 1 > + (and (eq_attr "cpu" "bdver1,bdver2") > + (and (eq_attr "type" "pop") > + (eq_attr "memory" "both"))) > + "bdver1-direct,bdver1-load,bdver1-store") > (define_insn_reservation "bdver1_pop" 1 > (and (eq_attr "cpu" "bdver1,bdver2") > (eq_attr "type" "pop")) > - "bdver1-direct,bdver1-ivector") > -;; LEAVE no latency info so far, assume same with amdfam10. > -(define_insn_reservation "bdver1_leave" 3 > + "bdver1-direct,bdver1-load") > +;; By Agner Fog, latency is 4. > +(define_insn_reservation "bdver1_leave" 4 > (and (eq_attr "cpu" "bdver1,bdver2") > (eq_attr "type" "leave")) > "bdver1-vector,bdver1-ivector") > Index: i386.c > =================================================================== > --- i386.c (revision 203204) > +++ i386.c (working copy) > @@ -24427,11 +25019,13 @@ ix86_issue_rate (void) > case PROCESSOR_K8: > case PROCESSOR_AMDFAM10: > case PROCESSOR_GENERIC: > - case PROCESSOR_BDVER1: > - case PROCESSOR_BDVER2: > - case PROCESSOR_BDVER3: > case PROCESSOR_BTVER1: > return 3; > + case PROCESSOR_BDVER3: > + return 4; > + case PROCESSOR_BDVER1: > + case PROCESSOR_BDVER2: > + return 6; > > case PROCESSOR_CORE2: > case PROCESSOR_COREI7: > @@ -24697,11 +25291,27 @@ ix86_adjust_cost (rtx insn, rtx link, rt > case PROCESSOR_GENERIC: > memory = get_attr_memory (insn); > > - /* Stack engine allows to execute push&pop instructions in parall. */ > + /* Stack engine allows to execute push&pop instructions in parallel. > + ??? There seems to be no detailed documentation of AMDFAM10, perhaps > + it is actually equivalent to the stronger notion of engine bellow. */ > if (((insn_type == TYPE_PUSH || insn_type == TYPE_POP) > && (dep_insn_type == TYPE_PUSH || dep_insn_type == TYPE_POP)) > - && (ix86_tune != PROCESSOR_ATHLON && ix86_tune != PROCESSOR_K8)) > + && (ix86_tune == PROCESSOR_AMDFAM10)) > + return 0; > + > + /* For buldozer up, the stack engine makes value of stack pointer > available > + immediately, no mater about the use. (i.e. when ESP is used as > pointer > + or for arithmetic, the cost is bypassed, too.) */ > + if (ix86_tune >= PROCESSOR_BDVER1 > + && dep_insn_type == TYPE_PUSH) > return 0; > + if (ix86_tune >= PROCESSOR_BDVER1 > + && dep_insn_type == TYPE_POP) > + { > + rtx dest = SET_DEST (PATTERN (dep_insn)); > + if (REG_P (dest) && !reg_referenced_p (dest, PATTERN (insn))) > + return 0; > + } > > /* Show ability of reorder buffer to hide latency of load by executing > in parallel with previous instruction in case >