Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

Jan Hubicka Wed, 09 Oct 2013 06:48:51 -0700

> Before merging the insn reservations, I need to compare the latency values 
> for bdver1 and bdver3. I know that they are different for some of the 
> instructions. 
> In that case, the merging should prop up another subset of latency 
> differences. I would like to keep these insn reservations in two .md files 
> (one for bdver1 and one for bdver3) even after the merger.


I am not really insisting on merging (define_insn_reservation "bdver3*") with
(define_insn_reservation "bdver1*).  What I have in mind is merging actual
atuomatons in cases it makes sense.  Latencies are not really encoded in those.

Bdver 12 has:
(define_automaton "bdver1,bdver1_ieu,bdver1_load,bdver1_fp,bdver1_agu")
while bdver 3:
(define_automaton "bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu")

automatons bdver1 and bdver3 are very different, because one handles up to 3 
instructions,
while other handles only 2.  I am still bit confused with this every second 
cycle logic,
so lets discuss it incrementally.

I would propose to have
(define_automaton "bdver3")
or perhaps
(define_automaton "bdver3,bdver3_fp")

now if you look at use of bdver3_ieu we have:
jan@linux-9ure:~/trunk/gcc/config/i386> grep bdver1-ieu *.md
bdver1.md:(define_cpu_unit "bdver1-ieu0" "bdver1_ieu")
bdver1.md:(define_cpu_unit "bdver1-ieu1" "bdver1_ieu")
bdver1.md:(define_reservation "bdver1-ieu" "(bdver1-ieu0 | bdver1-ieu1)")
bdver1.md:(define_reservation "bdver1-ivector" "bdver1-ieu0+bdver1-ieu1+
bdver1.md:                       "bdver1-direct1,bdver1-ieu1")
bdver1.md:                       "bdver1-direct1,bdver1-ieu1")
bdver1.md:                         "bdver1-direct1,bdver1-load,bdver1-ieu1")
bdver1.md:                       "bdver1-direct1,bdver1-load,bdver1-ieu1")
bdver1.md:                       
"bdver1-vector,(bdver1-ieu0*6+(bdver1-fpsched,bdver1-fvector))")
bdver1.md:                       
"bdver1-vector,((bdver1-load,bdver1-ieu0*6)+(bdver1-fpsched,bdver1-fvector))")
bdver1.md:                       "bdver1-vector,bdver1-load,bdver1-ieu0*6")
bdver1.md:                       "bdver1-direct,bdver1-ieu")
bdver1.md:                       "bdver1-vector,bdver1-ieu,bdver1-ieu")
bdver1.md:                       "bdver1-direct,bdver1-load,bdver1-ieu")
bdver1.md:                       
"bdver1-vector,bdver1-load,bdver1-ieu,bdver1-ieu")
bdver1.md:                        bdver1-ieu,bdver1-store,
bdver1.md:                        bdver1-ieu,
bdver1.md:                        bdver1-ieu,
bdver1.md:                       "bdver1-direct,(bdver1-ieu+bdver1-agu),
bdver1.md:                       
"bdver1-vector,(bdver1-ieu+bdver1-agu),bdver1-ieu,
jan@linux-9ure:~/trunk/gcc/config/i386> grep bdver3-ieu *.md
bdver3.md:(define_cpu_unit "bdver3-ieu0" "bdver3_ieu")
bdver3.md:(define_cpu_unit "bdver3-ieu1" "bdver3_ieu")
bdver3.md:(define_reservation "bdver3-ieu" "(bdver3-ieu0|bdver3-ieu1)")
bdver3.md:(define_reservation "bdver3-ivector" "bdver3-ieu0+bdver3-ieu1+
bdver3.md:                       "bdver3-double,(bdver3-agu | 
bdver3-ieu),nothing")
bdver3.md:                       "bdver3-direct,bdver3-ieu,bdver3-store")
bdver3.md:                       "bdver3-direct,bdver3-ieu")
bdver3.md:                       "bdver3-direct,bdver3-ieu1")
bdver3.md:                       "bdver3-direct,bdver3-ieu1")
bdver3.md:                       "bdver3-direct,bdver3-load,bdver3-ieu1")
bdver3.md:                       "bdver3-direct,bdver3-load,bdver3-ieu1")
bdver3.md:                       "bdver3-direct,(bdver3-ieu|bdver3-agu)")
bdver3.md:                       "bdver3-direct,bdver3-load,bdver3-ieu")
bdver3.md:                       "bdver3-direct,bdver3-ieu,bdver3-store")
bdver3.md:                        bdver3-ieu,bdver3-store,
bdver3.md:                       "bdver3-direct,(bdver3-ieu+bdver3-agu),

While they are not used always the same way, it seems that bdver3 instructions 
have similar characteristic
to bdver1 instructions, so unifying the automatons would save binary size.
This only means replacing
bdver3.md:(define_cpu_unit "bdver3-ieu0" "bdver1_ieu")
bdver3.md:(define_cpu_unit "bdver3-ieu1" "bdver1_ieu")
and removing bdver3_ieu from list of automatons.

The automaton minimization should take care of the rest and resulting schedules 
should not change.
I will give it a try.
> 
> > Your version has problem that it does not model the thing that the two 
> > decoders works sequentially.
> 
> The two stage modeling is required so that the decode unit reservations are 
> screened from other unit reservations.
> But this sort of goes away in bdver3 because of the decode cycle.
> In bdver3, the decode units scan two of these windows every "two" cycles 
> decoding a maximum of eight instructions.
> The hardware scan is done every two cycles in bdver3 whereas it is done every 
> single cycle in bdver1/bdver2. (But we have two separate hardware decoders 
> which guarantees higher throughput)
> This means that the two stage modeling is not required in the scheduler 
> descriptions since the hardware sort of guarantees that with its scanning 
> mechanism. 

OK, my undertanding always was, that the decoders works sort of sequentially.
One decoder of bdver1 can do

vector,
double, single,
single, single, signle

where decoder 1 is somehow special but hardware is able to swap first and
second single.  Now if two decoders run, i would expect

vector, vector
single, single, signle
double, single, double, single

to be decoded in once cycle, but not

single, vector, single
or double,double,single,single

Similar situation seems to exist for bdver3 too.  As I understand it does

single, single
double

So for example

single,double,single,double,single,double,single,double pattern will take 
longer than
single,single,double,double,single,single,double,double

If decoders are allocated sequencially and the instruction order matters,
modeling this precisely and enabling DFA_LOOKAHEAD should make GCC to place
things correctly into groups (plus minus a lot of luck because of dispatch
windows and size limitations.  There is already some code for Atom to take care
of this, perhaps it can be generalized)

Or is pre-scan smart enough to allocate single, signle into first decoder and 
vector into second?

> Our job is to make sure that 8 direct instructions get scheduled in two 
> cycles or 4 double instructions get scheduled in two cycles.
> So, I have modeled the bdver3 decoders such that with in a cycle they 
> guarantee to issue 4 direct instructions or 2 double instructions. 
> This eliminates the sequencing problem in modeling decoders and also ensures 
> that the issue rate can be numbered for a single cycle rather than two cycles.
> This is one of the reasons why I remodeled only bdver3. Let me know your 
> comments on this.
> 
> > We can also experiment with defining TARGET_SCHED_VARIABLE_ISSUE to get 
> > more realistic estimates on what still can be issued - the value of 6 is 
> > unrealistically high.
> This would get more complicated if we go by decoder capacity in bdver3. As we 
> have two hardware decoders in steamroller (bdver3), they have a capacity to 
> decode eight instructions per clock cycle, providing up to twice the decode 
> and dispatch bandwidth compared to bdver1.
> If we model this in GCC we need to change the issue rate to 8. If 6 is high, 
> then 8 would add more joy and excitement.
> 
> TARGET_SCHED_VARIABLE_ISSUE is a nice suggestion to schedule instructions in 
> different way. 
> 
> > We also should enable ia32_multipass_dfa_lookahead - with that scheduler 
> > should be able to put double decoded and vector decoded insns on the proper 
> > places.
> Yes. Whenever we have this scheduler analysis in place we discuss about this 
> but unfortunately is left as it is.
> I will look into this after I do the enablement for bdver4.

I benchmarked this with my bdver1/2 changes (don't have bdver3 for
benchmarking) and it does not seem to do much of differnce at this stage.  It
may be however simply because the model is broken.  Also SPEC is not the best
benchmark for scheduling, since it has many other bounds.

> 
> > I will work on replacing most of the CPU cases into tuning flags + costs.
> I am planning to get bdver4 enablement in place once scheduler descriptions 
> for bdver3 is done with.
> I will have cycles to look into the cost models. Please delegate some tasks 
> if you can and I am willing to take them up.

Thanks.  What I am working now is to decide what scheduling of generic should
be.  For that I need to be sure that both buldozer and core scheduling works
well, so I can compare their benefits on both CPUs.  So getting through these
issues is very useful.

Honza

> 
> Regards
> Ganesh
> 
> -----Original Message-----
> From: Jan Hubicka [mailto:hubi...@ucw.cz] 
> Sent: Tuesday, October 08, 2013 3:20 PM
> To: Gopalasubramanian, Ganesh
> Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; hjl.to...@gmail.com
> Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern 
> x86 chips
> 
> > Hi Honza,
> > 
> > I am planning to update the scheduler descriptions for bdver3 first.
> > Attached is the patch. Please let me know your comments if any.
> > 
> > Though I agree on merging bdver1/2 and bdver3 on most parts, the FP lines 
> > and decoding schemes are different. So, let me know how can I approach 
> > merging these.
> 
> Yep, I think we need to merge only those autmatas tha are same for both:
> (define_automaton "bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu")
> probably can become
> (define_automaton "bdver3,bdver3_fp")
> with the corresponding reservations using bdver3_ieu,bdver3_load,bdver3_agu 
> changed to bdver1
> automaton.  I think it should result in smaller binary - the fact that all 
> conditionals are
> physically duplicated in bdver1/bdev3.md should be optimized away by 
> genautomata.
> 
> I also played a bit with the decoders and I am attaching my version - that 
> seems SPEC neutral though.
> Your version has problem that it does not model the thing that the two 
> decoders works sequentially.
> 
> I removed the bdver1-decodev unit and instead i simply reserve all thre 
> decoders + I added 
> presence set requring second decoder to be taken only after first one changed 
> presence set requring
> decoder 2 to be taken only after decoder 1+2 to final presence set, so 
> decoderv resetvation has
> chance to pass.
> Finally I added use-decodera that consumes all of the first decoder as soon 
> as we start to allocate
> second decoder - we can not really allocate them in interchanging order.
> 
> I also noticed that push/pop instructions are modeled as being vector, while 
> manual says it 
> generates one micro op unless memory operand is used.
> 
> I did not have much time to play further with this except for manual 
> inspection of schedules that
> seems better now and in rare cases handle 4-5 instructions per cycle.
> 
> We also should enable ia32_multipass_dfa_lookahead - with that scheduler 
> should be able to put double decoded
> and vector decoded insns on the proper places.
> 
> We can also experiment with defining TARGET_SCHED_VARIABLE_ISSUE to get more 
> realistic
> estimates on what still can be issued - the value of 6 is unrealistically 
> high.
> 
> Seems like with addition of Atom the scheduler macros became very twisty maze 
> of passages.
> I will work on replacing most of the CPU cases into tuning flags + costs.
> 
> What do you think?
> Honza
> 
> 
> Index: bdver1.md
> ===================================================================
> --- bdver1.md (revision 203204)
> +++ bdver1.md (working copy)
> @@ -41,7 +41,9 @@
>  (define_cpu_unit "bdver1-decode0" "bdver1")
>  (define_cpu_unit "bdver1-decode1" "bdver1")
>  (define_cpu_unit "bdver1-decode2" "bdver1")
> -(define_cpu_unit "bdver1-decodev" "bdver1")
> +(define_cpu_unit "bdver1-decode0b" "bdver1")
> +(define_cpu_unit "bdver1-decode1b" "bdver1")
> +(define_cpu_unit "bdver1-decode2b" "bdver1")
>  
>  ;; Model the fact that double decoded instruction may take 2 cycles
>  ;; to decode when decoder2 and decoder0 in next cycle
> @@ -57,18 +59,26 @@
>  ;; too.  Vector decoded instructions then can't be issued when modeled
>  ;; as consuming decoder0+decoder1+decoder2.
>  ;; We solve that by specialized vector decoder unit and exclusion set.
> -(presence_set "bdver1-decode2" "bdver1-decode0")
> -(exclusion_set "bdver1-decodev" 
> "bdver1-decode0,bdver1-decode1,bdver1-decode2")
> -
> -(define_reservation "bdver1-vector" "nothing,bdver1-decodev")
> -(define_reservation "bdver1-direct1" "nothing,bdver1-decode1")
> +(final_presence_set "bdver1-decode2" "bdver1-decode0,bdver1-decode1")
> +(presence_set "bdver1-decode0b,bdver1-decode1b,bdver1-decode2b" 
> "bdver1-decode0,bdver1-decode1")
> +(final_presence_set "bdver1-decode2b" "bdver1-decode0b,bdver1-decode1b")
> +
> +(define_reservation "use-decodera" "((bdver1-decode0 | nothing)
> +                                  + (bdver1-decode1 | nothing)
> +                                  + (bdver1-decode2 | nothing))")
> +(define_reservation "bdver1-vector" 
> "nothing,((bdver1-decode0+bdver1-decode1+bdver1-decode2)
> +                                           
> |(use-decodera+bdver1-decode0b+bdver1-decode1b+bdver1-decode2b))")
> +(define_reservation "bdver1-direct1" 
> "nothing,(bdver1-decode1|(use-decodera+bdver1-decode1b))")
>  (define_reservation "bdver1-direct" "nothing,
>                                    (bdver1-decode0 | bdver1-decode1
> -                                  | bdver1-decode2)")
> +                                  | bdver1-decode2 | 
> (use-decodera+bdver1-decode0b)
> +                                  | (use-decodera+bdver1-decode1b) | 
> (use-decodera+bdver1-decode2b))")
>  ;; Double instructions behaves like two direct instructions.
>  (define_reservation "bdver1-double" "((bdver1-decode2,bdver1-decode0)
>                                    | (nothing,(bdver1-decode0 + 
> bdver1-decode1))
> -                                  | (nothing,(bdver1-decode1 + 
> bdver1-decode2)))")
> +                                  | (nothing,(bdver1-decode1 + 
> bdver1-decode2))
> +                                  | (nothing,(use-decodera + bdver1-decode0b 
> + bdver1-decode1b))
> +                                  | (nothing,(use-decodera + bdver1-decode1b 
> + bdver1-decode2b)))")
>  
>  
>  (define_cpu_unit "bdver1-ieu0" "bdver1_ieu")
> @@ -131,17 +141,28 @@
>                             (eq_attr "type" "call,callv"))
>                        "bdver1-double,bdver1-agu")
>  ;; PUSH mem is double path.
> +(define_insn_reservation "bdver1_pushmem" 1
> +                      (and (eq_attr "cpu" "bdver1,bdver2")
> +                           (and (eq_attr "type" "push")
> +                                (eq_attr "memory" "both")))
> +                      "bdver1-direct,bdver1-load,bdver1-store")
>  (define_insn_reservation "bdver1_push" 1
>                        (and (eq_attr "cpu" "bdver1,bdver2")
>                             (eq_attr "type" "push"))
> -                      "bdver1-direct,bdver1-agu,bdver1-store")
> +                      "bdver1-direct,bdver1-store")
>  ;; POP r16/mem are double path.
> +;; 16bit pops are not really used by GCC.
> +(define_insn_reservation "bdver1_popmem" 1
> +                      (and (eq_attr "cpu" "bdver1,bdver2")
> +                           (and (eq_attr "type" "pop")
> +                                (eq_attr "memory" "both")))
> +                      "bdver1-direct,bdver1-load,bdver1-store")
>  (define_insn_reservation "bdver1_pop" 1
>                        (and (eq_attr "cpu" "bdver1,bdver2")
>                             (eq_attr "type" "pop"))
> -                      "bdver1-direct,bdver1-ivector")
> -;; LEAVE no latency info so far, assume same with amdfam10.
> -(define_insn_reservation "bdver1_leave" 3
> +                      "bdver1-direct,bdver1-load")
> +;; By Agner Fog, latency is 4.
> +(define_insn_reservation "bdver1_leave" 4
>                        (and (eq_attr "cpu" "bdver1,bdver2")
>                             (eq_attr "type" "leave"))
>                        "bdver1-vector,bdver1-ivector")
> Index: i386.c
> ===================================================================
> --- i386.c    (revision 203204)
> +++ i386.c    (working copy)
> @@ -24427,11 +25019,13 @@ ix86_issue_rate (void)
>      case PROCESSOR_K8:
>      case PROCESSOR_AMDFAM10:
>      case PROCESSOR_GENERIC:
> -    case PROCESSOR_BDVER1:
> -    case PROCESSOR_BDVER2:
> -    case PROCESSOR_BDVER3:
>      case PROCESSOR_BTVER1:
>        return 3;
> +    case PROCESSOR_BDVER3:
> +      return 4;
> +    case PROCESSOR_BDVER1:
> +    case PROCESSOR_BDVER2:
> +      return 6;
>  
>      case PROCESSOR_CORE2:
>      case PROCESSOR_COREI7:
> @@ -24697,11 +25291,27 @@ ix86_adjust_cost (rtx insn, rtx link, rt
>      case PROCESSOR_GENERIC:
>        memory = get_attr_memory (insn);
>  
> -      /* Stack engine allows to execute push&pop instructions in parall.  */
> +      /* Stack engine allows to execute push&pop instructions in parallel. 
> +      ??? There seems to be no detailed documentation of AMDFAM10, perhaps
> +      it is actually equivalent to the stronger notion of engine bellow. */
>        if (((insn_type == TYPE_PUSH || insn_type == TYPE_POP)
>          && (dep_insn_type == TYPE_PUSH || dep_insn_type == TYPE_POP))
> -       && (ix86_tune != PROCESSOR_ATHLON && ix86_tune != PROCESSOR_K8))
> +       && (ix86_tune == PROCESSOR_AMDFAM10))
> +     return 0;
> +
> +      /* For buldozer up, the stack engine makes value of stack pointer 
> available
> +      immediately, no mater about the use.   (i.e. when ESP is used as 
> pointer
> +      or for arithmetic, the cost is bypassed, too.)  */
> +      if (ix86_tune >= PROCESSOR_BDVER1
> +       && dep_insn_type == TYPE_PUSH)
>       return 0;
> +      if (ix86_tune >= PROCESSOR_BDVER1
> +       && dep_insn_type == TYPE_POP)
> +     {
> +       rtx dest = SET_DEST (PATTERN (dep_insn));
> +       if (REG_P (dest) && !reg_referenced_p (dest, PATTERN (insn)))
> +         return 0;
> +     }
>  
>        /* Show ability of reorder buffer to hide latency of load by executing
>        in parallel with previous instruction in case
>

Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

Reply via email to