> OK, so it is about 2%. Did you try if you need lookahead even in the early
> pass (before reload)? My guess would be so, but if not, it could cut the
> cost to half. For -Ofast/-O3 it looks resonable to me, but we will need to
> announce it on the ML. For other settings I think we need to work on more
> improvements or cut the expenses.
Yes, it is required before reload.
I have another idea which can be pondered upon. Currently, can we enable
lookahead with the value 4 (pre reload) for default? This will exponentially
cut the cost of build time.
I have done some measurements on the build time of some benchmarks (mentioned
below) with lookahead value 4. The 2% increase in build time with value 8 is
now almost gone.
dfa4 no_lookahead
perlbench - 191s 193s
bzip2 - 19s 19s
gcc - 429s 429s
mcf - 3s 3s
gobmk - 116s 115s
hmmer - 60s 60s
sjeng - 18s 17s
libquantum - 6s 6s
h264ref - 107s 107s
omnetpp - 128s 128s
astar - 7s 7s
bwaves - 5s 5s
gamess - 1964s 1957s
milc - 18s 18s
GemsFDTD - 273s 272s
Lookahead value 4 also helps because, the modified decoder model in bdver3.md
is only two cycles deep (though in hardware it is actually 4 cycles deep). This
means that we can look another two levels deep for better schedule.
GemsFDTD still retains the performance boost of around 6-7% with value 4.
Let me know your thoughts.
Regards
Ganesh
-----Original Message-----
From: Jan Hubicka [mailto:[email protected]]
Sent: Thursday, October 24, 2013 6:48 PM
To: Gopalasubramanian, Ganesh
Cc: Jan Hubicka; [email protected]; Uros Bizjak ([email protected]); H.J.
Lu ([email protected])
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86
chips
> Hi,
>
> > Is this with -fschedule-insns? Or only with default settings? Did you test
> > the compile time implications of increasing the lookahead? (value of 8 is
> > very large, we may consider enbling it only for -Ofast, limiting for
> > postreload only or something similar).
>
> The improvement is seen with the options "-fschedule-insns -fschedule-insns2
> -fsched-pressure"
>
> Below are the build times of some of the SPEC benchmarks
>
> dfa8 no_lookahead
>
> perlbench - 196s 193s
> bzip2 - 19s 19s
> gcc - 439s 429s
> mcf - 3s 3s
> gobmk - 119s 115s
> hmmer - 62s 60s
> sjeng - 18s 17s
> libquantum - 6s 6s
> h264ref - 110s 107s
> omnetpp - 132s 128s
> astar - 7s 7s
> bwaves - 4s 5s
> gamess - 1996s 1957s
> milc - 18s 18s
> GemsFDTD - 276s 272s
>
> I think we can enable it by default rather than for -Ofast.
> Please let me know your inputs.
OK, so it is about 2%. Did you try if you need lookahead even in the early
pass (before reload)? My guess would be so, but if not, it could cut the cost
to half. For -Ofast/-O3 it looks resonable to me, but we will need to announce
it on the ML. For other settings I think we need to work on more improvmeents
or cut the expenses.
Honza
>
> Regards
> Ganesh
>
> -----Original Message-----
> From: Jan Hubicka [mailto:[email protected]]
> Sent: Thursday, October 24, 2013 2:54 PM
> To: Gopalasubramanian, Ganesh
> Cc: [email protected]; Uros Bizjak ([email protected]);
> [email protected]; H.J. Lu ([email protected])
> Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for
> modern x86 chips
>
> > Attached is the patch which does the following scheduler related changes.
> > * re-models bdver3 decoder.
> > * It enables lookahead with value 8 for all BD architectures. The patch
> > doesn't consider if reloading is completed or not (an area that needs to be
> > worked on).
> > * The issue rate for BD architectures are set to 4.
> >
> > I see the following performance improvements on bdver3 machine.
> > * GemsFDTD improves by 6-7% with lookahead value changed to 8.
> > * Hmmer improves by 9% when issue rate when set to 4 .
>
> Is this with -fschedule-insns? Or only with default settings? Did you test
> the compile time implications of increasing the lookahead? (value of 8 is
> very large, we may consider enbling it only for -Ofast, limiting for
> postreload only or something similar).
>
> >
> > I have considered the following hardware details for the model.
> > * There are four decoders inside a hardware decoder block.
> > * These four independent decoders can execute in parallel. (They can take
> > 8B from four different instructions and decode).
> > * These four decoders are pipelined 4 cycles deep and are non-stalling.
> > * Each decoder takes 8B of instruction data every cycle and tries decoding
> > it.
> > * Issue rate is 4.
> What is the overall limitation on number of bytes the instructions can occupy?
> I think they need to fit into 2 16 byte windows, right?
> In that case we may want to tweak the existing corei7 scheduling code to take
> care of this. Making scheduler not overly optimistic about the parallelism
> is good since it will make less register pressure during the first pass..
> >
> > Is it OK for upstream?
>
> Otherwise the patch seems OK, but I would like to know the compile time
> effect first.
>
> Honza
> >
> > Changelog
> > ========
> > 2013-10-24 Ganesh Gopalasubramanian
> > <[email protected]>
> >
> > * config/i386/bdver3.md : Added two additional decoder units
> > to support issue rate of 4 and remodeled vector unit.
> >
> > * config/i386/i386.c (ix86_issue_rate): Issue rate for BD
> > architectures is set to 4.
> >
> > * config/i386/i386.c (ia32_multipass_dfa_lookahead): DFA
> > lookahead is set to 8 for BD architectures.
> >
> > Regards
> > Ganesh
> >
>
>
>