Hi,
> Is this with -fschedule-insns? Or only with default settings? Did you test
> the compile time implications of increasing the lookahead? (value of 8 is
> very large, we may consider enbling it only for -Ofast, limiting for
> postreload only or something similar).
The improvement is seen with the options "-fschedule-insns -fschedule-insns2
-fsched-pressure"
Below are the build times of some of the SPEC benchmarks
dfa8 no_lookahead
perlbench - 196s 193s
bzip2 - 19s 19s
gcc - 439s 429s
mcf - 3s 3s
gobmk - 119s 115s
hmmer - 62s 60s
sjeng - 18s 17s
libquantum - 6s 6s
h264ref - 110s 107s
omnetpp - 132s 128s
astar - 7s 7s
bwaves - 4s 5s
gamess - 1996s 1957s
milc - 18s 18s
GemsFDTD - 276s 272s
I think we can enable it by default rather than for -Ofast.
Please let me know your inputs.
Regards
Ganesh
-----Original Message-----
From: Jan Hubicka [mailto:[email protected]]
Sent: Thursday, October 24, 2013 2:54 PM
To: Gopalasubramanian, Ganesh
Cc: [email protected]; Uros Bizjak ([email protected]); [email protected];
H.J. Lu ([email protected])
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86
chips
> Attached is the patch which does the following scheduler related changes.
> * re-models bdver3 decoder.
> * It enables lookahead with value 8 for all BD architectures. The patch
> doesn't consider if reloading is completed or not (an area that needs to be
> worked on).
> * The issue rate for BD architectures are set to 4.
>
> I see the following performance improvements on bdver3 machine.
> * GemsFDTD improves by 6-7% with lookahead value changed to 8.
> * Hmmer improves by 9% when issue rate when set to 4 .
Is this with -fschedule-insns? Or only with default settings? Did you test the
compile time implications of increasing the lookahead? (value of 8 is very
large, we may consider enbling it only for -Ofast, limiting for postreload only
or something similar).
>
> I have considered the following hardware details for the model.
> * There are four decoders inside a hardware decoder block.
> * These four independent decoders can execute in parallel. (They can take 8B
> from four different instructions and decode).
> * These four decoders are pipelined 4 cycles deep and are non-stalling.
> * Each decoder takes 8B of instruction data every cycle and tries decoding
> it.
> * Issue rate is 4.
What is the overall limitation on number of bytes the instructions can occupy?
I think they need to fit into 2 16 byte windows, right?
In that case we may want to tweak the existing corei7 scheduling code to take
care of this. Making scheduler not overly optimistic about the parallelism is
good since it will make less register pressure during the first pass.
>
> Is it OK for upstream?
Otherwise the patch seems OK, but I would like to know the compile time effect
first.
Honza
>
> Changelog
> ========
> 2013-10-24 Ganesh Gopalasubramanian
> <[email protected]>
>
> * config/i386/bdver3.md : Added two additional decoder units
> to support issue rate of 4 and remodeled vector unit.
>
> * config/i386/i386.c (ix86_issue_rate): Issue rate for BD
> architectures is set to 4.
>
> * config/i386/i386.c (ia32_multipass_dfa_lookahead): DFA
> lookahead is set to 8 for BD architectures.
>
> Regards
> Ganesh
>