> OK, so it is about 2%. Did you try if you need lookahead even in the early > pass (before reload)? My guess would be so, but if not, it could cut the > cost to half. For -Ofast/-O3 it looks resonable to me, but we will need to > announce it on the ML. For other settings I think we need to work on more > improvements or cut the expenses.
Yes, it is required before reload. I have another idea which can be pondered upon. Currently, can we enable lookahead with the value 4 (pre reload) for default? This will exponentially cut the cost of build time. I have done some measurements on the build time of some benchmarks (mentioned below) with lookahead value 4. The 2% increase in build time with value 8 is now almost gone. dfa4 no_lookahead perlbench - 191s 193s bzip2 - 19s 19s gcc - 429s 429s mcf - 3s 3s gobmk - 116s 115s hmmer - 60s 60s sjeng - 18s 17s libquantum - 6s 6s h264ref - 107s 107s omnetpp - 128s 128s astar - 7s 7s bwaves - 5s 5s gamess - 1964s 1957s milc - 18s 18s GemsFDTD - 273s 272s Lookahead value 4 also helps because, the modified decoder model in bdver3.md is only two cycles deep (though in hardware it is actually 4 cycles deep). This means that we can look another two levels deep for better schedule. GemsFDTD still retains the performance boost of around 6-7% with value 4. Let me know your thoughts. Regards Ganesh -----Original Message----- From: Jan Hubicka [mailto:hubi...@ucw.cz] Sent: Thursday, October 24, 2013 6:48 PM To: Gopalasubramanian, Ganesh Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); H.J. Lu (hjl.to...@gmail.com) Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips > Hi, > > > Is this with -fschedule-insns? Or only with default settings? Did you test > > the compile time implications of increasing the lookahead? (value of 8 is > > very large, we may consider enbling it only for -Ofast, limiting for > > postreload only or something similar). > > The improvement is seen with the options "-fschedule-insns -fschedule-insns2 > -fsched-pressure" > > Below are the build times of some of the SPEC benchmarks > > dfa8 no_lookahead > > perlbench - 196s 193s > bzip2 - 19s 19s > gcc - 439s 429s > mcf - 3s 3s > gobmk - 119s 115s > hmmer - 62s 60s > sjeng - 18s 17s > libquantum - 6s 6s > h264ref - 110s 107s > omnetpp - 132s 128s > astar - 7s 7s > bwaves - 4s 5s > gamess - 1996s 1957s > milc - 18s 18s > GemsFDTD - 276s 272s > > I think we can enable it by default rather than for -Ofast. > Please let me know your inputs. OK, so it is about 2%. Did you try if you need lookahead even in the early pass (before reload)? My guess would be so, but if not, it could cut the cost to half. For -Ofast/-O3 it looks resonable to me, but we will need to announce it on the ML. For other settings I think we need to work on more improvmeents or cut the expenses. Honza > > Regards > Ganesh > > -----Original Message----- > From: Jan Hubicka [mailto:hubi...@ucw.cz] > Sent: Thursday, October 24, 2013 2:54 PM > To: Gopalasubramanian, Ganesh > Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubiz...@gmail.com); > hubi...@ucw.cz; H.J. Lu (hjl.to...@gmail.com) > Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for > modern x86 chips > > > Attached is the patch which does the following scheduler related changes. > > * re-models bdver3 decoder. > > * It enables lookahead with value 8 for all BD architectures. The patch > > doesn't consider if reloading is completed or not (an area that needs to be > > worked on). > > * The issue rate for BD architectures are set to 4. > > > > I see the following performance improvements on bdver3 machine. > > * GemsFDTD improves by 6-7% with lookahead value changed to 8. > > * Hmmer improves by 9% when issue rate when set to 4 . > > Is this with -fschedule-insns? Or only with default settings? Did you test > the compile time implications of increasing the lookahead? (value of 8 is > very large, we may consider enbling it only for -Ofast, limiting for > postreload only or something similar). > > > > > I have considered the following hardware details for the model. > > * There are four decoders inside a hardware decoder block. > > * These four independent decoders can execute in parallel. (They can take > > 8B from four different instructions and decode). > > * These four decoders are pipelined 4 cycles deep and are non-stalling. > > * Each decoder takes 8B of instruction data every cycle and tries decoding > > it. > > * Issue rate is 4. > What is the overall limitation on number of bytes the instructions can occupy? > I think they need to fit into 2 16 byte windows, right? > In that case we may want to tweak the existing corei7 scheduling code to take > care of this. Making scheduler not overly optimistic about the parallelism > is good since it will make less register pressure during the first pass.. > > > > Is it OK for upstream? > > Otherwise the patch seems OK, but I would like to know the compile time > effect first. > > Honza > > > > Changelog > > ======== > > 2013-10-24 Ganesh Gopalasubramanian > > <ganesh.gopalasubraman...@amd.com> > > > > * config/i386/bdver3.md : Added two additional decoder units > > to support issue rate of 4 and remodeled vector unit. > > > > * config/i386/i386.c (ix86_issue_rate): Issue rate for BD > > architectures is set to 4. > > > > * config/i386/i386.c (ia32_multipass_dfa_lookahead): DFA > > lookahead is set to 8 for BD architectures. > > > > Regards > > Ganesh > > > > >