On Wed, Jul 20, 2016 at 03:07:45PM +0530, Virendra Pathak wrote: > Hi gcc-patches group, > > Please find the patch for adding the basic scheduler for vulcan > in the aarch64 port. > > Tested the patch with compiling cross aarch64-linux-gcc, > bootstrapped native aarch64-unknown-linux-gnu and > run gcc regression. > > Kindly review and merge the patch to trunk, if the patch is okay. > > There are few TODO in this patch which we have planned to > submit in the next submission e.g. crc and crypto > instructions, further improving integer & fp load/store > based on addressing mode of the address.
Hi Virendra, Thanks for the patch, I have some concerns about the size of the automata that this description generates. As you can see (use (automata_option "stats") in the description to enable statistics) this scheduler description generates a 10x larger automata for Vulcan than the second largest description we have for AArch64 (cortex_a53_advsimd): Automaton `cortex_a53_advsimd' 9072 NDFA states, 49572 NDFA arcs 9072 DFA states, 49572 DFA arcs 4050 minimal DFA states, 23679 minimal DFA arcs 368 all insns 11 insn equivalence classes 0 locked states 28759 transition comb vector els, 44550 trans table els: use simple vect 44550 min delay table els, compression factor 2 Automaton `vulcan' 103223 NDFA states, 651918 NDFA arcs 103223 DFA states, 651918 DFA arcs 45857 minimal DFA states, 352255 minimal DFA arcs 368 all insns 28 insn equivalence classes 0 locked states 429671 transition comb vector els, 1283996 trans table els: use comb vect 1283996 min delay table els, compression factor 2 Such a large automaton increases compiler build time and memory consumption, often for little scheduling benefit. Normally such a large automaton comes from using a large repeat expression (*). For example in your modeling of divisions: > +(define_insn_reservation "vulcan_div" 13 > + (and (eq_attr "tune" "vulcan") > + (eq_attr "type" "sdiv,udiv")) > + "vulcan_i1*13") > + > +(define_insn_reservation "vulcan_fp_divsqrt_s" 16 > + (and (eq_attr "tune" "vulcan") > + (eq_attr "type" "fdivs,fsqrts")) > + "vulcan_f0*8|vulcan_f1*8") > + > +(define_insn_reservation "vulcan_fp_divsqrt_d" 23 > + (and (eq_attr "tune" "vulcan") > + (eq_attr "type" "fdivd,fsqrtd")) > + "vulcan_f0*12|vulcan_f1*12") In other pipeline models, we try to keep these repeat numbers low to avoid the large state-space growth they cause. For example, the Cortex-A57 pipeline model describes them as: (define_insn_reservation "cortex_a57_fp_divd" 16 (and (eq_attr "tune" "cortexa57") (eq_attr "type" "fdivd, fsqrtd, neon_fp_div_d, neon_fp_sqrt_d")) "ca57_cx2_block*3") The lower accuracy is acceptable because of the nature of the scheduling model. For a machine with an issue rate of "4" like Vulcan, each cycle the compiler models it tries to find four instructions to schedule, before it progresses the state of the automaton. If an instruction is modelled as blocking the "vulcan_i1" unit for 13 cycles, that means up to 52 instructions that the scheduler would have to find before issuing the next instruction which would use vulcan_i1. Because scheduling works within basic-blocks, the chance of finding so many independent instructions is extremely low, and so you'd never see the benefit of the 13-cycle block. I tried lowering the repeat expressions as so: > +(define_insn_reservation "vulcan_div" 13 > + (and (eq_attr "tune" "vulcan") > + (eq_attr "type" "sdiv,udiv")) > + "vulcan_i1*3") > + > +(define_insn_reservation "vulcan_fp_divsqrt_s" 16 > + (and (eq_attr "tune" "vulcan") > + (eq_attr "type" "fdivs,fsqrts")) > + "vulcan_f0*3|vulcan_f1*3") > + > +(define_insn_reservation "vulcan_fp_divsqrt_d" 23 > + (and (eq_attr "tune" "vulcan") > + (eq_attr "type" "fdivd,fsqrtd")) > + "vulcan_f0*5|vulcan_f1*5") Which more than halves the size of the generated automaton: Automaton `vulcan' 45370 NDFA states, 319261 NDFA arcs 45370 DFA states, 319261 DFA arcs 20150 minimal DFA states, 170824 minimal DFA arcs 368 all insns 28 insn equivalence classes 0 locked states 215565 transition comb vector els, 564200 trans table els: use comb vect 564200 min delay table els, compression factor 2 The other technique we use to reduce the size of the generated automaton is to split off the AdvSIMD/FP model from the main pipeline description (the thunderx _main, thunderx_mult, thunderx_divide, and thunderx_simd models take this approach even further and acheieve very small automaton as a result) A change like wiring the vulcan_f0 and vulcan_f1 reservations to be cpu_units of a new define_automaton "vulcan_advsimd" would cut the size of the automaton by half again: Automaton `vulcan' 8520 NDFA states, 52754 NDFA arcs 8520 DFA states, 52754 DFA arcs 2414 minimal DFA states, 19882 minimal DFA arcs 368 all insns 19 insn equivalence classes 0 locked states 21062 transition comb vector els, 45866 trans table els: use simple vect 45866 min delay table els, compression factor 2 Automaton `vulcan_simd' 12231 NDFA states, 85833 NDFA arcs 12231 DFA states, 85833 DFA arcs 9246 minimal DFA states, 66554 minimal DFA arcs 368 all insns 11 insn equivalence classes 0 locked states 84074 transition comb vector els, 101706 trans table els: use simple vect 101706 min delay table els, compression factor 2 Finally, simplifying some of the remaining large expressions (vulcan_asimd_load*_mult, vulcan_asimd_load*_elts) can bring the size down by half again, making it much more in line with the size of the other AArch64 automaton. Would you mind taking a look at some of these techniques to see if you can reduce the size of the generated automata without causing too much trouble for code generation for Vulcan? Ideally we want to keep the size of all models to a reasonable level to avoid bugs like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70473 . Thanks, James > Virendra Pathak <virendra.pat...@broadcom.com> > Julian Brown <jul...@codesourcery.com> > > * config/aarch64/aarch64-cores.def: Change the scheduler > to vulcan. > * config/aarch64/aarch64.md: Include vulcan.md. > * config/aarch64/vulcan.md: New file. > >