https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64036
--- Comment #4 from Oleg Endo <olegendo at gcc dot gnu.org> --- (In reply to Oleg Endo from comment #2) > An example function, compiling with -O2 -m4: > > int test_0 (unsigned short* x, int y, int z) > { > return > (x[0] + x[1] + x[2] + x[3] + x[4] + x[5] + x[6] > + x[7] + x[8] + x[9] + x[10]) ? y : z; > } > > Without sched1, there are lots of dependencies on the results of memory > loads. > > With sched1, there is generally more code generated and variable live ranges > are longer. The above function will use r8 and r9, which is not really > necessary. Memory load dependencies are reduced and more LS/EX/MT > instructions can be executed in parallel. Code size for the test function > increases from ~37 insns to ~50 insns. Approximated cycles on SH4 pipeline > should be ~37 cycles without sched1 and ~33 cycles with sched1. On SH4A the > latency of a load is 1 cycle, so without sched1 it should be ~28 cycles. BTW, with AMS there is no difference of sched1 or no-sched1 in code size, because it uses post-inc loads. With AMS + sched1 the example above compiles to 38 insns and there are no dependencies/stalls on the mem loads anymore. The r8,r9 usage issue is also gone.