On Wed, Sep 09, 2015 at 12:13:10PM +0100, Morten Rasmussen wrote: > On Wed, Sep 09, 2015 at 11:43:05AM +0200, Peter Zijlstra wrote: > > Sadly that makes the code worse; I get 14 mul instructions where > > previously I had 11. > > > > What happens is that GCC gets confused and cannot constant propagate the > > new variables, so what used to be shifts now end up being actual > > multiplications. > > > > With this, I get back to 11. Can you see what happens on ARM where you > > have both functions defined to non constants? > > We repeated the experiment on arm and arm64 but still with functions > defined to constant to compare with your results. The mul instruction > count seems to be somewhat compiler version dependent, but consistently > show no effect of the patch: > > arm before after > gcc4.9 12 12 > gcc4.8 10 10 > > arm64 before after > gcc4.9 11 11 > > I will get numbers with the arch-functions implemented as well and do > hackbench runs to see what happens in terms of performance.
I have done some runs with the proposed fixes added: 1. PeterZ's util_sum shift fix (change util_sum). 2. Morten's scaling of weight instead of time (reduce bit loss). 3. PeterZ's unconditional calls to arch*() functions (compiler opt). To be clear: 2 includes 1, and 3 includes 1 and 2. Runs where done with the default (#define) implementation of the arch-functions and with arch specific implementation for ARM. I realized that just looking for 'mul' instructions in update_blocked_averages() is probably not a fair comparison on ARM as it turned out that it has quite a few multiply-accumulate instructions. So I have included the total count including those too. Test platforms: ARM TC2 (A7x3 only) perf stat --null --repeat 10 -- perf bench sched messaging -g 50 -l 200 #mul: grep -e mul (in update_blocked_averages()) #mul_all: grep -e mul -e mla -e mls -e mia (in update_blocked_averages()) gcc: 4.9.3 Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz perf stat --null --repeat 10 -- perf bench sched messaging -g 50 -l 15000 #mul: grep -e mul (in update_blocked_averages()) gcc: 4.9.2 Results: perf numbers are average of three (x10) runs. Raw data is available further down. ARM TC2 #mul #mul_all perf bench arch*() default arm default arm default arm 1 shift_fix 10 16 22 36 13.401 13.288 2 scaled_weight 12 14 30 32 13.282 13.238 3 unconditional 12 14 26 32 13.296 13.427 Intel E5-2690 #mul #mul_all perf bench arch*() default default default 1 shift_fix 13 14.786 2 scaled_weight 18 15.078 3 unconditional 14 15.195 Overall it appears that fewer 'mul' instructions doesn't necessarily mean better perf bench score. For ARM, 2 seems the best choice overall. While 1 is better for Intel. If we want to try avoid the bit loss by scaling weight instead of time, 2 is best for both. However, all that said, looking at the raw numbers there is a significant difference between runs of perf --repeat, so we can't really draw any strong conclusions. It all appears to be in the noise. I suggest that I spin a v2 of this series and go with scaled_weight to reduce bit loss. Any objections? While at it, should I include Yuyang's patch redefining the SCALE/SHIFT mess? Raw numbers: ARM TC2 shift_fix default_arch gcc4.9.3 #mul 10 #mul+mla+mls+mia 22 13.384416727 seconds time elapsed ( +- 0.17% ) 13.431014702 seconds time elapsed ( +- 0.18% ) 13.387434890 seconds time elapsed ( +- 0.15% ) shift_fix arm_arch gcc4.9.3 #mul 16 #mul+mla+mls+mia 36 13.271044081 seconds time elapsed ( +- 0.11% ) 13.310189123 seconds time elapsed ( +- 0.19% ) 13.283594740 seconds time elapsed ( +- 0.12% ) scaled_weight default_arch gcc4.9.3 #mul 12 #mul+mla+mls+mia 30 13.295649553 seconds time elapsed ( +- 0.20% ) 13.271634654 seconds time elapsed ( +- 0.19% ) 13.280081329 seconds time elapsed ( +- 0.14% ) scaled_weight arm_arch gcc4.9.3 #mul 14 #mul+mla+mls+mia 32 13.230659223 seconds time elapsed ( +- 0.15% ) 13.222276527 seconds time elapsed ( +- 0.15% ) 13.260275081 seconds time elapsed ( +- 0.21% ) unconditional default_arch gcc4.9.3 #mul 12 #mul+mla+mls+mia 26 13.274904460 seconds time elapsed ( +- 0.13% ) 13.307853511 seconds time elapsed ( +- 0.15% ) 13.304084844 seconds time elapsed ( +- 0.22% ) unconditional arm_arch gcc4.9.3 #mul 14 #mul+mla+mls+mia 32 13.432878577 seconds time elapsed ( +- 0.13% ) 13.417950552 seconds time elapsed ( +- 0.12% ) 13.431682719 seconds time elapsed ( +- 0.18% ) Intel shift_fix default_arch gcc4.9.2 #mul 13 14.905815416 seconds time elapsed ( +- 0.61% ) 14.811113694 seconds time elapsed ( +- 0.84% ) 14.639739309 seconds time elapsed ( +- 0.76% ) scaled_weight default_arch gcc4.9.2 #mul 18 15.113275474 seconds time elapsed ( +- 0.64% ) 15.056777680 seconds time elapsed ( +- 0.44% ) 15.064074416 seconds time elapsed ( +- 0.71% ) unconditional default_arch gcc4.9.2 #mul 14 15.105152500 seconds time elapsed ( +- 0.71% ) 15.346405473 seconds time elapsed ( +- 0.81% ) 15.132933523 seconds time elapsed ( +- 0.82% ) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/