On Tue, Mar 02, 2021 at 12:57:37PM -0800, Josh Don wrote: > From: Clement Courbet <cour...@google.com> > > A significant portion of __calc_delta time is spent in the loop > shifting a u64 by 32 bits. Use `fls` instead of iterating. > > This is ~7x faster on benchmarks. > > The generic `fls` implementation (`generic_fls`) is still ~4x faster > than the loop. > Architectures that have a better implementation will make use of it. For > example, on X86 we get an additional factor 2 in speed without dedicated > implementation. > > On gcc, the asm versions of `fls` are about the same speed as the > builtin. On clang, the versions that use fls (fls,fls64) are more than > twice as slow as the builtin. This is because the way the `fls` function > is written, clang puts the value in memory: > https://godbolt.org/z/EfMbYe. This can be fixed in a separate patch. > > ``` > name cpu/op > BM_Calc<__calc_delta_loop> 9.57ms ±12% > BM_Calc<__calc_delta_generic_fls> 2.36ms ±13% > BM_Calc<__calc_delta_asm_fls> 2.45ms ±13% > BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms ±12% > BM_Calc<__calc_delta_asm_fls64> 2.46ms ±13% > BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms ±15% > BM_Calc<__calc_delta_builtin> 1.32ms ±11% > ``` > > Signed-off-by: Clement Courbet <cour...@google.com> > Signed-off-by: Josh Don <josh...@google.com> > --- > kernel/sched/fair.c | 30 ++++++++++++++++++++++-------- > kernel/sched/sched.h | 1 + > 2 files changed, 23 insertions(+), 8 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8a8bd7b13634..67e5a1d536ad 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -214,6 +214,16 @@ static void __update_inv_weight(struct load_weight *lw) > lw->inv_weight = WMULT_CONST / w; > } > > +/* > + * An fls that handles an u32 value on architectures > + * where `sizeof(unsigned int) < 32`. > + */ > +#if (__SIZEOF_INT__ >= 32)
This should never happen, we use ILP32 or LP64 for Linux. > +# define FLS_AT_LEAST_32(v) fls(v) > +#else > +# define FLS_AT_LEAST_32(v) fls64(v) > +#endif > + > /* > * delta_exec * weight / lw.weight > * OR > @@ -229,27 +239,31 @@ static void __update_inv_weight(struct load_weight *lw) > static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct > load_weight *lw) > { > u64 fact = scale_load_down(weight); > + u32 fact_hi = (u32)(fact >> 32); > int shift = WMULT_SHIFT; > + int fs; > > __update_inv_weight(lw); > > - if (unlikely(fact >> 32)) { > - while (fact >> 32) { > - fact >>= 1; > - shift--; > - } > + if (unlikely(fact_hi)) { > + fs = FLS_AT_LEAST_32(fact_hi); you made fact_hi u32, why can't we unconditionally use fls() ? > + shift -= fs; > + fact >>= fs; > } > > fact = mul_u32_u32(fact, lw->inv_weight); > > - while (fact >> 32) { > - fact >>= 1; > - shift--; > + fact_hi = (u32)(fact >> 32); > + if (fact_hi) { > + fs = FLS_AT_LEAST_32(fact_hi); > + shift -= fs; > + fact >>= fs; > } > > return mul_u64_u32_shr(delta_exec, fact, shift); > } Horrific whitespace damage..