On Wed, Mar 03, 2021 at 02:46:53PM -0800, Josh Don wrote: > From: Clement Courbet <cour...@google.com> > > A significant portion of __calc_delta time is spent in the loop > shifting a u64 by 32 bits. Use `fls` instead of iterating. > > This is ~7x faster on benchmarks. > > The generic `fls` implementation (`generic_fls`) is still ~4x faster > than the loop. > Architectures that have a better implementation will make use of it. For > example, on X86 we get an additional factor 2 in speed without dedicated > implementation. > > On gcc, the asm versions of `fls` are about the same speed as the > builtin. On clang, the versions that use fls are more than twice as > slow as the builtin. This is because the way the `fls` function is > written, clang puts the value in memory: > https://godbolt.org/z/EfMbYe. This bug is filed at > https://bugs.llvm.org/show_bug.cgi?id=49406. > > ``` > name cpu/op > BM_Calc<__calc_delta_loop> 9.57ms ±12% > BM_Calc<__calc_delta_generic_fls> 2.36ms ±13% > BM_Calc<__calc_delta_asm_fls> 2.45ms ±13% > BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms ±12% > BM_Calc<__calc_delta_asm_fls64> 2.46ms ±13% > BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms ±15% > BM_Calc<__calc_delta_builtin> 1.32ms ±11% > ``` > > Signed-off-by: Clement Courbet <cour...@google.com> > Signed-off-by: Josh Don <josh...@google.com>
Thanks!