On Tue, Mar 02, 2021 at 12:55:07PM -0800, Josh Don wrote: > On Fri, Feb 26, 2021 at 1:03 PM Peter Zijlstra <pet...@infradead.org> wrote: > > > > On Fri, Feb 26, 2021 at 11:52:39AM -0800, Josh Don wrote: > > > From: Clement Courbet <cour...@google.com> > > > > > > A significant portion of __calc_delta time is spent in the loop > > > shifting a u64 by 32 bits. Use a __builtin_clz instead of iterating. > > > > > > This is ~7x faster on benchmarks. > > > > Have you tried on hardware without such fancy instructions? > > Was not able to find any on hand unfortunately. Clement did rework the > patch to use fls() instead, and has benchmarks for the generic and asm > variations. All of which are faster than the loop. In my next reply, > I'll include the updated patch inline.
Excellent; I have some vague memories where using fls ended up slower for some ARMs, but I can't seem to remember enough to even Google it :/