Even more efficient might be to do bitwise instead of logical or > if (tmp | d1 | d2 | d3) {
that should remove 3 of the 4 conditional jumps and should become 3 bitwise ors and one conditional jump On Mon, Mar 11, 2013 at 8:58 AM, Paolo Bonzini <pbonz...@redhat.com> wrote: > Il 11/03/2013 16:37, Peter Lieven ha scritto: >> >> Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <pbonz...@redhat.com>: >> >>> Il 11/03/2013 16:24, Peter Lieven ha scritto: >>>> >>>>> How would that be different in your patch? But you can solve it by >>>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking >>>>> BITS_PER_LONG. >>>> >>>> This is what I have now: >>>> >>>> diff --git a/util/bitops.c b/util/bitops.c >>>> index e72237a..b0dc93f 100644 >>>> --- a/util/bitops.c >>>> +++ b/util/bitops.c >>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, >>>> unsigned long size, >>>> const unsigned long *p = addr + BITOP_WORD(offset); >>>> unsigned long result = offset & ~(BITS_PER_LONG-1); >>>> unsigned long tmp; >>>> + unsigned long d0,d1,d2,d3; >>>> >>>> if (offset >= size) { >>>> return size; >>>> } >>>> size -= result; >>>> - offset %= BITS_PER_LONG; >>>> + offset &= (BITS_PER_LONG-1); >>>> if (offset) { >>>> tmp = *(p++); >>>> tmp &= (~0UL << offset); >>>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, >>>> unsigned long size, >>>> size -= BITS_PER_LONG; >>>> result += BITS_PER_LONG; >>>> } >>>> - while (size & ~(BITS_PER_LONG-1)) { >>>> + while (size >= 4*BITS_PER_LONG) { >>>> + d0 = *p; >>>> + d1 = *(p+1); >>>> + d2 = *(p+2); >>>> + d3 = *(p+3); >>>> + if (d0 || d1 || d2 || d3) { >>>> + break; >>>> + } >>>> + p+=4; >>>> + result += 4*BITS_PER_LONG; >>>> + size -= 4*BITS_PER_LONG; >>>> + } >>>> + while (size >= BITS_PER_LONG) { >>>> if ((tmp = *(p++))) { >>>> goto found_middle; >>>> } >>>> >>> >>> Minus the %= vs. &=, >>> >>> Reviewed-by: Paolo Bonzini <pbonz...@redhat.com> >>> >>> Perhaps: >>> >>> tmp = *p; >>> d1 = *(p+1); >>> d2 = *(p+2); >>> d3 = *(p+3); >>> if (tmp) { >>> goto found_middle; >>> } >>> if (d1 || d2 || d3) { >>> break; >>> } >> >> i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ? > > It depends on the target and how expensive branches are. > >> i would guess its sth like one addition w/ carry and 1 test? > > It could be either 4 compare-and-jump sequences, or 3 bitwise ORs > followed by a compare-and-jump. > > That is, either: > > test %r8, %r8 > jz second_loop > test %r9, %r9 > jz second_loop > test %r10, %r10 > jz second_loop > test %r11, %r11 > jz second_loop > > or > > or %r9, %r8 > or %r11, %r10 > or %r8, %r10 > jz second_loop > > Don't let the length of the code fool you. The processor knows how to > optimize all of these, and GCC knows too. > >> your proposed change would introduce 2 tests (maybe)? > > Yes, but I expect they to be fairly well predicted. > >> what about this to be sure? >> >> tmp = *p; >> d1 = *(p+1); >> d2 = *(p+2); >> d3 = *(p+3); >> if (tmp || d1 || d2 || d3) { >> if (tmp) { >> goto found_middle; > > I suspect that GCC would rewrite it my version (definitely if it > produces 4 compare-and-jumps; but possibly it does it even if it goes > for bitwise ORs, I haven't checked. > > Regarding your other question ("one last thought. would it make sense to > update only `size`in the while loops and compute the `result` at the end > as `orgsize` - `size`?"), again the compiler knows better and might even > do this for you. It will likely drop the p increases and use p[result], > so if you do that change you may even get the same code, only this time > p is increased and you get an extra subtraction at the end. :) > > Bottom line: don't try to outsmart an optimizing C compiler on > micro-optimization, unless you have benchmarked it and it shows there is > a problem. > > Paolo > >> } >> break; >> } >> >> Peter >> > >