Il 11/03/2013 16:37, Peter Lieven ha scritto: > > Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <pbonz...@redhat.com>: > >> Il 11/03/2013 16:24, Peter Lieven ha scritto: >>> >>>> How would that be different in your patch? But you can solve it by >>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking >>>> BITS_PER_LONG. >>> >>> This is what I have now: >>> >>> diff --git a/util/bitops.c b/util/bitops.c >>> index e72237a..b0dc93f 100644 >>> --- a/util/bitops.c >>> +++ b/util/bitops.c >>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, >>> unsigned long size, >>> const unsigned long *p = addr + BITOP_WORD(offset); >>> unsigned long result = offset & ~(BITS_PER_LONG-1); >>> unsigned long tmp; >>> + unsigned long d0,d1,d2,d3; >>> >>> if (offset >= size) { >>> return size; >>> } >>> size -= result; >>> - offset %= BITS_PER_LONG; >>> + offset &= (BITS_PER_LONG-1); >>> if (offset) { >>> tmp = *(p++); >>> tmp &= (~0UL << offset); >>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, >>> unsigned long size, >>> size -= BITS_PER_LONG; >>> result += BITS_PER_LONG; >>> } >>> - while (size & ~(BITS_PER_LONG-1)) { >>> + while (size >= 4*BITS_PER_LONG) { >>> + d0 = *p; >>> + d1 = *(p+1); >>> + d2 = *(p+2); >>> + d3 = *(p+3); >>> + if (d0 || d1 || d2 || d3) { >>> + break; >>> + } >>> + p+=4; >>> + result += 4*BITS_PER_LONG; >>> + size -= 4*BITS_PER_LONG; >>> + } >>> + while (size >= BITS_PER_LONG) { >>> if ((tmp = *(p++))) { >>> goto found_middle; >>> } >>> >> >> Minus the %= vs. &=, >> >> Reviewed-by: Paolo Bonzini <pbonz...@redhat.com> >> >> Perhaps: >> >> tmp = *p; >> d1 = *(p+1); >> d2 = *(p+2); >> d3 = *(p+3); >> if (tmp) { >> goto found_middle; >> } >> if (d1 || d2 || d3) { >> break; >> } > > i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?
It depends on the target and how expensive branches are. > i would guess its sth like one addition w/ carry and 1 test? It could be either 4 compare-and-jump sequences, or 3 bitwise ORs followed by a compare-and-jump. That is, either: test %r8, %r8 jz second_loop test %r9, %r9 jz second_loop test %r10, %r10 jz second_loop test %r11, %r11 jz second_loop or or %r9, %r8 or %r11, %r10 or %r8, %r10 jz second_loop Don't let the length of the code fool you. The processor knows how to optimize all of these, and GCC knows too. > your proposed change would introduce 2 tests (maybe)? Yes, but I expect they to be fairly well predicted. > what about this to be sure? > > tmp = *p; > d1 = *(p+1); > d2 = *(p+2); > d3 = *(p+3); > if (tmp || d1 || d2 || d3) { > if (tmp) { > goto found_middle; I suspect that GCC would rewrite it my version (definitely if it produces 4 compare-and-jumps; but possibly it does it even if it goes for bitwise ORs, I haven't checked. Regarding your other question ("one last thought. would it make sense to update only `size`in the while loops and compute the `result` at the end as `orgsize` - `size`?"), again the compiler knows better and might even do this for you. It will likely drop the p increases and use p[result], so if you do that change you may even get the same code, only this time p is increased and you get an extra subtraction at the end. :) Bottom line: don't try to outsmart an optimizing C compiler on micro-optimization, unless you have benchmarked it and it shows there is a problem. Paolo > } > break; > } > > Peter >