Am 25.03.2013 um 15:34 schrieb Paolo Bonzini <pbonz...@redhat.com>: > > Hmm, right. What about just processing the first few longs twice, i.e. > the above followed by "for (i = 0; i < len / sizeof(sizeof(VECTYPE); i > += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR)"?
I tested this version as v3: size_t buffer_find_nonzero_offset_v3(const void *buf, size_t len) { VECTYPE *p = (VECTYPE *)buf; unsigned long *tmp = (unsigned long *)buf; VECTYPE zero = ZERO_SPLAT; size_t i; assert(can_use_buffer_find_nonzero_offset(buf, len)); if (!len) { return 0; } if (tmp[0]) { return 0; } if (tmp[1]) { return 1 * sizeof(unsigned long); } if (tmp[2]) { return 2 * sizeof(unsigned long); } if (tmp[3]) { return 3 * sizeof(unsigned long); } for (i = 0; i < len / sizeof(VECTYPE); i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) { VECTYPE tmp0 = p[i + 0] | p[i + 1]; VECTYPE tmp1 = p[i + 2] | p[i + 3]; VECTYPE tmp2 = p[i + 4] | p[i + 5]; VECTYPE tmp3 = p[i + 6] | p[i + 7]; VECTYPE tmp01 = tmp0 | tmp1; VECTYPE tmp23 = tmp2 | tmp3; if (!ALL_EQ(tmp01 | tmp23, zero)) { break; } } return i * sizeof(VECTYPE); } For reference this is v2: size_t buffer_find_nonzero_offset_v2(const void *buf, size_t len) { VECTYPE *p = (VECTYPE *)buf; VECTYPE zero = ZERO_SPLAT; size_t i; assert(can_use_buffer_find_nonzero_offset(buf, len)); if (!len) { return 0; } for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) { if (!ALL_EQ(p[i], zero)) { return i * sizeof(VECTYPE); } } for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i < len / sizeof(VECTYPE); i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) { VECTYPE tmp0 = p[i + 0] | p[i + 1]; VECTYPE tmp1 = p[i + 2] | p[i + 3]; VECTYPE tmp2 = p[i + 4] | p[i + 5]; VECTYPE tmp3 = p[i + 6] | p[i + 7]; VECTYPE tmp01 = tmp0 | tmp1; VECTYPE tmp23 = tmp2 | tmp3; if (!ALL_EQ(tmp01 | tmp23, zero)) { break; } } return i * sizeof(VECTYPE); } I ran 3*2 tests. Each with 1GB memory and 256 iterations of checking each 4k page for zero. 1) all pages zero a) SSE2 is_zero_page: res=67108864 (ticks 3289 user 1 system) is_zero_page_v2: res=67108864 (ticks 3326 user 0 system) is_zero_page_v3: res=67108864 (ticks 3305 user 3 system) is_dup_page: res=67108864 (ticks 3648 user 1 system) b) unsigned long arithmetic is_zero_page: res=67108864 (ticks 3474 user 3 system) is_zero_page_2: res=67108864 (ticks 3516 user 1 system) is_zero_page_3: res=67108864 (ticks 3525 user 3 system) is_dup_page: res=67108864 (ticks 3826 user 4 system) 2) all pages non-zero, but first 64-bit of each page zero a) SSE2 is_zero_page: res=0 (ticks 251 user 0 system) is_zero_page_v2: res=0 (ticks 87 user 0 system) is_zero_page_v3: res=0 (ticks 91 user 0 system) is_dup_page: res=0 (ticks 82 user 0 system) b) unsigned long arithmetic is_zero_page: res=0 (ticks 209 user 0 system) is_zero_page_v2: res=0 (ticks 89 user 0 system) is_zero_page_v3: res=0 (ticks 88 user 0 system) is_dup_page: res=0 (ticks 88 user 0 system) 3) all pages non-zero, but first 256-bit of each page zero a) is_zero_pages: res=0 (ticks 260 user 0 system) is_zero_pages_2: res=0 (ticks 199 user 0 system) is_zero_pages_3: res=0 (ticks 342 user 0 system) is_dup_pages: res=0 (ticks 223 user 0 system) b) unsigned long arithmetic is_zero_pages: res=0 (ticks 230 user 0 system) is_zero_pages_2: res=0 (ticks 194 user 0 system) is_zero_pages_3: res=0 (ticks 280 user 0 system) is_dup_pages: res=0 (ticks 191 user 0 system) --- is_zero_page is the version from patch set v4. is_zero_page_2 is checking the first 8 * sizeof(VECTYPE) chunks one by one and than continuing 8 chunks at once without double-checks is_zero_page_3 is the above version. is_dup_page the old implementation. All compiled with gcc -O3 If noone objects I would use is_zero_page_2 and continue with v5 of the patch set. As I am ooo for the next 8 days from tomorrow. i prefer v3 as it has better performance if the non-zeroness is within the 8*sizeof(VECTYPE) bytes and not in the first 256-bit. Paolo, with the version that has lower setup costs in mind shall I use the vectorized or the unrolled version of patch 4 (find_next_bit optimization)? Peter