Am 25.03.2013 um 14:23 schrieb Peter Lieven <p...@kamp.de>: > > Am 25.03.2013 um 14:02 schrieb Paolo Bonzini <pbonz...@redhat.com>: > >>> Maybe I should have explained the output more detailed. The percentages >>> are added. 35.8% in the second last column means that >>> 35.8% have a return value that is less than TARGET_PAGE_SIZE. >>> This was meant to illustrate at how many 64-bit chunks you have >>> to look to grab a certain percentage of non-zero pages. >> >> Ok, I wrongly understood that many pages had 4088 zero bytes but >> the last 8 were not zero. Now it's clearer, and more logical too. :) >> >>> Looking e.g. at the third value it means that looking at the first >>> three 64-bit chunks it will catch 34.0% of all pages. >>> It turns out that the non-zeroness of a page can be detected looking >>> at the first 256 or so bits and only a low >>> percentage turns out to be non-zero at a later position. So after >>> having checked the first chunks one by one >>> there is no big penalty looking at the remaining chunks with the >>> vectorized loop. >> >> I think it makes most sense to unroll the first four non-vectorized >> iterations, i.e. not use SSE and use three or four ifs. Either: >> >> if (foo[0]) return 0; >> if (foo[1]) return 8; >> if (foo[2]) return 16; >> if (foo[3]) return 24; >> >> or >> >> if (foo[0]) return 0; >> if (foo[1] | foo[2] | foo[3]) return 8; >> >> and then proceed on the remaining 4096-4*sizeof(long) bytes with >> the vectorized loop. foo+4 is aligned for SIMD operations on both >> 32- and 64-bit machines, which makes this a nice choice. > > i can't start at foo+4 since the remaining X-4*sizeof(long) bytes > are not dividable by 8*sizeof(VECTYPE). > > I could just do sty like the following: > > const unsigned long *tmp = buf; > > for (i = 0; > i < sizeof(VECTYPE) * BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR > / sizeof(unsigned long); > i += 4) { > if (tmp[i + 0]) return i * sizeof(unsigned long); > if (tmp[i + 1]) return (i+1) * sizeof(unsigned long); > if (tmp[i + 2]) return (i+2) * sizeof(unsigned long); > if (tmp[i + 3]) return (i+3) * sizeof(unsigned long); > } > > for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; > i < len / sizeof(VECTYPE); > i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) { > … > }
performance of the above is bad compared to: for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) { if (!ALL_EQ(p[i], zero)) { return i * sizeof(VECTYPE); } } … The above is basically what old is_dup_page is doing, but after the first 8 iterations the optimized version kicks in. Peter