Am 25.03.2013 um 15:34 schrieb Paolo Bonzini <pbonz...@redhat.com>: > Il 25/03/2013 14:32, Peter Lieven ha scritto: >> >> Am 25.03.2013 um 14:23 schrieb Peter Lieven <p...@kamp.de>: >> >>> >>> Am 25.03.2013 um 14:02 schrieb Paolo Bonzini <pbonz...@redhat.com>: >>> >>>>> Maybe I should have explained the output more detailed. The percentages >>>>> are added. 35.8% in the second last column means that >>>>> 35.8% have a return value that is less than TARGET_PAGE_SIZE. >>>>> This was meant to illustrate at how many 64-bit chunks you have >>>>> to look to grab a certain percentage of non-zero pages. >>>> >>>> Ok, I wrongly understood that many pages had 4088 zero bytes but >>>> the last 8 were not zero. Now it's clearer, and more logical too. :) >>>> >>>>> Looking e.g. at the third value it means that looking at the first >>>>> three 64-bit chunks it will catch 34.0% of all pages. >>>>> It turns out that the non-zeroness of a page can be detected looking >>>>> at the first 256 or so bits and only a low >>>>> percentage turns out to be non-zero at a later position. So after >>>>> having checked the first chunks one by one >>>>> there is no big penalty looking at the remaining chunks with the >>>>> vectorized loop. >>>> >>>> I think it makes most sense to unroll the first four non-vectorized >>>> iterations, i.e. not use SSE and use three or four ifs. Either: >>>> >>>> if (foo[0]) return 0; >>>> if (foo[1]) return 8; >>>> if (foo[2]) return 16; >>>> if (foo[3]) return 24; >>>> >>>> or >>>> >>>> if (foo[0]) return 0; >>>> if (foo[1] | foo[2] | foo[3]) return 8; >>>> >>>> and then proceed on the remaining 4096-4*sizeof(long) bytes with >>>> the vectorized loop. foo+4 is aligned for SIMD operations on both >>>> 32- and 64-bit machines, which makes this a nice choice. >>> >>> i can't start at foo+4 since the remaining X-4*sizeof(long) bytes >>> are not dividable by 8*sizeof(VECTYPE). > > > Hmm, right. What about just processing the first few longs twice, i.e. > the above followed by "for (i = 0; i < len / sizeof(sizeof(VECTYPE); i > += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR)"?
i will profile it tomorrow. what is bad about processing the first 8 vectors like described below? >> for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) { >> if (!ALL_EQ(p[i], zero)) { >> return i * sizeof(VECTYPE); >> } >> } this way it would not be necessary to process them twice. Peter