> Rather than trying to cater to multiple assembly instruction implementations > ourselves, have you tried taking the ideas in this earlier thread? > https://lists.gnu.org/archive/html/qemu-devel/2015-10/msg05298.html > > Ideally, libc's memcmp() will already be using the most efficient assembly > instructions without us having to reproduce the work of picking the > instructions > that work best. >
Eric, thanks for you information. I didn't notice that discussion before. I rewrite the buffer_find_nonzero_offset() with the 'bool memeqzero4_paolo length' then write a test program to check a large amount of zero pages, and use the 'time' to recode the time takes by different optimization. Test result is like this: SSE2: ------------------------------------------------------ | test 1 | test 2 ---------------------------------------------------- Time(S):| 13.696 | 13.533 ------------------------------------------------ AVX2: ------------------------------------------- | test 1 | test 2 ------------------------------------------- Time (S):| 10.583 | 10.306 ------------------------------------------- memeqzero4_paolo: --------------------------------------- | test 1 | test 2 --------------------------------------- Time (S):| 9.718 | 9.817 ---------------------------------------- Paolo's implementation has the best performance. It seems that we can remove the SSE2 related Intrinsics. Liang > -- > Eric Blake eblake redhat com +1-919-301-3266 > Libvirt virtualization library http://libvirt.org