On 7 April 2016 at 11:30, Paolo Bonzini <pbonz...@redhat.com> wrote: > >> +#elif defined __aarch64__ >> +#include "arm_neon.h" >> + >> +#define NEON_VECTYPE uint64x2_t >> +#define NEON_LOAD_N_ORR(v1, v2) (vld1q_u64(&v1) | vld1q_u64(&v2)) > > Why is the load and orr necessary? Is ((v1) | (v2)) enough? > >> +#define NEON_ORR(v1, v2) ((v1) | (v2)) >> +#define NEON_NOT_EQ_ZERO(v1) \ >> + ((vgetq_lane_u64(v1, 0) != 0) || (vgetq_lane_u64(v1, 1) != 0)) >> + >> +#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR_NEON 16 > > Unless you have numbers saying that a 16-unroll is better than an 8-unroll > (and then you should put those in the commit message), you do not need to > duplicate code, just add aarch64 definitions for the existing code.
This pure-neon code is also not doing the initial short-loop to test for non-zero buffers, which means it's not an apples-to-apples comparison. It seems unlikely that workload balances are going to be different on ARM vs x86 such that it's worth doing the small loop on one but not the other. (This is also why it's helpful to explain your benchmarking method -- the short loop will slow things down for some cases like "large and untouched RAM", but be faster again for cases like "large RAM of which most pages have been dirtied".) thanks -- PMM