On 04/07/2016 02:58 AM, vija...@caviumnetworks.com wrote:
+#elif defined __aarch64__ +#include "arm_neon.h"
A better test is __NEON__, which asserts that neon is available at compile time (which will be true basically always for aarch64), and then you don't need a runime test for neon.
You also get support for armv7 with neon.
+#define NEON_VECTYPE uint64x2_t +#define NEON_LOAD_N_ORR(v1, v2) (vld1q_u64(&v1) | vld1q_u64(&v2)) +#define NEON_ORR(v1, v2) ((v1) | (v2)) +#define NEON_NOT_EQ_ZERO(v1) \ + ((vgetq_lane_u64(v1, 0) != 0) || (vgetq_lane_u64(v1, 1) != 0))
FWIW, I think that vmaxvq_u32 would be a better reduction for aarch64. Extracting the individual lanes isn't as efficient as one would like.
For armv7, folding via vget_lane_u64(vget_high_u64(v1) | vget_low_u64(v1), 0) is probably best.
r~