On 04/07/2016 02:58 AM, vija...@caviumnetworks.com wrote:
+#elif defined __aarch64__
+#include "arm_neon.h"

A better test is __NEON__, which asserts that neon is available at compile time (which will be true basically always for aarch64), and then you don't need a runime test for neon.

You also get support for armv7 with neon.

+#define NEON_VECTYPE               uint64x2_t
+#define NEON_LOAD_N_ORR(v1, v2)    (vld1q_u64(&v1) | vld1q_u64(&v2))
+#define NEON_ORR(v1, v2)           ((v1) | (v2))
+#define NEON_NOT_EQ_ZERO(v1) \
+        ((vgetq_lane_u64(v1, 0) != 0) || (vgetq_lane_u64(v1, 1) != 0))

FWIW, I think that vmaxvq_u32 would be a better reduction for aarch64. Extracting the individual lanes isn't as efficient as one would like.

For armv7, folding via vget_lane_u64(vget_high_u64(v1) | vget_low_u64(v1), 0) is probably best.


r~

Reply via email to