------- Comment #3 from siarhei dot siamashka at gmail dot com 2010-06-15 20:14 ------- The whole point of submitting this PR was to find an efficient way to use NEON instructions to operate on any arbitrary scalar floating point values in order to overcome Cortex-A8 VFP Lite inherent slowness (maybe make it transparent via wrapping it into a C++ class and use operator overloading).
Using 'vdup_n_f32' to load a single floating point value seems to be better than 'vset_lane_f32' here because we don't have to deal with uninitialized part of the register. But 'vdup_n_f32' suffers from the similar performance issues (VLD1 instruction is not used directly) and results in redundant instructions emitted when the value is loaded from memory. Optimistically, something like this should have been used instead of 'vdup_n_f32' in this case: static inline float32x2_t vdup_n_f32_mem(float *p) { float32x2_t result; asm ("vld1.f32 {%P0[]}, [%1, :32]" : "=w" (result) : "r" (p) : "memory"); return result; } If wonder if it is possible to check at compile time whether the operand comes from memory or from a register? Something similar to '__builtin_constant_p' builtin-function? Or use multiple alternatives feature for inline assembly constraints to emit either VMOV or VLD1? Anything else? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43364