[Bug target/43364] Suboptimal code for the use of ARM NEON intrinsic "vset_lane_f32"

siarhei dot siamashka at gmail dot com Tue, 15 Jun 2010 13:15:16 -0700


------- Comment #3 from siarhei dot siamashka at gmail dot com  2010-06-15 
20:14 -------
The whole point of submitting this PR was to find an efficient way to use NEON
instructions to operate on any arbitrary scalar floating point values in order
to overcome Cortex-A8 VFP Lite inherent slowness (maybe make it transparent via
wrapping it into a C++ class and use operator overloading).


Using 'vdup_n_f32' to load a single floating point value seems to be better
than 'vset_lane_f32' here because we don't have to deal with uninitialized part
of the register. But 'vdup_n_f32' suffers from the similar performance issues
(VLD1 instruction is not used directly) and results in redundant instructions
emitted when the value is loaded from memory. Optimistically, something like
this should have been used instead of 'vdup_n_f32' in this case:

static inline float32x2_t vdup_n_f32_mem(float *p)
{
    float32x2_t result;
    asm ("vld1.f32 {%P0[]}, [%1, :32]" : "=w" (result) : "r" (p) : "memory");
    return result;
}

If wonder if it is possible to check at compile time whether the operand comes
from memory or from a register? Something similar to '__builtin_constant_p'
builtin-function? Or use multiple alternatives feature for inline assembly
constraints to emit either VMOV or VLD1? Anything else?


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43364

[Bug target/43364] Suboptimal code for the use of ARM NEON intrinsic "vset_lane_f32"

Reply via email to