http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60408
Bug ID: 60408 Summary: ARM: inefficient code for vget_lane_f32 intrinsic Product: gcc Version: 4.9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: mans at mansr dot com Consider this trivial function: #include <arm_neon.h> float foo(float32x2_t v) { return vget_lane_f32(v, 0) + vget_lane_f32(v, 1); } Compiling with gcc 4.9 trunk from 2014-03-02 yields this (non-code output removed): $ gcc -O3 -march=armv7-a -mfpu=neon -S -o - test.c foo: vmov.32 r3, d0[0] vmov.32 r2, d0[1] fmsr s15, r3 fmsr s0, r2 fadds s0, s0, s15 bx lr A simple "fadds s0, s0, s1" is what one would expect from code like this.