http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51980
--- Comment #5 from Ramana Radhakrishnan <ramana at gcc dot gnu.org> 2012-03-30 08:17:21 UTC --- Experimenting with : Applying the patch of PR48941 and the patch for lower-subreg here http://gcc.gnu.org/ml/gcc-patches/2012-03/msg01886.html I now see : We still have too many moves for my liking but the gratuituous spilling is now gone. .cpu cortex-a9 .eabi_attribute 27, 3 .fpu neon .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 2 .eabi_attribute 30, 2 .eabi_attribute 34, 1 .eabi_attribute 18, 4 .file "t2.c" .text .align 2 .global sqrlen4D_16u8 .type sqrlen4D_16u8, %function sqrlen4D_16u8: @ args = 16, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. vmov d16, r0, r1 @ v16qi vmov d17, r2, r3 vldmia sp, {d18-d19} vabd.u8 q10, q8, q9 vmull.u8 q11, d20, d20 vmull.u8 q10, d21, d21 vmov q8, q11 @ v4si -- unnecessary ? vmov q9, q10 @ v4si -- unnecessary ? vuzp.32 q8, q9 vpaddl.u16 q10, q8 vmov q11, q10 @ v4si -- unnecessary vpadal.u16 q11, q9 vmov r0, r1, d22 @ v4si vmov r2, r3, d23 bx lr .size sqrlen4D_16u8, .-sqrlen4D_16u8 .ident "GCC: (GNU) 4.8.0 20120330 (experimental)" .section .note.GNU-stack,"",%progbits This probably makes it a dup of PR48941 but it's starting to look more promising now. Eric, could you try the 2 patches and see what you get - This isn't something to be gratuitously backported as we still have to see the effects elsewhere but it would be worth seeing if this helps on your intrinsics testcases. Ramana