http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49362
Summary: Arm Neon intrinsic types not correctly interpreted by compiler. Product: gcc Version: 4.4.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassig...@gcc.gnu.org ReportedBy: mark.pupi...@dyson.com Created attachment 24485 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24485 C-file with 2 funs that show the bug when compiled. Arm neon intrinsics define the type uint32x4x2_t as typedef struct uint32x4x2_t { uint32x4_t val[2]; }; This is interpreted by the compiler literally as a struct. This should not be the case. The compiler should treat it as a pair of registers, just as it treats uint32_t as a single register and not an array of 4 x uint32_t. The attached c file contains two version of the same function - one that uses quad word loads (vld1q), and one that uses double quad word loads (vld2q). The function thats uses double quad word loads should take 2 instructions fewer but it is actually 44 instructions long compared to 19 for the vld1q version. (Both functions compute the same results). I believe this bug arises because the compiler treats the following as array access instead of a reference into the register file: uint32x4x2_t A = vld2q_u32 ( a ); A.val[0]; // This statement should be treated as a reference to a register - not an array access! Assembly for vld2q version - hopefully I am not mistaken as I am new to ARM assembly but it appears to do double quad word loads in Neon pipeline, then transfers the registers back to the ARM processor, indexes them as arrays and then reloads them into the Neon pipeline again!: vld2q variant, 44 instructions: 00000014 <_ZN4Neon16hamming_distanceEPjS0_>: 14: e92d0070 push {r4, r5, r6} 18: e24dd084 sub sp, sp, #132 ; 0x84 1c: f460c38f vld2.32 {d28-d31}, [r0] 20: e28d6020 add r6, sp, #32 24: ecc6cb08 vstmia r6, {d28-d31} 28: e1a0c001 mov ip, r1 2c: e8b6000f ldm r6!, {r0, r1, r2, r3} 30: e28d4060 add r4, sp, #96 ; 0x60 34: e1a05004 mov r5, r4 38: f46c038f vld2.32 {d16-d19}, [ip] 3c: e8a5000f stmia r5!, {r0, r1, r2, r3} 40: eccd0b08 vstmia sp, {d16-d19} 44: e896000f ldm r6, {r0, r1, r2, r3} 48: e1a0c00d mov ip, sp 4c: e28d4040 add r4, sp, #64 ; 0x40 50: e885000f stm r5, {r0, r1, r2, r3} 54: e1a05004 mov r5, r4 58: e8bc000f ldm ip!, {r0, r1, r2, r3} 5c: e8a5000f stmia r5!, {r0, r1, r2, r3} 60: e89c000f ldm ip, {r0, r1, r2, r3} 64: e885000f stm r5, {r0, r1, r2, r3} 68: eddd4b10 vldr d20, [sp, #64] ; 0x40 6c: eddd5b12 vldr d21, [sp, #72] ; 0x48 70: edddab18 vldr d26, [sp, #96] ; 0x60 74: edddbb1a vldr d27, [sp, #104] ; 0x68 78: f34a61f4 veor q11, q13, q10 7c: eddd8b14 vldr d24, [sp, #80] ; 0x50 80: eddd9b16 vldr d25, [sp, #88] ; 0x58 84: eddd4b1c vldr d20, [sp, #112] ; 0x70 88: eddd5b1e vldr d21, [sp, #120] ; 0x78 8c: f30461f8 veor q3, q10, q12 90: f3f00546 vcnt.8 q8, q3 94: f3b04566 vcnt.8 q2, q11 98: f2042860 vadd.i8 q1, q2, q8 9c: f3f022c2 vpaddl.u8 q9, q1 a0: f3f422e2 vpaddl.u16 q9, q9 a4: f22201b2 vorr d0, d18, d18 a8: f26321b3 vorr d18, d19, d19 ac: f2620b90 vpadd.i32 d16, d18, d0 b0: f2600bb0 vpadd.i32 d16, d16, d16 b4: ee100b90 vmov.32 r0, d16[0] b8: e28dd084 add sp, sp, #132 ; 0x84 bc: e8bd0070 pop {r4, r5, r6} c0: e12fff1e bx lr vld1q variant, only 19 instructions: 00000014 <_ZN4Neon16hamming_distanceEPjS0_>: 14: e2802010 add r2, r0, #16 18: e2813010 add r3, r1, #16 1c: f4606a8f vld1.32 {d22-d23}, [r0] 20: f4624a8f vld1.32 {d20-d21}, [r2] 24: f463aa8f vld1.32 {d26-d27}, [r3] 28: f461ca8f vld1.32 {d28-d29}, [r1] 2c: f34681fc veor q12, q11, q14 30: f30461fa veor q3, q10, q13 34: f3f00546 vcnt.8 q8, q3 38: f3b04568 vcnt.8 q2, q12 3c: f2042860 vadd.i8 q1, q2, q8 40: f3f022c2 vpaddl.u8 q9, q1 44: f3f422e2 vpaddl.u16 q9, q9 48: f22201b2 vorr d0, d18, d18 4c: f26321b3 vorr d18, d19, d19 50: f2620b90 vpadd.i32 d16, d18, d0 54: f2600bb0 vpadd.i32 d16, d16, d16 58: ee100b90 vmov.32 r0, d16[0] 5c: e12fff1e bx lr