[Bug c/49362] New: Arm Neon intrinsic types not correctly interpreted by compiler.

mark.pupilli at dyson dot com Fri, 10 Jun 2011 04:35:45 -0700

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49362


           Summary: Arm Neon intrinsic types not correctly interpreted by
                    compiler.
           Product: gcc
           Version: 4.4.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassig...@gcc.gnu.org
        ReportedBy: mark.pupi...@dyson.com


Created attachment 24485
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=24485
C-file with 2 funs that show the bug when compiled.

Arm neon intrinsics define the type uint32x4x2_t as

typedef struct uint32x4x2_t { uint32x4_t val[2]; };

This is interpreted by the compiler literally as a struct. This should not be
the case. The compiler should treat it as a pair of registers, just as it
treats uint32_t as a single register and not an array of 4 x uint32_t.

The attached c file contains two version of the same function - one that uses
quad word loads (vld1q), and one that uses double quad word loads (vld2q). The
function thats uses double quad word loads should take 2 instructions fewer but
it is actually 44 instructions long compared to 19 for the vld1q version. (Both
functions compute the same results).

I believe this bug arises because the compiler treats the following as array
access instead of a reference into the register file:

uint32x4x2_t A = vld2q_u32 ( a );
A.val[0]; // This statement should be treated as a reference to a register -
not an array access!

Assembly for vld2q version - hopefully I am not mistaken as I am new to ARM
assembly but it appears to do double quad word loads in Neon pipeline, then
transfers the registers back to the ARM processor, indexes them as arrays and
then reloads them into the Neon pipeline again!:

vld2q variant, 44 instructions:

00000014 <_ZN4Neon16hamming_distanceEPjS0_>:
  14:    e92d0070     push    {r4, r5, r6}
  18:    e24dd084     sub    sp, sp, #132    ; 0x84
  1c:    f460c38f     vld2.32    {d28-d31}, [r0]
  20:    e28d6020     add    r6, sp, #32
  24:    ecc6cb08     vstmia    r6, {d28-d31}
  28:    e1a0c001     mov    ip, r1
  2c:    e8b6000f     ldm    r6!, {r0, r1, r2, r3}
  30:    e28d4060     add    r4, sp, #96    ; 0x60
  34:    e1a05004     mov    r5, r4
  38:    f46c038f     vld2.32    {d16-d19}, [ip]
  3c:    e8a5000f     stmia    r5!, {r0, r1, r2, r3}
  40:    eccd0b08     vstmia    sp, {d16-d19}
  44:    e896000f     ldm    r6, {r0, r1, r2, r3}
  48:    e1a0c00d     mov    ip, sp
  4c:    e28d4040     add    r4, sp, #64    ; 0x40
  50:    e885000f     stm    r5, {r0, r1, r2, r3}
  54:    e1a05004     mov    r5, r4
  58:    e8bc000f     ldm    ip!, {r0, r1, r2, r3}
  5c:    e8a5000f     stmia    r5!, {r0, r1, r2, r3}
  60:    e89c000f     ldm    ip, {r0, r1, r2, r3}
  64:    e885000f     stm    r5, {r0, r1, r2, r3}
  68:    eddd4b10     vldr    d20, [sp, #64]    ; 0x40
  6c:    eddd5b12     vldr    d21, [sp, #72]    ; 0x48
  70:    edddab18     vldr    d26, [sp, #96]    ; 0x60
  74:    edddbb1a     vldr    d27, [sp, #104]    ; 0x68
  78:    f34a61f4     veor    q11, q13, q10
  7c:    eddd8b14     vldr    d24, [sp, #80]    ; 0x50
  80:    eddd9b16     vldr    d25, [sp, #88]    ; 0x58
  84:    eddd4b1c     vldr    d20, [sp, #112]    ; 0x70
  88:    eddd5b1e     vldr    d21, [sp, #120]    ; 0x78
  8c:    f30461f8     veor    q3, q10, q12
  90:    f3f00546     vcnt.8    q8, q3
  94:    f3b04566     vcnt.8    q2, q11
  98:    f2042860     vadd.i8    q1, q2, q8
  9c:    f3f022c2     vpaddl.u8    q9, q1
  a0:    f3f422e2     vpaddl.u16    q9, q9
  a4:    f22201b2     vorr    d0, d18, d18
  a8:    f26321b3     vorr    d18, d19, d19
  ac:    f2620b90     vpadd.i32    d16, d18, d0
  b0:    f2600bb0     vpadd.i32    d16, d16, d16
  b4:    ee100b90     vmov.32    r0, d16[0]
  b8:    e28dd084     add    sp, sp, #132    ; 0x84
  bc:    e8bd0070     pop    {r4, r5, r6}
  c0:    e12fff1e     bx    lr

vld1q variant, only 19 instructions:

00000014 <_ZN4Neon16hamming_distanceEPjS0_>:
  14:    e2802010     add    r2, r0, #16
  18:    e2813010     add    r3, r1, #16
  1c:    f4606a8f     vld1.32    {d22-d23}, [r0]
  20:    f4624a8f     vld1.32    {d20-d21}, [r2]
  24:    f463aa8f     vld1.32    {d26-d27}, [r3]
  28:    f461ca8f     vld1.32    {d28-d29}, [r1]
  2c:    f34681fc     veor    q12, q11, q14
  30:    f30461fa     veor    q3, q10, q13
  34:    f3f00546     vcnt.8    q8, q3
  38:    f3b04568     vcnt.8    q2, q12
  3c:    f2042860     vadd.i8    q1, q2, q8
  40:    f3f022c2     vpaddl.u8    q9, q1
  44:    f3f422e2     vpaddl.u16    q9, q9
  48:    f22201b2     vorr    d0, d18, d18
  4c:    f26321b3     vorr    d18, d19, d19
  50:    f2620b90     vpadd.i32    d16, d18, d0
  54:    f2600bb0     vpadd.i32    d16, d16, d16
  58:    ee100b90     vmov.32    r0, d16[0]
  5c:    e12fff1e     bx    lr

[Bug c/49362] New: Arm Neon intrinsic types not correctly interpreted by compiler.

Reply via email to