On 26 September 2014 13:47, Tejas Belagod <tejas.bela...@arm.com> wrote:
> If we use type-punning, there are unnecessary spills that are generated
> which is also incorrect for BE because of of the way we spill (st1 {v0.16b -
> v1.16b}, [sp]) and restore. The implementation without type-punning seems to
> give a more optimal result. Did your patches improve on the spills for the
> type-punning solution?

OK, this part seems too contentious, so I've respun the vldN_lane
parts without the type punning and reposted them. This issue can be
resolved separately.

Trying an example like this gives good code with type punning, and
poor code without.

void t2(int32_t *p)
{
    int32x4x4_t va = vld4q_s32(p);
    va = vld4q_lane_s32(p + 500, va, 1);
    vst4q_s32(p+1000, va);
}


With type-punning, good code:
t2:
        ld4     {v0.4s - v3.4s}, [x0]
        add     x2, x0, 2000
        add     x1, x0, 4000
        ld4     {v0.s - v3.s}[1], [x2]
        st4     {v0.4s - v3.4s}, [x1]
        ret

Without type-punning, horrible code:
t2:
        ld4     {v0.4s - v3.4s}, [x0]
        sub     sp, sp, #64
        add     x14, x0, 2000
        add     x0, x0, 4000
        umov    x12, v0.d[0]
        umov    x13, v0.d[1]
        umov    x10, v1.d[0]
        umov    x11, v1.d[1]
        umov    x8, v2.d[0]
        str     x12, [sp]
        umov    x9, v2.d[1]
        str     x13, [sp, 8]
        str     q3, [sp, 48]
        str     x10, [sp, 16]
        str     x11, [sp, 24]
        str     x8, [sp, 32]
        str     x9, [sp, 40]
        ld1     {v0.16b - v3.16b}, [sp]
        ld4     {v0.s - v3.s}[1], [x14]
        umov    x10, v0.d[0]
        umov    x11, v0.d[1]
        umov    x8, v1.d[0]
        umov    x9, v1.d[1]
        umov    x6, v2.d[0]
        str     x10, [sp]
        umov    x7, v2.d[1]
        str     x11, [sp, 8]
        str     q3, [sp, 48]
        str     x8, [sp, 16]
        str     x9, [sp, 24]
        str     x6, [sp, 32]
        str     x7, [sp, 40]
        ld1     {v0.16b - v3.16b}, [sp]
        add     sp, sp, 64
        st4     {v0.4s - v3.4s}, [x0]
        ret

>> Maybe the solution is to pass the NEON
>> intrinsic types directly to the builtins? Is there a reason that it
>> wasn't done that way before?
>
> How do you mean? Do you mean pass a loaded value int32x2x2_t into a
> __builtin? How will that work?
>
> If you mean why we don't pass an int32x2x2_t into a builtin as a structure,
> I don't think that would work as it is struct type which would correspond to
> a  BLK mode, but we need RTL patterns with reg-lists to work with large int
> modes for the regalloc to allocate consecutive regs for the reglists.

OK, that makes sense. However, something needs to be done to create
the __arch64_simd_ objects without register moves. Since the existing
mechanism causes problems because the lifetimes of the inputs overlap
with the lifetimes of the outputs, I think there are these options:

1. represent the construction/deconstruction as a single operation, to
avoid overlapping variable liveness in the source.
2. add a pass or peephole which can combine the existing builtins into
a single operation, so that the lifetimes are normalised.
3. teach the register allocator how to handle overlapping liveness of
a register and a subreg of that register.

Option 1 would require a new builtin interface which somehow handled a
whole int32x2x2_t in one operation. Construction is easy
(__builtin_aarch64_simd_construct(v.val[0], v.val[1]) or similar).
Deconstruction is less obvious

Option 2 sounds like a hack, but would probably be effective,
particularly if it can be done before inlining.

Option 3 would also help with poor code generation for ARM targets
with vget_low_*, vget_high_* and vcombine_*.

What do you think is the best approach?

Thanks
Charles

Reply via email to