On 26 September 2014 13:47, Tejas Belagod <tejas.bela...@arm.com> wrote: > If we use type-punning, there are unnecessary spills that are generated > which is also incorrect for BE because of of the way we spill (st1 {v0.16b - > v1.16b}, [sp]) and restore. The implementation without type-punning seems to > give a more optimal result. Did your patches improve on the spills for the > type-punning solution?
OK, this part seems too contentious, so I've respun the vldN_lane parts without the type punning and reposted them. This issue can be resolved separately. Trying an example like this gives good code with type punning, and poor code without. void t2(int32_t *p) { int32x4x4_t va = vld4q_s32(p); va = vld4q_lane_s32(p + 500, va, 1); vst4q_s32(p+1000, va); } With type-punning, good code: t2: ld4 {v0.4s - v3.4s}, [x0] add x2, x0, 2000 add x1, x0, 4000 ld4 {v0.s - v3.s}[1], [x2] st4 {v0.4s - v3.4s}, [x1] ret Without type-punning, horrible code: t2: ld4 {v0.4s - v3.4s}, [x0] sub sp, sp, #64 add x14, x0, 2000 add x0, x0, 4000 umov x12, v0.d[0] umov x13, v0.d[1] umov x10, v1.d[0] umov x11, v1.d[1] umov x8, v2.d[0] str x12, [sp] umov x9, v2.d[1] str x13, [sp, 8] str q3, [sp, 48] str x10, [sp, 16] str x11, [sp, 24] str x8, [sp, 32] str x9, [sp, 40] ld1 {v0.16b - v3.16b}, [sp] ld4 {v0.s - v3.s}[1], [x14] umov x10, v0.d[0] umov x11, v0.d[1] umov x8, v1.d[0] umov x9, v1.d[1] umov x6, v2.d[0] str x10, [sp] umov x7, v2.d[1] str x11, [sp, 8] str q3, [sp, 48] str x8, [sp, 16] str x9, [sp, 24] str x6, [sp, 32] str x7, [sp, 40] ld1 {v0.16b - v3.16b}, [sp] add sp, sp, 64 st4 {v0.4s - v3.4s}, [x0] ret >> Maybe the solution is to pass the NEON >> intrinsic types directly to the builtins? Is there a reason that it >> wasn't done that way before? > > How do you mean? Do you mean pass a loaded value int32x2x2_t into a > __builtin? How will that work? > > If you mean why we don't pass an int32x2x2_t into a builtin as a structure, > I don't think that would work as it is struct type which would correspond to > a BLK mode, but we need RTL patterns with reg-lists to work with large int > modes for the regalloc to allocate consecutive regs for the reglists. OK, that makes sense. However, something needs to be done to create the __arch64_simd_ objects without register moves. Since the existing mechanism causes problems because the lifetimes of the inputs overlap with the lifetimes of the outputs, I think there are these options: 1. represent the construction/deconstruction as a single operation, to avoid overlapping variable liveness in the source. 2. add a pass or peephole which can combine the existing builtins into a single operation, so that the lifetimes are normalised. 3. teach the register allocator how to handle overlapping liveness of a register and a subreg of that register. Option 1 would require a new builtin interface which somehow handled a whole int32x2x2_t in one operation. Construction is easy (__builtin_aarch64_simd_construct(v.val[0], v.val[1]) or similar). Deconstruction is less obvious Option 2 sounds like a hack, but would probably be effective, particularly if it can be done before inlining. Option 3 would also help with poor code generation for ARM targets with vget_low_*, vget_high_* and vcombine_*. What do you think is the best approach? Thanks Charles