On Tue, Nov 17, 2015 at 5:46 PM, Alan Lawrence <alan.lawre...@arm.com> wrote:
> On 16/11/15 14:42, Christophe Lyon wrote:
>>
>>
>> Hi Alan,
>>
>> I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c)
>> fails for armeb targets.
>> I haven't had time to look at more details yet, but I guess you can
>> reproduce it quickly enough.
>>
>
> Thanks - yes I see it now.
>
> -fdump-tree-optimized looks sensible:
>
> __attribute__((noinline))
> test ()
> {
>   vector(4) int vect__14.21;
>   vector(4) int vect__2.20;
>   vector(4) int vect__2.19;
>   vector(4) int vect__3.13;
>   vector(4) int vect__2.12;
>
>   <bb 2>:
>   vect__2.12_24 = MEM[(int *)&b];
>   vect__3.13_27 = vect__2.12_24 + { 1, 2, 3, 4 };
>   MEM[(int *)&a] = vect__3.13_27;
>   vect__2.19_31 = MEM[(int *)&b + 16B];
>   vect__2.20_33 = VEC_PERM_EXPR <vect__2.12_24, vect__2.19_31, { 0, 2, 4, 6
> }>;
>   vect__14.21_35 = vect__2.20_33 * { 3, 4, 5, 7 };
>   MEM[(int *)&a + 16B] = vect__14.21_35;
>   return;
> }
>
> but while a[0...3] end up containing 5 7 9 11 as expected,
> a[4..7] end up with 30 32 30 28 rather than the expected 12 24 40 70.
> That is, we end up with (10 8 6 4), rather than the expected (4 6 8 10),
> being multiplied by {3,4,5,7}. Looking at the RTL, those values come from a
> UZP1/2 pair that should extract elements {0,2,4,6} of b. Assembler, with my
> workings as to what's in each register:
>
> test:
>         @ args = 0, pretend = 0, frame = 0
>         @ frame_needed = 0, uses_anonymous_args = 0
>         @ link register save eliminated.
>         movw    r2, #:lower16:b
>         movt    r2, #:upper16:b
>         vldr    d22, .L11
>         vldr    d23, .L11+8
> ;; So d22 = (3 4), d23 = (5 7), q11 = (5 7 3 4)
>         movw    r3, #:lower16:a
>         movt    r3, #:upper16:a
>         vld1.64 {d16-d17}, [r2:64]
> ;; So d16 = (b[0] b[1]), d17 = (b[2] b[3]), q8 = (b[2] b[3] b[0] b[1])
>         vmov    q9, q8  @ v4si
> ;; q9 = (b[2] b[3] b[0] b[1])
>         vldr    d20, [r2, #16]
>         vldr    d21, [r2, #24]
> ;; So d20 = (b[4] b[5]), d21 = (b[6] b[7]), q10 = (b[6] b[7] b[4] b[5])
>         vuzp.32 q10, q9
> ;; So  q10 = (b[3] b[1] b[7] b[5]), i.e. d20 = (b[7] b[5]) and d21 = (b[3]
> b[1])
> ;; and q9 = (b[2] b[0] b[6] b[4]), i.e. d18 = (b[6] b[4]) and d19 = (b[2]
> b[0])
>         vldr    d20, .L11+16
>         vldr    d21, .L11+24
> ;; d20 = (1 2), d21 = (3 4), q10 = (3 4 1 2)
>         vmul.i32        q9, q9, q11
> ;; q9 = (b[2]*5 b[0]*7 b[6]*3 b[4]*4)
> ;; i.e. d18 = (b[6]*3 b[4]*4) and d19 = (b[2]*5 b[0]*7)
>         vadd.i32        q8, q8, q10
> ;; q8 = (b[2]+3 b[3]+4 b[0]+1 b[1]+2)
> ;; i.e. d16 = (b[0]+1 b[1]+2), d17 = (b[2]+3 b[3]+4)
>         vst1.64 {d16-d17}, [r3:64]
> ;; a[0] = b[0]+1, a[1] = b[1]+2, a[2] = b[2]+3, a[3]=b[3]+4 all ok
>         vstr    d18, [r3, #16]
> ;; a[4] = b[6]*3, a[5] = b[4]*4
>         vstr    d19, [r3, #24]
> ;; a[6] = b[2]*5, a[7] = b[0]*7
>         bx      lr
> .L12:
>         .align  3
> .L11:
>         .word   3
>         .word   4
>         .word   5
>         .word   7
>         .word   1
>         .word   2
>         .word   3
>         .word   4
>
> Which is to say - the bit order in the q-registers, is neither big- nor
> little-endian, but the elements get stored back to memory in a consistent
> order with how they were loaded, so we're OK as long as there are no
> permutes. Unfortunately for UZP this lane ordering mixup is not idempotent
> and messes everything up...
>
> Hmmm. I'm feeling that "properly" fixing this testcase, amounts to fixing
> armeb's whole register-numbering/lane-flipping scheme, and might be quite a
> large task. OTOH it might also fix the significant number of failing
> vectorizer tests. A simpler solution might be to disable...some part of
> vector support....on armeb, but I'm not sure which part would be best, yet.
>
> Thoughts (CC maintainers)?

As most of the permutation support in the arm backend is disabled
for armeb I suppose disabling some more of it is appropriate ...

Or finally sitting down and fixing the mess.  For constant permutes
fixing it on the constant side would be an option I guess.

Richard.

> --Alan
>

Reply via email to