On 16/11/15 14:42, Christophe Lyon wrote:

Hi Alan,

I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c)
fails for armeb targets.
I haven't had time to look at more details yet, but I guess you can
reproduce it quickly enough.


Thanks - yes I see it now.

-fdump-tree-optimized looks sensible:

__attribute__((noinline))
test ()
{
  vector(4) int vect__14.21;
  vector(4) int vect__2.20;
  vector(4) int vect__2.19;
  vector(4) int vect__3.13;
  vector(4) int vect__2.12;

  <bb 2>:
  vect__2.12_24 = MEM[(int *)&b];
  vect__3.13_27 = vect__2.12_24 + { 1, 2, 3, 4 };
  MEM[(int *)&a] = vect__3.13_27;
  vect__2.19_31 = MEM[(int *)&b + 16B];
  vect__2.20_33 = VEC_PERM_EXPR <vect__2.12_24, vect__2.19_31, { 0, 2, 4, 6 }>;
  vect__14.21_35 = vect__2.20_33 * { 3, 4, 5, 7 };
  MEM[(int *)&a + 16B] = vect__14.21_35;
  return;
}

but while a[0...3] end up containing 5 7 9 11 as expected,
a[4..7] end up with 30 32 30 28 rather than the expected 12 24 40 70.
That is, we end up with (10 8 6 4), rather than the expected (4 6 8 10), being multiplied by {3,4,5,7}. Looking at the RTL, those values come from a UZP1/2 pair that should extract elements {0,2,4,6} of b. Assembler, with my workings as to what's in each register:

test:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        movw    r2, #:lower16:b
        movt    r2, #:upper16:b
        vldr    d22, .L11
        vldr    d23, .L11+8
;; So d22 = (3 4), d23 = (5 7), q11 = (5 7 3 4)
        movw    r3, #:lower16:a
        movt    r3, #:upper16:a
        vld1.64 {d16-d17}, [r2:64]
;; So d16 = (b[0] b[1]), d17 = (b[2] b[3]), q8 = (b[2] b[3] b[0] b[1])
        vmov    q9, q8  @ v4si
;; q9 = (b[2] b[3] b[0] b[1])
        vldr    d20, [r2, #16]
        vldr    d21, [r2, #24]
;; So d20 = (b[4] b[5]), d21 = (b[6] b[7]), q10 = (b[6] b[7] b[4] b[5])
        vuzp.32 q10, q9
;; So  q10 = (b[3] b[1] b[7] b[5]), i.e. d20 = (b[7] b[5]) and d21 = (b[3] b[1])
;; and q9 = (b[2] b[0] b[6] b[4]), i.e. d18 = (b[6] b[4]) and d19 = (b[2] b[0])
        vldr    d20, .L11+16
        vldr    d21, .L11+24
;; d20 = (1 2), d21 = (3 4), q10 = (3 4 1 2)
        vmul.i32        q9, q9, q11
;; q9 = (b[2]*5 b[0]*7 b[6]*3 b[4]*4)
;; i.e. d18 = (b[6]*3 b[4]*4) and d19 = (b[2]*5 b[0]*7)
        vadd.i32        q8, q8, q10
;; q8 = (b[2]+3 b[3]+4 b[0]+1 b[1]+2)
;; i.e. d16 = (b[0]+1 b[1]+2), d17 = (b[2]+3 b[3]+4)
        vst1.64 {d16-d17}, [r3:64]
;; a[0] = b[0]+1, a[1] = b[1]+2, a[2] = b[2]+3, a[3]=b[3]+4 all ok
        vstr    d18, [r3, #16]
;; a[4] = b[6]*3, a[5] = b[4]*4
        vstr    d19, [r3, #24]
;; a[6] = b[2]*5, a[7] = b[0]*7
        bx      lr
.L12:
        .align  3
.L11:
        .word   3
        .word   4
        .word   5
        .word   7
        .word   1
        .word   2
        .word   3
        .word   4

Which is to say - the bit order in the q-registers, is neither big- nor little-endian, but the elements get stored back to memory in a consistent order with how they were loaded, so we're OK as long as there are no permutes. Unfortunately for UZP this lane ordering mixup is not idempotent and messes everything up...

Hmmm. I'm feeling that "properly" fixing this testcase, amounts to fixing armeb's whole register-numbering/lane-flipping scheme, and might be quite a large task. OTOH it might also fix the significant number of failing vectorizer tests. A simpler solution might be to disable...some part of vector support....on armeb, but I'm not sure which part would be best, yet.

Thoughts (CC maintainers)?

--Alan

Reply via email to