Re: [PATCH] PR/67682, break SLP groups up if only some elements match

Alan Lawrence Tue, 17 Nov 2015 08:47:16 -0800

On 16/11/15 14:42, Christophe Lyon wrote:


Hi Alan,

I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c)
fails for armeb targets.
I haven't had time to look at more details yet, but I guess you can
reproduce it quickly enough.


Thanks - yes I see it now.

-fdump-tree-optimized looks sensible:

__attribute__((noinline))
test ()
{
  vector(4) int vect__14.21;
  vector(4) int vect__2.20;
  vector(4) int vect__2.19;
  vector(4) int vect__3.13;
  vector(4) int vect__2.12;

  <bb 2>:
  vect__2.12_24 = MEM[(int *)&b];
  vect__3.13_27 = vect__2.12_24 + { 1, 2, 3, 4 };
  MEM[(int *)&a] = vect__3.13_27;
  vect__2.19_31 = MEM[(int *)&b + 16B];
  vect__2.20_33 = VEC_PERM_EXPR <vect__2.12_24, vect__2.19_31, { 0, 2, 4, 6 }>;
  vect__14.21_35 = vect__2.20_33 * { 3, 4, 5, 7 };
  MEM[(int *)&a + 16B] = vect__14.21_35;
  return;
}

but while a[0...3] end up containing 5 7 9 11 as expected,
a[4..7] end up with 30 32 30 28 rather than the expected 12 24 40 70.

That is, we end up with (10 8 6 4), rather than the expected (4 6 8 10), beingmultiplied by {3,4,5,7}. Looking at the RTL, those values come from a UZP1/2pair that should extract elements {0,2,4,6} of b. Assembler, with my workings asto what's in each register:


test:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        movw    r2, #:lower16:b
        movt    r2, #:upper16:b
        vldr    d22, .L11
        vldr    d23, .L11+8
;; So d22 = (3 4), d23 = (5 7), q11 = (5 7 3 4)
        movw    r3, #:lower16:a
        movt    r3, #:upper16:a
        vld1.64 {d16-d17}, [r2:64]
;; So d16 = (b[0] b[1]), d17 = (b[2] b[3]), q8 = (b[2] b[3] b[0] b[1])
        vmov    q9, q8  @ v4si
;; q9 = (b[2] b[3] b[0] b[1])
        vldr    d20, [r2, #16]
        vldr    d21, [r2, #24]
;; So d20 = (b[4] b[5]), d21 = (b[6] b[7]), q10 = (b[6] b[7] b[4] b[5])
        vuzp.32 q10, q9
;; So  q10 = (b[3] b[1] b[7] b[5]), i.e. d20 = (b[7] b[5]) and d21 = (b[3] b[1])
;; and q9 = (b[2] b[0] b[6] b[4]), i.e. d18 = (b[6] b[4]) and d19 = (b[2] b[0])
        vldr    d20, .L11+16
        vldr    d21, .L11+24
;; d20 = (1 2), d21 = (3 4), q10 = (3 4 1 2)
        vmul.i32        q9, q9, q11
;; q9 = (b[2]*5 b[0]*7 b[6]*3 b[4]*4)
;; i.e. d18 = (b[6]*3 b[4]*4) and d19 = (b[2]*5 b[0]*7)
        vadd.i32        q8, q8, q10
;; q8 = (b[2]+3 b[3]+4 b[0]+1 b[1]+2)
;; i.e. d16 = (b[0]+1 b[1]+2), d17 = (b[2]+3 b[3]+4)
        vst1.64 {d16-d17}, [r3:64]
;; a[0] = b[0]+1, a[1] = b[1]+2, a[2] = b[2]+3, a[3]=b[3]+4 all ok
        vstr    d18, [r3, #16]
;; a[4] = b[6]*3, a[5] = b[4]*4
        vstr    d19, [r3, #24]
;; a[6] = b[2]*5, a[7] = b[0]*7
        bx      lr
.L12:
        .align  3
.L11:
        .word   3
        .word   4
        .word   5
        .word   7
        .word   1
        .word   2
        .word   3
        .word   4

Which is to say - the bit order in the q-registers, is neither big- norlittle-endian, but the elements get stored back to memory in a consistent orderwith how they were loaded, so we're OK as long as there are no permutes.Unfortunately for UZP this lane ordering mixup is not idempotent and messeseverything up...

Hmmm. I'm feeling that "properly" fixing this testcase, amounts to fixingarmeb's whole register-numbering/lane-flipping scheme, and might be quite alarge task. OTOH it might also fix the significant number of failing vectorizertests. A simpler solution might be to disable...some part of vectorsupport....on armeb, but I'm not sure which part would be best, yet.


Thoughts (CC maintainers)?

--Alan

Re: [PATCH] PR/67682, break SLP groups up if only some elements match

Reply via email to