On Tue, Nov 17, 2015 at 5:46 PM, Alan Lawrence <alan.lawre...@arm.com> wrote: > On 16/11/15 14:42, Christophe Lyon wrote: >> >> >> Hi Alan, >> >> I've noticed that this new test (gcc.dg/vect/bb-slp-subgroups-3.c) >> fails for armeb targets. >> I haven't had time to look at more details yet, but I guess you can >> reproduce it quickly enough. >> > > Thanks - yes I see it now. > > -fdump-tree-optimized looks sensible: > > __attribute__((noinline)) > test () > { > vector(4) int vect__14.21; > vector(4) int vect__2.20; > vector(4) int vect__2.19; > vector(4) int vect__3.13; > vector(4) int vect__2.12; > > <bb 2>: > vect__2.12_24 = MEM[(int *)&b]; > vect__3.13_27 = vect__2.12_24 + { 1, 2, 3, 4 }; > MEM[(int *)&a] = vect__3.13_27; > vect__2.19_31 = MEM[(int *)&b + 16B]; > vect__2.20_33 = VEC_PERM_EXPR <vect__2.12_24, vect__2.19_31, { 0, 2, 4, 6 > }>; > vect__14.21_35 = vect__2.20_33 * { 3, 4, 5, 7 }; > MEM[(int *)&a + 16B] = vect__14.21_35; > return; > } > > but while a[0...3] end up containing 5 7 9 11 as expected, > a[4..7] end up with 30 32 30 28 rather than the expected 12 24 40 70. > That is, we end up with (10 8 6 4), rather than the expected (4 6 8 10), > being multiplied by {3,4,5,7}. Looking at the RTL, those values come from a > UZP1/2 pair that should extract elements {0,2,4,6} of b. Assembler, with my > workings as to what's in each register: > > test: > @ args = 0, pretend = 0, frame = 0 > @ frame_needed = 0, uses_anonymous_args = 0 > @ link register save eliminated. > movw r2, #:lower16:b > movt r2, #:upper16:b > vldr d22, .L11 > vldr d23, .L11+8 > ;; So d22 = (3 4), d23 = (5 7), q11 = (5 7 3 4) > movw r3, #:lower16:a > movt r3, #:upper16:a > vld1.64 {d16-d17}, [r2:64] > ;; So d16 = (b[0] b[1]), d17 = (b[2] b[3]), q8 = (b[2] b[3] b[0] b[1]) > vmov q9, q8 @ v4si > ;; q9 = (b[2] b[3] b[0] b[1]) > vldr d20, [r2, #16] > vldr d21, [r2, #24] > ;; So d20 = (b[4] b[5]), d21 = (b[6] b[7]), q10 = (b[6] b[7] b[4] b[5]) > vuzp.32 q10, q9 > ;; So q10 = (b[3] b[1] b[7] b[5]), i.e. d20 = (b[7] b[5]) and d21 = (b[3] > b[1]) > ;; and q9 = (b[2] b[0] b[6] b[4]), i.e. d18 = (b[6] b[4]) and d19 = (b[2] > b[0]) > vldr d20, .L11+16 > vldr d21, .L11+24 > ;; d20 = (1 2), d21 = (3 4), q10 = (3 4 1 2) > vmul.i32 q9, q9, q11 > ;; q9 = (b[2]*5 b[0]*7 b[6]*3 b[4]*4) > ;; i.e. d18 = (b[6]*3 b[4]*4) and d19 = (b[2]*5 b[0]*7) > vadd.i32 q8, q8, q10 > ;; q8 = (b[2]+3 b[3]+4 b[0]+1 b[1]+2) > ;; i.e. d16 = (b[0]+1 b[1]+2), d17 = (b[2]+3 b[3]+4) > vst1.64 {d16-d17}, [r3:64] > ;; a[0] = b[0]+1, a[1] = b[1]+2, a[2] = b[2]+3, a[3]=b[3]+4 all ok > vstr d18, [r3, #16] > ;; a[4] = b[6]*3, a[5] = b[4]*4 > vstr d19, [r3, #24] > ;; a[6] = b[2]*5, a[7] = b[0]*7 > bx lr > .L12: > .align 3 > .L11: > .word 3 > .word 4 > .word 5 > .word 7 > .word 1 > .word 2 > .word 3 > .word 4 > > Which is to say - the bit order in the q-registers, is neither big- nor > little-endian, but the elements get stored back to memory in a consistent > order with how they were loaded, so we're OK as long as there are no > permutes. Unfortunately for UZP this lane ordering mixup is not idempotent > and messes everything up... > > Hmmm. I'm feeling that "properly" fixing this testcase, amounts to fixing > armeb's whole register-numbering/lane-flipping scheme, and might be quite a > large task. OTOH it might also fix the significant number of failing > vectorizer tests. A simpler solution might be to disable...some part of > vector support....on armeb, but I'm not sure which part would be best, yet. > > Thoughts (CC maintainers)?
As most of the permutation support in the arm backend is disabled for armeb I suppose disabling some more of it is appropriate ... Or finally sitting down and fixing the mess. For constant permutes fixing it on the constant side would be an option I guess. Richard. > --Alan >