On Tue, 1 Oct 2019, Joel Hutton wrote: > On 01/10/2019 12:07, Joel wrote: > > > > SLP vectorization: vectorize vector constructors > > > > > > Currently SLP vectorization can build SLP trees staring from > > reductions or from group stores. This patch adds a third starting > > point: vector constructors. > > > > > > For the following test case (compiled with -O3 -fno-vect-cost-model): > > > > > > char g_d[1024], g_s1[1024], g_s2[1024]; void test_loop(void) { char /d > > = g_d, /s1 = g_s1, *s2 = g_s2; > > > > > > for ( int y = 0; y < 128; y++ ) > > { > > for ( int x = 0; x < 16; x++ ) > > d[x] = s1[x] + s2[x]; > > d += 16; > > } > > > > } > > > > > > before patch: test_loop: .LFB0: .cfi_startproc adrp x0, g_s1 adrp x2, > > g_s2 add x3, x0, :lo12:g_s1 add x4, x2, :lo12:g_s2 ldrb w7, [x2, > > #:lo12:g_s2] ldrb w1, [x0, #:lo12:g_s1] adrp x0, g_d ldrb w6, [x4, 1] > > add x0, x0, :lo12:g_d ldrb w5, [x3, 1] add w1, w1, w7 fmov s0, w1 ldrb > > w7, [x4, 2] add w5, w5, w6 ldrb w1, [x3, 2] ldrb w6, [x4, 3] add x2, > > x0, 2048 ins v0.b[1], w5 add w1, w1, w7 ldrb w7, [x3, 3] ldrb w5, [x4, > > 4] add w7, w7, w6 ldrb w6, [x3, 4] ins v0.b[2], w1 ldrb w8, [x4, 5] > > add w6, w6, w5 ldrb w5, [x3, 5] ldrb w9, [x4, 6] add w5, w5, w8 ldrb > > w1, [x3, 6] ins v0.b[3], w7 ldrb w8, [x4, 7] add w1, w1, w9 ldrb w11, > > [x3, 7] ldrb w7, [x4, 8] add w11, w11, w8 ldrb w10, [x3, 8] ins > > v0.b[4], w6 ldrb w8, [x4, 9] add w10, w10, w7 ldrb w9, [x3, 9] ldrb > > w7, [x4, 10] add w9, w9, w8 ldrb w8, [x3, 10] ins v0.b[5], w5 ldrb w6, > > [x4, 11] add w8, w8, w7 ldrb w7, [x3, 11] ldrb w5, [x4, 12] add w7, > > w7, w6 ldrb w6, [x3, 12] ins v0.b[6], w1 ldrb w12, [x4, 13] add w6, > > w6, w5 ldrb w5, [x3, 13] ldrb w1, [x3, 14] add w5, w5, w12 ldrb w13, > > [x4, 14] ins v0.b[7], w11 ldrb w12, [x4, 15] add w4, w1, w13 ldrb w1, > > [x3, 15] add w1, w1, w12 ins v0.b[8], w10 ins v0.b[9], w9 ins > > v0.b[10], w8 ins v0.b[11], w7 ins v0.b[12], w6 ins v0.b[13], w5 ins > > v0.b[14], w4 ins v0.b[15], w1 .p2align 3,,7 .L2: str q0, [x0], 16 cmp > > x2, x0 bne .L2 ret .cfi_endproc .LFE0: > > > > > > After patch: > > > > > > test_loop: .LFB0: .cfi_startproc adrp x3, g_s1 adrp x2, g_s2 add x3, > > x3, :lo12:g_s1 add x2, x2, :lo12:g_s2 adrp x0, g_d add x0, x0, > > :lo12:g_d add x1, x0, 2048 ldr q1, [x2] ldr q0, [x3] add v0.16b, > > v0.16b, v1.16b .p2align 3,,7 .L2: str q0, [x0], 16 cmp x0, x1 bne .L2 > > ret .cfi_endproc .LFE0: > > > > > > > > > > bootstrapped and tested on aarch64-none-linux-gnu > > > Patch attached:
I think a better place for the loop searching for CONSTRUCTORs is vect_slp_analyze_bb_1 where I'd put it before the check you remove, and I'd simply append found CONSTRUCTORs to the grouped_stores array. The fixup you do in vectorizable_operation doesn't belong there either, I'd add a new field to the SLP instance structure refering to the CONSTRUCTOR stmt and do the fixup in vect_schedule_slp_instance instead where you can simply replace the CONSTRUCTOR with the vectorized SSA name then. + /* Check that the constructor elements are unique. */ + FOR_EACH_CONSTRUCTOR_VALUE (CONSTRUCTOR_ELTS (rhs), i, val) + { + tree prev_val; + int j; + FOR_EACH_CONSTRUCTOR_VALUE (CONSTRUCTOR_ELTS (rhs), j, prev_val) + { + if (val == prev_val && i!=j) why's that necessary? (it looks incomplete, also doesn't catch [duplicate] constants) You miss to check that CONSTRUCTOR_NELTS == TYPE_VECTOR_SUBPARTS (we can have omitted trailing zeros). What happens if you have a vector constructor that is twice as large as the machine supports? The vectorizer will happily produce a two vector SSA name vectorized result but your CONSTRUCTOR replacement doesn't work here. I think this should be made work correctly (not give up on that case). Thanks, Richard.