On Wed, Mar 6, 2019 at 8:46 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > On Tue, Mar 5, 2019 at 1:46 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > On Mon, Mar 04, 2019 at 12:55:04PM +0100, Richard Biener wrote: > > > On Sun, Mar 3, 2019 at 10:13 PM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > > > On Sun, Mar 03, 2019 at 06:40:09AM -0800, Andrew Pinski wrote: > > > > > ) > > > > > ,On Sun, Mar 3, 2019 at 6:32 AM H.J. Lu <hjl.to...@gmail.com> wrote: > > > > > > > > > > > > For vector init constructor: > > > > > > > > > > > > --- > > > > > > typedef float __v4sf __attribute__ ((__vector_size__ (16))); > > > > > > > > > > > > __v4sf > > > > > > foo (__v4sf x, float f) > > > > > > { > > > > > > __v4sf y = { f, x[1], x[2], x[3] }; > > > > > > return y; > > > > > > } > > > > > > --- > > > > > > > > > > > > we can optimize vector init constructor with vector copy or permute > > > > > > followed by a single scalar insert: > > > > > and you want to advance to the _1 = BIT_INSERT_EXPR here. The easiest way > > > is to emit a new stmt for _2 = copy ...; and do the set_rhs with the > > > BIT_INSERT_EXPR. > > > > Thanks for BIT_INSERT_EXPR suggestion. I am testing this patch. > > > > > > H.J. > > --- > > We can optimize vector constructor with vector copy or permute followed > > by a single scalar insert: > > > > __v4sf y; > > __v4sf D.1930; > > float _1; > > float _2; > > float _3; > > > > <bb 2> : > > _1 = BIT_FIELD_REF <x_9(D), 32, 96>; > > _2 = BIT_FIELD_REF <x_9(D), 32, 64>; > > _3 = BIT_FIELD_REF <x_9(D), 32, 32>; > > y_6 = {f_5(D), _3, _2, _1}; > > return y_6; > > > > with > > > > __v4sf y; > > __v4sf D.1930; > > float _1; > > float _2; > > float _3; > > vector(4) float _8; > > > > <bb 2> : > > _1 = BIT_FIELD_REF <x_9(D), 32, 96>; > > _2 = BIT_FIELD_REF <x_9(D), 32, 64>; > > _3 = BIT_FIELD_REF <x_9(D), 32, 32>; > > _8 = x_9(D); > > y_6 = BIT_INSERT_EXPR <x_9(D), f_5(D), 0 (32 bits)>; > > return y_6; > > > > gcc/ > > > > PR tree-optimization/88828 > > * tree-ssa-forwprop.c (simplify_vector_constructor): Optimize > > vector init constructor with vector copy or permute followed > > by a single scalar insert. > > > > gcc/testsuite/ > > > > PR tree-optimization/88828 > > * gcc.target/i386/pr88828-1a.c: New test. > > * gcc.target/i386/pr88828-2b.c: Likewise. > > * gcc.target/i386/pr88828-2.c: Likewise. > > * gcc.target/i386/pr88828-3a.c: Likewise. > > * gcc.target/i386/pr88828-3b.c: Likewise. > > * gcc.target/i386/pr88828-3c.c: Likewise. > > * gcc.target/i386/pr88828-3d.c: Likewise. > > * gcc.target/i386/pr88828-4a.c: Likewise. > > * gcc.target/i386/pr88828-4b.c: Likewise. > > * gcc.target/i386/pr88828-5a.c: Likewise. > > * gcc.target/i386/pr88828-5b.c: Likewise. > > * gcc.target/i386/pr88828-6a.c: Likewise. > > * gcc.target/i386/pr88828-6b.c: Likewise. > > Here is the updated patch with run-time tests.
- if (TREE_CODE (elt->value) != SSA_NAME) + if (TREE_CODE (ce->value) != SSA_NAME) return false; hmm, so it doesn't allow { 0, v[1], v[2], v[3] }? I think the single scalar value can be a constant as well. if (!def_stmt) - return false; + { + if (gimple_nop_p (SSA_NAME_DEF_STMT (ce->value))) if (SSA_NAME_IS_DEFAULT_DEF (ce->value)) + { also you seem to disallow { i + 1, v[1], v[2], v[3] } because get_prop_source_stmt will return the definition computing i + 1 in this case and your code will be skipped? I think you can simplify the code by treating scalar_element != NULL as nscalars == 1 and eliding nscalars. - if (conv_code == ERROR_MARK) + if (conv_code == ERROR_MARK && !insert) gimple_assign_set_rhs_with_ops (gsi, VEC_PERM_EXPR, orig[0], orig[1], op2); else @@ -2148,10 +2198,25 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi) VEC_PERM_EXPR, orig[0], orig[1], op2); orig[0] = gimple_assign_lhs (perm); gsi_insert_before (gsi, perm, GSI_SAME_STMT); - gimple_assign_set_rhs_with_ops (gsi, conv_code, orig[0], + gimple_assign_set_rhs_with_ops (gsi, + (conv_code != ERROR_MARK + ? conv_code + : NOP_EXPR), + orig[0], NULL_TREE, NULL_TREE); I believe you should elide the last stmt for conv_code == ERROR_MARK, that is, why did you need to add the && !insert check in the guarding condition (this path should already do the correct thing?). Note that in all cases it looks that with conv_code != ERROR_MARK you may end up doing a float->int or int->float conversion on a value it wasn't done on before which might raise exceptions? That is, do we need to make sure we permute a value we already do convert into the place we're going to insert to? + if (insert) + { + /* Generate a single scalar insert. */ + tree var = make_ssa_name (type); + tree val = gimple_assign_rhs1 (stmt); + gimple *copy = gimple_build_assign (var, val); I believe this doesn't properly copy the stmt in case it is a permute. You can use (note the use of gsi_stmt - gimple_assign_set_rhs_with_ops can re-allocate the stmt) gimple *copy = gimple_copy (gsi_stmt (*gsi)); gimple_assign_set_lhs (copy, var); + gsi_insert_before (gsi, copy, GSI_SAME_STMT); + tree bitpos = bitsize_int (scalar_idx * elem_size); + gimple_assign_set_rhs_with_ops (gsi, BIT_INSERT_EXPR, var, + scalar_element, bitpos); + } Otherwise looks OK to me. As separate followup patch it might be interesting to support { 0, a[1], a[2], 3 } kinds as well, thus combining a VECTOR_CST (which is reasonably cheap to create) with another vector. That should be maybe done as a first patch given this is just a two-vector permute which the code already handles apart from not recognizing the implicit constant vector participating. Similar { 0, a[1], b[2], 3 } where the combination of a and b is blended with another constant vector. I'm not sure if handling an arbitrary number of scalar elements should be done in a similar way, that is, implementing { s1, a[1], a[2], s2, s3, b[0], b[1], b[2] } as tem = VEC_PERM <a, b, { ... }> tem2 = { s1, 0, 0, s2, s3, 0, 0, 0 } res = VEC_PERM <tem, tem2, { blend-mask }> where constructing tem2 should take at most N-1 inserts (the first element to insert into tem2 can use a splat or if element zero a zero-extending move). Doing this effectively lifts the restriction of only handling two vectors - we'd incrementally do two-vector permute plus blend of the rest which has its constructor re-processed. But as said - the code is already a bit awkward so changing this in multiple reivisions is preferred and the single-element case is certainly sth to do via a BIT_INSERT_EXPR. Thanks, Richard. > -- > H.J.