Currently strided SLP vectorization creates vector constructors composed of vector elements. This is a constructor form that is not handled specially by the expander but it gets expanded via piecewise stores to scratch memory and a load of that scratch memory. This is obviously bad on any modern CPU which can do store forwarding (which in this case does not work on any CPU I know of). The following patch simply avoids the issue by making the vectorizer create integer loads, composing a vector of that integers and then punning that to the desired vector type. Thus (V4SF){V2SF, V2SF} becomes (V4SF)(V2DI){DI, DI} and every body is happy. Especially x264 gets a 5-10% improvement (dependent on vector size and x86 sub-architecture).
Handling the vector-vector constructors on the expander side would require either similar punning or making vec_init parametric on the element mode plus supporting vector elements in all targets (which in the end probably will simply pun them similarly). Bootstrap and regtest running on x86_64-unknown-linux-gnu. Any comments? Thanks, Richard. 2016-09-28 Richard Biener <rguent...@suse.de> * tree-vect-stmts.c (vectorizable_load): Avoid emitting vector constructors with vector elements. Index: gcc/tree-vect-stmts.c =================================================================== --- gcc/tree-vect-stmts.c (revision 240565) +++ gcc/tree-vect-stmts.c (working copy) @@ -6862,17 +6925,40 @@ vectorizable_load (gimple *stmt, gimple_ int nloads = nunits; int lnel = 1; tree ltype = TREE_TYPE (vectype); + tree lvectype = vectype; auto_vec<tree> dr_chain; if (memory_access_type == VMAT_STRIDED_SLP) { - nloads = nunits / group_size; if (group_size < nunits) { - lnel = group_size; - ltype = build_vector_type (TREE_TYPE (vectype), group_size); + /* Avoid emitting a constructor of vector elements by performing + the loads using an integer type of the same size, + constructing a vector of those and then re-interpreting it + as the original vector type. This works around the fact + that the vec_init optab was only designed for scalar + element modes and thus expansion goes through memory. + This avoids a huge runtime penalty due to the general + inability to perform store forwarding from smaller stores + to a larger load. */ + unsigned lsize + = group_size * TYPE_PRECISION (TREE_TYPE (vectype)); + enum machine_mode elmode = mode_for_size (lsize, MODE_INT, 0); + enum machine_mode vmode = mode_for_vector (elmode, + nunits / group_size); + /* If we can't construct such a vector fall back to + element loads of the original vector type. */ + if (VECTOR_MODE_P (vmode) + && optab_handler (vec_init_optab, vmode) != CODE_FOR_nothing) + { + nloads = nunits / group_size; + lnel = group_size; + ltype = build_nonstandard_integer_type (lsize, 1); + lvectype = build_vector_type (ltype, nloads); + } } else { + nloads = 1; lnel = nunits; ltype = vectype; } @@ -6925,9 +7011,17 @@ vectorizable_load (gimple *stmt, gimple_ } if (nloads > 1) { - tree vec_inv = build_constructor (vectype, v); - new_temp = vect_init_vector (stmt, vec_inv, vectype, gsi); + tree vec_inv = build_constructor (lvectype, v); + new_temp = vect_init_vector (stmt, vec_inv, lvectype, gsi); new_stmt = SSA_NAME_DEF_STMT (new_temp); + if (lvectype != vectype) + { + new_stmt = gimple_build_assign (make_ssa_name (vectype), + VIEW_CONVERT_EXPR, + build1 (VIEW_CONVERT_EXPR, + vectype, new_temp)); + vect_finish_stmt_generation (stmt, new_stmt, gsi); + } } if (slp)