The following patch adds FRE after vectorization which is needed for IVOPTs to remove redundant PHI nodes (well, I'm testing a patch for FRE that will do it already there).
The patch also makes FRE preserve loop-closed SSA form and thus make it suitable for use in the loop pipeline. With the placement in the vectorizer sub-pass FRE will effectively be enabled by -O3 only (well, or if one requests loop vectorization). I've considered placing it after complete_unroll instead but that would enable it at -O1 already. I have no strong opinion on the exact placement, but it should help all passes between vectorizing and ivopts for vectorized loops. Yeah, it adds yet another pass and thus I don't like it very much. But it for example improves code generated for gfortran.dg/vect/fast-math-pr37021.f90 from .L14: movupd (%r11), %xmm3 addl $1, %ecx addq %rax, %r11 movupd (%r8), %xmm0 addq %rax, %r8 unpckhpd %xmm3, %xmm3 movupd (%rdi), %xmm2 unpcklpd %xmm0, %xmm0 addq %rsi, %rdi movupd (%rbx), %xmm1 mulpd %xmm3, %xmm2 addq %rsi, %rbx cmpl %ecx, %ebp palignr $8, %xmm1, %xmm1 mulpd %xmm1, %xmm0 movapd %xmm2, %xmm1 addpd %xmm0, %xmm1 subpd %xmm2, %xmm0 shufpd $2, %xmm0, %xmm1 addpd %xmm1, %xmm4 jne .L14 to .L14: movupd (%r8), %xmm0 addl $1, %ecx addq %rax, %r8 movapd %xmm0, %xmm2 movupd (%rdi), %xmm1 addq %rsi, %rdi cmpl %ecx, %r11d unpckhpd %xmm0, %xmm2 unpcklpd %xmm0, %xmm0 mulpd %xmm1, %xmm2 palignr $8, %xmm1, %xmm1 mulpd %xmm1, %xmm0 movapd %xmm2, %xmm1 addpd %xmm0, %xmm1 subpd %xmm2, %xmm0 shufpd $2, %xmm0, %xmm1 addpd %xmm1, %xmm3 jne .L14 (yeah, the vectorizer happily generates redundant loads and one IV for each such load) Any other suggestions on pass placement? I can of course key that FRE run on -O3 explicitely. Not sure if we at this point want to start playing fancy games like setting a property when a pass (likely) generated redundancies that are worth fixing up and then key FRE on that one (it gets harder and less predictable what transforms are run on code). Bootstrap / regtest running on x86_64-unknown-linux-gnu. With other placements I'd expect quite some testsuite fallout eventually. Thoughts? Thanks, Richard. 2015-06-10 Richard Biener <rguent...@suse.de> * passes.def (pass_vectorize): Add pass_fre. * tree-ssa-pre.c (eliminate_dom_walker::before_dom_children): Preserve loop-closed SSA form. Index: gcc/passes.def =================================================================== *** gcc/passes.def (revision 224324) --- gcc/passes.def (working copy) *************** along with GCC; see the file COPYING3. *** 252,257 **** --- 252,258 ---- Please do not add any other passes in between. */ NEXT_PASS (pass_vectorize); PUSH_INSERT_PASSES_WITHIN (pass_vectorize) + NEXT_PASS (pass_fre); NEXT_PASS (pass_dce); POP_INSERT_PASSES () NEXT_PASS (pass_predcom); Index: gcc/tree-ssa-pre.c =================================================================== *** gcc/tree-ssa-pre.c (revision 224324) --- gcc/tree-ssa-pre.c (working copy) *************** eliminate_dom_walker::before_dom_childre *** 4013,4018 **** --- 4013,4028 ---- tailmerging. Eventually we can reduce its reliance on SCCVN now that we fully copy/constant-propagate (most) things. */ + /* Compute whether this block has loop-closed PHI nodes we need + to preserve. */ + bool lc_phi = false; + edge e; + if (loops_state_satisfies_p (LOOP_CLOSED_SSA) + && single_pred_p (b) + && (e = single_pred_edge (b)) + && loop_exit_edge_p (e->src->loop_father, e)) + lc_phi = true; + for (gphi_iterator gsi = gsi_start_phis (b); !gsi_end_p (gsi);) { gphi *phi = gsi.phi (); *************** eliminate_dom_walker::before_dom_childre *** 4026,4032 **** tree sprime = eliminate_avail (res); if (sprime ! && sprime != res) { if (dump_file && (dump_flags & TDF_DETAILS)) { --- 4036,4043 ---- tree sprime = eliminate_avail (res); if (sprime ! && sprime != res ! && !lc_phi) { if (dump_file && (dump_flags & TDF_DETAILS)) { *************** eliminate_dom_walker::before_dom_childre *** 4466,4472 **** /* Replace destination PHI arguments. */ edge_iterator ei; - edge e; FOR_EACH_EDGE (e, ei, b->succs) { for (gphi_iterator gsi = gsi_start_phis (e->dest); --- 4477,4482 ----