https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70686
--- Comment #3 from rguenther at suse dot de <rguenther at suse dot de> --- On Mon, 18 Apr 2016, alekshs at hotmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70686 > > --- Comment #2 from alekshs at hotmail dot com --- > (In reply to Richard Biener from comment #1) > > It's not so mind-blowing - it's simply that -fprofile-generate makes our > > GIMPLE level if-conversion no longer apply. Without -fprofile-generate > > we if-convert the loop into > > > > for (i = 1; i <100000001; i++) > > { > > ... > > > > b = b + (b < 1.00001) ? i + 12.43 : 0.0; > > ... > > } > > > > thus we always evaluate the i + 12.43 and one additional addition of zero. > > > > We do this to eventually enable vectorization but without any check > > on whether it would be profitable when not vectorizing (your testcase > > shows it's not profitable). > > > > Confirmed. -fno-tree-loop-if-convert should fix it in this particular case. > > Aha, thanks for the swift reply. > > Regarding profitability, I should note that the PGO misses entirely the fact > that 20 mulsd could become 10 mulpd: > > > 400560: f2 0f 59 e9 mulsd %xmm1,%xmm5 > 400564: f2 0f 59 e1 mulsd %xmm1,%xmm4 > 400568: f2 0f 59 d9 mulsd %xmm1,%xmm3 > 40056c: f2 0f 59 d1 mulsd %xmm1,%xmm2 > 400570: f2 0f 59 e9 mulsd %xmm1,%xmm5 > 400574: f2 0f 59 e1 mulsd %xmm1,%xmm4 > 400578: f2 0f 59 d9 mulsd %xmm1,%xmm3 > 40057c: f2 0f 59 d1 mulsd %xmm1,%xmm2 > 400580: f2 0f 59 e9 mulsd %xmm1,%xmm5 > 400584: f2 0f 59 e1 mulsd %xmm1,%xmm4 > 400588: f2 0f 59 d9 mulsd %xmm1,%xmm3 > 40058c: f2 0f 59 d1 mulsd %xmm1,%xmm2 > 400590: f2 0f 59 e9 mulsd %xmm1,%xmm5 > 400594: f2 0f 59 e1 mulsd %xmm1,%xmm4 > 400598: f2 0f 59 d9 mulsd %xmm1,%xmm3 > 40059c: f2 0f 59 d1 mulsd %xmm1,%xmm2 > 4005a0: f2 0f 59 e9 mulsd %xmm1,%xmm5 > 4005a4: f2 0f 59 e1 mulsd %xmm1,%xmm4 > 4005a8: f2 0f 59 d9 mulsd %xmm1,%xmm3 > 4005ac: f2 0f 59 d1 mulsd %xmm1,%xmm2 > > > ...So there was job to be done there. That's at -03 -march=native btw (to > preserve accuracy, unlike -Ofast). Ofast too doesn't pack them. It kind of > splits to scalar muls and packed adds. vectorization is confused by you computing a reduction that is broken by the if (). This isn't easily vectorized.