https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84490
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2018-04-10 Target Milestone|--- |8.0 Summary|436.cactusADM regressed by |[8 regression] |6-8% percent with -Ofast on |436.cactusADM regressed by |Zen, compared to gcc 7.2 |6-8% percent with -Ofast on | |Zen and Haswell, compared | |to gcc 7.2 Ever confirmed|0 |1 --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- I also see this for Haswell: https://gcc.opensuse.org/gcc-old/SPEC/CFP/sb-czerny-head-64-2006/index.html There it's more like 10-14% depending on which parts you look at. For bisection it's a bit weird: 201710240032 r254030 base 48.3 peak 52.2 201710230039 r253996 base 64.7 peak 57.2 201710221240 r253982 base 64.6 peak 65.8 201710210035 r253966 base 65.6 peak 65.2 where base is -Ofast -march=haswell and peak adds -flto. Note it might be that around this time I disabled address-space randomization just in case it is an issue similar to PR82362. I just don't remember exactly so I'd have to reproduce the regression around this revs. between r253982 and r253996 the culprit likely would be r253993 | hubicka | 2017-10-23 00:09:47 +0200 (Mon, 23 Oct 2017) | 12 lines * i386.c (ix86_builtin_vectorization_cost): Use existing rtx_cost latencies instead of having separate table; make difference between integer and float costs. * i386.h (processor_costs): Remove scalar_stmt_cost, scalar_load_cost, scalar_store_cost, vec_stmt_cost, vec_to_scalar_cost, scalar_to_vec_cost, vec_align_load_cost, vec_unalign_load_cost, vec_store_cost. * x86-tune-costs.h: Remove entries which has been removed in procesor_costs from all tables; make cond_taken_branch_cost and cond_not_taken_branch_cost COST_N_INSNS based. similar the other range includes r254012 | hubicka | 2017-10-23 17:10:09 +0200 (Mon, 23 Oct 2017) | 15 lines * i386.c (dimode_scalar_chain::compute_convert_gain): Use xmm_move instead of sse_move. (sse_store_index): New function. (ix86_register_move_cost): Be more sensible about mismatch stall; model AVX moves correctly; make difference between sse->integer and integer->sse. (ix86_builtin_vectorization_cost): Model correctly aligned and unaligned moves; make difference between SSE and AVX. * i386.h (processor_costs): Remove sse_move; add xmm_move, ymm_move and zmm_move. Increase size of sse load and store tables; add unaligned load and store tables; add ssemmx_to_integer. * x86-tune-costs.h: Update all entries according to real move latencies from Agner Fog's manual and chip documentation. so it indeed looks like a target (vectorization) cost model issue at a first glance. Profiling the difference between non-LTO r253982 and r254030 might tell apart the important loop(s). Note that we did recover performance later. cactusADM is a bit noisy (see that other PR) but base is now in the range of 51-55 with peak a little bit higher than that.