[Bug target/84490] [8 regression] 436.cactusADM regressed by 6-8% percent with -Ofast on Zen and Haswell, compared to gcc 7.2

rguenth at gcc dot gnu.org Tue, 10 Apr 2018 01:19:49 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84490


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2018-04-10
   Target Milestone|---                         |8.0
            Summary|436.cactusADM regressed by  |[8 regression]
                   |6-8% percent with -Ofast on |436.cactusADM regressed by
                   |Zen, compared to gcc 7.2    |6-8% percent with -Ofast on
                   |                            |Zen and Haswell, compared
                   |                            |to gcc 7.2
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
I also see this for Haswell:
https://gcc.opensuse.org/gcc-old/SPEC/CFP/sb-czerny-head-64-2006/index.html

There it's more like 10-14% depending on which parts you look at.

For bisection it's a bit weird:

201710240032 r254030  base 48.3  peak 52.2
201710230039 r253996  base 64.7  peak 57.2
201710221240 r253982  base 64.6  peak 65.8
201710210035 r253966  base 65.6  peak 65.2

where base is -Ofast -march=haswell and peak adds -flto.

Note it might be that around this time I disabled address-space randomization
just in case it is an issue similar to PR82362.  I just don't remember exactly
so I'd have to reproduce the regression around this revs.

between r253982 and r253996 the culprit likely would be

r253993 | hubicka | 2017-10-23 00:09:47 +0200 (Mon, 23 Oct 2017) | 12 lines


        * i386.c (ix86_builtin_vectorization_cost): Use existing rtx_cost
        latencies instead of having separate table; make difference between
        integer and float costs.
        * i386.h (processor_costs): Remove scalar_stmt_cost,
        scalar_load_cost, scalar_store_cost, vec_stmt_cost, vec_to_scalar_cost,
        scalar_to_vec_cost, vec_align_load_cost, vec_unalign_load_cost,
        vec_store_cost.
        * x86-tune-costs.h: Remove entries which has been removed in
        procesor_costs from all tables; make cond_taken_branch_cost
        and cond_not_taken_branch_cost COST_N_INSNS based.

similar the other range includes

r254012 | hubicka | 2017-10-23 17:10:09 +0200 (Mon, 23 Oct 2017) | 15 lines


        * i386.c (dimode_scalar_chain::compute_convert_gain): Use
        xmm_move instead of sse_move.
        (sse_store_index): New function.
        (ix86_register_move_cost): Be more sensible about mismatch stall;
        model AVX moves correctly; make difference between sse->integer and
        integer->sse.
        (ix86_builtin_vectorization_cost): Model correctly aligned and
unaligned
        moves; make difference between SSE and AVX.
        * i386.h (processor_costs): Remove sse_move; add xmm_move, ymm_move
        and zmm_move. Increase size of sse load and store tables;
        add unaligned load and store tables; add ssemmx_to_integer.
        * x86-tune-costs.h: Update all entries according to real 
        move latencies from Agner Fog's manual and chip documentation.

so it indeed looks like a target (vectorization) cost model issue at a first
glance.

Profiling the difference between non-LTO r253982 and r254030 might tell
apart the important loop(s).  Note that we did recover performance later.
cactusADM is a bit noisy (see that other PR) but base is now in the range
of 51-55 with peak a little bit higher than that.

[Bug target/84490] [8 regression] 436.cactusADM regressed by 6-8% percent with -Ofast on Zen and Haswell, compared to gcc 7.2

Reply via email to