Jan Hubicka <hubi...@ucw.cz> writes:
>> On Wed, Feb 19, 2025 at 9:06 PM Jan Hubicka <hubi...@ucw.cz> wrote:
>> >
>> > Hi,
>> > this is a variant of a hook I benchmarked on cpu2016 with -Ofast -flto
>> > and -O2 -flto.  For non -Os and no Windows ABI should be pratically the
>> > same as your variant that was simply returning mem_cost - 2.
>> >
>> I've tested O2/(Ofast march=native) with SPEC2017 on SPR, mostly
>> neutral (small improvement on povray).
>
> So I got ryzen3 runs with -O2, -O3 and -fno-ipa-ra.
>
> Overall differences are quite small, but I think it is expected. Here is
> what I get with -O2:
> --------------- -------  ---------  ---------    -------  ---------  ---------
> 500.perlbench_r       1        188       8.46  S       1        183       
> 8.69  S
> 500.perlbench_r       1        187       8.52  *       1        182       
> 8.75  *
> 500.perlbench_r       1        186       8.56  S       1        182       
> 8.75  S
> 502.gcc_r             1        139      10.2   S       1        137      10.3 
>   *
> 502.gcc_r             1        139      10.2   S       1        137      10.4 
>   S
> 502.gcc_r             1        139      10.2   *       1        137      10.3 
>   S
> 505.mcf_r             1        187       8.66  *       1        188       
> 8.61  S
> 505.mcf_r             1        186       8.70  S       1        187       
> 8.66  *
> 505.mcf_r             1        188       8.62  S       1        187       
> 8.66  S
> 520.omnetpp_r         1        213       6.15  *       1        207       
> 6.32  *
> 520.omnetpp_r         1        212       6.18  S       1        206       
> 6.37  S
> 520.omnetpp_r         1        219       5.99  S       1        215       
> 6.11  S
> 523.xalancbmk_r       1         --            CE       1         --           
>  CE
> 525.x264_r            1        135      13.0   S       1        135      12.9 
>   *
> 525.x264_r            1        135      13.0   *       1        135      12.9 
>   S
> 525.x264_r            1        135      13.0   S       1        135      12.9 
>   S
> 531.deepsjeng_r       1        167       6.86  *       1        167       
> 6.85  S
> 531.deepsjeng_r       1        167       6.86  S       1        168       
> 6.84  S
> 531.deepsjeng_r       1        167       6.86  S       1        167       
> 6.85  *
> 541.leela_r           1        296       5.60  S       1        292       
> 5.67  *
> 541.leela_r           1        293       5.65  S       1        293       
> 5.65  S
> 541.leela_r           1        296       5.60  *       1        292       
> 5.67  S
> 548.exchange2_r       1        208      12.6   S       1        208      12.6 
>   S
> 548.exchange2_r       1        208      12.6   *       1        208      12.6 
>   S
> 548.exchange2_r       1        208      12.6   S       1        208      12.6 
>   *
> 557.xz_r              1        194       5.58  S       1        193       
> 5.58  S
> 557.xz_r              1        192       5.62  S       1        193       
> 5.60  S
> 557.xz_r              1        193       5.60  *       1        193       
> 5.59  *
> =================================================================================
> 500.perlbench_r       1        187       8.52  *       1        182       
> 8.75  *
> 502.gcc_r             1        139      10.2   *       1        137      10.3 
>   *
> 505.mcf_r             1        187       8.66  *       1        187       
> 8.66  *
> 520.omnetpp_r         1        213       6.15  *       1        207       
> 6.32  *
> 523.xalancbmk_r                               NR                              
>  NR
> 525.x264_r            1        135      13.0   *       1        135      12.9 
>   *
> 531.deepsjeng_r       1        167       6.86  *       1        167       
> 6.85  *
> 541.leela_r           1        296       5.60  *       1        292       
> 5.67  *
> 548.exchange2_r       1        208      12.6   *       1        208      12.6 
>   *
> 557.xz_r              1        193       5.60  *       1        193       
> 5.59  *
>  Est. SPECrate2017_int_base              8.17
>  Est. SPECrate2017_int_peak                                               8.24
>
> Perlbench seems to improve consistently without LTO (bot -O2, -O3 and
> -O2 -fno-ipa-ra and I think it may be just a luck with code layout
> gcc is quie concistent in all settings. Overall it seems consistent
> little win.  For fp tests, I see only off-noise povray differences and only in
> -Ofast and -Ofast -flto.
>
> Comparing code sizes at -O2:
>
> 500.perlbench_r/run/run_base_refrate_regalloc-m64.0000/perlbench_r_base.regalloc-m64
>           1699987    1731648 101.86
> 502.gcc_r/run/run_base_refrate_regalloc-m64.0000/cpugcc_r_base.regalloc-m64   
>                 7072031    7226911 102.19
> 503.bwaves_r/run/run_base_refrate_regalloc-m64.0000/bwaves_r_base.regalloc-m64
>                   41327      41327 100.00
> 505.mcf_r/run/run_base_refrate_regalloc-m64.0000/mcf_r_base.regalloc-m64      
>                   17023      17023 100.00
> 507.cactuBSSN_r/run/run_base_refrate_regalloc-m64.0000/cactusBSSN_r_base.regalloc-m64
>          3432326    3464950 100.95
> 508.namd_r/run/run_base_refrate_regalloc-m64.0000/namd_r_base.regalloc-m64    
>                  835954     835457 99.94
> 510.parest_r/run/run_base_refrate_regalloc-m64.0000/parest_r_base.regalloc-m64
>                 7498066    7587378 101.19
> 511.povray_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_511_base.regalloc-m64
>          18206      18222 100.08
> 511.povray_r/run/run_base_refrate_regalloc-m64.0000/povray_r_base.regalloc-m64
>                  754591     761695 100.94
> 519.lbm_r/run/run_base_refrate_regalloc-m64.0000/lbm_r_base.regalloc-m64      
>                   10900      10916 100.14
> 520.omnetpp_r/run/run_base_refrate_regalloc-m64.0000/omnetpp_r_base.regalloc-m64
>               1403348    1425556 101.58
> 521.wrf_r/run/run_base_refrate_regalloc-m64.0000/diffwrf_521_base.regalloc-m64
>                16388136   16394552 100.03
> 521.wrf_r/run/run_base_refrate_regalloc-m64.0000/wrf_r_base.regalloc-m64      
>                22293527   22302167 100.03
> 525.x264_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_525_base.regalloc-m64
>            18206      18222 100.08
> 525.x264_r/run/run_base_refrate_regalloc-m64.0000/ldecod_r_base.regalloc-m64  
>                  398564     401667 100.77
> 525.x264_r/run/run_base_refrate_regalloc-m64.0000/x264_r_base.regalloc-m64    
>                  405515     407051 100.37
> 526.blender_r/run/run_base_refrate_regalloc-m64.0000/blender_r_base.regalloc-m64
>               7567792    7631536 100.84
> 526.blender_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_526_base.regalloc-m64
>         18206      18222 100.08
> 527.cam4_r/run/run_base_refrate_regalloc-m64.0000/cam4_r_base.regalloc-m64    
>                 5957695    5969535 100.19
> 527.cam4_r/run/run_base_refrate_regalloc-m64.0000/cam4_validate_527_base.regalloc-m64
>           606591     608767 100.35
> 531.deepsjeng_r/run/run_base_refrate_regalloc-m64.0000/deepsjeng_r_base.regalloc-m64
>             75304      76248 101.25
> 538.imagick_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_538_base.regalloc-m64
>         18206      18222 100.08
> 538.imagick_r/run/run_base_refrate_regalloc-m64.0000/imagick_r_base.regalloc-m64
>               1638858    1651628 100.77
> 541.leela_r/run/run_base_refrate_regalloc-m64.0000/leela_r_base.regalloc-m64  
>                  132636     133146 100.38
> 544.nab_r/run/run_base_refrate_regalloc-m64.0000/nab_r_base.regalloc-m64      
>                  150146     150513 100.24
> 548.exchange2_r/run/run_base_refrate_regalloc-m64.0000/exchange2_r_base.regalloc-m64
>             76709      76709 100.00
> 549.fotonik3d_r/run/run_base_refrate_regalloc-m64.0000/fotonik3d_r_base.regalloc-m64
>            464940     465260 100.06
> 554.roms_r/run/run_base_refrate_regalloc-m64.0000/roms_r_base.regalloc-m64    
>                  833926     834166 100.02
> 557.xz_r/run/run_base_refrate_regalloc-m64.0000/xz_r_base.regalloc-m64        
>                  130345     133253 102.23
>
> The 2% code size increase for gcc as not very nice, but I think also
> expected, since we make compiler to use less push/pop instructions.
> There are 34091 push instructions with patch and 38939 without.
>
> With -fno-ipa-ra the story is similar:
>
> 500.perlbench_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/perlbench_r_base.regalloc-O2-noipara-m64
>     1701299    1733024 101.86
> 502.gcc_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cpugcc_r_base.regalloc-O2-noipara-m64
>              7074527    7229855 102.19
> 503.bwaves_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/bwaves_r_base.regalloc-O2-noipara-m64
>             41327      41327 100.00
> 505.mcf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/mcf_r_base.regalloc-O2-noipara-m64
>                   17151      17151 100.00
> 507.cactuBSSN_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cactusBSSN_r_base.regalloc-O2-noipara-m64
>    3432326    3464950 100.95
> 508.namd_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/namd_r_base.regalloc-O2-noipara-m64
>                835954     835457 99.94
> 510.parest_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/parest_r_base.regalloc-O2-noipara-m64
>           7504722    7594098 101.19
> 511.povray_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_511_base.regalloc-O2-noipara-m64
>    18206      18222 100.08
> 511.povray_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/povray_r_base.regalloc-O2-noipara-m64
>            756639     763487 100.90
> 519.lbm_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/lbm_r_base.regalloc-O2-noipara-m64
>                   10900      10916 100.14
> 520.omnetpp_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/omnetpp_r_base.regalloc-O2-noipara-m64
>         1403412    1425748 101.59
> 521.wrf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/diffwrf_521_base.regalloc-O2-noipara-m64
>          16394344   16400504 100.03
> 521.wrf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/wrf_r_base.regalloc-O2-noipara-m64
>                22300503   22308759 100.03
> 525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_525_base.regalloc-O2-noipara-m64
>      18206      18222 100.08
> 525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/ldecod_r_base.regalloc-O2-noipara-m64
>              399204     402179 100.74
> 525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/x264_r_base.regalloc-O2-noipara-m64
>                406251     408299 100.50
> 526.blender_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/blender_r_base.regalloc-O2-noipara-m64
>         7583536    7648304 100.85
> 526.blender_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_526_base.regalloc-O2-noipara-m64
>   18206      18222 100.08
> 527.cam4_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cam4_r_base.regalloc-O2-noipara-m64
>               5962335    5974239 100.19
> 527.cam4_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cam4_validate_527_base.regalloc-O2-noipara-m64
>     607327     609375 100.33
> 531.deepsjeng_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/deepsjeng_r_base.regalloc-O2-noipara-m64
>       75240      76248 101.33
> 538.imagick_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_538_base.regalloc-O2-noipara-m64
>   18206      18222 100.08
> 538.imagick_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagick_r_base.regalloc-O2-noipara-m64
>         1641226    1654060 100.78
> 541.leela_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/leela_r_base.regalloc-O2-noipara-m64
>              132764     133274 100.38
> 544.nab_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/nab_r_base.regalloc-O2-noipara-m64
>                  150498     150929 100.28
> 548.exchange2_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/exchange2_r_base.regalloc-O2-noipara-m64
>       76921      76921 100.00
> 549.fotonik3d_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/fotonik3d_r_base.regalloc-O2-noipara-m64
>      464940     465260 100.06
> 554.roms_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/roms_r_base.regalloc-O2-noipara-m64
>                833926     834166 100.02
> 557.xz_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/xz_r_base.regalloc-O2-noipara-m64
>                    130697     133573 102.20
>
> Overall I think new costing works reasonably well.

Thanks for running these.  I saw poor results for perlbench with my
initial aarch64 hooks because the hooks reduced the cost to zero for
the entry case:

            auto entry_cost = targetm.callee_save_cost
              (spill_cost_type::SAVE, hard_regno, mode, saved_nregs,
               ira_memory_move_cost[mode][rclass][0] * saved_nregs / nregs,
               allocated_callee_save_regs, existing_spills_p);
            /* In the event of a tie between caller-save and callee-save,
               prefer callee-save.  We apply this to the entry cost rather
               than the exit cost since the entry frequency must be at
               least as high as the exit frequency.  */
            if (entry_cost > 0)
              entry_cost -= 1;

I "fixed" that by bumping the cost to a minimum of 2, but I was
wondering whether the "entry_cost > 0" should instead be "entry_cost > 1",
so that the cost is always greater than not using a callee save for
registers that don't cross a call.  WDYT?

Richard

Reply via email to