Jan Hubicka <hubi...@ucw.cz> writes: >> On Wed, Feb 19, 2025 at 9:06 PM Jan Hubicka <hubi...@ucw.cz> wrote: >> > >> > Hi, >> > this is a variant of a hook I benchmarked on cpu2016 with -Ofast -flto >> > and -O2 -flto. For non -Os and no Windows ABI should be pratically the >> > same as your variant that was simply returning mem_cost - 2. >> > >> I've tested O2/(Ofast march=native) with SPEC2017 on SPR, mostly >> neutral (small improvement on povray). > > So I got ryzen3 runs with -O2, -O3 and -fno-ipa-ra. > > Overall differences are quite small, but I think it is expected. Here is > what I get with -O2: > --------------- ------- --------- --------- ------- --------- --------- > 500.perlbench_r 1 188 8.46 S 1 183 > 8.69 S > 500.perlbench_r 1 187 8.52 * 1 182 > 8.75 * > 500.perlbench_r 1 186 8.56 S 1 182 > 8.75 S > 502.gcc_r 1 139 10.2 S 1 137 10.3 > * > 502.gcc_r 1 139 10.2 S 1 137 10.4 > S > 502.gcc_r 1 139 10.2 * 1 137 10.3 > S > 505.mcf_r 1 187 8.66 * 1 188 > 8.61 S > 505.mcf_r 1 186 8.70 S 1 187 > 8.66 * > 505.mcf_r 1 188 8.62 S 1 187 > 8.66 S > 520.omnetpp_r 1 213 6.15 * 1 207 > 6.32 * > 520.omnetpp_r 1 212 6.18 S 1 206 > 6.37 S > 520.omnetpp_r 1 219 5.99 S 1 215 > 6.11 S > 523.xalancbmk_r 1 -- CE 1 -- > CE > 525.x264_r 1 135 13.0 S 1 135 12.9 > * > 525.x264_r 1 135 13.0 * 1 135 12.9 > S > 525.x264_r 1 135 13.0 S 1 135 12.9 > S > 531.deepsjeng_r 1 167 6.86 * 1 167 > 6.85 S > 531.deepsjeng_r 1 167 6.86 S 1 168 > 6.84 S > 531.deepsjeng_r 1 167 6.86 S 1 167 > 6.85 * > 541.leela_r 1 296 5.60 S 1 292 > 5.67 * > 541.leela_r 1 293 5.65 S 1 293 > 5.65 S > 541.leela_r 1 296 5.60 * 1 292 > 5.67 S > 548.exchange2_r 1 208 12.6 S 1 208 12.6 > S > 548.exchange2_r 1 208 12.6 * 1 208 12.6 > S > 548.exchange2_r 1 208 12.6 S 1 208 12.6 > * > 557.xz_r 1 194 5.58 S 1 193 > 5.58 S > 557.xz_r 1 192 5.62 S 1 193 > 5.60 S > 557.xz_r 1 193 5.60 * 1 193 > 5.59 * > ================================================================================= > 500.perlbench_r 1 187 8.52 * 1 182 > 8.75 * > 502.gcc_r 1 139 10.2 * 1 137 10.3 > * > 505.mcf_r 1 187 8.66 * 1 187 > 8.66 * > 520.omnetpp_r 1 213 6.15 * 1 207 > 6.32 * > 523.xalancbmk_r NR > NR > 525.x264_r 1 135 13.0 * 1 135 12.9 > * > 531.deepsjeng_r 1 167 6.86 * 1 167 > 6.85 * > 541.leela_r 1 296 5.60 * 1 292 > 5.67 * > 548.exchange2_r 1 208 12.6 * 1 208 12.6 > * > 557.xz_r 1 193 5.60 * 1 193 > 5.59 * > Est. SPECrate2017_int_base 8.17 > Est. SPECrate2017_int_peak 8.24 > > Perlbench seems to improve consistently without LTO (bot -O2, -O3 and > -O2 -fno-ipa-ra and I think it may be just a luck with code layout > gcc is quie concistent in all settings. Overall it seems consistent > little win. For fp tests, I see only off-noise povray differences and only in > -Ofast and -Ofast -flto. > > Comparing code sizes at -O2: > > 500.perlbench_r/run/run_base_refrate_regalloc-m64.0000/perlbench_r_base.regalloc-m64 > 1699987 1731648 101.86 > 502.gcc_r/run/run_base_refrate_regalloc-m64.0000/cpugcc_r_base.regalloc-m64 > 7072031 7226911 102.19 > 503.bwaves_r/run/run_base_refrate_regalloc-m64.0000/bwaves_r_base.regalloc-m64 > 41327 41327 100.00 > 505.mcf_r/run/run_base_refrate_regalloc-m64.0000/mcf_r_base.regalloc-m64 > 17023 17023 100.00 > 507.cactuBSSN_r/run/run_base_refrate_regalloc-m64.0000/cactusBSSN_r_base.regalloc-m64 > 3432326 3464950 100.95 > 508.namd_r/run/run_base_refrate_regalloc-m64.0000/namd_r_base.regalloc-m64 > 835954 835457 99.94 > 510.parest_r/run/run_base_refrate_regalloc-m64.0000/parest_r_base.regalloc-m64 > 7498066 7587378 101.19 > 511.povray_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_511_base.regalloc-m64 > 18206 18222 100.08 > 511.povray_r/run/run_base_refrate_regalloc-m64.0000/povray_r_base.regalloc-m64 > 754591 761695 100.94 > 519.lbm_r/run/run_base_refrate_regalloc-m64.0000/lbm_r_base.regalloc-m64 > 10900 10916 100.14 > 520.omnetpp_r/run/run_base_refrate_regalloc-m64.0000/omnetpp_r_base.regalloc-m64 > 1403348 1425556 101.58 > 521.wrf_r/run/run_base_refrate_regalloc-m64.0000/diffwrf_521_base.regalloc-m64 > 16388136 16394552 100.03 > 521.wrf_r/run/run_base_refrate_regalloc-m64.0000/wrf_r_base.regalloc-m64 > 22293527 22302167 100.03 > 525.x264_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_525_base.regalloc-m64 > 18206 18222 100.08 > 525.x264_r/run/run_base_refrate_regalloc-m64.0000/ldecod_r_base.regalloc-m64 > 398564 401667 100.77 > 525.x264_r/run/run_base_refrate_regalloc-m64.0000/x264_r_base.regalloc-m64 > 405515 407051 100.37 > 526.blender_r/run/run_base_refrate_regalloc-m64.0000/blender_r_base.regalloc-m64 > 7567792 7631536 100.84 > 526.blender_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_526_base.regalloc-m64 > 18206 18222 100.08 > 527.cam4_r/run/run_base_refrate_regalloc-m64.0000/cam4_r_base.regalloc-m64 > 5957695 5969535 100.19 > 527.cam4_r/run/run_base_refrate_regalloc-m64.0000/cam4_validate_527_base.regalloc-m64 > 606591 608767 100.35 > 531.deepsjeng_r/run/run_base_refrate_regalloc-m64.0000/deepsjeng_r_base.regalloc-m64 > 75304 76248 101.25 > 538.imagick_r/run/run_base_refrate_regalloc-m64.0000/imagevalidate_538_base.regalloc-m64 > 18206 18222 100.08 > 538.imagick_r/run/run_base_refrate_regalloc-m64.0000/imagick_r_base.regalloc-m64 > 1638858 1651628 100.77 > 541.leela_r/run/run_base_refrate_regalloc-m64.0000/leela_r_base.regalloc-m64 > 132636 133146 100.38 > 544.nab_r/run/run_base_refrate_regalloc-m64.0000/nab_r_base.regalloc-m64 > 150146 150513 100.24 > 548.exchange2_r/run/run_base_refrate_regalloc-m64.0000/exchange2_r_base.regalloc-m64 > 76709 76709 100.00 > 549.fotonik3d_r/run/run_base_refrate_regalloc-m64.0000/fotonik3d_r_base.regalloc-m64 > 464940 465260 100.06 > 554.roms_r/run/run_base_refrate_regalloc-m64.0000/roms_r_base.regalloc-m64 > 833926 834166 100.02 > 557.xz_r/run/run_base_refrate_regalloc-m64.0000/xz_r_base.regalloc-m64 > 130345 133253 102.23 > > The 2% code size increase for gcc as not very nice, but I think also > expected, since we make compiler to use less push/pop instructions. > There are 34091 push instructions with patch and 38939 without. > > With -fno-ipa-ra the story is similar: > > 500.perlbench_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/perlbench_r_base.regalloc-O2-noipara-m64 > 1701299 1733024 101.86 > 502.gcc_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cpugcc_r_base.regalloc-O2-noipara-m64 > 7074527 7229855 102.19 > 503.bwaves_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/bwaves_r_base.regalloc-O2-noipara-m64 > 41327 41327 100.00 > 505.mcf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/mcf_r_base.regalloc-O2-noipara-m64 > 17151 17151 100.00 > 507.cactuBSSN_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cactusBSSN_r_base.regalloc-O2-noipara-m64 > 3432326 3464950 100.95 > 508.namd_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/namd_r_base.regalloc-O2-noipara-m64 > 835954 835457 99.94 > 510.parest_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/parest_r_base.regalloc-O2-noipara-m64 > 7504722 7594098 101.19 > 511.povray_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_511_base.regalloc-O2-noipara-m64 > 18206 18222 100.08 > 511.povray_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/povray_r_base.regalloc-O2-noipara-m64 > 756639 763487 100.90 > 519.lbm_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/lbm_r_base.regalloc-O2-noipara-m64 > 10900 10916 100.14 > 520.omnetpp_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/omnetpp_r_base.regalloc-O2-noipara-m64 > 1403412 1425748 101.59 > 521.wrf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/diffwrf_521_base.regalloc-O2-noipara-m64 > 16394344 16400504 100.03 > 521.wrf_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/wrf_r_base.regalloc-O2-noipara-m64 > 22300503 22308759 100.03 > 525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_525_base.regalloc-O2-noipara-m64 > 18206 18222 100.08 > 525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/ldecod_r_base.regalloc-O2-noipara-m64 > 399204 402179 100.74 > 525.x264_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/x264_r_base.regalloc-O2-noipara-m64 > 406251 408299 100.50 > 526.blender_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/blender_r_base.regalloc-O2-noipara-m64 > 7583536 7648304 100.85 > 526.blender_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_526_base.regalloc-O2-noipara-m64 > 18206 18222 100.08 > 527.cam4_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cam4_r_base.regalloc-O2-noipara-m64 > 5962335 5974239 100.19 > 527.cam4_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/cam4_validate_527_base.regalloc-O2-noipara-m64 > 607327 609375 100.33 > 531.deepsjeng_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/deepsjeng_r_base.regalloc-O2-noipara-m64 > 75240 76248 101.33 > 538.imagick_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagevalidate_538_base.regalloc-O2-noipara-m64 > 18206 18222 100.08 > 538.imagick_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/imagick_r_base.regalloc-O2-noipara-m64 > 1641226 1654060 100.78 > 541.leela_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/leela_r_base.regalloc-O2-noipara-m64 > 132764 133274 100.38 > 544.nab_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/nab_r_base.regalloc-O2-noipara-m64 > 150498 150929 100.28 > 548.exchange2_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/exchange2_r_base.regalloc-O2-noipara-m64 > 76921 76921 100.00 > 549.fotonik3d_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/fotonik3d_r_base.regalloc-O2-noipara-m64 > 464940 465260 100.06 > 554.roms_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/roms_r_base.regalloc-O2-noipara-m64 > 833926 834166 100.02 > 557.xz_r/run/run_base_refrate_regalloc-O2-noipara-m64.0000/xz_r_base.regalloc-O2-noipara-m64 > 130697 133573 102.20 > > Overall I think new costing works reasonably well.
Thanks for running these. I saw poor results for perlbench with my initial aarch64 hooks because the hooks reduced the cost to zero for the entry case: auto entry_cost = targetm.callee_save_cost (spill_cost_type::SAVE, hard_regno, mode, saved_nregs, ira_memory_move_cost[mode][rclass][0] * saved_nregs / nregs, allocated_callee_save_regs, existing_spills_p); /* In the event of a tie between caller-save and callee-save, prefer callee-save. We apply this to the entry cost rather than the exit cost since the entry frequency must be at least as high as the exit frequency. */ if (entry_cost > 0) entry_cost -= 1; I "fixed" that by bumping the cost to a minimum of 2, but I was wondering whether the "entry_cost > 0" should instead be "entry_cost > 1", so that the cost is always greater than not using a callee save for registers that don't cross a call. WDYT? Richard