https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106265
Vineet Gupta <vineetg at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #12 from Vineet Gupta <vineetg at gcc dot gnu.org> --- Two years hence and we are a little wiser. The root-cause of spills is sched1 [PR/114729](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114729). The recent sum of two s12 patch does make the spill codegen better by having 1 less insn to materialize the stack access. And that shaves off 10% of cactu dynamic icounts which should be enough to close this PR. commit 4bfc4585c9935fbde75ccf04e44a15d24f42cde9 Author: Vineet Gupta <vine...@rivosinc.com> Date: Mon May 13 11:45:55 2024 -0700 RISC-V: avoid LUI based const materialization ... [part of PR/106265] ... if the constant can be represented as sum of two S12 values. The two S12 values could instead be fused with subsequent ADD insn. The helps - avoid an additional LUI insn - side benefits of not clobbering a reg e.g. w/o patch w/ patch long | | plus(unsigned long i) | li a5,4096 | { | addi a5,a5,-2032 | addi a0, a0, 2047 return i + 2064; | add a0,a0,a5 | addi a0, a0, 17 } | ret | ret NOTE: In theory not having const in a standalone reg might seem less CSE friendly, but for workloads in consideration these mat are from very late LRA reloads and follow on GCSE is not doing much currently. The real benefit however is seen in base+offset computation for array accesses and especially for stack accesses which are finalized late in optim pipeline, during LRA register allocation. Often the finalized offsets trigger LRA reloads resulting in mind boggling repetition of exact same insn sequence including LUI based constant materialization. This shaves off 290 billion dynamic instrustions (QEMU icounts) in SPEC 2017 Cactu benchmark which is over 10% of workload. In the rest of suite, there additional 10 billion shaved, with both gains and losses in indiv workloads as is usual with compiler changes. 500.perlbench_r-0 | 1,214,534,029,025 | 1,212,887,959,387 | 500.perlbench_r-1 | 740,383,419,739 | 739,280,308,163 | 500.perlbench_r-2 | 692,074,638,817 | 691,118,734,547 | 502.gcc_r-0 | 190,820,141,435 | 190,857,065,988 | 502.gcc_r-1 | 225,747,660,839 | 225,809,444,357 | <- -0.02% 502.gcc_r-2 | 220,370,089,641 | 220,406,367,876 | <- -0.03% 502.gcc_r-3 | 179,111,460,458 | 179,135,609,723 | <- -0.02% 502.gcc_r-4 | 219,301,546,340 | 219,320,416,956 | <- -0.01% 503.bwaves_r-0 | 278,733,324,691 | 278,733,323,575 | <- -0.01% 503.bwaves_r-1 | 442,397,521,282 | 442,397,519,616 | 503.bwaves_r-2 | 344,112,218,206 | 344,112,216,760 | 503.bwaves_r-3 | 417,561,469,153 | 417,561,467,597 | 505.mcf_r | 669,319,257,525 | 669,318,763,084 | 507.cactuBSSN_r | 2,852,767,394,456 | 2,564,736,063,742 | <+ 10.10% 508.namd_r | 1,855,884,342,110 | 1,855,881,110,934 | 510.parest_r | 1,654,525,521,053 | 1,654,402,859,174 | 511.povray_r | 2,990,146,655,619 | 2,990,060,324,589 | 519.lbm_r | 1,158,337,294,525 | 1,158,337,294,529 | 520.omnetpp_r | 1,021,765,791,283 | 1,026,165,661,394 | 521.wrf_r | 1,715,955,652,503 | 1,714,352,737,385 | 523.xalancbmk_r | 849,846,008,075 | 849,836,851,752 | 525.x264_r-0 | 277,801,762,763 | 277,488,776,427 | 525.x264_r-1 | 927,281,789,540 | 926,751,516,742 | 525.x264_r-2 | 915,352,631,375 | 914,667,785,953 | 526.blender_r | 1,652,839,180,887 | 1,653,260,825,512 | 527.cam4_r | 1,487,053,494,925 | 1,484,526,670,770 | 531.deepsjeng_r | 1,641,969,526,837 | 1,642,126,598,866 | 538.imagick_r | 2,098,016,546,691 | 2,097,997,929,125 | 541.leela_r | 1,983,557,323,877 | 1,983,531,314,526 | 544.nab_r | 1,516,061,611,233 | 1,516,061,407,715 | 548.exchange2_r | 2,072,594,330,215 | 2,072,591,648,318 | 549.fotonik3d_r | 1,001,499,307,366 | 1,001,478,944,189 | 554.roms_r | 1,028,799,739,111 | 1,028,780,904,061 | 557.xz_r-0 | 363,827,039,684 | 363,057,014,260 | 557.xz_r-1 | 906,649,112,601 | 905,928,888,732 | 557.xz_r-2 | 509,023,898,187 | 508,140,356,932 | 997.specrand_fr | 402,535,577 | 403,052,561 | 999.specrand_ir | 402,535,577 | 403,052,561 | This should still be considered damage control as the real/deeper fix would be to reduce number of LRA reloads or CSE/anchor those during LRA constraint sub-pass (re)runs (thats a different PR/114729. Implementation Details (for posterity) -------------------------------------- - basic idea is to have a splitter selected via a new predicate for constant being possible sum of two S12 and provide the transform. This is however a 2 -> 2 transform which combine can't handle. So we specify it using a define_insn_and_split. - the initial loose "i" constraint caused LRA to accept invalid insns thus needing a tighter new constraint as well. - An additional fallback alternate with catch-all "r" register constraint also needed to allow any "reloads" that LRA might require for ADDI with const larger than S12. Testing -------- This is testsuite clean (rv64 only). I'll rely on post-commit CI multlib run for any possible fallout for other setups such as rv32. | | gcc | g++ | gfortran | | rv64imafdc_zba_zbb_zbs_zicond/ lp64d/ medlow | 41 / 17 | 8 / 3 | 7 / 2 | | rv64imafdc_zba_zbb_zbs_zicond/ lp64d/ medlow | 41 / 17 | 8 / 3 | 7 / 2 | I also threw this into a buildroot run, it obviously boots Linux to userspace. bloat-o-meter on glibc and kernel show overall decrease in staic instruction counts with some minor spot increases. These are generally in the category of - LUI + ADDI are 2 byte each vs. two ADD being 4 byte each. - Knock on effects due to inlining changes. - Sometimes the slightly shorter 2-insn seq in a mult-exit function can cause in-place epilogue duplication (vs. a jump back). This is slightly larger but more efficient in execution. In summary nothing to fret about. | linux/scripts/bloat-o-meter build-gcc-240131/target/lib/libc.so.6 \ build-gcc-240131-new-splitter-1-variant/target/lib/libc.so.6 | | add/remove: 0/0 grow/shrink: 21/49 up/down: 520/-3056 (-2536) | Function old new delta | getnameinfo 2756 2892 +136 ... | tempnam 136 144 +8 | padzero 276 284 +8 ... | __GI___register_printf_specifier 284 280 -4 | __EI_xdr_array 468 464 -4 | try_file_lock 268 260 -8 | pthread_create@GLIBC_2 3520 3508 -12 | __pthread_create_2_1 3520 3508 -12 ... | _nss_files_setnetgrent 932 904 -28 | _nss_dns_gethostbyaddr2_r 1524 1480 -44 | build_trtable 3312 3208 -104 | printf_positional 25000 22580 -2420 | Total: Before=2107999, After=2105463, chg -0.12% Caveat: ------ Jeff noted during v2 review that the operand0 constraint !riscv_reg_frame_related could potentially cause issues with hard reg cprop in future. If that trips things up we will have to loosen the constraint while dialing down the const range to (-2048 to 2032) as opposed to fll S12 range of (-2048 to 2047) to keep stack regs aligned. gcc/ChangeLog: * config/riscv/riscv.h: New macros to check for sum of two S12 range. * config/riscv/constraints.md: New constraint. * config/riscv/predicates.md: New Predicate. * config/riscv/riscv.md: New splitter. * config/riscv/riscv.cc (riscv_reg_frame_related): New helper. * config/riscv/riscv-protos.h: New helper prototype. gcc/testsuite/ChangeLog: * gcc.target/riscv/sum-of-two-s12-const-1.c: New test: checks for new patterns output. * gcc.target/riscv/sum-of-two-s12-const-2.c: Ditto. * gcc.target/riscv/sum-of-two-s12-const-3.c: New test: should not ICE. Tested-by: Edwin Lu <e...@rivosinc.com> # pre-commit-CI #1520 Signed-off-by: Vineet Gupta <vine...@rivosinc.com>