On 5/11/25 19:22, 钟居哲 wrote:
> Hi, vineet.
>
> >> I have a feeling this has to do with following:
> >> https://godbolt.org/z/Px9es7j1r
>
> I saw in there are 2 fsrm instruction inside the main loop in Clang generated
> ASM which I think GCC is better.
>
> Correct me if I am wrong. Thanks.

Yes you are right, gcc is better here if you consider the loop.
However at this point I'm not competing with LLVM ;-)

gcc still emits the FRM save/restore in the fall back function call code-path
which doesn't change RM at all and thus can be optimized.

BTW just to be clear this is a follow-up problem *after* this series. The idea
is to delay the generation of initial FRM save to close to actual static
rounding mode. That was the FRM backup reg is NOT live on all edges and then at
the time of mode emit, we can check if the backup reg is NOT live and skip the
restore.
I've yet to figure out the implementation though.

Thx,
-Vineet



>
> --------------------------------------------------------------------------------
> juzhe.zh...@rivai.ai
>
>      
>     *From:* Vineet Gupta <mailto:vine...@rivosinc.com>
>     *Date:* 2025-05-10 04:27
>     *To:* gcc-patches <mailto:gcc-patches@gcc.gnu.org>
>     *CC:* gnu-toolchain <mailto:gnu-toolch...@rivosinc.com>; Jeff Law
>     <mailto:jeffreya...@gmail.com>; Robin Dapp <mailto:rdapp....@gmail.com>;
>     Juzhe Zhong <mailto:juzhe.zh...@rivai.ai>; Pan Li
>     <mailto:pan2...@intel.com>; Kito Cheng <mailto:kito.ch...@sifive.com>;
>     Vineet Gupta <mailto:vine...@rivosinc.com>
>     *Subject:* [PATCH 0/6] RISC-V: frm state-machine improvements
>     Hi,
>      
>     This came out of Rivos perf team reporting (shoutout to Siavash) that
>     some of the SPEC2017 workloads had unnecessary FRM wiggles, when
>     none were needed. The writes in particular could be expensive.
>      
>     I started with reduced test for PR/119164 from 
> blender:node_testure_util.c.
>      
>     However in trying to understand (and a botched rewrite of whole thing)
>     it turned out that lot of code was just unnecessary leading to more
>     complexity than warranted. As a result there are more deletions here and
>     the actual improvements come from just a few lines of actual changes.
>      
>     I've verified each patch incrementally with
>     - Testsuite run (unchanged, 1 unexpected pass
>     gcc.target/riscv/rvv/autovec/pr119114.c)
>     - SPEC build
>     - Static analysis of FRM read/write insns emitted in all of SPEC binaries.
>     - There's BPI date for some of this too, but the delta there is not
>        significant as this could really be uarch specific.
>      
>     Here's the result for static analysis.
>      
>      
>                 1. revert-confluence  2. remove-edge-insert 
>     4-fewer-frm-restore  5-call-backtrack
>                                       3. remove-mode-after
>                   -------------------  -------------------- 
>     -------------------  ---------------
>                     frrm fsrmi fsrm       frrm fsrmi fsrm       frrm fsrmi
>     fsrm     frrm fsrmi fsrm
>         perlbench_r   42    0    4          42    0    4          17    0   
>     1        17    0    1
>            cpugcc_r  167    0   17         167    0   17          11    0   
>     0        11    0    0
>            bwaves_r   16    0    1          16    0    1          16    0   
>     1        16    0    1
>               mcf_r   11    0    0          11    0    0          11    0   
>     0        11    0    0
>        cactusBSSN_r   79    0   27          76    0   27          19    0   
>     1        19    0    1
>              namd_r  119    0   63         119    0   63          14    0   
>     1        14    0    1
>            parest_r  218    0  114         168    0  114          24    0   
>     1        24    0    1
>            povray_r  123    1   17         123    1   17          26    1   
>     6        26    1    6
>               lbm_r    6    0    0           6    0    0           6    0   
>     0         6    0    0
>           omnetpp_r   17    0    1          17    0    1          17    0   
>     1        17    0    1
>               wrf_r 2287   13 1956        2287   13 1956        1268   13
>     1603       613   13   82
>          cpuxalan_r   17    0    1          17    0    1          17    0   
>     1        17    0    1
>            ldecod_r   11    0    0          11    0    0          11    0   
>     0        11    0    0
>              x264_r   14    0    1          14    0    1          11    0   
>     0        11    0    0
>           blender_r  724   12  182         724   12  182          61   12  
>     42        39   12   16
>              cam4_r  324   13  169         324   13  169          45   13  
>     20        40   13   17
>         deepsjeng_r   11    0    0          11    0    0          11    0   
>     0        11    0    0
>           imagick_r  265   16   34         265   16   34         132   16  
>     25        33   16   18
>             leela_r   12    0    0          12    0    0          12    0   
>     0        12    0    0
>               nab_r   13    0    1          13    0    1          13    0   
>     1        13    0    1
>         exchange2_r   16    0    1          16    0    1          16    0   
>     1        16    0    1
>         fotonik3d_r   20    0   11          20    0   11          19    0   
>     1        19    0    1
>              roms_r   33    0   23          33    0   23          21    0   
>     1        21    0    1
>                xz_r    6    0    0           6    0    0           6    0   
>     0         6    0    0
>                   --------------------  ------------------- 
>     -------------------  ----------------
>                     4551   55 2623        4498   55 2623        1804   55
>     1707      1023   55  150
>                   --------------------  ------------------- 
>     -------------------  ----------------
>                               7729                  7176                 
>     3566                1228
>                   --------------------  ------------------- 
>     -------------------  ----------------
>      
>     It seems wrf still has half of all read/writes
>                      613   13   82
>      
>     with one function having the bulk of them
>           solve_em_  555    1   50
>      
>     This is 1 static RM so ideally needs 1 save and 1 restore.
>      
>     I have a feeling this has to do with following:
>         https://godbolt.org/z/Px9es7j1r
>      
>     The function call code path need not bother with frm save/restore at
>     all. This is currently being investigated but could take more time.
>      
>     Please review.
>      
>     Thx,
>     -Vineet
>      
>     Vineet Gupta (6):
>       emit-rtl: document next_nonnote_nondebug_insn_bb () can breach into
>         next BB
>       RISC-V: frm/mode-switch: remove TARGET_MODE_CONFLUENCE
>       RISC-V: frm/mode-switch: remove dubious frm edge insertion before
>         call_insn
>       RISC-V: frm/mode-switch: TARGET_MODE_AFTER not needed for frm
>         switching
>       RISC-V: frm/mode-switch: Reduce FRM restores on DYN transition
>       RISC-V: frm/mode-switch: robustify call_insn backtracking
>         [PR119164][PR120203]
>      
>     gcc/config/riscv/riscv.cc                     | 166 +++---------------
>     gcc/emit-rtl.cc                               |   6 +-
>     .../rvv/base/float-point-dynamic-frm-74.c     |   2 +-
>     .../gcc.target/riscv/rvv/base/pr119164.c      |  22 +++
>     4 files changed, 50 insertions(+), 146 deletions(-)
>     create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/base/pr119164.c
>      
>     -- 
>     2.43.0
>      
>      
>

Reply via email to