https://gcc.gnu.org/bugzilla/show_bug.cgi?id=123086

            Bug ID: 123086
           Summary: RISC-V  possible optimization of fp move instruction
                    in SpacemiT-x60 tuning
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: [email protected]
  Target Milestone: ---
            Target: riscv

Created attachment 63031
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=63031&action=edit
diff for fmv=3

While benchmarking a new spacemit-x60 tuning configuration on the RVV
benchmark,I found unexpected fmv instructions generated by GCC in the
mandelbrot_scalar_f32 example.The issue appears to be related to instruction
scheduling for fmv.
Compiler produces an unnecessary move inside a loop:

fmv.s   fa4,fa3   // unnecessary move

Relevant part of the generated assembly:
.L6:
        fsub.s  fa5,fa5,fa4
        addi    a5,a5,1
        fmv.s   fa4,fa3      // <- unnecessary
        fadd.s  fa3,fa5,fa1
        fadd.s  fa4,fa4,fa4
        fmul.s  fa5,fa3,fa3
        fmadd.s fa2,fa4,fa2,ft0
        fmul.s  fa4,fa2,fa2
        fadd.s  fa0,fa5,fa4
        fle.s   a4,fa0,ft1
        bne     a4,zero,.L22

When I modify the tuning description (latency of fmv lowered from 4 → 3), GCC
stops inserting the unnecessary fmv.
RVV Benchmark(mandelbrot_scalar_f32)-heuristic is Bytes per Cycle:

INPUT     10         100       1000      10000      100000     1000000
FMV=4  0.0087719  0.0016358  0.0008379  0.0009498  0.0009138  0.0009143
FMV=3  0.0094339  0.0019402  0.0009999  0.0011320  0.0010906  0.0010913
% diff 7.017246 % 15.68910 % 16.20162 % 16.09540 % 16.21125 % 16.21918 %

We got  approximately a 16% improvement for inputs of 100 bytes and above on
RVV mandelbrot_scalar_f32 compared to fmv=4 version
.DIFF:https://www.diffchecker.com/XmrzttMk/

Second minimal reproducer (much simpler)
Code:
float mandelbrot_scalar_f32_reduced()
{       
    float zx = 1, zy = 0, zxS = 0;                      
                while (zxS<77){
                        zxS = zy + zx + zx;  
                        zy = zx + zx; 
                        zx = zxS;
                }
    return zxS;
}
Generated assembly contains another unnecessary move:
.L2:
        fadd.s  fa4,fa0,fa5
        fmv.s   fa5,fa0      // <- unnecessary
        fadd.s  fa0,fa0,fa4
        fadd.s  fa5,fa5,fa5
        flt.s   a5,fa0,fa3
        bne     a5,zero,.L2
        ret
Adjusting the latency of fmv does not remove this redundant instruction, unlike
in the first example.
Full snippet:https://godbolt.org/z/6KaYdPahM
Optimization level:-O3  -mtune=spacemit-x60  -march=rv64gcb

Reply via email to