Re: [PATCH] RISC-V: Split unordered FP comparisons into individual RTL insns

Maciej W. Rozycki Thu, 23 Jun 2022 06:39:46 -0700

On Thu, 9 Jun 2022, Maciej W. Rozycki wrote:

>  I'm yet running some benchmarking to see if the use of UNSPEC_VOLATILEs 
> makes any noticeable performance difference, but I suspect it does not as 
> the compiler could not do much about the original multiple-instruction 
> single RTL insns anyway.


 This has now finally completed.  I used SPECfp 2006 built at `-O3' and 
statically linked, which needs ~33 hours per run with the HiFive Unmatched 
board at its standard 1196MHz clock rate.  Here are the results merged by 
hand from original reports:

               Base     Base     Base     Base   Est Base  Est Base  Est Base   
Benchmarks     Ref.   Run Time Run Time Run Time   Ratio     Ratio     Ratio  
                       (base)  (length) (split)   (base)   (length)   (split)  
------------- ------  -------  -------  -------  --------  --------  -------- 
410.bwaves     13590    10353    10396    10370    1.31      1.31      1.31   
416.gamess     19580     9080     9410     9284    2.16      2.08      2.11   
433.milc        9180     5465     5475     5610    1.68      1.68      1.64   
434.zeusmp      9100     5773     5767     5761    1.58      1.58      1.58   
435.gromacs     7140     3605     3561     3545    1.98      2.00      2.01   
436.cactusADM  11950     7779     7658     7680    1.54      1.56      1.56   
437.leslie3d    9400    10280    10697    10274    0.914     0.879     0.915  
444.namd        8020     3141     3120     3129    2.55      2.57      2.56   
447.dealII     11440     3459     3490     3495    3.31      3.28      3.27   
450.soplex      8340     4698     4899     4781    1.78      1.70      1.74   
453.povray      5320     1953     1922     1916    2.72      2.77      2.78   
454.calculix    8250     4844     4857     4821    1.70      1.70      1.71   
459.GemsFDTD   10610     8703     8957     9028    1.22      1.18      1.18   
465.tonto       9840     4585     4539     4620    2.15      2.17      2.13   
470.lbm        13740    10172    10945    10789    1.35      1.26      1.27   
481.wrf        11170     8516     8646     8584    1.31      1.29      1.30   
482.sphinx3    19490     9240     9267     9280    2.11      2.10      2.10   
==============================================================================

The execution time reference (second column) is for a Sun Ultra Enterprise 
2 system from 1997, based on a 296MHz UltraSPARC II CPU, times are given 
in seconds (lower is better) and the ratios calculated are in relation to 
the reference (higher is better).

In the table above "base" results are with upstream master as at commit 
7b98910406b5 ("c++: ICE with template NEW_EXPR [PR105803]".  Then "length" 
results are with commit 72b185189f91 ("RISC-V: Reset the length to the 
default of 4 for FP comparisons") applied on top, as it does make changes 
to code produced even at `-O3' (where size matters less than speed), e.g.:

    46b2c:      8d01b787                fld     fa5,-1840(gp) # 7760a8 
<__SDATA_BEGIN__+0xd0>
-   46b30:      66f4b027                fsd     fa5,1632(s1)
-   46b34:      a029                    j       46b3e <gciinp_+0x124>
-   46b36:      8c01b787                fld     fa5,-1856(gp) # 776098 
<__SDATA_BEGIN__+0xc0>
-   46b3a:      66f4b027                fsd     fa5,1632(s1)
-   46b3e:      00ab67b7                lui     a5,0xab6
-   46b42:      0a07b707                fld     fa4,160(a5) # ab60a0 <runopt_>
-   46b46:      8d81b787                fld     fa5,-1832(gp) # 7760b0 
<__SDATA_BEGIN__+0xd8>
-   46b4a:      a2f727d3                feq.d   a5,fa4,fa5
-   46b4e:      18079fe3                bnez    a5,474ec <gciinp_+0xad2>
-   46b52:      00afd7b7                lui     a5,0xafd
-   46b56:      4607a703                lw      a4,1120(a5) # afd460 
<symtry_+0x47340>
-   46b5a:      4785                    li      a5,1
-   46b5c:      18f708e3                beq     a4,a5,474ec <gciinp_+0xad2>
-   46b60:      00aaeab7                lui     s5,0xaae
-   46b64:      d70a8a93                addi    s5,s5,-656 # aadd70 <infoa_>
-   46b68:      008aa783                lw      a5,8(s5)
-   46b6c:      8301b707                fld     fa4,-2000(gp) # 776008 
<__SDATA_BEGIN__+0x30>
-   46b70:      37fd                    addiw   a5,a5,-1
+   46b30:      00ab67b7                lui     a5,0xab6
+   46b34:      0a07b707                fld     fa4,160(a5) # ab60a0 <runopt_>
+   46b38:      66f4b027                fsd     fa5,1632(s1)
+   46b3c:      8d81b787                fld     fa5,-1832(gp) # 7760b0 
<__SDATA_BEGIN__+0xd8>
+   46b40:      a2f727d3                feq.d   a5,fa4,fa5
+   46b44:      c39d                    beqz    a5,46b6a <gciinp_+0x150>
+   46b46:      8901b787                fld     fa5,-1904(gp) # 776068 
<__SDATA_BEGIN__+0x90>
+   46b4a:      66f4b027                fsd     fa5,1632(s1)
+   46b4e:      a02d                    j       46b78 <gciinp_+0x15e>
+   46b50:      8c01b787                fld     fa5,-1856(gp) # 776098 
<__SDATA_BEGIN__+0xc0>
+   46b54:      66f4b027                fsd     fa5,1632(s1)
+   46b58:      00ab67b7                lui     a5,0xab6
+   46b5c:      0a07b707                fld     fa4,160(a5) # ab60a0 <runopt_>
+   46b60:      8d81b787                fld     fa5,-1832(gp) # 7760b0 
<__SDATA_BEGIN__+0xd8>
+   46b64:      a2f727d3                feq.d   a5,fa4,fa5
+   46b68:      fff9                    bnez    a5,46b46 <gciinp_+0x12c>
+   46b6a:      00afd7b7                lui     a5,0xafd
+   46b6e:      4607a703                lw      a4,1120(a5) # afd460 
<symtry_+0x47340>
+   46b72:      4785                    li      a5,1
+   46b74:      fcf709e3                beq     a4,a5,46b46 <gciinp_+0x12c>
+   46b78:      00aaeab7                lui     s5,0xaae
+   46b7c:      d70a8a93                addi    s5,s5,-656 # aadd70 <infoa_>
+   46b80:      008aa783                lw      a5,8(s5)
+   46b84:      8301b707                fld     fa4,-2000(gp) # 776008 
<__SDATA_BEGIN__+0x30>
+   46b88:      37fd                    addiw   a5,a5,-1

And finally "split" is with this patch also applied, changing code in 
places as well, e.g.:

@@ -4873598,13 +4873598,13 @@
   5f5744:      87bf70ef                jal     ra,5ecfbe 
<_gfortrani_internal_error>
   5f5748:      8281b407                fld     fs0,-2008(gp) # 776000 
<__SDATA_BEGIN__+0x28>
   5f574c:      221c                    fld     fa5,0(a2)
-  5f574e:      0079a7b7                lui     a5,0x79a
-  5f5752:      ac22                    fsd     fs0,24(sp)
-  5f5754:      a83e                    fsd     fa5,16(sp)
-  5f5756:      27c2                    fld     fa5,16(sp)
-  5f5758:      4907b707                fld     fa4,1168(a5) # 79a490 
<__global_pointer$+0x23cb8>
-  5f575c:      22f7a7d3                fabs.d  fa5,fa5
-  5f5760:      00102773                frflags a4
+  5f574e:      ac22                    fsd     fs0,24(sp)
+  5f5750:      a83e                    fsd     fa5,16(sp)
+  5f5752:      27c2                    fld     fa5,16(sp)
+  5f5754:      22f7a7d3                fabs.d  fa5,fa5
+  5f5758:      00102773                frflags a4
+  5f575c:      0079a7b7                lui     a5,0x79a
+  5f5760:      4907b707                fld     fa4,1168(a5) # 79a490 
<__global_pointer$+0x23cb8>
   5f5764:      a2e787d3                fle.d   a5,fa5,fa4
   5f5768:      00171073                fsflags a4
   5f576c:      2c078363                beqz    a5,5f5a32 
<determine_en_precision+0x328>

or:

@@ -4909410,9 +4909410,9 @@
   60eb8a:      a2f696d3                flt.d   a3,fa3,fa5
   60eb8e:      00161073                fsflags a2
   60eb92:      ee81                    bnez    a3,60ebaa <__hypot+0x58>
-  60eb94:      f20707d3                fmv.d.x fa5,a4
-  60eb98:      22f7a7d3                fabs.d  fa5,fa5
-  60eb9c:      00102673                frflags a2
+  60eb94:      00102673                frflags a2
+  60eb98:      f20707d3                fmv.d.x fa5,a4
+  60eb9c:      22f7a7d3                fabs.d  fa5,fa5
   60eba0:      a2f696d3                flt.d   a3,fa3,fa5
   60eba4:      00161073                fsflags a2
   60eba8:      c29d                    beqz    a3,60ebce <__hypot+0x7c>

(so no arithmetic FP instructions appear to be scheduled between FSFLAGS 
and FRFLAGS, though it's not clear to me how the compiler knows it is not 
allowed do it) or finally:

-   66204:      52cd754b                fnmsub.d        fa0,fs10,fa2,fa0
-   66208:      40157553                fcvt.s.d        fa0,fa0
-   6620c:      a0e517d3                flt.s   a5,fa0,fa4
-   66210:      58079263                bnez    a5,66794 <do_cg+0xd20>
-   66214:      00102773                frflags a4
-   66218:      a0e517d3                flt.s   a5,fa0,fa4
-   6621c:      00171073                fsflags a4
-   66220:      220793e3                bnez    a5,66c46 <do_cg+0x11d2>
-   66224:      580576d3                fsqrt.s fa3,fa0
+   6620c:      52cd754b                fnmsub.d        fa0,fs10,fa2,fa0
+   66210:      40157553                fcvt.s.d        fa0,fa0
+   66214:      a0e517d3                flt.s   a5,fa0,fa4
+   66218:      58079063                bnez    a5,66798 <do_cg+0xd1c>
+   6621c:      00102773                frflags a4
+   66220:      00171073                fsflags a4
+   66224:      220793e3                bnez    a5,66c4a <do_cg+0x11ce>
+   66228:      580576d3                fsqrt.s fa3,fa0

(at least removing a redundant FLT.S instruction, although this doesn't 
seem optimal anyway as there appears no way for the second BNEZ branch to 
be ever taken, but I gather that's an unfortunate consequence of the 
volatility of `riscv_frflags'/`riscv_fsflags' RTL insns) and I was able to 
spot a place where an FMV.D instruction has been removed too, indicating a 
better register allocation.

 Results quoted above seem to suggest that in some cases a performance 
regression has resulted from the change, but that may not necessarily be 
the case given that the benchmarks have been run on a live even if lightly 
loaded Linux system.  Obtaining standard three samples would require ~4.5 
days per SPECfp 2006 iteration or almost a fortnight total.

 Therefore I chose to rerun only one of the worst offenders and the 
results are as follows:

                                  Estimated  
                Base     Base       Base     
Benchmarks      Ref.   Run Time     Ratio    
-------------- ------  ---------  ---------  
416.gamess      19580       9138       2.14 S
416.gamess      19580       9498       2.06 S
416.gamess      19580       9478       2.07 *

corresponding to the "split" result earlier on.  So the variation between 
runs is similar to the supposed loss of performance and therefore I think 
we do not need to be concerned.  If there's anything that we're missing, 
it's the tracking of IEEE exception flags, as I previously mentioned.

 I did not run benchmarking for `-fsignaling-nans'.  Relative figures are 
expected to be similar as the only difference is the presence of a FEQ.fmt 
instruction following FSFLAGS.  I've spotted this anomaly however:

-   3c670:      5a057553                fsqrt.d fa0,fa0
-   3c674:      f20006d3                fmv.d.x fa3,zero
-   3c678:      8d01b787                fld     fa5,-1840(gp) # c5678 
<__SDATA_BEGIN__+0xd0>
-   3c67c:      0ad57553                fsub.d  fa0,fa0,fa3
-   3c680:      a2a797d3                flt.d   a5,fa5,fa0
-   3c684:      e3cd                    bnez    a5,3c726 
<_ZN9ResultSet5checkEv+0xe6>
-   3c686:      8d81b787                fld     fa5,-1832(gp) # c5680 
<__SDATA_BEGIN__+0xd8>
-   3c68a:      a2f517d3                flt.d   a5,fa0,fa5
-   3c68e:      efc1                    bnez    a5,3c726 
<_ZN9ResultSet5checkEv+0xe6>
-   3c690:      3578                    fld     fa4,232(a0)
-   3c692:      317c                    fld     fa5,224(a0)
-   3c694:      3968                    fld     fa0,240(a0)
-   3c696:      12e77753                fmul.d  fa4,fa4,fa4
-   3c69a:      72f7f7c3                fmadd.d fa5,fa5,fa5,fa4
-   3c69e:      7aa57543                fmadd.d fa0,fa0,fa0,fa5
-   3c6a2:      00102773                frflags a4
-   3c6a6:      a2d517d3                flt.d   a5,fa0,fa3
-   3c6aa:      00171073                fsflags a4
-   3c6ae:      a2d52053                feq.d   zero,fa0,fa3
-   3c6b2:      efc9                    bnez    a5,3c74c 
<_ZN9ResultSet5checkEv+0x10c>
-   3c6b4:      5a057553                fsqrt.d fa0,fa0
+   3c66e:      5a057553                fsqrt.d fa0,fa0
+   3c672:      f20006d3                fmv.d.x fa3,zero
+   3c676:      8d01b787                fld     fa5,-1840(gp) # c5678 
<__SDATA_BEGIN__+0xd0>
+   3c67a:      0ad57553                fsub.d  fa0,fa0,fa3
+   3c67e:      a2a797d3                flt.d   a5,fa5,fa0
+   3c682:      e7cd                    bnez    a5,3c72c 
<_ZN9ResultSet5checkEv+0xee>
+   3c684:      8d81b787                fld     fa5,-1832(gp) # c5680 
<__SDATA_BEGIN__+0xd8>
+   3c688:      a2f517d3                flt.d   a5,fa0,fa5
+   3c68c:      e3c5                    bnez    a5,3c72c 
<_ZN9ResultSet5checkEv+0xee>
+   3c68e:      3578                    fld     fa4,232(a0)
+   3c690:      317c                    fld     fa5,224(a0)
+   3c692:      3968                    fld     fa0,240(a0)
+   3c694:      12e77753                fmul.d  fa4,fa4,fa4
+   3c698:      72f7f7c3                fmadd.d fa5,fa5,fa5,fa4
+   3c69c:      7aa57543                fmadd.d fa0,fa0,fa0,fa5
+   3c6a0:      00102773                frflags a4
+   3c6a4:      a2d517d3                flt.d   a5,fa0,fa3
+   3c6a8:      00171073                fsflags a4
+   3c6ac:      f20007d3                fmv.d.x fa5,zero
+   3c6b0:      a2f52053                feq.d   zero,fa0,fa5
+   3c6b4:      efd9                    bnez    a5,3c752 
<_ZN9ResultSet5checkEv+0x114>
+   3c6b6:      5a057553                fsqrt.d fa0,fa0

where the compiler for some reason cannot realise it already has the value 
of 0.0 available in fa3 and instead uses an extra move to fa5 for the 
final FEQ.D.

>  No regressions with the GCC (with and w/o `-fsignaling-nans') and glibc 
> testsuites (as per commit 1fcbfb00fc67 ("RISC-V: Fix -fsignaling-nans for 
> glibc testsuite.")).  OK to apply?

 Any comments on the change, anyone?

  Maciej

Re: [PATCH] RISC-V: Split unordered FP comparisons into individual RTL insns

Reply via email to