On Mon, 18 Mar 2024 20:50:14 PDT (-0700), jeffreya...@gmail.com wrote:


On 3/18/24 3:09 AM, Jivan Hakobyan wrote:
As RV has round instructions it is reasonable to use them instead of
calling the library functions.

With my patch for the following C code:
double foo(double a) {
     return ceil(a);
}

GCC generates the following ASM code (before it was tail call)
foo:
         fabs.d  fa4,fa0
         lui     a5,%hi(.LC0)
         fld     fa3,%lo(.LC0)(a5)
         flt.d   a5,fa4,fa3
         beq     a5,zero,.L3
         fcvt.l.d a5,fa0,rup

I'm not sure exactly what context this is in, but my reading of "according to the current rounding mode" means we'd usually use the dynamic rounding mode.

         fcvt.d.l        fa4,a5
         fsgnj.d fa0,fa4,fa0
.L3:
         ret

.LC0:
         .word   0
         .word   1127219200     // 0x4330000000000000


The patch I have evaluated on SPEC2017.
Counted dynamic instructions counts and got the following improvements

510.parest_r       262 m      -
511.povray_r      2.1  b        0.04%
521.wrt_r            269 m       -
526.blender_r    3 b             0.1%
527.cam4_r       15 b           0.6%
538.imagick_r    365 b         7.6%

Overall executed 385 billion fewer instructions which is 0.5%.
A few more notes.

The sequence Jivan is using is derived from LLVM.  The condition in the
generated code tests for those values were are supposed to pass through
unaltered.  The condition in the pattern ensures we do something
sensible WRT FE_INEXACT and mirrors how other ports handle these insns.

My only worry here is that when we were doing the other patterns we decided not to do rint. I can't remember exactly why, but from reading the docs we probably just skipped it due to the inexact handling and Zfa having an instruction that just does this. FP stuff is always a bit of a time sink, so at that point it probably just fell off the priority list.

I'm not really an FP guy, so I usually just poke around what the other ports generate and try to figure out what's going on. arm64 has the Zfa instruction and x86 FP is complicated, so I'm not sure exactly who else to look at for this sort of stuff. From just looking at the code, though, I think there's two issues -- I'm not really an FP person, though, so take this with a grain of salt:

IIUC what we've got here doesn't actually set the inexact flag for the bounds check failure case, as we're just loading up an exact constant and doing the bounds check. We're also not clamping to INT_MAX-type values, but not sure if we're supposed to. I think we could fix both of those by adjusting the expansion to something like

             fabs.d  fa4,fa0
             lui     a5,%hi(.LC0)
             fld     fa3,%lo(.LC0)(a5)
             flt.d   a5,fa4,fa3
             bne     a5,zero,.L3
             mv      fa0, fa3
    .L3:
             fcvt.l.d a5,fa0,rup
             fcvt.d.l        fa4,a5
             fsgnj.d fa0,fa4,fa0
             ret

and then adjusting the constant to be an epsilon larger than INT_MAX so it'll still trigger the clamping but also inexact.

There's also a pair of changes to the ISA in 2020 that added the conversion inexact handling requirement, it was a grey area before. I don't remember exactly what happened there, but I did remember it happening. I don't think anyone cares all that much about the performance of systems that target the older ISAs, so maybe we just restrict the non-libcall expansion to ISAs that contain the new wording?

   commit 5890a1a702abf4157d5879717a39d8ecdae0de68 (tag: draft-20201229-5890a1a)
   Author: Andrew Waterman <and...@sifive.com>
   Date:   Mon Dec 28 18:19:44 2020 -0800
Clarify when FP conversions raise the Inexact flag diff --git a/src/f.tex b/src/f.tex
   index 8649d75..97c253b 100644
   --- a/src/f.tex
   +++ b/src/f.tex
   @@ -594,9 +594,9 @@ instructions round according to the {\em rm} field.  A 
floating-point register
    can be initialized to floating-point positive zero using FCVT.S.W {\em rd},
    {\tt x0}, which will never set any exception flags.
-All floating-point conversion instructions set the Inexact exception flag if the
   -result differs from its operand value, yet is representable in the 
destination
   -format.
   +All floating-point conversion instructions set the Inexact exception flag if
   +the rounded result differs from the operand value and the Invalid exception
   +flag is not set.
\vspace{-0.2in}
    \begin{center}
commit 27b40fbc798357fcb4b1deaba4553646fe677576 (tag: draft-20200229-27b40fb)
   Author: Andrew Waterman <and...@sifive.com>
   Date:   Fri Feb 28 18:12:47 2020 -0800
Clarify that FCVT instructions signal inexact diff --git a/src/f.tex b/src/f.tex
   index 7680347..a9022c4 100644
   --- a/src/f.tex
   +++ b/src/f.tex
   @@ -582,6 +582,10 @@ instructions round according to the {\em rm} field.  A 
floating-point register
    can be initialized to floating-point positive zero using FCVT.S.W {\em rd},
    {\tt x0}, which will never set any exception flags.
+All floating-point conversion instructions raise the Inexact exception if the
   +result differs from its operand value, yet is representable in the 
destination
   +format.
   +
    \vspace{-0.2in}
    \begin{center}
    \begin{tabular}{R@{}F@{}R@{}R@{}F@{}R@{}O}

Our internal testing shows a benefit well beyond the 7% reduction in
icounts.  Presumably due to fewer calls, fewer transfers across the
register files, better scheduling around the call site, etc.

Obviously for Zfa we'll use the more efficient instructions for that
extension.  But there's no reason to not go forward with this change for
gcc-15.

Ya, I think those are all pretty minor issues (If the exist at all). So it seems like the right way to go in general.


Jeff

Reply via email to