RE: SH optimized software floating point routines

Joern Rennecke Thu, 22 Jul 2010 19:00:46 -0700

Quoting "Joseph S. Myers" <jos...@codesourcery.com>:

That diff does not appear to relate to undefined behavior.  GCC considers
these out-of-range conversions to yield an unspecified value, possibly
raising an exception, as per Annex F, and does not take the liberty of
optimizing on the basis of them being undefined when not in an IEEE mode.


Well, still, the test is wrong in possibly raising an exception there,
with no provisions to ignore the exception or catch any signal raised.

For the ARCompact, in order to test the floating point emulation better,
I had (there are still there in #if 0 /*DEBUG */ blocks) small wrappers
for each function to evaluate it once with the hand-optimized version,
and once with fp-bit.c, and abort on getting different values.
Now, fp-bit generally tries to yield some value that the programmer thought
might mean something, whereas the hand-optimized version treats computations
of unspecified values as irrelevant.

Considering:

GLOBAL(fixunsdfsi):
        mov.w   LOCAL(x413),r1  ! bias + 20
        mov     DBL0H,r0
        shll    DBL0H
        mov.l   LOCAL(mask),r3
        mov     #-21,r2
        shld    r2,DBL0H        ! SH4-200 will start this insn in a new cycle
        bt/s    LOCAL(ret0)
        sub     r1,DBL0H
        cmp/pl  DBL0H           ! SH4-200 will start this insn in a new cycle
        and     r3,r0
        bf/s    LOCAL(ignore_low)
        addc    r3,r0   ! uses T == 1; sets implict 1
        mov     #11,r2
        shld    DBL0H,r0        ! SH4-200 will start this insn in a new cycle
        cmp/gt  r2,DBL0H
        add     #-32,DBL0H
        bt      LOCAL(retmax)
        shld    DBL0H,DBL0L
        rts
        or      DBL0L,r0

and:

__fixunsdfsi:
        bbit0 DBL0H,30,.Lret0or1
        lsr r2,DBL0H,20
        bmsk_s DBL0H,DBL0H,19
        sub_s r2,r2,19; 0x3ff+20-0x400
        neg_s r3,r2
        btst_s r3,10
        bset_s DBL0H,DBL0H,20
#ifdef __LITTLE_ENDIAN__
        mov.ne DBL0L,DBL0H
        asl DBL0H,DBL0H,r2
#else
        asl.eq DBL0H,DBL0H,r2
        lsr.ne DBL0H,DBL0H,r3
#endif
        lsr DBL0L,DBL0L,r3
        j_s.d [blink]
        add.eq r0,r0,r1
.Lret0:
        j_s.d [blink]
        mov_l r0,0
.Lret0or1:
        add_s DBL0H,DBL0H,0x100000
        lsr_s DBL0H,DBL0H,30
        j_s.d [blink]
        bmsk_l r0,DBL0H,0

You can see that an SH4-300 can perform software floating point
fixunsdfsi in ten cycles, and the SH4-400 (SH4-200 sans FPU)
and ARC700 in twelve.

Adding any code in order to compute nice, fluffy values for
unspecified results would cause a significant performance degradation.

RE: SH optimized software floating point routines

Reply via email to