Re: RFR: JDK-8289551: Conversions between bit representations of half precision values and floats [v6]

Raffaello Giulietti Sun, 24 Jul 2022 08:49:11 -0700

On Sat, 23 Jul 2022 20:03:39 GMT, Raffaello Giulietti <d...@openjdk.org> wrote:


>> src/java.base/share/classes/java/lang/Float.java line 1122:
>> 
>>> 1120:             // binary16 (when rounding is done, could still round up)
>>> 1121:             int exp = Math.getExponent(f);
>>> 1122:             assert -25 <= exp && exp <= 15;
>> 
>> I think that both the subnormal and the normal case can be unified if we pay 
>> closer attention to the positions of the lsb, round and sticky bits in 
>> subnormals.
>> 
>> 
>>         // Clamp exp to the [-15, 15] range while retaining the
>>         // difference between the original value and -15 on clamping.
>>         // This is the excess shift value in addition to 13.
>>         int expdelta = Math.max(0, -15 - exp);
>>         exp += expdelta;
>>         assert -15 <= exp && exp <= 15;
>> 
>>         int f_signif_bits = doppel & 0x007f_ffff;  // original significand
>>         // Significand bits as if using rounding to zero (truncation).
>>         short signif_bits = (short)(f_signif_bits >> (13 + expdelta));
>> 
>>         // For round to nearest even, determining whether or
>>         // not to round up (in magnitude) is a function of the
>>         // least significant bit (LSB), the next bit position
>>         // (the round position), and the sticky bit (whether
>>         // there are any nonzero bits in the exact result to
>>         // the right of the round digit). An increment occurs
>>         // in three cases:
>>         //
>>         // LSB  Round Sticky
>>         // 0    1     1
>>         // 1    1     0
>>         // 1    1     1
>>         // See "Computer Arithmetic Algorithms," Koren, Table 4.9
>> 
>>         int lsb    = f_signif_bits & (1 << 13 + expdelta);
>>         int round  = f_signif_bits & (1 << 12 + expdelta);
>>         int sticky = f_signif_bits & ((1 << 12 + expdelta) - 1);
>> 
>>         if (round != 0 && ((lsb | sticky) != 0 )) {
>>             signif_bits++;
>>         }
>> 
>>         // No bits set in significand beyond the *first* exponent
>>         // bit, not just the sigificand; quantity is added to the
>>         // exponent to implement a carry out from rounding the
>>         // significand.
>>         assert (0xf800 & signif_bits) == 0x0;
>> 
>>         return (short)(sign_bit | ( ((exp + 15) << 10) + signif_bits ) );
>
> I didn't test this variant, will do tomorrow when also reviewing the tests 
> themselves.

The correct variant below passes the tests.


        // For binary16 subnormals, beside forcing exp to -15,
        // retain the difference expdelta = E_min - exp.
        // This is the excess shift value, in addition to 13, to be used
        // in the computations below.
        // Further the (hidden) msb with value 1 in f must be involved as well.
        int expdelta = 0;
        int msb = 0x0000_0000;
        if (exp < -14) {
            expdelta = -14 - exp;
            exp = -15;
            msb = 0x0080_0000;
        }
        int f_signif_bits = doppel & 0x007f_ffff | msb;

        // Significand bits as if using rounding to zero (truncation).
        short signif_bits = (short)(f_signif_bits >> (13 + expdelta));

        // For round to nearest even, determining whether or
        // not to round up (in magnitude) is a function of the
        // least significant bit (LSB), the next bit position
        // (the round position), and the sticky bit (whether
        // there are any nonzero bits in the exact result to
        // the right of the round digit). An increment occurs
        // in three cases:
        //
        // LSB  Round Sticky
        // 0    1     1
        // 1    1     0
        // 1    1     1
        // See "Computer Arithmetic Algorithms," Koren, Table 4.9

        int lsb    = f_signif_bits & (1 << 13 + expdelta);
        int round  = f_signif_bits & (1 << 12 + expdelta);
        int sticky = f_signif_bits & ((1 << 12 + expdelta) - 1);

        if (round != 0 && ((lsb | sticky) != 0 )) {
            signif_bits++;
        }

        // No bits set in significand beyond the *first* exponent
        // bit, not just the sigificand; quantity is added to the
        // exponent to implement a carry out from rounding the
        // significand.
        assert (0xf800 & signif_bits) == 0x0;

        return (short)(sign_bit | ( ((exp + 15) << 10) + signif_bits ) );

-------------

PR: https://git.openjdk.org/jdk/pull/9422

Re: RFR: JDK-8289551: Conversions between bit representations of half precision values and floats [v6]

Reply via email to