On Fri, 30 May 2014, Michael Meissner wrote:

> One issue is the current mode setup is when you create new floating point
> types, the widening system kicks in and the compiler will generate all sorts 
> of
> widening from one 128-bit floating point format to another (because internally
> the precision for IBM extended double is less than the precision of IEEE
> 128-bit, due to the size of the mantisas).  Ideally we need a different way to
> create an alternate floating point mode than FRACITION_FLOAT_MODE that does no
> automatic widening.  If there is a way under the current system, I am not 
> aware
> of it.

When you support both types (under different names) in one compiler, you 
do of course need to support conversions between them - but the compiler 
shouldn't generate such conversions automatically.

Furthermore, if the usual arithmetic conversions are applied to find a 
common type, you have the issue that neither type's values are a subset of 
the other's (__float128 has wider range, but __ibm128 can represent values 
with discontiguous mantissa bits spanning more than 113 bits).  DTS 
18661-3 (N1834) says "If both operands have floating types and neither of 
the sets of values of their corresponding real types is a subset of (or 
equivalent to) the other, the behavior is undefined.".  I'd suggest making 
this (mixed arithmetic or conditional expressions between __float128 and 
__ibm128) an error for both C and C++, so people need to use an explicit 
cast, or implicit conversion by assignment etc., if they wish to mix the 
two types in arithmetic.

(Conversion from __ibm128 to __float128 is a matter of converting the two 
halves and adding them - except for signed zero you must just convert the 
top half to avoid getting a zero of the wrong sign, and for NaNs you must 
also just convert the top half to avoid a spurious exception if the top 
half is a quiet NaN (meaning the whole long double is a quiet NaN) but the 
low half is a signaling NaN.  Conversion from __float128 to __ibm128 would 
presumably be done in the usual way of converting to double, and, if the 
result is finite, subtracting the double from the __float128 value, 
converting the remainder, and renormalizing in case the low part you get 
that way is exactly 0.5ulp of the high part and the high part has its low 
bit set.)

-- 
Joseph S. Myers
jos...@codesourcery.com

Reply via email to