[fpc-pascal] Functions for dealing with floating-point precision

Ekkehard Domning via fpc-pascal Sun, 15 Sep 2024 00:34:17 -0700

Dear Thomas,

thank your for the excellent approach for solving the problem, finding the closest smaller or greater floating point value to a given one.

We had an descent discussion about the same topic on "LazarusForum.de"


https://www.lazarusforum.de/viewtopic.php?f=10&t=16019&hilit=kleinste&start=15

where your solution comes up.

I packed your code into a working unit, so that it may be easier to include into other projects.

While performing an intense testing, I found two minor bugs.

The first one is, that the const MinDouble from the unit math.pas

is not the smallest Double which could be generated. Thus the code will bypass all values between MinDouble (2.2250738585072014e-308) and 4.9406564584124654E-324

The second one is that the usage of the const MaxSingle, will not work as expected. The compiler will obviously convert MaxSingle into a Double value and subsequently calling the related Double functions, not as expected the Single function.

Example:
  CopySign (MaxSingle, Start)

will call
  function CopySign (x: Double; y: Double): Double;
and not
  function CopySign (x: Single; y: Single): Single;
This will luckily not lead into errors.

In the function NextAfter

 if (toZero) then begin
 (...)
 end else begin
   // we are increasing the magnitude, toward +-Infinity
   if (Start = 0.0) then Exit (CopySign (MinSingle, Direction));
   if (absStart = MaxSingle) then Exit (CopySign (Infinity, Start));
   Exit (CopySign (LongBitsToSingle(Succ(SingleToLongBits(absStart))),
Start));
  end;

this conversion leads to an unwanted behaviour in the line:

   if (absStart = MaxSingle) then Exit (CopySign (Infinity, Start));

since the term (absStart = MaxSingle) will never become true, even if NextAfter is called with a value containing MaxSingle. The reason is

that
  Single(MaxSingle) <> MaxSingle

Surprisingly the function is working and delivers the correct result, since the following line


CopySign (LongBitsToSingle(Succ(SingleToLongBits(absStart))),Start));

executed absStart containing MaxSingle, will deliver Infinity as expected.
So I decided to remove this line.

The unit contains some new defined consts and uses them to bypass the described problems


  MinMinDouble = 4.9406564584124654E-324;
  SingleMaxSingle : Single = Single(MaxSingle);
  NegSingleMaxSingle : Single = Single(-MaxSingle);

I attached the unit, which is also slightly modified due to my different pascal formatting style.


Thank you very much for sharing.

Best regards
Ekkehard Domning

{
  Unit UNextFloat
  a very close adaption of the Source found at
  https://www.mail-archive.com/fpc-pascal@lists.freepascal.org/msg55419.html

  The two mainfunction are
    NextAfter
  and
    ULP

  NextAfter is an replacement for the nextafter function found in the 
"msvcrt.dll"
  
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/nextafter-functions?view=msvc-170
  It returns the next closest float (single or double) value to a Start-value
  to the Direction-value

  ULP is "unit in the last place or unit of least precision (ulp) is the spacing
  between two consecutive floating-point numbers, i.e., the value the least
  significant digit (rightmost digit) represents if it is 1."
  https://en.wikipedia.org/wiki/Unit_in_the_last_place

  -=-=-=-Start of Page excerpt =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  [fpc-pascal] Functions for dealing with floating-point precision

  Thomas Kurz via fpc-pascal Sun, 19 Jun 2022 13:38:48 -0700

  Hello,

  in my program I have need for checking floating-point precision. I'm 
internally
  using floating-points for calculations, but in the end I have to use integer
  numbers. I cannot use Round() because I have to check for thresholds. I.e. 
that
  I wish to accept a value of 1.9999.... as being ">=2". As you can see, I 
cannot
  use Trunc() either, so I decided to check for the difference in ULPs (unit in
  the last place).

  I couldn't find the corresponding functions in the RTL, but I think they 
should
  be included. I wrote the functions listed below. I would appreciate if the
  maintainers of FPC decide to add them to the Math unit. I skipped the 
functions
  for the Extended type, but if it's being desired, I can add hand them in 
later.

  Kind regards,
  Thomas
  -=-=-=-End of Page excerpt =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

  Version  Date        Who Change
  0.0.1    2024-09-05  EDo First light
  0.0.2    2024-09-05  EDo Bugfix MinDouble, introducing MinMinDouble = 
4.9406564584124654E-324
  0.0.3    2024-09-06  EDo Bugfix in usage of MaxSingle (see comments below)
                           Introducing two typed consts SingleMaxSingle and
                           NegSingleMaxSingle to ease the usage.
}
unit unextfloat;

interface

uses Math;

// CAUTION NextAfter for single values is not compiled into the code if e.g
//   NextAfter(MaxSingle,-MaxSingle)
//   is written. In this case the compiler uses the overloaded Double version!
//   To ensure the usage of the single parameter function
//   always cast MaxSingle to Single (Single(MaxSingle), Single(-MaxSingle))
//   or use the below defined SingleMaxSingle and NegSingleMaxSingle constants.
function NextAfter(const Start: Single; const Direction: Single): Single;
function ULP(Argument: Single): Single;
{$ifdef FPC_HAS_TYPE_DOUBLE}
function NextAfter(const Start: Double; const Direction: Double): Double;
function ULP(Argument: Double): Double;
{$endif FPC_HAS_TYPE_DOUBLE}

// Other functions, probably only needed internal
function SingleToLongBits(const Argument: Single): UInt32;
function LongBitsToSingle(const Argument: UInt32): Single;
// returns the first floating-point argument with the sign of the second 
floating-point argument
function CopySign(const x: Single; const y: Single): Single;
function IsSameSign(const x: Single; const y: Single): Boolean;
function FloatPrior(const Argument: Single): Single;
function FloatNext(const Argument: Single): Single;

{$ifdef FPC_HAS_TYPE_DOUBLE}
function DoubleToLongBits(const Argument: Double): UInt64;
function LongBitsToDouble(Argument: UInt64): Double;
// returns the first floating-point argument with the sign of the second 
floating-point argument
function CopySign(const x: Double; const y: Double): Double;
function IsSameSign(const x: Double; const y: Double): Boolean;
function FloatPrior(const Argument: Double): Double;
function FloatNext(Argument: Double): Double;
{$endif FPC_HAS_TYPE_DOUBLE}

const
  MinMinDouble = 4.9406564584124654E-324;
  MinMinSingle = 1.40129846E-45;
  // MaxSingle failed in equal comparations!
  // singlevalue := MaxSingle;
  // equal := (singlevalue = MaxSingle);
  // will give equal = false !!
  // This is caused obviously by converting MaxSingle into a double value,
  // by the compiler, prior the comparism
  // coding
  // equal := (singlevalue = Single(MaxSingle));
  // will give equal = true!!
  // To bypass the casting two const are introduced to be used
  // instead MaxSingle and -MaxSingle
  SingleMaxSingle : Single = Single(MaxSingle);
  NegSingleMaxSingle : Single = Single(-MaxSingle);

implementation

const MAX_ULP_SINGLE = 2.028240960E+31;
const MAX_ULP_DOUBLE = 1.9958403095347198E+292;


function SingleToLongBits(const Argument: Single): UInt32;
var
  argResult : Cardinal absolute Argument;
begin
  Result := ArgResult;
end;

{$ifdef FPC_HAS_TYPE_DOUBLE}
function DoubleToLongBits(const Argument: Double): UInt64;
var
  argResult : UInt64 absolute Argument;
begin
  Result := argResult;
end;
{$endif FPC_HAS_TYPE_DOUBLE}

function LongBitsToSingle(const Argument: UInt32): Single;
var
  argResult : Single absolute Argument;
begin
  Result := argResult;
end;

{$ifdef FPC_HAS_TYPE_DOUBLE}
function LongBitsToDouble(Argument: UInt64): Double;
var
  argResult : Double absolute Argument;
begin
  Result := argResult;
end;
{$endif FPC_HAS_TYPE_DOUBLE}

// returns the first floating-point argument with the sign of the second 
floating-point argument
function CopySign(const x: Single; const y: Single): Single;
begin
  Result := (Abs(x) * Sign(y));
end;

{$ifdef FPC_HAS_TYPE_DOUBLE}
// returns the first floating-point argument with the sign of the second 
floating-point argument
function CopySign(const x: Double; const y: Double): Double;
begin
  Result := (Abs(x) * Sign(y));
end;
{$endif FPC_HAS_TYPE_DOUBLE}

function IsSameSign(const x: Single; const y: Single): Boolean;
begin
  Result := (CopySign(x, y) = x);
end;

{$ifdef FPC_HAS_TYPE_DOUBLE}
function IsSameSign(const x: Double; const y: Double): Boolean;
begin
  Result := (CopySign(x, y) = x);
end;
{$endif FPC_HAS_TYPE_DOUBLE}

function NextAfter(const Start: Single; const Direction: Single): Single;
var
  absStart: Single;
  absDir: Single;
  toZero: Boolean;
begin
  // If either argument is a NaN, then NaN is returned.
  if IsNan(Start) or IsNan(Direction) then Exit(NaN);
  // If both arguments compare as equal the second argument is returned.
  if (Start = Direction) then Exit(Direction);
  absStart := Abs(Start);
  absDir := Abs(Direction);
  toZero := not IsSameSign(Start, Direction) or (AbsDir < absStart);
  if (toZero) then
  begin
    // we are reducing the magnitude, going toward zero.
    if (absStart = MinMinSingle) then
      Exit(CopySign(0.0, Start));
    if IsInfinite(absStart) then
      Exit(CopySign(SingleMaxSingle, Start));
    Exit(CopySign(LongBitsToSingle(Pred(SingleToLongBits(absStart))), Start));
  end
  else
  begin
    // we are increasing the magnitude, toward +-Infinity
    if (Start = 0.0) then
      Exit(CopySign(MinMinSingle, Direction));
{
// This comparism does not work
//  if (absStart = MaxSingle) then
// to get it work it must be coded
    if (absStart = SingleMaxSingle) then
      Exit(CopySign(Infinity, Start));
// But since the next Value after MaxSingle is Infinity, the whole comparism is 
not needed!
}
    Exit(CopySign(LongBitsToSingle(Succ(SingleToLongBits(absStart))), Start));
  end;
end;

{$ifdef FPC_HAS_TYPE_DOUBLE}
function NextAfter(const Start: Double; const Direction: Double): Double;
var
  absStart: Double;
  absDir: Double;
  toZero: Boolean;
begin
  // If either argument is a NaN, then NaN is returned.
  if IsNan(Start) or IsNan(Direction) then Exit(NaN);
  // If both arguments compare as equal the second argument is returned.
  if (Start = Direction) then Exit(Direction);
  absStart := Abs(Start);
  absDir := Abs(Direction);
  toZero := not IsSameSign(Start, Direction) or (AbsDir < AbsStart);

  if (toZero) then
  begin
    // we are reducing the magnitude, going toward zero.
    if (absStart = MinMinDouble) then
      Exit(CopySign(0.0, Start));
    if IsInfinite(absStart) then
      Exit(CopySign(MaxDouble, Start));
    Exit(CopySign(LongBitsToDouble(Pred(DoubleToLongBits(absStart))),Start));
  end
  else
  begin
    // we are increasing the magnitude, toward +-Infinity
    if (Start = 0.0) then
      Exit(CopySign(MinMinDouble, Direction));
    if (AbsStart = MaxDouble) then
      Exit(CopySign(Infinity, Start));
    Exit(CopySign(LongBitsToDouble(Succ(DoubleToLongBits(absStart))), Start));
  end;
end;
{$endif FPC_HAS_TYPE_DOUBLE}

function FloatPrior(const Argument: Single): Single;
begin
  Result := NextAfter(Argument, NegInfinity);
end;

function FloatNext(const Argument: Single): Single;
begin
  Result := NextAfter(Argument, Infinity);
end;

{$ifdef FPC_HAS_TYPE_DOUBLE}
function FloatPrior(const Argument: Double): Double;
begin
  Result := NextAfter(Argument, NegInfinity);
end;
{$endif FPC_HAS_TYPE_DOUBLE}

{$ifdef FPC_HAS_TYPE_DOUBLE}
function FloatNext(Argument: Double): Double;
begin
  Result := NextAfter(Argument, Infinity);
end;
{$endif FPC_HAS_TYPE_DOUBLE}

function ULP(Argument: Single): Single;
begin
  // If the argument is NaN, then the result is NaN.
  if IsNaN(Argument) then Exit(NaN);
  // If the argument is positive or negative infinity, then the result is 
positive infinity.
  if IsInfinite(Argument) then Exit(Infinity);
  // If the argument is positive or negative zero, then the result is 
Single.MIN_VALUE.
  if (Argument = 0.0) then Exit(MinMinSingle);
  Argument := Abs(Argument);
  // If the argument is ±Single.MAX_VALUE, then the result is equal to 2^104.
  if (Argument = SingleMaxSingle) then
    Exit(MAX_ULP_SINGLE);
  Result := FloatNext(Argument) - Argument;
end;

{$ifdef FPC_HAS_TYPE_DOUBLE}
function ULP(Argument: Double): Double;
begin
  // If the argument is NaN, then the result is NaN.
  if IsNaN(Argument) then Exit(NaN);
  // If the argument is positive or negative infinity, then the result is 
positive infinity.
  if IsInfinite(Argument) then Exit(Infinity);
  // If the argument is positive or negative zero, then the result is 
Double.MIN_VALUE.
  if (Argument = 0.0) then Exit(MinMinDouble);
  Argument := Abs(Argument);
  // If the argument is ±Double.MAX_VALUE, then the result is equal to 2^971.
  if (Argument = MaxDouble) then Exit(MAX_ULP_DOUBLE);
  Result := FloatNext(Argument) - Argument;
end;
{$endif FPC_HAS_TYPE_DOUBLE}

end.

OpenPGP_signature.asc
Description: OpenPGP digital signature

_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

[fpc-pascal] Functions for dealing with floating-point precision

Reply via email to