Efficient way to convert 32 bit float to 32 bit int (SSE)

Holger Strauss Wed, 26 Apr 2023 00:10:25 -0700

Hi,

thank you all for the interesting discussion posts on denorms and
fixed-point/floating-point processing.


I have a problem that is very much related to the arguments posted by B.J.,
mentioning the lack of saturation arithmetics on x86/x64 processors.

I need to convert a batch of 32 bit float samples to 32 bit int samples with
appropriate clipping. I.e. samples which are outside the range of a 32 bit
int (-2147483648..2147483647) shall be clipped to  -2147483648 or
2147483647.

Because the conversion shall be fast and efficient, I would prefer a
solution using SSE (2/3).

This sounds like an easy problem, but unfortunately it turned out it's not
so simple after all.
So I would like to challenge any SSE experts on this list.

Here is what I have found out already:

Starting with the following sample input:

    const __m128 sseFloatInput = _mm_set_ps(1000.0, -1000, 3000000000.0,
-3000000000.0);

My first approach was to convert this directly:

   const __m128i sseClippedInt = _mm_cvtps_epi32(sseFloatInput);

This results in 1000, -1000, -2147483648, -2147483648, which is correct for
all input samples but 3000000000.0. It turns out that all values which
cannot be represented by an int32 are converted to -2147483648.

To fix this, my next idea was to clip the maximum value before converting:

    const __m128 sseMax =
_mm_set1_ps(float(std::numeric_limits<int32_t>::max()));
    const __m128i sseClippedInt = _mm_cvtps_epi32(_mm_min_ps(sseFloatInput,
sseMax));

Well, the output is the same: 1000, -1000, -2147483648, -2147483648. What is
happening here? The maximum possible int32 (2147483647) cannot be
represented exactly as a floating-point number. So sseMax is slightly larger
(2.14748365e+09) and therefore sseClipMax is still (slightly) out of range,
resulting in the same int32 values.

My final approach was to make sseMax minimally smaller:

    const __m128 sseMax =
_mm_set1_ps(std::nextafterf(float(std::numeric_limits<int32_t>::max()),
0.0f));

This results in 1000, -1000, 2147483520, -2147483648. This is the 'best'
solution so far, but still not what I want, because 3000000000.0 does not
clip to the maximum possible int32 (2147483647). It is obviously the same
problem as before: The clipping limit cannot be represented exactly as a
float. (sseMax is 2.14748352e+09 here)

Does anyone have an _efficient_ solution for this problem? Does it really
need a (probably very inefficient) detour using double or int64?

Efficient way to convert 32 bit float to 32 bit int (SSE)

Reply via email to