Hi, thank you all for the interesting discussion posts on denorms and fixed-point/floating-point processing.
I have a problem that is very much related to the arguments posted by B.J., mentioning the lack of saturation arithmetics on x86/x64 processors. I need to convert a batch of 32 bit float samples to 32 bit int samples with appropriate clipping. I.e. samples which are outside the range of a 32 bit int (-2147483648..2147483647) shall be clipped to -2147483648 or 2147483647. Because the conversion shall be fast and efficient, I would prefer a solution using SSE (2/3). This sounds like an easy problem, but unfortunately it turned out it's not so simple after all. So I would like to challenge any SSE experts on this list. Here is what I have found out already: Starting with the following sample input: const __m128 sseFloatInput = _mm_set_ps(1000.0, -1000, 3000000000.0, -3000000000.0); My first approach was to convert this directly: const __m128i sseClippedInt = _mm_cvtps_epi32(sseFloatInput); This results in 1000, -1000, -2147483648, -2147483648, which is correct for all input samples but 3000000000.0. It turns out that all values which cannot be represented by an int32 are converted to -2147483648. To fix this, my next idea was to clip the maximum value before converting: const __m128 sseMax = _mm_set1_ps(float(std::numeric_limits<int32_t>::max())); const __m128i sseClippedInt = _mm_cvtps_epi32(_mm_min_ps(sseFloatInput, sseMax)); Well, the output is the same: 1000, -1000, -2147483648, -2147483648. What is happening here? The maximum possible int32 (2147483647) cannot be represented exactly as a floating-point number. So sseMax is slightly larger (2.14748365e+09) and therefore sseClipMax is still (slightly) out of range, resulting in the same int32 values. My final approach was to make sseMax minimally smaller: const __m128 sseMax = _mm_set1_ps(std::nextafterf(float(std::numeric_limits<int32_t>::max()), 0.0f)); This results in 1000, -1000, 2147483520, -2147483648. This is the 'best' solution so far, but still not what I want, because 3000000000.0 does not clip to the maximum possible int32 (2147483647). It is obviously the same problem as before: The clipping limit cannot be represented exactly as a float. (sseMax is 2.14748352e+09 here) Does anyone have an _efficient_ solution for this problem? Does it really need a (probably very inefficient) detour using double or int64?
