Moin, Seems the problem only exists for positive numbers where some lower bits are zero instead of one.
I would suggest to clip and convert first, then construct a mask to be or-ed to the result using the original float. Positive comparison value should have 24 bits (23 + implicit) to match float, rough intrinsic usage: clipped = _mm_min_ps(floatvalue,0x0FFFFFFp3f clipped = _mm_max_ps(clipped,-0x1p31f y = _mm_cvt_ps2pi(clipped mask = _mm_cmpge_ps(floatvalue, 0x0FFFFFFp3f) y = _mm_or_ps(y,_mm_srl_epi32(mask,1)) Not tested, but should work in principle. Stefan > On 26. Apr 2023, at 09:09, Holger Strauss <[email protected]> wrote: > > Hi, > > thank you all for the interesting discussion posts on denorms and > fixed-point/floating-point processing. > > I have a problem that is very much related to the arguments posted by B.J., > mentioning the lack of saturation arithmetics on x86/x64 processors. > > I need to convert a batch of 32 bit float samples to 32 bit int samples with > appropriate clipping. I.e. samples which are outside the range of a 32 bit > int (-2147483648..2147483647) shall be clipped to -2147483648 or > 2147483647. > > Because the conversion shall be fast and efficient, I would prefer a > solution using SSE (2/3). > > This sounds like an easy problem, but unfortunately it turned out it's not > so simple after all. > So I would like to challenge any SSE experts on this list. > > Here is what I have found out already: > > Starting with the following sample input: > > const __m128 sseFloatInput = _mm_set_ps(1000.0, -1000, 3000000000.0, > -3000000000.0); > > My first approach was to convert this directly: > > const __m128i sseClippedInt = _mm_cvtps_epi32(sseFloatInput); > > This results in 1000, -1000, -2147483648, -2147483648, which is correct for > all input samples but 3000000000.0. It turns out that all values which > cannot be represented by an int32 are converted to -2147483648. > > To fix this, my next idea was to clip the maximum value before converting: > > const __m128 sseMax = > _mm_set1_ps(float(std::numeric_limits<int32_t>::max())); > const __m128i sseClippedInt = _mm_cvtps_epi32(_mm_min_ps(sseFloatInput, > sseMax)); > > Well, the output is the same: 1000, -1000, -2147483648, -2147483648. What is > happening here? The maximum possible int32 (2147483647) cannot be > represented exactly as a floating-point number. So sseMax is slightly larger > (2.14748365e+09) and therefore sseClipMax is still (slightly) out of range, > resulting in the same int32 values. > > My final approach was to make sseMax minimally smaller: > > const __m128 sseMax = > _mm_set1_ps(std::nextafterf(float(std::numeric_limits<int32_t>::max()), > 0.0f)); > > This results in 1000, -1000, 2147483520, -2147483648. This is the 'best' > solution so far, but still not what I want, because 3000000000.0 does not > clip to the maximum possible int32 (2147483647). It is obviously the same > problem as before: The clipping limit cannot be represented exactly as a > float. (sseMax is 2.14748352e+09 here) > > Does anyone have an _efficient_ solution for this problem? Does it really > need a (probably very inefficient) detour using double or int64? >
