Moin,

Seems the problem only exists for positive numbers where some lower bits are 
zero instead of one.

I would suggest to clip and convert first, then construct a mask to be or-ed to 
the result using the original float.
Positive comparison value should have 24 bits (23 + implicit) to match float, 
rough intrinsic usage: 

clipped = _mm_min_ps(floatvalue,0x0FFFFFFp3f
clipped = _mm_max_ps(clipped,-0x1p31f
y = _mm_cvt_ps2pi(clipped

mask = _mm_cmpge_ps(floatvalue, 0x0FFFFFFp3f)
y    = _mm_or_ps(y,_mm_srl_epi32(mask,1))

Not tested, but should work in principle.

Stefan


> On 26. Apr 2023, at 09:09, Holger Strauss <[email protected]> wrote:
> 
> Hi,
> 
> thank you all for the interesting discussion posts on denorms and
> fixed-point/floating-point processing.
> 
> I have a problem that is very much related to the arguments posted by B.J.,
> mentioning the lack of saturation arithmetics on x86/x64 processors.
> 
> I need to convert a batch of 32 bit float samples to 32 bit int samples with
> appropriate clipping. I.e. samples which are outside the range of a 32 bit
> int (-2147483648..2147483647) shall be clipped to  -2147483648 or
> 2147483647.
> 
> Because the conversion shall be fast and efficient, I would prefer a
> solution using SSE (2/3).
> 
> This sounds like an easy problem, but unfortunately it turned out it's not
> so simple after all.
> So I would like to challenge any SSE experts on this list.
> 
> Here is what I have found out already:
> 
> Starting with the following sample input:
> 
>     const __m128 sseFloatInput = _mm_set_ps(1000.0, -1000, 3000000000.0,
> -3000000000.0);
> 
> My first approach was to convert this directly:
> 
>    const __m128i sseClippedInt = _mm_cvtps_epi32(sseFloatInput);
> 
> This results in 1000, -1000, -2147483648, -2147483648, which is correct for
> all input samples but 3000000000.0. It turns out that all values which
> cannot be represented by an int32 are converted to -2147483648.
> 
> To fix this, my next idea was to clip the maximum value before converting:
> 
>     const __m128 sseMax =
> _mm_set1_ps(float(std::numeric_limits<int32_t>::max()));
>     const __m128i sseClippedInt = _mm_cvtps_epi32(_mm_min_ps(sseFloatInput,
> sseMax));
> 
> Well, the output is the same: 1000, -1000, -2147483648, -2147483648. What is
> happening here? The maximum possible int32 (2147483647) cannot be
> represented exactly as a floating-point number. So sseMax is slightly larger
> (2.14748365e+09) and therefore sseClipMax is still (slightly) out of range,
> resulting in the same int32 values.
> 
> My final approach was to make sseMax minimally smaller:
> 
>     const __m128 sseMax =
> _mm_set1_ps(std::nextafterf(float(std::numeric_limits<int32_t>::max()),
> 0.0f));
> 
> This results in 1000, -1000, 2147483520, -2147483648. This is the 'best'
> solution so far, but still not what I want, because 3000000000.0 does not
> clip to the maximum possible int32 (2147483647). It is obviously the same
> problem as before: The clipping limit cannot be represented exactly as a
> float. (sseMax is 2.14748352e+09 here)
> 
> Does anyone have an _efficient_ solution for this problem? Does it really
> need a (probably very inefficient) detour using double or int64?
> 

Reply via email to