Sorry for spamming, but I am obsessive about optimisations and cannot spare you
the version with one less instruction:
int main()
{
const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f, 3000000000.f,
-3000000000.f);
const __m128 fcmp =
_mm_set_ps(0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f);
__m128i x = _mm_cvtps_epi32(sseFloatInput);
__m128i m = _mm_cmpge_ps(sseFloatInput,fcmp);
__m128i r = _mm_add_epi32(x,m);
printf("%08X %08X %08X %08X\n",
_mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)),
_mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)),
_mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)),
_mm_cvtsi128_si32(r)
);
}
> On 26. Apr 2023, at 10:34, Stefan Stenzel <[email protected]> wrote:
>
> Stefano’s solution is elegant because it exploits the fact that values
> outside the range are all set to 0x80000000.
> But the implementation is a bit overcomplicated, this works as well with less
> instructions, same result:
>
> int main()
> {
> const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f, 3000000000.f,
> -3000000000.f);
> const __m128 fcmp =
> _mm_set_ps(0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f);
>
> __m128i x = _mm_cvtps_epi32(sseFloatInput);
> __m128i m = _mm_cmpge_ps(sseFloatInput,fcmp);
> __m128i r = _mm_sub_epi32(x,_mm_srli_epi32(m,31));
>
> printf("%08X %08X %08X %08X\n",
> _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)),
> _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)),
> _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)),
> _mm_cvtsi128_si32(r)
> );
> }
>
>
>> On 26. Apr 2023, at 10:11, Stefano D'Angelo <[email protected]>
>> wrote:
>>
>> Hello,
>>
>> I'm no SSE expert either but I would exploit IEEE 754r single precision
>> floating point representation.
>>
>> Essentially you have that 0x4f000000 represents 2147483648.f while
>> 0x4effffff represents 2147483520.f. OTOH, in 2's complement 32 bits,
>> 0x7fffffff is 2147483647 and 0x80000000 is -2147483648.
>>
>> The idea is then to convert using _mm_cvtps_epi32 as you did, and subtract 1
>> if the input is represented as a number bigger than 0x4effffff.
>>
>> Here's the code:
>>
>> #include <smmintrin.h>
>> #include <emmintrin.h>
>> #include <stdio.h>
>>
>> int main()
>> {
>> const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f, 3000000000.f,
>> -3000000000.f);
>>
>> const __m128i ones = _mm_set_epi32(1, 1, 1, 1);
>> const __m128i h = _mm_set_epi32(0x4f000000, 0x4f000000, 0x4f000000,
>> 0x4f000000);
>>
>> __m128i x = _mm_cvtps_epi32(sseFloatInput);
>> __m128i i = _mm_castps_si128(sseFloatInput);
>> __m128i m = _mm_max_epi32(i, h);
>> __m128i s = _mm_sub_epi32(m, h);
>> __m128i y = _mm_sign_epi32(ones, s);
>> __m128i r = _mm_sub_epi32(x,y);
>>
>> printf("%d %d %d %d\n",
>> _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)),
>> _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)),
>> _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)),
>> _mm_cvtsi128_si32(r)
>> );
>> }
>>
>> I get the correct result: 1000 -1000 2147483647 -2147483648.
>>
>> HTH.
>>
>> Best,
>>
>> Stefano D'Angelo
>>
>> Il 26/04/23 09:09, Holger Strauss ha scritto:
>>> Hi,
>>>
>>> thank you all for the interesting discussion posts on denorms and
>>> fixed-point/floating-point processing.
>>>
>>> I have a problem that is very much related to the arguments posted by B.J.,
>>> mentioning the lack of saturation arithmetics on x86/x64 processors.
>>>
>>> I need to convert a batch of 32 bit float samples to 32 bit int samples with
>>> appropriate clipping. I.e. samples which are outside the range of a 32 bit
>>> int (-2147483648..2147483647) shall be clipped to -2147483648 or
>>> 2147483647.
>>>
>>> Because the conversion shall be fast and efficient, I would prefer a
>>> solution using SSE (2/3).
>>>
>>> This sounds like an easy problem, but unfortunately it turned out it's not
>>> so simple after all.
>>> So I would like to challenge any SSE experts on this list.
>>>
>>> Here is what I have found out already:
>>>
>>> Starting with the following sample input:
>>>
>>> const __m128 sseFloatInput = _mm_set_ps(1000.0, -1000, 3000000000.0,
>>> -3000000000.0);
>>>
>>> My first approach was to convert this directly:
>>>
>>> const __m128i sseClippedInt = _mm_cvtps_epi32(sseFloatInput);
>>>
>>> This results in 1000, -1000, -2147483648, -2147483648, which is correct for
>>> all input samples but 3000000000.0. It turns out that all values which
>>> cannot be represented by an int32 are converted to -2147483648.
>>>
>>> To fix this, my next idea was to clip the maximum value before converting:
>>>
>>> const __m128 sseMax =
>>> _mm_set1_ps(float(std::numeric_limits<int32_t>::max()));
>>> const __m128i sseClippedInt = _mm_cvtps_epi32(_mm_min_ps(sseFloatInput,
>>> sseMax));
>>>
>>> Well, the output is the same: 1000, -1000, -2147483648, -2147483648. What is
>>> happening here? The maximum possible int32 (2147483647) cannot be
>>> represented exactly as a floating-point number. So sseMax is slightly larger
>>> (2.14748365e+09) and therefore sseClipMax is still (slightly) out of range,
>>> resulting in the same int32 values.
>>>
>>> My final approach was to make sseMax minimally smaller:
>>>
>>> const __m128 sseMax =
>>> _mm_set1_ps(std::nextafterf(float(std::numeric_limits<int32_t>::max()),
>>> 0.0f));
>>>
>>> This results in 1000, -1000, 2147483520, -2147483648. This is the 'best'
>>> solution so far, but still not what I want, because 3000000000.0 does not
>>> clip to the maximum possible int32 (2147483647). It is obviously the same
>>> problem as before: The clipping limit cannot be represented exactly as a
>>> float. (sseMax is 2.14748352e+09 here)
>>>
>>> Does anyone have an _efficient_ solution for this problem? Does it really
>>> need a (probably very inefficient) detour using double or int64?
>>
>