Re: Efficient way to convert 32 bit float to 32 bit int (SSE)

STEFFAN DIEDRICHSEN Wed, 26 Apr 2023 02:05:31 -0700

That code snippet would be a good addition to the musicdsp source code archive:


https://www.musicdsp.org/en/latest/Other/index.html


Best,

Steffan 

> On 26. Apr 2023, at 10:50, Stefano D'Angelo <[email protected]> 
> wrote:
> 
> Yeah, Stefan's version is easier/better.
> 
> It only needs an extra _mm_castps_si128() to compute m, which costs nothing.
> 
> Best,
> 
> Stefano D'Angelo
> 
> Il 26/04/23 10:42, Stefan Stenzel ha scritto:
>> Sorry for spamming, but I am obsessive about optimisations and cannot spare 
>> you the version with one less instruction:
>> 
>> int main()
>> {
>>     const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f, 3000000000.f, 
>> -3000000000.f);
>>      const __m128 fcmp    = 
>> _mm_set_ps(0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f);
>>      
>>     __m128i x = _mm_cvtps_epi32(sseFloatInput);
>>     __m128i m = _mm_cmpge_ps(sseFloatInput,fcmp);
>>     __m128i r = _mm_add_epi32(x,m);
>> 
>>     printf("%08X %08X %08X %08X\n",
>>         _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)),
>>         _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)),
>>         _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)),
>>         _mm_cvtsi128_si32(r)
>>         );
>> }
>> 
>> 
>>> On 26. Apr 2023, at 10:34, Stefan Stenzel <[email protected]> wrote:
>>> 
>>> Stefano’s solution is elegant because it exploits the fact that values 
>>> outside the range are all set to 0x80000000.
>>> But the implementation is a bit overcomplicated, this works as well with 
>>> less instructions, same result:
>>> 
>>> int main()
>>> {
>>>    const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f, 3000000000.f, 
>>> -3000000000.f);
>>> const __m128 fcmp    = 
>>> _mm_set_ps(0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f);
>>> 
>>>    __m128i x = _mm_cvtps_epi32(sseFloatInput);
>>>    __m128i m = _mm_cmpge_ps(sseFloatInput,fcmp);
>>>    __m128i r = _mm_sub_epi32(x,_mm_srli_epi32(m,31));
>>> 
>>>    printf("%08X %08X %08X %08X\n",
>>>        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)),
>>>        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)),
>>>        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)),
>>>        _mm_cvtsi128_si32(r)
>>>        );
>>> }
>>> 
>>> 
>>>> On 26. Apr 2023, at 10:11, Stefano D'Angelo <[email protected]> 
>>>> wrote:
>>>> 
>>>> Hello,
>>>> 
>>>> I'm no SSE expert either but I would exploit IEEE 754r single precision 
>>>> floating point representation.
>>>> 
>>>> Essentially you have that 0x4f000000 represents 2147483648.f while 
>>>> 0x4effffff represents 2147483520.f. OTOH, in 2's complement 32 bits, 
>>>> 0x7fffffff is 2147483647 and 0x80000000 is -2147483648.
>>>> 
>>>> The idea is then to convert using _mm_cvtps_epi32 as you did, and subtract 
>>>> 1 if the input is represented as a number bigger than 0x4effffff.
>>>> 
>>>> Here's the code:
>>>> 
>>>> #include <smmintrin.h>
>>>> #include <emmintrin.h>
>>>> #include <stdio.h>
>>>> 
>>>> int main()
>>>> {
>>>>    const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f, 3000000000.f, 
>>>> -3000000000.f);
>>>> 
>>>>    const __m128i ones = _mm_set_epi32(1, 1, 1, 1);
>>>>    const __m128i h = _mm_set_epi32(0x4f000000, 0x4f000000, 0x4f000000, 
>>>> 0x4f000000);
>>>> 
>>>>    __m128i x = _mm_cvtps_epi32(sseFloatInput);
>>>>    __m128i i = _mm_castps_si128(sseFloatInput);
>>>>    __m128i m = _mm_max_epi32(i, h);
>>>>    __m128i s = _mm_sub_epi32(m, h);
>>>>    __m128i y = _mm_sign_epi32(ones, s);
>>>>    __m128i r = _mm_sub_epi32(x,y);
>>>> 
>>>>    printf("%d %d %d %d\n",
>>>>        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)),
>>>>        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)),
>>>>        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)),
>>>>        _mm_cvtsi128_si32(r)
>>>>        );
>>>> }
>>>> 
>>>> I get the correct result: 1000 -1000 2147483647 -2147483648.
>>>> 
>>>> HTH.
>>>> 
>>>> Best,
>>>> 
>>>> Stefano D'Angelo
>>>> 
>>>> Il 26/04/23 09:09, Holger Strauss ha scritto:
>>>>> Hi,
>>>>> 
>>>>> thank you all for the interesting discussion posts on denorms and
>>>>> fixed-point/floating-point processing.
>>>>> 
>>>>> I have a problem that is very much related to the arguments posted by 
>>>>> B.J.,
>>>>> mentioning the lack of saturation arithmetics on x86/x64 processors.
>>>>> 
>>>>> I need to convert a batch of 32 bit float samples to 32 bit int samples 
>>>>> with
>>>>> appropriate clipping. I.e. samples which are outside the range of a 32 bit
>>>>> int (-2147483648..2147483647) shall be clipped to  -2147483648 or
>>>>> 2147483647.
>>>>> 
>>>>> Because the conversion shall be fast and efficient, I would prefer a
>>>>> solution using SSE (2/3).
>>>>> 
>>>>> This sounds like an easy problem, but unfortunately it turned out it's not
>>>>> so simple after all.
>>>>> So I would like to challenge any SSE experts on this list.
>>>>> 
>>>>> Here is what I have found out already:
>>>>> 
>>>>> Starting with the following sample input:
>>>>> 
>>>>>    const __m128 sseFloatInput = _mm_set_ps(1000.0, -1000, 3000000000.0,
>>>>> -3000000000.0);
>>>>> 
>>>>> My first approach was to convert this directly:
>>>>> 
>>>>>   const __m128i sseClippedInt = _mm_cvtps_epi32(sseFloatInput);
>>>>> 
>>>>> This results in 1000, -1000, -2147483648, -2147483648, which is correct 
>>>>> for
>>>>> all input samples but 3000000000.0. It turns out that all values which
>>>>> cannot be represented by an int32 are converted to -2147483648.
>>>>> 
>>>>> To fix this, my next idea was to clip the maximum value before converting:
>>>>> 
>>>>>    const __m128 sseMax =
>>>>> _mm_set1_ps(float(std::numeric_limits<int32_t>::max()));
>>>>>    const __m128i sseClippedInt = _mm_cvtps_epi32(_mm_min_ps(sseFloatInput,
>>>>> sseMax));
>>>>> 
>>>>> Well, the output is the same: 1000, -1000, -2147483648, -2147483648. What 
>>>>> is
>>>>> happening here? The maximum possible int32 (2147483647) cannot be
>>>>> represented exactly as a floating-point number. So sseMax is slightly 
>>>>> larger
>>>>> (2.14748365e+09) and therefore sseClipMax is still (slightly) out of 
>>>>> range,
>>>>> resulting in the same int32 values.
>>>>> 
>>>>> My final approach was to make sseMax minimally smaller:
>>>>> 
>>>>>    const __m128 sseMax =
>>>>> _mm_set1_ps(std::nextafterf(float(std::numeric_limits<int32_t>::max()),
>>>>> 0.0f));
>>>>> 
>>>>> This results in 1000, -1000, 2147483520, -2147483648. This is the 'best'
>>>>> solution so far, but still not what I want, because 3000000000.0 does not
>>>>> clip to the maximum possible int32 (2147483647). It is obviously the same
>>>>> problem as before: The clipping limit cannot be represented exactly as a
>>>>> float. (sseMax is 2.14748352e+09 here)
>>>>> 
>>>>> Does anyone have an _efficient_ solution for this problem? Does it really
>>>>> need a (probably very inefficient) detour using double or int64?

Re: Efficient way to convert 32 bit float to 32 bit int (SSE)

Reply via email to