Re: Efficient way to convert 32 bit float to 32 bit int (SSE)

Stefan Stenzel Wed, 26 Apr 2023 02:57:47 -0700

Sorry, one more fix, I’ll keep silent now.

//src/dst must be aligned
void float2intx4(__m128 *src, __m128i *dst)
{
    const __m128 fcmp = _mm_set_ps1(0x00FFFFFFp7f); 
    const __m128 sseFloatInput = *src; 
    
    __m128i x = _mm_cvtps_epi32(sseFloatInput);
    __m128i m = _mm_cmpgt_ps(sseFloatInput,fcmp);
    *dst     = _mm_add_epi32(x,m);
}



> On 26. Apr 2023, at 11:50, Stefan Stenzel <[email protected]> wrote:
> 
> OK, but first needs some bug fixing, here a corrected version with the proper 
> constant for comparison:
> 
> //src/dst must be aligned
> void float2intx4(__m128 *src, __m128i *dst)
> {
>    const __m128 fcmp = _mm_set_ps1(0x00FFFFFFp7f); 
>    const __m128 sseFloatInput = *src; 
> 
>    __m128i x = _mm_cvtps_epi32(sseFloatInput);
>    __m128i m = _mm_cmpge_ps(sseFloatInput,fcmp);
>    *dst     = _mm_add_epi32(x,m);
> }
> 
> 
> 
>> On 26. Apr 2023, at 11:04, STEFFAN DIEDRICHSEN 
>> <[email protected]> wrote:
>> 
>> That code snippet would be a good addition to the musicdsp source code 
>> archive:
>> 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.musicdsp.org_en_latest_Other_index.html&d=DwIFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=TRvFbpof3kTa2q5hdjI2hccynPix7hNL2n0I6DmlDy0&m=e6fBRm5L0AcECDLgPGfI9Jox1vtgxpbm-bOQKQIAmXciiGXL9yw06MZDnY67UqGl&s=hZsmyZ2gd6LEjAyTJiJMhZTnJOVEyBB56faI1ZIJTPc&e=
>>  
>> 
>> 
>> 
>> 
>> Best,
>> 
>> Steffan 
>> 
>>> On 26. Apr 2023, at 10:50, Stefano D'Angelo <[email protected]> 
>>> wrote:
>>> 
>>> Yeah, Stefan's version is easier/better.
>>> 
>>> It only needs an extra _mm_castps_si128() to compute m, which costs nothing.
>>> 
>>> Best,
>>> 
>>> Stefano D'Angelo
>>> 
>>> Il 26/04/23 10:42, Stefan Stenzel ha scritto:
>>>> Sorry for spamming, but I am obsessive about optimisations and cannot 
>>>> spare you the version with one less instruction:
>>>> 
>>>> int main()
>>>> {
>>>>    const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f, 3000000000.f, 
>>>> -3000000000.f);
>>>> const __m128 fcmp    = 
>>>> _mm_set_ps(0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f);
>>>> 
>>>>    __m128i x = _mm_cvtps_epi32(sseFloatInput);
>>>>    __m128i m = _mm_cmpge_ps(sseFloatInput,fcmp);
>>>>    __m128i r = _mm_add_epi32(x,m);
>>>> 
>>>>    printf("%08X %08X %08X %08X\n",
>>>>        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)),
>>>>        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)),
>>>>        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)),
>>>>        _mm_cvtsi128_si32(r)
>>>>        );
>>>> }
>>>> 
>>>> 
>>>>> On 26. Apr 2023, at 10:34, Stefan Stenzel <[email protected]> wrote:
>>>>> 
>>>>> Stefano’s solution is elegant because it exploits the fact that values 
>>>>> outside the range are all set to 0x80000000.
>>>>> But the implementation is a bit overcomplicated, this works as well with 
>>>>> less instructions, same result:
>>>>> 
>>>>> int main()
>>>>> {
>>>>>   const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f, 3000000000.f, 
>>>>> -3000000000.f);
>>>>> const __m128 fcmp    = 
>>>>> _mm_set_ps(0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f);
>>>>> 
>>>>>   __m128i x = _mm_cvtps_epi32(sseFloatInput);
>>>>>   __m128i m = _mm_cmpge_ps(sseFloatInput,fcmp);
>>>>>   __m128i r = _mm_sub_epi32(x,_mm_srli_epi32(m,31));
>>>>> 
>>>>>   printf("%08X %08X %08X %08X\n",
>>>>>       _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)),
>>>>>       _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)),
>>>>>       _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)),
>>>>>       _mm_cvtsi128_si32(r)
>>>>>       );
>>>>> }
>>>>> 
>>>>> 
>>>>>> On 26. Apr 2023, at 10:11, Stefano D'Angelo 
>>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I'm no SSE expert either but I would exploit IEEE 754r single precision 
>>>>>> floating point representation.
>>>>>> 
>>>>>> Essentially you have that 0x4f000000 represents 2147483648.f while 
>>>>>> 0x4effffff represents 2147483520.f. OTOH, in 2's complement 32 bits, 
>>>>>> 0x7fffffff is 2147483647 and 0x80000000 is -2147483648.
>>>>>> 
>>>>>> The idea is then to convert using _mm_cvtps_epi32 as you did, and 
>>>>>> subtract 1 if the input is represented as a number bigger than 
>>>>>> 0x4effffff.
>>>>>> 
>>>>>> Here's the code:
>>>>>> 
>>>>>> #include <smmintrin.h>
>>>>>> #include <emmintrin.h>
>>>>>> #include <stdio.h>
>>>>>> 
>>>>>> int main()
>>>>>> {
>>>>>>   const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f, 3000000000.f, 
>>>>>> -3000000000.f);
>>>>>> 
>>>>>>   const __m128i ones = _mm_set_epi32(1, 1, 1, 1);
>>>>>>   const __m128i h = _mm_set_epi32(0x4f000000, 0x4f000000, 0x4f000000, 
>>>>>> 0x4f000000);
>>>>>> 
>>>>>>   __m128i x = _mm_cvtps_epi32(sseFloatInput);
>>>>>>   __m128i i = _mm_castps_si128(sseFloatInput);
>>>>>>   __m128i m = _mm_max_epi32(i, h);
>>>>>>   __m128i s = _mm_sub_epi32(m, h);
>>>>>>   __m128i y = _mm_sign_epi32(ones, s);
>>>>>>   __m128i r = _mm_sub_epi32(x,y);
>>>>>> 
>>>>>>   printf("%d %d %d %d\n",
>>>>>>       _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)),
>>>>>>       _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)),
>>>>>>       _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)),
>>>>>>       _mm_cvtsi128_si32(r)
>>>>>>       );
>>>>>> }
>>>>>> 
>>>>>> I get the correct result: 1000 -1000 2147483647 -2147483648.
>>>>>> 
>>>>>> HTH.
>>>>>> 
>>>>>> Best,
>>>>>> 
>>>>>> Stefano D'Angelo
>>>>>> 
>>>>>> Il 26/04/23 09:09, Holger Strauss ha scritto:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> thank you all for the interesting discussion posts on denorms and
>>>>>>> fixed-point/floating-point processing.
>>>>>>> 
>>>>>>> I have a problem that is very much related to the arguments posted by 
>>>>>>> B.J.,
>>>>>>> mentioning the lack of saturation arithmetics on x86/x64 processors.
>>>>>>> 
>>>>>>> I need to convert a batch of 32 bit float samples to 32 bit int samples 
>>>>>>> with
>>>>>>> appropriate clipping. I.e. samples which are outside the range of a 32 
>>>>>>> bit
>>>>>>> int (-2147483648..2147483647) shall be clipped to  -2147483648 or
>>>>>>> 2147483647.
>>>>>>> 
>>>>>>> Because the conversion shall be fast and efficient, I would prefer a
>>>>>>> solution using SSE (2/3).
>>>>>>> 
>>>>>>> This sounds like an easy problem, but unfortunately it turned out it's 
>>>>>>> not
>>>>>>> so simple after all.
>>>>>>> So I would like to challenge any SSE experts on this list.
>>>>>>> 
>>>>>>> Here is what I have found out already:
>>>>>>> 
>>>>>>> Starting with the following sample input:
>>>>>>> 
>>>>>>>   const __m128 sseFloatInput = _mm_set_ps(1000.0, -1000, 3000000000.0,
>>>>>>> -3000000000.0);
>>>>>>> 
>>>>>>> My first approach was to convert this directly:
>>>>>>> 
>>>>>>>  const __m128i sseClippedInt = _mm_cvtps_epi32(sseFloatInput);
>>>>>>> 
>>>>>>> This results in 1000, -1000, -2147483648, -2147483648, which is correct 
>>>>>>> for
>>>>>>> all input samples but 3000000000.0. It turns out that all values which
>>>>>>> cannot be represented by an int32 are converted to -2147483648.
>>>>>>> 
>>>>>>> To fix this, my next idea was to clip the maximum value before 
>>>>>>> converting:
>>>>>>> 
>>>>>>>   const __m128 sseMax =
>>>>>>> _mm_set1_ps(float(std::numeric_limits<int32_t>::max()));
>>>>>>>   const __m128i sseClippedInt = 
>>>>>>> _mm_cvtps_epi32(_mm_min_ps(sseFloatInput,
>>>>>>> sseMax));
>>>>>>> 
>>>>>>> Well, the output is the same: 1000, -1000, -2147483648, -2147483648. 
>>>>>>> What is
>>>>>>> happening here? The maximum possible int32 (2147483647) cannot be
>>>>>>> represented exactly as a floating-point number. So sseMax is slightly 
>>>>>>> larger
>>>>>>> (2.14748365e+09) and therefore sseClipMax is still (slightly) out of 
>>>>>>> range,
>>>>>>> resulting in the same int32 values.
>>>>>>> 
>>>>>>> My final approach was to make sseMax minimally smaller:
>>>>>>> 
>>>>>>>   const __m128 sseMax =
>>>>>>> _mm_set1_ps(std::nextafterf(float(std::numeric_limits<int32_t>::max()),
>>>>>>> 0.0f));
>>>>>>> 
>>>>>>> This results in 1000, -1000, 2147483520, -2147483648. This is the 'best'
>>>>>>> solution so far, but still not what I want, because 3000000000.0 does 
>>>>>>> not
>>>>>>> clip to the maximum possible int32 (2147483647). It is obviously the 
>>>>>>> same
>>>>>>> problem as before: The clipping limit cannot be represented exactly as a
>>>>>>> float. (sseMax is 2.14748352e+09 here)
>>>>>>> 
>>>>>>> Does anyone have an _efficient_ solution for this problem? Does it 
>>>>>>> really
>>>>>>> need a (probably very inefficient) detour using double or int64?
>> 
>

Re: Efficient way to convert 32 bit float to 32 bit int (SSE)

Reply via email to