AW: Efficient way to convert 32 bit float to 32 bit int (SSE)

Holger Strauss Wed, 26 Apr 2023 03:21:49 -0700

Thank you very much for the quick answers, Stefan, Stefano and Steffan (what a 
coincidental match... :-)). That was very helpful.


I had to add one more (no-op) instruction to get the types correctly (may be 
related to the Microsoft compiler, not sure):

void float2intx4(__m128 *src, __m128i *dst) {
    const __m128 fcmp = _mm_set_ps1(0x00FFFFFFp7f);
    const __m128 sseFloatInput = *src;

    __m128i x = _mm_cvtps_epi32(sseFloatInput);
    __m128i m = _mm_castps_si128(_mm_cmpgt_ps(sseFloatInput,fcmp));
    *dst     = _mm_add_epi32(x,m);
}

Best,
Holger


> Sorry, one more fix, I’ll keep silent now.
> 
> //src/dst must be aligned
> void float2intx4(__m128 *src, __m128i *dst) {
>     const __m128 fcmp = _mm_set_ps1(0x00FFFFFFp7f);
>     const __m128 sseFloatInput = *src;
> 
>     __m128i x = _mm_cvtps_epi32(sseFloatInput);
>     __m128i m = _mm_cmpgt_ps(sseFloatInput,fcmp);
>     *dst     = _mm_add_epi32(x,m);
> }
> 
> 
> > On 26. Apr 2023, at 11:50, Stefan Stenzel <[email protected]> wrote:
> >
> > OK, but first needs some bug fixing, here a corrected version with the 
> > proper
> constant for comparison:
> >
> > //src/dst must be aligned
> > void float2intx4(__m128 *src, __m128i *dst) {
> >    const __m128 fcmp = _mm_set_ps1(0x00FFFFFFp7f);
> >    const __m128 sseFloatInput = *src;
> >
> >    __m128i x = _mm_cvtps_epi32(sseFloatInput);
> >    __m128i m = _mm_cmpge_ps(sseFloatInput,fcmp);
> >    *dst     = _mm_add_epi32(x,m);
> > }
> >
> >
> >
> >> On 26. Apr 2023, at 11:04, STEFFAN DIEDRICHSEN <0000009333a9e91c-
> [email protected]> wrote:
> >>
> >> That code snippet would be a good addition to the musicdsp source code
> archive:
> >>
> >> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__https://urldefense.proofpoint.com/v2/url?u=http-3A__www.musicdsp.org&d=DwIFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4nbi2Q0M1QLJX9BeE&r=TRvFbpof3kTa2q5hdjI2hccynPix7hNL2n0I6DmlDy0&m=DaFAJSrFJAXUp5fPq02u8uGnrn1DeJoo_zd3BD2Sp91_xgm5m5fnQJ6Nsz0TCrv5&s=hxhmPewaVXRk-ngo-v6g7nu3V8Y6gXniOZIPkUWIXT4&e=
>  
> >>
> _en_latest_Other_index.html&d=DwIFaQ&c=009klHSCxuh5AI1vNQzSO0KGjl4
> nbi
> >>
> 2Q0M1QLJX9BeE&r=TRvFbpof3kTa2q5hdjI2hccynPix7hNL2n0I6DmlDy0&m=
> e6fBRm5
> >> L0AcECDLgPGfI9Jox1vtgxpbm-
> bOQKQIAmXciiGXL9yw06MZDnY67UqGl&s=hZsmyZ2gd
> >> 6LEjAyTJiJMhZTnJOVEyBB56faI1ZIJTPc&e=
> >>
> >>
> >>
> >>
> >> Best,
> >>
> >> Steffan
> >>
> >>> On 26. Apr 2023, at 10:50, Stefano D'Angelo
> <[email protected]> wrote:
> >>>
> >>> Yeah, Stefan's version is easier/better.
> >>>
> >>> It only needs an extra _mm_castps_si128() to compute m, which costs
> nothing.
> >>>
> >>> Best,
> >>>
> >>> Stefano D'Angelo
> >>>
> >>> Il 26/04/23 10:42, Stefan Stenzel ha scritto:
> >>>> Sorry for spamming, but I am obsessive about optimisations and cannot
> spare you the version with one less instruction:
> >>>>
> >>>> int main()
> >>>> {
> >>>>    const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f,
> 3000000000.f, -3000000000.f);
> >>>> const __m128 fcmp    =
> _mm_set_ps(0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f);
> >>>>
> >>>>    __m128i x = _mm_cvtps_epi32(sseFloatInput);
> >>>>    __m128i m = _mm_cmpge_ps(sseFloatInput,fcmp);
> >>>>    __m128i r = _mm_add_epi32(x,m);
> >>>>
> >>>>    printf("%08X %08X %08X %08X\n",
> >>>>        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)),
> >>>>        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)),
> >>>>        _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)),
> >>>>        _mm_cvtsi128_si32(r)
> >>>>        );
> >>>> }
> >>>>
> >>>>
> >>>>> On 26. Apr 2023, at 10:34, Stefan Stenzel <[email protected]>
> wrote:
> >>>>>
> >>>>> Stefano’s solution is elegant because it exploits the fact that values
> outside the range are all set to 0x80000000.
> >>>>> But the implementation is a bit overcomplicated, this works as well with
> less instructions, same result:
> >>>>>
> >>>>> int main()
> >>>>> {
> >>>>>   const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f,
> 3000000000.f, -3000000000.f);
> >>>>> const __m128 fcmp    =
> _mm_set_ps(0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f);
> >>>>>
> >>>>>   __m128i x = _mm_cvtps_epi32(sseFloatInput);
> >>>>>   __m128i m = _mm_cmpge_ps(sseFloatInput,fcmp);
> >>>>>   __m128i r = _mm_sub_epi32(x,_mm_srli_epi32(m,31));
> >>>>>
> >>>>>   printf("%08X %08X %08X %08X\n",
> >>>>>       _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)),
> >>>>>       _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)),
> >>>>>       _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)),
> >>>>>       _mm_cvtsi128_si32(r)
> >>>>>       );
> >>>>> }
> >>>>>
> >>>>>
> >>>>>> On 26. Apr 2023, at 10:11, Stefano D'Angelo
> <[email protected]> wrote:
> >>>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> I'm no SSE expert either but I would exploit IEEE 754r single precision
> floating point representation.
> >>>>>>
> >>>>>> Essentially you have that 0x4f000000 represents 2147483648.f while
> 0x4effffff represents 2147483520.f. OTOH, in 2's complement 32 bits,
> 0x7fffffff is 2147483647 and 0x80000000 is -2147483648.
> >>>>>>
> >>>>>> The idea is then to convert using _mm_cvtps_epi32 as you did, and
> subtract 1 if the input is represented as a number bigger than 0x4effffff.
> >>>>>>
> >>>>>> Here's the code:
> >>>>>>
> >>>>>> #include <smmintrin.h>
> >>>>>> #include <emmintrin.h>
> >>>>>> #include <stdio.h>
> >>>>>>
> >>>>>> int main()
> >>>>>> {
> >>>>>>   const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f,
> >>>>>> 3000000000.f, -3000000000.f);
> >>>>>>
> >>>>>>   const __m128i ones = _mm_set_epi32(1, 1, 1, 1);
> >>>>>>   const __m128i h = _mm_set_epi32(0x4f000000, 0x4f000000,
> >>>>>> 0x4f000000, 0x4f000000);
> >>>>>>
> >>>>>>   __m128i x = _mm_cvtps_epi32(sseFloatInput);
> >>>>>>   __m128i i = _mm_castps_si128(sseFloatInput);
> >>>>>>   __m128i m = _mm_max_epi32(i, h);
> >>>>>>   __m128i s = _mm_sub_epi32(m, h);
> >>>>>>   __m128i y = _mm_sign_epi32(ones, s);
> >>>>>>   __m128i r = _mm_sub_epi32(x,y);
> >>>>>>
> >>>>>>   printf("%d %d %d %d\n",
> >>>>>>       _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)),
> >>>>>>       _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)),
> >>>>>>       _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)),
> >>>>>>       _mm_cvtsi128_si32(r)
> >>>>>>       );
> >>>>>> }
> >>>>>>
> >>>>>> I get the correct result: 1000 -1000 2147483647 -2147483648.
> >>>>>>
> >>>>>> HTH.
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Stefano D'Angelo
> >>>>>>
> >>>>>> Il 26/04/23 09:09, Holger Strauss ha scritto:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> thank you all for the interesting discussion posts on denorms
> >>>>>>> and fixed-point/floating-point processing.
> >>>>>>>
> >>>>>>> I have a problem that is very much related to the arguments
> >>>>>>> posted by B.J., mentioning the lack of saturation arithmetics on
> x86/x64 processors.
> >>>>>>>
> >>>>>>> I need to convert a batch of 32 bit float samples to 32 bit int
> >>>>>>> samples with appropriate clipping. I.e. samples which are
> >>>>>>> outside the range of a 32 bit int (-2147483648..2147483647)
> >>>>>>> shall be clipped to  -2147483648 or 2147483647.
> >>>>>>>
> >>>>>>> Because the conversion shall be fast and efficient, I would
> >>>>>>> prefer a solution using SSE (2/3).
> >>>>>>>
> >>>>>>> This sounds like an easy problem, but unfortunately it turned
> >>>>>>> out it's not so simple after all.
> >>>>>>> So I would like to challenge any SSE experts on this list.
> >>>>>>>
> >>>>>>> Here is what I have found out already:
> >>>>>>>
> >>>>>>> Starting with the following sample input:
> >>>>>>>
> >>>>>>>   const __m128 sseFloatInput = _mm_set_ps(1000.0, -1000,
> >>>>>>> 3000000000.0, -3000000000.0);
> >>>>>>>
> >>>>>>> My first approach was to convert this directly:
> >>>>>>>
> >>>>>>>  const __m128i sseClippedInt = _mm_cvtps_epi32(sseFloatInput);
> >>>>>>>
> >>>>>>> This results in 1000, -1000, -2147483648, -2147483648, which is
> >>>>>>> correct for all input samples but 3000000000.0. It turns out
> >>>>>>> that all values which cannot be represented by an int32 are converted
> to -2147483648.
> >>>>>>>
> >>>>>>> To fix this, my next idea was to clip the maximum value before
> converting:
> >>>>>>>
> >>>>>>>   const __m128 sseMax =
> >>>>>>> _mm_set1_ps(float(std::numeric_limits<int32_t>::max()));
> >>>>>>>   const __m128i sseClippedInt =
> >>>>>>> _mm_cvtps_epi32(_mm_min_ps(sseFloatInput,
> >>>>>>> sseMax));
> >>>>>>>
> >>>>>>> Well, the output is the same: 1000, -1000, -2147483648,
> >>>>>>> -2147483648. What is happening here? The maximum possible int32
> >>>>>>> (2147483647) cannot be represented exactly as a floating-point
> >>>>>>> number. So sseMax is slightly larger
> >>>>>>> (2.14748365e+09) and therefore sseClipMax is still (slightly)
> >>>>>>> out of range, resulting in the same int32 values.
> >>>>>>>
> >>>>>>> My final approach was to make sseMax minimally smaller:
> >>>>>>>
> >>>>>>>   const __m128 sseMax =
> >>>>>>> _mm_set1_ps(std::nextafterf(float(std::numeric_limits<int32_t>::
> >>>>>>> max()),
> >>>>>>> 0.0f));
> >>>>>>>
> >>>>>>> This results in 1000, -1000, 2147483520, -2147483648. This is the
> 'best'
> >>>>>>> solution so far, but still not what I want, because 3000000000.0
> >>>>>>> does not clip to the maximum possible int32 (2147483647). It is
> >>>>>>> obviously the same problem as before: The clipping limit cannot
> >>>>>>> be represented exactly as a float. (sseMax is 2.14748352e+09
> >>>>>>> here)
> >>>>>>>
> >>>>>>> Does anyone have an _efficient_ solution for this problem? Does
> >>>>>>> it really need a (probably very inefficient) detour using double or
> int64?
> >>
> >

AW: Efficient way to convert 32 bit float to 32 bit int (SSE)

Reply via email to