That code snippet would be a good addition to the musicdsp source code archive:
https://www.musicdsp.org/en/latest/Other/index.html Best, Steffan > On 26. Apr 2023, at 10:50, Stefano D'Angelo <[email protected]> > wrote: > > Yeah, Stefan's version is easier/better. > > It only needs an extra _mm_castps_si128() to compute m, which costs nothing. > > Best, > > Stefano D'Angelo > > Il 26/04/23 10:42, Stefan Stenzel ha scritto: >> Sorry for spamming, but I am obsessive about optimisations and cannot spare >> you the version with one less instruction: >> >> int main() >> { >> const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f, 3000000000.f, >> -3000000000.f); >> const __m128 fcmp = >> _mm_set_ps(0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f); >> >> __m128i x = _mm_cvtps_epi32(sseFloatInput); >> __m128i m = _mm_cmpge_ps(sseFloatInput,fcmp); >> __m128i r = _mm_add_epi32(x,m); >> >> printf("%08X %08X %08X %08X\n", >> _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)), >> _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)), >> _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)), >> _mm_cvtsi128_si32(r) >> ); >> } >> >> >>> On 26. Apr 2023, at 10:34, Stefan Stenzel <[email protected]> wrote: >>> >>> Stefano’s solution is elegant because it exploits the fact that values >>> outside the range are all set to 0x80000000. >>> But the implementation is a bit overcomplicated, this works as well with >>> less instructions, same result: >>> >>> int main() >>> { >>> const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f, 3000000000.f, >>> -3000000000.f); >>> const __m128 fcmp = >>> _mm_set_ps(0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f,0x0FFFFFFp3f); >>> >>> __m128i x = _mm_cvtps_epi32(sseFloatInput); >>> __m128i m = _mm_cmpge_ps(sseFloatInput,fcmp); >>> __m128i r = _mm_sub_epi32(x,_mm_srli_epi32(m,31)); >>> >>> printf("%08X %08X %08X %08X\n", >>> _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)), >>> _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)), >>> _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)), >>> _mm_cvtsi128_si32(r) >>> ); >>> } >>> >>> >>>> On 26. Apr 2023, at 10:11, Stefano D'Angelo <[email protected]> >>>> wrote: >>>> >>>> Hello, >>>> >>>> I'm no SSE expert either but I would exploit IEEE 754r single precision >>>> floating point representation. >>>> >>>> Essentially you have that 0x4f000000 represents 2147483648.f while >>>> 0x4effffff represents 2147483520.f. OTOH, in 2's complement 32 bits, >>>> 0x7fffffff is 2147483647 and 0x80000000 is -2147483648. >>>> >>>> The idea is then to convert using _mm_cvtps_epi32 as you did, and subtract >>>> 1 if the input is represented as a number bigger than 0x4effffff. >>>> >>>> Here's the code: >>>> >>>> #include <smmintrin.h> >>>> #include <emmintrin.h> >>>> #include <stdio.h> >>>> >>>> int main() >>>> { >>>> const __m128 sseFloatInput = _mm_set_ps(1000.f, -1000.f, 3000000000.f, >>>> -3000000000.f); >>>> >>>> const __m128i ones = _mm_set_epi32(1, 1, 1, 1); >>>> const __m128i h = _mm_set_epi32(0x4f000000, 0x4f000000, 0x4f000000, >>>> 0x4f000000); >>>> >>>> __m128i x = _mm_cvtps_epi32(sseFloatInput); >>>> __m128i i = _mm_castps_si128(sseFloatInput); >>>> __m128i m = _mm_max_epi32(i, h); >>>> __m128i s = _mm_sub_epi32(m, h); >>>> __m128i y = _mm_sign_epi32(ones, s); >>>> __m128i r = _mm_sub_epi32(x,y); >>>> >>>> printf("%d %d %d %d\n", >>>> _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 3)), >>>> _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 2)), >>>> _mm_cvtsi128_si32(_mm_shuffle_epi32(r, 1)), >>>> _mm_cvtsi128_si32(r) >>>> ); >>>> } >>>> >>>> I get the correct result: 1000 -1000 2147483647 -2147483648. >>>> >>>> HTH. >>>> >>>> Best, >>>> >>>> Stefano D'Angelo >>>> >>>> Il 26/04/23 09:09, Holger Strauss ha scritto: >>>>> Hi, >>>>> >>>>> thank you all for the interesting discussion posts on denorms and >>>>> fixed-point/floating-point processing. >>>>> >>>>> I have a problem that is very much related to the arguments posted by >>>>> B.J., >>>>> mentioning the lack of saturation arithmetics on x86/x64 processors. >>>>> >>>>> I need to convert a batch of 32 bit float samples to 32 bit int samples >>>>> with >>>>> appropriate clipping. I.e. samples which are outside the range of a 32 bit >>>>> int (-2147483648..2147483647) shall be clipped to -2147483648 or >>>>> 2147483647. >>>>> >>>>> Because the conversion shall be fast and efficient, I would prefer a >>>>> solution using SSE (2/3). >>>>> >>>>> This sounds like an easy problem, but unfortunately it turned out it's not >>>>> so simple after all. >>>>> So I would like to challenge any SSE experts on this list. >>>>> >>>>> Here is what I have found out already: >>>>> >>>>> Starting with the following sample input: >>>>> >>>>> const __m128 sseFloatInput = _mm_set_ps(1000.0, -1000, 3000000000.0, >>>>> -3000000000.0); >>>>> >>>>> My first approach was to convert this directly: >>>>> >>>>> const __m128i sseClippedInt = _mm_cvtps_epi32(sseFloatInput); >>>>> >>>>> This results in 1000, -1000, -2147483648, -2147483648, which is correct >>>>> for >>>>> all input samples but 3000000000.0. It turns out that all values which >>>>> cannot be represented by an int32 are converted to -2147483648. >>>>> >>>>> To fix this, my next idea was to clip the maximum value before converting: >>>>> >>>>> const __m128 sseMax = >>>>> _mm_set1_ps(float(std::numeric_limits<int32_t>::max())); >>>>> const __m128i sseClippedInt = _mm_cvtps_epi32(_mm_min_ps(sseFloatInput, >>>>> sseMax)); >>>>> >>>>> Well, the output is the same: 1000, -1000, -2147483648, -2147483648. What >>>>> is >>>>> happening here? The maximum possible int32 (2147483647) cannot be >>>>> represented exactly as a floating-point number. So sseMax is slightly >>>>> larger >>>>> (2.14748365e+09) and therefore sseClipMax is still (slightly) out of >>>>> range, >>>>> resulting in the same int32 values. >>>>> >>>>> My final approach was to make sseMax minimally smaller: >>>>> >>>>> const __m128 sseMax = >>>>> _mm_set1_ps(std::nextafterf(float(std::numeric_limits<int32_t>::max()), >>>>> 0.0f)); >>>>> >>>>> This results in 1000, -1000, 2147483520, -2147483648. This is the 'best' >>>>> solution so far, but still not what I want, because 3000000000.0 does not >>>>> clip to the maximum possible int32 (2147483647). It is obviously the same >>>>> problem as before: The clipping limit cannot be represented exactly as a >>>>> float. (sseMax is 2.14748352e+09 here) >>>>> >>>>> Does anyone have an _efficient_ solution for this problem? Does it really >>>>> need a (probably very inefficient) detour using double or int64?
