That was fast! Thank you very much! I don't have access to my computer for
the weekend, but I'll check it as soon as I get back to the University on
tuesday (monday's holiday here).
In any case, I got to halfway implementing the AVX kernel, which I copy
below just for the record... I didn't even got to compile it, let alone
test it, but I surely learned a lot.
Best
Federico

static inline void
volk_32fc_x2_divide_32fc_u_avx(lv_32fc_t* cVector, const lv_32fc_t* aVector,
                                           const lv_32fc_t* bVector,
unsigned int num_points)
{
  unsigned int number = 0;
  const unsigned int quarterPoints = num_points / 4;

  __m256 x, y, z, sq, mag_sq, mag_sq_un, div;
  lv_32fc_t* c = cVector;
  const lv_32fc_t* a = aVector;
  const lv_32fc_t* b = bVector;

  for(; number < quarterPoints; number++){
    x = _mm256_loadu_ps((float*) a); // Load the ar + ai, br + bi ... as
ar,ai,br,bi ...
    y = _mm256_loadu_ps((float*) b); // Load the cr + ci, dr + di ... as
cr,ci,dr,di ...
    z = _mm256_complexconjugatemul_ps(x, y);
    sq = _mm256_mul_ps(y, y); // Square the values
    mag_sq_un = _mm256_hadd_ps(w,w); // obtain the actual squared
magnitude, although out of order
    mag_sq = _mm256_permute_ps(mag_sq_un, 0xd8) // I order it
    div = _mm256_div_ps(z,mag_sq);

    _mm256_storeu_ps((float*) c, div); // Store the results back into the C
container

    a += 4;
    b += 4;
    c += 4;
  }

(I got this far ).

2016-05-14 14:47 GMT-03:00 Marcus Müller <marcus.muel...@ettus.com>:

> Hi Federico,
>
> could you have a look at this branch [1] and tell me whether it works for
> you? It's SSE3 and AVX only, so far.
> Basically, you could
>
> cd /path/to/source/gnuradio/volk
> git pull https://github.com/marcusmueller/volk complex_division
> cd ../build
> make
> make install
>
> and use volk_32fc_x2_divide_32fc in your code.
>
> Best regards,
> Marcus
>
> [1] https://github.com/marcusmueller/volk/tree/complex_division
>
>
> On 11.05.2016 23:56, Federico Larroca wrote:
>
> Thank you very much for your quick answers!
> Marcus (Leech), I found the function you mentioned minutes after I sent
> the mail. Although it apparently works, Performance Monitor is behaving
> really weird when I use it. I have to look up that.
> Marcus (Müller), a very informative answer indeed. I will see if I can get
> that endless fame you mention :-).
> In any case, I'll post what I finally did and the performance gain
> achieved.
> Best
> Federico
>
>
> 2016-05-11 17:47 GMT-03:00 Marcus Müller <marcus.muel...@ettus.com>:
>
>> Hi Federico,
>>
>>
>> On 11.05.2016 21:09, Federico Larroca wrote:
>>
>> Hello everyone,
>> We are on the stage of optimizing our project (gr-isdbt).
>>
>> Awesome!
>>
>> One of the most consuming blocks is OFDM synchronization, and in
>> particular the equalization phase. This is simply the division between the
>> input signal and the estimated channel gains (two modestly big arrays of
>> ~5000 complexes for each OFDM symbol).
>> Until now, this was performed by a for loop, so my plan was to change it
>> for a volk function. However, there is no complex division in VOLK. So I've
>> done a rather indirect operation using the property that a/b =
>> a*conj(b)/|b|^2, resulting in six lines of code (a multiply conjugate, a
>> magnitude squared, a deinterleave, a couple of float divisions and an
>> interleave). Obviously the performance gain (measured with the Performance
>> Monitor) is marginal (to be optimistic)...
>>
>> I have to admit, I'd expect your "simple" for loop doing something like
>>
>> void yourclass::normalize(std::complex<float> *a, std::complex<float> *b) {
>>     for(size_t idx; idx < a_len; ++idx)
>>        a[idx] /= b[idx];
>> }
>>
>>
>> to be neatly optimizable by the compiler, at least if it knows that a and
>> b aren't pointing at the same memory-
>>
>> Your approach,
>> [image: $\frac ab = a \cdot \frac{b^*}{|b|^2}= a \cdot \frac{b^*}{b\,b^*}
>> = a \cdot \frac 1b$]
>> is correct; however, in C++ with std::complex<>
>>
>> a/b
>>
>> pretty much does that already (ugly std lib C++ ahead, from
>> /usr/include/c++/<version>/complex):
>>
>>   // XXX: This is a grammar school implementation.
>>   template<typename _Tp>
>>     template<typename _Up>
>>     complex<_Tp>&
>>     complex<_Tp>::operator/=(const complex<_Up>& __z)
>>     {
>>       const _Tp __r =  _M_real * __z.real() + _M_imag * __z.imag();
>>       const _Tp __n = std::norm(__z);
>>       _M_imag = (_M_imag * __z.real() - _M_real * __z.imag()) / __n;
>>       _M_real = __r / __n;
>>       return *this;
>>     }
>>
>> And the problem is that while doing that for every a and b separately
>> might mean you can't make full use of SIMD instructions to eg. do four
>> complex divisions at once, it avoids having to load and store original /
>> intermediate values from/to RAM. Basically, your CPU might not be the
>> bottleneck – RAM could be, and doing everything you need for a single
>> division at once, even if done without any optimization, might be faster
>> than incurring additional memory transfers. That's because your memory
>> controller pre-fetches whole cache lines worth of values when getting the
>> first elements of a and b, and working on values from cache is
>> significantly (read: factor > 50) than a single memory transfer.
>>
>> So, my immediate recommendation really is to keep your loop as minimal as
>> possible, giving your compiler a solid chance to see the potential for
>> optimization. There might not be much you can do. Even hand-written VOLK
>> kernels aren't always faster than automatically generated optimized machine
>> code.
>>
>> Does anyone has a better idea? Implementing a new kernel is simply out of
>> my knowledge scope.
>>
>> Ha! But it would mean endless (additional) fame!
>> Soooo: look at the volk_32fc_x2_multiply_conjugate_32fc.h kernel source.
>> Specifically, at the SSE3 implementation,
>> volk_32fc_x2_multiply_conjugate_32fc_u_sse3(…).
>> You'll notice line 134:
>>
>>      z = _mm_complexconjugatemul_ps(x, y);
>>
>>
>> As you can see, there's a a "VOLK intrinsic",
>>
>> _mm_complexconjugatemul_ps
>>
>> which is defined in volk_intrinsics.h. That same file contains
>> _mm_magnitudesquared_ps_sse3 . Maybe you can make something clever out
>> of that :)
>>
>> Best regards,
>> Marcus
>>
>>
>> [1] https://gcc.gnu.org/onlinedocs/gcc/Restricted-Pointers.html
>>
>> _______________________________________________
>> Discuss-gnuradio mailing list
>> Discuss-gnuradio@gnu.org
>> https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>>
>>
>
>
_______________________________________________
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Reply via email to