Hi Federico,

could you have a look at this branch [1] and tell me whether it works
for you? It's SSE3 and AVX only, so far.
Basically, you could

cd /path/to/source/gnuradio/volk
git pull https://github.com/marcusmueller/volk complex_division
cd ../build
make
make install

and use volk_32fc_x2_divide_32fc in your code.

Best regards,
Marcus

[1] https://github.com/marcusmueller/volk/tree/complex_division

On 11.05.2016 23:56, Federico Larroca wrote:
> Thank you very much for your quick answers!
> Marcus (Leech), I found the function you mentioned minutes after I
> sent the mail. Although it apparently works, Performance Monitor is
> behaving really weird when I use it. I have to look up that.
> Marcus (Müller), a very informative answer indeed. I will see if I can
> get that endless fame you mention :-).
> In any case, I'll post what I finally did and the performance gain
> achieved.
> Best
> Federico
>
>
> 2016-05-11 17:47 GMT-03:00 Marcus Müller <marcus.muel...@ettus.com
> <mailto:marcus.muel...@ettus.com>>:
>
>     Hi Federico,
>
>
>     On 11.05.2016 21:09, Federico Larroca wrote:
>>     Hello everyone,
>>     We are on the stage of optimizing our project (gr-isdbt).
>     Awesome!
>>     One of the most consuming blocks is OFDM synchronization, and in
>>     particular the equalization phase. This is simply the division
>>     between the input signal and the estimated channel gains (two
>>     modestly big arrays of ~5000 complexes for each OFDM symbol).
>>     Until now, this was performed by a for loop, so my plan was to
>>     change it for a volk function. However, there is no complex
>>     division in VOLK. So I've done a rather indirect operation using
>>     the property that a/b = a*conj(b)/|b|^2, resulting in six lines
>>     of code (a multiply conjugate, a magnitude squared, a
>>     deinterleave, a couple of float divisions and an interleave).
>>     Obviously the performance gain (measured with the Performance
>>     Monitor) is marginal (to be optimistic)...
>     I have to admit, I'd expect your "simple" for loop doing something
>     like
>
>     void yourclass::normalize(std::complex<float> *a, std::complex<float> *b) 
> {
>         for(size_t idx; idx < a_len; ++idx)
>            a[idx] /= b[idx];
>     }
>
>
>     to be neatly optimizable by the compiler, at least if it knows
>     that a and b aren't pointing at the same memory-
>
>     Your approach,
>     $\frac ab = a \cdot \frac{b^*}{|b|^2}= a \cdot \frac{b^*}{b\,b^*}
>     = a \cdot \frac 1b$
>     is correct; however, in C++ with std::complex<>
>
>     a/b
>
>     pretty much does that already (ugly std lib C++ ahead, from
>     /usr/include/c++/<version>/complex):
>
>       // XXX: This is a grammar school implementation.
>       template<typename _Tp>
>         template<typename _Up>
>         complex<_Tp>&
>         complex<_Tp>::operator/=(const complex<_Up>& __z)
>         {
>           const _Tp __r =  _M_real * __z.real() + _M_imag * __z.imag();
>           const _Tp __n = std::norm(__z);
>           _M_imag = (_M_imag * __z.real() - _M_real * __z.imag()) / __n;
>           _M_real = __r / __n;
>           return *this;
>         }
>
>     And the problem is that while doing that for every a and b
>     separately might mean you can't make full use of SIMD instructions
>     to eg. do four complex divisions at once, it avoids having to load
>     and store original / intermediate values from/to RAM. Basically,
>     your CPU might not be the bottleneck – RAM could be, and doing
>     everything you need for a single division at once, even if done
>     without any optimization, might be faster than incurring
>     additional memory transfers. That's because your memory controller
>     pre-fetches whole cache lines worth of values when getting the
>     first elements of a and b, and working on values from cache is
>     significantly (read: factor > 50) than a single memory transfer.
>
>     So, my immediate recommendation really is to keep your loop as
>     minimal as possible, giving your compiler a solid chance to see
>     the potential for optimization. There might not be much you can
>     do. Even hand-written VOLK kernels aren't always faster than
>     automatically generated optimized machine code.
>>     Does anyone has a better idea? Implementing a new kernel is
>>     simply out of my knowledge scope.
>     Ha! But it would mean endless (additional) fame!
>     Soooo: look at the volk_32fc_x2_multiply_conjugate_32fc.h kernel
>     source. Specifically, at the SSE3 implementation,
>     volk_32fc_x2_multiply_conjugate_32fc_u_sse3(…).
>     You'll notice line 134:
>
>          z = _mm_complexconjugatemul_ps(x, y);
>
>     As you can see, there's a a "VOLK intrinsic",
>
>     _mm_complexconjugatemul_ps
>
>     which is defined in volk_intrinsics.h. That same file contains
>     _mm_magnitudesquared_ps_sse3 . Maybe you can make something clever
>     out of that :)
>
>     Best regards,
>     Marcus
>
>
>     [1] https://gcc.gnu.org/onlinedocs/gcc/Restricted-Pointers.html
>
>     _______________________________________________
>     Discuss-gnuradio mailing list
>     Discuss-gnuradio@gnu.org <mailto:Discuss-gnuradio@gnu.org>
>     https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>
>

_______________________________________________
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Reply via email to