That was fast! Thank you very much! I don't have access to my computer for the weekend, but I'll check it as soon as I get back to the University on tuesday (monday's holiday here). In any case, I got to halfway implementing the AVX kernel, which I copy below just for the record... I didn't even got to compile it, let alone test it, but I surely learned a lot. Best Federico
static inline void volk_32fc_x2_divide_32fc_u_avx(lv_32fc_t* cVector, const lv_32fc_t* aVector, const lv_32fc_t* bVector, unsigned int num_points) { unsigned int number = 0; const unsigned int quarterPoints = num_points / 4; __m256 x, y, z, sq, mag_sq, mag_sq_un, div; lv_32fc_t* c = cVector; const lv_32fc_t* a = aVector; const lv_32fc_t* b = bVector; for(; number < quarterPoints; number++){ x = _mm256_loadu_ps((float*) a); // Load the ar + ai, br + bi ... as ar,ai,br,bi ... y = _mm256_loadu_ps((float*) b); // Load the cr + ci, dr + di ... as cr,ci,dr,di ... z = _mm256_complexconjugatemul_ps(x, y); sq = _mm256_mul_ps(y, y); // Square the values mag_sq_un = _mm256_hadd_ps(w,w); // obtain the actual squared magnitude, although out of order mag_sq = _mm256_permute_ps(mag_sq_un, 0xd8) // I order it div = _mm256_div_ps(z,mag_sq); _mm256_storeu_ps((float*) c, div); // Store the results back into the C container a += 4; b += 4; c += 4; } (I got this far ). 2016-05-14 14:47 GMT-03:00 Marcus Müller <marcus.muel...@ettus.com>: > Hi Federico, > > could you have a look at this branch [1] and tell me whether it works for > you? It's SSE3 and AVX only, so far. > Basically, you could > > cd /path/to/source/gnuradio/volk > git pull https://github.com/marcusmueller/volk complex_division > cd ../build > make > make install > > and use volk_32fc_x2_divide_32fc in your code. > > Best regards, > Marcus > > [1] https://github.com/marcusmueller/volk/tree/complex_division > > > On 11.05.2016 23:56, Federico Larroca wrote: > > Thank you very much for your quick answers! > Marcus (Leech), I found the function you mentioned minutes after I sent > the mail. Although it apparently works, Performance Monitor is behaving > really weird when I use it. I have to look up that. > Marcus (Müller), a very informative answer indeed. I will see if I can get > that endless fame you mention :-). > In any case, I'll post what I finally did and the performance gain > achieved. > Best > Federico > > > 2016-05-11 17:47 GMT-03:00 Marcus Müller <marcus.muel...@ettus.com>: > >> Hi Federico, >> >> >> On 11.05.2016 21:09, Federico Larroca wrote: >> >> Hello everyone, >> We are on the stage of optimizing our project (gr-isdbt). >> >> Awesome! >> >> One of the most consuming blocks is OFDM synchronization, and in >> particular the equalization phase. This is simply the division between the >> input signal and the estimated channel gains (two modestly big arrays of >> ~5000 complexes for each OFDM symbol). >> Until now, this was performed by a for loop, so my plan was to change it >> for a volk function. However, there is no complex division in VOLK. So I've >> done a rather indirect operation using the property that a/b = >> a*conj(b)/|b|^2, resulting in six lines of code (a multiply conjugate, a >> magnitude squared, a deinterleave, a couple of float divisions and an >> interleave). Obviously the performance gain (measured with the Performance >> Monitor) is marginal (to be optimistic)... >> >> I have to admit, I'd expect your "simple" for loop doing something like >> >> void yourclass::normalize(std::complex<float> *a, std::complex<float> *b) { >> for(size_t idx; idx < a_len; ++idx) >> a[idx] /= b[idx]; >> } >> >> >> to be neatly optimizable by the compiler, at least if it knows that a and >> b aren't pointing at the same memory- >> >> Your approach, >> [image: $\frac ab = a \cdot \frac{b^*}{|b|^2}= a \cdot \frac{b^*}{b\,b^*} >> = a \cdot \frac 1b$] >> is correct; however, in C++ with std::complex<> >> >> a/b >> >> pretty much does that already (ugly std lib C++ ahead, from >> /usr/include/c++/<version>/complex): >> >> // XXX: This is a grammar school implementation. >> template<typename _Tp> >> template<typename _Up> >> complex<_Tp>& >> complex<_Tp>::operator/=(const complex<_Up>& __z) >> { >> const _Tp __r = _M_real * __z.real() + _M_imag * __z.imag(); >> const _Tp __n = std::norm(__z); >> _M_imag = (_M_imag * __z.real() - _M_real * __z.imag()) / __n; >> _M_real = __r / __n; >> return *this; >> } >> >> And the problem is that while doing that for every a and b separately >> might mean you can't make full use of SIMD instructions to eg. do four >> complex divisions at once, it avoids having to load and store original / >> intermediate values from/to RAM. Basically, your CPU might not be the >> bottleneck – RAM could be, and doing everything you need for a single >> division at once, even if done without any optimization, might be faster >> than incurring additional memory transfers. That's because your memory >> controller pre-fetches whole cache lines worth of values when getting the >> first elements of a and b, and working on values from cache is >> significantly (read: factor > 50) than a single memory transfer. >> >> So, my immediate recommendation really is to keep your loop as minimal as >> possible, giving your compiler a solid chance to see the potential for >> optimization. There might not be much you can do. Even hand-written VOLK >> kernels aren't always faster than automatically generated optimized machine >> code. >> >> Does anyone has a better idea? Implementing a new kernel is simply out of >> my knowledge scope. >> >> Ha! But it would mean endless (additional) fame! >> Soooo: look at the volk_32fc_x2_multiply_conjugate_32fc.h kernel source. >> Specifically, at the SSE3 implementation, >> volk_32fc_x2_multiply_conjugate_32fc_u_sse3(…). >> You'll notice line 134: >> >> z = _mm_complexconjugatemul_ps(x, y); >> >> >> As you can see, there's a a "VOLK intrinsic", >> >> _mm_complexconjugatemul_ps >> >> which is defined in volk_intrinsics.h. That same file contains >> _mm_magnitudesquared_ps_sse3 . Maybe you can make something clever out >> of that :) >> >> Best regards, >> Marcus >> >> >> [1] https://gcc.gnu.org/onlinedocs/gcc/Restricted-Pointers.html >> >> _______________________________________________ >> Discuss-gnuradio mailing list >> Discuss-gnuradio@gnu.org >> https://lists.gnu.org/mailman/listinfo/discuss-gnuradio >> >> > >
_______________________________________________ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org https://lists.gnu.org/mailman/listinfo/discuss-gnuradio