Hi Johannes, always remember that the "f" suffix after fftw is just for "float", without that suffix you use the double precision FFTw.
> fftwf_execute_dft(plan, *in, *out) http://www.fftw.org/fftw3_doc/New_002darray-Execute-Functions.html > fftwf_alignment_of() same page, and http://www.fftw.org/fftw3_doc/Planner-Flags.html#index-fftw_005falignment_005fof under "FFTW_UNALIGNED". also, and this might be a bit of a downer: use the source, Luke! (also, this might even be more of a downer:) fftw3/kernel/align.c, reduced by all false #if: #define ALGN 16 int X(ialignment_of)(R *p) { return (int)(((uintptr_t) p) % ALGN); } or in other words: address % 16 However, we should really be using the fftw function for that, in case they get around to writing say AVX7 FFT kernels, which need 2048-byte aligned memory instructions or so¹. Best regards, Marcus ¹if you're from the year 2028, and are reading up on intels new AVX7 instruction on the GNU Radio mailing list archive: this is a joke. Obviously, AVX7 only needs 1024-byte alignment and that you only use right-hand quantum registers. On Fri, 2019-06-07 at 07:51 +0000, Johannes Demel wrote: > Hi Marcus, > > do you have any online resources that point to the official > documentation of `fftwf_execute_dft(plan, *in, *out)` and > `fftwf_alignment_of()`? (for the latter this mail thread is on the > first > page of my Google results.) > All the documentation I could find is [0] and this seems like some > Intel > wrapper that is not part of FFTW?! > > Cheers > Johannes > > [0] > http://sep.stanford.edu/sep/claudio/Research/Prst_ExpRefl/ShtPSPI/intel/mkl/10.0.3.020/interfaces/fftw3xf/wrappers/fftwf_execute_dft.c > > Am 06.06.19 um 14:57 schrieb Müller, Marcus (CEL): > > Hi Albin, > > > > no, it's not been discussed on the list, as far as my memory > > reaches. > > We're talking about gr:fft:fft_vcc, right? > > > > Because that output memory shifting only happens when you set shift > > = 1 > > in the constructor. I don't have any hard feelings about whether to > > shift the output or perform an exp(j π n) frequency shift (a.k.a. > > multiplying with +1 -1 …). My gut feeling is that copying blocks of > > memory is probably significantly faster than accessing all elements > > of > > said blocks through the CPU sequentially. Memcpy implementations > > are > > heavily hand-assembler optimized! > > > > However, the actual FFT would access these elements most > > definitely, > > so, for amounts of input samples smaller than the available CPU > > cache, > > the +1 -1 variant would act like an elegant prefetcher and probably > > be > > a bit more efficient, but for sizes larger, it would probably even > > thrash the caches a bit and mean one additional loop over all data. > > So, honestly, in many cases I'd expect a small improvement, in some > > a > > performance degradation. Not quite sure whether looking into that > > is > > clever. > > > > I'd however say that it should be obvious to do the +1 -1 input > > multiplication instead of the memcpy output shift if a window is > > applied, because the +1 -1 can be absorbed into the window in that > > case. (Watch out for the odd-length corner case where you get two > > different windows depending on whether you do an odd- or an even- > > numbered transform.) > > > > There's another easy way we can optimize things at nearly zero > > cost: > > > > The memcpy to the FFTW input buffer only happens so that data is > > properly aligned for FFTW to be able to use SIMD. Obviously, that > > copy > > could be avoided alltogether if the input *is* already aligned. > > > > FFTW does have a fftwf_alignment_of() that you can use to compare > > the > > alignment of the especially aligned fftw plan input buffer, and the > > current workload input pointer. > > Same applies to the output buffer of the fftw plan and the output > > pointer. > > > > If the input buffer and the work input pointer, and the output > > buffer > > and the work output pointer, are of the same aligment, one could > > simply > > use fftwf_execute_dft(plan, *in, *out) instead of > > fftwf_execute(plan), > > and not memcpy anything (unless the user requests shifting). > > > > Chances are that, considering our circular buffers are page- > > aligned, > > and that fft_length mod 4 = 0, for most applications, > > > > fft_length * sizeof(gr_complex_float) = (fft_length/4) * 4 * 64 > > > > are typically already 256-byte aligned, and things are "inherently > > good". > > > > > > Cheers, > > Marcus > > > > On Thu, 2019-06-06 at 11:13 +0200, Albin Stigö wrote: > > > Hi, > > > > > > There's a lot of memcpy/memmove going on in the FFT kernel which > > > I > > > suspect degrades performance quite a bit... This is partly > > > because > > > some platforms requires buffers to be aligned for SIMD to work > > > and > > > partly because DC is not normally centered (which is normally > > > what you > > > would like to see in a spektrum). > > > > > > For ARM in particular the alignment requirements are less strict > > > so > > > that would get rid of at least one memcpy for that platform. > > > > > > For centering an FFT at DC there's a nice trick, you just rotate > > > the > > > input by pi, that is multiply by 1, -1, 1, -1... I suspect that > > > would > > > be quite a bit faster than two memcpy and should be available on > > > all > > > platforms. > > > > > > Has this been discussed on the list before? > > > > > > > > > --Albin > > > > > > _______________________________________________ > > > Discuss-gnuradio mailing list > > > Discuss-gnuradio@gnu.org > > > https://lists.gnu.org/mailman/listinfo/discuss-gnuradio > > > > > > _______________________________________________ > > > Discuss-gnuradio mailing list > > > Discuss-gnuradio@gnu.org > > > https://lists.gnu.org/mailman/listinfo/discuss-gnuradio
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org https://lists.gnu.org/mailman/listinfo/discuss-gnuradio