On Sun, Dec 13, 2015 at 10:25 PM, Ronald S. Bultje <rsbul...@gmail.com> wrote: > Hi, > > On Sun, Dec 13, 2015 at 7:29 PM, Ganesh Ajjanagadde <gajja...@mit.edu> > wrote: > >> The worst part is that it is a bad idea to do runtime dispatch on the >> fma() itself, as the function call overhead will be nonneglible, and >> so one can't create a helper API in avutil or elsewhere. Thus, it can >> only be used when a function is in a critical hotspot, where the >> duplication of code and maintainence burden can be justified for the >> performance benefits. I might be missing something here though. > > > You would DSP'ize the loop, not the single fma instruction, right? > Depending on the size of the array (i.e. the size variable), it may be ok.
That is a general problem: fma is useful in a variety of contexts, some of which do not naturally map into e.g a level one BLAS a'*b + c. Thus, in an ideal world (like if I was just developing for my own machine), I would simply use fma whenever instead of a x += y * z and reap a cheap performance gain. I was planning on demonstrations for vsrc_mandelbrot, avutil/lls (cholesky code), but as you pointed out originally, this cheap method is not something FFmpeg can accept. This lack of generality and inability to create such a generic, easy to use fma wrapper across FFmpeg is what I was referring to here, and not this particular case. More concretely addressing your question: I avoided this, since keeping the polynomial evaluation inline can potentially offer a smart compiler greater room for optimization in bessel here. For instance, I am not an asm person, but based on what I know of the simd idea, the numerator and denominator polynomials can be evaluated in parallel, at least until the point where their degrees match. This depends on: 1. The compiler unrolling all of these loops, which it can in principle as it knows the sizes of the arrays, and they are quite small (max 15). 2. The compiler being able to auto-vectorize the relevant computations. 3. Any alignment or other relevant hackery being done of which I know nothing of. Anyway, the short summary is: I like keeping code as generic as possible. I won't write asm for this particular case; anyone interested is free to create an optimized bessel routine - for someone with the know-how, it should be trivial. Of course, it is not used in speed-critical code, and hence I don't like it myself. Note: avcodec/kbdwin also uses a bessel that is inferior to the current code, so maybe there is some utility. > > Ronald > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel