On 12/13/2015 8:08 PM, Ganesh Ajjanagadde wrote: > On Sun, Dec 13, 2015 at 5:55 PM, Ganesh Ajjanagadde > <gajjanaga...@gmail.com> wrote: >> On Sun, Dec 13, 2015 at 5:47 PM, Ronald S. Bultje <rsbul...@gmail.com> wrote: >>> Hi, >>> >>> On Sun, Dec 13, 2015 at 4:59 PM, Ganesh Ajjanagadde <gajjanaga...@gmail.com> >>> wrote: >>>> >>>> fma is a faster function on architectures supporting a native CPU >>>> instruction for it. >>>> This may be tested by the ISO C optionally defined FP_FAST_FMA. Although >>>> in the x86 lineup this came fairly late >>>> (from Haswell onwards, and hence is absent unless appropriate -march is >>>> passed), >>>> numerous other architectures support it: >>>> https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation. >>>> >>>> Concretely, one can expect ~ 15-25% speedup that is of course heavily >>>> architecture dependent. >>>> >>>> This patch also ensures that as people migrate to newer CPU's, the >>>> benefit will slowly trickle in. >>>> >>>> I doubt this will cause build failures on broken libm's since I can't >>>> imagine a platform where FP_FAST_FMA is defined but the function fma is >>>> absent. >>>> >>>> Sample benchmark (x86-64, Haswell, GNU/Linux under -march=native) >>>> >>>> old: >>>> 515828458 decicycles in build_filter (loop 1000), 1024 runs, 0 >>>> skips >>>> >>>> new (fma): >>>> 435866377 decicycles in build_filter (loop 1000), 1024 runs, 0 >>>> skips >>>> >>>> Tested with FATE. >>>> >>>> Signed-off-by: Ganesh Ajjanagadde <gajjanaga...@gmail.com> >>>> --- >>>> libswresample/resample.c | 4 ++++ >>>> 1 file changed, 4 insertions(+) >>>> >>>> diff --git a/libswresample/resample.c b/libswresample/resample.c >>>> index 34eb4c0..e61d4c5 100644 >>>> --- a/libswresample/resample.c >>>> +++ b/libswresample/resample.c >>>> @@ -33,8 +33,12 @@ static inline double eval_poly(const double *coeff, int >>>> size, double x) { >>>> double sum = coeff[size-1]; >>>> int i; >>>> for (i = size-2; i >= 0; --i) { >>>> +#ifdef FP_FAST_FMA >>>> + sum = fma(sum, x, coeff[i]); >>>> +#else >>>> sum *= x; >>>> sum += coeff[i]; >>>> +#endif >>>> } >>>> return sum; >>>> } >>>> -- >>>> 2.6.4 >>> >>> >>> Nope, this is not how we do CPU-specific optimizations. Check example >>> implementations in libswresample/x86/*.asm and the related init functions >>> plus macros to check for runtime cpu support in libswresample/x86/*_init.c. >>> You want to follow that pattern. >> >> No, this is not x86 specific. This is generic code. If I did such a >> maneouver, benefits would apply only to x86, an inferior outcome. > > To clarify: yes, in theory one could dump such things into > swresample/x86, swresample/aarch64, and a ton of other architectures > (for which some arches are actually lacking). Such a diff is far > larger and more brittle - I can't even test things like mips and the > like, and looking up the manuals for each and every one of these to > find out when/what is the fma equivalent is a pain in the neck. > > ISO C provides a mechanism, albeit build-time and not runtime detection. > > This patch is thus something that gives benefits at minimal scope for > regressions. Unless others show where/how fma detection can be done > for all arches (aarch64, arm, mips, powerpc, itanium, etc in addition > to x86-64), I view your idea as future work.
FP_FAST_FMA is apparently not defined on mingw-w64 even though it supports fma() and generates FMA3/4 instructions when targeting relevant CPUs. I also noticed that GCC will on x86_32 generate a call to an external fma function instead of inlining the relevant FMA3/4 instructions, same as it does when the target lacks fast fma instructions, so simply checking the target CPU is not enough. On said builds this patch will probably mean a slowdown. No idea what GCC does with other arches. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel