On Sun, Dec 13, 2015 at 5:55 PM, Ganesh Ajjanagadde <gajjanaga...@gmail.com> wrote: > On Sun, Dec 13, 2015 at 5:47 PM, Ronald S. Bultje <rsbul...@gmail.com> wrote: >> Hi, >> >> On Sun, Dec 13, 2015 at 4:59 PM, Ganesh Ajjanagadde <gajjanaga...@gmail.com> >> wrote: >>> >>> fma is a faster function on architectures supporting a native CPU >>> instruction for it. >>> This may be tested by the ISO C optionally defined FP_FAST_FMA. Although >>> in the x86 lineup this came fairly late >>> (from Haswell onwards, and hence is absent unless appropriate -march is >>> passed), >>> numerous other architectures support it: >>> https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation. >>> >>> Concretely, one can expect ~ 15-25% speedup that is of course heavily >>> architecture dependent. >>> >>> This patch also ensures that as people migrate to newer CPU's, the >>> benefit will slowly trickle in. >>> >>> I doubt this will cause build failures on broken libm's since I can't >>> imagine a platform where FP_FAST_FMA is defined but the function fma is >>> absent. >>> >>> Sample benchmark (x86-64, Haswell, GNU/Linux under -march=native) >>> >>> old: >>> 515828458 decicycles in build_filter (loop 1000), 1024 runs, 0 >>> skips >>> >>> new (fma): >>> 435866377 decicycles in build_filter (loop 1000), 1024 runs, 0 >>> skips >>> >>> Tested with FATE. >>> >>> Signed-off-by: Ganesh Ajjanagadde <gajjanaga...@gmail.com> >>> --- >>> libswresample/resample.c | 4 ++++ >>> 1 file changed, 4 insertions(+) >>> >>> diff --git a/libswresample/resample.c b/libswresample/resample.c >>> index 34eb4c0..e61d4c5 100644 >>> --- a/libswresample/resample.c >>> +++ b/libswresample/resample.c >>> @@ -33,8 +33,12 @@ static inline double eval_poly(const double *coeff, int >>> size, double x) { >>> double sum = coeff[size-1]; >>> int i; >>> for (i = size-2; i >= 0; --i) { >>> +#ifdef FP_FAST_FMA >>> + sum = fma(sum, x, coeff[i]); >>> +#else >>> sum *= x; >>> sum += coeff[i]; >>> +#endif >>> } >>> return sum; >>> } >>> -- >>> 2.6.4 >> >> >> Nope, this is not how we do CPU-specific optimizations. Check example >> implementations in libswresample/x86/*.asm and the related init functions >> plus macros to check for runtime cpu support in libswresample/x86/*_init.c. >> You want to follow that pattern. > > No, this is not x86 specific. This is generic code. If I did such a > maneouver, benefits would apply only to x86, an inferior outcome.
To clarify: yes, in theory one could dump such things into swresample/x86, swresample/aarch64, and a ton of other architectures (for which some arches are actually lacking). Such a diff is far larger and more brittle - I can't even test things like mips and the like, and looking up the manuals for each and every one of these to find out when/what is the fma equivalent is a pain in the neck. ISO C provides a mechanism, albeit build-time and not runtime detection. This patch is thus something that gives benefits at minimal scope for regressions. Unless others show where/how fma detection can be done for all arches (aarch64, arm, mips, powerpc, itanium, etc in addition to x86-64), I view your idea as future work. > >> >> Ronald _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel