On Sun, Oct 11, 2015 at 02:13:59AM +0200, Michael Niedermayer wrote: > On Sat, Oct 10, 2015 at 07:45:17PM -0400, Ganesh Ajjanagadde wrote: > > On Sat, Oct 10, 2015 at 7:14 PM, Michael Niedermayer > > <mich...@niedermayer.cc> wrote: > > > On Sat, Oct 10, 2015 at 11:32:06PM +0200, Henrik Gramner wrote: > > >> On Sat, Oct 10, 2015 at 11:06 PM, Ganesh Ajjanagadde > > >> <gajjanaga...@gmail.com> wrote: > > >> > This uses Stein's binary GCD algorithm: > > >> > https://en.wikipedia.org/wiki/Binary_GCD_algorithm > > >> > to get a reported 1.7-2.1x speedup over Euclidean GCD on standard > > >> > architectures. > > >> > Have not benchmarked, so I can't comment > > >> > > >> Before you submit a patch that's supposed to make something faster, > > >> you should benchmark it to verify that it is in fact faster. Do this > > >> with inputs of various sizes on both 32- and 64-bit architectures and > > >> both with and without compilers that support __builtin_ctzll(v). > > > > > > without __builtin_ctzll() the old code seems faster in a simple test > > > like: > > > make -j12 && ./ffmpeg -i matrixbench_mpeg2.mpg test.mov > > > > > > > > also it seems a simple > > > while (u != v) { > > > if (u > v) > > > FFSWAP(int64_t, v, u); > > > v -= u; > > > do { > > > v >>= 1; > > > } while(!(v & 1)); > > > } > > > is faster than the non builtin ctzll in its place but still not as > > > fast as the current code > > > > So I checked out the various implementation of (non built in) ctzll at > > https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightBinSearch. > > The funny thing is, among the looping ones, the fastest is the simplest :): > > static inline int ff_ctzll_c(long long v) > > { > > int c = 0; > > while (!(v & 1)) { > > c++; > > v >>= 1; > > } > > return c; > > }
tested this with the matrixbench_mpeg2 ->mov test and it seems 20% or so slower than the LUT based version > > its possible gcc/clang recognizes this and replaces it by the > instruction for the builtin > this is perfectly fine when such an instruction is available > but when not this might be slower than what we would expect from > just testing on architectures with such instructions gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 fails to replace the loop by anything smart you can use make libavutil/mathematics.s to compile the C code to asm and see how the functions get compiled [...] -- Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB What does censorship reveal? It reveals fear. -- Julian Assange
signature.asc
Description: Digital signature
_______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel