On Sat, Oct 10, 2015 at 8:22 PM, Michael Niedermayer <mich...@niedermayer.cc> wrote: > On Sun, Oct 11, 2015 at 02:13:59AM +0200, Michael Niedermayer wrote: >> On Sat, Oct 10, 2015 at 07:45:17PM -0400, Ganesh Ajjanagadde wrote: >> > On Sat, Oct 10, 2015 at 7:14 PM, Michael Niedermayer >> > <mich...@niedermayer.cc> wrote: >> > > On Sat, Oct 10, 2015 at 11:32:06PM +0200, Henrik Gramner wrote: >> > >> On Sat, Oct 10, 2015 at 11:06 PM, Ganesh Ajjanagadde >> > >> <gajjanaga...@gmail.com> wrote: >> > >> > This uses Stein's binary GCD algorithm: >> > >> > https://en.wikipedia.org/wiki/Binary_GCD_algorithm >> > >> > to get a reported 1.7-2.1x speedup over Euclidean GCD on standard >> > >> > architectures. >> > >> > Have not benchmarked, so I can't comment >> > >> >> > >> Before you submit a patch that's supposed to make something faster, >> > >> you should benchmark it to verify that it is in fact faster. Do this >> > >> with inputs of various sizes on both 32- and 64-bit architectures and >> > >> both with and without compilers that support __builtin_ctzll(v). >> > > >> > > without __builtin_ctzll() the old code seems faster in a simple test >> > > like: >> > > make -j12 && ./ffmpeg -i matrixbench_mpeg2.mpg test.mov >> > >> > > >> > > also it seems a simple >> > > while (u != v) { >> > > if (u > v) >> > > FFSWAP(int64_t, v, u); >> > > v -= u; >> > > do { >> > > v >>= 1; >> > > } while(!(v & 1)); >> > > } >> > > is faster than the non builtin ctzll in its place but still not as >> > > fast as the current code >> > >> > So I checked out the various implementation of (non built in) ctzll at >> > https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightBinSearch. >> > The funny thing is, among the looping ones, the fastest is the simplest :): >> > static inline int ff_ctzll_c(long long v) >> > { >> > int c = 0; >> > while (!(v & 1)) { >> > c++; >> > v >>= 1; >> > } >> > return c; >> > } > > tested this with the matrixbench_mpeg2 ->mov test and it seems > 20% or so slower than the LUT based version
If the LUT seems to be a good idea, I think the De Bruijn one should be even better - no branching, only a multiply. and some shifts etc Cache effects should be similar for the LUT and the De Bruijn, perhaps slightly in favor of De Bruijn due to a smaller table (64 vs 256 entries). Most places I see only the 32 bit version, so I can't quickly check this. > > >> >> its possible gcc/clang recognizes this and replaces it by the >> instruction for the builtin >> this is perfectly fine when such an instruction is available >> but when not this might be slower than what we would expect from >> just testing on architectures with such instructions > > gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 > fails to replace the loop by anything smart > > you can use > make libavutil/mathematics.s > > to compile the C code to asm and see how the functions get compiled > > [...] > > -- > Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB > > What does censorship reveal? It reveals fear. -- Julian Assange > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel