On Sat, Oct 10, 2015 at 8:34 PM, Ganesh Ajjanagadde <gajja...@mit.edu> wrote: > On Sat, Oct 10, 2015 at 8:22 PM, Michael Niedermayer > <mich...@niedermayer.cc> wrote: >> On Sun, Oct 11, 2015 at 02:13:59AM +0200, Michael Niedermayer wrote: >>> On Sat, Oct 10, 2015 at 07:45:17PM -0400, Ganesh Ajjanagadde wrote: >>> > On Sat, Oct 10, 2015 at 7:14 PM, Michael Niedermayer >>> > <mich...@niedermayer.cc> wrote: >>> > > On Sat, Oct 10, 2015 at 11:32:06PM +0200, Henrik Gramner wrote: >>> > >> On Sat, Oct 10, 2015 at 11:06 PM, Ganesh Ajjanagadde >>> > >> <gajjanaga...@gmail.com> wrote: >>> > >> > This uses Stein's binary GCD algorithm: >>> > >> > https://en.wikipedia.org/wiki/Binary_GCD_algorithm >>> > >> > to get a reported 1.7-2.1x speedup over Euclidean GCD on standard >>> > >> > architectures. >>> > >> > Have not benchmarked, so I can't comment >>> > >> >>> > >> Before you submit a patch that's supposed to make something faster, >>> > >> you should benchmark it to verify that it is in fact faster. Do this >>> > >> with inputs of various sizes on both 32- and 64-bit architectures and >>> > >> both with and without compilers that support __builtin_ctzll(v). >>> > > >>> > > without __builtin_ctzll() the old code seems faster in a simple test >>> > > like: >>> > > make -j12 && ./ffmpeg -i matrixbench_mpeg2.mpg test.mov >>> > >>> > > >>> > > also it seems a simple >>> > > while (u != v) { >>> > > if (u > v) >>> > > FFSWAP(int64_t, v, u); >>> > > v -= u; >>> > > do { >>> > > v >>= 1; >>> > > } while(!(v & 1)); >>> > > } >>> > > is faster than the non builtin ctzll in its place but still not as >>> > > fast as the current code >>> > >>> > So I checked out the various implementation of (non built in) ctzll at >>> > https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightBinSearch. >>> > The funny thing is, among the looping ones, the fastest is the simplest >>> > :): >>> > static inline int ff_ctzll_c(long long v) >>> > { >>> > int c = 0; >>> > while (!(v & 1)) { >>> > c++; >>> > v >>= 1; >>> > } >>> > return c; >>> > } >> >> tested this with the matrixbench_mpeg2 ->mov test and it seems >> 20% or so slower than the LUT based version > > If the LUT seems to be a good idea, I think the De Bruijn one should > be even better - no branching, only a multiply. and some shifts etc > Cache effects should be similar for the LUT and the De Bruijn, perhaps > slightly in favor of De Bruijn due to a smaller table (64 vs 256 > entries). > > Most places I see only the 32 bit version, so I can't quickly check this.
For reference: http://supertech.csail.mit.edu/papers/debruijn.pdf. > >> >> >>> >>> its possible gcc/clang recognizes this and replaces it by the >>> instruction for the builtin >>> this is perfectly fine when such an instruction is available >>> but when not this might be slower than what we would expect from >>> just testing on architectures with such instructions >> >> gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 >> fails to replace the loop by anything smart >> >> you can use >> make libavutil/mathematics.s >> >> to compile the C code to asm and see how the functions get compiled >> >> [...] >> >> -- >> Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB >> >> What does censorship reveal? It reveals fear. -- Julian Assange >> >> _______________________________________________ >> ffmpeg-devel mailing list >> ffmpeg-devel@ffmpeg.org >> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel >> _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel