On Sat, Oct 10, 2015 at 8:37 PM, Ganesh Ajjanagadde <gajja...@mit.edu> wrote: > On Sat, Oct 10, 2015 at 8:34 PM, Ganesh Ajjanagadde <gajja...@mit.edu> wrote: >> On Sat, Oct 10, 2015 at 8:22 PM, Michael Niedermayer >> <mich...@niedermayer.cc> wrote: >>> On Sun, Oct 11, 2015 at 02:13:59AM +0200, Michael Niedermayer wrote: >>>> On Sat, Oct 10, 2015 at 07:45:17PM -0400, Ganesh Ajjanagadde wrote: >>>> > On Sat, Oct 10, 2015 at 7:14 PM, Michael Niedermayer >>>> > <mich...@niedermayer.cc> wrote: >>>> > > On Sat, Oct 10, 2015 at 11:32:06PM +0200, Henrik Gramner wrote: >>>> > >> On Sat, Oct 10, 2015 at 11:06 PM, Ganesh Ajjanagadde >>>> > >> <gajjanaga...@gmail.com> wrote: >>>> > >> > This uses Stein's binary GCD algorithm: >>>> > >> > https://en.wikipedia.org/wiki/Binary_GCD_algorithm >>>> > >> > to get a reported 1.7-2.1x speedup over Euclidean GCD on standard >>>> > >> > architectures. >>>> > >> > Have not benchmarked, so I can't comment >>>> > >> >>>> > >> Before you submit a patch that's supposed to make something faster, >>>> > >> you should benchmark it to verify that it is in fact faster. Do this >>>> > >> with inputs of various sizes on both 32- and 64-bit architectures and >>>> > >> both with and without compilers that support __builtin_ctzll(v). >>>> > > >>>> > > without __builtin_ctzll() the old code seems faster in a simple test >>>> > > like: >>>> > > make -j12 && ./ffmpeg -i matrixbench_mpeg2.mpg test.mov >>>> > >>>> > > >>>> > > also it seems a simple >>>> > > while (u != v) { >>>> > > if (u > v) >>>> > > FFSWAP(int64_t, v, u); >>>> > > v -= u; >>>> > > do { >>>> > > v >>= 1; >>>> > > } while(!(v & 1)); >>>> > > } >>>> > > is faster than the non builtin ctzll in its place but still not as >>>> > > fast as the current code >>>> > >>>> > So I checked out the various implementation of (non built in) ctzll at >>>> > https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightBinSearch. >>>> > The funny thing is, among the looping ones, the fastest is the simplest >>>> > :): >>>> > static inline int ff_ctzll_c(long long v) >>>> > { >>>> > int c = 0; >>>> > while (!(v & 1)) { >>>> > c++; >>>> > v >>= 1; >>>> > } >>>> > return c; >>>> > } >>> >>> tested this with the matrixbench_mpeg2 ->mov test and it seems >>> 20% or so slower than the LUT based version >> >> If the LUT seems to be a good idea, I think the De Bruijn one should >> be even better - no branching, only a multiply. and some shifts etc >> Cache effects should be similar for the LUT and the De Bruijn, perhaps >> slightly in favor of De Bruijn due to a smaller table (64 vs 256 >> entries). >> >> Most places I see only the 32 bit version, so I can't quickly check this. > > For reference: http://supertech.csail.mit.edu/papers/debruijn.pdf.
De-Bruijn works well, using my random gcd bench: 22 seconds old, 11 seconds (De Bruijn), 7 seconds (built-in). I will do some FATE testing and then post updated patch. > >> >>> >>> >>>> >>>> its possible gcc/clang recognizes this and replaces it by the >>>> instruction for the builtin >>>> this is perfectly fine when such an instruction is available >>>> but when not this might be slower than what we would expect from >>>> just testing on architectures with such instructions >>> >>> gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 >>> fails to replace the loop by anything smart >>> >>> you can use >>> make libavutil/mathematics.s >>> >>> to compile the C code to asm and see how the functions get compiled >>> >>> [...] >>> >>> -- >>> Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB >>> >>> What does censorship reveal? It reveals fear. -- Julian Assange >>> >>> _______________________________________________ >>> ffmpeg-devel mailing list >>> ffmpeg-devel@ffmpeg.org >>> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel >>> _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel