On Sat, Oct 10, 2015 at 8:13 PM, Michael Niedermayer <mich...@niedermayer.cc> wrote: > On Sat, Oct 10, 2015 at 07:45:17PM -0400, Ganesh Ajjanagadde wrote: >> On Sat, Oct 10, 2015 at 7:14 PM, Michael Niedermayer >> <mich...@niedermayer.cc> wrote: >> > On Sat, Oct 10, 2015 at 11:32:06PM +0200, Henrik Gramner wrote: >> >> On Sat, Oct 10, 2015 at 11:06 PM, Ganesh Ajjanagadde >> >> <gajjanaga...@gmail.com> wrote: >> >> > This uses Stein's binary GCD algorithm: >> >> > https://en.wikipedia.org/wiki/Binary_GCD_algorithm >> >> > to get a reported 1.7-2.1x speedup over Euclidean GCD on standard >> >> > architectures. >> >> > Have not benchmarked, so I can't comment >> >> >> >> Before you submit a patch that's supposed to make something faster, >> >> you should benchmark it to verify that it is in fact faster. Do this >> >> with inputs of various sizes on both 32- and 64-bit architectures and >> >> both with and without compilers that support __builtin_ctzll(v). >> > >> > without __builtin_ctzll() the old code seems faster in a simple test >> > like: >> > make -j12 && ./ffmpeg -i matrixbench_mpeg2.mpg test.mov >> >> > >> > also it seems a simple >> > while (u != v) { >> > if (u > v) >> > FFSWAP(int64_t, v, u); >> > v -= u; >> > do { >> > v >>= 1; >> > } while(!(v & 1)); >> > } >> > is faster than the non builtin ctzll in its place but still not as >> > fast as the current code >> >> So I checked out the various implementation of (non built in) ctzll at >> https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightBinSearch. >> The funny thing is, among the looping ones, the fastest is the simplest :): >> static inline int ff_ctzll_c(long long v) >> { >> int c = 0; >> while (!(v & 1)) { >> c++; >> v >>= 1; >> } >> return c; >> } > > its possible gcc/clang recognizes this and replaces it by the > instruction for the builtin > this is perfectly fine when such an instruction is available > but when not this might be slower than what we would expect from > just testing on architectures with such instructions
I doubt it replaces it with the builtin; it is significantly slower than the builtin (and in fact makes gcd (on random numbers)) and going to my earlier numbers: 25 seconds (old) vs 22 seconds (with this) vs 6-7 seconds (for builtin). > > >> >> The De Bruijn based approach is intriguing: >> https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightMultLookup, >> since there is no loop there, and assuming multiplies are fast, it >> should be good. > > >> >> I guess one could in principle simply write the asm instruction, but I >> can't do that with my current knowledge. > > this would only work on architectures that do have such > instruction, i dont know to which non x86 architectures this applies The problem with coming up with a non builtin ctzll is that I think which method works best is highly architecture dependent. For instance, your lookup table approach is not good from a cache perspective, though it has other benefits. I don't know of a clean solution, and I lack the hardware/configurations to do performance comparisons across architectures. There are a few alternatives: 1. Drop this patch - I think that is a slight shame. 2. Do some ifdefry to enable this version of av_gcd only on things having a built in, else use the standard Euclidean version. This way we have no performance regression across any platform. 3. Find some version of non builtin ctzll that works well across all configurations we care about with no performance regression. This might be useful work, as we can then apply that knowledge to improve av_ctz as well (assuming 32 bit behaves similarly). However, I will be of limited use here due to my lack of configurations/hardware and lack of machine architecture knowledge. > > [...] > > -- > Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB > > If a bugfix only changes things apparently unrelated to the bug with no > further explanation, that is a good sign that the bugfix is wrong. > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > http://ffmpeg.org/mailman/listinfo/ffmpeg-devel > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel