Re: [FFmpeg-devel] [PATCH] avutil/mathematics: speed up av_gcd by using Stein's binary GCD algorithm

Ganesh Ajjanagadde Sat, 10 Oct 2015 17:31:22 -0700

On Sat, Oct 10, 2015 at 8:13 PM, Michael Niedermayer
<mich...@niedermayer.cc> wrote:
> On Sat, Oct 10, 2015 at 07:45:17PM -0400, Ganesh Ajjanagadde wrote:
>> On Sat, Oct 10, 2015 at 7:14 PM, Michael Niedermayer
>> <mich...@niedermayer.cc> wrote:
>> > On Sat, Oct 10, 2015 at 11:32:06PM +0200, Henrik Gramner wrote:
>> >> On Sat, Oct 10, 2015 at 11:06 PM, Ganesh Ajjanagadde
>> >> <gajjanaga...@gmail.com> wrote:
>> >> > This uses Stein's binary GCD algorithm:
>> >> > https://en.wikipedia.org/wiki/Binary_GCD_algorithm
>> >> > to get a reported 1.7-2.1x speedup over Euclidean GCD on standard 
>> >> > architectures.
>> >> > Have not benchmarked, so I can't comment
>> >>
>> >> Before you submit a patch that's supposed to make something faster,
>> >> you should benchmark it to verify that it is in fact faster. Do this
>> >> with inputs of various sizes on both 32- and 64-bit architectures and
>> >> both with and without compilers that support __builtin_ctzll(v).
>> >
>> > without __builtin_ctzll() the old code seems faster in a simple test
>> > like:
>> > make -j12 && ./ffmpeg -i matrixbench_mpeg2.mpg test.mov
>>
>> >
>> > also it seems a simple
>> >     while (u != v) {
>> >         if (u > v)
>> >             FFSWAP(int64_t, v, u);
>> >         v -= u;
>> >         do {
>> >             v >>= 1;
>> >         } while(!(v & 1));
>> >     }
>> > is faster than the non builtin ctzll in its place but still not as
>> > fast as the current code
>>
>> So I checked out the various implementation of (non built in) ctzll at
>> https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightBinSearch.
>> The funny thing is, among the looping ones, the fastest is the simplest :):
>> static inline int ff_ctzll_c(long long v)
>> {
>>     int c = 0;
>>     while (!(v & 1)) {
>>         c++;
>>         v >>= 1;
>>     }
>>     return c;
>> }
>
> its possible gcc/clang recognizes this and replaces it by the
> instruction for the builtin
> this is perfectly fine when such an instruction is available
> but when not this might be slower than what we would expect from
> just testing on architectures with such instructions


I doubt it replaces it with the builtin; it is significantly slower
than the builtin (and in fact makes gcd (on random numbers)) and going
to my earlier numbers: 25 seconds (old) vs 22 seconds (with this) vs
6-7 seconds (for builtin).

>
>
>>
>> The De Bruijn based approach is intriguing:
>> https://graphics.stanford.edu/~seander/bithacks.html#ZerosOnRightMultLookup,
>> since there is no loop there, and assuming multiplies are fast, it
>> should be good.
>
>
>>
>> I guess one could in principle simply write the asm instruction, but I
>> can't do that with my current knowledge.
>
> this would only work on architectures that do have such
> instruction, i dont know to which non x86 architectures this applies

The problem with coming up with a non builtin ctzll is that I think
which method works best is highly architecture dependent. For
instance, your lookup table approach is not good from a cache
perspective, though it has other benefits.

I don't know of a clean solution, and I lack the
hardware/configurations to do performance comparisons across
architectures. There are a few alternatives:
1. Drop this patch - I think that is a slight shame.
2. Do some ifdefry to enable this version of av_gcd only on things
having a built in, else use the standard Euclidean version. This way
we have no performance regression across any platform.
3. Find some version of non builtin ctzll that works well across all
configurations we care about with no performance regression. This
might be useful work, as we can then apply that knowledge to improve
av_ctz as well (assuming 32 bit behaves similarly). However, I will be
of limited use here due to my lack of configurations/hardware and lack
of machine architecture knowledge.

>
> [...]
>
> --
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> If a bugfix only changes things apparently unrelated to the bug with no
> further explanation, that is a good sign that the bugfix is wrong.
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
http://ffmpeg.org/mailman/listinfo/ffmpeg-devel

Re: [FFmpeg-devel] [PATCH] avutil/mathematics: speed up av_gcd by using Stein's binary GCD algorithm

Reply via email to