Hi, On Wed, Feb 25, 2015 at 12:52 PM, James Almer <jamr...@gmail.com> wrote:
> On 25/02/15 2:36 PM, Ronald S. Bultje wrote: > > Hi James, > > > > On Wed, Feb 25, 2015 at 12:25 PM, James Almer <jamr...@gmail.com> wrote: > > > >> On 25/02/15 9:41 AM, Ronald S. Bultje wrote: > >>> Hi, > >>> > >>> On Tue, Feb 24, 2015 at 8:05 PM, James Almer <jamr...@gmail.com> > wrote: > >>>> > >>>> +#if HAVE_FAST_POPCNT > >>>> +#if AV_GCC_VERSION_AT_LEAST(4,5) > >>>> +#ifndef av_popcount > >>>> + #define av_popcount __builtin_popcount > >>>> +#endif /* av_popcount */ > >>>> +#if HAVE_FAST_64BIT > >>>> +#ifndef av_popcount64 > >>>> + #define av_popcount64 __builtin_popcountll > >>>> +#endif /* av_popcount64 */ > >>>> +#endif /* HAVE_FAST_64BIT */ > >>>> +#endif /* AV_GCC_VERSION_AT_LEAST(4,5) */ > >>>> +#endif /* HAVE_FAST_POPCNT */ > >>>> > >>> > >>> Is this just to get the sse4 popcnt instruction if we compile with > >>> -mcpu=sse4? The slightly odd thing is that we're using a built-in, yet > >>> configure still does an arch/cpu check. I'd expect the > built-in/compiler > >> to > >>> do that for us based on -mcpu, and we could always unconditionally use > >> this > >>> (as long as gcc >= 4.5); alternatively, you could use inline asm and > then > >>> have the configure check (HAVE_FAST_POPCNT). But doing both seems a > >> little > >>> odd. I have no objection to it, patch is still fine, just odd. > >>> > >>> Ronald > >> > >> I purposely made the checks for gcc 4.5 and in configure for cpus that > >> support popcnt > >> because otherwise __builtin_popcount (at least gcc's) is slower than our > >> generic > >> av_popcount_c function from lavu/common.h. > >> When the CPU supports popcnt the builtin becomes a single inlined > >> instruction. > >> > >> I tried the __asm__ approach, but the code generated by the builtin > seemed > >> better. > > > > > > That's interesting, can you show the code you tried? > > > > Ronald > > If instead of builtins i use > > +#if HAVE_INLINE_ASM > + > +#ifdef __POPCNT__ > + > +#define av_popcount av_popcount_abm > +static av_always_inline av_const int av_popcount_abm(uint32_t a) > +{ > + int x; > + > + __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a)); > + return x; > +} > + > +#if ARCH_X86_64 > +#define av_popcount64 av_popcount64_abm > +static av_always_inline av_const int av_popcount64_abm(uint64_t a) > +{ > + uint64_t x; > + > + __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a)); > + return x; > +} > +#endif > + > +#endif /* __POPCNT__ */ > + > +#endif /* HAVE_INLINE_ASM */ > > in the x86/ header from the second version i sent, on x86_32 av_cpu_count > from > libavutil/cpu.o on Windows (Will not work with other OSes because of > HAVE_GETPROCESSAFFINITYMASK) generates > > 1af: f3 0f b8 5c 24 18 popcnt ebx,DWORD PTR [esp+0x18] > 1b5: 31 c0 xor eax,eax > 1b7: f3 0f b8 c0 popcnt eax,eax > 1bb: 01 c3 add ebx,eax > > Whereas with the builtins i instead get > > 1af: f3 0f b8 5c 24 18 popcnt ebx,DWORD PTR [esp+0x18] > > This is probably because of av_popcount64_c (Used unconditionally on > x86_32) calling > av_popcount twice, and proc_aff being DWORD_PTR type means it's 4 bytes in > x86_32. > The builtin seems to know proc_aff can't be right shifted 32 bits so it > generates > a single popcnt. It seems to me rather than the second version is defined as "pure" (or whatever the keyword was), i.e. "we don't depend on external state" -> "if you call this with a constant, we can calculate this at compile-time". Ronald _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel