Hi, On Wed, Feb 25, 2015 at 1:21 PM, James Almer <jamr...@gmail.com> wrote:
> On 25/02/15 2:58 PM, Ronald S. Bultje wrote: > > Hi, > > > > On Wed, Feb 25, 2015 at 12:52 PM, James Almer <jamr...@gmail.com> wrote: > > > >> On 25/02/15 2:36 PM, Ronald S. Bultje wrote: > >>> Hi James, > >>> > >>> On Wed, Feb 25, 2015 at 12:25 PM, James Almer <jamr...@gmail.com> > wrote: > >>> > >>>> On 25/02/15 9:41 AM, Ronald S. Bultje wrote: > >>>>> Hi, > >>>>> > >>>>> On Tue, Feb 24, 2015 at 8:05 PM, James Almer <jamr...@gmail.com> > >> wrote: > >>>>>> > >>>>>> +#if HAVE_FAST_POPCNT > >>>>>> +#if AV_GCC_VERSION_AT_LEAST(4,5) > >>>>>> +#ifndef av_popcount > >>>>>> + #define av_popcount __builtin_popcount > >>>>>> +#endif /* av_popcount */ > >>>>>> +#if HAVE_FAST_64BIT > >>>>>> +#ifndef av_popcount64 > >>>>>> + #define av_popcount64 __builtin_popcountll > >>>>>> +#endif /* av_popcount64 */ > >>>>>> +#endif /* HAVE_FAST_64BIT */ > >>>>>> +#endif /* AV_GCC_VERSION_AT_LEAST(4,5) */ > >>>>>> +#endif /* HAVE_FAST_POPCNT */ > >>>>>> > >>>>> > >>>>> Is this just to get the sse4 popcnt instruction if we compile with > >>>>> -mcpu=sse4? The slightly odd thing is that we're using a built-in, > yet > >>>>> configure still does an arch/cpu check. I'd expect the > >> built-in/compiler > >>>> to > >>>>> do that for us based on -mcpu, and we could always unconditionally > use > >>>> this > >>>>> (as long as gcc >= 4.5); alternatively, you could use inline asm and > >> then > >>>>> have the configure check (HAVE_FAST_POPCNT). But doing both seems a > >>>> little > >>>>> odd. I have no objection to it, patch is still fine, just odd. > >>>>> > >>>>> Ronald > >>>> > >>>> I purposely made the checks for gcc 4.5 and in configure for cpus that > >>>> support popcnt > >>>> because otherwise __builtin_popcount (at least gcc's) is slower than > our > >>>> generic > >>>> av_popcount_c function from lavu/common.h. > >>>> When the CPU supports popcnt the builtin becomes a single inlined > >>>> instruction. > >>>> > >>>> I tried the __asm__ approach, but the code generated by the builtin > >> seemed > >>>> better. > >>> > >>> > >>> That's interesting, can you show the code you tried? > >>> > >>> Ronald > >> > >> If instead of builtins i use > >> > >> +#if HAVE_INLINE_ASM > >> + > >> +#ifdef __POPCNT__ > >> + > >> +#define av_popcount av_popcount_abm > >> +static av_always_inline av_const int av_popcount_abm(uint32_t a) > >> +{ > >> + int x; > >> + > >> + __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a)); > >> + return x; > >> +} > >> + > >> +#if ARCH_X86_64 > >> +#define av_popcount64 av_popcount64_abm > >> +static av_always_inline av_const int av_popcount64_abm(uint64_t a) > >> +{ > >> + uint64_t x; > >> + > >> + __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a)); > >> + return x; > >> +} > >> +#endif > >> + > >> +#endif /* __POPCNT__ */ > >> + > >> +#endif /* HAVE_INLINE_ASM */ > >> > >> in the x86/ header from the second version i sent, on x86_32 > av_cpu_count > >> from > >> libavutil/cpu.o on Windows (Will not work with other OSes because of > >> HAVE_GETPROCESSAFFINITYMASK) generates > >> > >> 1af: f3 0f b8 5c 24 18 popcnt ebx,DWORD PTR [esp+0x18] > >> 1b5: 31 c0 xor eax,eax > >> 1b7: f3 0f b8 c0 popcnt eax,eax > >> 1bb: 01 c3 add ebx,eax > >> > >> Whereas with the builtins i instead get > >> > >> 1af: f3 0f b8 5c 24 18 popcnt ebx,DWORD PTR [esp+0x18] > >> > >> This is probably because of av_popcount64_c (Used unconditionally on > >> x86_32) calling > >> av_popcount twice, and proc_aff being DWORD_PTR type means it's 4 bytes > in > >> x86_32. > >> The builtin seems to know proc_aff can't be right shifted 32 bits so it > >> generates > >> a single popcnt. > > > > > > It seems to me rather than the second version is defined as "pure" (or > > whatever the keyword was), i.e. "we don't depend on external state" -> > "if > > you call this with a constant, we can calculate this at compile-time". > > > > Ronald > > Do you know how to get the asm version working right for this one case? > No, const should do it but looks like inline asm breaks that assumption. > Otherwise i think the second version with builtins i sent is good to > apply. Much > cleaner than the configure approach to be honest. OK. Ronald _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel