On 25/02/15 2:58 PM, Ronald S. Bultje wrote: > Hi, > > On Wed, Feb 25, 2015 at 12:52 PM, James Almer <jamr...@gmail.com> wrote: > >> On 25/02/15 2:36 PM, Ronald S. Bultje wrote: >>> Hi James, >>> >>> On Wed, Feb 25, 2015 at 12:25 PM, James Almer <jamr...@gmail.com> wrote: >>> >>>> On 25/02/15 9:41 AM, Ronald S. Bultje wrote: >>>>> Hi, >>>>> >>>>> On Tue, Feb 24, 2015 at 8:05 PM, James Almer <jamr...@gmail.com> >> wrote: >>>>>> >>>>>> +#if HAVE_FAST_POPCNT >>>>>> +#if AV_GCC_VERSION_AT_LEAST(4,5) >>>>>> +#ifndef av_popcount >>>>>> + #define av_popcount __builtin_popcount >>>>>> +#endif /* av_popcount */ >>>>>> +#if HAVE_FAST_64BIT >>>>>> +#ifndef av_popcount64 >>>>>> + #define av_popcount64 __builtin_popcountll >>>>>> +#endif /* av_popcount64 */ >>>>>> +#endif /* HAVE_FAST_64BIT */ >>>>>> +#endif /* AV_GCC_VERSION_AT_LEAST(4,5) */ >>>>>> +#endif /* HAVE_FAST_POPCNT */ >>>>>> >>>>> >>>>> Is this just to get the sse4 popcnt instruction if we compile with >>>>> -mcpu=sse4? The slightly odd thing is that we're using a built-in, yet >>>>> configure still does an arch/cpu check. I'd expect the >> built-in/compiler >>>> to >>>>> do that for us based on -mcpu, and we could always unconditionally use >>>> this >>>>> (as long as gcc >= 4.5); alternatively, you could use inline asm and >> then >>>>> have the configure check (HAVE_FAST_POPCNT). But doing both seems a >>>> little >>>>> odd. I have no objection to it, patch is still fine, just odd. >>>>> >>>>> Ronald >>>> >>>> I purposely made the checks for gcc 4.5 and in configure for cpus that >>>> support popcnt >>>> because otherwise __builtin_popcount (at least gcc's) is slower than our >>>> generic >>>> av_popcount_c function from lavu/common.h. >>>> When the CPU supports popcnt the builtin becomes a single inlined >>>> instruction. >>>> >>>> I tried the __asm__ approach, but the code generated by the builtin >> seemed >>>> better. >>> >>> >>> That's interesting, can you show the code you tried? >>> >>> Ronald >> >> If instead of builtins i use >> >> +#if HAVE_INLINE_ASM >> + >> +#ifdef __POPCNT__ >> + >> +#define av_popcount av_popcount_abm >> +static av_always_inline av_const int av_popcount_abm(uint32_t a) >> +{ >> + int x; >> + >> + __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a)); >> + return x; >> +} >> + >> +#if ARCH_X86_64 >> +#define av_popcount64 av_popcount64_abm >> +static av_always_inline av_const int av_popcount64_abm(uint64_t a) >> +{ >> + uint64_t x; >> + >> + __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a)); >> + return x; >> +} >> +#endif >> + >> +#endif /* __POPCNT__ */ >> + >> +#endif /* HAVE_INLINE_ASM */ >> >> in the x86/ header from the second version i sent, on x86_32 av_cpu_count >> from >> libavutil/cpu.o on Windows (Will not work with other OSes because of >> HAVE_GETPROCESSAFFINITYMASK) generates >> >> 1af: f3 0f b8 5c 24 18 popcnt ebx,DWORD PTR [esp+0x18] >> 1b5: 31 c0 xor eax,eax >> 1b7: f3 0f b8 c0 popcnt eax,eax >> 1bb: 01 c3 add ebx,eax >> >> Whereas with the builtins i instead get >> >> 1af: f3 0f b8 5c 24 18 popcnt ebx,DWORD PTR [esp+0x18] >> >> This is probably because of av_popcount64_c (Used unconditionally on >> x86_32) calling >> av_popcount twice, and proc_aff being DWORD_PTR type means it's 4 bytes in >> x86_32. >> The builtin seems to know proc_aff can't be right shifted 32 bits so it >> generates >> a single popcnt. > > > It seems to me rather than the second version is defined as "pure" (or > whatever the keyword was), i.e. "we don't depend on external state" -> "if > you call this with a constant, we can calculate this at compile-time". > > Ronald
Do you know how to get the asm version working right for this one case? Otherwise i think the second version with builtins i sent is good to apply. Much cleaner than the configure approach to be honest. Someone else can change it to add the relevant clang check/s because i can't test that. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel