On 25/02/15 2:36 PM, Ronald S. Bultje wrote: > Hi James, > > On Wed, Feb 25, 2015 at 12:25 PM, James Almer <jamr...@gmail.com> wrote: > >> On 25/02/15 9:41 AM, Ronald S. Bultje wrote: >>> Hi, >>> >>> On Tue, Feb 24, 2015 at 8:05 PM, James Almer <jamr...@gmail.com> wrote: >>>> >>>> +#if HAVE_FAST_POPCNT >>>> +#if AV_GCC_VERSION_AT_LEAST(4,5) >>>> +#ifndef av_popcount >>>> + #define av_popcount __builtin_popcount >>>> +#endif /* av_popcount */ >>>> +#if HAVE_FAST_64BIT >>>> +#ifndef av_popcount64 >>>> + #define av_popcount64 __builtin_popcountll >>>> +#endif /* av_popcount64 */ >>>> +#endif /* HAVE_FAST_64BIT */ >>>> +#endif /* AV_GCC_VERSION_AT_LEAST(4,5) */ >>>> +#endif /* HAVE_FAST_POPCNT */ >>>> >>> >>> Is this just to get the sse4 popcnt instruction if we compile with >>> -mcpu=sse4? The slightly odd thing is that we're using a built-in, yet >>> configure still does an arch/cpu check. I'd expect the built-in/compiler >> to >>> do that for us based on -mcpu, and we could always unconditionally use >> this >>> (as long as gcc >= 4.5); alternatively, you could use inline asm and then >>> have the configure check (HAVE_FAST_POPCNT). But doing both seems a >> little >>> odd. I have no objection to it, patch is still fine, just odd. >>> >>> Ronald >> >> I purposely made the checks for gcc 4.5 and in configure for cpus that >> support popcnt >> because otherwise __builtin_popcount (at least gcc's) is slower than our >> generic >> av_popcount_c function from lavu/common.h. >> When the CPU supports popcnt the builtin becomes a single inlined >> instruction. >> >> I tried the __asm__ approach, but the code generated by the builtin seemed >> better. > > > That's interesting, can you show the code you tried? > > Ronald
If instead of builtins i use +#if HAVE_INLINE_ASM + +#ifdef __POPCNT__ + +#define av_popcount av_popcount_abm +static av_always_inline av_const int av_popcount_abm(uint32_t a) +{ + int x; + + __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a)); + return x; +} + +#if ARCH_X86_64 +#define av_popcount64 av_popcount64_abm +static av_always_inline av_const int av_popcount64_abm(uint64_t a) +{ + uint64_t x; + + __asm__ ("popcnt %1, %0" : "=r" (x) : "rm" (a)); + return x; +} +#endif + +#endif /* __POPCNT__ */ + +#endif /* HAVE_INLINE_ASM */ in the x86/ header from the second version i sent, on x86_32 av_cpu_count from libavutil/cpu.o on Windows (Will not work with other OSes because of HAVE_GETPROCESSAFFINITYMASK) generates 1af: f3 0f b8 5c 24 18 popcnt ebx,DWORD PTR [esp+0x18] 1b5: 31 c0 xor eax,eax 1b7: f3 0f b8 c0 popcnt eax,eax 1bb: 01 c3 add ebx,eax Whereas with the builtins i instead get 1af: f3 0f b8 5c 24 18 popcnt ebx,DWORD PTR [esp+0x18] This is probably because of av_popcount64_c (Used unconditionally on x86_32) calling av_popcount twice, and proc_aff being DWORD_PTR type means it's 4 bytes in x86_32. The builtin seems to know proc_aff can't be right shifted 32 bits so it generates a single popcnt. This is probably a rare case since av_popcount64 is not normally called with a 32bits argument, but the builtin handled it better. If the argument used with av_popcount64 is effectively 64bits then both the builtin and the above asm seem to generate the same code. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org http://ffmpeg.org/mailman/listinfo/ffmpeg-devel