On 03/30/2015 10:50, John Baldwin wrote: > On Sunday, March 22, 2015 09:41:53 AM Bruce Evans wrote: >> On Sat, 21 Mar 2015, John Baldwin wrote: >> >>> On 3/21/15 12:35 PM, Konstantin Belousov wrote: >>>> On Sat, Mar 21, 2015 at 12:04:41PM -0400, John Baldwin wrote: >>>>> On 3/20/15 9:02 AM, Konstantin Belousov wrote: >>>>>> On Fri, Mar 20, 2015 at 10:27:06AM +0000, John Baldwin wrote: >>>>>>> Author: jhb >>>>>>> Date: Fri Mar 20 10:27:06 2015 >>>>>>> New Revision: 280279 >>>>>>> URL: https://svnweb.freebsd.org/changeset/base/280279 >>>>>>> >>>>>>> Log: >>>>>>> Expand the bitcount* API to support 64-bit integers, plain ints and >>>>>>> longs >>>>>>> and create a "hidden" API that can be used in other system headers >>>>>>> without >>>>>>> adding namespace pollution. >>>>>>> - If the POPCNT instruction is enabled at compile time, use >>>>>>> __builtin_popcount*() to implement __bitcount*(), otherwise fall >>>>>>> back >>>>>>> to software implementations. >>>>>> Are you aware of the Haswell errata HSD146 ? I see the described >>>>>> behaviour >>>>>> on machines back to SandyBridge, but not on Nehalems. >>>>>> HSD146. POPCNT Instruction May Take Longer to Execute Than Expected >>>>>> Problem: POPCNT instruction execution with a 32 or 64 bit operand may be >>>>>> delayed until previous non-dependent instructions have executed. >>>>>> >>>>>> Jilles noted that gcc head and 4.9.2 already provides a workaround by >>>>>> xoring the dst register. I have some patch for amd64 pmap, see the end >>>>>> of the message. >>>>> No, I was not aware, but I think it's hard to fix this anywhere but the >>>>> compiler. I set CPUTYPE in src.conf on my Ivy Bridge desktop and clang >>>>> uses POPCOUNT for this function from ACPI-CA: >>>>> >>>>> static UINT8 >>>>> AcpiRsCountSetBits ( >>>>> UINT16 BitField) >>>>> { >>>>> UINT8 BitsSet; >>>>> >>>>> >>>>> ACPI_FUNCTION_ENTRY (); >>>>> >>>>> >>>>> for (BitsSet = 0; BitField; BitsSet++) >>>>> { >>>>> /* Zero the least significant bit that is set */ >>>>> >>>>> BitField &= (UINT16) (BitField - 1); >>>>> } >>>>> >>>>> return (BitsSet); >>>>> } >>>>> >>>>> (I ran into this accidentally because a kernel built on my system failed >>>>> to boot in older qemu because the kernel paniced with an illegal >>>>> instruction >>>>> fault in this function.) >> Does it do the same for the similar home made popcount in pmap?: > Yes: > > ffffffff807658d4: f6 04 25 46 e2 d6 80 testb > $0x80,0xffffffff80d6e246 > ffffffff807658db: 80 > ffffffff807658dc: 74 32 je ffffffff80765910 > <pmap_demote_pde_locked+0x4d0> > ffffffff807658de: 48 89 4d b8 mov %rcx,-0x48(%rbp) > ffffffff807658e2: f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx > ffffffff807658e8: 48 8b 50 20 mov 0x20(%rax),%rdx > ffffffff807658ec: 48 89 55 b0 mov %rdx,-0x50(%rbp) > ffffffff807658f0: f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx > ffffffff807658f6: 01 ca add %ecx,%edx > ffffffff807658f8: 48 8b 48 28 mov 0x28(%rax),%rcx > ffffffff807658fc: 48 89 4d a8 mov %rcx,-0x58(%rbp) > ffffffff80765900: f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx > ffffffff80765906: eb 1b jmp ffffffff80765923 > <pmap_demote_pde_locked+0x4e3> > ffffffff80765908: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) > ffffffff8076590f: 00 > ffffffff80765910: f3 48 0f b8 c9 popcnt %rcx,%rcx > ffffffff80765915: f3 48 0f b8 50 20 popcnt 0x20(%rax),%rdx > ffffffff8076591b: 01 ca add %ecx,%edx > ffffffff8076591d: f3 48 0f b8 48 28 popcnt 0x28(%rax),%rcx > ffffffff80765923: 01 d1 add %edx,%ecx > > It also uses popcnt for this in blist_fill() and blist_meta_fill(): > > 742 /* Count the number of blocks we're about to allocate */ > 743 bitmap = scan->u.bmu_bitmap & mask; > 744 for (nblks = 0; bitmap != 0; nblks++) > 745 bitmap &= bitmap - 1; > >> Always using new API would lose the micro-optimizations given by the runtime >> decision for default CFLAGS (used by distributions for portability). To >> keep them, it seems best to keep the inline asm but replace >> popcnt_pc_map_elem(elem) by __bitcount64(elem). -mno-popcount can then >> be used to work around slowness in the software (that is actually >> hardware) case. > I'm not sure if bitcount64() is strictly better than the loop in this case > even though it is O(1) given the claimed nature of the values in the comment. >
I checked. Even with zeroes being more common than ones, bitcount64() is faster than the simple loop. Using bitcount64, reserve_pv_entries() takes on average 4265 cycles during "buildworld" on my test machine. In contrast, with the simple loop, it takes on average 4507 cycles. Even though bitcount64 is a lot larger than the simple loop, we do the 3 bit count operations many times in a loop, so the extra i-cache misses are being made up for by the repeated execution of the faster code. However, in the popcnt case, we are spilling the bit map to memory in order to popcnt it. That's rather silly: 3570: 48 8b 48 18 mov 0x18(%rax),%rcx 3574: f6 04 25 00 00 00 00 testb $0x80,0x0 357b: 80 357c: 74 42 je 35c0 <pmap_demote_pde_locked+0x2f0> 357e: 48 89 4d b8 mov %rcx,-0x48(%rbp) 3582: 31 c9 xor %ecx,%ecx 3584: f3 48 0f b8 4d b8 popcnt -0x48(%rbp),%rcx 358a: 48 8b 50 20 mov 0x20(%rax),%rdx 358e: 48 89 55 b0 mov %rdx,-0x50(%rbp) 3592: 31 d2 xor %edx,%edx 3594: f3 48 0f b8 55 b0 popcnt -0x50(%rbp),%rdx 359a: 01 ca add %ecx,%edx 359c: 48 8b 48 28 mov 0x28(%rax),%rcx 35a0: 48 89 4d a8 mov %rcx,-0x58(%rbp) 35a4: 31 c9 xor %ecx,%ecx 35a6: f3 48 0f b8 4d a8 popcnt -0x58(%rbp),%rcx 35ac: 01 d1 add %edx,%ecx 35ae: e9 12 01 00 00 jmpq 36c5 <pmap_demote_pde_locked+0x3f5> Caveat: I'm still using clang 3.5. Maybe the newer clang doesn't spill? _______________________________________________ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"