On Sun, Aug 14, 2016 at 07:53:40PM -0700, Linus Torvalds wrote: > On Sun, Aug 14, 2016 at 7:28 PM, Dave Chinner <da...@fromorbit.com> wrote: > >> > >> Maybe your symbol table came from a old kernel, and functions moved > >> around enough that the profile attributions ended up bogus. > > > > No, I don't think so. I don't install symbol tables on my test VMs, > > I let /proc/kallsyms do that work for me. From an strace of 'perf > > top -U -g": > > Ok. But something is definitely wrong with your profile. > > What does it say if you annotate that _raw_spin_unlock_irqrestore() function? ....
raw_spin_unlock_irqrestore /proc/kcore ¿ ¿ ¿ ¿ Disassembly of section load0: ¿ ¿ ffffffff81e628b0 <load0>: ¿ nop ¿ push %rbp ¿ mov %rsp,%rbp ¿ movb $0x0,(%rdi) ¿ nop ¿ mov %rsi,%rdi ¿ push %rdi ¿ popfq 99.35 ¿ nop ¿ decl %gs:0x7e1a9bc7(%rip) 0.65 ¿ ¿ je 25 ¿ pop %rbp ¿ ¿ retq ¿25: callq 0xffffffff81002000 ¿ pop %rbp ¿ ¿ retq > I guarantee you that no, it's not spending 41% of time in > spin_unlock_irqrestore. That just isn't a valid profile. There's > something seriously wrong somewhere. > > The fact that you then get the same profile when you turn _off_ > CONFIG_DEBUG_SPINLOCK only proves there is something going on that is > pure garbage. > > I suspect that what you did was to edit the .config file, remove > DEBUG_SPINLOCK, and then do "make oldconfig" again. Yes. > And it got turned on again, No. I'm not that stupid - I checked: $ grep SPINLOCK .config CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y CONFIG_QUEUED_SPINLOCKS=y CONFIG_PARAVIRT_SPINLOCKS=y # CONFIG_DEBUG_SPINLOCK is not set $ > because you have one of the lock debugging > options on that force spinlock debuggin on again: > - DEBUG_WW_MUTEX_SLOWPATH > - DEBUG_LOCK_ALLOC > - PROVE_LOCKING None of which are set: $ grep 'DEBUG\|PROVE' .config |grep -v '#' CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_DEBUG_RODATA=y CONFIG_SLUB_DEBUG=y CONFIG_HAVE_DMA_API_DEBUG=y CONFIG_X86_DEBUGCTLMSR=y CONFIG_PM_DEBUG=y CONFIG_PM_SLEEP_DEBUG=y CONFIG_DEBUG_DEVRES=y CONFIG_PNP_DEBUG_MESSAGES=y CONFIG_XFS_DEBUG=y CONFIG_OCFS2_DEBUG_MASKLOG=y CONFIG_CIFS_DEBUG=y CONFIG_DEBUG_INFO=y CONFIG_DEBUG_FS=y CONFIG_DEBUG_KERNEL=y CONFIG_HAVE_DEBUG_KMEMLEAK=y CONFIG_DEBUG_STACK_USAGE=y CONFIG_HAVE_DEBUG_STACKOVERFLOW=y CONFIG_SCHED_DEBUG=y CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_ATOMIC_SLEEP=y CONFIG_DEBUG_BUGVERBOSE=y CONFIG_DEBUG_LIST=y CONFIG_FAULT_INJECTION_DEBUG_FS=y CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS=y CONFIG_DEBUG_BOOT_PARAMS=y $ > [ Light goes on ] > > Oh, no, I can see another possibility: you're not doing proper CPU > profiles, you're doing some timer-irq profile, and the reason you get > 41% on spin_unlock_irq_restore() is that that is where the interrupts > are enabled again. > > Timer-interrupt based profiles are not useful either. I've just been using whatever perf defaults to. Defaults are supposed to be useful - if they aren't then perf needs to be fixed. perf top reports this by default: Samples: 118K of event 'cpu-clock', Event count (approx.): 793748915 Overhead Shared O Symbol ¿ 34.48% [kernel] [k] _raw_spin_unlock_irqrestore ¿ 7.89% [kernel] [k] copy_user_generic_string ¿ 5.08% [kernel] [k] _raw_spin_unlock_irq ... > Make sure you actually use "perf record -e cycles:pp" or something > that uses PEBS to get real profiles using CPU performance counters. WTF is PEBS? I'm not a CPU nerd, and I certainly don't expect to have to learn all the intricacies of hardware performance counters just to profile the kernel in a correct and sane manner. That's what the *perf defaults* are supposed to do. Anyway: `perf top -U -e cycles:pp`: Samples: 301K of event 'cpu-clock:ppH', Event count (approx.): 69364814 Overhead Shared O Symbol ¿ 30.89% [kernel] [k] _raw_spin_unlock_irqrestore ¿ 7.04% [kernel] [k] _raw_spin_unlock_irq ¿ 4.08% [kernel] [k] copy_user_generic_string ¿ 2.44% [kernel] [k] get_page_from_freelist ¿ 1.81% [kernel] [k] _raw_spin_lock No change. $ sudo perf record -e cycles:pp -a --all-kernel -- xfs_io -f -c "pwrite 0 47g" /mnt/scratch/fooey # Samples: 2M of event 'cpu-clock:khppH' # Event count (approx.): 588517250000 # # Overhead Command Shared Object Symbol # ........ ............... ................. .......................................... # 83.09% swapper [kernel.kallsyms] [k] native_safe_halt 1.42% xfs_io [kernel.kallsyms] [k] copy_user_generic_string 1.26% kswapd3 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 1.24% kswapd1 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 1.09% kswapd2 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0.98% kswapd0 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0.80% xfs_io [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0.77% kworker/u34:2 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0.73% xfs_io [kernel.kallsyms] [k] _raw_spin_unlock_irq 0.51% xfs_io [kernel.kallsyms] [k] get_page_from_freelist 0.39% xfs_io [kernel.kallsyms] [k] __block_commit_write.isra.29 0.16% xfs_io [kernel.kallsyms] [k] _raw_spin_lock 0.14% xfs_io [kernel.kallsyms] [k] up_write 0.14% kworker/u34:2 [kernel.kallsyms] [k] clear_page_dirty_for_io 0.14% kworker/u34:2 [kernel.kallsyms] [k] xfs_do_writepage .... It's exactly the same profile, just reported as a percentage of 16 CPUs rather than normalised to a single CPU. From my ignorant viewpoing, I'd say that's expected because perf is still using "cpu-clock" event configuration. The hardware event counters are undocumented in the perf man pages, perf-list doesn't output a single "cpu" or "cycles" event counter, or even what hardware event counters are available. Hence I've got no idea if it's broken, why "cycles" (or "cpu-cycles") doesn't apparently record "cycle" triggered events, or even what perf is supposed to tell me is it's recording cycle triggered events. -Dave. -- Dave Chinner da...@fromorbit.com