On Tue, 2013-01-29 at 16:12 +0100, Frederic Weisbecker wrote: > I have set some cputime tracepoint and looked at the stacktrace. They > are sorted by weight, which is the amount of cputime accounted. I > can't find anything wrong on these. See attachment.
That's because a histogram doesn't show the issue. You need to see the order of events. That is, you need to see that user time is calculated again within a system call. I ran my trace again with stack trace enabled and it shows the race nicely: kernelsp-2367 7.... 7088.483227: funcgraph_entry: | syscall_trace_leave() { kernelsp-2367 7d..1 7088.483228: bprint: account_system_time : system: 17887 kernelsp-2367 7d..1 7088.483242: kernel_stack: <stack trace> => __vtime_account_system (ffffffff810646e5) => vtime_user_enter (ffffffff81064a5e) => user_enter (ffffffff810d8b2b) => syscall_trace_leave (ffffffff8100d03f) => return_to_handler (ffffffff814c8dc5) => int_check_syscall_exit_work (ffffffff814c925f) syscall_trace_leave() was called, we are leaving a system call and going back to userspace. We calculate the system time and set the state to IN_USER by user_enter(). kernelsp-2367 7d..1 7088.483244: bprint: account_user_time : user: 15753 kernelsp-2367 7d..1 7088.483262: kernel_stack: <stack trace> => vtime_account_user (ffffffff81064a12) => vtime_account_irq_enter (ffffffff81064b40) => irq_enter (ffffffff8103d36e) => smp_apic_timer_interrupt (ffffffff814caa83) => apic_timer_interrupt (ffffffff814c9bb2) => user_enter (ffffffff810d8b58) => syscall_trace_leave (ffffffff8100d03f) => return_to_handler (ffffffff814c8dc5) => int_check_syscall_exit_work (ffffffff814c925f) But before we got to user space, we took an interrupt (smp_apic_timer_interrupt). Now the vtime_account_irq_enter() sees the IN_USER set, and accounts the time back to userspace. That was 16us it calculated incorrectly. kernelsp-2367 7d.h1 7088.483267: bprint: account_system_time : system: 23401 kernelsp-2367 7d.h1 7088.483286: kernel_stack: <stack trace> => __vtime_account_system (ffffffff810646e5) => vtime_account_irq_exit (ffffffff810649b8) => irq_exit (ffffffff8103d39e) => smp_apic_timer_interrupt (ffffffff814caacf) => apic_timer_interrupt (ffffffff814c9bb2) => user_enter (ffffffff810d8b58) => syscall_trace_leave (ffffffff8100d03f) => return_to_handler (ffffffff814c8dc5) => int_check_syscall_exit_work (ffffffff814c925f) Now we are leaving the interrupt and since the state is still IN_USER, the system time is calculated. kernelsp-2367 7d..2 7088.483287: bprint: account_user_time : user: 19865 kernelsp-2367 7d..2 7088.483309: kernel_stack: <stack trace> => vtime_account_user (ffffffff81064a12) => vtime_account_irq_enter (ffffffff81064b40) => __do_softirq (ffffffff8103d0f5) => call_softirq (ffffffff814ca2fc) => do_softirq (ffffffff81003fdc) => irq_exit (ffffffff8103d3df) => smp_apic_timer_interrupt (ffffffff814caacf) => apic_timer_interrupt (ffffffff814c9bb2) => user_enter (ffffffff810d8b58) => syscall_trace_leave (ffffffff8100d03f) => return_to_handler (ffffffff814c8dc5) => int_check_syscall_exit_work (ffffffff814c925f) But oops, we are now running a softirq, which happens to get everything wrong again, and calculates another 20us incorrectly towards user time. Perhaps the best thing to do is to have the irq vtime determine usermode or not with "user_mode(get_irq_regs())". I'll try it. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/