The preempt_disable tracepoint only traces in the disable <-> enable case. Which is correct, but think about this case:
--------------------------- %< ---------------------- THREAD IRQ | | preempt_disable_notrace() { __preempt_count_add(1) } /* preemption disabled/IRQs enabled */ -------> smp_apic_timer_interrupt() { preempt_disable() { do not trace (preempt count >= 1) } .... preempt_enable() { do not trace (preempt count >= 1) } <------- } preempt_enable_notrace() { __preempt_count_sub(1) } --------------------------- >% ---------------------- The non-traceble preempt_disable can hide a legit preempt_disable (which is worth tracing). It is possible to observe this problem using this kernel module: http://bristot.me/files/efficient_verification/wip.tar.gz and doing the following trace: # cd /sys/kernel/debug/tracing/ # echo 1 > snapshot # cat available_events | grep preempt_ > set_event # echo irq_vectors >> /sys/kernel/debug/tracing/set_event # insmod wip.ko /* wait for a snapshot creation */ # tail -100 snapshot sshd-1159 [000] d...1.. 2440.866116: preempt_enable: caller=migrate_enable+0x1bb/0x330 parent=migrate_enable+0x1bb/0x330 sshd-1159 [000] d..h1.. 2440.866122: local_timer_entry: vector=236 sshd-1159 [000] d..h1.. 2440.866127: local_timer_exit: vector=236 sshd-1159 [000] d.L.4.. 2440.866129: process_event: event sched_waking not expected in the state preemptive sshd-1159 [000] d.L.4.. 2440.866137: <stack trace> => process_event => __handle_event => ttwu_do_wakeup => try_to_wake_up => irq_exit => smp_apic_timer_interrupt => apic_timer_interrupt => kvm_clock_read => ktime_get_with_offset => posix_get_boottime => __x64_sys_clock_gettime => do_syscall_64 => entry_SYSCALL_64_after_hwframe and kvm_clock_read() disables preemption without tracing: --------------------------- %< ---------------------- static u64 kvm_clock_read(void) { u64 ret; preempt_disable_notrace(); ret = pvclock_clocksource_read(this_cpu_pvti()); preempt_enable_notrace(); return ret; } --------------------------- >% ---------------------- Use a percpu variable for the traced preempt_disable/enable, and use it to decide whether trace or not. Signed-off-by: Daniel Bristot de Oliveira <bris...@redhat.com> Cc: "Steven Rostedt (VMware)" <rost...@goodmis.org> Cc: Ingo Molnar <mi...@redhat.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Thomas Gleixner <t...@linutronix.de> Cc: "Paul E. McKenney" <paul...@linux.vnet.ibm.com> Cc: Matthias Kaehlcke <m...@chromium.org> Cc: "Joel Fernandes (Google)" <j...@joelfernandes.org> Cc: Frederic Weisbecker <frede...@kernel.org> Cc: Yangtao Li <tiny.win...@gmail.com> Cc: Tommaso Cucinotta <tommaso.cucino...@santannapisa.it> Cc: linux-kernel@vger.kernel.org --- kernel/sched/core.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index be4117d7384f..2e07d4174778 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3148,19 +3148,25 @@ static inline void sched_tick_stop(int cpu) { } #if defined(CONFIG_PREEMPT) && (defined(CONFIG_DEBUG_PREEMPT) || \ defined(CONFIG_TRACE_PREEMPT_TOGGLE)) + +DEFINE_PER_CPU(int, __traced_preempt_count) = 0; /* * If the value passed in is equal to the current preempt count * then we just disabled preemption. Start timing the latency. */ void preempt_latency_start(int val) { - if (preempt_count() == val) { + int curr = this_cpu_read(__traced_preempt_count); + + if (!curr) { unsigned long ip = get_lock_parent_ip(); #ifdef CONFIG_DEBUG_PREEMPT current->preempt_disable_ip = ip; #endif trace_preempt_off(CALLER_ADDR0, ip); } + + this_cpu_write(__traced_preempt_count, curr + val); } static inline void preempt_add_start_latency(int val) @@ -3200,8 +3206,12 @@ NOKPROBE_SYMBOL(preempt_count_add); */ void preempt_latency_stop(int val) { - if (preempt_count() == val) + int curr = this_cpu_read(__traced_preempt_count) - val; + + if (!curr) trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip()); + + this_cpu_write(__traced_preempt_count, curr); } static inline void preempt_sub_stop_latency(int val) -- 2.20.1