Igor reported garbage in a trace that looked like this: login-1718 [000] ...1 1340.195118: sched_process_exec: filename=/bin/login pid=1718 old_pid=1718 systemd-cgroups-2061 [006] ...1 1341.672207: sched_process_exec: filename=/usr/lib/systemd/systemd-cgroups-agent pid=2061 old_pid=2061 systemd-cgroups-2063 [004] ...1 1341.673437: sched_process_exec: filename=ÀÈûÉ pid=2063 old_pid=2063
See the junk in the last line for filename. Investigating this, I found that the problem is where the tracepoint lies. In exec.c exec_binprm() we have: ret = search_binary_handler(bprm); if (ret >= 0) { audit_bprm(bprm); trace_sched_process_exec(current, old_pid, bprm); The tracepoint uses bprm->filename to record the process that was executed. The problem is that the handlers called in search_binary_handler() can release the process' old memory for the new memory. The bprm->filename is not in this memory, but the issue is when this happens, if the parent used a vfork() it will wake up at that moment. To explain exactly what happens, we have this: cgroup_release_agent() agentbuf = kstrdup(); argv[0] = argentbuf; call_usermodehelper(argv[0],...,UMH_WAIT_EXEC); call_usermodehelper_setup() sub_info->path = path; (argv[0] from parameter) sub_info->work = __call_usermodehelper; call_usermodehelper_exec() queue_work(subinfo->work); wait_for_completion(); Now the work gets called: __call_usermodehelper kernel_thread(call_helper, CLONE_VFORK); Notice the VFORK flag. This means that the parent will sleep till the child exits or execs. call_helper() ____call_usermodehelper() do_execve(subinfo->path) filename = subinfo->path; (filename is allocated agentbuf) do_execve_common(filename) bprm->filename = filename; exec_bprm(bprm) search_binary_handler(bprm) fmt->load_binary() (load_elf_binary); load_elf_binary() flush_old_exec() exec_mmap() mm_release() if (tsk->vfork_done) complete_vfork_done(tsk); Here we wake up the parent. The one that called kthread_thread(call_helper). Now that one continues to run: case UMH_WAIT_EXEC: if (pid < 0) sub_info->retval = pid; umh_complete(sub_info); Now this wakes up the process that's waiting for completion. Which continues to run and does: kfree(agentbuf); This is what bprm->filename points to. When the search_binary_handler() returns, (with a possibility that filename has been freed), when it calls the tracepoint, the tracepoint will record bogus data. Luckily it didn't crash the kernel. Adding a few strategic trace_printks, we can see this happening: kworker/3:1-51 [003] .... 59.700772: cgroup_release_agent: agentbuf = ffff880114fc28c0 /usr/lib/systemd/systemd-cgroups-agent systemd-cgroups-1925 [005] .... 59.700866: do_execve_common.isra.25: calling search_binary_handler (bprm->filename=ffff880114fc28c0 /usr/lib/systemd/systemd-cgroups-agent systemd-cgroups-1925 [005] .... 59.700867: search_binary_handler: fmt->load_binary = load_misc_binary+0x0/0x302 systemd-cgroups-1925 [005] .... 59.700868: search_binary_handler: fmt->load_binary = load_script+0x0/0x1e8 systemd-cgroups-1925 [005] .... 59.700868: search_binary_handler: fmt->load_binary = load_elf_binary+0x0/0x179b systemd-cgroups-1925 [005] .... 59.700876: flush_old_exec: calling exec_mmap bprm->filename=ffff880114fc28c0 /usr/lib/systemd/systemd-cgroups-agent systemd-cgroups-1925 [005] .... 59.700876: flush_old_exec: calling mm_release systemd-cgroups-1925 [005] .... 59.700876: mm_release: tsk->vfork_done = ffff880000091d68 kworker/3:1-51 [003] .... 59.700904: cgroup_release_agent: freeing agentbuf systemd-cgroups-1925 [005] .... 59.700911: do_execve_common.isra.25: back from binary (bprm->filename=ffff880114fc28c0 @$üÿÿ/systemd/systemd-cgroups-agent systemd-cgroups-1925 [005] ...1 59.700912: sched_process_exec: filename=@$üÿÿ/systemd/systemd-cgroups-agent pid=1925 old_pid=1925 The junk starts after "freeing agentbuf" Now to fix this we need to save the filename before calling search_binary_handler(). But we don't want to save it if we are not tracing. Why slow everyone else down? To prevent this, I added a static key branch (nop or unconditional jump) that gets enabled when the tracepoint sched_process_exec is registered, and disabled when the tracepoint is unregistered. When the tracepoint is registered, the filename will be allocated by a kstrdup and freed after the call. If filename fails to allocate, we simply fall back to brpm->tcomm (just the name, not the path). This works, but is rather ugly. Looking for any other suggestions here. Reported-by: Igor Zhbanov <i.zhba...@samsung.com> [ I'm assuming this is the right person. All I have is a nick "IZh" on IRC] Signed-off-by: Steven Rostedt <rost...@goodmis.org> diff --git a/fs/exec.c b/fs/exec.c index e1529b4..9bb1b0f 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1407,9 +1407,22 @@ int search_binary_handler(struct linux_binprm *bprm) } EXPORT_SYMBOL(search_binary_handler); +static struct static_key sched_process_exec_key = STATIC_KEY_INIT_FALSE; + +void sched_process_exec_key_reg(void) +{ + static_key_slow_inc(&sched_process_exec_key); +} + +void sched_process_exec_key_unreg(void) +{ + static_key_slow_dec(&sched_process_exec_key); +} + static int exec_binprm(struct linux_binprm *bprm) { pid_t old_pid, old_vpid; + char *filename; int ret; /* Need to fetch pid before load_binary changes it */ @@ -1418,10 +1431,34 @@ static int exec_binprm(struct linux_binprm *bprm) old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent)); rcu_read_unlock(); - ret = search_binary_handler(bprm); + /* + * When the sched_process_exec tracepoint is activated, it + * enables the static key sched_process_exec_key to do some more + * work. The issue is that the tracepoint wants to record the + * filename of the bprm but only if search_binary_handler() returns + * without an error. The problem is that the search_binary_handler() + * may call mm_release() which can wake up the parent that did a vfork + * (see cgroup call_usermodehelper()) and the parent may free bprm->filename. + * + * The filename must be saved before calling search_binary_handler(). + * But as this is only needed for tracing, only do this work if + * the tracepoint is enabled. + * + * The static key usage removes conditional branches as well. + */ + if (static_key_false(&sched_process_exec_key)) { + /* If filename fails to alloc, just use tcomm */ + filename = kstrdup(bprm->filename, GFP_KERNEL); + ret = search_binary_handler(bprm); + if (ret >= 0) + trace_sched_process_exec(current, old_pid, + filename ? filename : bprm->tcomm); + kfree(filename); + } else + ret = search_binary_handler(bprm); + if (ret >= 0) { audit_bprm(bprm); - trace_sched_process_exec(current, old_pid, bprm); ptrace_event(PTRACE_EVENT_EXEC, old_vpid); proc_exec_connector(current); } diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 67e1bbf..a7eaa06 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -276,30 +276,35 @@ TRACE_EVENT(sched_process_fork, __entry->child_comm, __entry->child_pid) ); +extern void sched_process_exec_key_reg(void); +extern void sched_process_exec_key_unreg(void); + /* * Tracepoint for exec: */ -TRACE_EVENT(sched_process_exec, +TRACE_EVENT_FN(sched_process_exec, - TP_PROTO(struct task_struct *p, pid_t old_pid, - struct linux_binprm *bprm), + TP_PROTO(struct task_struct *p, pid_t old_pid, char *filename), - TP_ARGS(p, old_pid, bprm), + TP_ARGS(p, old_pid, filename), TP_STRUCT__entry( - __string( filename, bprm->filename ) + __string( filename, filename ) __field( pid_t, pid ) __field( pid_t, old_pid ) ), TP_fast_assign( - __assign_str(filename, bprm->filename); + __assign_str(filename, filename); __entry->pid = p->pid; __entry->old_pid = old_pid; ), TP_printk("filename=%s pid=%d old_pid=%d", __get_str(filename), - __entry->pid, __entry->old_pid) + __entry->pid, __entry->old_pid), + + sched_process_exec_key_reg, + sched_process_exec_key_unreg ); /* -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/