On Thu, 30 Apr 2020 16:11:21 +0200 Joerg Roedel <jroe...@suse.de> wrote:
> Hi, > > On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote: > > Talking with Mathieu about this on IRC, he pointed out that my code does > > have a vzalloc() that is called: > > > > in trace_pid_write() > > > > pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3); > > > > This is done when -P1,2 is on the trace-cmd command line. > > Okay, tracked it down, some instrumentation in the page-fault and > double-fault handler gave me the stack-traces. Here is what happens: > > As already pointed out, it all happens because of page-faults on the > vzalloc'ed pid bitmap. It starts with this stack-trace: > > RIP: 0010:trace_event_ignore_this_pid+0x23/0x30 Interesting. Because that function is this: bool trace_event_ignore_this_pid(struct trace_event_file *trace_file) { struct trace_array *tr = trace_file->tr; struct trace_array_cpu *data; struct trace_pid_list *no_pid_list; struct trace_pid_list *pid_list; pid_list = rcu_dereference_raw(tr->filtered_pids); no_pid_list = rcu_dereference_raw(tr->filtered_no_pids); if (!pid_list && !no_pid_list) return false; data = this_cpu_ptr(tr->array_buffer.data); return data->ignore_pid; } Where it only sees if the pid masks exist. That is, it looks to see if there's pointers to them, it doesn't actually touch the vmalloc'd area. This check is to handle a race between allocating and deallocating the buffers and setting the ignore_pid bit. The reading of these arrays is done at sched_switch time, which sets or clears the ignore_pid field. That said, since this only happens on buffer instances (it does not trigger on the top level instance, which uses the same code for the pid masks) Could this possibly be for the tr->array_buffer.data, which is allocated with: allocate_trace_buffer() { [..] buf->data = alloc_percpu(struct trace_array_cpu); That is, the bug isn't the vmalloc being a problem, but perhaps the per_cpu allocation. This would explain why this crashes with the buffer instance and not with the top level instance. If it was related to the pid masks, then it would trigger for either (because they act the same in allocating at time of use). But when an instance is made, the tr->array_buffer.data is created. Which for the top level happens at boot up and the pages would have been synced long ago. But for a newly created instance, this happens just before its used. This could possibly explain why it's not a problem when doing it manually by hand, because the time between creating the instance, and the time to start and stop the tracing, is long enough for something to sync them page tables. tl;dr; It's not an issue with the vmalloc, it's an issue with per_cpu allocations! -- Steve