On Thu, 30 Apr 2020 12:16:20 -0400 (EDT) Mathieu Desnoyers <mathieu.desnoy...@efficios.com> wrote:
> ----- On Apr 30, 2020, at 12:11 PM, rostedt rost...@goodmis.org wrote: > > > On Thu, 30 Apr 2020 16:11:21 +0200 > > Joerg Roedel <jroe...@suse.de> wrote: > > > >> Hi, > >> > >> On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote: > >> > Talking with Mathieu about this on IRC, he pointed out that my code does > >> > have a vzalloc() that is called: > >> > > >> > in trace_pid_write() > >> > > >> > pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3); > >> > > >> > This is done when -P1,2 is on the trace-cmd command line. > >> > >> Okay, tracked it down, some instrumentation in the page-fault and > >> double-fault handler gave me the stack-traces. Here is what happens: > >> > >> As already pointed out, it all happens because of page-faults on the > >> vzalloc'ed pid bitmap. It starts with this stack-trace: > >> > >> RIP: 0010:trace_event_ignore_this_pid+0x23/0x30 > > > > Interesting. Because that function is this: > > > > bool trace_event_ignore_this_pid(struct trace_event_file *trace_file) > > { > > struct trace_array *tr = trace_file->tr; > > struct trace_array_cpu *data; > > struct trace_pid_list *no_pid_list; > > struct trace_pid_list *pid_list; > > > > pid_list = rcu_dereference_raw(tr->filtered_pids); > > no_pid_list = rcu_dereference_raw(tr->filtered_no_pids); > > > > if (!pid_list && !no_pid_list) > > return false; > > > > data = this_cpu_ptr(tr->array_buffer.data); > > > > return data->ignore_pid; > > } > > > > Where it only sees if the pid masks exist. That is, it looks to see if > > there's pointers to them, it doesn't actually touch the vmalloc'd area. > > This check is to handle a race between allocating and deallocating the > > buffers and setting the ignore_pid bit. The reading of these arrays is done > > at sched_switch time, which sets or clears the ignore_pid field. > > > > That said, since this only happens on buffer instances (it does not trigger > > on the top level instance, which uses the same code for the pid masks) > > > > Could this possibly be for the tr->array_buffer.data, which is allocated > > with: > > > > allocate_trace_buffer() { > > [..] > > buf->data = alloc_percpu(struct trace_array_cpu); > > > > That is, the bug isn't the vmalloc being a problem, but perhaps the per_cpu > > allocation. This would explain why this crashes with the buffer instance > > and not with the top level instance. If it was related to the pid masks, > > then it would trigger for either (because they act the same in allocating > > at time of use). But when an instance is made, the tr->array_buffer.data is > > created. Which for the top level happens at boot up and the pages would > > have been synced long ago. But for a newly created instance, this happens > > just before its used. This could possibly explain why it's not a problem > > when doing it manually by hand, because the time between creating the > > instance, and the time to start and stop the tracing, is long enough for > > something to sync them page tables. > > > > tl;dr; It's not an issue with the vmalloc, it's an issue with per_cpu > > allocations! > > Did I mention that alloc_percpu uses: > > static void *pcpu_mem_zalloc(size_t size, gfp_t gfp) > { > if (WARN_ON_ONCE(!slab_is_available())) > return NULL; > > if (size <= PAGE_SIZE) > return kzalloc(size, gfp); > else > return __vmalloc(size, gfp | __GFP_ZERO, PAGE_KERNEL); > } > > So yeah, it's vmalloc'd memory when size > PAGE_SIZE. > I certainly hope that struct trace_array_cpu is not bigger than PAGE_SIZE! -- Steve