Current tracing infrastructure such as perf and ftrace reports system wide data when invoked inside a container. It is required to restrict events specific to a container context when such tools are invoked inside a container.
This RFC patch supports filtering container specific events, without any change in the user interface, when invoked within a container for the perf utility; such support needs to be extended to ftrace. This patch assumes that the debugfs is available within the container and all the processes running inside a container are grouped into a single perf_event subsystem of cgroups. This patch piggybacks on the existing support available for tracing with cgroups [1] by setting the cgrp member of the event structure to the cgroup of the context perf tool is invoked from. However, this patch is not complete and requires more work to fully support tracing inside a container. This patch is intended to initiate the discussion on having container-aware tracing support. A detailed explanation on what is supported and pending issues are mentioned below. Suggestions, feedback, flames are welcome. [1] https://lkml.org/lkml/2011/2/14/40 -------------------------------------------------------------------- Details: With this patch, perf-stat, perf-record (tracepoints, [ku]rpobes) and perf-top when executed within a container reports events that are triggered only in that container context. However, there are couple of limitations on how this works for kprobes/uprobes and in general ftrace infrastructure. The problem arises due to the use of files /sys/kernel/debug/ tracing/[uk]probe_events. Perf utility inserts a probe by writing into the [uk]probe_events file, which is parsed by the kernel to register an event. When debugfs is mounted inside containers, the contents of these files are visible to all containers. This implies that a user within a container can list/delete probes registered by other containers, leading to security issues and/or denial of service (Eg: by deleting a probe from another container every time it is registered). This could be undesirable depending on the way containers are used (Eg: if used in multi-tenancy with each users assigned a container). The issues mentioned above exist for tracing infrastructures which use ftrace interface. One approach is to have a container specific view of these files under /sys/kernel/debug/tracing. At this moment, this seems to require a significant rework of ftrace. We are looking for feedback on the assumptions we have made about the processes running inside a container grouped into a single perf_event subsystem and also any thoughts on extending such support to ftrace. Regards, Aravinda Cc: Hari Bathini <hbath...@linux.vnet.ibm.com> Signed-off-by: Aravinda Prasad <aravi...@linux.vnet.ibm.com> --- kernel/events/core.c | 49 +++++++++++++++++++++++++++++++++++-------------- 1 file changed, 35 insertions(+), 14 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 81aa3a4..f6a1f89 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -589,17 +589,38 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event, { struct perf_cgroup *cgrp; struct cgroup_subsys_state *css; - struct fd f = fdget(fd); + struct fd f; int ret = 0; - if (!f.file) - return -EBADF; + if (fd != -1) { + f = fdget(fd); + if (!f.file) + return -EBADF; - css = css_tryget_online_from_dir(f.file->f_path.dentry, + css = css_tryget_online_from_dir(f.file->f_path.dentry, &perf_event_cgrp_subsys); - if (IS_ERR(css)) { - ret = PTR_ERR(css); - goto out; + if (IS_ERR(css)) { + ret = PTR_ERR(css); + fdput(f); + return ret; + } + } else if (event->attach_state == PERF_ATTACH_TASK) { + /* Tracing on a PID. No need to set event->cgrp */ + return ret; + } else if (task_active_pid_ns(current) != &init_pid_ns) { + /* Don't set event->cgrp if task belongs to root cgroup */ + if (task_css_is_root(current, perf_event_cgrp_id)) + return ret; + + css = task_css(current, perf_event_cgrp_id); + if (!css || !css_tryget_online(css)) + return -ENOENT; + } else { + /* + * perf invoked from global context and hence don't set + * event->cgrp as all the events should be included + */ + return ret; } cgrp = container_of(css, struct perf_cgroup, css); @@ -614,8 +635,10 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event, perf_detach_cgroup(event); ret = -EINVAL; } -out: - fdput(f); + + if (fd != -1) + fdput(f); + return ret; } @@ -7554,11 +7577,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, if (!has_branch_stack(event)) event->attr.branch_sample_type = 0; - if (cgroup_fd != -1) { - err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader); - if (err) - goto err_ns; - } + err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader); + if (err) + goto err_ns; pmu = perf_init_event(event); if (!pmu) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/