Peter Zijlstra <pet...@infradead.org> writes: > On Fri, Sep 23, 2016 at 02:27:21PM +0300, Alexander Shishkin wrote: >> In order to be able to allocate perf ring buffers in non-mmap path, we >> need to make sure we can still account the memory to the user and that >> they don't exceed their mlock limit. >> >> This patch moves ring buffer memory accounting down the rb_alloc() path >> so that its callers won't have to worry about it. This also serves the >> additional purpose of slightly cleaning up perf_mmap(). > > While I like a cleanup of that code (it really can use one), I'm not a > big fan of hidden buffers like this. Why is this needed?
So what I wanted is a similar interface to call stack sampling or pretty much anything else sampling that we have at the moment. The user would ask for AUX samples of, say, intel_pt, and would get a sample with PT stuff right in the perf buffer every time their main event overflows. They don't *need* to know that we have a kernel event with a ring buffer under the hood. This was one of the use cases of 'hidden' ring buffers. The other two are process core dump and system core dump ([1] tried to do it without involving perf at all, for reference). > A quick look through the patches also leaves me wondering on the design > and interface of this thing. A few words explaining the overall design > would be nice. Right; here goes. PERF_SAMPLE_AUX is set in the attr.sample_type of the event that you want to sample. Then, using that event's attr.aux_sample_type as the PMU 'type' and attr.aux_sample_config as 'config' we create a kernel event. For this kernel event, we then allocate a ring buffer with 0 data pages and as many aux pages as would fit the attr.aux_sample_size. Then, we hook into the perf_prepare_sample()/perf_output_sample() path so that When the original event goes off, we first stop the kernel event, then memcpy the data from the 'hidden' aux buffer into the original event's perf buffer under PERF_SAMPLE_AUX and then restart the kernel event. This all is happening on the local cpu. The 'hidden' aux buffer is running in overwrite mode, so we copy attr.aux_sample_size bytes every time, which means there may be overlaps between samples, but the tooling has logic to handle this. This is about it. Before creating a new counter we first look for an existing one that fits the bill wrt filtering bits; if there is one, we grab its reference and use it instead. This is so that one could do things like $ perf record -Aintel_pt -e 'cycles,instructions,branch-misses' ls or $ perf record -Aintel_pt -e 'sched:*' -a sleep 10 > Afaict there's no actual need to hide the AUX buffer for this sampling > stuff; the user knows about all this and can simply mmap() the AUX part. Yes, you're right here. We could also re-use the AUX record, adding a new flag for this. It may be even better if I can work out the inheritance (the current code doesn't handle inheritance at the moment in case we decide to scrap it). > The sample could either point to locations in the AUX buffer, or (as I > think this code does) memcpy bits out. Yes and yes, it does. > Ideally we'd pass the AUX-event into the syscall, that way you avoid all > the find_aux_event crud. I'm not sure we want to overload the group_fd > thing more (its already very hard to create counter groups in a cgroup > for example) .. It can be also stuffed into the attribute or ioctl()ed. The latter is probably the best. > Coredump was mentioned somewhere, but I'm not sure I've seen > code/interfaces for that. How was that envisioned to work? Ok, so what I have is a new RLIMIT_PERF, which is set to the aux data sample to be included in the [process] core dump. At the prlimit(RLIMIT_PERF) time, given that RLIMIT_CORE is also nonzero, I create a kernel event with a 'hidden' buffer. The PMU for this event is, in this scenario, a system-wide setting, which is a tad iffy, seeing as we now have 2 PMUs in the system that can be used for this, but which are mutually exclusive. Now, when the core dump is written, we check if there's such an event on the task's perf context and if there is, we dump_emit() data from the hidden buffer into the file. The difference with sampling is that this kernel event is also inheritable, so that when the task fork()s, a new event is created. The memory is counted against sysctl_perf_event_mlock+user's RLIMIT_MEMLOCK (just like the rest of perf buffers), so when the user is out of it, no new events are created. The rlimit as interface to enable this seems weirder the more I look at it, which is also the reason why I haven't sent it out yet. The other ideas I had for this were a prctl(), which would be more straightforward, would also allow to specify the PMU, but, unlike prlimit() would only work on the current process. Yet another way would be to go through perf_event_open() and then somehow feed the event into the ether instead of polling it. The last one that can use the hidden buffer is system core dumps, that would be either retreived by kdump or stored in pstore/EFI capsule. I don't have the code for this yet, but the general idea is that per-cpu AUX events would start at boot time in overwrite mode and just hang in there till things go south. [1] http://marc.info/?l=linux-kernel&m=143814616805933 Thanks, -- Alex