On Mon, Mar 28, 2016 at 06:41:32AM +0000, Wang Nan wrote: > This patch introduces 'write_backward' bit to perf_event_attr, which > controls the direction of a ring buffer. After set, the corresponding > ring buffer is written from end to beginning. This feature is design to > support reading from overwritable ring buffer. > > Ring buffer can be created by mapping a perf event fd. Kernel puts event > records into ring buffer, user programs like perf fetch them from > address returned by mmap(). To prevent racing between kernel and perf, > they communicate to each other through 'head' and 'tail' pointers. > Kernel maintains 'head' pointer, points it to the next free area (tail > of the last record). Perf maintains 'tail' pointer, points it to the > tail of last consumed record (record has already been fetched). Kernel > determines the available space in a ring buffer using these two > pointers, prevents to overwrite unfetched records. > > By mapping without 'PROT_WRITE', an overwritable ring buffer is created. > Different from normal ring buffer, perf is unable to maintain 'tail' > pointer because writing is forbidden. Therefore, for this type of ring > buffers, kernel overwrite old records unconditionally, works like flight > recorder. This feature would be useful if reading from overwritable ring > buffer were as easy as reading from normal ring buffer. However, > there's an obscure problem. > > The following figure demonstrates the state of an overwritable ring > buffer which is nearly full. In this figure, the 'head' pointer points > to the end of last record, and a long record 'E' is pending. For a > normal ring buffer, a 'tail' pointer would have pointed to position (X), > so kernel knows there's no more space in the ring buffer. However, for > an overwritable ring buffer, kernel doesn't care the 'tail' pointer. > > (X) head > . | > . V > +------+-------+----------+------+---+ > |A....A|B.....B|C........C|D....D| | > +------+-------+----------+------+---+ > > After writing record 'E', record 'A' is overwritten. > > head > | > V > +--+---+-------+----------+------+---+ > |.E|..A|B.....B|C........C|D....D|E..| > +--+---+-------+----------+------+---+ > > Now perf decides to read from this ring buffer. However, none of the > the two natural positions, 'head' and the start of this ring buffer, > are pointing to the head of a record. Even perf can read the full ring > buffer, it is unable to find the position to start decoding. > > The first attempt tries to solve this problem AFAIK can be found from > [1]. It makes kernel to maintain 'tail' pointer: updates it when ring > buffer is half full. However, this approach introduces overhead to > fast path. Test result shows a 1% overhead [2]. In addition, this method > utilizes no more tham 50% records. > > Another attempt can be found from [3], which allow putting the size of > an event at the end of each record. This approach allows perf to find > records in a backword manner from 'head' pointer by reading size of a > record from its tail. However, because of alignment requirement, it > needs 8 bytes to record the size of a record, which is a huge waste. Its > performance is also not good, because more data need to be written. > This approach also introduces some extra branch instructions to fast > path. > > 'write_backward' is a better solution to this problem. > > Following figure demonstrates the state of the overwritable ring buffer > when 'write_backward' is set before overwriting: > > head > | > V > +---+------+----------+-------+------+ > | |D....D|C........C|B.....B|A....A| > +---+------+----------+-------+------+ > > and after overwriting: > head > | > V > +---+------+----------+-------+---+--+ > |..E|D....D|C........C|B.....B|A..|E.| > +---+------+----------+-------+---+--+ > > In each situation, 'head' points to the beginning of the newest record. > From this record, perf can iterate over the full ring buffer, fetching > as mush records as possible one by one. > > The only limitation needs to consider is back-to-back reading. Due to > the non-deterministic of user program, it is impossible to ensure the > ring buffer keeps stable during reading. Consider an extreme situation: > perf is scheduled out after reading record 'D', then a burst of events > come, eat up the whole ring buffer (one or multiple rounds), but 'head' > pointer happends to be at the same position when perf comes back. > Continue reading after 'D' is incorrect now. > > To prevent this problem, we need to find a way to ensure the ring buffer > is stable during reading. ioctl(PERF_EVENT_IOC_PAUSE_OUTPUT) is > suggested because its overhead is lower than > ioctl(PERF_EVENT_IOC_ENABLE). > > This patch utilizes event's default overflow_handler introduced > previously. perf_event_output_backward() is created as the default > overflow handler for backward ring buffers. To avoid extra overhead to > fast path, original perf_event_output() becomes __perf_event_output() > and marked '__always_inline'. In theory, there's no extra overhead > introduced to fast path. > > Performance result: > > Calling 3000000 times of 'close(-1)', use gettimeofday() to check > duration. Use 'perf record -o /dev/null -e raw_syscalls:*' to capture > system calls. In ns. > > Testing environment: > > CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz > Kernel : v4.5.0 > MEAN STDVAR > BASE 800214.950 2853.083 > PRE1 2253846.700 9997.014 > PRE2 2257495.540 8516.293 > POST 2250896.100 8933.921 > > Where 'BASE' is pure performance without capturing. 'PRE1' is test > result of pure 'v4.5.0' kernel. 'PRE2' is test result before this > patch. 'POST' is test result after this patch. See [4] for detail > experimental setup. > > Considering the stdvar, this patch doesn't introduce performance > overhead to fast path. > > [1] http://lkml.iu.edu/hypermail/linux/kernel/1304.1/04584.html > [2] http://lkml.iu.edu/hypermail/linux/kernel/1307.1/00535.html > [3] http://lkml.iu.edu/hypermail/linux/kernel/1512.0/01265.html > [4] http://lkml.kernel.org/g/56f89dcd.1040...@huawei.com > > Signed-off-by: Wang Nan <wangn...@huawei.com> > Cc: He Kuang <heku...@huawei.com> > Cc: Alexei Starovoitov <a...@kernel.org> > Cc: Arnaldo Carvalho de Melo <a...@redhat.com> > Cc: Brendan Gregg <brendan.d.gr...@gmail.com> > Cc: Jiri Olsa <jo...@kernel.org> > Cc: Masami Hiramatsu <masami.hiramatsu...@hitachi.com> > Cc: Namhyung Kim <namhy...@kernel.org> > Cc: Peter Zijlstra <pet...@infradead.org> > Cc: Zefan Li <lize...@huawei.com> > Cc: pi3or...@163.com > --- > include/linux/perf_event.h | 28 +++++++++++++++++++++--- > include/uapi/linux/perf_event.h | 3 ++- > kernel/events/core.c | 48 > ++++++++++++++++++++++++++++++++++++----- > kernel/events/ring_buffer.c | 14 ++++++++++++ > 4 files changed, 84 insertions(+), 9 deletions(-)
Very useful feature. Looks good. Acked-by: Alexei Starovoitov <a...@kernel.org>