Re: [PATCH v6 4/7] perf annotate: Do not ignore call instruction with indirect target
Hi Arnaldo, On Monday 19 September 2016 09:14 PM, Arnaldo Carvalho de Melo wrote: > Em Fri, Aug 19, 2016 at 06:29:35PM +0530, Ravi Bangoria escreveu: >> Do not ignore call instruction with indirect target when its already >> identified as a call. This is an extension of commit e8ea1561952b >> ("perf annotate: Use raw form for register indirect call instructions") >> to generalize annotation for all instructions with indirect calls. >> >> This is needed for certain powerpc call instructions that use address >> in a register (such as bctrl, btarl, ...). >> >> Apart from that, when kcore is used to disassemble function, all call >> instructions were ignored. This patch will fix it as a side effect by >> not ignoring them. For example, >> >> Before (with kcore): >>mov%r13,%rdi >>callq 0x811a7e70 >> ^ jmpq 64 >>mov%gs:0x7ef41a6e(%rip),%al >> >> After (with kcore): >>mov%r13,%rdi >> > callq 0x811a7e70 >> ^ jmpq 64 >>mov%gs:0x7ef41a6e(%rip),%al > Ok, makes sense, but then now I have the -> and can't press enter to go > to that function, in fact for the case I'm using as a test, the > vsnprintf kernel function, I get: > >│ 56: test %al,%al > > ▒ >│ ↓ je 81 > > ▒ >│ lea-0x38(%rbp),%rsi > > ▒ >│ mov%r15,%rdi > > ▒ >│ → callq 0x993e3230 > > That 0x993e3230 should've been resolved to: > > [root@jouet ~]# grep 993e3230 /proc/kallsyms > 993e3230 t format_decode > > Trying to investigate why it doesn't... I investigated this. If this example is with kcore, then it's expected. Because, perf annotate does not inspect kallsyms when it can't find symbol name from disassembly itself. For example, disassembly of finish_task_switch, with kcore: 810cf1b0: mov$0x1,%esi 810cf1b5: mov$0x4,%edi 810cf1ba: callq 0x811aced0 810cf1bf: andb $0xfb,0x4c4(%rbx) 810cf1c6: jmpq 0x810cf0e9 810cf1cb: mov%rbx,%rsi 810cf1ce: mov%r13,%rdi 810cf1d1: callq 0x811a7e70 810cf1d6: jmpq 0x810cf0e4 with debuginfo: 810cf1b0: mov$0x1,%esi 810cf1b5: mov$0x4,%edi 810cf1ba: callq 811aced0 <___perf_sw_event> 810cf1bf: andb $0xfb,0x4c4(%rbx) 810cf1c6: jmpq 810cf0e9 810cf1cb: mov%rbx,%rsi 810cf1ce: mov%r13,%rdi 810cf1d1: callq 811a7e70 <__perf_event_task_sched_in> ffff810cf1d6: jmpq 810cf0e4 call__parse tries to find symbol from angle brackets which is not present in case of kcore. -Ravi > - Arnaldo > >> Suggested-by: Michael Ellerman >> [Suggested about 'bctrl' instruction] >> Signed-off-by: Ravi Bangoria >> --- >> Changes in v6: >> - No change >> >> tools/perf/util/annotate.c | 8 ++-- >> 1 file changed, 2 insertions(+), 6 deletions(-) >> >> diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c >> index ea07588..a05423b 100644 >> --- a/tools/perf/util/annotate.c >> +++ b/tools/perf/util/annotate.c >> @@ -81,16 +81,12 @@ static int call__parse(struct ins_operands *ops, const >> char *norm_arch) >> return ops->target.name == NULL ? -1 : 0; >> >> indirect_call: >> -tok = strchr(endptr, '('); >> -if (tok != NULL) { >> +tok = strchr(endptr, '*'); >> +if (tok == NULL) { >> ops->target.addr = 0; >> return 0; >> } >> >> -tok = strchr(endptr, '*'); >> -if (tok == NULL) >> -return -1; >> - >> ops->target.addr = strtoull(tok + 1, NULL, 16); >> return 0; >> } >> -- >> 2.5.5
[PATCH v7 0/6] perf annotate: Cross arch support + few fixes
Currently Perf annotate support code navigation (branches and calls) only when run on the same architecture where perf.data was recorded. But, for example, record on powerpc server and annotate on client's x86 desktop is not supported. This patchset adds supports for that. Example: Record on powerpc: $ ./perf record -a Report -> Annotate on x86: $ ./perf report -i perf.data.powerpc --vmlinux vmlinux.powerpc Changes in v7: - Using string for normalized arch names instread of macros.(i.e. removed patch 1/7 of v6) - In patch 1/6, make norm_arch as global var instead of passing them to each parser. - In patch 1/6 and 6/6, little bit change in initializing instruction list. - patch 4/7 of v6 is already accepted. Removed that in v7. - Address other review comments. - Added more examples in patch descriptions. v6 link: https://lkml.org/lkml/2016/8/19/411 Kim, I don't have arm test machine. Can you please help me to test this on arm. Kim Phillips (1): perf annotate: cross arch annotate support fixes for ARM Naveen N. Rao (1): perf annotate: Add support for powerpc Ravi Bangoria (4): perf annotate: Add cross arch annotate support perf annotate: Show raw form for jump instruction with indirect target perf annotate: Support jump instruction with target as second operand perf annotate: Fix jump target outside of function address range tools/perf/builtin-top.c | 2 +- tools/perf/ui/browsers/annotate.c | 8 +- tools/perf/ui/gtk/annotate.c | 2 +- tools/perf/util/annotate.c| 259 -- tools/perf/util/annotate.h| 8 +- 5 files changed, 232 insertions(+), 47 deletions(-) -- 2.5.5
[PATCH v7 1/6] perf annotate: Add cross arch annotate support
Change current data structures and function to enable cross arch annotate. Current perf implementation does not support cross arch annotate. To make it truly cross arch, instruction table of all arch should be present in perf binary. And use appropriate table based on arch where perf.data was recorded. Record on arm: $ ./perf record -a Report -> Annotate on x86: $ ./perf report -i perf.data.arm --vmlinux vmlinux.arm Signed-off-by: Ravi Bangoria --- Changes in v7: - Make norm_arch as global var instead of passing them to each parser. - Address other review comments. tools/perf/builtin-top.c | 2 +- tools/perf/ui/browsers/annotate.c | 3 +- tools/perf/ui/gtk/annotate.c | 2 +- tools/perf/util/annotate.c| 151 -- tools/perf/util/annotate.h| 3 +- 5 files changed, 134 insertions(+), 27 deletions(-) diff --git a/tools/perf/builtin-top.c b/tools/perf/builtin-top.c index 4007857..41ecdd6 100644 --- a/tools/perf/builtin-top.c +++ b/tools/perf/builtin-top.c @@ -129,7 +129,7 @@ static int perf_top__parse_source(struct perf_top *top, struct hist_entry *he) return err; } - err = symbol__disassemble(sym, map, 0); + err = symbol__disassemble(sym, map, 0, NULL); if (err == 0) { out_assign: top->sym_filter_entry = he; diff --git a/tools/perf/ui/browsers/annotate.c b/tools/perf/ui/browsers/annotate.c index 4c18271..214a14a 100644 --- a/tools/perf/ui/browsers/annotate.c +++ b/tools/perf/ui/browsers/annotate.c @@ -1050,7 +1050,8 @@ int symbol__tui_annotate(struct symbol *sym, struct map *map, (nr_pcnt - 1); } - err = symbol__disassemble(sym, map, sizeof_bdl); + err = symbol__disassemble(sym, map, sizeof_bdl, + perf_evsel__env_arch(evsel)); if (err) { char msg[BUFSIZ]; symbol__strerror_disassemble(sym, map, err, msg, sizeof(msg)); diff --git a/tools/perf/ui/gtk/annotate.c b/tools/perf/ui/gtk/annotate.c index 42d3199..c127aba 100644 --- a/tools/perf/ui/gtk/annotate.c +++ b/tools/perf/ui/gtk/annotate.c @@ -167,7 +167,7 @@ static int symbol__gtk_annotate(struct symbol *sym, struct map *map, if (map->dso->annotate_warned) return -1; - err = symbol__disassemble(sym, map, 0); + err = symbol__disassemble(sym, map, 0, perf_evsel__env_arch(evsel)); if (err) { char msg[BUFSIZ]; symbol__strerror_disassemble(sym, map, err, msg, sizeof(msg)); diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index aeb5a44..816aa2c 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -21,10 +21,13 @@ #include #include #include +#include +#include "../arch/common.h" const char *disassembler_style; const char *objdump_path; static regex_t file_lineno; +static char*norm_arch; static struct ins *ins__find(const char *name); static int disasm_line__parse(char *line, char **namep, char **rawp); @@ -66,10 +69,8 @@ static int call__parse(struct ins_operands *ops, struct map *map) name++; -#ifdef __arm__ - if (strchr(name, '+')) + if (!strcmp(norm_arch, "arm") && strchr(name, '+')) return -1; -#endif tok = strchr(name, '>'); if (tok == NULL) @@ -252,16 +253,12 @@ static int mov__parse(struct ins_operands *ops, struct map *map __maybe_unused) return -1; target = ++s; -#ifdef __arm__ + comment = strchr(s, ';'); -#else - comment = strchr(s, '#'); -#endif + if (comment == NULL) + comment = strchr(s, '#'); - if (comment != NULL) - s = comment - 1; - else - s = strchr(s, '\0') - 1; + s = (comment != NULL) ? comment - 1 : strchr(s, '\0') - 1; while (s > target && isspace(s[0])) --s; @@ -364,14 +361,92 @@ bool ins__is_ret(const struct ins *ins) return ins->ops == &ret_ops; } -static struct ins instructions[] = { +static struct ins instructions_x86[] = { { .name = "add", .ops = &mov_ops, }, { .name = "addl", .ops = &mov_ops, }, { .name = "addq", .ops = &mov_ops, }, { .name = "addw", .ops = &mov_ops, }, { .name = "and", .ops = &mov_ops, }, -#ifdef __arm__ - { .name = "b", .ops = &jump_ops, }, // might also be a call + { .name = "bts", .ops = &mov_ops, }, + { .name = "call", .ops = &call_ops, }, + { .name = "callq", .ops = &call_ops, }, + { .name = "cmp", .ops = &mov_ops, }, +
[PATCH v7 4/6] perf annotate: Support jump instruction with target as second operand
Current perf is not able to parse jump instruction when second operand contains target address. Arch like powerpc has such instructions. For example, 'bne cr7,0xc00f6154'. objdump o/p: c00f6140: ld r9,1032(r31) c00f6144: cmpdi cr7,r9,0 c00f6148: bnecr7,0xc00f6154 c00f614c: ld r9,2312(r30) c00f6150: stdr9,1032(r31) c00f6154: ld r9,88(r31) Before patch: ld r9,1032(r31) cmpdi cr7,r9,0 v bne3ff09f2c ld r9,2312(r30) stdr9,1032(r31) 74:ld r9,88(r31) After patch: ld r9,1032(r31) cmpdi cr7,r9,0 v bne74 ld r9,2312(r30) stdr9,1032(r31) 74:ld r9,88(r31) Signed-off-by: Ravi Bangoria --- Changes in v7: - Added example in description tools/perf/util/annotate.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index 1ccf26a..a9dbac1 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -122,8 +122,12 @@ bool ins__is_call(const struct ins *ins) static int jump__parse(struct ins_operands *ops, struct map *map __maybe_unused) { const char *s = strchr(ops->raw, '+'); + const char *c = strchr(ops->raw, ','); - ops->target.addr = strtoull(ops->raw, NULL, 16); + if (c++ != NULL) + ops->target.addr = strtoull(c, NULL, 16); + else + ops->target.addr = strtoull(ops->raw, NULL, 16); if (s++ != NULL) ops->target.offset = strtoull(s, NULL, 16); -- 2.5.5
[PATCH v7 2/6] perf annotate: Add support for powerpc
From: "Naveen N. Rao" Current perf can disassemble annotated function but it does not have parsing logic for powerpc instructions. So all navigation options are not available for powerpc. Apart from that, Powerpc has long list of branch instructions and hardcoding them in table appears to be error-prone. So, add function to find instruction instead of creating table. This function dynamically create table (list of 'struct ins'), and instead of creating object every time, first check if list already contain object for that instruction. Signed-off-by: Naveen N. Rao Signed-off-by: Ravi Bangoria --- Changes in v7: - Little bit change in initializing instruction list. tools/perf/util/annotate.c | 112 + 1 file changed, 112 insertions(+) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index 816aa2c..5aa72d9 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -531,6 +531,11 @@ static struct ins instructions_arm[] = { { .name = "retq", .ops = &ret_ops, }, }; +struct instructions_powerpc { + struct ins *ins; + struct list_head list; +}; + static int ins__key_cmp(const void *name, const void *insp) { const struct ins *ins = insp; @@ -546,6 +551,111 @@ static int ins__cmp(const void *a, const void *b) return strcmp(ia->name, ib->name); } +static struct ins *list_add__ins_powerpc(struct instructions_powerpc *head, +const char *name, struct ins_ops *ops) +{ + struct instructions_powerpc *ins_powerpc; + struct ins *ins; + + ins = zalloc(sizeof(struct ins)); + if (!ins) + return NULL; + + ins_powerpc = zalloc(sizeof(struct instructions_powerpc)); + if (!ins_powerpc) + goto out_free_ins; + + ins->name = strdup(name); + if (!ins->name) + goto out_free_ins_power; + + ins->ops = ops; + ins_powerpc->ins = ins; + list_add_tail(&(ins_powerpc->list), &(head->list)); + + return ins; + +out_free_ins_power: + zfree(&ins_powerpc); +out_free_ins: + zfree(&ins); + return NULL; +} + +static struct ins *list_search__ins_powerpc(struct instructions_powerpc *head, + const char *name) +{ + struct instructions_powerpc *pos; + + list_for_each_entry(pos, &head->list, list) { + if (!strcmp(pos->ins->name, name)) + return pos->ins; + } + return NULL; +} + +static struct ins *ins__find_powerpc(const char *name) +{ + int i; + struct ins *ins; + struct ins_ops *ops; + static struct instructions_powerpc head = { + .list = LIST_HEAD_INIT(head.list), + }; + + /* +* - Interested only if instruction starts with 'b'. +* - Few start with 'b', but aren't branch instructions. +*/ + if (name[0] != 'b' || + !strncmp(name, "bcd", 3) || + !strncmp(name, "brinc", 5) || + !strncmp(name, "bper", 4)) + return NULL; + + /* +* Return if we already have object of 'struct ins' for this instruction +*/ + ins = list_search__ins_powerpc(&head, name); + if (ins) + return ins; + + ops = &jump_ops; + + i = strlen(name) - 1; + if (i < 0) + return NULL; + + /* ignore optional hints at the end of the instructions */ + if (name[i] == '+' || name[i] == '-') + i--; + + if (name[i] == 'l' || (name[i] == 'a' && name[i-1] == 'l')) { + /* +* if the instruction ends up with 'l' or 'la', then +* those are considered 'calls' since they update LR. +* ... except for 'bnl' which is branch if not less than +* and the absolute form of the same. +*/ + if (strcmp(name, "bnl") && strcmp(name, "bnl+") && + strcmp(name, "bnl-") && strcmp(name, "bnla") && + strcmp(name, "bnla+") && strcmp(name, "bnla-")) + ops = &call_ops; + } + if (name[i] == 'r' && name[i-1] == 'l') + /* +* instructions ending with 'lr' are considered to be +* return instructions +*/ + ops = &ret_ops; + + /* +* Add instruction to list so next time no need to +* allocate memory for i
[PATCH v7 5/6] perf annotate: Fix jump target outside of function address range
If jump target is outside of function range, perf is not handling it correctly. Especially when target address is lesser than function start address, target offset will be negative. But, target address declared to be unsigned, converts negative number into 2's complement. See below example. Here target of 'jumpq' instruction at 34cf8 is 34ac0 which is lesser than function start address(34cf0). 34ac0 - 34cf0 = -0x230 = 0xfdd0 Objdump output: 00034cf0 <__sigaction>: __GI___sigaction(): 34cf0: lea-0x20(%rdi),%eax 34cf3: cmp-bashx1,%eax 34cf6: jbe34d00 <__sigaction+0x10> 34cf8: jmpq 34ac0 <__GI___libc_sigaction> 34cfd: nopl (%rax) 34d00: mov0x386161(%rip),%rax# 3bae68 <_DYNAMIC+0x2e8> 34d07: movl -bashx16,%fs:(%rax) 34d0e: mov-bashx,%eax 34d13: retq perf annotate before applying patch: __GI___sigaction /usr/lib64/libc-2.22.so lea-0x20(%rdi),%eax cmp-bashx1,%eax v jbe10 v jmpq fdd0 nop 10:mov_DYNAMIC+0x2e8,%rax movl -bashx16,%fs:(%rax) mov-bashx,%eax retq perf annotate after applying patch: __GI___sigaction /usr/lib64/libc-2.22.so lea-0x20(%rdi),%eax cmp-bashx1,%eax v jbe10 ^ jmpq 34ac0 <__GI___libc_sigaction> nop 10:mov_DYNAMIC+0x2e8,%rax movl -bashx16,%fs:(%rax) mov-bashxffff,%eax retq Signed-off-by: Ravi Bangoria --- Changes in v7: - No changes tools/perf/ui/browsers/annotate.c | 5 +++-- tools/perf/util/annotate.c| 14 +- tools/perf/util/annotate.h| 5 +++-- 3 files changed, 15 insertions(+), 9 deletions(-) diff --git a/tools/perf/ui/browsers/annotate.c b/tools/perf/ui/browsers/annotate.c index 214a14a..2d04bdf 100644 --- a/tools/perf/ui/browsers/annotate.c +++ b/tools/perf/ui/browsers/annotate.c @@ -215,7 +215,7 @@ static void annotate_browser__write(struct ui_browser *browser, void *entry, int ui_browser__set_color(browser, color); if (dl->ins && dl->ins->ops->scnprintf) { if (ins__is_jump(dl->ins)) { - bool fwd = dl->ops.target.offset > (u64)dl->offset; + bool fwd = dl->ops.target.offset > dl->offset; ui_browser__write_graph(browser, fwd ? SLSMG_DARROW_CHAR : SLSMG_UARROW_CHAR); @@ -245,7 +245,8 @@ static bool disasm_line__is_valid_jump(struct disasm_line *dl, struct symbol *sy { if (!dl || !dl->ins || !ins__is_jump(dl->ins) || !disasm_line__has_offset(dl) - || dl->ops.target.offset >= symbol__size(sym)) + || dl->ops.target.offset < 0 + || dl->ops.target.offset >= (s64)symbol__size(sym)) return false; return true; diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index a9dbac1..fc44dd1 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -129,10 +129,12 @@ static int jump__parse(struct ins_operands *ops, struct map *map __maybe_unused) else ops->target.addr = strtoull(ops->raw, NULL, 16); - if (s++ != NULL) + if (s++ != NULL) { ops->target.offset = strtoull(s, NULL, 16); - else - ops->target.offset = UINT64_MAX; + ops->target.offset_avail = true; + } else { + ops->target.offset_avail = false; + } return 0; } @@ -140,7 +142,7 @@ static int jump__parse(struct ins_operands *ops, struct map *map __maybe_unused) static int jump__scnprintf(struct ins *ins, char *bf, size_t size, struct ins_operands *ops) { - if (!ops->target.addr) + if (!ops->target.addr || ops->target.offset < 0) return ins__raw_scnprintf(ins, bf, size, ops); return scnprintf(bf, size, "%-6.6s %" PRIx64, ins->name, ops->target.offset); @@ -1373,9 +1375,11 @@ static int symbol__parse_objdump_line(struct symbol *sym, struct map *map, if (dl == NULL) return -1; - if (dl->ops.target.offset == UINT64_MAX) + if (!disasm_line__has_offset(dl)) { dl->ops.target.offset = dl->ops.target.addr - map__rip_2objdump(map, sym->start); + dl->ops.target.offset_avail = true; + } /* kcore has no symbols, so add the call target name */ if (dl->ins && ins__is_call(dl->ins) && !dl->ops.target.n
[PATCH v7 3/6] perf annotate: Show raw form for jump instruction with indirect target
For jump instructions that does not include target address as direct operand, use raw value for that. This is needed for certain powerpc jump instructions that use target address in a register (such as bctr, btar, ...). Before: ld r12,32088(r12) mtctr r12 v bctr ca2c stdr2,24(r1) addis r12,r2,-1 After: ld r12,32088(r12) mtctr r12 v bctr stdr2,24(r1) addis r12,r2,-1 Suggested-by: Michael Ellerman Signed-off-by: Ravi Bangoria --- Changes in v7: - Added example in description tools/perf/util/annotate.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index 5aa72d9..1ccf26a 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -136,6 +136,9 @@ static int jump__parse(struct ins_operands *ops, struct map *map __maybe_unused) static int jump__scnprintf(struct ins *ins, char *bf, size_t size, struct ins_operands *ops) { + if (!ops->target.addr) + return ins__raw_scnprintf(ins, bf, size, ops); + return scnprintf(bf, size, "%-6.6s %" PRIx64, ins->name, ops->target.offset); } -- 2.5.5
[PATCH v7 6/6] perf annotate: cross arch annotate support fixes for ARM
From: Kim Phillips For ARM we remove the list that contains non-arm insns, and instead add more maintainable branch instruction regex logic. Signed-off-by: Kim Phillips Signed-off-by: Ravi Bangoria --- Changes in v7: - Little bit change in initializing instruction list. tools/perf/util/annotate.c | 177 + 1 file changed, 65 insertions(+), 112 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index fc44dd1..83d5ac8 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -28,6 +28,7 @@ const char*disassembler_style; const char *objdump_path; static regex_t file_lineno; static char*norm_arch; +static regex_t arm_call_insn, arm_jump_insn; static struct ins *ins__find(const char *name); static int disasm_line__parse(char *line, char **namep, char **rawp); @@ -449,98 +450,7 @@ static struct ins instructions_x86[] = { { .name = "retq", .ops = &ret_ops, }, }; -static struct ins instructions_arm[] = { - { .name = "add", .ops = &mov_ops, }, - { .name = "addl", .ops = &mov_ops, }, - { .name = "addq", .ops = &mov_ops, }, - { .name = "addw", .ops = &mov_ops, }, - { .name = "and", .ops = &mov_ops, }, - { .name = "b", .ops = &jump_ops, }, /* might also be a call */ - { .name = "bcc", .ops = &jump_ops, }, - { .name = "bcs", .ops = &jump_ops, }, - { .name = "beq", .ops = &jump_ops, }, - { .name = "bge", .ops = &jump_ops, }, - { .name = "bgt", .ops = &jump_ops, }, - { .name = "bhi", .ops = &jump_ops, }, - { .name = "bl",.ops = &call_ops, }, - { .name = "bls", .ops = &jump_ops, }, - { .name = "blt", .ops = &jump_ops, }, - { .name = "blx", .ops = &call_ops, }, - { .name = "bne", .ops = &jump_ops, }, - { .name = "bts", .ops = &mov_ops, }, - { .name = "call", .ops = &call_ops, }, - { .name = "callq", .ops = &call_ops, }, - { .name = "cmp", .ops = &mov_ops, }, - { .name = "cmpb", .ops = &mov_ops, }, - { .name = "cmpl", .ops = &mov_ops, }, - { .name = "cmpq", .ops = &mov_ops, }, - { .name = "cmpw", .ops = &mov_ops, }, - { .name = "cmpxch", .ops = &mov_ops, }, - { .name = "dec", .ops = &dec_ops, }, - { .name = "decl", .ops = &dec_ops, }, - { .name = "imul", .ops = &mov_ops, }, - { .name = "inc", .ops = &dec_ops, }, - { .name = "incl", .ops = &dec_ops, }, - { .name = "ja",.ops = &jump_ops, }, - { .name = "jae", .ops = &jump_ops, }, - { .name = "jb",.ops = &jump_ops, }, - { .name = "jbe", .ops = &jump_ops, }, - { .name = "jc",.ops = &jump_ops, }, - { .name = "jcxz", .ops = &jump_ops, }, - { .name = "je",.ops = &jump_ops, }, - { .name = "jecxz", .ops = &jump_ops, }, - { .name = "jg",.ops = &jump_ops, }, - { .name = "jge", .ops = &jump_ops, }, - { .name = "jl",.ops = &jump_ops, }, - { .name = "jle", .ops = &jump_ops, }, - { .name = "jmp", .ops = &jump_ops, }, - { .name = "jmpq", .ops = &jump_ops, }, - { .name = "jna", .ops = &jump_ops, }, - { .name = "jnae", .ops = &jump_ops, }, - { .name = "jnb", .ops = &jump_ops, }, - { .name = "jnbe", .ops = &jump_ops, }, - { .name = "jnc", .ops = &jump_ops, }, - { .name = "jne", .ops = &jump_ops, }, - { .name = "jng", .ops = &jump_ops, }, - { .name = "jnge", .ops = &jump_ops, }, - { .name = "jnl", .ops = &jump_ops, }, - { .name = "jnle", .ops = &jump_ops, }, - { .name = "jno", .ops = &jump_ops, }, - { .name = "jnp", .ops = &jump_ops, }, - { .name = "jns", .ops = &jump_ops, }, - { .name = "jnz", .ops = &jump_ops, }, - { .name = "jo",.ops = &jump_ops, }, - { .name = "jp",.ops = &jump_ops, }, - { .name = "jpe", .ops = &jump_ops, }, - { .name = &qu
Re: [PATCH v7 0/6] perf annotate: Cross arch support + few fixes
On Thursday 22 September 2016 01:04 AM, Kim Phillips wrote: > On Wed, 21 Sep 2016 21:17:50 +0530 > Ravi Bangoria wrote: > >> Kim, I don't have arm test machine. Can you please help me to test >> this on arm. > This works for me: hitting return on return instructions yields > "Invalid jump offset", but I'll get that later. Thanks Kim. Hmm.. so, ins__find_arm does not contain logic for return instructions. Navigation with return instruction is working fine for x86 and powerpc. -Ravi > Thanks, > > Kim >
Re: [PATCH v7 0/6] perf annotate: Cross arch support + few fixes
Hello, Any updates? Arnaldo, if patches looks good to you, can you please pickup them. -Ravi On Wednesday 21 September 2016 09:17 PM, Ravi Bangoria wrote: > Currently Perf annotate support code navigation (branches and calls) > only when run on the same architecture where perf.data was recorded. > But, for example, record on powerpc server and annotate on client's > x86 desktop is not supported. > > This patchset adds supports for that. > > Example: > > Record on powerpc: > $ ./perf record -a > > Report -> Annotate on x86: > $ ./perf report -i perf.data.powerpc --vmlinux vmlinux.powerpc > > Changes in v7: > - Using string for normalized arch names instread of macros.(i.e. > removed patch 1/7 of v6) > - In patch 1/6, make norm_arch as global var instead of passing them > to each parser. > - In patch 1/6 and 6/6, little bit change in initializing instruction > list. > - patch 4/7 of v6 is already accepted. Removed that in v7. > - Address other review comments. > - Added more examples in patch descriptions. > > v6 link: > https://lkml.org/lkml/2016/8/19/411 > > Kim, I don't have arm test machine. Can you please help me to test > this on arm. > > > Kim Phillips (1): > perf annotate: cross arch annotate support fixes for ARM > > Naveen N. Rao (1): > perf annotate: Add support for powerpc > > Ravi Bangoria (4): > perf annotate: Add cross arch annotate support > perf annotate: Show raw form for jump instruction with indirect target > perf annotate: Support jump instruction with target as second operand > perf annotate: Fix jump target outside of function address range > > tools/perf/builtin-top.c | 2 +- > tools/perf/ui/browsers/annotate.c | 8 +- > tools/perf/ui/gtk/annotate.c | 2 +- > tools/perf/util/annotate.c| 259 > -- > tools/perf/util/annotate.h| 8 +- > 5 files changed, 232 insertions(+), 47 deletions(-) >
[RFC] perf uprobe: Skip prologue if program compiled without optimization
Function prologue prepares stack and registers before executing function logic. When target program is compiled without optimization, function parameter information is only valid after prologue. When we probe entrypc of the function, and try to record function parameter, it contains garbage value. For example, $ vim test.c #include void foo(int i) { printf("i: %d\n", i); } int main() { foo(42); return 0; } $ gcc -g test.c -o test $ objdump -dl test | less foo(): /home/ravi/test.c:4 400536: 55 push %rbp 400537: 48 89 e5mov%rsp,%rbp 40053a: 48 83 ec 10 sub-bashx10,%rsp 40053e: 89 7d fcmov%edi,-0x4(%rbp) /home/ravi/test.c:5 400541: 8b 45 fcmov-0x4(%rbp),%eax ... ... main(): /home/ravi/test.c:9 400558: 55 push %rbp 400559: 48 89 e5mov%rsp,%rbp /home/ravi/test.c:10 40055c: bf 2a 00 00 00 mov-bashx2a,%edi 400561: e8 d0 ff ff ff callq 400536 /home/ravi/test.c:11 $ ./perf probe -x ./test 'foo i' $ cat /sys/kernel/debug/tracing/uprobe_events p:probe_test/foo /home/ravi/test:0x0536 i=-12(%sp):s32 $ ./perf record -e probe_test:foo ./test $ ./perf script test 5778 [001] 4918.562027: probe_test:foo: (400536) i=0 Here variable 'i' is passed via stack which is pushed on stack at 0x40053e. But we are probing at 0x400536. To resolve this issues, we need to probe on next instruction after prologue. gdb and systemtap also does same thing. I've implemented this patch based on approach systemtap has used. After applying patch: $ ./perf probe -x ./test 'foo i' $ cat /sys/kernel/debug/tracing/uprobe_events p:probe_test/foo /home/ravi/test:0x0541 i=-4(%bp):s32 $ ./perf record -e probe_test:foo ./test $ ./perf script test 6300 [001] 5877.879327: probe_test:foo: (400541) i=42 No need to skip prologue for optimized case since debug info is correct for each instructions for -O2 -g. For more details please visit: https://bugzilla.redhat.com/show_bug.cgi?id=612253#c6 Signed-off-by: Ravi Bangoria --- tools/perf/util/probe-finder.c | 156 + 1 file changed, 156 insertions(+) diff --git a/tools/perf/util/probe-finder.c b/tools/perf/util/probe-finder.c index f2d9ff0..a788b9c2 100644 --- a/tools/perf/util/probe-finder.c +++ b/tools/perf/util/probe-finder.c @@ -892,6 +892,161 @@ static int find_probe_point_lazy(Dwarf_Die *sp_die, struct probe_finder *pf) return die_walk_lines(sp_die, probe_point_lazy_walker, pf); } +static bool var_has_loclist(Dwarf_Die *die) +{ + Dwarf_Attribute loc; + int tag = dwarf_tag(die); + + if (tag != DW_TAG_formal_parameter && + tag != DW_TAG_variable) + return false; + + return (dwarf_attr_integrate(die, DW_AT_location, &loc) && + dwarf_whatform(&loc) == DW_FORM_sec_offset); +} + +/* + * For any object in given CU whose DW_AT_location is a location list, + * target program is compiled with optimization. + */ +static bool optimized_target(Dwarf_Die *die) +{ + Dwarf_Die tmp_die; + + if (var_has_loclist(die)) + return true; + + if (!dwarf_child(die, &tmp_die) && optimized_target(&tmp_die)) + return true; + + if (!dwarf_siblingof(die, &tmp_die) && optimized_target(&tmp_die)) + return true; + + return false; +} + +static bool get_entrypc_idx(Dwarf_Lines *lines, unsigned long nr_lines, + Dwarf_Addr pf_addr, unsigned long *entrypc_idx) +{ + unsigned long i; + Dwarf_Addr addr; + + for (i = 0; i < nr_lines; i++) { + if (dwarf_lineaddr(dwarf_onesrcline(lines, i), &addr)) + return false; + + if (addr == pf_addr) { + *entrypc_idx = i; + return true; + } + } + return false; +} + +static bool get_postprologue_addr(unsigned long entrypc_idx, + Dwarf_Lines *lines, + unsigned long nr_lines, + Dwarf_Addr highpc, + Dwarf_Addr *postprologue_addr) +{ + unsigned long i; + int entrypc_lno, lno; + Dwarf_Line *line; + Dwarf_Addr addr; + bool p_end; + + /* entrypc_lno is actual source line number */ + line = dwarf_onesrcline(lines, entrypc_idx); + if (dwarf_lineno(line, &entrypc_lno)) + return false; + + for (i = entrypc_idx; i < n
[PATCH] perf uprobe: Skip prologue if program compiled without optimization
Function prologue prepares stack and registers before executing function logic. When target program is compiled without optimization, function parameter information is only valid after prologue. When we probe entrypc of the function, and try to record function parameter, it contains garbage value. For example, $ vim test.c #include void foo(int i) { printf("i: %d\n", i); } int main() { foo(42); return 0; } $ gcc -g test.c -o test $ objdump -dl test | less foo(): /home/ravi/test.c:4 400536: 55 push %rbp 400537: 48 89 e5mov%rsp,%rbp 40053a: 48 83 ec 10 sub-bashx10,%rsp 40053e: 89 7d fcmov%edi,-0x4(%rbp) /home/ravi/test.c:5 400541: 8b 45 fcmov-0x4(%rbp),%eax ... ... main(): /home/ravi/test.c:9 400558: 55 push %rbp 400559: 48 89 e5mov%rsp,%rbp /home/ravi/test.c:10 40055c: bf 2a 00 00 00 mov-bashx2a,%edi 400561: e8 d0 ff ff ff callq 400536 /home/ravi/test.c:11 $ ./perf probe -x ./test 'foo i' $ cat /sys/kernel/debug/tracing/uprobe_events p:probe_test/foo /home/ravi/test:0x0536 i=-12(%sp):s32 $ ./perf record -e probe_test:foo ./test $ ./perf script test 5778 [001] 4918.562027: probe_test:foo: (400536) i=0 Here variable 'i' is passed via stack which is pushed on stack at 0x40053e. But we are probing at 0x400536. To resolve this issues, we need to probe on next instruction after prologue. gdb and systemtap also does same thing. I've implemented this patch based on approach systemtap has used. After applying patch: $ ./perf probe -x ./test 'foo i' $ cat /sys/kernel/debug/tracing/uprobe_events p:probe_test/foo /home/ravi/test:0x0541 i=-4(%bp):s32 $ ./perf record -e probe_test:foo ./test $ ./perf script test 6300 [001] 5877.879327: probe_test:foo: (400541) i=42 No need to skip prologue for optimized case since debug info is correct for each instructions for -O2 -g. For more details please visit: https://bugzilla.redhat.com/show_bug.cgi?id=612253#c6 Signed-off-by: Ravi Bangoria --- Changes wrt RFC: - Skip prologue only when function parameter is specified - Notify user about skipping prologue tools/perf/util/probe-finder.c | 164 + 1 file changed, 164 insertions(+) diff --git a/tools/perf/util/probe-finder.c b/tools/perf/util/probe-finder.c index f2d9ff0..8efa7f2 100644 --- a/tools/perf/util/probe-finder.c +++ b/tools/perf/util/probe-finder.c @@ -892,6 +892,169 @@ static int find_probe_point_lazy(Dwarf_Die *sp_die, struct probe_finder *pf) return die_walk_lines(sp_die, probe_point_lazy_walker, pf); } +static bool var_has_loclist(Dwarf_Die *die) +{ + Dwarf_Attribute loc; + int tag = dwarf_tag(die); + + if (tag != DW_TAG_formal_parameter && + tag != DW_TAG_variable) + return false; + + return (dwarf_attr_integrate(die, DW_AT_location, &loc) && + dwarf_whatform(&loc) == DW_FORM_sec_offset); +} + +/* + * For any object in given CU whose DW_AT_location is a location list, + * target program is compiled with optimization. + */ +static bool optimized_target(Dwarf_Die *die) +{ + Dwarf_Die tmp_die; + + if (var_has_loclist(die)) + return true; + + if (!dwarf_child(die, &tmp_die) && optimized_target(&tmp_die)) + return true; + + if (!dwarf_siblingof(die, &tmp_die) && optimized_target(&tmp_die)) + return true; + + return false; +} + +static bool get_entrypc_idx(Dwarf_Lines *lines, unsigned long nr_lines, + Dwarf_Addr pf_addr, unsigned long *entrypc_idx) +{ + unsigned long i; + Dwarf_Addr addr; + + for (i = 0; i < nr_lines; i++) { + if (dwarf_lineaddr(dwarf_onesrcline(lines, i), &addr)) + return false; + + if (addr == pf_addr) { + *entrypc_idx = i; + return true; + } + } + return false; +} + +static bool get_postprologue_addr(unsigned long entrypc_idx, + Dwarf_Lines *lines, + unsigned long nr_lines, + Dwarf_Addr highpc, + Dwarf_Addr *postprologue_addr) +{ + unsigned long i; + int entrypc_lno, lno; + Dwarf_Line *line; + Dwarf_Addr addr; + bool p_end; + + /* entrypc_lno is actual source line number */ + line = dwarf_onesrcline(lines, entrypc_idx); + if (dwa
Re: [RFC] perf uprobe: Skip prologue if program compiled without optimization
On Saturday 30 July 2016 08:34 AM, Masami Hiramatsu wrote: On Thu, 28 Jul 2016 20:01:51 +0530 Ravi Bangoria wrote: Function prologue prepares stack and registers before executing function logic. When target program is compiled without optimization, function parameter information is only valid after prologue. When we probe entrypc of the function, and try to record function parameter, it contains garbage value. Right! :) Thanks Masami for review. I've sent patch with changes you suggested. Please review it. -Ravi
Re: [PATCH] perf/powerpc: Fix build-test failure
Hi Arnaldo, Can you please pick up this. -Ravi On Wednesday 31 August 2016 01:33 PM, Ravi Bangoria wrote: > 'make -C tools/perf build-test' is failing with below log for poewrpc. > > In file included from > /tmp/tmp.3eEwmGlYaF/perf-4.8.0-rc4/tools/perf/perf.h:15:0, >from util/cpumap.h:8, >from util/env.c:1: > /tmp/tmp.3eEwmGlYaF/perf-4.8.0-rc4/tools/perf/perf-sys.h:23:56: > fatal error: ../../arch/powerpc/include/uapi/asm/unistd.h: No such file or > directory > compilation terminated. > > I bisected it and found it's failing from commit ad430729ae00 ("Remove: > kernel unistd*h files from perf's MANIFEST, not used"). > > Header file '../../arch/powerpc/include/uapi/asm/unistd.h' is included > only for powerpc in tools/perf/perf-sys.h. > > By looking closly at commit history, I found little weird thing: > > Commit f2d9cae9ea9e ("perf powerpc: Use uapi/unistd.h to fix build > error") replaced 'asm/unistd.h' with 'uapi/asm/unistd.h' > > Commit d2709c7ce4c5 ("perf: Make perf build for x86 with UAPI > disintegration applied") removes all arch specific 'uapi/asm/unistd.h' > for all archs and adds generic . > > Commit f0b9abfb0446 ("Merge branch 'linus' into perf/core") again > includes 'uapi/asm/unistd.h' for powerpc. Don't know how exactly this > happened as this change is not part of commit also. > > Signed-off-by: Ravi Bangoria > --- > tools/perf/perf-sys.h | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/tools/perf/perf-sys.h b/tools/perf/perf-sys.h > index 7ed72a4..e4b717e 100644 > --- a/tools/perf/perf-sys.h > +++ b/tools/perf/perf-sys.h > @@ -20,7 +20,6 @@ > #endif > > #ifdef __powerpc__ > -#include "../../arch/powerpc/include/uapi/asm/unistd.h" > #define CPUINFO_PROC {"cpu"} > #endif >
Re: [PATCH v6 0/7] perf: Cross arch annotate + few miscellaneous fixes
Hello, Any update on this? -Ravi On Friday 19 August 2016 06:29 PM, Ravi Bangoria wrote: > Currently Perf annotate support code navigation (branches and calls) > only when run on the same architecture where perf.data was recorded. > But, for example, record on powerpc server and annotate on client's > x86 desktop is not supported. > > This patchset enables cross arch annotate. Currently I've used x86 > and arm instructions which are already available and added support > for powerpc. > > Additionally this patch series also contains few other related fixes. > > Patches are prepared on top of acme/perf/core and tested it with x86 > and powerpc only. > > Note for arm: > I don't have arm test machine. As suggested by Russell in one of the > review comment, I've copied all instructions from default table to > arm table. This way it want break tool on arm but cleanup is needed > for x86 specific instructions added in arm table. > > Example: > > Record on powerpc: > $ ./perf record -a > > Report -> Annotate on x86: > $ ./perf report -i perf.data.powerpc --vmlinux vmlinux.powerpc > > Changes in v6: > - Instead of adding only those instructions defined in #ifdef __arm__, > add all instructions from default table to arm table. > > v5 link: > https://lkml.org/lkml/2016/8/19/35 > > Naveen N. Rao (1): > perf annotate: Add support for powerpc > > Ravi Bangoria (6): > perf: Define macro for normalized arch names > perf annotate: Add cross arch annotate support > perf annotate: Do not ignore call instruction with indirect target > perf annotate: Show raw form for jump instruction with indirect target > perf annotate: Support jump instruction with target as second operand > perf annotate: Fix jump target outside of function address range > > tools/perf/arch/common.c | 36 ++-- > tools/perf/arch/common.h | 11 ++ > tools/perf/builtin-top.c | 2 +- > tools/perf/ui/browsers/annotate.c | 8 +- > tools/perf/ui/gtk/annotate.c | 2 +- > tools/perf/util/annotate.c | 330 > +++-- > tools/perf/util/annotate.h | 10 +- > tools/perf/util/unwind-libunwind.c | 4 +- > 8 files changed, 327 insertions(+), 76 deletions(-) >
[PATCH] perf/powerpc: Fix build-test failure
'make -C tools/perf build-test' is failing with below log for poewrpc. In file included from /tmp/tmp.3eEwmGlYaF/perf-4.8.0-rc4/tools/perf/perf.h:15:0, from util/cpumap.h:8, from util/env.c:1: /tmp/tmp.3eEwmGlYaF/perf-4.8.0-rc4/tools/perf/perf-sys.h:23:56: fatal error: ../../arch/powerpc/include/uapi/asm/unistd.h: No such file or directory compilation terminated. I bisected it and found it's failing from commit ad430729ae00 ("Remove: kernel unistd*h files from perf's MANIFEST, not used"). Header file '../../arch/powerpc/include/uapi/asm/unistd.h' is included only for powerpc in tools/perf/perf-sys.h. By looking closly at commit history, I found little weird thing: Commit f2d9cae9ea9e ("perf powerpc: Use uapi/unistd.h to fix build error") replaced 'asm/unistd.h' with 'uapi/asm/unistd.h' Commit d2709c7ce4c5 ("perf: Make perf build for x86 with UAPI disintegration applied") removes all arch specific 'uapi/asm/unistd.h' for all archs and adds generic . Commit f0b9abfb0446 ("Merge branch 'linus' into perf/core") again includes 'uapi/asm/unistd.h' for powerpc. Don't know how exactly this happened as this change is not part of commit also. Signed-off-by: Ravi Bangoria --- tools/perf/perf-sys.h | 1 - 1 file changed, 1 deletion(-) diff --git a/tools/perf/perf-sys.h b/tools/perf/perf-sys.h index 7ed72a4..e4b717e 100644 --- a/tools/perf/perf-sys.h +++ b/tools/perf/perf-sys.h @@ -20,7 +20,6 @@ #endif #ifdef __powerpc__ -#include "../../arch/powerpc/include/uapi/asm/unistd.h" #define CPUINFO_PROC {"cpu"} #endif -- 2.4.11
[PATCH 2/2] perf ppc64le: Fix probe location when using DWARF
Powerpc has Global Entry Point and Local Entry Point for functions. LEP catches call from both the GEP and the LEP. Symbol table of ELF contains GEP and Offset from which we can calculate LEP, but debuginfo does not have LEP info. Currently, perf prioritize symbol table over dwarf to probe on LEP for ppc64le. But when user tries to probe with function parameter, we fall back to using dwarf(i.e. GEP) and when function called via LEP, probe will never hit. For example: $ objdump -d vmlinux ... do_sys_open(): c02eb4a0: e8 00 4c 3c addis r2,r12,232 c02eb4a4: 60 00 42 38 addir2,r2,96 c02eb4a8: a6 02 08 7c mflrr0 c02eb4ac: d0 ff 41 fb std r26,-48(r1) $ sudo ./perf probe do_sys_open $ sudo cat /sys/kernel/debug/tracing/kprobe_events p:probe/do_sys_open _text+3060904 $ sudo ./perf probe 'do_sys_open filename:string' $ sudo cat /sys/kernel/debug/tracing/kprobe_events p:probe/do_sys_open _text+3060896 filename_string=+0(%gpr4):string For second case, perf probed on GEP. So when function will be called via LEP, probe won't hit. $ sudo ./perf record -a -e probe:do_sys_open ls [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.195 MB perf.data ] To resolve this issue, let's not prioritize symbol table, let perf decide what it wants to use. Perf is already converting GEP to LEP when it uses symbol table. When perf uses debuginfo, let it find LEP offset form symbol table. This way we fall back to probe on LEP for all cases. After patch: $ sudo ./perf probe 'do_sys_open filename:string' $ sudo cat /sys/kernel/debug/tracing/kprobe_events p:probe/do_sys_open _text+3060904 filename_string=+0(%gpr4):string $ sudo ./perf record -a -e probe:do_sys_open ls [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.197 MB perf.data (11 samples) ] Signed-off-by: Ravi Bangoria --- tools/perf/arch/powerpc/util/sym-handling.c | 27 + tools/perf/util/probe-event.c | 37 - tools/perf/util/probe-event.h | 6 - 3 files changed, 49 insertions(+), 21 deletions(-) diff --git a/tools/perf/arch/powerpc/util/sym-handling.c b/tools/perf/arch/powerpc/util/sym-handling.c index c6d0f91..8d4dc97 100644 --- a/tools/perf/arch/powerpc/util/sym-handling.c +++ b/tools/perf/arch/powerpc/util/sym-handling.c @@ -54,10 +54,6 @@ int arch__compare_symbol_names(const char *namea, const char *nameb) #endif #if defined(_CALL_ELF) && _CALL_ELF == 2 -bool arch__prefers_symtab(void) -{ - return true; -} #ifdef HAVE_LIBELF_SUPPORT void arch__sym_update(struct symbol *s, GElf_Sym *sym) @@ -100,4 +96,27 @@ void arch__fix_tev_from_maps(struct perf_probe_event *pev, tev->point.offset += lep_offset; } } + +void arch__post_process_probe_trace_events(struct perf_probe_event *pev, + int ntevs) +{ + struct probe_trace_event *tev; + struct map *map; + struct symbol *sym = NULL; + struct rb_node *tmp; + int i = 0; + + map = get_target_map(pev->target, pev->uprobes); + if (!map || map__load(map, NULL) < 0) + return; + + for (i = 0; i < ntevs; i++) { + tev = &pev->tevs[i]; + map__for_each_symbol(map, sym, tmp) { + if (map->unmap_ip(map, sym->start) == tev->point.address) + arch__fix_tev_from_maps(pev, tev, map, sym); + } + } +} + #endif diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c index 4e215e7..5efa535 100644 --- a/tools/perf/util/probe-event.c +++ b/tools/perf/util/probe-event.c @@ -178,7 +178,7 @@ static struct map *kernel_get_module_map(const char *module) return NULL; } -static struct map *get_target_map(const char *target, bool user) +struct map *get_target_map(const char *target, bool user) { /* Init maps of given executable or kernel */ if (user) @@ -703,19 +703,32 @@ post_process_kernel_probe_trace_events(struct probe_trace_event *tevs, return skipped; } +void __weak +arch__post_process_probe_trace_events(struct perf_probe_event *pev __maybe_unused, + int ntevs __maybe_unused) +{ +} + /* Post processing the probe events */ -static int post_process_probe_trace_events(struct probe_trace_event *tevs, +static int post_process_probe_trace_events(struct perf_probe_event *pev, + struct probe_trace_event *tevs, int ntevs, const char *module, bool uprobe) { - if (uprobe) - return add_exec_to_probe_tra
[PATCH 1/2] perf: Add function to post process kernel trace events
Instead of inline code, introduce function to post process kernel probe trace events. Signed-off-by: Ravi Bangoria --- tools/perf/util/probe-event.c | 29 ++--- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c index 953dc1a..4e215e7 100644 --- a/tools/perf/util/probe-event.c +++ b/tools/perf/util/probe-event.c @@ -664,22 +664,14 @@ static int add_module_to_probe_trace_events(struct probe_trace_event *tevs, return ret; } -/* Post processing the probe events */ -static int post_process_probe_trace_events(struct probe_trace_event *tevs, - int ntevs, const char *module, - bool uprobe) +static int +post_process_kernel_probe_trace_events(struct probe_trace_event *tevs, + int ntevs) { struct ref_reloc_sym *reloc_sym; char *tmp; int i, skipped = 0; - if (uprobe) - return add_exec_to_probe_trace_events(tevs, ntevs, module); - - /* Note that currently ref_reloc_sym based probe is not for drivers */ - if (module) - return add_module_to_probe_trace_events(tevs, ntevs, module); - reloc_sym = kernel_get_ref_reloc_sym(); if (!reloc_sym) { pr_warning("Relocated base symbol is not found!\n"); @@ -711,6 +703,21 @@ static int post_process_probe_trace_events(struct probe_trace_event *tevs, return skipped; } +/* Post processing the probe events */ +static int post_process_probe_trace_events(struct probe_trace_event *tevs, + int ntevs, const char *module, + bool uprobe) +{ + if (uprobe) + return add_exec_to_probe_trace_events(tevs, ntevs, module); + + if (module) + /* Currently ref_reloc_sym based probe is not for drivers */ + return add_module_to_probe_trace_events(tevs, ntevs, module); + + return post_process_kernel_probe_trace_events(tevs, ntevs); +} + /* Try to find perf_probe_event with debuginfo */ static int try_to_find_probe_trace_events(struct perf_probe_event *pev, struct probe_trace_event **tevs) -- 2.7.4
Re: [PATCH v6 2/7] perf annotate: Add cross arch annotate support
Hi Kim, I've tested your patch on x86 and powerpc and it looks fine to me. Can you please put your signed-off-by. Please add Act-by: Ravi Bangoria as well. Regards, -Ravi On Wednesday 24 August 2016 02:06 AM, Kim Phillips wrote: > On Tue, 23 Aug 2016 11:17:16 +0900 > Namhyung Kim wrote: > >> On Tue, Aug 23, 2016 at 8:01 AM, Kim Phillips wrote: >>> On Fri, 19 Aug 2016 18:29:33 +0530 >>> Ravi Bangoria wrote: >>> >>>> Changes in v6: >>>> - Instead of adding only those instructions defined in #ifdef __arm__, >>>> add all instructions from default table to arm table. >>> Thanks, I've gone through the list and removed all not-ARM >>> instructions, and added some missing ARM branch instructions: >> Can we use regex patterns instead? > Yes, that helps prevent mistakes updating instruction lists - how does > this look?: > > diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c > index b2c6cf3..52316f3 100644 > --- a/tools/perf/util/annotate.c > +++ b/tools/perf/util/annotate.c > @@ -26,6 +26,7 @@ > const char *disassembler_style; > const char *objdump_path; > static regex_tfile_lineno; > +static regex_tarm_call_insn, arm_jump_insn; > > static struct ins *ins__find(const char *name, const char *norm_arch); > static int disasm_line__parse(char *line, char **namep, char **rawp); > @@ -449,98 +450,7 @@ static struct ins instructions_x86[] = { > { .name = "retq", .ops = &ret_ops, }, > }; > > -static struct ins instructions_arm[] = { > - { .name = "add", .ops = &mov_ops, }, > - { .name = "addl", .ops = &mov_ops, }, > - { .name = "addq", .ops = &mov_ops, }, > - { .name = "addw", .ops = &mov_ops, }, > - { .name = "and", .ops = &mov_ops, }, > - { .name = "b", .ops = &jump_ops, }, /* might also be a call */ > - { .name = "bcc", .ops = &jump_ops, }, > - { .name = "bcs", .ops = &jump_ops, }, > - { .name = "beq", .ops = &jump_ops, }, > - { .name = "bge", .ops = &jump_ops, }, > - { .name = "bgt", .ops = &jump_ops, }, > - { .name = "bhi", .ops = &jump_ops, }, > - { .name = "bl",.ops = &call_ops, }, > - { .name = "bls", .ops = &jump_ops, }, > - { .name = "blt", .ops = &jump_ops, }, > - { .name = "blx", .ops = &call_ops, }, > - { .name = "bne", .ops = &jump_ops, }, > - { .name = "bts", .ops = &mov_ops, }, > - { .name = "call", .ops = &call_ops, }, > - { .name = "callq", .ops = &call_ops, }, > - { .name = "cmp", .ops = &mov_ops, }, > - { .name = "cmpb", .ops = &mov_ops, }, > - { .name = "cmpl", .ops = &mov_ops, }, > - { .name = "cmpq", .ops = &mov_ops, }, > - { .name = "cmpw", .ops = &mov_ops, }, > - { .name = "cmpxch", .ops = &mov_ops, }, > - { .name = "dec", .ops = &dec_ops, }, > - { .name = "decl", .ops = &dec_ops, }, > - { .name = "imul", .ops = &mov_ops, }, > - { .name = "inc", .ops = &dec_ops, }, > - { .name = "incl", .ops = &dec_ops, }, > - { .name = "ja",.ops = &jump_ops, }, > - { .name = "jae", .ops = &jump_ops, }, > - { .name = "jb",.ops = &jump_ops, }, > - { .name = "jbe", .ops = &jump_ops, }, > - { .name = "jc",.ops = &jump_ops, }, > - { .name = "jcxz", .ops = &jump_ops, }, > - { .name = "je",.ops = &jump_ops, }, > - { .name = "jecxz", .ops = &jump_ops, }, > - { .name = "jg",.ops = &jump_ops, }, > - { .name = "jge", .ops = &jump_ops, }, > - { .name = "jl",.ops = &jump_ops, }, > - { .name = "jle", .ops = &jump_ops, }, > - { .name = "jmp", .ops = &jump_ops, }, > - { .name = "jmpq", .ops = &jump_ops, }, > - { .name = "jna", .ops = &jump_ops, }, > - { .name = "jnae", .ops = &jump_ops, }, > - { .name = "jnb", .ops = &jump_ops, }, > - { .name = "jnbe", .ops = &jump_ops, }, > - { .name = "jnc", .ops = &a
[RFC] perf tool: Fix memory corruption because of zero length symbols
Perf top is often crashing at very random locations on powerpc. After investigating, I found the crash only happens when sample is of zero length symbol. Powerpc kernel has many such symbols which does not contain length details in vmlinux binary and thus start and end addresses of such symbols are same. Structure struct sym_hist { u64 nr_samples; u64 period; struct sym_hist_entry addr[0]; }; has last member 'addr[]' of size zero. 'addr[]' is an array of addresses that belongs to one symbol (function). If function consist of 100 instructions, 'addr' points to an array of 100 'struct sym_hist_entry' elements. For zero length symbol, it points to the *empty* array, i.e. no members in the array and thus offset 0 is also invalid for such array. static int __symbol__inc_addr_samples(...) { ... offset = addr - sym->start; h = annotation__histogram(notes, evidx); h->nr_samples++; h->addr[offset].nr_samples++; h->period += sample->period; h->addr[offset].period += sample->period; ... } Here, when 'addr' is same as 'sym->start', 'offset' becomes 0, which is valid for normal symbols but *invalid* for zero length symbols and thus updating h->addr[offset] causes memory corruption. Fix this by adding one dummy element for zero length symbols. Fixes: edee44be5919 ("perf annotate: Don't throw error for zero length symbols") Signed-off-by: Ravi Bangoria --- tools/perf/util/annotate.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index 4397a8b..15db525 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -606,9 +606,19 @@ static struct arch *arch__find(const char *name) int symbol__alloc_hist(struct symbol *sym) { struct annotation *notes = symbol__annotation(sym); - const size_t size = symbol__size(sym); + size_t size = symbol__size(sym); size_t sizeof_sym_hist; + /* +* Add buffer of one element for zero length symbol. +* When sample is taken from first instruction of +* zero length symbol, perf still resolves it and +* shows symbol name in perf report and allows to +* annotate it. +*/ + if (size == 0) + size = 1; + /* Check for overflow when calculating sizeof_sym_hist */ if (size > (SIZE_MAX - sizeof(struct sym_hist)) / sizeof(struct sym_hist_entry)) return -1; -- 1.8.3.1
[PATCH] perf probe: Restrict user to use only one executable with -V
If multiple executables (-x) are specified with 'perf probe', listing variables only considers first executable. Ex, $ perf probe -V do_sys_open -x ./test -V fun Failed to find debug information for address 40 Debuginfo analysis failed. Error: Failed to show vars. Available variables at do_sys_open @ char* filename int dfd int flags struct open_flags op umode_t mode Here, first executable is kernel and second is 'test'. Instead of finding 'fun()' from 'test', perf is looking for 'fun()' in kernel. Fix this by restricting user to use only one executable with -V. $ perf probe -V do_sys_open -x ./test -V fun Error: Multiple executables are not allowed. $ perf probe -V do_sys_open Available variables at do_sys_open @ char* filename int dfd int flags struct open_flags op umode_t mode $ perf probe -x ./test -V fun Available variables at fun @ int arg Signed-off-by: Ravi Bangoria --- tools/perf/util/probe-event.c | 5 + 1 file changed, 5 insertions(+) diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c index b7aaf9b..0e04015 100644 --- a/tools/perf/util/probe-event.c +++ b/tools/perf/util/probe-event.c @@ -1128,6 +1128,11 @@ int show_available_vars(struct perf_probe_event *pevs, int npevs, int i, ret = 0; struct debuginfo *dinfo; + if (npevs > 1) { + pr_err("Error: Multiple executables are not allowed.\n"); + return 0; + } + ret = init_probe_symbol_maps(pevs->uprobes); if (ret < 0) return ret; -- 1.8.3.1
[PATCH] perf tool: Fix build failure when NO_AUXTRACE=1
Perf tool fails with following build failure when AUXTRACE is not set: $ make NO_AUXTRACE=1 builtin-script.c: In function 'perf_script__process_auxtrace_info': util/auxtrace.h:608:44: error: called object is not a function or function pointer #define perf_event__process_auxtrace_info 0 ^ Fix it by guarding function under HAVE_AUXTRACE_SUPPORT. Fixes: 47e5a26a916b ("perf script: Fix --per-event-dump for auxtrace synth evsels") Signed-off-by: Ravi Bangoria --- tools/perf/builtin-script.c | 4 1 file changed, 4 insertions(+) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index ad6404dcf91c..9b43bda45a41 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -2848,6 +2848,7 @@ int process_cpu_map_event(struct perf_tool *tool __maybe_unused, return set_maps(script); } +#ifdef HAVE_AUXTRACE_SUPPORT static int perf_script__process_auxtrace_info(struct perf_tool *tool, union perf_event *event, struct perf_session *session) @@ -2862,6 +2863,9 @@ static int perf_script__process_auxtrace_info(struct perf_tool *tool, return ret; } +#else +#define perf_script__process_auxtrace_info 0 +#endif int cmd_script(int argc, const char **argv) { -- 2.13.6
Re: [PATCH 16/35] perf annotate: Add samples into struct annotation_line
Hi Jiri, This patch seems to be causing segfault with "perf top --stdio". Steps to reproduce: 1. start "perf top --stdio" in one terminal 2. run some simple workload in another terminal, let it get finished. 3. annotate function from previous workload in perf top (press 'a' followed by 's') Perf will crash with: perf: Segmentation fault Obtained 8 stack frames. ./perf(sighandler_dump_stack+0x3e) [0x4f1b6e] /lib64/libc.so.6(+0x36a7f) [0x7ff3aa7e4a7f] ./perf() [0x4a27fd] ./perf(symbol__annotate+0x199) [0x4a4439] ./perf() [0x44e32d] ./perf() [0x44f098] /lib64/libpthread.so.0(+0x736c) [0x7ff3acee836c] /lib64/libc.so.6(clone+0x3e) [0x7ff3aa8bee1e] Can you please check. Thanks, Ravi On 10/11/2017 08:31 PM, Jiri Olsa wrote: Adding samples array into struct annotation_line to hold the annotation data. The data are populated in the following patches. Link: http://lkml.kernel.org/n/tip-97yja5m7z9brrcuf2gwr5...@git.kernel.org Signed-off-by: Jiri Olsa --- tools/perf/util/annotate.c | 8 tools/perf/util/annotate.h | 17 - 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index 8e1e88aab45a..1b5c7d0a53e8 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -901,7 +901,14 @@ static struct annotation_line* annotation_line__new(struct annotate_args *args, size_t privsize) { struct annotation_line *al; + struct perf_evsel *evsel = args->evsel; size_t size = privsize + sizeof(*al); + int nr = 1; + + if (perf_evsel__is_group_event(evsel)) + nr = evsel->nr_members; + + size += sizeof(al->samples[0]) * nr; al = zalloc(size); if (al) { @@ -910,6 +917,7 @@ annotation_line__new(struct annotate_args *args, size_t privsize) al->offset = args->offset; al->line = strdup(args->line); al->line_nr= args->line_nr; + al->samples_nr = nr; } return al; diff --git a/tools/perf/util/annotate.h b/tools/perf/util/annotate.h index a02a2bf4f2ab..9c722a7e5f6d 100644 --- a/tools/perf/util/annotate.h +++ b/tools/perf/util/annotate.h @@ -58,6 +58,16 @@ bool ins__is_fused(struct arch *arch, const char *ins1, const char *ins2); struct annotation; +struct sym_hist_entry { + u64 nr_samples; + u64 period; +}; + +struct annotation_data { + double percent; + struct sym_hist_entryhe; +}; + struct annotation_line { struct list_head node; struct rb_node rb_node; @@ -67,6 +77,8 @@ struct annotation_line { floatipc; u64 cycles; size_t privsize; + int samples_nr; + struct annotation_data samples[0]; }; struct disasm_line { @@ -87,11 +99,6 @@ static inline bool disasm_line__has_offset(const struct disasm_line *dl) return dl->ops.target.offset_avail; } -struct sym_hist_entry { - u64 nr_samples; - u64 period; -}; - void disasm_line__free(struct disasm_line *dl); struct annotation_line* annotation_line__next(struct annotation_line *pos, struct list_head *head);
[PATCH] perf annotate: Remove precision for mnemonics
There are many instructions, esp on powerpc, whose mnemonics are longer than 6 characters. Using precision limit causes truncation of such mnemonics. Fix this by removing precision limit. Note that, 'width' is still 6, so alignment won't get affected for length <= 6. Before: li r11,-1 xscvdp vs1,vs1 add. r10,r10,r11 After: li r11,-1 xscvdpsxds vs1,vs1 add. r10,r10,r11 Reported-by: Donald Stence Signed-off-by: Ravi Bangoria --- tools/perf/util/annotate.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c index 54321b947de8..6462a7423beb 100644 --- a/tools/perf/util/annotate.c +++ b/tools/perf/util/annotate.c @@ -165,7 +165,7 @@ static void ins__delete(struct ins_operands *ops) static int ins__raw_scnprintf(struct ins *ins, char *bf, size_t size, struct ins_operands *ops) { - return scnprintf(bf, size, "%-6.6s %s", ins->name, ops->raw); + return scnprintf(bf, size, "%-6s %s", ins->name, ops->raw); } int ins__scnprintf(struct ins *ins, char *bf, size_t size, @@ -230,12 +230,12 @@ static int call__scnprintf(struct ins *ins, char *bf, size_t size, struct ins_operands *ops) { if (ops->target.name) - return scnprintf(bf, size, "%-6.6s %s", ins->name, ops->target.name); + return scnprintf(bf, size, "%-6s %s", ins->name, ops->target.name); if (ops->target.addr == 0) return ins__raw_scnprintf(ins, bf, size, ops); - return scnprintf(bf, size, "%-6.6s *%" PRIx64, ins->name, ops->target.addr); + return scnprintf(bf, size, "%-6s *%" PRIx64, ins->name, ops->target.addr); } static struct ins_ops call_ops = { @@ -299,7 +299,7 @@ static int jump__scnprintf(struct ins *ins, char *bf, size_t size, c++; } - return scnprintf(bf, size, "%-6.6s %.*s%" PRIx64, + return scnprintf(bf, size, "%-6s %.*s%" PRIx64, ins->name, c ? c - ops->raw : 0, ops->raw, ops->target.offset); } @@ -372,7 +372,7 @@ static int lock__scnprintf(struct ins *ins, char *bf, size_t size, if (ops->locked.ins.ops == NULL) return ins__raw_scnprintf(ins, bf, size, ops); - printed = scnprintf(bf, size, "%-6.6s ", ins->name); + printed = scnprintf(bf, size, "%-6s ", ins->name); return printed + ins__scnprintf(&ops->locked.ins, bf + printed, size - printed, ops->locked.ops); } @@ -448,7 +448,7 @@ static int mov__parse(struct arch *arch, struct ins_operands *ops, struct map *m static int mov__scnprintf(struct ins *ins, char *bf, size_t size, struct ins_operands *ops) { - return scnprintf(bf, size, "%-6.6s %s,%s", ins->name, + return scnprintf(bf, size, "%-6s %s,%s", ins->name, ops->source.name ?: ops->source.raw, ops->target.name ?: ops->target.raw); } @@ -488,7 +488,7 @@ static int dec__parse(struct arch *arch __maybe_unused, struct ins_operands *ops static int dec__scnprintf(struct ins *ins, char *bf, size_t size, struct ins_operands *ops) { - return scnprintf(bf, size, "%-6.6s %s", ins->name, + return scnprintf(bf, size, "%-6s %s", ins->name, ops->target.name ?: ops->target.raw); } @@ -500,7 +500,7 @@ static struct ins_ops dec_ops = { static int nop__scnprintf(struct ins *ins __maybe_unused, char *bf, size_t size, struct ins_operands *ops __maybe_unused) { - return scnprintf(bf, size, "%-6.6s", "nop"); + return scnprintf(bf, size, "%-6s", "nop"); } static struct ins_ops nop_ops = { @@ -990,7 +990,7 @@ void disasm_line__free(struct disasm_line *dl) int disasm_line__scnprintf(struct disasm_line *dl, char *bf, size_t size, bool raw) { if (raw || !dl->ins.ops) - return scnprintf(bf, size, "%-6.6s %s", dl->ins.name, dl->ops.raw); + return scnprintf(bf, size, "%-6s %s", dl->ins.name, dl->ops.raw); return ins__scnprintf(&dl->ins, bf, size, &dl->ops); } -- 2.13.6
Re: [PATCH 16/35] perf annotate: Add samples into struct annotation_line
Hi Jiri, On 11/14/2017 03:01 PM, Jiri Olsa wrote: On Mon, Nov 13, 2017 at 09:14:38PM +0100, Jiri Olsa wrote: On Mon, Nov 13, 2017 at 09:16:20PM +0530, Ravi Bangoria wrote: Hi Jiri, This patch seems to be causing segfault with "perf top --stdio". Steps to reproduce: 1. start "perf top --stdio" in one terminal 2. run some simple workload in another terminal, let it get finished. 3. annotate function from previous workload in perf top (press 'a' followed by 's') Perf will crash with: perf: Segmentation fault Obtained 8 stack frames. ./perf(sighandler_dump_stack+0x3e) [0x4f1b6e] /lib64/libc.so.6(+0x36a7f) [0x7ff3aa7e4a7f] ./perf() [0x4a27fd] ./perf(symbol__annotate+0x199) [0x4a4439] ./perf() [0x44e32d] ./perf() [0x44f098] /lib64/libpthread.so.0(+0x736c) [0x7ff3acee836c] /lib64/libc.so.6(clone+0x3e) [0x7ff3aa8bee1e] Can you please check. hum, I'm getting following crash after resizing the terminal window: perf: Floating point exception Obtained 8 stack frames. ./perf(dump_stack+0x2e) [0x510c89] ./perf(sighandler_dump_stack+0x2e) [0x510d69] /lib64/libc.so.6(+0x36a80) [0x7f9419588a80] ./perf(perf_top__header_snprintf+0x208) [0x4f42c1] ./perf() [0x453c09] ./perf() [0x454ddb] /lib64/libpthread.so.0(+0x736d) [0x7f941bc8c36d] /lib64/libc.so.6(clone+0x3f) [0x7f9419662e1f] Floating point exception (core dumped) working on fix so my crash is caused by bogus resize code, I have it working with fix for memory corruption happening in SIGWINCH signal handler (attached) could you please check if that fixes the code for you? Yes, this fixes the crash caused by resize. But original crash I reported is still there. Issue seems to be with evsel being NULL and we are trying to de-reference it somewhere inside annotation_line__new(). Will try to spend more time on it. -Ravi
[PATCH] perf record: Fix tool crash with xyarray
I see 'perf record -p ' crashes with following log: *** Error in `./perf': free(): invalid next size (normal): 0x0298b340 *** === Backtrace: = /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f7fd85c87e5] /lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f7fd85d137a] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f7fd85d553c] ./perf(perf_evsel__close+0xb4)[0x4b7614] ./perf(perf_evlist__delete+0x100)[0x4ab180] ./perf(cmd_record+0x1d9)[0x43a5a9] ./perf[0x49aa2f] ./perf(main+0x631)[0x427841] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f7fd8571830] ./perf(_start+0x29)[0x427a59] This is because functions xyarray__max_x() and xyarray__max_y() are returning incorrect values, causing crash while accessing xyarray. Fixes: d74be4767367 ("perf xyarray: Save max_x, max_y") Signed-off-by: Ravi Bangoria --- tools/perf/util/xyarray.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/perf/util/xyarray.h b/tools/perf/util/xyarray.h index 4ba726c..d6858ed 100644 --- a/tools/perf/util/xyarray.h +++ b/tools/perf/util/xyarray.h @@ -21,12 +21,12 @@ static inline void *xyarray__entry(struct xyarray *xy, int x, int y) return &xy->contents[x * xy->row_size + y * xy->entry_size]; } -static inline int xyarray__max_y(struct xyarray *xy) +static inline int xyarray__max_x(struct xyarray *xy) { return xy->max_x; } -static inline int xyarray__max_x(struct xyarray *xy) +static inline int xyarray__max_y(struct xyarray *xy) { return xy->max_y; } -- 2.7.4
Re: [PATCH v7 3/6] Uprobes: Support SDT markers having reference count (semaphore)
Hi Oleg, Sorry for bit late reply. On 08/03/2018 04:54 PM, Oleg Nesterov wrote: > Hi Ravi, > > I was going to give up and ack this series, but it seems I noticed > a bug... > > On 07/31, Ravi Bangoria wrote: >> >> +static int delayed_uprobe_add(struct uprobe *uprobe, struct mm_struct *mm) >> +{ >> +struct delayed_uprobe *du; >> + >> +if (delayed_uprobe_check(uprobe, mm)) >> +return 0; >> + >> +du = kzalloc(sizeof(*du), GFP_KERNEL); >> +if (!du) >> +return -ENOMEM; >> + >> +du->uprobe = uprobe; >> +du->mm = mm; > > I am surprised I didn't notice this before... > > So > du->mm = mm; > > is fine, mm can't go away, uprobe_clear_state() does > delayed_uprobe_remove(NULL,mm). > > But > du->uprobe = uprobe; > > doesn't look right, uprobe can go away and it can be freed, its memory can be > reused. > We can't rely on remove_breakpoint(), I'm sorry. I didn't get this. How can uprobe go away without calling uprobe_unregister() -> rergister_for_each_vma() -> remove_breakpoint() And remove_breakpoint() will get called only when last uprobe consumer is going away _for that mm_. So, we can rely on remove_breakpoint() to remove {uprobe,mm} from delayed_uprobe_list. Am I missing anything? Or it would be even better if you can tell me some example scenario. > the application can unmap the probed page/vma. > Yes we do not care about the application in this case, say, the next > uprobe_mmap() can > wrongly increment the counter, we do not care although this can lead to > hard-to-debug > problems. But, if nothing else, the kernel can crash if the freed memory is > unmapped. > So I think put_uprobe() should do delayed_uprobe_remove(uprobe, NULL) before > kfree() > and delayed_uprobe_remove() should be updated to handle the mm==NULL case. > > Also. delayed_uprobe_add() should check the list and avoid duplicates. > Otherwise the > trivial > > for (;;) > munmap(mmap(uprobed_file)); > > will eat the memory until uprobe is unregistered. I'm already calling delayed_uprobe_check(uprobe, mm) from delayed_uprobe_add(). So, we don't add same {uprobe,mm} multiple time in delayed_uprobe_list. Is it the same check you are asking me to add or something else. > > >> +static bool valid_ref_ctr_vma(struct uprobe *uprobe, >> + struct vm_area_struct *vma) >> +{ >> +unsigned long vaddr = offset_to_vaddr(vma, uprobe->ref_ctr_offset); >> + >> +return uprobe->ref_ctr_offset && >> +vma->vm_file && >> +file_inode(vma->vm_file) == uprobe->inode && >> +vma->vm_flags & VM_WRITE && >> +!(vma->vm_flags & VM_SHARED) && > > vma->vm_flags & (VM_WRITE|VM_SHARED) == VM_WRITE && > > looks a bit better to me, but I won't insist. Sure. I'll change it. > >> +static int delayed_uprobe_install(struct vm_area_struct *vma) >> +{ >> +struct list_head *pos, *q; >> +struct delayed_uprobe *du; >> +unsigned long vaddr; >> +int ret = 0, err = 0; >> + >> +mutex_lock(&delayed_uprobe_lock); >> +list_for_each_safe(pos, q, &delayed_uprobe_list) { >> +du = list_entry(pos, struct delayed_uprobe, list); >> + >> +if (!valid_ref_ctr_vma(du->uprobe, vma)) >> +continue; >> + >> +vaddr = offset_to_vaddr(vma, du->uprobe->ref_ctr_offset); >> +ret = __update_ref_ctr(vma->vm_mm, vaddr, 1); >> +/* Record an error and continue. */ >> +err = ret & !err ? ret : err; > > I try to avoid the cosmetic nits, but I simply can't look at this line ;) > > if (ret && !err) > err = ret; This is neat. Will replace it. > >> @@ -1072,7 +1281,14 @@ int uprobe_mmap(struct vm_area_struct *vma) >> struct uprobe *uprobe, *u; >> struct inode *inode; >> >> -if (no_uprobe_events() || !valid_vma(vma, true)) >> +if (no_uprobe_events()) >> +return 0; >> + >> +if (vma->vm_flags & VM_WRITE && >> +test_bit(MMF_HAS_UPROBES, &vma->vm_mm->flags)) >> +delayed_uprobe_install(vma); > > OK, so you also added the VM_WRITE check and I agree. But then I think we > should also check VM_SHARED, just like valid_ref_ctr_vma() does? Right. I'll add a check here. Thanks for reviewing :)
Re: [PATCH v7 3/6] Uprobes: Support SDT markers having reference count (semaphore)
Hi Oleg, >> I'm sorry. I didn't get this. How can uprobe go away without calling >> uprobe_unregister() >> -> rergister_for_each_vma() >>-> remove_breakpoint() >> And remove_breakpoint() will get called > > assuming that _unregister() will find the same vma with the probed insn. But > as I said, the application can munmap the probed page/vma. Got it now :) Thanks.
[PATCH 3/3] perf script: Fix crash because of missing feat_op[] entry
perf_event__process_feature() tries to access feat_ops[feat].process which is not defined for feat = HEADER_LAST_FEATURE and thus perf is crashing. Add dummy entry for HEADER_LAST_FEATURE in the feat_ops. Before: # ./perf record -o - ls | ./perf script Segmentation fault (core dumped) After: # ./perf record -o - ls | ./perf script ls 7031 4392.099856: 25 cpu-clock:uhH: 7f5e0ce7cd60 ls 7031 4392.100355: 25 cpu-clock:uhH: 7f5e0c706ef7 Signed-off-by: Ravi Bangoria Fixes: 57b5de463925 ("perf report: Support forced leader feature in pipe mode") --- tools/perf/util/header.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c index 540cd2dcd3e7..de8e3e29d870 100644 --- a/tools/perf/util/header.c +++ b/tools/perf/util/header.c @@ -2555,14 +2555,18 @@ struct feature_ops { } /* feature_ops not implemented: */ +#define write_last_feature NULL + #define print_tracing_data NULL #define print_build_id NULL +#define print_last_feature NULL #define process_branch_stack NULL #define process_stat NULL +#define process_last_feature NULL -static const struct feature_ops feat_ops[HEADER_LAST_FEATURE] = { +static const struct feature_ops feat_ops[HEADER_LAST_FEATURE + 1] = { FEAT_OPN(TRACING_DATA, tracing_data, false), FEAT_OPN(BUILD_ID, build_id, false), FEAT_OPR(HOSTNAME, hostname, false), @@ -2585,6 +2589,7 @@ static const struct feature_ops feat_ops[HEADER_LAST_FEATURE] = { FEAT_OPN(CACHE, cache, true), FEAT_OPR(SAMPLE_TIME, sample_time,false), FEAT_OPR(MEM_TOPOLOGY, mem_topology, true), + FEAT_OPN(LAST_FEATURE, last_feature, false), }; struct header_print_data { -- 2.14.4
[PATCH 2/3] perf script: Fix crash because of missing evsel->priv
perf script in pipped mode is crashing because evsel->priv is not set properly. Fix it. Before: # ./perf record -o - -- ls | ./perf script Segmentation fault (core dumped) After: # ./perf record -o - -- ls | ./perf script ls 2282 1031.731974: 25 cpu-clock:uhH: 7effe4b3d29e ls 2282 1031.73: 25 cpu-clock:uhH: 7effe4b3a650 Signed-off-by: Ravi Bangoria Fixes: a14390fde64e ("perf script: Allow creating per-event dump files") --- tools/perf/builtin-script.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index 4d1cee68cfd2..acee05562f5e 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -1822,6 +1822,7 @@ static int process_attr(struct perf_tool *tool, union perf_event *event, struct perf_evlist *evlist; struct perf_evsel *evsel, *pos; int err; + static struct perf_evsel_script *es; err = perf_event__process_attr(tool, event, pevlist); if (err) @@ -1830,6 +1831,19 @@ static int process_attr(struct perf_tool *tool, union perf_event *event, evlist = *pevlist; evsel = perf_evlist__last(*pevlist); + if (!evsel->priv) { + if (scr->per_event_dump) { + evsel->priv = perf_evsel_script__new(evsel, + scr->session->data); + } else { + es = zalloc(sizeof(*es)); + if (!es) + return -ENOMEM; + es->fp = stdout; + evsel->priv = es; + } + } + if (evsel->attr.type >= PERF_TYPE_MAX && evsel->attr.type != PERF_TYPE_SYNTH) return 0; -- 2.14.4
Re: [PATCH 2/3] perf script: Fix crash because of missing evsel->priv
Hi Arnaldo, On 06/20/2018 07:22 PM, Arnaldo Carvalho de Melo wrote: > Em Wed, Jun 20, 2018 at 07:00:29PM +0530, Ravi Bangoria escreveu: >> perf script in pipped mode is crashing because evsel->priv is not >> set properly. Fix it. >> >> Before: >> # ./perf record -o - -- ls | ./perf script >> Segmentation fault (core dumped) >> >> After: >> # ./perf record -o - -- ls | ./perf script >> ls 2282 1031.731974: 25 cpu-clock:uhH: 7effe4b3d29e >> ls 2282 1031.73: 25 cpu-clock:uhH: 7effe4b3a650 >> >> Signed-off-by: Ravi Bangoria >> Fixes: a14390fde64e ("perf script: Allow creating per-event dump files") > > Humm, this cset doesn't set evsel->priv to a 'struct perf_evsel_script' > object, will check which one does to continue review. Possible. Actually, git bisect led me to this commit. Ravi
Re: [PATCH 3/3] perf script: Fix crash because of missing feat_op[] entry
Hi Arnaldo, On 06/20/2018 07:19 PM, Arnaldo Carvalho de Melo wrote: > Em Wed, Jun 20, 2018 at 07:00:30PM +0530, Ravi Bangoria escreveu: >> perf_event__process_feature() tries to access feat_ops[feat].process >> which is not defined for feat = HEADER_LAST_FEATURE and thus perf is >> crashing. Add dummy entry for HEADER_LAST_FEATURE in the feat_ops. > > Humm, first impression is that we should check for HEADER_LAST_FEATURE > and not try to access that array slot, as it is just a marker, not an > actual feature. Sure. Let me try to handle it inside perf_event__process_feature() itself. May be something like below. - diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c index 540cd2dcd3e7..4dc251d3b372 100644 --- a/tools/perf/util/header.c +++ b/tools/perf/util/header.c @@ -3461,6 +3461,10 @@ int perf_event__process_feature(struct perf_tool *tool, return -1; } + /* HEADER_LAST_FEATURE is just a marker. Ignore it. */ + if (feat == HEADER_LAST_FEATURE) + return 0; + if (!feat_ops[feat].process) return 0; - Thanks, Ravi
Re: [PATCH 2/3] perf script: Fix crash because of missing evsel->priv
Hi Arnaldo, On 06/20/2018 07:22 PM, Arnaldo Carvalho de Melo wrote: > Em Wed, Jun 20, 2018 at 07:00:29PM +0530, Ravi Bangoria escreveu: >> perf script in pipped mode is crashing because evsel->priv is not >> set properly. Fix it. >> >> Before: >> # ./perf record -o - -- ls | ./perf script >> Segmentation fault (core dumped) >> >> After: >> # ./perf record -o - -- ls | ./perf script >> ls 2282 1031.731974: 25 cpu-clock:uhH: 7effe4b3d29e >> ls 2282 1031.73: 25 cpu-clock:uhH: 7effe4b3a650 >> >> Signed-off-by: Ravi Bangoria >> Fixes: a14390fde64e ("perf script: Allow creating per-event dump files") > > Humm, this cset doesn't set evsel->priv to a 'struct perf_evsel_script' > object, will check which one does to continue review. So, it's not about setting evsel->priv to 'struct perf_evsel_script', but it's about setting proper output stream, either file or stdout. When 'struct perf_evsel_script' was not introduced, we were setting evsel->priv to output stream. So I think, this commit missed to set evsel->priv properly in pipped case. Ex, below patch applied _directly_ on top of a14390fde64e fixes the issue. -- diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index fb5e49b3bc44..66cc4b29bf4d 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -1645,6 +1645,9 @@ static int process_attr(struct perf_tool *tool, union perf_event *event, evlist = *pevlist; evsel = perf_evlist__last(*pevlist); + if (!evsel->priv) + evsel->priv = stdout; + if (evsel->attr.type >= PERF_TYPE_MAX && evsel->attr.type != PERF_TYPE_SYNTH) return 0; -- To me this commit seems to be the bad. Please let me know if that is not the case. I'll change the last patch as you suggested an post v2 soon. Thanks, Ravi
[PATCH v2 2/3] perf script: Fix crash because of missing evsel->priv
perf script in pipped mode is crashing because evsel->priv is not set properly. Fix it. Before: # ./perf record -o - -- ls | ./perf script Segmentation fault (core dumped) After: # ./perf record -o - -- ls | ./perf script ls 2282 1031.731974: 25 cpu-clock:uhH: 7effe4b3d29e ls 2282 1031.73: 25 cpu-clock:uhH: 7effe4b3a650 Signed-off-by: Ravi Bangoria Fixes: a14390fde64e ("perf script: Allow creating per-event dump files") --- tools/perf/builtin-script.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index f3fefbcc4503..ad2ac1300420 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -1834,6 +1834,7 @@ static int process_attr(struct perf_tool *tool, union perf_event *event, struct perf_evlist *evlist; struct perf_evsel *evsel, *pos; int err; + static struct perf_evsel_script *es; err = perf_event__process_attr(tool, event, pevlist); if (err) @@ -1842,6 +1843,19 @@ static int process_attr(struct perf_tool *tool, union perf_event *event, evlist = *pevlist; evsel = perf_evlist__last(*pevlist); + if (!evsel->priv) { + if (scr->per_event_dump) { + evsel->priv = perf_evsel_script__new(evsel, + scr->session->data); + } else { + es = zalloc(sizeof(*es)); + if (!es) + return -ENOMEM; + es->fp = stdout; + evsel->priv = es; + } + } + if (evsel->attr.type >= PERF_TYPE_MAX && evsel->attr.type != PERF_TYPE_SYNTH) return 0; -- 2.14.4
[PATCH v2 3/3] perf script/annotate: Fix crash caused by accessing feat_ops[HEADER_LAST_FEATURE]
perf_event__process_feature() accesses feat_ops[HEADER_LAST_FEATURE] which is not defined and thus perf is crashing. HEADER_LAST_FEATURE is used as an end marker for the perf report but it's unused for perf script/annotate. Ignore HEADER_LAST_FEATURE for perf script/ annotate. Before: # ./perf record -o - ls | ./perf script Segmentation fault (core dumped) After: # ./perf record -o - ls | ./perf script ls 7031 4392.099856: 25 cpu-clock:uhH: 7f5e0ce7cd60 ls 7031 4392.100355: 25 cpu-clock:uhH: 7f5e0c706ef7 Signed-off-by: Ravi Bangoria Fixes: 57b5de463925 ("perf report: Support forced leader feature in pipe mode") --- tools/perf/builtin-annotate.c | 11 ++- tools/perf/builtin-report.c | 3 ++- tools/perf/builtin-script.c | 11 ++- tools/perf/util/header.c | 2 +- 4 files changed, 23 insertions(+), 4 deletions(-) diff --git a/tools/perf/builtin-annotate.c b/tools/perf/builtin-annotate.c index 5eb22cc56363..8180319285af 100644 --- a/tools/perf/builtin-annotate.c +++ b/tools/perf/builtin-annotate.c @@ -283,6 +283,15 @@ static int process_sample_event(struct perf_tool *tool, return ret; } +static int process_feature_event(struct perf_tool *tool, +union perf_event *event, +struct perf_session *session) +{ + if (event->feat.feat_id < HEADER_LAST_FEATURE) + return perf_event__process_feature(tool, event, session); + return 0; +} + static int hist_entry__tty_annotate(struct hist_entry *he, struct perf_evsel *evsel, struct perf_annotate *ann) @@ -471,7 +480,7 @@ int cmd_annotate(int argc, const char **argv) .attr = perf_event__process_attr, .build_id = perf_event__process_build_id, .tracing_data = perf_event__process_tracing_data, - .feature= perf_event__process_feature, + .feature= process_feature_event, .ordered_events = true, .ordering_requires_timestamps = true, }, diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c index cdb5b6949832..c04dc7b53797 100644 --- a/tools/perf/builtin-report.c +++ b/tools/perf/builtin-report.c @@ -217,7 +217,8 @@ static int process_feature_event(struct perf_tool *tool, } /* -* All features are received, we can force the +* (feat_id = HEADER_LAST_FEATURE) is the end marker which +* means all features are received, now we can force the * group if needed. */ setup_forced_leader(rep, session->evlist); diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index ad2ac1300420..568ddfac3213 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -3044,6 +3044,15 @@ int process_cpu_map_event(struct perf_tool *tool __maybe_unused, return set_maps(script); } +static int process_feature_event(struct perf_tool *tool, +union perf_event *event, +struct perf_session *session) +{ + if (event->feat.feat_id < HEADER_LAST_FEATURE) + return perf_event__process_feature(tool, event, session); + return 0; +} + #ifdef HAVE_AUXTRACE_SUPPORT static int perf_script__process_auxtrace_info(struct perf_tool *tool, union perf_event *event, @@ -3088,7 +3097,7 @@ int cmd_script(int argc, const char **argv) .attr= process_attr, .event_update = perf_event__process_event_update, .tracing_data= perf_event__process_tracing_data, - .feature = perf_event__process_feature, + .feature = process_feature_event, .build_id= perf_event__process_build_id, .id_index= perf_event__process_id_index, .auxtrace_info = perf_script__process_auxtrace_info, diff --git a/tools/perf/util/header.c b/tools/perf/util/header.c index 59fcc790c865..653ff65aa2c3 100644 --- a/tools/perf/util/header.c +++ b/tools/perf/util/header.c @@ -3464,7 +3464,7 @@ int perf_event__process_feature(struct perf_tool *tool, pr_warning("invalid record type %d in pipe-mode\n", type); return 0; } - if (feat == HEADER_RESERVED || feat > HEADER_LAST_FEATURE) { + if (feat == HEADER_RESERVED || feat >= HEADER_LAST_FEATURE) { pr_warning("invalid record type %d in pipe-mode\n", type); return -1; } -- 2.14.4
[PATCH v2 1/3] perf script: Add missing output fields in a hint
Few fields are missing in a perf script -F hint. Add them. Signed-off-by: Ravi Bangoria --- tools/perf/builtin-script.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c index a31d7082188e..f3fefbcc4503 100644 --- a/tools/perf/builtin-script.c +++ b/tools/perf/builtin-script.c @@ -3125,8 +3125,9 @@ int cmd_script(int argc, const char **argv) "+field to add and -field to remove." "Valid types: hw,sw,trace,raw,synth. " "Fields: comm,tid,pid,time,cpu,event,trace,ip,sym,dso," -"addr,symoff,period,iregs,uregs,brstack,brstacksym,flags," - "bpf-output,callindent,insn,insnlen,brstackinsn,synth,phys_addr", +"addr,symoff,srcline,period,iregs,uregs,brstack," +"brstacksym,flags,bpf-output,brstackinsn,brstackoff," +"callindent,insn,insnlen,synth,phys_addr,metric,misc", parse_output_fields), OPT_BOOLEAN('a', "all-cpus", &system_wide, "system-wide collection from all CPUs"), -- 2.14.4
[PATCH v2 0/3] perf script: Few trivial fixes
First patch fixes perf output field hint. Second and third fixes crash when used in a piped mode. v2 changes: - [PATCH 3/3] Changed as suggested by Arnaldo v1: https://lkml.org/lkml/2018/6/20/538 Ravi Bangoria (3): perf script: Add missing output fields in a hint perf script: Fix crash because of missing evsel->priv perf script/annotate: Fix crash caused by accessing feat_ops[HEADER_LAST_FEATURE] tools/perf/builtin-annotate.c | 11 ++- tools/perf/builtin-report.c | 3 ++- tools/perf/builtin-script.c | 30 +++--- tools/perf/util/header.c | 2 +- 4 files changed, 40 insertions(+), 6 deletions(-) -- 2.14.4
[PATCH v8 4/6] Uprobes/sdt: Prevent multiple reference counter for same uprobe
We assume to have only one reference counter for one uprobe. Don't allow user to register multiple uprobes having same inode+offset but different reference counter. Signed-off-by: Ravi Bangoria --- kernel/events/uprobes.c | 9 + 1 file changed, 9 insertions(+) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 61b0481ef417..492a0e005b4d 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -689,6 +689,12 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, cur_uprobe = insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ if (cur_uprobe) { + if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { + pr_warn("Reference counter offset mismatch.\n"); + put_uprobe(cur_uprobe); + kfree(uprobe); + return ERR_PTR(-EINVAL); + } kfree(uprobe); uprobe = cur_uprobe; } @@ -1103,6 +1109,9 @@ static int __uprobe_register(struct inode *inode, loff_t offset, uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); if (!uprobe) return -ENOMEM; + if (IS_ERR(uprobe)) + return PTR_ERR(uprobe); + /* * We can race with uprobe_unregister()->delete_uprobe(). * Check uprobe_is_active() and retry if it is false. -- 2.14.4
[PATCH v8 3/6] Uprobes: Support SDT markers having reference count (semaphore)
Userspace Statically Defined Tracepoints[1] are dtrace style markers inside userspace applications. Applications like PostgreSQL, MySQL, Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc have these markers embedded in them. These markers are added by developer at important places in the code. Each marker source expands to a single nop instruction in the compiled code but there may be additional overhead for computing the marker arguments which expands to couple of instructions. In case the overhead is more, execution of it can be omitted by runtime if() condition when no one is tracing on the marker: if (reference_counter > 0) { Execute marker instructions; } Default value of reference counter is 0. Tracer has to increment the reference counter before tracing on a marker and decrement it when done with the tracing. Implement the reference counter logic in core uprobe. User will be able to use it from trace_uprobe as well as from kernel module. New trace_uprobe definition with reference counter will now be: :[(ref_ctr_offset)] where ref_ctr_offset is an optional field. For kernel module, new variant of uprobe_register() has been introduced: uprobe_register_refctr(inode, offset, ref_ctr_offset, consumer) No new variant for uprobe_unregister() because it's assumed to have only one reference counter for one uprobe. [1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation Note: 'reference counter' is called as 'semaphore' in original Dtrace (or Systemtap, bcc and even in ELF) documentation and code. But the term 'semaphore' is misleading in this context. This is just a counter used to hold number of tracers tracing on a marker. This is not really used for any synchronization. So we are referring it as 'reference counter' in kernel / perf code. Signed-off-by: Ravi Bangoria Reviewed-by: Masami Hiramatsu [Only trace_uprobe.c] --- include/linux/uprobes.h | 5 + kernel/events/uprobes.c | 246 ++-- kernel/trace/trace.c| 2 +- kernel/trace/trace_uprobe.c | 38 ++- 4 files changed, 280 insertions(+), 11 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index bb9d2084af03..103a48a48872 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -123,6 +123,7 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); +extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); extern int uprobe_mmap(struct vm_area_struct *vma); @@ -160,6 +161,10 @@ uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) { return -ENOSYS; } +static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc) +{ + return -ENOSYS; +} static inline int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add) { diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index c0418ba52ba8..61b0481ef417 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -73,6 +73,7 @@ struct uprobe { struct uprobe_consumer *consumers; struct inode*inode; /* Also hold a ref to inode */ loff_t offset; + loff_t ref_ctr_offset; unsigned long flags; /* @@ -88,6 +89,15 @@ struct uprobe { struct arch_uprobe arch; }; +struct delayed_uprobe { + struct list_head list; + struct uprobe *uprobe; + struct mm_struct *mm; +}; + +static DEFINE_MUTEX(delayed_uprobe_lock); +static LIST_HEAD(delayed_uprobe_list); + /* * Execute out of line area: anonymous executable mapping installed * by the probed task to execute the copy of the original instruction @@ -282,6 +292,157 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t return 1; } +static struct delayed_uprobe * +delayed_uprobe_check(struct uprobe *uprobe, struct mm_struct *mm) +{ + struct delayed_uprobe *du; + + list_for_each_entry(du, &delayed_uprobe_list, list) + if (du->uprobe == uprobe && du->mm == mm) + return du; + return NULL; +} + +static int delayed_uprobe_add(struct uprobe *uprobe, struct mm_struct *mm) +{ + struct
[PATCH v8 1/6] Uprobes: Simplify uprobe_register() body
Simplify uprobe_register() function body and let __uprobe_register() handle everything. Also move dependency functions around to fix build failures. Signed-off-by: Ravi Bangoria --- kernel/events/uprobes.c | 69 ++--- 1 file changed, 36 insertions(+), 33 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index ccc579a7d32e..471eac896635 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -840,13 +840,8 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new) return err; } -static int __uprobe_register(struct uprobe *uprobe, struct uprobe_consumer *uc) -{ - consumer_add(uprobe, uc); - return register_for_each_vma(uprobe, uc); -} - -static void __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) +static void +__uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) { int err; @@ -860,24 +855,46 @@ static void __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *u } /* - * uprobe_register - register a probe + * uprobe_unregister - unregister a already registered probe. + * @inode: the file in which the probe has to be removed. + * @offset: offset from the start of the file. + * @uc: identify which probe if multiple probes are colocated. + */ +void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +{ + struct uprobe *uprobe; + + uprobe = find_uprobe(inode, offset); + if (WARN_ON(!uprobe)) + return; + + down_write(&uprobe->register_rwsem); + __uprobe_unregister(uprobe, uc); + up_write(&uprobe->register_rwsem); + put_uprobe(uprobe); +} +EXPORT_SYMBOL_GPL(uprobe_unregister); + +/* + * __uprobe_register - register a probe * @inode: the file in which the probe has to be placed. * @offset: offset from the start of the file. * @uc: information on howto handle the probe.. * - * Apart from the access refcount, uprobe_register() takes a creation + * Apart from the access refcount, __uprobe_register() takes a creation * refcount (thro alloc_uprobe) if and only if this @uprobe is getting * inserted into the rbtree (i.e first consumer for a @inode:@offset * tuple). Creation refcount stops uprobe_unregister from freeing the * @uprobe even before the register operation is complete. Creation * refcount is released when the last @uc for the @uprobe - * unregisters. Caller of uprobe_register() is required to keep @inode + * unregisters. Caller of __uprobe_register() is required to keep @inode * (and the containing mount) referenced. * * Return errno if it cannot successully install probes * else return 0 (success) */ -int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +static int __uprobe_register(struct inode *inode, loff_t offset, +struct uprobe_consumer *uc) { struct uprobe *uprobe; int ret; @@ -904,7 +921,8 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer * down_write(&uprobe->register_rwsem); ret = -EAGAIN; if (likely(uprobe_is_active(uprobe))) { - ret = __uprobe_register(uprobe, uc); + consumer_add(uprobe, uc); + ret = register_for_each_vma(uprobe, uc); if (ret) __uprobe_unregister(uprobe, uc); } @@ -915,6 +933,12 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer * goto retry; return ret; } + +int uprobe_register(struct inode *inode, loff_t offset, + struct uprobe_consumer *uc) +{ + return __uprobe_register(inode, offset, uc); +} EXPORT_SYMBOL_GPL(uprobe_register); /* @@ -946,27 +970,6 @@ int uprobe_apply(struct inode *inode, loff_t offset, return ret; } -/* - * uprobe_unregister - unregister a already registered probe. - * @inode: the file in which the probe has to be removed. - * @offset: offset from the start of the file. - * @uc: identify which probe if multiple probes are colocated. - */ -void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) -{ - struct uprobe *uprobe; - - uprobe = find_uprobe(inode, offset); - if (WARN_ON(!uprobe)) - return; - - down_write(&uprobe->register_rwsem); - __uprobe_unregister(uprobe, uc); - up_write(&uprobe->register_rwsem); - put_uprobe(uprobe); -} -EXPORT_SYMBOL_GPL(uprobe_unregister); - static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm) { struct vm_area_struct *vma; -- 2.14.4
[PATCH v8 5/6] trace_uprobe/sdt: Prevent multiple reference counter for same uprobe
We assume to have only one reference counter for one uprobe. Don't allow user to add multiple trace_uprobe entries having same inode+offset but different reference counter. Ex, # echo "p:sdt_tick/loop2 /home/ravi/tick:0x6e4(0x10036)" > uprobe_events # echo "p:sdt_tick/loop2_1 /home/ravi/tick:0x6e4(0xf)" >> uprobe_events bash: echo: write error: Invalid argument # dmesg trace_kprobe: Reference counter offset mismatch. There is one exception though: When user is trying to replace the old entry with the new one, we allow this if the new entry does not conflict with any other existing entries. Signed-off-by: Ravi Bangoria --- kernel/trace/trace_uprobe.c | 37 +++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index bf2be098eb08..be64d943d7ea 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -324,6 +324,35 @@ static int unregister_trace_uprobe(struct trace_uprobe *tu) return 0; } +/* + * Uprobe with multiple reference counter is not allowed. i.e. + * If inode and offset matches, reference counter offset *must* + * match as well. Though, there is one exception: If user is + * replacing old trace_uprobe with new one(same group/event), + * then we allow same uprobe with new reference counter as far + * as the new one does not conflict with any other existing + * ones. + */ +static struct trace_uprobe *find_old_trace_uprobe(struct trace_uprobe *new) +{ + struct trace_uprobe *tmp, *old = NULL; + struct inode *new_inode = d_real_inode(new->path.dentry); + + old = find_probe_event(trace_event_name(&new->tp.call), + new->tp.call.class->system); + + list_for_each_entry(tmp, &uprobe_list, list) { + if ((old ? old != tmp : true) && + new_inode == d_real_inode(tmp->path.dentry) && + new->offset == tmp->offset && + new->ref_ctr_offset != tmp->ref_ctr_offset) { + pr_warn("Reference counter offset mismatch."); + return ERR_PTR(-EINVAL); + } + } + return old; +} + /* Register a trace_uprobe and probe_event */ static int register_trace_uprobe(struct trace_uprobe *tu) { @@ -333,8 +362,12 @@ static int register_trace_uprobe(struct trace_uprobe *tu) mutex_lock(&uprobe_lock); /* register as an event */ - old_tu = find_probe_event(trace_event_name(&tu->tp.call), - tu->tp.call.class->system); + old_tu = find_old_trace_uprobe(tu); + if (IS_ERR(old_tu)) { + ret = PTR_ERR(old_tu); + goto end; + } + if (old_tu) { /* delete old event */ ret = unregister_trace_uprobe(old_tu); -- 2.14.4
[PATCH v8 2/6] Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
Add addition argument 'arch_uprobe' to uprobe_write_opcode(). We need this in later set of patches. Signed-off-by: Ravi Bangoria --- arch/arm/probes/uprobes/core.c | 2 +- arch/mips/kernel/uprobes.c | 2 +- include/linux/uprobes.h| 2 +- kernel/events/uprobes.c| 9 + 4 files changed, 8 insertions(+), 7 deletions(-) diff --git a/arch/arm/probes/uprobes/core.c b/arch/arm/probes/uprobes/core.c index d1329f1ba4e4..bf992264060e 100644 --- a/arch/arm/probes/uprobes/core.c +++ b/arch/arm/probes/uprobes/core.c @@ -32,7 +32,7 @@ bool is_swbp_insn(uprobe_opcode_t *insn) int set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, + return uprobe_write_opcode(auprobe, mm, vaddr, __opcode_to_mem_arm(auprobe->bpinsn)); } diff --git a/arch/mips/kernel/uprobes.c b/arch/mips/kernel/uprobes.c index f7a0645ccb82..4aaff3b3175c 100644 --- a/arch/mips/kernel/uprobes.c +++ b/arch/mips/kernel/uprobes.c @@ -224,7 +224,7 @@ unsigned long arch_uretprobe_hijack_return_addr( int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, UPROBE_SWBP_INSN); + return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN); } void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 0a294e950df8..bb9d2084af03 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -121,7 +121,7 @@ extern bool is_swbp_insn(uprobe_opcode_t *insn); extern bool is_trap_insn(uprobe_opcode_t *insn); extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); -extern int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); +extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 471eac896635..c0418ba52ba8 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -299,8 +299,8 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t * Called with mm->mmap_sem held for write. * Return 0 (success) or a negative errno. */ -int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr, - uprobe_opcode_t opcode) +int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, + unsigned long vaddr, uprobe_opcode_t opcode) { struct page *old_page, *new_page; struct vm_area_struct *vma; @@ -351,7 +351,7 @@ int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr, */ int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, UPROBE_SWBP_INSN); + return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN); } /** @@ -366,7 +366,8 @@ int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned int __weak set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, *(uprobe_opcode_t *)&auprobe->insn); + return uprobe_write_opcode(auprobe, mm, vaddr, + *(uprobe_opcode_t *)&auprobe->insn); } static struct uprobe *get_uprobe(struct uprobe *uprobe) -- 2.14.4
[PATCH v8 6/6] perf probe: Support SDT markers having reference counter (semaphore)
With this, perf buildid-cache will save SDT markers with reference counter in probe cache. Perf probe will be able to probe markers having reference counter. Ex, # readelf -n /tmp/tick | grep -A1 loop2 Name: loop2 ... Semaphore: 0x10020036 # ./perf buildid-cache --add /tmp/tick # ./perf probe sdt_tick:loop2 # ./perf stat -e sdt_tick:loop2 /tmp/tick hi: 0 hi: 1 hi: 2 ^C Performance counter stats for '/tmp/tick': 3 sdt_tick:loop2 2.561851452 seconds time elapsed Signed-off-by: Ravi Bangoria Acked-by: Masami Hiramatsu Acked-by: Srikar Dronamraju --- tools/perf/util/probe-event.c | 39 tools/perf/util/probe-event.h | 1 + tools/perf/util/probe-file.c | 34 ++-- tools/perf/util/probe-file.h | 1 + tools/perf/util/symbol-elf.c | 46 --- tools/perf/util/symbol.h | 7 +++ 6 files changed, 106 insertions(+), 22 deletions(-) diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c index f119eb628dbb..e86f8be89157 100644 --- a/tools/perf/util/probe-event.c +++ b/tools/perf/util/probe-event.c @@ -1819,6 +1819,12 @@ int parse_probe_trace_command(const char *cmd, struct probe_trace_event *tev) tp->offset = strtoul(fmt2_str, NULL, 10); } + if (tev->uprobes) { + fmt2_str = strchr(p, '('); + if (fmt2_str) + tp->ref_ctr_offset = strtoul(fmt2_str + 1, NULL, 0); + } + tev->nargs = argc - 2; tev->args = zalloc(sizeof(struct probe_trace_arg) * tev->nargs); if (tev->args == NULL) { @@ -2012,6 +2018,22 @@ static int synthesize_probe_trace_arg(struct probe_trace_arg *arg, return err; } +static int +synthesize_uprobe_trace_def(struct probe_trace_event *tev, struct strbuf *buf) +{ + struct probe_trace_point *tp = &tev->point; + int err; + + err = strbuf_addf(buf, "%s:0x%lx", tp->module, tp->address); + + if (err >= 0 && tp->ref_ctr_offset) { + if (!uprobe_ref_ctr_is_supported()) + return -1; + err = strbuf_addf(buf, "(0x%lx)", tp->ref_ctr_offset); + } + return err >= 0 ? 0 : -1; +} + char *synthesize_probe_trace_command(struct probe_trace_event *tev) { struct probe_trace_point *tp = &tev->point; @@ -2041,15 +2063,17 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev) } /* Use the tp->address for uprobes */ - if (tev->uprobes) - err = strbuf_addf(&buf, "%s:0x%lx", tp->module, tp->address); - else if (!strncmp(tp->symbol, "0x", 2)) + if (tev->uprobes) { + err = synthesize_uprobe_trace_def(tev, &buf); + } else if (!strncmp(tp->symbol, "0x", 2)) { /* Absolute address. See try_to_find_absolute_address() */ err = strbuf_addf(&buf, "%s%s0x%lx", tp->module ?: "", tp->module ? ":" : "", tp->address); - else + } else { err = strbuf_addf(&buf, "%s%s%s+%lu", tp->module ?: "", tp->module ? ":" : "", tp->symbol, tp->offset); + } + if (err) goto error; @@ -2633,6 +2657,13 @@ static void warn_uprobe_event_compat(struct probe_trace_event *tev) { int i; char *buf = synthesize_probe_trace_command(tev); + struct probe_trace_point *tp = &tev->point; + + if (tp->ref_ctr_offset && !uprobe_ref_ctr_is_supported()) { + pr_warning("A semaphore is associated with %s:%s and " + "seems your kernel doesn't support it.\n", + tev->group, tev->event); + } /* Old uprobe event doesn't support memory dereference */ if (!tev->uprobes || tev->nargs == 0 || !buf) diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h index 45b14f020558..15a98c3a2a2f 100644 --- a/tools/perf/util/probe-event.h +++ b/tools/perf/util/probe-event.h @@ -27,6 +27,7 @@ struct probe_trace_point { char*symbol;/* Base symbol */ char*module;/* Module name */ unsigned long offset; /* Offset from symbol */ + unsigned long ref_ctr_offset; /* SDT reference counter offset */ unsigned long address;/* Actual address of the trace point */ boolretprobe; /* Return probe flag */ }; diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/pr
[PATCH v8 0/6] Uprobes: Support SDT markers having reference count (semaphore)
v8 changes: - Call delayed_uprobe_remove(uprobe, NULL) from put_uprobe() as well. - Other minor optimizations. v7: https://lkml.org/lkml/2018/7/31/20 Description: Userspace Statically Defined Tracepoints[1] are dtrace style markers inside userspace applications. Applications like PostgreSQL, MySQL, Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc have these markers embedded in them. These markers are added by developer at important places in the code. Each marker source expands to a single nop instruction in the compiled code but there may be additional overhead for computing the marker arguments which expands to couple of instructions. In case the overhead is more, execution of it can be omitted by runtime if() condition when no one is tracing on the marker: if (reference_counter > 0) { Execute marker instructions; } Default value of reference counter is 0. Tracer has to increment the reference counter before tracing on a marker and decrement it when done with the tracing. Currently, perf tool has limited supports for SDT markers. I.e. it can not trace markers surrounded by reference counter. Also, it's not easy to add reference counter logic in userspace tool like perf, so basic idea for this patchset is to add reference counter logic in the a uprobe infrastructure. Ex,[2] # cat tick.c ... for (i = 0; i < 100; i++) { DTRACE_PROBE1(tick, loop1, i); if (TICK_LOOP2_ENABLED()) { DTRACE_PROBE1(tick, loop2, i); } printf("hi: %d\n", i); sleep(1); } ... Here tick:loop1 is marker without reference counter where as tick:loop2 is surrounded by reference counter condition. # perf buildid-cache --add /tmp/tick # perf probe sdt_tick:loop1 # perf probe sdt_tick:loop2 # perf stat -e sdt_tick:loop1,sdt_tick:loop2 -- /tmp/tick hi: 0 hi: 1 hi: 2 ^C Performance counter stats for '/tmp/tick': 3 sdt_tick:loop1 0 sdt_tick:loop2 2.747086086 seconds time elapsed Perf failed to record data for tick:loop2. Same experiment with this patch series: # ./perf buildid-cache --add /tmp/tick # ./perf probe sdt_tick:loop2 # ./perf stat -e sdt_tick:loop2 /tmp/tick hi: 0 hi: 1 hi: 2 ^C Performance counter stats for '/tmp/tick': 3 sdt_tick:loop2 2.561851452 seconds time elapsed [1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation Ravi Bangoria (6): Uprobes: Simplify uprobe_register() body Uprobe: Additional argument arch_uprobe to uprobe_write_opcode() Uprobes: Support SDT markers having reference count (semaphore) Uprobes/sdt: Prevent multiple reference counter for same uprobe trace_uprobe/sdt: Prevent multiple reference counter for same uprobe perf probe: Support SDT markers having reference counter (semaphore) arch/arm/probes/uprobes/core.c | 2 +- arch/mips/kernel/uprobes.c | 2 +- include/linux/uprobes.h| 7 +- kernel/events/uprobes.c| 329 +++-- kernel/trace/trace.c | 2 +- kernel/trace/trace_uprobe.c| 75 +- tools/perf/util/probe-event.c | 39 - tools/perf/util/probe-event.h | 1 + tools/perf/util/probe-file.c | 34 - tools/perf/util/probe-file.h | 1 + tools/perf/util/symbol-elf.c | 46 -- tools/perf/util/symbol.h | 7 + 12 files changed, 472 insertions(+), 73 deletions(-) -- 2.14.4
Re: [PATCH v8 3/6] Uprobes: Support SDT markers having reference count (semaphore)
Hi Song, On 08/11/2018 01:27 PM, Song Liu wrote: >> + >> +static void delayed_uprobe_delete(struct delayed_uprobe *du) >> +{ >> + if (!du) >> + return; > Do we really need this check? Not necessary though, but I would still like to keep it for a safety. > >> + list_del(&du->list); >> + kfree(du); >> +} >> + >> +static void delayed_uprobe_remove(struct uprobe *uprobe, struct mm_struct >> *mm) >> +{ >> + struct list_head *pos, *q; >> + struct delayed_uprobe *du; >> + >> + if (!uprobe && !mm) >> + return; > And do we really need this check? Yes. delayed_uprobe_remove(uprobe=NULL, mm=NULL) is an invalid case. If I remove this check, code below (or more accurately code suggested by Oleg) will remove all entries from delayed_uprobe_list. So I will keep this check but put a comment above function. [...] >> + >> + ret = get_user_pages_remote(NULL, mm, vaddr, 1, >> + FOLL_WRITE, &page, &vma, NULL); >> + if (unlikely(ret <= 0)) { >> + /* >> +* We are asking for 1 page. If get_user_pages_remote() >> fails, >> +* it may return 0, in that case we have to return error. >> +*/ >> + ret = (ret == 0) ? -EBUSY : ret; >> + pr_warn("Failed to %s ref_ctr. (%d)\n", >> + d > 0 ? "increment" : "decrement", ret); > This warning is not really useful. Seems this function has little information > about which uprobe is failing here. Maybe we only need warning in the caller > (or caller of caller). Sure, I can move this warning to caller of this function but what are the exact fields you would like to print with warning? Something like this is fine? pr_warn("ref_ctr %s failed for 0x%lx, 0x%lx, 0x%lx, 0x%p", d > 0 ? "increment" : "decrement", inode->i_ino, offset, ref_ctr_offset, mm); More importantly, the reason I didn't print more info is because dmesg is accessible to unprivileged users in many distros but uprobes are not. So printing this information may be a security violation. No? > >> + return ret; >> + } >> + >> + kaddr = kmap_atomic(page); >> + ptr = kaddr + (vaddr & ~PAGE_MASK); >> + >> + if (unlikely(*ptr + d < 0)) { >> + pr_warn("ref_ctr going negative. vaddr: 0x%lx, " >> + "curr val: %d, delta: %d\n", vaddr, *ptr, d); >> + ret = -EINVAL; >> + goto out; >> + } >> + >> + *ptr += d; >> + ret = 0; >> +out: >> + kunmap_atomic(kaddr); >> + put_page(page); >> + return ret; >> +} >> + >> +static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm, >> + bool is_register) > What's the reason of bool is_register here vs. short d in __update_ref_ctr()? > Can we use short for both? Yes, I can use short as well. > >> +{ >> + struct vm_area_struct *rc_vma; >> + unsigned long rc_vaddr; >> + int ret = 0; >> + >> + rc_vma = find_ref_ctr_vma(uprobe, mm); >> + >> + if (rc_vma) { >> + rc_vaddr = offset_to_vaddr(rc_vma, uprobe->ref_ctr_offset); >> + ret = __update_ref_ctr(mm, rc_vaddr, is_register ? 1 : -1); >> + >> + if (is_register) >> + return ret; >> + } > Mixing __update_ref_ctr() here and delayed_uprobe_add() in the same > function is a little confusing (at least for me). How about we always use > delayed uprobe for uprobe_mmap() and use non-delayed in other case(s)? No. delayed_uprobe_add() is needed for uprobe_register() case to handle race between uprobe_register() and process creation. [...] >> >> +static int delayed_uprobe_install(struct vm_area_struct *vma) > This function name is confusing. How about we call it delayed_ref_ctr_incr() > or > something similar? Also, we should add comments to highlight this is vma is > not > the vma containing the uprobe, but the vma containing the ref_ctr. Sure, I'll do that. > >> +{ >> + struct list_head *pos, *q; >> + struct delayed_uprobe *du; >> + unsigned long vaddr; >> + int ret = 0, err = 0; >> + >> + mutex_lock(&delayed_uprobe_lock); >> + list_for_each_safe(pos, q, &delayed_uprobe_list) { >> + du = list_entry(pos, struct delayed_uprobe, list); >> + >> + if (!valid_ref_ctr_vma(du->uprobe, vma)) >> + continue; >> + >> + vaddr = offset_to_vaddr(vma, du->uprobe->ref_ctr_offset); >> + ret = __update_ref_ctr(vma->vm_mm, vaddr, 1); >> + /* Record an error and continue. */ >> + if (ret && !err) >> + err = ret; > I think this is a good place (when ret != 0) to call pr_warn(). I guess we can > print which mm get error for which uprobe (inode+offset). __update_ref_ctr() is already printing warnin
Re: [PATCH v8 3/6] Uprobes: Support SDT markers having reference count (semaphore)
Hi Song, On 08/13/2018 11:17 AM, Ravi Bangoria wrote: >>> + >>> +static void delayed_uprobe_remove(struct uprobe *uprobe, struct mm_struct >>> *mm) >>> +{ >>> + struct list_head *pos, *q; >>> + struct delayed_uprobe *du; >>> + >>> + if (!uprobe && !mm) >>> + return; >> And do we really need this check? > > > Yes. delayed_uprobe_remove(uprobe=NULL, mm=NULL) is an invalid case. If I > remove > this check, code below (or more accurately code suggested by Oleg) will remove > all entries from delayed_uprobe_list. So I will keep this check but put a > comment > above function. > Sorry, my bad. Please ignore above comment. Even though, it saves us to unnecessary loop over entire delayed_uprobe_list when both uprobe and mm are NULL, that case is not possible with current code. Also, I'm not dereferencing any of them. So, IMHO it's fine to remove this check. Thanks, Ravi
Re: [PATCH v8 5/6] trace_uprobe/sdt: Prevent multiple reference counter for same uprobe
Hi Song, On 08/11/2018 01:42 PM, Song Liu wrote: > Do we really need this given we already have PATCH 4/6? > uprobe_regsiter() can be called > out of trace_uprobe, this patch won't catch all conflicts anyway. Right but it, at least, catch all conflicts happening via trace_uprobe. I don't mind in removing this patch but I would like to get an opinion of Oleg/Srikar/Steven/Masami. Thanks, Ravi
Re: [PATCH v8 3/6] Uprobes: Support SDT markers having reference count (semaphore)
Hi Oleg, On 08/13/2018 05:20 PM, Oleg Nesterov wrote: > On 08/13, Ravi Bangoria wrote: >> >> On 08/11/2018 01:27 PM, Song Liu wrote: >>>> + >>>> +static void delayed_uprobe_delete(struct delayed_uprobe *du) >>>> +{ >>>> + if (!du) >>>> + return; >>> Do we really need this check? >> >> Not necessary though, but I would still like to keep it for a safety. > > Heh. I tried to ignore all minor problems in this version, but now that Song > mentioned this unnecessary check... > > Personally I really dislike the checks like this one. > > - It can confuse the reader who will try to understand the purpose > > - it can hide a bug if delayed_uprobe_delete(du) is actually called > with du == NULL. > > IMO, you should either remove it and let the kernel crash (to notice the > problem), or turn it into > > if (WARN_ON(!du)) > return; > > which is self-documented and reports the problem without kernel crash. Ok. I'll remove that check. > >>>> + rc_vma = find_ref_ctr_vma(uprobe, mm); >>>> + >>>> + if (rc_vma) { >>>> + rc_vaddr = offset_to_vaddr(rc_vma, uprobe->ref_ctr_offset); >>>> + ret = __update_ref_ctr(mm, rc_vaddr, is_register ? 1 : -1); >>>> + >>>> + if (is_register) >>>> + return ret; >>>> + } >>> Mixing __update_ref_ctr() here and delayed_uprobe_add() in the same >>> function is a little confusing (at least for me). How about we always use >>> delayed uprobe for uprobe_mmap() and use non-delayed in other case(s)? >> >> >> No. delayed_uprobe_add() is needed for uprobe_register() case to handle race >> between uprobe_register() and process creation. > > Yes. > > But damn, process creation (exec) is trivial. We could add a new uprobe_exec() > hook and avoid delayed_uprobe_install() in uprobe_mmap(). I'm sorry. I didn't get this. > > Afaics, the really problematic case is dlopen() which can race with > _register() > too, right? dlopen() should internally use mmap() right? So what is the problem here? Can you please elaborate. Thanks, Ravi
Re: [PATCH v8 3/6] Uprobes: Support SDT markers having reference count (semaphore)
On 08/13/2018 06:47 PM, Oleg Nesterov wrote: > On 08/13, Ravi Bangoria wrote: >> >>> But damn, process creation (exec) is trivial. We could add a new >>> uprobe_exec() >>> hook and avoid delayed_uprobe_install() in uprobe_mmap(). >> >> I'm sorry. I didn't get this. > > Sorry for confusion... > > I meant, if only exec*( could race with _register(), we could add another > uprobe > hook which updates all (delayed) counters before return to user-mode. Ok. > >>> Afaics, the really problematic case is dlopen() which can race with >>> _register() >>> too, right? >> >> dlopen() should internally use mmap() right? So what is the problem here? Can >> you please elaborate. > > What I tried to say is that we can't avoid > uprobe_mmap()->delayed_uprobe_install() > because dlopen() can race with _register() too, just like exec. Right :) Thanks, Ravi
Re: [PATCH v8 3/6] Uprobes: Support SDT markers having reference count (semaphore)
Hi Song, On 08/13/2018 10:42 PM, Song Liu wrote: > On Mon, Aug 13, 2018 at 6:17 AM, Oleg Nesterov wrote: >> On 08/13, Ravi Bangoria wrote: >>> >>>> But damn, process creation (exec) is trivial. We could add a new >>>> uprobe_exec() >>>> hook and avoid delayed_uprobe_install() in uprobe_mmap(). >>> >>> I'm sorry. I didn't get this. >> >> Sorry for confusion... >> >> I meant, if only exec*( could race with _register(), we could add another >> uprobe >> hook which updates all (delayed) counters before return to user-mode. >> >>>> Afaics, the really problematic case is dlopen() which can race with >>>> _register() >>>> too, right? >>> >>> dlopen() should internally use mmap() right? So what is the problem here? >>> Can >>> you please elaborate. >> >> What I tried to say is that we can't avoid >> uprobe_mmap()->delayed_uprobe_install() >> because dlopen() can race with _register() too, just like exec. >> >> Oleg. >> > > How about we do delayed_uprobe_install() per file? Say we keep a list > of delayed_uprobe > in load_elf_binary(). Then we can install delayed_uprobe after loading > all sections of the > file. I'm not sure if I totally understood the idea. But how this approach can solve dlopen() race with _register()? Rather, making delayed_uprobe_list an mm field seems simple and effective idea to me. The only overhead will be list_empty(mm->delayed_list) check. Please let me know if I misunderstood you. Thanks, Ravi
Re: [PATCH v8 3/6] Uprobes: Support SDT markers having reference count (semaphore)
> +static int delayed_uprobe_install(struct vm_area_struct *vma) > +{ > + struct list_head *pos, *q; > + struct delayed_uprobe *du; > + unsigned long vaddr; > + int ret = 0, err = 0; > + > + mutex_lock(&delayed_uprobe_lock); > + list_for_each_safe(pos, q, &delayed_uprobe_list) { > + du = list_entry(pos, struct delayed_uprobe, list); > + > + if (!valid_ref_ctr_vma(du->uprobe, vma)) > + continue; I think we should compare mm here. I.e.: if (du->mm != vma->vm_mm || !valid_ref_ctr_vma(du->uprobe, vma)) continue; Otherwise things can mess up. > + > + vaddr = offset_to_vaddr(vma, du->uprobe->ref_ctr_offset); > + ret = __update_ref_ctr(vma->vm_mm, vaddr, 1); > + /* Record an error and continue. */ > + if (ret && !err) > + err = ret; > + delayed_uprobe_delete(du); > + } > + mutex_unlock(&delayed_uprobe_lock); > + return err; > +}
Re: [PATCH v9 0/4] Uprobes: Support SDT markers having reference count (semaphore)
Hi Song, > root@virt-test:~# ~/a.out > 11 > semaphore 0 > semaphore 0 > semaphore 2 <<< when the uprobe is enabled Yes, this happens when multiple vmas points to the same file portion. Can you check /proc/`pgrep a.out`/maps. Logic is simple. If we are going to patch an instruction, increment the reference counter. If we are going to unpatch an instruction, decrement the reference counter. In this case, we patched instruction twice and thus incremented reference counter twice as well. Ravi
Re: [PATCH v9 0/4] Uprobes: Support SDT markers having reference count (semaphore)
Hi Song, > However, if I start a.out AFTER enabling the uprobe, there is something wrong: > > root@virt-test:~# ~/a.out > 11 > semaphore 0 <<< this should be non-zero, as the uprobe is already > enabled Ok. I'm able to reproduce this. Digging deeper. Ravi
Re: [PATCH v9 0/4] Uprobes: Support SDT markers having reference count (semaphore)
Hi Song, On 08/21/2018 10:53 AM, Ravi Bangoria wrote: > Hi Song, > >> However, if I start a.out AFTER enabling the uprobe, there is something >> wrong: >> >> root@virt-test:~# ~/a.out >> 11 >> semaphore 0 <<< this should be non-zero, as the uprobe is already >> enabled In this testcase, semaphore variable is stored into .bss: $ nm test | grep semaphore 10010c5e B semaphore $ readelf -SW ./test | grep "data\|bss" [22] .data PROGBITS10010c58 000c58 04 00 WA 0 0 1 [23] .bss NOBITS 10010c5c 000c5c 04 00 WA 0 0 2 I'm not so sure but I guess .bss data initialization happens after calling uprobe_mmap() and thus you are seeing semaphore as 0. To verify this, if I force to save semaphore into data section by assigning non-zero value to it: volatile short semaphore = 1 $ nm test | grep semaphore 10010c5c D semaphore $ readelf -SW ./test | grep "data\|bss" [22] .data PROGBITS10010c58 000c58 06 00 WA 0 0 2 [23] .bss NOBITS 10010c5e 000c5e 02 00 WA 0 0 1 increment/decrement works fine. Ravi
Re: [PATCH v9 0/4] Uprobes: Support SDT markers having reference count (semaphore)
On 08/21/2018 01:04 PM, Naveen N. Rao wrote: > Song Liu wrote: >> I am testing the patch set with the following code: >> >> #include >> #include >> >> volatile short semaphore = 0; >> >> int for_uprobe(int c) >> { >> printf("%d\n", c + 10); >> return c + 1; >> } >> >> int main(int argc, char *argv[]) >> { >> for_uprobe(argc); >> while (1) { >> sleep(1); >> printf("semaphore %d\n", semaphore); >> } >> } >> >> I created a uprobe on function for_uprobe(), that uses semaphore as >> reference counter: >> >> echo "p:uprobe_1 /root/a.out:0x49a(0x1036)" >> uprobe_events > > Is that even valid? That _looks_ like a semaphore, but I'm not quite sure > that it qualifies as an _SDT_ semaphore. Do you see this issue if you instead > use the macros provided by to create SDT markers? > Right. By default SDT reference counters(semaphore) goes into .probes section: [25] .probes PROGBITS0060102c 00102c 04 00 WA 0 0 2 which has PROGBITS set. So this works fine. And there are many other things which are coded into . So the official way to use SDT markers should be through that. Ravi
Re: [PATCH] trace_uprobe: support reference count in fd-based uprobe
On 08/22/2018 03:53 AM, Song Liu wrote: > This patch applies on top of Ravi Bangoria's work that enables reference > count for uprobe: > > https://lkml.org/lkml/2018/8/20/37 > > After Ravi's work, the effort to enable it in fd-based uprobe is straight > forward. Highest 40 bits of perf_event_attr.config is used to stored offset > of the reference count (semaphore). > > Format information in /sys/bus/event_source/devices/uprobe/format/ is > updated to reflect this new feature. LGTM Reviewed-and-tested-by: Ravi Bangoria > > Signed-off-by: Song Liu > Cc: Ravi Bangoria > Cc: Masami Hiramatsu > Cc: Oleg Nesterov > Cc: Srikar Dronamraju > Cc: Naveen N. Rao > Cc: Steven Rostedt (VMware) > Cc: Peter Zijlstra > Cc: Ingo Molnar > Cc: Teng Qin > --- > include/linux/trace_events.h| 3 +- > kernel/events/core.c| 49 ++--- > kernel/trace/trace_event_perf.c | 7 +++-- > kernel/trace/trace_probe.h | 3 +- > kernel/trace/trace_uprobe.c | 4 ++- > 5 files changed, 50 insertions(+), 16 deletions(-) > > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h > index 78a010e19ed4..4130a5497d40 100644 > --- a/include/linux/trace_events.h > +++ b/include/linux/trace_events.h > @@ -575,7 +575,8 @@ extern int bpf_get_kprobe_info(const struct perf_event > *event, > bool perf_type_tracepoint); > #endif > #ifdef CONFIG_UPROBE_EVENTS > -extern int perf_uprobe_init(struct perf_event *event, bool is_retprobe); > +extern int perf_uprobe_init(struct perf_event *event, > + unsigned long ref_ctr_offset, bool is_retprobe); > extern void perf_uprobe_destroy(struct perf_event *event); > extern int bpf_get_uprobe_info(const struct perf_event *event, > u32 *fd_type, const char **filename, > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 8f0434a9951a..75a0219be420 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -8363,30 +8363,39 @@ static struct pmu perf_tracepoint = { > * > * PERF_PROBE_CONFIG_IS_RETPROBE if set, create kretprobe/uretprobe > * if not set, create kprobe/uprobe > + * > + * The following values specify a reference counter (or semaphore in the > + * terminology of tools like dtrace, systemtap, etc.) Userspace Statically > + * Defined Tracepoints (USDT). Currently, we use 40 bit for the offset. > + * > + * PERF_UPROBE_REF_CTR_OFFSET_BITS # of bits in config as th offset > + * PERF_UPROBE_REF_CTR_OFFSET_SHIFT # of bits to shift left > */ > enum perf_probe_config { > PERF_PROBE_CONFIG_IS_RETPROBE = 1U << 0, /* [k,u]retprobe */ > + PERF_UPROBE_REF_CTR_OFFSET_BITS = 40, > + PERF_UPROBE_REF_CTR_OFFSET_SHIFT = 64 - PERF_UPROBE_REF_CTR_OFFSET_BITS, > }; > > PMU_FORMAT_ATTR(retprobe, "config:0"); > +#endif > > -static struct attribute *probe_attrs[] = { > +#ifdef CONFIG_KPROBE_EVENTS > +static struct attribute *kprobe_attrs[] = { > &format_attr_retprobe.attr, > NULL, > }; > > -static struct attribute_group probe_format_group = { > +static struct attribute_group kprobe_format_group = { > .name = "format", > - .attrs = probe_attrs, > + .attrs = kprobe_attrs, > }; > > -static const struct attribute_group *probe_attr_groups[] = { > - &probe_format_group, > +static const struct attribute_group *kprobe_attr_groups[] = { > + &kprobe_format_group, > NULL, > }; > -#endif > > -#ifdef CONFIG_KPROBE_EVENTS > static int perf_kprobe_event_init(struct perf_event *event); > static struct pmu perf_kprobe = { > .task_ctx_nr= perf_sw_context, > @@ -8396,7 +8405,7 @@ static struct pmu perf_kprobe = { > .start = perf_swevent_start, > .stop = perf_swevent_stop, > .read = perf_swevent_read, > - .attr_groups= probe_attr_groups, > + .attr_groups= kprobe_attr_groups, > }; > > static int perf_kprobe_event_init(struct perf_event *event) > @@ -8428,6 +8437,24 @@ static int perf_kprobe_event_init(struct perf_event > *event) > #endif /* CONFIG_KPROBE_EVENTS */ > > #ifdef CONFIG_UPROBE_EVENTS > +PMU_FORMAT_ATTR(ref_ctr_offset, "config:63-24"); > + > +static struct attribute *uprobe_attrs[] = { > + &format_attr_retprobe.attr, > + &format_attr_ref_ctr_offset.attr, > + NULL, > +}; > + > +static struct attribute_group uprobe_format_group = { > + .name = "format", > + .attrs = uprobe_attrs
Re: [PATCH v6 5/6] Uprobes/sdt: Prevent multiple reference counter for same uprobe
Hi Oleg, On 07/25/2018 04:38 PM, Oleg Nesterov wrote: > No, I can't understand this patch... > > On 07/16, Ravi Bangoria wrote: >> >> --- a/kernel/events/uprobes.c >> +++ b/kernel/events/uprobes.c >> @@ -63,6 +63,8 @@ static struct percpu_rw_semaphore dup_mmap_sem; >> >> /* Have a copy of original instruction */ >> #define UPROBE_COPY_INSN0 >> +/* Reference counter offset is reloaded with non-zero value. */ >> +#define REF_CTR_OFF_RELOADED1 >> >> struct uprobe { >> struct rb_node rb_node;/* node in the rb tree */ >> @@ -476,9 +478,23 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, >> struct mm_struct *mm, >> return ret; >> >> ret = verify_opcode(old_page, vaddr, &opcode); >> -if (ret <= 0) >> +if (ret < 0) >> goto put_old; > > I agree, "ret <= 0" wasn't nice even before this change, but "ret < 0" looks > worse because this is simply not possible. Ok. Any better idea? I think if we don't track all mms patched by uprobe, we have to rely on current instruction. > >> +/* >> + * If instruction is already patched but reference counter offset >> + * has been reloaded to non-zero value, increment the reference >> + * counter and return. >> + */ >> +if (ret == 0) { >> +if (is_register && >> +test_bit(REF_CTR_OFF_RELOADED, &uprobe->flags)) { >> +WARN_ON(!uprobe->ref_ctr_offset); >> +ret = update_ref_ctr(uprobe, mm, true); >> +} >> +goto put_old; >> +} > > So we need to force update_ref_ctr(true) in case when uprobe_register_refctr() > detects the already registered uprobe with ref_ctr_offset == 0, and then it > calls > register_for_each_vma(). > > Why this can't race with uprobe_mmap() ? > > uprobe_mmap() can do install_breakpoint() right after REF_CTR_OFF_RELOADED > was set, > then register_for_each_vma() will find this vma and do install_breakpoint() > too. > If ref_ctr_vma was already mmaped, the counter will be incremented twice, no? Hmm right. Probably, I can fix this race by using some lock, but I don't know if it's good to do that inside uprobe_mmap(). > >> @@ -971,6 +1011,7 @@ register_for_each_vma(struct uprobe *uprobe, struct >> uprobe_consumer *new) >> bool is_register = !!new; >> struct map_info *info; >> int err = 0; >> +bool installed = false; >> >> percpu_down_write(&dup_mmap_sem); >> info = build_map_info(uprobe->inode->i_mapping, >> @@ -1000,8 +1041,10 @@ register_for_each_vma(struct uprobe *uprobe, struct >> uprobe_consumer *new) >> if (is_register) { >> /* consult only the "caller", new consumer. */ >> if (consumer_filter(new, >> -UPROBE_FILTER_REGISTER, mm)) >> +UPROBE_FILTER_REGISTER, mm)) { >> err = install_breakpoint(uprobe, mm, vma, >> info->vaddr); >> +installed = true; >> +} >> } else if (test_bit(MMF_HAS_UPROBES, &mm->flags)) { >> if (!filter_chain(uprobe, >> UPROBE_FILTER_UNREGISTER, mm)) >> @@ -1016,6 +1059,8 @@ register_for_each_vma(struct uprobe *uprobe, struct >> uprobe_consumer *new) >> } >> out: >> percpu_up_write(&dup_mmap_sem); >> +if (installed) >> +clear_bit(REF_CTR_OFF_RELOADED, &uprobe->flags); > > I simply can't understand this "bool installed" That boolean is needed because consumer_filter() returns false when this function gets called first time from uprobe_register(). And consumer_filter returns true when it gets called by uprobe_apply(). If I make it unconditional, there is no effect of setting REF_CTR_OFF_RELOADED bit. So this boolean is needed. Though, there is one more issue I found. Let's say there are two processes running and user probes on both of them using uprobe_register() using, let's say systemtap. Now, some other guy does uprobe_register_refctr() using 'perf -p PID' on same uprobe but he is interested in only one process. Here, we will increment the reference counter only in the "PID" process and we will clear REF_CTR_OFF_RELOADED flag. Later, some other guy does 'perf -a' which will ca
Re: [PATCH v6 0/6] Uprobes: Support SDT markers having reference count (semaphore)
Oleg, Srikar, Masami, Song, Any feedback please? :) Thanks, Ravi On 07/16/2018 02:17 PM, Ravi Bangoria wrote: > Userspace Statically Defined Tracepoints[1] are dtrace style markers > inside userspace applications. Applications like PostgreSQL, MySQL, > Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc > have these markers embedded in them. These markers are added by developer > at important places in the code. Each marker source expands to a single > nop instruction in the compiled code but there may be additional > overhead for computing the marker arguments which expands to couple of > instructions. In case the overhead is more, execution of it can be > omitted by runtime if() condition when no one is tracing on the marker: > > if (reference_counter > 0) { > Execute marker instructions; > } > > Default value of reference counter is 0. Tracer has to increment the > reference counter before tracing on a marker and decrement it when > done with the tracing. > > Currently, perf tool has limited supports for SDT markers. I.e. it > can not trace markers surrounded by reference counter. Also, it's > not easy to add reference counter logic in userspace tool like perf, > so basic idea for this patchset is to add reference counter logic in > the a uprobe infrastructure. Ex,[2] > > # cat tick.c > ... > for (i = 0; i < 100; i++) { > DTRACE_PROBE1(tick, loop1, i); > if (TICK_LOOP2_ENABLED()) { > DTRACE_PROBE1(tick, loop2, i); > } > printf("hi: %d\n", i); > sleep(1); > } > ... > > Here tick:loop1 is marker without reference counter where as tick:loop2 > is surrounded by reference counter condition. > > # perf buildid-cache --add /tmp/tick > # perf probe sdt_tick:loop1 > # perf probe sdt_tick:loop2 > > # perf stat -e sdt_tick:loop1,sdt_tick:loop2 -- /tmp/tick > hi: 0 > hi: 1 > hi: 2 > ^C > Performance counter stats for '/tmp/tick': > 3 sdt_tick:loop1 > 0 sdt_tick:loop2 > 2.747086086 seconds time elapsed > > Perf failed to record data for tick:loop2. Same experiment with this > patch series: > > # ./perf buildid-cache --add /tmp/tick > # ./perf probe sdt_tick:loop2 > # ./perf stat -e sdt_tick:loop2 /tmp/tick > hi: 0 > hi: 1 > hi: 2 > ^C > Performance counter stats for '/tmp/tick': > 3 sdt_tick:loop2 >2.561851452 seconds time elapsed > > > v6 changes: > - Do not export struct uprobe outside of uprobe.c. Instead, use >container_of(arch_uprobe) to regain uprobe in uprobe_write_opcode(). > - Allow 0 as a special value for reference counter _offset_. I.e. >two uprobes, one having ref_ctr_offset=0 and the other having >non-zero ref_ctr_offset can coexists. > - If vma holding reference counter is not present while patching an >instruction, we add that uprobe in delayed_uprobe_list. When >appropriate mapping is created, we increment the reference counter >and remove uprobe from delayed_uprobe_list. While doing all this, >v5 was searching for all such uprobes in uprobe_tree which is not >require. Also, uprobes are stored in rbtree with inode+offset as >the key and v5 was searching uprobe based on inode+ref_ctr_offset >which is wrong too. Fix this by directly looing on delayed_uprobe_list. > - Consider VM_SHARED vma as invalid for reference counter. Return >false from valid_ref_ctr_vma() if vma->vm_flags has VM_SHARED set. > - No need to use FOLL_FORCE in get_user_pages_remote() while getting >a page to update reference counter. Remove it. > - Do not mention usage of reference counter in Documentation/. > > > v5 can be found at: https://lkml.org/lkml/2018/6/28/51 > > Ravi Bangoria (6): > Uprobes: Simplify uprobe_register() body > Uprobe: Additional argument arch_uprobe to uprobe_write_opcode() > Uprobes: Support SDT markers having reference count (semaphore) > trace_uprobe/sdt: Prevent multiple reference counter for same uprobe > Uprobes/sdt: Prevent multiple reference counter for same uprobe > perf probe: Support SDT markers having reference counter (semaphore) > > arch/arm/probes/uprobes/core.c | 2 +- > arch/mips/kernel/uprobes.c | 2 +- > include/linux/uprobes.h| 7 +- > kernel/events/uprobes.c| 358 > - > kernel/trace/trace.c | 2 +- > kernel/trace/trace_uprobe.c| 78 - > tools/perf/util/probe-event.c | 39 - > tools/perf/util/probe-event.h | 1 + > tools/perf/util/probe-file.c | 34 +++- > tools/perf/util/probe-file.h | 1 + > tools/perf/util/symbol-elf.c | 46 -- > tools/perf/util/symbol.h | 7 + > 12 files changed, 503 insertions(+), 74 deletions(-) >
[PATCH v7 4/6] Uprobes/sdt: Prevent multiple reference counter for same uprobe
We assume to have only one reference counter for one uprobe. Don't allow user to register multiple uprobes having same inode+offset but different reference counter. Signed-off-by: Ravi Bangoria --- kernel/events/uprobes.c | 9 + 1 file changed, 9 insertions(+) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index ad92fed11526..c27546929ae7 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -679,6 +679,12 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, cur_uprobe = insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ if (cur_uprobe) { + if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { + pr_warn("Reference counter offset mismatch.\n"); + put_uprobe(cur_uprobe); + kfree(uprobe); + return ERR_PTR(-EINVAL); + } kfree(uprobe); uprobe = cur_uprobe; } @@ -1093,6 +1099,9 @@ static int __uprobe_register(struct inode *inode, loff_t offset, uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); if (!uprobe) return -ENOMEM; + if (IS_ERR(uprobe)) + return PTR_ERR(uprobe); + /* * We can race with uprobe_unregister()->delete_uprobe(). * Check uprobe_is_active() and retry if it is false. -- 2.14.4
[PATCH v7 6/6] perf probe: Support SDT markers having reference counter (semaphore)
With this, perf buildid-cache will save SDT markers with reference counter in probe cache. Perf probe will be able to probe markers having reference counter. Ex, # readelf -n /tmp/tick | grep -A1 loop2 Name: loop2 ... Semaphore: 0x10020036 # ./perf buildid-cache --add /tmp/tick # ./perf probe sdt_tick:loop2 # ./perf stat -e sdt_tick:loop2 /tmp/tick hi: 0 hi: 1 hi: 2 ^C Performance counter stats for '/tmp/tick': 3 sdt_tick:loop2 2.561851452 seconds time elapsed Signed-off-by: Ravi Bangoria Acked-by: Masami Hiramatsu Acked-by: Srikar Dronamraju --- tools/perf/util/probe-event.c | 39 tools/perf/util/probe-event.h | 1 + tools/perf/util/probe-file.c | 34 ++-- tools/perf/util/probe-file.h | 1 + tools/perf/util/symbol-elf.c | 46 --- tools/perf/util/symbol.h | 7 +++ 6 files changed, 106 insertions(+), 22 deletions(-) diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c index f119eb628dbb..e86f8be89157 100644 --- a/tools/perf/util/probe-event.c +++ b/tools/perf/util/probe-event.c @@ -1819,6 +1819,12 @@ int parse_probe_trace_command(const char *cmd, struct probe_trace_event *tev) tp->offset = strtoul(fmt2_str, NULL, 10); } + if (tev->uprobes) { + fmt2_str = strchr(p, '('); + if (fmt2_str) + tp->ref_ctr_offset = strtoul(fmt2_str + 1, NULL, 0); + } + tev->nargs = argc - 2; tev->args = zalloc(sizeof(struct probe_trace_arg) * tev->nargs); if (tev->args == NULL) { @@ -2012,6 +2018,22 @@ static int synthesize_probe_trace_arg(struct probe_trace_arg *arg, return err; } +static int +synthesize_uprobe_trace_def(struct probe_trace_event *tev, struct strbuf *buf) +{ + struct probe_trace_point *tp = &tev->point; + int err; + + err = strbuf_addf(buf, "%s:0x%lx", tp->module, tp->address); + + if (err >= 0 && tp->ref_ctr_offset) { + if (!uprobe_ref_ctr_is_supported()) + return -1; + err = strbuf_addf(buf, "(0x%lx)", tp->ref_ctr_offset); + } + return err >= 0 ? 0 : -1; +} + char *synthesize_probe_trace_command(struct probe_trace_event *tev) { struct probe_trace_point *tp = &tev->point; @@ -2041,15 +2063,17 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev) } /* Use the tp->address for uprobes */ - if (tev->uprobes) - err = strbuf_addf(&buf, "%s:0x%lx", tp->module, tp->address); - else if (!strncmp(tp->symbol, "0x", 2)) + if (tev->uprobes) { + err = synthesize_uprobe_trace_def(tev, &buf); + } else if (!strncmp(tp->symbol, "0x", 2)) { /* Absolute address. See try_to_find_absolute_address() */ err = strbuf_addf(&buf, "%s%s0x%lx", tp->module ?: "", tp->module ? ":" : "", tp->address); - else + } else { err = strbuf_addf(&buf, "%s%s%s+%lu", tp->module ?: "", tp->module ? ":" : "", tp->symbol, tp->offset); + } + if (err) goto error; @@ -2633,6 +2657,13 @@ static void warn_uprobe_event_compat(struct probe_trace_event *tev) { int i; char *buf = synthesize_probe_trace_command(tev); + struct probe_trace_point *tp = &tev->point; + + if (tp->ref_ctr_offset && !uprobe_ref_ctr_is_supported()) { + pr_warning("A semaphore is associated with %s:%s and " + "seems your kernel doesn't support it.\n", + tev->group, tev->event); + } /* Old uprobe event doesn't support memory dereference */ if (!tev->uprobes || tev->nargs == 0 || !buf) diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h index 45b14f020558..15a98c3a2a2f 100644 --- a/tools/perf/util/probe-event.h +++ b/tools/perf/util/probe-event.h @@ -27,6 +27,7 @@ struct probe_trace_point { char*symbol;/* Base symbol */ char*module;/* Module name */ unsigned long offset; /* Offset from symbol */ + unsigned long ref_ctr_offset; /* SDT reference counter offset */ unsigned long address;/* Actual address of the trace point */ boolretprobe; /* Return probe flag */ }; diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/pr
[PATCH v7 3/6] Uprobes: Support SDT markers having reference count (semaphore)
Userspace Statically Defined Tracepoints[1] are dtrace style markers inside userspace applications. Applications like PostgreSQL, MySQL, Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc have these markers embedded in them. These markers are added by developer at important places in the code. Each marker source expands to a single nop instruction in the compiled code but there may be additional overhead for computing the marker arguments which expands to couple of instructions. In case the overhead is more, execution of it can be omitted by runtime if() condition when no one is tracing on the marker: if (reference_counter > 0) { Execute marker instructions; } Default value of reference counter is 0. Tracer has to increment the reference counter before tracing on a marker and decrement it when done with the tracing. Implement the reference counter logic in core uprobe. User will be able to use it from trace_uprobe as well as from kernel module. New trace_uprobe definition with reference counter will now be: :[(ref_ctr_offset)] where ref_ctr_offset is an optional field. For kernel module, new variant of uprobe_register() has been introduced: uprobe_register_refctr(inode, offset, ref_ctr_offset, consumer) No new variant for uprobe_unregister() because it's assumed to have only one reference counter for one uprobe. [1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation Note: 'reference counter' is called as 'semaphore' in original Dtrace (or Systemtap, bcc and even in ELF) documentation and code. But the term 'semaphore' is misleading in this context. This is just a counter used to hold number of tracers tracing on a marker. This is not really used for any synchronization. So we are referring it as 'reference counter' in kernel / perf code. Signed-off-by: Ravi Bangoria Reviewed-by: Masami Hiramatsu [Only trace_uprobe.c] --- include/linux/uprobes.h | 5 + kernel/events/uprobes.c | 232 ++-- kernel/trace/trace.c| 2 +- kernel/trace/trace_uprobe.c | 38 +++- 4 files changed, 267 insertions(+), 10 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index bb9d2084af03..103a48a48872 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -123,6 +123,7 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); +extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); extern int uprobe_mmap(struct vm_area_struct *vma); @@ -160,6 +161,10 @@ uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) { return -ENOSYS; } +static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc) +{ + return -ENOSYS; +} static inline int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add) { diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index c0418ba52ba8..ad92fed11526 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -73,6 +73,7 @@ struct uprobe { struct uprobe_consumer *consumers; struct inode*inode; /* Also hold a ref to inode */ loff_t offset; + loff_t ref_ctr_offset; unsigned long flags; /* @@ -88,6 +89,15 @@ struct uprobe { struct arch_uprobe arch; }; +struct delayed_uprobe { + struct list_head list; + struct uprobe *uprobe; + struct mm_struct *mm; +}; + +static DEFINE_MUTEX(delayed_uprobe_lock); +static LIST_HEAD(delayed_uprobe_list); + /* * Execute out of line area: anonymous executable mapping installed * by the probed task to execute the copy of the original instruction @@ -282,6 +292,154 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t return 1; } +static struct delayed_uprobe * +delayed_uprobe_check(struct uprobe *uprobe, struct mm_struct *mm) +{ + struct delayed_uprobe *du; + + list_for_each_entry(du, &delayed_uprobe_list, list) + if (du->uprobe == uprobe && du->mm == mm) + return du; + return NULL; +} + +static int delayed_uprobe_add(struct uprobe *uprobe, struct mm_struct *mm) +{ + struct
[PATCH v7 2/6] Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
Add addition argument 'arch_uprobe' to uprobe_write_opcode(). We need this in later set of patches. Signed-off-by: Ravi Bangoria --- arch/arm/probes/uprobes/core.c | 2 +- arch/mips/kernel/uprobes.c | 2 +- include/linux/uprobes.h| 2 +- kernel/events/uprobes.c| 9 + 4 files changed, 8 insertions(+), 7 deletions(-) diff --git a/arch/arm/probes/uprobes/core.c b/arch/arm/probes/uprobes/core.c index d1329f1ba4e4..bf992264060e 100644 --- a/arch/arm/probes/uprobes/core.c +++ b/arch/arm/probes/uprobes/core.c @@ -32,7 +32,7 @@ bool is_swbp_insn(uprobe_opcode_t *insn) int set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, + return uprobe_write_opcode(auprobe, mm, vaddr, __opcode_to_mem_arm(auprobe->bpinsn)); } diff --git a/arch/mips/kernel/uprobes.c b/arch/mips/kernel/uprobes.c index f7a0645ccb82..4aaff3b3175c 100644 --- a/arch/mips/kernel/uprobes.c +++ b/arch/mips/kernel/uprobes.c @@ -224,7 +224,7 @@ unsigned long arch_uretprobe_hijack_return_addr( int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, UPROBE_SWBP_INSN); + return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN); } void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 0a294e950df8..bb9d2084af03 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -121,7 +121,7 @@ extern bool is_swbp_insn(uprobe_opcode_t *insn); extern bool is_trap_insn(uprobe_opcode_t *insn); extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); -extern int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); +extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 471eac896635..c0418ba52ba8 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -299,8 +299,8 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t * Called with mm->mmap_sem held for write. * Return 0 (success) or a negative errno. */ -int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr, - uprobe_opcode_t opcode) +int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, + unsigned long vaddr, uprobe_opcode_t opcode) { struct page *old_page, *new_page; struct vm_area_struct *vma; @@ -351,7 +351,7 @@ int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr, */ int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, UPROBE_SWBP_INSN); + return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN); } /** @@ -366,7 +366,8 @@ int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned int __weak set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr) { - return uprobe_write_opcode(mm, vaddr, *(uprobe_opcode_t *)&auprobe->insn); + return uprobe_write_opcode(auprobe, mm, vaddr, + *(uprobe_opcode_t *)&auprobe->insn); } static struct uprobe *get_uprobe(struct uprobe *uprobe) -- 2.14.4
[PATCH v7 0/6] Uprobes: Support SDT markers having reference count (semaphore)
v7 changes: - Don't allow both zero and non-zero reference counter offset at the same time. It's painful and error prone. - Don't call delayed_uprobe_install() if vma->vm_mm does not have any breakpoint installed. v6: https://lkml.org/lkml/2018/7/16/353 Description: Userspace Statically Defined Tracepoints[1] are dtrace style markers inside userspace applications. Applications like PostgreSQL, MySQL, Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc have these markers embedded in them. These markers are added by developer at important places in the code. Each marker source expands to a single nop instruction in the compiled code but there may be additional overhead for computing the marker arguments which expands to couple of instructions. In case the overhead is more, execution of it can be omitted by runtime if() condition when no one is tracing on the marker: if (reference_counter > 0) { Execute marker instructions; } Default value of reference counter is 0. Tracer has to increment the reference counter before tracing on a marker and decrement it when done with the tracing. Currently, perf tool has limited supports for SDT markers. I.e. it can not trace markers surrounded by reference counter. Also, it's not easy to add reference counter logic in userspace tool like perf, so basic idea for this patchset is to add reference counter logic in the a uprobe infrastructure. Ex,[2] # cat tick.c ... for (i = 0; i < 100; i++) { DTRACE_PROBE1(tick, loop1, i); if (TICK_LOOP2_ENABLED()) { DTRACE_PROBE1(tick, loop2, i); } printf("hi: %d\n", i); sleep(1); } ... Here tick:loop1 is marker without reference counter where as tick:loop2 is surrounded by reference counter condition. # perf buildid-cache --add /tmp/tick # perf probe sdt_tick:loop1 # perf probe sdt_tick:loop2 # perf stat -e sdt_tick:loop1,sdt_tick:loop2 -- /tmp/tick hi: 0 hi: 1 hi: 2 ^C Performance counter stats for '/tmp/tick': 3 sdt_tick:loop1 0 sdt_tick:loop2 2.747086086 seconds time elapsed Perf failed to record data for tick:loop2. Same experiment with this patch series: # ./perf buildid-cache --add /tmp/tick # ./perf probe sdt_tick:loop2 # ./perf stat -e sdt_tick:loop2 /tmp/tick hi: 0 hi: 1 hi: 2 ^C Performance counter stats for '/tmp/tick': 3 sdt_tick:loop2 2.561851452 seconds time elapsed Ravi Bangoria (6): Uprobes: Simplify uprobe_register() body Uprobe: Additional argument arch_uprobe to uprobe_write_opcode() Uprobes: Support SDT markers having reference count (semaphore) Uprobes/sdt: Prevent multiple reference counter for same uprobe trace_uprobe/sdt: Prevent multiple reference counter for same uprobe perf probe: Support SDT markers having reference counter (semaphore) arch/arm/probes/uprobes/core.c | 2 +- arch/mips/kernel/uprobes.c | 2 +- include/linux/uprobes.h| 7 +- kernel/events/uprobes.c| 315 +++-- kernel/trace/trace.c | 2 +- kernel/trace/trace_uprobe.c| 75 +- tools/perf/util/probe-event.c | 39 - tools/perf/util/probe-event.h | 1 + tools/perf/util/probe-file.c | 34 - tools/perf/util/probe-file.h | 1 + tools/perf/util/symbol-elf.c | 46 -- tools/perf/util/symbol.h | 7 + 12 files changed, 459 insertions(+), 72 deletions(-) -- 2.14.4
[PATCH v7 5/6] trace_uprobe/sdt: Prevent multiple reference counter for same uprobe
We assume to have only one reference counter for one uprobe. Don't allow user to add multiple trace_uprobe entries having same inode+offset but different reference counter. Ex, # echo "p:sdt_tick/loop2 /home/ravi/tick:0x6e4(0x10036)" > uprobe_events # echo "p:sdt_tick/loop2_1 /home/ravi/tick:0x6e4(0xf)" >> uprobe_events bash: echo: write error: Invalid argument # dmesg trace_kprobe: Reference counter offset mismatch. There is one exception though: When user is trying to replace the old entry with the new one, we allow this if the new entry does not conflict with any other existing entries. Signed-off-by: Ravi Bangoria --- kernel/trace/trace_uprobe.c | 37 +++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index bf2be098eb08..be64d943d7ea 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -324,6 +324,35 @@ static int unregister_trace_uprobe(struct trace_uprobe *tu) return 0; } +/* + * Uprobe with multiple reference counter is not allowed. i.e. + * If inode and offset matches, reference counter offset *must* + * match as well. Though, there is one exception: If user is + * replacing old trace_uprobe with new one(same group/event), + * then we allow same uprobe with new reference counter as far + * as the new one does not conflict with any other existing + * ones. + */ +static struct trace_uprobe *find_old_trace_uprobe(struct trace_uprobe *new) +{ + struct trace_uprobe *tmp, *old = NULL; + struct inode *new_inode = d_real_inode(new->path.dentry); + + old = find_probe_event(trace_event_name(&new->tp.call), + new->tp.call.class->system); + + list_for_each_entry(tmp, &uprobe_list, list) { + if ((old ? old != tmp : true) && + new_inode == d_real_inode(tmp->path.dentry) && + new->offset == tmp->offset && + new->ref_ctr_offset != tmp->ref_ctr_offset) { + pr_warn("Reference counter offset mismatch."); + return ERR_PTR(-EINVAL); + } + } + return old; +} + /* Register a trace_uprobe and probe_event */ static int register_trace_uprobe(struct trace_uprobe *tu) { @@ -333,8 +362,12 @@ static int register_trace_uprobe(struct trace_uprobe *tu) mutex_lock(&uprobe_lock); /* register as an event */ - old_tu = find_probe_event(trace_event_name(&tu->tp.call), - tu->tp.call.class->system); + old_tu = find_old_trace_uprobe(tu); + if (IS_ERR(old_tu)) { + ret = PTR_ERR(old_tu); + goto end; + } + if (old_tu) { /* delete old event */ ret = unregister_trace_uprobe(old_tu); -- 2.14.4
[PATCH v7 1/6] Uprobes: Simplify uprobe_register() body
Simplify uprobe_register() function body and let __uprobe_register() handle everything. Also move dependency functions around to fix build failures. Signed-off-by: Ravi Bangoria --- kernel/events/uprobes.c | 69 ++--- 1 file changed, 36 insertions(+), 33 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index ccc579a7d32e..471eac896635 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -840,13 +840,8 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new) return err; } -static int __uprobe_register(struct uprobe *uprobe, struct uprobe_consumer *uc) -{ - consumer_add(uprobe, uc); - return register_for_each_vma(uprobe, uc); -} - -static void __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) +static void +__uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) { int err; @@ -860,24 +855,46 @@ static void __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *u } /* - * uprobe_register - register a probe + * uprobe_unregister - unregister a already registered probe. + * @inode: the file in which the probe has to be removed. + * @offset: offset from the start of the file. + * @uc: identify which probe if multiple probes are colocated. + */ +void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +{ + struct uprobe *uprobe; + + uprobe = find_uprobe(inode, offset); + if (WARN_ON(!uprobe)) + return; + + down_write(&uprobe->register_rwsem); + __uprobe_unregister(uprobe, uc); + up_write(&uprobe->register_rwsem); + put_uprobe(uprobe); +} +EXPORT_SYMBOL_GPL(uprobe_unregister); + +/* + * __uprobe_register - register a probe * @inode: the file in which the probe has to be placed. * @offset: offset from the start of the file. * @uc: information on howto handle the probe.. * - * Apart from the access refcount, uprobe_register() takes a creation + * Apart from the access refcount, __uprobe_register() takes a creation * refcount (thro alloc_uprobe) if and only if this @uprobe is getting * inserted into the rbtree (i.e first consumer for a @inode:@offset * tuple). Creation refcount stops uprobe_unregister from freeing the * @uprobe even before the register operation is complete. Creation * refcount is released when the last @uc for the @uprobe - * unregisters. Caller of uprobe_register() is required to keep @inode + * unregisters. Caller of __uprobe_register() is required to keep @inode * (and the containing mount) referenced. * * Return errno if it cannot successully install probes * else return 0 (success) */ -int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +static int __uprobe_register(struct inode *inode, loff_t offset, +struct uprobe_consumer *uc) { struct uprobe *uprobe; int ret; @@ -904,7 +921,8 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer * down_write(&uprobe->register_rwsem); ret = -EAGAIN; if (likely(uprobe_is_active(uprobe))) { - ret = __uprobe_register(uprobe, uc); + consumer_add(uprobe, uc); + ret = register_for_each_vma(uprobe, uc); if (ret) __uprobe_unregister(uprobe, uc); } @@ -915,6 +933,12 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer * goto retry; return ret; } + +int uprobe_register(struct inode *inode, loff_t offset, + struct uprobe_consumer *uc) +{ + return __uprobe_register(inode, offset, uc); +} EXPORT_SYMBOL_GPL(uprobe_register); /* @@ -946,27 +970,6 @@ int uprobe_apply(struct inode *inode, loff_t offset, return ret; } -/* - * uprobe_unregister - unregister a already registered probe. - * @inode: the file in which the probe has to be removed. - * @offset: offset from the start of the file. - * @uc: identify which probe if multiple probes are colocated. - */ -void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) -{ - struct uprobe *uprobe; - - uprobe = find_uprobe(inode, offset); - if (WARN_ON(!uprobe)) - return; - - down_write(&uprobe->register_rwsem); - __uprobe_unregister(uprobe, uc); - up_write(&uprobe->register_rwsem); - put_uprobe(uprobe); -} -EXPORT_SYMBOL_GPL(uprobe_unregister); - static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm) { struct vm_area_struct *vma; -- 2.14.4
Re: [PATCH v6 3/6] Uprobes: Support SDT markers having reference count (semaphore)
Hi Oleg, On 07/23/2018 09:56 PM, Oleg Nesterov wrote: > I have a mixed feeling about this series... I'll try to summarise my thinking > tomorrow, but I do not see any obvious problem so far. Although I have some > concerns about 5/6, I need to re-read it after sleep. Sure. > > > On 07/16, Ravi Bangoria wrote: >> >> +static int delayed_uprobe_install(struct vm_area_struct *vma) >> +{ >> +struct list_head *pos, *q; >> +struct delayed_uprobe *du; >> +unsigned long vaddr; >> +int ret = 0, err = 0; >> + >> +mutex_lock(&delayed_uprobe_lock); >> +list_for_each_safe(pos, q, &delayed_uprobe_list) { >> +du = list_entry(pos, struct delayed_uprobe, list); >> + >> +if (!du->uprobe->ref_ctr_offset || > > Is it possible to see ->ref_ctr_offset == 0 in delayed_uprobe_list? I'll remove this check. > >> @@ -1072,7 +1282,13 @@ int uprobe_mmap(struct vm_area_struct *vma) >> struct uprobe *uprobe, *u; >> struct inode *inode; >> >> -if (no_uprobe_events() || !valid_vma(vma, true)) >> +if (no_uprobe_events()) >> +return 0; >> + >> +if (vma->vm_flags & VM_WRITE) >> +delayed_uprobe_install(vma); > > Obviously not nice performance-wise... OK, I do not know if it will actually > hurt in practice and probably we can use the better data structures if > necessary. > But can't we check MMF_HAS_UPROBES at least? I mean, > > if (vma->vm_flags & VM_WRITE && test_bit(MMF_HAS_UPROBES, > &vma->vm_mm->flags)) > delayed_uprobe_install(vma); > > ? Yes. > > > Perhaps we can even add another flag later, MMF_HAS_DELAYED_UPROBES set by > delayed_uprobe_add(). Yes, good idea. Thanks for the review, Ravi
[PATCH 0/2] perf trace: Two trivial fixes
Two independent fixes: First adds 'generated' directory into .gitignore Second fixes call-graph output with perf trace Ravi Bangoria (2): perf tools: Add trace/beauty/generated/ into .gitignore perf trace: Fix call-graph output tools/perf/.gitignore | 1 + tools/perf/builtin-trace.c | 5 - 2 files changed, 5 insertions(+), 1 deletion(-) -- 1.8.3.1
[PATCH 1/2] perf tools: Add trace/beauty/generated/ into .gitignore
No functionality changes. Signed-off-by: Ravi Bangoria --- tools/perf/.gitignore | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/perf/.gitignore b/tools/perf/.gitignore index 643cc4ba..3e5135d 100644 --- a/tools/perf/.gitignore +++ b/tools/perf/.gitignore @@ -31,5 +31,6 @@ config.mak.autogen .config-detected util/intel-pt-decoder/inat-tables.c arch/*/include/generated/ +trace/beauty/generated/ pmu-events/pmu-events.c pmu-events/jevents -- 1.8.3.1
[PATCH 2/2] perf trace: Fix call-graph output
Recently, Arnaldo fixed global vs event specific --max-stack usage with commit bd3dda9ab0fb ("perf trace: Allow overriding global --max-stack per event"). This commit is having a regression when we don't use --max-stack at all with perf trace. Ex, $ ./perf trace record -g ls $ ./perf trace -i perf.data 0.076 ( 0.002 ms): ls/9109 brk( 0.196 ( 0.008 ms): ls/9109 access(filename: 0x9f998b70, mode: R 0.209 ( 0.031 ms): ls/9109 open(filename: 0x9f998978, flags: CLOEXEC This is missing call-traces. After patch: $ ./perf trace -i perf.data 0.076 ( 0.002 ms): ls/9109 brk( do_syscall_trace_leave ([kernel.kallsyms]) [0] ([unknown]) syscall_exit_work ([kernel.kallsyms]) brk (/usr/lib64/ld-2.17.so) _dl_sysdep_start (/usr/lib64/ld-2.17.so) _dl_start_final (/usr/lib64/ld-2.17.so) _dl_start (/usr/lib64/ld-2.17.so) _start (/usr/lib64/ld-2.17.so) 0.196 ( 0.008 ms): ls/9109 access(filename: 0x9f998b70, mode: R do_syscall_trace_leave ([kernel.kallsyms]) [0] ([unknown]) Fixes: bd3dda9ab0fb ("perf trace: Allow overriding global --max-stack per event") Signed-off-by: Ravi Bangoria --- tools/perf/builtin-trace.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c index 17d11de..e90f5c3 100644 --- a/tools/perf/builtin-trace.c +++ b/tools/perf/builtin-trace.c @@ -1661,9 +1661,12 @@ static int trace__resolve_callchain(struct trace *trace, struct perf_evsel *evse struct callchain_cursor *cursor) { struct addr_location al; + int max_stack = evsel->attr.sample_max_stack ? + evsel->attr.sample_max_stack : + trace->max_stack; if (machine__resolve(trace->host, &al, sample) < 0 || - thread__resolve_callchain(al.thread, cursor, evsel, sample, NULL, NULL, evsel->attr.sample_max_stack)) + thread__resolve_callchain(al.thread, cursor, evsel, sample, NULL, NULL, max_stack)) return -1; return 0; -- 1.8.3.1
[PATCH] trace_uprobe: Display correct offset in uprobe_events
Recently, how the pointers being printed with %p has been changed by commit ad67b74d2469 ("printk: hash addresses printed with %p"). This is causing a regression while showing offset in the uprobe_events file. Instead of %p, use %px to display offset. Before patch: # perf probe -vv -x /tmp/a.out main Opening /sys/kernel/debug/tracing//uprobe_events write=1 Writing event: p:probe_a/main /tmp/a.out:0x58c # cat /sys/kernel/debug/tracing/uprobe_events p:probe_a/main /tmp/a.out:0x49a0f352 After patch: # cat /sys/kernel/debug/tracing/uprobe_events p:probe_a/main /tmp/a.out:0x058c Signed-off-by: Ravi Bangoria --- kernel/trace/trace_uprobe.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 40592e7b3568..268029ae1be6 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -608,7 +608,7 @@ static int probes_seq_show(struct seq_file *m, void *v) /* Don't print "0x (null)" when offset is 0 */ if (tu->offset) { - seq_printf(m, "0x%p", (void *)tu->offset); + seq_printf(m, "0x%px", (void *)tu->offset); } else { switch (sizeof(void *)) { case 4: -- 2.13.6
Re: [PATCH] trace_uprobe: Display correct offset in uprobe_events
On 01/08/2018 10:49 AM, Srikar Dronamraju wrote: > * Ravi Bangoria [2018-01-06 11:12:46]: > >> Recently, how the pointers being printed with %p has been changed >> by commit ad67b74d2469 ("printk: hash addresses printed with %p"). >> This is causing a regression while showing offset in the >> uprobe_events file. Instead of %p, use %px to display offset. >> >> Signed-off-by: Ravi Bangoria >> --- >> kernel/trace/trace_uprobe.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c >> index 40592e7b3568..268029ae1be6 100644 >> --- a/kernel/trace/trace_uprobe.c >> +++ b/kernel/trace/trace_uprobe.c >> @@ -608,7 +608,7 @@ static int probes_seq_show(struct seq_file *m, void *v) >> >> /* Don't print "0x (null)" when offset is 0 */ >> if (tu->offset) { >> -seq_printf(m, "0x%p", (void *)tu->offset); >> +seq_printf(m, "0x%px", (void *)tu->offset); >> } else { >> switch (sizeof(void *)) { >> case 4: > Looks good to me. Did you consider %pK instead of %px? Thanks Srikar, Checked %pK. But I see same issue with that: perf probe: Opening /sys/kernel/debug/tracing//uprobe_events write=1 Writing event: p:probe_a/main /tmp/a.out:0x58c cat /sys/kernel/debug/tracing/uprobe_events: p:probe_a/main /tmp/a.out:0x14fd571e > Acked-by: Srikar Dronamraju
[PATCH 0/3] perf trace powerpc: Remove libaudit dependency for syscalls
This is almost identical set of patches recently done for s390. With this, user can run perf trace without libaudit on powerpc as well. Ex, $ make ... libaudit: [ OFF ] $ ./perf trace ls 0.221 ( 0.005 ms): ls/43330 open(filename: 0xac1e2778, flags: CLOEXEC ) = 3 0.227 ( 0.003 ms): ls/43330 read(fd: 3, buf: 0x39c4d678, count: 832 ) = 832 0.233 ( 0.002 ms): ls/43330 fstat(fd: 3, statbuf: 0x39c4d4b0) = 0 ... $ ./perf trace -e "open*" ls 0.000 ( 0.014 ms): ls/43342 open(filename: 0x793d8978, flags: CLOEXEC ) = 3 0.038 ( 0.006 ms): ls/43342 open(filename: 0x793f2778, flags: CLOEXEC ) = 3 ... Ravi Bangoria (3): tools include powerpc: Grab a copy of arch/powerpc/include/uapi/asm/unistd.h perf powerpc: Generate system call table from asm/unistd.h perf trace powerpc: Use generated syscall table tools/arch/powerpc/include/uapi/asm/unistd.h | 399 + tools/perf/Makefile.config | 2 + tools/perf/arch/powerpc/Makefile | 21 ++ .../perf/arch/powerpc/entry/syscalls/mksyscalltbl | 35 ++ tools/perf/check-headers.sh| 1 + tools/perf/util/syscalltbl.c | 4 + 6 files changed, 462 insertions(+) create mode 100644 tools/arch/powerpc/include/uapi/asm/unistd.h create mode 100755 tools/perf/arch/powerpc/entry/syscalls/mksyscalltbl -- 1.8.3.1
[PATCH 2/3] perf powerpc: Generate system call table from asm/unistd.h
This should speed up accessing new system calls introduced with the kernel rather than waiting for libaudit updates to include them. Signed-off-by: Ravi Bangoria --- tools/perf/arch/powerpc/Makefile | 21 + .../perf/arch/powerpc/entry/syscalls/mksyscalltbl | 35 ++ 2 files changed, 56 insertions(+) create mode 100755 tools/perf/arch/powerpc/entry/syscalls/mksyscalltbl diff --git a/tools/perf/arch/powerpc/Makefile b/tools/perf/arch/powerpc/Makefile index 42dab7c..c93e8f4 100644 --- a/tools/perf/arch/powerpc/Makefile +++ b/tools/perf/arch/powerpc/Makefile @@ -6,3 +6,24 @@ endif HAVE_KVM_STAT_SUPPORT := 1 PERF_HAVE_ARCH_REGS_QUERY_REGISTER_OFFSET := 1 PERF_HAVE_JITDUMP := 1 + +# +# Syscall table generation for perf +# + +out:= $(OUTPUT)arch/powerpc/include/generated/asm +header := $(out)/syscalls_64.c +sysdef := $(srctree)/tools/arch/powerpc/include/uapi/asm/unistd.h +sysprf := $(srctree)/tools/perf/arch/powerpc/entry/syscalls/ +systbl := $(sysprf)/mksyscalltbl + +# Create output directory if not already present +_dummy := $(shell [ -d '$(out)' ] || mkdir -p '$(out)') + +$(header): $(sysdef) $(systbl) + $(Q)$(SHELL) '$(systbl)' '$(CC)' $(sysdef) > $@ + +clean:: + $(call QUIET_CLEAN, powerpc) $(RM) $(header) + +archheaders: $(header) diff --git a/tools/perf/arch/powerpc/entry/syscalls/mksyscalltbl b/tools/perf/arch/powerpc/entry/syscalls/mksyscalltbl new file mode 100755 index 000..975947c --- /dev/null +++ b/tools/perf/arch/powerpc/entry/syscalls/mksyscalltbl @@ -0,0 +1,35 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# +# Generate system call table for perf. Derived from +# s390 script. +# +# Copyright IBM Corp. 2017 +# Author(s): Hendrik Brueckner +# Changed by: Ravi Bangoria + +gcc=$1 +input=$2 + +if ! test -r $input; then + echo "Could not read input file" >&2 + exit 1 +fi + +create_table() +{ + local max_nr + + echo 'static const char *syscalltbl_powerpc_64[] = {' + while read sc nr; do + printf '\t[%d] = "%s",\n' $nr $sc + max_nr=$nr + done + echo '};' + echo "#define SYSCALLTBL_POWERPC_64_MAX_ID $max_nr" +} + +$gcc -m64 -E -dM -x c $input \ + |sed -ne 's/^#define __NR_//p' \ + |sort -t' ' -k2 -nu\ + |create_table -- 1.8.3.1
[PATCH 1/3] tools include powerpc: Grab a copy of arch/powerpc/include/uapi/asm/unistd.h
Will be used for generating the syscall id/string translation table. Signed-off-by: Ravi Bangoria --- tools/arch/powerpc/include/uapi/asm/unistd.h | 399 +++ tools/perf/check-headers.sh | 1 + 2 files changed, 400 insertions(+) create mode 100644 tools/arch/powerpc/include/uapi/asm/unistd.h diff --git a/tools/arch/powerpc/include/uapi/asm/unistd.h b/tools/arch/powerpc/include/uapi/asm/unistd.h new file mode 100644 index 000..df8684f3 --- /dev/null +++ b/tools/arch/powerpc/include/uapi/asm/unistd.h @@ -0,0 +1,399 @@ +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */ +/* + * This file contains the system call numbers. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ +#ifndef _UAPI_ASM_POWERPC_UNISTD_H_ +#define _UAPI_ASM_POWERPC_UNISTD_H_ + + +#define __NR_restart_syscall 0 +#define __NR_exit1 +#define __NR_fork2 +#define __NR_read3 +#define __NR_write 4 +#define __NR_open5 +#define __NR_close 6 +#define __NR_waitpid 7 +#define __NR_creat 8 +#define __NR_link9 +#define __NR_unlink 10 +#define __NR_execve 11 +#define __NR_chdir 12 +#define __NR_time 13 +#define __NR_mknod 14 +#define __NR_chmod 15 +#define __NR_lchown 16 +#define __NR_break 17 +#define __NR_oldstat18 +#define __NR_lseek 19 +#define __NR_getpid 20 +#define __NR_mount 21 +#define __NR_umount 22 +#define __NR_setuid 23 +#define __NR_getuid 24 +#define __NR_stime 25 +#define __NR_ptrace 26 +#define __NR_alarm 27 +#define __NR_oldfstat 28 +#define __NR_pause 29 +#define __NR_utime 30 +#define __NR_stty 31 +#define __NR_gtty 32 +#define __NR_access 33 +#define __NR_nice 34 +#define __NR_ftime 35 +#define __NR_sync 36 +#define __NR_kill 37 +#define __NR_rename 38 +#define __NR_mkdir 39 +#define __NR_rmdir 40 +#define __NR_dup41 +#define __NR_pipe 42 +#define __NR_times 43 +#define __NR_prof 44 +#define __NR_brk45 +#define __NR_setgid 46 +#define __NR_getgid 47 +#define __NR_signal 48 +#define __NR_geteuid49 +#define __NR_getegid50 +#define __NR_acct 51 +#define __NR_umount252 +#define __NR_lock 53 +#define __NR_ioctl 54 +#define __NR_fcntl 55 +#define __NR_mpx56 +#define __NR_setpgid57 +#define __NR_ulimit 58 +#define __NR_oldolduname59 +#define __NR_umask 60 +#define __NR_chroot 61 +#define __NR_ustat 62 +#define __NR_dup2 63 +#define __NR_getppid64 +#define __NR_getpgrp65 +#define __NR_setsid 66 +#define __NR_sigaction 67 +#define __NR_sgetmask 68 +#define __NR_ssetmask 69 +#define __NR_setreuid 70 +#define __NR_setregid 71 +#define __NR_sigsuspend 72 +#define __NR_sigpending 73 +#define __NR_sethostname74 +#define __NR_setrlimit 75 +#define __NR_getrlimit 76 +#define __NR_getrusage 77 +#define __NR_gettimeofday 78 +#define __NR_settimeofday 79 +#define __NR_getgroups 80 +#define __NR_setgroups 81 +#define __NR_select 82 +#define __NR_symlink83 +#define __NR_oldlstat 84 +#define __NR_readlink 85 +#define __NR_uselib 86 +#define __NR_swapon 87 +#define __NR_reboot 88 +#define __NR_readdir89 +#define __NR_mmap 90 +#define __NR_munmap 91 +#define __NR_truncate 92 +#define __NR_ftruncate 93 +#define __NR_fchmod 94 +#define __NR_fchown 95 +#define __NR_getpriority96 +#define __NR_setpriority97 +#define __NR_profil 98 +#define __NR_statfs 99 +#define __NR_fstatfs 100 +#define __NR_ioperm101 +#define __NR_socketcall102 +#define __NR_syslog103 +#define __NR_setitimer 104 +#define __NR_getitimer 105 +#define __NR_stat 106 +#define __NR_lstat 107 +#define __NR_fstat 108 +#define
[PATCH 3/3] perf trace powerpc: Use generated syscall table
This should speed up accessing new system calls introduced with the kernel rather than waiting for libaudit updates to include them. It also enables users to specify wildcards, for example, perf trace -e 'open*', just like was already possible on x86 and s390. Signed-off-by: Ravi Bangoria --- tools/perf/Makefile.config | 2 ++ tools/perf/util/syscalltbl.c | 4 2 files changed, 6 insertions(+) diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config index 0dfdaa9..577a5d2 100644 --- a/tools/perf/Makefile.config +++ b/tools/perf/Makefile.config @@ -27,6 +27,8 @@ NO_SYSCALL_TABLE := 1 # Additional ARCH settings for ppc ifeq ($(SRCARCH),powerpc) NO_PERF_REGS := 0 + NO_SYSCALL_TABLE := 0 + CFLAGS += -I$(OUTPUT)arch/powerpc/include/generated LIBUNWIND_LIBS := -lunwind -lunwind-ppc64 endif diff --git a/tools/perf/util/syscalltbl.c b/tools/perf/util/syscalltbl.c index 303bdb8..b12c5f5 100644 --- a/tools/perf/util/syscalltbl.c +++ b/tools/perf/util/syscalltbl.c @@ -30,6 +30,10 @@ #include const int syscalltbl_native_max_id = SYSCALLTBL_S390_64_MAX_ID; static const char **syscalltbl_native = syscalltbl_s390_64; +#elif defined(__powerpc64__) +#include +const int syscalltbl_native_max_id = SYSCALLTBL_POWERPC_64_MAX_ID; +static const char **syscalltbl_native = syscalltbl_powerpc_64; #endif struct syscall { -- 1.8.3.1
Re: [tip:perf/core] perf trace: Allow overriding global --max-stack per event
Hi Arnaldo, This commit seems to be causing a regression: $ ./perf trace record -g ls Perf compiled from acme/perf/core: $ ./perf trace -i perf.data 0.200 ( 0.016 ms): ls/19722 brk( 0.367 ( 0.024 ms): ls/19722 access(filename: 0xa1438b70, mode: R 0.401 ( 0.019 ms): ls/19722 open(filename: 0xa1438978, flags: CLOEXEC Missing calltraces ^ Distro perf: $ perf trace -i perf.data 0.200 ( 0.016 ms): ls/19722 brk( do_syscall_trace_leave ([kernel.kallsyms]) [0] ([unknown]) syscall_exit_work ([kernel.kallsyms]) brk (/usr/lib64/ld-2.17.so) _dl_sysdep_start (/usr/lib64/ld-2.17.so) _dl_start_final (/usr/lib64/ld-2.17.so) _dl_start (/usr/lib64/ld-2.17.so) _start (/usr/lib64/ld-2.17.so) 0.367 ( 0.024 ms): ls/19722 access(filename: 0xa1438b70, mode: R do_syscall_trace_leave ([kernel.kallsyms]) [0] ([unknown]) syscall_exit_work ([kernel.kallsyms]) access (/usr/lib64/ld-2.17.so) dl_main (/usr/lib64/ld-2.17.so) _dl_sysdep_start (/usr/lib64/ld-2.17.so) _dl_start_final (/usr/lib64/ld-2.17.so) _dl_start (/usr/lib64/ld-2.17.so) _start (/usr/lib64/ld-2.17.so) 0.401 ( 0.019 ms): ls/19722 open(filename: 0xa1438978, flags: CLOEXEC do_syscall_trace_leave ([kernel.kallsyms]) [0] ([unknown]) syscall_exit_work ([kernel.kallsyms]) open64 (/usr/lib64/ld-2.17.so) Patch something like below should fix the issue ?? diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c index 531d43b..d0ace22 100644 --- a/tools/perf/builtin-trace.c +++ b/tools/perf/builtin-trace.c @@ -1642,9 +1642,12 @@ static int trace__resolve_callchain(struct trace *trace, struct perf_evsel *evse struct callchain_cursor *cursor) { struct addr_location al; + int max_stack = evsel->attr.sample_max_stack ? + evsel->attr.sample_max_stack: + trace->max_stack; if (machine__resolve(trace->host, &al, sample) < 0 || - thread__resolve_callchain(al.thread, cursor, evsel, sample, NULL, NULL, evsel->attr.sample_max_stack)) + thread__resolve_callchain(al.thread, cursor, evsel, sample, NULL, NULL, max_stack)) return -1; return 0; Thanks, Ravi On 01/17/2018 10:04 PM, tip-bot for Arnaldo Carvalho de Melo wrote: > Commit-ID: bd3dda9ab0fbdb8a91a2e869d93a0c9692b8444f > Gitweb: > https://git.kernel.org/tip/bd3dda9ab0fbdb8a91a2e869d93a0c9692b8444f > Author: Arnaldo Carvalho de Melo > AuthorDate: Mon, 15 Jan 2018 11:33:53 -0300 > Committer: Arnaldo Carvalho de Melo > CommitDate: Wed, 17 Jan 2018 10:23:33 -0300 > > perf trace: Allow overriding global --max-stack per event > > The per-event max-stack setting wasn't overriding the global --max-stack > setting: > > # perf trace --no-syscalls --max-stack 4 -e > probe_libc:inet_pton/call-graph=dwarf,max-stack=2/ ping -6 -c 1 ::1 > PING ::1(::1) 56 data bytes > 64 bytes from ::1: icmp_seq=1 ttl=64 time=0.072 ms > > --- ::1 ping statistics --- > 1 packets transmitted, 1 received, 0% packet loss, time 0ms > rtt min/avg/max/mdev = 0.072/0.072/0.072/0.000 ms >0.000 probe_libc:inet_pton:(7feb7a998350)) > __inet_pton (inlined) > gaih_inet.constprop.7 > (/usr/lib64/libc-2.26.so) > __GI_getaddrinfo (inlined) > [0xaa39b6108f3f] (/usr/bin/ping) > # > > Fix it: > > # perf trace --no-syscalls --max-stack 4 -e > probe_libc:inet_pton/call-graph=dwarf,max-stack=2/ ping -6 -c 1 ::1 > PING ::1(::1) 56 data bytes > 64 bytes from ::1: icmp_seq=1 ttl=64 time=0.073 ms > > --- ::1 ping statistics --- > 1 packets transmitted, 1 received, 0% packet loss, time 0ms > rtt min/avg/max/mdev = 0.073/0.073/0.073/0.000 ms >0.000 probe_libc:inet_pton:(7f1083221350)) > __inet_pton (inlined) >
Re: [PATCH 2/2] trace_uprobe: Simplify probes_seq_show()
On 02/08/2018 09:13 AM, Ravi Bangoria wrote: > > On 02/08/2018 08:59 AM, Kees Cook wrote: >> On Tue, Feb 6, 2018 at 8:34 PM, Ravi Bangoria >> wrote: >>> Simplify probes_seq_show() function. We are using %lx to display >>> the offset and we don't prepend unnecessary 0s in the offset. >>> >>> Signed-off-by: Ravi Bangoria >>> --- >>> kernel/trace/trace_uprobe.c | 21 +++-- >>> 1 file changed, 3 insertions(+), 18 deletions(-) >>> >>> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c >>> index c2c965398893..c12a3957e1ee 100644 >>> --- a/kernel/trace/trace_uprobe.c >>> +++ b/kernel/trace/trace_uprobe.c >>> @@ -602,24 +602,9 @@ static int probes_seq_show(struct seq_file *m, void *v) >>> char c = is_ret_probe(tu) ? 'r' : 'p'; >>> int i; >>> >>> - seq_printf(m, "%c:%s/%s", c, tu->tp.call.class->system, >>> - trace_event_name(&tu->tp.call)); >>> - seq_printf(m, " %s:", tu->filename); >>> - >>> - /* Don't print "0x (null)" when offset is 0 */ >>> - if (tu->offset) { >>> - seq_printf(m, "0x%lx", tu->offset); >>> - } else { >>> - switch (sizeof(void *)) { >>> - case 4: >>> - seq_printf(m, "0x"); >>> - break; >>> - case 8: >>> - default: >>> - seq_printf(m, "0x"); >>> - break; >>> - } >>> - } >>> + seq_printf(m, "%c:%s/%s %s:0x%lx", c, tu->tp.call.class->system, >>> + trace_event_name(&tu->tp.call), tu->filename, >>> + tu->offset); >> To keep the prepended zeros (and avoid the redundant 0x prefix): >> >> "...%#0*lx...", ... sizeof(void *) * 2, tu->offset); >> >> As in: >> >> + seq_printf(m, "%c:%s/%s %s:%#0*lx", c, tu->tp.call.class->system, >> + trace_event_name(&tu->tp.call), tu->filename, >> + sizeof(void *) * 2, tu->offset); > This is useful, thanks Kees. > > @Wang, Do we really need those 0s? Won't just 0x0 should > suffice? Here is the sample output... > > # echo "p:probe_a/main /tmp/a.out:0x0" > uprobe_events > > Before patch: > # cat uprobe_events > p:probe_a/main /tmp/a.out:0x > > After patch: > # cat uprobe_events > p:probe_a/main /tmp/a.out:0x0 Wang, ping :) Kees, I don't hear back from Wang and no one has reported any issues with the patches yet. Can I have your Acked-by? Thanks, Ravi
Re: [PATCH 2/2] trace_uprobe: Simplify probes_seq_show()
On 03/07/2018 03:34 AM, Kees Cook wrote: > On Tue, Mar 6, 2018 at 12:12 AM, Ravi Bangoria > wrote: >> >> On 02/08/2018 09:13 AM, Ravi Bangoria wrote: >>> Wang, ping :) >>> >>> Kees, I don't hear back from Wang and no one has reported any issues with >>> the patches yet. Can I have your Acked-by? > I didn't see a v2 of these patches with the output fixed? Kees, I didn't send v2 because those 0s seems unnecessary to me. Please let me know if you still wants to show them. Will re-spin :) Thanks, Ravi PS: Changing Masami's email address in cc.
Re: [RFC 3/4] trace_uprobe: Support SDT markers having semaphore
On 03/06/2018 05:29 PM, Peter Zijlstra wrote: > On Wed, Feb 28, 2018 at 01:23:44PM +0530, Ravi Bangoria wrote: >> Userspace Statically Defined Tracepoints[1] are dtrace style markers >> inside userspace applications. These markers are added by developer at >> important places in the code. Each marker source expands to a single >> nop instruction in the compiled code but there may be additional >> overhead for computing the marker arguments which expands to couple of >> instructions. If this computaion is quite more, execution of it can be >> ommited by runtime if() condition when no one is tracing on the marker: >> >> if (semaphore > 0) { >> Execute marker instructions; >> } >> >> Default value of semaphore is 0. Tracer has to increment the semaphore >> before recording on a marker and decrement it at the end of tracing. >> >> Implement the semaphore flip logic in trace_uprobe, leaving core uprobe >> infrastructure as is, except one new callback from uprobe_mmap() to >> trace_uprobe. > W.T.H. is that called a semaphore? afaict its just a usage-counter. I totally agree with you. But it's not me who named it semaphore :) Please refer to "Semaphore Handling" section at: https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation We can surly name it differently in the kernel code and document it properly in the Documents/tracing/ > There is no blocking, no releasing, nothing that would make it an actual > semaphore. > > So please, remove all mention of semaphore from this code, because it, > most emphatically, is not one. > > Also, would it not be much better to do userspace jump-labels for this? > That completely avoids the dynamic branch at the SDT site. > Userspace jump-label is a good idea but... Semaphore logic has already became a kinda ABI now. Tools like bcc, gdb, systemtap etc. flip the semaphore while probing the marker. Thanks, Ravi
Re: [PATCH v6 21/21] perf-probe: Add array argument support
Hi Masami :) On 03/22/2018 03:53 PM, Masami Hiramatsu wrote: > On Mon, 19 Mar 2018 13:29:59 +0530 > Ravi Bangoria wrote: > >> >> Is it okay to allow user to specify array size with type field? > Fro this patch, yes. So IIUC, perf _tool_ will allow user to record array either with "name[range]" or by "name:type[length]". Please correct me if that's wrong. And If perf tool will allow array length along with TYPE field, I guess we should document that in man perf-probe? Otherwise, Acked-by: Ravi Bangoria Thanks, Ravi > The availability of type is checked only when > it is automatically generated. > IMO, it should be done in another patch, something like > "Validate user specified type casting" patch. Would you need it? > > Thank you, >
Re: [PATCH v2 9/9] perf probe: Support SDT markers having reference counter (semaphore)
Hi Masami, On 04/09/2018 12:58 PM, Masami Hiramatsu wrote: > Hi Ravi, > > On Wed, 4 Apr 2018 14:01:10 +0530 > Ravi Bangoria wrote: > >> @@ -2054,15 +2060,21 @@ char *synthesize_probe_trace_command(struct >> probe_trace_event *tev) >> } >> >> /* Use the tp->address for uprobes */ >> -if (tev->uprobes) >> +if (tev->uprobes) { >> err = strbuf_addf(&buf, "%s:0x%lx", tp->module, tp->address); >> -else if (!strncmp(tp->symbol, "0x", 2)) >> +if (uprobe_ref_ctr_is_supported() && >> +tp->ref_ctr_offset && >> +err >= 0) >> +err = strbuf_addf(&buf, "(0x%lx)", tp->ref_ctr_offset); > If the kernel doesn't support uprobe_ref_ctr but the event requires > to increment uprobe_ref_ctr, I think we should (at least) warn user here. pr_debug("A semaphore is associated with %s:%s and seems your kernel doesn't support it.\n" tev->group, tev->event); Looks good? >> @@ -776,14 +784,21 @@ static char *synthesize_sdt_probe_command(struct >> sdt_note *note, >> { >> struct strbuf buf; >> char *ret = NULL, **args; >> -int i, args_count; >> +int i, args_count, err; >> +unsigned long long ref_ctr_offset; >> >> if (strbuf_init(&buf, 32) < 0) >> return NULL; >> >> -if (strbuf_addf(&buf, "p:%s/%s %s:0x%llx", >> -sdtgrp, note->name, pathname, >> -sdt_note__get_addr(note)) < 0) >> +err = strbuf_addf(&buf, "p:%s/%s %s:0x%llx", >> +sdtgrp, note->name, pathname, >> +sdt_note__get_addr(note)); >> + >> +ref_ctr_offset = sdt_note__get_ref_ctr_offset(note); >> +if (uprobe_ref_ctr_is_supported() && ref_ctr_offset && err >= 0) >> +err = strbuf_addf(&buf, "(0x%llx)", ref_ctr_offset); > We don't have to care about uprobe_ref_ctr support here, because > this information will be just cached, not directly written to > uprobe_events. Sure, will remove the check. Thanks for the review :). Ravi
[PATCH 0/3] perf/buildid-cache: Add --list and --purge-all options
First patch is a trivial error message fix. Second and third adds new options --list and --purge-all to 'buildid-cache' subcommand. Ravi Bangoria (3): tools/parse-options: Add '\n' at the end of error messages perf/buildid-cache: Support --list option perf/buildid-cache: Support --purge-all option tools/lib/subcmd/parse-options.c| 6 +- tools/perf/Documentation/perf-buildid-cache.txt | 7 ++- tools/perf/builtin-buildid-cache.c | 75 - 3 files changed, 81 insertions(+), 7 deletions(-) -- 2.14.3
[PATCH 1/3] tools/parse-options: Add '\n' at the end of error messages
Few error messages does not have '\n' at the end and thus next prompt gets printed in the same line. Ex, linux~$ perf buildid-cache -verbose --add ./a.out Error: did you mean `--verbose` (with two dashes ?)linux~$ Fix it. Signed-off-by: Ravi Bangoria --- tools/lib/subcmd/parse-options.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/lib/subcmd/parse-options.c b/tools/lib/subcmd/parse-options.c index f6a1babcbac4..cb7154eccbdc 100644 --- a/tools/lib/subcmd/parse-options.c +++ b/tools/lib/subcmd/parse-options.c @@ -433,7 +433,7 @@ static int parse_long_opt(struct parse_opt_ctx_t *p, const char *arg, if (ambiguous_option) { fprintf(stderr, -" Error: Ambiguous option: %s (could be --%s%s or --%s%s)", +" Error: Ambiguous option: %s (could be --%s%s or --%s%s)\n", arg, (ambiguous_flags & OPT_UNSET) ? "no-" : "", ambiguous_option->long_name, @@ -458,7 +458,7 @@ static void check_typos(const char *arg, const struct option *options) return; if (strstarts(arg, "no-")) { - fprintf(stderr, " Error: did you mean `--%s` (with two dashes ?)", arg); + fprintf(stderr, " Error: did you mean `--%s` (with two dashes ?)\n", arg); exit(129); } @@ -466,7 +466,7 @@ static void check_typos(const char *arg, const struct option *options) if (!options->long_name) continue; if (strstarts(options->long_name, arg)) { - fprintf(stderr, " Error: did you mean `--%s` (with two dashes ?)", arg); + fprintf(stderr, " Error: did you mean `--%s` (with two dashes ?)\n", arg); exit(129); } } -- 2.14.3
[PATCH 2/3] perf/buildid-cache: Support --list option
Perf buildid-cache allows to add/remove files into cache but there is no option to list all cached files. Add --list option to list all _valid_ cached files. Ex, # perf buildid-cache --add /tmp/a.out # perf buildid-cache -l /tmp/a.out (8a86ef73e44067bca52cc3f6cd3e5446c783391c) Signed-off-by: Ravi Bangoria --- tools/perf/Documentation/perf-buildid-cache.txt | 4 ++- tools/perf/builtin-buildid-cache.c | 41 +++-- 2 files changed, 41 insertions(+), 4 deletions(-) diff --git a/tools/perf/Documentation/perf-buildid-cache.txt b/tools/perf/Documentation/perf-buildid-cache.txt index 73c2650bd0db..3f285ba6e1f9 100644 --- a/tools/perf/Documentation/perf-buildid-cache.txt +++ b/tools/perf/Documentation/perf-buildid-cache.txt @@ -59,7 +59,9 @@ OPTIONS exactly same build-id, that is replaced by new one. It can be used to update kallsyms and kernel dso to vmlinux in order to support annotation. - +-l:: +--list:: + List all valid binaries from cache. -v:: --verbose:: Be more verbose. diff --git a/tools/perf/builtin-buildid-cache.c b/tools/perf/builtin-buildid-cache.c index 41db2cba77eb..50db05bd0cc6 100644 --- a/tools/perf/builtin-buildid-cache.c +++ b/tools/perf/builtin-buildid-cache.c @@ -25,6 +25,7 @@ #include "util/session.h" #include "util/symbol.h" #include "util/time-utils.h" +#include "util/probe-file.h" static int build_id_cache__kcore_buildid(const char *proc_dir, char *sbuildid) { @@ -297,6 +298,25 @@ static int build_id_cache__update_file(const char *filename, struct nsinfo *nsi) return err; } +static void build_id_cache__show_all(void) +{ + struct strlist *bidlist; + struct str_node *nd; + char *buf; + + bidlist = build_id_cache__list_all(true); + if (!bidlist) { + pr_debug("Failed to get buildids: -%d\n", errno); + return; + } + strlist__for_each_entry(nd, bidlist) { + buf = build_id_cache__origname(nd->s); + printf("%s (%s)\n", buf, nd->s); + free(buf); + } + strlist__delete(bidlist); +} + int cmd_buildid_cache(int argc, const char **argv) { struct strlist *list; @@ -304,6 +324,8 @@ int cmd_buildid_cache(int argc, const char **argv) int ret = 0; int ns_id = -1; bool force = false; + bool list_files = false; + bool opts_flag = false; char const *add_name_list_str = NULL, *remove_name_list_str = NULL, *purge_name_list_str = NULL, @@ -327,6 +349,7 @@ int cmd_buildid_cache(int argc, const char **argv) "file(s) to remove"), OPT_STRING('p', "purge", &purge_name_list_str, "file list", "file(s) to remove (remove old caches too)"), + OPT_BOOLEAN('l', "list", &list_files, "list all cached files"), OPT_STRING('M', "missing", &missing_filename, "file", "to find missing build ids in the cache"), OPT_BOOLEAN('f', "force", &force, "don't complain, do it"), @@ -344,11 +367,18 @@ int cmd_buildid_cache(int argc, const char **argv) argc = parse_options(argc, argv, buildid_cache_options, buildid_cache_usage, 0); - if (argc || (!add_name_list_str && !kcore_filename && -!remove_name_list_str && !purge_name_list_str && -!missing_filename && !update_name_list_str)) + opts_flag = add_name_list_str || kcore_filename || + remove_name_list_str || purge_name_list_str || + missing_filename || update_name_list_str; + + if (argc || !(list_files || opts_flag)) usage_with_options(buildid_cache_usage, buildid_cache_options); + /* -l is exclusive. It can not be used with other options. */ + if (list_files && opts_flag) + usage_with_options_msg(buildid_cache_usage, + buildid_cache_options, "-l is exclusive.\n"); + if (ns_id > 0) nsi = nsinfo__new(ns_id); @@ -366,6 +396,11 @@ int cmd_buildid_cache(int argc, const char **argv) setup_pager(); + if (list_files) { + build_id_cache__show_all(); + goto out; + } + if (add_name_list_str) { list = strlist__new(add_name_list_str, NULL); if (list) { -- 2.14.3
[PATCH 3/3] perf/buildid-cache: Support --purge-all option
User can remove files from cache using --remove/--purge options but both needs list of files as an argument. It's not convenient when you want to flush out entire cache. Add an option to purge all files from cache. Ex, # perf buildid-cache -l /tmp/a.out (8a86ef73e44067bca52cc3f6cd3e5446c783391c) /tmp/a.out.1 (ebe71fdcf4b366518cc154d570a33cd461a51c36) # perf buildid-cache -P -v Removing /tmp/a.out (8a86ef73e44067bca52cc3f6cd3e5446c783391c): Ok Removing /tmp/a.out.1 (ebe71fdcf4b366518cc154d570a33cd461a51c36): Ok Purged all: Ok Signed-off-by: Ravi Bangoria --- tools/perf/Documentation/perf-buildid-cache.txt | 3 +++ tools/perf/builtin-buildid-cache.c | 36 - 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/tools/perf/Documentation/perf-buildid-cache.txt b/tools/perf/Documentation/perf-buildid-cache.txt index 3f285ba6e1f9..f6de0952ff3c 100644 --- a/tools/perf/Documentation/perf-buildid-cache.txt +++ b/tools/perf/Documentation/perf-buildid-cache.txt @@ -48,6 +48,9 @@ OPTIONS --purge=:: Purge all cached binaries including older caches which have specified path from the cache. +-P:: +--purge-all:: + Purge all cached binaries. This will flush out entire cache. -M:: --missing=:: List missing build ids in the cache for the specified file. diff --git a/tools/perf/builtin-buildid-cache.c b/tools/perf/builtin-buildid-cache.c index 50db05bd0cc6..3b4373cabd49 100644 --- a/tools/perf/builtin-buildid-cache.c +++ b/tools/perf/builtin-buildid-cache.c @@ -240,6 +240,34 @@ static int build_id_cache__purge_path(const char *pathname, struct nsinfo *nsi) return err; } +static int build_id_cache__purge_all(void) +{ + struct strlist *list; + struct str_node *pos; + int err; + char *buf; + + list = build_id_cache__list_all(false); + if (!list) { + pr_debug("Failed to get buildids: -%d\n", errno); + return -EINVAL; + } + + strlist__for_each_entry(pos, list) { + buf = build_id_cache__origname(pos->s); + err = build_id_cache__remove_s(pos->s); + pr_debug("Removing %s (%s): %s\n", buf, pos->s, +err ? "FAIL" : "Ok"); + free(buf); + if (err) + break; + } + strlist__delete(list); + + pr_debug("Purged all: %s\n", err ? "FAIL" : "Ok"); + return err; +} + static bool dso__missing_buildid_cache(struct dso *dso, int parm __maybe_unused) { char filename[PATH_MAX]; @@ -326,6 +354,7 @@ int cmd_buildid_cache(int argc, const char **argv) bool force = false; bool list_files = false; bool opts_flag = false; + bool purge_all = false; char const *add_name_list_str = NULL, *remove_name_list_str = NULL, *purge_name_list_str = NULL, @@ -349,6 +378,7 @@ int cmd_buildid_cache(int argc, const char **argv) "file(s) to remove"), OPT_STRING('p', "purge", &purge_name_list_str, "file list", "file(s) to remove (remove old caches too)"), + OPT_BOOLEAN('P', "purge-all", &purge_all, "purge all cached files"), OPT_BOOLEAN('l', "list", &list_files, "list all cached files"), OPT_STRING('M', "missing", &missing_filename, "file", "to find missing build ids in the cache"), @@ -369,7 +399,8 @@ int cmd_buildid_cache(int argc, const char **argv) opts_flag = add_name_list_str || kcore_filename || remove_name_list_str || purge_name_list_str || - missing_filename || update_name_list_str; + missing_filename || update_name_list_str || + purge_all; if (argc || !(list_files || opts_flag)) usage_with_options(buildid_cache_usage, buildid_cache_options); @@ -455,6 +486,9 @@ int cmd_buildid_cache(int argc, const char **argv) } } + if (purge_all) + ret = build_id_cache__purge_all(); + if (missing_filename) ret = build_id_cache__fprintf_missing(session, stdout); -- 2.14.3
Re: [PATCH v2 7/9] trace_uprobe/sdt: Fix multiple update of same reference counter
Hi Oleg, On 04/09/2018 06:47 PM, Oleg Nesterov wrote: > On 04/04, Ravi Bangoria wrote: >> +static void sdt_add_mm_list(struct trace_uprobe *tu, struct mm_struct *mm) >> +{ >> +struct mmu_notifier *mn; >> +struct sdt_mm_list *sml = kzalloc(sizeof(*sml), GFP_KERNEL); >> + >> +if (!sml) >> +return; >> +sml->mm = mm; >> +list_add(&(sml->list), &(tu->sml.list)); >> + >> +/* Register mmu_notifier for this mm. */ >> +mn = kzalloc(sizeof(*mn), GFP_KERNEL); >> +if (!mn) >> +return; >> + >> +mn->ops = &sdt_mmu_notifier_ops; >> +__mmu_notifier_register(mn, mm); >> +} > I didn't read this version yet, just one question... > > So now it depends on CONFIG_MMU_NOTIFIER, yes? I do not see any changes in > Kconfig > files, this doesn't look right... Yes, you are write. I'll make CONFIG_UPROBE_EVENTS dependent on CONFIG_MMU_NOTIFIER. Thanks, Ravi
Re: [PATCH v2 7/9] trace_uprobe/sdt: Fix multiple update of same reference counter
On 04/09/2018 07:02 PM, Ravi Bangoria wrote: > Hi Oleg, > > On 04/09/2018 06:47 PM, Oleg Nesterov wrote: >> I didn't read this version yet, just one question... >> >> So now it depends on CONFIG_MMU_NOTIFIER, yes? I do not see any changes in >> Kconfig >> files, this doesn't look right... > Yes, you are write. s/write/right.
Re: [PATCH v5 0/6] enable creating [k,u]probe with perf_event_open
Hi Song, On 12/07/2017 04:15 AM, Song Liu wrote: > With current kernel, user space tools can only create/destroy [k,u]probes > with a text-based API (kprobe_events and uprobe_events in tracefs). This > approach relies on user space to clean up the [k,u]probe after using them. > However, this is not easy for user space to clean up properly. > > To solve this problem, we introduce a file descriptor based API. > Specifically, we extended perf_event_open to create [k,u]probe, and attach > this [k,u]probe to the file descriptor created by perf_event_open. These > [k,u]probe are associated with this file descriptor, so they are not > available in tracefs. Sorry for being late. One simple question.. Will it be good to support k/uprobe arguments with perf_event_open()? Do you have any plans about that? Thanks, Ravi
Re: [PATCH v2 7/9] trace_uprobe/sdt: Fix multiple update of same reference counter
Hi Oleg, On 04/09/2018 06:59 PM, Oleg Nesterov wrote: > On 04/04, Ravi Bangoria wrote: >> +static void sdt_add_mm_list(struct trace_uprobe *tu, struct mm_struct *mm) >> +{ >> +struct mmu_notifier *mn; >> +struct sdt_mm_list *sml = kzalloc(sizeof(*sml), GFP_KERNEL); >> + >> +if (!sml) >> +return; >> +sml->mm = mm; >> +list_add(&(sml->list), &(tu->sml.list)); >> + >> +/* Register mmu_notifier for this mm. */ >> +mn = kzalloc(sizeof(*mn), GFP_KERNEL); >> +if (!mn) >> +return; >> + >> +mn->ops = &sdt_mmu_notifier_ops; >> +__mmu_notifier_register(mn, mm); >> +} > and what if __mmu_notifier_register() fails simply because signal_pending() > == T? > see mm_take_all_locks(). > > at first glance this all look suspicious and sub-optimal, Yes. I should have added checks for failure cases. Will fix them in v3. Thanks for the review, Ravi > but let me repeat that > I didn't read this version yet. > > Oleg. >
Re: [PATCH v2 7/9] trace_uprobe/sdt: Fix multiple update of same reference counter
Hi Oleg, On 04/10/2018 04:36 PM, Oleg Nesterov wrote: > Hi Ravi, > > On 04/10, Ravi Bangoria wrote: >>> and what if __mmu_notifier_register() fails simply because signal_pending() >>> == T? >>> see mm_take_all_locks(). >>> >>> at first glance this all look suspicious and sub-optimal, >> Yes. I should have added checks for failure cases. >> Will fix them in v3. > And what can you do if it fails? Nothing except report the problem. But > signal_pending() is not the unlikely or error condition, it should not > cause the tracing errors. ... > Plus mm_take_all_locks() is very heavy... BTW, uprobe_mmap_callback() is > called unconditionally. Whatever it does, can we at least move it after > the no_uprobe_events() check? Can't we also check MMF_HAS_UPROBES? Sure, I'll move it after these conditions. > Either way, I do not feel that mmu_notifier is the right tool... Did you > consider the uprobe_clear_state() hook we already have? Ah! This is really a good idea. We don't need mmu_notifier then. Thanks for suggestion, Ravi
Re: [PATCH v2 1/2] trace_uprobe: Use %lx to display offset
Hi Steve, Can you please pull these patches. Thanks, Ravi On 03/15/2018 01:57 PM, Ravi Bangoria wrote: > tu->offset is unsigned long, not a pointer, thus %lx should > be used to print it, not the %px. > > Fixes: 0e4d819d0893 ("trace_uprobe: Display correct offset in uprobe_events") > Suggested-by: Kees Cook > Signed-off-by: Ravi Bangoria > --- > v2 changelog: > - Keep prefixed 0s as is. > > kernel/trace/trace_uprobe.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c > index 2014f4351ae0..0298bd15be83 100644 > --- a/kernel/trace/trace_uprobe.c > +++ b/kernel/trace/trace_uprobe.c > @@ -608,7 +608,7 @@ static int probes_seq_show(struct seq_file *m, void *v) > > /* Don't print "0x (null)" when offset is 0 */ > if (tu->offset) { > - seq_printf(m, "0x%px", (void *)tu->offset); > + seq_printf(m, "0x%0*lx", (int)(sizeof(void *) * 2), tu->offset); > } else { > switch (sizeof(void *)) { > case 4:
[PATCH 2/8] mm: Prefix vma_ to vaddr_to_offset() and offset_to_vaddr()
No functionality changes. Signed-off-by: Ravi Bangoria --- include/linux/mm.h | 4 ++-- kernel/events/uprobes.c | 14 +++--- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 95909f2..d7ee526 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2275,13 +2275,13 @@ struct vm_unmapped_area_info { } static inline unsigned long -offset_to_vaddr(struct vm_area_struct *vma, loff_t offset) +vma_offset_to_vaddr(struct vm_area_struct *vma, loff_t offset) { return vma->vm_start + offset - ((loff_t)vma->vm_pgoff << PAGE_SHIFT); } static inline loff_t -vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr) +vma_vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr) { return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start); } diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index bd6f230..535fd39 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -748,7 +748,7 @@ static inline struct map_info *free_map_info(struct map_info *info) curr = info; info->mm = vma->vm_mm; - info->vaddr = offset_to_vaddr(vma, offset); + info->vaddr = vma_offset_to_vaddr(vma, offset); } i_mmap_unlock_read(mapping); @@ -807,7 +807,7 @@ static inline struct map_info *free_map_info(struct map_info *info) goto unlock; if (vma->vm_start > info->vaddr || - vaddr_to_offset(vma, info->vaddr) != uprobe->offset) + vma_vaddr_to_offset(vma, info->vaddr) != uprobe->offset) goto unlock; if (is_register) { @@ -977,7 +977,7 @@ static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm) uprobe->offset >= offset + vma->vm_end - vma->vm_start) continue; - vaddr = offset_to_vaddr(vma, uprobe->offset); + vaddr = vma_offset_to_vaddr(vma, uprobe->offset); err |= remove_breakpoint(uprobe, mm, vaddr); } up_read(&mm->mmap_sem); @@ -1023,7 +1023,7 @@ static void build_probe_list(struct inode *inode, struct uprobe *u; INIT_LIST_HEAD(head); - min = vaddr_to_offset(vma, start); + min = vma_vaddr_to_offset(vma, start); max = min + (end - start) - 1; spin_lock(&uprobes_treelock); @@ -1076,7 +1076,7 @@ int uprobe_mmap(struct vm_area_struct *vma) list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) { if (!fatal_signal_pending(current) && filter_chain(uprobe, UPROBE_FILTER_MMAP, vma->vm_mm)) { - unsigned long vaddr = offset_to_vaddr(vma, uprobe->offset); + unsigned long vaddr = vma_offset_to_vaddr(vma, uprobe->offset); install_breakpoint(uprobe, vma->vm_mm, vma, vaddr); } put_uprobe(uprobe); @@ -1095,7 +1095,7 @@ int uprobe_mmap(struct vm_area_struct *vma) inode = file_inode(vma->vm_file); - min = vaddr_to_offset(vma, start); + min = vma_vaddr_to_offset(vma, start); max = min + (end - start) - 1; spin_lock(&uprobes_treelock); @@ -1730,7 +1730,7 @@ static struct uprobe *find_active_uprobe(unsigned long bp_vaddr, int *is_swbp) if (vma && vma->vm_start <= bp_vaddr) { if (valid_vma(vma, false)) { struct inode *inode = file_inode(vma->vm_file); - loff_t offset = vaddr_to_offset(vma, bp_vaddr); + loff_t offset = vma_vaddr_to_offset(vma, bp_vaddr); uprobe = find_uprobe(inode, offset); } -- 1.8.3.1
[PATCH 1/8] Uprobe: Export vaddr <-> offset conversion functions
No functionality changes. Signed-off-by: Ravi Bangoria --- include/linux/mm.h | 12 kernel/events/uprobes.c | 10 -- 2 files changed, 12 insertions(+), 10 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ad06d42..95909f2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2274,6 +2274,18 @@ struct vm_unmapped_area_info { return unmapped_area(info); } +static inline unsigned long +offset_to_vaddr(struct vm_area_struct *vma, loff_t offset) +{ + return vma->vm_start + offset - ((loff_t)vma->vm_pgoff << PAGE_SHIFT); +} + +static inline loff_t +vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr) +{ + return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start); +} + /* truncate.c */ extern void truncate_inode_pages(struct address_space *, loff_t); extern void truncate_inode_pages_range(struct address_space *, diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index ce6848e..bd6f230 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -130,16 +130,6 @@ static bool valid_vma(struct vm_area_struct *vma, bool is_register) return vma->vm_file && (vma->vm_flags & flags) == VM_MAYEXEC; } -static unsigned long offset_to_vaddr(struct vm_area_struct *vma, loff_t offset) -{ - return vma->vm_start + offset - ((loff_t)vma->vm_pgoff << PAGE_SHIFT); -} - -static loff_t vaddr_to_offset(struct vm_area_struct *vma, unsigned long vaddr) -{ - return ((loff_t)vma->vm_pgoff << PAGE_SHIFT) + (vaddr - vma->vm_start); -} - /** * __replace_page - replace page in vma by new page. * based on replace_page in mm/ksm.c -- 1.8.3.1
[PATCH 0/8] trace_uprobe: Support SDT markers having reference count (semaphore)
Userspace Statically Defined Tracepoints[1] are dtrace style markers inside userspace applications. These markers are added by developer at important places in the code. Each marker source expands to a single nop instruction in the compiled code but there may be additional overhead for computing the marker arguments which expands to couple of instructions. In case the overhead is more, execution of it can be omitted by runtime if() condition when no one is tracing on the marker: if (reference_counter > 0) { Execute marker instructions; } Default value of reference counter is 0. Tracer has to increment the reference counter before tracing on a marker and decrement it when done with the tracing. Currently, perf tool has limited supports for SDT markers. I.e. it can not trace markers surrounded by reference counter. Also, it's not easy to add reference counter logic in userspace tool like perf, so basic idea for this patchset is to add reference counter logic in the trace_uprobe infrastructure. Ex,[2] # cat tick.c ... for (i = 0; i < 100; i++) { DTRACE_PROBE1(tick, loop1, i); if (TICK_LOOP2_ENABLED()) { DTRACE_PROBE1(tick, loop2, i); } printf("hi: %d\n", i); sleep(1); } ... Here tick:loop1 is marker without reference counter where as tick:loop2 is surrounded by reference counter condition. # perf buildid-cache --add /tmp/tick # perf probe sdt_tick:loop1 # perf probe sdt_tick:loop2 # perf stat -e sdt_tick:loop1,sdt_tick:loop2 -- /tmp/tick hi: 0 hi: 1 hi: 2 ^C Performance counter stats for '/tmp/tick': 3 sdt_tick:loop1 0 sdt_tick:loop2 2.747086086 seconds time elapsed Perf failed to record data for tick:loop2. Same experiment with this patch series: # ./perf buildid-cache --add /tmp/tick # ./perf probe sdt_tick:loop2 # ./perf stat -e sdt_tick:loop2 /tmp/tick hi: 0 hi: 1 hi: 2 ^C Performance counter stats for '/tmp/tick': 3 sdt_tick:loop2 2.561851452 seconds time elapsed [1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation [2] https://github.com/iovisor/bcc/issues/327#issuecomment-200576506 [3] https://lkml.org/lkml/2017/12/6/976 Note: 'reference counter' is called as 'semaphore' in original Dtrace (or Systemtap, bcc and even in ELF) documentation and code. But the term 'semaphore' is misleading in this context. This is just a counter used to hold number of tracers tracing on a marker. This is not really used for any synchronization. So we are referring it as 'reference counter' in kernel / perf code. RFC series can be found at: https://lkml.org/lkml/2018/2/28/76 Ravi Bangoria (8): Uprobe: Export vaddr <-> offset conversion functions mm: Prefix vma_ to vaddr_to_offset() and offset_to_vaddr() Uprobe: Rename map_info to uprobe_map_info Uprobe: Export uprobe_map_info along with uprobe_{build/free}_map_info() trace_uprobe: Support SDT markers having reference count (semaphore) trace_uprobe/sdt: Fix multiple update of same reference counter perf probe: Support SDT markers having reference counter (semaphore) trace_uprobe/sdt: Document about reference counter Documentation/trace/uprobetracer.txt | 16 +- include/linux/mm.h | 12 ++ include/linux/uprobes.h | 11 ++ kernel/events/uprobes.c | 62 kernel/trace/trace.c | 2 +- kernel/trace/trace_uprobe.c | 273 ++- tools/perf/util/probe-event.c| 21 ++- tools/perf/util/probe-event.h| 1 + tools/perf/util/probe-file.c | 22 ++- tools/perf/util/symbol-elf.c | 10 ++ tools/perf/util/symbol.h | 1 + 11 files changed, 382 insertions(+), 49 deletions(-) -- 1.8.3.1
[PATCH 4/8] Uprobe: Export uprobe_map_info along with uprobe_{build/free}_map_info()
These exported data structure and functions will be used by other files in later patches. No functionality changes. Signed-off-by: Ravi Bangoria --- include/linux/uprobes.h | 9 + kernel/events/uprobes.c | 14 +++--- 2 files changed, 12 insertions(+), 11 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 0a294e9..7bd2760 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -109,12 +109,19 @@ enum rp_check { RP_CHECK_RET, }; +struct address_space; struct xol_area; struct uprobes_state { struct xol_area *xol_area; }; +struct uprobe_map_info { + struct uprobe_map_info *next; + struct mm_struct *mm; + unsigned long vaddr; +}; + extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr); extern int set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr); extern bool is_swbp_insn(uprobe_opcode_t *insn); @@ -149,6 +156,8 @@ struct uprobes_state { extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs); extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, void *src, unsigned long len); +extern struct uprobe_map_info *uprobe_free_map_info(struct uprobe_map_info *info); +extern struct uprobe_map_info *uprobe_build_map_info(struct address_space *mapping, loff_t offset, bool is_register); #else /* !CONFIG_UPROBES */ struct uprobes_state { }; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 081b88c1..e7830b8 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -695,23 +695,15 @@ static void delete_uprobe(struct uprobe *uprobe) put_uprobe(uprobe); } -struct uprobe_map_info { - struct uprobe_map_info *next; - struct mm_struct *mm; - unsigned long vaddr; -}; - -static inline struct uprobe_map_info * -uprobe_free_map_info(struct uprobe_map_info *info) +struct uprobe_map_info *uprobe_free_map_info(struct uprobe_map_info *info) { struct uprobe_map_info *next = info->next; kfree(info); return next; } -static struct uprobe_map_info * -uprobe_build_map_info(struct address_space *mapping, loff_t offset, - bool is_register) +struct uprobe_map_info *uprobe_build_map_info(struct address_space *mapping, + loff_t offset, bool is_register) { unsigned long pgoff = offset >> PAGE_SHIFT; struct vm_area_struct *vma; -- 1.8.3.1
[PATCH 3/8] Uprobe: Rename map_info to uprobe_map_info
map_info is very generic name, rename it to uprobe_map_info. Renaming will help to export this structure outside of the file. Also rename free_map_info() to uprobe_free_map_info() and build_map_info() to uprobe_build_map_info(). No functionality changes. Signed-off-by: Ravi Bangoria --- kernel/events/uprobes.c | 32 +--- 1 file changed, 17 insertions(+), 15 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 535fd39..081b88c1 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -695,27 +695,29 @@ static void delete_uprobe(struct uprobe *uprobe) put_uprobe(uprobe); } -struct map_info { - struct map_info *next; +struct uprobe_map_info { + struct uprobe_map_info *next; struct mm_struct *mm; unsigned long vaddr; }; -static inline struct map_info *free_map_info(struct map_info *info) +static inline struct uprobe_map_info * +uprobe_free_map_info(struct uprobe_map_info *info) { - struct map_info *next = info->next; + struct uprobe_map_info *next = info->next; kfree(info); return next; } -static struct map_info * -build_map_info(struct address_space *mapping, loff_t offset, bool is_register) +static struct uprobe_map_info * +uprobe_build_map_info(struct address_space *mapping, loff_t offset, + bool is_register) { unsigned long pgoff = offset >> PAGE_SHIFT; struct vm_area_struct *vma; - struct map_info *curr = NULL; - struct map_info *prev = NULL; - struct map_info *info; + struct uprobe_map_info *curr = NULL; + struct uprobe_map_info *prev = NULL; + struct uprobe_map_info *info; int more = 0; again: @@ -729,7 +731,7 @@ static inline struct map_info *free_map_info(struct map_info *info) * Needs GFP_NOWAIT to avoid i_mmap_rwsem recursion through * reclaim. This is optimistic, no harm done if it fails. */ - prev = kmalloc(sizeof(struct map_info), + prev = kmalloc(sizeof(struct uprobe_map_info), GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN); if (prev) prev->next = NULL; @@ -762,7 +764,7 @@ static inline struct map_info *free_map_info(struct map_info *info) } do { - info = kmalloc(sizeof(struct map_info), GFP_KERNEL); + info = kmalloc(sizeof(struct uprobe_map_info), GFP_KERNEL); if (!info) { curr = ERR_PTR(-ENOMEM); goto out; @@ -774,7 +776,7 @@ static inline struct map_info *free_map_info(struct map_info *info) goto again; out: while (prev) - prev = free_map_info(prev); + prev = uprobe_free_map_info(prev); return curr; } @@ -782,11 +784,11 @@ static inline struct map_info *free_map_info(struct map_info *info) register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new) { bool is_register = !!new; - struct map_info *info; + struct uprobe_map_info *info; int err = 0; percpu_down_write(&dup_mmap_sem); - info = build_map_info(uprobe->inode->i_mapping, + info = uprobe_build_map_info(uprobe->inode->i_mapping, uprobe->offset, is_register); if (IS_ERR(info)) { err = PTR_ERR(info); @@ -825,7 +827,7 @@ static inline struct map_info *free_map_info(struct map_info *info) up_write(&mm->mmap_sem); free: mmput(mm); - info = free_map_info(info); + info = uprobe_free_map_info(info); } out: percpu_up_write(&dup_mmap_sem); -- 1.8.3.1
[PATCH 5/8] trace_uprobe: Support SDT markers having reference count (semaphore)
Userspace Statically Defined Tracepoints[1] are dtrace style markers inside userspace applications. These markers are added by developer at important places in the code. Each marker source expands to a single nop instruction in the compiled code but there may be additional overhead for computing the marker arguments which expands to couple of instructions. In case the overhead is more, execution of it can be ommited by runtime if() condition when no one is tracing on the marker: if (reference_counter > 0) { Execute marker instructions; } Default value of reference counter is 0. Tracer has to increment the reference counter before tracing on a marker and decrement it when done with the tracing. Implement the reference counter logic in trace_uprobe, leaving core uprobe infrastructure as is, except one new callback from uprobe_mmap() to trace_uprobe. trace_uprobe definition with reference counter will now be: :[(ref_ctr_offset)] There are two different cases while enabling the marker, 1. Trace existing process. In this case, find all suitable processes and increment the reference counter in them. 2. Enable trace before running target binary. In this case, all mmaps will get notified to trace_uprobe and trace_uprobe will increment the reference counter if corresponding uprobe is enabled. At the time of disabling probes, decrement reference counter in all existing target processes. [1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation Note: 'reference counter' is called as 'semaphore' in original Dtrace (or Systemtap, bcc and even in ELF) documentation and code. But the term 'semaphore' is misleading in this context. This is just a counter used to hold number of tracers tracing on a marker. This is not really used for any synchronization. So we are referring it as 'reference counter' in kernel / perf code. Signed-off-by: Ravi Bangoria Signed-off-by: Fengguang Wu [Fengguang reported/fixed build failure in RFC patch] --- include/linux/uprobes.h | 2 + kernel/events/uprobes.c | 6 ++ kernel/trace/trace_uprobe.c | 172 +++- 3 files changed, 178 insertions(+), 2 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index 7bd2760..2d4df65 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -122,6 +122,8 @@ struct uprobe_map_info { unsigned long vaddr; }; +extern void (*uprobe_mmap_callback)(struct vm_area_struct *vma); + extern int set_swbp(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr); extern int set_orig_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long vaddr); extern bool is_swbp_insn(uprobe_opcode_t *insn); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index e7830b8..06821bb 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1041,6 +1041,9 @@ static void build_probe_list(struct inode *inode, spin_unlock(&uprobes_treelock); } +/* Rightnow the only user of this is trace_uprobe. */ +void (*uprobe_mmap_callback)(struct vm_area_struct *vma); + /* * Called from mmap_region/vma_adjust with mm->mmap_sem acquired. * @@ -1053,6 +1056,9 @@ int uprobe_mmap(struct vm_area_struct *vma) struct uprobe *uprobe, *u; struct inode *inode; + if (uprobe_mmap_callback) + uprobe_mmap_callback(vma); + if (no_uprobe_events() || !valid_vma(vma, true)) return 0; diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 2014f43..b6c9b48 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -25,6 +25,8 @@ #include #include #include +#include +#include #include "trace_probe.h" @@ -58,6 +60,7 @@ struct trace_uprobe { struct inode*inode; char*filename; unsigned long offset; + unsigned long ref_ctr_offset; unsigned long nhit; struct trace_probe tp; }; @@ -362,10 +365,10 @@ static int create_trace_uprobe(int argc, char **argv) { struct trace_uprobe *tu; struct inode *inode; - char *arg, *event, *group, *filename; + char *arg, *event, *group, *filename, *rctr, *rctr_end; char buf[MAX_EVENT_NAME_LEN]; struct path path; - unsigned long offset; + unsigned long offset, ref_ctr_offset; bool is_delete, is_return; int i, ret; @@ -375,6 +378,7 @@ static int create_trace_uprobe(int argc, char **argv) is_return = false; event = NULL; group = NULL; + ref_ctr_offset = 0; /* argc must be >= 1 */ if (argv[0][0] == '-') @@ -454,6 +458,26 @@ static int create_trace_uprobe(int argc, char **argv) goto fail_address_
[PATCH 7/8] perf probe: Support SDT markers having reference counter (semaphore)
With this, perf buildid-cache will save SDT markers with reference counter in probe cache. Perf probe will be able to probe markers having reference counter. Ex, # readelf -n /tmp/tick | grep -A1 loop2 Name: loop2 ... Semaphore: 0x10020036 # ./perf buildid-cache --add /tmp/tick # ./perf probe sdt_tick:loop2 # ./perf stat -e sdt_tick:loop2 /tmp/tick hi: 0 hi: 1 hi: 2 ^C Performance counter stats for '/tmp/tick': 3 sdt_tick:loop2 2.561851452 seconds time elapsed Signed-off-by: Ravi Bangoria --- tools/perf/util/probe-event.c | 21 + tools/perf/util/probe-event.h | 1 + tools/perf/util/probe-file.c | 22 +++--- tools/perf/util/symbol-elf.c | 10 ++ tools/perf/util/symbol.h | 1 + 5 files changed, 48 insertions(+), 7 deletions(-) diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c index e1dbc98..2cbe68a 100644 --- a/tools/perf/util/probe-event.c +++ b/tools/perf/util/probe-event.c @@ -1832,6 +1832,12 @@ int parse_probe_trace_command(const char *cmd, struct probe_trace_event *tev) tp->offset = strtoul(fmt2_str, NULL, 10); } + if (tev->uprobes) { + fmt2_str = strchr(p, '('); + if (fmt2_str) + tp->ref_ctr_offset = strtoul(fmt2_str + 1, NULL, 0); + } + tev->nargs = argc - 2; tev->args = zalloc(sizeof(struct probe_trace_arg) * tev->nargs); if (tev->args == NULL) { @@ -2054,15 +2060,22 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev) } /* Use the tp->address for uprobes */ - if (tev->uprobes) - err = strbuf_addf(&buf, "%s:0x%lx", tp->module, tp->address); - else if (!strncmp(tp->symbol, "0x", 2)) + if (tev->uprobes) { + if (tp->ref_ctr_offset) + err = strbuf_addf(&buf, "%s:0x%lx(0x%lx)", tp->module, + tp->address, tp->ref_ctr_offset); + else + err = strbuf_addf(&buf, "%s:0x%lx", tp->module, + tp->address); + } else if (!strncmp(tp->symbol, "0x", 2)) { /* Absolute address. See try_to_find_absolute_address() */ err = strbuf_addf(&buf, "%s%s0x%lx", tp->module ?: "", tp->module ? ":" : "", tp->address); - else + } else { err = strbuf_addf(&buf, "%s%s%s+%lu", tp->module ?: "", tp->module ? ":" : "", tp->symbol, tp->offset); + } + if (err) goto error; diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h index 45b14f0..15a98c3 100644 --- a/tools/perf/util/probe-event.h +++ b/tools/perf/util/probe-event.h @@ -27,6 +27,7 @@ struct probe_trace_point { char*symbol;/* Base symbol */ char*module;/* Module name */ unsigned long offset; /* Offset from symbol */ + unsigned long ref_ctr_offset; /* SDT reference counter offset */ unsigned long address;/* Actual address of the trace point */ boolretprobe; /* Return probe flag */ }; diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c index 4ae1123..08ba3a6 100644 --- a/tools/perf/util/probe-file.c +++ b/tools/perf/util/probe-file.c @@ -701,6 +701,12 @@ static unsigned long long sdt_note__get_addr(struct sdt_note *note) : (unsigned long long)note->addr.a64[0]; } +static unsigned long long sdt_note__get_ref_ctr_offset(struct sdt_note *note) +{ + return note->bit32 ? (unsigned long long)note->addr.a32[2] + : (unsigned long long)note->addr.a64[2]; +} + static const char * const type_to_suffix[] = { ":s64", "", "", "", ":s32", "", ":s16", ":s8", "", ":u8", ":u16", "", ":u32", "", "", "", ":u64" @@ -776,14 +782,24 @@ static char *synthesize_sdt_probe_command(struct sdt_note *note, { struct strbuf buf; char *ret = NULL, **args; - int i, args_count; + int i, args_count, err; + unsigned long long ref_ctr_offset; if (strbuf_init(&buf, 32) < 0) return NULL; - if (strbuf_addf(&buf, "p:%s/%s %s:0x%llx", + ref_ctr_offset = sdt_note__get_ref_ctr_offset(note); + + if (ref_ctr_offset) +
[PATCH 6/8] trace_uprobe/sdt: Fix multiple update of same reference counter
For tiny binaries/libraries, different mmap regions points to the same file portion. In such cases, we may increment reference counter multiple times. But while de-registration, reference counter will get decremented only by once leaving reference counter > 0 even if no one is tracing on that marker. Ensure increment and decrement happens in sync by keeping list of mms in trace_uprobe. Increment reference counter only if mm is not present in the list and decrement only if mm is present in the list. Example # echo "p:sdt_tick/loop2 /tmp/tick:0x6e4(0x10036)" > uprobe_events Before patch: # perf stat -a -e sdt_tick:loop2 # /tmp/tick # dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd 000: 02 . # pkill perf # dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd 000: 01 . After patch: # perf stat -a -e sdt_tick:loop2 # /tmp/tick # dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd 000: 01 . # pkill perf # dd if=/proc/`pgrep tick`/mem bs=1 count=1 skip=$(( 0x10020036 )) 2>/dev/null | xxd 000: 00 . Signed-off-by: Ravi Bangoria --- kernel/trace/trace_uprobe.c | 105 +++- 1 file changed, 103 insertions(+), 2 deletions(-) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index b6c9b48..9bf3f7a 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -50,6 +50,11 @@ struct trace_uprobe_filter { struct list_headperf_events; }; +struct sdt_mm_list { + struct mm_struct *mm; + struct sdt_mm_list *next; +}; + /* * uprobe event core functions */ @@ -61,6 +66,8 @@ struct trace_uprobe { char*filename; unsigned long offset; unsigned long ref_ctr_offset; + struct sdt_mm_list *sml; + struct rw_semaphore sml_rw_sem; unsigned long nhit; struct trace_probe tp; }; @@ -274,6 +281,7 @@ static inline bool is_ret_probe(struct trace_uprobe *tu) if (is_ret) tu->consumer.ret_handler = uretprobe_dispatcher; init_trace_uprobe_filter(&tu->filter); + init_rwsem(&tu->sml_rw_sem); return tu; error: @@ -921,6 +929,74 @@ static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func, return trace_handle_return(s); } +static bool sdt_check_mm_list(struct trace_uprobe *tu, struct mm_struct *mm) +{ + struct sdt_mm_list *tmp = tu->sml; + + if (!tu->sml || !mm) + return false; + + while (tmp) { + if (tmp->mm == mm) + return true; + tmp = tmp->next; + } + + return false; +} + +static void sdt_add_mm_list(struct trace_uprobe *tu, struct mm_struct *mm) +{ + struct sdt_mm_list *tmp; + + tmp = kzalloc(sizeof(*tmp), GFP_KERNEL); + if (!tmp) + return; + + tmp->mm = mm; + tmp->next = tu->sml; + tu->sml = tmp; +} + +static void sdt_del_mm_list(struct trace_uprobe *tu, struct mm_struct *mm) +{ + struct sdt_mm_list *prev, *curr; + + if (!tu->sml) + return; + + if (tu->sml->mm == mm) { + curr = tu->sml; + tu->sml = tu->sml->next; + kfree(curr); + return; + } + + prev = tu->sml; + curr = tu->sml->next; + while (curr) { + if (curr->mm == mm) { + prev->next = curr->next; + kfree(curr); + return; + } + prev = curr; + curr = curr->next; + } +} + +static void sdt_flush_mm_list(struct trace_uprobe *tu) +{ + struct sdt_mm_list *next, *curr = tu->sml; + + while (curr) { + next = curr->next; + kfree(curr); + curr = next; + } + tu->sml = NULL; +} + static bool sdt_valid_vma(struct trace_uprobe *tu, struct vm_area_struct *vma) { unsigned long vaddr = vma_offset_to_vaddr(vma, tu->ref_ctr_offset); @@ -989,17 +1065,25 @@ static void sdt_increment_ref_ctr(struct trace_uprobe *tu) if (IS_ERR(info)) goto out; + down_write(&tu->sml_rw_sem); while (info) { + if (sdt_check_mm_list(tu, info->mm)) + goto cont; + down_write(&info->mm->mmap_sem); vma = sdt_find_vma(info->mm, tu); vaddr = vma_offset_to_
[PATCH 8/8] trace_uprobe/sdt: Document about reference counter
No functionality changes. Signed-off-by: Ravi Bangoria --- Documentation/trace/uprobetracer.txt | 16 +--- kernel/trace/trace.c | 2 +- 2 files changed, 14 insertions(+), 4 deletions(-) diff --git a/Documentation/trace/uprobetracer.txt b/Documentation/trace/uprobetracer.txt index bf526a7c..8fb13b0 100644 --- a/Documentation/trace/uprobetracer.txt +++ b/Documentation/trace/uprobetracer.txt @@ -19,15 +19,25 @@ user to calculate the offset of the probepoint in the object. Synopsis of uprobe_tracer - - p[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a uprobe - r[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a return uprobe (uretprobe) - -:[GRP/]EVENT : Clear uprobe or uretprobe event + p[:[GRP/]EVENT] PATH:OFFSET[(REF_CTR_OFFSET)] [FETCHARGS] + r[:[GRP/]EVENT] PATH:OFFSET[(REF_CTR_OFFSET)] [FETCHARGS] + -:[GRP/]EVENT + + p : Set a uprobe + r : Set a return uprobe (uretprobe) + - : Clear uprobe or uretprobe event GRP : Group name. If omitted, "uprobes" is the default value. EVENT : Event name. If omitted, the event name is generated based on PATH+OFFSET. PATH : Path to an executable or a library. OFFSET: Offset where the probe is inserted. + REF_CTR_OFFSET: Reference counter offset. Optional field. Reference count + gate the invocation of probe. If present, by default + reference count is 0. Kernel needs to increment it before + tracing the probe and decrement it when done. This is + identical to semaphore in Userspace Statically Defined + Tracepoints (USDT). FETCHARGS : Arguments. Each probe can have up to 128 args. %REG : Fetch register REG diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 20a2300..2104d03 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -4604,7 +4604,7 @@ static int tracing_trace_options_open(struct inode *inode, struct file *file) "place (kretprobe): [:][+]|\n" #endif #ifdef CONFIG_UPROBE_EVENTS - "\tplace: :\n" + " place (uprobe): :[(ref_ctr_offset)]\n" #endif "\t args: =fetcharg[:type]\n" "\t fetcharg: %, @, @[+|-],\n" -- 1.8.3.1