On Sat, 2 Nov 2013 16:54:58 +0100, Oleg Nesterov wrote: > Hello, > > Let me first apologize again if this was already discussed. And I also > need to mention that I know almost nothing about elf/randomization/etc.
No no, this was not discussed enough. And I really appreciate your thorough review! :) > > However, > > On 10/29, Namhyung Kim wrote: >> >> # nm foo | grep -e glob$ -e str -e foo >> 00000000006008bc D foo >> 00000000006008a8 D glob >> 00000000006008ac D str >> >> # perf probe -x /home/namhyung/tmp/foo -a 'foo=main+0x13 glob=@0x8a8:s32 \ > > This does not look right to me. > > - get_user_vaddr() is costly, it does vma_interval_tree_foreach() under > ->i_mmap_mutex. Hmm.. yes, I think this is not needed. I guess it should lookup a proper vma in current->mm with mmap_sem read-locked. > > - this only allows to read the data from the same binary. Right. This is also an unnecessary restriction. We should be able to access data in other binary. > > - in particular, you can't read the data from bss I can't understand why.. The bss region should also be in a same vma of normal data, no? > > - get_user_vaddr() looks simply wrong. I blindly applied the whole series > and did the test to ensure. > > Test-case: > > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > > unsigned int global = 0x1234; > > void func(void) > { > } > > int main(void) > { > char cmd[64]; > > global = 0x4321; > func(); > > printf("addr = %p\n", &global); > > sprintf(cmd, "cat /proc/%d/maps", getpid()); > system(cmd); > > return 0; > } > > # nm foo | grep -w global > 0000000000600a04 D global > > # perf probe -x ./foo -a "func var=@0xa04:u32" > # perf record -e probe_foo:func ./foo > addr = 0x600a04 > 00400000-00401000 r-xp 00000000 fe:01 20958 > /root/foo > 00600000-00601000 rw-p 00000000 fe:01 20958 > /root/foo > ... > > # perf script | tail -1 > foo 555 [000] 1302.345642: probe_foo:func: (40059c) var=1234 > > Note that it reports "1234", not "4321". This is because > get_user_vaddr() finds another (1st) read-only mapping, and > prints the initial value of "global". > > IOW, it reads the memory from 0x400a04, not from 0x600a04. Argh.. This is a problem. I thought the gcc somehow aligns data to next page boundary. But if it's not the case, we need to recognize which is the proper one.. Simply preferring a writable vma to a read-only vma is what's came to my head now. Do you have an idea? > > ------------------------------------------------------------------------------- > Can't we simply implement get_user_vaddr() as > > static void __user *get_user_vaddr(unsigned long addr, struct > trace_uprobe *tu) > { > void __user *vaddr = (void __force __user *)addr; > > /* A NULL tu means that we already got the vaddr */ > if (tu) > vaddr += (current->mm->start_data & PAGE_MASK); > > return vaddr; > } > > ? > > I did this change, and now the test-case above works. And it also works > with "cc -pie -fPIC", > > # nm foo | grep -w global > 0000000000200c9c D global > > # perf probe -x ./foo -a "func var=@0xc9c:u32" > # perf record -e probe_foo:func ./foo > ... > # perf script | tail -1 > foo 576 [001] 475.519940: probe_foo:func: (7ffe95ca3814) > var=4321 > > What do you think? This can only work with the probes fetching data from the executable, right? But as I said it should support any other binaries too. What about this? static void __user *get_user_vaddr(unsigned long addr, struct trace_uprobe *tu) { unsigned long pgoff = addr >> PAGE_SHIFT; struct vm_area_struct *vma, *orig_vma = NULL; unsigned long vaddr = 0; if (tu == NULL) { /* A NULL tu means that we already got the vaddr */ return (void __force __user *) addr; } down_read(¤t->mm->mmap_sem); vma = current->mm->mmap; do { if (!vma->vm_file || vma->vm_file->f_inode != tu->inode) { /* * We found read-only mapping for this inode. * (provided that all mappings for this inode * have consecutive addresses) */ if (orig_vma) break; continue; } if (vma->vm_pgoff > pgoff || (vma->vm_pgoff + vma_pages(vma) <= pgoff)) continue; orig_vma = vma; /* * We prefer writable mapping over read-only since * data is usually in read/write memory region. But * in case of read-only data, it only can be found in * read-only mapping so we save orig_vma and check * whether it also has writable mapping. */ if (vma->vm_flags & VM_WRITE) break; } while ((vma = vma->vm_next) != NULL); if (orig_vma) vaddr = offset_to_vaddr(orig_vma, addr); up_read(¤t->mm->mmap_sem); return (void __force __user *) vaddr; } > > ------------------------------------------------------------------------------- > Note: > - I think that /* A NULL tu means that we already got the vaddr */ > needs more discussion... IOW, I am not sure about 11/13. Discussion and feedbacks are always more than welcome. :) > > - Perhaps it also makes sense to allow to pass the absolute address > (iow, += start_data should be conditional) For per-process uprobe, it might be useful. > > but lets ignore this for now. Okay. Let's discuss again after solving current issues. Thanks, Namhyung -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/