On 6/2/26 12:09, Alban Crequy wrote:
> From: Alban Crequy <[email protected]>
> 
> There are two categories of users for process_vm_readv:
> 
> 1. Debuggers like GDB or strace.
> 
>    When a debugger attempts to read the target memory and triggers a
>    page fault, the page fault needs to be resolved so that the debugger
>    can accurately interpret the memory. A debugger is typically attached
>    to a single process.
> 
> 2. Profilers like OpenTelemetry eBPF Profiler.
> 
>    The profiler uses a perf event to get stack traces from all
>    processes at 20Hz (20 stack traces to resolve per second). For
>    interpreted languages (Ruby, Python, etc.), the profiler uses
>    process_vm_readv to get the correct symbols. In this case,
>    performance is the most important. It is fine if some stack traces
>    cannot be resolved as long as it is not statistically significant.
> 
> The current behaviour of process_vm_readv is to resolve page faults in
> the target VM. This is as desired for debuggers, but unwelcome for
> profilers because the page fault resolution could take a lot of time
> depending on the backing filesystem. Additionally, since profilers
> monitor all processes, we don't want a slow page fault resolution for
> one target process slowing down the monitoring for all other target
> processes.
> 
> This patch adds the flag PROCESS_VM_NOWAIT, so the caller can choose to
> not block on IO if the memory access causes a page fault. When a page
> is not resident and would require IO to fault in, the syscall returns
> a short read (the number of bytes successfully read before the fault)
> or -1 with errno set to EFAULT if no bytes were read.
> 
> Additionally, this patch adds the flag PROCESS_VM_PIDFD to refer to the
> remote process via PID file descriptor instead of PID. Such a file
> descriptor can be obtained with pidfd_open(2). This is useful to avoid
> the pid number being reused. It is unlikely to happen for debuggers
> because they can monitor the target process termination in other ways
> (ptrace), but can be helpful in some profiling scenarios. When using
> PROCESS_VM_PIDFD, the first argument is a pidfd instead of a pid. If
> the pidfd is invalid, the syscall returns -1 with errno set to EBADF.
> 
> If a given flag is unsupported, the syscall returns the error EINVAL
> without checking the buffers. This gives a way to userspace to detect
> whether the current kernel supports a specific flag:
> 
>   process_vm_readv(pid, NULL, 1, NULL, 1, PROCESS_VM_PIDFD)
>   -> EINVAL if the kernel does not support the flag PROCESS_VM_PIDFD
>      (before this patch)
>   -> EFAULT if the kernel supports the flag (after this patch)
> 
> Suggested man page update for process_vm_readv(2):
> 
>   The flags argument is the bitwise OR of zero or more of these flags:
> 
>   PROCESS_VM_PIDFD (since Linux 7.x)
>       The pid argument is a PID file descriptor (see pidfd_open(2))
>       instead of a PID number. When using this flag, the existing
>       ESRCH error applies if the process referred to by the pidfd
>       has exited.
> 
>   PROCESS_VM_NOWAIT (since Linux 7.x)
>       Do not block on IO. If a page in the remote address space is not
>       resident and would require disk IO to fault in, the system call
>       returns a short read or fails with EFAULT if no bytes were read.
> 
>   Additional error:
> 
>   EBADF  pid is not a valid file descriptor (PROCESS_VM_PIDFD only).
> 
> Signed-off-by: Alban Crequy <[email protected]>
> ---

Nothing jumped at me, thanks!

Acked-by: David Hildenbrand (Arm) <[email protected]>

-- 
Cheers,

David

Reply via email to