On 2026-06-03 10:27 +0200, Alban Crequy wrote: > Sashiko raised a question about pidfd_get_task() and PIDFD_THREAD [1], > so I ran some tests to understand the behavior. > [1] > https://sashiko.dev/#/patchset/[email protected] > > pidfd_get_task() always resolves pidfds using PIDTYPE_TGID (kernel/pid.c > line 640), regardless of whether the pidfd was created with PIDFD_THREAD. > This means: > > - A PIDFD_THREAD pidfd for a non-leader thread fails with ESRCH. > - A regular pidfd for a process whose leader has exited (pthread_exit > in main, secondary thread still alive) also fails with ESRCH. > > This is not specific to my patch: process_madvise() uses pidfd_get_task() > in the same way and has the same behavior. I wrote a test program > confirming this: > > > https://github.com/alban/tests/tree/alban_pvm_flags/pvm_flags/pidfd_thread_test > > Results summary: > > All threads alive: > pidfd_open(pid, 0) + process_vm_readv: OK > pidfd_open(tid, PIDFD_THREAD) + process_vm_readv: OK (leader tid) > pidfd_open(tid, PIDFD_THREAD) + process_vm_readv: ESRCH (non-leader) > > Leader thread exited (secondary still alive): > pidfd_open(pid, 0) + process_vm_readv: ESRCH > pidfd_open(pid, PIDFD_THREAD) + process_vm_readv: ESRCH > pidfd_open(tid, PIDFD_THREAD) + process_vm_readv: ESRCH (non-leader) > process_vm_readv(tid, flags=0) : OK (plain TID path) > > process_madvise() behaves identically in all cases above. > > For the non-leader thread case when all threads are alive, this is fine in > practice: all threads share the same mm_struct, so profilers just use a > regular > pidfd for the thread-group leader.
This was an intentional limitation back then because pidfds only came in thread-group flavor. I only added subthread pidfds much later. pidfd_get_task() should drop the flags argument btw. I think that's unused. > However, the exited-leader case is a real limitation for profilers. > OpenTelemetry eBPF Profiler wants to profile a process where the main thread > has exited but secondary threads are still running [2]. > [2] https://github.com/open-telemetry/opentelemetry-ebpf-profiler/pull/376 If the thread-group leader exists before all of it's subthreads exit then this is a broken program - even if it is a legal state. The thread-group leader cannot be reaped while there are live subthreads and it also means that any subthread exec "resurrects" the thread-group leader struct pid. So that's going to make for fun profiling... > Using plain TIDs (flags=0) would work, but it means users cannot use > PROCESS_VM_PIDFD in this scenario. > > What do you think this patch should do? I see two options: > - Address this limitation in a separate future patch that fixes > pidfd_get_task() to use PIDTYPE_PID when PIDFD_THREAD is detected in > f_flags, benefiting all callers (process_vm_readv, process_madvise, > and any future users). As long as all users of the interface are fine with operating on subthreads this should be perfectly fine.

