On 2026-06-03 10:27 +0200, Alban Crequy wrote:
> Sashiko raised a question about pidfd_get_task() and PIDFD_THREAD [1],
> so I ran some tests to understand the behavior.
> [1] 
> https://sashiko.dev/#/patchset/[email protected]
> 
> pidfd_get_task() always resolves pidfds using PIDTYPE_TGID (kernel/pid.c
> line 640), regardless of whether the pidfd was created with PIDFD_THREAD.
> This means:
> 
>  - A PIDFD_THREAD pidfd for a non-leader thread fails with ESRCH.
>  - A regular pidfd for a process whose leader has exited (pthread_exit
>    in main, secondary thread still alive) also fails with ESRCH.
> 
> This is not specific to my patch: process_madvise() uses pidfd_get_task()
> in the same way and has the same behavior. I wrote a test program
> confirming this:
> 
>   
> https://github.com/alban/tests/tree/alban_pvm_flags/pvm_flags/pidfd_thread_test
> 
> Results summary:
> 
>   All threads alive:
>     pidfd_open(pid, 0)              + process_vm_readv: OK
>     pidfd_open(tid, PIDFD_THREAD)   + process_vm_readv: OK (leader tid)
>     pidfd_open(tid, PIDFD_THREAD)   + process_vm_readv: ESRCH (non-leader)
> 
>   Leader thread exited (secondary still alive):
>     pidfd_open(pid, 0)              + process_vm_readv: ESRCH
>     pidfd_open(pid, PIDFD_THREAD)   + process_vm_readv: ESRCH
>     pidfd_open(tid, PIDFD_THREAD)   + process_vm_readv: ESRCH (non-leader)
>     process_vm_readv(tid, flags=0)                    : OK (plain TID path)
> 
>   process_madvise() behaves identically in all cases above.
> 
> For the non-leader thread case when all threads are alive, this is fine in
> practice: all threads share the same mm_struct, so profilers just use a 
> regular
> pidfd for the thread-group leader.

This was an intentional limitation back then because pidfds only came in
thread-group flavor. I only added subthread pidfds much later.
pidfd_get_task() should drop the flags argument btw. I think that's
unused.

> However, the exited-leader case is a real limitation for profilers.
> OpenTelemetry eBPF Profiler wants to profile a process where the main thread
> has exited but secondary threads are still running [2].
> [2] https://github.com/open-telemetry/opentelemetry-ebpf-profiler/pull/376

If the thread-group leader exists before all of it's subthreads exit
then this is a broken program - even if it is a legal state. The
thread-group leader cannot be reaped while there are live subthreads and
it also means that any subthread exec "resurrects" the thread-group
leader struct pid. So that's going to make for fun profiling...

> Using plain TIDs (flags=0) would work, but it means users cannot use
> PROCESS_VM_PIDFD in this scenario.
> 
> What do you think this patch should do? I see two options:
>  - Address this limitation in a separate future patch that fixes
>    pidfd_get_task() to use PIDTYPE_PID when PIDFD_THREAD is detected in
>    f_flags, benefiting all callers (process_vm_readv, process_madvise,
>    and any future users).

As long as all users of the interface are fine with operating on
subthreads this should be perfectly fine.


Reply via email to