On 12/10, Eric W. Biederman wrote: > > Oleg Nesterov <o...@redhat.com> writes: > > > On 12/09, EunTaik Lee wrote: > >> > >> There is a use-after-free case with below call stack. > >> > >> pid_nr_ns+0x10/0x38 > >> cgroup_pidlist_start+0x144/0x400 > >> cgroup_seqfile_start+0x1c/0x24 > >> kernfs_seq_start+0x54/0x90 > >> seq_read+0x15c/0x3a8 > >> kernfs_fop_read+0x38/0x160 > >> __vfs_read+0x28/0xc8 > >> vfs_read+0x84/0xfc > > How is this a use after free. The function pid_nr_ns should take a NULL > pointer > as input and return 0?
No, the task (task_struct) itself can't go away, but task->group_leader can point to nowhere. > Certainly if the addtion of pid_alive fixes it pid_vnr(task_tgid(tsk)) > is fine. Are we perhaps missing rcu locking? rcu_read_lock() is not enough in this case, see below. > Or is the problem simply that in task_tgid we are accessing > task->group_leader which may already be dead? Yes. Lets forget about the callchain above, I didn't even bother to verify that it can actually hit the problem. Although I think EunTaik is very right, css_task_iter_next() does get_task_struct() and drops css_set_lock, the task can exit after that. Forget. Just suppose that a task simply does pid = task_tgid_vnr(current); after it has already called exit_notify(). And this is what perf_event_pid() does, perhaps we have more buggy users. In this case current->group_leader or parent/real_parent can point to the exited/freed tasks. I already said this many times, ee really need to nullify them in __unhash_process() but this needs a lot of (mostly simple) cleanups. > If so the fix needs to be > in task_tgid. Yes, task_tgid() should probably return NULL in this case, but this connects to "a lot of cleanups" above. > > --- x/include/linux/pid.h > > +++ x/include/linux/pid.h > > @@ -8,7 +8,8 @@ enum pid_type > > PIDTYPE_PID, > > PIDTYPE_PGID, > > PIDTYPE_SID, > > - PIDTYPE_MAX > > + PIDTYPE_MAX, > > + PIDTYPE_TGID /* do not use */ > > > I would do: > > /* __PIDTYPE_TGID is only valid to __task_pid_nr_ns */ > #define __PIDTYPE_TGID PIDTYPE_MAX > > Prefixing __PIDTYPE_TGID with __ should help make it clear > this is a special use define. OK, will do, thanks. > I am also curious why pid_alive is the proper check to see if > task->group_leader is valid? That feels like it could get us into > trouble later. It is. pid_alive(task) == T means that this task was not removed from rcu-protected-lists and thus task->group_leader (in particular) can't go away. As long as you check pid_alive() and use ->group_leader in the same rcu_read_lock section, of course. To clarify, this is not because detach_pid(PIDTYPE_PID) is called before list_del_rcu(), this doesn't matter. What does matter is that if you see pid_alive() == T (or task->sighand != NULL, or anything else which means that release_task/__unhash_process was not called yet) you know that this task, its leader/parent/real_parent/pids/etc can't go away until rcu_read_unlock(). And this is why __task_pid_nr_ns() checks pid_alive() before it reads ->group_leader. Note that __task_pid_nr_ns(PIDTYPE_PID) does not need this check, task->pids[PID] is nullified by detach_pid() and pid_nr_ns() check pid != NULL. However. I think it should be renamed. Or, better, we should add a new helper to make this all more clear. Say, bool task_is_rcu_safe(task) { return task->sighand != NULL; } then change __task_pid_nr_ns() to use this helper rather than pid_alive(). Because, once again, it is not that we need to ensure that pids[PIDTYPE_PID].pid != NULL, we need to ensure that rcu_read_lock() can actually protect this task and its leader/parent/etc. And of course, it can have more users. Oleg.