On 10/25, Tetsuo Handa wrote: > > On 2018/10/25 21:17, Oleg Nesterov wrote: > >>> And yes, task_is_descendant() can hit the dead child, if nothing else it > >>> can > >>> be killed. This can explain the kasan report. > >> > >> The kasan is reporting that child->real_parent (or maybe > >> child->real_parent->real_parent > >> or child->real_parent->real_parent->real_parent ...) was pointing to > >> already freed memory, > >> isn't it? > > > > Yes. and you know, I am all confused. I no longer can understand you :/ > > Why don't we need to check every time like shown below? > Why checking only once is sufficient?
Why do you think it is not sufficient? Again, I can be easily wrong, rcu is not simple, but so far I think we need a single check at the start. > --- a/security/yama/yama_lsm.c > +++ b/security/yama/yama_lsm.c > @@ -285,7 +285,7 @@ static int task_is_descendant(struct task_struct *parent, > rcu_read_lock(); > if (!thread_group_leader(parent)) > parent = rcu_dereference(parent->group_leader); > - while (walker->pid > 0) { > + while (pid_alive(walker) && walker->pid > 0) { OK. To simplify, ets suppose that task_is_descendant() is called with tasklist lock held. And lets suppose that all tasks are single-threaded. Then we obviously need a single check at the start, we need to ensure that the child was not removed from its ->real_parent->children list. The latter means that if ->real_parent exits, the child will be re-parented and its ->real_parent will be updated. So we could do read_lock(tasklist); if (list_empty(child->sibling)) // it is dead, removed from ->children list, we can't trust // child->real_parent return -EWHATEVER; task_is_descendant(current, child); But note that we can safely use pid_alive(child) instead, detach_pid() and list_del_init(&p->sibling) happen "at the same time" since we hold tasklist. (And btw, I suggested several times to rename it, or add another helper with a better name. Note also that we could check, say, ->sighand != NULL with the same effect.) Now. Why do you think rcu_read_lock() differs in that we need to check pid_alive() at every step? Suppose that one of the grand parents exits, and it is going to be freed. Again, to (over)simplify the things, lets suppose that release_task() does synchronize_rcu(); free_task(p); at the end. Now, can rcu_read_lock(); if (pid_alive(child)) { while (child->pid) child = child->real_parent; } rcu_read_unlock(); hit the already freed ->real_parent ? Say, the freed child->real_parent->real_parent. Lets denote P1 = child->real_parent, P2 = P1->real_parent. Can P2 be already freed? This is only possible if synchronize_rcu() above was called before rcu_read_lock(), see the last sentence below. If P1->real_parent is still P2, then P1 has already exited too. And we still observe that child->real_parent == P1, this too is only possible if child has exited, so we must see pid_alive() == F. Why must we see pid_alive() == F without tasklist? It must be true, release_task() is serialized by tasklist_lock, but why we can't get the stale value under rcu_read_lock() ? Because our rcu read-lock critical section extends beyond the return from synchronize_rcu(), and thus we must have a full memory barrier _between_ that synchronize_rcu() and our rcu_read_lock(). We must see all memory updates, including thread_pid = NULL which makes pid_alive() == F. Do you see any hole? Oleg.