On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote:
> Instead, we've come up with a more plausible sequence that can in theory
> happen on a single CPU:
> 
> <task foo calls exit()>
> 
> do_exit
>       exit_mm

If this is the last task of the process, we would expect:

  mm_count == 1
  mm_users == 1

at this point.

>               mmgrab(mm);                     // foo's mm has count +1
>               BUG_ON(mm != current->active_mm);
>               task_lock(current);
>               current->mm = NULL;
>               task_unlock(current);

So the whole active_mm is basically the last 'real' mm, and its purpose
is to avoid switch_mm() between user tasks and kernel tasks.

A kernel task has !->mm. We do this by incrementing mm_count when
switching from user to kernel task and decrementing when switching from
kernel to user.

What exit_mm() does is change a user task into a 'kernel' task. So it
should increment mm_count to mirror the context switch. I suspect this
is what the mmgrab() in exit_mm() is for.

> <irq and ctxsw to kthread>
> 
> context_switch(prev=foo, next=kthread)
>       mm = next->mm;
>       oldmm = prev->active_mm;
> 
>       if (!mm) {                              // True for kthread
>               next->active_mm = oldmm;
>               mmgrab(oldmm);                  // foo's mm has count +2
>       }
> 
>       if (!prev->mm) {                        // True for foo
>               rq->prev_mm = oldmm;
>       }
> 
>       finish_task_switch
>               mm = rq->prev_mm;
>               if (mm) {                       // True (foo's mm)
>                       mmdrop(mm);             // foo's mm has count +1
>               }
> 
>       [...]
> 
> <ctxsw to task bar>
> 
> context_switch(prev=kthread, next=bar)
>       mm = next->mm;
>       oldmm = prev->active_mm;                // foo's mm!
> 
>       if (!prev->mm) {                        // True for kthread
>               rq->prev_mm = oldmm;
>       }
> 
>       finish_task_switch
>               mm = rq->prev_mm;
>               if (mm) {                       // True (foo's mm)
>                       mmdrop(mm);             // foo's mm has count +0

The context switch into the next user task will then decrement. At this
point foo no longer has a reference to its mm, except on the stack.

>               }
> 
>       [...]
> 
> <ctxsw back to task foo>
> 
> context_switch(prev=bar, next=foo)
>       mm = next->mm;
>       oldmm = prev->active_mm;
> 
>       if (!mm) {                              // True for foo
>               next->active_mm = oldmm;        // This is bar's mm
>               mmgrab(oldmm);                  // bar's mm has count +1
>       }
> 
> 
>       [return back to exit_mm]

Enter mm_users, this counts the number of tasks associated with the mm.
We start with 1 in mm_init(), and when it drops to 0, we decrement
mm_count. Since we also start with mm_count == 1, this would appear
consistent.

  mmput() // --mm_users == 0, which then results in:

> mmdrop(mm);                                   // foo's mm has count -1

In the above case, that's the very last reference to the mm, and since
we started out with mm_count == 1, this -1 makes 0 and we do the actual
free.

> At this point, we've got an imbalanced count on the mm and could free it
> prematurely as seen in the KASAN log.

I'm not sure I see premature. At this point mm_users==0, mm_count==0 and
we freed mm and there is no further use of the on-stack mm pointer and
foo no longer has a pointer to it in either ->mm or ->active_mm. It's
well and proper dead.

> A subsequent context-switch away from foo would therefore result in a
> use-after-free.

At the above point, foo no longer has a reference to mm, we cleared ->mm
early, and the context switch to bar cleared ->active_mm. The switch
back into foo then results with foo->active_mm == bar->mm, which is
fine.

Reply via email to