On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote: > Instead, we've come up with a more plausible sequence that can in theory > happen on a single CPU: > > <task foo calls exit()> > > do_exit > exit_mm
If this is the last task of the process, we would expect: mm_count == 1 mm_users == 1 at this point. > mmgrab(mm); // foo's mm has count +1 > BUG_ON(mm != current->active_mm); > task_lock(current); > current->mm = NULL; > task_unlock(current); So the whole active_mm is basically the last 'real' mm, and its purpose is to avoid switch_mm() between user tasks and kernel tasks. A kernel task has !->mm. We do this by incrementing mm_count when switching from user to kernel task and decrementing when switching from kernel to user. What exit_mm() does is change a user task into a 'kernel' task. So it should increment mm_count to mirror the context switch. I suspect this is what the mmgrab() in exit_mm() is for. > <irq and ctxsw to kthread> > > context_switch(prev=foo, next=kthread) > mm = next->mm; > oldmm = prev->active_mm; > > if (!mm) { // True for kthread > next->active_mm = oldmm; > mmgrab(oldmm); // foo's mm has count +2 > } > > if (!prev->mm) { // True for foo > rq->prev_mm = oldmm; > } > > finish_task_switch > mm = rq->prev_mm; > if (mm) { // True (foo's mm) > mmdrop(mm); // foo's mm has count +1 > } > > [...] > > <ctxsw to task bar> > > context_switch(prev=kthread, next=bar) > mm = next->mm; > oldmm = prev->active_mm; // foo's mm! > > if (!prev->mm) { // True for kthread > rq->prev_mm = oldmm; > } > > finish_task_switch > mm = rq->prev_mm; > if (mm) { // True (foo's mm) > mmdrop(mm); // foo's mm has count +0 The context switch into the next user task will then decrement. At this point foo no longer has a reference to its mm, except on the stack. > } > > [...] > > <ctxsw back to task foo> > > context_switch(prev=bar, next=foo) > mm = next->mm; > oldmm = prev->active_mm; > > if (!mm) { // True for foo > next->active_mm = oldmm; // This is bar's mm > mmgrab(oldmm); // bar's mm has count +1 > } > > > [return back to exit_mm] Enter mm_users, this counts the number of tasks associated with the mm. We start with 1 in mm_init(), and when it drops to 0, we decrement mm_count. Since we also start with mm_count == 1, this would appear consistent. mmput() // --mm_users == 0, which then results in: > mmdrop(mm); // foo's mm has count -1 In the above case, that's the very last reference to the mm, and since we started out with mm_count == 1, this -1 makes 0 and we do the actual free. > At this point, we've got an imbalanced count on the mm and could free it > prematurely as seen in the KASAN log. I'm not sure I see premature. At this point mm_users==0, mm_count==0 and we freed mm and there is no further use of the on-stack mm pointer and foo no longer has a pointer to it in either ->mm or ->active_mm. It's well and proper dead. > A subsequent context-switch away from foo would therefore result in a > use-after-free. At the above point, foo no longer has a reference to mm, we cleared ->mm early, and the context switch to bar cleared ->active_mm. The switch back into foo then results with foo->active_mm == bar->mm, which is fine.