On Tue, Aug 01, 2017 at 05:20:11PM +0800, Yafang Shao wrote: > Hi Peter, > > 2017-08-01 17:12 GMT+08:00 Peter Zijlstra <pet...@infradead.org>: > > On Tue, Aug 01, 2017 at 04:57:43PM +0800, Yafang Shao wrote: > >> > And how would that happen? We only call pick_next_entity(.curr=NULL) > >> > when we _know_ cfs_rq->nr_running. > >> > >> It crashed my machine when I did hadoop test, and after I made this change > >> it works now. > >> On SMP system, cfs_rq->nr_running isn't protected well, although we _know_ > >> cfs_rq->nr_running, > >> but it is modified by other thread running on other CPU and the > >> sched_entity is set NULL as well. > >> Then this thread broken here as accessed the NULL pointer here. > > > > cfs_rq->nr_running should be protected by the rq->lock. If it is not, > > something else is buggered. > > Yes, I admit that something else is buggered, but unfortunately I > haven't understood the > scheduler deeply so can't find the root cause. > > But from my understanding, it is obviously a bug here as we can find > that it may occurs that both 'se' > and 'curr' are NULL. That's why I submit this patch to fix it.
Thing is, it doesn't fix a bug, it hides one. The real bug is nr_running changing while you own the rq->lock. That should _never_ happen.