On Mon, Sep 09, 2019 at 07:22:15AM -0500, Eric W. Biederman wrote: > Let me see if I can explain my confusion in terms of task_numa_compare. > > The function task_numa_comare now does: > > rcu_read_lock(); > cur = rcu_dereference(dst_rq->curr); > > Then it proceeds to examine a few fields of cur. > > cur->flags; > cur->cpus_ptr; > cur->numa_group; > cur->numa_faults[]; > cur->total_numa_faults; > cur->se.cfs_rq; > > My concern here is the ordering between setting rq->curr and and setting > those other fields. If we read values of those fields that were not > current when we set cur than I think task_numa_compare would give an > incorrect answer.
That code is explicitly without serialization. One memory barrier isn't going to make any difference what so ever. Specifically, those numbers will change _after_ we make it current, not before. > From what I can see pick_next_task_fair does not mess with any of > the fields that task_numa_compare uses. So the existing memory barrier > should be more than sufficient for that case. There should not be, but even if there is, load-balancing in general (very much including the NUMA balancing) is completely unserialized and prone to reading stale data at all times. This is on purpose; it minimizes the interference of load-balancing, and if we make a 'wrong' choice, the next pass can fix it up. > So I think I just need to write up a patch 4/3 that replaces the my > "rcu_assign_pointer(rq->curr, next)" with "RCU_INIT_POINTER(rq->curr, > next)". And includes a great big comment to that effect. > > > > Now maybe I am wrong. Maybe we can afford to see data that was changed > before the task became rq->curr in task_numa_compare. But I don't think > so. I really think it is that full memory barrier just a bit earlier > in __schedule that is keeping us safe. > > Am I wrong? Not in so far that we now both seem to agree on using RCU_INIT_POINTER(); but we're still disagreeing on why. My argument is that since we can already obtain any task_struct pointer through find_task_by_vpid() and friends, any access to its individual members must already have serialzation rules (or explicitly none, as for load-balancing). I'm viewing this as just another variant of find_task_by_vpid(), except we index by CPU instead of PID. There can be changes to task_struct between to invocations of find_task_by_vpid(), any user will have to somehow deal with that. This is no different. Take for instance p->cpus_ptr, it has very specific serialzation rules (which that NUMA balancer thing blatantly ignores), as do a lot of other task_struct fields. The memory barrier in question isn't going to help with any of that. Specifically, if we care about the consistency of the numbers changed by pick_next_task(), we must acquire rq->lock (of course, once we do that, we can immediately use rq->curr without further RCU use).