On 16-11-22 12:41 PM, Daniel Borkmann wrote: > On 11/22/2016 08:28 PM, Cong Wang wrote: >> On Tue, Nov 22, 2016 at 8:11 AM, Jiri Pirko <j...@resnulli.us> wrote: >>> Tue, Nov 22, 2016 at 05:04:11PM CET, dan...@iogearbox.net wrote: >>>> Hmm, I don't think we want to have such an additional test in fast >>>> path for each and every classifier. Can we think of ways to avoid that? >>>> >>>> My question is, since we unlink individual instances from such >>>> tp-internal >>>> lists through RCU and release the instance through call_rcu() as >>>> well as >>>> the head (tp->root) via kfree_rcu() eventually, against what are we >>>> protecting >>>> setting RCU_INIT_POINTER(tp->root, NULL) in ->destroy() callback? >>>> Something >>>> not respecting grace period? >>> >>> If you call tp->ops->destroy in call_rcu, you don't have to set tp->root >>> to null. > > But that's not really an answer to my question. ;) > >> We do need to respect the grace period if we touch the globally visible >> data structure tp in tcf_destroy(). Therefore Roi's patch is not >> fixing the >> right place. > > I think there may be multiple issues actually. > > At the time we go into tc_classify(), from ingress as well as egress side, > we're under RCU, but BH variant. In cls delete()/destroy() callbacks, we > everywhere use call_rcu() and kfree_rcu(), same as for tcf_destroy() where > we use kfree_rcu() on tp, although we iterate tps (and implicitly inner > filters) > via rcu_dereference_bh() from reader side. Is there a reason why we don't > use call_rcu_bh() variant on destruction for all this instead?
I can't think of any if its all under _bh we can convert the call_rcu to call_rcu_bh it just needs an audit. > > Just looking at cls_bpf and others, what protects > RCU_INIT_POINTER(tp->root, > NULL) against? The tp is unlinked in tc_ctl_tfilter() from the tp chain in > tcf_destroy() cases. Still active readers under RCU BH can race against > this > (tp->root being NULL), as the commit identified. Only the get() callback > checks > for head against NULL, but both are serialized under rtnl, and the only > place > we call this is tc_ctl_tfilter(). Even if we create a new tp, head > should not > be NULL there, if it was assigned during the init() cb, but contains an > empty > list. (It's different for things like cls_cgroup, though.) So, I'm > wondering > if the RCU_INIT_POINTER(tp->root, NULL) can just be removed instead > (unless I'm > missing something obvious)? Just took a look at this I think there are a couple possible solutions. The easiest is likely to fix all the call sites so that 'tp' is unlinked before calling the destroy() handlers AND not doing the NULL set. I only see one such call site where destroy is called before unlinking at the moment. This should enforce that after a grace period there is no path to reach the classifiers because 'tp' is unlinked. Calling destroy before unlinking 'tp' however could cause a small race between grace period of 'tp' and grace period of the filter. Another would be to only call the destroy path from the call_rcu path of the 'tp' object so that destroy is only ever called after the object is guaranteed to be unlinked from the tc_filter path. I think both solutions would be fine. Cong were you working on one of these? Or do you have another idea? > >> Also I don't know why you blame my commit, this problem should already >> exist prior to my commit, probably date back to John's RCU patches. > > It seems so.