On Tue, Nov 22, 2016 at 3:36 PM, John Fastabend <john.fastab...@gmail.com> wrote: > On 16-11-22 12:41 PM, Daniel Borkmann wrote: >> On 11/22/2016 08:28 PM, Cong Wang wrote: >>> On Tue, Nov 22, 2016 at 8:11 AM, Jiri Pirko <j...@resnulli.us> wrote: >>>> Tue, Nov 22, 2016 at 05:04:11PM CET, dan...@iogearbox.net wrote: >>>>> Hmm, I don't think we want to have such an additional test in fast >>>>> path for each and every classifier. Can we think of ways to avoid that? >>>>> >>>>> My question is, since we unlink individual instances from such >>>>> tp-internal >>>>> lists through RCU and release the instance through call_rcu() as >>>>> well as >>>>> the head (tp->root) via kfree_rcu() eventually, against what are we >>>>> protecting >>>>> setting RCU_INIT_POINTER(tp->root, NULL) in ->destroy() callback? >>>>> Something >>>>> not respecting grace period? >>>> >>>> If you call tp->ops->destroy in call_rcu, you don't have to set tp->root >>>> to null. >> >> But that's not really an answer to my question. ;) >> >>> We do need to respect the grace period if we touch the globally visible >>> data structure tp in tcf_destroy(). Therefore Roi's patch is not >>> fixing the >>> right place. >> >> I think there may be multiple issues actually. >> >> At the time we go into tc_classify(), from ingress as well as egress side, >> we're under RCU, but BH variant. In cls delete()/destroy() callbacks, we >> everywhere use call_rcu() and kfree_rcu(), same as for tcf_destroy() where >> we use kfree_rcu() on tp, although we iterate tps (and implicitly inner >> filters) >> via rcu_dereference_bh() from reader side. Is there a reason why we don't >> use call_rcu_bh() variant on destruction for all this instead? > > I can't think of any if its all under _bh we can convert the call_rcu to > call_rcu_bh it just needs an audit. > >> >> Just looking at cls_bpf and others, what protects >> RCU_INIT_POINTER(tp->root, >> NULL) against? The tp is unlinked in tc_ctl_tfilter() from the tp chain in >> tcf_destroy() cases. Still active readers under RCU BH can race against >> this >> (tp->root being NULL), as the commit identified. Only the get() callback >> checks >> for head against NULL, but both are serialized under rtnl, and the only >> place >> we call this is tc_ctl_tfilter(). Even if we create a new tp, head >> should not >> be NULL there, if it was assigned during the init() cb, but contains an >> empty >> list. (It's different for things like cls_cgroup, though.) So, I'm >> wondering >> if the RCU_INIT_POINTER(tp->root, NULL) can just be removed instead >> (unless I'm >> missing something obvious)? > > > Just took a look at this I think there are a couple possible solutions. > The easiest is likely to fix all the call sites so that 'tp' is unlinked > before calling the destroy() handlers AND not doing the NULL set. I only > see one such call site where destroy is called before unlinking at the > moment. This should enforce that after a grace period there is no path > to reach the classifiers because 'tp' is unlinked. Calling destroy > before unlinking 'tp' however could cause a small race between grace > period of 'tp' and grace period of the filter. > > Another would be to only call the destroy path from the call_rcu path > of the 'tp' object so that destroy is only ever called after the object > is guaranteed to be unlinked from the tc_filter path. > > I think both solutions would be fine. > > Cong were you working on one of these? Or do you have another idea?
Yeah, this is basic what I think as well, however, both are hard. On one hand, we can't detach the tp from the global singly-linked list before tcf_destroy() since we rely on its return value to make this decision. On the other hand, it is a singly-linked list, we have to pass in the address of its previous pointer to rcu callback to remove it, it seems racy as well since we modify a previous pointer which is still visible globally... Hmm, perhaps we really have to switch to a doubly-linked list, that is list_head. I need to double check. And also the semantic of ->destroy() needs to revise too. So yeah, my commit should be blamed. :-/