Hi Cong, sorry for the late reply. Generally the patch looks good to me, just a few comments inline:
On 04/17/2017 08:30 PM, Cong Wang wrote:
Roi reported we could have a race condition where in ->classify() path we dereference tp->root and meanwhile a parallel ->destroy() makes it a NULL.
Correct.
This is possible because ->destroy() could be called when deleting a filter to check if we are the last one in tp, this tp is still linked and visible at that time. Daniel fixed this in commit d936377414fa ("net, sched: respect rcu grace period on cls destruction"), but the root cause of this problem is the semantic of ->destroy(), it does two things (for non-force case): 1) check if tp is empty 2) if tp is empty we could really destroy it and its caller, if cares, needs to check its return value to see if it is really destroyed. Therefore we can't unlink tp unless we know it is empty. As suggested by Daniel, we could actually move the test logic to ->delete() so that we can safely unlink tp after ->delete() tells us the last one is just deleted and before ->destroy(). What's more, even we unlink it before ->destroy(), it could still have readers since we don't wait for a grace period here, we should not modify tp->root in ->destroy() either.
Here seems to be a bit of a mixup in this analysis, imo. The issue that Roi reported back then was exactly the one that d936377414fa ("net, sched: respect rcu grace period on cls destruction") fixed, which affected flower and other classifiers: Roi reported a crash in flower where tp->root was NULL in ->classify() callbacks. Reason is that in ->destroy() tp->root is set to NULL via RCU_INIT_POINTER(). It's problematic for some of the classifiers, because this doesn't respect RCU grace period for them, and as a result, still outstanding readers from tc_classify() will try to blindly dereference a NULL tp->root. The ->delete() callback was never used by Roi back then, he said that he just removed the ingress qdisc in his test, which implicitly purges all cls attached to it via tcf_destroy_chain(). So the above description with regards to the "root cause" of Roi's reported issue is not correct. The issue that this patch fixes is an _independent_ race that we found while auditing the code when looking into Roi's report back then. It fixes commit 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone"), which added the RCU_INIT_POINTER() after the tcf_destroy() in RTM_DELTFILTER case. That part of the description looks good, where you describe that "[...] we need to move the test logic to ->delete(), so that we can safely unlink tp after ->delete() tells us the last one is just deleted and before ->destroy()." Please also add Fixes tag, so it can be better tracked for backports. Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
Reported-by: Roi Dayan <r...@mellanox.com> Cc: Daniel Borkmann <dan...@iogearbox.net> Cc: John Fastabend <john.fastab...@gmail.com> Cc: Jamal Hadi Salim <j...@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
[...]
diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c index 9962090..d388536 100644 --- a/net/sched/cls_fw.c +++ b/net/sched/cls_fw.c
[...]
@@ -150,17 +144,17 @@ static bool fw_destroy(struct tcf_proto *tp, bool force) call_rcu(&f->rcu, fw_delete_filter); } } - RCU_INIT_POINTER(tp->root, NULL); kfree_rcu(head, rcu); - return true; }
diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c index a371075..8e2baf8 100644 --- a/net/sched/cls_route.c +++ b/net/sched/cls_route.c @@ -312,12 +305,10 @@ static bool route4_destroy(struct tcf_proto *tp, bool force) kfree_rcu(b, rcu); } } - RCU_INIT_POINTER(tp->root, NULL); kfree_rcu(head, rcu); - return true; }
diff --git a/net/sched/cls_rsvp.h b/net/sched/cls_rsvp.h index d7f2923..9e3748b 100644 --- a/net/sched/cls_rsvp.h +++ b/net/sched/cls_rsvp.h @@ -302,22 +302,13 @@ static void rsvp_delete_filter(struct tcf_proto *tp, struct rsvp_filter *f)
[...]
- } - - RCU_INIT_POINTER(tp->root, NULL); + return; for (h1 = 0; h1 < 256; h1++) { struct rsvp_session *s; @@ -337,10 +328,9 @@ static bool rsvp_destroy(struct tcf_proto *tp, bool force)
[...] The above three RCU_INIT_POINTER(tp->root, NULL) are independent of the fix and actually do no harm right now. I described that in d936377414fa ("net, sched: respect rcu grace period on cls destruction") as well, meaning that they each handle tp->root being NULL in ->classify() path (for historic reasons), so this is handled gracefully, readers use rcu_dereference_bh(tp->root) and test for this being NULL. But I agree that this could be cleaned up along with the check in the ->classify() callbacks for these three (not sure if really worth it, though). However, such cleanup should be a separate patch and not included in this fix. Other than the mentioned things, this looks good to me. Thanks, Daniel