On Fri, Jun 15, 2018 at 08:23:14AM -0700, John Fastabend wrote: > On 06/14/2018 10:41 PM, Martin KaFai Lau wrote: > > On Thu, Jun 14, 2018 at 09:44:57AM -0700, John Fastabend wrote: > >> First in tcp_close, reduce scope of sk_callback_lock() the lock is > >> only needed for protecting smap_release_sock() the ingress and cork > >> lists are protected by sock lock. Having the lock in wider scope is > >> harmless but may confuse the reader who may infer it is in fact > >> needed. > >> > >> Next, in sock_hash_delete_elem() the pattern is as follows, > >> > >> sock_hash_delete_elem() > >> [...] > >> spin_lock(bucket_lock) > >> l = lookup_elem_raw() > >> if (l) > >> hlist_del_rcu() > >> write_lock(sk_callback_lock) > >> .... destroy psock ... > >> write_unlock(sk_callback_lock) > >> spin_unlock(bucket_lock) > >> > >> The ordering is necessary because we only know the {p}sock after > >> dereferencing the hash table which we can't do unless we have the > >> bucket lock held. Once we have the bucket lock and the psock element > >> it is deleted from the hashmap to ensure any other path doing a lookup > >> will fail. Finally, the refcnt is decremented and if zero the psock > >> is destroyed. > >> > >> In parallel with the above (or free'ing the map) a tcp close event > >> may trigger tcp_close(). Which at the moment omits the bucket lock > >> altogether (oops!) where the flow looks like this, > >> > >> bpf_tcp_close() > >> [...] > >> write_lock(sk_callback_lock) > >> for each psock->maps // list of maps this sock is part of > >> hlist_del_rcu(ref_hash_node); > >> .... destroy psock ... > >> write_unlock(sk_callback_lock) > >> > >> Obviously, and demonstrated by syzbot, this is broken because > >> we can have multiple threads deleting entries via hlist_del_rcu(). > >> > >> To fix this we might be tempted to wrap the hlist operation in a > >> bucket lock but that would create a lock inversion problem. In > >> summary to follow locking rules maps needs the sk_callback_lock but we > >> need the bucket lock to do the hlist_del_rcu. To resolve the lock > >> inversion problem note that when bpf_tcp_close is called no updates > >> can happen in parallel, due to ESTABLISH state check in update logic, > >> so pop the head of the list repeatedly and remove the reference until > >> no more are left. If a delete happens in parallel from the BPF API > >> that is OK as well because it will do a similar action, lookup the > >> sock in the map/hash, delete it from the map/hash, and dec the refcnt. > >> We check for this case before doing a destroy on the psock to ensure > >> we don't have two threads tearing down a psock. The new logic is > >> as follows, > >> > >> bpf_tcp_close() > >> e = psock_map_pop(psock->maps) // done with sk_callback_lock > >> bucket_lock() // lock hash list bucket > >> l = lookup_elem_raw(head, hash, key, key_size); > >> if (l) { > >> //only get here if elmnt was not already removed > >> hlist_del_rcu() > >> ... destroy psock... > >> } > >> bucket_unlock() > >> > >> And finally for all the above to work add missing sk_callback_lock > >> around smap_list_remove in sock_hash_ctx_update_elem(). Otherwise > >> delete and update may corrupt maps list. > >> > >> (As an aside the sk_callback_lock serves two purposes. The > >> first, is to update the sock callbacks sk_data_ready, sk_write_space, > >> etc. The second is to protect the psock 'maps' list. The 'maps' list > >> is used to (as shown above) to delete all map/hash references to a > >> sock when the sock is closed) > >> > >> (If we did not have the ESTABLISHED state guarantee from tcp_close > >> then we could not ensure completion because updates could happen > >> forever and pin thread in delete loop.) > >> > >> Reported-by: syzbot+0ce137753c78f7b6a...@syzkaller.appspotmail.com > >> Fixes: 81110384441a ("bpf: sockmap, add hash map support") > >> Signed-off-by: John Fastabend <john.fastab...@gmail.com> > >> --- > >> 0 files changed > >> > > ^^^^ Will fix this 0 files changes as well. > > >> struct bpf_htab *htab = container_of(map, struct bpf_htab, map); > >> @@ -2114,10 +2169,13 @@ static void sock_hash_free(struct bpf_map *map) > >> */ > >> rcu_read_lock(); > >> for (i = 0; i < htab->n_buckets; i++) { > >> - struct hlist_head *head = select_bucket(htab, i); > >> + struct bucket *b = __select_bucket(htab, i); > >> + struct hlist_head *head; > >> struct hlist_node *n; > >> struct htab_elem *l; > >> > >> + raw_spin_lock_bh(&b->lock); > > There is a synchronize_rcu() at the beginning of sock_hash_free(). > > Taking the bucket's lock of the free-ing map at this point is a bit > > suspicious. What may still access the map after synchronize_rcu()? > > > > tcp_close() may be called while the map is being free. The sync_rcu will > only sync the BPF side. Thanks for the explanation.
hmm....so the bpf_tcp_close(), which is under rcu_read_lock(), may still has a reference to this hash. Should it wait for another rcu grace period before freeing the htab->buckets and htab?