On Wed, Aug 10, 2022 at 07:08:39PM +0300, Alexander Mikhalitsyn wrote: > From: "Denis V. Lunev" <d...@openvz.org> > > Normal processing of ARP request (usually this is Ethernet broadcast > packet) coming to the host is looking like the following: > * the packet comes to arp_process() call and is passed through routing > procedure > * the request is put into the queue using pneigh_enqueue() if > corresponding ARP record is not local (common case for container > records on the host) > * the request is processed by timer (within 80 jiffies by default) and > ARP reply is sent from the same arp_process() using > NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED condition (flag is set inside > pneigh_enqueue()) > > And here the problem comes. Linux kernel calls pneigh_queue_purge() > which destroys the whole queue of ARP requests on ANY network interface > start/stop event through __neigh_ifdown(). > > This is actually not a problem within the original world as network > interface start/stop was accessible to the host 'root' only, which > could do more destructive things. But the world is changed and there > are Linux containers available. Here container 'root' has an access > to this API and could be considered as untrusted user in the hosting > (container's) world. > > Thus there is an attack vector to other containers on node when > container's root will endlessly start/stop interfaces. We have observed > similar situation on a real production node when docker container was > doing such activity and thus other containers on the node become not > accessible. > > The patch proposed doing very simple thing. It drops only packets from > the same namespace in the pneigh_queue_purge() where network interface > state change is detected. This is enough to prevent the problem for the > whole node preserving original semantics of the code.
This is how I'd do it as well. > > v2: > - do del_timer_sync() if queue is empty after pneigh_queue_purge() > > Cc: "David S. Miller" <da...@davemloft.net> > Cc: Eric Dumazet <eduma...@google.com> > Cc: Jakub Kicinski <k...@kernel.org> > Cc: Paolo Abeni <pab...@redhat.com> > Cc: Daniel Borkmann <dan...@iogearbox.net> > Cc: David Ahern <dsah...@kernel.org> > Cc: Yajun Deng <yajun.d...@linux.dev> > Cc: Roopa Prabhu <ro...@nvidia.com> > Cc: Christian Brauner <brau...@kernel.org> > Cc: net...@vger.kernel.org > Cc: linux-ker...@vger.kernel.org > Cc: Alexey Kuznetsov <kuz...@ms2.inr.ac.ru> > Cc: Alexander Mikhalitsyn <alexander.mikhalit...@virtuozzo.com> > Cc: Konstantin Khorenko <khore...@virtuozzo.com> > Cc: ker...@openvz.org > Cc: devel@openvz.org > Investigated-by: Alexander Mikhalitsyn <alexander.mikhalit...@virtuozzo.com> > Signed-off-by: Denis V. Lunev <d...@openvz.org> > --- > net/core/neighbour.c | 25 +++++++++++++++++-------- > 1 file changed, 17 insertions(+), 8 deletions(-) > > diff --git a/net/core/neighbour.c b/net/core/neighbour.c > index 54625287ee5b..19d99d1eff53 100644 > --- a/net/core/neighbour.c > +++ b/net/core/neighbour.c > @@ -307,14 +307,23 @@ static int neigh_del_timer(struct neighbour *n) > return 0; > } > > -static void pneigh_queue_purge(struct sk_buff_head *list) > +static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net) > { > + unsigned long flags; > struct sk_buff *skb; > > - while ((skb = skb_dequeue(list)) != NULL) { > - dev_put(skb->dev); > - kfree_skb(skb); > + spin_lock_irqsave(&list->lock, flags); I'm a bit surprised to see a spinlock held around a while loop walking a linked list but that seems to be quite common in this file. I take it the lists are guaranteed to be short. > + skb = skb_peek(list); > + while (skb != NULL) { > + struct sk_buff *skb_next = skb_peek_next(skb, list); > + if (net == NULL || net_eq(dev_net(skb->dev), net)) { > + __skb_unlink(skb, list); > + dev_put(skb->dev); > + kfree_skb(skb); > + } > + skb = skb_next; > } > + spin_unlock_irqrestore(&list->lock, flags); > } > > static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev, > @@ -385,9 +394,9 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct > net_device *dev, > write_lock_bh(&tbl->lock); > neigh_flush_dev(tbl, dev, skip_perm); > pneigh_ifdown_and_unlock(tbl, dev); > - > - del_timer_sync(&tbl->proxy_timer); > - pneigh_queue_purge(&tbl->proxy_queue); > + pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev)); > + if (skb_queue_empty_lockless(&tbl->proxy_queue)) > + del_timer_sync(&tbl->proxy_timer); > return 0; > } > > @@ -1787,7 +1796,7 @@ int neigh_table_clear(int index, struct neigh_table > *tbl) > cancel_delayed_work_sync(&tbl->managed_work); > cancel_delayed_work_sync(&tbl->gc_work); > del_timer_sync(&tbl->proxy_timer); > - pneigh_queue_purge(&tbl->proxy_queue); > + pneigh_queue_purge(&tbl->proxy_queue, NULL); > neigh_ifdown(tbl, NULL); > if (atomic_read(&tbl->entries)) > pr_crit("neighbour leakage\n"); > -- > 2.36.1 > _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel