On 05.04.2018 17:07, Christian Brauner wrote: > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote: >> On 04.04.2018 22:48, Christian Brauner wrote: >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network >>> namespaces") >>> >>> enabled sending hotplug events into all network namespaces back in 2010. >>> Over time the set of uevents that get sent into all network namespaces has >>> shrunk. We have now reached the point where hotplug events for all devices >>> that carry a namespace tag are filtered according to that namespace. >>> >>> Specifically, they are filtered whenever the namespace tag of the kobject >>> does not match the namespace tag of the netlink socket. One example are >>> network devices. Uevents for network devices only show up in the network >>> namespaces these devices are moved to or created in. >>> >>> However, any uevent for a kobject that does not have a namespace tag >>> associated with it will not be filtered and we will *try* to broadcast it >>> into all network namespaces. >>> >>> The original patchset was written in 2010 before user namespaces were a >>> thing. With the introduction of user namespaces sending out uevents became >>> partially isolated as they were filtered by user namespaces: >>> >>> net/netlink/af_netlink.c:do_one_broadcast() >>> >>> if (!net_eq(sock_net(sk), p->net)) { >>> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID)) >>> return; >>> >>> if (!peernet_has_id(sock_net(sk), p->net)) >>> return; >>> >>> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns, >>> CAP_NET_BROADCAST)) >>> j return; >>> } >>> >>> The file_ns_capable() check will check whether the caller had >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user >>> namespace of interest. This check is fine in general but seems insufficient >>> to me when paired with uevents. The reason is that devices always belong to >>> the initial user namespace so uevents for kobjects that do not carry a >>> namespace tag should never be sent into another user namespace. This has >>> been the intention all along. But there's one case where this breaks, >>> namely if a new user namespace is created by root on the host and an >>> identity mapping is established between root on the host and root in the >>> new user namespace. Here's a reproducer: >>> >>> sudo unshare -U --map-root >>> udevadm monitor -k >>> # Now change to initial user namespace and e.g. do >>> modprobe kvm >>> # or >>> rmmod kvm >>> >>> will allow the non-initial user namespace to retrieve all uevents from the >>> host. This seems very anecdotal given that in the general case user >>> namespaces do not see any uevents and also can't really do anything useful >>> with them. >>> >>> Additionally, it is now possible to send uevents from userspace. As such we >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user >>> namespace of the network namespace of the netlink socket) userspace process >>> make a decision what uevents should be sent. >>> >>> This makes me think that we should simply ensure that uevents for kobjects >>> that do not carry a namespace tag are *always* filtered by user namespace >>> in kobj_bcast_filter(). Specifically: >>> - If the owning user namespace of the uevent socket is not init_user_ns the >>> event will always be filtered. >>> - If the network namespace the uevent socket belongs to was created in the >>> initial user namespace but was opened from a non-initial user namespace >>> the event will be filtered as well. >>> Put another way, uevents for kobjects not carrying a namespace tag are now >>> always only sent to the initial user namespace. The regression potential >>> for this is near to non-existent since user namespaces can't really do >>> anything with interesting devices. >>> >>> Signed-off-by: Christian Brauner <christian.brau...@ubuntu.com> >>> --- >>> lib/kobject_uevent.c | 10 +++++++++- >>> 1 file changed, 9 insertions(+), 1 deletion(-) >>> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c >>> index 15ea216a67ce..cb98cddb6e3b 100644 >>> --- a/lib/kobject_uevent.c >>> +++ b/lib/kobject_uevent.c >>> @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct >>> sk_buff *skb, void *data) >>> return sock_ns != ns; >>> } >>> >>> - return 0; >>> + /* >>> + * The kobject does not carry a namespace tag so filter by user >>> + * namespace below. >>> + */ >>> + if (sock_net(dsk)->user_ns != &init_user_ns) >>> + return 1; >>> + >>> + /* Check if socket was opened from non-initial user namespace. */ >>> + return sk_user_ns(dsk) != &init_user_ns; >>> } >>> #endif >> >> So, this prohibits to listen events of all devices except network-related >> in containers? If it's so, I don't think it's a good solution. Uevents is not > > No, this is not correct: As it is right now *without my patch* no > non-initial user namespace is receiving *any uevents* but those > specifically namespaced such as those for network devices. This patch > doesn't change that at all. The commit message outlines this in detail > how this comes about. > There is only one case where this currently breaks and this is as I > outlined explicitly in my commit message when you create a new user > namespace and map container(0) -> host(0). This patch fixes this.
Could you please point the place, where non-initial user namespaces are filtered? I only see the kobj_bcast_filter() logic, and it used to return 0, which means "accepted". Now it will return 1 sometimes. >> net-devices-only related interface and it's used for all devices in system. >> People may want to delegate block devices to nested user_ns, for example. > > That's fine but that's why I added uevent injection in a previous patch > series: I repeat no non-initial user namespace will by default receive > uevents. Thanks, Kirill