On Sun, Mar 06, 2016 at 07:49:14PM -0800, Andy Lutomirski wrote: > On Sun, Mar 6, 2016 at 7:45 PM, Serge E. Hallyn <se...@hallyn.com> wrote: > > On Sun, Mar 06, 2016 at 06:24:23PM -0800, Andy Lutomirski wrote: > >> On Mar 6, 2016 2:03 PM, "Eric W. Biederman" <ebied...@xmission.com> wrote: > >> > > >> > "Serge E. Hallyn" <serge.hal...@ubuntu.com> writes: > >> > > >> > > Hi, > >> > > > >> > > So we've been over this many times... but unfortunately there is more > >> > > breakage to report. Regular privileged and unprivileged containers > >> > > work all right for us. But running an unprivileged container inside a > >> > > privileged container is blocked. > >> > > > >> > > When creating privileged containers, lxc by default does a few things: > >> > > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo > >> > > and > >> > > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as > >> > > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly > >> > > (because this container is not in a user namespace) then moves > >> > > /proc/sys/net back. Finally it mounts sys ro but bind-mounts > >> > > /sys/devices/virtual/net as writeable. > >> > > > >> > > If any of these are left enabled, unprivileged containers can't be > >> > > started. If all are disabled, then they can be. > >> > > > >> > > Can we find a way to make these not block remounts in child user > >> > > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl? > >> > > >> > Are any of these overmounts done for the purpose of security? It > >> > appears the /proc/sys and /sys mounts being made read-only is for that > >> > purpose. > >> > > >> > If none of the mounts are for secuirty the easy solution that works > >> > today is to also mount /proc and /sys somewhere else in your container > >> > so that the permission check for mounting a new copy passes. > >> > >> Can we use the big hammer approach on /proc/sys? Specifically, what > >> if we made it so that /proc mounts created in a non-root namespace > >> *only* see things that are scoped to the active namespaces, and only > >> those over which the mounter has capabilities? We could have mount > >> options for this. > > > > Of course the problem is precisely non-user-namespaced containers which > > do own and have capabilities over the /proc/sys/files. For user-namespaced > > containers /proc/sys/ isn't really an issue. > > What I mean is: > > mount -o nsonly=user,net -t proc none /proc > > would show the list of processors and things scoped to the current > userns and netns, would *not* show global sysctls, and would fail > unless the caller has appropriate caps over the userns and netns. > This would work even if the old procfs is not fully visbile.
Gah, so apparently I'd forgotten the workaround I'd implemented - I thought things had regressed, but they haven't, I'd just missed a step. Sorry for the noise. I don't want to make things more complicated or more brittle when we can make it work as is - thanks. -serge