Re: [lxc-devel] Device Namespaces
On Wed, Sep 25, 2013 at 02:34:54PM -0700, Eric W. Biederman wrote: > So the big issues for a device namespace to solve are filtering which > devices a container has access to and being able to dynamically change > which devices those are at run time (aka hotplug). As _all_ devices are hotpluggable now (look, there's no CONFIG_HOTPLUG anymore, because it was redundant), I think you need to really think this through better (pci, memory, cpus, etc.) before you do anything in the kernel. > After having thought about this for a bit I don't know if a pure > userspace solution is sufficient or actually a good idea. > > - We can manually manage a tmpfs with device nodes in userspace. > (But that is deprecated functionality in the mainstream kernel). Yes, but I'm not going to namespace devtmpfs, as that is going to be an impossible task, right? And remember, udev doesn't create device nodes anymore... > - We can manually export a subset of sysfs with bind mounts. > (But that feels hacky, and is essentially incompatible with hotplug). True. > - We can relay a call of /sbin/hotplug from outside of a container > to inside of a container based on policy. > (But no one uses /sbin/hotplug anymore). That's right, they should be listening to libudev events, so why can't your daemon shuffle them off to the proper container, all in userspace? > - There is no way to fake netlink uevents for a container to see them. > (The best we could do is replace udev everywhere with something that >listens on a unix domain socket). You shouldn't need to do this. > - It would be nice to replace the device cgroup with a comprehensive > solution that really works. (Among other things the device cgroup > does not work in terms of struct device the underlying kernel > abstraction for devices). I didn't even know there was a device cgroup. Which means that if there is one, odds are it's useless. > We must manage sysfs entries as well device nodes because: > - Seeing more than we should has the real potential to confuse > userspace, especially a userspace that replays uevents. You should never replay uevents. If you don't do that, why can't you see all of sysfs? > - Some device control must happens through writing to sysfs files and > if we don't remove all root privileges from a container only by > exporting a subset of sysfs to that container can we limit which > sysfs nodes can be written to. But you have the issue of controlling devices in a "shared" way, which isn't going to be usable for almost all devices. > The current kernel tagged sysfs entry support does not look like a good > match for the impelementing device filtering. The common case will > be allowing devices like /dev/zero, and /dev/null that live in > /sys/devices/virtual and are the devices we are most likely to care > about. Those devices need to live in multiple device namespaces so > everyone can use them. Perhaps exclusive assignment will be the more > common paradigm for device namespaces like it is for network devices in > the network namespace but from what little I can of this problem right now I > don't think so. > > I definitely think we should hold off on a kernel level implementation > until we really understand the issues and are ready to implement device > namespaces correctly. I agree, especially as I don't think this will ever work. > A userspace implementation looks like it can only do about 95% of what > is really needed, but at the same time looks like an easy way to > experiment until the problem is sufficiently well understood. 95% is probably way better than what you have today, and will fit the needs of almost everyone today, so why not do it? I'd argue that those last 5% either are custom solutions that never get merged, or candidates for true virtulization. > In summary the situation with device hoptlug and containers sucks today, > and we need to do something. Running a linux desktop in a container is > a reasonably good example use case. No it isn't. I'd argue that this is a horrible use case, one that you shouldn't do. Why not just use multi-head machines like people do who really want to do this, relying on user separation? That's a workable solution that is quite common and works very well today. > Having one standard common maintainable implementation would be very > useful and the most logical place for that would be in the kernel. > For now we should focus on simple device filtering and hotplug. Just listen for libudev stuff, don't try to filter them, or ever "replay" them, that way lies madness, and lots of nasty race conditions that is guaranteed to break things. good luck, greg k-h -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and
Re: [lxc-devel] Device Namespaces
On Thu, Sep 26, 2013 at 11:25:56AM +0300, Janne Karhunen wrote: > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman > wrote: > > >> In summary the situation with device hoptlug and containers sucks today, > >> and we need to do something. Running a linux desktop in a container is > >> a reasonably good example use case. > > > > No it isn't. I'd argue that this is a horrible use case, one that you > > shouldn't do. Why not just use multi-head machines like people do who > > really want to do this, relying on user separation? That's a workable > > solution that is quite common and works very well today. > > I suppose so, but now you take the assumption that there is no > need for running multiple Linux variants on the same host (say > Ubuntu and Android side by side). Is this something you would > not like to see done? You can do that today without any need for device namespaces, so why is this an issue here? greg k-h -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] Device Namespaces
On Thu, Sep 26, 2013 at 08:01:31PM +0300, Janne Karhunen wrote: > That being said, our wish would be to support any combination of > OS's and frankly, I'd be slightly annoyed to tell the customer that > they can't do two Androids or we magically run out of bits. If you want to support "any" combination of operating systems, then use a hypervisor, that's what they are there for :) -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] Device Namespaces
On Sun, Sep 29, 2013 at 10:28:55PM +0300, Amir Goldstein wrote: > > > > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman > > wrote: > > On Wed, Sep 25, 2013 at 02:34:54PM -0700, Eric W. Biederman wrote: > > So the big issues for a device namespace to solve are filtering which > > devices a container has access to and being able to dynamically change > > which devices those are at run time (aka hotplug). > > As _all_ devices are hotpluggable now (look, there's no CONFIG_HOTPLUG > anymore, because it was redundant), I think you need to really think > this through better (pci, memory, cpus, etc.) before you do anything in > the kernel. > > > After having thought about this for a bit I don't know if a pure > > userspace solution is sufficient or actually a good idea. > > > > - We can manually manage a tmpfs with device nodes in userspace. > > (But that is deprecated functionality in the mainstream kernel). > > Yes, but I'm not going to namespace devtmpfs, as that is going to be an > impossible task, right? > > > That sounds like a challenge ;-) > Seriously, as Serge correctly noted, it would not be that different from > devpts > if you start from an empty devtmpfs and populate it with devices that are > "added in the context of that namespace". The semantics in which > devices are "added in the context of a namespace" is the missing piece > of the puzzle. And the fact that these devices are almost all created before userspace starts up, is a non-trivial "piece of the puzzle" :) Good luck, greg k-h -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] Device Namespaces
On Mon, Sep 30, 2013 at 08:37:19AM -0700, James Bottomley wrote: > On Thu, 2013-09-26 at 10:07 -0700, Greg Kroah-Hartman wrote: > > On Thu, Sep 26, 2013 at 08:01:31PM +0300, Janne Karhunen wrote: > > > That being said, our wish would be to support any combination of > > > OS's and frankly, I'd be slightly annoyed to tell the customer that > > > they can't do two Androids or we magically run out of bits. > > > > If you want to support "any" combination of operating systems, then use > > a hypervisor, that's what they are there for :) > > No that's not quite the right way to think about it: The correct > statement is only use a hypervisor if you need different kernels. With > Windows, it happens to be true that you need a different kernel for each > different OS version. However; with Linux, thanks to strong ABI > backwards compatibility, you mostly don't. The way OpenVZ works today > is that it installs a modified kernel which can then bring up every > Linux OS in a separate container. Our use case is the hosters that give > you root login to a virtual private server and allow you to upgrade it > on your own. The reason for using a container rather than a hypervisor > is the old density and elasticity one: 3x the density (i.e. 1/3 the > overhead cost to the hoster) and the boot only needs to start at init, > not bring up of virtual hardware and booting a second kernel. I understand that some people really like the idea of using OpenVZ for various things like this, but to claim that because of it we need to hack up the driver core in the kernel into unimaginable pieces is not necessarily something that I'll agree with. But all of this is just words, I have yet to see any patches for any of this, so I'll just wait until that happens before worrying about it... thanks, greg k-h -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60133471&iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] Device Namespaces
On Tue, Oct 01, 2013 at 09:19:58AM +0300, Janne Karhunen wrote: > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman > wrote: > > >> - We can relay a call of /sbin/hotplug from outside of a container > >> to inside of a container based on policy. > >> (But no one uses /sbin/hotplug anymore). > > > > That's right, they should be listening to libudev events, so why can't > > your daemon shuffle them off to the proper container, all in userspace? > > Which reminds me, one potential reason being.. > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html I really wish I had never seen that patch, and I am glad it was rejected. -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel
Re: [lxc-devel] Device Namespaces
On Tue, Oct 01, 2013 at 12:51:36PM -0700, Eric W. Biederman wrote: > "Serge E. Hallyn" writes: > > > Quoting Andy Lutomirski (l...@amacapital.net): > >> On Tue, Oct 1, 2013 at 7:19 AM, Janne Karhunen > >> wrote: > >> > On Thu, Sep 26, 2013 at 8:33 AM, Greg Kroah-Hartman > >> > wrote: > >> > > >> >>> - We can relay a call of /sbin/hotplug from outside of a container > >> >>> to inside of a container based on policy. > >> >>> (But no one uses /sbin/hotplug anymore). > >> >> > >> >> That's right, they should be listening to libudev events, so why can't > >> >> your daemon shuffle them off to the proper container, all in userspace? > >> > > >> > Which reminds me, one potential reason being.. > >> > http://lists.linuxfoundation.org/pipermail/containers/2013-May/032591.html > >> > > >> > >> Can't the daemon live outside the container and shuffle stuff in? > > > > That's exactly what Michael Warfield is suggesting, fwiw. > > Michael Warfields example of dynamically assigning serial ports to > containers is a pretty good test case. Serial ports are extremely well > known kernel objects who evolution effectively stopped long ago. When > we need it we have ptys to virtual serial ports when we need it, but in > general unprivileged users are safe to directly use a serial port > device. > > Glossing over the details. The general problem is some policy exists > outside of the container that deciedes if an when a container gets a > serial port and stuffs it in. > > The expectation is that system containers will then run the udev > rules and send the libuevent event. > > To make that all work without kernel modifications requires placing > a faux-udev in the container, that listens for a device assignment from > outside the container and then does exactly what udev would have done. > > The problems with this that I see are: > > - udev is a moving target making it hard to build a faux-udev that will > work everywhere. How is udev a moving target? Use libudev and all should be fine, that's an ABI you can rely on, right? Or, if you don't like/want udev, use mdev in your container. Or something else, what does this have to do with the kernel? > - On distro's running systemd and udev integration is sufficiently tight > that I am not certain a faux-udev is possible or will continue to be > possible. That's not a kernel issue, that's a "ouch, this is hard, let's give up" issue. Or perhaps it is a "maybe I shouldn't even be trying to do this" type issue... :) > - There are two other widely deployed solutions for managing hotplug > devices besides udev. I know of mdev, what's the other one? The hacked-up shell script that Android uses? Or something else? > So given these difficulties I do not believe that the evolution of linux > device management is done, and that patches to udev, the kernel or both > will be needed. While it would be good for testing and understanding > the problem I don't think a faux-udev will be a long term maintainable > solution. You are saying that for some reason you feel helpless with the way userspace is going, so we have to change the kernel. That's horrible, and is not going to be a reason I accept to change the kernel, sorry. > I also understand the point that we aren't talking patches yet and just > discussing ideas. Right now it is my hope that if we talk this out we > can figure out a general direction that has a hope of working. > > From where I am standing faking uevents instead of replacing > udev/mdev/whatever looks simpler and more maintainable. Have you really looked into this? Numerous people, who understand this code path and userspace issues, have said it is not a good idea at all. But hey, what do I know... I still have yet to see a reason why you can't use libudev today for something like this. Anyway, I'm done discussing this as it's pointless this early, I'm going to refrain for any more pithy comments until someone posts some code, as this is just wasting people's time at the moment. greg k-h -- October Webinars: Code for Performance Free Intel webinars can help you accelerate application performance. Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from the latest Intel processors and coprocessors. See abstracts and register > http://pubads.g.doubleclick.net/gampad/clk?id=60134791&iu=/4140/ostg.clktrk ___ Lxc-devel mailing list Lxc-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/lxc-devel