On Mon, 2015-02-23 at 17:22 -0800, Benjamin Coddington wrote: > On Tue, 24 Feb 2015, Ian Kent wrote: > > > On Mon, 2015-02-23 at 09:52 -0500, J. Bruce Fields wrote: > > > On Sat, Feb 21, 2015 at 11:58:58AM +0800, Ian Kent wrote: > > > > On Fri, 2015-02-20 at 14:05 -0500, J. Bruce Fields wrote: > > > > > On Fri, Feb 20, 2015 at 12:07:15PM -0600, Eric W. Biederman wrote: > > > > > > "J. Bruce Fields" <bfie...@fieldses.org> writes: > > > > > > > > > > > > > On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote: > > > > > > > > > > > > >> The case of nfsd state-recovery might be similar but you'll need > > > > > > >> to help > > > > > > >> me out a bit with that too. > > > > > > > > > > > > > > Each network namespace can have its own virtual nfs server. > > > > > > > Servers can > > > > > > > be started and stopped independently per network namespace. We > > > > > > > decide > > > > > > > which server should handle an incoming rpc by looking at the > > > > > > > network > > > > > > > namespace associated with the socket that it arrived over. > > > > > > > > > > > > > > A server is started by the rpc.nfsd command writing a value into > > > > > > > a magic > > > > > > > file somewhere. > > > > > > > > > > > > nit. Unless I am completely turned around that file is on the nfsd > > > > > > filesystem, that lives in fs/nfsd/nfs.c. > > > > > > > > > > > > So I bevelive this really is a case of figuring out what we want the > > > > > > semantics to be for mount and propogating the information down from > > > > > > mount to where we call the user mode helpers. > > > > > > > > > > Oops, I agree. So when I said: > > > > > > > > > > The upcalls need to happen consistently in one context for a > > > > > given virtual nfs server, and that context should probably be > > > > > derived from rpc.nfsd's somehow. > > > > > > > > > > Instead of "rpc.nfsd's", I think I should have said "the mounter of > > > > > the nfsd filesystem". > > > > > > > > > > Which is already how we choose a net namespace: nfsd_mount and > > > > > nfsd_fill_super store the current net namespace in s_fs_info. (And > > > > > then > > > > > grep for "netns" to see the places where that's used.) > > > > > > > > This is going to be mostly a restatement of what's already been said, > > > > partly for me to refer back to later and partly to clarify and confirm > > > > what I need to do, so prepare to be bored. > > > > > > > > As a result of Oleg's recommendations and comments, the next version of > > > > the series will take a reference to an nsproxy and a user namespace > > > > (from the init process of the calling task, while it's still a child of > > > > that task), it won't carry around task structs. There are still a couple > > > > of questions with this so it's not quite there yet. > > > > > > > > We'll have to wait and see if what I've done is enough to remedy Oleg's > > > > concerns too. LOL, and then there's the question of how much I'll need > > > > to do to get it to actually work. > > > > > > > > The other difference is obtaining the context (now nsproxy and user > > > > namspace) has been taken entirely within the usermode helper. I think > > > > that's a good thing from the calling process isolation requirement. That > > > > may need to change again based on the discussion here. > > > > > > > > Now we're starting to look at actual usage it's worth keeping in mind > > > > that how to execute within required namespaces has to be sound before we > > > > tackle use cases that have requirements over this fundamental > > > > functionality. > > > > > > > > There are a couple of things to think about. > > > > > > > > One thing that's needed is how to work out if the UMH_USE_NS is needed > > > > and another is how to provide provide persistent usage of particular > > > > namespaces across containers. The later probably will relate to the > > > > origin of the file system (which looks like it will be identified at > > > > mount time). > > > > > > > > The first case is when the mount originates in the root init namespace > > > > and most of the time (if not all the time) the UMH_USE_NS doesn't need > > > > to be set and the helper should run in the root init namspace. > > > > > > The helper always runs in the original mount's container. Sometimes > > > that container is the init container, yes, but I don't see what value > > > there is in setting a flag in that one case. > > > > Yep. that's pretty much what I meant. > > > > > > > > > That > > > > should work for mount propagation as well with mounts bound into a > > > > container. > > > > > > > > Is this also true for automounted mounts at mount point crossing? Or > > > > perhaps I should ask, should automounted NFS mounts inherit the property > > > > from their parent mount? > > > > > > Yes. If we run separate helpers in each container, then the superblocks > > > should also be separate (so that one container can't poison cached > > > values used by another). So the containers would all end up with > > > entirely separate superblocks for the submounts. > > > > That's almost what I was thinking. > > > > The question relates to a mount for which the namespace proxy would have > > been set at mount time in a container and then bound into another > > container (in Docker by using the "--volumes-from <name>"). I believe > > the namespace information from the original mount should always be used > > when calling a usermode helper. This might not be a sensible question > > now but I think it needs to be considered. > > > > > > > > That seems inefficient at least, and I don't think it's what an admin > > > would expect as the default behavior. > > > > LOL, but the best way to manage this is to set the namespace information > > at mount time (as Eric mentioned long ago) and use that everywhere. It's > > consistent and it provides a way for a process with appropriate > > privilege to specify the namespace information. > > > > > > > > > The second case is when the mount originates in another namespace, > > > > possibly a container. TBH I haven't thought too much about mounts that > > > > originate from namespaces created by unshare(1) or other source yet. I'm > > > > hoping that will just work once this is done, ;) > > > > > > So, one container mounts and spawns a "subcontainer" which continues to > > > use that filesystem? Yes, I think helpers should continue to run in the > > > container of the original mount, I don't see any tricky exception here. > > > > That's what I think should happen too. > > > > > > > > > The last time I tried binding NFS mounts from one container into another > > > > it didn't work, > > > > > > I'm not sure what you mean by "binding NFS mounts from one container > > > into another". What exactly didn't work? > > > > It's the volumes-from Docker option I'm thinking of. > > I'm not sure now if my statement is accurate. I'll need to test it > > again. I thought I had but what didn't work with the volumes-from might > > have been autofs not NFS mounts. > > > > Anyway, I'm going to need to provide a way for clients to say "calculate > > the namespace information and give me an identifier so it can be used > > everywhere for this mount" which amounts to maintaining a list of the > > namespace objects. > > That sounds a lot closer to some of the work I've been doing to see if I can > come up with a way to solve the "where's the namespace I need?" problem. > > I agree with Greg's very early comments that the easiest way to determine > which namespace context a process should use is to keep it as a copy of > the task -- and the place that copy should be done is fork(). The > problem was where to keep that information and how to make it reusable. > > I've been hacking out a keyrings-based "key-agent" service that is basically > a special type of key (like a keyring). A key_agent type roughly > corresponds to a particular type of upcall user, such as the idmapper. A > key_agent_type is registered, and that registration ties a particular > key_type to that key_agent. When a process calls request_key() for that > key_type instead of using the helper to execute /sbin/request-key the > process' keyrings are searched for a key_agent. If a key_agent isn't found, > the key_agent provider is then asked to provide an existing one based on > some rules (is there an existing key_agent running in a different namespace > that we might want to use for this purpose -- for example: is there there > one already running in the namespace where the mount occurred). If so, it > is linked to the calling process' keyrings and then used for the upcall. If > not, then the calling process itself is forked/execve-ed into a new > persistent key_agent that is installed on the calling process' keyrings just > like a key, and with the same lifetime and GC expectations of a key. > > A key_agent is a user-space process waiting for a realtime signal to process a > particular key and provide the requested key information that can be > installed back onto the calling process' keyrings. > > Basically, this approach allows a particular user of a keyrings-based upcall > to specify their own rules about how to provide a namespace context for a > calling process. It does, however, require extra work to create a specific > key_agent_type for each individual key_type that might want to use this > mechanism. > > I've been waiting to have a bit more of a proof-of-concept before bringing > this approach into the discussion. However, it looks like it may be > important to allow particular users of the upcall their own rules about > which namespace contexts they might want to use. This approach could > provide that flexibility.
I was wondering if what you've been doing would help. This does sound interesting, perhaps I should wait a little before doing much more in case it can be generalized a little and used here too. It's likely the current limited implementation I have will also be useful for upcalls that need a straight "just execute me in the caller namespace", so it's probably worth continuing it for that case. > > Ben > > > > > I'm not sure yet if I should undo some of what I've done recently or > > leave it for users who need a straight "execute me now within the > > current namespace". > > > > > > > > --b. > > > > > > > but if we assume that will work at some point then, as > > > > Bruce points out, we need to provide the ability to record the > > > > namespaces to be used for subsequent "in namespace" execution while > > > > maintaining caller isolation (ie. derived from the callers init > > > > process). > > > > > > > > I've been aware of the need for persistence for a while now and I've > > > > been thinking about how to do it but I don't have a clear plan quite > > > > yet. Bruce, having noticed this, has described details about the > > > > environment I have to work with so that's a start. I need the thoughts > > > > of others on this too. > > > > > > > > As a result I'm not sure yet if this persistence can be integrated into > > > > the current implementation or if additional calls will be needed to set > > > > and clear the namespace information while maintaining the needed > > > > isolation. > > > > > > > > As Bruce says, perhaps the namespace information should be saved as > > > > properties of a mount or perhaps it should be a list keyed by some > > > > handle, the handle being the saved property. I'm not sure yet but the > > > > later might be unnecessary complication and overhead. The cleanup of the > > > > namespace information upon summary termination of processes could be a > > > > bit difficult, but perhaps it will be as simple as making it a function > > > > of freeing of the object it's stored in (in the cases we have so far > > > > that would be the mount). > > > > > > > > So, yes, I've still got a fair way to go yet, ;) > > > > > > > > Ian > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/