On Wed, Mar 19, 2014 at 10:39:45PM +0100, Max Kellermann wrote: > mount.mnt_hash is RCU-protected. However, list_move() breaks RCU > protection: when one thread walks the linked list while another calls > list_move(), it may "redirect" the first thread into the new list, > making it loop endlessly in __lookup_mnt(), because the list head is > never found.
> The right way to delete items from a RCU-protected list is > list_del_rcu(). Before the item is inserted into another list > (completing the list_move), synchronize_rcu() must be called. > The fix is to avoid reusing "mnt_hash". This patch adds a new > list_head attribute dedicated for umounting. This avoids clobbering > mnt_hash.next, allowing all rcu_locked threads to continue walk the > list until namespace_unlock() is called. NAK. Nice catch, the bug is real, but the fix is wrong. For one thing, you have missed detach_mnt()/attach_mnt(), so you are not covering all places where the sucker might be removed from the list. For another, I don't believe that this is the right approach. The *only* thing we care about is not getting stuck in __lookup_mnt(). If it misses an entry because something in front of it just got moved around, etc., we are fine. We'll notice that mount_lock mismatch and that'll be it. IOW, there are two problems here: (1) we can, indeed, get caught into an infinite loop. (2) __follow_mount_rcu() and follow_mount_rcu() ought to recheck nd->m_seq and bugger off with ECHILD in case of mismatch (the latter probably ought to be folded into follow_dotdot_rcu()). legitimize_mnt() will fail in the end of lookup *and* we are risking an incorrect hard error if we fail to cross a mountpoint and continue under it. I would prefer to deal with (1) by turning mnt_hash into hlist; the problem with that is __lookup_mnt_last(). That sucker is only called under mount_lock, so RCU issues do not play there, but it's there and it complicates things. There might be a way to get rid of that thing for good, but that's more invasive than what I'd be happy with for backports. Let's just have __lookup_mnt() recheck mount_lock if it goes for too long - e.g. every 256 iterations (and if the hash chain is that long, the price of extra smp_rmb() + fetching seqcount won't matter anyway). And have the fs/namei.c callers check nd->m_seq properly (lookup_mnt() already does). I've looked into taking the check into __lookup_mnt() itself, but it leads to clumsier calling conventions and worse code in callers. How about the following (completely untested yet): diff --git a/fs/mount.h b/fs/mount.h index 46790ae1..905d1bc 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -77,7 +77,7 @@ static inline int is_mounted(struct vfsmount *mnt) return !IS_ERR_OR_NULL(real_mount(mnt)->mnt_ns); } -extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *); +extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *, unsigned seq); extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *); extern bool legitimize_mnt(struct vfsmount *, unsigned); diff --git a/fs/namei.c b/fs/namei.c index d6e32a1..458572b 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1115,7 +1115,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, if (!d_mountpoint(path->dentry)) break; - mounted = __lookup_mnt(path->mnt, path->dentry); + mounted = __lookup_mnt(path->mnt, path->dentry, nd->m_seq); if (!mounted) break; path->mnt = &mounted->mnt; @@ -1129,20 +1129,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, struct path *path, */ *inode = path->dentry->d_inode; } - return true; -} - -static void follow_mount_rcu(struct nameidata *nd) -{ - while (d_mountpoint(nd->path.dentry)) { - struct mount *mounted; - mounted = __lookup_mnt(nd->path.mnt, nd->path.dentry); - if (!mounted) - break; - nd->path.mnt = &mounted->mnt; - nd->path.dentry = mounted->mnt.mnt_root; - nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); - } + return read_seqretry(&mount_lock, nd->m_seq); } static int follow_dotdot_rcu(struct nameidata *nd) @@ -1170,7 +1157,17 @@ static int follow_dotdot_rcu(struct nameidata *nd) break; nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); } - follow_mount_rcu(nd); + while (d_mountpoint(nd->path.dentry)) { + struct mount *mounted; + mounted = __lookup_mnt(nd->path.mnt, nd->path.dentry, nd->m_seq); + if (!mounted) + break; + nd->path.mnt = &mounted->mnt; + nd->path.dentry = mounted->mnt.mnt_root; + nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq); + if (!read_seqretry(&mount_lock, nd->m_seq)) + goto failed; + } nd->inode = nd->path.dentry->d_inode; return 0; diff --git a/fs/namespace.c b/fs/namespace.c index 203475b..72c20b0 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -599,16 +599,23 @@ bool legitimize_mnt(struct vfsmount *bastard, unsigned seq) /* * find the first mount at @dentry on vfsmount @mnt. - * call under rcu_read_lock() + * call under rcu_read_lock(). seq is normally *not* checked - that's + * for the caller to deal with; here we are just breaking infinite loops. + * If we get hash chains longer than 256 elements, the price of extra + * smp_rmb() won't matter. */ -struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry) +struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry, unsigned seq) { struct list_head *head = m_hash(mnt, dentry); struct mount *p; + unsigned char n = 0; - list_for_each_entry_rcu(p, head, mnt_hash) + list_for_each_entry_rcu(p, head, mnt_hash) { + if (unlikely(!--n) && !read_seqretry(&mount_lock, seq)) + break; if (&p->mnt_parent->mnt == mnt && p->mnt_mountpoint == dentry) return p; + } return NULL; } @@ -652,7 +659,7 @@ struct vfsmount *lookup_mnt(struct path *path) rcu_read_lock(); do { seq = read_seqbegin(&mount_lock); - child_mnt = __lookup_mnt(path->mnt, path->dentry); + child_mnt = __lookup_mnt(path->mnt, path->dentry, seq); m = child_mnt ? &child_mnt->mnt : NULL; } while (!legitimize_mnt(m, seq)); rcu_read_unlock(); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/