On Wed, Mar 19, 2014 at 10:39:45PM +0100, Max Kellermann wrote:
> mount.mnt_hash is RCU-protected.  However, list_move() breaks RCU
> protection: when one thread walks the linked list while another calls
> list_move(), it may "redirect" the first thread into the new list,
> making it loop endlessly in __lookup_mnt(), because the list head is
> never found.

> The right way to delete items from a RCU-protected list is
> list_del_rcu().  Before the item is inserted into another list
> (completing the list_move), synchronize_rcu() must be called.

> The fix is to avoid reusing "mnt_hash".  This patch adds a new
> list_head attribute dedicated for umounting.  This avoids clobbering
> mnt_hash.next, allowing all rcu_locked threads to continue walk the
> list until namespace_unlock() is called.

NAK.  Nice catch, the bug is real, but the fix is wrong.  For one thing,
you have missed detach_mnt()/attach_mnt(), so you are not covering
all places where the sucker might be removed from the list.  For another,
I don't believe that this is the right approach.

The *only* thing we care about is not getting stuck in __lookup_mnt().  If
it misses an entry because something in front of it just got moved around,
etc., we are fine.  We'll notice that mount_lock mismatch and that'll be
it.

IOW, there are two problems here:
        (1) we can, indeed, get caught into an infinite loop.
        (2) __follow_mount_rcu() and follow_mount_rcu() ought to recheck
nd->m_seq and bugger off with ECHILD in case of mismatch (the latter probably
ought to be folded into follow_dotdot_rcu()).  legitimize_mnt() will fail
in the end of lookup *and* we are risking an incorrect hard error if we fail
to cross a mountpoint and continue under it.

I would prefer to deal with (1) by turning mnt_hash into hlist; the problem
with that is __lookup_mnt_last().  That sucker is only called under
mount_lock, so RCU issues do not play there, but it's there and it
complicates things.  There might be a way to get rid of that thing for
good, but that's more invasive than what I'd be happy with for backports.

Let's just have __lookup_mnt() recheck mount_lock if it goes for too long -
e.g. every 256 iterations (and if the hash chain is that long, the price
of extra smp_rmb() + fetching seqcount won't matter anyway).  And have the
fs/namei.c callers check nd->m_seq properly (lookup_mnt() already does).
I've looked into taking the check into __lookup_mnt() itself, but it
leads to clumsier calling conventions and worse code in callers.

How about the following (completely untested yet):

diff --git a/fs/mount.h b/fs/mount.h
index 46790ae1..905d1bc 100644
--- a/fs/mount.h
+++ b/fs/mount.h
@@ -77,7 +77,7 @@ static inline int is_mounted(struct vfsmount *mnt)
        return !IS_ERR_OR_NULL(real_mount(mnt)->mnt_ns);
 }
 
-extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *);
+extern struct mount *__lookup_mnt(struct vfsmount *, struct dentry *, unsigned 
seq);
 extern struct mount *__lookup_mnt_last(struct vfsmount *, struct dentry *);
 
 extern bool legitimize_mnt(struct vfsmount *, unsigned);
diff --git a/fs/namei.c b/fs/namei.c
index d6e32a1..458572b 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1115,7 +1115,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, 
struct path *path,
                if (!d_mountpoint(path->dentry))
                        break;
 
-               mounted = __lookup_mnt(path->mnt, path->dentry);
+               mounted = __lookup_mnt(path->mnt, path->dentry, nd->m_seq);
                if (!mounted)
                        break;
                path->mnt = &mounted->mnt;
@@ -1129,20 +1129,7 @@ static bool __follow_mount_rcu(struct nameidata *nd, 
struct path *path,
                 */
                *inode = path->dentry->d_inode;
        }
-       return true;
-}
-
-static void follow_mount_rcu(struct nameidata *nd)
-{
-       while (d_mountpoint(nd->path.dentry)) {
-               struct mount *mounted;
-               mounted = __lookup_mnt(nd->path.mnt, nd->path.dentry);
-               if (!mounted)
-                       break;
-               nd->path.mnt = &mounted->mnt;
-               nd->path.dentry = mounted->mnt.mnt_root;
-               nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
-       }
+       return read_seqretry(&mount_lock, nd->m_seq);
 }
 
 static int follow_dotdot_rcu(struct nameidata *nd)
@@ -1170,7 +1157,17 @@ static int follow_dotdot_rcu(struct nameidata *nd)
                        break;
                nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
        }
-       follow_mount_rcu(nd);
+       while (d_mountpoint(nd->path.dentry)) {
+               struct mount *mounted;
+               mounted = __lookup_mnt(nd->path.mnt, nd->path.dentry, 
nd->m_seq);
+               if (!mounted)
+                       break;
+               nd->path.mnt = &mounted->mnt;
+               nd->path.dentry = mounted->mnt.mnt_root;
+               nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);
+               if (!read_seqretry(&mount_lock, nd->m_seq))
+                       goto failed;
+       }
        nd->inode = nd->path.dentry->d_inode;
        return 0;
 
diff --git a/fs/namespace.c b/fs/namespace.c
index 203475b..72c20b0 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -599,16 +599,23 @@ bool legitimize_mnt(struct vfsmount *bastard, unsigned 
seq)
 
 /*
  * find the first mount at @dentry on vfsmount @mnt.
- * call under rcu_read_lock()
+ * call under rcu_read_lock().  seq is normally *not* checked - that's
+ * for the caller to deal with; here we are just breaking infinite loops.
+ * If we get hash chains longer than 256 elements, the price of extra
+ * smp_rmb() won't matter.
  */
-struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
+struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry, 
unsigned seq)
 {
        struct list_head *head = m_hash(mnt, dentry);
        struct mount *p;
+       unsigned char n = 0;
 
-       list_for_each_entry_rcu(p, head, mnt_hash)
+       list_for_each_entry_rcu(p, head, mnt_hash) {
+               if (unlikely(!--n) && !read_seqretry(&mount_lock, seq))
+                       break;
                if (&p->mnt_parent->mnt == mnt && p->mnt_mountpoint == dentry)
                        return p;
+       }
        return NULL;
 }
 
@@ -652,7 +659,7 @@ struct vfsmount *lookup_mnt(struct path *path)
        rcu_read_lock();
        do {
                seq = read_seqbegin(&mount_lock);
-               child_mnt = __lookup_mnt(path->mnt, path->dentry);
+               child_mnt = __lookup_mnt(path->mnt, path->dentry, seq);
                m = child_mnt ? &child_mnt->mnt : NULL;
        } while (!legitimize_mnt(m, seq));
        rcu_read_unlock();
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to