On Mon, Mar 29, 2021 at 11:21:34AM +0200, Christian Brauner wrote:
> On Sat, Mar 27, 2021 at 11:33:37PM +0000, Al Viro wrote:
> > On Fri, Mar 26, 2021 at 02:50:11PM +0100, Christian Brauner wrote:
> > > @@ -632,6 +632,7 @@ EXPORT_SYMBOL(close_fd); /* for ksys_close() */
> > >  static inline void __range_cloexec(struct files_struct *cur_fds,
> > >                              unsigned int fd, unsigned int max_fd)
> > >  {
> > > + unsigned int cur_max;
> > >   struct fdtable *fdt;
> > >  
> > >   if (fd > max_fd)
> > > @@ -639,7 +640,12 @@ static inline void __range_cloexec(struct 
> > > files_struct *cur_fds,
> > >  
> > >   spin_lock(&cur_fds->file_lock);
> > >   fdt = files_fdtable(cur_fds);
> > > - bitmap_set(fdt->close_on_exec, fd, max_fd - fd + 1);
> > > + /* make very sure we're using the correct maximum value */
> > > + cur_max = fdt->max_fds;
> > > + cur_max--;
> > > + cur_max = min(max_fd, cur_max);
> > > + if (fd <= cur_max)
> > > +         bitmap_set(fdt->close_on_exec, fd, cur_max - fd + 1);
> > >   spin_unlock(&cur_fds->file_lock);
> > >  }
> > 
> > Umm...  That's harder to follow than it ought to be.  What's the point of
> > having
> >         max_fd = min(max_fd, cur_max);
> > done in the caller, anyway?  Note that in __range_close() you have to
> > compare with re-fetched ->max_fds (look at pick_file()), so...
> 
> Yeah, I'll massage that patch a bit. I wanted to know whether this fixes
> the issue first though.
> 
> > 
> > BTW, I really wonder if the cost of jerking ->file_lock up and down
> > in that loop in __range_close() is negligible.  What values do we
> 
> Just for the record, I remember you pointing at that originally. Linus
> argued that this likely wasn't going to be a problem and that if people
> see performance hits we'll optimize.
> 
> > typically get from callers and how sparse does descriptor table tend
> > to be for those?
> 
> Weirdly, I can actually somewhat answer that question since I tend to
> regularly "survey" large userspace projects I know or am involved in
> that adopt new APIs we added just to see how they use it.
> 
> A few users:
> 1. crun
>    
> https://github.com/containers/crun/blob/a1c0ef1b886ca30c2fb0906c7c43be04b555c52c/src/libcrun/utils.c#L1490
>    ret = syscall_close_range (n, UINT_MAX, CLOSE_RANGE_CLOEXEC);
> 
> 2. LXD
>    
> https://github.com/lxc/lxd/blob/f12f03a4ba4645892ef6cc167c24da49d1217b02/lxd/main_forkexec.go#L293
>    ret = close_range(EXEC_PIPE_FD + 1, UINT_MAX, CLOSE_RANGE_UNSHARE);
> 
> 3. LXC
>    
> https://github.com/lxc/lxc/blob/1718e6d6018d5d6072a01d92a11d5aafc314f98f/src/lxc/rexec.c#L165
>    ret = close_range(STDERR_FILENO + 1, MAX_FILENO, CLOSE_RANGE_CLOEXEC);
> 
> Of these three 1. and 3. don't matter because they rely on
> CLOSE_RANGE_CLOEXEC and exec.
> For 2. I can say that the fdtable is likely going to be sparse.
> close_range() here is basically used to prevent accidental fd leaks
> across an exec. So 2. should never have more > 4 file. In fact, this
> could and should probably be switched to CLOSE_RANGE_CLOEXEC too.
> 
> The next two cases might be more interesting:
> 
> 4. systemd
>    - 
> https://github.com/systemd/systemd/blob/fe96c0f86d15e844d74d539c6cff7f971078cf84/src/basic/fd-util.c#L228
>      close_range(3, -1, 0)
>    - 
> https://github.com/systemd/systemd/blob/fe96c0f86d15e844d74d539c6cff7f971078cf84/src/basic/fd-util.c#L271
>      
> https://github.com/systemd/systemd/blob/fe96c0f86d15e844d74d539c6cff7f971078cf84/src/basic/fd-util.c#L288
>      /* Close everything between the start and end fds (both of which shall 
> stay open) */
>      if (close_range(start + 1, end - 1, 0) < 0) {
>      if (close_range(sorted[n_sorted-1] + 1, -1, 0) >= 0)
> 
> 5. Python
>    
> https://github.com/python/cpython/blob/9976834f807ea63ca51bc4f89be457d734148682/Python/fileutils.c#L2250
> 
> systemd has the regular case that others have too where it simply closes
> all fds over 3 and it also has the more complicated case where it has an
> ordered array of fds closing up to the lower bound and after the upper
> bound up to the maximum. PID 1 can have a large number of fds open
> because of socket activation so here close_range() will encounter less
> sparse fd tables where it needs to close a lot of fds.
> 
> For Python's os.closerange() implementation which depends on our syscall
> it's harder to say given that this will be used by a lot of projects but
> I would _guess_ that if people use closerange() they do so because they
> actually have something to close.
> 
> In short, I would think that close_range() without the
> CLOSE_RANGE_CLOEXEC feature will usually be used in scenarios where
> there's work to be done, i.e. where the caller likely knows that they
> might inherit a non-trivial number of file descriptors (usually after a
> fork) that they want to close and they want to do it either because they
> don't exec or they don't know when they'll exec. All others I'd expect
> to switch to CLOSE_RANGE_CLOEXEC on kernels where it's supported.

The patch below is a bit noiser then I'd like but it rips out the double
logic where we check min() twice and should also tweak the worst-case
where we keep taking the spinlock for a few fds that are already past
fdt->max_fds.
Afaict, this should be a very rare (pathological?) case. Even before
this change the logic would've capped max_fds to fdt->max_fds but
wouldn't have bothered to recheck within the pick_file() loop.
I've tweaked it to do that and then to return as soon as we see that
we're past the last fd. If you think that's worth then I'm happy to
write a commit message. I'm not sure if it's worth doing something more
fancy than this tbh but curious to hear what you think (Only compile
tested for now.):

---
 fs/file.c | 72 +++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 49 insertions(+), 23 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index f3a4bac2cbe9..5027cd75ec59 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -596,18 +596,32 @@ void fd_install(unsigned int fd, struct file *file)
 
 EXPORT_SYMBOL(fd_install);
 
+/**
+ * pick_file - return file associatd with fd
+ * @files: file struct to retrieve file from
+ * @fd: file descriptor to retrieve file for
+ *
+ * If this functions returns an EINVAL error pointer the fd was beyond the
+ * current maximum number of file descriptors for that fdtable.
+ *
+ * Returns: The file associated with @fd, on error returns an error pointer.
+ */
 static struct file *pick_file(struct files_struct *files, unsigned fd)
 {
-       struct file *file = NULL;
+       struct file *file;
        struct fdtable *fdt;
 
        spin_lock(&files->file_lock);
        fdt = files_fdtable(files);
-       if (fd >= fdt->max_fds)
+       if (fd >= fdt->max_fds) {
+               file = ERR_PTR(-EINVAL);
                goto out_unlock;
+       }
        file = fdt->fd[fd];
-       if (!file)
+       if (!file) {
+               file = ERR_PTR(-EBADF);
                goto out_unlock;
+       }
        rcu_assign_pointer(fdt->fd[fd], NULL);
        __put_unused_fd(files, fd);
 
@@ -622,24 +636,37 @@ int close_fd(unsigned fd)
        struct file *file;
 
        file = pick_file(files, fd);
-       if (!file)
+       if (IS_ERR(file))
                return -EBADF;
 
        return filp_close(file, files);
 }
 EXPORT_SYMBOL(close_fd); /* for ksys_close() */
 
+/**
+ * last_fd - return last valid index into fd table
+ * @cur_fds: files struct
+ *
+ * Context: Either rcu read lock or files_lock must be held.
+ *
+ * Returns: Last valid index into fdtable.
+ */
+static inline unsigned last_fd(struct fdtable *fdt)
+{
+       return fdt->max_fds - 1;
+}
+
 static inline void __range_cloexec(struct files_struct *cur_fds,
                                   unsigned int fd, unsigned int max_fd)
 {
        struct fdtable *fdt;
 
-       if (fd > max_fd)
-               return;
-
+       /* make sure we're using the correct maximum value */
        spin_lock(&cur_fds->file_lock);
        fdt = files_fdtable(cur_fds);
-       bitmap_set(fdt->close_on_exec, fd, max_fd - fd + 1);
+       max_fd = min(last_fd(fdt), max_fd);
+       if (fd <= max_fd)
+               bitmap_set(fdt->close_on_exec, fd, max_fd - fd + 1);
        spin_unlock(&cur_fds->file_lock);
 }
 
@@ -650,11 +677,16 @@ static inline void __range_close(struct files_struct 
*cur_fds, unsigned int fd,
                struct file *file;
 
                file = pick_file(cur_fds, fd++);
-               if (!file)
+               if (!IS_ERR(file)) {
+                       /* found a valid file to close */
+                       filp_close(file, cur_fds);
+                       cond_resched();
                        continue;
+               }
 
-               filp_close(file, cur_fds);
-               cond_resched();
+               /* beyond the last fd in that table */
+               if (PTR_ERR(file) == -EINVAL)
+                       return;
        }
 }
 
@@ -669,7 +701,6 @@ static inline void __range_close(struct files_struct 
*cur_fds, unsigned int fd,
  */
 int __close_range(unsigned fd, unsigned max_fd, unsigned int flags)
 {
-       unsigned int cur_max;
        struct task_struct *me = current;
        struct files_struct *cur_fds = me->files, *fds = NULL;
 
@@ -679,13 +710,6 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned 
int flags)
        if (fd > max_fd)
                return -EINVAL;
 
-       rcu_read_lock();
-       cur_max = files_fdtable(cur_fds)->max_fds;
-       rcu_read_unlock();
-
-       /* cap to last valid index into fdtable */
-       cur_max--;
-
        if (flags & CLOSE_RANGE_UNSHARE) {
                int ret;
                unsigned int max_unshare_fds = NR_OPEN_MAX;
@@ -697,8 +721,12 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned 
int flags)
                 * If the caller requested all fds to be made cloexec copy all
                 * of the file descriptors since they still want to use them.
                 */
-               if (!(flags & CLOSE_RANGE_CLOEXEC) && (max_fd >= cur_max))
-                       max_unshare_fds = fd;
+               if (!(flags & CLOSE_RANGE_CLOEXEC)) {
+                       rcu_read_lock();
+                       if (max_fd >= last_fd(files_fdtable(cur_fds)))
+                               max_unshare_fds = fd;
+                       rcu_read_unlock();
+               }
 
                ret = unshare_fd(CLONE_FILES, max_unshare_fds, &fds);
                if (ret)
@@ -712,8 +740,6 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned 
int flags)
                        swap(cur_fds, fds);
        }
 
-       max_fd = min(max_fd, cur_max);
-
        if (flags & CLOSE_RANGE_CLOEXEC)
                __range_cloexec(cur_fds, fd, max_fd);
        else
-- 
2.27.0

Reply via email to