Matt Helsley wrote:
> On Fri, Jul 24, 2009 at 04:35:14AM -0400, Oren Laadan wrote:
>>
>> Matt Helsley wrote:
>>>  Add checkpoint/restart support for epoll files. This is the minimum
>>>  support necessary to recreate the epoll item sets without any pending
>>>  events.
>>>
>>>  This is an RFC to show where I'm going with the patch and give an idea
>>>  of how much code I expect it will take. Compiles and boots on x86 but
>>>  I haven't tested it.
>>>
>>>  Caveats: Does not correctly support restoring epoll fds to epoll sets
>>>       as this requires some recursion detection/avoidance. See the
>>>       "TODO" that mentions deferqueues.
>>>
>>>       Does not attempt to save pending event information for
>>>       delivery during restart.
>>>
>>>       TODO: Look into if we need a restart block for epoll_wait().
>> Since epoll restore can only complete after all fd's of a task have
>> been restores, there is little advantage to even attempt restore
>> earlier, on the fly.
>>
>> Instead, I suggest that first epoll fd's be created (like all other
>> files), and after all fd's are in place, the epoll state be restored.
>>
>> To avoid keeping potentially large data describing this state in the
>> kernel, modify checkpoint to only dump the internal state of epoll
>> fd's after all other fd's of the task have been dumped.
>>
>> In both cases deferqueue is our friend:
>>
>> At checkpoint, have do_checkpoint_file_table() setup a deferqueue
>> to which checkpoint_file_desc() may add work. The deferqueue will
>> be executed at the end of do_checkpoint_file_table(), and dump the
>> state of each epoll fd.
>>
>> At restart, do_restore_file_table() will do the same: setup a
>> deferqueue and use it to restore the state of all epoll fd's.
>>
>>> Signed-off-by: Matt Helsley <[email protected]>
>>> ---
>>>  checkpoint/files.c             |   13 +++
>>>  fs/eventpoll.c                 |  181 
>>> +++++++++++++++++++++++++++++++++++++++-
>>>  include/linux/checkpoint_hdr.h |   15 ++++
>>>  include/linux/eventpoll.h      |   14 +++-
>>>  4 files changed, 221 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/checkpoint/files.c b/checkpoint/files.c
>>> index 5ca2e6c..7233a9b 100644
>>> --- a/checkpoint/files.c
>>> +++ b/checkpoint/files.c
>>> @@ -484,6 +484,11 @@ struct restore_file_ops {
>>>                               struct ckpt_hdr_file *ptr);
>>>  };
>>>  
>>> +#ifdef CONFIG_EPOLL
>>> +extern struct file* ep_file_restore(struct ckpt_ctx *ctx,
>>> +                               struct ckpt_hdr_file *ptr);
>>> +#endif
>>> +
>> Already in eventpoll.h ?
> 
> Yup, good point. Fixed.
> 
>>>  static struct restore_file_ops restore_file_ops[] = {
>>>     /* ignored file */
>>>     {
>>> @@ -503,6 +508,14 @@ static struct restore_file_ops restore_file_ops[] = {
>>>             .file_type = CKPT_FILE_PIPE,
>>>             .restore = pipe_file_restore,
>>>     },
>>> +#ifdef CONFIG_EPOLL
>>> +   /* epoll */
>>> +   {
>>> +           .file_name = "EPOLL",
>>> +           .file_type = CKPT_FILE_EPOLL,
>>> +           .restore = ep_file_restore,
>>> +   },
>>> +#endif
>>>  };
>>>  
>>>  static struct file *do_restore_file(struct ckpt_ctx *ctx)
>>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
>>> index 5458e80..9b2414b 100644
>>> --- a/fs/eventpoll.c
>>> +++ b/fs/eventpoll.c
>>> @@ -668,10 +668,17 @@ static unsigned int ep_eventpoll_poll(struct file 
>>> *file, poll_table *wait)
>>>     return pollflags != -1 ? pollflags : 0;
>>>  }
>>>  
>>> +#ifdef CONFIG_CHECKPOINT
>>> +static int ep_eventpoll_checkpoint(struct ckpt_ctx *ctx, struct file 
>>> *file);
>>> +#else
>>> +#define ep_eventpoll_checkpoint NULL
>>> +#endif
>>> +
>>>  /* File callbacks that implement the eventpoll file behaviour */
>>>  static const struct file_operations eventpoll_fops = {
>>>     .release        = ep_eventpoll_release,
>>> -   .poll           = ep_eventpoll_poll
>>> +   .poll           = ep_eventpoll_poll,
>>> +   .checkpoint     = ep_eventpoll_checkpoint,
>>>  };
>>>  
>>>  /* Fast test to see if the file is an evenpoll file */
>>> @@ -1410,6 +1417,178 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct 
>>> epoll_event __user *, events,
>>>  
>>>  #endif /* HAVE_SET_RESTORE_SIGMASK */
>>>  
>>> +#ifdef CONFIG_CHECKPOINT
>>> +#include <linux/checkpoint.h>
>>> +#include <linux/checkpoint_hdr.h>
>>> +
>> Each file/fd registered in an epoll fd produces a reference count
>> to the target file, this needs to be taken into account for leak
>> detection when collecting references.
>>
>> I'm thinking of adding a new fops method for files for this purpose:
>> e.g. collect(), such that collect_file_desc() will invoke this method
>> on the file if that method is non-NULL.
>>
>> (Turns out that it's useful also for ptys/ttys, since the underlying
>> tty also needs to be counted for leaks).
>>
>> For epoll, the collect() method will add all those files that are
>> being tracked.
> 
> Sounds good to me -- it was on my TODO list. Here's a rough draft of the
> collect routine that I've got:
> 
> static int ep_file_collect(struct ckpt_ctx *ctx, struct file *file)
> {
>         struct rb_node *rbp;
>         struct eventpoll *ep;
>         int rc = 0;
> 
> #if 0
>       /* Not needed if we have a "collect" function ptr, otherwise
>          can be called unconditionally from ckpt collect files path
>          and this will exit early.. */
>         if (!is_file_epoll(file))
>                 return rc;
> #endif
> 
>         ep = file->private_data;
>         mutex_lock(&ep->mtx);
>         for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp), nitems++) {
>                 struct epitem *epi;
> 
>                 epi = rb_entry(rbp, struct epitem, rbn);
>                 rc = ckpt_obj_collect(ctx, epi->ffd.file, CKPT_OBJ_FILE);
>                 if (rc < 0)
>                       break;
>         }
>         mutex_unlock(&ep->mtx);
>       return rc;
> }

Looks ok to me.

> 
> 
>>> +static int ep_eventpoll_checkpoint(struct ckpt_ctx *ctx, struct file *file)
>>> +{
>>> +   struct ckpt_hdr_file_eventpoll *h;
>>> +   struct ckpt_eventpoll_item *cepi;
>>> +   struct rb_node *rbp;
>>> +   struct eventpoll *ep;
>>> +   int nitems = 0, rc = -ENOMEM;
>>> +
>>> +   h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_FILE_EPOLL);
>>> +   if (!h)
>>> +           return rc;
>>> +
>>> +   ep = file->private_data;
>>> +   /* TODO see if we need mutex_lock(&ep->mtx);*/
>> Yes we do, especially for subtree (non-full-container) checkpoints
>> where another, not-checkpointed process, may modify it concurrently.
> 
> OK.
> 
>>> +   for (rbp = rb_first(&ep->rbr); rbp && nitems++; rbp = rb_next(rbp)) {}
>>> +   h->num_items = nitems;
>>> +   h->common.f_type = CKPT_FILE_EPOLL;
>>> +   rc = checkpoint_file_common(ctx, file, &h->common);
>>> +   if (!rc) {
>>> +           /*
>>> +            * We write this in such a weird way! The problem is we want
>>> +            * to use the common file checkpoint code above but we also
>>> +            * want a variable number of saved epitems to follow the
>>> +            * num_items field. So we write out the header type and length,
>>> +            * then we write the remaining, fixed-size part of the struct.
>>> +            * Finally we'll write each epitem by walking the rbtree.
>>> +            */
>> If we write the epoll-specific state later (as suggested above),
>> then this weirdness disappears.
>>
>> And if not, I suggest that you separate this header from the actual
>> epoll-specific state, namely the array of epoll elements. The latter
>> should go into its own object.
>>
>> Besides, during restart, the entire object will be read into memory,
>> and the array can be (or be made) very large.
> 
> Sure.
> 
>>> +           h->common.h.len += nitems*sizeof(*cepi);
>>> +           rc = ckpt_write_obj_type(ctx, NULL, h->common.h.len,
>>> +                                    h->common.h.type);
>>> +           ckpt_kwrite(ctx, ((void*)&h->common.h) + sizeof(h->common.h),
>>> +                       sizeof(*h) - sizeof(h->common.h));
>>> +   }
>>> +   ckpt_hdr_put(ctx, h);
>>> +   if (rc)
>>> +           goto out;
>>> +
>>> +   /* 
>>> +    * FIXME for now we do it one at a time. Later we might do it like
>>> +    * checkpoint_pids() for better performance since there can be many
>>> +    * more epoll items than pids.
>>> +    */
>> Defer the writing below ...
> 
> OK
> 
>>> +   cepi = ckpt_hdr_get(ctx, sizeof(*cepi));
>>> +   for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
>>> +           struct epitem *epi = rb_entry(rbp, struct epitem, rbn);
>>> +           cepi->fd = epi->ffd.fd;
>>> +           cepi->events = epi->event.events;
>>> +           cepi->data = epi->event.data;
>>> +           if (ckpt_kwrite(ctx, cepi, sizeof(*cepi)) < 0)
>>> +                   break;
>>> +   }
>>> +   _ckpt_hdr_put(ctx, cepi, sizeof(*cepi));
>>> +out:
>>> +   /*mutex_unlock(&ep->mtx);*/
>>> +   return rc;
>>> +}
>>> +
>>> +struct file* ep_file_restore(struct ckpt_ctx *ctx,
>>> +                        struct ckpt_hdr_file *ptr)
>>> +{
>>> +   struct ckpt_hdr_file_eventpoll *h;
>>> +   struct eventpoll *ep;
>>> +   struct file *epfile;
>>> +   int epfd, error, i;
>>> +
>>> +   h = container_of(ptr, struct ckpt_hdr_file_eventpoll, common);
>>> +   if (h->common.h.type != CKPT_HDR_FILE_EPOLL ||
>>> +       h->common.h.len < sizeof(*h) ||
>>> +       h->common.f_type != CKPT_FILE_EPOLL)
>>> +           return ERR_PTR(-EINVAL);
>>> +
>>> +   /* limit max ep watches and the use of flags */
>>> +   if (h->num_items >= max_user_watches)
>>> +           return ERR_PTR(-ENOSPC);
>> This check should be against the sum of all epoll fd's per user.
>> It will take place elsewhere, and here it's incomplete and doesn't
>> add much.
> 
> Yeah, I noticed that too. Currently, I've got:
> 
>        /* Limit max ep watches. */
>         user = get_current_user();
>         remaining_watches = (max_user_watches -
>                              atomic_read(&user->epoll_watches));
>         free_uid(user);
>         if (h->num_items > remaining_watches)
>                 return ERR_PTR(-ENOSPC);
> 
> ep_insert() checks user->epoll_watches too. So even the original version
> would have failed eventually if the number of watches would have been
> exceeded.
> 
> The check in ep_insert() also means we'll properly enforce the limit even
> if we're running in parallel with other epoll file restores.
> 
> So really this check is redundant. I added it originally with the idea
> that I might not be able to use ep_insert() directly. Now I'm thinking
> it might be better to remove it entirely.

Note that by reusing code from epoll_ctl() (after refactoring),
you'll get all the standard checks, including this one, for free.

> 
>>> +   if (h->common.f_flags & ~EPOLL_CLOEXEC)
>>> +           return ERR_PTR(-EINVAL);
>>> +
>>> +   /* similar to epoll_create() */
>>> +   error = ep_alloc(&ep);
>>> +   if (error < 0)
>>> +           return ERR_PTR(error);
>>> +   error = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep,
>>> +                            ptr->f_flags & O_CLOEXEC);
>>> +   if (error < 0) {
>>> +           ep_free(ep);
>>> +           return ERR_PTR(error);
>>> +   }
>>> +
>>> +   /* Everything's allocated, we just need a file* */
>>> +   epfd = error;
>>> +   epfile = fget(epfd);
>>> +   BUG_ON(!epfile);
>> Is there a reason not to use sys_epoll_create(), or refactor it
>> and use the common code ?
> 
> Yeah, I'll reuse it.
> 
>>> +
>>> +   /* Restore the common file bits */
>>> +   i = 0;
>>> +   error = restore_file_common(ctx, epfile, ptr);
>>> +   if (error < 0)
>>> +           goto error_return;
>>> +
>>> +   /* Restore the epoll items/watches */
>>> +   for (; i < h->num_items; i++) {
>> Defer these ...
> 
> OK.
> 
>>> +           /* 
>>> +            * Loop body like multiple epoll_ctl(ep, ADD, event)
>>> +            * calls except we've already done much of the checking.
>>> +            */
>>> +           struct ckpt_eventpoll_item cepi;
>>> +           struct epoll_event epev;
>>> +           struct epitem *epi;
>>> +           struct file *tfile;
>>> +
>>> +           error = ckpt_kread(ctx, &cepi, sizeof(cepi));
>>> +           if (error < 0) {
>>> +                   i++;
>>> +                   break;
>>> +           }
>>> +           epev.events = cepi.events;
>>> +           epev.data = cepi.data;
>> The code below is a duplicate of sys_epoll_ctl(). Is there a reason
>> not to use that code, or refactor it and share the common code ?
> 
> I certainly can't reuse it directly since it does a copy_from_user().
> Also I've already got the epoll file* and know the op. I'll look for a
> good way to factor common pieces from both sys_epoll_ctl() and
> ep_file_restore().
> 
>>> +
>>> +           /* Get the file* for the target file */
>>> +           error = -EFAULT;
>>> +           tfile = fget(cepi.fd);
>>> +           if (!tfile) {
>>> +                   /*
>>> +                    * TODO delayed addition of epitem because 
>>> +                    * tfile hasn't been restored yet.
>>> +                    */
>>> +                   continue;
>>> +           }
>>> +
>>> +           /* The target file descriptor must support poll */
>>> +           error = -EPERM;
>>> +           if (!tfile->f_op || !tfile->f_op->poll)
>>> +                   goto error_tgt_fput;
>>> +
>>> +           /* Cannot add an epoll file descriptor inside itself. */
>>> +           error = -EINVAL;
>>> +           if (epfile == tfile)
>>> +                   goto error_tgt_fput;
>>> +
>>> +           /* mutex_lock(&ep->mtx); TODO check if we need */
>> Yes, please.
> 
> OK.
> 
>>> +           epi = ep_find(ep, tfile, cepi.fd);
>>> +           if (!epi) {
>>> +                   epev.events |= POLLERR | POLLHUP;
>>> +                   error = ep_insert(ep, &epev, tfile, cepi.fd);
>>> +           } else
>>> +                   error = -EEXIST;
>>> +           /* mutex_unlock(&ep->mtx); */
>>> +error_tgt_fput:
>>> +           fput(tfile);
>>> +           if (error < 0)
>>> +                   break;
>>> +   }
>>> +error_return:
>>> +   /* Read the remaining number of items. */
>>> +   for (; i < h->num_items; i++) {
>>> +           struct ckpt_eventpoll_item cepi;
>>> +           ckpt_kread(ctx, &cepi, sizeof(cepi));
>>> +   }
>> What's the purpose of this ?
> 
> If we encounter an error we're left in the middle of the epoll item
> array. This effectively seeks to the end of the array. Not needed
> when deferring as you suggested since it changes the way we read the
> ckpt image..

Ok. I actually wondered what is the purpose of seeking to the end
of the array after you detect an error that would clearly abort the
restart ...

Thanks,

Oren.

> 
>>> +   if (error < 0) {
>>> +           /* TODO closeup epfile and epfd */
>>> +           fput(epfile);
>>> +           sys_close(epfd);
>> sys_close() should happen regardless of whether we succeeded or
>> failed.
> 
> OK, for some reason I thought restore_file_desc() tried fget() before
> resorting to attach_file()...
> 
> Thanks for the review.
> 
> Cheers,
>       -Matt Helsley
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers
_______________________________________________
Containers mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/containers

_______________________________________________
Devel mailing list
[email protected]
https://openvz.org/mailman/listinfo/devel

Reply via email to