On Tue, 20 Mar 2007 11:37:14 -0700
Davide Libenzi <davidel@xmailserver.org> wrote:

> This is a very simple and light file descriptor, that can be used as
> event wait/dispatch by userspace (both wait and dispatch) and by the
> kernel (dispatch only). It can be used instead of pipe(2) in all cases
> where those would simply be used to signal events. Their kernel overhead
> is much lower than pipes, and they do not consume two fds. When used in
> the kernel, it can offer an fd-bridge to enable, for example, functionalities
> like KAIO or syslets/threadlets to signal to an fd the completion of certain
> operations. But more in general, an eventfd can be used by the kernel to
> signal readiness, in a POSIX poll/select way, of interfaces that would
> otherwise be incompatible with it. The API is:
> 
> int eventfd(unsigned int count);
>
> The eventfd API accepts an initial "count" parameter, and returns an
> eventfd fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and 
> write(2).
> The POLLIN flag is raised when the internal counter is greater than zero.
> The POLLOUT flag is raised when at least a value of "1" can be written to
> the internal counter.
> The POLLERR flag is raised when an overflow in the counter value is detected.
> The write(2) operation can never overflow the counter, since it blocks
> (unless O_NONBLOCK is set, in which case -EAGAIN is returned).
> But the eventfd_signal() function can do it, since it's supposed to not
> sleep during its operation.
> The read(2) function reads the __u64 counter value, and reset the internal
> value to zero. If the value read is equal to (__u64) -1, an overflow
> happened on the internal counter (due to 2^64 eventfd_signal() posts
> that has never been retired - unlickely, but possible).
> The write(2) call writes an __u64 count value, and adds it
> to the current counter. The eventfd fd supports O_NONBLOCK also.
> On the kernel side, we have:
> 
> struct file *eventfd_fget(int fd);
> int eventfd_signal(struct file *file, unsigned int n);
> 
> The eventfd_fget() should be called to get a struct file* from an eventfd
> fd (this is an fget() + check of f_op being an eventfd fops pointer).
> The kernel can then call eventfd_signal() every time it wants to post
> an event to userspace. The eventfd_signal() function can be called from any
> context.
> An eventfd() simple test and bench is available here:
> 
> http://www.xmailserver.org/eventfd-bench.c
> 
> This is the eventfd-based version of pipetest-4 (pipe(2) based):
> 
> http://www.xmailserver.org/pipetest-4.c
> 
> Not that performance matters much in the eventfd case, but eventfd-bench
> shows almost as double as performance than pipetest-4.
> 
>...
> 
> Index: linux-2.6.21-rc3.quilt/fs/eventfd.c
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.21-rc3.quilt/fs/eventfd.c       2007-03-19 19:01:58.000000000 
> -0700
> @@ -0,0 +1,237 @@
> +/*
> + *  fs/eventfd.c
> + *
> + *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
> + *
> + */
> +
> +#include <linux/file.h>
> +#include <linux/poll.h>
> +#include <linux/slab.h>
> +#include <linux/init.h>
> +#include <linux/fs.h>
> +#include <linux/mount.h>
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/jiffies.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/eventfd.h>
> +
> +#include <asm/uaccess.h>
> +
> +
> +
> +struct eventfd_ctx {
> +     spinlock_t lock;
> +     wait_queue_head_t wqh;
> +     __u64 count;
> +};

Again, can we borrow wqh.lock?

`count' needs documentation - these things are key to understanding the
code.

> +
> +
> +static int eventfd_close(struct inode *inode, struct file *file);
> +static unsigned int eventfd_poll(struct file *file, poll_table *wait);
> +static ssize_t eventfd_read(struct file *file, char __user *buf, size_t 
> count,
> +                         loff_t *ppos);
> +static ssize_t eventfd_write(struct file *file, const char __user *buf, 
> size_t count,
> +                          loff_t *ppos);

dittos

> +
> +
> +

dittos

> +static const struct file_operations eventfd_fops = {
> +     .release        = eventfd_close,

dittos

> +     .poll           = eventfd_poll,
> +     .read           = eventfd_read,
> +     .write          = eventfd_write,
> +};
> +
> +
> +
> +struct file *eventfd_fget(int fd)
> +{
> +     struct file *file;
> +
> +     file = fget(fd);
> +     if (!file)
> +             return ERR_PTR(-EBADF);
> +     if (file->f_op != &eventfd_fops) {
> +             fput(file);
> +             return ERR_PTR(-EINVAL);
> +     }
> +
> +     return file;
> +}
> +
> +/*
> + * This function is supposed to be called by the kernel in paths
> + * that do not allow to sleep. In this function we allow the counter

"that do not allow sleeping"

> + * to reach the ULLONG_MAX value, and we signal this as overflow
> + * condition by returining a POLLERR to poll(2).

typo

> + */

So it is the caller's responsibility to ensure that *file refers to an
eventfd file?

> +int eventfd_signal(struct file *file, int n)
> +{
> +     struct eventfd_ctx *ctx = file->private_data;
> +     unsigned long flags;
> +
> +     if (n < 0)
> +             return -EINVAL;
> +     spin_lock_irqsave(&ctx->lock, flags);
> +     if (ULLONG_MAX - ctx->count < n)
> +             n = (int) (ULLONG_MAX - ctx->count);
> +     ctx->count += n;
> +     if (waitqueue_active(&ctx->wqh))
> +             wake_up_locked(&ctx->wqh);
> +     spin_unlock_irqrestore(&ctx->lock, flags);
> +
> +     return n;
> +}

Neither the incoming arg (usefully named "n") nor the return value are
documented.


> +asmlinkage long sys_eventfd(unsigned int count)
> +{
> +     int error, fd;
> +     struct eventfd_ctx *ctx;
> +     struct file *file;
> +     struct inode *inode;
> +
> +     ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> +     if (!ctx)
> +             return -ENOMEM;
> +
> +     init_waitqueue_head(&ctx->wqh);
> +     spin_lock_init(&ctx->lock);
> +     ctx->count = count;
> +
> +     /*
> +      * When we call this, the initialization must be complete, since
> +      * aino_getfd() will install the fd.
> +      */
> +     error = aino_getfd(&fd, &inode, &file, "[eventfd]",
> +                        &eventfd_fops, ctx);
> +     if (!error)
> +             return fd;
> +
> +     kfree(ctx);
> +     return error;
> +}

Documentation?

> +static int eventfd_close(struct inode *inode, struct file *file)
> +{
> +     kfree(file->private_data);
> +     return 0;
> +}
> +
> +static unsigned int eventfd_poll(struct file *file, poll_table *wait)
> +{
> +     struct eventfd_ctx *ctx = file->private_data;
> +     unsigned int events = 0;
> +     unsigned long flags;
> +
> +     poll_wait(file, &ctx->wqh, wait);
> +
> +     spin_lock_irqsave(&ctx->lock, flags);
> +     if (ctx->count > 0)
> +             events |= POLLIN;
> +     if (ctx->count == ULLONG_MAX)
> +             events |= POLLERR;
> +     if (ULLONG_MAX - 1 > ctx->count)
> +             events |= POLLOUT;
> +     spin_unlock_irqrestore(&ctx->lock, flags);
> +
> +     return events;
> +}
> +
> +static ssize_t eventfd_read(struct file *file, char __user *buf, size_t 
> count,
> +                         loff_t *ppos)
> +{
> +     struct eventfd_ctx *ctx = file->private_data;
> +     ssize_t res;
> +     __u64 ucnt;
> +     DECLARE_WAITQUEUE(wait, current);
> +
> +     if (count < sizeof(ucnt))
> +             return -EINVAL;
> +     spin_lock_irq(&ctx->lock);
> +     res = -EAGAIN;
> +     ucnt = ctx->count;
> +     if (ucnt > 0)
> +             res = sizeof(ucnt);
> +     else if (!(file->f_flags & O_NONBLOCK)) {
> +             __add_wait_queue(&ctx->wqh, &wait);
> +             for (res = 0;;) {
> +                     set_current_state(TASK_INTERRUPTIBLE);
> +                     if (ctx->count > 0) {
> +                             ucnt = ctx->count;
> +                             res = sizeof(ucnt);
> +                             break;
> +                     }
> +                     if (signal_pending(current)) {
> +                             res = -ERESTARTSYS;
> +                             break;
> +                     }
> +                     spin_unlock_irq(&ctx->lock);
> +                     schedule();
> +                     spin_lock_irq(&ctx->lock);
> +             }
> +             __remove_wait_queue(&ctx->wqh, &wait);
> +             __set_current_state(TASK_RUNNING);
> +     }
> +     if (res > 0) {
> +             ctx->count = 0;
> +             if (waitqueue_active(&ctx->wqh))
> +                     wake_up_locked(&ctx->wqh);
> +     }
> +     spin_unlock_irq(&ctx->lock);
> +     if (res > 0 && put_user(ucnt, (__u64 __user *) buf))
> +             return -EFAULT;
> +
> +     return res;
> +}

Needs interface documentation, please.  Even the changelog doesn't tell us
what an EAGAIN return from read() means.

> +static ssize_t eventfd_write(struct file *file, const char __user *buf, 
> size_t count,
> +                          loff_t *ppos)
> +{
> +     struct eventfd_ctx *ctx = file->private_data;
> +     ssize_t res;
> +     __u64 ucnt;
> +     DECLARE_WAITQUEUE(wait, current);
> +
> +     if (count < sizeof(ucnt))
> +             return -EINVAL;
> +     if (get_user(ucnt, (const __u64 __user *) buf))
> +             return -EFAULT;

Some architectures do not implement 64-bit get_user()

> +     if (ucnt == ULLONG_MAX)
> +             return -EINVAL;
> +     spin_lock_irq(&ctx->lock);
> +     res = -EAGAIN;
> +     if (ULLONG_MAX - ctx->count > ucnt)
> +             res = sizeof(ucnt);
> +     else if (!(file->f_flags & O_NONBLOCK)) {
> +             __add_wait_queue(&ctx->wqh, &wait);
> +             for (res = 0;;) {
> +                     set_current_state(TASK_INTERRUPTIBLE);
> +                     if (ULLONG_MAX - ctx->count > ucnt) {
> +                             res = sizeof(ucnt);
> +                             break;
> +                     }
> +                     if (signal_pending(current)) {
> +                             res = -ERESTARTSYS;
> +                             break;
> +                     }
> +                     spin_unlock_irq(&ctx->lock);
> +                     schedule();
> +                     spin_lock_irq(&ctx->lock);
> +             }
> +             __remove_wait_queue(&ctx->wqh, &wait);
> +             __set_current_state(TASK_RUNNING);
> +     }
> +     if (res > 0) {
> +             ctx->count += ucnt;
> +             if (waitqueue_active(&ctx->wqh))
> +                     wake_up_locked(&ctx->wqh);
> +     }
> +     spin_unlock_irq(&ctx->lock);
> +
> +     return res;
> +}
> +

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to