On Fri, Oct 22, 2010 at 4:25 PM, Nicholas Marriott <nicholas.marri...@gmail.com> wrote: > I think something needs to walk the file->f_ep_links list on close() and > remove any epitems where epi->ffd->fd is the fd being closed from the > tree in epi->ep. > > I don't have a Linux box to try this on though :-).
I think you're right. I don't do kernel hacking myself though, so maybe somebody else should take this one on. In the meantime, this doesn't help people using any of the kernels currently in the wild. Fortunately, part of the point of Libevent is to work around weird kernel behaviors. Since the epoll tree _does_ keep a separate struct epitem for every (file, fd) pair, I believe that we can work around this behavior by having epoll fall back to EPOLL_CTL_MOD when EPOLL_CTL_ADD fails with EEXIST. Incidentally, there's another, more sinister bug hiding here. When Libevent's changelist code coalesces a (DEL,ADD) combo, rather than making it a no-op, it makes it into an "ADD", since it's possible that the fd was closed in between the DEL and the ADD. It calls these adds "precautionary". When the current epoll code encounters such an add and the add fails with EEXIST, we currently ignore the failure with a debug message. But if the user has been using dup() in the same way as Gilad's original code, this means that they'll think they have added the event, when in fact they haven't. I've attached a series of two patches that try to get working behavior here. See the commit messages for details. Please test them if you can; they'll probably go into 2.0.9 if nobody finds them to be buggy. They are also branch epoll_dup in my github repo. In the process, I also decided that maybe epoll_apply_changes should have fewer conditional branches, since they tend to be a poor choice on modern CPUs. I've got a branch epoll_table in my github repo that replaces the huge pile of conditionals at the start of eopll_apply_changes with a table lookup. It is *NOT* under consideration for 2.0, since it's a relatively large change and it doesn't fix a bug, but if anybody wants to have a look, that would be cool tool. (My github repo is at git://github.com/nmathewson/Libevent.git ) yrs, -- Nick
From c281aba30e69a501fc183d068894bfa47f891700 Mon Sep 17 00:00:00 2001 From: Nick Mathewson <ni...@torproject.org> Date: Sun, 24 Oct 2010 11:38:29 -0400 Subject: [PATCH 1/2] Fix a nasty bug related to use of dup() with epoll on Linux Current versions of the Linux kernel don't seem to remove the struct epitem for a given (file,fd) combo when the fd is closed unless the file itself is also completely closed. This means that if you do: fd = dup(fd_orig); add(fd); close(fd); dup2(fd_orig, fd); add(fd); you will get an EEXIST when you should have gotten a success. This could cause warnings and dropped events when using dup and epoll. The solution is pretty simple: when we get an EEXIST from EPOLL_CTL_ADD, we retry with EPOLL_CTL_MOD. Unit test included to demonstrate the bug. Found due to the patient efforts of Gilad Benjamini; diagnosed with help from Nicholas Marriott. --- epoll.c | 32 ++++++++++++++---------- test/regress.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 92 insertions(+), 13 deletions(-) diff --git a/epoll.c b/epoll.c index b574bf4..9c8f0e9 100644 --- a/epoll.c +++ b/epoll.c @@ -155,7 +155,6 @@ epoll_apply_changes(struct event_base *base) int op, events; for (i = 0; i < changelist->n_changes; ++i) { - int precautionary_add = 0; ch = &changelist->changes[i]; events = 0; @@ -203,13 +202,12 @@ epoll_apply_changes(struct event_base *base) on the fd, we need to try the ADD anyway, in case the fd was closed at some in the middle. If it wasn't, the ADD operation - will fail with; that's okay. - */ - precautionary_add = 1; + will fail with EEXIST; and we retry it as a + MOD. */ + op = EPOLL_CTL_ADD; } else if (ch->old_events) { op = EPOLL_CTL_MOD; } - } else if ((ch->read_change & EV_CHANGE_DEL) || (ch->write_change & EV_CHANGE_DEL)) { /* If we're deleting anything, we'll want to do a MOD @@ -255,14 +253,22 @@ epoll_apply_changes(struct event_base *base) (int)epev.events, ch->fd)); } - } else if (op == EPOLL_CTL_ADD && errno == EEXIST && - precautionary_add) { - /* If a precautionary ADD operation fails with - EEXIST, that's fine too. - */ - event_debug(("Epoll ADD(%d) on fd %d gave %s: ADD was redundant", - (int)epev.events, - ch->fd, strerror(errno))); + } else if (op == EPOLL_CTL_ADD && errno == EEXIST) { + /* If an ADD operation fails with EEXIST, + * either the operation was redundant (as with a + * precautionary add), or we ran into a fun + * kernel bug where using dup*() to duplicate the + * same file into the same fd gives you the same epitem + * rather than a fresh one. For the second case, + * we must retry with MOD. */ + if (epoll_ctl(epollop->epfd, EPOLL_CTL_MOD, ch->fd, &epev) == -1) { + event_warn("Epoll ADD(%d) on %d retried as MOD; that failed too", + (int)epev.events, ch->fd); + } else { + event_debug(("Epoll ADD(%d) on %d retried as MOD; succeeded.", + (int)epev.events, + ch->fd)); + } } else if (op == EPOLL_CTL_DEL && (errno == ENOENT || errno == EBADF || errno == EPERM)) { diff --git a/test/regress.c b/test/regress.c index 21c921b..0cc4017 100644 --- a/test/regress.c +++ b/test/regress.c @@ -2061,6 +2061,76 @@ end: } } +#ifndef WIN32 +/* You can't do this test on windows, since dup2 doesn't work on sockets */ + +static void +dfd_cb(evutil_socket_t fd, short e, void *data) +{ + *(int*)data = (int)e; +} + +/* Regression test for our workaround for a fun epoll/linux related bug + * where fd2 = dup(fd1); add(fd2); close(fd2); dup2(fd1,fd2); add(fd2) + * will get you an EEXIST */ +static void +test_dup_fd(void *arg) +{ + struct basic_test_data *data = arg; + struct event_base *base = data->base; + struct event *ev1=NULL, *ev2=NULL; + int fd, dfd=-1; + int ev1_got, ev2_got; + + tt_int_op(write(data->pair[0], "Hello world", + strlen("Hello world")), >, 0); + fd = data->pair[1]; + + dfd = dup(fd); + tt_int_op(dfd, >=, 0); + + ev1 = event_new(base, fd, EV_READ|EV_PERSIST, dfd_cb, &ev1_got); + ev2 = event_new(base, dfd, EV_READ|EV_PERSIST, dfd_cb, &ev2_got); + ev1_got = ev2_got = 0; + event_add(ev1, NULL); + event_add(ev2, NULL); + event_base_loop(base, EVLOOP_ONCE); + tt_int_op(ev1_got, ==, EV_READ); + tt_int_op(ev2_got, ==, EV_READ); + + /* Now close and delete dfd then dispatch. We need to do the + * dispatch here so that when we add it later, we think there + * was an intermediate delete. */ + close(dfd); + event_del(ev2); + ev1_got = ev2_got = 0; + event_base_loop(base, EVLOOP_ONCE); + tt_want_int_op(ev1_got, ==, EV_READ); + tt_int_op(ev2_got, ==, 0); + + /* Re-duplicate the fd. We need to get the same duplicated + * value that we closed to provoke the epoll quirk. Also, we + * need to change the events to write, or else the old lingering + * read event will make the test pass whether the change was + * successful or not. */ + tt_int_op(dup2(fd, dfd), ==, dfd); + event_free(ev2); + ev2 = event_new(base, dfd, EV_WRITE|EV_PERSIST, dfd_cb, &ev2_got); + event_add(ev2, NULL); + ev1_got = ev2_got = 0; + event_base_loop(base, EVLOOP_ONCE); + tt_want_int_op(ev1_got, ==, EV_READ); + tt_int_op(ev2_got, ==, EV_WRITE); + +end: + if (ev1) + event_free(ev1); + if (ev2) + event_free(ev2); + close(dfd); +} +#endif + #ifdef _EVENT_DISABLE_MM_REPLACEMENT static void test_mm_functions(void *arg) @@ -2229,6 +2299,9 @@ struct testcase_t main_testcases[] = { { "event_once", test_event_once, TT_ISOLATED, &basic_setup, NULL }, { "event_pending", test_event_pending, TT_ISOLATED, &basic_setup, NULL }, +#ifndef WIN32 + { "dup_fd", test_dup_fd, TT_ISOLATED, &basic_setup, NULL }, +#endif { "mm_functions", test_mm_functions, TT_FORK, NULL, NULL }, BASIC(many_events, TT_ISOLATED), -- 1.7.2.3
From 2c66983a6d46cbd863b079a1c92c916b41efe0ba Mon Sep 17 00:00:00 2001 From: Nick Mathewson <ni...@torproject.org> Date: Sun, 24 Oct 2010 11:51:14 -0400 Subject: [PATCH 2/2] Simplify the logic for choosing EPOLL_CTL_ADD vs EPOLL_CTL_MOD Previously, we chose "ADD" whenever old_events==new_events, (since we expected the add to fail with EEXIST), or whenever old_events was==0, and MOD otherwise (i.e., when old_events was nonzero and not equal to new_events). But now that we retry failed MOD events as ADD *and* failed ADD events as MOD, the important thing is now to try to guess right the largest amount of the time, since guessing right means we do only one syscall, but guessing wrong means we do two. When old_events is 0, ADD is probably right (unless we're hitting the dup bug, when we'll fall back). And when old_events is set and != new_events, MOD is almost certainly right for the same reasons as before. But when old_events is equal to new events, then MOD will work fine unless we closed and reopened the fd, in which case we'll have to fall back to the ADD case. (Redundant del/add pairs are more common than closes for most use cases.) This change lets us avoid calculating new_events, which ought to save a little time in epoll.c --- epoll.c | 35 +++++++++++++++++++---------------- 1 files changed, 19 insertions(+), 16 deletions(-) diff --git a/epoll.c b/epoll.c index 9c8f0e9..672025f 100644 --- a/epoll.c +++ b/epoll.c @@ -169,43 +169,46 @@ epoll_apply_changes(struct event_base *base) */ + /* TODO: Turn this into a switch or a table lookup. */ + if ((ch->read_change & EV_CHANGE_ADD) || (ch->write_change & EV_CHANGE_ADD)) { /* If we are adding anything at all, we'll want to do * either an ADD or a MOD. */ - short new_events = ch->old_events; events = 0; op = EPOLL_CTL_ADD; if (ch->read_change & EV_CHANGE_ADD) { events |= EPOLLIN; - new_events |= EV_READ; } else if (ch->read_change & EV_CHANGE_DEL) { - new_events &= ~EV_READ; + ; } else if (ch->old_events & EV_READ) { events |= EPOLLIN; } if (ch->write_change & EV_CHANGE_ADD) { events |= EPOLLOUT; - new_events |= EV_WRITE; } else if (ch->write_change & EV_CHANGE_DEL) { - new_events &= ~EV_WRITE; + ; } else if (ch->old_events & EV_WRITE) { events |= EPOLLOUT; } if ((ch->read_change|ch->write_change) & EV_ET) events |= EPOLLET; - if (new_events == ch->old_events) { - /* - If the changelist has an "add" operation, - but no visible change to the events enabled - on the fd, we need to try the ADD anyway, in - case the fd was closed at some in the - middle. If it wasn't, the ADD operation - will fail with EEXIST; and we retry it as a - MOD. */ - op = EPOLL_CTL_ADD; - } else if (ch->old_events) { + if (ch->old_events) { + /* If MOD fails, we retry as an ADD, and if + * ADD fails we will retry as a MOD. So the + * only hard part here is to guess which one + * will work. As a heuristic, we'll try + * MOD first if we think there were old + * events and ADD if we think there were none. + * + * We can be wrong about the MOD if the file + * has in fact been closed and re-opened. + * + * We can be wrong about the ADD if the + * the fd has been re-created with a dup() + * of the same file that it was before. + */ op = EPOLL_CTL_MOD; } } else if ((ch->read_change & EV_CHANGE_DEL) || -- 1.7.2.3