Maybe not too much help but it seems like strace would answer possibilities 1, 2 and 3, that's how I typically track what is happening with epoll inside libevent.
On Fri, Oct 22, 2010 at 01:38:45PM -0400, Nick Mathewson wrote: > [summary: Nick looks for a bug, and finds only weirdness. But I think > we're on to something, and we *are* making progress. Big find: there > doesn't seem to be anything wrong with the changlist logic; something > weird is up with epoll here. Could other people please help me look > into this?] > > On Fri, Oct 22, 2010 at 12:20 AM, Gilad Benjamini > <gi...@altornetworks.com> wrote: > >> -----Original Message----- > >> From: Nick Mathewson <ni...@freehaven.net> > >> Sent: Tuesday, September 21, 2010 7:43 PM > >> To: libevent-us...@freehaven.net > >> Subject: Re: [Libevent-users] epoll erros > >> ... > >> > >> I just added a couple more detailed debugging messages in the git > >> master branch to help try to figure out what exactly we're doing to > >> epoll fd, and why. If you can get the same output with the new stuff > >> from git master, that would be great. > >> > > > > I am correct to assume these additions made it into 2.0.8-rc ? > > Assuming they did, I am attaching the output when run with 2.0.8-rc. > > > > If that's not enough, please direct me to the specific changes and I'll run > > with them > > Hm. Here's the first instance of the bug at issue here. Let's see if > it means anything. I've cut out some lines only having to do with > timeout events, so everything that remains is about the > soon-to-be-infamous fd 13: > > > [debug] event_del: 0x801e7618 (fd 13), callback 0x80105870 > > Here's the first appearance of fd 13. It's in event 0x801e7618, which > I'll be calling 618 for short. Note that this is a delete of an event > that hasn't been added yet. That's allowed. > > > [debug] event_add: event: 0x801e7618 (fd 13), EV_WRITE call 0x80105870 > > Then we add that event again. It's an EV_WRITE event. > > > [debug] Epoll ADD(4) on fd 13 okay. [old events were 0; read change was 0; > > write change was 1] > > And now fd 13 hits epoll for the first time. We add it for > writing(4==EPOLLOUT), since the write change is 1 (1==EV_CHANGE_ADD) > > > [debug] epoll_dispatch: epoll_wait reports 1 > > [debug] event_active: 0x801e7618 (fd 13), res 4, callback 0x80105870 > > [debug] event_process_active: event: 0x801e7618, EV_WRITE call 0x80105870 > > The event triggers! We process it.... > > > [debug] event_del: 0x801e7618 (fd 13), callback 0x80105870 > > Then delete it. > > > [debug] event_add: event: 0x801e7618 (fd 13), EV_READ call 0x80105870 > > Now we add it. This time 618 is a write event on fd 13. Interesting! > It looks like the architecture of this program is to have one event > per fd (or per fd of this kind), and use event_assign [or event_set if > this is legacy code] to switch it between reading and writing as > needed. > > This is allowed, but there's a possible bug to watch out for: you must > never event_assign [or event_set] an event that has been added without > first deleting it. That is almost certainly not happening here, since > Gilad has said that his code is calling event_enable_debug_mode(), and > that is supposed to catch any attempt to event_assign or event_set on > an added event. > > It won't catch attempts to manually reassign the event's ev_events > while the event is added. I sure hope nobody is doing that. Don't > touch struct event's fields by hand. > > So this is a little bug-prone in the wrong hands, but there's no > actual bug evident here yet. > > > [debug] Epoll MOD(1) on fd 13 okay. [old events were 4; read change was 1; > > write change was 2] > > So now we call epoll_ctl again, with command EPOLL_MOD(1==EPOLLIN). > MOD is right, because fd 13 was added before. The write > change(EV_CHANGE_DEL) is correct because we were previously writing. > The read change is right because we added a read. The old events were > indeed 4==EV_WRITE. > > > [debug] epoll_dispatch: epoll_wait reports 1 > > [debug] event_active: 0x801e7618 (fd 13), res 2, callback 0x80105870 > > [debug] event_process_active: event: 0x801e7618, EV_READ call 0x80105870 > > The read event triggers. > > > [debug] epoll_dispatch: epoll_wait reports 1 > > [debug] event_active: 0x801e7618 (fd 13), res 2, callback 0x80105870 > > [debug] event_process_active: event: 0x801e7618, EV_READ call 0x80105870 > > The read event triggers again. > > > [debug] event_del: 0x801e7618 (fd 13), callback 0x80105870 > > [debug] event_add: event: 0x801e7618 (fd 13), EV_WRITE call 0x80105870 > > [debug] event_del: 0x801e7618 (fd 13), callback 0x80105870 > > [debug] event_add: event: 0x801e7618 (fd 13), EV_READ call 0x80105870 > > We delete it, then it gets added as a write event, and deleted, and > added again as a read event. > > So that's event_del(read), event_add(write), event_del(write), and > event_add(read). ADD-then-DEL coalesces into a no-op, but > DEL-then-ADD coalesces into an ADD that might be redundant. > > > [debug] Epoll ADD(1) on fd 13 gave File exists: ADD was redundant > > Yup. > > > [debug] epoll_dispatch: epoll_wait reports 1 > > [debug] event_active: 0x801e7618 (fd 13), res 2, callback 0x80105870 > > [debug] event_process_active: event: 0x801e7618, EV_READ call 0x80105870 > > [debug] event_del: 0x801e7618 (fd 13), callback 0x80105870 > > The read triggers, then we delete it. > > > [debug] event_process_active: event: 0x801e76a8, call 0x80102126 > > [debug] Epoll DEL(1) on fd 13 gave Bad file descriptor: DEL was unnecessary. > > When we go to delete it, we get an EBADF. This happens when the file > is closed then we try to do an operation on it. It should be fine. > > > [debug] epoll_dispatch: epoll_wait reports 0 > > [debug] event_process_active: event: 0x801d5a98, call 0x80102126 > > [debug] event_del: 0x801e76a0 (fd 13), callback 0x80105870 > > [debug] event_add: event: 0x801e76a0 (fd 13), EV_WRITE call 0x80105870 > > [debug] event_del: 0x801e76a0 (fd 13), callback 0x80105870 > > [debug] event_add: event: 0x801e76a0 (fd 13), EV_READ call 0x80105870 > > Now here comes a new event on fd 13: 0x801e76a0 (6a0 for short). > Let's see what happens. We delete an (unknown) event, then add it for > writing, then delete that, then add it again for reading. > > > [warn] Epoll ADD(1) on fd 13 failed. Old events were 0; read change was 1 > > (add); write change was 0 (none): File exists > > And boom, there's the bug. This is not consistent with the sequence > above; if you replay the Epoll lines above with a test socketpair, and > try to close and reopen the socket at the appropriate time, you'll > find that you don't get the EEXIST here. I've attached a quick test > program to demonstrate that this sequence of operations shouldn't get > an EEXIST. > > To recap, the sequence of epoll events on 13 was: > 1: [debug] Epoll ADD(4) on fd 13 okay. [old events were 0; read change > was 0; write change was 1] > 2: [debug] Epoll MOD(1) on fd 13 okay. [old events were 4; read change > was 1; write change was 2] > 3: [debug] Epoll ADD(1) on fd 13 gave File exists: ADD was redundant > 4: (close the fd here, or else you won't see an EBADF in the next step) > 5: [debug] Epoll DEL(1) on fd 13 gave Bad file descriptor: DEL was > unnecessary. > 6: (reopen the fd here, or else you *will* see an EBADF in the next step.) > 7: [warn] Epoll ADD(1) on fd 13 failed. Old events were 0; read > change was 1 (add); write change was 0 (none): File exists > > But when I run my test program with Linux 2.6.34.7, I get: > 1: add(OUT) OK > 2: mod(IN) OK > 3: add(IN) 17: File exists > 4: close > 5: del(IN) 9: Bad file descriptor > 6: reopen > 7: add(IN) OK > > So, something's up! The logs have an eexist, an EEXIST, but the Epoll > operations above don't cause an EEXIST on their own. > > A couple of possibilities, none of them very pleasant looking: > 1) The debug logs are wrong. There are more epoll_ctl operations > going on than are getting logged. > 2) I am wrong about the close; something else causes the EBADF on step 5. > 3) I am wrong about the reopen; something else prevents an EBADF at step 7. > 4) The debug logs are confused somehow; the events are all happening, > and no other events are happening, but things are somehow not as they > seem and there is missing information that would make this all clear. > 5) My test harness is somehow wrong for your situation; maybe if you > use pipes or something, everthing is > > 1 seems unlikely from a simple read-through of the libevent code: > there is no way to call epoll_ctl without a message getting logged as > far as I can tell. > > If 2, then what else might cause an EBADF at step 5? If 3, then what > else might prevent an EBADF at step 7? (A cursory read of > fs/eventpoll.c in the Linux 2.6.34.7 source turns up nothing for > either.) > > 4 is conceivable under a couple of weird circumstances. Might there > be more than one event_base in the application, so that fd 13 is > actually getting added/removed from separate event_bases? Might there > be multithreading issues? Is the application forking, then calling > (or not calling) event_reinit, then using Libevent stuff? > > 5 is also conceivable; if you can find some way to tweak the attached > program so that step 7 gets an eexist from a series of operations > similar to those in your program as logged above, I'd love to know. > > -- > Nick *********************************************************************** To unsubscribe, send an e-mail to majord...@freehaven.net with unsubscribe libevent-users in the body.