There were a couple of errors in the code when I posted my last message. I have fixed those. The epoll bug still occurs.
-Andi On Dec 13, 2012, at 7:16 PM, Andreas Voellmy <andreas.voel...@yale.edu> wrote: > I believe I have found a bug in epoll. This bug causes the behavior I > described in earlier emails. The bug is caused by the interaction of epoll > instances which share no files in common. > > I wrote a C program that behaves similar to my original program and triggers > the bug. The bug only arises when I use enough cores and threads (about 16). > The program is here: > https://github.com/AndreasVoellmy/epollbug/blob/master/epollbug.c This > program is a super-stripped down http server. It uses a number of threads > that serve requests, each with its own epoll instance. There is also a > "wakeup" thread that simply monitors an eventfd file and reads from the > eventfd file when woken. All the worker threads write to the eventfd file > when they process a request. This probably seems like a strange program, but > something like this came up in a real system. > > I test the program using the weighttp http request generator > (http://redmine.lighttpd.net/projects/weighttp/wiki). You need to test with > enough requests and enough concurrent clients, and enough worker threads to > create the problem. For example, I run with './weighttp -n 400000 -c 500 -t 6 > -k "10.12.0.1:8080"'. With 16 cores for the server program (epollbug.c) this > test workload triggers the bug about once every 3 runs. The server > (epollbug.c) has been hardcoded to work with whatever specific request > weighttp sends it. You need to find out what weighttp is sending from your > test machine and then put that at the top of epollbug.c. You will see where > it goes. You can uncomment the SHOW_DEBUG flag at the top of the program and > run weighttp against it and it will print the request weighttp is sending. > Then update the EXPECTED_HTTP_REQUEST with whatever you get. > > I am running Linux 3.4.0.0. > > Cheers, > Andi > > On Dec 13, 2012, at 10:29 AM, Andreas Voellmy <andreas.voel...@yale.edu> > wrote: > >> Hi Eric, >> >> On Dec 13, 2012, at 4:32 AM, Eric Wong <normalper...@yhbt.net> wrote: >> >>> Andreas Voellmy <andreas.voel...@yale.edu> wrote: >>> >>>>> Another thread, distinct from all of the threads serving particular >>>>> sockets, is perfoming epoll_wait calls. When sockets are returned as >>>>> being ready from an epoll_wait call, the thread signals to the >>>>> condition variable for the socket. >>> >>> Perhaps there is a bug in the way your epoll_wait thread >>> uses the condition variable to notify other threads? >>> >> >> This is possible; I've tried very hard (e.g. I added assertions to check >> various error conditions) to ensure that there is problem in signaling the >> other threads. From everything I can tell, it is working properly. >> >>> >>>>> The problem I am encountering is that sometimes a thread will block >>>>> waiting for the readiness signal and will never get notified, even >>>>> though there is data to be read. This behavior seems to go away when >>>>> I remove EPOLLONESHOT flag when registering the event. >>> >>> Is the thread the one waiting on the condition variable or epoll_wait? >>> In your situation (stream I/O via multiple threads, single epoll >>> descriptor), I think EPOLLONESHOT is the /only/ sane thing to do. >> >> The one waiting on the condition variable. >> >> I think I've narrowed down the problem a bit more. In my program I have >> multiple epoll instances. Most of the epoll instances are for monitoring >> sockets. One is used for monitoring an eventfd that is written to by other >> threads. The problem only occurs when I write to the eventfd after servicing >> each http request on a socket; i.e. the epoll monitoring the eventfd is >> returning from a blocking epoll_wait call very frequently . If I don't do >> that write, or if I use a different notification facility, for example poll, >> to monitor the eventfd, then the problem goes away. So it looks like there >> may be some way in which different epoll instances can interfere with each >> other. >> >> Probably this setup sounds weird to you, but I'm trying to spare you from >> understanding my whole application; this is part of a multicore runtime >> system for a programming language with user-level threads and to explain the >> full story of this would probably take more time than you want to spend. >> But I can provide more detail if you like. >> >> -Andi > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/