On 08/05/2015 07:06 AM, Madars Vitolins wrote: > Jason Baron @ 2015-08-04 18:02 rakstīja: >> On 08/03/2015 07:48 PM, Eric Wong wrote: >>> Madars Vitolins <m...@silodev.com> wrote: >>>> Hi Folks, >>>> >>>> I am developing kind of open systems application, which uses >>>> multiple processes/executables where each of them monitors some set >>>> of resources (in this case POSIX Queues) via epoll interface. For >>>> example when 10 processes on same queue are in state of epoll_wait() >>>> and one message arrives, all 10 processes gets woken up and all of >>>> them tries to read the message from Q. One succeeds, the others gets >>>> EAGAIN error. The problem is with those others, which generates >>>> extra context switches - useless CPU usage. With more processes >>>> inefficiency gets higher. >>>> >>>> I tried to use EPOLLONESHOT, but no help. Seems this is suitable for >>>> multi-threaded application and not for multi-process application. >>> >>> Correct. Most FDs are not shared across processes. >>> >>>> Ideal mechanism for this would be: >>>> 1. If multiple epoll sets in kernel matches same event and one or >>>> more processes are in state of epoll_wait() - then send event only >>>> to one waiter. >>>> 2. If none of processes are in wait state, then send the event to >>>> all epoll sets (as it is currently). Then the first free process >>>> will grab the event. >>> >>> Jason Baron was working on this (search LKML archives for >>> EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE) >>> >>> However, I was unconvinced about modifying epoll. >>> >>> Perhaps I may be more easily convinced about your mqueue case than his >>> case for listen sockets, though[*] >>> >> >> Yeah, so I implemented an 'EPOLL_ROTATE' mode, where you could have >> multiple epoll fds (or epoll sets) attached to the same wakeup source, >> and have the wakeups 'rotate' among the epoll sets. The wakeup >> essentially walks the list of waiters, wakes up the first thread >> that is actively in epoll_wait(), stops and moves the woken up >> epoll set to the end of the list. So it attempts to balance >> the wakeups among the epoll sets, I think in the way that you >> were describing. >> >> Here is the patchset: >> >> https://lkml.org/lkml/2015/2/24/667 >> >> The test program shows how to use the API. Essentially, you >> have to create a 'dummy' epoll fd with the 'EPOLL_ROTATE' flag, >> which you then attach to you're shared wakeup source and >> then to your epoll sets. Please let me know if its unclear. >> >> Thanks, >> >> -Jason > > In my particular case I need to work with multiple processes/executables > running (not threads) and listening on same queues (this concept allows to > sysadmin easily manage those processes (start new ones for balancing or stop > them with out service interruption), and if any process dies for some reason > (signal, core, etc..), the whole application does not get killed, but only > one transaction is lost). > > Recently I did tests, and found out that kernel's epoll currently sends > notifications to 4 processes (I think it is EP_MAX_NESTS constant) waiting on > same resource (those other 6 from my example will stay in sleep state). So it > is not as bad as I thought before. It could be nice if EP_MAX_NESTS could be > configurable, but I guess 4 is fine too. >
hmmm...EP_MAX_NESTS is about the level 'nesting' epoll sets, IE if you can do ep1->ep2->ep3->ep4-> <wakeup src fd>. But you can't add in 'ep5'. Where the 'epN' above represent epoll file descriptors that are attached together via: EPOLL_CTL_ADD. The nesting does not affect how wakeups are down. All epoll fds that are attached to the even source fd are going to get wakeups. > Jason, does your patch work for multi-process application? How hard it would > be to implement this for such scenario? I don't think it would be too hard, but it requires: 1) adding the patches 2) re-compiling, running new kernel 3) modifying your app to the new API. Thanks, -Jason > > Madars > >> >>> Typical applications have few (probably only one) listen sockets or >>> POSIX mqueues; so I would rather use dedicated threads to issue >>> blocking syscalls (accept4 or mq_timedreceive). >>> >>> Making blocking syscalls allows exclusive wakeups to avoid thundering >>> herds. >>> >>>> How do you think, would it be real to implement this? How about >>>> concurrency? >>>> Can you please give me some hints from which points in code to start >>>> to implement these changes? >>> >>> For now, I suggest dedicating a thread in each process to do >>> mq_timedreceive/mq_receive, assuming you only have a small amount >>> of queues in your system. >>> >>> >>> [*] mq_timedreceive may copy a largish buffer which benefits from >>> staying on the same CPU as much as possible. >>> Contrary, accept4 only creates a client socket. With a C10K+ >>> socket server (e.g. http/memcached/DB), a typical new client >>> socket spends a fair amount of time idle. Thus I don't believe >>> memory locality inside the kernel is much concern when there's >>> thousands of accepted client sockets. >>> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/