On 08/05/2015 07:06 AM, Madars Vitolins wrote:
> Jason Baron @ 2015-08-04 18:02 rakstīja:
>> On 08/03/2015 07:48 PM, Eric Wong wrote:
>>> Madars Vitolins <m...@silodev.com> wrote:
>>>> Hi Folks,
>>>>
>>>> I am developing kind of open systems application, which uses
>>>> multiple processes/executables where each of them monitors some set
>>>> of resources (in this case POSIX Queues) via epoll interface. For
>>>> example when 10 processes on same queue are in state of epoll_wait()
>>>> and one message arrives, all 10 processes gets woken up and all of
>>>> them tries to read the message from Q. One succeeds, the others gets
>>>> EAGAIN error. The problem is with those others, which generates
>>>> extra context switches - useless CPU usage. With more processes
>>>> inefficiency gets higher.
>>>>
>>>> I tried to use EPOLLONESHOT, but no help. Seems this is suitable for
>>>> multi-threaded application and not for multi-process application.
>>>
>>> Correct.  Most FDs are not shared across processes.
>>>
>>>> Ideal mechanism for this would be:
>>>> 1. If multiple epoll sets in kernel matches same event and one or
>>>> more processes are in state of epoll_wait() - then send event only
>>>> to one waiter.
>>>> 2. If none of processes are in wait state, then send the event to
>>>> all epoll sets (as it is currently). Then the first free process
>>>> will grab the event.
>>>
>>> Jason Baron was working on this (search LKML archives for
>>> EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE)
>>>
>>> However, I was unconvinced about modifying epoll.
>>>
>>> Perhaps I may be more easily convinced about your mqueue case than his
>>> case for listen sockets, though[*]
>>>
>>
>> Yeah, so I implemented an 'EPOLL_ROTATE' mode, where you could have
>> multiple epoll fds (or epoll sets) attached to the same wakeup source,
>> and have the wakeups 'rotate' among the epoll sets. The wakeup
>> essentially walks the list of waiters, wakes up the first thread
>> that is actively in epoll_wait(), stops and moves the woken up
>> epoll set to the end of the list. So it attempts to balance
>> the wakeups among the epoll sets, I think in the way that you
>> were describing.
>>
>> Here is the patchset:
>>
>> https://lkml.org/lkml/2015/2/24/667
>>
>> The test program shows how to use the API. Essentially, you
>> have to create a 'dummy' epoll fd with the 'EPOLL_ROTATE' flag,
>> which you then attach to you're shared wakeup source and
>> then to your epoll sets. Please let me know if its unclear.
>>
>> Thanks,
>>
>> -Jason
> 
> In my particular case I need to work with multiple processes/executables 
> running (not threads) and listening on same queues (this concept allows to 
> sysadmin easily manage those processes (start new ones for balancing or stop 
> them with out service interruption), and if any process dies for some reason 
> (signal, core, etc..), the whole application does not get killed, but only 
> one transaction is lost).
> 
> Recently I did tests, and found out that kernel's epoll currently sends 
> notifications to 4 processes (I think it is EP_MAX_NESTS constant) waiting on 
> same resource (those other 6 from my example will stay in sleep state). So it 
> is not as bad as I thought before. It could be nice if EP_MAX_NESTS could be 
> configurable, but I guess 4 is fine too.
> 

hmmm...EP_MAX_NESTS is about the level 'nesting' epoll sets, IE
if you can do ep1->ep2->ep3->ep4-> <wakeup src fd>. But you
can't add in 'ep5'. Where the 'epN' above represent epoll file
descriptors that are attached together via: EPOLL_CTL_ADD.

The nesting does not affect how wakeups are down. All epoll fds
that are attached to the even source fd are going to get wakeups.


> Jason, does your patch work for multi-process application? How hard it would 
> be to implement this for such scenario?

I don't think it would be too hard, but it requires:

1) adding the patches
2) re-compiling, running new kernel
3) modifying your app to the new API.

Thanks,

-Jason


> 
> Madars
> 
>>
>>> Typical applications have few (probably only one) listen sockets or
>>> POSIX mqueues; so I would rather use dedicated threads to issue
>>> blocking syscalls (accept4 or mq_timedreceive).
>>>
>>> Making blocking syscalls allows exclusive wakeups to avoid thundering
>>> herds.
>>>
>>>> How do you think, would it be real to implement this? How about
>>>> concurrency?
>>>> Can you please give me some hints from which points in code to start
>>>> to implement these changes?
>>>
>>> For now, I suggest dedicating a thread in each process to do
>>> mq_timedreceive/mq_receive, assuming you only have a small amount
>>> of queues in your system.
>>>
>>>
>>> [*] mq_timedreceive may copy a largish buffer which benefits from
>>>     staying on the same CPU as much as possible.
>>>     Contrary, accept4 only creates a client socket.  With a C10K+
>>>     socket server (e.g. http/memcached/DB), a typical new client
>>>     socket spends a fair amount of time idle.  Thus I don't believe
>>>     memory locality inside the kernel is much concern when there's
>>>     thousands of accepted client sockets.
>>>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to