Hi,

We are optimizing the Request-Per-Second of nginx http server, and we found
that acquiring epmutex in eventpoll_release_file() will become a bottleneck
under the one-request-per-connection scenario. The following are some details
of the scenario:

* HTTP server (nginx):
        * under ARM64 with 64 cores
        * 64 worker processes, each worker is binded to a specific CPU
        * keepalive_requests = 1 in nginx.conf: nginx will close the
          connection fd after a reply is send
* HTTP benchmark tool (wrk):
        * under x86-64 with 48 cores
        * 16 threads, 64 connections per-thread

Before the patch, the RPS measured by wrk is ~220K, after applying
the patch the RPS is ~240K. We also measure the overhead of
eventpoll_release_file() and its children by perf: 29% before and
2% after.

In the following section I will explain the purposes of epmutex, and
the way of replacing it by using locks with a smaller granularity.

epmutex serves four purposes:
(1) serialize ep_loop_check() and ep_free()/eventpoll_release_file()
        (a) ensure the validity of ep when clearing visited_list
        The acquisition of epmutex in ep_free() prevent the freeing of ep.

        It's fixed in patch 2: when freeing ep, remove it from visited_list.
        When there is no nested-epoll cast, ep will not been added to
        visited_list, so we check the condition first. If it has already been
        added to visited_list, we need to wait for the release of epmutex.

(2) serialize reverse_path_check() and ep_free()/eventpoll_release_file()
        (a) ensure the validity of file in tfile_check_list
        epi->ffd.file was added to tfile_check_list under ep->mtx, but
        was accessed without ep->mtx. The acquisition of epmutex in
        eventpoll_release_file() prevent the freeing of file.

        It's fixed in patch 3: when releasing file, remove it from
        tfile_check_list. If it has been already added, we need to
        wait for the release of epmutex.

        (b) ensure the validity of epi->ep and epi->ep->file
        The epmutex will prevent the freeing of ep and its related file,
        so it's OK to access epi->ep under rcu read critical region.

        The change is done in patch 4: we free ep by rcu, so it's OK
        to access epi->ep->file under rcu read critical region. The file
        has already been freed by rcu, so it's also OK to access its fields.

(3) serialize the concurrent invocations of epoll_ctl(EPOLL_CTL_ADD)
    for the nested-epoll-fd case
        (a) protect tfile_check_list and visited_list

        There is nothing to do.

(4) serialize ep_free() and eventpoll_release_file()
        (a) protect file->f_ep_links
        eventpoll_release_file() will read the list through
        file->f_ep_links, and modify it through epi->fllink.
        ep_free() will modify it through epi->fllink.

        It's fixed in patch 5: using rcu and list_first_or_null_rcu() to
        iterate file->f_ep_links instead of epmutex.

        (b) ensure the validity of epi->ep
        When eventpoll_release_file() gets epi from file->f_ep_links,
        epi->ep should still be valid.

        It's fixed in patch 4 and 6: add an ref-counter to eventpoll and
        free eventpoll by rcu.

        (c) protect the removal of epi
        Both ep_free() and eventpoll_release_file() will try to remove
        the same epi, if one function has removed the epi, the other
        function should not remove it again.

        It's fixed in patch 7: check whether or not ep_free() has already
        removed the epi before the invocation of ep_remove() in
        eventpoll_release_file().

        (d) ensure the validity of epi->ffd.file
        When ep_remove() is invoked by ep_free(), epi->ffd.file should
        still be valid.

        Do not need to do anything: when ep_free() is invoking ep_remove()
        and access epi->ffd.file, if the file is freeing, the freeing will
        be blocked on ep->mtx, so it's OK to access the file in ep_remove().

Patch 1 just removes epmutex from ep_free() and eventpoll_release_file(),
and patch 8 enlarge the protected region of ep->mtx to protect against
the iteration of ep->rbr.

The patch set has passed the epoll related test cases in LTP, and we are
planing to run some torture or performance test cases for nested-epoll
cases.

Comments and questions are welcome.

Regards,

Tao
---
Hou Tao (8):
  epoll: remove epmutex from ep_free() & eventpoll_release_file()
  epoll: remove ep from visited_list when freeing ep
  epoll: remove file from tfile_check_list when releasing file
  epoll: free eventpoll by rcu to provide existence guarantee
  epoll: iterate epi in file->f_ep_links by using list_first_or_null_rcu
  epoll: ensure the validity of ep when removing epi in
    eventpoll_release_file()
  epoll: prevent the double-free of epi in eventpoll_release_file()
  epoll: protect the iteration of ep->rbr by ep->mtx in ep_free()

 fs/eventpoll.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 88 insertions(+), 14 deletions(-)

-- 
2.7.5

Reply via email to