Public bug reported:

* High level description

When sending SIGHUP to the neutron-server process in a neutron_api container, 
it looks like the main process locks up in a tight loop. strace output shows 
that it's waiting for a process that doesn't exist:
wait4(0, 0x7ffe97e025a4, WNOHANG, NULL) = -1 ECHILD (No child processes)

This is problematic, because logrotate uses SIGHUP in the containerized
environment. It doesn't happen always: it might take one or two signals
reasonably interspersed, to trigger this.

* Pre-conditions:

I'm using CentOS 7 + Queens RDO
"rdo_version": "c9fd24040454913b4a325741094285676fb7e7bc_a0a28280"

I first noticed the issue when the neutron_api docker would stop working
on the control nodes, eventually it was traced back to the
logrotate_crond container sending SIGHUP to all the processes owning log
files in /var/log/containers. This doesn't happen every time, but it's
pretty easy to trigger on my system.

* Step-by-step reproduction steps:

# Start with a clean container

docker restart neutron_api

# Identify the neutron-server PID: (613782 in this case) and send SIGHUP

kill -HUP 613782

# the relevant log files generally look clean the first time:

2018-07-04 16:50:34.730 7 INFO oslo_service.service [-] Caught SIGHUP, stopping 
children
2018-07-04 16:50:34.739 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:50:34.740 7 INFO neutron.common.config [-] 
/usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:50:34.761 33 INFO neutron.wsgi [-] (33) wsgi exited, 
is_accepting=True
2018-07-04 16:50:34.761 27 INFO neutron.wsgi [-] (27) wsgi exited, 
is_accepting=True
2018-07-04 16:50:34.761 28 INFO neutron.wsgi [-] (28) wsgi exited, 
is_accepting=True
2018-07-04 16:50:34.761 30 INFO neutron.wsgi [-] (30) wsgi exited, 
is_accepting=True
2018-07-04 16:50:34.761 7 INFO oslo_service.service [-] Caught SIGHUP, stopping 
children
2018-07-04 16:50:34.761 32 INFO neutron.wsgi [-] (32) wsgi exited, 
is_accepting=True
2018-07-04 16:50:34.761 34 INFO neutron.wsgi [-] (34) wsgi exited, 
is_accepting=True
2018-07-04 16:50:34.761 29 INFO neutron.wsgi [-] (29) wsgi exited, 
is_accepting=True
2018-07-04 16:50:34.761 31 INFO neutron.wsgi [-] (31) wsgi exited, 
is_accepting=True
2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] 
/usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] 
/usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:50:34.807 7 INFO oslo_service.service [-] Child 27 exited with 
status 0
2018-07-04 16:50:34.807 7 WARNING oslo_service.service [-] pid 27 not in child 
list
2018-07-04 16:50:35.761 7 INFO oslo_service.service [-] Child 28 exited with 
status 0
2018-07-04 16:50:35.764 7 INFO oslo_service.service [-] Child 29 exited with 
status 0
2018-07-04 16:50:35.767 7 INFO oslo_service.service [-] Child 30 exited with 
status 0
2018-07-04 16:50:35.768 78 INFO neutron.wsgi [-] (78) wsgi starting up on 
http://10.0.105.101:9696
2018-07-04 16:50:35.771 79 INFO neutron.wsgi [-] (79) wsgi starting up on 
http://10.0.105.101:9696
2018-07-04 16:50:35.770 7 INFO oslo_service.service [-] Child 31 exited with 
status 0
2018-07-04 16:50:35.773 7 INFO oslo_service.service [-] Child 32 exited with 
status 0
2018-07-04 16:50:35.774 80 INFO neutron.wsgi [-] (80) wsgi starting up on 
http://10.0.105.101:9696
2018-07-04 16:50:35.776 7 INFO oslo_service.service [-] Child 33 exited with 
status 0
2018-07-04 16:50:35.777 81 INFO neutron.wsgi [-] (81) wsgi starting up on 
http://10.0.105.101:9696
2018-07-04 16:50:35.780 82 INFO neutron.wsgi [-] (82) wsgi starting up on 
http://10.0.105.101:9696
2018-07-04 16:50:35.779 7 INFO oslo_service.service [-] Child 34 exited with 
status 0
2018-07-04 16:50:35.782 7 INFO oslo_service.service [-] Child 43 exited with 
status 0
2018-07-04 16:50:35.783 83 INFO neutron.wsgi [-] (83) wsgi starting up on 
http://10.0.105.101:9696
2018-07-04 16:50:35.783 7 WARNING oslo_service.service [-] pid 43 not in child 
list
2018-07-04 16:50:35.786 84 INFO neutron.wsgi [-] (84) wsgi starting up on 
http://10.0.105.101:9696

# But on the second SIGHUP, the following happened:

2018-07-04 16:52:08.821 7 INFO oslo_service.service [-] Caught SIGHUP, stopping 
children
2018-07-04 16:52:08.830 7 INFO neutron.common.config [-] Logging enabled!
2018-07-04 16:52:08.831 7 INFO neutron.common.config [-] 
/usr/bin/neutron-server version 12.0.3.dev17
2018-07-04 16:52:08.847 7 INFO oslo_service.service [-] Wait called after 
thread killed. Cleaning up.
2018-07-04 16:52:08.847 79 INFO neutron.wsgi [-] (79) wsgi exited, 
is_accepting=True
2018-07-04 16:52:08.847 82 INFO neutron.wsgi [-] (82) wsgi exited, 
is_accepting=True
2018-07-04 16:52:08.847 78 INFO neutron.wsgi [-] (78) wsgi exited, 
is_accepting=True
2018-07-04 16:52:08.847 84 INFO neutron.wsgi [-] (84) wsgi exited, 
is_accepting=True
2018-07-04 16:52:08.847 81 INFO neutron.wsgi [-] (81) wsgi exited, 
is_accepting=True
2018-07-04 16:52:08.847 80 INFO neutron.wsgi [-] (80) wsgi exited, 
is_accepting=True
2018-07-04 16:52:08.847 83 INFO neutron.wsgi [-] (83) wsgi exited, 
is_accepting=True
2018-07-04 16:52:08.848 7 INFO oslo_service.service [-] Waiting on 10 children 
to exit
2018-07-04 16:52:08.852 7 INFO oslo_service.service [-] Child 78 exited with 
status 0
2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 78 not in child 
list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 79 exited with 
status 0
2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 79 not in child 
list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 44 killed by 
signal 15
2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 80 exited with 
status 0
2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 80 not in child 
list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 81 exited with 
status 0
2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 81 not in child 
list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 82 exited with 
status 0
2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 82 not in child 
list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 83 exited with 
status 0
2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 83 not in child 
list: OSError: [Errno 3] No such process
2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 84 exited with 
status 0
2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 84 not in child 
list: OSError: [Errno 3] No such process
2018-07-04 16:52:15.039 7 INFO oslo_service.service [-] Child 98 exited with 
status 0
2018-07-04 16:52:30.025 7 INFO oslo_service.service [-] Child 97 exited with 
status 0
2018-07-04 16:52:38.017 7 INFO oslo_service.service [-] Child 96 exited with 
status 0

* Version

I'm running Centos7 RDO TripleO (commit tag 
c9fd24040454913b4a325741094285676fb7e7bc_a0a28280), the OVS ML2 plugin and 3 
controllers in a HA setup. tripleo templates and deployment config
available on request.

This corresponds to OpenStack Queens.

I have not fully investigated all other services, but they seem to be
fine with SIGHUP. In particular nova-api-metadata seems to be OK and
doing what's expected.

* Perceived severity

This blocks our private production stack from running: neutron api
breaking has a big knock-on effect, unfortunately.

** Affects: neutron
     Importance: Undecided
         Status: New


** Tags: queens

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1780139

Title:
  Sending SIGHUP to neutron-server process causes it to hang

Status in neutron:
  New

Bug description:
  * High level description

  When sending SIGHUP to the neutron-server process in a neutron_api container, 
it looks like the main process locks up in a tight loop. strace output shows 
that it's waiting for a process that doesn't exist:
  wait4(0, 0x7ffe97e025a4, WNOHANG, NULL) = -1 ECHILD (No child processes)

  This is problematic, because logrotate uses SIGHUP in the
  containerized environment. It doesn't happen always: it might take one
  or two signals reasonably interspersed, to trigger this.

  * Pre-conditions:

  I'm using CentOS 7 + Queens RDO
  "rdo_version": "c9fd24040454913b4a325741094285676fb7e7bc_a0a28280"

  I first noticed the issue when the neutron_api docker would stop
  working on the control nodes, eventually it was traced back to the
  logrotate_crond container sending SIGHUP to all the processes owning
  log files in /var/log/containers. This doesn't happen every time, but
  it's pretty easy to trigger on my system.

  * Step-by-step reproduction steps:

  # Start with a clean container

  docker restart neutron_api

  # Identify the neutron-server PID: (613782 in this case) and send
  SIGHUP

  kill -HUP 613782

  # the relevant log files generally look clean the first time:

  2018-07-04 16:50:34.730 7 INFO oslo_service.service [-] Caught SIGHUP, 
stopping children
  2018-07-04 16:50:34.739 7 INFO neutron.common.config [-] Logging enabled!
  2018-07-04 16:50:34.740 7 INFO neutron.common.config [-] 
/usr/bin/neutron-server version 12.0.3.dev17
  2018-07-04 16:50:34.761 33 INFO neutron.wsgi [-] (33) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.761 27 INFO neutron.wsgi [-] (27) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.761 28 INFO neutron.wsgi [-] (28) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.761 30 INFO neutron.wsgi [-] (30) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.761 7 INFO oslo_service.service [-] Caught SIGHUP, 
stopping children
  2018-07-04 16:50:34.761 32 INFO neutron.wsgi [-] (32) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.761 34 INFO neutron.wsgi [-] (34) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.761 29 INFO neutron.wsgi [-] (29) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.761 31 INFO neutron.wsgi [-] (31) wsgi exited, 
is_accepting=True
  2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] Logging enabled!
  2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] 
/usr/bin/neutron-server version 12.0.3.dev17
  2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] Logging enabled!
  2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] 
/usr/bin/neutron-server version 12.0.3.dev17
  2018-07-04 16:50:34.807 7 INFO oslo_service.service [-] Child 27 exited with 
status 0
  2018-07-04 16:50:34.807 7 WARNING oslo_service.service [-] pid 27 not in 
child list
  2018-07-04 16:50:35.761 7 INFO oslo_service.service [-] Child 28 exited with 
status 0
  2018-07-04 16:50:35.764 7 INFO oslo_service.service [-] Child 29 exited with 
status 0
  2018-07-04 16:50:35.767 7 INFO oslo_service.service [-] Child 30 exited with 
status 0
  2018-07-04 16:50:35.768 78 INFO neutron.wsgi [-] (78) wsgi starting up on 
http://10.0.105.101:9696
  2018-07-04 16:50:35.771 79 INFO neutron.wsgi [-] (79) wsgi starting up on 
http://10.0.105.101:9696
  2018-07-04 16:50:35.770 7 INFO oslo_service.service [-] Child 31 exited with 
status 0
  2018-07-04 16:50:35.773 7 INFO oslo_service.service [-] Child 32 exited with 
status 0
  2018-07-04 16:50:35.774 80 INFO neutron.wsgi [-] (80) wsgi starting up on 
http://10.0.105.101:9696
  2018-07-04 16:50:35.776 7 INFO oslo_service.service [-] Child 33 exited with 
status 0
  2018-07-04 16:50:35.777 81 INFO neutron.wsgi [-] (81) wsgi starting up on 
http://10.0.105.101:9696
  2018-07-04 16:50:35.780 82 INFO neutron.wsgi [-] (82) wsgi starting up on 
http://10.0.105.101:9696
  2018-07-04 16:50:35.779 7 INFO oslo_service.service [-] Child 34 exited with 
status 0
  2018-07-04 16:50:35.782 7 INFO oslo_service.service [-] Child 43 exited with 
status 0
  2018-07-04 16:50:35.783 83 INFO neutron.wsgi [-] (83) wsgi starting up on 
http://10.0.105.101:9696
  2018-07-04 16:50:35.783 7 WARNING oslo_service.service [-] pid 43 not in 
child list
  2018-07-04 16:50:35.786 84 INFO neutron.wsgi [-] (84) wsgi starting up on 
http://10.0.105.101:9696

  # But on the second SIGHUP, the following happened:

  2018-07-04 16:52:08.821 7 INFO oslo_service.service [-] Caught SIGHUP, 
stopping children
  2018-07-04 16:52:08.830 7 INFO neutron.common.config [-] Logging enabled!
  2018-07-04 16:52:08.831 7 INFO neutron.common.config [-] 
/usr/bin/neutron-server version 12.0.3.dev17
  2018-07-04 16:52:08.847 7 INFO oslo_service.service [-] Wait called after 
thread killed. Cleaning up.
  2018-07-04 16:52:08.847 79 INFO neutron.wsgi [-] (79) wsgi exited, 
is_accepting=True
  2018-07-04 16:52:08.847 82 INFO neutron.wsgi [-] (82) wsgi exited, 
is_accepting=True
  2018-07-04 16:52:08.847 78 INFO neutron.wsgi [-] (78) wsgi exited, 
is_accepting=True
  2018-07-04 16:52:08.847 84 INFO neutron.wsgi [-] (84) wsgi exited, 
is_accepting=True
  2018-07-04 16:52:08.847 81 INFO neutron.wsgi [-] (81) wsgi exited, 
is_accepting=True
  2018-07-04 16:52:08.847 80 INFO neutron.wsgi [-] (80) wsgi exited, 
is_accepting=True
  2018-07-04 16:52:08.847 83 INFO neutron.wsgi [-] (83) wsgi exited, 
is_accepting=True
  2018-07-04 16:52:08.848 7 INFO oslo_service.service [-] Waiting on 10 
children to exit
  2018-07-04 16:52:08.852 7 INFO oslo_service.service [-] Child 78 exited with 
status 0
  2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 78 not in 
child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 79 exited with 
status 0
  2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 79 not in 
child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 44 killed by 
signal 15
  2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 80 exited with 
status 0
  2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 80 not in 
child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 81 exited with 
status 0
  2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 81 not in 
child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 82 exited with 
status 0
  2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 82 not in 
child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 83 exited with 
status 0
  2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 83 not in 
child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 84 exited with 
status 0
  2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 84 not in 
child list: OSError: [Errno 3] No such process
  2018-07-04 16:52:15.039 7 INFO oslo_service.service [-] Child 98 exited with 
status 0
  2018-07-04 16:52:30.025 7 INFO oslo_service.service [-] Child 97 exited with 
status 0
  2018-07-04 16:52:38.017 7 INFO oslo_service.service [-] Child 96 exited with 
status 0

  * Version

  I'm running Centos7 RDO TripleO (commit tag 
c9fd24040454913b4a325741094285676fb7e7bc_a0a28280), the OVS ML2 plugin and 3 
controllers in a HA setup. tripleo templates and deployment config
  available on request.

  This corresponds to OpenStack Queens.

  I have not fully investigated all other services, but they seem to be
  fine with SIGHUP. In particular nova-api-metadata seems to be OK and
  doing what's expected.

  * Perceived severity

  This blocks our private production stack from running: neutron api
  breaking has a big knock-on effect, unfortunately.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1780139/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to