Public bug reported: * High level description
When sending SIGHUP to the neutron-server process in a neutron_api container, it looks like the main process locks up in a tight loop. strace output shows that it's waiting for a process that doesn't exist: wait4(0, 0x7ffe97e025a4, WNOHANG, NULL) = -1 ECHILD (No child processes) This is problematic, because logrotate uses SIGHUP in the containerized environment. It doesn't happen always: it might take one or two signals reasonably interspersed, to trigger this. * Pre-conditions: I'm using CentOS 7 + Queens RDO "rdo_version": "c9fd24040454913b4a325741094285676fb7e7bc_a0a28280" I first noticed the issue when the neutron_api docker would stop working on the control nodes, eventually it was traced back to the logrotate_crond container sending SIGHUP to all the processes owning log files in /var/log/containers. This doesn't happen every time, but it's pretty easy to trigger on my system. * Step-by-step reproduction steps: # Start with a clean container docker restart neutron_api # Identify the neutron-server PID: (613782 in this case) and send SIGHUP kill -HUP 613782 # the relevant log files generally look clean the first time: 2018-07-04 16:50:34.730 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children 2018-07-04 16:50:34.739 7 INFO neutron.common.config [-] Logging enabled! 2018-07-04 16:50:34.740 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17 2018-07-04 16:50:34.761 33 INFO neutron.wsgi [-] (33) wsgi exited, is_accepting=True 2018-07-04 16:50:34.761 27 INFO neutron.wsgi [-] (27) wsgi exited, is_accepting=True 2018-07-04 16:50:34.761 28 INFO neutron.wsgi [-] (28) wsgi exited, is_accepting=True 2018-07-04 16:50:34.761 30 INFO neutron.wsgi [-] (30) wsgi exited, is_accepting=True 2018-07-04 16:50:34.761 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children 2018-07-04 16:50:34.761 32 INFO neutron.wsgi [-] (32) wsgi exited, is_accepting=True 2018-07-04 16:50:34.761 34 INFO neutron.wsgi [-] (34) wsgi exited, is_accepting=True 2018-07-04 16:50:34.761 29 INFO neutron.wsgi [-] (29) wsgi exited, is_accepting=True 2018-07-04 16:50:34.761 31 INFO neutron.wsgi [-] (31) wsgi exited, is_accepting=True 2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] Logging enabled! 2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17 2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] Logging enabled! 2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17 2018-07-04 16:50:34.807 7 INFO oslo_service.service [-] Child 27 exited with status 0 2018-07-04 16:50:34.807 7 WARNING oslo_service.service [-] pid 27 not in child list 2018-07-04 16:50:35.761 7 INFO oslo_service.service [-] Child 28 exited with status 0 2018-07-04 16:50:35.764 7 INFO oslo_service.service [-] Child 29 exited with status 0 2018-07-04 16:50:35.767 7 INFO oslo_service.service [-] Child 30 exited with status 0 2018-07-04 16:50:35.768 78 INFO neutron.wsgi [-] (78) wsgi starting up on http://10.0.105.101:9696 2018-07-04 16:50:35.771 79 INFO neutron.wsgi [-] (79) wsgi starting up on http://10.0.105.101:9696 2018-07-04 16:50:35.770 7 INFO oslo_service.service [-] Child 31 exited with status 0 2018-07-04 16:50:35.773 7 INFO oslo_service.service [-] Child 32 exited with status 0 2018-07-04 16:50:35.774 80 INFO neutron.wsgi [-] (80) wsgi starting up on http://10.0.105.101:9696 2018-07-04 16:50:35.776 7 INFO oslo_service.service [-] Child 33 exited with status 0 2018-07-04 16:50:35.777 81 INFO neutron.wsgi [-] (81) wsgi starting up on http://10.0.105.101:9696 2018-07-04 16:50:35.780 82 INFO neutron.wsgi [-] (82) wsgi starting up on http://10.0.105.101:9696 2018-07-04 16:50:35.779 7 INFO oslo_service.service [-] Child 34 exited with status 0 2018-07-04 16:50:35.782 7 INFO oslo_service.service [-] Child 43 exited with status 0 2018-07-04 16:50:35.783 83 INFO neutron.wsgi [-] (83) wsgi starting up on http://10.0.105.101:9696 2018-07-04 16:50:35.783 7 WARNING oslo_service.service [-] pid 43 not in child list 2018-07-04 16:50:35.786 84 INFO neutron.wsgi [-] (84) wsgi starting up on http://10.0.105.101:9696 # But on the second SIGHUP, the following happened: 2018-07-04 16:52:08.821 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children 2018-07-04 16:52:08.830 7 INFO neutron.common.config [-] Logging enabled! 2018-07-04 16:52:08.831 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17 2018-07-04 16:52:08.847 7 INFO oslo_service.service [-] Wait called after thread killed. Cleaning up. 2018-07-04 16:52:08.847 79 INFO neutron.wsgi [-] (79) wsgi exited, is_accepting=True 2018-07-04 16:52:08.847 82 INFO neutron.wsgi [-] (82) wsgi exited, is_accepting=True 2018-07-04 16:52:08.847 78 INFO neutron.wsgi [-] (78) wsgi exited, is_accepting=True 2018-07-04 16:52:08.847 84 INFO neutron.wsgi [-] (84) wsgi exited, is_accepting=True 2018-07-04 16:52:08.847 81 INFO neutron.wsgi [-] (81) wsgi exited, is_accepting=True 2018-07-04 16:52:08.847 80 INFO neutron.wsgi [-] (80) wsgi exited, is_accepting=True 2018-07-04 16:52:08.847 83 INFO neutron.wsgi [-] (83) wsgi exited, is_accepting=True 2018-07-04 16:52:08.848 7 INFO oslo_service.service [-] Waiting on 10 children to exit 2018-07-04 16:52:08.852 7 INFO oslo_service.service [-] Child 78 exited with status 0 2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 78 not in child list: OSError: [Errno 3] No such process 2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 79 exited with status 0 2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 79 not in child list: OSError: [Errno 3] No such process 2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 44 killed by signal 15 2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 80 exited with status 0 2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 80 not in child list: OSError: [Errno 3] No such process 2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 81 exited with status 0 2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 81 not in child list: OSError: [Errno 3] No such process 2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 82 exited with status 0 2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 82 not in child list: OSError: [Errno 3] No such process 2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 83 exited with status 0 2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 83 not in child list: OSError: [Errno 3] No such process 2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 84 exited with status 0 2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 84 not in child list: OSError: [Errno 3] No such process 2018-07-04 16:52:15.039 7 INFO oslo_service.service [-] Child 98 exited with status 0 2018-07-04 16:52:30.025 7 INFO oslo_service.service [-] Child 97 exited with status 0 2018-07-04 16:52:38.017 7 INFO oslo_service.service [-] Child 96 exited with status 0 * Version I'm running Centos7 RDO TripleO (commit tag c9fd24040454913b4a325741094285676fb7e7bc_a0a28280), the OVS ML2 plugin and 3 controllers in a HA setup. tripleo templates and deployment config available on request. This corresponds to OpenStack Queens. I have not fully investigated all other services, but they seem to be fine with SIGHUP. In particular nova-api-metadata seems to be OK and doing what's expected. * Perceived severity This blocks our private production stack from running: neutron api breaking has a big knock-on effect, unfortunately. ** Affects: neutron Importance: Undecided Status: New ** Tags: queens -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1780139 Title: Sending SIGHUP to neutron-server process causes it to hang Status in neutron: New Bug description: * High level description When sending SIGHUP to the neutron-server process in a neutron_api container, it looks like the main process locks up in a tight loop. strace output shows that it's waiting for a process that doesn't exist: wait4(0, 0x7ffe97e025a4, WNOHANG, NULL) = -1 ECHILD (No child processes) This is problematic, because logrotate uses SIGHUP in the containerized environment. It doesn't happen always: it might take one or two signals reasonably interspersed, to trigger this. * Pre-conditions: I'm using CentOS 7 + Queens RDO "rdo_version": "c9fd24040454913b4a325741094285676fb7e7bc_a0a28280" I first noticed the issue when the neutron_api docker would stop working on the control nodes, eventually it was traced back to the logrotate_crond container sending SIGHUP to all the processes owning log files in /var/log/containers. This doesn't happen every time, but it's pretty easy to trigger on my system. * Step-by-step reproduction steps: # Start with a clean container docker restart neutron_api # Identify the neutron-server PID: (613782 in this case) and send SIGHUP kill -HUP 613782 # the relevant log files generally look clean the first time: 2018-07-04 16:50:34.730 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children 2018-07-04 16:50:34.739 7 INFO neutron.common.config [-] Logging enabled! 2018-07-04 16:50:34.740 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17 2018-07-04 16:50:34.761 33 INFO neutron.wsgi [-] (33) wsgi exited, is_accepting=True 2018-07-04 16:50:34.761 27 INFO neutron.wsgi [-] (27) wsgi exited, is_accepting=True 2018-07-04 16:50:34.761 28 INFO neutron.wsgi [-] (28) wsgi exited, is_accepting=True 2018-07-04 16:50:34.761 30 INFO neutron.wsgi [-] (30) wsgi exited, is_accepting=True 2018-07-04 16:50:34.761 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children 2018-07-04 16:50:34.761 32 INFO neutron.wsgi [-] (32) wsgi exited, is_accepting=True 2018-07-04 16:50:34.761 34 INFO neutron.wsgi [-] (34) wsgi exited, is_accepting=True 2018-07-04 16:50:34.761 29 INFO neutron.wsgi [-] (29) wsgi exited, is_accepting=True 2018-07-04 16:50:34.761 31 INFO neutron.wsgi [-] (31) wsgi exited, is_accepting=True 2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] Logging enabled! 2018-07-04 16:50:34.771 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17 2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] Logging enabled! 2018-07-04 16:50:34.792 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17 2018-07-04 16:50:34.807 7 INFO oslo_service.service [-] Child 27 exited with status 0 2018-07-04 16:50:34.807 7 WARNING oslo_service.service [-] pid 27 not in child list 2018-07-04 16:50:35.761 7 INFO oslo_service.service [-] Child 28 exited with status 0 2018-07-04 16:50:35.764 7 INFO oslo_service.service [-] Child 29 exited with status 0 2018-07-04 16:50:35.767 7 INFO oslo_service.service [-] Child 30 exited with status 0 2018-07-04 16:50:35.768 78 INFO neutron.wsgi [-] (78) wsgi starting up on http://10.0.105.101:9696 2018-07-04 16:50:35.771 79 INFO neutron.wsgi [-] (79) wsgi starting up on http://10.0.105.101:9696 2018-07-04 16:50:35.770 7 INFO oslo_service.service [-] Child 31 exited with status 0 2018-07-04 16:50:35.773 7 INFO oslo_service.service [-] Child 32 exited with status 0 2018-07-04 16:50:35.774 80 INFO neutron.wsgi [-] (80) wsgi starting up on http://10.0.105.101:9696 2018-07-04 16:50:35.776 7 INFO oslo_service.service [-] Child 33 exited with status 0 2018-07-04 16:50:35.777 81 INFO neutron.wsgi [-] (81) wsgi starting up on http://10.0.105.101:9696 2018-07-04 16:50:35.780 82 INFO neutron.wsgi [-] (82) wsgi starting up on http://10.0.105.101:9696 2018-07-04 16:50:35.779 7 INFO oslo_service.service [-] Child 34 exited with status 0 2018-07-04 16:50:35.782 7 INFO oslo_service.service [-] Child 43 exited with status 0 2018-07-04 16:50:35.783 83 INFO neutron.wsgi [-] (83) wsgi starting up on http://10.0.105.101:9696 2018-07-04 16:50:35.783 7 WARNING oslo_service.service [-] pid 43 not in child list 2018-07-04 16:50:35.786 84 INFO neutron.wsgi [-] (84) wsgi starting up on http://10.0.105.101:9696 # But on the second SIGHUP, the following happened: 2018-07-04 16:52:08.821 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children 2018-07-04 16:52:08.830 7 INFO neutron.common.config [-] Logging enabled! 2018-07-04 16:52:08.831 7 INFO neutron.common.config [-] /usr/bin/neutron-server version 12.0.3.dev17 2018-07-04 16:52:08.847 7 INFO oslo_service.service [-] Wait called after thread killed. Cleaning up. 2018-07-04 16:52:08.847 79 INFO neutron.wsgi [-] (79) wsgi exited, is_accepting=True 2018-07-04 16:52:08.847 82 INFO neutron.wsgi [-] (82) wsgi exited, is_accepting=True 2018-07-04 16:52:08.847 78 INFO neutron.wsgi [-] (78) wsgi exited, is_accepting=True 2018-07-04 16:52:08.847 84 INFO neutron.wsgi [-] (84) wsgi exited, is_accepting=True 2018-07-04 16:52:08.847 81 INFO neutron.wsgi [-] (81) wsgi exited, is_accepting=True 2018-07-04 16:52:08.847 80 INFO neutron.wsgi [-] (80) wsgi exited, is_accepting=True 2018-07-04 16:52:08.847 83 INFO neutron.wsgi [-] (83) wsgi exited, is_accepting=True 2018-07-04 16:52:08.848 7 INFO oslo_service.service [-] Waiting on 10 children to exit 2018-07-04 16:52:08.852 7 INFO oslo_service.service [-] Child 78 exited with status 0 2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 78 not in child list: OSError: [Errno 3] No such process 2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 79 exited with status 0 2018-07-04 16:52:08.853 7 WARNING oslo_service.service [-] pid 79 not in child list: OSError: [Errno 3] No such process 2018-07-04 16:52:08.853 7 INFO oslo_service.service [-] Child 44 killed by signal 15 2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 80 exited with status 0 2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 80 not in child list: OSError: [Errno 3] No such process 2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 81 exited with status 0 2018-07-04 16:52:08.854 7 WARNING oslo_service.service [-] pid 81 not in child list: OSError: [Errno 3] No such process 2018-07-04 16:52:08.854 7 INFO oslo_service.service [-] Child 82 exited with status 0 2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 82 not in child list: OSError: [Errno 3] No such process 2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 83 exited with status 0 2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 83 not in child list: OSError: [Errno 3] No such process 2018-07-04 16:52:08.855 7 INFO oslo_service.service [-] Child 84 exited with status 0 2018-07-04 16:52:08.855 7 WARNING oslo_service.service [-] pid 84 not in child list: OSError: [Errno 3] No such process 2018-07-04 16:52:15.039 7 INFO oslo_service.service [-] Child 98 exited with status 0 2018-07-04 16:52:30.025 7 INFO oslo_service.service [-] Child 97 exited with status 0 2018-07-04 16:52:38.017 7 INFO oslo_service.service [-] Child 96 exited with status 0 * Version I'm running Centos7 RDO TripleO (commit tag c9fd24040454913b4a325741094285676fb7e7bc_a0a28280), the OVS ML2 plugin and 3 controllers in a HA setup. tripleo templates and deployment config available on request. This corresponds to OpenStack Queens. I have not fully investigated all other services, but they seem to be fine with SIGHUP. In particular nova-api-metadata seems to be OK and doing what's expected. * Perceived severity This blocks our private production stack from running: neutron api breaking has a big knock-on effect, unfortunately. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1780139/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp