Reviewed: https://review.openstack.org/411784 Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1927da1bc7c4e56162dd3704d58d3b922d4ebce9 Submitter: Jenkins Branch: master
commit 1927da1bc7c4e56162dd3704d58d3b922d4ebce9 Author: AKamyshnikova <akamyshnik...@mirantis.com> Date: Fri Dec 16 16:00:59 2016 +0400 Add check for ha state If all agents are shown as a standby it is possible changing state were lost due to problems with RabbitMQ. Current change adds check for ha state in fetch_and_sync_all_routers. If state is different - notify server that state should be changed. Also change _get_bindings_and_update_router_state_for_dead_agents to set standby for dead agent only in case we have more than one active. Change-Id: If5596eb24041ea9fae1d5d2563dcaf655c5face7 Closes-bug:#1648242 ** Changed in: neutron Status: In Progress => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1648242 Title: Failure to retry update_ha_routers_states Status in neutron: Fix Released Bug description: Version: Mitaka While performing failover testing of L3 HA routers, we've discovered an issue with regards to the failure of an agent to report its state. In this scenario, we have a router (7629f5d7-b205-4af5-8e0e- a3c4d15e7677) scheduled to (3) L3 agents: +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ | 4434f999-51d0-4bbb-843c-5430255d5c64 | 726404-infra03-neutron-agents-container-a8bb0b1f | True | :-) | active | | 710e7768-df47-4bfe-917f-ca35c138209a | 726402-infra01-neutron-agents-container-fc937477 | True | :-) | standby | | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 726403-infra02-neutron-agents-container-0338af5a | True | :-) | standby | +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ The infra03 node was shut down completely and abruptly. The router transitioned to master on infra02 as indicated in these log messages: 2016-12-06 16:15:06.457 18450 INFO neutron.agent.linux.interface [-] Device qg-d48918fa-eb already exists 2016-12-07 15:16:51.145 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to master 2016-12-07 15:16:51.811 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:16:51] "GET / HTTP/1.1" 200 115 0.666464 2016-12-07 15:18:29.167 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to backup 2016-12-07 15:18:29.229 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:18:29] "GET / HTTP/1.1" 200 115 0.062110 2016-12-07 15:21:48.870 18450 INFO neutron.agent.l3.ha [-] Router 7629f5d7-b205-4af5-8e0e-a3c4d15e7677 transitioned to master 2016-12-07 15:21:49.537 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:21:49] "GET / HTTP/1.1" 200 115 0.667920 2016-12-07 15:22:08.796 18450 INFO neutron.agent.l3.ha [-] Router 4676e7a5-279c-4114-8674-209f7fd5ab1a transitioned to master 2016-12-07 15:22:09.515 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:22:09] "GET / HTTP/1.1" 200 115 0.719848 Traffic to/from VMs through the new master router functioned as expected. However, the ha_state remained 'standby': +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ | 4434f999-51d0-4bbb-843c-5430255d5c64 | 726404-infra03-neutron-agents-container-a8bb0b1f | True | xxx | standby | | 710e7768-df47-4bfe-917f-ca35c138209a | 726402-infra01-neutron-agents-container-fc937477 | True | :-) | standby | | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 726403-infra02-neutron-agents-container-0338af5a | True | :-) | standby | +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ A traceback was observed in the logs related to a message timeout, probably due to the cut of AMQP on infra03: 2016-12-07 15:22:30.525 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 seconds. 2016-12-07 15:22:36.537 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 seconds. 2016-12-07 15:22:37.553 18450 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.29.238.65:5671 via [amqp] client 2016-12-07 15:22:51.210 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.246:5671 is unreachable: Basic.cancel: (0) 1. Trying again in 1 seconds. 2016-12-07 15:22:52.262 18450 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.29.237.246:5671 via [amqp] client 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent [-] Failed reporting state! 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent Traceback (most recent call last): 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 686, in _report_state 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent True) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/rpc.py", line 87, in report_state 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent return method(context, 'report_state', **kwargs) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 158, in call 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent retry=self.retry) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent timeout=timeout, retry=retry) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent retry=retry) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in _send 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent result = self._waiter.wait(msg_id, timeout) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 342, in wait 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent message = self.waiters.get(msg_id, timeout=timeout) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in get 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent 'to message ID %s' % msg_id) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent MessagingTimeout: Timed out waiting for a reply to message ID 03e3909c25d749908652c050eeeaec91 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent 2016-12-07 15:22:55.829 18450 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.l3.agent.L3NATAgentWithStateReport._report_state' run outlasted interval by 0.01 sec 2016-12-07 15:23:01.086 18450 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 03e3909c25d749908652c050eeeaec91 2016-12-07 15:23:45.180 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_ha_routers_states. Waiting for 4 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough. 2016-12-07 15:23:45.181 18450 WARNING neutron.common.rpc [-] Increasing timeout for update_ha_routers_states calls to 120 seconds. Restart the agent to restore it to the default value. 2016-12-07 15:23:51.941 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_ha_routers_states. Waiting for 35 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough. While subsequent transitions for other routers acted and updated Neutron accordingly, this router's ha_state was never updated after this initial failure. We could force the update with a restart of the L3 agent, or by causing a transition from infra02 to another node, and back. Neither is ideal. If you need any other information please let me know. Thanks, James To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1648242/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp