[Yahoo-eng-team] [Bug 1664299] [NEW] Issue about lost rpc status report from agent.

zhaobo Mon, 13 Feb 2017 09:06:21 -0800

Public bug reported:

Background:
We need a stable and functional public cloud. It means users could launch VM 
and call openstack API as they want.
So we need the server more strong and strong error-tolerance.


Scenario:
1. Neutron agent report its status through rpc to server side.
2. Alright, the message had been sent by agent. Now it is in message queue.
3. Neutron server take the message from the queue, and will process the 
payload, but not actually update the agent in db.
4. At the same time, Neutron server restart. That means the rpc message lost. 
And the agent side will wait for the server response.

In this view, 
if assuming that the max wait time for server response('rpc_response_timeout') 
is 60s and the max agent DOWN time on Neutron server side is 150s.
As I said background above, users issue the requests in the DOWN time, maybe 
the destination host which deployed the agent had been selected. The agent side 
still wait the response from neutron server, but not try asap, just waiting. 
During launch instances, Neutron server set the agent DOWN, all the instances 
which host is that will hit binding failed error.

The result is unacceptable in some ways, especially in public products.
Could our neutron solve this issue in some nice ways? :)  Thank you.

** Affects: neutron
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to neutron.
https://bugs.launchpad.net/bugs/1664299

Title:
  Issue about lost rpc status report from agent.

Status in neutron:
  New

Bug description:
  Background:
  We need a stable and functional public cloud. It means users could launch VM 
and call openstack API as they want.
  So we need the server more strong and strong error-tolerance.

  Scenario:
  1. Neutron agent report its status through rpc to server side.
  2. Alright, the message had been sent by agent. Now it is in message queue.
  3. Neutron server take the message from the queue, and will process the 
payload, but not actually update the agent in db.
  4. At the same time, Neutron server restart. That means the rpc message lost. 
And the agent side will wait for the server response.

  In this view, 
  if assuming that the max wait time for server 
response('rpc_response_timeout') is 60s and the max agent DOWN time on Neutron 
server side is 150s.
  As I said background above, users issue the requests in the DOWN time, maybe 
the destination host which deployed the agent had been selected. The agent side 
still wait the response from neutron server, but not try asap, just waiting. 
During launch instances, Neutron server set the agent DOWN, all the instances 
which host is that will hit binding failed error.

  The result is unacceptable in some ways, especially in public
  products. Could our neutron solve this issue in some nice ways? :)
  Thank you.

To manage notifications about this bug go to:
https://bugs.launchpad.net/neutron/+bug/1664299/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1664299] [NEW] Issue about lost rpc status report from agent.

Reply via email to