[jira] [Updated] (CLOUDSTACK-10400) VPC Router Corruption when working with large number of networks containing instances with public IP addresses

Barys Dubauski (JIRA) Fri, 02 Nov 2018 19:26:52 -0700


     [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Barys Dubauski updated CLOUDSTACK-10400:
----------------------------------------
    Description: 
We are using CloudStack 4.11.1 running with KVM hosts.  To simulate our 
usecase, we created a small program that calls CloudStack API to

1) create VPC network with 20 guest networks, each containing one virtual 
machine with a public IP address allocated.  

2) delete the machines and networks one by one. 

 

However,  we frequently get a timeout error, sometimes during VM deletion, and 
sometimes during guest network deletion or even during static NAT disable step. 
 Once the timeout occurs, it seems that the VPC network / Virtual router is in 
an *unstable/corrupted* state.  We need to restart the Virtual Router with a 
clean option (sometimes have to try restart several times as it fails to deploy 
router VM as well).  After that, we can continue delete the network remaining 
environment.  Here is the high level steps that we did:
 # Create VPC Network
 # For each of the 20 "environments"
 ## Create Guest Network
 ## Add a VM to the network
 ## Acquire Public IP
 ## Associate the Public IP with VM
 # For each of the 20 environment
 ## Disassociate the Public IP
 ## Delete VM
 ## Delete Guest network
 # Delete VPC

 

The hanging / timeout problems could be in any time during environment 
deletion.  The first few deletion could go through successfully, and then fail 
at some point.  The failure could be in any stage.  i.e. Disassociate public 
IP, delete VM or delete guest network.  We looked at cloud.log, agent log and 
management server log but couldn’t get any obvious errors.  It seems that 
management server sends the request to do the deletion, but the VR does not 
respond and the system/network becomes stuck in an invalid state. Network often 
gets stuck in “Shutdown” state as a result.

 

Here are some errors in the management server log:

============================================
 2018-11-01 01:15:29,263 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965) (logid:dbe80d4f) Complete async 
job-29965, jobStatus: FAILED, resultCode: 530, result: 
org.apache.cloudstack.api.response.ExceptionResponse/null/\{"uuidList":[],"errorcode":530,"errortext":"Failed
 to delete network"}

2018-11-01 01:15:29,245 DEBUG [c.c.a.t.Request] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) Seq 
4-667095694804259240: Received: 

{ Ans: , MgmtId: [7474664765770|tel:7474664765770], via: 
4([cehv02.core.jazz.net|http://cehv02.core.jazz.net/]), Ver: v1, Flags: 110, 
\\{ GroupAnswer }

}
 2018-11-01 01:15:29,245 WARN  [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
*Unable to destroy guest network on router VM*[DomainRouter|r-3388-VM]
 2018-11-01 01:15:29,247 WARN  [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
*Failed to destroy guest network config Ntwk*[1122|Guest|12] on router 
VM[DomainRouter|r-3388-VM]
 2018-11-01 01:15:29,247 WARN  [c.c.n.e.VpcVirtualRouterElement] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
*Failed to unplug nic in network Ntwk*[1122|Guest|12] for virtual router 
VM[DomainRouter|r-3388-VM]
 2018-11-01 01:15:29,247 WARN  [o.a.c.e.o.NetworkOrchestrator] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
*Unable to complete shutdown of the network elements due to element: 
VpcVirtualRouter*
 2018-11-01 01:15:29,255 DEBUG [o.a.c.e.o.NetworkOrchestrator] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
Lock is released for network Ntwk[1122|Guest|12] as a part of network shutdown
 2018-11-01 01:15:29,256 DEBUG [o.a.c.e.o.NetworkOrchestrator] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
*Network is not not in the correct state to be destroyed: Shutdown*

============================================

 

I'm attaching the simple java program which performs all of the above described 
steps and which allowed us to consistently run into the bug.

 

To use the application:

 

java -jar testCloudStack.jar <CloudStack API url: e.g. 
[http://foo:8080/client/api]> <apiKey> <secretKey> <zoneName>

 

Note, that the test application works successfully with CloudStack server 4.9.2 
but consistently reproduces the bug with CloudStack server 4.11.1

  was:
We are using CloudStack 4.11.1 running with KVM hosts.  To simulate our 
usecase, we created a small program that calls CloudStack API to

1) create VPC network with 20 guest networks, each containing one virtual 
machine with a public IP address allocated.  

2) delete the machines and networks one by one. 

 

However,  we frequently get a timeout error, sometimes during VM deletion, and 
sometimes during guest network deletion or even during static NAT disable step. 
 Once the timeout occurs, it seems that the VPC network / Virtual router is in 
an *unstable/corrupted* state.  We need to restart the Virtual Router with a 
clean option (sometimes have to try restart several times as it fails to deploy 
router VM as well).  After that, we can continue delete the network remaining 
environment.  Here is the high level steps that we did:
 # Create VPC Network
 # For each of the 20 "environments"
 ## Create Guest Network
 ## Add a VM to the network
 ## Acquire Public IP
 ## Associate the Public IP with VM
 # For each of the 20 environment
 ## Disassociate the Public IP
 ## Delete VM
 ## Delete Guest network
 # Delete VPC

 

The hanging / timeout problems could be in any time during environment 
deletion.  The first few deletion could go through successfully, and then fail 
at some point.  The failure could be in any stage.  i.e. Disassociate public 
IP, delete VM or delete guest network.  We look at cloud.log, agent log and 
management server log but couldn’t get any obvious errors.  It may seems that 
management server sends the request to do the deletion, but the VR does not 
respond and the system/network becomes stuck in an invalid state. Network is 
often gets stuck in “Shutdown” state as a result

 

Here are some error in the management server log:

============================================
2018-11-01 01:15:29,263 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965) (logid:dbe80d4f) Complete async 
job-29965, jobStatus: FAILED, resultCode: 530, result: 
org.apache.cloudstack.api.response.ExceptionResponse/null/\{"uuidList":[],"errorcode":530,"errortext":"Failed
 to delete network"}

2018-11-01 01:15:29,245 DEBUG [c.c.a.t.Request] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) Seq 
4-667095694804259240: Received:  { Ans: , MgmtId: 
[7474664765770|tel:7474664765770], via: 
4([cehv02.core.jazz.net|http://cehv02.core.jazz.net/]), Ver: v1, Flags: 110, \{ 
GroupAnswer } }
2018-11-01 01:15:29,245 WARN  [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
*Unable to destroy guest network on router VM*[DomainRouter|r-3388-VM]
2018-11-01 01:15:29,247 WARN  [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
*Failed to destroy guest network config Ntwk*[1122|Guest|12] on router 
VM[DomainRouter|r-3388-VM]
2018-11-01 01:15:29,247 WARN  [c.c.n.e.VpcVirtualRouterElement] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
*Failed to unplug nic in network Ntwk*[1122|Guest|12] for virtual router 
VM[DomainRouter|r-3388-VM]
2018-11-01 01:15:29,247 WARN  [o.a.c.e.o.NetworkOrchestrator] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
*Unable to complete shutdown of the network elements due to element: 
VpcVirtualRouter*
2018-11-01 01:15:29,255 DEBUG [o.a.c.e.o.NetworkOrchestrator] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
Lock is released for network Ntwk[1122|Guest|12] as a part of network shutdown
2018-11-01 01:15:29,256 DEBUG [o.a.c.e.o.NetworkOrchestrator] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
*Network is not not in the correct state to be destroyed: Shutdown*

============================================

 

I'm attaching the simple java program which performs all of the above described 
steps and which allowed us to consistently run into the bug.

 

To use the application:

 

java -jar testCloudStack.jar <CloudStack API url: e.g. 
http://foo:8080/client/api> <apiKey> <secretKey> <zoneName>

 

Note, that the test application works successfully with CloudStack server 4.9.2 
but consistently reproduces the bug with CloudStack server 4.11.1


> VPC Router Corruption when working with large number of networks containing 
> instances with public IP addresses 
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-10400
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10400
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: API
>    Affects Versions: 4.11.1.0
>            Reporter: Barys Dubauski
>            Priority: Critical
>         Attachments: testCloudStack.jar
>
>
> We are using CloudStack 4.11.1 running with KVM hosts.  To simulate our 
> usecase, we created a small program that calls CloudStack API to
> 1) create VPC network with 20 guest networks, each containing one virtual 
> machine with a public IP address allocated.  
> 2) delete the machines and networks one by one. 
>  
> However,  we frequently get a timeout error, sometimes during VM deletion, 
> and sometimes during guest network deletion or even during static NAT disable 
> step.  Once the timeout occurs, it seems that the VPC network / Virtual 
> router is in an *unstable/corrupted* state.  We need to restart the Virtual 
> Router with a clean option (sometimes have to try restart several times as it 
> fails to deploy router VM as well).  After that, we can continue delete the 
> network remaining environment.  Here is the high level steps that we did:
>  # Create VPC Network
>  # For each of the 20 "environments"
>  ## Create Guest Network
>  ## Add a VM to the network
>  ## Acquire Public IP
>  ## Associate the Public IP with VM
>  # For each of the 20 environment
>  ## Disassociate the Public IP
>  ## Delete VM
>  ## Delete Guest network
>  # Delete VPC
>  
> The hanging / timeout problems could be in any time during environment 
> deletion.  The first few deletion could go through successfully, and then 
> fail at some point.  The failure could be in any stage.  i.e. Disassociate 
> public IP, delete VM or delete guest network.  We looked at cloud.log, agent 
> log and management server log but couldn’t get any obvious errors.  It seems 
> that management server sends the request to do the deletion, but the VR does 
> not respond and the system/network becomes stuck in an invalid state. Network 
> often gets stuck in “Shutdown” state as a result.
>  
> Here are some errors in the management server log:
> ============================================
>  2018-11-01 01:15:29,263 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965) (logid:dbe80d4f) Complete async 
> job-29965, jobStatus: FAILED, resultCode: 530, result: 
> org.apache.cloudstack.api.response.ExceptionResponse/null/\{"uuidList":[],"errorcode":530,"errortext":"Failed
>  to delete network"}
> 2018-11-01 01:15:29,245 DEBUG [c.c.a.t.Request] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
> Seq 4-667095694804259240: Received: 
> { Ans: , MgmtId: [7474664765770|tel:7474664765770], via: 
> 4([cehv02.core.jazz.net|http://cehv02.core.jazz.net/]), Ver: v1, Flags: 110, 
> \\{ GroupAnswer }
> }
>  2018-11-01 01:15:29,245 WARN  
> [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
> *Unable to destroy guest network on router VM*[DomainRouter|r-3388-VM]
>  2018-11-01 01:15:29,247 WARN  
> [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
> *Failed to destroy guest network config Ntwk*[1122|Guest|12] on router 
> VM[DomainRouter|r-3388-VM]
>  2018-11-01 01:15:29,247 WARN  [c.c.n.e.VpcVirtualRouterElement] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
> *Failed to unplug nic in network Ntwk*[1122|Guest|12] for virtual router 
> VM[DomainRouter|r-3388-VM]
>  2018-11-01 01:15:29,247 WARN  [o.a.c.e.o.NetworkOrchestrator] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
> *Unable to complete shutdown of the network elements due to element: 
> VpcVirtualRouter*
>  2018-11-01 01:15:29,255 DEBUG [o.a.c.e.o.NetworkOrchestrator] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
> Lock is released for network Ntwk[1122|Guest|12] as a part of network shutdown
>  2018-11-01 01:15:29,256 DEBUG [o.a.c.e.o.NetworkOrchestrator] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
> *Network is not not in the correct state to be destroyed: Shutdown*
> ============================================
>  
> I'm attaching the simple java program which performs all of the above 
> described steps and which allowed us to consistently run into the bug.
>  
> To use the application:
>  
> java -jar testCloudStack.jar <CloudStack API url: e.g. 
> [http://foo:8080/client/api]> <apiKey> <secretKey> <zoneName>
>  
> Note, that the test application works successfully with CloudStack server 
> 4.9.2 but consistently reproduces the bug with CloudStack server 4.11.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CLOUDSTACK-10400) VPC Router Corruption when working with large number of networks containing instances with public IP addresses

Reply via email to