[jira] [Commented] (CLOUDSTACK-10400) VPC Router Corruption when working with large number of networks containing instances with public IP addresses

Barys Dubauski (JIRA) Tue, 13 Nov 2018 10:39:11 -0800


    [ 
https://issues.apache.org/jira/browse/CLOUDSTACK-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685587#comment-16685587
 ]


Barys Dubauski commented on CLOUDSTACK-10400:
---------------------------------------------

Thank you, Rohit: https://github.com/apache/cloudstack/issues/3025

> VPC Router Corruption when working with large number of networks containing 
> instances with public IP addresses 
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-10400
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10400
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: API
>    Affects Versions: 4.11.1.0
>            Reporter: Barys Dubauski
>            Priority: Critical
>         Attachments: testCloudStack.jar
>
>
> We are using CloudStack 4.11.1 running with KVM hosts.  To simulate our 
> usecase, we created a small program that calls CloudStack API to
> 1) create VPC network with 20 guest networks, each containing one virtual 
> machine with a public IP address allocated.  
> 2) delete the machines and networks one by one. 
>  
> However,  we frequently get a timeout error, sometimes during VM deletion, 
> and sometimes during guest network deletion or even during static NAT disable 
> step.  Once the timeout occurs, it seems that the VPC network / Virtual 
> router is in an *unstable/corrupted* state.  We need to restart the Virtual 
> Router with a clean option (sometimes have to try restart several times as it 
> fails to deploy router VM as well).  After that, we can continue delete the 
> network remaining environment.  Here is the high level steps that we did:
>  # Create VPC Network
>  # For each of the 20 "environments"
>  ## Create Guest Network
>  ## Add a VM to the network
>  ## Acquire Public IP
>  ## Associate the Public IP with VM
>  # For each of the 20 environment
>  ## Disassociate the Public IP
>  ## Delete VM
>  ## Delete Guest network
>  # Delete VPC
>  
> The hanging / timeout problems could be in any time during environment 
> deletion.  The first few deletion could go through successfully, and then 
> fail at some point.  The failure could be in any stage.  i.e. Disassociate 
> public IP, delete VM or delete guest network.  We looked at cloud.log, agent 
> log and management server log but couldn’t get any obvious errors.  It seems 
> that management server sends the request to do the deletion, but the VR does 
> not respond and the system/network becomes stuck in an invalid state. Network 
> often gets stuck in “Shutdown” state as a result.
>  
> Here are some errors in the management server log:
> ============================================
>  2018-11-01 01:15:29,263 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965) (logid:dbe80d4f) Complete async 
> job-29965, jobStatus: FAILED, resultCode: 530, result: 
> org.apache.cloudstack.api.response.ExceptionResponse/null/\{"uuidList":[],"errorcode":530,"errortext":"Failed
>  to delete network"}
> 2018-11-01 01:15:29,245 DEBUG [c.c.a.t.Request] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
> Seq 4-667095694804259240: Received: 
> { Ans: , MgmtId: [7474664765770|tel:7474664765770], via: 
> 4([cehv02.core.jazz.net|http://cehv02.core.jazz.net/]), Ver: v1, Flags: 110, 
> \\{ GroupAnswer }
> }
>  2018-11-01 01:15:29,245 WARN  
> [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
> *Unable to destroy guest network on router VM*[DomainRouter|r-3388-VM]
>  2018-11-01 01:15:29,247 WARN  
> [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
> *Failed to destroy guest network config Ntwk*[1122|Guest|12] on router 
> VM[DomainRouter|r-3388-VM]
>  2018-11-01 01:15:29,247 WARN  [c.c.n.e.VpcVirtualRouterElement] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
> *Failed to unplug nic in network Ntwk*[1122|Guest|12] for virtual router 
> VM[DomainRouter|r-3388-VM]
>  2018-11-01 01:15:29,247 WARN  [o.a.c.e.o.NetworkOrchestrator] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
> *Unable to complete shutdown of the network elements due to element: 
> VpcVirtualRouter*
>  2018-11-01 01:15:29,255 DEBUG [o.a.c.e.o.NetworkOrchestrator] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
> Lock is released for network Ntwk[1122|Guest|12] as a part of network shutdown
>  2018-11-01 01:15:29,256 DEBUG [o.a.c.e.o.NetworkOrchestrator] 
> (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
> *Network is not not in the correct state to be destroyed: Shutdown*
> ============================================
>  
> I'm attaching the simple java program which performs all of the above 
> described steps and which allowed us to consistently run into the bug.
>  
> To use the application:
>  
> java -jar testCloudStack.jar <CloudStack API url: e.g. 
> [http://foo:8080/client/api]> <apiKey> <secretKey> <zoneName>
>  
> Note, that the test application works successfully with CloudStack server 
> 4.9.2 but consistently reproduces the bug with CloudStack server 4.11.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CLOUDSTACK-10400) VPC Router Corruption when working with large number of networks containing instances with public IP addresses

Reply via email to