[ https://issues.apache.org/jira/browse/CLOUDSTACK-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685587#comment-16685587 ]
Barys Dubauski commented on CLOUDSTACK-10400: --------------------------------------------- Thank you, Rohit: https://github.com/apache/cloudstack/issues/3025 > VPC Router Corruption when working with large number of networks containing > instances with public IP addresses > --------------------------------------------------------------------------------------------------------------- > > Key: CLOUDSTACK-10400 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10400 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: API > Affects Versions: 4.11.1.0 > Reporter: Barys Dubauski > Priority: Critical > Attachments: testCloudStack.jar > > > We are using CloudStack 4.11.1 running with KVM hosts. To simulate our > usecase, we created a small program that calls CloudStack API to > 1) create VPC network with 20 guest networks, each containing one virtual > machine with a public IP address allocated. > 2) delete the machines and networks one by one. > > However, we frequently get a timeout error, sometimes during VM deletion, > and sometimes during guest network deletion or even during static NAT disable > step. Once the timeout occurs, it seems that the VPC network / Virtual > router is in an *unstable/corrupted* state. We need to restart the Virtual > Router with a clean option (sometimes have to try restart several times as it > fails to deploy router VM as well). After that, we can continue delete the > network remaining environment. Here is the high level steps that we did: > # Create VPC Network > # For each of the 20 "environments" > ## Create Guest Network > ## Add a VM to the network > ## Acquire Public IP > ## Associate the Public IP with VM > # For each of the 20 environment > ## Disassociate the Public IP > ## Delete VM > ## Delete Guest network > # Delete VPC > > The hanging / timeout problems could be in any time during environment > deletion. The first few deletion could go through successfully, and then > fail at some point. The failure could be in any stage. i.e. Disassociate > public IP, delete VM or delete guest network. We looked at cloud.log, agent > log and management server log but couldn’t get any obvious errors. It seems > that management server sends the request to do the deletion, but the VR does > not respond and the system/network becomes stuck in an invalid state. Network > often gets stuck in “Shutdown” state as a result. > > Here are some errors in the management server log: > ============================================ > 2018-11-01 01:15:29,263 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] > (API-Job-Executor-119:ctx-c14b2ab4 job-29965) (logid:dbe80d4f) Complete async > job-29965, jobStatus: FAILED, resultCode: 530, result: > org.apache.cloudstack.api.response.ExceptionResponse/null/\{"uuidList":[],"errorcode":530,"errortext":"Failed > to delete network"} > 2018-11-01 01:15:29,245 DEBUG [c.c.a.t.Request] > (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) > Seq 4-667095694804259240: Received: > { Ans: , MgmtId: [7474664765770|tel:7474664765770], via: > 4([cehv02.core.jazz.net|http://cehv02.core.jazz.net/]), Ver: v1, Flags: 110, > \\{ GroupAnswer } > } > 2018-11-01 01:15:29,245 WARN > [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] > (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) > *Unable to destroy guest network on router VM*[DomainRouter|r-3388-VM] > 2018-11-01 01:15:29,247 WARN > [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] > (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) > *Failed to destroy guest network config Ntwk*[1122|Guest|12] on router > VM[DomainRouter|r-3388-VM] > 2018-11-01 01:15:29,247 WARN [c.c.n.e.VpcVirtualRouterElement] > (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) > *Failed to unplug nic in network Ntwk*[1122|Guest|12] for virtual router > VM[DomainRouter|r-3388-VM] > 2018-11-01 01:15:29,247 WARN [o.a.c.e.o.NetworkOrchestrator] > (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) > *Unable to complete shutdown of the network elements due to element: > VpcVirtualRouter* > 2018-11-01 01:15:29,255 DEBUG [o.a.c.e.o.NetworkOrchestrator] > (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) > Lock is released for network Ntwk[1122|Guest|12] as a part of network shutdown > 2018-11-01 01:15:29,256 DEBUG [o.a.c.e.o.NetworkOrchestrator] > (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) > *Network is not not in the correct state to be destroyed: Shutdown* > ============================================ > > I'm attaching the simple java program which performs all of the above > described steps and which allowed us to consistently run into the bug. > > To use the application: > > java -jar testCloudStack.jar <CloudStack API url: e.g. > [http://foo:8080/client/api]> <apiKey> <secretKey> <zoneName> > > Note, that the test application works successfully with CloudStack server > 4.9.2 but consistently reproduces the bug with CloudStack server 4.11.1 -- This message was sent by Atlassian JIRA (v7.6.3#76005)