Andrew Bayer created CLOUDSTACK-3163:
----------------------------------------

             Summary: Virtual Router startup time is painfully long
                 Key: CLOUDSTACK-3163
                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3163
             Project: CloudStack
          Issue Type: Bug
      Security Level: Public (Anyone can view this level - this is the default.)
    Affects Versions: pre-4.0.0
         Environment: CloudPlatform 3.0.3, but I don't see any changes to the 
relevant code (I think) on master
            Reporter: Andrew Bayer
            Priority: Critical


When you've got a couple thousand instances, spread across 10 or so pods, 
virtual router startup time is near crippling - actually, if you don't enable 
the option to have virtual routers only populated with instances in their pod, 
it *is* crippling, in that the virtual routers don't finish starting before the 
management server decides they've timed out and tries to start a new one.

This seems to be the result of a few painful inefficiencies:

- The same codepath is followed whether you're adding a new instance to an 
already running VR, or adding two hundred already running instances to a new 
VR. So each ssh/scp/sed/cp/chmod/etc command is replicated for each instance, 
rather than finding efficiencies by doing things across the whole set of 
instances. 
- But what really eats up the time is the population of vm data - for each 
piece of vm data (which, from a rough look at the code, seems to be something 
like 10 or 11 data files), there are something like 7 ssh calls and an scp 
call. So that means that per instance, we have somewhere around 80 to 90 
ssh/scp calls, plus the single ssh call for dhcp_entry.sh. So with 200 
instances, that's 1600 to 1800 ssh/scp calls on a single VR, with all the 
overhead entailed in opening that many ssh connections, starting bash, etc, 
etc... Given that in my experience, a VR with ~200 instances takes ~90 minutes 
to start up (I may be misremembering slightly - it could be ~200 instances 
takes closer to 60 minutes, and ~300 takes closer to 90), that works out to 3 
seconds or so per ssh/scp, which doesn't seem implausible to me. 

So, this shouldn't be this way. At a minimum, there's no reason not to offload 
the whole process from a script run on the host making repeated ssh calls to 
the VR to a script on the VR that gets called from the host, albeit possibly a 
temporary one that's generated on the fly and copied over to the VR. That alone 
would probably save most of the VR startup time, just by dropping the number of 
ssh/scp connections per instance from 80-90 to 3 (dhcp_entry.sh call, scp of 
temporary script, execution of temporary script).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to