[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-3163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714141#comment-13714141
 ] 

Andrew Bayer commented on CLOUDSTACK-3163:
------------------------------------------

I owe you a beer - so glad to see this finally addressed! It's been one of our 
biggest painpoints with CloudStack for a long time.
                
> KVM Virtual Router startup time is painfully long
> -------------------------------------------------
>
>                 Key: CLOUDSTACK-3163
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3163
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>          Components: KVM
>    Affects Versions: pre-4.0.0
>         Environment: CloudPlatform 3.0.3, but I don't see any changes to the 
> relevant code (I think) on master
>            Reporter: Andrew Bayer
>            Priority: Critical
>
> When you've got a couple thousand instances, spread across 10 or so pods, 
> virtual router startup time is near crippling - actually, if you don't enable 
> the option to have virtual routers only populated with instances in their 
> pod, it *is* crippling, in that the virtual routers don't finish starting 
> before the management server decides they've timed out and tries to start a 
> new one.
> This seems to be the result of a few painful inefficiencies:
> - The same codepath is followed whether you're adding a new instance to an 
> already running VR, or adding two hundred already running instances to a new 
> VR. So each ssh/scp/sed/cp/chmod/etc command is replicated for each instance, 
> rather than finding efficiencies by doing things across the whole set of 
> instances. 
> - But what really eats up the time is the population of vm data - for each 
> piece of vm data (which, from a rough look at the code, seems to be something 
> like 10 or 11 data files), there are something like 7 ssh calls and an scp 
> call. So that means that per instance, we have somewhere around 80 to 90 
> ssh/scp calls, plus the single ssh call for dhcp_entry.sh. So with 200 
> instances, that's 1600 to 1800 ssh/scp calls on a single VR, with all the 
> overhead entailed in opening that many ssh connections, starting bash, etc, 
> etc... Given that in my experience, a VR with ~200 instances takes ~90 
> minutes to start up (I may be misremembering slightly - it could be ~200 
> instances takes closer to 60 minutes, and ~300 takes closer to 90), that 
> works out to 3 seconds or so per ssh/scp, which doesn't seem implausible to 
> me. 
> So, this shouldn't be this way. At a minimum, there's no reason not to 
> offload the whole process from a script run on the host making repeated ssh 
> calls to the VR to a script on the VR that gets called from the host, albeit 
> possibly a temporary one that's generated on the fly and copied over to the 
> VR. That alone would probably save most of the VR startup time, just by 
> dropping the number of ssh/scp connections per instance from 80-90 to 3 
> (dhcp_entry.sh call, scp of temporary script, execution of temporary script).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to