[ https://issues.apache.org/jira/browse/CLOUDSTACK-3163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13712294#comment-13712294 ]
Wido den Hollander commented on CLOUDSTACK-3163: ------------------------------------------------ So, I took a quick peek at how this works and I see it does about 13 calls on my set up, of which 11 are calling "vmdata.sh" with different parameters. I think that can be brought back to one call, bringing the total (in my setup) back to 3 instead of 13. I'll see if I can find the time to test this out. > KVM Virtual Router startup time is painfully long > ------------------------------------------------- > > Key: CLOUDSTACK-3163 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-3163 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Components: KVM > Affects Versions: pre-4.0.0 > Environment: CloudPlatform 3.0.3, but I don't see any changes to the > relevant code (I think) on master > Reporter: Andrew Bayer > Priority: Critical > > When you've got a couple thousand instances, spread across 10 or so pods, > virtual router startup time is near crippling - actually, if you don't enable > the option to have virtual routers only populated with instances in their > pod, it *is* crippling, in that the virtual routers don't finish starting > before the management server decides they've timed out and tries to start a > new one. > This seems to be the result of a few painful inefficiencies: > - The same codepath is followed whether you're adding a new instance to an > already running VR, or adding two hundred already running instances to a new > VR. So each ssh/scp/sed/cp/chmod/etc command is replicated for each instance, > rather than finding efficiencies by doing things across the whole set of > instances. > - But what really eats up the time is the population of vm data - for each > piece of vm data (which, from a rough look at the code, seems to be something > like 10 or 11 data files), there are something like 7 ssh calls and an scp > call. So that means that per instance, we have somewhere around 80 to 90 > ssh/scp calls, plus the single ssh call for dhcp_entry.sh. So with 200 > instances, that's 1600 to 1800 ssh/scp calls on a single VR, with all the > overhead entailed in opening that many ssh connections, starting bash, etc, > etc... Given that in my experience, a VR with ~200 instances takes ~90 > minutes to start up (I may be misremembering slightly - it could be ~200 > instances takes closer to 60 minutes, and ~300 takes closer to 90), that > works out to 3 seconds or so per ssh/scp, which doesn't seem implausible to > me. > So, this shouldn't be this way. At a minimum, there's no reason not to > offload the whole process from a script run on the host making repeated ssh > calls to the VR to a script on the VR that gets called from the host, albeit > possibly a temporary one that's generated on the fly and copied over to the > VR. That alone would probably save most of the VR startup time, just by > dropping the number of ssh/scp connections per instance from 80-90 to 3 > (dhcp_entry.sh call, scp of temporary script, execution of temporary script). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira