Hi everyone,

I was wondering how many of you are running CloudStack with a cluster of
management servers. I would think most of you, but it would be nice to hear
everyone voices. And do you get hosts going over their capacity limits?

We discovered that during the VM allocation, if you get a lot of parallel
requests to create new VMs, most notably with large profiles, the capacity
increase is done too far after the host capacity checks and results in
hosts going over their capacity limits. To detail the steps: the deployment
planner checks for cluster/host capacity and pick up one deployment plan
(zone, cluster, host). The plan is stored in the database under a VMwork
job and another thread picks that entry and starts the deployment,
increasing the host capacity and sending the commands. Here there's a time
gap between the host being picked up and the capacity increase for that
host of a couple of seconds, which is well enough to go over the capacity
on one or more hosts. A few VMwork job can be added in the DB queue
targeting the same host before one gets picked up.

To fix this issue, we're using Zookeeper to act as the multi JVM lock
manager thanks to their curator library (
https://curator.apache.org/curator-recipes/shared-lock.html). We also
changed the time when the capacity is increased, which occurs now pretty
much after the deployment plan is found and inside the zookeeper lock. This
ensure we don't go over the capacity of any host, and it has been proven
efficient since a month in our management server cluster.

This adds another potential requirement which should be discuss before
proposing a PR. Today the code works seamlessly without ZK too, to ensure
it's not a hard requirement, for example in a lab.

Comments?

Kind regards,
Marc-Aurèle

Reply via email to