You can default it to the number of cores, but please make it configurable. Some ops cram lots of services onto one node, and one service doesn't get to monopolize all cores.
Thanks, Kevin ________________________________ From: Angus Salkeld [asalk...@mirantis.com] Sent: Tuesday, September 01, 2015 4:53 PM To: OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [Heat] convergence rally test results (so far) On Tue, Sep 1, 2015 at 10:45 PM Steven Hardy <sha...@redhat.com<mailto:sha...@redhat.com>> wrote: On Fri, Aug 28, 2015 at 01:35:52AM +0000, Angus Salkeld wrote: > Hi > I have been running some rally tests against convergence and our existing > implementation to compare. > So far I have done the following: > 1. defined a template with a resource > groupA > https://github.com/asalkeld/convergence-rally/blob/master/templates/resource_group_test_resource.yaml.template > 2. the inner resource looks like > this:A > https://github.com/asalkeld/convergence-rally/blob/master/templates/server_with_volume.yaml.templateA > (it > uses TestResource to attempt to be a reasonable simulation of a > server+volume+floatingip) > 3. defined a rally > job:A > https://github.com/asalkeld/convergence-rally/blob/master/increasing_resources.yamlA > that > creates X resources then updates to X*2 then deletes. > 4. I then ran the above with/without convergence and with 2,4,8 > heat-engines > Here are the results compared: > > https://docs.google.com/spreadsheets/d/12kRtPsmZBl_y78aw684PTBg3op1ftUYsAEqXBtT800A/edit?usp=sharing > Some notes on the results so far: > * A convergence with only 2 engines does suffer from RPC overload (it > gets message timeouts on larger templates). I wonder if this is the > problem in our convergence gate... > * convergence does very well with a reasonable number of engines > running. > * delete is slightly slower on convergence > Still to test: > * the above, but measure memory usage > * many small templates (run concurrently) So, I tried running my many-small-templates here with convergence enabled: https://bugs.launchpad.net/heat/+bug/1489548 In heat.conf I set: max_resources_per_stack = -1 convergence_engine = true Most other settings (particularly RPC and DB settings) are defaults. Without convergence (but with max_resources_per_stack disabled) I see the time to create a ResourceGroup of 400 nested stacks (each containing one RandomString resource) is about 2.5 minutes (core i7 laptop w/SSD, 4 heat workers e.g the default for a 4 core machine). With convergence enabled, I see these errors from sqlalchemy: File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 652, in _checkout\n fairy = _ConnectionRecord.checkout(pool)\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 444, in checkout\n rec = pool._do_get()\n', u' File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 980, in _do_get\n (self.size(), self.overflow(), self._timeout))\n', u'TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30\n']. I assume this means we're loading the DB much more in the convergence case and overflowing the QueuePool? Yeah, looks like it. This seems to happen when the RPC call from the ResourceGroup tries to create some of the 400 nested stacks. Interestingly after this error, the parent stack moves to CREATE_FAILED, but the engine remains (very) busy, to the point of being partially responsive, so it looks like maybe the cancel-on-fail isnt' working (I'm assuming it isn't error_wait_time because the parent stack has been marked FAILED and I'm pretty sure it's been more than 240s). I'll dig a bit deeper when I get time, but for now you might like to try the stress test too. It's a bit of a synthetic test, but it turns out to be a reasonable proxy for some performance issues we observed when creating large-ish TripleO deployments (which also create a large number of nested stacks concurrently). Thanks a lot for testing Steve! I'll make 2 bugs for what you have raised 1. limit the number of resource actions in parallel (maybe base on the number of cores) 2. the cancel on fail error -Angus Steve __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe<http://openstack-dev-requ...@lists.openstack.org?subject:unsubscribe> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev