Hi everyone, Following [1], a few of us sat down during the last day of the Austin Summit and discussed the possibility of adding formal support for Tooz, specifically for the locking mechanism it provides. The conclusion we reached was that benchmarks should be done to show if and how Tooz affects the normal operation of Neutron (i.e. if locking a resource using Zookeeper takes 3 seconds, it's not worthwhile at all).
We've finally finished the benchmarks and they are available at [2]. They test a specific case: when creating an HA router a lock-free algorithm is used to assign a vrid to a router (this is later used for keepalived), and the benchmark specifically checks the effects of locking that function with either Zookeeper or Etcd, using the no-Tooz case as a baseline. The locking was checked in 2 different ways - one which presents no contention (acquire() always succeeds immediately) and one which presents contentions (acquire() may block until a similar process for the invoking tenant is complete). The benchmarks show that while using Tooz does raise the cost of an operation, the effects are not as bad as we initially feared. In the simple, single simultaneous request, using Zookeeper raised the average time it took to create a router by 1.5% (from 11.811 to 11.988 seconds). On the more-realistic case of 6 simultaneous requests, Zookeeper raised the cost by 3.74% (from 16.533 to 17.152 seconds). It is important to note that the setup itself was overloaded - it was built on a single baremetal hosting 5 VMs (4 of which were controllers) and thus we were unable to go further - for example, 10 concurrent requests overloaded the server and caused some race conditions to appear in the L3 scheduler (bugs will be opened soon), so for this reason we haven't tested heavier samples and limited ourselves to 6 simultaneous requests. Also important to note that some kind of race condition was noticed in tooz's etcd driver. We've discussed this with the tooz devs and provided a patch that is supposed to fix them [3]. Lastly, races in the L3 HA Scheduler were found and we are yet to dig into them and find out their cause - bugs will be opened for these as well. I've opened the summary [2] for comments so you're welcome to open a discussion about the results both in the ML and on the doc itself. (CC to all those who attended the Austin Summit meeting and other interested parties) Happy locking, [1]: http://lists.openstack.org/pipermail/openstack-dev/2016-April/093199.html [2]: https://docs.google.com/document/d/1jdI8gkQKBE0G9koR0nLiW02d5rwyWv_-gAp7yavt4w8 [3]: https://review.openstack.org/#/c/342096/ -- John Schwarz, Senior Software Engineer, Red Hat. __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev