On Mon, 2014-03-17 at 14:55 -0400, Matthew Treinish wrote: > Hi everyone, > > So a little while ago we noticed that in all the gate runs one of the > ceilometer > cli tests is consistently in the list of slowest tests. (and often the > slowest) > This was a bit surprising given the nature of the cli tests we expect them to > execute very quickly. > > test_ceilometer_resource_list which just calls ceilometer resource_list from > the > CLI once is taking >=2 min to respond. For example: > http://logs.openstack.org/68/80168/3/gate/gate-tempest-dsvm-postgres-full/07ab7f5/logs/tempest.txt.gz#_2014-03-17_17_08_25_003 > (where it takes > 3min)
Yep. At AT&T, we had to disable calls to GET /resources without any filters on it. The call would return hundreds of thousands of records, all being JSON-ified at the Ceilometer API endpoint, and the result would take minutes to return. There was no default limit on the query, which meant every single records in the database was returned, and on even a semi-busy system, that meant horrendous performance. Besides the problem that the SQLAlchemy driver doesn't yet support pagination [1], the main problem with the get_resources() call is the underlying databases schema for the Sample model is wacky, and forces the use of a dependent subquery in the WHERE clause [2] which completely kills performance of the query to get resources. [1] https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_sqlalchemy.py#L436 [2] https://github.com/openstack/ceilometer/blob/master/ceilometer/storage/impl_sqlalchemy.py#L503 > The cli tests are supposed to be quick read-only sanity checks of the cli > functionality and really shouldn't ever be on the list of slowest tests for a > gate run. Oh, the test is readonly all-right. ;) It's just that it's reading hundreds of thousands of records. > I think there was possibly a performance regression recently in > ceilometer because from I can tell this test used to normally take ~60 sec. > (which honestly is probably too slow for a cli test too) but it is currently > much slower than that. > > From logstash it seems there are still some cases when the resource list takes > as long to execute as it used to, but the majority of runs take a long time: > http://goo.gl/smJPB9 > > In the short term I've pushed out a patch that will remove this test from gate > runs: https://review.openstack.org/#/c/81036 But, I thought it would be good > to > bring this up on the ML to try and figure out what changed or why this is so > slow. I agree with removing the test from the gate in the short term. Medium to long term, the root causes of the problem (that GET /resources has no support for pagination on the query, there is no default for limiting results based on a since timestamp, and that the underlying database schema is non-optimal) should be addressed. Best, -jay _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev