On 09/07/2016 06:46 AM, Chris Dent wrote: > > More updates on resource providers work: > > Yesterday we realized that a SQL join for associating inventories > with allocations and resource providers was missing a critical and > clause. This was leading to allocations failing to be written when > there should have been plenty of capacity. > > This was fixed in: > > https://review.openstack.org/#/c/366245/ > > It will merged in a few minutes. There are still some concerns that > we don't understand why tests (of the prior code) were not failing.
As a follow up here, I actually got to the bottom of why the old tests didn't work. There were no tests which had > 1 resource class and > 1 consumer for a resource provider. And even if there had been, they probably wouldn't have failed unless the scales of the resource providers were enough that comparing the free / used mixed up between them would have caused an issue. To reproduce the key issue you need to have active allocations in the database not owned by your consumer. Because one of the first things that happens when setting allocations for your consumer, is it deletes existing allocations. If nothing is in the allocations table the left outer join has nothing to add up to join for usage. Basically the column set: cols_in_output = [ _RP_TBL.c.id.label('resource_provider_id'), _RP_TBL.c.uuid, _RP_TBL.c.generation, _INV_TBL.c.resource_class_id, _INV_TBL.c.total, _INV_TBL.c.reserved, _INV_TBL.c.allocation_ratio, usage.c.used, ] Ends up with None in the final column. So you'll get rows like. 1,$uuid,1,2,1024,4,16.0,None 1,$uuid,1,9,40,4,1.0,None However, if there are existing allocations there, the left outer join blows this out into a matrix and you'd get: 1,$uuid,1,2,1024,4,16.0,16 1,$uuid,1,9,40,4,1.0,16 1,$uuid,1,2,1024,4,16.0,1 1,$uuid,1,9,40,4,1.0,1 Where 1 is the usage by resource provider 9, and 16 is the usage by resource provider 2. This is because of a missing join where the inventory.resource_class_id == allocs.resource_class_id. The fix provides a test that will explode if we regress this. Because this only would expose when we've got existing allocations by a different consumer (i.e. a concurrently running guest), this explains why it was spuraticly showing up in the gate. Only when 3 guests were stood up at the same time (either in a test, or between) would we get this issue. Our guests run at 64M memory, we run on 8 cpu hosts, with 16x modifier. If we compare consumed ram to available cpu (which was the actual fail happening) the first guest up consumes 64M ram, 1 vcpu. 128 vcpu can be consumed, 128 - 64 >= 0. Second guest gets us to 128M ram, 2 vcpu. Again, we can actually survive the column shift. But once we are >= 3 guests at once we can hit this. There are no ORDER by clauses inside the SQL monster (https://github.com/openstack/nova/blob/25abb68039ca122b4b3796a9f8c9e3495db22772/nova/objects/resource_provider.py#L637) which means which order we'll get the rows and the join means sometimes we'll be correctly comparing, sometimes we won't. But until you get to 3 guests at once, then you'll never be able to see it. -Sean -- Sean Dague http://dague.net __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev