On Wed, Nov 23, 2016 at 11:58 PM, Zane Bitter <zbit...@redhat.com> wrote: > We discussed $SUBJECT at the summit as one of the main performance problems > that people are running into when trying to create very large autoscaling > groups, as projects like Sahara, Magnum, TripleO, OpenShift are wont to do. > Of course, as we all know, validation happens synchronously, so it's prone > to causing RPC timeouts that mean a hard failure of the parent stack. > > First the good news - I just committed this patch: > > https://review.openstack.org/#/c/400961/ > > which should mean from now on that resources with identical definitions will > not all be validated, and instead we'll just validate one representative > one. In theory this should mean that autoscaling groups should now validate > in constant rather than linear time. If anyone from one of the affected > projects is able to confirm this, then I'd be happy to backport the patch to > stable/newton. It really is very simple. > > The bad news here is for users of ResourceGroups with %index% substitution > (*cough*TripleO*cough*) - this makes each resource definition unique, so it > won't benefit from this fix. (Adding this to my mental list of reasons why > index substitution is bad.) > > > I also investigated another issue, which is that since the fix for > https://bugs.launchpad.net/heat/+bug/1388140 landed (in Kilo) I believe we > are validating nested stacks multiple times (specifically, m times, where m > is the stack's depth in the tree): > > root child grandchild > > create > -> validate ----------> validate --------------> validate > -> Resource.create ===> create > -> validate ----------> validate > -> Resource.create ===> create > -> validate > > The only good news here is that ResourceGroup is smart enough to make sure > that it generates a nested stack with at most 1 resource to validate when > validate() is called. (However, when the nested stack is created, and thus > validated, it is of course full-sized.) Autoscaling groups make no such > allowances, but the patch above should actually have the same effect. (We > can't get rid of the special case for ResourceGroup though, because of index > substitution.) > > An obvious fix would be to disable validation - or, more specifically, > validation of _resources_ - on create/update for stacks that have a non-null > owner_id (i.e. nested stacks), so that we had something like: > > root child grandchild > > create > -> validate ----------> validate --------------> validate > -> Resource.create ===> create > -> Resource.create ===> create > > That would eliminate the duplication/triplication/multiplication of > validation. It would also mean that we'd cut out the expensive part of > ResourceGroup validation with index substitution, leaving only the cheap > part. > > One downside is that in the ResourceGroup/index substitution case we'd be > creating resources whose definitions hadn't _ever_ been validated. I _think_ > that's safe, in the sense that you'd just hear about errors later, as > opposed to everything falling over in a heap, but it's difficult to be > certain. Hearing about problems late is also not ideal (since it may cause > otherwise-healthy siblings to be cancelled), but I would guess that heavy > users like TripleO developers would say that it's worth the tradeoff. > > However, one other thing about this bothers me. The part of validation that > we're keeping: > > -> validate ----------> validate --------------> validate > > involves loading all of the nested stacks in memory at once (i.e. the thing > we were not supposed to be doing any more in Kilo, in favour of farming > nested stacks out over RPC.) As we discovered when we found out we were > doing the same thing with outputs[1], this is a bit like hanging out a giant > "Kick Me" sign for the OOM Killer. > > That's mitigated quite a lot by my patch though... we'll load the whole > autoscaling group stack in memory, but if its members are themselves nested > stacks we'll load only one of them. So the scaling tendencies will hopefully > be dominated by the complexity of your templates more than than the size of > your deployment. ResourceGroup is in a better position, because its nested > stack will actually have only one member, so the size shouldn't affect > memory consumption at all during validation. > > Some options: > 1) Chalk it up to an acceptable tradeoff > 2) Add a single-member special case for autoscaling group validation > 3) Farm out the nested validation over RPC > 4) Both (2) & (3) > 5) Some totally different arrangement of how nested stacks are validated
I think I'd like to see what difference 3 makes. Maybe then also do 2. Again, we really need to have some reproducible big template that we can use to make sure what we're doing is useful. -- Thomas __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev