On 08-Jan-15 16:09, Anant Patil wrote: > On 16-Dec-14 09:41, Zane Bitter wrote: >> On 15/12/14 09:32, Anant Patil wrote: >>> On 12-Dec-14 06:29, Zane Bitter wrote: >>>> On 11/12/14 01:14, Anant Patil wrote: >>>>> On 04-Dec-14 10:49, Zane Bitter wrote: >>>>>> On 01/12/14 02:02, Anant Patil wrote: >>>>>>> On GitHub:https://github.com/anantpatil/heat-convergence-poc >>>>>> >>>>>> I'm trying to review this code at the moment, and finding some stuff I >>>>>> don't understand: >>>>>> >>>>>> https://github.com/anantpatil/heat-convergence-poc/blob/master/heat/engine/stack.py#L911-L916 >>>>>> >>>>>> This appears to loop through all of the resources *prior* to kicking off >>>>>> any actual updates to check if the resource will change. This is >>>>>> impossible to do in general, since a resource may obtain a property >>>>>> value from an attribute of another resource and there is no way to know >>>>>> whether an update to said other resource would cause a change in the >>>>>> attribute value. >>>>>> >>>>>> In addition, no attempt to catch UpdateReplace is made. Although that >>>>>> looks like a simple fix, I'm now worried about the level to which this >>>>>> code has been tested. >>>>>> >>>>> We were working on new branch and as we discussed on Skype, we have >>>>> handled all these cases. Please have a look at our current branch: >>>>> https://github.com/anantpatil/heat-convergence-poc/tree/graph-version >>>>> >>>>> When a new resource is taken for convergence, its children are loaded >>>>> and the resource definition is re-parsed. The frozen resource definition >>>>> will have all the "get_attr" resolved. >>>>> >>>>>> >>>>>> I'm also trying to wrap my head around how resources are cleaned up in >>>>>> dependency order. If I understand correctly, you store in the >>>>>> ResourceGraph table the dependencies between various resource names in >>>>>> the current template (presumably there could also be some left around >>>>>> from previous templates too?). For each resource name there may be a >>>>>> number of rows in the Resource table, each with an incrementing version. >>>>>> As far as I can tell though, there's nowhere that the dependency graph >>>>>> for _previous_ templates is persisted? So if the dependency order >>>>>> changes in the template we have no way of knowing the correct order to >>>>>> clean up in any more? (There's not even a mechanism to associate a >>>>>> resource version with a particular template, which might be one avenue >>>>>> by which to recover the dependencies.) >>>>>> >>>>>> I think this is an important case we need to be able to handle, so I >>>>>> added a scenario to my test framework to exercise it and discovered that >>>>>> my implementation was also buggy. Here's the fix: >>>>>> https://github.com/zaneb/heat-convergence-prototype/commit/786f367210ca0acf9eb22bea78fd9d51941b0e40 >>>>>> >>>>> >>>>> Thanks for pointing this out Zane. We too had a buggy implementation for >>>>> handling inverted dependency. I had a hard look at our algorithm where >>>>> we were continuously merging the edges from new template into the edges >>>>> from previous updates. It was an optimized way of traversing the graph >>>>> in both forward and reverse order with out missing any resources. But, >>>>> when the dependencies are inverted, this wouldn't work. >>>>> >>>>> We have changed our algorithm. The changes in edges are noted down in >>>>> DB, only the delta of edges from previous template is calculated and >>>>> kept. At any given point of time, the graph table has all the edges from >>>>> current template and delta from previous templates. Each edge has >>>>> template ID associated with it. >>>> >>>> The thing is, the cleanup dependencies aren't really about the template. >>>> The real resources really depend on other real resources. You can't >>>> delete a Volume before its VolumeAttachment, not because it says so in >>>> the template but because it will fail if you try. The template can give >>>> us a rough guide in advance to what those dependencies will be, but if >>>> that's all we keep then we are discarding information. >>>> >>>> There may be multiple versions of a resource corresponding to one >>>> template version. Even worse, the actual dependencies of a resource >>>> change on a smaller time scale than an entire stack update (this is the >>>> reason the current implementation updates the template one resource at a >>>> time as we go). >>>> >>> >>> Absolutely! The edges from the template are kept only for the reference >>> purposes. When we have a resource in new template, its template ID will >>> also be marked to current template. At any point of time, realized >>> resource will from current template, even if it were found in previous >>> templates. The template ID "moves" for a resource if it is found. >> >> In theory (disclaimer: I didn't implement this yet) it can change on an >> even smaller timescale than that. The existing plugins are something of >> a black box to us: if a failure occurs we don't necessarily know whether >> the real-world dependency is on the old or new version of another resource. >> > > Yes, and that's why we rollback the failed resource and its dependent > resources to older versions, provided that, the older resources are not > deleted unless update is done. It is easier with template Id as we know > the previous complete template. > >>>> Given that our Resource entries in the DB are in 1:1 correspondence with >>>> actual resources (we create a new one whenever we need to replace the >>>> underlying resource), I found it makes the most conceptual and practical >>>> sense to store the requirements in the resource itself, and update them >>>> at the time they actually change in the real world (bonus: introduces no >>>> new locking issues and no extra DB writes). I settled on this after a >>>> legitimate attempt at trying other options, but they didn't work out: >>>> https://github.com/zaneb/heat-convergence-prototype/commit/a62958342e8583f74e2aca90f6239ad457ba984d >>>> >>> >>> I am okay with the notion of graph stored in resource table. >>> >>>>> For resource clean up, we start from the >>>>> first template (template which was completed and updates were made on >>>>> top of it, empty template otherwise), and move towards the current >>>>> template in the order in which the updates were issued, and for each >>>>> template the graph (edges if found for the template) is traversed in >>>>> reverse order and resources are cleaned-up. >>>> >>>> I'm pretty sure this is backwards - you'll need to clean up newer >>>> resources first because they may reference resources from older >>>> templates. Also if you have a stubborn old resource that won't delete >>>> you don't want that to block cleanups of anything newer. >>>> >>>> You're also serialising some of the work unnecessarily because you've >>>> discarded the information about dependencies that cross template >>>> versions, forcing you to clean up only one template version at a time. >>>> >>> >>> Since only the changed set is stored, it is going to be smaller in most >>> of the cases. Within each previous template, if there are any edges, >>> they are tried for clean-up in reverse order. If a resource is not >>> updated, but available in new template, the template ID will be current >>> template ID as I mentioned above. With the template, the process is >>> concurrent, but across the templates, it will be serial. I am okay with >>> this since in real world, the change in dependencies may not be as much >>> as we think. >> >> I agree the serialisation probably won't make a big difference in most >> cases, although autoscaling is an obvious and important one where >> serialising a lot of things unnecessarily could have a big impact on the >> user. >> > > I was probably over-cautious when I tested it, but I can see that the > resource clean-up can be triggered concurrently. This is possible since > the update-replaced resources have different template IDs for replaced > and replacement. > >>>>> The process ends up with >>>>> current template being traversed in reverse order and resources being >>>>> cleaned up. All the update-replaced resources from the older templates >>>>> (older updates in concurrent updates) are cleaned up in the order in >>>>> which they are suppose to be. >>>>> >>>>> Resources are now tied to template, they have a template_id instead of >>>>> version. As we traverse the graph, we know which template we are working >>>>> on, and can take the relevant action on resource. >>>>> >>>>> For rollback, another update is issued with the last completed template >>>>> (it is designed to have an empty template as last completed template by >>>>> default). The current template being worked on becomes predecessor for >>>>> the new incoming template. In case of rollback, the last completed >>>>> template becomes incoming new template, the current becomes the new >>>>> template's predecessor and the successor of last completed template will >>>>> have no predecessor. All these changes are available in the >>>>> 'graph-version' branch. (The branch name is a misnomer though!) >>>>> >>>>> I think it is simpler to think about stack and concurrent updates when >>>>> we associate resources and edges with template, and stack with current >>>>> template and its predecessors (if any). >>>> >>>> It doesn't seem simple to me because it's trying to reconstruct reality >>>> from a lossy version of history. The simplest way to think about it, in >>>> my opinion is this: >>>> - When updating resources, respect their dependencies as given in the >>>> template >>>> - When checking resources to clean up, respect their actual, current >>>> real-world dependencies, and check replacement resources before the >>>> resources that they replaced. >>>> - Don't check a resource for clean up until it has been updated to the >>>> latest template. >>>> >>>>> I also think that we should decouple Resource from Stack. This is really >>>>> a hindrance when workers work on individual resources. The resource >>>>> should be abstracted enough from stack for the worker to work on the >>>>> resource alone. The worker should load the required resource plug-in and >>>>> start converging. >>>> >>>> I think that's a worthy goal, and it would be really nice if we could >>>> load a Resource completely independently of its Stack, and I know this >>>> has always been a design goal of yours (hence you're caching the >>>> resource definition in the Resource rather than getting it from the >>>> template). >>>> >>>> That said, I am convinced it's an unachievable goal, and I believe we >>>> should give up on it. >>>> >>>> - We'll always need to load _some_ central thing (e.g. to find out if >>>> the current traversal is still the valid one), so it might as well be >>>> the Stack. >>>> - Existing plugin abominations like HARestarter expect a working Stack >>>> object to be provided so it can go hunting for other resources. >>>> >>>> I think the best we can do is try to make heat.engine.stack.Stack as >>>> lazy as possible so that it only does extra work when strictly required, >>>> and just accept that the stack will always be loaded from the database. >>>> >>>> I am also strongly in favour of treating the idea of caching the >>>> unresolved resource definition in the Resource table as a straight >>>> performance optimisation that is completely separate to the convergence >>>> work. It's going to be inevitably ugly because there is no >>>> template-format-independent way to serialise a resource definition >>>> (while resource definition objects themselves are designed to be >>>> inherently template-format-independent). Once phase 1 is complete we can >>>> decide whether it's worth it based on measuring the actual performance >>>> improvement. >>>> >>> >>> The idea of keeping the resource definition in DB was to decouple from >>> the stack as much as possible. It is not for performance optimization. >> >> OK, that's fair, but then you said... >> >>> When a worker job is received, the worker won't have to load the entire >>> template from the stack, get the realized definitions etc. >> >> ...that's a performance optimisation. >> > > Ok... I was having decoupling of resource from stack in mind! The worker > don't load the stack, but only the resource and work-on. I did mention > scale somewhere but it was with SHA hash of resource definition for > quick comparison. > >>> The worker >>> simply loads the resource with definition with minimal stack (lazily >>> loaded stack), and starts working on it. This work is towards making >>> stack lazily loaded. >>> >>> Like you said, the stack has to be minimal when it is loaded. There will >>> be few information needed by the worker (like resource definition and >>> requirers/requirements) which can be obtained without loading the >>> template and recalculating the graph again and again. >> >> None of this changes the fundamental design, so let's get the important >> stuff done first and when that's finished we can see if this makes a >> meaningful (positive) difference. >> >>>> (Note that we _already_ store the _resolved_ properties of the resource, >>>> which is what the observer will be comparing against for phase 2, so >>>> there should be no reason for the observer to need to load the stack.) >>>> >>>>> The READEME.rst is really helpful for bringing up the minimal devstack >>>>> and test the PoC. I also has some notes on design. >>>>> >>>> [snip] >>>>> >>>>> Zane, I have few questions: >>>>> 1. Our current implementation is based on notifications from worker so >>>>> that the engine can take up next set of tasks. I don't see this in your >>>>> case. I think we should be doing this. It gels well with observer >>>>> notification mechanism. When the observer comes, it would send a >>>>> converge notification. Both, the provisioning of stack and the >>>>> continuous observation, happens with notifications (async message >>>>> passing). I see that the workers in your case pick up the parent when/if >>>>> it is done and schedules it or updates the sync point. >>>> >>>> I'm not quite sure what you're asking here, so forgive me if I'm >>>> misunderstanding. What I think you're saying is that where my prototype >>>> propagates notifications thus: >>>> >>>> worker -> worker >>>> -> worker >>>> >>>> (where -> is an async message) >>>> you would prefer it to do: >>>> >>>> worker -> engine -> worker >>>> -> worker >>>> >>>> Is that right? >>>> >>>> To me the distinction seems somewhat academic, given that we've decided >>>> that the engine and the worker will be the same process. I don't see a >>>> disadvantage to doing right away stuff that we know needs to be done >>>> right away. Obviously we should factor the code out tidily into a >>>> separate method where we can _also_ expose it as a notification that can >>>> be triggered by the continuous observer. >>>> >>>> You mentioned above that you thought the workers should not ever load >>>> the Stack, and I think that's probably the reason you favour this >>>> approach: the 'worker' would always load just the Resource and the >>>> 'engine' (even though they're really the same) would always load just >>>> the Stack, right? >>>> >>>> However, as I mentioned above, I think we'll want/have to load the Stack >>>> in the worker anyway, so eliminating the extra asynchronous call >>>> eliminates the performance penalty for having to do so. >>>> >>> >>> When we started, we had the following idea: >>> >>> >>> (provisions with >>> converge API) (Observe) >>> Engine ------------------> Worker --------------> Observer >>> | | | | >>> | | | | >>> ^ (done) v ^ (converge) v >>> ------------------------ ------------------------ >>> >>> The boundary has to be clearly demarcated, as done above. While >>> provisioning a stack, the Engine uses the converge facility, as does the >>> observer when it sees a need to converge a resource. >>> >>> Even though we have these logical things in same process, by not clearly >>> demarcating the responsibilities of each logical entity, we will end-up >>> with issues which we face when we mix the responsibilities. >> >> Sure, and that's why we don't write all the code in one big method, but >> it doesn't follow that it has to be in a different process. Or in this >> case, since they're actually the same process anyway, it doesn't imply >> that they have to be called over RPC. >> >>> As you mentioned, the worker should have a notification API exposed >>> which can be used not only by the Observer, but also by the engine to >>> provision (CRUD) a stack resource. >> >> yep >> >>> The engine decides on which resource >>> to be converged next when a resource is done. >> >> Who says it has to? Each resource knows what depends on it... let it >> send the notifications directly. The engine is just there to do whatever >> setup requires everything to be loaded at once and then kick things off >> by starting the worker tasks for resources that don't depend on anything. >> > > I am fine with this. > >>> The observer is only >>> responsible for once resource and issues a converge request if it has >>> to. >>> >>>>> 2. The dependency graph travels everywhere. IMHO, we can keep the graph >>>>> in DB and let the workers work on a resource, and engine decide which >>>>> one to be scheduled next by looking at the graph. There wouldn't be a >>>>> need for a lock here, in the engine, the DB transactions should take >>>>> care of concurrent DB updates. Our new PoC follows this model. >>>> >>>> I'm fine with keeping the graph in the DB instead of having it flow with >>>> the notifications. >>>> >>>>> 3. The request ID is passed down to check_*_complete. Would the check >>>>> method be interrupted if new request arrives? IMHO, the check method >>>>> should not get interrupted. It should return back when the resource has >>>>> reached a concrete state, either failed or completed. >>>> >>>> I agree, it should not be interrupted. >>>> >>>> I've started to think of phase 1 and phase 2 like this: >>>> >>>> 1) Make locks more granular: stack-level lock becomes resource-level >>>> 2) Get rid of locks altogether >>>> >>> >>> I thought that the DB transactions were enough for concurrency issues. >>> The resources status is enough to know whether to trigger the update or >>> not. For resources currently in progress, the engine can wait for >>> notifications from then and trigger the update. With completely >>> distributed graph and decision making it is hard to achieve. If the >>> graph is stored in DB, the decision to trigger update is at one place >>> (by looking at the graph only) and hence locks are not required. >> >> Right, right, but it's still a lock (maybe I should have said mutex) >> because only one traversal can be operating on each resource at a time. >> We're just arguing semantics now. It doesn't matter whether you do >> select-for-update on the Resource table to mark it as in progress or if >> you do select-for-update on a separate Lock table to mark it as in >> progress, only one update can be working on a given resource at a time, >> and this is a big improvement in granularity because before only one >> update could be working on the whole stack at a time, but not as big an >> improvement as phase 2 will be. >> > > The select-for-update is good only for updates on one table. What if the > two tables are to be updated? I have stored the graph in a table and I > update both the graph and some information in resource in a transaction. > I don't see a value in keeping only one table with all the information > stuffed in it. Ideally, DB tables have to normalized to avoid data > duplication. The graph table is normalized for graph information and > resource table for resource. > >>>> So in phase 1 we'll lock the resources and like you said, it will return >>>> back when it has reached a concrete state. In phase 2 we'll be able to >>>> just update the goal state for the resource and the observe/converge >>>> process will be able to automagically find the best way to that state >>>> regardless of whether it was in the middle of transitioning to another >>>> state or not. Or something. But that's for the future :) >>>> >>> >>> I am thinking about not having any locks at all. The concurrency is >>> handled by DB transaction and notifications gives us hint on deciding >>> when and whether to update the next set of resources. If a resource is >>> in progress currently, and an update is issued on that resource, the >>> graph API doesn't return it as a ready-to-be-converged resource as a >>> previous version is in progress. When it is done, and notification is >>> received, the graph DB API returns it as next to converge. >> >> So sort of a hybrid with Michal's approach. >> > > Not sure what is Michal's approach. > >> I think I get what you're saying here. If we reach a particular resource >> in the traverse and it is still in-progress from a previous update we >> can ignore it and use the notification from the previous update >> finishing to trigger it later, provided that the notifications go >> through the engine because the in-progress resource can't possibly know >> anything about traversals that started later. >> >> Or on the other hand we could just sleep for a while until it's done and >> then carry on. >> > > I would prefer notifications triggering next set of actions rather than > keeping many eventlets sleeping. > >>>>> 4. Lot of synchronization issues which we faced in our PoC cannot be >>>>> encountered with the framework. How do we evaluate what happens when >>>>> synchronization issues are encountered (like stack lock kind of issues >>>>> which we are replacing with DB transaction). >>>> >>>> Right, yeah, this is obviously the big known limitation of the >>>> simulator. I don't have a better answer other than to Think Very Hard >>>> about it. >>>> >>>> Designing software means solving for hundreds of constraints - too many >>>> for a human to hold in their brain at the same time. The purpose of >>>> prototyping is to fix enough of the responses to those constraints in a >>>> concrete form to allow reasoning about the remaining ones to become >>>> tractable. If you fix solutions for *all* of the constraints, then what >>>> you've built is by definition not a prototype but the final product. >>>> >>>> One technique available to us is to encapsulate the parts of the >>>> algorithm that are subject to synchronisation issues behind abstractions >>>> that offer stronger guarantees. Then in order to have confidence in the >>>> design we need only satisfy ourselves that we have analysed the >>>> guarantees correctly and that a concrete implementation offering those >>>> same guarantees is possible. For example, the SyncPoints are shown to >>>> work under the assumption that they are not subject to race conditions, >>>> and the SyncPoint code is small enough that we can easily see that it >>>> can be implemented in an atomic fashion using the same DB primitives >>>> already proven to work by StackLock. Therefore we can have a very high >>>> confidence (but not proof) that the overall algorithm will work when >>>> implemented in the final product. >>>> >>>> Having Thought Very Hard about it, I'm as confident as I can be that I'm >>>> not relying on any synchronisation properties that can't be implemented >>>> using select-for-update on a single database row. There will of course >>>> be surprises at implementation time, but I hope that won't be one of >>>> them and anticipate that any changes required to the plan will be >>>> localised and not wide-ranging. >>>> >>>> (This is in contrast BTW to my centralised-graph branch, linked above, >>>> where it became very obvious that it would require some sort of external >>>> locking - so there is reason to think that this process can reveal >>>> architectural problems related to synchronisation where they are present.) >>>> >>>> cheers, >>>> Zane. >> >> >> _______________________________________________ >> OpenStack-dev mailing list >> OpenStack-dev@lists.openstack.org >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > Zane, > I was playing around with the PoC and found couple of issues: > > 1. The stack was failing when there were single disjoint resources or > just one resource in template. The graph did not include this resource > due to a minor bug in dependency_names(). I have added a test case and > fix here: > https://github.com/anantpatil/heat-convergence-prototype/commit/b58abd77cf596475ecf3f19ed38adf8ad3bb6b3b > > 2. The resource graph is created with keys in both forward order > traversal and reverse order traversal and the update will finish the > forward order and attempt the reverse order. If this is the case, then > the update-replaced resources will be deleted before the update is > complete and if the update fails, the old resource is not available for > roll-back; a new resource has to be created then. I have added a test > case at the above mentioned location. > > In our PoC, the updates (concurrent updates) won't remove a > update-replaced resource until all the resources are updated, and > resource clean-up phase is started. It is unacceptable to remove the old > resource to be rolled-back to since it may have changes which the user > doesn't want to loose; and that's why probably they use the roll-back > flag. >
Here is the link to scenario that I mentioned in (2) above. Please ignore the one specified in above link in (1). This scenario simulates an update replace of a resource and then rolls-back. The test case is to test that the replaced resource is not re-created for rollback. The old resource should be retained for roll-back cases. https://github.com/anantpatil/heat-convergence-prototype/blob/distributed-graph/scenarios/multi_update_rollback.py > - Anant > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev