On Tue, Dec 2, 2014 at 1:49 AM, Clint Byrum <cl...@fewbar.com> wrote:
> Excerpts from Anant Patil's message of 2014-11-30 23:02:29 -0800: > > On 27-Nov-14 18:03, Murugan, Visnusaran wrote: > > > Hi Zane, > > > > > > > > > > > > At this stage our implementation (as mentioned in wiki > > > <https://wiki.openstack.org/wiki/Heat/ConvergenceDesign>) achieves > your > > > design goals. > > > > > > > > > > > > 1. In case of a parallel update, our implementation adjusts graph > > > according to new template and waits for dispatched resource tasks to > > > complete. > > > > > > 2. Reason for basing our PoC on Heat code: > > > > > > a. To solve contention processing parent resource by all > dependent > > > resources in parallel. > > > > > > b. To avoid porting issue from PoC to HeatBase. (just to be aware > > > of potential issues asap) > > > > > > 3. Resource timeout would be helpful, but I guess its resource > > > specific and has to come from template and default values from plugins. > > > > > > 4. We see resource notification aggregation and processing next > > > level of resources without contention and with minimal DB usage as the > > > problem area. We are working on the following approaches in *parallel.* > > > > > > a. Use a Queue per stack to serialize notification. > > > > > > b. Get parent ProcessLog (ResourceID, EngineID) and initiate > > > convergence upon first child notification. Subsequent children who fail > > > to get parent resource lock will directly send message to waiting > parent > > > task (topic=stack_id.parent_resource_id) > > > > > > Based on performance/feedback we can select either or a mashed version. > > > > > > > > > > > > Advantages: > > > > > > 1. Failed Resource tasks can be re-initiated after ProcessLog > > > table lookup. > > > > > > 2. One worker == one resource. > > > > > > 3. Supports concurrent updates > > > > > > 4. Delete == update with empty stack > > > > > > 5. Rollback == update to previous know good/completed stack. > > > > > > > > > > > > Disadvantages: > > > > > > 1. Still holds stackLock (WIP to remove with ProcessLog) > > > > > > > > > > > > Completely understand your concern on reviewing our code, since commits > > > are numerous and there is change of course at places. Our start commit > > > is [c1b3eb22f7ab6ea60b095f88982247dd249139bf] though this might not > help J > > > > > > > > > > > > Your Thoughts. > > > > > > > > > > > > Happy Thanksgiving. > > > > > > Vishnu. > > > > > > > > > > > > *From:*Angus Salkeld [mailto:asalk...@mirantis.com] > > > *Sent:* Thursday, November 27, 2014 9:46 AM > > > *To:* OpenStack Development Mailing List (not for usage questions) > > > *Subject:* Re: [openstack-dev] [Heat] Convergence proof-of-concept > showdown > > > > > > > > > > > > On Thu, Nov 27, 2014 at 12:20 PM, Zane Bitter <zbit...@redhat.com > > > <mailto:zbit...@redhat.com>> wrote: > > > > > > A bunch of us have spent the last few weeks working independently > on > > > proof of concept designs for the convergence architecture. I think > > > those efforts have now reached a sufficient level of maturity that > > > we should start working together on synthesising them into a plan > > > that everyone can forge ahead with. As a starting point I'm going > to > > > summarise my take on the three efforts; hopefully the authors of > the > > > other two will weigh in to give us their perspective. > > > > > > > > > Zane's Proposal > > > =============== > > > > > > > https://github.com/zaneb/heat-convergence-prototype/tree/distributed-graph > > > > > > I implemented this as a simulator of the algorithm rather than > using > > > the Heat codebase itself in order to be able to iterate rapidly on > > > the design, and indeed I have changed my mind many, many times in > > > the process of implementing it. Its notable departure from a > > > realistic simulation is that it runs only one operation at a time - > > > essentially giving up the ability to detect race conditions in > > > exchange for a completely deterministic test framework. You just > > > have to imagine where the locks need to be. Incidentally, the test > > > framework is designed so that it can easily be ported to the actual > > > Heat code base as functional tests so that the same scenarios could > > > be used without modification, allowing us to have confidence that > > > the eventual implementation is a faithful replication of the > > > simulation (which can be rapidly experimented on, adjusted and > > > tested when we inevitably run into implementation issues). > > > > > > This is a complete implementation of Phase 1 (i.e. using existing > > > resource plugins), including update-during-update, resource > > > clean-up, replace on update and rollback; with tests. > > > > > > Some of the design goals which were successfully incorporated: > > > - Minimise changes to Heat (it's essentially a distributed version > > > of the existing algorithm), and in particular to the database > > > - Work with the existing plugin API > > > - Limit total DB access for Resource/Stack to O(n) in the number of > > > resources > > > - Limit overall DB access to O(m) in the number of edges > > > - Limit lock contention to only those operations actually > contending > > > (i.e. no global locks) > > > - Each worker task deals with only one resource > > > - Only read resource attributes once > > > > > > > > > Open questions: > > > - What do we do when we encounter a resource that is in progress > > > from a previous update while doing a subsequent update? Obviously > we > > > don't want to interrupt it, as it will likely be left in an unknown > > > state. Making a replacement is one obvious answer, but in many > cases > > > there could be serious down-sides to that. How long should we wait > > > before trying it? What if it's still in progress because the engine > > > processing the resource already died? > > > > > > > > > > > > Also, how do we implement resource level timeouts in general? > > > > > > > > > > > > > > > Michał's Proposal > > > ================= > > > > > > https://github.com/inc0/heat-convergence-prototype/tree/iterative > > > > > > Note that a version modified by me to use the same test scenario > > > format (but not the same scenarios) is here: > > > > > > > https://github.com/zaneb/heat-convergence-prototype/tree/iterative-adapted > > > > > > This is based on my simulation framework after a fashion, but with > > > everything implemented synchronously and a lot of handwaving about > > > how the actual implementation could be distributed. The central > > > premise is that at each step of the algorithm, the entire graph is > > > examined for tasks that can be performed next, and those are then > > > started. Once all are complete (it's synchronous, remember), the > > > next step is run. Keen observers will be asking how we know when it > > > is time to run the next step in a distributed version of this > > > algorithm, where it will be run and what to do about resources that > > > are in an intermediate state at that time. All of these questions > > > remain unanswered. > > > > > > > > > > > > Yes, I was struggling to figure out how it could manage an IN_PROGRESS > > > state as it's stateless. So you end up treading on the other action's > toes. > > > > > > Assuming we use the resource's state (IN_PROGRESS) you could get around > > > that. Then you kick off a converge when ever an action completes (if > > > there is nothing new to be > > > > > > done then do nothing). > > > > > > > > > > > > > > > A non-exhaustive list of concerns I have: > > > - Replace on update is not implemented yet > > > - AFAIK rollback is not implemented yet > > > - The simulation doesn't actually implement the proposed > architecture > > > - This approach is punishingly heavy on the database - O(n^2) or > worse > > > > > > > > > > > > Yes, re-reading the state of all resources when ever run a new converge > > > is worrying, but I think Michal had some ideas to minimize this. > > > > > > > > > > > > - A lot of phase 2 is mixed in with phase 1 here, making it > > > difficult to evaluate which changes need to be made first and > > > whether this approach works with existing plugins > > > - The code is not really based on how Heat works at the moment, so > > > there would be either a major redesign required or lots of radical > > > changes in Heat or both > > > > > > I think there's a fair chance that given another 3-4 weeks to work > > > on this, all of these issues and others could probably be resolved. > > > The question for me at this point is not so much "if" but "why". > > > > > > Michał believes that this approach will make Phase 2 easier to > > > implement, which is a valid reason to consider it. However, I'm not > > > aware of any particular issues that my approach would cause in > > > implementing phase 2 (note that I have barely looked into it at all > > > though). In fact, I very much want Phase 2 to be entirely > > > encapsulated by the Resource class, so that the plugin type (legacy > > > vs. convergence-enabled) is transparent to the rest of the system. > > > Only in this way can we be sure that we'll be able to maintain > > > support for legacy plugins. So a phase 1 that mixes in aspects of > > > phase 2 is actually a bad thing in my view. > > > > > > I really appreciate the effort that has gone into this already, but > > > in the absence of specific problems with building phase 2 on top of > > > another approach that are solved by this one, I'm ready to call > this > > > a distraction. > > > > > > > > > > > > In it's defence, I like the simplicity of it. The concepts and code are > > > easy to understand - tho' part of this is doesn't implement all the > > > stuff on your list yet. > > > > > > > > > > > > > > > > > > Anant & Friends' Proposal > > > ========================= > > > > > > First off, I have found this very difficult to review properly > since > > > the code is not separate from the huge mass of Heat code and nor is > > > the commit history in the form that patch submissions would take > > > (but rather includes backtracking and iteration on the design). As > a > > > result, most of the information here has been gleaned from > > > discussions about the code rather than direct review. I have > > > repeatedly suggested that this proof of concept work should be done > > > using the simulator framework instead, unfortunately so far to no > avail. > > > > > > The last we heard on the mailing list about this, resource clean-up > > > had not yet been implemented. That was a major concern because that > > > is the more difficult half of the algorithm. Since then there have > > > been a lot more commits, but it's not yet clear whether resource > > > clean-up, update-during-update, replace-on-update and rollback have > > > been implemented, though it is clear that at least some progress > has > > > been made on most or all of them. Perhaps someone can give us an > update. > > > > > > > > > https://github.com/anantpatil/heat-convergence-poc > > > > > > > > > > > > AIUI this code also mixes phase 2 with phase 1, which is a concern. > > > For me the highest priority for phase 1 is to be sure that it works > > > with existing plugins. Not only because we need to continue to > > > support them, but because converting all of our existing > > > 'integration-y' unit tests to functional tests that operate in a > > > distributed system is virtually impossible in the time frame we > have > > > available. So the existing test code needs to stick around, and the > > > existing stack create/update/delete mechanisms need to remain in > > > place until such time as we have equivalent functional test > coverage > > > to begin eliminating existing unit tests. (We'll also, of course, > > > need to have unit tests for the individual elements of the new > > > distributed workflow, functional tests to confirm that the > > > distributed workflow works in principle as a whole - the scenarios > > > from the simulator can help with _part_ of this - and, not least, > an > > > algorithm that is as similar as possible to the current one so that > > > our existing tests remain at least somewhat representative and > don't > > > require too many major changes themselves.) > > > > > > Speaking of tests, I gathered that this branch included tests, but > I > > > don't know to what extent there are automated end-to-end functional > > > tests of the algorithm? > > > > > > From what I can gather, the approach seems broadly similar to the > > > one I eventually settled on also. The major difference appears to > be > > > in how we merge two or more streams of execution (i.e. when one > > > resource depends on two or more others). In my approach, the > > > dependencies are stored in the resources and each joining of > streams > > > creates a database row to track it, which is easily locked with > > > contention on the lock extending only to those resources which are > > > direct dependencies of the one waiting. In this approach, both the > > > dependencies and the progress through the graph are stored in a > > > database table, necessitating (a) reading of the entire table (as > it > > > relates to the current stack) on every resource operation, and (b) > > > locking of the entire table (which is hard) when marking a resource > > > operation complete. > > > > > > I chatted to Anant about this today and he mentioned that they had > > > solved the locking problem by dispatching updates to a queue that > is > > > read by a single engine per stack. > > > > > > My approach also has the neat side-effects of pushing the data > > > required to resolve get_resource and get_att (without having to > > > reload the resources again and query them) as well as to update > > > dependencies (e.g. because of a replacement or deletion) along with > > > the flow of triggers. I don't know if anything similar is at work > here. > > > > > > It's entirely possible that the best design might combine elements > > > of both approaches. > > > > > > The same open questions I detailed under my proposal also apply to > > > this one, if I understand correctly. > > > > > > > > > I'm certain that I won't have represented everyone's work fairly > > > here, so I encourage folks to dive in and correct any errors about > > > theirs and ask any questions you might have about mine. (In case > you > > > have been living under a rock, note that I'll be out of the office > > > for the rest of the week due to Thanksgiving so don't expect > > > immediate replies.) > > > > > > I also think this would be a great time for the wider Heat > community > > > to dive in and start asking questions and suggesting ideas. We need > > > to, ahem, converge on a shared understanding of the design so we > can > > > all get to work delivering it for Kilo. > > > > > > > > > > > > Agree, we need to get moving on this. > > > > > > -Angus > > > > > > > > > > > > cheers, > > > Zane. > > > > > > _______________________________________________ > > > OpenStack-dev mailing list > > > OpenStack-dev@lists.openstack.org > > > <mailto:OpenStack-dev@lists.openstack.org> > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > > > > > > > > > > > > > _______________________________________________ > > > OpenStack-dev mailing list > > > OpenStack-dev@lists.openstack.org > > > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > > > > > > Thanks Zane for your e-mail Zane and summarizing everyone's work. > > > > The design goals mentioned above looks more of performance goals and > > constraints to me. I understand that it is unacceptable to have a poorly > > performing engine and Resource plug-ins broken. Convergence spec clearly > > mentions that the existing Resource plugins should not be changed. > > > > IMHO, and my teams' HO, the design goals of convergence would be: > > 1. Stability: No transient failures, either in Openstack/external > > services or resources themselves, should fail the stack. Therefore, we > > need to have Observers to check for divergence and converge a resource > > if needed, to bring back to stable state. > > 2. Resiliency: Heat engines should be able to take up tasks in case of > > failures/restarts. > > 3. Backward compatibility: "We don't break the user space." No existing > > stacks should break. > > > > We started the PoC with these goals in mind, any performance > > optimization would be a plus point for us. Note than I am neglecting the > > performance goal, just that it should be next in the pipeline. The > > questions we should ask ourselves is whether we are storing enough data > > (state of stack) in DB to enable resiliency? Are we distributing the > > load evenly to all Heat engines? Does our notification mechanism > > provides us some form of guarantee or acknowledgement? > > > > In retrospective, we had to struggle a lot to understand the existing > > Heat engine. We couldn't have done justice by just creating another > > project in GitHub and without any concrete understanding of existing > > state-of-affairs. We are not at the same page with Heat core members, we > > are novice and cores are experts. > > > > I am glad that we experimented with the Heat engine directly. The > > current Heat engine is not resilient and the messaging also lacks > > reliability. We (my team and I guess cores also) understand that async > > message passing would be the way to go as synchronous RPC calls simply > > wouldn't scale. But with async message passing there has to be some > > mechanism of ACKing back, which I think lacks in current infrastructure. > > > > How could we provide stable user defined stack if the underlying Heat > > core lacks it? Convergence is all about stable stacks. To make the > > current Heat core stable we need to have, at the least: > > 1. Some mechanism to ACK back messages over AMQP. Or some other solid > > mechanism of message passing. > > 2. Some mechanism for fault tolerance in Heat engine using external > > tools/infrastructure like Celerey/Zookeeper. Without external > > infrastructure/tool we will end-up bloating Heat engine with lot of > > boiler-plate code to achieve this. We had recommended Celery in our > > previous e-mail (from Vishnu.) > > > > It was due to our experiments with Heat engines for this PoC, we could > > come up with above recommendations. > > > > Sate of our PoC > > --------------- > > > > On GitHub: https://github.com/anantpatil/heat-convergence-poc > > > > Our current implementation of PoC locks the stack after each > > notification to mark the graph as traversed and produce next level of > > resources for convergence. We are facing challenges in > > removing/minimizing these locks. We also have two different school of > > thoughts for solving this lock issue as mentioned above in Vishnu's > > e-mail. I will these descibe in detail the Wiki. There would different > > branches in our GitHub for these two approaches. > > > > It would be helpful if you explained why you need to _lock_ the stack. > MVCC in the database should be enough here. Basically you need to: > > begin transaction > update traversal information > select resolvable nodes > {in code not sql -- send converge commands into async queue} > commit > > Any failure inside this transation should rollback the transaction and > retry this. It is o-k to have duplicate converge commands for a resource. > > This should be the single point of synchronization between workers that > are resolving resources. Or perhaps this is the lock you meant? Either > way, this isn't avoidable if you want to make sure everything is attempted > at least once without having to continuously poll and re-poll the stack > to look for unresolved resources. That is an option, but not one that I > think is going to be as simple as the transactional method. > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev