Hi John, This is really good thinking on the future of cloudstack driver/plugin architecture.
I would be very happy if we can decouple the ACS releases from vendor specific releases. I also agree on the research & experiments (POCs, tools) that needs to be undertaken to conclude if decoupled driver, driver upgrade, coexistence of multiple drivers, hot driver deployment, etc. can actually work in JVM. Note - This approach reminds me of the Vert.x modules & containers that tries to manage this problem along with other concerns (read instrumentation, etc.) Regards, Amit *CloudByte Inc.* <http://www.cloudbyte.com/> On Wed, Aug 21, 2013 at 6:22 AM, Darren Shepherd < darren.s.sheph...@gmail.com> wrote: > Sure, I fully understand how it theoretically works, but I'm saying from a > practical perspective it always seems to fall apart. What your describing > is done excellently in OSGI 4.2 Blueprint. It's a beautiful framework that > allows you to expose services that can be dynamically updated at runtime. > > The issues always happens with unloading. I'll give you a real world > example. As part of the servlet spec your supposed to be able to stop and > unload wars. But in practice if you do it enough times you typically run > out of memory. So one such issue was with commons logging (since fixed). > When you do getLogger(myclass.class) it would cache a reference of the > Class object to the actual log impl. The commons logging jar is typically > loaded with a system classloader and but MyClass.class would be loaded in > the webapp classloader. So when you stop the war there is a reference > chain system classloader -> logfactory -> Myclass -> webapp classloader. > So the web app never gets GC'd. > > So just pointing out the practical issues, that's it. > > Darren > > On Aug 20, 2013, at 5:31 PM, John Burwell <jburw...@basho.com> wrote: > > > Darren, > > > > Actually, loading and unloading aren't difficult if resource management > and drivers work within the following constraints/assumptions: > > > > Drivers are transient and stateless > > A driver instance is assigned per resource managed (i.e. no singletons) > > A lightweight thread and mailbox (i.e. actor model) are assigned per > resource managed (outlined in the presentation referenced below) > > > > Based on these constraints and assumptions, the following upgrade > process could be implemented: > > > > Load and verify new driver version to make it available > > Notify the supervisor processes of each affected resource that a new > driver is available > > Upon completion of the current message being processed by its associated > actor, the supervisor kills and respawns the actor managing its associated > resource > > As part of startup, the supervisor injects an instance of the new driver > version and the actor resumes processing messages in its mailbox > > > > This process mirrors the process that would occur on management server > startup for each resource minus killing an existing actor instance. > Eventually, the system will upgrade the driver without loss of operation. > More sophisticated policies could be added, but I think this approach > would be a solid default upgrade behavior. As a bonus, this same approach > could also be applied to global configuration settings -- allowing the > system to apply changes to these values without restarting the system. > > > > In summary, CloudStack and Eclipse are very different types of systems. > Eclipse is a desktop application implementing complex workflows, user > interactions, and management of shared state (e.g. project structure, AST, > compiler status, etc). In contrast, CloudStack is an eventually consistent > distributed system performing automation control. As such, its > requirements plugin requirements are not only very different, but IMHO, > much simpler. > > > > Thanks, > > -John > > > > On Aug 20, 2013, at 7:44 PM, Darren Shepherd < > darren.s.sheph...@gmail.com> wrote: > > > >> I know this isn't terribly useful, but I've been drawing a lot of > squares and circles and lines that connect those squares and circles lately > and I have a lot of architectural ideas for CloudStack. At the rate I'm > going it will take me about two weeks to put together a discussion/proposal > for the community. What I'm thinking is a superset of what you've listed > out and should align with your idea of a CAR. The focus has a a lot to do > with modularity and extensibility. > >> > >> So more to come soon.... I will say one thing though, is with java you > end up having a hard time doing dynamic load and unloading of modules. > There's plenty of frameworks that try really hard to do this right, like > OSGI, but its darn near impossible to do it right because of class loading > and GC issues (and that's why Eclipse has you restart after installing > plugs even though it is OSGi). > >> > >> I do believe that CloudStack should be possible of zero downtime > maintenance and have ideas around that, but at the end of the day, for > plenty of practical reasons, you still need a JVM restart if modules change. > >> > >> Darren > >> > >> On Aug 20, 2013, at 3:39 PM, Mike Tutkowski < > mike.tutkow...@solidfire.com> wrote: > >> > >>> I agree, John - let's get consensus first, then talk time tables. > >>> > >>> > >>> On Tue, Aug 20, 2013 at 4:31 PM, John Burwell <jburw...@basho.com> > wrote: > >>> > >>>> Mike, > >>>> > >>>> Before we can dig into timelines or implementations, I think we need > to > >>>> get consensus on the problem to solved and the goals. Once we have a > >>>> proper understanding of the scope, I believe we can chunk the across > a set > >>>> of development lifecycle. The subject is vast, but it also has a far > >>>> reaching impact to both the storage and network layer evolution > efforts. > >>>> As such, I believe we need to start addressing it as part of the next > >>>> release. > >>>> > >>>> As a separate thread, we need to discuss the timeline for the next > >>>> release. I think we need to avoid the time compression caused by the > >>>> overlap of the 4.1 stabilization effort and 4.2 development. > Therefore, I > >>>> don't think we should consider development of the next release started > >>>> until the first 4.2 RC is released. I will try to open a separate > discuss > >>>> thread for this topic, as well as, tying of the discussion of release > code > >>>> names. > >>>> > >>>> Thanks, > >>>> -John > >>>> > >>>> On Aug 20, 2013, at 6:22 PM, Mike Tutkowski < > mike.tutkow...@solidfire.com> > >>>> wrote: > >>>> > >>>>> Hey John, > >>>>> > >>>>> I think this is some great stuff. Thanks for the write up. > >>>>> > >>>>> It looks like you have ideas around what might go into a first > release of > >>>>> this plug-in framework. Were you thinking we'd have enough time to > >>>> squeeze > >>>>> that first rev into 4.3. I'm just wondering (it's not a huge deal to > hit > >>>>> that release for this) because we would only have about five weeks. > >>>>> > >>>>> Thanks > >>>>> > >>>>> > >>>>> On Tue, Aug 20, 2013 at 3:43 PM, John Burwell <jburw...@basho.com> > >>>> wrote: > >>>>> > >>>>>> All, > >>>>>> > >>>>>> In capturing my thoughts on storage, my thinking backed into the > driver > >>>>>> model. While we have the beginnings of such a model today, I see > the > >>>>>> following deficiencies: > >>>>>> > >>>>>> > >>>>>> 1. *Multiple Models*: The Storage, Hypervisor, and Security layers > >>>>>> each have a slightly different model for allowing system > >>>> functionality to > >>>>>> be extended/substituted. These differences increase the barrier of > >>>> entry > >>>>>> for vendors seeking to extend CloudStack and accrete code paths to > be > >>>>>> maintained and verified. > >>>>>> 2. *Leaky Abstraction*: Plugins are registered through a Spring > >>>>>> configuration file. In addition to being operator unfriendly (most > >>>>>> sysadmins are not Spring experts nor do they want to be), we expose > >>>> the > >>>>>> core bootstrapping mechanism to operators. Therefore, a > >>>> misconfiguration > >>>>>> could negatively impact the injection/configuration of internal > >>>> management > >>>>>> server components. Essentially handing them a loaded shotgun > pointed > >>>> at > >>>>>> our right foot. > >>>>>> 3. *Nondeterministic Load/Unload Model*: Because the core loading > >>>>>> mechanism is Spring, the management has little control over the > >>>> timing and > >>>>>> order of component loading/unloading. Changes to the Management > >>>> Server's > >>>>>> component dependency graph could break a driver by causing it to be > >>>> started > >>>>>> at an unexpected time. > >>>>>> 4. *Lack of Execution Isolation*: As a Spring component, plugins are > >>>>>> loaded into the same execution context as core management server > >>>>>> components. Therefore, an errant plugin can corrupt the entire > >>>> management > >>>>>> server. > >>>>>> > >>>>>> > >>>>>> For next revision of the plugin/driver mechanism, I would like see > us > >>>>>> migrate towards a standard pluggable driver model that supports all > of > >>>> the > >>>>>> management server's extension points (e.g. network devices, storage > >>>>>> devices, hypervisors, etc) with the following capabilities: > >>>>>> > >>>>>> > >>>>>> - *Consolidated Lifecycle and Startup Procedure*: Drivers share a > >>>>>> common state machine and categorization (e.g. network, storage, > >>>> hypervisor, > >>>>>> etc) that permits the deterministic calculation of initialization > and > >>>>>> destruction order (i.e. network layer drivers -> storage layer > >>>> drivers -> > >>>>>> hypervisor drivers). Plugin inter-dependencies would be supported > >>>> between > >>>>>> plugins sharing the same category. > >>>>>> - *In-process Installation and Upgrade*: Adding or upgrading a > driver > >>>>>> does not require the management server to be restarted. This > >>>> capability > >>>>>> implies a system that supports the simultaneous execution of > multiple > >>>>>> driver versions and the ability to suspend continued execution work > >>>> on a > >>>>>> resource while the underlying driver instance is replaced. > >>>>>> - *Execution Isolation*: The deployment packaging and execution > >>>>>> environment supports different (and potentially conflicting) > versions > >>>> of > >>>>>> dependencies to be simultaneously used. Additionally, plugins would > >>>> be > >>>>>> sufficiently sandboxed to protect the management server against > driver > >>>>>> instability. > >>>>>> - *Extension Data Model*: Drivers provide a property bag with a > >>>>>> metadata descriptor to validate and render vendor specific data. > The > >>>>>> contents of this property bag will provided to every driver > operation > >>>>>> invocation at runtime. The metadata descriptor would be a > lightweight > >>>>>> description that provides a label resource key, a description > >>>> resource key, > >>>>>> data type (string, date, number, boolean), required flag, and > optional > >>>>>> length limit. > >>>>>> - *Introspection: Administrative APIs/UIs allow operators to > >>>>>> understand the configuration of the drivers in the system, their > >>>>>> configuration, and their current state.* > >>>>>> - *Discoverability*: Optionally, drivers can be discovered via a > >>>>>> project repository definition (similar to Yum) allowing drivers to > be > >>>>>> remotely acquired and operators to be notified regarding update > >>>>>> availability. The project would also provide, free of charge, > >>>> certificates > >>>>>> to sign plugins. This mechanism would support local mirroring to > >>>> support > >>>>>> air gapped management networks. > >>>>>> > >>>>>> > >>>>>> Fundamentally, I do not want to turn CloudStack into an erector set > with > >>>>>> more screws than nuts which is a risk with highly pluggable > >>>> architectures. > >>>>>> As such, I think we would need to tightly bound the scope of > drivers and > >>>>>> their behaviors to prevent the loss system usability and stability. > My > >>>>>> thinking is that drivers would be packaged into a custom JAR, CAR > >>>>>> (CloudStack ARchive), that would be structured as followed: > >>>>>> > >>>>>> > >>>>>> - META-INF > >>>>>> - MANIFEST.MF > >>>>>> - driver.yaml (driver metadata(e.g. version, name, description, > >>>>>> etc) serialized in YAML format) > >>>>>> - LICENSE (a text file containing the driver's license) > >>>>>> - lib (driver dependencies) > >>>>>> - classes (driver implementation) > >>>>>> - resources (driver message files and potentially JS resources) > >>>>>> > >>>>>> > >>>>>> The management server would acquire drivers through a simple scan > of a > >>>> URL > >>>>>> (e.g. file directory, S3 bucket, etc). For every CAR object found, > the > >>>>>> management server would create an execution environment (likely a > >>>> dedicated > >>>>>> ExecutorService and Classloader), and transition the state of the > >>>> driver to > >>>>>> Running (the exact state model would need to be worked out). To be > >>>> really > >>>>>> nice, we could develop a custom Ant task/Maven plugin/Gradle plugin > to > >>>>>> create CARs. I can also imagine an opportunities to add hooks to > this > >>>>>> model to register instrumentation information with JMX and > >>>> authorization. > >>>>>> > >>>>>> To keep the scope of this email confined, we would introduce the > general > >>>>>> notion of a Resource, and (hand wave hand wave) eventually > >>>> compartmentalize > >>>>>> the execution of work around a resource [1]. This (hand waved) > >>>>>> compartmentalization would allow us the controls necessary to > safely and > >>>>>> reliably perform in-place driver upgrades. For an initial release, > I > >>>> would > >>>>>> recommend implementing the abstractions, loading mechanism, > extension > >>>> data > >>>>>> model, and discovery features. With these capabilities in place, we > >>>> could > >>>>>> attack the in-place upgrade model. > >>>>>> > >>>>>> If we were to adopt such a pluggable capability, we would have the > >>>>>> opportunity to decouple the vendor and CloudStack release schedules. > >>>> For > >>>>>> example, if a vendor were introducing a new product that required a > new > >>>> or > >>>>>> updated driver, they would no longer need to wait for a CloudStack > >>>> release > >>>>>> to support it. They would also gain the ability to fix high > priority > >>>>>> defects in the same manner. > >>>>>> > >>>>>> I have hand waved a number of issues that would need to be resolved > >>>> before > >>>>>> such an approach could be implemented. However, I think we need to > >>>> decide, > >>>>>> as a community, that it worth devoting energy and effort to > enhancing > >>>> the > >>>>>> plugin/driver model and the goals of that effort before driving head > >>>> first > >>>>>> into the deep rabbit hole of design/implementation. > >>>>>> > >>>>>> Thoughts? (/me ducks) > >>>>>> -John > >>>>>> > >>>>>> [1]: My opinions on the matter from CloudStack Collab 2013 -> > >>>> > http://www.slideshare.net/JohnBurwell1/how-to-run-from-a-zombie-cloud-stack-distributed-process-management > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> *Mike Tutkowski* > >>>>> *Senior CloudStack Developer, SolidFire Inc.* > >>>>> e: mike.tutkow...@solidfire.com > >>>>> o: 303.746.7302 > >>>>> Advancing the way the world uses the > >>>>> cloud<http://solidfire.com/solution/overview/?video=play> > >>>>> *™* > >>> > >>> > >>> -- > >>> *Mike Tutkowski* > >>> *Senior CloudStack Developer, SolidFire Inc.* > >>> e: mike.tutkow...@solidfire.com > >>> o: 303.746.7302 > >>> Advancing the way the world uses the > >>> cloud<http://solidfire.com/solution/overview/?video=play> > >>> *™* > > >