On Wed, Sep 11, 2013 at 03:51:02AM +0000, Adrian Otto wrote: > I have a different point of view. First I will offer some assertions: > > A-1) We need to keep it simple. > A-1.1) Systems that are hard to comprehend are hard to debug, and > that's bad. > A-1.2) Complex systems tend to be much more brittle than simple ones.
I don't think anyone will disagree with this, but the solutions we've been discussing are not complex, or hard to comprehend. The layered topology discussed is simply aimed at ensuring we don't have significant duplicate functionality between services, and the best way to do that is just to implement functionality in one service, ensuring the scope of each service is sufficiently well defined and separated. > A-2) Scale-up operations need to be as-fast-as-possible. > A-2.1) Auto-Scaling only works right if your new capacity is added > quickly when your controller detects that you need more. If you spend a bunch > of time goofing around before actually adding a new resource to a pool when > its under staring. > A-2.2) The fewer network round trips between "add-more-resources-now" > and "resources-added" the better. Fewer = less brittle. Sure, latency in any control system is important, but in this case, the additional delay caused by one additional service in the chain is very likely to be insignificant compared to the time taken to build, launch, and customize an instance. > A-3) The control logic for scaling different applications vary. > A-3.1) What metrics are watched may differ between various use cases. > A-3.2) The data types that represent sensor data may vary. So? The metric source will be the same regardless of where AS is implemented, ie ceilometer. > A-3.3) The policy that's applied to the metrics (such as max, min, and > cooldown period) vary between applications. Not only the values vary, but the > logic itself. > A-3.4) A scaling policy may not just be a handful of simple parameters. > Ideally it allows configurable logic that the end-user can control to some > extent. Ok, so having some way to implement specialized scaling policies seems, AFAICT, to be the main driver behind all this autoscaling-service discussion? I don't think anyone has ever said we shouldn't provide an interface which allows end users to implement whatever scaling policy they want. To some extent Provider resources allready allow this. Something we discussed at the Havana summit (but has not yet been implemented) was the idea of a generic webhook based policy resource, which took data associated with the scaling event (alarm) and simply made a request to $special_scaling_service and then acted on the result. This would probably be very easy to implement as a Heat resource. > A-4) Auto-scale operations are usually not orchestrations. They are usually > simple linear workflows. > A-4.1) The Taskflow project[1] offers a simple way to do workflows and > stable state management that can be integrated directly into Autoscale. > A-4.2) A task flow (workflow) can trigger a Heat orchestration if > needed. So, we should probably consider the nova group scheduling features here, it seems like you basically just want a policy service between ceilometer and the nova group-scheduling API? This is fine, provided you never care about dependencies between resources (instances). As soon as you start thinking about stuff like clustering, or notifying other dependent resources, it becomes an orchestration problem IMO. > Now a mental tool to think about control policies: > > Auto-scaling is like steering a car. The control policy says that you want to > drive equally between the two lane lines, and that if you drift off center, > you gradually correct back toward center again. If the road bends, you try to > remain in your lane as the lane lines curve. You try not to weave around in > your lane, and you try not to drift out of the lane. > > If your controller notices that you are about to drift out of your lane > because the road is starting to bend, and you are distracted, or your hands > slip off the wheel, you might drift out of your lane into nearby traffic. > That's why you don't want a Rube Goldberg Machine[2] between you and the > steering wheel. See assertions A-1 and A-2. > > If you are driving an 18-wheel tractor/trailer truck, steering is different > than if you are driving a Fiat. You need to wait longer and steer toward the > outside of curves so your trailer does not lag behind on the inside of the > curve behind you as you correct for a bend in the road. When you are driving > the Fiat, you may want to aim for the middle of the lane at all times, > possibly even apexing bends to reduce your driving distance, which is > actually the opposite of what truck drivers need to do. Control policies > apply to other parts of driving too. I want a different policy for braking > than I use for steering. On some vehicles I go through a gear shifting > workflow, and on others I don't. See assertion A-3. Thanks for that amusingly verbose lecture in system dynamics ;D As mentioned previously, I think the delay in the feedback loop caused by how long it takes to spin up instances means you will need to damp the loop so much (via cooldown periods) to avoid oscillation, that any delay introduced by the controlling services is likely to be insignificant. > So, I don't intend to argue the technical minutia of each design point, but I > challenge you to make sure that we (1) arrive at a simple system that any > OpenStack user can comprehend, (2) responds quickly to alarm stimulus, (3) is > unlikely to fail, (4) can be easily customized with user-supplied logic that > controls how the scaling happens, and under what conditions. So (1) I think we all want something simple, but also flexible and without undue duplication. (2) I think this concern is overstated, as argued above (3) Sure, there are a few ways to enable this, and a Heat scaling resource is a valid and useful one IMO (not necessarily the only one) > It would be better if we could explain Autoscale like this: > > Heat -> Autoscale -> Nova, etc. > -or- > User -> Autoscale -> Nova, etc. > > This approach allows use cases where (for whatever reason) the end user does > not want to use Heat at all, but still wants something simple to be > auto-scaled for them. Nobody would be scratching their heads wondering why > things are going in circles. > > From an implementation perspective, that means the auto-scale service needs > at least a simple linear workflow capability in it that may trigger a Heat > orchestration if there is a good reason for it. This way, the typical use > cases don't have anything resembling circular dependencies. The source of > truth for how many members are currently in an Autoscaling group should be > the Autoscale service, not in the Heat database. If you want to expose that > in list-stack-resources output, then cause Heat to call out to the Autoscale > service to fetch that figure as needed. It is irrelevant to orchestration. > Code does not need to be duplicated. Both Autoscale and Heat can use the same > exact source code files for the code that launches/terminates instances of > resources. So I take issue with the "circular dependencies" statement, nothing proposed so far has anything resembling a circular dependency. I think it's better to consider traditional encapsulation, where two projects may very well make use of the same class from a library. Why is it less valid to consider code reuse via another interface (ReST service)? The point of the arguments to date, AIUI is to ensure orchestration actions and management of dependencies don't get duplicated in any AS service which is created. It seems to me, as previously stated that what you really are describing (and seem to want) is an Autoscaling *policy* service, which can act as a decision point between alarms (from Ceilometer) and scaling actions (in Heat, or potentially direct to Nova) The recent rework to enable scaling actions to be triggered direct from Ceilometer have actually made this much easier, and less tightly coupled to Heat. Heat AutoScaling actions can be triggered by a pre-signed web-hook URL, which is passed to Ceilometer when we set up the alarm, and called whenever an alarm happens. We currently have: Ceilometer --> Heat --> Nova You seem to want (where orchestration is required): Ceilometer --> AS Policy Service --> Heat --> Nova You seem to want (where orchestration is *not* required): Ceilometer --> AS Policy Service --> Nova An optional alternate data-flow would be via a Heat resource representing the policy service, where heat calls the policy service instead of using the internal simple policy implementation. The only issue I have with this, is I'm still not sure what value the policy service actually adds, functionally, other than some percieved-to-be simpler (or more AWSish?) AS API, and a way to plug in custom policies without defining a Heat resource (examples of how you envisage them being defined might help). Steve _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev