This is primarily an issue for Nova. Mike Spreitzer/Watson/IBM@IBMUS wrote on 08/20/2014 01:21:24 AM:
> From: Mike Spreitzer/Watson/IBM@IBMUS > To: "OpenStack Development Mailing List" <openstack-dev@lists.openstack.org>, > Date: 08/20/2014 01:24 AM > Subject: [openstack-dev] Separating the issues around smarter/ > solver/joint/holistic placement > > There has been a lot of discussion around these issues, let me see > if I can break it down into pieces, hopefully in a way that allows > some progress on one of them first. > > I continue to focus on the timeless version of the problem, in which > the placement question is simply where can we put some guest(s) now > without making plans for whether/how things may change in the > future. There is also a rich discussion to have about placement > over time (in other communities this is what "scheduling" means), > but that is not my goal here. > > I see essentially three issues: (i1) more sophisticated placement > criteria, (i2) joint vs sequential decision making, and (i3) single- > service vs cross service. I think we could make a roadmap for > addressing them in this order. Do you think the issues are > separable and attackable in this way? > > Objections have been raised regarding "complexity", and I would like > to be clear about what complexity we are discussing. For each of > these three issues, I can think of five kinds of complexity to > discuss: the development complexity faced by us OpenStack developers > and/or a cloud provider who wants to add such things, the runtime > costs, the API complexity for allowing the issue to be addressed, > the development complexity faced by the cloud users in writing their > placement problems, and the size of the written problems. > > (i1) More sophisticated placement criteria. For example, I think > cloud providers would like to have a way to express a preference for > consolidation, to help them conserve energy. That's a fairly rich > topic in itself, and I hope the details need not concern us here. > OTOH, cloud users may want to have a way to express a preference for > spreading their guests out --- for example, spreading around within > an availability zone. This may seem relatively boring for stateless > web servers, but if you think about things like Zookeeper or > Cassandra servers you see a serious preference for getting as much > failure independence as you can get within a given AZ. Other > examples of what cloud users may want cross services, like placing > VMs and storage volumes in close proximity (by some measure). > > Let's consider a couple of concrete examples from the record. One > is referenced in the current revision of https:// > review.openstack.org/#/c/96543/ - namely, http://goo.gl/vuHcz2 . > There Yathi exhibits a binary integer programming problem for > placing a group of guests so as to minimize the sum of (host's free > RAM before any members of this group are placed). Minimizing that > objective has a roughly consolidating effect in many situations, > even though it does not directly express exactly consolidation > (exactly expressing consolidation requires a non-linear objective). > Similarly, maximizing that objective has something of a tendency to > spread things out, even though the objective does not exactly > express spreading. (BTW, I do not think we should assume that all > hosts in an AZ have the same capacity vector.) > > Another example is at https://wiki.openstack.org/wiki/Heat/ > PolicyExtension#LLMN_Anti-Collocation . There we see a criterion > that is a precise spreading condition, for spreading a group of > guests to a certain precise degree. (BTW, I tend to use > "collocation" and "anti-collocation" when speaking of affinity > stated in terms of location --- as opposed, for example, to network > proximity.) > > For any collection of more or less sophisticated placement criteria, > if we consider the question of placing a single guest --- supposing > it is stipulated that some guests have already had their placement > determined and others will be determined later --- the question > boils down to the exactly the terms that we have in Nova's > FilterScheduler today: (a) which hosts can host this guest and (b) > ranking the acceptable hosts. Of course, that is not the only way > to address a problem of placing several guests --- but it is one > valid way, even if the placement problem is stated as a mixed > integer programming problem. Group placement problems are NP hard, > so solving them exactly is out of the question. The only question > is how hard are we going to work at making a good approximation; > this is what (i2) is about, and we will get to that below. > > Let us review the two concrete examples with an eye towards the five > kinds of complexity. Let us start with Yathi's RAM minimization > binary integer programming example. There is a difficulty if we try > to work that example in today's framework. The objective is a > function of the amount of RAM that was free on each host *before any > group members are placed*. Today's FilterScheduler works on one > guest at a time, updating host states as it goes along; it does not > offer earlier host states for use in later decisions. However, > today we can solve a slightly different problem --- which happens to > be an even better approximation of a consolidation problem. That is > to minimize the sum of (host's available RAM just before the guest > is placed), rather than the sum of (host's available RAM before any > guests in the group are placed). Today's Nova already has a weigher > that the cloud provider can configure to do the revised sum. The > code complexity of that weigher is code that evaluates the chosen > weight function for a given guest on a given host. The runtime cost > is the cost of doing that evaluation for every guest in the group, > for every acceptable host, once. The additional API complexity is > nil, since no changes are required. There are no complexities for > cloud users in this case, since it is a cloud provider matter. > > Let us consider the LLMN anti-collocation statement applied to a > group --- let's say a Nova server group. This could be handled in > today's FilterScheduler by a filter that, when there is an LLMN > anti-collocation statement in the scheduler hints, tests whether a > proposed placement is consistent with the criterion and the > placements already chosen for that group. This could be similar to > the existing sort of logic we see in today's various affinity > filters. That sort of logic evaluates the condition directly, > without using anything cached for the group; this makes the runtime > cost higher than necessary. These runtime costs are those of a > triply nested loop over all group members, over all otherwise- > acceptable hosts, and over all group members already placed --- that > is, about 0.5*H*G^2 iterations of the innermost loop body if H is > the average number of otherwise-acceptable hosts and G is the number > of guests in the group. If the code framework were expanded so that > the filter could know the decisions made by the FilterScheduler and > cache the number of group members already placed in a given L1 zone, > the runtime complexity could be reduced to H*G iterations of a > different innermost loop body. The additional API complexity is nil > if we leave the framework alone, otherwise it is the aforementioned > expansion of the SPI for filters. The development complexity for a > cloud user consists of picking the L1, L2, and N parameters for a > given group, and the size of the statement is the size of those > parameters and their binding to the group. > > > (i2) Joint (AKA simultaneous) vs. sequential decision-making. The > change here is about changing the placement framework so that it > can, in one invocation, choose the placement for each member of a > general (i.e., not necessarily homogenous) group of guests. The > motivation for this is the observation that sequential decision > making can paint itself into a corner, since it has no view of where > it is going. You might think that this is a non-issue if your AZ > has a comfortable amount of spare capacity, but that's not quite > right --- it is individual hosts, rather than an AZ as a whole, that > run out of capacity. Consider an example in which the cloud > provider has a preference for consolidation --- but it is of > secondary importance, the cloud provider wants to put the cloud > users' criteria first. Suppose a cloud user has a group of 10 > guests that he wants collocated on the same host. Suppose the cloud > has several hosts that are full, one that has spare capacity for 5 > such guests, and some hosts that are empty and have capacity for 20 > such guests each. Sequential decision making is going to put the > first 5 guests on that partially full host, and then be in a tight > spot (live migration has costs, and is disallowed in some > situations). Joint decision-making, OTOH, will put all 10 guests on > the same formerly empty host. > > The development complexity of switching to joint decision making is > illustrated by the existing change for that --- https:// > review.openstack.org/#/c/46588/ . > > The runtime costs are dependent on input patterns, the way input is > translated into an optimization problem, and the solver used. > Typically we use H*G binary decision variables, which puts a lower > bound of H*G on runtime costs. Even for a linear objective, runtime > complexity is exponential in the worst case (the problem is NP hard, > remember) --- but runtime complexity is often much lower in, for > example, the Simplex method. I hope Yathi can chime in here with > some more detailed statement about what he has seen. > > My own group has been working with non-linear objectives and a very > flexible solver based on biased sampling. Its complexity is O(G*(H > +K)) where G and H are as above and K is the number of pairwise > constraints (the only kind understood by this solver); the leading > coefficient is more than one. In a not-so recent paper (more recent > results are better), my colleague Asser Tantawi was getting decent > results from runs that evaluate between 10^2 and 10^3 samples; here > a sample is a candidate joint placement (containing a candidate host > for each guest in the group). > > The client API complexity of switching to joint decision making is > centered on the need to present a whole group placement problem at > once. If we leave orchestration to the clients (I detect a strong > preference for this in the community, although I personally think > alternatives could be viable), this requires introducing > introductory phases in which the group members and their placement > criteria are sufficiently described and then the joint decision is > made. To properly cope with the fact that stuff can go wrong during > orchestration, this would benefit from also adding a final > confirmation phase in which the client declares that it has done all > it is going to do about the group at hand --- and this, in turn, > requires an orchestration leasing mechanism to make the final phase > eventually happen even if the client aborts. The API in question > does not have to be the Nova API. There is current work on > separating out scheduling. But regardless of whether it is part of > Nova or something else, there would need to be an API along these lines. > > Part of the API complexity is about providing some way for the > chosen placement decisions to get from the introductory phase into > the orchestration phase (where the guests are actually put in the > chosen places). I can see two approaches. One is for the > introductory phase to return placement decisions to the client, and > the client pass to them into the guest create/update operations in > the orchestration phase. Another is based on using a unique > identifier for a group member in both the introductory and > orchestration phase, so that the placement decision can travel under > the covers while the client deals with only the identifier. > > There are also SPI changes needed, so that the joint problem is > preserved all the way from the API into the scheduler. Additionally > this needs a way to make a joint claim on host capacity (there is a > potential connection to Blazar here). > > The additional complexity for developers of clients is the > complexity of: writing code that creates the introductory > description of group members and their placement criteria; modifying > the orchestration code to refer to the results of the introductory > phases and to keep the orchestration lease alive; and writing code > that does the final confirmation. > > The additional input size is the size of the needed descriptions of > group members (i.e., what capacities they need). If this is step 2 > in a roadmap then the placement criteria are already being described > (no new/different input is needed for that --- which is an advantage > of this roadmap) once; depending on details the new API would take > one or two copies of each placement criterion. > > Note that a user who is using Heat has already written such a > description --- a heat template already contains descriptions of > group members and their placement criteria. A Heat user would also > escape the development burden of using the expanded API, if Heat > covers that (either in upstream code or in provider plugins). > > > (i3) Single service vs. cross service. Issues (i1) and (i2) appear > even when you consider only one service. Those issues get more > intense when you want placement criteria that involve multiple > services (e.g., Cinder and Nova). I think we already have a well > established agreement that (i3) belongs last on the roadmap. > > > Thanks, > Mike_______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev