Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Cheng, Yingxin Wed, 17 Feb 2016 07:45:00 -0800

On Wed, 17 February 2016, Sylvain Bauza wrote
Le 17/02/2016 12:59, Chris Dent a écrit :
On Wed, 17 Feb 2016, Cheng, Yingxin wrote:



To better illustrate the differences between shared-state, resource-
provider and legacy scheduler, I've drew 3 simplified pictures [1] in
emphasizing the location of resource view, the location of claim and
resource consumption, and the resource update/refresh pattern in three
kinds of schedulers. Hoping I'm correct in the "resource-provider
scheduler" part.

That's a useful visual aid, thank you. It aligns pretty well with my
understanding of each idea.

A thing that may be missing, which may help in exploring the usefulness
of each idea, is a representation of resources which are separate
from compute nodes and shared by them, such as shared disk or pools
of network addresses. In addition some would argue that we need to
see bare-metal nodes for a complete picture.

One of the driving motivations of the resource-provider work is to
make it possible to adequately and accurately track and consume the
shared resources. The legacy scheduler currently fails to do that
well. As you correctly points out it does this by having "strict
centralized consistency" as a design goal.

So, to be clear, I'm really happy to see the resource-providers series for many 
reasons :
 - it will help us getting a nice Facade for getting the resources and 
attributing them
 - it will help a shared-storage deployment by making sure that we don't have 
some resource problems when the resource is shared
 - it will create a possibility for external resource providers to provide some 
resource types to Nova so the Nova scheduler could use them (like Neutron 
related resources)

That, I really want to have it implemented in Mitaka and Newton and I'm totally 
on-board and supporting it.

TBC, the only problem I see with the series is [2], not the whole, please.

@cdent:
As far as I know, some resources are defined "shared", simply because they are 
not the resources of compute node service. In other words, the compute node 
resource tracker does not have the authority of those "shared" resources. For 
example, the "shared" storage resources are actually managed by the storage 
service, and the "shared" network resource "IP pool" is actually owned by 
network service. If all the resources are labeled "shared" only because they 
are not owned by compute node services, the 
shared-resource-tracking/consumption problem can be solved by implementing 
resource trackers in all the authorized services. Those resource trackers are 
constantly providing incremental updates to schedulers, and have the 
responsibilities to reserve and consume resources independently/distributedly, 
no matter where they are from, compute service, storage service or network 
service etc.

As can be seen in the illustrations [1], the main compatibility issue
between shared-state and resource-provider scheduler is caused by the
different location of claim/consumption and the assumed consistent
resource view. IMO unless the claims are allowed to happen in both
places(resource tracker and resource-provider db), it seems difficult
to make shared-state and resource-provider scheduler work together.

Yes, but doing claims twice feels intuitively redundant.

As I've explored this space I've often wondered why we feel it is
necessary to persist the resource data at all. Your shared-state
model is appealing because it lets the concrete resource(-provider)
be the authority about its own resources. That is information which
it can broadcast as it changes or on intervals (or both) to other
things which need that information. That feels like the correct
architecture in a massively distributed system, especially one where
resources are not scarce.

So, IMHO, we should only have the compute nodes being the authority for 
allocating resources. They are many reasons for that I provided in the spec 
review, but I can reply again :
·        #1 If we consider that an external system, as a resource provider, 
will provide a single resource class usage (like network segment availability), 
it will still require the instance to be spawned *for* consuming that resource 
class, even if the scheduler accounts for it. That would mean that the 
scheduler would have to manage a list of allocations with TTL, and periodically 
verify that the allocation succeeded by asking the external system (or getting 
feedback from the external system). See, that's racy.
·        #2 the scheduler is just a decision maker, by any case it doesn't 
account for the real instance creation (it doesn't hold the ownership of the 
instance). Having it being accountable for the instances usage is heavily 
difficult. Take for example a request for CPU pinning or NUMA affinity. The 
user can't really express which pin of the pCPU he will get, that's the compute 
node which will do that for him. Of course, the scheduler will help picking an 
host that can fit the request, but the real pinning will happen in the compute 
node.
IMHO, the authority to allocate resources is not limited by compute nodes, but 
also include network service, storage service or all other services which have 
the authority to manage their own resources. Those "shared" resources are 
coming from external services(i.e. system) which are not compute service. They 
all have responsibilities to push their own resource updates to schedulers, 
make resource reservation and consumption. The resource provider series 
provides a flexible representation of all kinds of resources, so that scheduler 
can handle them without having the specific knowledge of all the resources.

Also, I'm very interested in keeping an optimistic scheduler which wouldn't 
lock the entire view of the world anytime a request comes in. There are many 
papers showing different architectures and benchmarks against different 
possibilities and TBH, I'm very concerned by the scaling effect.
Also, we should keep in mind our new paradigm called Cells V2, which implies a 
global distributed scheduler for handling all requests. Having it following the 
same design tenets of OpenStack [3] by having a "eventually consistent 
shared-state" makes my guts saying that I'd love to see that.





The advantage of a centralized datastore for that information is
that it provides administrative control (e.g. reserving resources for
other needs) and visibility. That level of command and control seems
to be something people really want (unfortunately).


My point is that while I truly understand the need of getting an API resource 
like "scheduler, get me how much my cloud is free", that shouldn't necessarly 
need to be accurate but rather eventually consistent.
If operators want to do capacity planning, they need trends and thresholds, not 
exactly knowing the precise amounts that can change everytime a request comes 
in.

The disadvantage of a centralized datastore is that administrators/monitors 
should refresh their resource view from db and make comparisons see changes. 
The distributed resource trackers with shared-state scheduler provides a 
natural way to present the constantly changing resource view with the lowest 
overhead. It is worth noting that the shared-state scheduler is a natural 
resource monitor if it does no scheduling.


-Sylvain




[2] https://review.openstack.org/#/c/271823/
[3] https://wiki.openstack.org/wiki/BasicDesignTenets



Regards,
-Yingxin

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Reply via email to