Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Cheng, Yingxin Sun, 21 Feb 2016 05:55:28 -0800

On 19 February 2016 at 5:58, John Garbutt wrote:
> On 17 February 2016 at 17:52, Clint Byrum <cl...@fewbar.com> wrote:
> > Excerpts from Cheng, Yingxin's message of 2016-02-14 21:21:28 -0800:
> >> Hi,
> >>
> >> I've uploaded a prototype https://review.openstack.org/#/c/280047/ to
> >> testify its design goals in accuracy, performance, reliability and
> >> compatibility improvements. It will also be an Austin Summit Session
> >> if elected:
> >> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presen
> >> tation/7316
> 
> Long term, I see a world where there are multiple scheduler Nova is able to 
> use,
> depending on the deployment scenario.
> 
> We have tried to stop any more scheduler going in tree (like the solver 
> scheduler)
> while we get the interface between the nova-scheduler and the rest of Nova
> straightened out, to make that much easier.

Technically, what I've implemented is a new type of scheduler host manager
`shared_state_manager.SharedHostManager`[1] with the ability to synchronize host
states directly from resource trackers. Filter scheduler driver can choose to 
load this
manager from stevedore[2], and thus get a different update model of internal
caches. This new manager is highly compatible to current scheduler architecture
because filter scheduler with HostManager can even run with the schedulers 
loaded
SharedHostManager at the same time(tested).

So why not have this in tree to give operators more options in choosing host
managers. I also have an opinion that caching scheduler is not exactly a new 
kind
of scheduler driver, it only has a different behavior in updating host states, 
and it
should be implemented as a new kind of host manager instead.

What I'm concerned is that the resource provider scheduler is going to change 
the
architecture of filter scheduler in Jay Pipe's bp[3]. There will be no host 
manager,
even no host state caches in the future. So what I've done in keeping 
compatibilities
will become incompatibilities in the future.

[1] 
https://review.openstack.org/#/c/280047/2/nova/scheduler/shared_state_manager.py
 L55
[2] https://review.openstack.org/#/c/280047/2/setup.cfg L194
[3] https://review.openstack.org/#/c/271823 

> 
> So a big question for me is, does the new scheduler interface work if you 
> look at
> slotting in your prototype scheduler?
> 
> Specifically I am thinking about this interface:
> https://github.com/openstack/nova/blob/master/nova/scheduler/client/__init_
> _.py

> There are several problems in this update model, proven in experiments[3]:
> >> 1. Performance: The scheduler performance is largely affected by db access
> in retrieving compute node records. The db block time of a single request is
> 355ms in average in the deployment of 3 compute nodes, compared with only
> 3ms in in-memory decision-making. Imagine there could be at most 1k nodes,
> even 10k nodes in the future.
> >> 2. Race conditions: This is not only a parallel-scheduler problem,
> >> but also a problem using only one scheduler. The detailed analysis of one-
> scheduler-problem is located in bug analysis[2]. In short, there is a gap 
> between
> the scheduler makes a decision in host state cache and the compute node
> updates its in-db resource record according to that decision in resource 
> tracker.
> A recent scheduler resource consumption in cache can be lost and overwritten
> by compute node data because of it, result in cache inconsistency and
> unexpected retries. In a one-scheduler experiment using 3-node deployment,
> there are 7 retries out of 31 concurrent schedule requests recorded, results 
> in
> 22.6% extra performance overhead.
> >> 3. Parallel scheduler support: The design of filter scheduler leads to an 
> >> "even
> worse" performance result using parallel schedulers. In the same experiment
> with 4 schedulers on separate machines, the average db block time is increased
> to 697ms per request and there are 16 retries out of 31 schedule requests,
> namely 51.6% extra overhead.
> >
> > This mostly agrees with recent tests I've been doing simulating 1000
> > compute nodes with the fake virt driver.
> 
> Overall this agrees with what I saw in production before moving us to the
> caching scheduler driver.
> 
> I would love a nova functional test that does that test. It will help us 
> compare
> these different schedulers and find the strengths and weaknesses.

I'm also working on implementing the functional tests of nova scheduler, there
is a patch showing my latest progress: https://review.openstack.org/#/c/281825/ 

IMO scheduler functional tests are not good at testing real performance of
different schedulers, because all of the services are running as green threads
instead of real processes. I think the better way to analysis the real 
performance
and the strengths and weaknesses is to start services in different processes 
with
fake virt driver(i.e. Clint Byrum's work) or Jay Pipe's work in emulating 
different
designs.

> >> 2. Since the scheduler claims are tracked and the "false overwrite" is
> eliminated, there should be 0 retries in one-scheduler deployment, as proven 
> in
> the experiment. Thanks to the quick claim responding implementation, there are
> only 2 retries out of 31 requests in the 4-scheduler experiment.
> >
> > This is a real win. I've seen 3 schedulers get so overwhelmed with
> > retries that they go slower than 1.
> 
> When I was looking at this, I certainly saw DB queries dominating, and races
> between the decisions made in the scheduler. The short term hack I can up with
> was this approach:
> https://github.com/openstack/nova/blob/master/nova/scheduler/caching_sche
> duler.py
> 
> How it works is very hidden but it relies on (the thing we added for when you
> build multiple instances in a single request) this line of code, so that the 
> in
> memory cache of the host state is shared between requests into the scheduler:
> https://github.com/openstack/nova/blob/392f1aca1ba1773ec87b5cf913a7fe94
> 0190f916/nova/scheduler/filter_scheduler.py#L136
> 
> The case I was having issue with was when you get a burst of around
> 1000 build requests within a one min timeframe. Running a single caching
> scheduler was able to deal with that load way better than the existing 
> scheduler.
> 
> Now if you run multiple of these caching_schedulers, particularly if you want 
> a
> fill first strategy, its awful. But running one caching scheduler is proving 
> a lot
> faster than running two or three of the regular filter schedulers (when doing 
> fill
> first), due to the lower time spend doing DB queries and the lower races
> between scheduler decisions. Now the caching scheduler is a total a hack, and
> only works for certain deployment scenarios. It was really added to buy time
> while we agree how to build this proposed DB based reservation scheduler that
> works well when running multiple schedulers.
> 
> I am really interested how your prototype and the caching scheduler compare?
> It looks like single node scheduler will perform in a very similar way, but 
> multiple
> schedulers are less likely to race each other, although there are quite a few
> races?

I think the major weakness of caching scheduler comes from its host state update
model, i.e. updating host states from db every ` 
CONF.scheduler_driver_task_period`
seconds.

The compute node resources are not only changed by booting instances, also by
others for example instance deleting, resizing, immigration and changes from 
virt
driver. The caching scheduler can only get those changes from db every 60 
seconds
in default setting. This is a huge inconsistent window between scheduler cache 
and
real compute node resource view. The shared-state schedule will not have the 
same
issue because it receives updates immediately from resource trackers.

> 
> Thanks,
> John
> 

On 15 February 2016 at 22:02, J Ed Leafe wrote:
> On 02/15/2016 03:27 AM, Sylvain Bauza wrote:
> 
> > - can we have the feature optional for operators
> 
> One thing that concerns me is the lesson learned from simply having a compute
> node's instance information sent and persisted in memory. That was resisted by
> several large operators, due to overhead. This proposal will have to store 
> that
> and more in memory.

Sorry missed this thread. This proposal won't store more in memory actually
compared with legacy filter scheduler.

1. The size of host state cache in scheduler side is unchanged in the new 
design.

2. In compute node side, there is a copy of host state cache because I don't 
want
to make too many changes in resource tracker for the clearness of the design.

In the next step(or in another bp) if the first bp is accepted, I'll focus on 
generating
resource updates directly in resource tracker and change compute node host
state cache to the proxy representation of `objects.ComputeNode`. So there are
no copies of host state in compute node side.

Regards,
-Yingxin

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Reply via email to