Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Boris Pavlovic Sun, 14 Feb 2016 22:04:15 -0800

Yingxin,

This looks quite similar to the work of this bp:
https://blueprints.launchpad.net/nova/+spec/no-db-scheduler


It's really nice that somebody is still trying to push scheduler
refactoring in this way.
Thanks.

Best regards,
Boris Pavlovic

On Sun, Feb 14, 2016 at 9:21 PM, Cheng, Yingxin <yingxin.ch...@intel.com>
wrote:

> Hi,
>
>
>
> I’ve uploaded a prototype https://review.openstack.org/#/c/280047/ to
> testify its design goals in accuracy, performance, reliability and
> compatibility improvements. It will also be an Austin Summit Session if
> elected:
> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316
>
>
>
> I want to gather opinions about this idea:
>
> 1. Is this feature possible to be accepted in the Newton release?
>
> 2. Suggestions to improve its design and compatibility.
>
> 3. Possibilities to integrate with resource-provider bp series: I know
> resource-provider is the major direction of Nova scheduler, and there will
> be fundamental changes in the future, especially according to the bp
> https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
> However, this prototype proposes a much faster and compatible way to make
> schedule decisions based on scheduler caches. The in-memory decisions are
> made at the same speed with the caching scheduler, but the caches are kept
> consistent with compute nodes as quickly as possible without db refreshing.
>
>
>
> Here is the detailed design of the mentioned prototype:
>
>
>
> >>----------------------------
>
> Background:
>
> The host state cache maintained by host manager is the scheduler resource
> view during schedule decision making. It is updated whenever a request is
> received[1], and all the compute node records are retrieved from db every
> time. There are several problems in this update model, proven in
> experiments[3]:
>
> 1. Performance: The scheduler performance is largely affected by db access
> in retrieving compute node records. The db block time of a single request
> is 355ms in average in the deployment of 3 compute nodes, compared with
> only 3ms in in-memory decision-making. Imagine there could be at most 1k
> nodes, even 10k nodes in the future.
>
> 2. Race conditions: This is not only a parallel-scheduler problem, but
> also a problem using only one scheduler. The detailed analysis of
> one-scheduler-problem is located in bug analysis[2]. In short, there is a
> gap between the scheduler makes a decision in host state cache and the
>
> compute node updates its in-db resource record according to that decision
> in resource tracker. A recent scheduler resource consumption in cache can
> be lost and overwritten by compute node data because of it, result in cache
> inconsistency and unexpected retries. In a one-scheduler experiment using
> 3-node deployment, there are 7 retries out of 31 concurrent schedule
> requests recorded, results in 22.6% extra performance overhead.
>
> 3. Parallel scheduler support: The design of filter scheduler leads to an
> "even worse" performance result using parallel schedulers. In the same
> experiment with 4 schedulers on separate machines, the average db block
> time is increased to 697ms per request and there are 16 retries out of 31
> schedule requests, namely 51.6% extra overhead.
>
>
>
> Improvements:
>
> This prototype solved the mentioned issues above by implementing a new
> update model to scheduler host state cache. Instead of refreshing caches
> from db, every compute node maintains its accurate version of host state
> cache updated by the resource tracker, and sends incremental updates
> directly to schedulers. So the scheduler cache are synchronized to the
> correct state as soon as possible with the lowest overhead. Also, scheduler
> will send resource claim with its decision to the target compute node. The
> compute node can decide whether the resource claim is successful
> immediately by its local host state cache and send responds back ASAP. With
> all the claims are tracked from schedulers to compute nodes, no false
> overwrites will happen, and thus the gaps between scheduler cache and real
> compute node states are minimized. The benefits are obvious with recorded
> experiments[3] compared with caching scheduler and filter scheduler:
>
> 1. There is no db block time during scheduler decision making, the average
> decision time per request is about 3ms in both single and multiple
> scheduler scenarios, which is equal to the in-memory decision time of
> filter scheduler and caching scheduler.
>
> 2. Since the scheduler claims are tracked and the "false overwrite" is
> eliminated, there should be 0 retries in one-scheduler deployment, as
> proven in the experiment. Thanks to the quick claim responding
> implementation, there are only 2 retries out of 31 requests in the
> 4-scheduler experiment.
>
> 3. All the filtering and weighing algorithms are compatible because the
> data structure of HostState is unchanged. In fact, this prototype even
> supports filter scheduler running at the same time(already tested). Like
> other operations with resource changes such as migration, resizing or
> shelving, they make claims in the resource tracker directly and update the
> compute node host state immediately without major changes.
>
>
>
> Extra features:
>
> More efforts are made to better adjust the implementation to real-world
> scenarios, such as network issues, service unexpectedly down and
> overwhelming messages etc:
>
> 1. The communication between schedulers and compute nodes are only casts,
> there are no RPC calls thus no blocks during scheduling.
>
> 2. All updates from nodes to schedulers are labelled with an incremental
> seed, so any message reordering, lost or duplication due to network issues
> can be detected by MessageWindow immediately. The inconsistent cache can be
> detected and refreshed correctly.
>
> 3. The overwhelming messages are compressed by MessagePipe in its async
> mode. There is no need to send all the messages one by one in the MQ, they
> can be merged before sent to schedulers.
>
> 4. When a new service is up or recovered, it sends notifications to all
> known remotes for quick cache synchronization, even before the service
> record is available in db. And if a remote service is unexpectedly down
> according to service group records, no more messages will send to it. The
> ComputeFilter is also removed because of this feature, the scheduler can
> detect remote compute nodes by itself.
>
> 5. In fact the claim tracking is not only from schedulers to compute
> nodes, but also from compute-node host state to the resource tracker. One
> reason is that there is still a gap between a claim is acknowledged by
> compute-node host state and the claim is successful in resource tracker. It
> is necessary to track those unhandled claims to keep host state accurate.
> The second reason is to separate schedulers from compute node and resource
> trackers. Scheduler only export limited interfaces `update_from_compute`
> and `handle_rt_claim_failure` to compute service and the RT, so the testing
> and reusing are easier with clear boundaries.
>
>
>
> TODOs:
>
> There are still many features to be implemented, the most important are
> unit tests and incremental updates to PCI and NUMA resources, all of them
> are marked out inline.
>
>
>
> References:
>
> [1]
> https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104
>
> [2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24
>
> [3] http://paste.openstack.org/show/486929/
>
> ----------------------------<<
>
>
>
> The original commit history of this prototype is located in
> https://github.com/cyx1231st/nova/commits/shared-scheduler
>
> For instructions to install and test this prototype, please refer to the
> commit message of https://review.openstack.org/#/c/280047/
>
>
>
>
>
> Regards,
>
> -Yingxin
>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>
>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Reply via email to