Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Sylvain Bauza Mon, 15 Feb 2016 05:52:53 -0800


Le 15/02/2016 10:48, Cheng, Yingxin a écrit :


Thanks Sylvain,

1. The below ideas will be extended to a spec ASAP.


Nice, looking forward to it then :-)

2. Thanks for providing concerns I’ve not thought it yet, they will bein the spec soon.
3. Let me copy my thoughts from another thread about the integrationwith resource-provider:
The idea is about “Only compute node knows its own final compute-noderesource view” or “The accurate resource view only exists at the placewhere it is actually consumed.” I.e., The incremental updates can onlycome from the actual “consumption” action, no matter where it is(e.g.compute node, storage service, network service, etc.). Borrow theterms from resource-provider, compute nodes can maintain its accurateversion of “compute-node-inventory” cache, and can send incrementalupdates because it actually consumes compute resources, furthermore,storage service can also maintain an accurate version of“storage-inventory” cache and send incremental updates if it alsoconsumes storage resources. If there are central services in charge ofconsuming all the resources, the accurate cache and updates must comefrom them.

That is one of the things I'd like to see in your spec, and how youcould interact with the new model.

Thanks,
-Sylvain

Regards,

-Yingxin

*From:*Sylvain Bauza [mailto:[email protected]]
*Sent:* Monday, February 15, 2016 5:28 PM

*To:* OpenStack Development Mailing List (not for usage questions)<[email protected]>*Subject:* Re: [openstack-dev] [nova] A prototype implementationtowards the "shared state scheduler"


Le 15/02/2016 06:21, Cheng, Yingxin a écrit :

    Hi,

    I’ve uploaded a prototype https://review.openstack.org/#/c/280047/
    <https://review.openstack.org/#/c/280047/> to testify its design
    goals in accuracy, performance, reliability and compatibility
    improvements. It will also be an Austin Summit Session if elected:
    
https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316


    I want to gather opinions about this idea:

    1. Is this feature possible to be accepted in the Newton release?

Such feature requires a spec file to be writtenhttp://docs.openstack.org/developer/nova/process.html#how-do-i-get-my-code-merged

Ideally, I'd like to see your below ideas written in that spec file soit would be the best way to discuss on the design.




    2. Suggestions to improve its design and compatibility.

I don't want to go into details here (that's rather the goal of thespec for that), but my biggest concerns would be when reviewing the spec :- how this can meet the OpenStack mission statement (ie. ubiquitoussolution that would be easy to install and massively scalable)- how this can be integrated with the existing (filters, weighers) toprovide a clean and simple path for operators to upgrade- how this can be supporting rolling upgrades (old computes sendingupdates to new scheduler)

 - how can we test it
 - can we have the feature optional for operators



    3. Possibilities to integrate with resource-provider bp series: I
    know resource-provider is the major direction of Nova scheduler,
    and there will be fundamental changes in the future, especially
    according to the bp
    
https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst.
    However, this prototype proposes a much faster and compatible way
    to make schedule decisions based on scheduler caches. The
    in-memory decisions are made at the same speed with the caching
    scheduler, but the caches are kept consistent with compute nodes
    as quickly as possible without db refreshing.

That's the key point, thanks for noticing our priorities. So, you knowthat our resource modeling is drastically subject to change in Mitakaand Newton. That is the new game, so I'd love to see how you plan tointeract with that.Ideally, I'd appreciate if Jay Pipes, Chris Dent and you could shareyour ideas because all of you are having great ideas to improve acurrent frustrating solution.


-Sylvain



    Here is the detailed design of the mentioned prototype:

    >>----------------------------

    Background:

    The host state cache maintained by host manager is the scheduler
    resource view during schedule decision making. It is updated
    whenever a request is received[1], and all the compute node
    records are retrieved from db every time. There are several
    problems in this update model, proven in experiments[3]:

    1. Performance: The scheduler performance is largely affected by
    db access in retrieving compute node records. The db block time of
    a single request is 355ms in average in the deployment of 3
    compute nodes, compared with only 3ms in in-memory
    decision-making. Imagine there could be at most 1k nodes, even 10k
    nodes in the future.

    2. Race conditions: This is not only a parallel-scheduler problem,
    but also a problem using only one scheduler. The detailed analysis
    of one-scheduler-problem is located in bug analysis[2]. In short,
    there is a gap between the scheduler makes a decision in host
    state cache and the

    compute node updates its in-db resource record according to that
    decision in resource tracker. A recent scheduler resource
    consumption in cache can be lost and overwritten by compute node
    data because of it, result in cache inconsistency and unexpected
    retries. In a one-scheduler experiment using 3-node deployment,
    there are 7 retries out of 31 concurrent schedule requests
    recorded, results in 22.6% extra performance overhead.

    3. Parallel scheduler support: The design of filter scheduler
    leads to an "even worse" performance result using parallel
    schedulers. In the same experiment with 4 schedulers on separate
    machines, the average db block time is increased to 697ms per
    request and there are 16 retries out of 31 schedule requests,
    namely 51.6% extra overhead.

    Improvements:

    This prototype solved the mentioned issues above by implementing a
    new update model to scheduler host state cache. Instead of
    refreshing caches from db, every compute node maintains its
    accurate version of host state cache updated by the resource
    tracker, and sends incremental updates directly to schedulers. So
    the scheduler cache are synchronized to the correct state as soon
    as possible with the lowest overhead. Also, scheduler will send
    resource claim with its decision to the target compute node. The
    compute node can decide whether the resource claim is successful
    immediately by its local host state cache and send responds back
    ASAP. With all the claims are tracked from schedulers to compute
    nodes, no false overwrites will happen, and thus the gaps between
    scheduler cache and real compute node states are minimized. The
    benefits are obvious with recorded experiments[3] compared with
    caching scheduler and filter scheduler:

    1. There is no db block time during scheduler decision making, the
    average decision time per request is about 3ms in both single and
    multiple scheduler scenarios, which is equal to the in-memory
    decision time of filter scheduler and caching scheduler.

    2. Since the scheduler claims are tracked and the "false
    overwrite" is eliminated, there should be 0 retries in
    one-scheduler deployment, as proven in the experiment. Thanks to
    the quick claim responding implementation, there are only 2
    retries out of 31 requests in the 4-scheduler experiment.

    3. All the filtering and weighing algorithms are compatible
    because the data structure of HostState is unchanged. In fact,
    this prototype even supports filter scheduler running at the same
    time(already tested). Like other operations with resource changes
    such as migration, resizing or shelving, they make claims in the
    resource tracker directly and update the compute node host state
    immediately without major changes.

    Extra features:

    More efforts are made to better adjust the implementation to
    real-world scenarios, such as network issues, service unexpectedly
    down and overwhelming messages etc:

    1. The communication between schedulers and compute nodes are only
    casts, there are no RPC calls thus no blocks during scheduling.

    2. All updates from nodes to schedulers are labelled with an
    incremental seed, so any message reordering, lost or duplication
    due to network issues can be detected by MessageWindow
    immediately. The inconsistent cache can be detected and refreshed
    correctly.

    3. The overwhelming messages are compressed by MessagePipe in its
    async mode. There is no need to send all the messages one by one
    in the MQ, they can be merged before sent to schedulers.

    4. When a new service is up or recovered, it sends notifications
    to all known remotes for quick cache synchronization, even before
    the service record is available in db. And if a remote service is
    unexpectedly down according to service group records, no more
    messages will send to it. The ComputeFilter is also removed
    because of this feature, the scheduler can detect remote compute
    nodes by itself.

    5. In fact the claim tracking is not only from schedulers to
    compute nodes, but also from compute-node host state to the
    resource tracker. One reason is that there is still a gap between
    a claim is acknowledged by compute-node host state and the claim
    is successful in resource tracker. It is necessary to track those
    unhandled claims to keep host state accurate. The second reason is
    to separate schedulers from compute node and resource trackers.
    Scheduler only export limited interfaces `update_from_compute` and
    `handle_rt_claim_failure` to compute service and the RT, so the
    testing and reusing are easier with clear boundaries.

    TODOs:

    There are still many features to be implemented, the most
    important are unit tests and incremental updates to PCI and NUMA
    resources, all of them are marked out inline.

    References:

    [1]
    
https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104


    [2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24
    <https://bugs.launchpad.net/nova/+bug/1341420/comments/24>

    [3] http://paste.openstack.org/show/486929/

    ----------------------------<<

    The original commit history of this prototype is located in
    https://github.com/cyx1231st/nova/commits/shared-scheduler

    For instructions to install and test this prototype, please refer
    to the commit message of https://review.openstack.org/#/c/280047/

    Regards,

    -Yingxin




    __________________________________________________________________________

    OpenStack Development Mailing List (not for usage questions)

    Unsubscribe:[email protected]?subject:unsubscribe  
<mailto:[email protected]?subject:unsubscribe>

    http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Reply via email to