Yingxin, This looks quite similar to the work of this bp: https://blueprints.launchpad.net/nova/+spec/no-db-scheduler
It's really nice that somebody is still trying to push scheduler refactoring in this way. Thanks. Best regards, Boris Pavlovic On Sun, Feb 14, 2016 at 9:21 PM, Cheng, Yingxin <yingxin.ch...@intel.com> wrote: > Hi, > > > > I’ve uploaded a prototype https://review.openstack.org/#/c/280047/ to > testify its design goals in accuracy, performance, reliability and > compatibility improvements. It will also be an Austin Summit Session if > elected: > https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316 > > > > I want to gather opinions about this idea: > > 1. Is this feature possible to be accepted in the Newton release? > > 2. Suggestions to improve its design and compatibility. > > 3. Possibilities to integrate with resource-provider bp series: I know > resource-provider is the major direction of Nova scheduler, and there will > be fundamental changes in the future, especially according to the bp > https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst. > However, this prototype proposes a much faster and compatible way to make > schedule decisions based on scheduler caches. The in-memory decisions are > made at the same speed with the caching scheduler, but the caches are kept > consistent with compute nodes as quickly as possible without db refreshing. > > > > Here is the detailed design of the mentioned prototype: > > > > >>---------------------------- > > Background: > > The host state cache maintained by host manager is the scheduler resource > view during schedule decision making. It is updated whenever a request is > received[1], and all the compute node records are retrieved from db every > time. There are several problems in this update model, proven in > experiments[3]: > > 1. Performance: The scheduler performance is largely affected by db access > in retrieving compute node records. The db block time of a single request > is 355ms in average in the deployment of 3 compute nodes, compared with > only 3ms in in-memory decision-making. Imagine there could be at most 1k > nodes, even 10k nodes in the future. > > 2. Race conditions: This is not only a parallel-scheduler problem, but > also a problem using only one scheduler. The detailed analysis of > one-scheduler-problem is located in bug analysis[2]. In short, there is a > gap between the scheduler makes a decision in host state cache and the > > compute node updates its in-db resource record according to that decision > in resource tracker. A recent scheduler resource consumption in cache can > be lost and overwritten by compute node data because of it, result in cache > inconsistency and unexpected retries. In a one-scheduler experiment using > 3-node deployment, there are 7 retries out of 31 concurrent schedule > requests recorded, results in 22.6% extra performance overhead. > > 3. Parallel scheduler support: The design of filter scheduler leads to an > "even worse" performance result using parallel schedulers. In the same > experiment with 4 schedulers on separate machines, the average db block > time is increased to 697ms per request and there are 16 retries out of 31 > schedule requests, namely 51.6% extra overhead. > > > > Improvements: > > This prototype solved the mentioned issues above by implementing a new > update model to scheduler host state cache. Instead of refreshing caches > from db, every compute node maintains its accurate version of host state > cache updated by the resource tracker, and sends incremental updates > directly to schedulers. So the scheduler cache are synchronized to the > correct state as soon as possible with the lowest overhead. Also, scheduler > will send resource claim with its decision to the target compute node. The > compute node can decide whether the resource claim is successful > immediately by its local host state cache and send responds back ASAP. With > all the claims are tracked from schedulers to compute nodes, no false > overwrites will happen, and thus the gaps between scheduler cache and real > compute node states are minimized. The benefits are obvious with recorded > experiments[3] compared with caching scheduler and filter scheduler: > > 1. There is no db block time during scheduler decision making, the average > decision time per request is about 3ms in both single and multiple > scheduler scenarios, which is equal to the in-memory decision time of > filter scheduler and caching scheduler. > > 2. Since the scheduler claims are tracked and the "false overwrite" is > eliminated, there should be 0 retries in one-scheduler deployment, as > proven in the experiment. Thanks to the quick claim responding > implementation, there are only 2 retries out of 31 requests in the > 4-scheduler experiment. > > 3. All the filtering and weighing algorithms are compatible because the > data structure of HostState is unchanged. In fact, this prototype even > supports filter scheduler running at the same time(already tested). Like > other operations with resource changes such as migration, resizing or > shelving, they make claims in the resource tracker directly and update the > compute node host state immediately without major changes. > > > > Extra features: > > More efforts are made to better adjust the implementation to real-world > scenarios, such as network issues, service unexpectedly down and > overwhelming messages etc: > > 1. The communication between schedulers and compute nodes are only casts, > there are no RPC calls thus no blocks during scheduling. > > 2. All updates from nodes to schedulers are labelled with an incremental > seed, so any message reordering, lost or duplication due to network issues > can be detected by MessageWindow immediately. The inconsistent cache can be > detected and refreshed correctly. > > 3. The overwhelming messages are compressed by MessagePipe in its async > mode. There is no need to send all the messages one by one in the MQ, they > can be merged before sent to schedulers. > > 4. When a new service is up or recovered, it sends notifications to all > known remotes for quick cache synchronization, even before the service > record is available in db. And if a remote service is unexpectedly down > according to service group records, no more messages will send to it. The > ComputeFilter is also removed because of this feature, the scheduler can > detect remote compute nodes by itself. > > 5. In fact the claim tracking is not only from schedulers to compute > nodes, but also from compute-node host state to the resource tracker. One > reason is that there is still a gap between a claim is acknowledged by > compute-node host state and the claim is successful in resource tracker. It > is necessary to track those unhandled claims to keep host state accurate. > The second reason is to separate schedulers from compute node and resource > trackers. Scheduler only export limited interfaces `update_from_compute` > and `handle_rt_claim_failure` to compute service and the RT, so the testing > and reusing are easier with clear boundaries. > > > > TODOs: > > There are still many features to be implemented, the most important are > unit tests and incremental updates to PCI and NUMA resources, all of them > are marked out inline. > > > > References: > > [1] > https://github.com/openstack/nova/blob/master/nova/scheduler/filter_scheduler.py#L104 > > [2] https://bugs.launchpad.net/nova/+bug/1341420/comments/24 > > [3] http://paste.openstack.org/show/486929/ > > ----------------------------<< > > > > The original commit history of this prototype is located in > https://github.com/cyx1231st/nova/commits/shared-scheduler > > For instructions to install and test this prototype, please refer to the > commit message of https://review.openstack.org/#/c/280047/ > > > > > > Regards, > > -Yingxin > > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev