On 17 February 2016 at 17:52, Clint Byrum <cl...@fewbar.com> wrote: > Excerpts from Cheng, Yingxin's message of 2016-02-14 21:21:28 -0800: >> Hi, >> >> I've uploaded a prototype https://review.openstack.org/#/c/280047/ to >> testify its design goals in accuracy, performance, reliability and >> compatibility improvements. It will also be an Austin Summit Session if >> elected: >> https://www.openstack.org/summit/austin-2016/vote-for-speakers/Presentation/7316
Long term, I see a world where there are multiple scheduler Nova is able to use, depending on the deployment scenario. We have tried to stop any more scheduler going in tree (like the solver scheduler) while we get the interface between the nova-scheduler and the rest of Nova straightened out, to make that much easier. So a big question for me is, does the new scheduler interface work if you look at slotting in your prototype scheduler? Specifically I am thinking about this interface: https://github.com/openstack/nova/blob/master/nova/scheduler/client/__init__.py >> I want to gather opinions about this idea: >> 1. Is this feature possible to be accepted in the Newton release? >> 2. Suggestions to improve its design and compatibility. >> 3. Possibilities to integrate with resource-provider bp series: I know >> resource-provider is the major direction of Nova scheduler, and there will >> be fundamental changes in the future, especially according to the bp >> https://review.openstack.org/#/c/271823/1/specs/mitaka/approved/resource-providers-scheduler.rst. >> However, this prototype proposes a much faster and compatible way to make >> schedule decisions based on scheduler caches. The in-memory decisions are >> made at the same speed with the caching scheduler, but the caches are kept >> consistent with compute nodes as quickly as possible without db refreshing. >> >> Here is the detailed design of the mentioned prototype: >> >> >>---------------------------- >> Background: >> The host state cache maintained by host manager is the scheduler resource >> view during schedule decision making. It is updated whenever a request is >> received[1], and all the compute node records are retrieved from db every >> time. There are several problems in this update model, proven in >> experiments[3]: >> 1. Performance: The scheduler performance is largely affected by db access >> in retrieving compute node records. The db block time of a single request is >> 355ms in average in the deployment of 3 compute nodes, compared with only >> 3ms in in-memory decision-making. Imagine there could be at most 1k nodes, >> even 10k nodes in the future. >> 2. Race conditions: This is not only a parallel-scheduler problem, but also >> a problem using only one scheduler. The detailed analysis of >> one-scheduler-problem is located in bug analysis[2]. In short, there is a >> gap between the scheduler makes a decision in host state cache and the >> compute node updates its in-db resource record according to that decision in >> resource tracker. A recent scheduler resource consumption in cache can be >> lost and overwritten by compute node data because of it, result in cache >> inconsistency and unexpected retries. In a one-scheduler experiment using >> 3-node deployment, there are 7 retries out of 31 concurrent schedule >> requests recorded, results in 22.6% extra performance overhead. >> 3. Parallel scheduler support: The design of filter scheduler leads to an >> "even worse" performance result using parallel schedulers. In the same >> experiment with 4 schedulers on separate machines, the average db block time >> is increased to 697ms per request and there are 16 retries out of 31 >> schedule requests, namely 51.6% extra overhead. > > This mostly agrees with recent tests I've been doing simulating 1000 > compute nodes with the fake virt driver. Overall this agrees with what I saw in production before moving us to the caching scheduler driver. I would love a nova functional test that does that test. It will help us compare these different schedulers and find the strengths and weaknesses. > My retry rate is much lower, > because there's less window for race conditions since there is no latency > for the time between nova-compute getting the message that the VM is > scheduled to it, and responding with a host update. Note that your > database latency numbers seem much higher, we see about 200ms, and I > wonder if you are running in a very resource constrained database > instance. Just to double check, you are using pymysql rather than MySQL-python as the sqlalchemy backend? If you use a driver that doesn't work well with eventlet, things can get very bad, very quickly. Particularly because of the way the scheduling works around handing back the results of the DB call. You can get some benefits by shrinking the db and greenlet pools to reduce the concurrency. >> Improvements: >> This prototype solved the mentioned issues above by implementing a new >> update model to scheduler host state cache. Instead of refreshing caches >> from db, every compute node maintains its accurate version of host state >> cache updated by the resource tracker, and sends incremental updates >> directly to schedulers. So the scheduler cache are synchronized to the >> correct state as soon as possible with the lowest overhead. Also, scheduler >> will send resource claim with its decision to the target compute node. The >> compute node can decide whether the resource claim is successful immediately >> by its local host state cache and send responds back ASAP. With all the >> claims are tracked from schedulers to compute nodes, no false overwrites >> will happen, and thus the gaps between scheduler cache and real compute node >> states are minimized. The benefits are obvious with recorded experiments[3] >> compared with caching scheduler and filter scheduler: > > You don't mention this, but I'm assuming this is true: At startup of a > new shared state scheduler, it fills its host state cache from the > database. > >> 1. There is no db block time during scheduler decision making, the average >> decision time per request is about 3ms in both single and multiple scheduler >> scenarios, which is equal to the in-memory decision time of filter scheduler >> and caching scheduler. >> 2. Since the scheduler claims are tracked and the "false overwrite" is >> eliminated, there should be 0 retries in one-scheduler deployment, as proven >> in the experiment. Thanks to the quick claim responding implementation, >> there are only 2 retries out of 31 requests in the 4-scheduler experiment. > > This is a real win. I've seen 3 schedulers get so overwhelmed with > retries that they go slower than 1. When I was looking at this, I certainly saw DB queries dominating, and races between the decisions made in the scheduler. The short term hack I can up with was this approach: https://github.com/openstack/nova/blob/master/nova/scheduler/caching_scheduler.py How it works is very hidden but it relies on (the thing we added for when you build multiple instances in a single request) this line of code, so that the in memory cache of the host state is shared between requests into the scheduler: https://github.com/openstack/nova/blob/392f1aca1ba1773ec87b5cf913a7fe940190f916/nova/scheduler/filter_scheduler.py#L136 The case I was having issue with was when you get a burst of around 1000 build requests within a one min timeframe. Running a single caching scheduler was able to deal with that load way better than the existing scheduler. Now if you run multiple of these caching_schedulers, particularly if you want a fill first strategy, its awful. But running one caching scheduler is proving a lot faster than running two or three of the regular filter schedulers (when doing fill first), due to the lower time spend doing DB queries and the lower races between scheduler decisions. Now the caching scheduler is a total a hack, and only works for certain deployment scenarios. It was really added to buy time while we agree how to build this proposed DB based reservation scheduler that works well when running multiple schedulers. I am really interested how your prototype and the caching scheduler compare? It looks like single node scheduler will perform in a very similar way, but multiple schedulers are less likely to race each other, although there are quite a few races? Thanks, John __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev