Thank you very much for in-depth discussion about this topic, @Nikola and @Sylvain.
I agree that we should solve the technical debt firstly, and then make the scheduler better. Best Regards. 2015-03-05 21:12 GMT+08:00 Sylvain Bauza <sba...@redhat.com>: > > Le 05/03/2015 13:00, Nikola Đipanov a écrit : > > On 03/04/2015 09:23 AM, Sylvain Bauza wrote: >> >>> Le 04/03/2015 04:51, Rui Chen a écrit : >>> >>>> Hi all, >>>> >>>> I want to make it easy to launch a bunch of scheduler processes on a >>>> host, multiple scheduler workers will make use of multiple processors >>>> of host and enhance the performance of nova-scheduler. >>>> >>>> I had registered a blueprint and commit a patch to implement it. >>>> https://blueprints.launchpad.net/nova/+spec/scheduler- >>>> multiple-workers-support >>>> >>>> This patch had applied in our performance environment and pass some >>>> test cases, like: concurrent booting multiple instances, currently we >>>> didn't find inconsistent issue. >>>> >>>> IMO, nova-scheduler should been scaled horizontally on easily way, the >>>> multiple workers should been supported as an out of box feature. >>>> >>>> Please feel free to discuss this feature, thanks. >>>> >>> >>> As I said when reviewing your patch, I think the problem is not just >>> making sure that the scheduler is thread-safe, it's more about how the >>> Scheduler is accounting resources and providing a retry if those >>> consumed resources are higher than what's available. >>> >>> Here, the main problem is that two workers can actually consume two >>> distinct resources on the same HostState object. In that case, the >>> HostState object is decremented by the number of taken resources (modulo >>> what means a resource which is not an Integer...) for both, but nowhere >>> in that section, it does check that it overrides the resource usage. As >>> I said, it's not just about decorating a semaphore, it's more about >>> rethinking how the Scheduler is managing its resources. >>> >>> >>> That's why I'm -1 on your patch until [1] gets merged. Once this BP will >>> be implemented, we will have a set of classes for managing heterogeneous >>> types of resouces and consume them, so it would be quite easy to provide >>> a check against them in the consume_from_instance() method. >>> >>> I feel that the above explanation does not give the full picture in >> addition to being factually incorrect in several places. I have come to >> realize that the current behaviour of the scheduler is subtle enough >> that just reading the code is not enough to understand all the edge >> cases that can come up. The evidence being that it trips up even people >> that have spent significant time working on the code. >> >> It is also important to consider the design choices in terms of >> tradeoffs that they were trying to make. >> >> So here are some facts about the way Nova does scheduling of instances >> to compute hosts, considering the amount of resources requested by the >> flavor (we will try to put the facts into a bigger picture later): >> >> * Scheduler receives request to chose hosts for one or more instances. >> * Upon every request (_not_ for every instance as there may be several >> instances in a request) the scheduler learns the state of the resources >> on all compute nodes from the central DB. This state may be inaccurate >> (meaning out of date). >> * Compute resources are update by each compute host periodically. This >> is done by updating the row in the DB. >> * The wall-clock time difference between the scheduler deciding to >> schedule an instance, and the resource consumption being reflected in >> the data the scheduler learns from the DB can be arbitrarily long (due >> to load on the compute nodes and latency of message arrival). >> * To cope with the above, there is a concept of retrying the request >> that fails on a certain compute node due to the scheduling decision >> being made with data stale at the moment of build, by default we will >> retry 3 times before giving up. >> * When running multiple instances, decisions are made in a loop, and >> internal in-memory view of the resources gets updated (the widely >> misunderstood consume_from_instance method is used for this), so as to >> keep subsequent decisions as accurate as possible. As was described >> above, this is all thrown away once the request is finished. >> >> Now that we understand the above, we can start to consider what changes >> when we introduce several concurrent scheduler processes. >> >> Several cases come to mind: >> * Concurrent requests will no longer be serialized on reading the state >> of all hosts (due to how eventlet interacts with mysql driver). >> * In the presence of a single request for a large number of instances >> there is going to be a drift in accuracy of the decisions made by other >> schedulers as they will not have the accounted for any of the instances >> until they actually get claimed on their respective hosts. >> >> All of the above limitations will likely not pose a problem under normal >> load and usage and can cause issues to start appearing when nodes are >> close to full or when there is heavy load. Also this changes drastically >> based on how we actually chose to utilize hosts (see a very interesting >> Ironic bug [1]) >> >> Weather any of the above matters to users is dependant heavily on their >> use-case though. This is why I feel we should be providing more >> information. >> >> Finally - I think it is important to accept that the scheduler service >> will always have to operate under the assumptions of stale data, and >> build for that. Based on that I'd be happy to see real work go into >> making multiple schedulers work well enough for most common use-cases >> while providing a way forward for people who need tighter bounds on the >> feedback loop. >> >> N. >> > > Agreed 100% with all your above email. Thanks Nikola for giving time on > explaining how the Scheduler is working, that's (btw.) something I hope to > be presenting for the Vancouver Summit if my proposal is accepted. > > That said, I hope my reviewers will understand that I would want to see > first the Scheduler being splitted and being on a separate repo before > working on fixing the race conditions you mentioned above. Yes, I know, > it's difficult to accept some limitations on the Nova scheduler while many > customers would want them to be fixed, but here we have so many technical > debt issues that I think we should really work on the split itself (like we > did for Kilo and what we'll hopefully work for Liberty) and then discuss on > the new design once after that. > > -Sylvain > > > > [1] https://bugs.launchpad.net/nova/+bug/1341420 >> >> ____________________________________________________________ >> ______________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject: >> unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev