Re: [openstack-dev] Scheduler proposal

Alec Hothan (ahothan) Mon, 12 Oct 2015 18:00:07 -0700



On 10/12/15, 11:45 AM, "Joshua Harlow" <harlo...@fastmail.com> wrote:

>Alec Hothan (ahothan) wrote:
>>
>>
>>
>>
>> On 10/10/15, 11:35 PM, "Clint Byrum"<cl...@fewbar.com>  wrote:
>>
>>> Excerpts from Alec Hothan (ahothan)'s message of 2015-10-09 21:19:14 -0700:
>>>> On 10/9/15, 6:29 PM, "Clint Byrum"<cl...@fewbar.com>  wrote:
>>>>
>>>>> Excerpts from Chris Friesen's message of 2015-10-09 17:33:38 -0700:
>>>>>> On 10/09/2015 03:36 PM, Ian Wells wrote:
>>>>>>> On 9 October 2015 at 12:50, Chris Friesen<chris.frie...@windriver.com
>>>>>>> <mailto:chris.frie...@windriver.com>>  wrote:
>>>>>>>
>>>>>>>      Has anybody looked at why 1 instance is too slow and what it would 
>>>>>>> take to
>>>>>>>
>>>>>>>          make 1 scheduler instance work fast enough? This does not 
>>>>>>> preclude the
>>>>>>>          use of
>>>>>>>          concurrency for finer grain tasks in the background.
>>>>>>>
>>>>>>>
>>>>>>>      Currently we pull data on all (!) of the compute nodes out of the 
>>>>>>> database
>>>>>>>      via a series of RPC calls, then evaluate the various filters in 
>>>>>>> python code.
>>>>>>>
>>>>>>>
>>>>>>> I'll say again: the database seems to me to be the problem here.  Not to
>>>>>>> mention, you've just explained that they are in practice holding all 
>>>>>>> the data in
>>>>>>> memory in order to do the work so the benefit we're getting here is 
>>>>>>> really a
>>>>>>> N-to-1-to-M pattern with a DB in the middle (the store-to-DB is rather
>>>>>>> secondary, in fact), and that without incremental updates to the 
>>>>>>> receivers.
>>>>>> I don't see any reason why you couldn't have an in-memory scheduler.
>>>>>>
>>>>>> Currently the database serves as the persistant storage for the resource 
>>>>>> usage,
>>>>>> so if we take it out of the picture I imagine you'd want to have some 
>>>>>> way of
>>>>>> querying the compute nodes for their current state when the scheduler 
>>>>>> first
>>>>>> starts up.
>>>>>>
>>>>>> I think the current code uses the fact that objects are remotable via the
>>>>>> conductor, so changing that to do explicit posts to a known scheduler 
>>>>>> topic
>>>>>> would take some work.
>>>>>>
>>>>> Funny enough, I think thats exactly what Josh's "just use Zookeeper"
>>>>> message is about. Except in memory, it is "in an observable storage
>>>>> location".
>>>>>
>>>>> Instead of having the scheduler do all of the compute node inspection
>>>>> and querying though, you have the nodes push their stats into something
>>>>> like Zookeeper or consul, and then have schedulers watch those stats
>>>>> for changes to keep their in-memory version of the data up to date. So
>>>>> when you bring a new one online, you don't have to query all the nodes,
>>>>> you just scrape the data store, which all of these stores (etcd, consul,
>>>>> ZK) are built to support atomically querying and watching at the same
>>>>> time, so you can have a reasonable expectation of correctness.
>>>>>
>>>>> Even if you figured out how to make the in-memory scheduler crazy fast,
>>>>> There's still value in concurrency for other reasons. No matter how
>>>>> fast you make the scheduler, you'll be slave to the response time of
>>>>> a single scheduling request. If you take 1ms to schedule each node
>>>>> (including just reading the request and pushing out your scheduling
>>>>> result!) you will never achieve greater than 1000/s. 1ms is way lower
>>>>> than it's going to take just to shove a tiny message into RabbitMQ or
>>>>> even 0mq.
>>>> That is not what I have seen, measurements that I did or done by others 
>>>> show between 5000 and 10000 send *per sec* (depending on mirroring, up to 
>>>> 1KB msg size) using oslo messaging/kombu over rabbitMQ.
>>> You're quoting througput of RabbitMQ, but how many threads were
>>> involved? An in-memory scheduler that was multi-threaded would need to
>>> implement synchronization at a fairly granular level to use the same
>>> in-memory store, and we're right back to the extreme need for efficient
>>> concurrency in the design, though with much better latency on the
>>> synchronization.
>>
>> These were single-threaded tests and you're correct that if you had multiple 
>> threads trying to send something you'd have some inefficiency.
>> However I'd question the likelihood of that happening as it is very likely 
>> that most of the cpu time will be spent outside of oslo messaging code.
>>
>> Furthermore, Python does not need multiple threads to go faster. As a matter 
>> of fact, for in-memory operations, it could end up being slower because of 
>> the inherent design of the interpreter (and there are many independent 
>> measurements that have shown it).
>>
>>
>>>> And this is unmodified/highly unoptimized oslo messaging code.
>>>> If you remove the oslo messaging layer, you get 25000 to 45000 msg/sec 
>>>> with kombu/rabbitMQ (which shows how inefficient is oslo messaging layer 
>>>> itself)
>>>>
>>>>> So I'm pretty sure this is o-k for small clouds, but would be
>>>>> a disaster for a large, busy cloud.
>>>> It all depends on how many sched/sec for the "large busy cloud"...
>>>>
>>> I think there are two interesting things to discern. Of course, the
>>> exact rate would be great to have as a target, but operational security
>>> and just plain secrecy of business models will probably prevent us from
>>> getting at many of these requirements.
>>
>> I don't think that is the case. We have no visibility because nobody has 
>> really thought about these numbers. Ops should be ok to provide some rough 
>> requirement numbers if asked (everybody is in the same boat).
>>
>>
>>> The second is the complexity model of scaling. We can just think about
>>> the actual cost benefit of running 1, 3, and more schedulers and come up
>>> with some rough numbers for a lower bounds for scheduler performance
>>> that would make sense.
>>>
>>>>> If, however, you can have 20 schedulers that all take 10ms on average,
>>>>> and have the occasional lock contention for a resource counter resulting
>>>>> in 100ms, now you're at 2000/s minus the lock contention rate. This
>>>>> strategy would scale better with the number of compute nodes, since
>>>>> more nodes means more distinct locks, so you can scale out the number
>>>>> of running servers separate from the number of scheduling requests.
>>>> How many compute nodes are we talking about max? How many scheduling per 
>>>> second is the requirement? And where are we today with the latest nova 
>>>> scheduler?
>>>> My point is that without these numbers we could end up under-shooting, 
>>>> over-shooting or over-engineering along with the cost of maintaining that 
>>>> extra complexity over the lifetime of openstack.
>>>>
>>>> I'll just make up some numbers for the sake of this discussion:
>>>>
>>>> nova scheduler latest can do only 100 sched/sec for 1 instance (I guess 
>>>> the 10ms average you bring out may not be that unrealistic)
>>>> the requirement is a sustained 500 sched/sec worst case with 10K nodes 
>>>> (that is 5% of 10K and today we can barely launch 100VM/sec sustained)
>>>>
>>>> Are we going to achieve 5x with just 3 instances which is what most people 
>>>> deploy? Not likely.
>>>> Will using more elaborate distributed infra/DLM like consul/zk/etcd going 
>>>> to get us to that 500 mark with 3 instances? Maybe but it will be at the 
>>>> expense of added complexity of the overall solution.
>>>> Can we instead optimize nova scheduler with single instance to do 500/sec? 
>>>> Maybe but if we succeed we'll get a lot more simple solution.
>>>>
>>> Of course we can. And simple solutions are great. So if you can get one
>>> node to do all the work and the cloud doesn't fall over dead when it dies
>>> (because you have more waiting on standby), that would be fantastic.
>>>
>>> I'm dubious that this will be a solution that works for large deployers.
>>> But I may be wrong!
>>
>> Control plane apps that had to do more complex work than scheduling 
>> instances have been working for a long time at scale using simple 
>> active/standby designs.
>> In this case it is easier because we can even afford some scheduling 
>> disruption when your active goes down (as long as the standby can pick up 
>> most of the pending requests). Heck when an entire openstack controller goes 
>> down, I bet failing a few request will be the least of your concerns 
>> (because a lot more other things will go wrong).
>>
>>
>>>> Not saying that high concurrency and distributed schedulers is not the 
>>>> solution and maybe we really need a distributed solution, but it'd be good 
>>>> to have some numbers to frame the discussion.
>>>>
>>> Indeed, however, I don't think we can expect those numbers to materialize
>>> in public, ever. We have to make some informed guesses and see how the
>>> product managers respond when we tell them what we think should happen. ;)
>>
>>
>> Isn't that a concern? Shouldn't the TC provide at least some numbers for the 
>> scale range so we do not under/over engineer? Any number would be better 
>> than no number at all.
>>
>> It is an open secret that Nova/Neutron does not scale well below the 
>> 1000-node mark (400/500 seems to be a threshold to do anything serious in 
>> production).
>> If we look at all the openstack/Neutron deployments today, nobody has an 
>> exact count, but very likely the vast majority of deployments are smaller 
>> than 100 nodes. Few are over 100 nodes, even less over 500 (well we know 
>> some companies/orgs have deployed thousands but none with Neutron).
>> For those 99% of deployments, we only need to service less than 100 nodes. 
>> In that space the inability to schedule over 100 instances per second is 
>> probably not that important. If the problem we are trying to solve here is 
>> "scheduling is too slow", then the first question to ask is how much faster 
>> do we need to go, then brainstorm on what would be the best way to achieve.
>> I have no problem believing we can schedule a lot more instances/sec by 
>> using scale out design with DLMs but there is no free lunch. Is it fair to 
>> impose on those 99% deployments the complexity of a machinery that is 
>> designed to handle 1000 schedules/sec? Ops are very concerned about adding 
>> even more complexity to an already complex platform and I think we should be 
>> considerate to that, meaning worry about the cost impact in HW, config, 
>> deployment, troubleshooting.
>>
>
>Ah there-in is the question, where do we want to go... This is a tough 
>one to answer, and I can offer my inputs (mine would be ~5000 
>hypervisors to start and bigger as we get there, with a instance 
>bring-up/tear-down rate of let's say 50 instances per second to start 
>and a failure rate being well as low as we can get it), but my inputs 
>aren't likely others inputs. Sadly if this isn't 1000+ nodes then I 
>would start to believe that bigger players will start looking elsewhere 
>for solutions. If openstack just wants to focus on the < 500 that's 
>cool, but the choice will IMHO decide the future of where openstack 
>goes. So it's a tricky question to answer, but a good one to ask and 
>think about :)


If you have 5K nodes and you only need to launch/restart 50 instances/sec, that 
does not look impossible to achieve with just 1 scheduler, and I hope we will 
not need 20 schedulers ;-)
Just curious, have you tried actually launching 50/sec and can it keep up (with 
any openstack release)? If it can't, what's to say a faster scheduler will make 
it go faster overall? And I guess the 5K node deployment is using nova 
networking?





__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] Scheduler proposal

Reply via email to