Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Jay Pipes Tue, 23 Feb 2016 16:14:40 -0800

On 02/22/2016 04:23 AM, Sylvain Bauza wrote:

I won't argue against performance here. You made a very nice PoC for
testing scaling DB writes within a single python process and I trust
your findings. While I would be naturally preferring some shared-nothing
approach that can horizontally scale, one could mention that we can do
the same with Galera clusters.

a) My benchmarks aren't single process comparisons. They aremulti-process benchmarks.

b) The approach I've taken is indeed shared-nothing. The schedulerprocesses do not share any data whatsoever.

c) Galera isn't horizontally scalable. Never was, never will be. Thatisn't its strong-suit. Galera is best for having asynchronously-replicated database cluster that is incredibly easy tomanage and administer but it isn't a performance panacea. It's focus ison availability not performance :)

That said, most of the operators run a controller/compute situation
where all the services but the compute node are hosted on 1:N hosts.
Implementing the resource-providers-scheduler BP (and only that one)
will dramatically increase the number of writes we do on the scheduler
process (ie. on the "controller" - quoting because there is no notion of
a "controller" in Nova, it's just a deployment choice).

Yup, no doubt about it. It won't increase the *total* number of writesthe system makes, just the concentration of those writes into thescheduler processes. You are trading increased writes in the schedulerfor the challenges inherent in keeping a large distributed cache systemvalid and fresh (which itself introduces a different kind of writes).

That's a big game changer for operators who are currently capping their
capacity by adding more conductors. It would require them to do some DB
modifications to be able to scale their capacity. I'm not against that,
I just say it's a big thing that we need to consider and properly
communicate if agreed.

Agreed completely. I will say, however, that on a 1600 compute nodesimulation (~60K variably-sized instances), an untuned stock MySQL 5.6database with 128MB InnoDB buffer pool size barely breaks a sweat on mylocal machine.

> It can be alleviated by changing to a stand-alone high

 performance database.


It doesn't need to be high-performance at all. In my benchmarks, a
small-sized stock MySQL database server is able to fulfill thousands
of placement queries and claim transactions per minute using
completely isolated non-shared, non-caching scheduler processes.

> And the cache refreshing is designed to be

replaced by to direct SQL queries according to resource-provider
scheduler spec [2].


Yes, this is correct.

> The performance bottlehead of shared-state scheduler

may come from the overwhelming update messages, it can also be
alleviated by changing to stand-alone distributed message queue and by
using the “MessagePipe” to merge messages.


In terms of the number of messages used in each design, I see the
following relationship:

resource-providers < legacy < shared-state-scheduler

would you agree with that?


True. But that's manageable by adding more conductors, right ? IMHO,
Nova performance is bound by the number of conductors you run and I like
that - because that's easy to increase the capacity.
Also, the payload could be far smaller from the existing : instead of
sending a full update for a single compute_node entry, it would only
send the diff (+ some full syncs periodically). We would then mitigate
the messages increase by making sure we're sending less per message.

No message sent is better than sending any message, regardless ofwhether that message contains an incremental update or a full object.

The resource-providers proposal actually uses no update messages at
all (except in the abnormal case of a compute node failing to start
the resources that had previously been claimed by the scheduler). All
updates are done in a single database transaction when the claim is made.


See, I don't think that a compute node unable to start a request is an
'abnormal case'. There are many reasons why a request can't be honored
by the compute node :
  - for example, the scheduler doesn't own all the compute resources and
thus can miss some information : for example, say that you want to pin a
specific pCPU and this pCPU is already assigned. The scheduler doesn't
know *which* pCPUs are free, it only knows *how much* are free
That atomic transaction (pick a free pCPU and assign it to the instance)
is made on the compute manager not at the exact same time we're
decreasing resource usage for pCPUs (because it would be done in the
scheduler process).


See my response to Chris Friesen about the above.

  - some "reserved" RAM or disk could be underestimated and
consequently, spawning a VM could be either taking fare more time than
planned (which would mean that it would be a suboptimal placement) or it
would fail which would issue a reschedule.


Again, the above is an abnormal case.

<snip>

It's a distributed problem. Neither the shared-state scheduler can have
an accurate view (because it will only guarantee that it will
*eventually* be consistent), nor the scheduler process can have an
accurate view (because the scheduler doesn't own the resource usage that
is made by compute nodes)

If the scheduler claimed the resources on the compute node, it *would*own those resources, though, and therefore it *would* have a 99.9%accurate view of the inventory of the system.


It's not a distributed problem if you don't make it a distributed problem.

There's nothing wrong with the compute node having the final "say" aboutif some claimed-by-the-scheduler resources truly cannot be provided bythe compute node, but again, that is a) an abnormal operating conditionand b) simply would trigger a notification to the scheduler to free saidresources.

4. Design goal difference:

The fundamental design goal of the two new schedulers is different. Copy
my views from [2], I think it is the choice between “the loose
distributed consistency with retries” and “the strict centralized
consistency with locks”.


There are a couple other things that I believe we should be
documenting, considering and measuring with regards to scheduler designs:

a) Debuggability

The ability of a system to be debugged and for requests to that system
to be diagnosed is a critical component to the benefit of a particular
system design. I'm hoping that by removing a lot of the moving parts
from the legacy filter scheduler design (removing the caching,
removing the Python-side filtering and weighing, removing the interval
between which placement decisions can conflict, removing the cost and
frequency of retry operations) that the resource-provider scheduler
design will become simpler for operators to use.

Keep in mind that scheduler filters are a very handy pluggable system
for operators wanting to implement their own placement logic. If you
want to deprecate filters (and I'm not opiniated on that), just make
sure that you keep that extension capability.


Understood, and I will do my best to provide some level of extensibility.

b) Simplicity

Goes to the above point about debuggability, but I've always tried to
follow the mantra that the best software design is not when you've
added the last piece to it, but rather when you've removed the last
piece from it and still have a functioning and performant system.
Having a scheduler that can tackle the process of tracking resources,
deciding on placement, and claiming those resources instead of playing
an intricate dance of keeping state caches valid will, IMHO, lead to a
better scheduler.


I'm also very concerned by the upgrade path for both proposals like I
said before. I haven't yet seen how either of those 2 proposals are
changing the existing FilterScheduler and what is subject to change.
Also, please keep in mind that we support rolling upgrades, so we need
to support old compute nodes.

Nothing about any of the resource-providers specs inhibits rollingupgrades. In fact, two of the specs (resource-providers-allocations andcompute-node-inventory) are *only* about how to do online data migrationfrom the legacy DB schema to the resource-providers schema. :)


Best,
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Reply via email to