On 02/12/2016 01:47 PM, Clint Byrum wrote:
Excerpts from Jay Pipes's message of 2016-02-11 12:24:04 -0800:
Hello all,
Performance working group, please pay attention to Chapter 2 in the
details section.
<snipped the part you let us not pay attention to. ;)>
Chapter 2 - Addressing performance and scale
============================================
One of the significant performance problems with the Nova scheduler is
the fact that for every call to the select_destinations() RPC API method
-- which itself is called at least once every time a launch or migration
request is made -- the scheduler grabs all records for all compute nodes
in the deployment. Once retrieving all these compute node records, the
scheduler runs each through a set of filters to determine which compute
nodes have the required capacity to service the instance's requested
resources. Having the scheduler continually retrieve every compute node
record on each request to select_destinations() is extremely
inefficient. The greater the number of compute nodes, the bigger the
performance and scale problem this becomes.
On a loaded cloud deployment -- say there are 1000 compute nodes and 900
of them are fully loaded with active virtual machines -- the scheduler
is still going to retrieve all 1000 compute node records on every
request to select_destinations() and process each one of those records
through all scheduler filters. Clearly, if we could filter the amount of
compute node records that are returned by removing those nodes that do
not have available capacity, we could dramatically reduce the amount of
work that each call to select_destinations() would need to perform.
The resource-providers-scheduler blueprint attempts to address the above
problem by replacing a number of the scheduler filters that currently
run *after* the database has returned all compute node records with
instead a series of WHERE clauses and join conditions on the database
query. The idea here is to winnow the number of returned compute node
results as much as possible. The fewer records the scheduler must
post-process, the faster the performance of each individual call to
select_destinations().
This is great, and I think it is the way to go. However, I'm not sure how
dramatic the overall benefit will be, since it also shifts some load from
reads to writes.
No, the above is *only* talking about the destination host selection
process, not the claim process. There are no writes here at all.
From my benchmarking, I see a 7.0% to 38.6% increase in the average
time to perform the destination selection operation when doing the
resource filtering on the Python side as opposed to in the DB side.
As you would expect, the larger the size of the deployment, the greater
the performance benefit you see using the DB for querying instead of
Python (lower numbers are better here):
DB or Python # Compute Nodes Avg Time to Select Delta
------------------------------------------------------------
DB 100 0.021035
Python 100 0.022517 +7.0%
DB 200 0.023370
Python 200 0.026526 +13.5%
DB 400 0.027638
Python 400 0.034666 +25.4%
DB 800 0.034814
Python 800 0.048271 +38.6%
The above was for a serialized scenario (1 scheduler process). Parallel
operations at 2, 4 and 8 scheduler processes were virtually identical as
can be expected since this is testing the read operation performance,
not the write operations.
> With 1000 active compute nodes updating their status,
each index added will be 1000 more index writes per update period. Still
a net win, but I'm always cautious about shifting things to more writes
on the database server. That said, I do think it will be a win and should
be done.
Again, this isn't what the "move the filtering to the database query"
proposal is about :) You are describing the *claim* operation above, not
the select-destination operation.
The *current* scheduler design is what has each distributed compute node
sending updates to the scheduler^Wdatabase each time a claim occurs.
What the second part of my proposal does is move the claim from the
distributed compute nodes and into the scheduler, which should allow the
scheduler to operate on non-stale data (which will reduce the number of
long retry operations). More below.
The second major scale problem with the current Nova scheduler design
has to do with the fact that the scheduler does *not* actually claim
resources on a provider. Instead, the scheduler selects a destination
host to place the instance on and the Nova conductor then sends a
message to that target host which attempts to spawn the instance on its
hypervisor. If the spawn succeeds, the target compute host updates the
Nova database and decrements its count of available resources. These
steps (from nova-scheduler to nova-conductor to nova-compute to
database) all take some not insignificant amount of time. During this
time window, a different scheduler process may pick the exact same
target host for a like-sized launch request. If there is only room on
the target host for one of those size requests [5], one of those spawn
requests will fail and trigger a retry operation. This retry operation
will attempt to repeat the scheduler placement decisions (by calling
select_destinations()).
This retry operation is relatively expensive and needlessly so: if the
scheduler claimed the resources on the target host before sending its
pick back to the scheduler, then the chances of producing a retry will
be almost eliminated [6]. The resource-providers-scheduler blueprint
attempts to remedy this second scaling design problem by having the
scheduler write records to the allocations table before sending the
selected target host back to the Nova conductor.
*This*, to me, is the thing that makes the scheduler dramatically more
scalable. The ability to run as many schedulers as I expect to need to
respond to user requests in a reasonable amount of time, is the key to
victory here.
However, I wonder how you will avoid serialization or getting into
a much tighter retry race for the claiming operations. There's talk
in the spec of inserting allocations in a table atomically. However,
with multiple schedulers, you'll still have the problem where one will
claim and the others will need to know that they cannot.
This is handled in my proposal with a single database transaction that
looks at a "generation" column on each resource provider and rolls back
the transaction if the generation is not the same as what was read
during the select-destination process.
> We can talk
about nuts and bolts, but there's really only two ways this can work:
exclusive locking, or compare and swap retry loops.
Yup. Compare and swap is what I propose and have implemented in the
placement-bench project here:
https://github.com/jaypipes/placement-bench/blob/master/placement.py#L123-L129
triggering a retry here:
https://github.com/jaypipes/placement-bench/blob/master/placement.py#L212-L217
Exclusive locking -- i.e. SELECT FOR UPDATE -- won't work on Galera
systems in multi-writer mode, as you already know :)
In my initial benchmarks, I have found that this compare and swap
approach works OK at scale (higher numbers are better here):
# Compute Nodes Successful claims per second
100 54.1
200 68.9
400 51.3
800 34.3
All of the above numbers are for 8 scheduler processes, using a
pack-first placement strategy and using no partitioning strategy (so,
pretty much worst-case scenario).
Using a simple modulo partitioning strategy but staying with the
pack-first placement strategy, I got much better results:
# Compute Nodes Successful claims per second
100 97.1
200 124.5
400 115.1
800 89.4
This is to be expected since the modulo partitioning reduces the surface
area for conflicting writes.
I am still coding in the placement-bench project an emulation for doing
claims on the compute nodes. Of course, I will notify everyone once I
complete this work. I'm quite curious to see the results! :)
Best,
-jay
I think the right way to go is probably the retries, so we can make
use of some of the advantages of Galera. But I think it will need some
collision avoidance mechanisms added in so schedulers generally stay out
of each others' way and avoid too many retries, especially while packing.
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev