Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Sylvain Bauza Fri, 04 Mar 2016 01:04:51 -0800

Both of you, thanks for your insights. Greatly appreciated.


Le 04/03/2016 09:25, Cheng, Yingxin a écrit :

Hi,

First of all, many delayed thanks to Jay Pipes's benchmarking framework, learnt 
a lot from it : )

Other comments inline.

On Friday, March 4, 2016 8:42 AM Jay Pipes wrote:

Hi again, Yingxin, sorry for the delayed response... been traveling.
Comments inline :)

On 03/01/2016 12:34 AM, Cheng, Yingxin wrote:

Hi,

I have simulated the distributed resource management with the incremental

update model based on Jay's benchmarking framework:
https://github.com/cyx1231st/placement-bench/tree/shared-state-
demonstration. The complete result lies at
http://paste.openstack.org/show/488677/. It's ran by a VM with 4 cores and
4GB RAM, and the mysql service is using the default settings with the
"innodb_buffer_pool_size" setting to "2G". The number of simulated compute
nodes are set to "300".

A few things.

First, in order to make any predictions or statements about a potential
implementation's scaling characteristics, you need to run the benchmarks with
increasing levels of compute nodes. The results you show give us only a single
dimension of scaling (300 compute nodes). What you want to do is run the
benchmarks at 100, 200, 400, 800 and 1600 compute node scales. You don't
need to run *all* of the different permutations of placement/partition/workers
scenarios, of course. I'd suggest just running the none partition strategy and 
the
pack placement strategy at 8 worker processes. Those results will give you (and
us!) the data points that will indicate the scaling behaviour of the 
shared-state-
scheduler implementation proposal as the number of compute nodes in the
deployment increases. The "none" partitioning strategy represents the reality of
the existing scheduler implementation, which does not shard the deployment
into partitions but retrieves all compute nodes for the entire deployment on
every request to the scheduler's
select_destinations() method.

Hmm... good suggestion. I don't like to run all the benchmarks, either. It 
makes me wait for a whole day, and so much data to evaluate.

300 is the max number for me to test in my environment. Or db will refuse to work because 
of connection limits, because all those nodes are asking for connections. Should I 
emulate "conductors" to limit the db connections, or build up a thread pool to 
connect, or edit db configurations? I'm wondering if I can write a new tool to do tests 
in more real environment.

Yeah, I think you need to simulate the same conditions than a realproduction environment, where the compute nodes are not actually writingon the DB directly, but using conductors rather.Having the same model (ie. AMQP casts to the conductor and conductorworker writing on the DB) would help us having better accurate figuresfor that, and would potentially help you to scale your lab.

Again, we have a fake oslo.messaging driver for this kind of purpose, Iguess you should take a look on the Nova in-tree functional tests to seehow we setup that.

Secondly, and I'm not sure if you intended this, the code in your
compute_node.py file in the placement-bench project is not thread-safe.
In other words, your code assumes that only a single process on each compute
node could ever actually run the database transaction that inserts allocation
records at any time.

[a]
So, single threaded in each node is already good enough to support 
"shared-state" scheduler to make 1000 more decisions per second. And because 
those claims are made distributedly in nodes, they are actually wrote to db by 300 
parallel nodes in nature. AFAIK, the compute node is single threaded,  they actually use 
greenthreads instead of real threads.

True, but keep in mind that we have a very weird design where thecompute manager actually initializes 1 or more ResourceTrackers(depending on the number of "nodes" attached to the compute-managerservice) which means that you have potentially synchronized sectionsrunning concurrently when trying to update the stats.

I would appreciate if you could amend your branch to have a synchronizedsection there :https://github.com/cyx1231st/placement-bench/blob/shared-state-demonstration/compute_node.py#L91that would really simulate how the RT is working. Adding a possibilityto have this 1:N relationship between node(s) and service would alsomake the simulator closer to the real situation.

If you want more than a single process on the compute
node to be able to handle claims of resources, you will need to modify that code
to use a compare-and-update strategy, checking a "generation" attribute on the
inventory record to ensure that another process on the compute node hasn't
simultaneously updated the allocations information for that compute node.

I still don't think the compare-and-update strategy should be forced to "compute-local" 
resources even if the compute service is changed to use multiple processes. The scheduler decisions 
to those "compute-local" resources can be checked and confirmed by the accurate in-memory 
view of local resources in resource tracker, which is really really faster than db operations. And 
the following inventory insertion can be concurrent without locks.

The db is only responsible to use "compare-and-update strategy" to claim those 
shared resources, persist the confirmed scheduler decision with consumption into 
inventories, then tell compute service that it's OK to start to do the long job in 
spawning the VM.

No, I think Jay's right, you should manage a compare-and-update strategywhen writing the "compute claim" even if that's not necessary yet forthe exact purpose of the 1:N relationship I mentioned above.

Third, you have your scheduler workers consuming messages off the request
queue using get_nowait(), while you left the original placement scheduler using
the blocking get() call. :) Probably best to compare apples to apples and have
them both using the blocking get() call.

Sorry, I don't agree with this. Consuming messages and getting requests are entirely two different 
things. I've tried to add timer around the "get()" method, there are no blocks actually, 
because the requests are already prepared and put into the queue. Note there is a "None" 
for every schedulers at the end of request queue, and the emulated scheduler will stop getting more 
requests immediately if there are no more requests. There is no wait at all.

Also, if look into the code, the "get_nowait()" is simply a implementation to process available messages ASAP during scheduling. The 
"shared-state" emulation will actually be BLOCKED to wait for messages from compute nodes at the end, until all the requests are confirmed 
"succeeded" or "failed" by schedule. That wait time is already taken into consideration to calculate the "Total wallclock time". 
And that's why "Placement query count" = "Placement found provider count" + "Placement no found provider count" in the 
shared-state emulation.


That's a fair point, I tend to agree, things are different.

First, the conclusions from the result of the eventually consistent scheduler

state simulation(i.e. rows that "do claim in compute?" = Yes):

#accuracy
1. The final decision accuracy is 100%: No resource usage will exceed the real

capacity by examining the rationality of db records at the end of each run.

Again, with your simulation, this assumes only a single thread will ever 
attempt a
claim on each compute node at any given time.

Explained above in [a].

2. The schedule decision accuracy is 100% if there's only one
scheduler: The successful scheduler decisions are all succeeded in
compute nodes, thus no retries recorded, i.e. "Count of requests
processed" = "Placement query count". See
http://paste.openstack.org/show/488696/

Yep, no disagreement here :)

3. The schedule decision accuracy is 100% if "Partition strategy" is
set to "modulo", no matter how many scheduler processes. See
http://paste.openstack.org/show/488697/
#racing

Yep, modulo partitioning eliminates the race conditions when the number of
partitions == the number of worker processes. However, this isn't representative
of the existing scheduler system which processes every compute node in the
deployment on every call to select_destinations().

Agreed.

What happens in the shared-state-scheduler approach when you want to scale
the scheduler process out with more scheduler processes handling more load?

Yes, it needs to be carefully evaluated.

What about having two scheduler processes handling the scheduling to the same
partition (i.e. highly-available scheduling)?

No, if we only want HA scheduling, why would we want schedulers to fight each 
other? We should instead add a passive-active model to HA schedulers. The 
passive scheduler should only accept resource updates but does no scheduling 
until another scheduler is failed.


I strongly disagree.

Our OpenStack design tenets [dt] don't imply any active-passive setup,rather a shared-nothing approach. So having independent schedulersrunning parallely is a key thing, and I wouldn't accept a limitationlike this that lock a certain type of deployment.

Jay's point is valid. If you're running two scheduler processes at thesame time, they would possibly have different views.That said, I think it's perfectly okay to assume that there could be adiscrepancy leading to retries. That's the trade-off of having thecompute nodes owning the information, but I just think the retry ratioshould be far lower than what's we have at the moment, and that's whatI'd like to see in your placement benchmark, Yingxin (ie. is the retryratio very low and not exponentially increasing when adding morecomputes or requests)


[dt] https://wiki.openstack.org/wiki/BasicDesignTenets

Both of these situations will introduce contention into the scheduling process
and introduce races that will manifest themselves on the compute nodes instead
of in the scheduler processes themselves where the total deadlock and retry
time can be limited.

4. No racing is happened if there's only one scheduler process or the "Partition

strategy" is set to "modulo", explained by 2. 3.

Yes, no disagreement.

5. The multiple-schedulers racing rate is extremely low using the
"spread" or "random" placement strategy used by legacy filter
scheduler: This rate is 3.0% using "spread" strategy and 0.15% using
"random" strategy, note that there are 8 workers in processing about
12000 requests within 20 seconds. The result is even better than
resource-provider scheduler(rows that "do claim in compute?" = No),
that's 82.9% using "spread" strategy and 2.52% using "random" strategy
of 12000 requests within 70-190 seconds. See
http://paste.openstack.org/show/488699/. Note, retry rate is
calculated by (Placement query count - Count of requests processed) /
Count of requests processed * 100%

I think you will find different numbers when we introduce the cost of the
messages into the system.

Agreed.

In addition, I'd like to point out that the retries when
done in the claim-in-scheduler solution are on the order of microseconds
whereas retries in the legacy solution are on the order of seconds. The
retry/race-for-last-spot-on-node issue can be addressed using partitioning, as
mentioned above.

In the claim-in-scheduler design, the claims to those compute-local resources 
from different schedulers will fight with each other, inside the centralized 
db, even if they won't actually overcommit the resources, and it causes 
unnecessary retries.
But in the distributed claim design, all the claims to the same compute-local resources 
will be sent to the corresponding compute. There are no such fightings because compute 
node will first check those "compute-local" resources consumptions using its 
accurate in-memory view, which is really fast.

#overwhelming messages
6. The total count of messages are only affected by the number of
schedulers and the number of schedule queries, NOT by the number of
compute nodes. See http://paste.openstack.org/show/488701/

True, however you are not actually showing the true cost of messages in the
system in the benchmark utility. An IPC queue push/pop != an AMQP broker
push/pop :)

Truly is.

7. The messages per successful query is (number_of_schedulers + 2)

That is still a significant increase in the number of messages over the 
resource-
providers-claim-in-scheduler approach though, so it's important that we
accurately account for the message costs. Not saying that you are wrong, just
want to be fair and account for the message costs accurately if we can.

True, I still don't know the actual cost.

  >, its growth pattern is lineral and only affected by scheduler processes. And
there is only 1 message if the query failed. It is not a huge number plus there 
are
no additional messages in order to access db during scheduling.

While it's a linear growth pattern, it's still a significant coefficient that 
will
multiply the number of messages sent over the bus by the number of scheduler
processes. 1000 requests to launch an instance with
1 scheduler process will produce 1000 messages over the bus. 8 scheduler
processes turns into 8000 messages.

Yes, but it depends on how much performance gain we can get from "shared-state" 
design. If the performance can be multiplied by 10, then we don't need to care about 8 
schedulers. It's not cleared yet.

Honestly, that's a very good point, and I agree with Jay, we need to seehow things evolve at scale.I just think that adding more messages to the queue is a reasonabletrade-off because we have the same bottleneck as we have at the moment,and that operators know how to face with.What I still don't know is if that message increase is reasonable orjust flooding the MQ so that it would require adding a lot more ofconductors and that's my concern.

And while there are no additional messages to update the database, the
database updates themselves don't go away. Instead of X number of scheduler
processes updating the database, you have Y number of compute nodes sending
those updates. And if you use multiple threads on each compute node to handle
launch requests, you will have Y * <threads> connections to the database. If you
will use the conductor to throttle those now-distributed DB writes, you will 
need
to send a message to the conductor to update the DB, and you're back to having
more messages in the system :)

My point is, 1 "shared-state" scheduler will have the least messages without 
extra overhead to db.
If there should be multiple schedulers because of mass requirement, we should 
then care about the gain of less db overhead and the loss of extra messages.

Second, here's what I've found in the centralized db claim design(i.e. rows that

"do claim in compute?" = No):

1. The speed of legacy python filtering is not slow(see rows that
"Filter strategy" = python): "Placement total query time" records the
cost of all query time including fetching states from db and filtering
using python. The actual cost of python filtering is
(Placement_total_query_time - Placement_total_db_query_time), and
that's only about 1/4 of total cost or even less. It also means python
in-memory filtering is much faster than db filtering in this
experiment. See http://paste.openstack.org/show/488710/

Heh, you only tested the above on 300 compute nodes :) The more compute
nodes you have in the deployment, the more work the Python legacy filtering
needs to do. This is why it's important to re-run the benchmarks at varying 
scales
of compute nodes. I found that at small numbers of compute nodes, the
difference you see isn't great -- just like you show in the above results. 
However,
the savings you see from doing the filtering on the DB side increase the bigger
the number of compute nodes in the system.

That's my bad. I'm too eager to show the results without solving the "too many 
connections" problem. :(

2. The speed of `db filter strategy` and the legacy `python filter
strategy` are in the same order of magnitude, not a very huge
improvement. See the comparison of column "Placement total query
time". Note that the extra cost of `python filter strategy` mainly
comes from "Placement total db query time"(i.e. fetching states from
db). See http://paste.openstack.org/show/488709/

You need to test at more than a single dimension of the number of compute
nodes :)

Third, my major concern of "centralized db claim" design is: Putting too much

scheduling works into the centralized db, and it is not scalable by simply 
adding
conductors and schedulers.

This is certainly a valid concern, also raised by Sylvain. I'd like to point 
out that
there is an additional benefit to the centralized DB claim strategy around 
having
a single pane of glass view of all resources in the system, including shared
resource pool inventory. This is one thing that is problematic if you claim and
account for inventory on the distributed compute nodes. Those nodes that share
resources on, say, a shared storage system, don't have the ability to properly
claim resources or accurately report capacity and usage.

Not sure I get your worries, Jay. We can still support shared resourceswith your inventories and allocations tables, that's the whole purposeof the resource-providers series.There are 6 (or more) specs and generic-resource-pools is focusing onthat particular shared resources problem.

That's not because we have distributed compute nodes that we don't havea centralized DB, right?I honestly take that point as a strawman, or I misunderstood somethingreally important.

I'm fully support and admire your design in shared resources management part, and in 
resource abstraction part. I only have concerns to move those "compute-local" 
resource management from resource trackers to the centralized db.
I'm also very curious about whether it is possible to let resource trackers do 
the claims to those shared resources using your design.

1. The filtering works are majorly done inside db by executing complex sqls. If

the filtering logic is much more complex(currently only CPU and RAM are
accounted in the experiment), the db overhead will be considerable.

I continue to ask that we back up statements like this with data. Let's not
assume that a SQL statement that may seem complex at first glance will actually
be inefficient when executed by a query planner :)

Sorry, I see in http://paste.openstack.org/show/488709/ that the SQL filtering won't be 
that faster than legacy state refreshing from db(300 nodes). And the in-memory filtering 
can be 4 times faster or even 100 times faster (see 1 worker scenario in 
http://paste.openstack.org/show/488717/, column "Placement avg query time") 
without optimization.

And what do you mean by "query planner"? Is that a feature of mysql?

2. The racing of centralized claims are resolved by rolling back transactions

and by checking the generations(see the design of "compare and update"
strategy in https://review.openstack.org/#/c/283253/), it also causes additional
overhead to db.

Yes, this is absolutely correct. However, please see my note about the
compute_node.py code in your benchmark program being single-threaded and if
you want it to be multi-threaded you would need to similarly handle concurrent
writes to the same hot data location.

Please see [a].

3. The db overhead of filtering operation can be relaxed by moving
them to schedulers, that will be 38 times faster and can be executed
in parallel by schedulers according to the column "Placement avg query
time". See http://paste.openstack.org/show/488715/
4. The "compare and update" overhead can be partially relaxed by using

distributed resource claims in resource trackers. There is no need to roll back
transactions in updating inventories of compute local resources in order to be
accurate. It is confirmed by checking the db records at the end of each run of
eventually consistent scheduler state design.

Again, here you are assuming a single-threaded compute node.

Please refer to [a].

5. If a) all the filtering operations are done inside schedulers,
          b) schedulers do not need to refresh caches from db because of

incremental updates,

          c) it is no need to do "compare and update" to compute-local

resources(i.e. none-shared resources),

       then here is the performance comparison using 1 scheduler
instances: http://paste.openstack.org/show/488717/

Finally, it is not fair to directly compare the actual ability of 
resource-provider

scheduler and shared-state scheduler using this benchmarking tool, because
there are 300 more processes needed to be created in order to simulate the
distributed resource management of 300 compute nodes, and there are no
conductors and MQ in the simulation. But I think it is still useful to provide 
at
least some statistics.

Agreed. I've been trying to think of a way that would be fair to emulate the
shared-state-scheduler approach and have yet to come up with a good
solution :( I'll try on the plane ride back home tomorrow to give this some
thought. :)

Shed some lights here :)
What I'm trying to do is to inject additional logs to all critical places in 
nova-api, nova-scheduler, nova-compute and nova-conductor. Then use the fake 
virt driver and fake hosts to start compute nodes into processes. Finally I can 
analyze the logs with timestamps and collect statistics in processing multiple 
schedule requests.


Cool, ping us if you need assistance.
-Sylvain

Best,
-jay

Regards,
-Yingxin

Regards,
-Yingxin



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [nova] A prototype implementation towards the "shared state scheduler"

Reply via email to