Mathieu Gagné wrote:
Hi,
For those that attended the OpenStack Ops meetup, you probably heard
me complaining about a serious performance issue we had with Nova
scheduler (Kilo) with Ironic.
BTW, thanks for helping push this and complaining about it and ...
It's a tough and thankless job but it's needed IMHO :)
Without further ado,
Thanks to Sean Dague and Matt Riedemann, we found the root cause.
It was caused by this block of code [1] which is hitting the database
for each node loaded by the scheduler. This block of code is called if
no instance info is found in the scheduler cache.
I found that this instance info is only populated if the
scheduler_tracks_instance_changes config [2] is enabled which it is by
default. But being a good operator (wink wink), I followed the Ironic
install guide which recommends disabling it [3], unknowingly getting
myself into deep troubles.
There isn't much information about the purpose of this config in the
kilo branch. Fortunately, you can find more info in the master branch
[4], thanks to the config documentation effort. This instance info
cache is used by filters which rely on instance location to perform
affinity/anti-affinity placement or anything that cares about the
instances running on the destination node.
Enabling this option will make it so Nova scheduler loads instance
info asynchronously at start up. Depending on the number of
hypervisors and instances, it can take several minutes. (we are
talking about 10-15 minutes with 600+ Ironic nodes, or ~1s per node in
our case)
This feels like a classic thing that could just be made better by a
scatter/gather (in threads or other?) to the database or other service.
1s per node seems ummm, sorta bad and/or non-optimal (I wonder if this
is low hanging fruit to improve this). I can travel around the world 7.5
times in that amount of time (if I was a light beam, haha).
So Jim Roll jumped into the discussion on IRC and found a bug [5] he
opened and fixed in Liberty. It makes it so Nova scheduler never
populates the instance info cache if Ironic host manager is loaded.
For those running Nova with Ironic, you will agree that there is no
known use case where affinity/anti-affinity is used. (please reply if
you know of one)
To summarize, the poor performance of Nova scheduler will only show if
you are running the Kilo version of Nova and you disable
scheduler_tracks_instance_changes which might be the case if you are
running Ironic too.
For those curious about our Nova scheduler + Ironic setup, we have
done the following to get nova scheduler to ludicrous speed:
1) Use CachingScheduler
There was a great talk at the OpenStack Summit about why you would
want to use it. [6]
By default, the Nova scheduler will load ALL nodes (hypervisors) from
database to memory before each scheduling. If you have A LOT of
hypervisors, this process can take a while. This means scheduling
won't happen until this step is completed. It could also mean that
scheduling will always fail if you don't tweak service_down_time (see
3 below) if you have lot of hypervisors.
This driver will make it so nodes (hypervisors) are loaded in memory
every ~60 seconds. Since information is now pre-cached, the scheduling
process can happen right away, it is super fast.
There is a lot of side-effects to using it though. For example:
- you can only run ONE nova-scheduler process since cache state won't
be shared between processes and you don't want instances to be
scheduled twice to the same node/hypervisor.
Out of curiosity, do you have only one scheduler process active and
passive scheduler process(es) idle waiting to become active if the other
schedule dies? (pretty simply done via something like
https://kazoo.readthedocs.io/en/latest/api/recipe/election.html) Or do
you have some manual/other process that kicks off a new scheduler if the
'main' one dies?
- It can take ~1m before new capacity is recognized by the scheduler.
(new or freed nodes) The cache is refreshed every 60 seconds with a
periodic task. (this can be changed with scheduler_driver_task_period)
In the context of Ironic, it is a compromise we are willing to accept.
We are not adding Ironic nodes that often and nodes aren't
created/deleting as often as virtual machines.
2) Run a single nova-compute service
I strongly suggest you DO NOT run multiple nova-compute services. If
you do, you will have duplicated hypervisors loaded by the scheduler
and you could end up with conflicting scheduling. You will also have
twice as much hypervisors to load in the scheduler.
This seems scary (whenever I hear run a single of anything in a *cloud*
platform, that makes me shiver). It'd be nice if we at least recommended
people run
https://kazoo.readthedocs.io/en/latest/api/recipe/election.html or have
some active/passive automatic election process to handle that single
thing dying (which they usually do, at odd times of the night). Honestly
I'd (personally) really like to get to the bottom of how we as a group
of developers ever got to the place where software was released (and/or
even recommended to be used) in a *cloud* platform that ever required
only one of anything to be ran (that's crazy bonkers, and yes there is
history here, but damn, it just feels rotten as all hell, for lack of
better words).
Note: I heard about multiple compute host support in Nova for Ironic
with use of an hash ring but I don't have much details about it. So
this recommendation might not apply to you if you are using a recent
version of Nova.
3) Increase service_down_time
If you have a lot of nodes, you might have to increase this value
which is set to 60 seconds by default. This value is used by the
ComputeFilter filter to exclude nodes it hasn't heard from. If it
takes more than 60 seconds to list the list of nodes, you might guess
what we will happen, the scheduler will reject all of them since node
info is already outdated when it finally hits the filtering steps. I
strongly suggest you tweak this setting, regardless of the use of
CachingScheduler.
Same kind of feeling I had above also applies, something feels broken if
such things have to be found by operators (I'm pretty sure yahoo when I
was there saw something similar) and not by the developers making the
software. If I could (and I know I really can't due to the community we
work in) I'd very much have an equivalent of a retrospective around how
these kinds of solutions got built and how they ended up getting
released to the wider public with such flaws....
4) Tweak scheduler to only load empty nodes/hypervisors
So this is a hack [7] we did before finding out about the bug [5] we
described and identified earlier. When investigating our performance
issue, we enabled debug logging and saw that periodic task was taking
forever to complete (10-15m) with CachingScheduler driver.
We knew (strongly suspected) Nova scheduler was spending a huge amount
of time loading nodes/hypervisors. We (unfortunately) didn't push
further our investigation and jumped right away to optimization phase.
So we came up with the idea of only loading empty nodes/hypervisors.
Remember, we are still in the context of Ironic, not cloud and virtual
machines. So it made perfect sense for us to stop spending time
loading nodes/hypervisors we would discard anyway.
Thanks to all that help us debugging our scheduling performance
issues, it is now crazy fast. =)
[1]
https://github.com/openstack/nova/blob/kilo-eol/nova/scheduler/host_manager.py#L589-L592
[2]
https://github.com/openstack/nova/blob/kilo-eol/nova/scheduler/host_manager.py#L65-L68
[3]
http://docs.openstack.org/developer/ironic/deploy/install-guide.html#configure-compute-to-use-the-bare-metal-service
[4]
https://github.com/openstack/nova/blob/282c257aff6b53a1b6bb4b4b034a670c450d19d8/nova/conf/scheduler.py#L166-L185
[5] https://bugs.launchpad.net/nova/+bug/1479124
[6] https://www.youtube.com/watch?v=BcHyiOdme2s
[7] https://gist.github.com/mgagne/1fbeca4c0b60af73f019bc2e21eb4a80
--
Mathieu
_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators
_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators