+openstack-operators to see if others have the same use case
On 5/31/2018 5:14 PM, Moore, Curt wrote:
We recently upgraded from Liberty to Pike and looking ahead to the code
in Queens, noticed the image download deprecation notice with
instructions to post here if this interface was in use. As such, I’d
like to explain our use case and see if there is a better way of
accomplishing our goal or lobby for the "un-deprecation" of this
extension point.
Thanks for speaking up - this is much easier *before* code is removed.
As with many installations, we are using Ceph for both our Glance image
store and VM instance disks. In a normal workflow when both Glance and
libvirt are configured to use Ceph, libvirt reacts to the direct_url
field on the Glance image and performs an in-place clone of the RAW disk
image from the images pool into the vms pool all within Ceph. The
snapshot creation process is very fast and is thinly provisioned as it’s
a COW snapshot.
This underlying workflow itself works great, the issue is with
performance of the VM’s disk within Ceph, especially as the number of
nodes within the cluster grows. We have found, especially with Windows
VMs (largely as a result of I/O for the Windows pagefile), that the
performance of the Ceph cluster as a whole takes a very large hit in
keeping up with all of this I/O thrashing, especially when Windows is
booting. This is not the case with Linux VMs as they do not use swap as
frequently as do Windows nodes with their pagefiles. Windows can be run
without a pagefile but that leads to other odditites within Windows.
I should also mention that in our case, the nodes themselves are
ephemeral and we do not care about live migration, etc., we just want
raw performance.
As an aside on our Ceph setup without getting into too many details, we
have very fast SSD based Ceph nodes for this pool (separate crush root,
SSDs for both OSD and journals, 2 replicas), interconnected on the same
switch backplane, each with bonded 10GB uplinks to the switch. Our Nova
nodes are within the same datacenter (also have bonded 10GB uplinks to
their switches) but are distributed across different switches. We could
move the Nova nodes to the same switch as the Ceph nodes but that is a
larger logistical challenge to rearrange many servers to make space.
Back to our use case, in order to isolate this heavy I/O, a subset of
our compute nodes have a local SSD and are set to use qcow2 images
instead of rbd so that libvirt will pull the image down from Glance into
the node’s local image cache and run the VM from the local SSD. This
allows Windows VMs to boot and perform their initial cloudbase-init
setup/reboot within ~20 sec vs 4-5 min, regardless of overall Ceph
cluster load. Additionally, this prevents us from "wasting" IOPS and
instead keep them local to the Nova node, reclaiming the network
bandwidth and Ceph IOPS for use by Cinder volumes. This is essentially
the use case outlined here in the "Do designate some non-Ceph compute
hosts with low-latency local storage" section:
https://ceph.com/planet/the-dos-and-donts-for-ceph-for-openstack/
The challenge is that transferring the Glance image transfer is
_glacially slow_ when using the Glance HTTP API (~30 min for a 50GB
Windows image (It’s Windows, it’s huge with all of the necessary tools
installed)). If libvirt can instead perform an RBD export on the image
using the image download functionality, it is able to download the same
image in ~30 sec. We have code that is performing the direct download
from Glance over RBD and it works great in our use case which is very
similar to the code in this older patch:
https://review.openstack.org/#/c/44321/
It looks like at the time this had general approval (i.e. it wasn't
considered crazy) but was blocked simply due to the Havana feature
freeze. That's good to know.
We could look at attaching an additional ephemeral disk to the instance
and have cloudbase-init use it as the pagefile but it appears that if
libvirt is using rbd for its images_type, _all_ disks must then come
from Ceph, there is no way at present to allow the VM image to run from
Ceph and have an ephemeral disk mapped in from node-local storage. Even
still, this would have the effect of "wasting" Ceph IOPS for the VM disk
itself which could be better used for other purposes.
When you mentioned the swap above I was thinking similar to this,
attaching a swap device but as you've pointed out, all disks local to
the compute host are going to use the same image type backend, so you
can't have the root disk and swap/ephemeral disks using different image
backends.
Based on what I have explained about our use case, is there a
better/different way to accomplish the same goal without using the
deprecated image download functionality? If not, can we work to
"un-deprecate" the download extension point? Should I work to get the
code for this RBD download into the upstream repository?
I think you should propose your changes upstream with a blueprint, the
docs for the blueprint process are here:
https://docs.openstack.org/nova/latest/contributor/blueprints.html
Since it's not an API change, this might just be a specless blueprint,
but you'd need to write up the blueprint and probably post the PoC code
to Gerrit and then bring it up during the "Open Discussion" section of
the weekly nova meeting.
Once we can take a look at the code change, we can go from there on
whether or not to add that in-tree or go some alternative route.
Until that happens, I think we'll just say we won't remove that
deprecated image download extension code, but that's not going to be an
unlimited amount of time if you don't propose your changes upstream.
Is there going to be anything blocking or slowing you down on your end
with regard to contributing this change, like legal approval, license
agreements, etc? If so, please be up front about that.
--
Thanks,
Matt
_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators