Dom0 CPU has to serve: 1. All I/O for all domains, including inspection / routing of network packets at the bridge. 2. All device emulation for non-PV domains, which is particularly expensive and unavoidable during Windows boot. 3. Display emulation for all HVM domains, which is moderately expensive. 4. VNC service, any time a customer wants to see the console, which is moderately expensive. 5. Performance metrics sampling and aggregation. 6. Control-plane operations.
Each individual thing isn't huge, but you've only got one domain 0 with four VCPUs, and it has to serve tens of domUs. It starts to add up. If you put something CPU-intensive in there too, such as a gunzip or some crypto, then you can find yourself with 30 customer VMs trying to funnel I/O through one or two dom0 CPUs, and we simply run out. The majority of the cost of I/O is the copy of the payload. We're doing work at the moment to move that cost into the domU, so that it's accounted to the domU CPU time, not dom0. This will improve fairness between customer VMs, because if the cost of I/O is inside dom0, you can't ensure fairness between I/O-intensive customer VMs vs CPU-intensive ones. Yes, the hypervisor does generally schedule dom0 and domUs similarly. There's actually a boost given to dom0, so if it has a VCPU ready to run, it will be scheduled over a domU, but other than that they're basically the same. The main problem is that there's only one dom0 and lots of domUs. Cheers, Ewan. > -----Original Message----- > From: Chris Behrens [mailto:chris.behr...@rackspace.com] > Sent: 16 February 2011 17:13 > To: Ewan Mellor > Cc: Chris Behrens; Rick Harris; openstack-xenapi@lists.launchpad.net > Subject: Re: [Openstack-xenapi] Glance Plugin/DomU access to SR? > > Ewan, > > Can you explain why you say dom0 CPU is a scarce resource? I agree for > a lot of reasons work like this should be done in a domU, but I'm just > curious. My thoughts would have been that it's not so scarce. I know > there are things like the disk drivers running in the dom0 kernel doing > disk I/O, but I'd think that'd not be much CPU usage. > It'd be mostly I/O wait. And I wouldn't think network receive in dom0 > vs domU would cause much of a difference overall. I thought the > hypervisor scheduled dom0 and domUs similarly. Am I wrong? > > The only thing I can think of is when running HVM VMs, qemu can be > using a lot of CPU. > > - Chris > > > > On Feb 16, 2011, at 7:12 AM, Ewan Mellor wrote: > > > Just for summary, the advantages of having the streaming inside a > domU are: > > > > 1. You move the network receive and the image decompression / > decryption (if you're using that) off dom0's CPU and onto the domU's. > Dom0 CPU is a scarce resource, even in the new release of XenServer > with 4 CPUs in domain 0. This avoids hurting customer workloads by > contending inside domain 0. > > 2. You can easily apply network and CPU QoS to the operations > above. This also avoids hurting customer workloads, by simply capping > the maximum amount of work that the OpenStack domU can do. > > 3. You can use Python 2.6 for OpenStack, even though XenServer > dom0 is stuck on CentOS 5.5 (Python 2.4). > > 4. You get a minor security improvement, because you get to > keep a network-facing service out of domain 0. > > > > So, this is all fine if you're streaming direct to disk, but as you > say, if you want to stream VHD files you have a problem, because the > VHD needs to go into a filesystem mounted in domain 0. It's not > possible to write from a domU into a dom0-owned filesystem, without > some trickery. Here are the options as I see them: > > > > Option A: Stream in two stages, one from Glance to domU, then from > domU to dom0. The stream from domU to dom0 could just be a really > simple network put, and would just fit on the end of the current > pipeline. You lose a bit of dom0 CPU, because of the incoming stream, > and it's less efficient overall, because of the two hops. It's primary > advantage is that you can do most of the work inside the domU still, so > if you are intending to decompress and/or decrypt locally, then this > would likely be a win. > > > > Option B: Stream from Glance directly into dom0. This would be a > xapi plugin acting as a Glance client. This is the simplest solution, > but loses all the benefits above. I think it's the one that you're > suggesting below. This leaves you with similar performance problems to > the ones that you suffer today on your existing architecture. The > advantage here is simplicity, and it's certainly worth considering. > > > > Option C: Run an NFS server in domain 0, and mount that inside the > domU. You can then write direct to dom0's filesystem from the domU. > This sounds plausible, but I don't think that I recommend it. The load > on dom0 of doing this is probably no better than Options A or B, which > would mean that the complexity wasn't worth it. > > > > Option D: Unpack the VHD file inside the domU, and write it through > the PV path. This is probably the option that you haven't considered > yet. The same VHD parsing code that we use in domain 0 is also > available in an easily consumable form (called libvhdio). This can be > used to take a VHD file from the network and parse it, so that you can > write the allocated blocks directly to the VDI. This would have all > the advantages above, but it adds yet another moving part to the > pipeline. Also, this is going to be pretty simple if you're just using > VHDs as a way to handle sparseness. If you're expecting to stream a > whole tree of snapshots as multiple files, and then expect all the > relationships between the files to get wired up correctly, then this is > not the solution you're looking for. It's technically doable, but it's > very fiddly. > > > > So, in summary: > > > > Option A: Two hops. Ideal if you're worried about the cost of > decompressing / decrypting on the host. > > Option B: Direct to dom0. Ideal if you want the simplest solution. > > Option D: Parse the VHD. Probably best performance. Fiddly > development work required. Not a good idea if you want to work with > trees of VHDs. > > > > Where do you think you stand? I can advise in more detail about the > implementation, if you have a particular option that you prefer. > > > > Cheers. > > > > Ewan. > > > > > > From: openstack-xenapi- > bounces+ewan.mellor=citrix....@lists.launchpad.net [mailto:openstack- > xenapi-bounces+ewan.mellor=citrix....@lists.launchpad.net] On Behalf Of > Rick Harris > > Sent: 11 February 2011 22:13 > > To: openstack-xenapi@lists.launchpad.net > > Subject: [Openstack-xenapi] Glance Plugin/DomU access to SR? > > > > We recently moved to running the compute-worker within a domU > instance. > > > > We could make this move because domU can access VBDs in dom0-space by > > performing a VBD.plug. > > > > The problem is that we'd like to deal deal with whole VHDs rather > than kernel, > > ramdisk, and partitioning (the impetus of the unified-images BP). > > > > So, for snapshots we stream the base copy VHD held in the SR into > Glance, > > and, likewise, for restores, we stream the snapshot VHD from Glance > into the SR, rescan, and > > then spin up the instance. > > > > The problem is: now that we're running the compute-worker in domU, > how can we > > access the SR? Is there a way we can map it into domU space (a la > VBD.plug)? > > > > The way we solved this for snapshots was by using the Glance plugin > and > > performing these operations in dom0. > > > > So, my questions are: > > > > 1. Are SR operations something we need to use the Glance plugin for? > > > > 2. If we must use a dom0 plugin for this method of restore, does it > make sense to just do > > everything image related in the plugin? > > > > -Rick > > > > Confidentiality Notice: This e-mail message (including any attached > or > > embedded documents) is intended for the exclusive and confidential > use of the > > individual or entity to which this message is addressed, and unless > otherwise > > expressly indicated, is confidential and privileged information of > Rackspace. > > Any dissemination, distribution or copying of the enclosed material > is prohibited. > > If you receive this transmission in error, please notify us > immediately by e-mail > > at ab...@rackspace.com, and delete the original message. > > Your cooperation is appreciated. > > _______________________________________________ > > Mailing list: https://launchpad.net/~openstack-xenapi > > Post to : openstack-xenapi@lists.launchpad.net > > Unsubscribe : https://launchpad.net/~openstack-xenapi > > More help : https://help.launchpad.net/ListHelp _______________________________________________ Mailing list: https://launchpad.net/~openstack-xenapi Post to : openstack-xenapi@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack-xenapi More help : https://help.launchpad.net/ListHelp