Cool, thanks!
On Feb 16, 2011, at 9:30 AM, Ewan Mellor wrote: > Dom0 CPU has to serve: > > 1. All I/O for all domains, including inspection / routing of network packets > at the bridge. > 2. All device emulation for non-PV domains, which is particularly expensive > and unavoidable during Windows boot. > 3. Display emulation for all HVM domains, which is moderately expensive. > 4. VNC service, any time a customer wants to see the console, which is > moderately expensive. > 5. Performance metrics sampling and aggregation. > 6. Control-plane operations. > > Each individual thing isn't huge, but you've only got one domain 0 with four > VCPUs, and it has to serve tens of domUs. It starts to add up. If you put > something CPU-intensive in there too, such as a gunzip or some crypto, then > you can find yourself with 30 customer VMs trying to funnel I/O through one > or two dom0 CPUs, and we simply run out. > > The majority of the cost of I/O is the copy of the payload. We're doing work > at the moment to move that cost into the domU, so that it's accounted to the > domU CPU time, not dom0. This will improve fairness between customer VMs, > because if the cost of I/O is inside dom0, you can't ensure fairness between > I/O-intensive customer VMs vs CPU-intensive ones. > > Yes, the hypervisor does generally schedule dom0 and domUs similarly. > There's actually a boost given to dom0, so if it has a VCPU ready to run, it > will be scheduled over a domU, but other than that they're basically the > same. The main problem is that there's only one dom0 and lots of domUs. > > Cheers, > > Ewan. > > >> -----Original Message----- >> From: Chris Behrens [mailto:chris.behr...@rackspace.com] >> Sent: 16 February 2011 17:13 >> To: Ewan Mellor >> Cc: Chris Behrens; Rick Harris; openstack-xenapi@lists.launchpad.net >> Subject: Re: [Openstack-xenapi] Glance Plugin/DomU access to SR? >> >> Ewan, >> >> Can you explain why you say dom0 CPU is a scarce resource? I agree for >> a lot of reasons work like this should be done in a domU, but I'm just >> curious. My thoughts would have been that it's not so scarce. I know >> there are things like the disk drivers running in the dom0 kernel doing >> disk I/O, but I'd think that'd not be much CPU usage. >> It'd be mostly I/O wait. And I wouldn't think network receive in dom0 >> vs domU would cause much of a difference overall. I thought the >> hypervisor scheduled dom0 and domUs similarly. Am I wrong? >> >> The only thing I can think of is when running HVM VMs, qemu can be >> using a lot of CPU. >> >> - Chris >> >> >> >> On Feb 16, 2011, at 7:12 AM, Ewan Mellor wrote: >> >>> Just for summary, the advantages of having the streaming inside a >> domU are: >>> >>> 1. You move the network receive and the image decompression / >> decryption (if you're using that) off dom0's CPU and onto the domU's. >> Dom0 CPU is a scarce resource, even in the new release of XenServer >> with 4 CPUs in domain 0. This avoids hurting customer workloads by >> contending inside domain 0. >>> 2. You can easily apply network and CPU QoS to the operations >> above. This also avoids hurting customer workloads, by simply capping >> the maximum amount of work that the OpenStack domU can do. >>> 3. You can use Python 2.6 for OpenStack, even though XenServer >> dom0 is stuck on CentOS 5.5 (Python 2.4). >>> 4. You get a minor security improvement, because you get to >> keep a network-facing service out of domain 0. >>> >>> So, this is all fine if you're streaming direct to disk, but as you >> say, if you want to stream VHD files you have a problem, because the >> VHD needs to go into a filesystem mounted in domain 0. It's not >> possible to write from a domU into a dom0-owned filesystem, without >> some trickery. Here are the options as I see them: >>> >>> Option A: Stream in two stages, one from Glance to domU, then from >> domU to dom0. The stream from domU to dom0 could just be a really >> simple network put, and would just fit on the end of the current >> pipeline. You lose a bit of dom0 CPU, because of the incoming stream, >> and it's less efficient overall, because of the two hops. It's primary >> advantage is that you can do most of the work inside the domU still, so >> if you are intending to decompress and/or decrypt locally, then this >> would likely be a win. >>> >>> Option B: Stream from Glance directly into dom0. This would be a >> xapi plugin acting as a Glance client. This is the simplest solution, >> but loses all the benefits above. I think it's the one that you're >> suggesting below. This leaves you with similar performance problems to >> the ones that you suffer today on your existing architecture. The >> advantage here is simplicity, and it's certainly worth considering. >>> >>> Option C: Run an NFS server in domain 0, and mount that inside the >> domU. You can then write direct to dom0's filesystem from the domU. >> This sounds plausible, but I don't think that I recommend it. The load >> on dom0 of doing this is probably no better than Options A or B, which >> would mean that the complexity wasn't worth it. >>> >>> Option D: Unpack the VHD file inside the domU, and write it through >> the PV path. This is probably the option that you haven't considered >> yet. The same VHD parsing code that we use in domain 0 is also >> available in an easily consumable form (called libvhdio). This can be >> used to take a VHD file from the network and parse it, so that you can >> write the allocated blocks directly to the VDI. This would have all >> the advantages above, but it adds yet another moving part to the >> pipeline. Also, this is going to be pretty simple if you're just using >> VHDs as a way to handle sparseness. If you're expecting to stream a >> whole tree of snapshots as multiple files, and then expect all the >> relationships between the files to get wired up correctly, then this is >> not the solution you're looking for. It's technically doable, but it's >> very fiddly. >>> >>> So, in summary: >>> >>> Option A: Two hops. Ideal if you're worried about the cost of >> decompressing / decrypting on the host. >>> Option B: Direct to dom0. Ideal if you want the simplest solution. >>> Option D: Parse the VHD. Probably best performance. Fiddly >> development work required. Not a good idea if you want to work with >> trees of VHDs. >>> >>> Where do you think you stand? I can advise in more detail about the >> implementation, if you have a particular option that you prefer. >>> >>> Cheers. >>> >>> Ewan. >>> >>> >>> From: openstack-xenapi- >> bounces+ewan.mellor=citrix....@lists.launchpad.net [mailto:openstack- >> xenapi-bounces+ewan.mellor=citrix....@lists.launchpad.net] On Behalf Of >> Rick Harris >>> Sent: 11 February 2011 22:13 >>> To: openstack-xenapi@lists.launchpad.net >>> Subject: [Openstack-xenapi] Glance Plugin/DomU access to SR? >>> >>> We recently moved to running the compute-worker within a domU >> instance. >>> >>> We could make this move because domU can access VBDs in dom0-space by >>> performing a VBD.plug. >>> >>> The problem is that we'd like to deal deal with whole VHDs rather >> than kernel, >>> ramdisk, and partitioning (the impetus of the unified-images BP). >>> >>> So, for snapshots we stream the base copy VHD held in the SR into >> Glance, >>> and, likewise, for restores, we stream the snapshot VHD from Glance >> into the SR, rescan, and >>> then spin up the instance. >>> >>> The problem is: now that we're running the compute-worker in domU, >> how can we >>> access the SR? Is there a way we can map it into domU space (a la >> VBD.plug)? >>> >>> The way we solved this for snapshots was by using the Glance plugin >> and >>> performing these operations in dom0. >>> >>> So, my questions are: >>> >>> 1. Are SR operations something we need to use the Glance plugin for? >>> >>> 2. If we must use a dom0 plugin for this method of restore, does it >> make sense to just do >>> everything image related in the plugin? >>> >>> -Rick >>> >>> Confidentiality Notice: This e-mail message (including any attached >> or >>> embedded documents) is intended for the exclusive and confidential >> use of the >>> individual or entity to which this message is addressed, and unless >> otherwise >>> expressly indicated, is confidential and privileged information of >> Rackspace. >>> Any dissemination, distribution or copying of the enclosed material >> is prohibited. >>> If you receive this transmission in error, please notify us >> immediately by e-mail >>> at ab...@rackspace.com, and delete the original message. >>> Your cooperation is appreciated. >>> _______________________________________________ >>> Mailing list: https://launchpad.net/~openstack-xenapi >>> Post to : openstack-xenapi@lists.launchpad.net >>> Unsubscribe : https://launchpad.net/~openstack-xenapi >>> More help : https://help.launchpad.net/ListHelp > > > _______________________________________________ > Mailing list: https://launchpad.net/~openstack-xenapi > Post to : openstack-xenapi@lists.launchpad.net > Unsubscribe : https://launchpad.net/~openstack-xenapi > More help : https://help.launchpad.net/ListHelp _______________________________________________ Mailing list: https://launchpad.net/~openstack-xenapi Post to : openstack-xenapi@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack-xenapi More help : https://help.launchpad.net/ListHelp