On Thu, 05 Jan 2012 15:41:33 -0600 Michael Roth <mdr...@linux.vnet.ibm.com> wrote:
> On 01/05/2012 02:25 PM, Luiz Capitulino wrote: > > On Thu, 05 Jan 2012 09:10:50 -0600 > > Michael Roth<mdr...@linux.vnet.ibm.com> wrote: > > > >> On 01/05/2012 08:42 AM, Luiz Capitulino wrote: > >>> On Thu, 5 Jan 2012 12:59:27 +0000 > >>> "Daniel P. Berrange"<berra...@redhat.com> wrote: > >>> > >>>> On Thu, Jan 05, 2012 at 10:37:14AM -0200, Luiz Capitulino wrote: > >>>>> On Thu, 5 Jan 2012 10:16:30 +0000 > >>>>> "Daniel P. Berrange"<berra...@redhat.com> wrote: > >>>>> > >>>>>> On Wed, Jan 04, 2012 at 05:45:11PM -0200, Luiz Capitulino wrote: > >>>>>>> This version drops modes 'sleep' and 'hybrid' because they don't work > >>>>>>> properly due to issues in qemu. Only the 'hibernate' mode is supported > >>>>>>> for now. > >>>>>> > >>>>>> IMHO this is short-sighted. When the bugs QEMU in are fixed so that > >>>>>> these modes work, you have needlessly put users in the situation where > >>>>>> they have to now upgrade the guest agent everywhere to take advantage > >>>>>> of the bugfix. > >>>>> > >>>>> That was my thinking until v4. But after discussing with Michael the > >>>>> issues > >>>>> we have with S3 I concluded that it doesn't make sense to offer an API > >>>>> to > >>>>> something that doesn't work, this will just generate bug reports. Also, > >>>>> updating to get new features is normal and expected. > >>>> > >>>> This is assuming that users will always upgrade their VMs& hosts in > >>>> lock step, which I rather doubt they will in practice. eg imagine a > >>>> deployment might have a mixture of hosts, running QEMU 1.1 (broken S3) > >>>> and QEMU 1.2 (working S3). If they build VM disk images they will likely > >>>> use the QEMU GA from 1.2 for all their builds, even if many of them > >>>> will only run on QEMU 1.1 hosts. So you'll end up having 'sleep' and > >>>> 'hybrid' commands available in the guest agent, even though the host > >>>> QEMU doesn't work properly. > >>>> > >>>> So you *will* ultimately need to make sure that QEMU GA from 1.2, has > >>>> sensible behaviour when run on a QEMU 1.1 host. If you don't address > >>>> this during 1.1, you may well find yourself in an un-winnable situation > >>>> for 1.2, where it is impossible to provide good behaviour on old hosts. > >>>> > >>>> So IMHO we are better off in the long run, if we include all commands > >>>> right now, even though some don't work yet, and work to ensure we have > >>>> good error reporting behaviour for those that don't work. > >>> > >>> Yes, I agree. As a side note: if we add error reporting it will only work > >>> on 1.1 and later. Ie, the problem you describe above will still happen > >>> with 1.0. > >>> > >>> But what you're suggesting seems to be the right thing to do. Do you agree > >>> Michael? > >> > >> Agree, but unless we add an RPC that QEMU uses to advertise > >> capabilities, I'm really not sure it's possible to detect whether or not > >> the host will support it. > > > > You mean an RPC to advertise if 'sleep' is supported? I think this is best > > done > > by making guest-suspend return an error as suggested by Daniel, otherwise a > > client that doesn't query for capabilities might run in trouble. > > Agreed, but what I mean is that if the user executes the suspend using > on up-level agent running on a down-level 1.0 host, the agent will still > see s3 advertised and issue the buggy suspend. That's why I suggested > the host->agent capabilities reporting as a possible (but somewhat ugly) > way to just simply tell the agent it can handle it (and, lacking that, > assume that it can't). That makes sense. > > > > > There's an important detail though: we need to make qemu not advertise S3 > > for > > this to work. However, we might be able to fix S3 for 1.1 (and bugs, like > > the > > S4 ones, can't be detected, limiting the scope of the 'unsupported' error). > > > > So, we could merge all modes and commit to get S3 fixed for 1.1 :) > > No disagreement there, if we can commit to making qemu-ga/qemu 1.1 > releases interoperable in this manner (whether by fixing s3 or not > advertising it), I think that approach is perfectly fine, ideal even. > Doing a 1.1 release where qemu and qemu-ga are not interoperable (qemu > missing s3 support, qemu-ga using s3) was my main objection. I see. > But there is a 2nd topic here I'm trying to mull over: what is qemu-ga's > support policy for down-level hosts? backward-compatible? incompatible? That's a good question, I think we should be backward-compatible, but I think that's not going to be trivial. > The above approach to this problem suggests the latter (qemu-ga 1.1 has > RPCs that will knowingly break 1.0 qemu instances). We could solve this > by introducing the capabilities negotiation I mentioned early. It > actually wouldn't need to be anything other than qemu telling qemu-ga > what qemu-ga version-level it supports. By default we assume 1.0, and > limit qemu-ga to that until qemu-ga is told otherwise (so, no > sleep/hybrid suspend modes). For new RPCs we may be able to handle this > version automatically, since we include qemu version levels for the RPCs > in the schema. For functionality within an RPC (like sleep/hybrid > suspend modes) we could use conditional code. > > If we take that approach (maintaining backward-compatibility), we'd need > to introduce that code in the agent now, and require qemu/libvirt > execute the guest-set-support-level RPC or whatever to access these 1.1 > features. What does guest-set-support-level do? It enables all 1.1 post features? A different approach would be to add a new field in the command dict in the schema file, say 'broken-in-qemu-version', and change qemu-ga to check that field in its main loop before executing a command. If 'broken-in-qemu-version' <= qemu version qemu-ga returns an not supported error. For commands like the guest-suspend which is partially supported, we'd have to do a manual check for the qemu version as you suggested above. That's just an idea though, I'm not sure what's the best way to do this. > > Technically, there's a required RPC qemu-ga clients need to execute > already: guest-sync. It's required because we have no way to reliably > detect EOF over virtio-serial, and thus an agent may send stale data to > a newly-connected qemu-ga client, so the client needs to do the > guest-sync command to find the expected response and re-sync the > streams. We could roll the guest-set-support-level functionality into > that. Basically just add another field. > > > > >> And if we can't detect that reliably, we're > >> better off leaving it out for now, because sleeping guests is not > >> obscure functionality, and accidentally nuking guests when a user sleeps > >> them (presumably because they want to retain their working state) is > >> much worse than telling a user to upgrade their agent, or not supported > >> or whatever. > >> > >>> > >>>> As an example, if S3 is broken in current QEMU, then we should not be > >>>> advertizing S3 to the guest OS. This would allow 'pm-is-supported > >>>> --suspend' > >>>> to return false, at which point the guest agent can send back a nice > >>>> error > >>>> message 'Suspend is not supported on this host', instead of just having > >>>> the > >>>> guest try to suspend& hang or worse. > >>> > >> > > >