> On 6 Jun 2019, at 14:07, Dr. David Alan Gilbert <dgilb...@redhat.com> wrote:
>
> * Liran Alon (liran.a...@oracle.com) wrote:
>>
>>
>>> On 6 Jun 2019, at 13:39, Dr. David Alan Gilbert <dgilb...@redhat.com> wrote:
>>>
>>> * Liran Alon (liran.a...@oracle.com) wrote:
>>>>
>>>>
>>>>> On 6 Jun 2019, at 12:23, Dr. David Alan Gilbert <dgilb...@redhat.com>
>>>>> wrote:
>>>>>
>>>>> * Liran Alon (liran.a...@oracle.com) wrote:
>>>>>>
>>>>>>
>>>>>>> On 6 Jun 2019, at 11:42, Dr. David Alan Gilbert <dgilb...@redhat.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> * Liran Alon (liran.a...@oracle.com) wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Looking at QEMU source code, I am puzzled regarding how migration
>>>>>>>> backwards compatibility is preserved regarding X86CPU.
>>>>>>>>
>>>>>>>> As I understand it, fields that are based on KVM capabilities and
>>>>>>>> guest runtime usage are defined in VMState subsections in order to not
>>>>>>>> send them if not necessary.
>>>>>>>> This is done such that in case they are not needed and we migrate to
>>>>>>>> an old QEMU which don’t support loading this state, migration will
>>>>>>>> still succeed
>>>>>>>> (As .needed() method will return false and therefore this state won’t
>>>>>>>> be sent as part of migration stream).
>>>>>>>> Furthermore, in case .needed() returns true and old QEMU don’t support
>>>>>>>> loading this state, migration fails. As it should because we are aware
>>>>>>>> that guest state
>>>>>>>> is not going to be restored properly on destination.
>>>>>>>>
>>>>>>>> I’m puzzled about what will happen in the following scenario:
>>>>>>>> 1) Source is running new QEMU with new KVM that supports save of some
>>>>>>>> VMState subsection.
>>>>>>>> 2) Destination is running new QEMU that supports load this state but
>>>>>>>> with old kernel that doesn’t know how to load this state.
>>>>>>>>
>>>>>>>> I would have expected in this case that if source .needed() returns
>>>>>>>> true, then migration will fail because of lack of support in
>>>>>>>> destination kernel.
>>>>>>>> However, it seems from current QEMU code that this will actually
>>>>>>>> succeed in many cases.
>>>>>>>>
>>>>>>>> For example, if msr_smi_count is sent as part of migration stream (See
>>>>>>>> vmstate_msr_smi_count) and destination have has_msr_smi_count==false,
>>>>>>>> then destination will succeed loading migration stream but
>>>>>>>> kvm_put_msrs() will actually ignore env->msr_smi_count and will
>>>>>>>> successfully load guest state.
>>>>>>>> Therefore, migration will succeed even though it should have failed…
>>>>>>>>
>>>>>>>> It seems to me that QEMU should have for every such VMState
>>>>>>>> subsection, a .post_load() method that verifies that relevant
>>>>>>>> capability is supported by kernel
>>>>>>>> and otherwise fail migration.
>>>>>>>>
>>>>>>>> What do you think? Should I really create a patch to modify all these
>>>>>>>> CPUX86 VMState subsections to behave like this?
>>>>>>>
>>>>>>> I don't know the x86 specific side that much; but from my migration side
>>>>>>> the answer should mostly be through machine types - indeed for smi-count
>>>>>>> there's a property 'x-migrate-smi-count' which is off for machine types
>>>>>>> pre 2.11 (see hw/i386/pc.c pc_compat_2_11) - so if you've got an old
>>>>>>> kernel you should stick to the old machine types.
>>>>>>>
>>>>>>> There's nothing guarding running the new machine type on old-kernels;
>>>>>>> and arguably we should have a check at startup that complains if
>>>>>>> your kernel is missing something the machine type uses.
>>>>>>> However, that would mean that people running with -M pc would fail
>>>>>>> on old kernels.
>>>>>>>
>>>>>>> A post-load is also a valid check; but one question is whether,
>>>>>>> for a particular register, the pain is worth it - it depends on the
>>>>>>> symptom that the missing state causes. If it's minor then you might
>>>>>>> conclude it's not worth a failed migration; if it's a hung or
>>>>>>> corrupt guest then yes it is. Certainly a warning printed is worth
>>>>>>> it.
>>>>>>>
>>>>>>> Dave
>>>>>>
>>>>>> I think we should have flags that allow user to specify which VMState
>>>>>> subsections user explicitly allow to avoid restore even though they are
>>>>>> required to fully restore guest state.
>>>>>> But it seems to me that the behaviour should be to always fail migration
>>>>>> in case we load a VMState subsections that we are unable to restore
>>>>>> unless user explicitly specified this is ok
>>>>>> for this specific subsection.
>>>>>> Therefore, it seems that for every VMState subsection that it’s restore
>>>>>> is based on kernel capability we should:
>>>>>> 1) Have a user-controllable flag (which is also tied to machine-type?)
>>>>>> to explicitly allow avoid restoring this state if cannot. Default should
>>>>>> be “false”.
>>>>>> 2) Have a .post_load() method that verifies we have required kernel
>>>>>> capability to restore this state, unless flag (1) was specified as
>>>>>> “true”.
>>>>>
>>>>> This seems a lot of flags; users aren't going to know what to do with
>>>>> all of them; I don't see what will set/control them.
>>>>
>>>> True but I think users will want to specify only for a handful of VMState
>>>> subsections that it is OK to not restore them even thought hey are deemed
>>>> needed by source QEMU.
>>>> We can create flags only for those VMState subsections.
>>>> User should set these flags explicitly on QEMU command-line. As a “-cpu”
>>>> property? I don’t think these flags should be tied to machine-type.
>>>
>>> I don't see who is going to work out these flags and send them.
>>>
>>>>>
>>>>>> Note that above mentioned flags is different than flags such as
>>>>>> “x-migrate-smi-count”.
>>>>>> The purpose of “x-migrate-smi-count” flag is to avoid sending the
>>>>>> VMState subsection to begin with in case we know we migrate to older
>>>>>> QEMU which don’t even have the relevant VMState subsection. But it is
>>>>>> not relevant for the case both source and destination runs QEMU which
>>>>>> understands the VMState subsection but run on kernels with different
>>>>>> capabilities.
>>>>>>
>>>>>> Also note regarding your first paragraph, that specifying flags based on
>>>>>> kernel you are running on doesn’t help for the case discussed here.
>>>>>> As source QEMU is running on new kernel. Unless you meant that source
>>>>>> QEMU should use relevant machine-type based on the destination kernel.
>>>>>> i.e. You should launch QEMU with old machine-type as long as you have
>>>>>> hosts in your migration pool that runs with old kernel.
>>>>>
>>>>> That's what I meant; stick to the old machine-type unless you know it's
>>>>> safe to use a newer one.
>>>>>
>>>>>> I don’ think it’s the right approach though. As there is no way to
>>>>>> change flags such as “x-migrate-smi-count” dynamically after all hosts
>>>>>> in migration pool have been upgraded.
>>>>>>
>>>>>> What do you think?
>>>>>
>>>>> I don't have an easy answer. The users already have to make sure they
>>>>> use a machine type that's old enough for all the QEMUs installed in
>>>>> their cluster; making sure it's also old enough for their oldest
>>>>> kernel isn't too big a difference - *except* that it's much harder to
>>>>> tell which kernel corresponds to which feature/machine type etc - so
>>>>> how does a user know what the newest supported machine type is?
>>>>> Failing at startup when selecting a machine type that the current
>>>>> kernel can't support would help that.
>>>>>
>>>>> Dave
>>>>
>>>> First, machine-type express the set of vHW behaviour and properties that
>>>> is exposed to guest.
>>>> Therefore, machine-type shouldn’t change for a given guest lifetime
>>>> (including Live-Migrations).
>>>> Otherwise, guest will experience different vHW behaviour and properties
>>>> before/after Live-Migration.
>>>> So I think machine-type is not relevant for this discussion. We should
>>>> focus on flags which specify
>>>> migration behaviour (such as “x-migrate-smi-count” which can also be
>>>> controlled by machine-type but not only).
>>>
>>> Machine type specifies two things:
>>> a) The view from the guest
>>> b) Migration compatibility
>>>
>>> (b) is explicitly documented in qemu's docs/devel/migration.rst, see the
>>> subsection on subsections.
>>>
>>>> Second, this strategy results in inefficient migration management.
>>>> Consider the following scenario:
>>>> 1) Guest running on new_qemu+old_kernel migrate to host with
>>>> new_qemu+new_kernel.
>>>> Because source is old_kernel than destination QEMU is launched with
>>>> (x-migrate-smi-count == false).
>>>> 2) Assume at this point fleet of hosts have half of hosts with old_kernel
>>>> and half with new_kernel.
>>>> 3) Further assume that guest workload indeed use msr_smi_count and
>>>> therefore relevant VMState subsection should be sent to properly preserve
>>>> guest state.
>>>> 4) From some reason, we decide to migrate again the guest in (1).
>>>> Even if guest is migrated to a host with new_kernel, then QEMU still
>>>> avoids sending msr_smi_count VMState subsection because it is launched
>>>> with (x-migrate-smi-count == false).
>>>>
>>>> Therefore, I think it makes more sense that source QEMU will always send
>>>> all VMState subsection that are deemed needed (i.e. .nedeed() returns true)
>>>> and let receive-side decide if migration should fail if this subsection
>>>> was sent but failed to be restored.
>>>> The only case which I think sender should limit the VMState subsection it
>>>> sends to destination is because source is running older QEMU
>>>> which is not even aware of this VMState subsection (Which is to my
>>>> understanding the rational behind using “x-migrate-smi-count” and tie it
>>>> up to machine-type).
>>>
>>> But we want to avoid failed migrations if we can; so in general we don't
>>> want to be sending subsections to destinations that can't handle them.
>>> The only case where it's reasonable is when there's a migration bug such
>>> that the behaviour in the guest is really nasty; if there's a choice
>>> between a failed migration or a hung/corrupt guest I'll take a failed
>>> migration.
>>>
>>>> Third, let’s assume all hosts in fleet was upgraded to new_kernel. How do
>>>> I modify all launched QEMUs on these new hosts to now have
>>>> “x-migrate-smi-count” set to true?
>>>> As I would like future migrations to do send this VMState subsection.
>>>> Currently there is no QMP command to update these flags.
>>>
>>> I guess that's possible - it's pretty painful though; you're going to
>>> have to teach your management layer about features/fixes of the kernels
>>> and which flags to tweak in qemu. Having said that, if you could do it,
>>> then you'd avoid having to restart VMs to pick up a few fixes.
>>>
>>>> Fourth, I think it’s not trivial for management-plane to be aware with
>>>> which flags it should set on destination QEMU based on currently running
>>>> kernels on fleet.
>>>> It’s not the same as machine-type, as already discussed above doesn’t
>>>> change during the entire lifetime of guest.
>>>
>>> Right, which is why I don't see your idea of adding flags will work.
>>> I don't see how anything will figure out what the right flags to use
>>> are.
>>> (Getting the management layers to do sane things with the cpuid flags
>>> is already a nightmare, and they're fairly well understood).
>>>
>>>> I’m also not sure it is a good idea that we currently control flags such
>>>> as “x-migrate-smi-count” from machine-type.
>>>> As it means that if a guest was initially launched using some old QEMU, it
>>>> will *forever* not migrate some VMState subsection during all it’s
>>>> Live-Migrations.
>>>> Even if all hosts and all QEMUs on fleet are capable of migrating this
>>>> state properly.
>>>> Maybe it is preferred that this flag was specified as part of “migrate”
>>>> command itself in case management-plane knows it wishes to migrate even
>>>> though dest QEMU
>>>> is older and doesn’t understand this specific VMState subsection.
>>>>
>>>> I’m left pretty confused about QEMU’s migration compatibility strategy...
>>>
>>> The compatibility strategy is the machine type; but yes it does
>>> have a problem when it's not really just a qemu version - but also
>>> kernel (and external libraries, etc).
>>> My general advice is that users should be updating their kernels and
>>> qemus together; but I realise there's lots of cases where that
>>> doesn't work.
>>>
>>> Dave
>>
>> I think it’s not practical advise to expect users to always upgrade kernel
>> and QEMU together.
>> In fact, users prefer to upgrade them separately to avoid doing major
>> upgrades at once and to better pin-point root-cause of issues.
>
> It's tricky; for distro-based users, hitting 'update' and getting both
> makes a lot of sense; but as you say you ened to let them do stuff
> individually if they want to, so they can track down problems.
> There's also a newer problem which is people want to run the QEMU in
> containers on hosts that have separate update schedules - the kernel
> version relationship is then much more fluid.
>
>> Compiling all above very useful discussion (thanks for this!), I may have a
>> better suggestion that doesn’t require any additional flags:
>> 1) Source QEMU will always send all all VMState subsections that is deemed
>> by source QEMU as required to not break guest semantic behaviour.
>> This is done by .needed() methods that examine guest runtime state to
>> understand if this state is required to be sent or not.
>
> So that's as we already do.
Besides the fact that today we also expect to add a flag tied to machine-type
for every new VMState subsection we add that didn’t exist on previous QEMU
versions...
>
>> 2) Destination QEMU will provide a generic QMP command which allows to set
>> names of VMState subsections that if accepted on migration stream
>> and failed to be loaded (because either subsection name is not implemented
>> or because .post_load() method failed) then the failure should be ignored
>> and migration should continue as usual. By default, the list of this names
>> will be empty.
>
> The format of the migration stream means that you can't skip an unknown
> subsection; it's not possible to resume parsing the stream without
> knowing what was supposed to be there. [This is pretty awful
> but my last attempts to rework it hit a dead end]
Wow… That is indeed pretty awful.
I thought every VMState subsection have a header with a length field… :(
Why did your last attempts to add such a length field to migration stream
protocol failed?
>
> So we still need to tie subsections to machine types; that way
> you don't send them to old qemu's and there for you don't have the
> problem of the qemu receiving something it doesn't know.
I agree that if there is no way to skip a VMState subsection in the stream,
then we must
have a way to specify to source QEMU to prevent sending this subsection to
destination…
I would suggest though that instead of having a flag tied to machine-type, we
will have a QMP command
that can specify names of subsections we explicitly wish to be skipped sending
to destination even if their .needed() method returns true.
This seems like a more explicit approach and doesn’t come with the down-side of
forever not migrating this VMState subsection
for the entire lifetime of guest.
>
> Still, you could skip things where the destination kernel doesn't know
> about it.
>
>> 3) Destination QEMU will implement .post_load() method for all these VMState
>> subsections that depend on kernel capability to be restored properly
>> such that it will fail subsection load in case kernel capability is not
>> present. (Note that this load failure will be ignored if subsection name is
>> specified in (2)).
>>
>> Above suggestion have the following properties:
>> 1) Doesn’t require any flag to be added to QEMU.
>
> There's no logical difference between 'flags' and 'names of subsections'
> - they're got the same problem in someone somewhere knowing which are
> safe.
I agree. But creating additional flags does come with a development and testing
overhead and makes code less intuitive.
I would have prefer to use subsection names.
>
>> 2) Moves all control on whether to fail migration because of failure to load
>> VMState subsection to receiver side. Sender always attempts to send max
>> state he believes is required.
>> 3) We remove coupling of migration compatibility from machine-type.
>>
>> What do you think?
>
> Sorry, can't do (3) - we need to keep the binding for subsections to
> machine types for qemu compatibility; I'm open for something for
> kernel compat, but not when it's breaking the qemu subsection
> checks.
>
> Dave
Agree. I have proposed now above how to not break qemu subsection checks while
still not tie this to machine-type.
Please tell me what you think on that approach. :)
We can combine that approach together with implementing the mentioned
.post_load() methods and maybe it solves the discussion at hand here.
-Liran
>
>>
>> -Liran
>>
>>>
>>>> -Liran
>>>>
>>>>>
>>>>>> -Liran
>>>>>>
>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Liran
>>>>>>> --
>>>>>>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>>>>>>
>>>>> --
>>>>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>>>>
>>> --
>>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
>>
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK