On Thursday 28 March 2019 06:18 AM, David Gibson wrote:
> On Tue, Mar 26, 2019 at 12:22:50PM +0530, Aravinda Prasad wrote:
>>
>>
>> On Tuesday 26 March 2019 09:49 AM, David Gibson wrote:
>>> On Mon, Mar 25, 2019 at 02:31:10PM +0530, Aravinda Prasad wrote:
>>>>
>>>>
>>>> On Monday 25 March 2019 12:03 PM, David Gibson wrote:
>>>>> On Fri, Mar 22, 2019 at 12:04:25PM +0530, Aravinda Prasad wrote:
>>>>>> Block VM migration requests until the machine check
>>>>>> error handling is complete as (i) these errors are
>>>>>> specific to the source hardware and is irrelevant on
>>>>>> the target hardware, (ii) these errors cause data
>>>>>> corruption and should be handled before migration.
>>>>>>
>>>>>> Signed-off-by: Aravinda Prasad <aravi...@linux.vnet.ibm.com>
>>>>>> ---
>>>>>> hw/ppc/spapr_events.c | 17 +++++++++++++++++
>>>>>> hw/ppc/spapr_rtas.c | 4 ++++
>>>>>> include/hw/ppc/spapr.h | 3 +++
>>>>>> 3 files changed, 24 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c
>>>>>> index d7cc0a4..6356a38 100644
>>>>>> --- a/hw/ppc/spapr_events.c
>>>>>> +++ b/hw/ppc/spapr_events.c
>>>>>> @@ -41,6 +41,7 @@
>>>>>> #include "qemu/bcd.h"
>>>>>> #include "hw/ppc/spapr_ovec.h"
>>>>>> #include <libfdt.h>
>>>>>> +#include "migration/blocker.h"
>>>>>>
>>>>>> #define RTAS_LOG_VERSION_MASK 0xff000000
>>>>>> #define RTAS_LOG_VERSION_6 0x06000000
>>>>>> @@ -866,6 +867,22 @@ static void spapr_mce_build_elog(PowerPCCPU *cpu,
>>>>>> bool recovered)
>>>>>> void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered)
>>>>>> {
>>>>>> SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
>>>>>> + int ret;
>>>>>> + Error *local_err = NULL;
>>>>>> +
>>>>>> + error_setg(&spapr->migration_blocker,
>>>>>> + "Live migration not supported during machine check
>>>>>> handling");
>>>>>> + ret = migrate_add_blocker(spapr->migration_blocker, &local_err);
>>>>>> + if (ret < 0) {
>>>>>> + /*
>>>>>> + * We don't want to abort and let the migration to continue. In
>>>>>> a
>>>>>> + * rare case, the machine check handler will run on the target
>>>>>> + * hardware. Though this is not preferable, it is better than
>>>>>> aborting
>>>>>> + * the migration or killing the VM.
>>>>>
>>>>> Can't you just discard the error in that case?
>>>>
>>>> I am actually discarding the error. Do you mean I don't have to print
>>>> the below warning or am I missing something?
>>>
>>> Um.. I guess we should keep the warning. The tricky thing here is
>>> that the information on the machine check is no longer relevant after
>>> the migration, however presumably the event which triggered it might
>>> have already corrupted guest memory, which *would* be relevant to the
>>> guest after migration. Not sure what to do about that.
>>
>> I think the information is still relevant after the migration, because
>> it includes the effective address in error and the context. If the
>> corruption is in the guest kernel context then the machine check handler
>> inside the guest kernel will call panic(). If the corruption is in user
>> space, then the application is sent a SIGBUS.
>
> Good point.
>
>> But the problem is when the machine check handler hardware poisons the
>> page frame backing the corrupt memory to avoid reuse of that page frame.
>> After migration, the page frame backing the corrupt memory is clean, but
>> we poison the clean page frame.
>
> This is the guest side handler? It sounds like we we need two
> variants of the notification that say whether or not the area is still
> bad.
Yes this is guest side handler.
Can we just let the guest machine check handler poison a clean page as a
migration during a machine check is a rare occurrence. We may at most
lose a clean page which is a trade-off for additional complexity in the
machine check handler.
>
--
Regards,
Aravinda