On 10/12/2017 11:25 AM, Cédric Le Goater wrote: > On 10/12/2017 12:46 AM, David Gibson wrote: >> On Wed, Oct 11, 2017 at 01:55:20PM +0200, Cédric Le Goater wrote: >>> On 10/11/2017 08:45 AM, David Gibson wrote: >>>> On Mon, Oct 09, 2017 at 05:49:28PM +0200, Cédric Le Goater wrote: >>>>> When a CPU is stopped with the 'stop-self' RTAS call, its state >>>>> 'halted' is switched to 1 and, in this case, the MSR is not taken into >>>>> account anymore in the cpu_has_work() routine. Only the pending >>>>> hardware interrupts are checked with their LPCR:PECE* enablement bit. >>>>> >>>>> If the DECR timer fires after 'stop-self' is called and before the CPU >>>>> 'stop' state is reached, the nearly-dead CPU will have some work to do >>>>> and the guest will crash. This case happens very frequently with the >>>>> not yet upstream P9 XIVE exploitation mode. In XICS mode, the DECR is >>>>> occasionally fired but after 'stop' state, so no work is to be done >>>>> and the guest survives. >>>>> >>>>> I suspect there is a race between the QEMU mainloop triggering the >>>>> timers and the TCG CPU thread but I could not quite identify the root >>>>> cause. To be safe, let's disable the decrementer interrupt in the LPCR >>>>> when the CPU is halted and reenable it when the CPU is restarted. >>>>> >>>>> Signed-off-by: Cédric Le Goater <c...@kaod.org> >>>>> --- >>>>> >>>>> Changes in v2: >>>>> >>>>> - used a new routine ppc_cpu_pvr_match() to discriminate CPU versions >>>>> - removed the LPCR:PECE* enablement bit when the CPU is initialized >>>>> if it is a secondary >>>>> >>>>> hw/ppc/spapr_rtas.c | 20 ++++++++++++++++++++ >>>>> target/ppc/translate_init.c | 19 +++++++++++++++++-- >>>>> 2 files changed, 37 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c >>>>> index cdf0b607a0a0..dfdbf1e2c6f8 100644 >>>>> --- a/hw/ppc/spapr_rtas.c >>>>> +++ b/hw/ppc/spapr_rtas.c >>>>> @@ -46,6 +46,7 @@ >>>>> #include "qemu/cutils.h" >>>>> #include "trace.h" >>>>> #include "hw/ppc/fdt.h" >>>>> +#include "target/ppc/cpu-models.h" >>>>> >>>>> static void rtas_display_character(PowerPCCPU *cpu, sPAPRMachineState >>>>> *spapr, >>>>> uint32_t token, uint32_t nargs, >>>>> @@ -174,6 +175,15 @@ static void rtas_start_cpu(PowerPCCPU *cpu_, >>>>> sPAPRMachineState *spapr, >>>>> kvm_cpu_synchronize_state(cs); >>>>> >>>>> env->msr = (1ULL << MSR_SF) | (1ULL << MSR_ME); >>>>> + >>>>> + /* Enable DECR interrupt */ >>>>> + if (ppc_cpu_pvr_match(cpu, CPU_POWERPC_LOGICAL_3_00)) { >>>> >>>> Sorry, I didn't reply to your earlier mail in time. Going via the PVR >>>> in this way seems bonkers to me - I like it even less than checking >>>> the mmu type. After all, classifying a bunch of precise models (PVRs) >>>> together by behaviour is kind of exactly what the CPU classes are for, >>>> so using object_dynamic_case() (==instance_of) is a better idea here. >>> >>> hmm, and which type should I use ? we don't have any TYPE_POWER9* we >>> could use for a object_dynamic_cast(). I don't think so ? I could use >>> the name and strcmp("power9") probably but it looks ugly. >> >> Actually there is, but, yeah, it's a lot less obvious than I thought. >> It's constructed by the POWERPC_FAILY macro and will be >> "POWER9-family-powerpc64-cpu" >> >>> The only thing we have is "CPU_POWERPC_POWER9_BASE" and it only >>> applicates to PVR. >>> >>> May be I don't understand your idea. >> >> Urgh, sorry. This got much muckier than I thought it would be. I >> think maybe it's best to go back to the mmu type test, and later on we >> can fix up both the previously existing test like that, and the new >> one to something better. > > Given that the bits are the same on all processors, why not just use :
grummf, P7 reserves bits 47 and 48. C. > env->spr[SPR_LPCR] |= LPCR_PECE_L_MASK; > > and > > env->spr[SPR_LPCR] &= ~LPCR_PECE_L_MASK; > > Thanks, > > C. > > >>>>> + env->spr[SPR_LPCR] |= LPCR_DEE; >>>>> + } else { >>>>> + /* P7 and P8 both have same bit for DECR */ >>>>> + env->spr[SPR_LPCR] |= LPCR_P8_PECE3; >>>>> + } >>>>> + >>>>> env->nip = start; >>>>> env->gpr[3] = r3; >>>>> cs->halted = 0; >>>> >>>> The other option I'm wondering about here is to actually add a >>>> "shutdown" (or something) method to the cpu class, which does whatever >>>> is necessary to put the vcpu into a quiescent state that won't be >>>> woken up unless it's specifically requested. >>> >>> yes. That is a good idea. >>> >>> Thanks, >>> >>> C. >>> >>> >>>>> @@ -210,6 +220,16 @@ static void rtas_stop_self(PowerPCCPU *cpu, >>>>> sPAPRMachineState *spapr, >>>>> * no need to bother with specific bits, we just clear it. >>>>> */ >>>>> env->msr = 0; >>>>> + >>>>> + /* Don't let the decremeter run on a CPU being stopped. This could >>>>> + * deliver an interrupt on a dying CPU and crash the guest. >>>>> + */ >>>>> + if (ppc_cpu_pvr_match(cpu, CPU_POWERPC_LOGICAL_3_00)) { >>>>> + env->spr[SPR_LPCR] &= ~LPCR_DEE; >>>>> + } else { >>>>> + /* P7 and P8 both have same bit for DECR */ >>>>> + env->spr[SPR_LPCR] &= ~LPCR_P8_PECE3; >>>>> + } >>>>> } >>>>> >>>>> static inline int sysparm_st(target_ulong addr, target_ulong len, >>>>> diff --git a/target/ppc/translate_init.c b/target/ppc/translate_init.c >>>>> index 0d6379fcc5b4..1a62159843e7 100644 >>>>> --- a/target/ppc/translate_init.c >>>>> +++ b/target/ppc/translate_init.c >>>>> @@ -8905,6 +8905,7 @@ void cpu_ppc_set_papr(PowerPCCPU *cpu, >>>>> PPCVirtualHypervisor *vhyp) >>>>> CPUPPCState *env = &cpu->env; >>>>> ppc_spr_t *lpcr = &env->spr_cb[SPR_LPCR]; >>>>> ppc_spr_t *amor = &env->spr_cb[SPR_AMOR]; >>>>> + CPUState *cs = CPU(cpu); >>>>> >>>>> cpu->vhyp = vhyp; >>>>> >>>>> @@ -8946,8 +8947,15 @@ void cpu_ppc_set_papr(PowerPCCPU *cpu, >>>>> PPCVirtualHypervisor *vhyp) >>>>> } else { >>>>> lpcr->default_value &= ~(LPCR_UPRT | LPCR_GTSE); >>>>> } >>>>> - lpcr->default_value |= LPCR_PDEE | LPCR_HDEE | LPCR_EEE | >>>>> LPCR_DEE | >>>>> + lpcr->default_value |= LPCR_PDEE | LPCR_HDEE | LPCR_EEE | >>>>> LPCR_OEE; >>>> >>>> But I guess we'd also need a "set_papr" method to go with that. >>>> >>>>> + >>>>> + /* Only let the decremeter wake up the boot CPU. The RTAS >>>>> + * command start-cpu will enable it on secondaries. >>>>> + */ >>>>> + if (cs == first_cpu) { >>>>> + lpcr->default_value |= LPCR_DEE; >>>>> + } >>>>> break; >>>>> default: >>>>> /* P7 and P8 has slightly different PECE bits, mostly because P8 >>>>> adds >>>>> @@ -8955,7 +8963,14 @@ void cpu_ppc_set_papr(PowerPCCPU *cpu, >>>>> PPCVirtualHypervisor *vhyp) >>>>> * will work as expected for both implementations >>>>> */ >>>>> lpcr->default_value |= LPCR_P8_PECE0 | LPCR_P8_PECE1 | >>>>> LPCR_P8_PECE2 | >>>>> - LPCR_P8_PECE3 | LPCR_P8_PECE4; >>>>> + LPCR_P8_PECE4; >>>>> + >>>>> + /* Only let the decremeter wake up the boot CPU. The RTAS >>>>> + * command start-cpu will enable it on secondaries. >>>>> + */ >>>>> + if (cs == first_cpu) { >>>>> + lpcr->default_value |= LPCR_P8_PECE3; >>>>> + } >>>>> } >>>>> >>>>> /* We should be followed by a CPU reset but update the active value >>>> >>> >> >