On Tue, 17 Jan 2017 19:53:21 +0100 Laszlo Ersek <ler...@redhat.com> wrote:
> On 01/17/17 15:20, Igor Mammedov wrote: > > On Tue, 17 Jan 2017 14:46:05 +0100 > > Laszlo Ersek <ler...@redhat.com> wrote: > > > >> On 01/17/17 14:20, Igor Mammedov wrote: > >>> On Tue, 17 Jan 2017 13:06:13 +0100 > >>> Laszlo Ersek <ler...@redhat.com> wrote: > >>> > >>>> On 01/17/17 12:21, Igor Mammedov wrote: > >>>>> On Mon, 16 Jan 2017 21:46:55 +0100 > >>>>> Laszlo Ersek <ler...@redhat.com> wrote: > >>>>> > >>>>>> On 01/13/17 14:09, Igor Mammedov wrote: > >>>>>>> On Thu, 12 Jan 2017 19:24:45 +0100 > >>>>>>> Laszlo Ersek <ler...@redhat.com> wrote: > >>>>>>> > >>>>>>>> The generic edk2 SMM infrastructure prefers > >>>>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each > >>>>>>>> processor. If > >>>>>>>> Trigger() only brings the current processor into SMM, then edk2 > >>>>>>>> handles it > >>>>>>>> in the following ways: > >>>>>>>> > >>>>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before > >>>>>>>> ExitBootServices(), but is not necessarily true at runtime), > >>>>>>>> then: > >>>>>>>> > >>>>>>>> (a) If edk2 has been configured for "traditional" SMM > >>>>>>>> synchronization, > >>>>>>>> then the BSP sends directed SMIs to the APs with APIC > >>>>>>>> delivery, > >>>>>>>> bringing them into SMM individually. Then the BSP runs the > >>>>>>>> SMI > >>>>>>>> handler / dispatcher. > >>>>>>>> > >>>>>>>> (b) If edk2 has been configured for "relaxed" SMM > >>>>>>>> synchronization, > >>>>>>>> then the APs that are not already in SMM are not brought in, > >>>>>>>> and > >>>>>>>> the BSP runs the SMI handler / dispatcher. > >>>>>>>> > >>>>>>>> (2) If Trigger() is executed by an AP (which is possible after > >>>>>>>> ExitBootServices(), and can be forced e.g. by "taskset -c 1 > >>>>>>>> efibootmgr"), then the AP in question brings in the BSP with a > >>>>>>>> directed SMI, and the BSP runs the SMI handler / dispatcher. > >>>>>>>> > >>>>>>>> The smaller problem with (1a) and (2) is that the BSP and AP > >>>>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr" > >>>>>>>> command from (2) can take more than 3 seconds to complete, because > >>>>>>>> efibootmgr accesses non-volatile UEFI variables intensively. > >>>>>>>> > >>>>>>>> The larger problem is that QEMU's current behavior diverges from the > >>>>>>>> behavior usually seen on physical hardware, and that keeps exposing > >>>>>>>> obscure corner cases, race conditions and other instabilities in > >>>>>>>> edk2, > >>>>>>>> which generally expects / prefers a software SMI to affect all CPUs > >>>>>>>> at > >>>>>>>> once. > >>>>>>>> > >>>>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to > >>>>>>>> inject > >>>>>>>> the SMI on all VCPUs. > >>>>>>>> > >>>>>>>> While the original posting of this patch > >>>>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html> > >>>>>>>> only intended to speed up (2), based on our recent "stress testing" > >>>>>>>> of SMM > >>>>>>>> this patch actually provides functional improvements. > >>>>>>>> > >>>>>>>> Cc: "Michael S. Tsirkin" <m...@redhat.com> > >>>>>>>> Cc: Gerd Hoffmann <kra...@redhat.com> > >>>>>>>> Cc: Igor Mammedov <imamm...@redhat.com> > >>>>>>>> Cc: Paolo Bonzini <pbonz...@redhat.com> > >>>>>>>> Signed-off-by: Laszlo Ersek <ler...@redhat.com> > >>>>>>>> Reviewed-by: Michael S. Tsirkin <m...@redhat.com> > >>>>>>>> --- > >>>>>>>> > >>>>>>>> Notes: > >>>>>>>> v6: > >>>>>>>> - no changes, pick up Michael's R-b > >>>>>>>> > >>>>>>>> v5: > >>>>>>>> - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the > >>>>>>>> ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for > >>>>>>>> DEFINE_PROP_BIT() in the next patch) > >>>>>>>> > >>>>>>>> include/hw/i386/ich9.h | 3 +++ > >>>>>>>> hw/isa/lpc_ich9.c | 10 +++++++++- > >>>>>>>> 2 files changed, 12 insertions(+), 1 deletion(-) > >>>>>>>> > >>>>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h > >>>>>>>> index da1118727146..18dcca7ebcbf 100644 > >>>>>>>> --- a/include/hw/i386/ich9.h > >>>>>>>> +++ b/include/hw/i386/ich9.h > >>>>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void); > >>>>>>>> #define ICH9_SMB_HST_D1 0x06 > >>>>>>>> #define ICH9_SMB_HOST_BLOCK_DB 0x07 > >>>>>>>> > >>>>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */ > >>>>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT 0 > >>>>>>>> + > >>>>>>>> #endif /* HW_ICH9_H */ > >>>>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c > >>>>>>>> index 376b7801a42c..ced6f803a4f2 100644 > >>>>>>>> --- a/hw/isa/lpc_ich9.c > >>>>>>>> +++ b/hw/isa/lpc_ich9.c > >>>>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, > >>>>>>>> void *arg) > >>>>>>>> > >>>>>>>> /* SMI_EN = PMBASE + 30. SMI control and enable register */ > >>>>>>>> if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) { > >>>>>>>> - cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); > >>>>>>>> + if (lpc->smi_negotiated_features & > >>>>>>>> + (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) { > >>>>>>>> + CPUState *cs; > >>>>>>>> + CPU_FOREACH(cs) { > >>>>>>>> + cpu_interrupt(cs, CPU_INTERRUPT_SMI); > >>>>>>>> + } > >>>>>>> Shouldn't CPUs with default SMI base be excluded from broadcast? > >>>>>>> > >>>>>> > >>>>>> I don't think so; as far as I recall, in edk2 the initial SMM entry > >>>>>> code > >>>>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same > >>>>>> code area for execution. The VCPUs are told apart from each other by > >>>>>> Initial APIC ID (that is, "who am I"), which is used as an index or > >>>>>> search criterion into a global shared array. Once they find their > >>>>>> respective private data blocks, the VCPUs don't step on each other's > >>>>>> toes. > >>>>> That's what I'm not sure. > >>>>> If broadcast SMI is sent before relocation all CPUs will use > >>>>> the same SMBASE and as result save their state in the same > >>>>> memory (even if it's state after RESET/POWER ON) racing/overwriting > >>>>> each other's saved state > >>>> > >>>> Good point! > >>>> > >>>>> and then on exit from SMI they all will restore > >>>>> whatever state that race produced so it seems plain wrong thing to do. > >>>>> > >>>>> Are you sure that edk2 does broadcast for SMI initialization or does it > >>>>> using directed SMIs? > >>>> > >>>> Hmmm, you are right. The SmmRelocateBases() function in > >>>> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for > >>>> each individual AP in succession, by sending SMI IPIs to them, one by > >>>> one. Then it does the same for the BSP, by sending itself a similar SMI > >>>> IPI. > >>>> > >>>> The above QEMU code is only exercised much later, after the SMBASE > >>>> relocation, with the SMM stack up and running. It is used when a > >>>> (preferably broadcast) SMI is triggered for firmware services (for > >>>> example, the UEFI variable services). > >>>> > >>>> So, you are right. > >>>> > >>>> Considering edk2 only, the difference shouldn't matter -- when this code > >>>> is invoked (via an IO port write), the SMBASEs will all have been > >>>> relocated. > >>>> > >>>> Also, I've been informed that on real hardware, writing to APM_CNT > >>>> triggers an SMI on all CPUs, regardless of the individual SMBASE values. > >>>> So I believe the above code should be closest to real hardware, and the > >>>> pre-patch code was only written in unicast form for SeaBIOS's sake. > >>>> > >>>> I think the code is safe above. If the guest injects an SMI via APM_CNT > >>>> after negotiating SMI broadcast, it should have not left any VCPUs > >>>> without an SMI handler by that time. edk2 / OVMF conforms to this. In > >>>> the future we can tweak this further, if necessary; we have 63 more > >>>> bits, and we'll be able to reject invalid combinations of bits. > >>>> > >>>> Do you feel that we should skip VCPUs whose SMBASE has not been changed > >>>> from the default? > >>>> > >>>> If so, doesn't that run the risk of missing a VCPU that has an actual > >>>> SMI handler installed, and it just happens to be placed at the default > >>>> SMBASE location? > >>>> > >>>> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs, > >>>> but I think (a) that's not what real hardware does, and (b) it is risky > >>>> if a VCPU's actual SMI handler has been installed while keeping the > >>>> default SMBASE value. > >>>> > >>>> What do you think? > >>> (a) probably doesn't consider hotplug, so it's totally fine for the case > >>> where firmware is the only one who wakes up/initializes CPUs. > >>> however consider 2 CPU hotplugged and then broadcast SMI happens > >>> to serve some other task (if there isn't unpark 'feature'). > >>> Even if real hw does it, it's behavior not defined by SDM with possibly > >>> bad consequences. > >>> > >>> (b) alone it's risky so I'd skip based on combination of > >>> > >>> if (default SMBASE & CPU is in reset state) > >>> skip; > >>> > >>> that should be safe and it leaves open possibility to avoid adding > >>> unpark 'feature' to CPU. > >> > >> The above attributes ("SMBASE" and "CPU is in reset state") are specific > >> to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the > >> X86_CPU() macro? > >> > >> Can I assume here that the macro will never return NULL? (I think so, > >> LPC is only used with x86.) > > yep, that should work. > > > >> > >> ... I guess SMBASE can be accessed as > >> > >> X86_CPU(cs)->env.smbase > > preferred style: > > X86CPU *cpu = X86_CPU(cs); > > cpu->... > > > >> > >> How do I figure out if the CPU is in reset state ("waiting for first > >> INIT") though? Note that this state should be distinguished from simply > >> "halted" (i.e., after a HLT). After a HLT, the SMI should be injected > >> and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send > >> APs to sleep. > > > > how about using EIP reset value? > > x86_cpu_reset() > > ... > > env->eip = 0xfff0; > > Okay, so I tried this. It doesn't work. > > This is the code I wrote (extracting a new ich9_apm_broadcast_smi() > function from ich9_apm_ctrl_changed(), due to size reasons): > > > static void ich9_apm_broadcast_smi(void) > > { > > CPUState *cs; > > > > cpu_synchronize_all_states(); /* <--------- mark this */ it probably should be executed after pause_all_vcpus(), maybe it will help. > > CPU_FOREACH(cs) { > > X86CPU *cpu = X86_CPU(cs); > > CPUX86State *env = &cpu->env; > > > > if (env->smbase == 0x30000 && env->eip == 0xfff0) { > > CPUClass *k = CPU_GET_CLASS(cs); > > uint64_t cpu_arch_id = k->get_arch_id(cs); > > > > trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); > > continue; > > } > > > > cpu_interrupt(cs, CPU_INTERRUPT_SMI); > > } > > } > [...] > (b) If I add the cpu_synchronize_all_states() call before the loop, then > the incorrect filter matches go away; QEMU sees the KVM VCPU states > correctly, and the SMI is broad-cast. > > However, in this case, the boot slows to a *crawl*. It's unbearable. All > VCPUs are pegged at 100%, and you can easily follow the OVMF debug log > with the naked eye, almost line by line. > I didn't expect that cpu_synchronize_all_states() would be so costly, > but it makes the idea of checking VCPU state in the APM_CNT handler a > non-starter. I wonder were bottleneck is to slow down guest so much. What is actual cost of calling cpu_synchronize_all_states()? Maybe calling pause_all_vcpus() before cpu_synchronize_all_states() would help. > Can we stick with the current "v6 wave 2" in light of this? > > Thanks, > Laszlo