On 01/17/17 15:20, Igor Mammedov wrote: > On Tue, 17 Jan 2017 14:46:05 +0100 > Laszlo Ersek <ler...@redhat.com> wrote: > >> On 01/17/17 14:20, Igor Mammedov wrote: >>> On Tue, 17 Jan 2017 13:06:13 +0100 >>> Laszlo Ersek <ler...@redhat.com> wrote: >>> >>>> On 01/17/17 12:21, Igor Mammedov wrote: >>>>> On Mon, 16 Jan 2017 21:46:55 +0100 >>>>> Laszlo Ersek <ler...@redhat.com> wrote: >>>>> >>>>>> On 01/13/17 14:09, Igor Mammedov wrote: >>>>>>> On Thu, 12 Jan 2017 19:24:45 +0100 >>>>>>> Laszlo Ersek <ler...@redhat.com> wrote: >>>>>>> >>>>>>>> The generic edk2 SMM infrastructure prefers >>>>>>>> EFI_SMM_CONTROL2_PROTOCOL.Trigger() to inject an SMI on each >>>>>>>> processor. If >>>>>>>> Trigger() only brings the current processor into SMM, then edk2 >>>>>>>> handles it >>>>>>>> in the following ways: >>>>>>>> >>>>>>>> (1) If Trigger() is executed by the BSP (which is guaranteed before >>>>>>>> ExitBootServices(), but is not necessarily true at runtime), then: >>>>>>>> >>>>>>>> (a) If edk2 has been configured for "traditional" SMM >>>>>>>> synchronization, >>>>>>>> then the BSP sends directed SMIs to the APs with APIC delivery, >>>>>>>> bringing them into SMM individually. Then the BSP runs the SMI >>>>>>>> handler / dispatcher. >>>>>>>> >>>>>>>> (b) If edk2 has been configured for "relaxed" SMM synchronization, >>>>>>>> then the APs that are not already in SMM are not brought in, >>>>>>>> and >>>>>>>> the BSP runs the SMI handler / dispatcher. >>>>>>>> >>>>>>>> (2) If Trigger() is executed by an AP (which is possible after >>>>>>>> ExitBootServices(), and can be forced e.g. by "taskset -c 1 >>>>>>>> efibootmgr"), then the AP in question brings in the BSP with a >>>>>>>> directed SMI, and the BSP runs the SMI handler / dispatcher. >>>>>>>> >>>>>>>> The smaller problem with (1a) and (2) is that the BSP and AP >>>>>>>> synchronization is slow. For example, the "taskset -c 1 efibootmgr" >>>>>>>> command from (2) can take more than 3 seconds to complete, because >>>>>>>> efibootmgr accesses non-volatile UEFI variables intensively. >>>>>>>> >>>>>>>> The larger problem is that QEMU's current behavior diverges from the >>>>>>>> behavior usually seen on physical hardware, and that keeps exposing >>>>>>>> obscure corner cases, race conditions and other instabilities in edk2, >>>>>>>> which generally expects / prefers a software SMI to affect all CPUs at >>>>>>>> once. >>>>>>>> >>>>>>>> Therefore introduce the "broadcast SMI" feature that causes QEMU to >>>>>>>> inject >>>>>>>> the SMI on all VCPUs. >>>>>>>> >>>>>>>> While the original posting of this patch >>>>>>>> <http://lists.nongnu.org/archive/html/qemu-devel/2015-10/msg05658.html> >>>>>>>> only intended to speed up (2), based on our recent "stress testing" of >>>>>>>> SMM >>>>>>>> this patch actually provides functional improvements. >>>>>>>> >>>>>>>> Cc: "Michael S. Tsirkin" <m...@redhat.com> >>>>>>>> Cc: Gerd Hoffmann <kra...@redhat.com> >>>>>>>> Cc: Igor Mammedov <imamm...@redhat.com> >>>>>>>> Cc: Paolo Bonzini <pbonz...@redhat.com> >>>>>>>> Signed-off-by: Laszlo Ersek <ler...@redhat.com> >>>>>>>> Reviewed-by: Michael S. Tsirkin <m...@redhat.com> >>>>>>>> --- >>>>>>>> >>>>>>>> Notes: >>>>>>>> v6: >>>>>>>> - no changes, pick up Michael's R-b >>>>>>>> >>>>>>>> v5: >>>>>>>> - replace the ICH9_LPC_SMI_F_BROADCAST bit value with the >>>>>>>> ICH9_LPC_SMI_F_BROADCAST_BIT bit position (necessary for >>>>>>>> DEFINE_PROP_BIT() in the next patch) >>>>>>>> >>>>>>>> include/hw/i386/ich9.h | 3 +++ >>>>>>>> hw/isa/lpc_ich9.c | 10 +++++++++- >>>>>>>> 2 files changed, 12 insertions(+), 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/include/hw/i386/ich9.h b/include/hw/i386/ich9.h >>>>>>>> index da1118727146..18dcca7ebcbf 100644 >>>>>>>> --- a/include/hw/i386/ich9.h >>>>>>>> +++ b/include/hw/i386/ich9.h >>>>>>>> @@ -250,4 +250,7 @@ Object *ich9_lpc_find(void); >>>>>>>> #define ICH9_SMB_HST_D1 0x06 >>>>>>>> #define ICH9_SMB_HOST_BLOCK_DB 0x07 >>>>>>>> >>>>>>>> +/* bit positions used in fw_cfg SMI feature negotiation */ >>>>>>>> +#define ICH9_LPC_SMI_F_BROADCAST_BIT 0 >>>>>>>> + >>>>>>>> #endif /* HW_ICH9_H */ >>>>>>>> diff --git a/hw/isa/lpc_ich9.c b/hw/isa/lpc_ich9.c >>>>>>>> index 376b7801a42c..ced6f803a4f2 100644 >>>>>>>> --- a/hw/isa/lpc_ich9.c >>>>>>>> +++ b/hw/isa/lpc_ich9.c >>>>>>>> @@ -437,7 +437,15 @@ static void ich9_apm_ctrl_changed(uint32_t val, >>>>>>>> void *arg) >>>>>>>> >>>>>>>> /* SMI_EN = PMBASE + 30. SMI control and enable register */ >>>>>>>> if (lpc->pm.smi_en & ICH9_PMIO_SMI_EN_APMC_EN) { >>>>>>>> - cpu_interrupt(current_cpu, CPU_INTERRUPT_SMI); >>>>>>>> + if (lpc->smi_negotiated_features & >>>>>>>> + (UINT64_C(1) << ICH9_LPC_SMI_F_BROADCAST_BIT)) { >>>>>>>> + CPUState *cs; >>>>>>>> + CPU_FOREACH(cs) { >>>>>>>> + cpu_interrupt(cs, CPU_INTERRUPT_SMI); >>>>>>>> + } >>>>>>> Shouldn't CPUs with default SMI base be excluded from broadcast? >>>>>> >>>>>> I don't think so; as far as I recall, in edk2 the initial SMM entry code >>>>>> -- before SMBASE relocation -- accommodates all VCPUs entering the same >>>>>> code area for execution. The VCPUs are told apart from each other by >>>>>> Initial APIC ID (that is, "who am I"), which is used as an index or >>>>>> search criterion into a global shared array. Once they find their >>>>>> respective private data blocks, the VCPUs don't step on each other's >>>>>> toes. >>>>> That's what I'm not sure. >>>>> If broadcast SMI is sent before relocation all CPUs will use >>>>> the same SMBASE and as result save their state in the same >>>>> memory (even if it's state after RESET/POWER ON) racing/overwriting >>>>> each other's saved state >>>> >>>> Good point! >>>> >>>>> and then on exit from SMI they all will restore >>>>> whatever state that race produced so it seems plain wrong thing to do. >>>>> >>>>> Are you sure that edk2 does broadcast for SMI initialization or does it >>>>> using directed SMIs? >>>> >>>> Hmmm, you are right. The SmmRelocateBases() function in >>>> "UefiCpuPkg/PiSmmCpuDxeSmm/PiSmmCpuDxeSmm.c" first relocates SMBASE for >>>> each individual AP in succession, by sending SMI IPIs to them, one by >>>> one. Then it does the same for the BSP, by sending itself a similar SMI >>>> IPI. >>>> >>>> The above QEMU code is only exercised much later, after the SMBASE >>>> relocation, with the SMM stack up and running. It is used when a >>>> (preferably broadcast) SMI is triggered for firmware services (for >>>> example, the UEFI variable services). >>>> >>>> So, you are right. >>>> >>>> Considering edk2 only, the difference shouldn't matter -- when this code >>>> is invoked (via an IO port write), the SMBASEs will all have been >>>> relocated. >>>> >>>> Also, I've been informed that on real hardware, writing to APM_CNT >>>> triggers an SMI on all CPUs, regardless of the individual SMBASE values. >>>> So I believe the above code should be closest to real hardware, and the >>>> pre-patch code was only written in unicast form for SeaBIOS's sake. >>>> >>>> I think the code is safe above. If the guest injects an SMI via APM_CNT >>>> after negotiating SMI broadcast, it should have not left any VCPUs >>>> without an SMI handler by that time. edk2 / OVMF conforms to this. In >>>> the future we can tweak this further, if necessary; we have 63 more >>>> bits, and we'll be able to reject invalid combinations of bits. >>>> >>>> Do you feel that we should skip VCPUs whose SMBASE has not been changed >>>> from the default? >>>> >>>> If so, doesn't that run the risk of missing a VCPU that has an actual >>>> SMI handler installed, and it just happens to be placed at the default >>>> SMBASE location? >>>> >>>> ... I wouldn't mind skipping VCPUs with apparently unrelocated SMBASEs, >>>> but I think (a) that's not what real hardware does, and (b) it is risky >>>> if a VCPU's actual SMI handler has been installed while keeping the >>>> default SMBASE value. >>>> >>>> What do you think? >>> (a) probably doesn't consider hotplug, so it's totally fine for the case >>> where firmware is the only one who wakes up/initializes CPUs. >>> however consider 2 CPU hotplugged and then broadcast SMI happens >>> to serve some other task (if there isn't unpark 'feature'). >>> Even if real hw does it, it's behavior not defined by SDM with possibly >>> bad consequences. >>> >>> (b) alone it's risky so I'd skip based on combination of >>> >>> if (default SMBASE & CPU is in reset state) >>> skip; >>> >>> that should be safe and it leaves open possibility to avoid adding >>> unpark 'feature' to CPU. >> >> The above attributes ("SMBASE" and "CPU is in reset state") are specific >> to x86 VCPUs. Should I use some dynamic QOM casting for this? Like the >> X86_CPU() macro? >> >> Can I assume here that the macro will never return NULL? (I think so, >> LPC is only used with x86.) > yep, that should work. > >> >> ... I guess SMBASE can be accessed as >> >> X86_CPU(cs)->env.smbase > preferred style: > X86CPU *cpu = X86_CPU(cs); > cpu->... > >> >> How do I figure out if the CPU is in reset state ("waiting for first >> INIT") though? Note that this state should be distinguished from simply >> "halted" (i.e., after a HLT). After a HLT, the SMI should be injected >> and delivered; see commit a9bad65d2c1f. Edk2 regularly uses HLT to send >> APs to sleep. > > how about using EIP reset value? > x86_cpu_reset() > ... > env->eip = 0xfff0;
Okay, so I tried this. It doesn't work. This is the code I wrote (extracting a new ich9_apm_broadcast_smi() function from ich9_apm_ctrl_changed(), due to size reasons): > static void ich9_apm_broadcast_smi(void) > { > CPUState *cs; > > cpu_synchronize_all_states(); /* <--------- mark this */ > CPU_FOREACH(cs) { > X86CPU *cpu = X86_CPU(cs); > CPUX86State *env = &cpu->env; > > if (env->smbase == 0x30000 && env->eip == 0xfff0) { > CPUClass *k = CPU_GET_CLASS(cs); > uint64_t cpu_arch_id = k->get_arch_id(cs); > > trace_ich9_apm_broadcast_smi_skip(cpu_arch_id); > continue; > } > > cpu_interrupt(cs, CPU_INTERRUPT_SMI); > } > } There are two cases: (a) If I do *not* add the cpu_synchronize_all_states() call before the loop, then the filter condition matches *despite* the APM_CNT write being performed by OVMF *after* SMBASE relocation. I see the trace messages in the QEMU log (this is a four VCPU Ia32 guest): 28863@1484677204.715171:ich9_apm_broadcast_smi_skip cpu_arch_id=0x0 28863@1484677204.715182:ich9_apm_broadcast_smi_skip cpu_arch_id=0x1 28863@1484677204.715184:ich9_apm_broadcast_smi_skip cpu_arch_id=0x2 28863@1484677204.715185:ich9_apm_broadcast_smi_skip cpu_arch_id=0x3 28863@1484677204.724084:ich9_apm_broadcast_smi_skip cpu_arch_id=0x0 28863@1484677204.724088:ich9_apm_broadcast_smi_skip cpu_arch_id=0x1 28863@1484677204.724090:ich9_apm_broadcast_smi_skip cpu_arch_id=0x2 28863@1484677204.724091:ich9_apm_broadcast_smi_skip cpu_arch_id=0x3 That is, even though SMBASE relocation succeeds first, and APM_CNT is only written afterwards -- I proved this from the OVMF debug log --, the code above does not see an up-to-date CPU state for the KVM VCPUs, and it filters out the SMI for *all* VCPUs. Needless to say, this breaks the firmware entirely; it cannot boot. (b) If I add the cpu_synchronize_all_states() call before the loop, then the incorrect filter matches go away; QEMU sees the KVM VCPU states correctly, and the SMI is broad-cast. However, in this case, the boot slows to a *crawl*. It's unbearable. All VCPUs are pegged at 100%, and you can easily follow the OVMF debug log with the naked eye, almost line by line. I didn't expect that cpu_synchronize_all_states() would be so costly, but it makes the idea of checking VCPU state in the APM_CNT handler a non-starter. Can we stick with the current "v6 wave 2" in light of this? Thanks, Laszlo