Re: [Qemu-devel] Help me with QEMU
> How can I add a kind of new device support in Qemu and What is the steps to do so? https://github.com/rafilia/qemu_example_virtual_pcidev -- Dongli Zhang (张东立) finallyjustice.github.io
[PATCH 2/3] i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled
The "perf stat" at the VM side still works even we set "-cpu host,-pmu" in the QEMU command line. That is, neither "-cpu host,-pmu" nor "-cpu EPYC" could disable the pmu virtualization in an AMD environment. We still see below at VM kernel side ... [0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver. ... although we expect something like below. [0.596381] Performance Events: PMU not available due to virtualization, using software events only. [0.600972] NMI watchdog: Perf NMI watchdog permanently disabled This is because the AMD pmu (v1) does not rely on cpuid to decide if the pmu virtualization is supported. We disable KVM_CAP_PMU_CAPABILITY if the 'pmu' is disabled in the vcpu properties. Cc: Joe Jin Signed-off-by: Dongli Zhang --- target/i386/kvm/kvm.c | 17 + 1 file changed, 17 insertions(+) diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 8fec0bc5b5..0b1226ff7f 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -137,6 +137,8 @@ static int has_triple_fault_event; static bool has_msr_mcg_ext_ctl; +static int has_pmu_cap; + static struct kvm_cpuid2 *cpuid_cache; static struct kvm_cpuid2 *hv_cpuid_cache; static struct kvm_msr_list *kvm_feature_msrs; @@ -1725,6 +1727,19 @@ static void kvm_init_nested_state(CPUX86State *env) void kvm_arch_pre_create_vcpu(CPUState *cs) { +X86CPU *cpu = X86_CPU(cs); +int ret; + +if (has_pmu_cap && !cpu->enable_pmu) { +ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_PMU_CAPABILITY, 0, +KVM_PMU_CAP_DISABLE); +if (ret < 0) { +error_report("kvm: Failed to disable pmu cap: %s", + strerror(-ret)); +} + +has_pmu_cap = 0; +} } int kvm_arch_init_vcpu(CPUState *cs) @@ -2517,6 +2532,8 @@ int kvm_arch_init(MachineState *ms, KVMState *s) } } +has_pmu_cap = kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY); + ret = kvm_get_supported_msrs(s); if (ret < 0) { return ret; -- 2.34.1
[PATCH 3/3] target/i386/kvm: get and put AMD pmu registers
The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to the KVM side. However, only the Intel gp/fixed/global pmu registers are involved. There is not any implementation for AMD pmu registers. The 'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for AMD. Before AMD PerfMonV2, the number of gp registers is decided based on the CPU version. This patch is to add the support for AMD version=1 pmu, to get and put AMD pmu registers. Otherwise, there will be a bug: 1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it is running "perf top". The pmu registers are not disabled gracefully. 2. Although the x86_cpu_reset() resets many registers to zero, the kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, some pmu events are still enabled at the KVM side. 3. The KVM pmc_speculative_in_use() always returns true so that the events will not be reclaimed. The kvm_pmc->perf_event is still active. 4. After the reboot, the VM kernel reports below error: [0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor. [0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076) 5. In a worse case, the active kvm_pmc->perf_event is still able to inject unknown NMIs randomly to the VM kernel. [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. The patch is to fix the issue by resetting AMD pmu registers during the reset. Cc: Joe Jin Signed-off-by: Dongli Zhang --- target/i386/cpu.h | 5 +++ target/i386/kvm/kvm.c | 83 +-- 2 files changed, 86 insertions(+), 2 deletions(-) diff --git a/target/i386/cpu.h b/target/i386/cpu.h index d4bc19577a..4cf0b98817 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -468,6 +468,11 @@ typedef enum X86Seg { #define MSR_CORE_PERF_GLOBAL_CTRL 0x38f #define MSR_CORE_PERF_GLOBAL_OVF_CTRL 0x390 +#define MSR_K7_EVNTSEL0 0xc001 +#define MSR_K7_PERFCTR0 0xc0010004 +#define MSR_F15H_PERF_CTL0 0xc0010200 +#define MSR_F15H_PERF_CTR0 0xc0010201 + #define MSR_MC0_CTL 0x400 #define MSR_MC0_STATUS 0x401 #define MSR_MC0_ADDR0x402 diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 0b1226ff7f..023fcbce48 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -2005,6 +2005,32 @@ int kvm_arch_init_vcpu(CPUState *cs) } } +if (IS_AMD_CPU(env)) { +int64_t family; + +family = (env->cpuid_version >> 8) & 0xf; +if (family == 0xf) { +family += (env->cpuid_version >> 20) & 0xff; +} + +/* + * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to + * disable the AMD pmu virtualization. + * + * If KVM_CAP_PMU_CAPABILITY is supported, "!has_pmu_cap" indicates + * the KVM side has already disabled the pmu virtualization. + */ +if (family >= 6 && (!has_pmu_cap || cpu->enable_pmu)) { +has_architectural_pmu_version = 1; + +if (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) { +num_architectural_pmu_gp_counters = 6; +} else { +num_architectural_pmu_gp_counters = 4; +} +} +} + cpu_x86_cpuid(env, 0x8000, 0, &limit, &unused, &unused, &unused); for (i = 0x8000; i <= limit; i++) { @@ -3326,7 +3352,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level) kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env->poll_control_msr); } -if (has_architectural_pmu_version > 0) { +if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) { if (has_architectural_pmu_version > 1) { /* Stop the counter. */ kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0); @@ -3357,6 +3383,26 @@ static int kvm_put_msrs(X86CPU *cpu, int level) env->msr_global_ctrl); } } + +if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) { +uint32_t sel_base = MSR_K7_EVNTSEL0; +uint32_t ctr_base = MSR_K7_PERFCTR0; +uint32_t step = 1; + +if (num_architectural_pmu_gp_counters == 6) { +sel_base = MSR_F15H_PERF_CTL0; +ctr_base = MSR_F15H_PERF_CTR0; +step = 2; +} + +for (i = 0; i < num_architectural_pmu_gp_counters; i++) { +kvm_msr_entry_a
[PATCH 0/3] kvm: fix two svm pmu virtualization bugs
This patchset is to fix two svm pmu virtualization bugs. 1. The 1st bug is that "-cpu,-pmu" cannot disable svm pmu virtualization. To use "-cpu EPYC" or "-cpu host,-pmu" cannot disable the pmu virtualization. There is still below at the VM linux side ... [0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver. ... although we expect something like below. [0.596381] Performance Events: PMU not available due to virtualization, using software events only. [0.600972] NMI watchdog: Perf NMI watchdog permanently disabled The patch 1-2 is to disable the pmu virtualization via KVM_PMU_CAP_DISABLE if the per-vcpu "pmu" property is disabled. I considered 'KVM_X86_SET_MSR_FILTER' initially. Since both KVM_X86_SET_MSR_FILTER and KVM_PMU_CAP_DISABLE are VM ioctl. I finally used the latter because it is easier to use. 2. The 2nd bug is that un-reclaimed perf events (after QEMU system_reset) at the KVM side may inject random unwanted/unknown NMIs to the VM. The svm pmu registers are not reset during QEMU system_reset. (1). The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it is running "perf top". The pmu registers are not disabled gracefully. (2). Although the x86_cpu_reset() resets many registers to zero, the kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, some pmu events are still enabled at the KVM side. (3). The KVM pmc_speculative_in_use() always returns true so that the events will not be reclaimed. The kvm_pmc->perf_event is still active. (4). After the reboot, the VM kernel reports below error: [0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor. [0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076) (5). In a worse case, the active kvm_pmc->perf_event is still able to inject unknown NMIs randomly to the VM kernel. [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. The patch 3 is to fix the issue by resetting AMD pmu registers as well as Intel registers. This patchset does cover does not cover PerfMonV2, until the below patchset is merged into the KVM side. [PATCH v3 0/8] KVM: x86: Add AMD Guest PerfMonV2 PMU support https://lore.kernel.org/all/2022102645.82001-1-lik...@tencent.com/ Dongli Zhang (3): kvm: introduce a helper before creating the 1st vcpu i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled target/i386/kvm: get and put AMD pmu registers accel/kvm/kvm-all.c| 7 ++- include/sysemu/kvm.h | 2 + target/arm/kvm64.c | 4 ++ target/i386/cpu.h | 5 +++ target/i386/kvm/kvm.c | 104 +++- target/mips/kvm.c | 4 ++ target/ppc/kvm.c | 4 ++ target/riscv/kvm.c | 4 ++ target/s390x/kvm/kvm.c | 4 ++ 9 files changed, 134 insertions(+), 4 deletions(-) Thank you very much! Dongli Zhang
[PATCH 1/3] kvm: introduce a helper before creating the 1st vcpu
Some per-VM kvm caps (e.g., KVM_CAP_PMU_CAPABILITY) can only be enabled/disabled before creating the 1st vcpu, that is, when (!kvm->created_vcpus) at the KVM side. Unfortunately, some properties are still not set during kvm_arch_init(). The values of those properties are obtained during the init of each vcpu. This is to add a new helper to provide the last chance before creating the 1st vcpu, in order for the QEMU to set kvm caps based on the per-vcpu properties (e.g., "pmu"). In the future patch, we may disable KVM_CAP_PMU_CAPABILITY in the helper if the "-pmu" is set for the vcpu. Cc: Joe Jin Signed-off-by: Dongli Zhang --- accel/kvm/kvm-all.c| 7 +-- include/sysemu/kvm.h | 2 ++ target/arm/kvm64.c | 4 target/i386/kvm/kvm.c | 4 target/mips/kvm.c | 4 target/ppc/kvm.c | 4 target/riscv/kvm.c | 4 target/s390x/kvm/kvm.c | 4 8 files changed, 31 insertions(+), 2 deletions(-) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index f99b0becd8..335ff6ce4d 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -367,8 +367,9 @@ void kvm_destroy_vcpu(CPUState *cpu) } } -static int kvm_get_vcpu(KVMState *s, unsigned long vcpu_id) +static int kvm_get_vcpu(KVMState *s, CPUState *cs) { +unsigned long vcpu_id = kvm_arch_vcpu_id(cs); struct KVMParkedVcpu *cpu; QLIST_FOREACH(cpu, &s->kvm_parked_vcpus, node) { @@ -382,6 +383,8 @@ static int kvm_get_vcpu(KVMState *s, unsigned long vcpu_id) } } +kvm_arch_pre_create_vcpu(cs); + return kvm_vm_ioctl(s, KVM_CREATE_VCPU, (void *)vcpu_id); } @@ -393,7 +396,7 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp) trace_kvm_init_vcpu(cpu->cpu_index, kvm_arch_vcpu_id(cpu)); -ret = kvm_get_vcpu(s, kvm_arch_vcpu_id(cpu)); +ret = kvm_get_vcpu(s, cpu); if (ret < 0) { error_setg_errno(errp, -ret, "kvm_init_vcpu: kvm_get_vcpu failed (%lu)", kvm_arch_vcpu_id(cpu)); diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index e9a97eda8c..9a2e2ba012 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -371,6 +371,8 @@ int kvm_arch_put_registers(CPUState *cpu, int level); int kvm_arch_init(MachineState *ms, KVMState *s); +void kvm_arch_pre_create_vcpu(CPUState *cs); + int kvm_arch_init_vcpu(CPUState *cpu); int kvm_arch_destroy_vcpu(CPUState *cpu); diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c index 1197253d12..da4317ad06 100644 --- a/target/arm/kvm64.c +++ b/target/arm/kvm64.c @@ -834,6 +834,10 @@ static int kvm_arm_sve_set_vls(CPUState *cs) return kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); } +void kvm_arch_pre_create_vcpu(CPUState *cs) +{ +} + #define ARM_CPU_ID_MPIDR 3, 0, 0, 0, 5 int kvm_arch_init_vcpu(CPUState *cs) diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index a213209379..8fec0bc5b5 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -1723,6 +1723,10 @@ static void kvm_init_nested_state(CPUX86State *env) } } +void kvm_arch_pre_create_vcpu(CPUState *cs) +{ +} + int kvm_arch_init_vcpu(CPUState *cs) { struct { diff --git a/target/mips/kvm.c b/target/mips/kvm.c index bcb8e06b2c..1be1695b6b 100644 --- a/target/mips/kvm.c +++ b/target/mips/kvm.c @@ -61,6 +61,10 @@ int kvm_arch_irqchip_create(KVMState *s) return 0; } +void kvm_arch_pre_create_vcpu(CPUState *cs) +{ +} + int kvm_arch_init_vcpu(CPUState *cs) { MIPSCPU *cpu = MIPS_CPU(cs); diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c index 7c25348b7b..9049c6eb5e 100644 --- a/target/ppc/kvm.c +++ b/target/ppc/kvm.c @@ -462,6 +462,10 @@ static void kvmppc_hw_debug_points_init(CPUPPCState *cenv) } } +void kvm_arch_pre_create_vcpu(CPUState *cs) +{ +} + int kvm_arch_init_vcpu(CPUState *cs) { PowerPCCPU *cpu = POWERPC_CPU(cs); diff --git a/target/riscv/kvm.c b/target/riscv/kvm.c index 30f21453d6..811f65d4f6 100644 --- a/target/riscv/kvm.c +++ b/target/riscv/kvm.c @@ -394,6 +394,10 @@ void kvm_arch_init_irq_routing(KVMState *s) { } +void kvm_arch_pre_create_vcpu(CPUState *cs) +{ +} + int kvm_arch_init_vcpu(CPUState *cs) { int ret = 0; diff --git a/target/s390x/kvm/kvm.c b/target/s390x/kvm/kvm.c index 3ac7ec9acf..65f701894e 100644 --- a/target/s390x/kvm/kvm.c +++ b/target/s390x/kvm/kvm.c @@ -404,6 +404,10 @@ unsigned long kvm_arch_vcpu_id(CPUState *cpu) return cpu->cpu_index; } +void kvm_arch_pre_create_vcpu(CPUState *cs) +{ +} + int kvm_arch_init_vcpu(CPUState *cs) { unsigned int max_cpus = MACHINE(qdev_get_machine())->smp.max_cpus; -- 2.34.1
Re: [PATCH 0/3] kvm: fix two svm pmu virtualization bugs
Hi Like, On 11/20/22 22:42, Like Xu wrote: > On 19/11/2022 8:28 pm, Dongli Zhang wrote: >> This patchset is to fix two svm pmu virtualization bugs. >> >> 1. The 1st bug is that "-cpu,-pmu" cannot disable svm pmu virtualization. >> >> To use "-cpu EPYC" or "-cpu host,-pmu" cannot disable the pmu >> virtualization. There is still below at the VM linux side ... > > Many QEMU vendor forks already have similar fixes, and > thanks for bringing this issue back to the mainline. Would you mind helping point to if there used to be any prior patchset for mainline to resolve the issue? > >> >> [ 0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver. >> >> ... although we expect something like below. >> >> [ 0.596381] Performance Events: PMU not available due to virtualization, >> using software events only. >> [ 0.600972] NMI watchdog: Perf NMI watchdog permanently disabled >> >> The patch 1-2 is to disable the pmu virtualization via KVM_PMU_CAP_DISABLE >> if the per-vcpu "pmu" property is disabled. >> >> I considered 'KVM_X86_SET_MSR_FILTER' initially. >> Since both KVM_X86_SET_MSR_FILTER and KVM_PMU_CAP_DISABLE are VM ioctl. I >> finally used the latter because it is easier to use. >> >> >> 2. The 2nd bug is that un-reclaimed perf events (after QEMU system_reset) >> at the KVM side may inject random unwanted/unknown NMIs to the VM. >> >> The svm pmu registers are not reset during QEMU system_reset. >> >> (1). The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it >> is running "perf top". The pmu registers are not disabled gracefully. >> >> (2). Although the x86_cpu_reset() resets many registers to zero, the >> kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, >> some pmu events are still enabled at the KVM side. >> >> (3). The KVM pmc_speculative_in_use() always returns true so that the events >> will not be reclaimed. The kvm_pmc->perf_event is still active. > > I'm not sure if you're saying KVM doing something wrong, I don't think so > because KVM doesn't sense the system_reset defined by QEME or other user > space, > AMD's vPMC will continue to be enabled (if it was enabled before), generating > pmi > injection into the guest, and the newly started guest doesn't realize the > counter is still > enabled and blowing up the error log. I were not saying KVM was buggy. I was trying to explain how the issue impacts KVM side and VM side. > >> >> (4). After the reboot, the VM kernel reports below error: >> >> [ 0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS >> detected, >> complain to your hardware vendor. >> [ 0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR >> c0010200 is 530076) >> >> (5). In a worse case, the active kvm_pmc->perf_event is still able to >> inject unknown NMIs randomly to the VM kernel. >> >> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. >> >> The patch 3 is to fix the issue by resetting AMD pmu registers as well as >> Intel registers. > > This fix idea looks good, it does require syncing the new changed device state > of QEMU to KVM. Thank you very much! Dongli Zhang > >> >> >> This patchset does cover does not cover PerfMonV2, until the below patchset >> is merged into the KVM side. >> >> [PATCH v3 0/8] KVM: x86: Add AMD Guest PerfMonV2 PMU support >> https://lore.kernel.org/all/2022102645.82001-1-lik...@tencent.com/ >> >> >> Dongli Zhang (3): >> kvm: introduce a helper before creating the 1st vcpu >> i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled >> target/i386/kvm: get and put AMD pmu registers >> >> accel/kvm/kvm-all.c | 7 ++- >> include/sysemu/kvm.h | 2 + >> target/arm/kvm64.c | 4 ++ >> target/i386/cpu.h | 5 +++ >> target/i386/kvm/kvm.c | 104 +++- >> target/mips/kvm.c | 4 ++ >> target/ppc/kvm.c | 4 ++ >> target/riscv/kvm.c | 4 ++ >> target/s390x/kvm/kvm.c | 4 ++ >> 9 files changed, 134 insertions(+), 4 deletions(-) >> >> Thank you very much! >> >> Dongli Zhang >> >> >>
Re: [PATCH 2/3] i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled
Hi Greg and Liang, On 11/21/22 6:23 AM, Liang Yan wrote: > > On 11/21/22 06:03, Greg Kurz wrote: >> On Sat, 19 Nov 2022 04:29:00 -0800 >> Dongli Zhang wrote: >> >>> The "perf stat" at the VM side still works even we set "-cpu host,-pmu" in >>> the QEMU command line. That is, neither "-cpu host,-pmu" nor "-cpu EPYC" >>> could disable the pmu virtualization in an AMD environment. >>> >>> We still see below at VM kernel side ... >>> >>> [ 0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver. >>> >>> ... although we expect something like below. >>> >>> [ 0.596381] Performance Events: PMU not available due to virtualization, >>> using software events only. >>> [ 0.600972] NMI watchdog: Perf NMI watchdog permanently disabled >>> >>> This is because the AMD pmu (v1) does not rely on cpuid to decide if the >>> pmu virtualization is supported. >>> >>> We disable KVM_CAP_PMU_CAPABILITY if the 'pmu' is disabled in the vcpu >>> properties. >>> >>> Cc: Joe Jin >>> Signed-off-by: Dongli Zhang >>> --- >>> target/i386/kvm/kvm.c | 17 + >>> 1 file changed, 17 insertions(+) >>> >>> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c >>> index 8fec0bc5b5..0b1226ff7f 100644 >>> --- a/target/i386/kvm/kvm.c >>> +++ b/target/i386/kvm/kvm.c >>> @@ -137,6 +137,8 @@ static int has_triple_fault_event; >>> static bool has_msr_mcg_ext_ctl; >>> +static int has_pmu_cap; >>> + >>> static struct kvm_cpuid2 *cpuid_cache; >>> static struct kvm_cpuid2 *hv_cpuid_cache; >>> static struct kvm_msr_list *kvm_feature_msrs; >>> @@ -1725,6 +1727,19 @@ static void kvm_init_nested_state(CPUX86State *env) >>> void kvm_arch_pre_create_vcpu(CPUState *cs) >>> { >>> + X86CPU *cpu = X86_CPU(cs); >>> + int ret; >>> + >>> + if (has_pmu_cap && !cpu->enable_pmu) { >>> + ret = kvm_vm_enable_cap(kvm_state, KVM_CAP_PMU_CAPABILITY, 0, >>> + KVM_PMU_CAP_DISABLE); >> It doesn't seem conceptually correct to configure VM level stuff out of >> a vCPU property, which could theoretically be different for each vCPU, >> even if this isn't the case with the current code base. >> >> Maybe consider controlling PMU with a machine property and this >> could be done in kvm_arch_init() like other VM level stuff ? >> > > There is already a 'pmu' property for x86_cpu with variable 'enable_pmu' as we > see the above code. It is mainly used by Intel CPU and set to off by default > since qemu 1.5. > > And, this property is spread to AMD CPU too. > > I think you may need setup a machine property to disable it from current > machine > model. Otherwise, it will break the Live Migration scenario. Initially, I was thinking about to replace the CPU level property 'pmu' to a VM level property but with a concern that reviewers/maintainers might not be happy with it. That may break the way other Apps use the QEMU. I will add an extra VM level property to the kvm_accel_class_init() for "-machine accel=kvm" in addition to the existing per CPU property. The VM (kvm) level 'pmu' will be more privileged than the CPU level 'pmu'. Thank you very much! Dongli Zhang > > >>> + if (ret < 0) { >>> + error_report("kvm: Failed to disable pmu cap: %s", >>> + strerror(-ret)); >>> + } >>> + >>> + has_pmu_cap = 0; >>> + } >>> } >>> int kvm_arch_init_vcpu(CPUState *cs) >>> @@ -2517,6 +2532,8 @@ int kvm_arch_init(MachineState *ms, KVMState *s) >>> } >>> } >>> + has_pmu_cap = kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY); >>> + >>> ret = kvm_get_supported_msrs(s); >>> if (ret < 0) { >>> return ret; >>
Re: [PATCH 3/3] target/i386/kvm: get and put AMD pmu registers
Hi Liang, On 11/21/22 6:28 AM, Liang Yan wrote: > A little bit more information from kernel perspective. > > https://urldefense.com/v3/__https://lkml.org/lkml/2022/10/31/476__;!!ACWV5N9M2RV99hQ!NHxyuDAt7ZD4hlsoxPCIUSRsPzaii0kDx2DrS7umBMoKVD8Z6BH7IKvPu8p0EhBBTEqQkCMfTk1xoj-XzT0$ > > > I was kindly thinking of the same idea, but not sure if it is expected from a > bare-metal perspective, since the four legacy MSRs > > are always there. Also not sure if they are used by other applications. > > > ~Liang Can I assume you were replying to the the patch ... [PATCH 2/3] i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled ... but not this one which is about to save/restore PMU registers? [PATCH 3/3] target/i386/kvm: get and put AMD pmu registers About the legacy registers, here is my understanding just based on the linux amd source code. The line 1450 indicates the PMU initialization is failed only when the family is < 6. The amd_core_pmu_init() at line 1455 will decide whether to use PERFCTR_CORE. 1445 __init int amd_pmu_init(void) 1446 { 1447 int ret; 1448 1449 /* Performance-monitoring supported from K7 and later: */ 1450 if (boot_cpu_data.x86 < 6) 1451 return -ENODEV; 1452 1453 x86_pmu = amd_pmu; 1454 1455 ret = amd_core_pmu_init(); 1456 if (ret) 1457 return ret By default (when family < 6), the below 4 legacy registers are used. 1270 static __initconst const struct x86_pmu amd_pmu = { 1271 .name = "AMD", ... ... 1279 .eventsel = MSR_K7_EVNTSEL0, 1280 .perfctr= MSR_K7_PERFCTR0, #define MSR_K7_EVNTSEL0 0xc001 #define MSR_K7_PERFCTR0 0xc0010004 #define MSR_K7_EVNTSEL1 0xc0010001 #define MSR_K7_PERFCTR1 0xc0010005 #define MSR_K7_EVNTSEL2 0xc0010002 #define MSR_K7_PERFCTR2 0xc0010006 #define MSR_K7_EVNTSEL3 0xc0010003 #define MSR_K7_PERFCTR3 0xc0010007 If PERFCTR_CORE is supported, counters like below are used, as line 1368 and 1369. 1351 static int __init amd_core_pmu_init(void) 1352 { 1353 union cpuid_0x8022_ebx ebx; 1354 u64 even_ctr_mask = 0ULL; 1355 int i; 1356 1357 if (!boot_cpu_has(X86_FEATURE_PERFCTR_CORE)) 1358 return 0; 1359 1360 /* Avoid calculating the value each time in the NMI handler */ 1361 perf_nmi_window = msecs_to_jiffies(100); 1362 1363 /* 1364 * If core performance counter extensions exists, we must use 1365 * MSR_F15H_PERF_CTL/MSR_F15H_PERF_CTR msrs. See also 1366 * amd_pmu_addr_offset(). 1367 */ 1368 x86_pmu.eventsel= MSR_F15H_PERF_CTL; 1369 x86_pmu.perfctr = MSR_F15H_PERF_CTR; 1370 x86_pmu.num_counters= AMD64_NUM_COUNTERS_CORE; #define MSR_F15H_PERF_CTL 0xc0010200 #define MSR_F15H_PERF_CTL0 MSR_F15H_PERF_CTL #define MSR_F15H_PERF_CTL1 (MSR_F15H_PERF_CTL + 2) #define MSR_F15H_PERF_CTL2 (MSR_F15H_PERF_CTL + 4) #define MSR_F15H_PERF_CTL3 (MSR_F15H_PERF_CTL + 6) #define MSR_F15H_PERF_CTL4 (MSR_F15H_PERF_CTL + 8) #define MSR_F15H_PERF_CTL5 (MSR_F15H_PERF_CTL + 10) #define MSR_F15H_PERF_CTR 0xc0010201 #define MSR_F15H_PERF_CTR0 MSR_F15H_PERF_CTR #define MSR_F15H_PERF_CTR1 (MSR_F15H_PERF_CTR + 2) #define MSR_F15H_PERF_CTR2 (MSR_F15H_PERF_CTR + 4) #define MSR_F15H_PERF_CTR3 (MSR_F15H_PERF_CTR + 6) #define MSR_F15H_PERF_CTR4 (MSR_F15H_PERF_CTR + 8) #define MSR_F15H_PERF_CTR5 (MSR_F15H_PERF_CTR + 10) Unfortunately, I do not have access to machine with family < 6. It might be interesting to use QEMU to emulate a machine with family < 6. Thank you very much for suggestion! Dongli Zhang > > > On 11/19/22 07:29, Dongli Zhang wrote: >> The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM >> side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to >> the KVM side. >> >> However, only the Intel gp/fixed/global pmu registers are involved. There >> is not any implementation for AMD pmu registers. The >> 'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are >> calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for >> AMD. Before AMD PerfMonV2, the number of gp registers is decided based on >> the CPU version. >> >> This patch is to add the support for AMD version=1 pmu, to get and put AMD >> pmu registers. Otherwise, there will be a bug: >> >> 1. The VM resets (e.g., via QEMU system_reset
[PATCH v2 1/2] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
The "perf stat" at the VM side still works even we set "-cpu host,-pmu" in the QEMU command line. That is, neither "-cpu host,-pmu" nor "-cpu EPYC" could disable the pmu virtualization in an AMD environment. We still see below at VM kernel side ... [0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver. ... although we expect something like below. [0.596381] Performance Events: PMU not available due to virtualization, using software events only. [0.600972] NMI watchdog: Perf NMI watchdog permanently disabled This is because the AMD pmu (v1) does not rely on cpuid to decide if the pmu virtualization is supported. We introduce a new property 'pmu_disabled' for KVM accel to set KVM_PMU_CAP_DISABLE if KVM_CAP_PMU_CAPABILITY is supported. Only x86 host is supported because currently KVM uses KVM_CAP_PMU_CAPABILITY only for x86. Cc: Joe Jin Signed-off-by: Dongli Zhang --- Changed since v1: - In version 1 we did not introduce the new property. We ioctl KVM_PMU_CAP_DISABLE only before the creation of the 1st vcpu. We had introduced a helpfer function to do this job before creating the 1st KVM vcpu in v1. accel/kvm/kvm-all.c | 1 + include/sysemu/kvm_int.h | 1 + qemu-options.hx | 7 ++ target/i386/kvm/kvm.c| 46 4 files changed, 55 insertions(+) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index f99b0becd8..5d4439ba74 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -3620,6 +3620,7 @@ static void kvm_accel_instance_init(Object *obj) s->kvm_dirty_ring_size = 0; s->notify_vmexit = NOTIFY_VMEXIT_OPTION_RUN; s->notify_window = 0; +s->pmu_cap_disabled = false; } /** diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h index 3b4adcdc10..e29ac5d767 100644 --- a/include/sysemu/kvm_int.h +++ b/include/sysemu/kvm_int.h @@ -110,6 +110,7 @@ struct KVMState struct KVMDirtyRingReaper reaper; NotifyVmexitOption notify_vmexit; uint32_t notify_window; +bool pmu_cap_disabled; }; void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml, diff --git a/qemu-options.hx b/qemu-options.hx index 7f99d15b23..15a2f717ff 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -186,6 +186,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel, "tb-size=n (TCG translation block cache size)\n" "dirty-ring-size=n (KVM dirty ring GFN count, default 0)\n" "notify-vmexit=run|internal-error|disable,notify-window=n (enable notify VM exit and set notify window, x86 only)\n" +"pmu-cap-disabled=true|false (disable KVM_CAP_PMU_CAPABILITY, x86 only, default false)\n" "thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL) SRST ``-accel name[,prop=value[,...]]`` @@ -247,6 +248,12 @@ SRST open up for a specified of time (i.e. notify-window). Default: notify-vmexit=run,notify-window=0. +``pmu-cap-disabled=true|false`` +When the KVM accelerator is used, it controls whether to disable the +KVM_CAP_PMU_CAPABILITY via KVM_PMU_CAP_DISABLE. When disabled, the +PMU virtualization is disabled at the KVM module side. This is for +x86 host only. + ERST DEF("smp", HAS_ARG, QEMU_OPTION_smp, diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index a213209379..090e4fb44d 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -122,6 +122,7 @@ static bool has_msr_ucode_rev; static bool has_msr_vmx_procbased_ctls2; static bool has_msr_perf_capabs; static bool has_msr_pkrs; +static bool has_pmu_cap; static uint32_t has_architectural_pmu_version; static uint32_t num_architectural_pmu_gp_counters; @@ -2652,6 +2653,23 @@ int kvm_arch_init(MachineState *ms, KVMState *s) } } +has_pmu_cap = kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY); + +if (s->pmu_cap_disabled) { +if (has_pmu_cap) { +ret = kvm_vm_enable_cap(s, KVM_CAP_PMU_CAPABILITY, 0, +KVM_PMU_CAP_DISABLE); +if (ret < 0) { +s->pmu_cap_disabled = false; +error_report("kvm: Failed to disable pmu cap: %s", + strerror(-ret)); +} +} else { +s->pmu_cap_disabled = false; +error_report("kvm: KVM_CAP_PMU_CAPABILITY is not supported"); +} +} + return 0; } @@ -5706,6 +5724,28 @@ static void kvm_arch_set_notify_window(Object *obj, Visitor *v, s->notify_window = value; } +static void kvm_set_pmu_cap_disabled(Object *obj, Visitor *v, + const char *name, void *opaque, + Error **errp)
[PATCH v2 0/2] target/i386/kvm: fix two svm pmu virtualization bugs
This patchset is to fix two svm pmu virtualization bugs, x86 only. version 1: https://lore.kernel.org/all/20221119122901.2469-1-dongli.zh...@oracle.com/ 1. The 1st bug is that "-cpu,-pmu" cannot disable svm pmu virtualization. To use "-cpu EPYC" or "-cpu host,-pmu" cannot disable the pmu virtualization. There is still below at the VM linux side ... [0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver. ... although we expect something like below. [0.596381] Performance Events: PMU not available due to virtualization, using software events only. [0.600972] NMI watchdog: Perf NMI watchdog permanently disabled The 1st patch has introduced a new x86 only accel/kvm property "pmu-cap-disabled=true" to disable the pmu virtualization via KVM_PMU_CAP_DISABLE. I considered 'KVM_X86_SET_MSR_FILTER' initially before patchset v1. Since both KVM_X86_SET_MSR_FILTER and KVM_PMU_CAP_DISABLE are VM ioctl. I finally used the latter because it is easier to use. 2. The 2nd bug is that un-reclaimed perf events (after QEMU system_reset) at the KVM side may inject random unwanted/unknown NMIs to the VM. The svm pmu registers are not reset during QEMU system_reset. (1). The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it is running "perf top". The pmu registers are not disabled gracefully. (2). Although the x86_cpu_reset() resets many registers to zero, the kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, some pmu events are still enabled at the KVM side. (3). The KVM pmc_speculative_in_use() always returns true so that the events will not be reclaimed. The kvm_pmc->perf_event is still active. (4). After the reboot, the VM kernel reports below error: [0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor. [0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076) (5). In a worse case, the active kvm_pmc->perf_event is still able to inject unknown NMIs randomly to the VM kernel. [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. The 2nd patch is to fix the issue by resetting AMD pmu registers as well as Intel registers. This patchset does not cover PerfMonV2, until the below patchset is merged into the KVM side. [PATCH v3 0/8] KVM: x86: Add AMD Guest PerfMonV2 PMU support https://lore.kernel.org/all/2022102645.82001-1-lik...@tencent.com/ Dongli Zhang (2): target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE target/i386/kvm: get and put AMD pmu registers accel/kvm/kvm-all.c | 1 + include/sysemu/kvm_int.h | 1 + qemu-options.hx | 7 +++ target/i386/cpu.h| 5 ++ target/i386/kvm/kvm.c| 129 +- 5 files changed, 141 insertions(+), 2 deletions(-) Thank you very much! Dongli Zhang
[PATCH v2 2/2] target/i386/kvm: get and put AMD pmu registers
The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to the KVM side. However, only the Intel gp/fixed/global pmu registers are involved. There is not any implementation for AMD pmu registers. The 'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for AMD. Before AMD PerfMonV2, the number of gp registers is decided based on the CPU version. This patch is to add the support for AMD version=1 pmu, to get and put AMD pmu registers. Otherwise, there will be a bug: 1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it is running "perf top". The pmu registers are not disabled gracefully. 2. Although the x86_cpu_reset() resets many registers to zero, the kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, some pmu events are still enabled at the KVM side. 3. The KVM pmc_speculative_in_use() always returns true so that the events will not be reclaimed. The kvm_pmc->perf_event is still active. 4. After the reboot, the VM kernel reports below error: [0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor. [0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076) 5. In a worse case, the active kvm_pmc->perf_event is still able to inject unknown NMIs randomly to the VM kernel. [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. The patch is to fix the issue by resetting AMD pmu registers during the reset. Cc: Joe Jin Signed-off-by: Dongli Zhang --- target/i386/cpu.h | 5 +++ target/i386/kvm/kvm.c | 83 +-- 2 files changed, 86 insertions(+), 2 deletions(-) diff --git a/target/i386/cpu.h b/target/i386/cpu.h index d4bc19577a..4cf0b98817 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -468,6 +468,11 @@ typedef enum X86Seg { #define MSR_CORE_PERF_GLOBAL_CTRL 0x38f #define MSR_CORE_PERF_GLOBAL_OVF_CTRL 0x390 +#define MSR_K7_EVNTSEL0 0xc001 +#define MSR_K7_PERFCTR0 0xc0010004 +#define MSR_F15H_PERF_CTL0 0xc0010200 +#define MSR_F15H_PERF_CTR0 0xc0010201 + #define MSR_MC0_CTL 0x400 #define MSR_MC0_STATUS 0x401 #define MSR_MC0_ADDR0x402 diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index 090e4fb44d..296cd4cab7 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -1987,6 +1987,32 @@ int kvm_arch_init_vcpu(CPUState *cs) } } +/* + * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to + * disable the AMD pmu virtualization. + * + * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled + * indicates the KVM side has already disabled the pmu virtualization. + */ +if (IS_AMD_CPU(env) && !cs->kvm_state->pmu_cap_disabled) { +int64_t family; + +family = (env->cpuid_version >> 8) & 0xf; +if (family == 0xf) { +family += (env->cpuid_version >> 20) & 0xff; +} + +if (family >= 6) { +has_architectural_pmu_version = 1; + +if (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) { +num_architectural_pmu_gp_counters = 6; +} else { +num_architectural_pmu_gp_counters = 4; +} +} +} + cpu_x86_cpuid(env, 0x8000, 0, &limit, &unused, &unused, &unused); for (i = 0x8000; i <= limit; i++) { @@ -3323,7 +3349,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level) kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env->poll_control_msr); } -if (has_architectural_pmu_version > 0) { +if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) { if (has_architectural_pmu_version > 1) { /* Stop the counter. */ kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0); @@ -3354,6 +3380,26 @@ static int kvm_put_msrs(X86CPU *cpu, int level) env->msr_global_ctrl); } } + +if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) { +uint32_t sel_base = MSR_K7_EVNTSEL0; +uint32_t ctr_base = MSR_K7_PERFCTR0; +uint32_t step = 1; + +if (num_architectural_pmu_gp_counters == 6) { +sel_base = MSR_F15H_PERF_CTL0; +ctr_base = MSR_F15H_PERF_CTR0; +step = 2; +} + +for (i = 0; i < num_architectural_pmu_gp_counters; i++) { +kvm_msr_entry_add(c
Re: [PATCH 1/1] dump: kdump-zlib data pages not dumped with pvtime/aarch64
Hi Marc-André, May I get any feedback on this bugfix? Feel free to let me know if I can help by collecting any data from the aarch64 server. Since the pvtime is supported by default by QEMU/linux, this indicates currently the dump does not work for aarch64 with new version of QEMU/linux. Thank you very much! Dongli Zhang On 7/12/23 22:58, Dongli Zhang wrote: > The kdump-zlib data pages are not dumped from aarch64 host when the > 'pvtime' is involved, that is, when the block->target_end is not aligned to > page_size. In the below example, it is expected to dump two blocks. > > (qemu) info mtree -f > ... ... > 090a-090a0fff (prio 0, ram): pvtime KVM > ... ... > 4000-0001bfff (prio 0, ram): mach-virt.ram KVM > ... ... > > However, there is an issue with get_next_page() so that the pages for > "mach-virt.ram" will not be dumped. > > At line 1296, although we have reached at the end of the 'pvtime' block, > since it is not aligned to the page_size (e.g., 0x1), it will not break > at line 1298. > > 1255 static bool get_next_page(GuestPhysBlock **blockptr, uint64_t *pfnptr, > 1256 uint8_t **bufptr, DumpState *s) > ... ... > 1294 memcpy(buf + addr % page_size, hbuf, n); > 1295 addr += n; > 1296 if (addr % page_size == 0) { > 1297 /* we filled up the page */ > 1298 break; > 1299 } > > As a result, get_next_page() will continue to the next > block ("mach-virt.ram"). Finally, when get_next_page() returns to the > caller: > > - 'pfnptr' is referring to the 'pvtime' > - but 'blockptr' is referring to the "mach-virt.ram" > > When get_next_page() is called the next time, "*pfnptr += 1" still refers > to the prior 'pvtime'. It will exit immediately because it is out of the > range of the current "mach-virt.ram". > > The fix is to break when it is time to come to the next block, so that both > 'pfnptr' and 'blockptr' refer to the same block. > > Fixes: 94d788408d2d ("dump: fix kdump to work over non-aligned blocks") > Cc: Joe Jin > Signed-off-by: Dongli Zhang > --- > dump/dump.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/dump/dump.c b/dump/dump.c > index 1f1a6edcab..c93e4c572f 100644 > --- a/dump/dump.c > +++ b/dump/dump.c > @@ -1293,8 +1293,8 @@ static bool get_next_page(GuestPhysBlock **blockptr, > uint64_t *pfnptr, > > memcpy(buf + addr % page_size, hbuf, n); > addr += n; > -if (addr % page_size == 0) { > -/* we filled up the page */ > +if (addr % page_size == 0 || addr >= block->target_end) { > +/* we filled up the page or the current block is finished */ > break; > } > } else {
Re: [PATCH v4 1/1] oslib-posix: initialize backend memory objects in parallel
t touch_all_pages(char *area, size_t hpagesize, > size_t numpages, > } > > if (use_madv_populate_write) { > -/* Avoid creating a single thread for MADV_POPULATE_WRITE */ > -if (context.num_threads == 1) { > +/* > + * Avoid creating a single thread for MADV_POPULATE_WRITE when > + * preallocating synchronously. > + */ > +if (context->num_threads == 1 && !async) { > if (qemu_madvise(area, hpagesize * numpages, > QEMU_MADV_POPULATE_WRITE)) { > return -errno; > @@ -445,50 +478,86 @@ static int touch_all_pages(char *area, size_t > hpagesize, size_t numpages, > touch_fn = do_touch_pages; > } > > -context.threads = g_new0(MemsetThread, context.num_threads); > -numpages_per_thread = numpages / context.num_threads; > -leftover = numpages % context.num_threads; > -for (i = 0; i < context.num_threads; i++) { > -context.threads[i].addr = addr; > -context.threads[i].numpages = numpages_per_thread + (i < leftover); > -context.threads[i].hpagesize = hpagesize; > -context.threads[i].context = &context; > +context->threads = g_new0(MemsetThread, context->num_threads); > +numpages_per_thread = numpages / context->num_threads; > +leftover = numpages % context->num_threads; > +for (i = 0; i < context->num_threads; i++) { > +context->threads[i].addr = addr; > +context->threads[i].numpages = numpages_per_thread + (i < leftover); > +context->threads[i].hpagesize = hpagesize; > +context->threads[i].context = context; > if (tc) { > -thread_context_create_thread(tc, &context.threads[i].pgthread, > +thread_context_create_thread(tc, &context->threads[i].pgthread, > "touch_pages", > - touch_fn, &context.threads[i], > + touch_fn, &context->threads[i], > QEMU_THREAD_JOINABLE); > } else { > -qemu_thread_create(&context.threads[i].pgthread, "touch_pages", > - touch_fn, &context.threads[i], > +qemu_thread_create(&context->threads[i].pgthread, "touch_pages", > + touch_fn, &context->threads[i], > QEMU_THREAD_JOINABLE); > } > -addr += context.threads[i].numpages * hpagesize; > +addr += context->threads[i].numpages * hpagesize; > +} > + > +if (async) { > +/* > + * async requests currently require the BQL. Add it to the list and > kick > + * preallocation off during qemu_finish_async_prealloc_mem(). > + */ > +assert(bql_locked()); > +QLIST_INSERT_HEAD(&memset_contexts, context, next); > +return 0; > } > > if (!use_madv_populate_write) { > -sigbus_memset_context = &context; > +sigbus_memset_context = context; > } > > qemu_mutex_lock(&page_mutex); > -context.all_threads_created = true; > +context->all_threads_created = true; > qemu_cond_broadcast(&page_cond); > qemu_mutex_unlock(&page_mutex); > > -for (i = 0; i < context.num_threads; i++) { > -int tmp = (uintptr_t)qemu_thread_join(&context.threads[i].pgthread); > +ret = wait_and_free_mem_prealloc_context(context); > > +if (!use_madv_populate_write) { > +sigbus_memset_context = NULL; > +} > +return ret; > +} > + > +bool qemu_finish_async_prealloc_mem(Error **errp) > +{ > +int ret, tmp; The above should be initialized? I did a build test and encounter: In file included from ../util/oslib-posix.c:36: ../util/oslib-posix.c: In function ‘qemu_finish_async_prealloc_mem’: /home/libvirt/vm/software/qemu/include/qapi/error.h:334:5: error: ‘ret’ may be used uninitialized in this function [-Werror=maybe-uninitialized] 334 | error_setg_errno_internal((errp), __FILE__, __LINE__, __func__, \ | ^ ../util/oslib-posix.c:531:9: note: ‘ret’ was declared here 531 | int ret, tmp; | ^~~ cc1: all warnings being treated as errors ninja: build stopped: subcommand failed. make: *** [Makefile:162: run-ninja] Error 1 Thank you very much! Dongli Zhang > +MemsetContext *context, *next_context; > + > +/* Waiting for preallocation requires the BQL. */ > +assert(bql_locked()); >
Re: [RFC 0/8] virtio,vhost: Add VIRTIO_F_IN_ORDER support
Hi Jonah, Would you mind helping explain how does VIRTIO_F_IN_ORDER improve the performance? https://lore.kernel.org/all/20240321155717.1392787-1-jonah.pal...@oracle.com/#t I tried to look for it from prior discussions but could not find why. https://lore.kernel.org/all/byapr18mb2791df7e6c0f61e2d8698e8fa0...@byapr18mb2791.namprd18.prod.outlook.com/ Thank you very much! Dongli Zhang On 3/21/24 08:57, Jonah Palmer wrote: > The goal of these patches is to add support to a variety of virtio and > vhost devices for the VIRTIO_F_IN_ORDER transport feature. This feature > indicates that all buffers are used by the device in the same order in > which they were made available by the driver. > > These patches attempt to implement a generalized, non-device-specific > solution to support this feature. > > The core feature behind this solution is a buffer mechanism in the form > of GLib's GHashTable. The decision behind using a hash table was to > leverage their ability for quick lookup, insertion, and removal > operations. Given that our keys are simply numbers of an ordered > sequence, a hash table seemed like the best choice for a buffer > mechanism. > > - > > The strategy behind this implementation is as follows: > > We know that buffers that are popped from the available ring and enqueued > for further processing will always done in the same order in which they > were made available by the driver. Given this, we can note their order > by assigning the resulting VirtQueueElement a key. This key is a number > in a sequence that represents the order in which they were popped from > the available ring, relative to the other VirtQueueElements. > > For example, given 3 "elements" that were popped from the available > ring, we assign a key value to them which represents their order (elem0 > is popped first, then elem1, then lastly elem2): > > elem2 -- elem1 -- elem0 ---> Enqueue for processing > (key: 2)(key: 1)(key: 0) > > Then these elements are enqueued for further processing by the host. > > While most devices will return these completed elements in the same > order in which they were enqueued, some devices may not (e.g. > virtio-blk). To guarantee that these elements are put on the used ring > in the same order in which they were enqueued, we can use a buffering > mechanism that keeps track of the next expected sequence number of an > element. > > In other words, if the completed element does not have a key value that > matches the next expected sequence number, then we know this element is > not in-order and we must stash it away in a hash table until an order > can be made. The element's key value is used as the key for placing it > in the hash table. > > If the completed element has a key value that matches the next expected > sequence number, then we know this element is in-order and we can push > it on the used ring. Then we increment the next expected sequence number > and check if the hash table contains an element at this key location. > > If so, we retrieve this element, push it to the used ring, delete the > key-value pair from the hash table, increment the next expected sequence > number, and check the hash table again for an element at this new key > location. This process is repeated until we're unable to find an element > in the hash table to continue the order. > > So, for example, say the 3 elements we enqueued were completed in the > following order: elem1, elem2, elem0. The next expected sequence number > is 0: > > exp-seq-num = 0: > > elem1 --> elem1.key == exp-seq-num ? --> No, stash it > (key: 1) | > | > v > >|key: 1 - elem1| > > - > exp-seq-num = 0: > > elem2 --> elem2.key == exp-seq-num ? --> No, stash it > (key: 2) | > | > v > >|key: 1 - elem1| >|--| >|key: 2 - elem2| > > --
Re: qemu process consumes 100% host CPU after reverting snapshot
To share my test result, I do NOT reproduce the issue with the below command line and QEMU-8.2. However, I can reproduce with QEMU-6.2. The mainloop consumes 100% CPU usage. qemu-system-x86_64 \ --enable-kvm -cpu host -smp 2 -m 8G \ -object throttle-group,id=limit0,x-iops-total=200,x-iops-total-max=200,x-bps-total-max-length=1,x-bps-read-max-length=1,x-bps-write-max-length=1,x-iops-total-max-length=1,x-iops-read-max-length=1,x-iops-write-max-length=1 \ -object throttle-group,id=limit1,x-iops-total=250,x-iops-total-max=250,x-bps-total-max-length=1,x-bps-read-max-length=1,x-bps-write-max-length=1,x-iops-total-max-length=1,x-iops-read-max-length=1,x-iops-write-max-length=1 \ -object throttle-group,id=limit2,x-iops-total=300,x-iops-total-max=300,x-bps-total-max-length=1,x-bps-read-max-length=1,x-bps-write-max-length=1,x-iops-total-max-length=1,x-iops-read-max-length=1,x-iops-write-max-length=1 \ -object throttle-group,id=limit012,x-iops-total=400,x-iops-total-max=400,x-bps-total-max-length=1,x-bps-read-max-length=1,x-bps-write-max-length=1,x-iops-total-max-length=1,x-iops-read-max-length=1,x-iops-write-max-length=1 \ -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=libvirt-4-format,id=virtio-disk0,bootindex=1 \ -blockdev driver=file,filename=image.qcow2,node-name=libvirt-4-storage,discard=unmap \ -blockdev node-name=libvirt-4-format,driver=qcow2,file=libvirt-4-storage \ -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=libvirt-6-filter,id=virtio-disk1 \ -blockdev driver=file,filename=test01.qcow2,node-name=libvirt-3-storage,discard=unmap \ -blockdev node-name=libvirt-3-format,driver=qcow2,file=libvirt-3-storage \ -blockdev driver=throttle,node-name=libvirt-5-filter,throttle-group=limit0,file=libvirt-3-format \ -blockdev driver=throttle,node-name=libvirt-6-filter,throttle-group=limit012,file=libvirt-5-filter \ -device virtio-blk-pci,bus=pci.0,addr=0x6,drive=libvirt-4-filter,id=virtio-disk2 \ -blockdev driver=file,filename=test02.qcow2,node-name=libvirt-2-storage,discard=unmap \ -blockdev node-name=libvirt-2-format,driver=qcow2,file=libvirt-2-storage \ -blockdev driver=throttle,node-name=libvirt-3-filter,throttle-group=limit1,file=libvirt-2-format \ -blockdev driver=throttle,node-name=libvirt-4-filter,throttle-group=limit012,file=libvirt-3-filter \ -device virtio-blk-pci,bus=pci.0,addr=0x7,drive=libvirt-2-filter,id=virtio-disk3 \ -blockdev driver=file,filename=test03.qcow2,node-name=libvirt-1-storage,discard=unmap \ -blockdev node-name=libvirt-1-format,driver=qcow2,file=libvirt-1-storage \ -blockdev driver=throttle,node-name=libvirt-1-filter,throttle-group=limit2,file=libvirt-1-format \ -blockdev driver=throttle,node-name=libvirt-2-filter,throttle-group=limit012,file=libvirt-1-filter \ -netdev user,id=user0,hostfwd=tcp::5028-:22 \ -device virtio-net-pci,netdev=user0,id=net0,mac=52:54:00:12:34:56,bus=pci.0,addr=0x3 \ -monitor stdio (qemu) info status (qemu) savevm snapshot1 (qemu) loadvm snapshot1 Dongli Zhang On 3/28/24 03:50, Chun Feng Wu wrote: > BTW, if I download > qemu8.0(https://urldefense.com/v3/__https://download.qemu.org/qemu-8.0.0.tar.xz__;!!ACWV5N9M2RV99hQ!ObhT3ZYGr8GQFt4p7dWc-t-CFeKff1UzyRPc9WQrAuhz1B7jDC1-VTqjoCUDcrhXsffyBQuRtnoZhfZhS8FVig$ > ), compile and install it in my ubuntu22.04, launch vm with the same > command(just update --machine to be "pc"), the host cpu usage is not high, it > seems to be a bug in QEMU6 > > Also, I have another question, for disk iotune or throttle group, after > reverting snapshot, if I login vm, and run fio, I/O performance drops a lot in > both QEMU6 and QEMU8, do anyone know the reason? Any explanation would be > appreciated! > > > > On 2024/3/28 07:32, Chun Feng Wu wrote: >> >> Hi, >> >> I am testing throttle filter chain(multiple throttle-groups on disk) with the >> following steps: >> 1. start guest vm(chained throttle filters applied on disk per >> https://urldefense.com/v3/__https://github.com/qemu/qemu/blob/master/docs/throttle.txt__;!!ACWV5N9M2RV99hQ!ObhT3ZYGr8GQFt4p7dWc-t-CFeKff1UzyRPc9WQrAuhz1B7jDC1-VTqjoCUDcrhXsffyBQuRtnoZhfaQOXc8pg$ >> ) >> 2. take snapshot >> 3. revert snapshot >> >> after step3, I noticed qemu process in host consumes 100% cpu, and after I >> login guest vm, vm cannot(or slowly) response my cmd (it works well before >> reverting). >> >> / PID USER PR NI VIRT RES SHR S %CPU %MEM >> TIME+ COMMAND >> 65455 root 20 0 9659924 891328 20132 R 100.3 5.4 29:39.93 >> qemu-system-x86/ >> >> / >> / >> >> Does anybody know why such issue happens? is it a bug or I misunderstand >> something? >> >> my cmd: >> >> /qemu-system-x86_64 \ >> -name ubuntu-20.04-vm,debug-threads=on \ >> -machine pc-i440fx-jammy,usb=off,dump-guest-core=off \ >> -acce
Re: [Qemu-devel] ssh session with qemu-arm using busybox
On 3/13/19 1:08 AM, Pintu Agarwal wrote: > On Tue, Mar 12, 2019 at 7:41 PM Suzuki K Poulose > wrote: >> >> >> >> On 12/03/2019 14:02, Pintu Agarwal wrote: >>>> >>>> -netdev user,id=unet,hostfwd=tcp::-:22 \ >>>> -net user \ >>>> >>>> and you 'll get guest's port 22 to be forwarded to hosts port , so >>>> you can do >>>> >>>> ssh root@localhost: >>>> >>>> from the host. >>>> >>> >>> I tried many different options, but unfortunately none worked for me. >>> 1) >>> qemu-system-arm -M vexpress-a9 -m 1024M -kernel >>> ../KERNEL/linux/arch/arm/boot/zImage -dtb >>> ../KERNEL/linux/arch/arm/boot/dts/vexpress-v2p-ca9.dtb -initrd >>> rootfs.img.gz -append "console=ttyAMA0 root=/dev/ram rdinit=/sbin/init >>> ip=dhcp" -nographic -smp 4 -netdev user,id=unet,hostfwd=tcp::-:22 >>> -net user >>> >>> With this the eth0 interface is removed, and I see this message >>> (although login works): >>> qemu-system-arm: warning: hub 0 with no nics >>> qemu-system-arm: warning: netdev unet has no peer >>> Booting Linux on physical CPU 0x0 >>> >>> NET: Registered protocol family 17 >>> >>> Run /sbin/init as init process >>> ifconfig: SIOCSIFADDR: No such device >>> route: SIOCADDRT: Network is unreachable >>> >>> But, ssh is still not working. >>> ssh root@localhost: >>> ssh: Could not resolve hostname localhost:: Name or service not known >> >> man ssh >> >> + >> >> Make sure you have sshd in your custom rootfs and has been stared. >> > > My busybox is a very minimal rootfs. It hardly contains any commands. > Is there any precompiled busybox for arm (cpio image), available (with > ssh, scp and networking in-built), which I can use directly to boot ? > > what else I am missing to make ssh/scp working on qemu ? > When I was working on arm binary rewriting/instrumentation in the university, I use the image created by Linaro. http://releases.linaro.org/ One example would be https://www.linaro.org/blog/linaro-16-04-release-available-for-download-2/ or https://releases.linaro.org/archive/ubuntu/images/developer-arm64/15.12/linaro-vivid-developer-20151215-114.tar.gz The license should be good for academia. I do not know if there is any restrictions on commercial/industry usage. Thank for Linaro again for providing those images which helped me a lot when I was in the university! Dongli Zhang
Re: [PATCH] target/i386: Clear xsave pkru bit when KVM XCR0 not support
Hi Yuchen, On 5/17/23 03:55, Yuchen wrote: > Migrating guest from Intel new CPU (as Gold 6230) to old CPU (as > E5-2650 v4) will pause on the destination host. Because old CPU > not support xsave pkru feature, and KVM KVM_SET_XSAVE ioctl > return EINVAL. > > This kernel commit introduces the problem: > ea4d6938d4c0 x86/fpu: Replace KVMs home brewed FPU copy from user This kernel commit issue should be resolved by the below kernel commit. x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ad856280ddea3401e1f5060ef20e6de9f6122c76 Since the old target server does not support pkru, I assume the VM's cpu type should not support pkru. Therefore, the pkru should never be migrated away from source server. Dongli Zhang > > Signed-off-by: YuChen > --- > target/i386/xsave_helper.c | 8 > 1 file changed, 8 insertions(+) > > diff --git a/target/i386/xsave_helper.c b/target/i386/xsave_helper.c > index 996e9f3bfe..64e2b969fe 100644 > --- a/target/i386/xsave_helper.c > +++ b/target/i386/xsave_helper.c > @@ -6,6 +6,8 @@ > #include "cpu.h" > +static bool has_xsave_pkru; > + > void x86_cpu_xsave_all_areas(X86CPU *cpu, void *buf, uint32_t buflen) > { > CPUX86State *env = &cpu->env; > @@ -47,6 +49,9 @@ void x86_cpu_xsave_all_areas(X86CPU *cpu, void *buf, > uint32_t buflen) > stq_p(xmm + 8, env->xmm_regs[i].ZMM_Q(1)); > } > +if (!has_xsave_pkru) { > +env->xstate_bv &= ~XSTATE_PKRU_MASK; > +} > header->xstate_bv = env->xstate_bv; > e = &x86_ext_save_areas[XSTATE_YMM_BIT]; > @@ -181,6 +186,9 @@ void x86_cpu_xrstor_all_areas(X86CPU *cpu, const void > *buf, uint32_t buflen) > env->xmm_regs[i].ZMM_Q(1) = ldq_p(xmm + 8); > } > +if (xsave->header.xstate_bv & XSTATE_PKRU_MASK) { > +has_xsave_pkru = true; > +} > env->xstate_bv = header->xstate_bv; > e = &x86_ext_save_areas[XSTATE_YMM_BIT]; > -- > 2.34.1 > - > ? > > > ??? > This e-mail and its attachments contain confidential information from New > H3C, which is > intended only for the person or entity whose address is listed above. Any use > of the > information contained herein in any way (including, but not limited to, total > or partial > disclosure, reproduction, or dissemination) by persons other than the intended > recipient(s) is prohibited. If you receive this e-mail in error, please > notify the sender > by phone or email immediately and delete it! >
Re: [PATCH 0/2] scsi: to fix issue on passing host_status to the guest kernel
ping? Thank you very much! Dongli Zhang On 12/10/21 6:16 AM, Dongli Zhang wrote: > This patchset fixes the issue on passing 'host_status' to the guest kernel. > > The 1st patch fixes the erroneous usage of req->host_status. > > The 2nd patch is to pass the SCSI_HOST_ERROR to the guest kernel when the > req->bus->info->fail() is not implemented. I do not add 'Fixes:' because I > am not sure if to not pass SCSI_HOST_ERROR was on purpose, especially for > security reason. > > Thank you very much! > > Dongli Zhang > > >
[PATCH 1/1] monitor/hmp: print trace as option in help for log command
The below is printed when printing help information in qemu-system-x86_64 command line, and when CONFIG_TRACE_LOG is enabled: $ qemu-system-x86_64 -d help ... ... trace:PATTERN enable trace events Use "-d trace:help" to get a list of trace events. However, they are not printed in hmp "help log" command. Cc: Joe Jin Signed-off-by: Dongli Zhang --- monitor/hmp.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/monitor/hmp.c b/monitor/hmp.c index 15ca047..9f48b70 100644 --- a/monitor/hmp.c +++ b/monitor/hmp.c @@ -287,8 +287,13 @@ void help_cmd(Monitor *mon, const char *name) monitor_printf(mon, "Log items (comma separated):\n"); monitor_printf(mon, "%-10s %s\n", "none", "remove all logs"); for (item = qemu_log_items; item->mask != 0; item++) { -monitor_printf(mon, "%-10s %s\n", item->name, item->help); +monitor_printf(mon, "%-15s %s\n", item->name, item->help); } +#ifdef CONFIG_TRACE_LOG +monitor_printf(mon, "trace:PATTERN enable trace events\n"); +monitor_printf(mon, "\nUse \"info trace-events\" to get a list of " +"trace events.\n\n"); +#endif return; } -- 1.8.3.1
[PATCH 2/2] util/log: add timestamp to logs via qemu_log()
The qemu_log is very helpful for diagnostic. Add the timestamp to the log when it is enabled (e.g., "-msg timestamp=on"). While there are many other places that may print to log file, this patch is only for qemu_log(), e.g., the developer may add qemu_log/qemu_log_mask to selected locations to diagnose QEMU issue. Cc: Joe Jin Signed-off-by: Dongli Zhang --- Please let me know if we should use 'error_with_guestname' as well. util/log.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/util/log.c b/util/log.c index d6eb037..f0a081a 100644 --- a/util/log.c +++ b/util/log.c @@ -129,8 +129,15 @@ void qemu_log(const char *fmt, ...) { FILE *f = qemu_log_trylock(); if (f) { +gchar *timestr; va_list ap; +if (message_with_timestamp) { +timestr = real_time_iso8601(); +fprintf(f, "%s ", timestr); +g_free(timestr); +} + va_start(ap, fmt); vfprintf(f, fmt, ap); va_end(ap); -- 1.8.3.1
[PATCH 1/2] error-report: make real_time_iso8601() non-static
To make real_time_iso8601() a non-static function so that it can be used by other files later. No functional change. Cc: Joe Jin Signed-off-by: Dongli Zhang --- include/qemu/error-report.h | 2 ++ util/error-report.c | 3 +-- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/include/qemu/error-report.h b/include/qemu/error-report.h index 3ae2357..cc73b99 100644 --- a/include/qemu/error-report.h +++ b/include/qemu/error-report.h @@ -30,6 +30,8 @@ void loc_set_none(void); void loc_set_cmdline(char **argv, int idx, int cnt); void loc_set_file(const char *fname, int lno); +char *real_time_iso8601(void); + int error_vprintf(const char *fmt, va_list ap) G_GNUC_PRINTF(1, 0); int error_printf(const char *fmt, ...) G_GNUC_PRINTF(1, 2); diff --git a/util/error-report.c b/util/error-report.c index 5edb2e6..63862cd 100644 --- a/util/error-report.c +++ b/util/error-report.c @@ -169,8 +169,7 @@ static void print_loc(void) } } -static char * -real_time_iso8601(void) +char *real_time_iso8601(void) { #if GLIB_CHECK_VERSION(2,62,0) g_autoptr(GDateTime) dt = g_date_time_new_now_utc(); -- 1.8.3.1
[PATCH v2 1/1] monitor/hmp: print trace as option in help for log command
The below is printed when printing help information in qemu-system-x86_64 command line, and when CONFIG_TRACE_LOG is enabled: $ qemu-system-x86_64 -d help ... ... trace:PATTERN enable trace events Use "-d trace:help" to get a list of trace events. However, they are not printed in hmp "help log" command. Cc: Joe Jin Signed-off-by: Dongli Zhang --- Changed since v1: - change format for "none" as well. monitor/hmp.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/monitor/hmp.c b/monitor/hmp.c index 15ca047..467fc84 100644 --- a/monitor/hmp.c +++ b/monitor/hmp.c @@ -285,10 +285,15 @@ void help_cmd(Monitor *mon, const char *name) if (!strcmp(name, "log")) { const QEMULogItem *item; monitor_printf(mon, "Log items (comma separated):\n"); -monitor_printf(mon, "%-10s %s\n", "none", "remove all logs"); +monitor_printf(mon, "%-15s %s\n", "none", "remove all logs"); for (item = qemu_log_items; item->mask != 0; item++) { -monitor_printf(mon, "%-10s %s\n", item->name, item->help); +monitor_printf(mon, "%-15s %s\n", item->name, item->help); } +#ifdef CONFIG_TRACE_LOG +monitor_printf(mon, "trace:PATTERN enable trace events\n"); +monitor_printf(mon, "\nUse \"info trace-events\" to get a list of " +"trace events.\n\n"); +#endif return; } -- 1.8.3.1
Re: [PATCH 1/1] monitor/hmp: print trace as option in help for log command
Sorry that the format for "none" should be changed as well. I have sent a v2: https://lists.gnu.org/archive/html/qemu-devel/2022-08/msg04445.html Thank you very much! Dongli Zhang On 8/29/22 3:04 AM, Dongli Zhang wrote: > The below is printed when printing help information in qemu-system-x86_64 > command line, and when CONFIG_TRACE_LOG is enabled: > > $ qemu-system-x86_64 -d help > ... ... > trace:PATTERN enable trace events > > Use "-d trace:help" to get a list of trace events. > > However, they are not printed in hmp "help log" command. > > Cc: Joe Jin > Signed-off-by: Dongli Zhang > --- > monitor/hmp.c | 7 ++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/monitor/hmp.c b/monitor/hmp.c > index 15ca047..9f48b70 100644 > --- a/monitor/hmp.c > +++ b/monitor/hmp.c > @@ -287,8 +287,13 @@ void help_cmd(Monitor *mon, const char *name) > monitor_printf(mon, "Log items (comma separated):\n"); > monitor_printf(mon, "%-10s %s\n", "none", "remove all logs"); > for (item = qemu_log_items; item->mask != 0; item++) { > -monitor_printf(mon, "%-10s %s\n", item->name, item->help); > +monitor_printf(mon, "%-15s %s\n", item->name, item->help); > } > +#ifdef CONFIG_TRACE_LOG > +monitor_printf(mon, "trace:PATTERN enable trace events\n"); > +monitor_printf(mon, "\nUse \"info trace-events\" to get a list > of " > +"trace events.\n\n"); > +#endif > return; > } > >
Re: [PATCH v2 1/1] monitor/hmp: print trace as option in help for log command
Hi Markus, On 8/30/22 4:04 AM, Markus Armbruster wrote: > Dongli Zhang writes: > >> The below is printed when printing help information in qemu-system-x86_64 >> command line, and when CONFIG_TRACE_LOG is enabled: >> >> $ qemu-system-x86_64 -d help >> ... ... >> trace:PATTERN enable trace events >> >> Use "-d trace:help" to get a list of trace events. >> >> However, they are not printed in hmp "help log" command. > > This leaves me guessing what exactly the patch tries to do. I will clarify in the commit message. > >> Cc: Joe Jin >> Signed-off-by: Dongli Zhang >> --- >> Changed since v1: >> - change format for "none" as well. >> >> monitor/hmp.c | 9 +++-- >> 1 file changed, 7 insertions(+), 2 deletions(-) >> >> diff --git a/monitor/hmp.c b/monitor/hmp.c >> index 15ca047..467fc84 100644 >> --- a/monitor/hmp.c >> +++ b/monitor/hmp.c >> @@ -285,10 +285,15 @@ void help_cmd(Monitor *mon, const char *name) >> if (!strcmp(name, "log")) { >> const QEMULogItem *item; >> monitor_printf(mon, "Log items (comma separated):\n"); >> -monitor_printf(mon, "%-10s %s\n", "none", "remove all logs"); >> +monitor_printf(mon, "%-15s %s\n", "none", "remove all logs"); >> for (item = qemu_log_items; item->mask != 0; item++) { >> -monitor_printf(mon, "%-10s %s\n", item->name, item->help); >> +monitor_printf(mon, "%-15s %s\n", item->name, item->help); >> } >> +#ifdef CONFIG_TRACE_LOG >> +monitor_printf(mon, "trace:PATTERN enable trace events\n"); >> +monitor_printf(mon, "\nUse \"info trace-events\" to get a list >> of " >> +"trace events.\n\n"); > > Aha: it fixes help to show "log trace:PATTERN". Was that forgotten in > Paolo's commit c84ea00dc2 'log: add "-d trace:PATTERN"'? I will add the Fixes tag. > > "info trace-events", hmmm... it shows trace events and their state. > "log trace:help" also lists them, less their state, and in opposite > order. Why do we need both? I will print "log trace:help" in the help output. > > What about showing them in alphabetical order? The order is following how they are defined in the qemu_log_items[] array. To re-order them in the array may introduce more conflicts when backporting a util/log patch to QEMU old version. Please let me know if you prefer to re-order. Otherwise, I prefer to avoid that. Thank you very much for the suggestions! Dongli Zhang > >> +#endif >> return; >> } >
[PATCH v3 1/1] monitor/hmp: print trace as option in help for log command
The below is printed when printing help information in qemu-system-x86_64 command line, and when CONFIG_TRACE_LOG is enabled: $ qemu-system-x86_64 -d help ... ... trace:PATTERN enable trace events Use "-d trace:help" to get a list of trace events. However, the options of "trace:PATTERN" are only printed by "qemu-system-x86_64 -d help", but missing in hmp "help log" command. Fixes: c84ea00dc2 ("log: add "-d trace:PATTERN"") Cc: Joe Jin Signed-off-by: Dongli Zhang --- Changed since v1: - change format for "none" as well. Changed since v2: - use "log trace:help" in help message. - add more clarification in commit message. - add 'Fixes' tag. --- monitor/hmp.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/monitor/hmp.c b/monitor/hmp.c index 15ca04735c..a3375d0341 100644 --- a/monitor/hmp.c +++ b/monitor/hmp.c @@ -285,10 +285,15 @@ void help_cmd(Monitor *mon, const char *name) if (!strcmp(name, "log")) { const QEMULogItem *item; monitor_printf(mon, "Log items (comma separated):\n"); -monitor_printf(mon, "%-10s %s\n", "none", "remove all logs"); +monitor_printf(mon, "%-15s %s\n", "none", "remove all logs"); for (item = qemu_log_items; item->mask != 0; item++) { -monitor_printf(mon, "%-10s %s\n", item->name, item->help); +monitor_printf(mon, "%-15s %s\n", item->name, item->help); } +#ifdef CONFIG_TRACE_LOG +monitor_printf(mon, "trace:PATTERN enable trace events\n"); +monitor_printf(mon, "\nUse \"log trace:help\" to get a list of " + "trace events.\n\n"); +#endif return; } -- 2.17.1
Re: [PATCH 2/2] util/log: add timestamp to logs via qemu_log()
Hi Markus and Richard, Thank you very much for the feedback. I agree this is not a good solution. I will look for alternatives to add timestamp. Thank you very much! Dongli Zhang On 8/30/22 8:31 AM, Richard Henderson wrote: > On 8/30/22 04:09, Markus Armbruster wrote: >> Dongli Zhang writes: >> >>> The qemu_log is very helpful for diagnostic. Add the timestamp to the log >>> when it is enabled (e.g., "-msg timestamp=on"). >>> >>> While there are many other places that may print to log file, this patch is >>> only for qemu_log(), e.g., the developer may add qemu_log/qemu_log_mask to >>> selected locations to diagnose QEMU issue. >> >> Opinions on the new feature, anyone? >> >>> Cc: Joe Jin >>> Signed-off-by: Dongli Zhang >>> --- >>> Please let me know if we should use 'error_with_guestname' as well. >>> >>> util/log.c | 7 +++ >>> 1 file changed, 7 insertions(+) >>> >>> diff --git a/util/log.c b/util/log.c >>> index d6eb037..f0a081a 100644 >>> --- a/util/log.c >>> +++ b/util/log.c >>> @@ -129,8 +129,15 @@ void qemu_log(const char *fmt, ...) >>> { >>> FILE *f = qemu_log_trylock(); >>> if (f) { >>> + gchar *timestr; >>> va_list ap; >>> + if (message_with_timestamp) { >>> + timestr = real_time_iso8601(); >>> + fprintf(f, "%s ", timestr); >>> + g_free(timestr); >>> + } >>> + >>> va_start(ap, fmt); >>> vfprintf(f, fmt, ap); >>> va_end(ap); >> >> This extends -msg timestamp=on to apply to log messages without >> documenting it in -help or anywhere else. Needs fixing. > > I think this is a poor place to add the timestamp. > > You'll find that qemu_log is used many times to assemble pieces, e.g. > > linux-user/thunk.c:360: qemu_log("%" PRIu64, tswap64(val)); > > linux-user/thunk.c:376: qemu_log("\""); > > linux-user/thunk.c:379: qemu_log("["); > > linux-user/thunk.c:384: qemu_log(","); > > linux-user/thunk.c:391: qemu_log("\""); > > linux-user/thunk.c:393: qemu_log("]"); > > linux-user/thunk.c:417: qemu_log("{"); > > linux-user/thunk.c:420: qemu_log(","); > > linux-user/thunk.c:424: qemu_log("}"); > > > Not the best idea, really, but the replacement for this is to avoid qemu_log > entirely, and use > > f = qemu_log_trylock(); > if (f) { > fprintf > some > stuff > qemu_log_unlock(f); > } > > at which point you don't get your timestamp either. You'd need to explicitly > add timestamps to individual locations. > > It would probably be easier to add timestamps to tracepoints, which are always > emitted as a unit. > > > r~ >
Re: [PATCH v2 1/1] monitor/hmp: print trace as option in help for log command
Hi Markus, On 8/31/22 11:47 PM, Markus Armbruster wrote: > Dongli Zhang writes: > >> Hi Markus, >> >> On 8/30/22 4:04 AM, Markus Armbruster wrote: >>> Dongli Zhang writes: >>> >>>> The below is printed when printing help information in qemu-system-x86_64 >>>> command line, and when CONFIG_TRACE_LOG is enabled: >>>> >>>> $ qemu-system-x86_64 -d help >>>> ... ... >>>> trace:PATTERN enable trace events >>>> >>>> Use "-d trace:help" to get a list of trace events. >>>> >>>> However, they are not printed in hmp "help log" command. >>> >>> This leaves me guessing what exactly the patch tries to do. >> >> I will clarify in the commit message. >> >>> >>>> Cc: Joe Jin >>>> Signed-off-by: Dongli Zhang >>>> --- >>>> Changed since v1: >>>> - change format for "none" as well. >>>> >>>> monitor/hmp.c | 9 +++-- >>>> 1 file changed, 7 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/monitor/hmp.c b/monitor/hmp.c >>>> index 15ca047..467fc84 100644 >>>> --- a/monitor/hmp.c >>>> +++ b/monitor/hmp.c >>>> @@ -285,10 +285,15 @@ void help_cmd(Monitor *mon, const char *name) >>>> if (!strcmp(name, "log")) { >>>> const QEMULogItem *item; >>>> monitor_printf(mon, "Log items (comma separated):\n"); >>>> -monitor_printf(mon, "%-10s %s\n", "none", "remove all logs"); >>>> +monitor_printf(mon, "%-15s %s\n", "none", "remove all logs"); >>>> for (item = qemu_log_items; item->mask != 0; item++) { >>>> -monitor_printf(mon, "%-10s %s\n", item->name, item->help); >>>> +monitor_printf(mon, "%-15s %s\n", item->name, item->help); >>>> } >>>> +#ifdef CONFIG_TRACE_LOG >>>> +monitor_printf(mon, "trace:PATTERN enable trace events\n"); >>>> +monitor_printf(mon, "\nUse \"info trace-events\" to get a >>>> list of " >>>> +"trace events.\n\n"); >>> >>> Aha: it fixes help to show "log trace:PATTERN". Was that forgotten in >>> Paolo's commit c84ea00dc2 'log: add "-d trace:PATTERN"'? >> >> I will add the Fixes tag. >> >>> >>> "info trace-events", hmmm... it shows trace events and their state. >>> "log trace:help" also lists them, less their state, and in opposite >>> order. Why do we need both? > > I guess we have both because we want an HMP command to show the state of > trace events ("info trace-events"), and we want "-d trace" to provide > help. > > The latter also lets HMP command "log trace" help, which feels less > important to me, since "info trace-events" exists and is easier to find > and significantly more usable than "log trace:help": it can filter its > output, and unfiltered output is too long to be useful without something > like grep. > > Could the two share more code? Thank you very much for the suggestion. I will try if they can share the code. > > Hmm, there seems to be something wrong with "log trace:help": I see > truncated output. Moreover, output goes to stdout instead of the > monitor. That's wrong. Any help you can also emit from the monitor > should be printed with qemu_printf(). Currently the "log trace:help" prints via fprintf() to stdout. I will fix this. 169 void trace_list_events(FILE *f) 170 { 171 TraceEventIter iter; 172 TraceEvent *ev; 173 trace_event_iter_init_all(&iter); 174 while ((ev = trace_event_iter_next(&iter)) != NULL) { 175 fprintf(f, "%s\n", trace_event_get_name(ev)); 176 } 177 #ifdef CONFIG_TRACE_DTRACE 178 fprintf(f, "This list of names of trace points may be incomplete " 179"when using the DTrace/SystemTap backends.\n" 180"Run 'qemu-trace-stap list %s' to print the full list.\n", 181 g_get_prgname()); 182 #endif 183 } > >> I will print "log trace:help" in the help output. >> >>> What about showing them in alphabetical order? >> >> The order is following how they are defined in the qemu_log_items[] array. To >> re-order them in the array may introduce more conflicts when backporting a >> util/log patch to QEMU old version. >> >> Please let me know if you prefer to re-order. Otherwise, I prefer to avoid >> that. > > I'm talking about the output of "log trace:help", not the output of "log > help". The order is guaranteed by each trace group. To re-order may require some works with scripts/tracetool when the events from each group/folder are united. I will have a confirm. Thank you very much for suggestions! Dongli Zhang
[PATCH 1/1] hw/core/cpu: always print cpu index with cpu state
The cpu_dump_state() does not print the cpu index. When the cpu_dump_state() is invoked due to the KVM failure, we are not able to tell from which CPU the state is. The below is an example. KVM internal error. Suberror: 764064 RAX=0002 RBX=8a9e57c38400 RCX= RDX=8a9cc00ba8a0 RSI=0003 RDI=8a9e57c38400 RBP=b6120c5b3c50 RSP=b6120c5b3c40 R8 = R9 =8a9cc00ba8a0 R10=8e467350 R11=0007 R12=000a R13=8f987e25 R14=8f988a01 R15= RIP=8e51bb04 RFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES = 00c0 CS =0010 00a09b00 DPL=0 CS64 [-RA] SS = 00c0 DS = 00c0 FS = 00c0 GS = 8ac27fcc 00c0 LDT= 00c0 TR =0040 fe096000 206f 8b00 DPL=0 TSS64-busy GDT= fe094000 007f IDT= fe00 0fff CR0=80050033 CR2= CR3=0010ca40a001 CR4=003606e0 DR0= DR1= DR2= DR3= DR6=fffe0ff0 DR7=0400 EFER=0d01 Code=0f 1f ... ... Print the cpu->cpu_index in cpu_dump_state() and remove it from the caller. Cc: Joe Jin Signed-off-by: Dongli Zhang --- hw/core/cpu-common.c | 1 + monitor/hmp-cmds-target.c | 2 -- softmmu/cpus.c| 1 - 3 files changed, 1 insertion(+), 3 deletions(-) diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c index 5ccc3837b6..d2503f2d09 100644 --- a/hw/core/cpu-common.c +++ b/hw/core/cpu-common.c @@ -105,6 +105,7 @@ void cpu_dump_state(CPUState *cpu, FILE *f, int flags) if (cc->dump_state) { cpu_synchronize_state(cpu); +qemu_fprintf(f, "\nCPU#%d\n", cpu->cpu_index); cc->dump_state(cpu, f, flags); } } diff --git a/monitor/hmp-cmds-target.c b/monitor/hmp-cmds-target.c index 0d3e84d960..f7dd354d2a 100644 --- a/monitor/hmp-cmds-target.c +++ b/monitor/hmp-cmds-target.c @@ -99,7 +99,6 @@ void hmp_info_registers(Monitor *mon, const QDict *qdict) if (all_cpus) { CPU_FOREACH(cs) { -monitor_printf(mon, "\nCPU#%d\n", cs->cpu_index); cpu_dump_state(cs, NULL, CPU_DUMP_FPU); } } else { @@ -114,7 +113,6 @@ void hmp_info_registers(Monitor *mon, const QDict *qdict) return; } -monitor_printf(mon, "\nCPU#%d\n", cs->cpu_index); cpu_dump_state(cs, NULL, CPU_DUMP_FPU); } } diff --git a/softmmu/cpus.c b/softmmu/cpus.c index 9cbc8172b5..f69bbe6abc 100644 --- a/softmmu/cpus.c +++ b/softmmu/cpus.c @@ -122,7 +122,6 @@ void hw_error(const char *fmt, ...) vfprintf(stderr, fmt, ap); fprintf(stderr, "\n"); CPU_FOREACH(cpu) { -fprintf(stderr, "CPU #%d:\n", cpu->cpu_index); cpu_dump_state(cpu, stderr, CPU_DUMP_FPU); } va_end(ap); -- 2.34.1
[PATCH 1/1] readline: fix hmp completion issue
The auto completion does not work in some cases. Case 1. 1. (qemu) info reg 2. Press 'Tab'. 3. It does not auto complete. Case 2. 1. (qemu) block_resize flo 2. Press 'Tab'. 3. It does not auto complete 'floppy0'. Since the readline_add_completion_of() may add any completion when strlen(pfx) is zero, we remove the check with (name[0] == '\0') because strlen() always returns zero in that case. Fixes: 52f50b1e9f8f ("readline: Extract readline_add_completion_of() from monitor") Cc: Joe Jin Signed-off-by: Dongli Zhang --- monitor/hmp.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/monitor/hmp.c b/monitor/hmp.c index 2aa85d3982..fee410362f 100644 --- a/monitor/hmp.c +++ b/monitor/hmp.c @@ -1189,9 +1189,7 @@ static void cmd_completion(MonitorHMP *mon, const char *name, const char *list) } memcpy(cmd, pstart, len); cmd[len] = '\0'; -if (name[0] == '\0') { -readline_add_completion_of(mon->rs, name, cmd); -} +readline_add_completion_of(mon->rs, name, cmd); if (*p == '\0') { break; } @@ -1335,9 +1333,7 @@ static void monitor_find_completion_by_table(MonitorHMP *mon, /* block device name completion */ readline_set_completion_index(mon->rs, strlen(str)); while ((blk = blk_next(blk)) != NULL) { -if (str[0] == '\0') { -readline_add_completion_of(mon->rs, str, blk_name(blk)); -} +readline_add_completion_of(mon->rs, str, blk_name(blk)); } break; case 's': -- 2.34.1
Re: [PATCH 1/1] hw/core/cpu: always print cpu index with cpu state
Hi Philippe, On 2/6/23 23:16, Philippe Mathieu-Daudé wrote: > On 7/2/23 00:42, Dongli Zhang wrote: >> The cpu_dump_state() does not print the cpu index. When the >> cpu_dump_state() is invoked due to the KVM failure, we are not able to tell >> from which CPU the state is. The below is an example. >> >> KVM internal error. Suberror: 764064 >> RAX=0002 RBX=8a9e57c38400 RCX= >> RDX=8a9cc00ba8a0 >> RSI=0003 RDI=8a9e57c38400 RBP=b6120c5b3c50 >> RSP=b6120c5b3c40 >> R8 = R9 =8a9cc00ba8a0 R10=8e467350 >> R11=0007 >> R12=000a R13=8f987e25 R14=8f988a01 >> R15= >> RIP=8e51bb04 RFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 >> ES = 00c0 >> CS =0010 00a09b00 DPL=0 CS64 [-RA] >> SS = 00c0 >> DS = 00c0 >> FS = 00c0 >> GS = 8ac27fcc 00c0 >> LDT= 00c0 >> TR =0040 fe096000 206f 8b00 DPL=0 TSS64-busy >> GDT= fe094000 007f >> IDT= fe00 0fff >> CR0=80050033 CR2= CR3=0010ca40a001 CR4=003606e0 >> DR0= DR1= DR2= >> DR3= >> DR6=fffe0ff0 DR7=0400 >> EFER=0d01 >> Code=0f 1f ... ... >> >> Print the cpu->cpu_index in cpu_dump_state() and remove it from the caller. >> >> Cc: Joe Jin >> Signed-off-by: Dongli Zhang >> --- >> hw/core/cpu-common.c | 1 + >> monitor/hmp-cmds-target.c | 2 -- >> softmmu/cpus.c | 1 - >> 3 files changed, 1 insertion(+), 3 deletions(-) >> >> diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c >> index 5ccc3837b6..d2503f2d09 100644 >> --- a/hw/core/cpu-common.c >> +++ b/hw/core/cpu-common.c >> @@ -105,6 +105,7 @@ void cpu_dump_state(CPUState *cpu, FILE *f, int flags) >> if (cc->dump_state) { >> cpu_synchronize_state(cpu); > > Should we check for: > > if (cpu->cpu_index != -1) { > >> + qemu_fprintf(f, "\nCPU#%d\n", cpu->cpu_index); > > } I think you meant if (cpu->cpu_index != UNASSIGNED_CPU_INDEX). I do not see this case may happen within my knowledge. The cpu_index is always expected to be assigned if cpu_exec_realizefn()-->cpu_list_add() is called. 83 void cpu_list_add(CPUState *cpu) 84 { 85 QEMU_LOCK_GUARD(&qemu_cpu_list_lock); 86 if (cpu->cpu_index == UNASSIGNED_CPU_INDEX) { 87 cpu->cpu_index = cpu_get_free_index(); 88 assert(cpu->cpu_index != UNASSIGNED_CPU_INDEX); 89 } else { 90 assert(!cpu_index_auto_assigned); 91 } 92 QTAILQ_INSERT_TAIL_RCU(&cpus, cpu, node); 93 cpu_list_generation_id++; 94 } In addition, the cc->dump_state() is always invoked by cpu_dump_state(). As a result, e.g., arm_cpu_dump_state() or x86_cpu_dump_state() may always print the cpu state unconditionally (same for mips, s390 or riscv). I do not see a reason to hide the cpu_index. Would you please let me know if the above is wrong? I do not think it is required to filter the cpu_index with UNASSIGNED_CPU_INDEX. Thank you very much! Dongli Zhang > > ? > >> cc->dump_state(cpu, f, flags); >> } >> } >> diff --git a/monitor/hmp-cmds-target.c b/monitor/hmp-cmds-target.c >> index 0d3e84d960..f7dd354d2a 100644 >> --- a/monitor/hmp-cmds-target.c >> +++ b/monitor/hmp-cmds-target.c >> @@ -99,7 +99,6 @@ void hmp_info_registers(Monitor *mon, const QDict *qdict) >> if (all_cpus) { >> CPU_FOREACH(cs) { >> - monitor_printf(mon, "\nCPU#%d\n", cs->cpu_index); >> cpu_dump_state(cs, NULL, CPU_DUMP_FPU); >> } >
Re: [PATCH v2 0/2] target/i386/kvm: fix two svm pmu virtualization bugs
Can I get feedback for this patchset, especially the [PATCH v2 2/2]? About the [PATCH v2 2/2], currently the issue impacts the usage of PMUs on AMD VM, especially the below case: 1. Enable panic on nmi. 2. Use perf to monitor the performance of VM. Although without a test, I think the nmi watchdog has the same effect. 3. A sudden system reset, or a kernel panic (kdump/kexec). 4. After reboot, there will be random unknown NMI. 5. Unfortunately, the "panic on nmi" may panic the VM randomly at any time. Thank you very much! Dongli Zhang On 12/1/22 16:22, Dongli Zhang wrote: > This patchset is to fix two svm pmu virtualization bugs, x86 only. > > version 1: > https://lore.kernel.org/all/20221119122901.2469-1-dongli.zh...@oracle.com/ > > 1. The 1st bug is that "-cpu,-pmu" cannot disable svm pmu virtualization. > > To use "-cpu EPYC" or "-cpu host,-pmu" cannot disable the pmu > virtualization. There is still below at the VM linux side ... > > [0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver. > > ... although we expect something like below. > > [0.596381] Performance Events: PMU not available due to virtualization, > using software events only. > [0.600972] NMI watchdog: Perf NMI watchdog permanently disabled > > The 1st patch has introduced a new x86 only accel/kvm property > "pmu-cap-disabled=true" to disable the pmu virtualization via > KVM_PMU_CAP_DISABLE. > > I considered 'KVM_X86_SET_MSR_FILTER' initially before patchset v1. > Since both KVM_X86_SET_MSR_FILTER and KVM_PMU_CAP_DISABLE are VM ioctl. I > finally used the latter because it is easier to use. > > > 2. The 2nd bug is that un-reclaimed perf events (after QEMU system_reset) > at the KVM side may inject random unwanted/unknown NMIs to the VM. > > The svm pmu registers are not reset during QEMU system_reset. > > (1). The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it > is running "perf top". The pmu registers are not disabled gracefully. > > (2). Although the x86_cpu_reset() resets many registers to zero, the > kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, > some pmu events are still enabled at the KVM side. > > (3). The KVM pmc_speculative_in_use() always returns true so that the events > will not be reclaimed. The kvm_pmc->perf_event is still active. > > (4). After the reboot, the VM kernel reports below error: > > [0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS > detected, complain to your hardware vendor. > [0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR > c0010200 is 530076) > > (5). In a worse case, the active kvm_pmc->perf_event is still able to > inject unknown NMIs randomly to the VM kernel. > > [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. > > The 2nd patch is to fix the issue by resetting AMD pmu registers as well as > Intel registers. > > > This patchset does not cover PerfMonV2, until the below patchset is merged > into the KVM side. > > [PATCH v3 0/8] KVM: x86: Add AMD Guest PerfMonV2 PMU support > https://lore.kernel.org/all/2022102645.82001-1-lik...@tencent.com/ > > > Dongli Zhang (2): > target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE > target/i386/kvm: get and put AMD pmu registers > > accel/kvm/kvm-all.c | 1 + > include/sysemu/kvm_int.h | 1 + > qemu-options.hx | 7 +++ > target/i386/cpu.h| 5 ++ > target/i386/kvm/kvm.c| 129 +- > 5 files changed, 141 insertions(+), 2 deletions(-) > > Thank you very much! > > Dongli Zhang > >
[PATCH 1/1] vhost-scsi: fix memleak of vsc->inflight
This is below memleak detected when to quit the qemu-system-x86_64 (with vhost-scsi-pci). (qemu) quit = ==15568==ERROR: LeakSanitizer: detected memory leaks Direct leak of 40 byte(s) in 1 object(s) allocated from: #0 0x7f00aec57917 in __interceptor_calloc (/lib64/libasan.so.6+0xb4917) #1 0x7f00ada0d7b5 in g_malloc0 (/lib64/libglib-2.0.so.0+0x517b5) #2 0x5648ffd38bac in vhost_scsi_start ../hw/scsi/vhost-scsi.c:92 #3 0x5648ffd38d52 in vhost_scsi_set_status ../hw/scsi/vhost-scsi.c:131 #4 0x5648ffda340e in virtio_set_status ../hw/virtio/virtio.c:2036 #5 0x5648ff8de281 in virtio_ioport_write ../hw/virtio/virtio-pci.c:431 #6 0x5648ff8deb29 in virtio_pci_config_write ../hw/virtio/virtio-pci.c:576 #7 0x5648ffe5c0c2 in memory_region_write_accessor ../softmmu/memory.c:493 #8 0x5648ffe5c424 in access_with_adjusted_size ../softmmu/memory.c:555 #9 0x5648ffe6428f in memory_region_dispatch_write ../softmmu/memory.c:1515 #10 0x5648ffe8613d in flatview_write_continue ../softmmu/physmem.c:2825 #11 0x5648ffe86490 in flatview_write ../softmmu/physmem.c:2867 #12 0x5648ffe86d9f in address_space_write ../softmmu/physmem.c:2963 #13 0x5648ffe86e57 in address_space_rw ../softmmu/physmem.c:2973 #14 0x5648fffbfb3d in kvm_handle_io ../accel/kvm/kvm-all.c:2639 #15 0x5648fffc0e0d in kvm_cpu_exec ../accel/kvm/kvm-all.c:2890 #16 0x5648fffc90a7 in kvm_vcpu_thread_fn ../accel/kvm/kvm-accel-ops.c:51 #17 0x56490042400a in qemu_thread_start ../util/qemu-thread-posix.c:505 #18 0x7f00ac3b6ea4 in start_thread (/lib64/libpthread.so.0+0x7ea4) Free the vsc->inflight at the 'stop' path. Fixes: b82526c7ee ("vhost-scsi: support inflight io track") Cc: Joe Jin Cc: Li Feng Signed-off-by: Dongli Zhang --- hw/scsi/vhost-scsi-common.c | 1 + 1 file changed, 1 insertion(+) diff --git a/hw/scsi/vhost-scsi-common.c b/hw/scsi/vhost-scsi-common.c index 18ea5dcfa1..a06f01af26 100644 --- a/hw/scsi/vhost-scsi-common.c +++ b/hw/scsi/vhost-scsi-common.c @@ -113,6 +113,7 @@ void vhost_scsi_common_stop(VHostSCSICommon *vsc) if (vsc->inflight) { vhost_dev_free_inflight(vsc->inflight); +g_free(vsc->inflight); vsc->inflight = NULL; } -- 2.34.1
Re: [PATCH v2 0/2] target/i386/kvm: fix two svm pmu virtualization bugs
Ping? About [PATCH v2 2/2], the bad thing is that the customer will not be able to notice the issue, that is, the "Broken BIOS detected" in dmesg, immediately. As a result, the customer VM many panic randomly anytime in the future (once issue is encountered) if "/proc/sys/kernel/unknown_nmi_panic" is enabled. Thank you very much! Dongli Zhang On 12/19/22 06:45, Dongli Zhang wrote: > Can I get feedback for this patchset, especially the [PATCH v2 2/2]? > > About the [PATCH v2 2/2], currently the issue impacts the usage of PMUs on AMD > VM, especially the below case: > > 1. Enable panic on nmi. > 2. Use perf to monitor the performance of VM. Although without a test, I think > the nmi watchdog has the same effect. > 3. A sudden system reset, or a kernel panic (kdump/kexec). > 4. After reboot, there will be random unknown NMI. > 5. Unfortunately, the "panic on nmi" may panic the VM randomly at any time. > > Thank you very much! > > Dongli Zhang > > On 12/1/22 16:22, Dongli Zhang wrote: >> This patchset is to fix two svm pmu virtualization bugs, x86 only. >> >> version 1: >> https://lore.kernel.org/all/20221119122901.2469-1-dongli.zh...@oracle.com/ >> >> 1. The 1st bug is that "-cpu,-pmu" cannot disable svm pmu virtualization. >> >> To use "-cpu EPYC" or "-cpu host,-pmu" cannot disable the pmu >> virtualization. There is still below at the VM linux side ... >> >> [0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver. >> >> ... although we expect something like below. >> >> [0.596381] Performance Events: PMU not available due to virtualization, >> using software events only. >> [0.600972] NMI watchdog: Perf NMI watchdog permanently disabled >> >> The 1st patch has introduced a new x86 only accel/kvm property >> "pmu-cap-disabled=true" to disable the pmu virtualization via >> KVM_PMU_CAP_DISABLE. >> >> I considered 'KVM_X86_SET_MSR_FILTER' initially before patchset v1. >> Since both KVM_X86_SET_MSR_FILTER and KVM_PMU_CAP_DISABLE are VM ioctl. I >> finally used the latter because it is easier to use. >> >> >> 2. The 2nd bug is that un-reclaimed perf events (after QEMU system_reset) >> at the KVM side may inject random unwanted/unknown NMIs to the VM. >> >> The svm pmu registers are not reset during QEMU system_reset. >> >> (1). The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it >> is running "perf top". The pmu registers are not disabled gracefully. >> >> (2). Although the x86_cpu_reset() resets many registers to zero, the >> kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, >> some pmu events are still enabled at the KVM side. >> >> (3). The KVM pmc_speculative_in_use() always returns true so that the events >> will not be reclaimed. The kvm_pmc->perf_event is still active. >> >> (4). After the reboot, the VM kernel reports below error: >> >> [0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS >> detected, complain to your hardware vendor. >> [0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR >> c0010200 is 530076) >> >> (5). In a worse case, the active kvm_pmc->perf_event is still able to >> inject unknown NMIs randomly to the VM kernel. >> >> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. >> >> The 2nd patch is to fix the issue by resetting AMD pmu registers as well as >> Intel registers. >> >> >> This patchset does not cover PerfMonV2, until the below patchset is merged >> into the KVM side. >> >> [PATCH v3 0/8] KVM: x86: Add AMD Guest PerfMonV2 PMU support >> https://lore.kernel.org/all/20221111102645.82001-1-lik...@tencent.com/ >> >> >> Dongli Zhang (2): >> target/i386/kvm: introduce 'pmu-cap-disabled' to set >> KVM_PMU_CAP_DISABLE >> target/i386/kvm: get and put AMD pmu registers >> >> accel/kvm/kvm-all.c | 1 + >> include/sysemu/kvm_int.h | 1 + >> qemu-options.hx | 7 +++ >> target/i386/cpu.h| 5 ++ >> target/i386/kvm/kvm.c| 129 +- >> 5 files changed, 141 insertions(+), 2 deletions(-) >> >> Thank you very much! >> >> Dongli Zhang >> >>
Re: [PATCH v3 1/1] monitor/hmp: print trace as option in help for log command
Hi Markus, On 9/17/22 2:44 PM, Philippe Mathieu-Daudé via wrote: > Hi Markus, > > On 2/9/22 14:24, Markus Armbruster wrote: >> Dongli Zhang writes: >> >>> The below is printed when printing help information in qemu-system-x86_64 >>> command line, and when CONFIG_TRACE_LOG is enabled: >>> >>> >>> $ qemu-system-x86_64 -d help >>> ... ... >>> trace:PATTERN enable trace events >>> >>> Use "-d trace:help" to get a list of trace events. >>> >>> >>> However, the options of "trace:PATTERN" are only printed by >>> "qemu-system-x86_64 -d help", but missing in hmp "help log" command. >>> >>> Fixes: c84ea00dc2 ("log: add "-d trace:PATTERN"") >>> Cc: Joe Jin >>> Signed-off-by: Dongli Zhang >>> --- >>> Changed since v1: >>> - change format for "none" as well. >>> Changed since v2: >>> - use "log trace:help" in help message. >>> - add more clarification in commit message. >>> - add 'Fixes' tag. >>> --- >>> monitor/hmp.c | 9 +++-- >>> 1 file changed, 7 insertions(+), 2 deletions(-) > >> Not this patch's fault: >> >> 1. "-d help" terminates with exit status 1, "-d trace:help" with 0. The >> former is wrong. May I assume it is expected to have exit status 1 when "-d help"? According to the output of "-d", there is even not a "help" option, but only a "-d trace:help" option. That is, "-d help" is not officially supported. The below example use "-d hellworld" but not "help". # qemu-system-x86_64 -d helloworld Log items (comma separated): out_asm show generated host assembly code for each compiled TB in_asm show target assembly code for each compiled TB op show micro ops for each compiled TB op_opt show micro ops after optimization op_ind show micro ops before indirect lowering int show interrupts/exceptions in short format execshow trace before each executed TB (lots of logs) cpu show CPU registers before entering a TB (lots of logs) fpu include FPU registers in the 'cpu' logging mmu log MMU-related activities pcall x86 only: show protected mode far calls/returns/exceptions cpu_reset show CPU state before CPU resets unimp log unimplemented functionality guest_errorslog when the guest OS does something invalid (eg accessing a non-existent register) pagedump pages at beginning of user mode emulation nochain do not chain compiled TBs so that "exec" and "cpu" show complete traces plugin output from TCG plugins strace log every user-mode syscall, its input, and its result tid open a separate log file per thread; filename must contain '%d' trace:PATTERN enable trace events Use "-d trace:help" to get a list of trace events. According to the source code, the qemu_str_to_log_mask() expects either log items or "trace". For any other inputs (e.g., "help" or "helloworld"), qemu_str_to_log_mask() returns 0 (no bit set in the mask). That indicates the input (e.g., "help") is not an expected input. Therefore, can I assume this is not a bug? I do not think something like below is very helpful. diff --git a/softmmu/vl.c b/softmmu/vl.c index 263f029a8e..54c8e624bf 100644 --- a/softmmu/vl.c +++ b/softmmu/vl.c @@ -2389,6 +2389,8 @@ static void qemu_process_early_options(void) mask = qemu_str_to_log_mask(log_mask); if (!mask) { qemu_print_log_usage(stdout); +if (g_str_equal(log_mask, "help")) +exit(0) exit(1); } } Thank you very much! Dongli Zhang >> >> 2. HMP "log trace:help" prints to stdout instead of the current monitor. >> >> 3. Output of HMP "log trace:help" sometimes is truncated for me. >> >> 4. Output of "log trace:help" and "info trace-events" is unwieldy. >> Sorted output could be a bit less unwieldy. >> >> 5. Could "log trace:help" and "info trace-events" share code? > > Do you mind opening issue(s) on our GitLab so we don't loose your > analysis buried within the infinite mailing list? >
[PATCH 0/2] scsi: to fix issue on passing host_status to the guest kernel
This patchset fixes the issue on passing 'host_status' to the guest kernel. The 1st patch fixes the erroneous usage of req->host_status. The 2nd patch is to pass the SCSI_HOST_ERROR to the guest kernel when the req->bus->info->fail() is not implemented. I do not add 'Fixes:' because I am not sure if to not pass SCSI_HOST_ERROR was on purpose, especially for security reason. Thank you very much! Dongli Zhang
[PATCH 1/2] scsi/scsi_bus: use host_status as parameter for scsi_sense_from_host_status()
The scsi_sense_from_host_status() always returns GOOD since req->host_status is 255 (-1) at this time. Change req->host_status to host_status so that scsi_sense_from_host_status() will be able to return the expected value. Fixes: f3126d65b393("scsi: move host_status handling into SCSI drivers") Cc: Joe Jin Signed-off-by: Dongli Zhang --- hw/scsi/scsi-bus.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/scsi/scsi-bus.c b/hw/scsi/scsi-bus.c index 77325d8cc7..d46650bd8c 100644 --- a/hw/scsi/scsi-bus.c +++ b/hw/scsi/scsi-bus.c @@ -1465,7 +1465,7 @@ void scsi_req_complete_failed(SCSIRequest *req, int host_status) assert(req->ops != &reqops_unit_attention); if (!req->bus->info->fail) { -status = scsi_sense_from_host_status(req->host_status, &sense); +status = scsi_sense_from_host_status(host_status, &sense); if (status == CHECK_CONDITION) { scsi_req_build_sense(req, sense); } -- 2.17.1
[PATCH 2/2] scsi/utils: pass host_status = SCSI_HOST_ERROR to guest kernel
For scsi_req_complete_failed() and when the req->bus->info->fail() is implemented, the virtio-scsi passes SCSI_HOST_ERROR to the guest kernel as VIRTIO_SCSI_S_FAILURE, while the pvscsi passes SCSI_HOST_ERROR to guest kernel as BTSTAT_HASOFTWARE. However, the scsi_req_complete_failed()->scsi_sense_from_host_status() always returns GOOD for SCSI_HOST_ERROR, when the req->bus->info->fail() is not implemented (e.g., megasas). As a result, the sense is not passed to the guest kernel. The SCSI_HOST_ERROR is reproduced on purpose by below QEMU command line: -device megasas,id=vscsi0,bus=pci.0,addr=0x4 \ -drive file=/dev/sdc,format=raw,if=none,id=drive01 \ -device scsi-block,bus=vscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive01 \ ... and when we remove the /dev/sdc from the host with: "echo 1 > /sys/block/sdc/device/delete" This patch passes sense_code_IO_ERROR to the guest kernel for host_status = SCSI_HOST_ERROR. (This issue is detected by running a testing code from Rui Loura). Cc: Joe Jin Cc: Adnan Misherfi Cc: Rui Loura Signed-off-by: Dongli Zhang --- scsi/utils.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/scsi/utils.c b/scsi/utils.c index 357b036671..086a1fea66 100644 --- a/scsi/utils.c +++ b/scsi/utils.c @@ -638,6 +638,9 @@ int scsi_sense_from_host_status(uint8_t host_status, case SCSI_HOST_ABORTED: *sense = SENSE_CODE(COMMAND_ABORTED); return CHECK_CONDITION; +case SCSI_HOST_ERROR: +*sense = SENSE_CODE(IO_ERROR); +return CHECK_CONDITION; case SCSI_HOST_RESET: *sense = SENSE_CODE(RESET); return CHECK_CONDITION; -- 2.17.1
[PATCH 1/1] dump: kdump-zlib data pages not dumped with pvtime/aarch64
The kdump-zlib data pages are not dumped from aarch64 host when the 'pvtime' is involved, that is, when the block->target_end is not aligned to page_size. In the below example, it is expected to dump two blocks. (qemu) info mtree -f ... ... 090a-090a0fff (prio 0, ram): pvtime KVM ... ... 4000-0001bfff (prio 0, ram): mach-virt.ram KVM ... ... However, there is an issue with get_next_page() so that the pages for "mach-virt.ram" will not be dumped. At line 1296, although we have reached at the end of the 'pvtime' block, since it is not aligned to the page_size (e.g., 0x1), it will not break at line 1298. 1255 static bool get_next_page(GuestPhysBlock **blockptr, uint64_t *pfnptr, 1256 uint8_t **bufptr, DumpState *s) ... ... 1294 memcpy(buf + addr % page_size, hbuf, n); 1295 addr += n; 1296 if (addr % page_size == 0) { 1297 /* we filled up the page */ 1298 break; 1299 } As a result, get_next_page() will continue to the next block ("mach-virt.ram"). Finally, when get_next_page() returns to the caller: - 'pfnptr' is referring to the 'pvtime' - but 'blockptr' is referring to the "mach-virt.ram" When get_next_page() is called the next time, "*pfnptr += 1" still refers to the prior 'pvtime'. It will exit immediately because it is out of the range of the current "mach-virt.ram". The fix is to break when it is time to come to the next block, so that both 'pfnptr' and 'blockptr' refer to the same block. Fixes: 94d788408d2d ("dump: fix kdump to work over non-aligned blocks") Cc: Joe Jin Signed-off-by: Dongli Zhang --- dump/dump.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/dump/dump.c b/dump/dump.c index 1f1a6edcab..c93e4c572f 100644 --- a/dump/dump.c +++ b/dump/dump.c @@ -1293,8 +1293,8 @@ static bool get_next_page(GuestPhysBlock **blockptr, uint64_t *pfnptr, memcpy(buf + addr % page_size, hbuf, n); addr += n; -if (addr % page_size == 0) { -/* we filled up the page */ +if (addr % page_size == 0 || addr >= block->target_end) { +/* we filled up the page or the current block is finished */ break; } } else { -- 2.34.1
Re: [PATCH RESEND v2 1/2] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
Hi Like, On 7/2/23 06:41, Like Xu wrote: > On Wed, Jun 21, 2023 at 9:39 AM Dongli Zhang wrote: >> >> The "perf stat" at the VM side still works even we set "-cpu host,-pmu" in >> the QEMU command line. That is, neither "-cpu host,-pmu" nor "-cpu EPYC" >> could disable the pmu virtualization in an AMD environment. >> >> We still see below at VM kernel side ... >> >> [0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver. >> >> ... although we expect something like below. >> >> [0.596381] Performance Events: PMU not available due to virtualization, >> using software events only. >> [0.600972] NMI watchdog: Perf NMI watchdog permanently disabled >> >> This is because the AMD pmu (v1) does not rely on cpuid to decide if the >> pmu virtualization is supported. >> >> We introduce a new property 'pmu-cap-disabled' for KVM accel to set >> KVM_PMU_CAP_DISABLE if KVM_CAP_PMU_CAPABILITY is supported. Only x86 host >> is supported because currently KVM uses KVM_CAP_PMU_CAPABILITY only for >> x86. > > We may check cpu->enable_pmu when creating the first CPU or a BSP one > (before it gets running) and then choose whether to disable guest pmu using > vm ioctl KVM_CAP_PMU_CAPABILITY. Introducing a new property is not too > acceptable if there are other options. In the v1 of the implementation, we have implemented something similar: not based on the cpu_index (or BSP), but to introduce a helper before creating the KVM vcpu to let the further implementation decide. We did the KVM_CAP_PMU_CAPABILITY in that helper once. [PATCH 1/3] kvm: introduce a helper before creating the 1st vcpu https://lore.kernel.org/all/20221119122901.2469-2-dongli.zh...@oracle.com/ [PATCH 2/3] i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled https://lore.kernel.org/all/20221119122901.2469-3-dongli.zh...@oracle.com/ The below was the suggestion from Greg Kurz about to use per-VCPU property to control per-VM cap: "It doesn't seem conceptually correct to configure VM level stuff out of a vCPU property, which could theoretically be different for each vCPU, even if this isn't the case with the current code base. Maybe consider controlling PMU with a machine property and this could be done in kvm_arch_init() like other VM level stuff ?" Would you mind comment on that? Thank you very much! Dongli Zhang > >> >> Cc: Joe Jin >> Cc: Like Xu >> Signed-off-by: Dongli Zhang >> --- >> Changed since v1: >> - In version 1 we did not introduce the new property. We ioctl >> KVM_PMU_CAP_DISABLE only before the creation of the 1st vcpu. We had >> introduced a helpfer function to do this job before creating the 1st >> KVM vcpu in v1. >> >> accel/kvm/kvm-all.c | 1 + >> include/sysemu/kvm_int.h | 1 + >> qemu-options.hx | 7 ++ >> target/i386/kvm/kvm.c| 46 >> 4 files changed, 55 insertions(+) >> >> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c >> index 7679f397ae..238098e991 100644 >> --- a/accel/kvm/kvm-all.c >> +++ b/accel/kvm/kvm-all.c >> @@ -3763,6 +3763,7 @@ static void kvm_accel_instance_init(Object *obj) >> s->xen_version = 0; >> s->xen_gnttab_max_frames = 64; >> s->xen_evtchn_max_pirq = 256; >> +s->pmu_cap_disabled = false; >> } >> >> /** >> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h >> index 511b42bde5..cbbe08ec54 100644 >> --- a/include/sysemu/kvm_int.h >> +++ b/include/sysemu/kvm_int.h >> @@ -123,6 +123,7 @@ struct KVMState >> uint32_t xen_caps; >> uint16_t xen_gnttab_max_frames; >> uint16_t xen_evtchn_max_pirq; >> +bool pmu_cap_disabled; >> }; >> >> void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml, >> diff --git a/qemu-options.hx b/qemu-options.hx >> index b57489d7ca..1976c0ca3e 100644 >> --- a/qemu-options.hx >> +++ b/qemu-options.hx >> @@ -187,6 +187,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel, >> "tb-size=n (TCG translation block cache size)\n" >> "dirty-ring-size=n (KVM dirty ring GFN count, default >> 0)\n" >> " >> notify-vmexit=run|internal-error|disable,notify-window=n (enable notify VM >> exit and set notify window, x86 only)\n" >> +"pmu-cap-disabled=true|false (disable >> KVM_CAP_PMU_CAPABILITY, x86 only, default false)\n" >&g
Re: [PATCH RESEND v2 2/2] target/i386/kvm: get and put AMD pmu registers
Hi Like, On 7/2/23 07:15, Like Xu wrote: > On Wed, Jun 21, 2023 at 9:39 AM Dongli Zhang wrote: >> >> The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM >> side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to >> the KVM side. >> >> However, only the Intel gp/fixed/global pmu registers are involved. There >> is not any implementation for AMD pmu registers. The >> 'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are >> calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for >> AMD. Before AMD PerfMonV2, the number of gp registers is decided based on >> the CPU version. > > Updating the relevant documentation to clarify this part of the deficiency > would be a good first step. Would you mind suggesting the doc to add this TODO/deficiency? The only place I find is to add a new TODO under docs/system/i386, but not sure if it is worth it. This bugfix is not complex. > >> >> This patch is to add the support for AMD version=1 pmu, to get and put AMD >> pmu registers. Otherwise, there will be a bug: > > AMD version=1 ? > > AMD does not have version 1, just directly has 2, perhaps because of x86 > compatibility. AMD also does not have the so-called architectural pmu. > Maybe need to rename has_architectural_pmu_version for AMD. > Thank you very much for the explanation. I will use version 2. > It might be more helpful to add similar support for AMD PerfMonV2. Yes. I will do that. During that time, the AMD PerfMonV2 KVM patchset (from you) was still in progress. I see the pull request from Paolo today. I will add that in v3. > >> >> 1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it >> is running "perf top". The pmu registers are not disabled gracefully. >> >> 2. Although the x86_cpu_reset() resets many registers to zero, the >> kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, >> some pmu events are still enabled at the KVM side. > > I agree that we should have done that, especially if guest pmu is enabled > on the AMD platforms. > >> >> 3. The KVM pmc_speculative_in_use() always returns true so that the events >> will not be reclaimed. The kvm_pmc->perf_event is still active. >> >> 4. After the reboot, the VM kernel reports below error: >> >> [0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS >> detected, complain to your hardware vendor. >> [0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR >> c0010200 is 530076) >> >> 5. In a worse case, the active kvm_pmc->perf_event is still able to >> inject unknown NMIs randomly to the VM kernel. >> >> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. >> >> The patch is to fix the issue by resetting AMD pmu registers during the >> reset. > > I'm not sure if the qemu_reset or VM kexec will necessarily trigger > kvm::amd_pmu_reset(). According to the mainline linux kernel: kvm_vcpu_reset() -> kvm_pmu_reset() -> amd_pmu_reset() The PMU will not reset when init_event==true, that is, when processing the INIT: line 12049. 11975 void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) 11976 { ... ... 12049 if (!init_event) { 12050 kvm_pmu_reset(vcpu); 12051 vcpu->arch.smbase = 0x3; 12052 12053 vcpu->arch.msr_misc_features_enables = 0; 12054 vcpu->arch.ia32_misc_enable_msr = MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL | 12055 MSR_IA32_MISC_ENABLE_BTS_UNAVAIL; 12056 12057 __kvm_set_xcr(vcpu, 0, XFEATURE_MASK_FP); 12058 __kvm_set_msr(vcpu, MSR_IA32_XSS, 0, true); 12059 } According to the below ... https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d28bc9dd25ce023270d2e039e7c98d38ecbf7758 ... "INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU, MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP." That's why initially I did not send a KVM patch to remove the 'init_event'. > >> >> Cc: Joe Jin >> Cc: Like Xu >> Signed-off-by: Dongli Zhang >> --- >> target/i386/cpu.h | 5 +++ >> target/i386/kvm/kvm.c | 83 +-- >> 2 files changed, 86 insertions(+), 2 deletions(-) >> >> diff --git a/target/i386/cpu.h b/target/i386/cpu.h >> index cd047e0410..b8ba72e87a 100644 >> --- a/target/i386/cpu.h >> +++ b/target/i386/cpu.h >> @@ -471,6 +471,11 @@ typedef enum X86Seg { >> #define MSR_CORE_PERF_GLO
Re: [PATCH v2 0/2] target/i386/kvm: fix two svm pmu virtualization bugs
Hi Like and zhenyu, Thank you very much! That will be very helpful. In order to help the review, I will rebase the patchset on top of the most recent QEMU. Thank you very much! Dongli Zhang On 6/19/23 01:52, Like Xu wrote: > I think we've been stuck here too long. Sorry Dongli. > > +zhenyu, could you get someone to follow up on this, or I will start working > on > that. > > On 9/1/2023 9:19 am, Dongli Zhang wrote: >> Ping? >> >> About [PATCH v2 2/2], the bad thing is that the customer will not be able to >> notice the issue, that is, the "Broken BIOS detected" in dmesg, immediately. >> >> As a result, the customer VM many panic randomly anytime in the future (once >> issue is encountered) if "/proc/sys/kernel/unknown_nmi_panic" is enabled. >> >> Thank you very much! >> >> Dongli Zhang >> >> On 12/19/22 06:45, Dongli Zhang wrote: >>> Can I get feedback for this patchset, especially the [PATCH v2 2/2]? >>> >>> About the [PATCH v2 2/2], currently the issue impacts the usage of PMUs on >>> AMD >>> VM, especially the below case: >>> >>> 1. Enable panic on nmi. >>> 2. Use perf to monitor the performance of VM. Although without a test, I >>> think >>> the nmi watchdog has the same effect. >>> 3. A sudden system reset, or a kernel panic (kdump/kexec). >>> 4. After reboot, there will be random unknown NMI. >>> 5. Unfortunately, the "panic on nmi" may panic the VM randomly at any time. >>> >>> Thank you very much! >>> >>> Dongli Zhang >>> >>> On 12/1/22 16:22, Dongli Zhang wrote: >>>> This patchset is to fix two svm pmu virtualization bugs, x86 only. >>>> >>>> version 1: >>>> https://lore.kernel.org/all/20221119122901.2469-1-dongli.zh...@oracle.com/ >>>> >>>> 1. The 1st bug is that "-cpu,-pmu" cannot disable svm pmu virtualization. >>>> >>>> To use "-cpu EPYC" or "-cpu host,-pmu" cannot disable the pmu >>>> virtualization. There is still below at the VM linux side ... >>>> >>>> [ 0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver. >>>> >>>> ... although we expect something like below. >>>> >>>> [ 0.596381] Performance Events: PMU not available due to virtualization, >>>> using software events only. >>>> [ 0.600972] NMI watchdog: Perf NMI watchdog permanently disabled >>>> >>>> The 1st patch has introduced a new x86 only accel/kvm property >>>> "pmu-cap-disabled=true" to disable the pmu virtualization via >>>> KVM_PMU_CAP_DISABLE. >>>> >>>> I considered 'KVM_X86_SET_MSR_FILTER' initially before patchset v1. >>>> Since both KVM_X86_SET_MSR_FILTER and KVM_PMU_CAP_DISABLE are VM ioctl. I >>>> finally used the latter because it is easier to use. >>>> >>>> >>>> 2. The 2nd bug is that un-reclaimed perf events (after QEMU system_reset) >>>> at the KVM side may inject random unwanted/unknown NMIs to the VM. >>>> >>>> The svm pmu registers are not reset during QEMU system_reset. >>>> >>>> (1). The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it >>>> is running "perf top". The pmu registers are not disabled gracefully. >>>> >>>> (2). Although the x86_cpu_reset() resets many registers to zero, the >>>> kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, >>>> some pmu events are still enabled at the KVM side. >>>> >>>> (3). The KVM pmc_speculative_in_use() always returns true so that the >>>> events >>>> will not be reclaimed. The kvm_pmc->perf_event is still active. >>>> >>>> (4). After the reboot, the VM kernel reports below error: >>>> >>>> [ 0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS >>>> detected, complain to your hardware vendor. >>>> [ 0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR >>>> c0010200 is 530076) >>>> >>>> (5). In a worse case, the active kvm_pmc->perf_event is still able to >>>> inject unknown NMIs randomly to the VM kernel. >>>> >>>> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. >>>> >>>> The 2nd patch is to fix the issue by resetting AMD pmu registers as well as >>>> Intel registers. >>>> >>>> >>>> This patchset does not cover PerfMonV2, until the below patchset is merged >>>> into the KVM side. >>>> >>>> [PATCH v3 0/8] KVM: x86: Add AMD Guest PerfMonV2 PMU support >>>> https://lore.kernel.org/all/2022102645.82001-1-lik...@tencent.com/ >>>> >>>> >>>> Dongli Zhang (2): >>>> target/i386/kvm: introduce 'pmu-cap-disabled' to set >>>> KVM_PMU_CAP_DISABLE >>>> target/i386/kvm: get and put AMD pmu registers >>>> >>>> accel/kvm/kvm-all.c | 1 + >>>> include/sysemu/kvm_int.h | 1 + >>>> qemu-options.hx | 7 +++ >>>> target/i386/cpu.h | 5 ++ >>>> target/i386/kvm/kvm.c | 129 +- >>>> 5 files changed, 141 insertions(+), 2 deletions(-) >>>> >>>> Thank you very much! >>>> >>>> Dongli Zhang >>>> >>>> >> >>
[PATCH RESEND v2 2/2] target/i386/kvm: get and put AMD pmu registers
The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to the KVM side. However, only the Intel gp/fixed/global pmu registers are involved. There is not any implementation for AMD pmu registers. The 'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for AMD. Before AMD PerfMonV2, the number of gp registers is decided based on the CPU version. This patch is to add the support for AMD version=1 pmu, to get and put AMD pmu registers. Otherwise, there will be a bug: 1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it is running "perf top". The pmu registers are not disabled gracefully. 2. Although the x86_cpu_reset() resets many registers to zero, the kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, some pmu events are still enabled at the KVM side. 3. The KVM pmc_speculative_in_use() always returns true so that the events will not be reclaimed. The kvm_pmc->perf_event is still active. 4. After the reboot, the VM kernel reports below error: [0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor. [0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076) 5. In a worse case, the active kvm_pmc->perf_event is still able to inject unknown NMIs randomly to the VM kernel. [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. The patch is to fix the issue by resetting AMD pmu registers during the reset. Cc: Joe Jin Cc: Like Xu Signed-off-by: Dongli Zhang --- target/i386/cpu.h | 5 +++ target/i386/kvm/kvm.c | 83 +-- 2 files changed, 86 insertions(+), 2 deletions(-) diff --git a/target/i386/cpu.h b/target/i386/cpu.h index cd047e0410..b8ba72e87a 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -471,6 +471,11 @@ typedef enum X86Seg { #define MSR_CORE_PERF_GLOBAL_CTRL 0x38f #define MSR_CORE_PERF_GLOBAL_OVF_CTRL 0x390 +#define MSR_K7_EVNTSEL0 0xc001 +#define MSR_K7_PERFCTR0 0xc0010004 +#define MSR_F15H_PERF_CTL0 0xc0010200 +#define MSR_F15H_PERF_CTR0 0xc0010201 + #define MSR_MC0_CTL 0x400 #define MSR_MC0_STATUS 0x401 #define MSR_MC0_ADDR0x402 diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index bf4136fa1b..a0f7273dad 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -2084,6 +2084,32 @@ int kvm_arch_init_vcpu(CPUState *cs) } } +/* + * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to + * disable the AMD pmu virtualization. + * + * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled + * indicates the KVM side has already disabled the pmu virtualization. + */ +if (IS_AMD_CPU(env) && !cs->kvm_state->pmu_cap_disabled) { +int64_t family; + +family = (env->cpuid_version >> 8) & 0xf; +if (family == 0xf) { +family += (env->cpuid_version >> 20) & 0xff; +} + +if (family >= 6) { +has_architectural_pmu_version = 1; + +if (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) { +num_architectural_pmu_gp_counters = 6; +} else { +num_architectural_pmu_gp_counters = 4; +} +} +} + cpu_x86_cpuid(env, 0x8000, 0, &limit, &unused, &unused, &unused); for (i = 0x8000; i <= limit; i++) { @@ -3438,7 +3464,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level) kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env->poll_control_msr); } -if (has_architectural_pmu_version > 0) { +if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) { if (has_architectural_pmu_version > 1) { /* Stop the counter. */ kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0); @@ -3469,6 +3495,26 @@ static int kvm_put_msrs(X86CPU *cpu, int level) env->msr_global_ctrl); } } + +if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) { +uint32_t sel_base = MSR_K7_EVNTSEL0; +uint32_t ctr_base = MSR_K7_PERFCTR0; +uint32_t step = 1; + +if (num_architectural_pmu_gp_counters == 6) { +sel_base = MSR_F15H_PERF_CTL0; +ctr_base = MSR_F15H_PERF_CTR0; +step = 2; +} + +for (i = 0; i < num_architectural_pmu_gp_counters; i++) { +kvm_msr_entry_add(c
[PATCH RESEND v2 0/2] target/i386/kvm: fix two svm pmu virtualization bugs
This is to rebase the patchset on top of the most recet QEMU. This patchset is to fix two svm pmu virtualization bugs, x86 only. version 1: https://lore.kernel.org/all/20221119122901.2469-1-dongli.zh...@oracle.com/ 1. The 1st bug is that "-cpu,-pmu" cannot disable svm pmu virtualization. To use "-cpu EPYC" or "-cpu host,-pmu" cannot disable the pmu virtualization. There is still below at the VM linux side ... [0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver. ... although we expect something like below. [0.596381] Performance Events: PMU not available due to virtualization, using software events only. [0.600972] NMI watchdog: Perf NMI watchdog permanently disabled The 1st patch has introduced a new x86 only accel/kvm property "pmu-cap-disabled=true" to disable the pmu virtualization via KVM_PMU_CAP_DISABLE. I considered 'KVM_X86_SET_MSR_FILTER' initially before patchset v1. Since both KVM_X86_SET_MSR_FILTER and KVM_PMU_CAP_DISABLE are VM ioctl. I finally used the latter because it is easier to use. 2. The 2nd bug is that un-reclaimed perf events (after QEMU system_reset) at the KVM side may inject random unwanted/unknown NMIs to the VM. The svm pmu registers are not reset during QEMU system_reset. (1). The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it is running "perf top". The pmu registers are not disabled gracefully. (2). Although the x86_cpu_reset() resets many registers to zero, the kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, some pmu events are still enabled at the KVM side. (3). The KVM pmc_speculative_in_use() always returns true so that the events will not be reclaimed. The kvm_pmc->perf_event is still active. (4). After the reboot, the VM kernel reports below error: [0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor. [0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076) (5). In a worse case, the active kvm_pmc->perf_event is still able to inject unknown NMIs randomly to the VM kernel. [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. The 2nd patch is to fix the issue by resetting AMD pmu registers as well as Intel registers. This patchset does not cover PerfMonV2, until the below patchset is merged into the KVM side. It has been queued to kvm-x86 next by Sean. [PATCH v7 00/12] KVM: x86: Add AMD Guest PerfMonV2 PMU support https://lore.kernel.org/all/168609790857.1417369.13152633386083458084.b4...@google.com/ Dongli Zhang (2): target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE target/i386/kvm: get and put AMD pmu registers accel/kvm/kvm-all.c | 1 + include/sysemu/kvm_int.h | 1 + qemu-options.hx | 7 +++ target/i386/cpu.h| 5 ++ target/i386/kvm/kvm.c| 129 +- 5 files changed, 141 insertions(+), 2 deletions(-) Thank you very much! Dongli Zhang
[PATCH RESEND v2 1/2] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
The "perf stat" at the VM side still works even we set "-cpu host,-pmu" in the QEMU command line. That is, neither "-cpu host,-pmu" nor "-cpu EPYC" could disable the pmu virtualization in an AMD environment. We still see below at VM kernel side ... [0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver. ... although we expect something like below. [0.596381] Performance Events: PMU not available due to virtualization, using software events only. [0.600972] NMI watchdog: Perf NMI watchdog permanently disabled This is because the AMD pmu (v1) does not rely on cpuid to decide if the pmu virtualization is supported. We introduce a new property 'pmu-cap-disabled' for KVM accel to set KVM_PMU_CAP_DISABLE if KVM_CAP_PMU_CAPABILITY is supported. Only x86 host is supported because currently KVM uses KVM_CAP_PMU_CAPABILITY only for x86. Cc: Joe Jin Cc: Like Xu Signed-off-by: Dongli Zhang --- Changed since v1: - In version 1 we did not introduce the new property. We ioctl KVM_PMU_CAP_DISABLE only before the creation of the 1st vcpu. We had introduced a helpfer function to do this job before creating the 1st KVM vcpu in v1. accel/kvm/kvm-all.c | 1 + include/sysemu/kvm_int.h | 1 + qemu-options.hx | 7 ++ target/i386/kvm/kvm.c| 46 4 files changed, 55 insertions(+) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index 7679f397ae..238098e991 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -3763,6 +3763,7 @@ static void kvm_accel_instance_init(Object *obj) s->xen_version = 0; s->xen_gnttab_max_frames = 64; s->xen_evtchn_max_pirq = 256; +s->pmu_cap_disabled = false; } /** diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h index 511b42bde5..cbbe08ec54 100644 --- a/include/sysemu/kvm_int.h +++ b/include/sysemu/kvm_int.h @@ -123,6 +123,7 @@ struct KVMState uint32_t xen_caps; uint16_t xen_gnttab_max_frames; uint16_t xen_evtchn_max_pirq; +bool pmu_cap_disabled; }; void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml, diff --git a/qemu-options.hx b/qemu-options.hx index b57489d7ca..1976c0ca3e 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -187,6 +187,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel, "tb-size=n (TCG translation block cache size)\n" "dirty-ring-size=n (KVM dirty ring GFN count, default 0)\n" "notify-vmexit=run|internal-error|disable,notify-window=n (enable notify VM exit and set notify window, x86 only)\n" +"pmu-cap-disabled=true|false (disable KVM_CAP_PMU_CAPABILITY, x86 only, default false)\n" "thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL) SRST ``-accel name[,prop=value[,...]]`` @@ -254,6 +255,12 @@ SRST open up for a specified of time (i.e. notify-window). Default: notify-vmexit=run,notify-window=0. +``pmu-cap-disabled=true|false`` +When the KVM accelerator is used, it controls whether to disable the +KVM_CAP_PMU_CAPABILITY via KVM_PMU_CAP_DISABLE. When disabled, the +PMU virtualization is disabled at the KVM module side. This is for +x86 host only. + ERST DEF("smp", HAS_ARG, QEMU_OPTION_smp, diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index de531842f6..bf4136fa1b 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -129,6 +129,7 @@ static bool has_msr_ucode_rev; static bool has_msr_vmx_procbased_ctls2; static bool has_msr_perf_capabs; static bool has_msr_pkrs; +static bool has_pmu_cap; static uint32_t has_architectural_pmu_version; static uint32_t num_architectural_pmu_gp_counters; @@ -2767,6 +2768,23 @@ int kvm_arch_init(MachineState *ms, KVMState *s) } } +has_pmu_cap = kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY); + +if (s->pmu_cap_disabled) { +if (has_pmu_cap) { +ret = kvm_vm_enable_cap(s, KVM_CAP_PMU_CAPABILITY, 0, +KVM_PMU_CAP_DISABLE); +if (ret < 0) { +s->pmu_cap_disabled = false; +error_report("kvm: Failed to disable pmu cap: %s", + strerror(-ret)); +} +} else { +s->pmu_cap_disabled = false; +error_report("kvm: KVM_CAP_PMU_CAPABILITY is not supported"); +} +} + return 0; } @@ -5951,6 +5969,28 @@ static void kvm_arch_set_xen_evtchn_max_pirq(Object *obj, Visitor *v, s->xen_evtchn_max_pirq = value; } +static void kvm_set_pmu_cap_disabled(Object *obj, Visitor *v, + const char *name, void *opaque, + Error **errp) +{ +
Re: [PATCH 1/1] hw/core/cpu: always print cpu index with cpu state
Ping? The patch already got the Reviewed-by from Philippe Mathieu-Daudé and Alistair Francis. The current log does not provide much information when (1) multiple CPUs are involved in the bug, and (2) when the "info registers -a" is not used to collect the context from all CPUs for comparison. We may wait for very long timer until the bug filer can reproduce the error again. To print the cpu index helps a lot. Thank you very much! Dongli Zhang On 2/6/23 15:42, Dongli Zhang wrote: > The cpu_dump_state() does not print the cpu index. When the > cpu_dump_state() is invoked due to the KVM failure, we are not able to tell > from which CPU the state is. The below is an example. > > KVM internal error. Suberror: 764064 > RAX=0002 RBX=8a9e57c38400 RCX= > RDX=8a9cc00ba8a0 > RSI=0003 RDI=8a9e57c38400 RBP=b6120c5b3c50 > RSP=b6120c5b3c40 > R8 = R9 =8a9cc00ba8a0 R10=8e467350 > R11=0007 > R12=000a R13=8f987e25 R14=8f988a01 > R15= > RIP=8e51bb04 RFL=00010046 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 > ES = 00c0 > CS =0010 00a09b00 DPL=0 CS64 [-RA] > SS = 00c0 > DS = 00c0 > FS = 00c0 > GS = 8ac27fcc 00c0 > LDT= 00c0 > TR =0040 fe096000 206f 8b00 DPL=0 TSS64-busy > GDT= fe094000 007f > IDT= fe00 0fff > CR0=80050033 CR2= CR3=0010ca40a001 CR4=003606e0 > DR0= DR1= DR2= > DR3= > DR6=fffe0ff0 DR7=0400 > EFER=0d01 > Code=0f 1f ... ... > > Print the cpu->cpu_index in cpu_dump_state() and remove it from the caller. > > Cc: Joe Jin > Signed-off-by: Dongli Zhang > --- > hw/core/cpu-common.c | 1 + > monitor/hmp-cmds-target.c | 2 -- > softmmu/cpus.c| 1 - > 3 files changed, 1 insertion(+), 3 deletions(-) > > diff --git a/hw/core/cpu-common.c b/hw/core/cpu-common.c > index 5ccc3837b6..d2503f2d09 100644 > --- a/hw/core/cpu-common.c > +++ b/hw/core/cpu-common.c > @@ -105,6 +105,7 @@ void cpu_dump_state(CPUState *cpu, FILE *f, int flags) > > if (cc->dump_state) { > cpu_synchronize_state(cpu); > +qemu_fprintf(f, "\nCPU#%d\n", cpu->cpu_index); > cc->dump_state(cpu, f, flags); > } > } > diff --git a/monitor/hmp-cmds-target.c b/monitor/hmp-cmds-target.c > index 0d3e84d960..f7dd354d2a 100644 > --- a/monitor/hmp-cmds-target.c > +++ b/monitor/hmp-cmds-target.c > @@ -99,7 +99,6 @@ void hmp_info_registers(Monitor *mon, const QDict *qdict) > > if (all_cpus) { > CPU_FOREACH(cs) { > -monitor_printf(mon, "\nCPU#%d\n", cs->cpu_index); > cpu_dump_state(cs, NULL, CPU_DUMP_FPU); > } > } else { > @@ -114,7 +113,6 @@ void hmp_info_registers(Monitor *mon, const QDict *qdict) > return; > } > > -monitor_printf(mon, "\nCPU#%d\n", cs->cpu_index); > cpu_dump_state(cs, NULL, CPU_DUMP_FPU); > } > } > diff --git a/softmmu/cpus.c b/softmmu/cpus.c > index 9cbc8172b5..f69bbe6abc 100644 > --- a/softmmu/cpus.c > +++ b/softmmu/cpus.c > @@ -122,7 +122,6 @@ void hw_error(const char *fmt, ...) > vfprintf(stderr, fmt, ap); > fprintf(stderr, "\n"); > CPU_FOREACH(cpu) { > -fprintf(stderr, "CPU #%d:\n", cpu->cpu_index); > cpu_dump_state(cpu, stderr, CPU_DUMP_FPU); > } > va_end(ap);
[PATCH 1/1] vhost-scsi: fix usage of error_reportf_err()
It is required to use error_report() instead of error_reportf_err(), if the prior function does not take local_err as the argument. As a result, the local_err is always NULL and segment fault may happen. vhost_scsi_start() -> vhost_scsi_set_endpoint(s) --> does not allocate local_err -> error_reportf_err() -> error_vprepend() -> g_string_append(newmsg, (*errp)->msg) --> (*errp) is NULL In addition, add ": " at the end of other error_reportf_err() logs. Fixes: 7962e432b4e4 ("vhost-user-scsi: support reconnect to backend") Signed-off-by: Dongli Zhang --- hw/scsi/vhost-scsi.c | 4 ++-- hw/scsi/vhost-user-scsi.c | 3 ++- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c index 3126df9e1d..9929c0d14b 100644 --- a/hw/scsi/vhost-scsi.c +++ b/hw/scsi/vhost-scsi.c @@ -91,13 +91,13 @@ static int vhost_scsi_start(VHostSCSI *s) ret = vhost_scsi_common_start(vsc, &local_err); if (ret < 0) { -error_reportf_err(local_err, "Error starting vhost-scsi"); +error_reportf_err(local_err, "Error starting vhost-scsi: "); return ret; } ret = vhost_scsi_set_endpoint(s); if (ret < 0) { -error_reportf_err(local_err, "Error setting vhost-scsi endpoint"); +error_report("Error setting vhost-scsi endpoint"); vhost_scsi_common_stop(vsc); } diff --git a/hw/scsi/vhost-user-scsi.c b/hw/scsi/vhost-user-scsi.c index 780f10559d..af18c4f3d3 100644 --- a/hw/scsi/vhost-user-scsi.c +++ b/hw/scsi/vhost-user-scsi.c @@ -83,7 +83,8 @@ static void vhost_user_scsi_set_status(VirtIODevice *vdev, uint8_t status) if (should_start) { ret = vhost_user_scsi_start(s, &local_err); if (ret < 0) { -error_reportf_err(local_err, "unable to start vhost-user-scsi: %s", +error_reportf_err(local_err, + "unable to start vhost-user-scsi: %s: ", strerror(-ret)); qemu_chr_fe_disconnect(&vs->conf.chardev); } -- 2.34.1
[PATCH v3 1/1] target/i386/kvm: introduce 'pmu-cap-disabled' to set KVM_PMU_CAP_DISABLE
The "perf stat" at the VM side still works even we set "-cpu host,-pmu" in the QEMU command line. That is, neither "-cpu host,-pmu" nor "-cpu EPYC" could disable the pmu virtualization in an AMD environment. We still see below at VM kernel side ... [0.510611] Performance Events: Fam17h+ core perfctr, AMD PMU driver. ... although we expect something like below. [0.596381] Performance Events: PMU not available due to virtualization, using software events only. [0.600972] NMI watchdog: Perf NMI watchdog permanently disabled This is because the AMD pmu does not rely on cpuid to decide if the pmu virtualization is supported. We introduce a new property 'pmu-cap-disabled' for KVM accel to set KVM_PMU_CAP_DISABLE if KVM_CAP_PMU_CAPABILITY is supported. Only x86 host is supported because currently KVM uses KVM_CAP_PMU_CAPABILITY only for x86. Cc: Joe Jin Cc: Like Xu Cc: Denis V. Lunev Signed-off-by: Dongli Zhang --- This is to resurrect the patch to disable PMU. I split the patchset and send the patch to disable PMU. Changed since v1: [PATCH 1/3] kvm: introduce a helper before creating the 1st vcpu https://lore.kernel.org/all/20221119122901.2469-2-dongli.zh...@oracle.com/ [PATCH 2/3] i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled https://lore.kernel.org/all/20221119122901.2469-3-dongli.zh...@oracle.com/ - In version 1 we did not introduce the new property. We ioctl KVM_PMU_CAP_DISABLE only before the creation of the 1st vcpu. We had introduced a helpfer function to do this job before creating the 1st KVM vcpu in v1. Changed since v2: https://lore.kernel.org/all/20230621013821.6874-2-dongli.zh...@oracle.com/ Nothing. I split the patchset and send this as a single patch. As a summary: - Greg Kurz and Liang Yan suggested introduce the machine property to disable the PMU (e.g., with the concern of live migration, or vCPU prop theoretically be different for each vCPU). - Denis V. Lunev and Like Xu preferred the method in v1 patch: to re-use cpu->enable_pmu. Would you please suggest if we may go via v1 (re-use cpu->enable_pmu) or v2 (to introduce new machine prop) 1. The v1 is to re-use cpu->enable_pmu. It disables KVM_PMU_CAP_DISABLE when creating the 1st vCPU. We may use the vCPU id or (current_cpu == first_cpu) to check when it is the 1st vCPU creation. The benefit is that the QEMU user (e.g., libvirt will not require much change). 2. The v2 is to introduce the new machine property as in this patch. The benefit: the 'pmu' is to configure cpuid, while KVM_PMU_CAP_DISABLE is a different KVM feature. They are orthogonal features. Perhaps there is another option to sum both v1 and v2 together ... Perhaps the maintainer can help make decision on that :) Thank you very much! accel/kvm/kvm-all.c | 1 + include/sysemu/kvm_int.h | 1 + qemu-options.hx | 7 ++ target/i386/kvm/kvm.c| 46 4 files changed, 55 insertions(+) diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index e39a810a4e..4acc5bdcc8 100644 --- a/accel/kvm/kvm-all.c +++ b/accel/kvm/kvm-all.c @@ -3619,6 +3619,7 @@ static void kvm_accel_instance_init(Object *obj) s->xen_version = 0; s->xen_gnttab_max_frames = 64; s->xen_evtchn_max_pirq = 256; +s->pmu_cap_disabled = false; } /** diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h index fd846394be..b7c0c6ffee 100644 --- a/include/sysemu/kvm_int.h +++ b/include/sysemu/kvm_int.h @@ -120,6 +120,7 @@ struct KVMState uint32_t xen_caps; uint16_t xen_gnttab_max_frames; uint16_t xen_evtchn_max_pirq; +bool pmu_cap_disabled; }; void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml, diff --git a/qemu-options.hx b/qemu-options.hx index 42fd09e4de..7fe201e41c 100644 --- a/qemu-options.hx +++ b/qemu-options.hx @@ -188,6 +188,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel, "dirty-ring-size=n (KVM dirty ring GFN count, default 0)\n" "eager-split-size=n (KVM Eager Page Split chunk size, default 0, disabled. ARM only)\n" "notify-vmexit=run|internal-error|disable,notify-window=n (enable notify VM exit and set notify window, x86 only)\n" +"pmu-cap-disabled=true|false (disable KVM_CAP_PMU_CAPABILITY, x86 only, default false)\n" "thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL) SRST ``-accel name[,prop=value[,...]]`` @@ -269,6 +270,12 @@ SRST open up for a specified of time (i.e. notify-window). Default: notify-vmexit=run,notify-window=0. +``pmu-cap-disabled=true|false`` +When the KVM accelerator is used, it controls whether to disable the +KVM_CAP_PMU_CAPABILITY via KVM_PMU_CAP_DISABLE. When disabled,
Re: [PATCH] target/i386/monitor: synchronize cpu before printing lapic state
Hi David, On 10/26/23 08:39, David Woodhouse wrote: > From: David Woodhouse > > Where the local APIC is emulated by KVM, we need kvm_get_apic() to pull > the current state into userspace before it's printed. Otherwise we get > stale values. > > Signed-off-by: David Woodhouse > --- > target/i386/monitor.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/target/i386/monitor.c b/target/i386/monitor.c > index 6512846327..0754d699ba 100644 > --- a/target/i386/monitor.c > +++ b/target/i386/monitor.c > @@ -29,6 +29,7 @@ > #include "monitor/hmp.h" > #include "qapi/qmp/qdict.h" > #include "sysemu/kvm.h" > +#include "sysemu/hw_accel.h" > #include "qapi/error.h" > #include "qapi/qapi-commands-misc-target.h" > #include "qapi/qapi-commands-misc.h" > @@ -655,6 +656,7 @@ void hmp_info_local_apic(Monitor *mon, const QDict *qdict) > if (qdict_haskey(qdict, "apic-id")) { > int id = qdict_get_try_int(qdict, "apic-id", 0); > cs = cpu_by_arch_id(id); > +cpu_synchronize_state(cs); AFAIR, there is a case that cs may be NULL here when I was sending the similar bugfix long time ago. https://lore.kernel.org/qemu-devel/20210701214051.1588-1-dongli.zh...@oracle.com/ ... and resend: https://lore.kernel.org/qemu-devel/20210908143803.29191-1-dongli.zh...@oracle.com/ ... and resent by Daniel as part of another patchset (after review): https://lore.kernel.org/qemu-devel/20211028155457.967291-19-berra...@redhat.com/ This utility is helpful for the diagnostic of loss of interrupt issue. Dongli Zhang > } else { > cs = mon_get_cpu(mon); > }
[PATCH v3 1/1] hmp: synchronize cpu state for lapic info
While the default "info lapic" always synchronizes cpu state ... mon_get_cpu() -> mon_get_cpu_sync(mon, true) -> cpu_synchronize_state(cpu) -> ioctl KVM_GET_LAPIC (taking KVM as example) ... the cpu state is not synchronized when the apic-id is available as argument. The cpu state should be synchronized when apic-id is available. Otherwise the "info lapic " always returns stale data. Reference: https://lore.kernel.org/all/20211028155457.967291-19-berra...@redhat.com/ Cc: Joe Jin Signed-off-by: Dongli Zhang Reviewed-by: Daniel P. Berrangé Reviewed-by: David Woodhouse --- Changed since v1: - I sent out wrong patch version in v1 Changed since v2: - Add the Reviewed-by from Daniel and David target/i386/monitor.c | 5 + 1 file changed, 5 insertions(+) diff --git a/target/i386/monitor.c b/target/i386/monitor.c index 6512846327..d727270fd0 100644 --- a/target/i386/monitor.c +++ b/target/i386/monitor.c @@ -28,6 +28,7 @@ #include "monitor/hmp-target.h" #include "monitor/hmp.h" #include "qapi/qmp/qdict.h" +#include "sysemu/hw_accel.h" #include "sysemu/kvm.h" #include "qapi/error.h" #include "qapi/qapi-commands-misc-target.h" @@ -654,7 +655,11 @@ void hmp_info_local_apic(Monitor *mon, const QDict *qdict) if (qdict_haskey(qdict, "apic-id")) { int id = qdict_get_try_int(qdict, "apic-id", 0); + cs = cpu_by_arch_id(id); +if (cs) { +cpu_synchronize_state(cs); +} } else { cs = mon_get_cpu(mon); } -- 2.34.1
Re: [PATCH] target/i386/monitor: synchronize cpu before printing lapic state
Hi David, Thank you very much for the Reviewed-by in another thread. I have re-based the patch and sent again. https://lore.kernel.org/all/20231026211938.162815-1-dongli.zh...@oracle.com/ Dongli Zhang On 10/26/23 09:39, Dongli Zhang wrote: > Hi David, > > On 10/26/23 08:39, David Woodhouse wrote: >> From: David Woodhouse >> >> Where the local APIC is emulated by KVM, we need kvm_get_apic() to pull >> the current state into userspace before it's printed. Otherwise we get >> stale values. >> >> Signed-off-by: David Woodhouse >> --- >> target/i386/monitor.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/target/i386/monitor.c b/target/i386/monitor.c >> index 6512846327..0754d699ba 100644 >> --- a/target/i386/monitor.c >> +++ b/target/i386/monitor.c >> @@ -29,6 +29,7 @@ >> #include "monitor/hmp.h" >> #include "qapi/qmp/qdict.h" >> #include "sysemu/kvm.h" >> +#include "sysemu/hw_accel.h" >> #include "qapi/error.h" >> #include "qapi/qapi-commands-misc-target.h" >> #include "qapi/qapi-commands-misc.h" >> @@ -655,6 +656,7 @@ void hmp_info_local_apic(Monitor *mon, const QDict >> *qdict) >> if (qdict_haskey(qdict, "apic-id")) { >> int id = qdict_get_try_int(qdict, "apic-id", 0); >> cs = cpu_by_arch_id(id); >> +cpu_synchronize_state(cs); > > AFAIR, there is a case that cs may be NULL here when I was sending the similar > bugfix long time ago. > > https://lore.kernel.org/qemu-devel/20210701214051.1588-1-dongli.zh...@oracle.com/ > > ... and resend: > > https://lore.kernel.org/qemu-devel/20210908143803.29191-1-dongli.zh...@oracle.com/ > > ... and resent by Daniel as part of another patchset (after review): > > https://lore.kernel.org/qemu-devel/20211028155457.967291-19-berra...@redhat.com/ > > > This utility is helpful for the diagnostic of loss of interrupt issue. > > Dongli Zhang > >> } else { >> cs = mon_get_cpu(mon); >> }
Re: [PATCH v3 1/1] hmp: synchronize cpu state for lapic info
Hi Juan, On 10/30/23 09:31, Juan Quintela wrote: > Dongli Zhang wrote: >> While the default "info lapic" always synchronizes cpu state ... >> >> mon_get_cpu() >> -> mon_get_cpu_sync(mon, true) >>-> cpu_synchronize_state(cpu) >> -> ioctl KVM_GET_LAPIC (taking KVM as example) >> >> ... the cpu state is not synchronized when the apic-id is available as >> argument. >> >> The cpu state should be synchronized when apic-id is available. Otherwise >> the "info lapic " always returns stale data. >> >> Reference: >> https://urldefense.com/v3/__https://lore.kernel.org/all/20211028155457.967291-19-berra...@redhat.com/__;!!ACWV5N9M2RV99hQ!KOLfuCesLC4T6ka9bjf4x6ncC34GPK9pVvWwOJhbwSZw2fwp3Mxlakk0fnR-NCoqRPKOX7X4SOAxozQBC7VQ$ >> >> >> Cc: Joe Jin >> Signed-off-by: Dongli Zhang >> Reviewed-by: Daniel P. Berrangé >> Reviewed-by: David Woodhouse > > Reviewed-by: Juan Quintela > > But I wonder how I did get CC'd on this patch O:-) > Thank you very much! This component does not have a maintainer. I just blindly cc all suggested reviewers :), in order to get it reviewed and merged. get_maintainer.pl: No maintainers found, printing recent contributors. get_maintainer.pl: Do not blindly cc: them on patches! Use common sense. Thank you very much! Dongli Zhang
[PATCH 1/1] multi-process: fix usage information
>From source code, the 'devid' of x-remote-object should be one of devices in remote QEMU process. Signed-off-by: Dongli Zhang --- I have verified by reading the code and playing with below orchestrator. https://github.com/finallyjustice/sample/blob/master/kvm/multiprocess/orchestrator.py docs/system/multi-process.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/system/multi-process.rst b/docs/system/multi-process.rst index 46bb0cafc2..210531ee17 100644 --- a/docs/system/multi-process.rst +++ b/docs/system/multi-process.rst @@ -45,7 +45,7 @@ Following is a description of command-line used to launch mpqemu. -device lsi53c895a,id=lsi0 \ -drive id=drive_image2,file=/build/ol7-nvme-test-1.qcow2 \ -device scsi-hd,id=drive2,drive=drive_image2,bus=lsi0.0,scsi-id=0 \ - -object x-remote-object,id=robj1,devid=lsi1,fd=4, + -object x-remote-object,id=robj1,devid=lsi0,fd=4, * QEMU: -- 2.17.1
Re: [Qemu-devel] Qemu Trace
Hi Nesrine, On 02/01/2018 11:30 PM, Nesrine Zouari wrote: > Hello, > > I am a computer engineering student and I am actually working on my > graduation project at Lauterbach company. The project is about Qemu Trace > and as a future I would like to contribute this work to the main line. > > My project is divided into two parts: > > 1/ Collecting the Guest trace data : The trace solution should be able to > provide: > > a/ Instruction flow Trace > > b/ Memory read/write access About mem read/write, there is a project panda: https://github.com/panda-re/panda > > c/ Time Stamps. > > d/ For tracing rich operating systems that are using MMU, we > additionally need to trace the task switches. > > 2/ Sending the collected data to a third party tool for analysis. > > My question is about the first part. I would like to know, which trace > backend that better fit my use case. > > > Regards, > > - > > Nesrine ZOUARI > Computer Engineering Student > Department of Computer Engineering and Applied Mathematics > National Engineering School of Sfax (ENIS) > University of Sfax-Tunisia > Tel: +216 52 620 475 > Dongli Zhang
Re: [Qemu-devel] [PULL 05/33] virtio-blk: fix comment for virtio_blk_rw_complete
On 11/06/2018 02:15 AM, Michael S. Tsirkin wrote: > From: Yaowei Bai > > Here should be submit_requests, there is no submit_merged_requests > function. > > Signed-off-by: Yaowei Bai > Reviewed-by: Michael S. Tsirkin > Signed-off-by: Michael S. Tsirkin > --- > hw/block/virtio-blk.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c > index 225fe44b7a..83cf5c01f9 100644 > --- a/hw/block/virtio-blk.c > +++ b/hw/block/virtio-blk.c > @@ -97,8 +97,8 @@ static void virtio_blk_rw_complete(void *opaque, int ret) > > if (req->qiov.nalloc != -1) { > /* If nalloc is != 1 req->qiov is a local copy of the original Should it be "If nalloc is != -1" in the comment? Seems the initial state is -1. > - * external iovec. It was allocated in submit_merged_requests > - * to be able to merge requests. */ > + * external iovec. It was allocated in submit_requests to be > + * able to merge requests. */ > qemu_iovec_destroy(&req->qiov); > } > > Dongli Zhang
[Qemu-devel] [PATCH 1/1] virtio-blk: fix comment for virtio_blk_rw_complete as nalloc is initially -1
The initial value of nalloc is -1, but not 1. Signed-off-by: Dongli Zhang --- This is based on git://git.kernel.org/pub/scm/virt/kvm/mst/qemu.git tags/for_upstream hw/block/virtio-blk.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index 83cf5c0..30999c3 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -96,7 +96,7 @@ static void virtio_blk_rw_complete(void *opaque, int ret) trace_virtio_blk_rw_complete(vdev, req, ret); if (req->qiov.nalloc != -1) { -/* If nalloc is != 1 req->qiov is a local copy of the original +/* If nalloc is != -1 req->qiov is a local copy of the original * external iovec. It was allocated in submit_requests to be * able to merge requests. */ qemu_iovec_destroy(&req->qiov); -- 2.7.4
Re: [Qemu-devel] How to emulate block I/O timeout on qemu side?
On 11/06/2018 01:49 AM, Eric Blake wrote: > On 11/2/18 3:11 AM, Dongli Zhang wrote: >> Hi, >> >> Is there any way to emulate I/O timeout on qemu side (not fault injection in >> VM >> kernel) without modifying qemu source code? > > You may be interested in Rich's work on nbdkit. If you don't mind the > overhead > of the host connecting through NBD, then you can use nbdkit's delay and > fault-injection filters for inserting delays or even run-time-controllable > failures to investigate how the guest reacts to those situations Thank you all very much for the suggestions. I will take a look on nbdkit. So far I am reproducing the issue with NFS (by shutdown the link to NFS where the image is placed on purpose) but it did not work well. > >> >> For instance, I would like to observe/study/debug the I/O timeout handling of >> nvme, scsi, virtio-blk (not supported) of VM kernel. >> >> Is there a way to trigger this on purpose on qemu side? >> >> Thank you very much! >> >> Dongli Zhang >> >> > Dongli Zhang
[Qemu-devel] [PATCH 1/1] virtio-blk: rename iov to out_iov in virtio_blk_handle_request()
In virtio_blk_handle_request(), in_iov is used for input header while iov is used for output header. Rename iov to out_iov to pair output header's name with in_iov to avoid confusing people when reading source code. Signed-off-by: Dongli Zhang --- hw/block/virtio-blk.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index 83cf5c0..fb0d74d 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -482,7 +482,7 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb) { uint32_t type; struct iovec *in_iov = req->elem.in_sg; -struct iovec *iov = req->elem.out_sg; +struct iovec *out_iov = req->elem.out_sg; unsigned in_num = req->elem.in_num; unsigned out_num = req->elem.out_num; VirtIOBlock *s = req->dev; @@ -493,13 +493,13 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb) return -1; } -if (unlikely(iov_to_buf(iov, out_num, 0, &req->out, +if (unlikely(iov_to_buf(out_iov, out_num, 0, &req->out, sizeof(req->out)) != sizeof(req->out))) { virtio_error(vdev, "virtio-blk request outhdr too short"); return -1; } -iov_discard_front(&iov, &out_num, sizeof(req->out)); +iov_discard_front(&out_iov, &out_num, sizeof(req->out)); if (in_iov[in_num - 1].iov_len < sizeof(struct virtio_blk_inhdr)) { virtio_error(vdev, "virtio-blk request inhdr too short"); @@ -526,7 +526,7 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb) &req->out.sector); if (is_write) { -qemu_iovec_init_external(&req->qiov, iov, out_num); +qemu_iovec_init_external(&req->qiov, out_iov, out_num); trace_virtio_blk_handle_write(vdev, req, req->sector_num, req->qiov.size / BDRV_SECTOR_SIZE); } else { -- 2.7.4
Re: [Qemu-devel] [PATCH 1/1 V2] Add vhost-pci-blk driver
It looks the kernel space vhost-blk can only process raw image. How about to verify that only raw image is used in the drive command line when vhost-blk-pci is paired with it? Otherwise, vhost-blk-pci might be working with qcow2 image without any warning on qemu side. Dongli Zhang On 11/06/2018 04:56 AM, Vitaly Mayatskikh wrote: > This driver uses the kernel-mode acceleration for virtio-blk and > allows to get a near bare metal disk performance inside a VM. > > Signed-off-by: Vitaly Mayatskikh > --- > configure | 10 + > default-configs/virtio.mak| 1 + > hw/block/Makefile.objs| 1 + > hw/block/vhost-blk.c | 429 ++ > hw/virtio/virtio-pci.c| 60 + > hw/virtio/virtio-pci.h| 19 ++ > include/hw/virtio/vhost-blk.h | 43 > 7 files changed, 563 insertions(+) > create mode 100644 hw/block/vhost-blk.c > create mode 100644 include/hw/virtio/vhost-blk.h > > diff --git a/configure b/configure > index 46ae1e8c76..787bc780da 100755 > --- a/configure > +++ b/configure > @@ -371,6 +371,7 @@ vhost_crypto="no" > vhost_scsi="no" > vhost_vsock="no" > vhost_user="" > +vhost_blk="" > kvm="no" > hax="no" > hvf="no" > @@ -869,6 +870,7 @@ Linux) >vhost_crypto="yes" >vhost_scsi="yes" >vhost_vsock="yes" > + vhost_blk="yes" >QEMU_INCLUDES="-I\$(SRC_PATH)/linux-headers -I$(pwd)/linux-headers > $QEMU_INCLUDES" >supported_os="yes" >libudev="yes" > @@ -1263,6 +1265,10 @@ for opt do >;; >--enable-vhost-vsock) vhost_vsock="yes" >;; > + --disable-vhost-blk) vhost_blk="no" > + ;; > + --enable-vhost-blk) vhost_blk="yes" > + ;; >--disable-opengl) opengl="no" >;; >--enable-opengl) opengl="yes" > @@ -6000,6 +6006,7 @@ echo "vhost-crypto support $vhost_crypto" > echo "vhost-scsi support $vhost_scsi" > echo "vhost-vsock support $vhost_vsock" > echo "vhost-user support $vhost_user" > +echo "vhost-blk support $vhost_blk" > echo "Trace backends$trace_backends" > if have_backend "simple"; then > echo "Trace output file $trace_file-" > @@ -6461,6 +6468,9 @@ fi > if test "$vhost_user" = "yes" ; then >echo "CONFIG_VHOST_USER=y" >> $config_host_mak > fi > +if test "$vhost_blk" = "yes" ; then > + echo "CONFIG_VHOST_BLK=y" >> $config_host_mak > +fi > if test "$blobs" = "yes" ; then >echo "INSTALL_BLOBS=yes" >> $config_host_mak > fi > diff --git a/default-configs/virtio.mak b/default-configs/virtio.mak > index 1304849018..765c0a2a04 100644 > --- a/default-configs/virtio.mak > +++ b/default-configs/virtio.mak > @@ -1,5 +1,6 @@ > CONFIG_VHOST_USER_SCSI=$(call land,$(CONFIG_VHOST_USER),$(CONFIG_LINUX)) > CONFIG_VHOST_USER_BLK=$(call land,$(CONFIG_VHOST_USER),$(CONFIG_LINUX)) > +CONFIG_VHOST_BLK=$(CONFIG_LINUX) > CONFIG_VIRTIO=y > CONFIG_VIRTIO_9P=y > CONFIG_VIRTIO_BALLOON=y > diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs > index 53ce5751ae..857ce823fc 100644 > --- a/hw/block/Makefile.objs > +++ b/hw/block/Makefile.objs > @@ -14,3 +14,4 @@ obj-$(CONFIG_SH4) += tc58128.o > obj-$(CONFIG_VIRTIO_BLK) += virtio-blk.o > obj-$(CONFIG_VIRTIO_BLK) += dataplane/ > obj-$(CONFIG_VHOST_USER_BLK) += vhost-user-blk.o > +obj-$(CONFIG_VHOST_BLK) += vhost-blk.o > diff --git a/hw/block/vhost-blk.c b/hw/block/vhost-blk.c > new file mode 100644 > index 00..4ca8040ee7 > --- /dev/null > +++ b/hw/block/vhost-blk.c > @@ -0,0 +1,429 @@ > +/* > + * vhost-blk host device > + * > + * Copyright(C) 2018 IBM Corporation > + * > + * Authors: > + * Vitaly Mayatskikh > + * > + * Largely based on the "vhost-user-blk.c" implemented by: > + * Changpeng Liu > + * > + * This work is licensed under the terms of the GNU LGPL, version 2 or later. > + * See the COPYING.LIB file in the top-level directory. > + * > + */ > + > +#include "qemu/osdep.h" > +#include "qapi/error.h" > +#include "qemu/error-report.h" > +#include "qemu/cutils.h" > +#include "qom/object.h" > +#include "hw/qdev-core.h" > +#include "hw/virtio/vhost.h" > +#include "hw/virtio/vhost-blk.h" > +#include "hw/virtio/virtio.h" > +#include "hw/virtio/virtio-bus.h" > +#in
Re: [Qemu-devel] [PATCH 1/1 V2] Add vhost-pci-blk driver
On 11/09/2018 12:47 AM, Michael S. Tsirkin wrote: > On Thu, Nov 08, 2018 at 10:09:00PM +0800, Dongli Zhang wrote: >> It looks the kernel space vhost-blk can only process raw image. >> >> How about to verify that only raw image is used in the drive command line >> when >> vhost-blk-pci is paired with it? >> >> Otherwise, vhost-blk-pci might be working with qcow2 image without any >> warning >> on qemu side. >> >> Dongli Zhang > > raw being raw can you really verify that? > I meant to verify the property 'format=' of '-drive', e.g., to check if BlockBackend->root->bs->drv->format_name is raw? We allow the user to erroneously give a qcow2 file with 'format=raw'. However, if 'format=qcow2' is set explicitly, vhots-blk-pci would exit with error as only raw is supported in kernel space. Dongli Zhang
Re: [Qemu-devel] How to emulate block I/O timeout on qemu side?
On 11/12/2018 03:13 PM, Marc Olson via Qemu-devel wrote: > On 11/3/18 10:24 AM, Dongli Zhang wrote: >> Hi all, >> >> I tried with the patch at: >> >> https://lists.gnu.org/archive/html/qemu-devel/2018-09/msg00394.html >> >> The patch is applied to qemu-3.0.0. >> >> >> Below configuration is used to test the feature for guest VM nvme. >> >> # qemu-system-x86_64 \ >> -smp 4 -m 2000M -enable-kvm -vnc :0 -monitor stdio \ >> -net nic -net user,hostfwd=tcp::5022-:22 \ >> -drive file=virtio-disk.img,format=raw,if=none,id=disk0 \ >> -device virtio-blk-pci,drive=disk0,id=disk0-dev,num-queues=2,iothread=io1 \ >> -object iothread,id=io1 \ >> -device nvme,drive=nvme1,serial=deadbeaf1 \ >> -drive file=blkdebug:blkdebug.config:nvme.img,if=none,id=nvme1 >> >> # cat blkdebug.config >> [delay] >> event = "write_aio" >> latency = "99" >> sector = "40960" >> >> >> The 'write' latency of sector=40960 is set to a very large value. When the >> I/O >> is stalled in guest due to that sector=40960 is accessed, I do see below >> messages in guest log: >> >> [ 80.807755] nvme nvme0: I/O 11 QID 2 timeout, aborting >> [ 80.808095] nvme nvme0: Abort status: 0x4001 >> >> >> However, then nothing happens further. nvme I/O hangs in guest. I am not >> able to >> kill the qemu process with Ctrl+C. Both vnc and qemu user net do not work. I >> need to kill qemu with "kill -9" >> >> >> The same result for virtio-scsi and qemu is stuck as well. > While I didn't try virtio-scsi, I wasn't able to reproduce this behavior using > nvme on Ubuntu 18.04 (4.15). What image and kernel version are you trying > against? Would you like to reproduce the "aborting" message or the qemu hang? guest image: ubuntu 16.04 guest kernel: mainline linux kernel (and default kernel in ubuntu 16.04) qemu: qemu-3.0.0 (with the blkdebug delay patch) Would you be able to see the nvme abort (which is indeed not supported by qemu) message in guest kernel? Once I see that message, I would not be able to kill the qemu-system-x86_64 command line with Ctrl+C. Dongli Zhang
Re: [Qemu-devel] [PATCH v3 1/3] blkdebug: fix one shot rule processing
Hi Marc, When I play with the v3 patch set, the qemu hangs again and I need to kill it with "kill -9". I got below from guest: [ 104.828127] nvme nvme0: I/O 52 QID 1 timeout, aborting [ 104.828470] nvme nvme0: Abort status: 0x4001 nvme abort is not supported by qemu and therefore 0x4001 (NVME_INVALID_OPCODE | NVME_DNR) is returned. qemu does not restart the device. Below is the environment: host: most recent qemu (master branch) with 3 patches applied, on top of Ubuntu 16.04.4 with 4.15.0-36-generic guest: Ubuntu 16.04.4 with 4.13.0-39-generic. The image to emulate nvme is a 128MB raw image (ext4). I randomly run "dd if=/dev/zero of=output bs=1M count=30" on the nvme raw image to hit the sector="40960" with "inject-delay" enabled as shown below: [inject-delay] event = "write_aio" latency = "99" sector = "40960" sudo ./x86_64-softmmu/qemu-system-x86_64 \ -drive file=os.img,format=raw,if=none,id=disk0 \ -device virtio-blk-pci,drive=disk0,id=device0,num-queues=2,iothread=io1,bootindex=0 \ -object iothread,id=io1 \ -smp 2 -m 2000M -enable-kvm -vnc :0 -monitor stdio \ -device nvme,drive=nvmedrive,serial=deadbeaf1 \ -drive file=blkdebug:blkdebug.config:nvme.img,format=raw,if=none,id=nvmedrive \ -net nic -net user,hostfwd=tcp::5022-:22 I will debug where it hangs. Dongli Zhang On 11/12/2018 03:06 PM, Marc Olson via Qemu-devel wrote: > If 'once' is specified, the rule should execute just once, regardless if > it is supposed to return an error or not. Take the example where you > want the first IO to an LBA to succeed, but subsequent IOs to fail. You > could either use state transitions, or create two rules, one with > error = 0 and once set to true, and one with a non-zero error. > > Signed-off-by: Marc Olson > --- > block/blkdebug.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/block/blkdebug.c b/block/blkdebug.c > index 0759452..327049b 100644 > --- a/block/blkdebug.c > +++ b/block/blkdebug.c > @@ -488,7 +488,7 @@ static int rule_check(BlockDriverState *bs, uint64_t > offset, uint64_t bytes) > } > } > > -if (!rule || !rule->options.inject.error) { > +if (!rule) { > return 0; > } > > @@ -500,7 +500,7 @@ static int rule_check(BlockDriverState *bs, uint64_t > offset, uint64_t bytes) > remove_rule(rule); > } > > -if (!immediately) { > +if (error && !immediately) { > aio_co_schedule(qemu_get_current_aio_context(), > qemu_coroutine_self()); > qemu_coroutine_yield(); > } >
Re: [Qemu-devel] How to emulate block I/O timeout on qemu side?
On 11/13/2018 06:52 AM, Marc Olson via Qemu-devel wrote: > On 11/11/18 11:36 PM, Dongli Zhang wrote: >> On 11/12/2018 03:13 PM, Marc Olson via Qemu-devel wrote: >>> On 11/3/18 10:24 AM, Dongli Zhang wrote: >>>> The 'write' latency of sector=40960 is set to a very large value. When the >>>> I/O >>>> is stalled in guest due to that sector=40960 is accessed, I do see below >>>> messages in guest log: >>>> >>>> [ 80.807755] nvme nvme0: I/O 11 QID 2 timeout, aborting >>>> [ 80.808095] nvme nvme0: Abort status: 0x4001 >>>> >>>> >>>> However, then nothing happens further. nvme I/O hangs in guest. I am not >>>> able to >>>> kill the qemu process with Ctrl+C. Both vnc and qemu user net do not work. >>>> I >>>> need to kill qemu with "kill -9" >>>> >>>> >>>> The same result for virtio-scsi and qemu is stuck as well. >>> While I didn't try virtio-scsi, I wasn't able to reproduce this behavior >>> using >>> nvme on Ubuntu 18.04 (4.15). What image and kernel version are you trying >>> against? >> Would you like to reproduce the "aborting" message or the qemu hang? > I could not reproduce IO hanging in the guest, but I can reproduce qemu > hanging. >> guest image: ubuntu 16.04 >> guest kernel: mainline linux kernel (and default kernel in ubuntu 16.04) >> qemu: qemu-3.0.0 (with the blkdebug delay patch) >> >> Would you be able to see the nvme abort (which is indeed not supported by >> qemu) >> message in guest kernel? > Yes. >> Once I see that message, I would not be able to kill the qemu-system-x86_64 >> command line with Ctrl+C. > > I missed this part. I wasn't expecting to handle very long timeouts, but what > appears to be happening is that the sleep doesn't get interrupted on > shutdown. I > suspect something like this, on top of the series I sent last night, should > help: > > diff --git a/block/blkdebug.c b/block/blkdebug.c > index 6b1f2d6..0bfb91b 100644 > --- a/block/blkdebug.c > +++ b/block/blkdebug.c > @@ -557,8 +557,11 @@ static int rule_check(BlockDriverState *bs, uint64_t > offset, uint64_t bytes) > remove_active_rule(s, delay_rule); > } > > -if (latency != 0) { > -qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, latency); > +while (latency > 0 && > !aio_external_disabled(bdrv_get_aio_context(bs))) { > +int64_t cur_latency = MIN(latency, 10ULL); > + > +qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, cur_latency); > +latency -= cur_latency; > } > } > > > /marc > > I am able to interrupt qemu with above patch to periodically wake up and sleep again. Dongli Zhang
Re: [Qemu-devel] [PATCH 1/1] virtio-blk: rename iov to out_iov in virtio_blk_handle_request()
Ping? Thanks! Dongli Zhang On 11/07/2018 12:09 AM, Dongli Zhang wrote: > In virtio_blk_handle_request(), in_iov is used for input header while iov > is used for output header. Rename iov to out_iov to pair output header's > name with in_iov to avoid confusing people when reading source code. > > Signed-off-by: Dongli Zhang > --- > hw/block/virtio-blk.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c > index 83cf5c0..fb0d74d 100644 > --- a/hw/block/virtio-blk.c > +++ b/hw/block/virtio-blk.c > @@ -482,7 +482,7 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, > MultiReqBuffer *mrb) > { > uint32_t type; > struct iovec *in_iov = req->elem.in_sg; > -struct iovec *iov = req->elem.out_sg; > +struct iovec *out_iov = req->elem.out_sg; > unsigned in_num = req->elem.in_num; > unsigned out_num = req->elem.out_num; > VirtIOBlock *s = req->dev; > @@ -493,13 +493,13 @@ static int virtio_blk_handle_request(VirtIOBlockReq > *req, MultiReqBuffer *mrb) > return -1; > } > > -if (unlikely(iov_to_buf(iov, out_num, 0, &req->out, > +if (unlikely(iov_to_buf(out_iov, out_num, 0, &req->out, > sizeof(req->out)) != sizeof(req->out))) { > virtio_error(vdev, "virtio-blk request outhdr too short"); > return -1; > } > > -iov_discard_front(&iov, &out_num, sizeof(req->out)); > +iov_discard_front(&out_iov, &out_num, sizeof(req->out)); > > if (in_iov[in_num - 1].iov_len < sizeof(struct virtio_blk_inhdr)) { > virtio_error(vdev, "virtio-blk request inhdr too short"); > @@ -526,7 +526,7 @@ static int virtio_blk_handle_request(VirtIOBlockReq *req, > MultiReqBuffer *mrb) > &req->out.sector); > > if (is_write) { > -qemu_iovec_init_external(&req->qiov, iov, out_num); > +qemu_iovec_init_external(&req->qiov, out_iov, out_num); > trace_virtio_blk_handle_write(vdev, req, req->sector_num, >req->qiov.size / BDRV_SECTOR_SIZE); > } else { >
[Qemu-devel] When AioHandler->is_external=true?
Hi, Would you please help explain in which case AioHandler->is_external is true, and when it is false? I read about iothread and mainloop and I am little bit confused about it. Thank you very much! Dongli Zhang
[Qemu-devel] [PATCH 1/1] virtio: pass argument by value for virtqueue_map_iovec()
Pass num_sg by value instead of by pointer, as num_sg is never modified in virtqueue_map_iovec(). Signed-off-by: Dongli Zhang --- hw/virtio/virtio.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index 4136d23..a5cb4e6 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -796,13 +796,13 @@ static void virtqueue_undo_map_desc(unsigned int out_num, unsigned int in_num, } static void virtqueue_map_iovec(VirtIODevice *vdev, struct iovec *sg, -hwaddr *addr, unsigned int *num_sg, +hwaddr *addr, unsigned int num_sg, int is_write) { unsigned int i; hwaddr len; -for (i = 0; i < *num_sg; i++) { +for (i = 0; i < num_sg; i++) { len = sg[i].iov_len; sg[i].iov_base = dma_memory_map(vdev->dma_as, addr[i], &len, is_write ? @@ -821,8 +821,8 @@ static void virtqueue_map_iovec(VirtIODevice *vdev, struct iovec *sg, void virtqueue_map(VirtIODevice *vdev, VirtQueueElement *elem) { -virtqueue_map_iovec(vdev, elem->in_sg, elem->in_addr, &elem->in_num, 1); -virtqueue_map_iovec(vdev, elem->out_sg, elem->out_addr, &elem->out_num, 0); +virtqueue_map_iovec(vdev, elem->in_sg, elem->in_addr, elem->in_num, 1); +virtqueue_map_iovec(vdev, elem->out_sg, elem->out_addr, elem->out_num, 0); } static void *virtqueue_alloc_element(size_t sz, unsigned out_num, unsigned in_num) -- 2.7.4
[Qemu-devel] How to emulate block I/O timeout on qemu side?
Hi, Is there any way to emulate I/O timeout on qemu side (not fault injection in VM kernel) without modifying qemu source code? For instance, I would like to observe/study/debug the I/O timeout handling of nvme, scsi, virtio-blk (not supported) of VM kernel. Is there a way to trigger this on purpose on qemu side? Thank you very much! Dongli Zhang
Re: [Qemu-devel] How to emulate block I/O timeout on qemu side?
Hi all, I tried with the patch at: https://lists.gnu.org/archive/html/qemu-devel/2018-09/msg00394.html The patch is applied to qemu-3.0.0. Below configuration is used to test the feature for guest VM nvme. # qemu-system-x86_64 \ -smp 4 -m 2000M -enable-kvm -vnc :0 -monitor stdio \ -net nic -net user,hostfwd=tcp::5022-:22 \ -drive file=virtio-disk.img,format=raw,if=none,id=disk0 \ -device virtio-blk-pci,drive=disk0,id=disk0-dev,num-queues=2,iothread=io1 \ -object iothread,id=io1 \ -device nvme,drive=nvme1,serial=deadbeaf1 \ -drive file=blkdebug:blkdebug.config:nvme.img,if=none,id=nvme1 # cat blkdebug.config [delay] event = "write_aio" latency = "99" sector = "40960" The 'write' latency of sector=40960 is set to a very large value. When the I/O is stalled in guest due to that sector=40960 is accessed, I do see below messages in guest log: [ 80.807755] nvme nvme0: I/O 11 QID 2 timeout, aborting [ 80.808095] nvme nvme0: Abort status: 0x4001 However, then nothing happens further. nvme I/O hangs in guest. I am not able to kill the qemu process with Ctrl+C. Both vnc and qemu user net do not work. I need to kill qemu with "kill -9" The same result for virtio-scsi and qemu is stuck as well. About blkdebug, I can only trigger the error by the config file. Is there a way to inject error or latency via qemu monior? For instance, I would like to inject error not for a specific sector or state, but for the entire disk when I input some command via qemu monitor. Dongli Zhang On 11/03/2018 02:17 AM, John Snow wrote: > > > On 11/02/2018 01:55 PM, Marc Olson wrote: >> On 11/2/18 10:49 AM, John Snow wrote: >>> On 11/02/2018 04:11 AM, Dongli Zhang wrote: >>>> Hi, >>>> >>>> Is there any way to emulate I/O timeout on qemu side (not fault >>>> injection in VM >>>> kernel) without modifying qemu source code? >>>> >>>> For instance, I would like to observe/study/debug the I/O timeout >>>> handling of >>>> nvme, scsi, virtio-blk (not supported) of VM kernel. >>>> >>>> Is there a way to trigger this on purpose on qemu side? >>>> >>>> Thank you very much! >>>> >>>> Dongli Zhang >>>> >>> I don't think the blkdebug driver supports arbitrary delays right now. >>> Maybe we could augment it to do so? >>> >>> (I thought someone already had, but maybe it wasn't merged?) >>> >>> Aha, here: >>> >>> https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg05297.html >>> V2: https://lists.gnu.org/archive/html/qemu-devel/2018-09/msg00394.html >>> >>> Let's work from there. >> >> I've got updates to that patch series that fell on the floor due to >> other competing things. I'll get some screen time this weekend to work >> on them and submit v3. >> >> /marc >> > > Great! Please CC the usual maintainers, but also include me. > > In the meantime, Dongli Zhang, why don't you try the v2 patch and see if > that helps you out for your use case? Report back if it works for you or > not. > > --js >
[Qemu-devel] vfio failure with intel 760p 128GB nvme
Hi, I obtained below error when assigning an intel 760p 128GB nvme to guest via vfio on my desktop: qemu-system-x86_64: -device vfio-pci,host=:01:00.0: vfio :01:00.0: failed to add PCI capability 0x11[0x50]@0xb0: table & pba overlap, or they don't fit in BARs, or don't align This is because the msix table is overlapping with pba. According to below 'lspci -vv' from host, the distance between msix table offset and pba offset is only 0x100, although there are 22 entries supported (22 entries need 0x160). Looks qemu supports at most 0x800. # sudo lspci -vv ... ... 01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a6 (rev 03) (prog-if 02 [NVM Express]) Subsystem: Intel Corporation Device 390b ... ... Capabilities: [b0] MSI-X: Enable- Count=22 Masked- Vector table: BAR=0 offset=2000 PBA: BAR=0 offset=2100 A patch below could workaround the issue and passthrough nvme successfully. diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 5c7bd96..54fc25e 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -1510,6 +1510,11 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp) msix->pba_offset = pba & ~PCI_MSIX_FLAGS_BIRMASK; msix->entries = (ctrl & PCI_MSIX_FLAGS_QSIZE) + 1; +if (msix->table_bar == msix->pba_bar && +msix->table_offset + msix->entries * PCI_MSIX_ENTRY_SIZE > msix->pba_offset) { +msix->entries = (msix->pba_offset - msix->table_offset) / PCI_MSIX_ENTRY_SIZE; +} + /* * Test the size of the pba_offset variable and catch if it extends outside * of the specified BAR. If it is the case, we need to apply a hardware Would you please help confirm if this can be regarded as bug in qemu, or issue with nvme hardware? Should we fix thin in qemu, or we should never use such buggy hardware with vfio? Thank you very much! Dongli Zhang
Re: [Qemu-devel] vfio failure with intel 760p 128GB nvme
Hi Alex, On 12/02/2018 03:29 AM, Alex Williamson wrote: > On Sat, 1 Dec 2018 10:52:21 -0800 (PST) > Dongli Zhang wrote: > >> Hi, >> >> I obtained below error when assigning an intel 760p 128GB nvme to guest via >> vfio on my desktop: >> >> qemu-system-x86_64: -device vfio-pci,host=:01:00.0: vfio :01:00.0: >> failed to add PCI capability 0x11[0x50]@0xb0: table & pba overlap, or they >> don't fit in BARs, or don't align >> >> >> This is because the msix table is overlapping with pba. According to below >> 'lspci -vv' from host, the distance between msix table offset and pba offset >> is >> only 0x100, although there are 22 entries supported (22 entries need 0x160). >> Looks qemu supports at most 0x800. >> >> # sudo lspci -vv >> ... ... >> 01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a6 (rev >> 03) (prog-if 02 [NVM Express]) >> Subsystem: Intel Corporation Device 390b >> ... ... >> Capabilities: [b0] MSI-X: Enable- Count=22 Masked- >> Vector table: BAR=0 offset=2000 >> PBA: BAR=0 offset=2100 >> >> >> >> A patch below could workaround the issue and passthrough nvme successfully. >> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >> index 5c7bd96..54fc25e 100644 >> --- a/hw/vfio/pci.c >> +++ b/hw/vfio/pci.c >> @@ -1510,6 +1510,11 @@ static void vfio_msix_early_setup(VFIOPCIDevice >> *vdev, Error **errp) >> msix->pba_offset = pba & ~PCI_MSIX_FLAGS_BIRMASK; >> msix->entries = (ctrl & PCI_MSIX_FLAGS_QSIZE) + 1; >> >> +if (msix->table_bar == msix->pba_bar && >> +msix->table_offset + msix->entries * PCI_MSIX_ENTRY_SIZE > >> msix->pba_offset) { >> +msix->entries = (msix->pba_offset - msix->table_offset) / >> PCI_MSIX_ENTRY_SIZE; >> +} >> + >> /* >> * Test the size of the pba_offset variable and catch if it extends >> outside >> * of the specified BAR. If it is the case, we need to apply a hardware >> >> >> Would you please help confirm if this can be regarded as bug in qemu, or >> issue >> with nvme hardware? Should we fix thin in qemu, or we should never use such >> buggy >> hardware with vfio? > > It's a hardware bug, is there perhaps a firmware update for the device > that resolves it? It's curious that a vector table size of 0x100 gives > us 16 entries and 22 in hex is 0x16 (table size would be reported as > 0x15 for the N-1 algorithm). I wonder if there's a hex vs decimal > mismatch going on. We don't really know if the workaround above is > correct, are there really 16 entries or maybe does the PBA actually > start at a different offset? We wouldn't want to generically assume > one or the other. I think we need Intel to tell us in which way their > hardware is broken and whether it can or is already fixed in a firmware > update. Thanks, Thank you very much for the confirmation. Just realized looks this would make trouble to my desktop as well when 17 vectors are used. I will report to intel and confirm how this can happen and if there is any firmware update available for this issue. Dongli Zhang
[Qemu-devel] [PATCH 1/1] msix: correct pba size calculation used for msix_exclusive_bar initialization
The bar_pba_size is more than what the pba is expected to have, although this usually would not affect the bar_size used for dev->msix_exclusive_bar initialization. Signed-off-by: Dongli Zhang --- hw/pci/msix.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/pci/msix.c b/hw/pci/msix.c index 702dac4..234c0a3 100644 --- a/hw/pci/msix.c +++ b/hw/pci/msix.c @@ -345,7 +345,7 @@ int msix_init_exclusive_bar(PCIDevice *dev, unsigned short nentries, char *name; uint32_t bar_size = 4096; uint32_t bar_pba_offset = bar_size / 2; -uint32_t bar_pba_size = (nentries / 8 + 1) * 8; +uint32_t bar_pba_size = QEMU_ALIGN_UP(nentries, 64) / 8; /* * Migration compatibility dictates that this remains a 4k -- 2.7.4
Re: [Qemu-devel] [PATCH 1/1] msix: correct pba size calculation used for msix_exclusive_bar initialization
Hi Michael, CCed Jason as this is related to commit a0ccd2123ee2 "pci: remove hard-coded bar size in msix_init_exclusive_bar()" Please see my reply inline. On 12/17/2018 10:24 AM, Michael S. Tsirkin wrote: > On Mon, Dec 17, 2018 at 07:34:39AM +0800, Dongli Zhang wrote: >> The bar_pba_size is more than what the pba is expected to have, although >> this usually would not affect the bar_size used for dev->msix_exclusive_bar >> initialization. >> >> Signed-off-by: Dongli Zhang > > > If this does ever have an effect, we need a compat config > for old machine types. > Could you explain a bit more? Are there configs affected? > What are these? Here I copy parts of msix_init_exclusive_bar() and msix_init(). 'bar_pba_size' is computed (line 348) in msix_init_exclusive_bar(), while 'pba_size' is computed (line 291) in msix_init(). msix_init_exclusive_bar() is one of callers of msix_init(). I think both 'bar_pba_size' and 'pba_size' are indicating the size of pba. The difference is, while 'bar_pba_size' is used to initialize the size of MemoryRegion (dev->msix_exclusive_bar), 'pba_size' is used to initialize the config space in msix_init(). However, the equations to compute the two variables are different. Why would we use different size for MemoryRegion and config space? Thanks to line 365, both equations would reach at the same bar_size. For instance, assuming nvme with 1333 queues, with "uint32_t bar_pba_size = (nentries / 8 + 1) * 8;", bar_pba_size=1336 (line 348) and bar_size=32768 (line 365). with "uint32_t bar_pba_size = QEMU_ALIGN_UP(nentries, 64) / 8;", bar_pba_size=168 (line 348) and bar_size=32768 (line 365). That's why I mentioned "this usually would not affect the bar_size used for dev->msix_exclusive_bar initialization." 341 int msix_init_exclusive_bar(PCIDevice *dev, unsigned short nentries, 342 uint8_t bar_nr, Error **errp) 343 { 344 int ret; 345 char *name; 346 uint32_t bar_size = 4096; 347 uint32_t bar_pba_offset = bar_size / 2; 348 uint32_t bar_pba_size = (nentries / 8 + 1) * 8; 349 350 /* 351 * Migration compatibility dictates that this remains a 4k 352 * BAR with the vector table in the lower half and PBA in 353 * the upper half for nentries which is lower or equal to 128. 354 * No need to care about using more than 65 entries for legacy 355 * machine types who has at most 64 queues. 356 */ 357 if (nentries * PCI_MSIX_ENTRY_SIZE > bar_pba_offset) { 358 bar_pba_offset = nentries * PCI_MSIX_ENTRY_SIZE; 359 } 360 361 if (bar_pba_offset + bar_pba_size > 4096) { 362 bar_size = bar_pba_offset + bar_pba_size; 363 } 364 365 bar_size = pow2ceil(bar_size); 366 367 name = g_strdup_printf("%s-msix", dev->name); 368 memory_region_init(&dev->msix_exclusive_bar, OBJECT(dev), name, bar_size); 369 g_free(name); 370 371 ret = msix_init(dev, nentries, &dev->msix_exclusive_bar, bar_nr, 372 0, &dev->msix_exclusive_bar, 373 bar_nr, bar_pba_offset, 374 0, errp); 375 if (ret) { 376 return ret; 377 } 378 379 pci_register_bar(dev, bar_nr, PCI_BASE_ADDRESS_SPACE_MEMORY, 380 &dev->msix_exclusive_bar); 381 382 return 0; 383 } 269 int msix_init(struct PCIDevice *dev, unsigned short nentries, 270 MemoryRegion *table_bar, uint8_t table_bar_nr, 271 unsigned table_offset, MemoryRegion *pba_bar, 272 uint8_t pba_bar_nr, unsigned pba_offset, uint8_t cap_pos, 273 Error **errp) 274 { 275 int cap; 276 unsigned table_size, pba_size; 277 uint8_t *config; 278 279 /* Nothing to do if MSI is not supported by interrupt controller */ 280 if (!msi_nonbroken) { 281 error_setg(errp, "MSI-X is not supported by interrupt controller"); 282 return -ENOTSUP; 283 } 284 285 if (nentries < 1 || nentries > PCI_MSIX_FLAGS_QSIZE + 1) { 286 error_setg(errp, "The number of MSI-X vectors is invalid"); 287 return -EINVAL; 288 } 289 290 table_size = nentries * PCI_MSIX_ENTRY_SIZE; 291 pba_size = QEMU_ALIGN_UP(nentries, 64) / 8; Dongli Zhang > > >> --- >> hw/pci/msix.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/hw/pci/msix.c b/hw/pci/msix.c >> index 702dac4..234c0a3 100644 >> --- a/hw/pci/msix.c >> +++ b/hw/pci/msix.c >> @@ -345,7 +345,7 @@ int msix_init_exclusive_bar(PCIDevice *dev, unsigned >> short nentries, >> char *name; >> uint32_t bar_size = 4096; >> uint32_t bar_pba_offset = bar_size / 2; >> -uint32_t bar_pba_size = (nentries / 8 + 1) * 8; >> +uint32_t bar_pba_size = QEMU_ALIGN_UP(nentries, 64) / 8; >> >> /* >> * Migration compatibility dictates that this remains a 4k >> -- >> 2.7.4
[Qemu-devel] [PATCH 1/1] qdev: rename qdev_create() argument to sync with qdev_try_create()
The second argument used by qdev_create() is typename and 'name' is very confusing. Rename it from 'name' to 'type', which is the same used by qdev_try_create(). Signed-off-by: Dongli Zhang --- hw/core/qdev.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/hw/core/qdev.c b/hw/core/qdev.c index 6b3cc55..1d65e8f 100644 --- a/hw/core/qdev.c +++ b/hw/core/qdev.c @@ -113,17 +113,17 @@ void qdev_set_parent_bus(DeviceState *dev, BusState *bus) /* Create a new device. This only initializes the device state structure and allows properties to be set. The device still needs to be realized. See qdev-core.h. */ -DeviceState *qdev_create(BusState *bus, const char *name) +DeviceState *qdev_create(BusState *bus, const char *type) { DeviceState *dev; -dev = qdev_try_create(bus, name); +dev = qdev_try_create(bus, type); if (!dev) { if (bus) { -error_report("Unknown device '%s' for bus '%s'", name, +error_report("Unknown device '%s' for bus '%s'", type, object_get_typename(OBJECT(bus))); } else { -error_report("Unknown device '%s' for default sysbus", name); +error_report("Unknown device '%s' for default sysbus", type); } abort(); } -- 2.7.4
Re: [Qemu-devel] vfio failure with intel 760p 128GB nvme
Hi Alex, On 12/02/2018 09:29 AM, Dongli Zhang wrote: > Hi Alex, > > On 12/02/2018 03:29 AM, Alex Williamson wrote: >> On Sat, 1 Dec 2018 10:52:21 -0800 (PST) >> Dongli Zhang wrote: >> >>> Hi, >>> >>> I obtained below error when assigning an intel 760p 128GB nvme to guest via >>> vfio on my desktop: >>> >>> qemu-system-x86_64: -device vfio-pci,host=:01:00.0: vfio :01:00.0: >>> failed to add PCI capability 0x11[0x50]@0xb0: table & pba overlap, or they >>> don't fit in BARs, or don't align >>> >>> >>> This is because the msix table is overlapping with pba. According to below >>> 'lspci -vv' from host, the distance between msix table offset and pba >>> offset is >>> only 0x100, although there are 22 entries supported (22 entries need 0x160). >>> Looks qemu supports at most 0x800. >>> >>> # sudo lspci -vv >>> ... ... >>> 01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a6 (rev >>> 03) (prog-if 02 [NVM Express]) >>> Subsystem: Intel Corporation Device 390b >>> ... ... >>> Capabilities: [b0] MSI-X: Enable- Count=22 Masked- >>> Vector table: BAR=0 offset=2000 >>> PBA: BAR=0 offset=2100 >>> >>> >>> >>> A patch below could workaround the issue and passthrough nvme successfully. >>> >>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>> index 5c7bd96..54fc25e 100644 >>> --- a/hw/vfio/pci.c >>> +++ b/hw/vfio/pci.c >>> @@ -1510,6 +1510,11 @@ static void vfio_msix_early_setup(VFIOPCIDevice >>> *vdev, Error **errp) >>> msix->pba_offset = pba & ~PCI_MSIX_FLAGS_BIRMASK; >>> msix->entries = (ctrl & PCI_MSIX_FLAGS_QSIZE) + 1; >>> >>> +if (msix->table_bar == msix->pba_bar && >>> +msix->table_offset + msix->entries * PCI_MSIX_ENTRY_SIZE > >>> msix->pba_offset) { >>> +msix->entries = (msix->pba_offset - msix->table_offset) / >>> PCI_MSIX_ENTRY_SIZE; >>> +} >>> + >>> /* >>> * Test the size of the pba_offset variable and catch if it extends >>> outside >>> * of the specified BAR. If it is the case, we need to apply a hardware >>> >>> >>> Would you please help confirm if this can be regarded as bug in qemu, or >>> issue >>> with nvme hardware? Should we fix thin in qemu, or we should never use such >>> buggy >>> hardware with vfio? >> >> It's a hardware bug, is there perhaps a firmware update for the device >> that resolves it? It's curious that a vector table size of 0x100 gives >> us 16 entries and 22 in hex is 0x16 (table size would be reported as >> 0x15 for the N-1 algorithm). I wonder if there's a hex vs decimal >> mismatch going on. We don't really know if the workaround above is >> correct, are there really 16 entries or maybe does the PBA actually >> start at a different offset? We wouldn't want to generically assume >> one or the other. I think we need Intel to tell us in which way their >> hardware is broken and whether it can or is already fixed in a firmware >> update. Thanks, > > Thank you very much for the confirmation. > > Just realized looks this would make trouble to my desktop as well when 17 > vectors are used. > > I will report to intel and confirm how this can happen and if there is any > firmware update available for this issue. > I found there is similar issue reported to kvm: https://bugzilla.kernel.org/show_bug.cgi?id=202055 I confirmed with my env again. By default, the msi-x count is 16. Capabilities: [b0] MSI-X: Enable+ Count=16 Masked- Vector table: BAR=0 offset=2000 PBA: BAR=0 offset=2100 The count is still 16 after the device is assigned to vfio (Enable- now): # echo :01:00.0 > /sys/bus/pci/devices/\:01\:00.0/driver/unbind # echo "8086 f1a6" > /sys/bus/pci/drivers/vfio-pci/new_id Capabilities: [b0] MSI-X: Enable- Count=16 Masked- Vector table: BAR=0 offset=2000 PBA: BAR=0 offset=2100 After I boot qemu with "-device vfio-pci,host=:01:00.0", count becomes 22. Capabilities: [b0] MSI-X: Enable- Count=22 Masked- Vector table: BAR=0 offset=2000 PBA: BAR=0 offset=2100 Another interesting observation is, vfio-based userspace nvme also changes count from 16 to 22. I reboot host and the count is reset to 16. Then I boot VM with "-drive file=nvme://:01:00.0/1,if=none,id=nvmedrive0 -device virtio-blk,drive=nvmedrive0,id=nvmevirtio0". As userspace nvme uses different vfio path, it boots successfully without issue. However, the count becomes 22 then: Capabilities: [b0] MSI-X: Enable- Count=22 Masked- Vector table: BAR=0 offset=2000 PBA: BAR=0 offset=2100 Both vfio and userspace nvme (based on vfio) would change the count from 16 to 22. Dongli Zhang
Re: [Qemu-devel] vfio failure with intel 760p 128GB nvme
Hi Alex, On 12/02/2018 09:29 AM, Dongli Zhang wrote: > Hi Alex, > > On 12/02/2018 03:29 AM, Alex Williamson wrote: >> On Sat, 1 Dec 2018 10:52:21 -0800 (PST) >> Dongli Zhang wrote: >> >>> Hi, >>> >>> I obtained below error when assigning an intel 760p 128GB nvme to guest via >>> vfio on my desktop: >>> >>> qemu-system-x86_64: -device vfio-pci,host=:01:00.0: vfio :01:00.0: >>> failed to add PCI capability 0x11[0x50]@0xb0: table & pba overlap, or they >>> don't fit in BARs, or don't align >>> >>> >>> This is because the msix table is overlapping with pba. According to below >>> 'lspci -vv' from host, the distance between msix table offset and pba >>> offset is >>> only 0x100, although there are 22 entries supported (22 entries need 0x160). >>> Looks qemu supports at most 0x800. >>> >>> # sudo lspci -vv >>> ... ... >>> 01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a6 (rev >>> 03) (prog-if 02 [NVM Express]) >>> Subsystem: Intel Corporation Device 390b >>> ... ... >>> Capabilities: [b0] MSI-X: Enable- Count=22 Masked- >>> Vector table: BAR=0 offset=2000 >>> PBA: BAR=0 offset=2100 >>> >>> >>> >>> A patch below could workaround the issue and passthrough nvme successfully. >>> >>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>> index 5c7bd96..54fc25e 100644 >>> --- a/hw/vfio/pci.c >>> +++ b/hw/vfio/pci.c >>> @@ -1510,6 +1510,11 @@ static void vfio_msix_early_setup(VFIOPCIDevice >>> *vdev, Error **errp) >>> msix->pba_offset = pba & ~PCI_MSIX_FLAGS_BIRMASK; >>> msix->entries = (ctrl & PCI_MSIX_FLAGS_QSIZE) + 1; >>> >>> +if (msix->table_bar == msix->pba_bar && >>> +msix->table_offset + msix->entries * PCI_MSIX_ENTRY_SIZE > >>> msix->pba_offset) { >>> +msix->entries = (msix->pba_offset - msix->table_offset) / >>> PCI_MSIX_ENTRY_SIZE; >>> +} >>> + >>> /* >>> * Test the size of the pba_offset variable and catch if it extends >>> outside >>> * of the specified BAR. If it is the case, we need to apply a hardware >>> >>> >>> Would you please help confirm if this can be regarded as bug in qemu, or >>> issue >>> with nvme hardware? Should we fix thin in qemu, or we should never use such >>> buggy >>> hardware with vfio? >> >> It's a hardware bug, is there perhaps a firmware update for the device >> that resolves it? It's curious that a vector table size of 0x100 gives >> us 16 entries and 22 in hex is 0x16 (table size would be reported as >> 0x15 for the N-1 algorithm). I wonder if there's a hex vs decimal >> mismatch going on. We don't really know if the workaround above is >> correct, are there really 16 entries or maybe does the PBA actually >> start at a different offset? We wouldn't want to generically assume >> one or the other. I think we need Intel to tell us in which way their >> hardware is broken and whether it can or is already fixed in a firmware >> update. Thanks, > > Thank you very much for the confirmation. > > Just realized looks this would make trouble to my desktop as well when 17 > vectors are used. > > I will report to intel and confirm how this can happen and if there is any > firmware update available for this issue. > I found there is similar issue reported to kvm: https://bugzilla.kernel.org/show_bug.cgi?id=202055 I confirmed with my env again. By default, the msi-x count is 16. Capabilities: [b0] MSI-X: Enable+ Count=16 Masked- Vector table: BAR=0 offset=2000 PBA: BAR=0 offset=2100 The count is still 16 after the device is assigned to vfio (Enable- now): # echo :01:00.0 > /sys/bus/pci/devices/\:01\:00.0/driver/unbind # echo "8086 f1a6" > /sys/bus/pci/drivers/vfio-pci/new_id Capabilities: [b0] MSI-X: Enable- Count=16 Masked- Vector table: BAR=0 offset=2000 PBA: BAR=0 offset=2100 After I boot qemu with "-device vfio-pci,host=:01:00.0", count becomes 22. Capabilities: [b0] MSI-X: Enable- Count=22 Masked- Vector table: BAR=0 offset=2000 PBA: BAR=0 offset=2100 Another interesting observation is, vfio-based userspace nvme also changes count from 16 to 22. I reboot host and the count is reset to 16. Then I boot VM with "-drive file=nvme://:01:00.0/1,if=none,id=nvmedrive0 -device virtio-blk,drive=nvmedrive0,id=nvmevirtio0". As userspace nvme uses different vfio path, it boots successfully without issue. However, the count becomes 22 then: Capabilities: [b0] MSI-X: Enable- Count=22 Masked- Vector table: BAR=0 offset=2000 PBA: BAR=0 offset=2100 Both vfio and userspace nvme (based on vfio) would change the count from 16 to 22. Dongli Zhang
Re: [Qemu-devel] vfio failure with intel 760p 128GB nvme
Hi Alex, On 12/27/2018 10:20 PM, Alex Williamson wrote: > On Thu, 27 Dec 2018 20:30:48 +0800 > Dongli Zhang wrote: > >> Hi Alex, >> >> On 12/02/2018 09:29 AM, Dongli Zhang wrote: >>> Hi Alex, >>> >>> On 12/02/2018 03:29 AM, Alex Williamson wrote: >>>> On Sat, 1 Dec 2018 10:52:21 -0800 (PST) >>>> Dongli Zhang wrote: >>>> >>>>> Hi, >>>>> >>>>> I obtained below error when assigning an intel 760p 128GB nvme to guest >>>>> via >>>>> vfio on my desktop: >>>>> >>>>> qemu-system-x86_64: -device vfio-pci,host=:01:00.0: vfio >>>>> :01:00.0: failed to add PCI capability 0x11[0x50]@0xb0: table & pba >>>>> overlap, or they don't fit in BARs, or don't align >>>>> >>>>> >>>>> This is because the msix table is overlapping with pba. According to below >>>>> 'lspci -vv' from host, the distance between msix table offset and pba >>>>> offset is >>>>> only 0x100, although there are 22 entries supported (22 entries need >>>>> 0x160). >>>>> Looks qemu supports at most 0x800. >>>>> >>>>> # sudo lspci -vv >>>>> ... ... >>>>> 01:00.0 Non-Volatile memory controller: Intel Corporation Device f1a6 >>>>> (rev 03) (prog-if 02 [NVM Express]) >>>>> Subsystem: Intel Corporation Device 390b >>>>> ... ... >>>>> Capabilities: [b0] MSI-X: Enable- Count=22 Masked- >>>>> Vector table: BAR=0 offset=2000 >>>>> PBA: BAR=0 offset=2100 >>>>> >>>>> >>>>> >>>>> A patch below could workaround the issue and passthrough nvme >>>>> successfully. >>>>> >>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>>>> index 5c7bd96..54fc25e 100644 >>>>> --- a/hw/vfio/pci.c >>>>> +++ b/hw/vfio/pci.c >>>>> @@ -1510,6 +1510,11 @@ static void vfio_msix_early_setup(VFIOPCIDevice >>>>> *vdev, Error **errp) >>>>> msix->pba_offset = pba & ~PCI_MSIX_FLAGS_BIRMASK; >>>>> msix->entries = (ctrl & PCI_MSIX_FLAGS_QSIZE) + 1; >>>>> >>>>> +if (msix->table_bar == msix->pba_bar && >>>>> +msix->table_offset + msix->entries * PCI_MSIX_ENTRY_SIZE > >>>>> msix->pba_offset) { >>>>> +msix->entries = (msix->pba_offset - msix->table_offset) / >>>>> PCI_MSIX_ENTRY_SIZE; >>>>> +} >>>>> + >>>>> /* >>>>> * Test the size of the pba_offset variable and catch if it extends >>>>> outside >>>>> * of the specified BAR. If it is the case, we need to apply a >>>>> hardware >>>>> >>>>> >>>>> Would you please help confirm if this can be regarded as bug in qemu, or >>>>> issue >>>>> with nvme hardware? Should we fix thin in qemu, or we should never use >>>>> such buggy >>>>> hardware with vfio? >>>> >>>> It's a hardware bug, is there perhaps a firmware update for the device >>>> that resolves it? It's curious that a vector table size of 0x100 gives >>>> us 16 entries and 22 in hex is 0x16 (table size would be reported as >>>> 0x15 for the N-1 algorithm). I wonder if there's a hex vs decimal >>>> mismatch going on. We don't really know if the workaround above is >>>> correct, are there really 16 entries or maybe does the PBA actually >>>> start at a different offset? We wouldn't want to generically assume >>>> one or the other. I think we need Intel to tell us in which way their >>>> hardware is broken and whether it can or is already fixed in a firmware >>>> update. Thanks, >>> >>> Thank you very much for the confirmation. >>> >>> Just realized looks this would make trouble to my desktop as well when 17 >>> vectors are used. >>> >>> I will report to intel and confirm how this can happen and if there is any >>> firmware update available for this issue. >>> >> >> I found there is similar issue reported to kvm: >> >> https://bugzilla.kernel.org/show_bug.cgi?id
Re: [Qemu-devel] [PATCH v3 1/5] virtio_net: Add VIRTIO_NET_F_STANDBY feature bit.
I am not familiar with libvirt and I would like to play with this only with qemu. With failover, we need to hotplug the VF on destination server to VM after live migration. However, the VF on destination server would have different mac address. How can we specify the mac for the new VF to hotplug via qemu, as VF is only a vfio pci device? I am trying to play with this with only qemu (w/o libvirt). Thank you very much! Dongli Zhang On 01/08/2019 06:29 AM, Venu Busireddy wrote: > From: Sridhar Samudrala > > This feature bit can be used by a hypervisor to indicate to the virtio_net > device that it can act as a standby for another device with the same MAC > address. > > Signed-off-by: Sridhar Samudrala > Signed-off-by: Venu Busireddy > --- > hw/net/virtio-net.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c > index 385b1a0..411f8fb 100644 > --- a/hw/net/virtio-net.c > +++ b/hw/net/virtio-net.c > @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = { > true), > DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN), > DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str), > +DEFINE_PROP_BIT64("standby", VirtIONet, host_features, > VIRTIO_NET_F_STANDBY, > + false), > DEFINE_PROP_END_OF_LIST(), > }; > >
Re: [Qemu-devel] [PATCH v3 1/5] virtio_net: Add VIRTIO_NET_F_STANDBY feature bit.
Hi Venu, On 2019/1/9 上午1:25, Venu Busireddy wrote: > On 2019-01-09 00:56:38 +0800, Dongli Zhang wrote: >> I am not familiar with libvirt and I would like to play with this only with >> qemu. >> >> With failover, we need to hotplug the VF on destination server to VM after >> live >> migration. However, the VF on destination server would have different mac >> address. >> >> How can we specify the mac for the new VF to hotplug via qemu, as VF is only >> a >> vfio pci device? > > How is the VF device on the destination host any different from the VF > on the source host? > > As you do on the source host, you first assign the MAC address of > 00:00:00:00:00:00 to the VF. After the migration, you assign the same > MAC address as that of the virtio_net device to the VF, and hotadd the VF This was what I was wondering. How the mac address is configured for VF (or any NIC like PF) after it is assigned to vfio? Thank you very much! Dongli Zhang > device to the VM. And then, after you receive the FAILOVER_PRIMARY_CHANGED > event, set the macvtap device to down state. > > Venu > >> >> I am trying to play with this with only qemu (w/o libvirt). >> >> Thank you very much! >> >> Dongli Zhang >> >> On 01/08/2019 06:29 AM, Venu Busireddy wrote: >>> From: Sridhar Samudrala >>> >>> This feature bit can be used by a hypervisor to indicate to the virtio_net >>> device that it can act as a standby for another device with the same MAC >>> address. >>> >>> Signed-off-by: Sridhar Samudrala >>> Signed-off-by: Venu Busireddy >>> --- >>> hw/net/virtio-net.c | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c >>> index 385b1a0..411f8fb 100644 >>> --- a/hw/net/virtio-net.c >>> +++ b/hw/net/virtio-net.c >>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = { >>> true), >>> DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN), >>> DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str), >>> +DEFINE_PROP_BIT64("standby", VirtIONet, host_features, >>> VIRTIO_NET_F_STANDBY, >>> + false), >>> DEFINE_PROP_END_OF_LIST(), >>> }; >>> >>> >
Re: [Qemu-devel] [PATCH v3 1/5] virtio_net: Add VIRTIO_NET_F_STANDBY feature bit.
Hi Samudrala, On 2019/1/9 上午8:18, Samudrala, Sridhar wrote: > On 1/8/2019 4:14 PM, Dongli Zhang wrote: >> Hi Venu, >> >> On 2019/1/9 上午1:25, Venu Busireddy wrote: >>> On 2019-01-09 00:56:38 +0800, Dongli Zhang wrote: >>>> I am not familiar with libvirt and I would like to play with this only >>>> with qemu. >>>> >>>> With failover, we need to hotplug the VF on destination server to VM after >>>> live >>>> migration. However, the VF on destination server would have different mac >>>> address. >>>> >>>> How can we specify the mac for the new VF to hotplug via qemu, as VF is >>>> only a >>>> vfio pci device? >>> How is the VF device on the destination host any different from the VF >>> on the source host? >>> >>> As you do on the source host, you first assign the MAC address of >>> 00:00:00:00:00:00 to the VF. After the migration, you assign the same >>> MAC address as that of the virtio_net device to the VF, and hotadd the VF >> This was what I was wondering. >> >> How the mac address is configured for VF (or any NIC like PF) after it is >> assigned to vfio? > > ip link set vf mac > > See https://www.kernel.org/doc/html/latest/networking/net_failover.html > for a sample script that shows the steps to initiate live migration with VF > and virtio-net in standby mode. Thank you very much for the help! Sorry that I did not ask the question in the right way. Although I was talking about VF, I would like to passthrough the entire PF (with sriov_numvfs=0) to guest VM. In this situation, I am not able to configure the mac address when the entire PF (or NIC) is assigned to VFIO. does not exist as it is belong to VFIO. Thank you very much! Dongli Zhang > > >> Thank you very much! >> >> Dongli Zhang >> >> >>> device to the VM. And then, after you receive the FAILOVER_PRIMARY_CHANGED >>> event, set the macvtap device to down state. >>> >>> Venu >>> >>>> I am trying to play with this with only qemu (w/o libvirt). >>>> >>>> Thank you very much! >>>> >>>> Dongli Zhang >>>> >>>> On 01/08/2019 06:29 AM, Venu Busireddy wrote: >>>>> From: Sridhar Samudrala >>>>> >>>>> This feature bit can be used by a hypervisor to indicate to the virtio_net >>>>> device that it can act as a standby for another device with the same MAC >>>>> address. >>>>> >>>>> Signed-off-by: Sridhar Samudrala >>>>> Signed-off-by: Venu Busireddy >>>>> --- >>>>> hw/net/virtio-net.c | 2 ++ >>>>> 1 file changed, 2 insertions(+) >>>>> >>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c >>>>> index 385b1a0..411f8fb 100644 >>>>> --- a/hw/net/virtio-net.c >>>>> +++ b/hw/net/virtio-net.c >>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = { >>>>> true), >>>>> DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN), >>>>> DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str), >>>>> +DEFINE_PROP_BIT64("standby", VirtIONet, host_features, >>>>> VIRTIO_NET_F_STANDBY, >>>>> + false), >>>>> DEFINE_PROP_END_OF_LIST(), >>>>> }; >>>>> >>>>>
Re: [Qemu-devel] [Qemu-discuss] IRQ per CPU
On 07/04/2018 10:32 PM, Probir Roy wrote: > I am writing a virtual device that would generate IRQ per CPU basis. I Does 'per CPU basis' indicates irq per cpu, or irq per device queue? AFAIK, the device may create multiple queues in the driver (in OS) and we would have one irq (vector) per queue? If you are talking about irq (vector) per queue, I would suggest read about following on nvme which involves per-queue vector. https://github.com/qemu/qemu/blob/master/hw/block/nvme.c Although I am not expert on qemu, in my opinion the qemu nvme code is very helpful for understanding per-queue vector. > have written a PCI device from the template which does not generate > IRQ per CPU. How can write such device in Qemu? > > The code of current device is here: > https://gist.github.com/proywm/6ca98d3e8ca001965e2c8792fcf97911 > > Regards, > Probir > Dongli Zhang
Re: [Qemu-devel] [Qemu-discuss] IRQ per CPU
On 07/05/2018 12:12 PM, Probir Roy wrote: >> Does 'per CPU basis' indicates irq per cpu, or irq per device queue? > > IRQ per CPU core, meaning that IRQ will be raised at and served by > that CPU. Does IRQ per queue mean the same thing? > About 'IRQ per queue', the device may create multiple queue in the OS driver. The number of queues are always proportional to the number of CPU core/thread (although this is also always configurable by OS driver). Usually the per-queue irq/vector is bound to each CPU. As the number of queue/irq is always the same as the number of CPU, it is sort of 'per CPU basis', as each CPU will be serving irq for its own queue. Dongli Zhang
[Qemu-devel] Why qom-get is not available in qemu monitor
Hi, Would you mind help confirm why qom-get command is not supported by qemu monitor, but only qapi? Is this a feature (qom-get in monitor) totally useless for some reason, or it is still not implemented yet? Thank you very much! Dongli Zhang
[PATCH 1/1] pci: don't skip function 0 occupancy verification for devfn auto assign
When the devfn is already assigned in the command line, the do_pci_register_device() may verify if the function 0 is already occupied. However, when devfn < 0, the verification is skipped because it is part of the last "else if". For instance, suppose there is already a device at addr=00.00 of a port. -device pcie-root-port,bus=pcie.0,chassis=115,id=port01,addr=0e.00 \ -device virtio-net-pci,bus=port01,id=vnet01,addr=00.00 \ When 'addr' is specified for the 2nd device, the hotplug is denied. (qemu) device_add virtio-net-pci,bus=port01,id=vnet02,addr=01.00 Error: PCI: slot 0 function 0 already occupied by virtio-net-pci, new func virtio-net-pci cannot be exposed to guest. When 'addr' is automatically assigned, the hotplug is not denied. This is because the verification is skipped. (qemu) device_add virtio-net-pci,bus=port01,id=vnet02 warning: PCI: slot 1 is not valid for virtio-net-pci, parent device only allows plugging into slot 0. Fix the issue by moving the verification into an independent 'if' statement. Fixes: 3f1e1478db2d ("enable multi-function hot-add") Reported-by: Aswin Unnikrishnan Signed-off-by: Dongli Zhang --- hw/pci/pci.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/hw/pci/pci.c b/hw/pci/pci.c index 4c7be52951..82ebd243d0 100644 --- a/hw/pci/pci.c +++ b/hw/pci/pci.c @@ -1186,14 +1186,15 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCI_SLOT(devfn), PCI_FUNC(devfn), name, bus->devices[devfn]->name, bus->devices[devfn]->qdev.id); return NULL; -} /* - * Populating function 0 triggers a scan from the guest that - * exposes other non-zero functions. Hence we need to ensure that - * function 0 wasn't added yet. - */ -else if (dev->hotplugged && - !pci_is_vf(pci_dev) && - pci_get_function_0(pci_dev)) { +} + +/* + * Populating function 0 triggers a scan from the guest that + * exposes other non-zero functions. Hence we need to ensure that + * function 0 wasn't added yet. + */ +if (dev->hotplugged && !pci_is_vf(pci_dev) && +pci_get_function_0(pci_dev)) { error_setg(errp, "PCI: slot %d function 0 already occupied by %s," " new func %s cannot be exposed to guest.", PCI_SLOT(pci_get_function_0(pci_dev)->devfn), -- 2.34.1
[PATCH 0/6] Add debug interface to kick/call on purpose
The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to the loss of doorbell kick, e.g., https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler"). This patch introduces a new debug interface 'DeviceEvent' to DeviceClass to help narrow down if the issue is due to loss of irq/kick. So far the new interface handles only two events: 'call' and 'kick'. Any device (e.g., virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X or legacy IRQ). The 'call' is to inject irq on purpose by admin for a specific device (e.g., vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell on purpose by admin at QEMU/host side for a specific device. This device can be used as a workaround if call/kick is lost due to virtualization software (e.g., kernel or QEMU) issue. We may also implement the interface for VFIO PCI, e.g., to write to VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM on purpose. This is considered future work once the virtio part is done. Below is from live crash analysis. Initially, the queue=2 has count=15 for 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no used available. We suspect this is because vhost-scsi was not notified by VM. In order to narrow down and analyze the issue, we use live crash to dump the current counter of eventfd for queue=2. crash> eventfd_ctx 8f67f6bbe700 struct eventfd_ctx { kref = { refcount = { refs = { counter = 4 } } }, wqh = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, head = { next = 0x8f841dc08e18, prev = 0x8f841dc08e18 } }, count = 15, ---> eventfd is 15 !!! flags = 526336 } Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic with this interface. { "execute": "x-debug-device-event", "arguments": { "dev": "/machine/peripheral/vscsi0", "event": "kick", "queue": 2 } } The counter is increased to 16. Suppose the hang issue is resolved, it indicates something bad is in software that the 'kick' is lost. crash> eventfd_ctx 8f67f6bbe700 struct eventfd_ctx { kref = { refcount = { refs = { counter = 4 } } }, wqh = { lock = { { rlock = { raw_lock = { val = { counter = 0 } } } } }, head = { next = 0x8f841dc08e18, prev = 0x8f841dc08e18 } }, count = 16, ---> eventfd incremented to 16 !!! flags = 526336 } Original RFC link: https://lists.nongnu.org/archive/html/qemu-devel/2021-01/msg03441.html Changed since RFC: - add support for more virtio/vhost pci devices - add log (toggled by DEBUG_VIRTIO_EVENT) to virtio.c to say that this mischeivous command had been used - fix grammer error (s/lost/loss/) - change version to 6.1 - fix incorrect example in qapi/qdev.json - manage event types with enum/array, instead of hard coding Dongli Zhang (6): qdev: introduce qapi/hmp command for kick/call event virtio: introduce helper function for kick/call device event virtio-blk-pci: implement device event interface for kick/call virtio-scsi-pci: implement device event interface for kick/call vhost-scsi-pci: implement device event interface for kick/call virtio-net-pci: implement device event interface for kick/call hmp-commands.hx | 14 hw/block/virtio-blk.c | 9 + hw/net/virtio-net.c | 9 + hw/scsi/vhost-scsi.c| 6 hw/scsi/virtio-scsi.c | 9 + hw/virtio/vhost-scsi-pci.c | 10 ++ hw/virtio/virtio-blk-pci.c | 10 ++ hw/virtio/virtio-net-pci.c | 10 ++ hw/virtio/virtio-scsi-pci.c | 10 ++ hw/virtio/virtio.c | 64 include/hw/qdev-core.h | 9 + include/hw/virtio/vhost-scsi.h | 3 ++ include/hw/virtio/virtio-blk.h | 2 ++ include/hw/virtio/virtio-net.h | 3 ++ include/hw/virtio/virtio-scsi.h | 3 ++ include/hw/virtio/virtio.h | 3 ++ include/monitor/hmp.h | 1 + qapi/qdev.json | 30 + softmmu/qdev-monitor.c | 56 +++ 19 files changed, 261 insertions(+) I did tests with below cases. - virtio-blk-pci (ioeventfd on/off, iothread, live migration) - virtio-scsi-pci (ioeventfd on/off) - vhost-scsi-pci - virtio-net-pci (ioeventfd on/off, vhost) Thank you very much! Dongli Zhang
[PATCH 6/6] virtio-net-pci: implement device event interface for kick/call
This is to implement the device event interface for virtio-net-pci. Signed-off-by: Dongli Zhang --- hw/net/virtio-net.c| 9 + hw/virtio/virtio-net-pci.c | 10 ++ include/hw/virtio/virtio-net.h | 3 +++ 3 files changed, 22 insertions(+) diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index 66b9ff4511..b5c3fa392c 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -3195,6 +3195,15 @@ static bool failover_hide_primary_device(DeviceListener *listener, return qatomic_read(&n->failover_primary_hidden); } +void virtio_net_device_event(DeviceState *dev, int event, int queue, + Error **errp) +{ +VirtIONet *n = VIRTIO_NET(dev); +bool irqfd = n->vhost_started; + +virtio_device_event(dev, event, queue, irqfd, errp); +} + static void virtio_net_device_realize(DeviceState *dev, Error **errp) { VirtIODevice *vdev = VIRTIO_DEVICE(dev); diff --git a/hw/virtio/virtio-net-pci.c b/hw/virtio/virtio-net-pci.c index aa0b3caecb..1fa5a6fe5d 100644 --- a/hw/virtio/virtio-net-pci.c +++ b/hw/virtio/virtio-net-pci.c @@ -46,6 +46,15 @@ static Property virtio_net_properties[] = { DEFINE_PROP_END_OF_LIST(), }; +static void virtio_net_pci_event(DeviceState *dev, int event, int queue, + Error **errp) +{ +VirtIONetPCI *vnet = VIRTIO_NET_PCI(dev); +DeviceState *vdev = DEVICE(&vnet->vdev); + +virtio_net_device_event(vdev, event, queue, errp); +} + static void virtio_net_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp) { DeviceState *qdev = DEVICE(vpci_dev); @@ -77,6 +86,7 @@ static void virtio_net_pci_class_init(ObjectClass *klass, void *data) k->class_id = PCI_CLASS_NETWORK_ETHERNET; set_bit(DEVICE_CATEGORY_NETWORK, dc->categories); device_class_set_props(dc, virtio_net_properties); +dc->event = virtio_net_pci_event; vpciklass->realize = virtio_net_pci_realize; } diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h index 7e96d193aa..d88c9969ea 100644 --- a/include/hw/virtio/virtio-net.h +++ b/include/hw/virtio/virtio-net.h @@ -214,4 +214,7 @@ struct VirtIONet { void virtio_net_set_netclient_name(VirtIONet *n, const char *name, const char *type); +void virtio_net_device_event(DeviceState *dev, int event, int queue, + Error **errp); + #endif -- 2.17.1
[PATCH 2/6] virtio: introduce helper function for kick/call device event
This is to introduce the helper function for virtio device to kick or call. Signed-off-by: Dongli Zhang --- hw/virtio/virtio.c | 64 ++ include/hw/virtio/virtio.h | 3 ++ 2 files changed, 67 insertions(+) diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index 07f4e60b30..e081041a75 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -30,6 +30,8 @@ #include "sysemu/runstate.h" #include "standard-headers/linux/virtio_ids.h" +/* #define DEBUG_VIRTIO_EVENT */ + /* * The alignment to use between consumer and producer parts of vring. * x86 pagesize again. This is the default, used by transports like PCI @@ -2572,6 +2574,68 @@ void virtio_notify(VirtIODevice *vdev, VirtQueue *vq) virtio_irq(vq); } +static void virtio_device_event_call(VirtQueue *vq, bool eventfd, + Error **errp) +{ +#ifdef DEBUG_VIRTIO_EVENT +printf("The 'call' event is triggered for path=%s, queue=%d, irqfd=%d.\n", + object_get_canonical_path(OBJECT(vq->vdev)), + vq->queue_index, eventfd); +#endif + +if (eventfd) { +virtio_set_isr(vq->vdev, 0x1); +event_notifier_set(&vq->guest_notifier); +} else { +virtio_irq(vq); +} +} + +static void virtio_device_event_kick(VirtQueue *vq, bool eventfd, + Error **errp) +{ +#ifdef DEBUG_VIRTIO_EVENT +printf("The 'kick' event is triggered for path=%s, queue=%d.\n", + object_get_canonical_path(OBJECT(vq->vdev)), vq->queue_index); +#endif + +virtio_queue_notify(vq->vdev, virtio_get_queue_index(vq)); +} + +typedef void (VirtIOEvent)(VirtQueue *vq, bool eventfd, Error **errp); + +static VirtIOEvent *virtio_event_funcs[DEVICE_EVENT_MAX] = { +[DEVICE_EVENT_CALL] = virtio_device_event_call, +[DEVICE_EVENT_KICK] = virtio_device_event_kick +}; + +void virtio_device_event(DeviceState *dev, int event, int queue, + bool eventfd, Error **errp) +{ +struct VirtIODevice *vdev = VIRTIO_DEVICE(dev); +int num = virtio_get_num_queues(vdev); +VirtQueue *vq; + +assert(event < DEVICE_EVENT_MAX); + +if (vdev->broken) { +error_setg(errp, "Broken device"); +return; +} + +if (queue < 0 || queue >= num) { +error_setg(errp, "Invalid queue %d", queue); +return; +} + +vq = &vdev->vq[queue]; + +if (virtio_event_funcs[event]) +virtio_event_funcs[event](vq, eventfd, errp); +else +error_setg(errp, "The event is not supported"); +} + void virtio_notify_config(VirtIODevice *vdev) { if (!(vdev->status & VIRTIO_CONFIG_S_DRIVER_OK)) diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h index b7ece7a6a8..21bb13ffa6 100644 --- a/include/hw/virtio/virtio.h +++ b/include/hw/virtio/virtio.h @@ -210,6 +210,9 @@ void virtqueue_get_avail_bytes(VirtQueue *vq, unsigned int *in_bytes, void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq); void virtio_notify(VirtIODevice *vdev, VirtQueue *vq); +void virtio_device_event(DeviceState *dev, int event, int queue, + bool eventfd, Error **errp); + int virtio_save(VirtIODevice *vdev, QEMUFile *f); extern const VMStateInfo virtio_vmstate_info; -- 2.17.1
[PATCH 4/6] virtio-scsi-pci: implement device event interface for kick/call
This is to implement the device event interface for virtio-scsi-pci. Signed-off-by: Dongli Zhang --- hw/scsi/virtio-scsi.c | 9 + hw/virtio/virtio-scsi-pci.c | 10 ++ include/hw/virtio/virtio-scsi.h | 3 +++ 3 files changed, 22 insertions(+) diff --git a/hw/scsi/virtio-scsi.c b/hw/scsi/virtio-scsi.c index 6d80730287..f437ed1a81 100644 --- a/hw/scsi/virtio-scsi.c +++ b/hw/scsi/virtio-scsi.c @@ -962,6 +962,15 @@ static struct SCSIBusInfo virtio_scsi_scsi_info = { .load_request = virtio_scsi_load_request, }; +void virtio_scsi_device_event(DeviceState *dev, int event, int queue, + Error **errp) +{ +VirtIOSCSI *s = VIRTIO_SCSI(dev); +bool irqfd = s->dataplane_started && !s->dataplane_fenced; + +virtio_device_event(dev, event, queue, irqfd, errp); +} + void virtio_scsi_common_realize(DeviceState *dev, VirtIOHandleOutput ctrl, VirtIOHandleOutput evt, diff --git a/hw/virtio/virtio-scsi-pci.c b/hw/virtio/virtio-scsi-pci.c index 97fab74236..d5db743692 100644 --- a/hw/virtio/virtio-scsi-pci.c +++ b/hw/virtio/virtio-scsi-pci.c @@ -43,6 +43,15 @@ static Property virtio_scsi_pci_properties[] = { DEFINE_PROP_END_OF_LIST(), }; +static void virtio_scsi_pci_event(DeviceState *dev, int event, int queue, + Error **errp) +{ +VirtIOSCSIPCI *vscsi = VIRTIO_SCSI_PCI(dev); +DeviceState *vdev = DEVICE(&vscsi->vdev); + +virtio_scsi_device_event(vdev, event, queue, errp); +} + static void virtio_scsi_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp) { VirtIOSCSIPCI *dev = VIRTIO_SCSI_PCI(vpci_dev); @@ -82,6 +91,7 @@ static void virtio_scsi_pci_class_init(ObjectClass *klass, void *data) k->realize = virtio_scsi_pci_realize; set_bit(DEVICE_CATEGORY_STORAGE, dc->categories); device_class_set_props(dc, virtio_scsi_pci_properties); +dc->event = virtio_scsi_pci_event; pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET; pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_SCSI; pcidev_k->revision = 0x00; diff --git a/include/hw/virtio/virtio-scsi.h b/include/hw/virtio/virtio-scsi.h index 543681bc18..d1fff0eeac 100644 --- a/include/hw/virtio/virtio-scsi.h +++ b/include/hw/virtio/virtio-scsi.h @@ -163,4 +163,7 @@ void virtio_scsi_dataplane_setup(VirtIOSCSI *s, Error **errp); int virtio_scsi_dataplane_start(VirtIODevice *s); void virtio_scsi_dataplane_stop(VirtIODevice *s); +void virtio_scsi_device_event(DeviceState *dev, int event, int queue, + Error **errp); + #endif /* QEMU_VIRTIO_SCSI_H */ -- 2.17.1
[PATCH 3/6] virtio-blk-pci: implement device event interface for kick/call
This is to implement the device event interface for virtio-blk-pci. Signed-off-by: Dongli Zhang --- hw/block/virtio-blk.c | 9 + hw/virtio/virtio-blk-pci.c | 10 ++ include/hw/virtio/virtio-blk.h | 2 ++ 3 files changed, 21 insertions(+) diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index d28979efb8..2b3583a913 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -1118,6 +1118,15 @@ static int virtio_blk_load_device(VirtIODevice *vdev, QEMUFile *f, return 0; } +void virtio_blk_device_event(DeviceState *dev, int event, int queue, + Error **errp) +{ +VirtIOBlock *s = VIRTIO_BLK(dev); +bool irqfd = s->dataplane_started && !s->dataplane_disabled; + +virtio_device_event(dev, event, queue, irqfd, errp); +} + static void virtio_resize_cb(void *opaque) { VirtIODevice *vdev = opaque; diff --git a/hw/virtio/virtio-blk-pci.c b/hw/virtio/virtio-blk-pci.c index 9d5795810c..f1fc72e7f1 100644 --- a/hw/virtio/virtio-blk-pci.c +++ b/hw/virtio/virtio-blk-pci.c @@ -47,6 +47,15 @@ static Property virtio_blk_pci_properties[] = { DEFINE_PROP_END_OF_LIST(), }; +static void virtio_blk_pci_event(DeviceState *dev, int event, int queue, + Error **errp) +{ +VirtIOBlkPCI *vblk = VIRTIO_BLK_PCI(dev); +DeviceState *vdev = DEVICE(&vblk->vdev); + +virtio_blk_device_event(vdev, event, queue, errp); +} + static void virtio_blk_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp) { VirtIOBlkPCI *dev = VIRTIO_BLK_PCI(vpci_dev); @@ -72,6 +81,7 @@ static void virtio_blk_pci_class_init(ObjectClass *klass, void *data) set_bit(DEVICE_CATEGORY_STORAGE, dc->categories); device_class_set_props(dc, virtio_blk_pci_properties); +dc->event = virtio_blk_pci_event; k->realize = virtio_blk_pci_realize; pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET; pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_BLOCK; diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h index 29655a406d..500be01dff 100644 --- a/include/hw/virtio/virtio-blk.h +++ b/include/hw/virtio/virtio-blk.h @@ -92,5 +92,7 @@ typedef struct MultiReqBuffer { bool virtio_blk_handle_vq(VirtIOBlock *s, VirtQueue *vq); void virtio_blk_process_queued_requests(VirtIOBlock *s, bool is_bh); +void virtio_blk_device_event(DeviceState *dev, int event, int queue, + Error **errp); #endif -- 2.17.1
[PATCH 5/6] vhost-scsi-pci: implement device event interface for kick/call
This is to implement the device event interface for vhost-scsi-pci. Signed-off-by: Dongli Zhang --- hw/scsi/vhost-scsi.c | 6 ++ hw/virtio/vhost-scsi-pci.c | 10 ++ include/hw/virtio/vhost-scsi.h | 3 +++ 3 files changed, 19 insertions(+) diff --git a/hw/scsi/vhost-scsi.c b/hw/scsi/vhost-scsi.c index 4d70fa036b..11dd94ff92 100644 --- a/hw/scsi/vhost-scsi.c +++ b/hw/scsi/vhost-scsi.c @@ -163,6 +163,12 @@ static const VMStateDescription vmstate_virtio_vhost_scsi = { .pre_save = vhost_scsi_pre_save, }; +void vhost_scsi_device_event(DeviceState *dev, int event, int queue, + Error **errp) +{ +virtio_device_event(dev, event, queue, true, errp); +} + static void vhost_scsi_realize(DeviceState *dev, Error **errp) { VirtIOSCSICommon *vs = VIRTIO_SCSI_COMMON(dev); diff --git a/hw/virtio/vhost-scsi-pci.c b/hw/virtio/vhost-scsi-pci.c index cb71a294fa..c7a614cb11 100644 --- a/hw/virtio/vhost-scsi-pci.c +++ b/hw/virtio/vhost-scsi-pci.c @@ -44,6 +44,15 @@ static Property vhost_scsi_pci_properties[] = { DEFINE_PROP_END_OF_LIST(), }; +static void vhost_scsi_pci_event(DeviceState *dev, int event, int queue, + Error **errp) +{ +VHostSCSIPCI *vscsi = VHOST_SCSI_PCI(dev); +DeviceState *vdev = DEVICE(&vscsi->vdev); + +vhost_scsi_device_event(vdev, event, queue, errp); +} + static void vhost_scsi_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp) { VHostSCSIPCI *dev = VHOST_SCSI_PCI(vpci_dev); @@ -70,6 +79,7 @@ static void vhost_scsi_pci_class_init(ObjectClass *klass, void *data) k->realize = vhost_scsi_pci_realize; set_bit(DEVICE_CATEGORY_STORAGE, dc->categories); device_class_set_props(dc, vhost_scsi_pci_properties); +dc->event = vhost_scsi_pci_event; pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET; pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_SCSI; pcidev_k->revision = 0x00; diff --git a/include/hw/virtio/vhost-scsi.h b/include/hw/virtio/vhost-scsi.h index 7dc2bdd69d..b47854a0c6 100644 --- a/include/hw/virtio/vhost-scsi.h +++ b/include/hw/virtio/vhost-scsi.h @@ -32,4 +32,7 @@ struct VHostSCSI { VHostSCSICommon parent_obj; }; +void vhost_scsi_device_event(DeviceState *dev, int event, int queue, + Error **errp); + #endif -- 2.17.1
[PATCH 1/6] qdev: introduce qapi/hmp command for kick/call event
The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to the loss of doorbell kick, e.g., https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler"). This patch introduces a new debug interface 'DeviceEvent' to DeviceClass to help narrow down if the issue is due to loss of irq/kick. So far the new interface handles only two events: 'call' and 'kick'. Any device (e.g., virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X or legacy IRQ). The 'call' is to inject irq on purpose by admin for a specific device (e.g., vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell on purpose by admin at QEMU/host side for a specific device. Signed-off-by: Dongli Zhang --- hmp-commands.hx| 14 +++ include/hw/qdev-core.h | 9 +++ include/monitor/hmp.h | 1 + qapi/qdev.json | 30 ++ softmmu/qdev-monitor.c | 56 ++ 5 files changed, 110 insertions(+) diff --git a/hmp-commands.hx b/hmp-commands.hx index 435c591a1c..d74b895fff 100644 --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -1725,3 +1725,17 @@ ERST .flags = "p", }, +{ +.name = "x-debug-device-event", +.args_type = "dev:s,event:s,queue:l", +.params = "dev event queue", +.help = "generate device event for a specific device queue", +.cmd= hmp_x_debug_device_event, +.flags = "p", +}, + +SRST +``x-debug-device-event`` *dev* *event* *queue* + Generate device event *event* for specific *queue* of *dev* +ERST + diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h index bafc311bfa..1ea8bf23b9 100644 --- a/include/hw/qdev-core.h +++ b/include/hw/qdev-core.h @@ -29,9 +29,17 @@ typedef enum DeviceCategory { DEVICE_CATEGORY_MAX } DeviceCategory; +enum { +DEVICE_EVENT_CALL, +DEVICE_EVENT_KICK, +DEVICE_EVENT_MAX +}; + typedef void (*DeviceRealize)(DeviceState *dev, Error **errp); typedef void (*DeviceUnrealize)(DeviceState *dev); typedef void (*DeviceReset)(DeviceState *dev); +typedef void (*DeviceEvent)(DeviceState *dev, int event, int queue, +Error **errp); typedef void (*BusRealize)(BusState *bus, Error **errp); typedef void (*BusUnrealize)(BusState *bus); @@ -132,6 +140,7 @@ struct DeviceClass { DeviceReset reset; DeviceRealize realize; DeviceUnrealize unrealize; +DeviceEvent event; /* device state */ const VMStateDescription *vmsd; diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h index 605d57287a..c7795d4ba5 100644 --- a/include/monitor/hmp.h +++ b/include/monitor/hmp.h @@ -129,5 +129,6 @@ void hmp_info_replay(Monitor *mon, const QDict *qdict); void hmp_replay_break(Monitor *mon, const QDict *qdict); void hmp_replay_delete_break(Monitor *mon, const QDict *qdict); void hmp_replay_seek(Monitor *mon, const QDict *qdict); +void hmp_x_debug_device_event(Monitor *mon, const QDict *qdict); #endif diff --git a/qapi/qdev.json b/qapi/qdev.json index b83178220b..711c4a297a 100644 --- a/qapi/qdev.json +++ b/qapi/qdev.json @@ -124,3 +124,33 @@ ## { 'event': 'DEVICE_DELETED', 'data': { '*device': 'str', 'path': 'str' } } + +## +# @x-debug-device-event: +# +# Generate device event for a specific device queue +# +# @dev: device path +# +# @event: event (e.g., kick or call) to trigger +# +# @queue: queue id +# +# Returns: Nothing on success +# +# Since: 6.1 +# +# Notes: This is used to debug VM driver hang issue. The 'kick' event is to +#send notification to QEMU/vhost while the 'call' event is to +#interrupt VM on purpose. +# +# Example: +# +# -> { "execute": "x-debug-device_event", +# "arguments": { "dev": "/machine/peripheral/vscsi0", "event": "kick", +# "queue": 1 } } +# <- { "return": {} } +# +## +{ 'command': 'x-debug-device-event', + 'data': {'dev': 'str', 'event': 'str', 'queue': 'int'} } diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c index a9955b97a0..bca53111fb 100644 --- a/softmmu/qdev-monitor.c +++ b/softmmu/qdev-monitor.c @@ -924,6 +924,62 @@ void hmp_device_del(Monitor *mon, const QDict *qdict) hmp_handle_error(mon, err); } +static const char * const device_events[DEVICE_EVENT_MAX] = { +[DEVICE_EVENT_KICK] = "kick", +[DEVICE_EVENT_CALL] = "call" +}; + +static int get_device_event(const ch
Re: [PATCH 0/6] Add debug interface to kick/call on purpose
Hi Jason, On 3/26/21 12:24 AM, Jason Wang wrote: > > 在 2021/3/26 下午1:44, Dongli Zhang 写道: >> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to >> the loss of doorbell kick, e.g., >> >> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$ >> >> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit >> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler"). >> >> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass >> to help narrow down if the issue is due to loss of irq/kick. So far the new >> interface handles only two events: 'call' and 'kick'. Any device (e.g., >> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X >> or legacy IRQ). >> >> The 'call' is to inject irq on purpose by admin for a specific device (e.g., >> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell >> on purpose by admin at QEMU/host side for a specific device. >> >> >> This device can be used as a workaround if call/kick is lost due to >> virtualization software (e.g., kernel or QEMU) issue. >> >> We may also implement the interface for VFIO PCI, e.g., to write to >> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM >> on purpose. This is considered future work once the virtio part is done. >> >> >> Below is from live crash analysis. Initially, the queue=2 has count=15 for >> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no >> used available. We suspect this is because vhost-scsi was not notified by >> VM. In order to narrow down and analyze the issue, we use live crash to >> dump the current counter of eventfd for queue=2. >> >> crash> eventfd_ctx 8f67f6bbe700 >> struct eventfd_ctx { >> kref = { >> refcount = { >> refs = { >> counter = 4 >> } >> } >> }, >> wqh = { >> lock = { >> { >> rlock = { >> raw_lock = { >> val = { >> counter = 0 >> } >> } >> } >> } >> }, >> head = { >> next = 0x8f841dc08e18, >> prev = 0x8f841dc08e18 >> } >> }, >> count = 15, ---> eventfd is 15 !!! >> flags = 526336 >> } >> >> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic >> with this interface. >> >> { "execute": "x-debug-device-event", >> "arguments": { "dev": "/machine/peripheral/vscsi0", >> "event": "kick", "queue": 2 } } >> >> The counter is increased to 16. Suppose the hang issue is resolved, it >> indicates something bad is in software that the 'kick' is lost. > > > What do you mean by "software" here? And it looks to me you're testing whether > event_notifier_set() is called by virtio_queue_notify() here. If so, I'm not > sure how much value could we gain from a dedicated debug interface like this > consider there're a lot of exisinting general purpose debugging method like > tracing or gdb. I'd say the path from virtio_queue_notify() to > event_notifier_set() is only a very small fraction of the process of virtqueue > kick which is unlikey to be buggy. Consider usually the ioeventfd will be > offloaded to KVM, it's more a chance that something is wrong in setuping > ioeventfd instead of here. Irq is even more complicated. Thank you very much! I am not testing whether event_notifier_set() is called by virtio_queue_notify(). The 'software' indicates the data processing and event notification mechanism involved with virtio/vhost PV driver frontend. E.g., while VM is waiting for an extra IRQ, vhost side did not trigger IRQ, suppose vring_need_event() erroneously returns false due to corrupted ring buffer status. This was initially proposed for vhost only and I was going to export ioeventfd/irqfd from vhost to admin via sysfs. Finally, I realized I would better implement this at QEMU. The QEMU inits the eventfd (ioeventfd and irqfd), and offloads them to KVM/vhost. The VM side sends requests to ring buffer and kicks the doorbell (via ioeventfd), while the backend vhost side sends responses back and calls the IRQ (via ioeventfd). Unfortunately,
Re: [PATCH 0/6] Add debug interface to kick/call on purpose
On 3/28/21 8:56 PM, Jason Wang wrote: > > 在 2021/3/27 上午5:16, Dongli Zhang 写道: >> Hi Jason, >> >> On 3/26/21 12:24 AM, Jason Wang wrote: >>> 在 2021/3/26 下午1:44, Dongli Zhang 写道: >>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to >>>> the loss of doorbell kick, e.g., >>>> >>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$ >>>> >>>> >>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit >>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler"). >>>> >>>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass >>>> to help narrow down if the issue is due to loss of irq/kick. So far the new >>>> interface handles only two events: 'call' and 'kick'. Any device (e.g., >>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X >>>> or legacy IRQ). >>>> >>>> The 'call' is to inject irq on purpose by admin for a specific device >>>> (e.g., >>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell >>>> on purpose by admin at QEMU/host side for a specific device. >>>> >>>> >>>> This device can be used as a workaround if call/kick is lost due to >>>> virtualization software (e.g., kernel or QEMU) issue. >>>> >>>> We may also implement the interface for VFIO PCI, e.g., to write to >>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM >>>> on purpose. This is considered future work once the virtio part is done. >>>> >>>> >>>> Below is from live crash analysis. Initially, the queue=2 has count=15 for >>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is no >>>> used available. We suspect this is because vhost-scsi was not notified by >>>> VM. In order to narrow down and analyze the issue, we use live crash to >>>> dump the current counter of eventfd for queue=2. >>>> >>>> crash> eventfd_ctx 8f67f6bbe700 >>>> struct eventfd_ctx { >>>> kref = { >>>> refcount = { >>>> refs = { >>>> counter = 4 >>>> } >>>> } >>>> }, >>>> wqh = { >>>> lock = { >>>> { >>>> rlock = { >>>> raw_lock = { >>>> val = { >>>> counter = 0 >>>> } >>>> } >>>> } >>>> } >>>> }, >>>> head = { >>>> next = 0x8f841dc08e18, >>>> prev = 0x8f841dc08e18 >>>> } >>>> }, >>>> count = 15, ---> eventfd is 15 !!! >>>> flags = 526336 >>>> } >>>> >>>> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic >>>> with this interface. >>>> >>>> { "execute": "x-debug-device-event", >>>> "arguments": { "dev": "/machine/peripheral/vscsi0", >>>> "event": "kick", "queue": 2 } } >>>> >>>> The counter is increased to 16. Suppose the hang issue is resolved, it >>>> indicates something bad is in software that the 'kick' is lost. >>> What do you mean by "software" here? And it looks to me you're testing >>> whether >>> event_notifier_set() is called by virtio_queue_notify() here. If so, I'm not >>> sure how much value could we gain from a dedicated debug interface like this >>> consider there're a lot of exisinting general purpose debugging method like >>> tracing or gdb. I'd say the path from virtio_queue_notify() to >>> event_notifier_set() is only a very small fraction of the process of >>> virtqueue >>> kick which is unlikey to be buggy. Consider usually the ioeventfd will be >>> offloaded to KVM, it's more a chance that something is wrong in setuping >>> ioeventfd instead of here. Irq is even more complicated. >&g
Re: [PATCH 0/6] Add debug interface to kick/call on purpose
On 4/1/21 8:47 PM, Jason Wang wrote: > > 在 2021/3/30 下午3:29, Dongli Zhang 写道: >> >> On 3/28/21 8:56 PM, Jason Wang wrote: >>> 在 2021/3/27 上午5:16, Dongli Zhang 写道: >>>> Hi Jason, >>>> >>>> On 3/26/21 12:24 AM, Jason Wang wrote: >>>>> 在 2021/3/26 下午1:44, Dongli Zhang 写道: >>>>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to >>>>>> the loss of doorbell kick, e.g., >>>>>> >>>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$ >>>>>> >>>>>> >>>>>> >>>>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit >>>>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler"). >>>>>> >>>>>> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass >>>>>> to help narrow down if the issue is due to loss of irq/kick. So far the >>>>>> new >>>>>> interface handles only two events: 'call' and 'kick'. Any device (e.g., >>>>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, >>>>>> MSI-X >>>>>> or legacy IRQ). >>>>>> >>>>>> The 'call' is to inject irq on purpose by admin for a specific device >>>>>> (e.g., >>>>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the >>>>>> doorbell >>>>>> on purpose by admin at QEMU/host side for a specific device. >>>>>> >>>>>> >>>>>> This device can be used as a workaround if call/kick is lost due to >>>>>> virtualization software (e.g., kernel or QEMU) issue. >>>>>> >>>>>> We may also implement the interface for VFIO PCI, e.g., to write to >>>>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to VM >>>>>> on purpose. This is considered future work once the virtio part is done. >>>>>> >>>>>> >>>>>> Below is from live crash analysis. Initially, the queue=2 has count=15 >>>>>> for >>>>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there is >>>>>> no >>>>>> used available. We suspect this is because vhost-scsi was not notified by >>>>>> VM. In order to narrow down and analyze the issue, we use live crash to >>>>>> dump the current counter of eventfd for queue=2. >>>>>> >>>>>> crash> eventfd_ctx 8f67f6bbe700 >>>>>> struct eventfd_ctx { >>>>>> kref = { >>>>>> refcount = { >>>>>> refs = { >>>>>> counter = 4 >>>>>> } >>>>>> } >>>>>> }, >>>>>> wqh = { >>>>>> lock = { >>>>>> { >>>>>> rlock = { >>>>>> raw_lock = { >>>>>> val = { >>>>>> counter = 0 >>>>>> } >>>>>> } >>>>>> } >>>>>> } >>>>>> }, >>>>>> head = { >>>>>> next = 0x8f841dc08e18, >>>>>> prev = 0x8f841dc08e18 >>>>>> } >>>>>> }, >>>>>> count = 15, ---> eventfd is 15 !!! >>>>>> flags = 526336 >>>>>> } >>>>>> >>>>>> Now we kick the doorbell for vhost-scsi queue=2 on purpose for diagnostic >>>>>> with this interface. >>>>>> >>>>>> { "execute": "x-debug-device-event", >>>>>> "arguments": { "dev": "/machine/peripheral/vscsi0", >>>>>> "event": "kick", "queue": 2 } } >>>>>> >>>>>> The counter is increased to 16. Suppose the hang issue is resolved, it >>>>>> i
Re: [PATCH 0/6] Add debug interface to kick/call on purpose
On 4/5/21 6:55 PM, Jason Wang wrote: > > 在 2021/4/6 上午4:00, Dongli Zhang 写道: >> >> On 4/1/21 8:47 PM, Jason Wang wrote: >>> 在 2021/3/30 下午3:29, Dongli Zhang 写道: >>>> On 3/28/21 8:56 PM, Jason Wang wrote: >>>>> 在 2021/3/27 上午5:16, Dongli Zhang 写道: >>>>>> Hi Jason, >>>>>> >>>>>> On 3/26/21 12:24 AM, Jason Wang wrote: >>>>>>> 在 2021/3/26 下午1:44, Dongli Zhang 写道: >>>>>>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due >>>>>>>> to >>>>>>>> the loss of doorbell kick, e.g., >>>>>>>> >>>>>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit >>>>>>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler"). >>>>>>>> >>>>>>>> This patch introduces a new debug interface 'DeviceEvent' to >>>>>>>> DeviceClass >>>>>>>> to help narrow down if the issue is due to loss of irq/kick. So far >>>>>>>> the new >>>>>>>> interface handles only two events: 'call' and 'kick'. Any device (e.g., >>>>>>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, >>>>>>>> MSI-X >>>>>>>> or legacy IRQ). >>>>>>>> >>>>>>>> The 'call' is to inject irq on purpose by admin for a specific device >>>>>>>> (e.g., >>>>>>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the >>>>>>>> doorbell >>>>>>>> on purpose by admin at QEMU/host side for a specific device. >>>>>>>> >>>>>>>> >>>>>>>> This device can be used as a workaround if call/kick is lost due to >>>>>>>> virtualization software (e.g., kernel or QEMU) issue. >>>>>>>> >>>>>>>> We may also implement the interface for VFIO PCI, e.g., to write to >>>>>>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to >>>>>>>> VM >>>>>>>> on purpose. This is considered future work once the virtio part is >>>>>>>> done. >>>>>>>> >>>>>>>> >>>>>>>> Below is from live crash analysis. Initially, the queue=2 has count=15 >>>>>>>> for >>>>>>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there >>>>>>>> is no >>>>>>>> used available. We suspect this is because vhost-scsi was not notified >>>>>>>> by >>>>>>>> VM. In order to narrow down and analyze the issue, we use live crash to >>>>>>>> dump the current counter of eventfd for queue=2. >>>>>>>> >>>>>>>> crash> eventfd_ctx 8f67f6bbe700 >>>>>>>> struct eventfd_ctx { >>>>>>>> kref = { >>>>>>>> refcount = { >>>>>>>> refs = { >>>>>>>> counter = 4 >>>>>>>> } >>>>>>>> } >>>>>>>> }, >>>>>>>> wqh = { >>>>>>>> lock = { >>>>>>>> { >>>>>>>> rlock = { >>>>>>>> raw_lock = { >>>>>>>> val = { >>>>>>>> counter = 0 >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> } >>>>>>>> }, >>>>>>>> head = { >>>>>>>> next = 0x8f841dc08e18, >>>>>>>>
Re: [PATCH 0/6] Add debug interface to kick/call on purpose
On 4/6/21 1:43 AM, Dongli Zhang wrote: > > > On 4/5/21 6:55 PM, Jason Wang wrote: >> >> 在 2021/4/6 上午4:00, Dongli Zhang 写道: >>> >>> On 4/1/21 8:47 PM, Jason Wang wrote: >>>> 在 2021/3/30 下午3:29, Dongli Zhang 写道: >>>>> On 3/28/21 8:56 PM, Jason Wang wrote: >>>>>> 在 2021/3/27 上午5:16, Dongli Zhang 写道: >>>>>>> Hi Jason, >>>>>>> >>>>>>> On 3/26/21 12:24 AM, Jason Wang wrote: >>>>>>>> 在 2021/3/26 下午1:44, Dongli Zhang 写道: >>>>>>>>> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due >>>>>>>>> to >>>>>>>>> the loss of doorbell kick, e.g., >>>>>>>>> >>>>>>>>> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!KS3pAU2cKjz4wgI4QSlE-YsJPhPG71nkE5_tGhaOf7mi_xvNxbvKkfn03rk5BNDLSEU$ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit >>>>>>>>> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler"). >>>>>>>>> >>>>>>>>> This patch introduces a new debug interface 'DeviceEvent' to >>>>>>>>> DeviceClass >>>>>>>>> to help narrow down if the issue is due to loss of irq/kick. So far >>>>>>>>> the new >>>>>>>>> interface handles only two events: 'call' and 'kick'. Any device >>>>>>>>> (e.g., >>>>>>>>> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, >>>>>>>>> MSI-X >>>>>>>>> or legacy IRQ). >>>>>>>>> >>>>>>>>> The 'call' is to inject irq on purpose by admin for a specific device >>>>>>>>> (e.g., >>>>>>>>> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the >>>>>>>>> doorbell >>>>>>>>> on purpose by admin at QEMU/host side for a specific device. >>>>>>>>> >>>>>>>>> >>>>>>>>> This device can be used as a workaround if call/kick is lost due to >>>>>>>>> virtualization software (e.g., kernel or QEMU) issue. >>>>>>>>> >>>>>>>>> We may also implement the interface for VFIO PCI, e.g., to write to >>>>>>>>> VFIOPCIDevice->msi_vectors[i].interrupt will be able to inject IRQ to >>>>>>>>> VM >>>>>>>>> on purpose. This is considered future work once the virtio part is >>>>>>>>> done. >>>>>>>>> >>>>>>>>> >>>>>>>>> Below is from live crash analysis. Initially, the queue=2 has >>>>>>>>> count=15 for >>>>>>>>> 'kick' eventfd_ctx. Suppose there is data in vring avail while there >>>>>>>>> is no >>>>>>>>> used available. We suspect this is because vhost-scsi was not >>>>>>>>> notified by >>>>>>>>> VM. In order to narrow down and analyze the issue, we use live crash >>>>>>>>> to >>>>>>>>> dump the current counter of eventfd for queue=2. >>>>>>>>> >>>>>>>>> crash> eventfd_ctx 8f67f6bbe700 >>>>>>>>> struct eventfd_ctx { >>>>>>>>> kref = { >>>>>>>>> refcount = { >>>>>>>>> refs = { >>>>>>>>> counter = 4 >>>>>>>>> } >>>>>>>>> } >>>>>>>>> }, >>>>>>>>> wqh = { >>>>>>>>> lock = { >>>>>>>>> { >>>>>>>>> rlock = { >>>>>>>>> raw_lock = { >>>>>>>>> val = { >>>>>>>
Re: [PATCH 1/6] qdev: introduce qapi/hmp command for kick/call event
On 4/7/21 6:40 AM, Eduardo Habkost wrote: > On Thu, Mar 25, 2021 at 10:44:28PM -0700, Dongli Zhang wrote: >> The virtio device/driver (e.g., vhost-scsi or vhost-net) may hang due to >> the loss of doorbell kick, e.g., >> >> https://urldefense.com/v3/__https://lists.gnu.org/archive/html/qemu-devel/2018-12/msg01711.html__;!!GqivPVa7Brio!NaqdV_o0gMJkUtVWaHyLRwKDa_8MsiuANAqEcM-Ooy4pYE3R1bwPmLdCTkE0gq6gywY$ >> >> >> ... or due to the loss of IRQ, e.g., as fixed by linux kernel commit >> fe200ae48ef5 ("genirq: Mark polled irqs and defer the real handler"). >> >> This patch introduces a new debug interface 'DeviceEvent' to DeviceClass >> to help narrow down if the issue is due to loss of irq/kick. So far the new >> interface handles only two events: 'call' and 'kick'. Any device (e.g., >> virtio/vhost or VFIO) may implement the interface (e.g., via eventfd, MSI-X >> or legacy IRQ). >> >> The 'call' is to inject irq on purpose by admin for a specific device (e.g., >> vhost-scsi) from QEMU/host to VM, while the 'kick' is to kick the doorbell >> on purpose by admin at QEMU/host side for a specific device. >> >> Signed-off-by: Dongli Zhang > [...] >> diff --git a/include/monitor/hmp.h b/include/monitor/hmp.h >> index 605d57287a..c7795d4ba5 100644 >> --- a/include/monitor/hmp.h >> +++ b/include/monitor/hmp.h >> @@ -129,5 +129,6 @@ void hmp_info_replay(Monitor *mon, const QDict *qdict); >> void hmp_replay_break(Monitor *mon, const QDict *qdict); >> void hmp_replay_delete_break(Monitor *mon, const QDict *qdict); >> void hmp_replay_seek(Monitor *mon, const QDict *qdict); >> +void hmp_x_debug_device_event(Monitor *mon, const QDict *qdict); >> >> #endif >> diff --git a/qapi/qdev.json b/qapi/qdev.json >> index b83178220b..711c4a297a 100644 >> --- a/qapi/qdev.json >> +++ b/qapi/qdev.json >> @@ -124,3 +124,33 @@ >> ## >> { 'event': 'DEVICE_DELETED', >>'data': { '*device': 'str', 'path': 'str' } } >> + >> +## >> +# @x-debug-device-event: >> +# >> +# Generate device event for a specific device queue >> +# >> +# @dev: device path >> +# >> +# @event: event (e.g., kick or call) to trigger > > Any specific reason to not use an enum here? > > In addition to making the QAPI schema and documentation more > descriptive, it would save you the work of manually defining the > DEVICE_EVENT_* constants and implementing get_device_event(). Thank you very much for the suggestion! I will use enum in json file. Dongli Zhang > > >> +# >> +# @queue: queue id >> +# >> +# Returns: Nothing on success >> +# >> +# Since: 6.1 >> +# >> +# Notes: This is used to debug VM driver hang issue. The 'kick' event is to >> +#send notification to QEMU/vhost while the 'call' event is to >> +#interrupt VM on purpose. >> +# >> +# Example: >> +# >> +# -> { "execute": "x-debug-device_event", >> +# "arguments": { "dev": "/machine/peripheral/vscsi0", "event": "kick", >> +# "queue": 1 } } >> +# <- { "return": {} } >> +# >> +## >> +{ 'command': 'x-debug-device-event', >> + 'data': {'dev': 'str', 'event': 'str', 'queue': 'int'} } > [...] >
Re: [PATCH 0/6] Add debug interface to kick/call on purpose
On 4/6/21 7:20 PM, Jason Wang wrote: > > 在 2021/4/7 上午7:27, Dongli Zhang 写道: >>> This will answer your question that "Can it bypass the masking?". >>> >>> For vhost-scsi, virtio-blk, virtio-scsi and virtio-net, to write to eventfd >>> is >>> not able to bypass masking because masking is to unregister the eventfd. To >>> write to eventfd does not take effect. >>> >>> However, it is possible to bypass masking for vhost-net because vhost-net >>> registered a specific masked_notifier eventfd in order to mask irq. To >>> write to >>> original eventfd still takes effect. >>> >>> We may leave the user to decide whether to write to 'masked_notifier' or >>> original 'guest_notifier' for vhost-net. >> My fault here. To write to masked_notifier will always be masked:( > > > Only when there's no bug in the qemu. > > >> >> If it is EventNotifier level, we will not care whether the EventNotifier is >> masked or not. It just provides an interface to write to EventNotifier. > > > Yes. > > >> >> To dump the MSI-x table for both virtio and vfio will help confirm if the >> vector >> is masked. > > > That would be helpful as well. It's probably better to extend "info pci" > command. > > Thanks I will try if to add to "info pci" (introduce new arg option to "info pci"), or to introduce new command. About the EventNotifier, I will classify them as guest notifier or host notifier so that it will be much more easier for user to tell if the eventfd is for injecting IRQ or kicking the doorbell. Thank you very much for all suggestions! Dongli Zhang > > >> >> Thank you very much! >> >> Dongli Zhang >> >
[PATCH v2 3/3] virtio-pci/hmp: implement device specific hmp interface
This patch is to implement the device specific interface to dump the mapping between virtio queues and vectors. (qemu) info msix -d /machine/peripheral/vscsi0 Msg L.Addr Msg U.Addr Msg Data Vect Ctrl 0xfee0 0x 0x4041 0x 0xfee0 0x 0x4051 0x 0xfee0 0x 0x4061 0x 0xfee0 0x 0x4071 0x 0xfee01000 0x 0x40b1 0x 0xfee02000 0x 0x40c1 0x 0xfee03000 0x 0x40d1 0x MSI-X PBA 0 0 0 0 0 0 0 virtio pci vector info: config: 0 queue 0: 1 queue 1: 2 queue 2: 3 queue 3: 4 queue 4: 5 queue 5: 6 Cc: Jason Wang Cc: Joe Jin Suggested-by: Jason Wang Signed-off-by: Dongli Zhang --- hw/virtio/virtio-pci.c | 22 ++ hw/virtio/virtio.c | 10 ++ include/hw/virtio/virtio.h | 2 ++ 3 files changed, 34 insertions(+) diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c index b321604d9b..edc143285b 100644 --- a/hw/virtio/virtio-pci.c +++ b/hw/virtio/virtio-pci.c @@ -37,6 +37,7 @@ #include "qemu/range.h" #include "hw/virtio/virtio-bus.h" #include "qapi/visitor.h" +#include "monitor/monitor.h" #define VIRTIO_PCI_REGION_SIZE(dev) VIRTIO_PCI_CONFIG_OFF(msix_present(dev)) @@ -1563,6 +1564,26 @@ static void virtio_pci_modern_io_region_unmap(VirtIOPCIProxy *proxy, ®ion->mr); } +static void virtio_pci_dc_msix_info(Monitor *mon, PCIDevice *dev, +Error **errp) +{ +DeviceState *qdev = DEVICE(dev); +VirtIOPCIProxy *proxy = to_virtio_pci_proxy_fast(qdev); +VirtIODevice *vdev = virtio_bus_get_device(&proxy->bus); +int num = virtio_get_num_queues(vdev); +int i; + +monitor_printf(mon, "virtio pci vector info:\n"); + +monitor_printf(mon, "config: %d\n", virtio_get_config_vector(vdev)); + +for (i = 0; i < num; i++) +monitor_printf(mon, "queue %d: %u\n", + i, virtio_get_vector(vdev, i)); + +monitor_printf(mon, "\n"); +} + static void virtio_pci_pre_plugged(DeviceState *d, Error **errp) { VirtIOPCIProxy *proxy = VIRTIO_PCI(d); @@ -1975,6 +1996,7 @@ static void virtio_pci_class_init(ObjectClass *klass, void *data) k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET; k->revision = VIRTIO_PCI_ABI_VERSION; k->class_id = PCI_CLASS_OTHERS; +k->msix_info = virtio_pci_dc_msix_info; device_class_set_parent_realize(dc, virtio_pci_dc_realize, &vpciklass->parent_dc_realize); dc->reset = virtio_pci_reset; diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index e02544b2df..fd07d9deac 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -2581,6 +2581,16 @@ void virtio_notify_config(VirtIODevice *vdev) virtio_notify_vector(vdev, vdev->config_vector); } +uint16_t virtio_get_vector(VirtIODevice *vdev, int n) +{ +return vdev->vq[n].vector; +} + +uint16_t virtio_get_config_vector(VirtIODevice *vdev) +{ +return vdev->config_vector; +} + static bool virtio_device_endian_needed(void *opaque) { VirtIODevice *vdev = opaque; diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h index 8bab9cfb75..6746227f73 100644 --- a/include/hw/virtio/virtio.h +++ b/include/hw/virtio/virtio.h @@ -268,6 +268,8 @@ int virtio_set_status(VirtIODevice *vdev, uint8_t val); void virtio_reset(void *opaque); void virtio_update_irq(VirtIODevice *vdev); int virtio_set_features(VirtIODevice *vdev, uint64_t val); +uint16_t virtio_get_vector(VirtIODevice *vdev, int n); +uint16_t virtio_get_config_vector(VirtIODevice *vdev); /* Base devices. */ typedef struct VirtIOBlkConf VirtIOBlkConf; -- 2.17.1
[PATCH v2 2/3] msix/hmp: add interface to dump device specific info
While the previous patch is to dump the MSI-X table, sometimes we may need to dump device specific data, e.g., to help match the vector with the specific device queue. This patch is to add the PCI device specific interface to help dump those information. Any PCI device class may implement this PCIDeviceClass->msix_info interface. Cc: Jason Wang Cc: Joe Jin Suggested-by: Jason Wang Signed-off-by: Dongli Zhang --- hmp-commands-info.hx | 7 --- include/hw/pci/pci.h | 3 +++ softmmu/qdev-monitor.c | 11 +++ 3 files changed, 18 insertions(+), 3 deletions(-) diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx index cbd056442b..df9cdc6c7e 100644 --- a/hmp-commands-info.hx +++ b/hmp-commands-info.hx @@ -223,9 +223,10 @@ ERST { .name = "msix", -.args_type = "dev:s", -.params = "dev", -.help = "dump MSI-X information", +.args_type = "info:-d,dev:s", +.params = "[-d] dev", +.help = "dump MSI-X information; " + "(-d: show device specific info)", .cmd= hmp_info_msix, }, diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h index 6be4e0c460..4620b9e757 100644 --- a/include/hw/pci/pci.h +++ b/include/hw/pci/pci.h @@ -129,6 +129,8 @@ typedef void PCIMapIORegionFunc(PCIDevice *pci_dev, int region_num, pcibus_t addr, pcibus_t size, int type); typedef void PCIUnregisterFunc(PCIDevice *pci_dev); +typedef void PCIMSIXInfoFunc(Monitor *mon, PCIDevice *dev, Error **errp); + typedef struct PCIIORegion { pcibus_t addr; /* current PCI mapping address. -1 means not mapped */ #define PCI_BAR_UNMAPPED (~(pcibus_t)0) @@ -224,6 +226,7 @@ struct PCIDeviceClass { PCIUnregisterFunc *exit; PCIConfigReadFunc *config_read; PCIConfigWriteFunc *config_write; +PCIMSIXInfoFunc *msix_info; uint16_t vendor_id; uint16_t device_id; diff --git a/softmmu/qdev-monitor.c b/softmmu/qdev-monitor.c index 7837a17d0d..7fd3fe0ada 100644 --- a/softmmu/qdev-monitor.c +++ b/softmmu/qdev-monitor.c @@ -1011,7 +1011,9 @@ void hmp_info_msix(Monitor *mon, const QDict *qdict) { const char *name = qdict_get_str(qdict, "dev"); DeviceState *dev = find_device_state(name, NULL); +bool info = qdict_get_try_bool(qdict, "info", false); PCIDevice *pci_dev; +PCIDeviceClass *pc; Error *err = NULL; if (!dev) { @@ -1027,6 +1029,15 @@ void hmp_info_msix(Monitor *mon, const QDict *qdict) pci_dev = PCI_DEVICE(dev); msix_dump_info(mon, pci_dev, &err); +if (info) { +pc = PCI_DEVICE_GET_CLASS(pci_dev); +if (pc->msix_info) { +pc->msix_info(mon, pci_dev, &err); +} else { +error_setg(&err, "Device specific info not supported"); +} +} + exit: hmp_handle_error(mon, err); } -- 2.17.1
[PATCH v2 1/3] msix/hmp: add hmp interface to dump MSI-X info
This patch is to add the HMP interface to dump MSI-X table and PBA, in order to help diagnose the loss of IRQ issue in VM (e.g., if an MSI-X vector is erroneously masked permanently). Here is the example with vhost-scsi: (qemu) info msix /machine/peripheral/vscsi0 Msg L.Addr Msg U.Addr Msg Data Vect Ctrl 0xfee0 0x 0x4041 0x 0xfee0 0x 0x4051 0x 0xfee0 0x 0x4061 0x 0xfee0 0x 0x4071 0x 0xfee01000 0x 0x40b1 0x 0xfee02000 0x 0x40c1 0x 0xfee03000 0x 0x40d1 0x MSI-X PBA 0 0 0 0 0 0 0 Since the number of MSI-X entries is not determined and might be very large, it is sometimes inappropriate to dump via QMP. Therefore, this patch dumps MSI-X information only via HMP, which is similar to the implementation of hmp_info_mem(). Cc: Jason Wang Cc: Joe Jin Signed-off-by: Dongli Zhang Acked-by: Dr. David Alan Gilbert --- Changed since v1: - Add heading to MSI-X table (suggested by David Alan Gilbert) - Not sure if it is fine to use Acked-by from David and there is very few change hmp-commands-info.hx | 13 + hw/pci/msix.c | 63 ++ include/hw/pci/msix.h | 2 ++ include/monitor/hmp.h | 1 + softmmu/qdev-monitor.c | 25 + 5 files changed, 104 insertions(+) diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx index ab0c7aa5ee..cbd056442b 100644 --- a/hmp-commands-info.hx +++ b/hmp-commands-info.hx @@ -221,6 +221,19 @@ SRST Show PCI information. ERST +{ +.name = "msix", +.args_type = "dev:s", +.params = "dev", +.help = "dump MSI-X information", +.cmd= hmp_info_msix, +}, + +SRST + ``info msix`` *dev* +Dump MSI-X information for device *dev*. +ERST + #if defined(TARGET_I386) || defined(TARGET_SH4) || defined(TARGET_SPARC) || \ defined(TARGET_PPC) || defined(TARGET_XTENSA) || defined(TARGET_M68K) { diff --git a/hw/pci/msix.c b/hw/pci/msix.c index ae9331cd0b..4b4ec87eee 100644 --- a/hw/pci/msix.c +++ b/hw/pci/msix.c @@ -22,6 +22,7 @@ #include "sysemu/xen.h" #include "migration/qemu-file-types.h" #include "migration/vmstate.h" +#include "monitor/monitor.h" #include "qemu/range.h" #include "qapi/error.h" #include "trace.h" @@ -669,3 +670,65 @@ const VMStateDescription vmstate_msix = { VMSTATE_END_OF_LIST() } }; + +static void msix_dump_table(Monitor *mon, PCIDevice *dev) +{ +int vector; +uint32_t val; +uint8_t *table_entry; + +monitor_printf(mon, "Msg L.Addr "); +monitor_printf(mon, "Msg U.Addr "); +monitor_printf(mon, "Msg Data "); +monitor_printf(mon, "Vect Ctrl\n"); + +for (vector = 0; vector < dev->msix_entries_nr; vector++) { +table_entry = dev->msix_table + vector * PCI_MSIX_ENTRY_SIZE; + +val = pci_get_long(table_entry + PCI_MSIX_ENTRY_LOWER_ADDR); +monitor_printf(mon, "0x%08x ", val); + +val = pci_get_long(table_entry + PCI_MSIX_ENTRY_UPPER_ADDR); +monitor_printf(mon, "0x%08x ", val); + +val = pci_get_long(table_entry + PCI_MSIX_ENTRY_DATA); +monitor_printf(mon, "0x%08x ", val); + +val = pci_get_long(table_entry + PCI_MSIX_ENTRY_VECTOR_CTRL); +monitor_printf(mon, "0x%08x\n", val); +} + +monitor_printf(mon, "\n"); +} + +static void msix_dump_pba(Monitor *mon, PCIDevice *dev) +{ +int vector; + +monitor_printf(mon, "MSI-X PBA\n"); + +for (vector = 0; vector < dev->msix_entries_nr; vector++) { +monitor_printf(mon, "%d ", !!msix_is_pending(dev, vector)); + +if (vector % 16 == 15) { +monitor_printf(mon, "\n"); +} +} + +if (vector % 16 != 15) { +monitor_printf(mon, "\n"); +} + +monitor_printf(mon, "\n"); +} + +void msix_dump_info(Monitor *mon, PCIDevice *dev, Error **errp) +{ +if (!msix_present(dev)) { +error_setg(errp, "MSI-X not available"); +return; +} + +msix_dump_table(mon, dev); +msix_dump_pba(mon, dev); +} diff --git a/include/hw/pci/msix.h b/include/hw/pci/msix.h index 4c4a60c739..10a4500295 100644 --- a/include/hw/pci/msix.h +++ b/include/hw/pci/msix.h @@ -47,6 +47,8 @@ int msix_set_vector_notifiers(PCIDevice *dev, MSIVectorPollNotifier poll_notifier); void msix_unset_vector_notifiers(PCIDevice *dev); +void msix_dump_info(Monitor *mon, PCIDevice *dev, Error **errp); + extern const VMStateDescription vmstate_msix; #define VMSTATE_MSIX_TEST(_field, _state, _test) { \ diff --git a/include/monitor/hmp.h b/include/m
[PATCH v2 0/3] To add HMP interface to dump PCI MSI-X table/PBA
This patch is to introduce the new HMP command to dump the MSI-X table/PBA. Here is the RFC v1: https://lists.gnu.org/archive/html/qemu-devel/2021-04/msg04673.html The idea was inspired by below discussion: https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg09020.html (qemu) info msix -d /machine/peripheral/vscsi0 Msg L.Addr Msg U.Addr Msg Data Vect Ctrl 0xfee0 0x 0x4041 0x 0xfee0 0x 0x4051 0x 0xfee0 0x 0x4061 0x 0xfee0 0x 0x4071 0x 0xfee01000 0x 0x40b1 0x 0xfee02000 0x 0x40c1 0x 0xfee03000 0x 0x40d1 0x MSI-X PBA 0 0 0 0 0 0 0 virtio pci vector info: config: 0 queue 0: 1 queue 1: 2 queue 2: 3 queue 3: 4 queue 4: 5 queue 5: 6 Changed since RFC v1: - Add heading to MSI-X table (suggested by David Alan Gilbert) - Add device specific interface, e.g., to dump virtio-pci queue-to-vector mapping (Suggested By Jason) hmp-commands-info.hx | 14 + hw/pci/msix.c | 63 + hw/virtio/virtio-pci.c | 22 ++ hw/virtio/virtio.c | 10 +++ include/hw/pci/msix.h | 2 ++ include/hw/pci/pci.h | 3 ++ include/hw/virtio/virtio.h | 2 ++ include/monitor/hmp.h | 1 + softmmu/qdev-monitor.c | 36 +++ 9 files changed, 153 insertions(+) Thank you very much! Dongli Zhang