[PATCH v4] KVM: PPC: e500mc: Enhance tlb invalidation condition on vcpu schedule
On vcpu schedule, the condition checked for tlb pollution is too loose. The tlb entries of a vcpu become polluted (vs stale) only when a different vcpu within the same logical partition runs in-between. Optimize the tlb invalidation condition keeping last_vcpu per logical partition id. With the new invalidation condition, a guest shows 4% performance improvement on P5020DS while running a memory stress application with the cpu oversubscribed, the other guest running a cpu intensive workload. Guest - old invalidation condition real 3.89 user 3.87 sys 0.01 Guest - enhanced invalidation condition real 3.75 user 3.73 sys 0.01 Host real 3.70 user 1.85 sys 0.00 The memory stress application accesses 4KB pages backed by 75% of available TLB0 entries: char foo[ENTRIES][4096] __attribute__ ((aligned (4096))); int main() { char bar; int i, j; for (i = 0; i < ITERATIONS; i++) for (j = 0; j < ENTRIES; j++) bar = foo[j][0]; return 0; } Signed-off-by: Mihai Caraman Cc: Scott Wood --- v4: - rename last_vcpu_on_cpu to last_vcpu_of_lpid - use "*[" syntax despite checkpatch error v3: - use existing logic while keeping last_cpu per lpid arch/powerpc/kvm/e500mc.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c index 17e4562..690499d 100644 --- a/arch/powerpc/kvm/e500mc.c +++ b/arch/powerpc/kvm/e500mc.c @@ -110,7 +110,7 @@ void kvmppc_mmu_msr_notify(struct kvm_vcpu *vcpu, u32 old_msr) { } -static DEFINE_PER_CPU(struct kvm_vcpu *, last_vcpu_on_cpu); +static DEFINE_PER_CPU(struct kvm_vcpu *[KVMPPC_NR_LPIDS], last_vcpu_of_lpid); static void kvmppc_core_vcpu_load_e500mc(struct kvm_vcpu *vcpu, int cpu) { @@ -141,9 +141,9 @@ static void kvmppc_core_vcpu_load_e500mc(struct kvm_vcpu *vcpu, int cpu) mtspr(SPRN_GESR, vcpu->arch.shared->esr); if (vcpu->arch.oldpir != mfspr(SPRN_PIR) || - __get_cpu_var(last_vcpu_on_cpu) != vcpu) { + __get_cpu_var(last_vcpu_of_lpid)[vcpu->kvm->arch.lpid] != vcpu) { kvmppc_e500_tlbil_all(vcpu_e500); - __get_cpu_var(last_vcpu_on_cpu) = vcpu; + __get_cpu_var(last_vcpu_of_lpid)[vcpu->kvm->arch.lpid] = vcpu; } kvmppc_load_guest_fp(vcpu); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 5/6] VMX: Validate capability MSRs
Il 18/06/2014 07:32, Jan Kiszka ha scritto: Check for required-0 or required-1 bits as well as known field value restrictions. Also check the consistency between VMX_*_CTLS and VMX_TRUE_*_CTLS and between CR0/4_FIXED0 and CR0/4_FIXED1. Signed-off-by: Jan Kiszka --- Changes in v3: - integrated suggestions of Paolo x86/vmx.c | 74 ++- x86/vmx.h | 5 +++-- 2 files changed, 76 insertions(+), 3 deletions(-) diff --git a/x86/vmx.c b/x86/vmx.c index f01e443..5bb5969 100644 --- a/x86/vmx.c +++ b/x86/vmx.c @@ -661,6 +661,77 @@ static void test_vmptrst(void) report("test vmptrst", (!ret) && (vmcs1 == vmcs2)); } +struct vmx_ctl_msr { + const char *name; + u32 index, true_index; + u32 default1; +} vmx_ctl_msr[] = { + { "MSR_IA32_VMX_PINBASED_CTLS", MSR_IA32_VMX_PINBASED_CTLS, + MSR_IA32_VMX_TRUE_PIN, 0x16 }, + { "MSR_IA32_VMX_PROCBASED_CTLS", MSR_IA32_VMX_PROCBASED_CTLS, + MSR_IA32_VMX_TRUE_PROC, 0x401e172 }, + { "MSR_IA32_VMX_PROCBASED_CTLS2", MSR_IA32_VMX_PROCBASED_CTLS2, + MSR_IA32_VMX_PROCBASED_CTLS2, 0 }, + { "MSR_IA32_VMX_EXIT_CTLS", MSR_IA32_VMX_EXIT_CTLS, + MSR_IA32_VMX_TRUE_EXIT, 0x36dff }, + { "MSR_IA32_VMX_ENTRY_CTLS", MSR_IA32_VMX_ENTRY_CTLS, + MSR_IA32_VMX_TRUE_ENTRY, 0x11ff }, +}; + +static void test_vmx_caps(void) +{ + u64 val, default1, fixed0, fixed1; + union vmx_ctrl_msr ctrl, true_ctrl; + unsigned int n; + bool ok; + + printf("\nTest suite: VMX capability reporting\n"); + + report("MSR_IA32_VMX_BASIC", + (basic.revision & (1ul << 31)) == 0 && + basic.size > 0 && basic.size <= 4096 && + (basic.type == 0 || basic.type == 6) && + basic.reserved1 == 0 && basic.reserved2 == 0); + + val = rdmsr(MSR_IA32_VMX_MISC); + report("MSR_IA32_VMX_MISC", + (!(ctrl_cpu_rev[1].clr & CPU_URG) || val & (1ul << 5)) && + ((val >> 16) & 0x1ff) <= 256 && + (val & 0xc0007e00) == 0); + + for (n = 0; n < ARRAY_SIZE(vmx_ctl_msr); n++) { + ctrl.val = rdmsr(vmx_ctl_msr[n].index); + default1 = vmx_ctl_msr[n].default1; + ok = (ctrl.set & default1) == default1; + ok = ok && (ctrl.set & ~ctrl.clr) == 0; + if (ok && basic.ctrl) { + true_ctrl.val = rdmsr(vmx_ctl_msr[n].true_index); + ok = ctrl.clr == true_ctrl.clr; + ok = ok && ctrl.set == (true_ctrl.set | default1); + } + report(vmx_ctl_msr[n].name, ok); + } + + fixed0 = rdmsr(MSR_IA32_VMX_CR0_FIXED0); + fixed1 = rdmsr(MSR_IA32_VMX_CR0_FIXED1); + report("MSR_IA32_VMX_IA32_VMX_CR0_FIXED0/1", + ((fixed0 ^ fixed1) & ~fixed1) == 0); + + fixed0 = rdmsr(MSR_IA32_VMX_CR4_FIXED0); + fixed1 = rdmsr(MSR_IA32_VMX_CR4_FIXED1); + report("MSR_IA32_VMX_IA32_VMX_CR4_FIXED0/1", + ((fixed0 ^ fixed1) & ~fixed1) == 0); + + val = rdmsr(MSR_IA32_VMX_VMCS_ENUM); + report("MSR_IA32_VMX_VMCS_ENUM", + (val & 0x3e) >= 0x2a && + (val & 0xfc01Ull) == 0); + + val = rdmsr(MSR_IA32_VMX_EPT_VPID_CAP); + report("MSR_IA32_VMX_EPT_VPID_CAP", + (val & 0xf07ef9eebebeUll) == 0); +} + /* This function can only be called in guest */ static void __attribute__((__used__)) hypercall(u32 hypercall_no) { @@ -803,7 +874,7 @@ static int test_run(struct vmx_test *test) regs = test->guest_regs; vmcs_write(GUEST_RFLAGS, regs.rflags | 0x2); launched = 0; - printf("\nTest suite : %s\n", test->name); + printf("\nTest suite: %s\n", test->name); vmx_run(); if (vmx_off()) { printf("%s : vmxoff failed.\n", __func__); @@ -842,6 +913,7 @@ int main(void) goto exit; } test_vmxoff(); + test_vmx_caps(); while (vmx_tests[++i].name != NULL) if (test_run(&vmx_tests[i])) diff --git a/x86/vmx.h b/x86/vmx.h index 00f2842..87457b1 100644 --- a/x86/vmx.h +++ b/x86/vmx.h @@ -46,12 +46,13 @@ union vmx_basic { struct { u32 revision; u32 size:13, - : 3, + reserved1: 3, width:1, dual:1, type:4, insouts:1, - ctrl:1; + ctrl:1, + reserved2:8; }; }; Thanks, Reviewed-by: Paolo Bonzini -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Why I advise against using ivshmem
On Tue, Jun 17, 2014 at 11:44:11AM +0200, Paolo Bonzini wrote: > Il 17/06/2014 11:03, David Marchand ha scritto: > >>Unless someone steps up and maintains ivshmem, I think it should be > >>deprecated and dropped from QEMU. > > > >Then I can maintain ivshmem for QEMU. > >If this is ok, I will send a patch for MAINTAINERS file. > > Typically, adding yourself to maintainers is done only after having proved > your ability to be a maintainer. :) > > So, let's stop talking and go back to code! You can start doing what was > suggested elsewhere in the thread: get the server and uio driver merged into > the QEMU tree, document the protocol in docs/specs/ivshmem_device_spec.txt, > and start fixing bugs such as the ones that Markus reported. One more thing to add to the list: static void ivshmem_read(void *opaque, const uint8_t * buf, int flags) The "flags" argument should be "size". Size should be checked before accessing buf. Please also see the bug fixes in the following unapplied patch: "[PATCH] ivshmem: fix potential OOB r/w access (#2)" by Sebastian Krahmer https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg03538.html Stefan pgp5LqUdx7Rvg.pgp Description: PGP signature
Re: [Qemu-devel] Why I advise against using ivshmem
On Tue, Jun 17, 2014 at 11:03:32AM +0200, David Marchand wrote: > On 06/17/2014 04:54 AM, Stefan Hajnoczi wrote: > >ivshmem has a performance disadvantage for guest-to-host > >communication. Since the shared memory is exposed as PCI BARs, the > >guest has to memcpy into the shared memory. > > > >vhost-user can access guest memory directly and avoid the copy inside the > >guest. > > Actually, you can avoid this memory copy using frameworks like DPDK. I guess it's careful to allocate all packets in the mmapped BAR? That's fine if you can modify applications but doesn't work for unmodified applications using regular networking APIs. Stefan pgpPxEpGjindt.pgp Description: PGP signature
Re: [PATCH 03/11] qspinlock: Add pending bit
Il 17/06/2014 22:36, Konrad Rzeszutek Wilk ha scritto: + /* One more attempt - but if we fail mark it as pending. */ + if (val == _Q_LOCKED_VAL) { + new = Q_LOCKED_VAL |_Q_PENDING_VAL; + + old = atomic_cmpxchg(&lock->val, val, new); + if (old == _Q_LOCKED_VAL) /* YEEY! */ + return; + val = old; + } Note that Peter's code is in a for(;;) loop: + for (;;) { + /* +* If we observe any contention; queue. +*/ + if (val & ~_Q_LOCKED_MASK) + goto queue; + + new = _Q_LOCKED_VAL; + if (val == new) + new |= _Q_PENDING_VAL; + + old = atomic_cmpxchg(&lock->val, val, new); + if (old == val) + break; + + val = old; + } + + /* +* we won the trylock +*/ + if (new == _Q_LOCKED_VAL) + return; So what you'd have is basically: /* * One more attempt if no one is already in queue. Perhaps * they have unlocked the spinlock already. */ if (val == _Q_LOCKED_VAL && atomic_read(&lock->val) == 0) { old = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL); if (old == 0) /* YEEY! */ return; val = old; } But I agree with Waiman that this is unlikely to trigger often enough. It does have to be handled in the slowpath for correctness, but the most likely path is (0,0,1)->(0,1,1). Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Nested paging in nested SVM setup
Hi all, I'm using a KVM/Qemu nested SVM setup to debug another hypervisor (Jailhouse) I contribute to. IOW, the scheme is: AMD64 Linux host running [paravirtualized] AMD64 Linux guest (the same kernel as the host) running Jailhouse. Jailhouse, in turn, uses Nested Paging to virtualize xAPIC: APIC page (0xfee0, no APIC remapping) is mapped read-only into Jailhouse's guests. This of course implies that APIC page appears to Jailhouse guests as uncacheable (UC). Is it achievable in the setup I described, or do I need to run my code on a real hardware to make the APIC page accesses in Jailhouse guests uncacheable? Thanks in advance. -- Best regards, Valentine Sinitsyn -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vfio: Fix endianness handling for emulated BARs
VFIO exposes BARs to user space as a byte stream so userspace can read it using pread()/pwrite(). Since this is a byte stream, VFIO should not do byte swapping and simply return values as it gets them from PCI device. Instead, the existing code assumes that byte stream in read/write is little-endian and it fixes endianness for values which it passes to ioreadXX/iowriteXX helpers. This works for little-endian as PCI is little endian and le32_to_cpu/... are stubs. This also works for big endian but rather by an accident: it reads 4 bytes from the stream (@val is big endian), converts to CPU format (which should be big endian) as it was little endian (@val becomes actually little endian) and calls iowrite32() which does not do swapping on big endian system. This removes byte swapping and makes use ioread32be/iowrite32be (and 16bit versions) on big-endian systems. The "be" helpers take native endian values and do swapping at the moment of writing to a PCI register using one of "store byte-reversed" instructions. Suggested-by: Benjamin Herrenschmidt Signed-off-by: Alexey Kardashevskiy --- drivers/vfio/pci/vfio_pci_rdwr.c | 20 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c index 210db24..f363b5a 100644 --- a/drivers/vfio/pci/vfio_pci_rdwr.c +++ b/drivers/vfio/pci/vfio_pci_rdwr.c @@ -21,6 +21,18 @@ #include "vfio_pci_private.h" +#ifdef __BIG_ENDIAN__ +#define ioread16_nativeioread16be +#define ioread32_nativeioread32be +#define iowrite16_native iowrite16be +#define iowrite32_native iowrite32be +#else +#define ioread16_nativeioread16 +#define ioread32_nativeioread32 +#define iowrite16_native iowrite16 +#define iowrite32_native iowrite32 +#endif + /* * Read or write from an __iomem region (MMIO or I/O port) with an excluded * range which is inaccessible. The excluded range drops writes and fills @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf, if (copy_from_user(&val, buf, 4)) return -EFAULT; - iowrite32(le32_to_cpu(val), io + off); + iowrite32_native(val, io + off); } else { - val = cpu_to_le32(ioread32(io + off)); + val = ioread32_native(io + off); if (copy_to_user(buf, &val, 4)) return -EFAULT; @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf, if (copy_from_user(&val, buf, 2)) return -EFAULT; - iowrite16(le16_to_cpu(val), io + off); + iowrite16_native(val, io + off); } else { - val = cpu_to_le16(ioread16(io + off)); + val = ioread16_native(io + off); if (copy_to_user(buf, &val, 2)) return -EFAULT; -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: KVM_SYSTEM_TIME clock is marked unstable on a modern single-socket system
Hi, I'm working on OSv (https://github.com/cloudius-systems/osv), a guest operating system. I've been investigating a phenomena of KVM_SYSTEM_TIME being marked as unstable (PVCLOCK_TSC_STABLE_BIT cleared) by KVM on a modern single-socket CPU since the very beginning of guest's life time. According to the ABI, when pvclock_vcpu_time_info.flags does not have the PVCLOCK_TSC_STABLE_BIT set we must not assume monotonicity and need to compensate for this using atomic cmpxchg. However we would like to avoid that cost unless necessary and since modern single-socket CPUs have a stable TSC this cost seems unnecessary. Setups tested: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz, kernel 3.13.0-29-generic (Ubuntu), Qemu 2.0.0 Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, kernel 3.14.7-200.fc20.x86_64 (Fedora), Qemu 1.6.2 Both CPUs have constant_tsc and nonstop_tsc flags and the host kernel is using TSC as the master clock source. Here's what tracing KVM during QEMU (v2.0.0) startup shows. I used `trace-cmd record -e kvm` and then `trace-cmd report | egrep '(tsc|clock)'`. === 1 vCPU === qemu-system-x86-1353 [003] 75446.815273: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 qemu-system-x86-1355 [002] 75446.849021: kvm_write_tsc_offset: vcpu=0 prev=0 next=18446556065789505232 qemu-system-x86-1355 [002] 75446.849024: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc qemu-system-x86-1353 [000] 75446.905313: kvm_write_tsc_offset: vcpu=0 prev=18446556065789505232 next=18446556065789505232 qemu-system-x86-1353 [000] 75446.905315: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 1 hostclock tsc qemu-system-x86-1353 [000] 75446.925064: kvm_write_tsc_offset: vcpu=0 prev=18446556065789505232 next=18446556065789505232 qemu-system-x86-1353 [000] 75446.925065: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 2 nr_online 1 hostclock tsc qemu-system-x86-1353 [000] 75446.925119: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 qemu-system-x86-1355 [000] 75446.932931: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 === 2 vCPUs === qemu-system-x86-1431 [003] 75539.014452: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 qemu-system-x86-1433 [003] 75539.054169: kvm_write_tsc_offset: vcpu=0 prev=0 next=18446555836061830054 qemu-system-x86-1433 [003] 75539.054171: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc qemu-system-x86-1434 [003] 75539.078266: kvm_write_tsc_offset: vcpu=1 prev=0 next=18446555836061830054 qemu-system-x86-1434 [003] 75539.078269: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc qemu-system-x86-1431 [002] 75539.134695: kvm_write_tsc_offset: vcpu=0 prev=18446555836061830054 next=18446555836061830054 qemu-system-x86-1431 [002] 75539.134698: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 2 nr_online 2 hostclock tsc qemu-system-x86-1431 [002] 75539.142465: kvm_write_tsc_offset: vcpu=1 prev=18446555836061830054 next=18446555836061830054 qemu-system-x86-1431 [002] 75539.142468: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 3 nr_online 2 hostclock tsc qemu-system-x86-1431 [002] 75539.182614: kvm_write_tsc_offset: vcpu=0 prev=18446555836061830054 next=18446555836061830054 qemu-system-x86-1431 [002] 75539.182617: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 4 nr_online 2 hostclock tsc qemu-system-x86-1431 [002] 75539.182758: kvm_write_tsc_offset: vcpu=1 prev=18446555836061830054 next=18446555836061830054 qemu-system-x86-1431 [002] 75539.182758: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 5 nr_online 2 hostclock tsc qemu-system-x86-1431 [002] 75539.182840: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 qemu-system-x86-1433 [003] 75539.194415: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 qemu-system-x86-1434 [001] 75539.218634: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 When "masterclock" is 0 it means the PVCLOCK_TSC_STABLE_BIT will not be set. It is not set because offsets are supposedly not matched. However according to the kvm_track_tsc tracepoints each TSC write is matched. But there are 3 writes per-CPU which makes ka->nr_vcpus_matched_tsc to become larger than kvm->online_vcpus which makes the following check in pvclock_update_vm_gtod_copy() to fail: vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == atomic_read(&kvm->online_vcpus)); I turns out that kvm_write_tsc() is called with the target TSC value of 0 from these 3 places per each vCPU: The first one: 0xa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel] 0xa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm] 0xa04cfd6b : kvm_arch_vcpu_postcreate+0x4b/0x80 [kvm] 0xa04b8188 : kvm_vm_ioctl+0x418/0x750 [kvm] 0x811fd0f0 0x811fd331 (inexact) 0x81121eb6 (inexact) 0xfff
Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word
Il 17/06/2014 22:55, Konrad Rzeszutek Wilk ha scritto: On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote: From: Waiman Long This patch extracts the logic for the exchange of new and previous tail code words into a new xchg_tail() function which can be optimized in a later patch. And also adds a third try on acquiring the lock. That I think should be a seperate patch. It doesn't really add a new try, the old code is: - for (;;) { - new = _Q_LOCKED_VAL; - if (val) - new = tail | (val & _Q_LOCKED_PENDING_MASK); - - old = atomic_cmpxchg(&lock->val, val, new); - if (old == val) - break; - - val = old; - } /* -* we won the trylock; forget about queueing. */ - if (new == _Q_LOCKED_VAL) - goto release; The trylock happens if the "if (val)" hits the else branch. What the patch does is change it from attempting two transition with a single cmpxchg: -* 0,0,0 -> 0,0,1 ; trylock -* p,y,x -> n,y,x ; prev = xchg(lock, node) to first doing the trylock, then the xchg. If the trylock passes and the xchg returns prev=0,0,0, the next step of the algorithm goes to the locked/uncontended state + /* +* claim the lock: +* +* n,0 -> 0,1 : lock, uncontended Similar to your suggestion of patch 3, it's expected that the xchg will *not* return prev=0,0,0 after a failed trylock. However, I *do* agree with you that it's simpler to just squash this patch into 01/11. Paolo And instead of saying 'later patch' you should spell out the name of the patch. Especially as this might not be obvious from somebody doing git bisection. Signed-off-by: Waiman Long Signed-off-by: Peter Zijlstra --- include/asm-generic/qspinlock_types.h |2 + kernel/locking/qspinlock.c| 58 +- 2 files changed, 38 insertions(+), 22 deletions(-) --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -61,6 +61,8 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) + #define _Q_LOCKED_VAL (1U << _Q_LOCKED_OFFSET) #define _Q_PENDING_VAL (1U << _Q_PENDING_OFFSET) --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -86,6 +86,31 @@ static inline struct mcs_spinlock *decod #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) /** + * xchg_tail - Put in the new queue tail code word & retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* -> n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + u32 old, new, val = atomic_read(&lock->val); + + for (;;) { + new = (val & _Q_LOCKED_PENDING_MASK) | tail; + old = atomic_cmpxchg(&lock->val, val, new); + if (old == val) + break; + + val = old; + } + return old; +} + +/** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word @@ -182,36 +207,25 @@ void queue_spin_lock_slowpath(struct qsp node->next = NULL; /* -* we already touched the queueing cacheline; don't bother with pending -* stuff. -* -* trylock || xchg(lock, node) -* -* 0,0,0 -> 0,0,1 ; trylock -* p,y,x -> n,y,x ; prev = xchg(lock, node) +* We touched a (possibly) cold cacheline in the per-cpu queue node; +* attempt the trylock once more in the hope someone let go while we +* weren't watching. */ - for (;;) { - new = _Q_LOCKED_VAL; - if (val) - new = tail | (val & _Q_LOCKED_PENDING_MASK); - - old = atomic_cmpxchg(&lock->val, val, new); - if (old == val) - break; - - val = old; - } + if (queue_spin_trylock(lock)) + goto release; So now are three of them? One in queue_spin_lock, then at the start of this function when checking for the pending bit, and the once more here. And that is because the local cache line might be cold for the 'mcs_index' struct? That all seems to be a bit of experimental. But then we are already in the slowpath so we could as well do: for (i = 0; i < 10; i++) if (queue_spin_trylock(lock)) goto release; And would have the same effect. /* -* we won the trylock; forget about queueing. +* we already t
Re: [PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS
Il 15/06/2014 14:47, Peter Zijlstra ha scritto: - for (;;) { - new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; - - old = atomic_cmpxchg(&lock->val, val, new); - if (old == val) - break; - - val = old; - } + clear_pending_set_locked(lock, val); return; Might as well add clear_pending_set_locked already in patch 3. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/11] qspinlock: Optimize pending bit
Il 15/06/2014 14:47, Peter Zijlstra ha scritto: XXX: merge into the pending bit patch.. Agree, or if not move it right after the pending bit patch, before the NR_CPUS optimization. Paolo It is possible so observe the pending bit without the locked bit when the last owner has just released but the pending owner has not yet taken ownership. In this case we would normally queue -- because the pending bit is already taken. However, in this case the pending bit is guaranteed to be released 'soon', therefore wait for it and avoid queueing. Signed-off-by: Peter Zijlstra -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/11] qspinlock: Paravirt support
Il 17/06/2014 00:08, Waiman Long ha scritto: +void __pv_queue_unlock(struct qspinlock *lock) +{ + int val = atomic_read(&lock->val); + + native_queue_unlock(lock); + + if (val & _Q_LOCKED_SLOW) + ___pv_kick_head(lock); +} + Again a race can happen here between the reading and writing of the lock value. I can't think of a good way to do that without using cmpxchg. Could you just use xchg on the locked byte? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/11] qspinlock: Paravirt support
Il 15/06/2014 14:47, Peter Zijlstra ha scritto: #if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE) -#definequeue_spin_unlock queue_spin_unlock /** * queue_spin_unlock - release a queue spinlock * @lock : Pointer to queue spinlock structure * * An effective smp_store_release() on the least-significant byte. */ -static inline void queue_spin_unlock(struct qspinlock *lock) +static inline void native_queue_unlock(struct qspinlock *lock) { barrier(); ACCESS_ONCE(*(u8 *)lock) = 0; } +#else + +static inline void native_queue_unlock(struct qspinlock *lock) +{ + atomic_dec(&lock->val); +} + #endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */ Should be (part of) an earlier patch? Also, does it get wrong if (CONFIG_X86_OOSTORE || CONFIG_X86_PPRO_FENCE) && paravirt patches the unlock to a single movb? Of course the paravirt spinlocks could simply depend on !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE. + +#define INVALID_HEAD -1 +#define NO_HEADnr_cpu_ids + -2, like Waiman said. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Nested paging in nested SVM setup
On 2014-06-18 13:36, Valentine Sinitsyn wrote: > Hi all, > > I'm using a KVM/Qemu nested SVM setup to debug another hypervisor > (Jailhouse) I contribute to. IOW, the scheme is: AMD64 Linux host > running [paravirtualized] AMD64 Linux guest (the same kernel as the > host) running Jailhouse. > > Jailhouse, in turn, uses Nested Paging to virtualize xAPIC: APIC page > (0xfee0, no APIC remapping) is mapped read-only into Jailhouse's > guests. This of course implies that APIC page appears to Jailhouse > guests as uncacheable (UC). > > Is it achievable in the setup I described, or do I need to run my code > on a real hardware to make the APIC page accesses in Jailhouse guests > uncacheable? If we want to provide useful nested SVM support, this must be feasible. If there is a bug, it has to be fixed. Maybe you can describe how you configured the involved units (NPT structures, guest / host PAR, MTRR etc.). Even better would be a test case based on kvm-unit-tests (see [1], x86/svm.c) that replicates the observed behavior. If it reveals a bug, this test would be very valuable for making sure it remains fixed (once that is done). Jan [1] https://git.kernel.org/cgit/virt/kvm/kvm-unit-tests.git -- Siemens AG, Corporate Technology, CT RTC ITP SES-DE Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/11] qspinlock: Add pending bit
On Wed, Jun 18, 2014 at 01:29:48PM +0200, Paolo Bonzini wrote: > Il 17/06/2014 22:36, Konrad Rzeszutek Wilk ha scritto: > >+/* One more attempt - but if we fail mark it as pending. */ > >+if (val == _Q_LOCKED_VAL) { > >+new = Q_LOCKED_VAL |_Q_PENDING_VAL; > >+ > >+old = atomic_cmpxchg(&lock->val, val, new); > >+if (old == _Q_LOCKED_VAL) /* YEEY! */ > >+return; > >+val = old; > >+} > > Note that Peter's code is in a for(;;) loop: > > > + for (;;) { > + /* > + * If we observe any contention; queue. > + */ > + if (val & ~_Q_LOCKED_MASK) > + goto queue; > + > + new = _Q_LOCKED_VAL; > + if (val == new) > + new |= _Q_PENDING_VAL; > + > + old = atomic_cmpxchg(&lock->val, val, new); > + if (old == val) > + break; > + > + val = old; > + } > + > + /* > + * we won the trylock > + */ > + if (new == _Q_LOCKED_VAL) > + return; > > So what you'd have is basically: > > /* >* One more attempt if no one is already in queue. Perhaps >* they have unlocked the spinlock already. >*/ > if (val == _Q_LOCKED_VAL && atomic_read(&lock->val) == 0) { > old = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL); > if (old == 0) /* YEEY! */ > return; > val = old; > } > > But I agree with Waiman that this is unlikely to trigger often enough. It > does have to be handled in the slowpath for correctness, but the most likely > path is (0,0,1)->(0,1,1). > > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word
On Wed, Jun 18, 2014 at 01:37:45PM +0200, Paolo Bonzini wrote: > Il 17/06/2014 22:55, Konrad Rzeszutek Wilk ha scritto: > >On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote: > >>From: Waiman Long > >> > >>This patch extracts the logic for the exchange of new and previous tail > >>code words into a new xchg_tail() function which can be optimized in a > >>later patch. > > > >And also adds a third try on acquiring the lock. That I think should > >be a seperate patch. > > It doesn't really add a new try, the old code is: > > > - for (;;) { > - new = _Q_LOCKED_VAL; > - if (val) > - new = tail | (val & _Q_LOCKED_PENDING_MASK); > - > - old = atomic_cmpxchg(&lock->val, val, new); > - if (old == val) > - break; > - > - val = old; > - } > > /* > - * we won the trylock; forget about queueing. >*/ > - if (new == _Q_LOCKED_VAL) > - goto release; > > The trylock happens if the "if (val)" hits the else branch. > > What the patch does is change it from attempting two transition with a > single cmpxchg: > > - * 0,0,0 -> 0,0,1 ; trylock > - * p,y,x -> n,y,x ; prev = xchg(lock, node) > > to first doing the trylock, then the xchg. If the trylock passes and the > xchg returns prev=0,0,0, the next step of the algorithm goes to the > locked/uncontended state > > + /* > + * claim the lock: > + * > + * n,0 -> 0,1 : lock, uncontended > > Similar to your suggestion of patch 3, it's expected that the xchg will > *not* return prev=0,0,0 after a failed trylock. I do like your explanation. I hope that Peter will put it in the description as it explains the change quite well. > > However, I *do* agree with you that it's simpler to just squash this patch > into 01/11. Uh, did I say that? Oh I said why don't make it right the first time! I meant in terms of seperating the slowpath (aka the bytelock on the pending bit) from the queue (MCS code). Or renaming the function to be called 'complex' instead of 'slowpath' as it is getting quite hairy. The #1 patch is nice by itself - as it lays out the foundation of the MCS-similar code - and if Ingo decides he does not want this pending byte-lock bit business - it can be easily reverted or dropped. In terms of squashing this in #1 - I would advocate against that. Thanks! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: Increase the number of fixed MTRR regs to 10
Recent Intel CPUs have 10 variable range MTRRs. Since operating systems sometime make assumptions on CPUs while they ignore capability MSRs, it is better for KVM to be consistent with recent CPUs. Reporting more MTRRs than actually supported has no functional implications. Signed-off-by: Nadav Amit --- arch/x86/include/asm/kvm_host.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 4931415..0bab29d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -95,7 +95,7 @@ static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level) #define KVM_REFILL_PAGES 25 #define KVM_MAX_CPUID_ENTRIES 80 #define KVM_NR_FIXED_MTRR_REGION 88 -#define KVM_NR_VAR_MTRR 8 +#define KVM_NR_VAR_MTRR 10 #define ASYNC_PF_PER_VCPU 64 -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] KVM: x86: Emulator support for #UD on CPL>0
Certain instructions (e.g., mwait and monitor) cause a #UD exception when they are executed in privilaged mode. This is in contrast to the regular privilaged instructions which cause #GP. In order not to mess with SVM interception of mwait and monitor which assumes privilage level assertions take place before interception, a flag has been added. Signed-off-by: Nadav Amit --- arch/x86/kvm/emulate.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index f90194d..ef7a5a0 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -163,6 +163,7 @@ #define SrcWrite((u64)1 << 46) /* Write back src operand */ #define NoMod ((u64)1 << 47) /* Mod field is ignored */ #define NoBigReal ((u64)1 << 48) /* No big real mode */ +#define UDOnPriv((u64)1 << 49) /* #UD instead of #GP on CPL > 0 */ #define DstXacc (DstAccLo | SrcAccHi | SrcWrite) @@ -4560,7 +4561,10 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) /* Privileged instruction can be executed only in CPL=0 */ if ((ctxt->d & Priv) && ops->cpl(ctxt)) { - rc = emulate_gp(ctxt, 0); + if (ctxt->d & UDOnPriv) + rc = emulate_ud(ctxt); + else + rc = emulate_gp(ctxt, 0); goto done; } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 4/6] arm/arm64: KVM: vgic: Improve handling of GICD_I{CS}PENDRn
On 06/14/2014 10:51 PM, Christoffer Dall wrote: > The handling of writes to the GICD_ISPENDRn and GICD_ICPENDRn is > currently not handled correctly for level-triggered interrupts. Hi Christoffer, Thanks for this patch serie. I can confirm it fixes my QEMU/VFIO issue where all IRQs were pending cleared at guest OS boot while IRQ wires were set. Now those IRQs are left pending which is compliant with the GIC spec. You will find few comments/questions below. Best Regards Eric > spec states that for level-triggered interrupts, writes to the > GICD_ISPENDRn activates the output of a flip-flop which is in turn or'ed > with the actual input interrupt signal. Correspondingly, writes to > GICD_ICPENDRn simply deactives the output of that flip-flop, but does deactivates > not (of course) affect the external input signal. Reads from GICC_IAR > will also deactivate the flip-flop output. > > This requires us to track the state of the level-input separately from > the state in the flip-flop. Introduce two new variables on the > distributor struct to track these two exact states. Astute readers > may notice that this is introducing more state than required (because an > OR of the two states give you the pending state), but the remainding remaining > vgic code uses the pending bitmap for optimized operations to figure > out, at the end of the day, if an interrupt is pending or not on the > distributor side. Changing all the to consider the two state variables sentence > did not look pretty. > > Signed-off-by: Christoffer Dall > --- > include/kvm/arm_vgic.h | 16 ++- > virt/kvm/arm/vgic.c| 118 > - > 2 files changed, 122 insertions(+), 12 deletions(-) > > diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h > index 10fa64b..f678d5c 100644 > --- a/include/kvm/arm_vgic.h > +++ b/include/kvm/arm_vgic.h > @@ -86,9 +86,23 @@ struct vgic_dist { > /* Interrupt enabled (one bit per IRQ) */ > struct vgic_bitmap irq_enabled; > > - /* Interrupt state is pending on the distributor */ > + /* Level-triggered interrupt external input is asserted */ > + struct vgic_bitmap irq_level; was a bit confused by the name. I now understand this is the wire (interrupt signal to GIC), relevant in case of level-sensitive IRQ. ~ wire_level. > + > + /* > + * Interrupt state is pending on the distributor > + */ > struct vgic_bitmap irq_pending; ~ status_includes_pending > > + /* > + * Tracks writes to GICD_ISPENDRn and GICD_ICPENDRn for level-triggered > + * interrupts. Essentially holds the state of the flip-flop in > + * Figure 4-10 on page 4-101 in ARM IHI 0048B.b. > + * Once set, it is only cleared for level-triggered interrupts on > + * guest ACKs (when we queue it) or writes to GICD_ICPENDRn. > + */ > + struct vgic_bitmap irq_soft_pend; > + > /* Level-triggered interrupt queued on VCPU interface */ > struct vgic_bitmap irq_queued; > > diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c > index 87c977c..0b41875 100644 > --- a/virt/kvm/arm/vgic.c > +++ b/virt/kvm/arm/vgic.c > @@ -67,6 +67,11 @@ > * - When the interrupt is EOIed, the maintenance interrupt fires, > * and clears the corresponding bit in irq_queued. This allow the > * interrupt line to be sampled again. > + * - Note that level-triggered interrupts can also be set to pending from > + * writes to GICD_ISPENDRn and lowering the external input line does not > + * cause the interrupt to become inactive in such a situation. > + * Conversely, writes to GICD_ICPENDRn do not cause the interrupt to become > + * inactive as long as the external input line is held high. > */ > > #define VGIC_ADDR_UNDEF (-1) > @@ -200,6 +205,41 @@ static void vgic_irq_clear_queued(struct kvm_vcpu *vcpu, > int irq) > vgic_bitmap_set_irq_val(&dist->irq_queued, vcpu->vcpu_id, irq, 0); > } > > +static int vgic_dist_irq_get_level(struct kvm_vcpu *vcpu, int irq) > +{ > + struct vgic_dist *dist = &vcpu->kvm->arch.vgic; > + > + return vgic_bitmap_get_irq_val(&dist->irq_level, vcpu->vcpu_id, irq); > +} > + > +static void vgic_dist_irq_set_level(struct kvm_vcpu *vcpu, int irq) > +{ > + struct vgic_dist *dist = &vcpu->kvm->arch.vgic; > + > + vgic_bitmap_set_irq_val(&dist->irq_level, vcpu->vcpu_id, irq, 1); > +} > + > +static void vgic_dist_irq_clear_level(struct kvm_vcpu *vcpu, int irq) > +{ > + struct vgic_dist *dist = &vcpu->kvm->arch.vgic; > + > + vgic_bitmap_set_irq_val(&dist->irq_level, vcpu->vcpu_id, irq, 0); > +} > + > +static int vgic_dist_irq_soft_pend(struct kvm_vcpu *vcpu, int irq) > +{ > + struct vgic_dist *dist = &vcpu->kvm->arch.vgic; > + > + return vgic_bitmap_get_irq_val(&dist->irq_soft_pend, vcpu->vcpu_id, > irq); > +} > + > +static void vgic_dist_irq_clear_soft_pend(struct kvm_vcpu *vcpu, int irq) > +{ > + struct vgic_dist
[PATCH 0/3] Correct monitor-mwait emulation as nop
KVM handles monitor-mwait as nop, but does not check any of the preconditions for the instructions. These instructions may generate all kind of exceptions (#UD, #PF, #GP, #SS). They can also be executed in real-mode. This patch-set moves the handling of monitor-mwait to the emulator, to allow their execution in either real-mode or protected-mode. It tries to follow the SDM in checking the preconditions and generating the necassary exceptions. Thanks for reviewing the patch. Please try it with OS X to make sure it works properly without generating unnecassary exception. Nadav Amit (3): KVM: x86: Emulator flag for instruction with no big real mode KVM: x86: Emulator support for #UD on CPL>0 KVM: x86: correct mwait and monitor emulation arch/x86/kvm/emulate.c | 55 ++ arch/x86/kvm/svm.c | 22 ++-- arch/x86/kvm/vmx.c | 27 ++--- 3 files changed, 64 insertions(+), 40 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] KVM: x86: Emulator flag for instruction with no big real mode
Certain instructions, such as monitor and xsave do not support big real mode and cause a #GP exception if any of the accessed bytes effective address are not within [0, 0x]. This patch introduces a flag to mark these instructions, including the necassary checks. Signed-off-by: Nadav Amit --- arch/x86/kvm/emulate.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index e4e833d..f90194d 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -162,6 +162,7 @@ #define NoWrite ((u64)1 << 45) /* No writeback */ #define SrcWrite((u64)1 << 46) /* Write back src operand */ #define NoMod ((u64)1 << 47) /* Mod field is ignored */ +#define NoBigReal ((u64)1 << 48) /* No big real mode */ #define DstXacc (DstAccLo | SrcAccHi | SrcWrite) @@ -651,7 +652,12 @@ static int __linearize(struct x86_emulate_ctxt *ctxt, if (!fetch && (desc.type & 8) && !(desc.type & 2)) goto bad; lim = desc_limit_scaled(&desc); - if ((desc.type & 8) || !(desc.type & 4)) { + if ((ctxt->mode == X86EMUL_MODE_REAL) && !fetch && + (ctxt->d & NoBigReal)) { + /* la is between zero and 0x */ + if (la > 0x || (u32)(la + size - 1) > 0x) + goto bad; + } else if ((desc.type & 8) || !(desc.type & 4)) { /* expand-up segment */ if (addr.ea > lim || (u32)(addr.ea + size - 1) > lim) goto bad; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] KVM: x86: correct mwait and monitor emulation
mwait and monitor are currently handled as nop. Considering this behavior, they should still be handled correctly, i.e., check execution conditions and generate exceptions when required. mwait and monitor may also be executed in real-mode and are not handled in that case. This patch performs the emulation of monitor-mwait according to Intel SDM (other than checking whether interrupt can be used as a break event). Signed-off-by: Nadav Amit --- arch/x86/kvm/emulate.c | 41 +++-- arch/x86/kvm/svm.c | 22 ++ arch/x86/kvm/vmx.c | 27 +++ 3 files changed, 52 insertions(+), 38 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index ef7a5a0..424b58d 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3344,6 +3344,43 @@ static int em_bswap(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } +static int em_monitor(struct x86_emulate_ctxt *ctxt) +{ + int rc; + struct segmented_address addr; + u64 rcx = reg_read(ctxt, VCPU_REGS_RCX); + u64 rax = reg_read(ctxt, VCPU_REGS_RAX); + u8 byte; + + if (ctxt->mode != X86EMUL_MODE_PROT64) + rcx = (u32)rcx; + if (rcx != 0) + return emulate_gp(ctxt, 0); + + addr.seg = seg_override(ctxt); + addr.ea = ctxt->ad_bytes == 8 ? rax : (u32)rax; + rc = segmented_read(ctxt, addr, &byte, 1); + if (rc != X86EMUL_CONTINUE) + return rc; + + printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n"); + return X86EMUL_CONTINUE; +} + +static int em_mwait(struct x86_emulate_ctxt *ctxt) +{ + u64 rcx = reg_read(ctxt, VCPU_REGS_RCX); + + if (ctxt->mode != X86EMUL_MODE_PROT64) + rcx = (u32)rcx; + if ((rcx & ~1UL) != 0) + return emulate_gp(ctxt, 0); + + /* Accepting interrupt as break event regardless to cpuid */ + printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); + return X86EMUL_CONTINUE; +} + static bool valid_cr(int nr) { switch (nr) { @@ -3557,8 +3594,8 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt) F2bv(((_f) & ~Lock) | DstAcc | SrcImm, _e) static const struct opcode group7_rm1[] = { - DI(SrcNone | Priv, monitor), - DI(SrcNone | Priv, mwait), + II(SrcNone | Priv | NoBigReal | UDOnPriv, em_monitor, monitor), + II(SrcNone | Priv | NoBigReal | UDOnPriv, em_mwait, mwait), N, N, N, N, N, N, }; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ec8366c..a524e04 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3274,24 +3274,6 @@ static int pause_interception(struct vcpu_svm *svm) return 1; } -static int nop_interception(struct vcpu_svm *svm) -{ - skip_emulated_instruction(&(svm->vcpu)); - return 1; -} - -static int monitor_interception(struct vcpu_svm *svm) -{ - printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n"); - return nop_interception(svm); -} - -static int mwait_interception(struct vcpu_svm *svm) -{ - printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); - return nop_interception(svm); -} - static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { [SVM_EXIT_READ_CR0] = cr_interception, [SVM_EXIT_READ_CR3] = cr_interception, @@ -3349,8 +3331,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { [SVM_EXIT_CLGI] = clgi_interception, [SVM_EXIT_SKINIT] = skinit_interception, [SVM_EXIT_WBINVD] = emulate_on_interception, - [SVM_EXIT_MONITOR] = monitor_interception, - [SVM_EXIT_MWAIT]= mwait_interception, + [SVM_EXIT_MONITOR] = emulate_on_interception, + [SVM_EXIT_MWAIT]= emulate_on_interception, [SVM_EXIT_XSETBV] = xsetbv_interception, [SVM_EXIT_NPF] = pf_interception, }; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 801332e..7023e71 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5672,22 +5672,17 @@ static int handle_pause(struct kvm_vcpu *vcpu) return 1; } -static int handle_nop(struct kvm_vcpu *vcpu) +static int handle_emulate(struct kvm_vcpu *vcpu) { - skip_emulated_instruction(vcpu); - return 1; -} + int err = emulate_instruction(vcpu, 0); -static int handle_mwait(struct kvm_vcpu *vcpu) -{ - printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); - return handle_nop(vcpu); -} - -static int handle_monitor(struct kvm_vcpu *vcpu) -{ - printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n"); - return handle
[PATCH v2 9/9] KVM: vmx: vmx instructions handling does not consider cs.l
VMX instructions use 32-bit operands in 32-bit mode, and 64-bit operands in 64-bit mode. The current implementation is broken since it does not use the register operands correctly, and always uses 64-bit for reads and writes. Moreover, write to memory in vmwrite only considers long-mode, so it ignores cs.l. This patch fixes this behavior. The field of vmread/vmwrite is kept intentionally as 64-bit read since if bits [63:32] are not cleared the instruction should fail, according to Intel SDM. Signed-off-by: Nadav Amit --- arch/x86/kvm/vmx.c | 8 arch/x86/kvm/x86.h | 9 + 2 files changed, 13 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index cbfbb8b..75dc888 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -6397,7 +6397,7 @@ static int handle_vmread(struct kvm_vcpu *vcpu) * on the guest's mode (32 or 64 bit), not on the given field's length. */ if (vmx_instruction_info & (1u << 10)) { - kvm_register_write(vcpu, (((vmx_instruction_info) >> 3) & 0xf), + kvm_register_writel(vcpu, (((vmx_instruction_info) >> 3) & 0xf), field_value); } else { if (get_vmx_mem_address(vcpu, exit_qualification, @@ -6434,14 +6434,14 @@ static int handle_vmwrite(struct kvm_vcpu *vcpu) return 1; if (vmx_instruction_info & (1u << 10)) - field_value = kvm_register_read(vcpu, + field_value = kvm_register_readl(vcpu, (((vmx_instruction_info) >> 3) & 0xf)); else { if (get_vmx_mem_address(vcpu, exit_qualification, vmx_instruction_info, &gva)) return 1; if (kvm_read_guest_virt(&vcpu->arch.emulate_ctxt, gva, - &field_value, (is_long_mode(vcpu) ? 8 : 4), &e)) { + &field_value, (is_64_bit_mode(vcpu) ? 8 : 4), &e)) { kvm_inject_page_fault(vcpu, &e); return 1; } @@ -6571,7 +6571,7 @@ static int handle_invept(struct kvm_vcpu *vcpu) } vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO); - type = kvm_register_read(vcpu, (vmx_instruction_info >> 28) & 0xf); + type = kvm_register_readl(vcpu, (vmx_instruction_info >> 28) & 0xf); types = (nested_vmx_ept_caps >> VMX_EPT_EXTENT_SHIFT) & 6; diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index c5b61a7..306a1b7 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -126,6 +126,15 @@ static inline unsigned long kvm_register_readl(struct kvm_vcpu *vcpu, return is_64_bit_mode(vcpu) ? val : (u32)val; } +static inline void kvm_register_writel(struct kvm_vcpu *vcpu, + enum kvm_reg reg, + unsigned long val) +{ + if (!is_64_bit_mode(vcpu)) + val = (u32)val; + return kvm_register_write(vcpu, reg, val); +} + void kvm_before_handle_nmi(struct kvm_vcpu *vcpu); void kvm_after_handle_nmi(struct kvm_vcpu *vcpu); int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/9] KVM: x86: bit-ops emulation ignores offset on 64-bit
The current emulation of bit operations ignores the offset from the destination on 64-bit target memory operands. This patch fixes this behavior. Signed-off-by: Nadav Amit --- arch/x86/kvm/emulate.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index e4e833d..f0b0a10 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -1220,12 +1220,14 @@ static void fetch_bit_operand(struct x86_emulate_ctxt *ctxt) long sv = 0, mask; if (ctxt->dst.type == OP_MEM && ctxt->src.type == OP_REG) { - mask = ~(ctxt->dst.bytes * 8 - 1); + mask = ~((long)ctxt->dst.bytes * 8 - 1); if (ctxt->src.bytes == 2) sv = (s16)ctxt->src.val & (s16)mask; else if (ctxt->src.bytes == 4) sv = (s32)ctxt->src.val & (s32)mask; + else + sv = (s64)ctxt->src.val & (s64)mask; ctxt->dst.addr.mem.ea += (sv >> 3); } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 6/9] KVM: x86: check DR6/7 high-bits are clear only on long-mode
From: Nadav Amit When the guest sets DR6 and DR7, KVM asserts the high 32-bits are clear, and otherwise injects a #GP exception. This exception should only be injected only if running in long-mode. Signed-off-by: Nadav Amit --- arch/x86/kvm/vmx.c | 2 +- arch/x86/kvm/x86.h | 18 ++ 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 801332e..c0f53a0 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5165,7 +5165,7 @@ static int handle_dr(struct kvm_vcpu *vcpu) return 1; kvm_register_write(vcpu, reg, val); } else - if (kvm_set_dr(vcpu, dr, kvm_register_read(vcpu, reg))) + if (kvm_set_dr(vcpu, dr, kvm_register_readl(vcpu, reg))) return 1; skip_emulated_instruction(vcpu); diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 8c97bac..c5b61a7 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -47,6 +47,16 @@ static inline int is_long_mode(struct kvm_vcpu *vcpu) #endif } +static inline bool is_64_bit_mode(struct kvm_vcpu *vcpu) +{ + int cs_db, cs_l; + + if (!is_long_mode(vcpu)) + return false; + kvm_x86_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l); + return cs_l; +} + static inline bool mmu_is_nested(struct kvm_vcpu *vcpu) { return vcpu->arch.walk_mmu == &vcpu->arch.nested_mmu; @@ -108,6 +118,14 @@ static inline bool vcpu_match_mmio_gpa(struct kvm_vcpu *vcpu, gpa_t gpa) return false; } +static inline unsigned long kvm_register_readl(struct kvm_vcpu *vcpu, + enum kvm_reg reg) +{ + unsigned long val = kvm_register_read(vcpu, reg); + + return is_64_bit_mode(vcpu) ? val : (u32)val; +} + void kvm_before_handle_nmi(struct kvm_vcpu *vcpu); void kvm_after_handle_nmi(struct kvm_vcpu *vcpu); int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 7/9] KVM: x86: Hypercall handling does not considers opsize correctly
Currently, the hypercall handling routine only considers LME as an indication to whether the guest uses 32/64-bit mode. This is incosistent with hyperv hypercalls handling and against the common sense of considering cs.l as well. This patch uses is_64_bit_mode instead of is_long_mode for that matter. In addition, the result is masked in respect to the guest execution mode. Last, it changes kvm_hv_hypercall to use is_64_bit_mode as well to simplify the code. Signed-off-by: Nadav Amit --- arch/x86/kvm/x86.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index f32a025..c6dfe47 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5662,7 +5662,6 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) u64 param, ingpa, outgpa, ret; uint16_t code, rep_idx, rep_cnt, res = HV_STATUS_SUCCESS, rep_done = 0; bool fast, longmode; - int cs_db, cs_l; /* * hypercall generates UD from non zero cpl and real mode @@ -5673,8 +5672,7 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) return 0; } - kvm_x86_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l); - longmode = is_long_mode(vcpu) && cs_l == 1; + longmode = is_64_bit_mode(vcpu); if (!longmode) { param = ((u64)kvm_register_read(vcpu, VCPU_REGS_RDX) << 32) | @@ -5739,7 +5737,7 @@ static void kvm_pv_kick_cpu_op(struct kvm *kvm, unsigned long flags, int apicid) int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) { unsigned long nr, a0, a1, a2, a3, ret; - int r = 1; + int op_64_bit, r = 1; if (kvm_hv_hypercall_enabled(vcpu->kvm)) return kvm_hv_hypercall(vcpu); @@ -5752,7 +5750,8 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) trace_kvm_hypercall(nr, a0, a1, a2, a3); - if (!is_long_mode(vcpu)) { + op_64_bit = is_64_bit_mode(vcpu); + if (!op_64_bit) { nr &= 0x; a0 &= 0x; a1 &= 0x; @@ -5778,6 +5777,8 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) break; } out: + if (!op_64_bit) + ret = (u32)ret; kvm_register_write(vcpu, VCPU_REGS_RAX, ret); ++vcpu->stat.hypercalls; return r; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 5/9] KVM: x86: NOP emulation clears (incorrectly) the high 32-bits of RAX
On long-mode the current NOP (0x90) emulation still writes back to RAX. As a result, EAX is zero-extended and the high 32-bits of RAX are cleared. Signed-off-by: Nadav Amit --- arch/x86/kvm/emulate.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index b354531..eb93eb4 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -4663,8 +4663,9 @@ special_insn: break; case 0x90 ... 0x97: /* nop / xchg reg, rax */ if (ctxt->dst.addr.reg == reg_rmw(ctxt, VCPU_REGS_RAX)) - break; - rc = em_xchg(ctxt); + ctxt->dst.type = OP_NONE; + else + rc = em_xchg(ctxt); break; case 0x98: /* cbw/cwde/cdqe */ switch (ctxt->op_bytes) { -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests 0/5] x86: Tests for recent emulator bugs
This patch-set checks recent emulator bugs as well as monitor-mwait emulation in real-mode. Nadav Amit (5): x86: Testing nop instruction on 64-bit x86: test xadd with two identical operands x86: Test btcq with operand larger than 64 x86: check cmov instruction on 64-bit x86: Test monitor and mwait on real-mode x86/emulator.c | 30 -- x86/realmode.c | 20 2 files changed, 48 insertions(+), 2 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests 4/5] x86: check cmov instruction on 64-bit
cmov instruction on 64-bit with dword destination register operand should clear bits [63:32]. This test checks this behavior due to previous KVM bug. Signed-off-by: Nadav Amit --- x86/emulator.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/x86/emulator.c b/x86/emulator.c index 460949f..1fd0ca6 100644 --- a/x86/emulator.c +++ b/x86/emulator.c @@ -710,6 +710,18 @@ static void test_shld_shrd(u32 *mem) report("shrd (cl)", *mem == ((0x12345678 >> 3) | (5u << 29))); } +static void test_cmov(u32 *mem) +{ + u64 val; + *mem = 0xabcdef12u; + asm ("movq $0x1234567812345678, %%rax\n\t" +"cmpl %%eax, %%eax\n\t" +"cmovnel (%[mem]), %%eax\n\t" +"movq %%rax, %[val]\n\t" +: [val]"=r"(val) : [mem]"r"(mem) : "%rax", "cc"); + report("cmovnel", val == 0x12345678ul); +} + #define INSN_XCHG_ALL \ "xchg %rax, 0+save \n\t"\ "xchg %rbx, 8+save \n\t"\ @@ -1054,12 +1066,12 @@ int main() test_sreg(mem); test_lldt(mem); test_ltr(mem); + test_cmov(mem); test_mmx_movq_mf(mem, insn_page, alt_insn_page, insn_ram); test_movabs(mem, insn_page, alt_insn_page, insn_ram); test_smsw_reg(mem, insn_page, alt_insn_page, insn_ram); test_nop(mem, insn_page, alt_insn_page, insn_ram); - test_crosspage_mmio(mem); test_string_io_mmio(mem); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests 5/5] x86: Test monitor and mwait on real-mode
monitor and mwait are now considered to behave as nop. New patch enables monitor and mwait in realmode as well. This test checks whether they are handled as nop in realmode as well. Signed-off-by: Nadav Amit --- x86/realmode.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/x86/realmode.c b/x86/realmode.c index 10c3e03..0a62b5d 100644 --- a/x86/realmode.c +++ b/x86/realmode.c @@ -1671,6 +1671,16 @@ void test_xadd(void) report("xadd", R_AX, outregs.eax == inregs.eax * 2); } +void test_monitor_mwait(void) +{ + MK_INSN(monitor, "monitor\n\t" +"mwait\n\t"); + inregs.ecx = 0; + inregs.eax = 0; + exec_in_big_real_mode(&insn_monitor); + report("monitor", 0, 1); +} + void realmode_start(void) { @@ -1721,6 +1731,7 @@ void realmode_start(void) test_smsw(); test_nopl(); test_xadd(); + test_monitor_mwait(); test_perf_loop(); test_perf_mov(); test_perf_arith(); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 8/9] KVM: vmx: handle_cr ignores 32/64-bit mode
On 32-bit mode only bits [31:0] of the CR should be used for setting the CR value. Otherwise, the host may incorrectly assume the value is invalid if bits [63:32] are not zero. Moreover, the CR is currently being read twice when CR8 is used. Last, nested mov-cr exiting is modified to handle the CR value correctly as well. Signed-off-by: Nadav Amit --- arch/x86/kvm/vmx.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c0f53a0..cbfbb8b 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5039,7 +5039,7 @@ static int handle_cr(struct kvm_vcpu *vcpu) reg = (exit_qualification >> 8) & 15; switch ((exit_qualification >> 4) & 3) { case 0: /* mov to cr */ - val = kvm_register_read(vcpu, reg); + val = kvm_register_readl(vcpu, reg); trace_kvm_cr_write(cr, val); switch (cr) { case 0: @@ -5056,7 +5056,7 @@ static int handle_cr(struct kvm_vcpu *vcpu) return 1; case 8: { u8 cr8_prev = kvm_get_cr8(vcpu); - u8 cr8 = kvm_register_read(vcpu, reg); + u8 cr8 = (u8)val; err = kvm_set_cr8(vcpu, cr8); kvm_complete_insn_gp(vcpu, err); if (irqchip_in_kernel(vcpu->kvm)) @@ -6751,7 +6751,7 @@ static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu, unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); int cr = exit_qualification & 15; int reg = (exit_qualification >> 8) & 15; - unsigned long val = kvm_register_read(vcpu, reg); + unsigned long val = kvm_register_readl(vcpu, reg); switch ((exit_qualification >> 4) & 3) { case 0: /* mov to cr */ -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/9] KVM: x86: Inter privilage level ret emulation is not implemeneted
Return unhandlable error on inter-privilage level ret instruction. This is since the current emulation does not check the privilage level correctly when loading the CS, and does not pop RSP/SS as needed. Signed-off-by: Nadav Amit --- arch/x86/kvm/emulate.c | 4 1 file changed, 4 insertions(+) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 3c8d867..0183350 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -2019,6 +2019,7 @@ static int em_ret_far(struct x86_emulate_ctxt *ctxt) { int rc; unsigned long cs; + int cpl = ctxt->ops->cpl(ctxt); rc = emulate_pop(ctxt, &ctxt->_eip, ctxt->op_bytes); if (rc != X86EMUL_CONTINUE) @@ -2028,6 +2029,9 @@ static int em_ret_far(struct x86_emulate_ctxt *ctxt) rc = emulate_pop(ctxt, &cs, ctxt->op_bytes); if (rc != X86EMUL_CONTINUE) return rc; + /* Outer-privilage level return is not implemented */ + if (ctxt->mode >= X86EMUL_MODE_PROT16 && (cs & 3) > cpl) + return X86EMUL_UNHANDLEABLE; rc = load_segment_descriptor(ctxt, (u16)cs, VCPU_SREG_CS); return rc; } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/9] KVM: x86: Wrong emulation on 'xadd X, X'
The emulator does not emulate the xadd instruction correctly if the two operands are the same. In this (unlikely) situation the result should be the sum of X and X (2X) when it is currently X. The solution is to first perform writeback to the source, before writing to the destination. The only instruction which should be affected is xadd, as the other instructions that perform writeback to the source use the extended accumlator (e.g., RAX:RDX). Signed-off-by: Nadav Amit --- arch/x86/kvm/emulate.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index f0b0a10..3c8d867 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -4711,17 +4711,17 @@ special_insn: goto done; writeback: - if (!(ctxt->d & NoWrite)) { - rc = writeback(ctxt, &ctxt->dst); - if (rc != X86EMUL_CONTINUE) - goto done; - } if (ctxt->d & SrcWrite) { BUG_ON(ctxt->src.type == OP_MEM || ctxt->src.type == OP_MEM_STR); rc = writeback(ctxt, &ctxt->src); if (rc != X86EMUL_CONTINUE) goto done; } + if (!(ctxt->d & NoWrite)) { + rc = writeback(ctxt, &ctxt->dst); + if (rc != X86EMUL_CONTINUE) + goto done; + } /* * restore dst type in case the decoding will be reused -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] arm/arm64: KVM: Rename irq_state to irq_pending
On 06/14/2014 10:51 PM, Christoffer Dall wrote: > The irq_state field on the distributor struct is ambiguous in its > meaning; the comment says it's the level of the input put, but that > doesn't make much sense for edge-triggered interrupts. The code > actually uses this state variable to check if the interrupt is in the > pending state on the distributor so clarify the comment and rename the > actual variable and accessor methods. > > Signed-off-by: Christoffer Dall > --- > include/kvm/arm_vgic.h | 4 ++-- > virt/kvm/arm/vgic.c| 52 > +- > 2 files changed, 28 insertions(+), 28 deletions(-) > > diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h > index f27000f..a5d55d5 100644 > --- a/include/kvm/arm_vgic.h > +++ b/include/kvm/arm_vgic.h > @@ -86,8 +86,8 @@ struct vgic_dist { > /* Interrupt enabled (one bit per IRQ) */ > struct vgic_bitmap irq_enabled; > > - /* Interrupt 'pin' level */ > - struct vgic_bitmap irq_state; > + /* Interrupt state is pending on the distributor */ > + struct vgic_bitmap irq_pending; > > /* Level-triggered interrupt in progress */ > struct vgic_bitmap irq_active; > diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c > index 47b2983..aba960a 100644 > --- a/virt/kvm/arm/vgic.c > +++ b/virt/kvm/arm/vgic.c > @@ -37,7 +37,7 @@ > * > * - At any time, the dist->irq_pending_on_cpu is the oracle that knows if > * something is pending > - * - VGIC pending interrupts are stored on the vgic.irq_state vgic > + * - VGIC pending interrupts are stored on the vgic.irq_pending vgic > * bitmap (this bitmap is updated by both user land ioctls and guest > * mmio ops, and other in-kernel peripherals such as the > * arch. timers) and indicate the 'wire' state. Hi Christoffer, Shouldn't we say that vgic.irq_pending is a combination of wire state and soft pending set action(ISPENDR), corresponding to GIC archi spec "status_includes_pending" - at least for level sensitive IRQS - ? Best Regards Eric > @@ -45,8 +45,8 @@ > * recalculated > * - To calculate the oracle, we need info for each cpu from > * compute_pending_for_cpu, which considers: > - * - PPI: dist->irq_state & dist->irq_enable > - * - SPI: dist->irq_state & dist->irq_enable & dist->irq_spi_target > + * - PPI: dist->irq_pending & dist->irq_enable > + * - SPI: dist->irq_pending & dist->irq_enable & dist->irq_spi_target > * - irq_spi_target is a 'formatted' version of the GICD_ICFGR > * registers, stored on each vcpu. We only keep one bit of > * information per interrupt, making sure that only one vcpu can > @@ -204,21 +204,21 @@ static int vgic_dist_irq_is_pending(struct kvm_vcpu > *vcpu, int irq) > { > struct vgic_dist *dist = &vcpu->kvm->arch.vgic; > > - return vgic_bitmap_get_irq_val(&dist->irq_state, vcpu->vcpu_id, irq); > + return vgic_bitmap_get_irq_val(&dist->irq_pending, vcpu->vcpu_id, irq); > } > > -static void vgic_dist_irq_set(struct kvm_vcpu *vcpu, int irq) > +static void vgic_dist_irq_set_pending(struct kvm_vcpu *vcpu, int irq) > { > struct vgic_dist *dist = &vcpu->kvm->arch.vgic; > > - vgic_bitmap_set_irq_val(&dist->irq_state, vcpu->vcpu_id, irq, 1); > + vgic_bitmap_set_irq_val(&dist->irq_pending, vcpu->vcpu_id, irq, 1); > } > > -static void vgic_dist_irq_clear(struct kvm_vcpu *vcpu, int irq) > +static void vgic_dist_irq_clear_pending(struct kvm_vcpu *vcpu, int irq) > { > struct vgic_dist *dist = &vcpu->kvm->arch.vgic; > > - vgic_bitmap_set_irq_val(&dist->irq_state, vcpu->vcpu_id, irq, 0); > + vgic_bitmap_set_irq_val(&dist->irq_pending, vcpu->vcpu_id, irq, 0); > } > > static void vgic_cpu_irq_set(struct kvm_vcpu *vcpu, int irq) > @@ -392,7 +392,7 @@ static bool handle_mmio_set_pending_reg(struct kvm_vcpu > *vcpu, > struct kvm_exit_mmio *mmio, > phys_addr_t offset) > { > - u32 *reg = vgic_bitmap_get_reg(&vcpu->kvm->arch.vgic.irq_state, > + u32 *reg = vgic_bitmap_get_reg(&vcpu->kvm->arch.vgic.irq_pending, > vcpu->vcpu_id, offset); > vgic_reg_access(mmio, reg, offset, > ACCESS_READ_VALUE | ACCESS_WRITE_SETBIT); > @@ -408,7 +408,7 @@ static bool handle_mmio_clear_pending_reg(struct kvm_vcpu > *vcpu, > struct kvm_exit_mmio *mmio, > phys_addr_t offset) > { > - u32 *reg = vgic_bitmap_get_reg(&vcpu->kvm->arch.vgic.irq_state, > + u32 *reg = vgic_bitmap_get_reg(&vcpu->kvm->arch.vgic.irq_pending, > vcpu->vcpu_id, offset); > vgic_reg_access(mmio, reg, offset, > ACCESS_READ_VALUE | ACCESS_WRITE_CLEARBIT); > @@ -650,7 +650,7 @@ static void vgic_unqueue_irqs(struct kvm_vcpu *vcpu) >
[PATCH v2 4/9] KVM: x86: emulation of dword cmov on long-mode should clear [63:32]
Even if the condition of cmov is not satisfied, bits[63:32] should be cleared. This is clearly stated in Intel's CMOVcc documentation. The solution is to reassign the destination onto itself if the condition is unsatisfied. For that matter the original destination value needs to be read. Signed-off-by: Nadav Amit --- arch/x86/kvm/emulate.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 0183350..b354531 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3905,7 +3905,7 @@ static const struct opcode twobyte_table[256] = { N, N, N, N, N, N, N, N, N, N, /* 0x40 - 0x4F */ - X16(D(DstReg | SrcMem | ModRM | Mov)), + X16(D(DstReg | SrcMem | ModRM)), /* 0x50 - 0x5F */ N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, /* 0x60 - 0x6F */ @@ -4799,8 +4799,10 @@ twobyte_insn: ops->get_dr(ctxt, ctxt->modrm_reg, &ctxt->dst.val); break; case 0x40 ... 0x4f: /* cmov */ - ctxt->dst.val = ctxt->dst.orig_val = ctxt->src.val; - if (!test_cc(ctxt->b, ctxt->eflags)) + if (test_cc(ctxt->b, ctxt->eflags)) + ctxt->dst.val = ctxt->src.val; + else if (ctxt->mode != X86EMUL_MODE_PROT64 || +ctxt->op_bytes != 4) ctxt->dst.type = OP_NONE; /* no writeback */ break; case 0x80 ... 0x8f: /* jnz rel, etc*/ -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/9] KVM: x86: More emulator bugs
This patch-set resolves several emulator bugs. Each fix is independent of the others. The DR6 bug can occur during DR-access exit (regardless to unrestricted mode, MMIO and SPT). Changes in v2: Introduced kvm_register_readl and kvm_register_writel which consider long-mode and cs.l when reading the registers. Fixing the register read to respect 32/64 bit in hypercall handling, CR exit handling and VMX instructions handling. Thanks for re-reviewing the patch Nadav Amit (9): KVM: x86: bit-ops emulation ignores offset on 64-bit KVM: x86: Wrong emulation on 'xadd X, X' KVM: x86: Inter privilage level ret emulation is not implemeneted KVM: x86: emulation of dword cmov on long-mode should clear [63:32] KVM: x86: NOP emulation clears (incorrectly) the high 32-bits of RAX KVM: x86: check DR6/7 high-bits are clear only on long-mode KVM: x86: Hypercall handling does not considers opsize correctly KVM: vmx: handle_cr ignores 32/64-bit mode KVM: vmx: vmx instructions handling does not consider cs.l arch/x86/kvm/emulate.c | 31 --- arch/x86/kvm/vmx.c | 16 arch/x86/kvm/x86.c | 11 ++- arch/x86/kvm/x86.h | 27 +++ 4 files changed, 61 insertions(+), 24 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests 3/5] x86: Test btcq with operand larger than 64
Previously, KVM did not calculate the offset for bit-operations correctly when quad-word operands were used. This test checks btcq when operand is larger than 64 in order to check this scenario. Signed-off-by: Nadav Amit --- x86/emulator.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/x86/emulator.c b/x86/emulator.c index bf8a873..460949f 100644 --- a/x86/emulator.c +++ b/x86/emulator.c @@ -496,7 +496,7 @@ void test_btc(void *mem) { unsigned int *a = mem; - memset(mem, 0, 3 * sizeof(unsigned int)); + memset(mem, 0, 4 * sizeof(unsigned int)); asm ("btcl $32, %0" :: "m"(a[0]) : "memory"); asm ("btcl $1, %0" :: "m"(a[1]) : "memory"); @@ -505,6 +505,10 @@ void test_btc(void *mem) asm ("btcl %1, %0" :: "m"(a[3]), "r"(-1) : "memory"); report("btcl reg, r/m", a[0] == 1 && a[1] == 2 && a[2] == 0x8004); + + asm ("btcq %1, %0" : : "m"(a[2]), "r"(-1l) : "memory"); + report("btcq reg, r/m", a[0] == 1 && a[1] == 0x8002 && + a[2] == 0x8004 && a[3] == 0); } void test_bsfbsr(void *mem) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests 2/5] x86: test xadd with two identical operands
Previously, KVM emulated xadd incorrectly when the source and destination operands were identical. The expected result is that the register would hold the sum (2x) and not the previous value (x). This test checks this behavior. It should be executed with a disabled unrestricted mode. Signed-off-by: Nadav Amit --- x86/realmode.c | 9 + 1 file changed, 9 insertions(+) diff --git a/x86/realmode.c b/x86/realmode.c index dc4a1d3..10c3e03 100644 --- a/x86/realmode.c +++ b/x86/realmode.c @@ -1663,6 +1663,14 @@ void test_smsw(void) report("smsw", R_AX | R_BX | R_CX, outregs.eax == outregs.ebx); } +void test_xadd(void) +{ + MK_INSN(xadd, "xaddl %eax, %eax\n\t"); + inregs.eax = 0x12345678; + exec_in_big_real_mode(&insn_xadd); + report("xadd", R_AX, outregs.eax == inregs.eax * 2); +} + void realmode_start(void) { @@ -1712,6 +1720,7 @@ void realmode_start(void) test_dr_mod(); test_smsw(); test_nopl(); + test_xadd(); test_perf_loop(); test_perf_mov(); test_perf_arith(); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 6/6] arm/arm64: KVM: vgic: Clarify and correct vgic documentation
On 06/14/2014 10:51 PM, Christoffer Dall wrote: > The VGIC virtual distributor implementation documentation was written a > very long time ago, before the true nature of the beast had been > partially absorbed into my bloodstream. I think this amalgamates the > two evil beings (myself and the code) a little more. > > Plus, it fixes an actual bug. ICFRn, pfff. > > Signed-off-by: Christoffer Dall > --- > virt/kvm/arm/vgic.c | 13 +++-- > 1 file changed, 7 insertions(+), 6 deletions(-) > > diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c > index 1f91b3b..cc776af 100644 > --- a/virt/kvm/arm/vgic.c > +++ b/virt/kvm/arm/vgic.c > @@ -36,21 +36,22 @@ > * How the whole thing works (courtesy of Christoffer Dall): > * > * - At any time, the dist->irq_pending_on_cpu is the oracle that knows if > - * something is pending > - * - VGIC pending interrupts are stored on the vgic.irq_pending vgic > - * bitmap (this bitmap is updated by both user land ioctls and guest > - * mmio ops, and other in-kernel peripherals such as the > - * arch. timers) and indicate the 'wire' state. > + * something is pending on the CPU interface. > + * - Interrupts that are pending on the distributor are stored on the > + * vgic.irq_pending vgic bitmap (this bitmap is updated by both user land > + * ioctls and guest mmio ops, and other in-kernel peripherals such as the > + * arch. timers). ok forget my previous comment related to wire state;-) > * - Every time the bitmap changes, the irq_pending_on_cpu oracle is > * recalculated > * - To calculate the oracle, we need info for each cpu from > * compute_pending_for_cpu, which considers: > * - PPI: dist->irq_pending & dist->irq_enable > * - SPI: dist->irq_pending & dist->irq_enable & dist->irq_spi_target > - * - irq_spi_target is a 'formatted' version of the GICD_ICFGR > + * - irq_spi_target is a 'formatted' version of the GICD_ITARGETSRn > * registers, stored on each vcpu. We only keep one bit of > * information per interrupt, making sure that only one vcpu can > * accept the interrupt. > + * - If any of the above state changes, we must recalculate the oracle. > * - The same is true when injecting an interrupt, except that we only > * consider a single interrupt at a time. The irq_spi_cpu array > * contains the target CPU for each SPI. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Why I advise against using ivshmem
Hello Stefan, On 06/18/2014 12:48 PM, Stefan Hajnoczi wrote: One more thing to add to the list: static void ivshmem_read(void *opaque, const uint8_t * buf, int flags) The "flags" argument should be "size". Size should be checked before accessing buf. You are welcome to send a fix and I will review it. Please also see the bug fixes in the following unapplied patch: "[PATCH] ivshmem: fix potential OOB r/w access (#2)" by Sebastian Krahmer https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg03538.html Thanks for the pointer. I'll check it. -- David Marchand -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Why I advise against using ivshmem
On 06/18/2014 12:51 PM, Stefan Hajnoczi wrote: Actually, you can avoid this memory copy using frameworks like DPDK. I guess it's careful to allocate all packets in the mmapped BAR? Yes. That's fine if you can modify applications but doesn't work for unmodified applications using regular networking APIs. If you have access to source code, this should not be a problem. -- David Marchand -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-tests 1/5] x86: Testing nop instruction on 64-bit
Previously, nop instruction emulation on 64-bit caused RAX bits [63:32] to be cleared. This test checks the behavior is correct and RAX is unmodified. Signed-off-by: Nadav Amit --- x86/emulator.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/x86/emulator.c b/x86/emulator.c index f653127..bf8a873 100644 --- a/x86/emulator.c +++ b/x86/emulator.c @@ -845,6 +845,15 @@ static void test_smsw_reg(uint64_t *mem, uint8_t *insn_page, report("64-bit smsw reg", outregs.rax == cr0); } +static void test_nop(uint64_t *mem, uint8_t *insn_page, + uint8_t *alt_insn_page, void *insn_ram) +{ + inregs = (struct regs){ .rax = 0x1234567890abcdeful }; + MK_INSN(nop, "nop\n\t"); + trap_emulator(mem, alt_insn_page, &insn_nop); + report("nop", outregs.rax == inregs.rax); +} + static void test_crosspage_mmio(volatile uint8_t *mem) { volatile uint16_t w, *pw; @@ -1045,6 +1054,7 @@ int main() test_mmx_movq_mf(mem, insn_page, alt_insn_page, insn_ram); test_movabs(mem, insn_page, alt_insn_page, insn_ram); test_smsw_reg(mem, insn_page, alt_insn_page, insn_ram); + test_nop(mem, insn_page, alt_insn_page, insn_ram); test_crosspage_mmio(mem); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] Why I advise against using ivshmem
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Am 18.06.2014 12:48, schrieb Stefan Hajnoczi: > On Tue, Jun 17, 2014 at 11:44:11AM +0200, Paolo Bonzini wrote: >> Il 17/06/2014 11:03, David Marchand ha scritto: Unless someone steps up and maintains ivshmem, I think it should be deprecated and dropped from QEMU. >>> >>> Then I can maintain ivshmem for QEMU. If this is ok, I will >>> send a patch for MAINTAINERS file. >> >> Typically, adding yourself to maintainers is done only after >> having proved your ability to be a maintainer. :) >> >> So, let's stop talking and go back to code! You can start doing >> what was suggested elsewhere in the thread: get the server and >> uio driver merged into the QEMU tree, document the protocol in >> docs/specs/ivshmem_device_spec.txt, and start fixing bugs such as >> the ones that Markus reported. > > One more thing to add to the list: > > static void ivshmem_read(void *opaque, const uint8_t * buf, int > flags) > > The "flags" argument should be "size". Size should be checked > before accessing buf. > > Please also see the bug fixes in the following unapplied patch: > "[PATCH] ivshmem: fix potential OOB r/w access (#2)" by Sebastian > Krahmer > https://lists.gnu.org/archive/html/qemu-devel/2014-04/msg03538.html Jumping > late onto this thread: SUSE Security team has just recently done a thorough review of QEMU ivshmem code because a customer has requested this be supported in SLES12. Multiple security-related patches were submitted by Stefan Hajnoczi and Sebastian Krahmer, and I fear they are probably still not merged for lack of active maintainer... In such cases, after review, I expect them to be picked up by Peter as committer or via qemu-trivial. So -1, against dropping it. Vincent, you will find an RFC for an ivshmem-test in the qemu-devel list archives or possibly on my qtest branch. The blocking issue that I haven't worked on yet is that we can't unconditionally run the qtest because it depends on KVM enabled at configure time (as opposed to runtime) to have the device available. http://patchwork.ozlabs.org/patch/336367/ As others have stated before, the nahanni server seems unmaintained, thus not getting packaged by SUSE either and making testing the interrupt parts of ivshmem difficult - unless we sort out and fill with actual test code my proposed qtest. Regards, Andreas - -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAEBAgAGBQJToanZAAoJEPou0S0+fgE/L6YP/jtPiwvz3YoW3+H/h/YzrnE7 xVP92jj5orzmbG3HMmEnx0l7YrtzYkwymgUO56dy2SrLFe0xVMnxuzcHLzHLsnm3 bYvMVq3eAx8sdx9c/O2B/rQbNo2p8PF/luTNewN7A+w5TX0XgxdI3TpLT2pVxf0b kMaBnfivzUf2JY/zg6NaiGnwvVrA/0kXsCGKcTBiMQxOX2EdDgNak842SjlmS332 dPbqp5PIMdxwCxI/p+gpmu0cSy1bl2H6N2gkmKQZ63Z2tA7bWn/APdQeHyOcESZE xRAfDz2Cs3/6EL7FLirJWdwT9EMNaFcM+eRgIqDamFzviQPZVuLKdDUteO1k9x1s FlhL3ZRa3qHair9ByEJItqzneAeYmuwZ2DkKh4p/HQfbcxLzZlL8a1EEtYz5DTy0 8+Ax6IU5U5RZmwJ4/M/Ov5eT4t/fNe0MbG3mf5A8FJ6GWoF11ut/wyj70p/EmXua QjUblK/eFemN4YvIi0ovD4DR9ZH2+bXOb44wKL7yFahKLldaP4y9DhJTap2J0mT1 b62FfFZ6hVIGP5n30OHLlhe39QY6SyIPc4JNc9VZ3GcpXtfOHPUOAD/ykt/As1P3 cPfL+jM0QSb6VNJHNbvUsSlJ6xI26qEWzyJ5R7ww4fyEoq4XiE2RCDUWJ2t9/jQb +Bi/esBUDhAduc1Eh3FK =MtPH -END PGP SIGNATURE- -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 00/12] KVM Support for MIPS32 Processors
Il 18/06/2014 00:10, James Hogan ha scritto: The patchset depends on v4 of "target-mips: implement UserLocal Register". I'm aiming for QEMU 2.1, hopefully it isn't too late to get some final review. Thanks to everybody who has already taken part in review. This patchset implements KVM support for MIPS32 processors, using Trap & Emulation. In KVM mode, CPU virtualization is handled via the kvm kernel module, while system and I/O virtualization leverage the Malta model already present in QEMU. Both Guest kernel and Guest Userspace execute in UM. The Guest address space is as folows: Guest User address space: 0x -> 0x4000 Guest Kernel Unmapped: 0x4000 -> 0x6000 Guest Kernel Mapped:0x6000 -> 0x8000 As a result, Guest Usermode virtual memory is limited to 1GB. KVM support (by trap and emulate) was added to the Linux kernel in v3.10. This patchset partly depends on MIPS KVM work which will land in v3.16 (for example to save/restore the state of various registers and the KVM Count/Compare timer). Changes in v5: Changes addressing review comments from v4 patchset, and to use the MIPS KVM timer API added in v3.16. A git tag for this version of the patchset can also be found on github: https://github.com/jahogan/qemu-kvm-mips.git kvm-mips-v5 - Rebase on master + v4 of "target-mips: implement UserLocal Register". - New patch ([01/12] target-mips: Reset CPU timer consistently) to address timer reset behaviour (reported by Paolo Bonzini). - New patch ([08/12] target-mips: Call kvm_mips_reset_vcpu() from mips_cpu_reset()) and rename kvm_arch_reset_vcpu to kvm_mips_reset_vcpu, based on commit 50a2c6e55fa2 (kvm: reset state from the CPU's reset method). - KSEG0 doesn't actually change size, so fix mask in cpu_mips_kseg0_to_phys() (patch 3) and use that instead of having the KVM specific cpu_mips_kvm_um_kseg0_to_phys() (patch 10). - Fix typo in patch 9 subject (s/interupts/interrupts/). - Rename kvm_mips_te_{put,get}_cp0_registers() functions to drop the "te_" since they're not really specific to T&E. - Pass level through from kvm_arch_put_registers() to kvm_mips_put_cp0_registers() rather than hard coding it to KVM_PUT_FULL_STATE. - Fix KVM_REG_MIPS_CP0_* definitions to set KVM_REG_MIPS and KVM_REG_SIZE_U32/KVM_REG_SIZE_U64 (using a macro). - Remove unused KVM_REG_MIPS_CP0_* definitions for now. - Correct type of kvm_mips_{get,put}_one_{,ul}reg() reg_id argument to uint64_t. Various high bits must be set to disambiguate the architecture and register size. - Simplify register access functions slightly. - Add register accessors for always-64-bit registers (rather than ulong registers). These are needed for virtual KVM registers for controlling the KVM Compare/Count timer. - Save and restore KVM timer state with the rest of the state, and also when VM clock is started or stopped. When the KVM timer state is restored (or VM clock restarted) it is resumed with the stored count at the monotonic time when the VM clock was last stopped. If the VM clock hasn't been stopped it resumes from the monotonic time when the state was saved (i.e. as if the timer was never stopped). Changes since RFC patch on kernel KVM thread "[PATCH v2 00/23] MIPS: KVM: Fixes and guest timer rewrite"): - Simplified, removing extra state for storing VM time of save/restore, at the cost of losing/gaining time when VM gets stopped and started (Paolo Bonzini). - Save and restore the UserLocal and HWREna CP0 registers. - Improve get/put KVM register error handling with DPRINTFs and fall through so that getting/putting of all the registers is attempted even if one of them fails due to being unimplemented in the kernel. Changes in v4: Changes mostly addressing a few review comments from v3 patchset. A git tag for this version of the patchset can also be found on github: https://github.com/jahogan/qemu-kvm-mips.git kvm-mips-v4 - Rebase on v2.0.0-rc0. - Use int32_t instead of int32 (which is for softfloat) in kvm register accessors (Andreas Färber). - Use uint64_t instead of __u64 (which is really just for kernel headers) in the kvm register accessors (Andreas Färber). - Cast pointer to uintptr_t rather than target_ulong in kvm register accessors. - Remove some redundant casts in kvm register accessors. - Add MAINTAINERS entry for MIPS KVM. Changes in v3: Changes mostly addressing review comments from v2 patchset. A git tag for this version of the patchset can also be found on github: https://github.com/jahogan/qemu-kvm-mips.git kvm-mips-v3 - Remove "target-mips: Set target page size to 16K in KVM mode". It should actually work fine with 4k TARGET_PAGE_SIZE as long as there is no cache aliasing or both host and guest kernels are configured to a sufficient page size to avoid aliasing (which the kernel arch/mips/kvm/00README.txt alludes to anyway). - Rewrote kvm s
Re: Fwd: KVM_SYSTEM_TIME clock is marked unstable on a modern single-socket system
On Wed, Jun 18, 2014 at 01:38:09PM +0200, Tomasz Grabiec wrote: > Hi, > > I'm working on OSv (https://github.com/cloudius-systems/osv), a guest > operating system. Right, please add a "tsc_matched" field to kvm_vcpu_arch (to keep track per-vcpu) and then increment the global counter only once per vcpu. > I've been investigating a phenomena of KVM_SYSTEM_TIME being marked as > unstable (PVCLOCK_TSC_STABLE_BIT cleared) by KVM on a modern > single-socket CPU since the very beginning of guest's life time. > According to the ABI, when pvclock_vcpu_time_info.flags does not have > the PVCLOCK_TSC_STABLE_BIT set we must not assume monotonicity and > need to compensate for this using atomic cmpxchg. However we would > like to avoid that cost unless necessary and since modern > single-socket CPUs have a stable TSC this cost seems unnecessary. > > Setups tested: > > Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz, kernel 3.13.0-29-generic > (Ubuntu), Qemu 2.0.0 > Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, kernel 3.14.7-200.fc20.x86_64 > (Fedora), Qemu 1.6.2 > > Both CPUs have constant_tsc and nonstop_tsc flags and the host kernel > is using TSC as the master clock source. > > Here's what tracing KVM during QEMU (v2.0.0) startup shows. I used > `trace-cmd record -e kvm` and then `trace-cmd report | egrep > '(tsc|clock)'`. > > === 1 vCPU === > > qemu-system-x86-1353 [003] 75446.815273: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 0 > qemu-system-x86-1355 [002] 75446.849021: kvm_write_tsc_offset: > vcpu=0 prev=0 next=18446556065789505232 > qemu-system-x86-1355 [002] 75446.849024: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc > qemu-system-x86-1353 [000] 75446.905313: kvm_write_tsc_offset: > vcpu=0 prev=18446556065789505232 next=18446556065789505232 > qemu-system-x86-1353 [000] 75446.905315: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 1 hostclock tsc > qemu-system-x86-1353 [000] 75446.925064: kvm_write_tsc_offset: > vcpu=0 prev=18446556065789505232 next=18446556065789505232 > qemu-system-x86-1353 [000] 75446.925065: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 2 nr_online 1 hostclock tsc > qemu-system-x86-1353 [000] 75446.925119: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 0 > qemu-system-x86-1355 [000] 75446.932931: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 0 > > === 2 vCPUs === > > qemu-system-x86-1431 [003] 75539.014452: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 0 > qemu-system-x86-1433 [003] 75539.054169: kvm_write_tsc_offset: > vcpu=0 prev=0 next=18446555836061830054 > qemu-system-x86-1433 [003] 75539.054171: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc > qemu-system-x86-1434 [003] 75539.078266: kvm_write_tsc_offset: > vcpu=1 prev=0 next=18446555836061830054 > qemu-system-x86-1434 [003] 75539.078269: kvm_track_tsc: > vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-1431 [002] 75539.134695: kvm_write_tsc_offset: > vcpu=0 prev=18446555836061830054 next=18446555836061830054 > qemu-system-x86-1431 [002] 75539.134698: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 2 nr_online 2 hostclock tsc > qemu-system-x86-1431 [002] 75539.142465: kvm_write_tsc_offset: > vcpu=1 prev=18446555836061830054 next=18446555836061830054 > qemu-system-x86-1431 [002] 75539.142468: kvm_track_tsc: > vcpu_id 1 masterclock 0 offsetmatched 3 nr_online 2 hostclock tsc > qemu-system-x86-1431 [002] 75539.182614: kvm_write_tsc_offset: > vcpu=0 prev=18446555836061830054 next=18446555836061830054 > qemu-system-x86-1431 [002] 75539.182617: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 4 nr_online 2 hostclock tsc > qemu-system-x86-1431 [002] 75539.182758: kvm_write_tsc_offset: > vcpu=1 prev=18446555836061830054 next=18446555836061830054 > qemu-system-x86-1431 [002] 75539.182758: kvm_track_tsc: > vcpu_id 1 masterclock 0 offsetmatched 5 nr_online 2 hostclock tsc > qemu-system-x86-1431 [002] 75539.182840: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 0 > qemu-system-x86-1433 [003] 75539.194415: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 0 > qemu-system-x86-1434 [001] 75539.218634: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 0 > > When "masterclock" is 0 it means the PVCLOCK_TSC_STABLE_BIT will not > be set. It is not set because offsets are supposedly not matched. > However according to the kvm_track_tsc tracepoints each TSC write is > matched. But there are 3 writes per-CPU which makes > ka->nr_vcpus_matched_tsc to become larger than kvm->online_vcpus which > makes the following check in pvclock_update_vm_gtod_copy() to fail: > >vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 == > atomic_read(&kvm->online_vcpus)); > > I turns out that kvm_write_tsc() is called with the ta
Re: [Qemu-devel] Why I advise against using ivshmem
Il 18/06/2014 16:57, David Marchand ha scritto: Hello Stefan, On 06/18/2014 12:48 PM, Stefan Hajnoczi wrote: One more thing to add to the list: static void ivshmem_read(void *opaque, const uint8_t * buf, int flags) The "flags" argument should be "size". Size should be checked before accessing buf. You are welcome to send a fix and I will review it. This is not what a maintainer should do. A maintainer should, if possible, contribute fixes to improve the code. I know this is very different from usual "company-style" development (even open source software can be developed on with methods more typical of proprietary software), but we're asking you to do it because you evidently understand ivshmem better than us. Claudio has more experience with free/open-source software. Since he's interested in ivshmem, he can help you too. Perhaps you could try sending out the patch, and Claudio can review it and send pull requests at least in the beginning? Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/7] KVM: PPC: Book3S HV: Fix ABIv2 on LE
Alexander Graf writes: > We use ABIv2 on Little Endian systems which gets rid of the dotted function > names. Branch to the actual functions when we see such a system. > > Signed-off-by: Alexander Graf As per patches sent by anton we don't need this. We can branch to the function rathen than the dot symbol http://article.gmane.org/gmane.linux.ports.ppc.embedded/68925 http://article.gmane.org/gmane.linux.ports.ppc.embedded/71005 -aneesh > --- > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 22 ++ > 1 file changed, 14 insertions(+), 8 deletions(-) > > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > index 1a71f60..1ff3ebd 100644 > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > @@ -36,6 +36,12 @@ > #define NAPPING_CEDE 1 > #define NAPPING_NOVCPU 2 > > +#if defined(_CALL_ELF) && _CALL_ELF == 2 > +#define FUNC(name) name > +#else > +#define FUNC(name) GLUE(.,name) > +#endif > + > /* > * Call kvmppc_hv_entry in real mode. > * Must be called with interrupts hard-disabled. > @@ -668,9 +674,9 @@ END_FTR_SECTION_IFCLR(CPU_FTR_TM) > > mr r31, r4 > addir3, r31, VCPU_FPRS_TM > - bl .load_fp_state > + bl FUNC(load_fp_state) > addir3, r31, VCPU_VRS_TM > - bl .load_vr_state > + bl FUNC(load_vr_state) > mr r4, r31 > lwz r7, VCPU_VRSAVE_TM(r4) > mtspr SPRN_VRSAVE, r7 > @@ -1414,9 +1420,9 @@ END_FTR_SECTION_IFCLR(CPU_FTR_TM) > > /* Save FP/VSX. */ > addir3, r9, VCPU_FPRS_TM > - bl .store_fp_state > + bl FUNC(store_fp_state) > addir3, r9, VCPU_VRS_TM > - bl .store_vr_state > + bl FUNC(store_vr_state) > mfspr r6, SPRN_VRSAVE > stw r6, VCPU_VRSAVE_TM(r9) > 1: > @@ -2405,11 +2411,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX) > mtmsrd r8 > isync > addir3,r3,VCPU_FPRS > - bl .store_fp_state > + bl FUNC(store_fp_state) > #ifdef CONFIG_ALTIVEC > BEGIN_FTR_SECTION > addir3,r31,VCPU_VRS > - bl .store_vr_state > + bl FUNC(store_vr_state) > END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC) > #endif > mfspr r6,SPRN_VRSAVE > @@ -2441,11 +2447,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX) > mtmsrd r8 > isync > addir3,r4,VCPU_FPRS > - bl .load_fp_state > + bl FUNC(load_fp_state) > #ifdef CONFIG_ALTIVEC > BEGIN_FTR_SECTION > addir3,r31,VCPU_VRS > - bl .load_vr_state > + bl FUNC(load_vr_state) > END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC) > #endif > lwz r7,VCPU_VRSAVE(r31) -aneesh -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/7] KVM: PPC: Book3S HV: Fix ABIv2 on LE
On 18.06.14 17:21, Aneesh Kumar K.V wrote: Alexander Graf writes: We use ABIv2 on Little Endian systems which gets rid of the dotted function names. Branch to the actual functions when we see such a system. Signed-off-by: Alexander Graf As per patches sent by anton we don't need this. We can branch to the function rathen than the dot symbol http://article.gmane.org/gmane.linux.ports.ppc.embedded/68925 http://article.gmane.org/gmane.linux.ports.ppc.embedded/71005 Ah, true in this case, because the file never gets compiled as a module. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/11] qspinlock: Paravirt support
On 06/18/2014 08:03 AM, Paolo Bonzini wrote: Il 17/06/2014 00:08, Waiman Long ha scritto: +void __pv_queue_unlock(struct qspinlock *lock) +{ +int val = atomic_read(&lock->val); + +native_queue_unlock(lock); + +if (val & _Q_LOCKED_SLOW) +___pv_kick_head(lock); +} + Again a race can happen here between the reading and writing of the lock value. I can't think of a good way to do that without using cmpxchg. Could you just use xchg on the locked byte? Paolo The slowpath flag is just an indication that the queue head cpu might have been suspended. It may not be due to spurious wakeup. Releasing the lock unconditionally may cause the queue to be changed while it is being inspected. It really depending on how the cpu kicking is being handled. My patch delays the unlocking until all the inspections had been done to make sure that we don't waste time doing a cpu kick that is not needed. -Longman -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 9/9] KVM: vmx: vmx instructions handling does not consider cs.l
Il 18/06/2014 16:19, Nadav Amit ha scritto: VMX instructions use 32-bit operands in 32-bit mode, and 64-bit operands in 64-bit mode. The current implementation is broken since it does not use the register operands correctly, and always uses 64-bit for reads and writes. Moreover, write to memory in vmwrite only considers long-mode, so it ignores cs.l. This patch fixes this behavior. The field of vmread/vmwrite is kept intentionally as 64-bit read since if bits [63:32] are not cleared the instruction should fail, according to Intel SDM. This is not how I read the SDM: "These instructions fail if given, in 64-bit mode, an operand that sets an encoding bit beyond bit 32." (Section 24.11.1.2) "Outside IA-32e mode, the source operand has 32 bits, regardless of the value of CS.D. In 64-bit mode, the source operand has 64 bits; however, if bits 63:32 of the source operand are not zero, VMREAD will fail due to an attempt to access an unsupported VMCS component (see operation section)." (Description of VMREAD in Chapter 30). I'll fix up the patch myself. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/9] KVM: x86: More emulator bugs
Il 18/06/2014 16:19, Nadav Amit ha scritto: This patch-set resolves several emulator bugs. Each fix is independent of the others. The DR6 bug can occur during DR-access exit (regardless to unrestricted mode, MMIO and SPT). Changes in v2: Introduced kvm_register_readl and kvm_register_writel which consider long-mode and cs.l when reading the registers. Fixing the register read to respect 32/64 bit in hypercall handling, CR exit handling and VMX instructions handling. Thanks for re-reviewing the patch Nadav Amit (9): KVM: x86: bit-ops emulation ignores offset on 64-bit KVM: x86: Wrong emulation on 'xadd X, X' KVM: x86: Inter privilage level ret emulation is not implemeneted KVM: x86: emulation of dword cmov on long-mode should clear [63:32] KVM: x86: NOP emulation clears (incorrectly) the high 32-bits of RAX KVM: x86: check DR6/7 high-bits are clear only on long-mode KVM: x86: Hypercall handling does not considers opsize correctly KVM: vmx: handle_cr ignores 32/64-bit mode KVM: vmx: vmx instructions handling does not consider cs.l arch/x86/kvm/emulate.c | 31 --- arch/x86/kvm/vmx.c | 16 arch/x86/kvm/x86.c | 11 ++- arch/x86/kvm/x86.h | 27 +++ 4 files changed, 61 insertions(+), 24 deletions(-) Thanks, looks good. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word
On 06/18/2014 09:50 AM, Konrad Rzeszutek Wilk wrote: On Wed, Jun 18, 2014 at 01:37:45PM +0200, Paolo Bonzini wrote: Il 17/06/2014 22:55, Konrad Rzeszutek Wilk ha scritto: On Sun, Jun 15, 2014 at 02:47:01PM +0200, Peter Zijlstra wrote: From: Waiman Long This patch extracts the logic for the exchange of new and previous tail code words into a new xchg_tail() function which can be optimized in a later patch. And also adds a third try on acquiring the lock. That I think should be a seperate patch. It doesn't really add a new try, the old code is: - for (;;) { - new = _Q_LOCKED_VAL; - if (val) - new = tail | (val& _Q_LOCKED_PENDING_MASK); - - old = atomic_cmpxchg(&lock->val, val, new); - if (old == val) - break; - - val = old; - } /* -* we won the trylock; forget about queueing. */ - if (new == _Q_LOCKED_VAL) - goto release; The trylock happens if the "if (val)" hits the else branch. What the patch does is change it from attempting two transition with a single cmpxchg: -* 0,0,0 -> 0,0,1 ; trylock -* p,y,x -> n,y,x ; prev = xchg(lock, node) to first doing the trylock, then the xchg. If the trylock passes and the xchg returns prev=0,0,0, the next step of the algorithm goes to the locked/uncontended state + /* +* claim the lock: +* +* n,0 -> 0,1 : lock, uncontended Similar to your suggestion of patch 3, it's expected that the xchg will *not* return prev=0,0,0 after a failed trylock. I do like your explanation. I hope that Peter will put it in the description as it explains the change quite well. However, I *do* agree with you that it's simpler to just squash this patch into 01/11. Uh, did I say that? Oh I said why don't make it right the first time! I meant in terms of seperating the slowpath (aka the bytelock on the pending bit) from the queue (MCS code). Or renaming the function to be called 'complex' instead of 'slowpath' as it is getting quite hairy. The #1 patch is nice by itself - as it lays out the foundation of the MCS-similar code - and if Ingo decides he does not want this pending byte-lock bit business - it can be easily reverted or dropped. The pending bit code is needed for performance parity with ticket spinlock for light load. My own measurement indicates that the queuing overhead will cause the queue spinlock to be slower than ticket spinlock with 2-4 contending tasks. The pending bit solves the performance problem with 2 contending tasks, leave only the 3-4 tasks cases being a bit slower than the ticket spinlock which should be more than compensated by its superior performance with heavy contention and slightly better performance with no contention. -Longman -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word
Il 18/06/2014 17:46, Waiman Long ha scritto: The #1 patch is nice by itself - as it lays out the foundation of the MCS-similar code - and if Ingo decides he does not want this pending byte-lock bit business - it can be easily reverted or dropped. The pending bit code is needed for performance parity with ticket spinlock for light load. My own measurement indicates that the queuing overhead will cause the queue spinlock to be slower than ticket spinlock with 2-4 contending tasks. The pending bit solves the performance problem with 2 contending tasks, leave only the 3-4 tasks cases being a bit slower than the ticket spinlock which should be more than compensated by its superior performance with heavy contention and slightly better performance with no contention. Note that this patch is not related to the pending bit, only to the trylock bit which is already in patch 1. It serializes two previously-parallel checks for transitions. This is why I thought it could already belong in patch 1. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] perf: ignore LBR and offcore_rsp.
On Wed, Jun 18, 2014 at 03:52:55PM +, Liang, Kan wrote: > perf ignore LBR and offcore_rsp. > > x86, perf: Protect LBR and offcore rsp against KVM lying > > With -cpu host, KVM reports LBR and offcore support, if the host has > support. > When the guest perf driver tries to access LBR or offcore_rsp MSR, > it #GPs all MSR accesses,since KVM doesn't handle LBR and offcore support. > So LBR and offcore rsp MSRs are needed to be protected by _safe(). > > For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL > = y. > And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n. > Start the guest with -cpu host. > Run perf record with --branch-any or --branch-filter in guest to trigger > LBR #GP. > Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in > guest to trigger offcore_rsp #GP > > Signed-off-by: Andi Kleen This order indicates Andi is the author; but there's no corresponding From. > Signed-off-by: Kan Liang And here I thought that Andi was of the opinion that if you set CPUID to indicate a particular CPU you had better also handle all its MSRs. pgpOQB8UrEqHl.pgp Description: PGP signature
RE: [PATCH] KVM: PPC: Book3E: Unlock mmu_lock when setting caching atttribute
> -Original Message- > From: Mihai Caraman [mailto:mihai.cara...@freescale.com] > Sent: Wednesday, June 18, 2014 9:15 PM > To: kvm-...@vger.kernel.org > Cc: kvm@vger.kernel.org; linuxppc-...@lists.ozlabs.org; Caraman Mihai Claudiu- > B02008; Bhushan Bharat-R65777 > Subject: [PATCH] KVM: PPC: Book3E: Unlock mmu_lock when setting caching > atttribute > > The patch 08c9a188d0d0fc0f0c5e17d89a06bb59c493110f > kvm: powerpc: use caching attributes as per linux pte > do not handle properly the error case, letting mmu_lock locked. The lock > will further generate a RCU stall from kvmppc_e500_emul_tlbwe() caller. > > In case of an error go to out label. > > Signed-off-by: Mihai Caraman > Cc: Bharat Bhushan Thanks mike for fixing this; I am curious to know how you reached to this point :) Reviewed-by: Bharat Bhushan Regards -Bharat > --- > arch/powerpc/kvm/e500_mmu_host.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kvm/e500_mmu_host.c > b/arch/powerpc/kvm/e500_mmu_host.c > index 0528fe5..54144c7 100644 > --- a/arch/powerpc/kvm/e500_mmu_host.c > +++ b/arch/powerpc/kvm/e500_mmu_host.c > @@ -473,7 +473,8 @@ static inline int kvmppc_e500_shadow_map(struct > kvmppc_vcpu_e500 *vcpu_e500, > if (printk_ratelimit()) > pr_err("%s: pte not present: gfn %lx, pfn %lx\n", > __func__, (long)gfn, pfn); > - return -EINVAL; > + ret = -EINVAL; > + goto out; > } > kvmppc_e500_ref_setup(ref, gtlbe, pfn, wimg); > > -- > 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: PPC: Book3E: Unlock mmu_lock when setting caching atttribute
The patch 08c9a188d0d0fc0f0c5e17d89a06bb59c493110f kvm: powerpc: use caching attributes as per linux pte do not handle properly the error case, letting mmu_lock locked. The lock will further generate a RCU stall from kvmppc_e500_emul_tlbwe() caller. In case of an error go to out label. Signed-off-by: Mihai Caraman Cc: Bharat Bhushan --- arch/powerpc/kvm/e500_mmu_host.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c index 0528fe5..54144c7 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -473,7 +473,8 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, if (printk_ratelimit()) pr_err("%s: pte not present: gfn %lx, pfn %lx\n", __func__, (long)gfn, pfn); - return -EINVAL; + ret = -EINVAL; + goto out; } kvmppc_e500_ref_setup(ref, gtlbe, pfn, wimg); -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 9/9] KVM: vmx: vmx instructions handling does not consider cs.l
On 6/18/14, 6:41 PM, Paolo Bonzini wrote: Il 18/06/2014 16:19, Nadav Amit ha scritto: VMX instructions use 32-bit operands in 32-bit mode, and 64-bit operands in 64-bit mode. The current implementation is broken since it does not use the register operands correctly, and always uses 64-bit for reads and writes. Moreover, write to memory in vmwrite only considers long-mode, so it ignores cs.l. This patch fixes this behavior. The field of vmread/vmwrite is kept intentionally as 64-bit read since if bits [63:32] are not cleared the instruction should fail, according to Intel SDM. This is not how I read the SDM: "These instructions fail if given, in 64-bit mode, an operand that sets an encoding bit beyond bit 32." (Section 24.11.1.2) "Outside IA-32e mode, the source operand has 32 bits, regardless of the value of CS.D. In 64-bit mode, the source operand has 64 bits; however, if bits 63:32 of the source operand are not zero, VMREAD will fail due to an attempt to access an unsupported VMCS component (see operation section)." (Description of VMREAD in Chapter 30). I'll fix up the patch myself. Perhaps I am missing something, but I don't see where my mistake is. The VMREAD source operand is always read as 64-bits and I made no changes there. Therefore, if bits 63:32 are not zero, the instruction should fail when attempting to access the field. The value in the source operand of VMWRITE which represents the value which should be written is zero-extended outside 64-bit mode. Quoting: "The effective size of the primary source operand, which may be a register or in memory, is always 32 bits outside IA-32e mode (the setting of CS.D is ignored with respect to operand size) and 64 bits in 64-bit mode." (Description of VMWRITE in chapter 30). Regards, Nadav -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] perf: ignore LBR and offcore_rsp.
Peter Zijlstra writes: > > This order indicates Andi is the author; but there's no corresponding > From. I wrote an early version of the patch, but Kan took it over and extended it. So both are authors. BTW Kan you may want to use git send-email to get standard format. > >> Signed-off-by: Kan Liang > > And here I thought that Andi was of the opinion that if you set CPUID to > indicate a particular CPU you had better also handle all its MSRs. Yes, philosophically that would be the right way, but we needed a short term fix to stop things from crashing, and that was the simplest. -Andi -- a...@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL v2 043/106] Add kvm_eventfds_enabled function
From: Nikolay Nikolaev Add a function to check if the eventfd capability is present in KVM in the host kernel. Signed-off-by: Antonios Motakis Signed-off-by: Nikolay Nikolaev Reviewed-by: Michael S. Tsirkin Signed-off-by: Michael S. Tsirkin Acked-by: Paolo Bonzini --- include/sysemu/kvm.h | 11 +++ kvm-all.c| 4 kvm-stub.c | 1 + 3 files changed, 16 insertions(+) diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index e79e92c..c4556ad 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -43,6 +43,7 @@ extern bool kvm_allowed; extern bool kvm_kernel_irqchip; extern bool kvm_async_interrupts_allowed; extern bool kvm_halt_in_kernel_allowed; +extern bool kvm_eventfds_allowed; extern bool kvm_irqfds_allowed; extern bool kvm_msi_via_irqfd_allowed; extern bool kvm_gsi_routing_allowed; @@ -83,6 +84,15 @@ extern bool kvm_readonly_mem_allowed; #define kvm_halt_in_kernel() (kvm_halt_in_kernel_allowed) /** + * kvm_eventfds_enabled: + * + * Returns: true if we can use eventfds to receive notifications + * from a KVM CPU (ie the kernel supports eventds and we are running + * with a configuration where it is meaningful to use them). + */ +#define kvm_eventfds_enabled() (kvm_eventfds_allowed) + +/** * kvm_irqfds_enabled: * * Returns: true if we can use irqfds to inject interrupts into @@ -128,6 +138,7 @@ extern bool kvm_readonly_mem_allowed; #define kvm_irqchip_in_kernel() (false) #define kvm_async_interrupts_enabled() (false) #define kvm_halt_in_kernel() (false) +#define kvm_eventfds_enabled() (false) #define kvm_irqfds_enabled() (false) #define kvm_msi_via_irqfd_enabled() (false) #define kvm_gsi_routing_allowed() (false) diff --git a/kvm-all.c b/kvm-all.c index 4e19eff..92f56d8 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -113,6 +113,7 @@ KVMState *kvm_state; bool kvm_kernel_irqchip; bool kvm_async_interrupts_allowed; bool kvm_halt_in_kernel_allowed; +bool kvm_eventfds_allowed; bool kvm_irqfds_allowed; bool kvm_msi_via_irqfd_allowed; bool kvm_gsi_routing_allowed; @@ -1541,6 +1542,9 @@ int kvm_init(MachineClass *mc) (kvm_check_extension(s, KVM_CAP_READONLY_MEM) > 0); #endif +kvm_eventfds_allowed = +(kvm_check_extension(s, KVM_CAP_IOEVENTFD) > 0); + ret = kvm_arch_init(s); if (ret < 0) { goto err; diff --git a/kvm-stub.c b/kvm-stub.c index ac33d86..8e7737c 100644 --- a/kvm-stub.c +++ b/kvm-stub.c @@ -22,6 +22,7 @@ KVMState *kvm_state; bool kvm_kernel_irqchip; bool kvm_async_interrupts_allowed; +bool kvm_eventfds_allowed; bool kvm_irqfds_allowed; bool kvm_msi_via_irqfd_allowed; bool kvm_gsi_routing_allowed; -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] KVM: x86: Emulator support for #UD on CPL>0
Il 18/06/2014 16:19, Nadav Amit ha scritto: Certain instructions (e.g., mwait and monitor) cause a #UD exception when they are executed in privilaged mode. It's actually "non-privileged mode" (Priv means the instruction is privileged, not the mode). So I've renamed the flag to PrivUD. Paolo This is in contrast to the regular privilaged instructions which cause #GP. In order not to mess with SVM interception of mwait and monitor which assumes privilage level assertions take place before interception, a flag has been added. Signed-off-by: Nadav Amit --- arch/x86/kvm/emulate.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index f90194d..ef7a5a0 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -163,6 +163,7 @@ #define SrcWrite((u64)1 << 46) /* Write back src operand */ #define NoMod ((u64)1 << 47) /* Mod field is ignored */ #define NoBigReal ((u64)1 << 48) /* No big real mode */ +#define UDOnPriv((u64)1 << 49) /* #UD instead of #GP on CPL > 0 */ #define DstXacc (DstAccLo | SrcAccHi | SrcWrite) @@ -4560,7 +4561,10 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt) /* Privileged instruction can be executed only in CPL=0 */ if ((ctxt->d & Priv) && ops->cpl(ctxt)) { - rc = emulate_gp(ctxt, 0); + if (ctxt->d & UDOnPriv) + rc = emulate_ud(ctxt); + else + rc = emulate_gp(ctxt, 0); goto done; } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation
Il 18/06/2014 16:19, Nadav Amit ha scritto: mwait and monitor are currently handled as nop. Considering this behavior, they should still be handled correctly, i.e., check execution conditions and generate exceptions when required. mwait and monitor may also be executed in real-mode and are not handled in that case. This patch performs the emulation of monitor-mwait according to Intel SDM (other than checking whether interrupt can be used as a break event). Signed-off-by: Nadav Amit --- arch/x86/kvm/emulate.c | 41 +++-- arch/x86/kvm/svm.c | 22 ++ arch/x86/kvm/vmx.c | 27 +++ 3 files changed, 52 insertions(+), 38 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index ef7a5a0..424b58d 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3344,6 +3344,43 @@ static int em_bswap(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } +static int em_monitor(struct x86_emulate_ctxt *ctxt) +{ + int rc; + struct segmented_address addr; + u64 rcx = reg_read(ctxt, VCPU_REGS_RCX); + u64 rax = reg_read(ctxt, VCPU_REGS_RAX); + u8 byte; + + if (ctxt->mode != X86EMUL_MODE_PROT64) + rcx = (u32)rcx; + if (rcx != 0) + return emulate_gp(ctxt, 0); + + addr.seg = seg_override(ctxt); + addr.ea = ctxt->ad_bytes == 8 ? rax : (u32)rax; + rc = segmented_read(ctxt, addr, &byte, 1); + if (rc != X86EMUL_CONTINUE) + return rc; + + printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n"); + return X86EMUL_CONTINUE; +} + +static int em_mwait(struct x86_emulate_ctxt *ctxt) +{ + u64 rcx = reg_read(ctxt, VCPU_REGS_RCX); + + if (ctxt->mode != X86EMUL_MODE_PROT64) + rcx = (u32)rcx; + if ((rcx & ~1UL) != 0) + return emulate_gp(ctxt, 0); + + /* Accepting interrupt as break event regardless to cpuid */ + printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); + return X86EMUL_CONTINUE; +} + static bool valid_cr(int nr) { switch (nr) { @@ -3557,8 +3594,8 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt) F2bv(((_f) & ~Lock) | DstAcc | SrcImm, _e) static const struct opcode group7_rm1[] = { - DI(SrcNone | Priv, monitor), - DI(SrcNone | Priv, mwait), + II(SrcNone | Priv | NoBigReal | UDOnPriv, em_monitor, monitor), + II(SrcNone | Priv | NoBigReal | UDOnPriv, em_mwait, mwait), N, N, N, N, N, N, }; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ec8366c..a524e04 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3274,24 +3274,6 @@ static int pause_interception(struct vcpu_svm *svm) return 1; } -static int nop_interception(struct vcpu_svm *svm) -{ - skip_emulated_instruction(&(svm->vcpu)); - return 1; -} - -static int monitor_interception(struct vcpu_svm *svm) -{ - printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n"); - return nop_interception(svm); -} - -static int mwait_interception(struct vcpu_svm *svm) -{ - printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); - return nop_interception(svm); -} - static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { [SVM_EXIT_READ_CR0] = cr_interception, [SVM_EXIT_READ_CR3] = cr_interception, @@ -3349,8 +3331,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { [SVM_EXIT_CLGI] = clgi_interception, [SVM_EXIT_SKINIT] = skinit_interception, [SVM_EXIT_WBINVD] = emulate_on_interception, - [SVM_EXIT_MONITOR] = monitor_interception, - [SVM_EXIT_MWAIT]= mwait_interception, + [SVM_EXIT_MONITOR] = emulate_on_interception, + [SVM_EXIT_MWAIT]= emulate_on_interception, [SVM_EXIT_XSETBV] = xsetbv_interception, [SVM_EXIT_NPF] = pf_interception, }; diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 801332e..7023e71 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5672,22 +5672,17 @@ static int handle_pause(struct kvm_vcpu *vcpu) return 1; } -static int handle_nop(struct kvm_vcpu *vcpu) +static int handle_emulate(struct kvm_vcpu *vcpu) { - skip_emulated_instruction(vcpu); - return 1; -} + int err = emulate_instruction(vcpu, 0); -static int handle_mwait(struct kvm_vcpu *vcpu) -{ - printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); - return handle_nop(vcpu); -} - -static int handle_monitor(struct kvm_vcpu *vcpu) -{ - printk_once(KERN_WARNING "kvm: MONITOR instruction emu
Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation
Nadav Amit writes: > mwait and monitor are currently handled as nop. Considering this behavior, > they > should still be handled correctly, i.e., check execution conditions and > generate > exceptions when required. mwait and monitor may also be executed in real-mode Is this necessary ? They are NOPs and kvm prints that out (correctly so) to dmesg. Implementing them correctly is a different thing, but adding extra checks for NOPs just seems like adding extra cycles. > and are not handled in that case. This patch performs the emulation of > monitor-mwait according to Intel SDM (other than checking whether interrupt > can > be used as a break event). > > Signed-off-by: Nadav Amit > --- > arch/x86/kvm/emulate.c | 41 +++-- > arch/x86/kvm/svm.c | 22 ++ > arch/x86/kvm/vmx.c | 27 +++ > 3 files changed, 52 insertions(+), 38 deletions(-) > > diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c > index ef7a5a0..424b58d 100644 > --- a/arch/x86/kvm/emulate.c > +++ b/arch/x86/kvm/emulate.c > @@ -3344,6 +3344,43 @@ static int em_bswap(struct x86_emulate_ctxt *ctxt) > return X86EMUL_CONTINUE; > } > > +static int em_monitor(struct x86_emulate_ctxt *ctxt) > +{ > + int rc; > + struct segmented_address addr; > + u64 rcx = reg_read(ctxt, VCPU_REGS_RCX); > + u64 rax = reg_read(ctxt, VCPU_REGS_RAX); > + u8 byte; > + > + if (ctxt->mode != X86EMUL_MODE_PROT64) > + rcx = (u32)rcx; > + if (rcx != 0) > + return emulate_gp(ctxt, 0); > + > + addr.seg = seg_override(ctxt); > + addr.ea = ctxt->ad_bytes == 8 ? rax : (u32)rax; > + rc = segmented_read(ctxt, addr, &byte, 1); > + if (rc != X86EMUL_CONTINUE) > + return rc; > + > + printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n"); > + return X86EMUL_CONTINUE; > +} > + > +static int em_mwait(struct x86_emulate_ctxt *ctxt) > +{ > + u64 rcx = reg_read(ctxt, VCPU_REGS_RCX); > + > + if (ctxt->mode != X86EMUL_MODE_PROT64) > + rcx = (u32)rcx; > + if ((rcx & ~1UL) != 0) > + return emulate_gp(ctxt, 0); > + > + /* Accepting interrupt as break event regardless to cpuid */ > + printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); > + return X86EMUL_CONTINUE; > +} > + > static bool valid_cr(int nr) > { > switch (nr) { > @@ -3557,8 +3594,8 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt) > F2bv(((_f) & ~Lock) | DstAcc | SrcImm, _e) > > static const struct opcode group7_rm1[] = { > - DI(SrcNone | Priv, monitor), > - DI(SrcNone | Priv, mwait), > + II(SrcNone | Priv | NoBigReal | UDOnPriv, em_monitor, monitor), > + II(SrcNone | Priv | NoBigReal | UDOnPriv, em_mwait, mwait), > N, N, N, N, N, N, > }; > > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > index ec8366c..a524e04 100644 > --- a/arch/x86/kvm/svm.c > +++ b/arch/x86/kvm/svm.c > @@ -3274,24 +3274,6 @@ static int pause_interception(struct vcpu_svm *svm) > return 1; > } > > -static int nop_interception(struct vcpu_svm *svm) > -{ > - skip_emulated_instruction(&(svm->vcpu)); > - return 1; > -} > - > -static int monitor_interception(struct vcpu_svm *svm) > -{ > - printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n"); > - return nop_interception(svm); > -} > - > -static int mwait_interception(struct vcpu_svm *svm) > -{ > - printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); > - return nop_interception(svm); > -} > - > static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { > [SVM_EXIT_READ_CR0] = cr_interception, > [SVM_EXIT_READ_CR3] = cr_interception, > @@ -3349,8 +3331,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm > *svm) = { > [SVM_EXIT_CLGI] = clgi_interception, > [SVM_EXIT_SKINIT] = skinit_interception, > [SVM_EXIT_WBINVD] = emulate_on_interception, > - [SVM_EXIT_MONITOR] = monitor_interception, > - [SVM_EXIT_MWAIT]= mwait_interception, > + [SVM_EXIT_MONITOR] = emulate_on_interception, > + [SVM_EXIT_MWAIT]= emulate_on_interception, > [SVM_EXIT_XSETBV] = xsetbv_interception, > [SVM_EXIT_NPF] = pf_interception, > }; > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 801332e..7023e71 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -5672,22 +5672,17 @@ static int handle_pause(struct kvm_vcpu *vcpu) > return 1; > } > > -static int handle_nop(struct kvm_vcpu *vcpu) > +static int handle_emulate(struct kvm_vcpu *vcpu) > { > - skip_emulated_instruction(vcpu); > -
Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation
Il 18/06/2014 18:43, Bandan Das ha scritto: > mwait and monitor are currently handled as nop. Considering this behavior, they > should still be handled correctly, i.e., check execution conditions and generate > exceptions when required. mwait and monitor may also be executed in real-mode Is this necessary ? They are NOPs and kvm prints that out (correctly so) to dmesg. Implementing them correctly is a different thing, but adding extra checks for NOPs just seems like adding extra cycles. Raising the correct exception is a good thing, though. The guest is going to busy wait anyway, it doesn't matter how fast it does that. :) Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 9/9] KVM: vmx: vmx instructions handling does not consider cs.l
Il 18/06/2014 18:01, Nadav Amit ha scritto: Perhaps I am missing something, but I don't see where my mistake is. The VMREAD source operand is always read as 64-bits and I made no changes there. Therefore, if bits 63:32 are not zero, the instruction should fail when attempting to access the field. In 32-bit mode, it should be read as 32 bits: "Outside IA-32e mode, the source operand has 32 bits, regardless of the value of CS.D" (so it's never 16 bits, but it's also never 64 bits). Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Nested paging in nested SVM setup
Hi Jan, If we want to provide useful nested SVM support, this must be feasible. If there is a bug, it has to be fixed. I was more concerned about if it is supported (and it means I do something wrong), or if it is not supported (at least, now). Maybe you can describe how you configured the involved units (NPT structures, guest / host PAR, MTRR etc.). I've tried different combinations, but to be specific: - NPT: four-level long-mode page tables; all PTEs except terminal have U,R,P bits set (0x07), as per APMv2 15.25.5 - APIC page pte; physical address 0xfee0, flags: PAT, PWT, PCD, U, P (0x9D) - guest PAT and host PAT are the same, 0x7010600070106 (as set by the Linux kernel). Guest PAT is stored in VMCB; host PAT is restored at each #VMEXIT. - MTRRs. No changes to what Linux use prior to VM entry here; #0 (base 0x8000, mask 0xFF8000) uncacheable, others are disabled. I also noticed that setting PAT MSR from the nested hypervisor leaves high word unassigned, i.e. the code like this: mov $0x70106, %rax mov %rax, %rdx mov $0x0277, %rcx wrmsr rdmsr yields %rax = 0, %rdx = 0x70106. Even better would be a test case based on kvm-unit-tests (see [1], Will have a look at it, thanks. -- Best regards, Valentine Sinitsyn -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] KVM: PPC: e500mc: Enhance tlb invalidation condition on vcpu schedule
On Wed, 2014-06-18 at 10:15 +0300, Mihai Caraman wrote: > On vcpu schedule, the condition checked for tlb pollution is too loose. > The tlb entries of a vcpu become polluted (vs stale) only when a different > vcpu within the same logical partition runs in-between. Optimize the tlb > invalidation condition keeping last_vcpu per logical partition id. > > With the new invalidation condition, a guest shows 4% performance improvement > on P5020DS while running a memory stress application with the cpu > oversubscribed, > the other guest running a cpu intensive workload. > > Guest - old invalidation condition > real 3.89 > user 3.87 > sys 0.01 > > Guest - enhanced invalidation condition > real 3.75 > user 3.73 > sys 0.01 > > Host > real 3.70 > user 1.85 > sys 0.00 > > The memory stress application accesses 4KB pages backed by 75% of available > TLB0 entries: > > char foo[ENTRIES][4096] __attribute__ ((aligned (4096))); > > int main() > { > char bar; > int i, j; > > for (i = 0; i < ITERATIONS; i++) > for (j = 0; j < ENTRIES; j++) > bar = foo[j][0]; > > return 0; > } > > Signed-off-by: Mihai Caraman > Cc: Scott Wood > --- > v4: > - rename last_vcpu_on_cpu to last_vcpu_of_lpid > - use "*[" syntax despite checkpatch error > > v3: > - use existing logic while keeping last_cpu per lpid > > arch/powerpc/kvm/e500mc.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c > index 17e4562..690499d 100644 > --- a/arch/powerpc/kvm/e500mc.c > +++ b/arch/powerpc/kvm/e500mc.c > @@ -110,7 +110,7 @@ void kvmppc_mmu_msr_notify(struct kvm_vcpu *vcpu, u32 > old_msr) > { > } > > -static DEFINE_PER_CPU(struct kvm_vcpu *, last_vcpu_on_cpu); > +static DEFINE_PER_CPU(struct kvm_vcpu *[KVMPPC_NR_LPIDS], last_vcpu_of_lpid); > > static void kvmppc_core_vcpu_load_e500mc(struct kvm_vcpu *vcpu, int cpu) > { > @@ -141,9 +141,9 @@ static void kvmppc_core_vcpu_load_e500mc(struct kvm_vcpu > *vcpu, int cpu) > mtspr(SPRN_GESR, vcpu->arch.shared->esr); > > if (vcpu->arch.oldpir != mfspr(SPRN_PIR) || > - __get_cpu_var(last_vcpu_on_cpu) != vcpu) { > + __get_cpu_var(last_vcpu_of_lpid)[vcpu->kvm->arch.lpid] != vcpu) { > kvmppc_e500_tlbil_all(vcpu_e500); > - __get_cpu_var(last_vcpu_on_cpu) = vcpu; > + __get_cpu_var(last_vcpu_of_lpid)[vcpu->kvm->arch.lpid] = vcpu; > } > > kvmppc_load_guest_fp(vcpu); Reviewed-by: Scott Wood -Scott -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation
Paolo Bonzini writes: > Il 18/06/2014 18:43, Bandan Das ha scritto: >>> > mwait and monitor are currently handled as nop. Considering this >>> > behavior, they >>> > should still be handled correctly, i.e., check execution conditions and >>> > generate >>> > exceptions when required. mwait and monitor may also be executed in >>> > real-mode >> Is this necessary ? They are NOPs and kvm prints that out (correctly so) to >> dmesg. >> Implementing them correctly is a different thing, but adding extra checks >> for NOPs >> just seems like adding extra cycles. > > Raising the correct exception is a good thing, though. The guest is > going to busy wait anyway, it doesn't matter how fast it does that. :) Thanks, makes sense. I was also thinking why to advertise it at all, but it's probably more common than I think it is. > Paolo > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 9/9] KVM: vmx: vmx instructions handling does not consider cs.l
On Jun 18, 2014, at 7:06 PM, Paolo Bonzini wrote: > Il 18/06/2014 18:01, Nadav Amit ha scritto: >>> >> >> Perhaps I am missing something, but I don't see where my mistake is. >> The VMREAD source operand is always read as 64-bits and I made no >> changes there. Therefore, if bits 63:32 are not zero, the instruction >> should fail when attempting to access the field. > > In 32-bit mode, it should be read as 32 bits: "Outside IA-32e mode, the > source operand has 32 bits, regardless of the value of CS.D" (so it's never > 16 bits, but it's also never 64 bits). Oh. I now get it. I misunderstood what the SDM said, as I was thinking that 62:32 will lead to failure also on 32-bit mode. If you fix it, please fix both VMREAD and VMWRITE. If not, I would resubmit. Thanks, Nadav-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation
On Wed, Jun 18, 2014 at 7:19 AM, Nadav Amit wrote: > mwait and monitor are currently handled as nop. Considering this behavior, > they > should still be handled correctly, i.e., check execution conditions and > generate > exceptions when required. mwait and monitor may also be executed in real-mode > and are not handled in that case. This patch performs the emulation of > monitor-mwait according to Intel SDM (other than checking whether interrupt > can > be used as a break event). > > Signed-off-by: Nadav Amit > --- > arch/x86/kvm/emulate.c | 41 +++-- > arch/x86/kvm/svm.c | 22 ++ > arch/x86/kvm/vmx.c | 27 +++ > 3 files changed, 52 insertions(+), 38 deletions(-) > > diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c > index ef7a5a0..424b58d 100644 > --- a/arch/x86/kvm/emulate.c > +++ b/arch/x86/kvm/emulate.c > @@ -3344,6 +3344,43 @@ static int em_bswap(struct x86_emulate_ctxt *ctxt) > return X86EMUL_CONTINUE; > } > > +static int em_monitor(struct x86_emulate_ctxt *ctxt) > +{ > + int rc; > + struct segmented_address addr; > + u64 rcx = reg_read(ctxt, VCPU_REGS_RCX); > + u64 rax = reg_read(ctxt, VCPU_REGS_RAX); > + u8 byte; I'd request: u32 ebx, ecx, edx, eax = 1; ctxt->opt->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx); if (!(ecx & FFL(MWAIT))) return emulate_ud(ctxt); and also in em_mwait. > + > + if (ctxt->mode != X86EMUL_MODE_PROT64) > + rcx = (u32)rcx; > + if (rcx != 0) > + return emulate_gp(ctxt, 0); > + > + addr.seg = seg_override(ctxt); > + addr.ea = ctxt->ad_bytes == 8 ? rax : (u32)rax; > + rc = segmented_read(ctxt, addr, &byte, 1); > + if (rc != X86EMUL_CONTINUE) > + return rc; > + > + printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as > NOP!\n"); > + return X86EMUL_CONTINUE; > +} > + > +static int em_mwait(struct x86_emulate_ctxt *ctxt) > +{ > + u64 rcx = reg_read(ctxt, VCPU_REGS_RCX); > + > + if (ctxt->mode != X86EMUL_MODE_PROT64) > + rcx = (u32)rcx; > + if ((rcx & ~1UL) != 0) > + return emulate_gp(ctxt, 0); > + > + /* Accepting interrupt as break event regardless to cpuid */ > + printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); > + return X86EMUL_CONTINUE; > +} > + > static bool valid_cr(int nr) > { > switch (nr) { > @@ -3557,8 +3594,8 @@ static int check_perm_out(struct x86_emulate_ctxt *ctxt) > F2bv(((_f) & ~Lock) | DstAcc | SrcImm, _e) > > static const struct opcode group7_rm1[] = { > - DI(SrcNone | Priv, monitor), > - DI(SrcNone | Priv, mwait), > + II(SrcNone | Priv | NoBigReal | UDOnPriv, em_monitor, monitor), > + II(SrcNone | Priv | NoBigReal | UDOnPriv, em_mwait, mwait), > N, N, N, N, N, N, > }; > > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > index ec8366c..a524e04 100644 > --- a/arch/x86/kvm/svm.c > +++ b/arch/x86/kvm/svm.c > @@ -3274,24 +3274,6 @@ static int pause_interception(struct vcpu_svm *svm) > return 1; > } > > -static int nop_interception(struct vcpu_svm *svm) > -{ > - skip_emulated_instruction(&(svm->vcpu)); > - return 1; > -} > - > -static int monitor_interception(struct vcpu_svm *svm) > -{ > - printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as > NOP!\n"); > - return nop_interception(svm); > -} > - > -static int mwait_interception(struct vcpu_svm *svm) > -{ > - printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); > - return nop_interception(svm); > -} > - > static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { > [SVM_EXIT_READ_CR0] = cr_interception, > [SVM_EXIT_READ_CR3] = cr_interception, > @@ -3349,8 +3331,8 @@ static int (*const svm_exit_handlers[])(struct vcpu_svm > *svm) = { > [SVM_EXIT_CLGI] = clgi_interception, > [SVM_EXIT_SKINIT] = skinit_interception, > [SVM_EXIT_WBINVD] = emulate_on_interception, > - [SVM_EXIT_MONITOR] = monitor_interception, > - [SVM_EXIT_MWAIT]= mwait_interception, > + [SVM_EXIT_MONITOR] = emulate_on_interception, > + [SVM_EXIT_MWAIT]= emulate_on_interception, > [SVM_EXIT_XSETBV] = xsetbv_interception, > [SVM_EXIT_NPF] = pf_interception, > }; > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 801332e..7023e71 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -5672,22 +5672,17 @@ static int handle_pause(struct kvm_vcpu *vcpu) > return 1; > } > > -static int handle_nop(struct kvm_vcpu *vcpu) > +static
Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation
On 6/18/14, 8:59 PM, Eric Northup wrote: On Wed, Jun 18, 2014 at 7:19 AM, Nadav Amit wrote: mwait and monitor are currently handled as nop. Considering this behavior, they should still be handled correctly, i.e., check execution conditions and generate exceptions when required. mwait and monitor may also be executed in real-mode and are not handled in that case. This patch performs the emulation of monitor-mwait according to Intel SDM (other than checking whether interrupt can be used as a break event). Signed-off-by: Nadav Amit --- arch/x86/kvm/emulate.c | 41 +++-- arch/x86/kvm/svm.c | 22 ++ arch/x86/kvm/vmx.c | 27 +++ 3 files changed, 52 insertions(+), 38 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index ef7a5a0..424b58d 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3344,6 +3344,43 @@ static int em_bswap(struct x86_emulate_ctxt *ctxt) return X86EMUL_CONTINUE; } +static int em_monitor(struct x86_emulate_ctxt *ctxt) +{ + int rc; + struct segmented_address addr; + u64 rcx = reg_read(ctxt, VCPU_REGS_RCX); + u64 rax = reg_read(ctxt, VCPU_REGS_RAX); + u8 byte; I'd request: u32 ebx, ecx, edx, eax = 1; ctxt->opt->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx); if (!(ecx & FFL(MWAIT))) return emulate_ud(ctxt); and also in em_mwait. I had similar implementation on previous version, which also checked on mwait whether "interrupt as break event" matches ECX value. However, I was under the impression that it was decided that MWAIT will always be emulated as NOP to avoid misbehaving VMs that ignore CPUID (see the discussion at http://www.spinics.net/lists/kvm/msg102766.html ). Nadav -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word
> >>However, I *do* agree with you that it's simpler to just squash this patch > >>into 01/11. > >Uh, did I say that? Oh I said why don't make it right the first time! > > > >I meant in terms of seperating the slowpath (aka the bytelock on the pending > >bit) from the queue (MCS code). Or renaming the function to be called > >'complex' instead of 'slowpath' as it is getting quite hairy. > > > >The #1 patch is nice by itself - as it lays out the foundation of the > >MCS-similar code - and if Ingo decides he does not want this pending > >byte-lock bit business - it can be easily reverted or dropped. > > The pending bit code is needed for performance parity with ticket spinlock > for light load. My own measurement indicates that the queuing overhead will > cause the queue spinlock to be slower than ticket spinlock with 2-4 > contending tasks. The pending bit solves the performance problem with 2 Aha! > contending tasks, leave only the 3-4 tasks cases being a bit slower than the > ticket spinlock which should be more than compensated by its superior > performance with heavy contention and slightly better performance with no > contention. That should be mentioned in the commit description as the rationale for the patch "qspinlock: Add pending bit" and also in the code. Thank you! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation
Quoting Gabriel's post http://www.spinics.net/lists/kvm/msg103792.html : [...] > E.g., OS X 10.5 *does* check CPUID, and panics if it doesn't find it. > It needs the MONITOR cpuid flag to be on, *and* the actual > instructions to work. On Wed, Jun 18, 2014 at 11:23 AM, Nadav Amit wrote: > On 6/18/14, 8:59 PM, Eric Northup wrote: >> >> On Wed, Jun 18, 2014 at 7:19 AM, Nadav Amit >> wrote: >>> >>> mwait and monitor are currently handled as nop. Considering this >>> behavior, they >>> should still be handled correctly, i.e., check execution conditions and >>> generate >>> exceptions when required. mwait and monitor may also be executed in >>> real-mode >>> and are not handled in that case. This patch performs the emulation of >>> monitor-mwait according to Intel SDM (other than checking whether >>> interrupt can >>> be used as a break event). >>> >>> Signed-off-by: Nadav Amit >>> --- >>> arch/x86/kvm/emulate.c | 41 +++-- >>> arch/x86/kvm/svm.c | 22 ++ >>> arch/x86/kvm/vmx.c | 27 +++ >>> 3 files changed, 52 insertions(+), 38 deletions(-) >>> >>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c >>> index ef7a5a0..424b58d 100644 >>> --- a/arch/x86/kvm/emulate.c >>> +++ b/arch/x86/kvm/emulate.c >>> @@ -3344,6 +3344,43 @@ static int em_bswap(struct x86_emulate_ctxt *ctxt) >>> return X86EMUL_CONTINUE; >>> } >>> >>> +static int em_monitor(struct x86_emulate_ctxt *ctxt) >>> +{ >>> + int rc; >>> + struct segmented_address addr; >>> + u64 rcx = reg_read(ctxt, VCPU_REGS_RCX); >>> + u64 rax = reg_read(ctxt, VCPU_REGS_RAX); >>> + u8 byte; >> >> >> I'd request: >> >> u32 ebx, ecx, edx, eax = 1; >> ctxt->opt->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx); >> if (!(ecx & FFL(MWAIT))) >> return emulate_ud(ctxt); >> >> and also in em_mwait. >> > > I had similar implementation on previous version, which also checked on > mwait whether "interrupt as break event" matches ECX value. However, I was > under the impression that it was decided that MWAIT will always be emulated > as NOP to avoid misbehaving VMs that ignore CPUID (see the discussion at > http://www.spinics.net/lists/kvm/msg102766.html ). > > Nadav -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On Wed, 2014-06-18 at 21:36 +1000, Alexey Kardashevskiy wrote: > VFIO exposes BARs to user space as a byte stream so userspace can > read it using pread()/pwrite(). Since this is a byte stream, VFIO should > not do byte swapping and simply return values as it gets them from > PCI device. > > Instead, the existing code assumes that byte stream in read/write is > little-endian and it fixes endianness for values which it passes to > ioreadXX/iowriteXX helpers. This works for little-endian as PCI is > little endian and le32_to_cpu/... are stubs. vfio read32: val = cpu_to_le32(ioread32(io + off)); Where the typical x86 case, ioread32 is: #define ioread32(addr) readl(addr) and readl is: __le32_to_cpu(__raw_readl(addr)); So we do canceling byte swaps, which are both nops on x86, and end up returning device endian, which we assume is little endian. vfio write32 is similar: iowrite32(le32_to_cpu(val), io + off); The implicit cpu_to_le32 of iowrite32() and our explicit swap cancel out, so input data is device endian, which is assumed little. > This also works for big endian but rather by an accident: it reads 4 bytes > from the stream (@val is big endian), converts to CPU format (which should > be big endian) as it was little endian (@val becomes actually little > endian) and calls iowrite32() which does not do swapping on big endian > system. Really? In arch/powerpc/kernel/iomap.c iowrite32() is just a wrapper around writel(), which seems to use the generic implementation, which does include a cpu_to_le32. I also see other big endian archs like parisc doing cpu_to_le32 on iowrite32, so I don't think this statement is true. I imagine it's probably working for you because the swap cancel. > This removes byte swapping and makes use ioread32be/iowrite32be > (and 16bit versions) on big-endian systems. The "be" helpers take > native endian values and do swapping at the moment of writing to a PCI > register using one of "store byte-reversed" instructions. So now you want iowrite32() on little endian and iowrite32be() on big endian, the former does a cpu_to_le32 (which is a nop on little endian) and the latter does a cpu_to_be32 (which is a nop on big endian)... should we just be using __raw_writel() on both? There doesn't actually seem to be any change in behavior here, it just eliminates back-to-back byte swaps, which are a nop on x86, but not power, right? > Suggested-by: Benjamin Herrenschmidt > Signed-off-by: Alexey Kardashevskiy > --- > drivers/vfio/pci/vfio_pci_rdwr.c | 20 > 1 file changed, 16 insertions(+), 4 deletions(-) > > diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c > b/drivers/vfio/pci/vfio_pci_rdwr.c > index 210db24..f363b5a 100644 > --- a/drivers/vfio/pci/vfio_pci_rdwr.c > +++ b/drivers/vfio/pci/vfio_pci_rdwr.c > @@ -21,6 +21,18 @@ > > #include "vfio_pci_private.h" > > +#ifdef __BIG_ENDIAN__ > +#define ioread16_native ioread16be > +#define ioread32_native ioread32be > +#define iowrite16_native iowrite16be > +#define iowrite32_native iowrite32be > +#else > +#define ioread16_native ioread16 > +#define ioread32_native ioread32 > +#define iowrite16_native iowrite16 > +#define iowrite32_native iowrite32 > +#endif > + > /* > * Read or write from an __iomem region (MMIO or I/O port) with an excluded > * range which is inaccessible. The excluded range drops writes and fills > @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf, > if (copy_from_user(&val, buf, 4)) > return -EFAULT; > > - iowrite32(le32_to_cpu(val), io + off); > + iowrite32_native(val, io + off); > } else { > - val = cpu_to_le32(ioread32(io + off)); > + val = ioread32_native(io + off); > > if (copy_to_user(buf, &val, 4)) > return -EFAULT; > @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf, > if (copy_from_user(&val, buf, 2)) > return -EFAULT; > > - iowrite16(le16_to_cpu(val), io + off); > + iowrite16_native(val, io + off); > } else { > - val = cpu_to_le16(ioread16(io + off)); > + val = ioread16_native(io + off); > > if (copy_to_user(buf, &val, 2)) > return -EFAULT; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation
On Wed, Jun 18, 2014 at 10:59:14AM -0700, Eric Northup wrote: > On Wed, Jun 18, 2014 at 7:19 AM, Nadav Amit wrote: > > mwait and monitor are currently handled as nop. Considering this behavior, > > they > > should still be handled correctly, i.e., check execution conditions and > > generate > > exceptions when required. mwait and monitor may also be executed in > > real-mode > > and are not handled in that case. This patch performs the emulation of > > monitor-mwait according to Intel SDM (other than checking whether interrupt > > can > > be used as a break event). > > > > Signed-off-by: Nadav Amit How about this instead (details in the commit log below) ? Please let me know what you think, and if you'd prefer me to send it out as a separate patch rather than a reply to this thread. Thanks, --Gabriel >From 0375a0aceb54cdbc26a6c0e5b43c46324f830ec3 Mon Sep 17 00:00:00 2001 From: "Gabriel L. Somlo" Date: Wed, 18 Jun 2014 14:39:15 -0400 Subject: [PATCH] kvm: x86: revert "emulate monitor and mwait instructions as nop" This reverts commit 87c00572ba05aa8c9db118da75c608f47eb10b9e. OS X <= 10.7.* are the only known guests which realistically required this functionality. As it turns out, OS X can be told to forego using monitor/mwait by passing it "idlehalt=0" as a kernel argument, so we're better off removing this hack from KVM altogether. Signed-off-by: Gabriel L. Somlo --- arch/x86/kvm/cpuid.c | 2 -- arch/x86/kvm/svm.c | 8 +++- arch/x86/kvm/vmx.c | 10 -- 3 files changed, 7 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 38a0afe..17b42fa 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -283,8 +283,6 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW); /* cpuid 1.ecx */ const u32 kvm_supported_word4_x86_features = - /* NOTE: MONITOR (and MWAIT) are emulated as NOP, -* but *not* advertised to guests via CPUID ! */ F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ | 0 /* DS-CPL, VMX, SMX, EST */ | 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ | diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ec8366c..0e8ef20 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3274,7 +3274,7 @@ static int pause_interception(struct vcpu_svm *svm) return 1; } -static int nop_interception(struct vcpu_svm *svm) +static int invalid_op_interception(struct vcpu_svm *svm) { skip_emulated_instruction(&(svm->vcpu)); return 1; @@ -3282,14 +3282,12 @@ static int nop_interception(struct vcpu_svm *svm) static int monitor_interception(struct vcpu_svm *svm) { - printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n"); - return nop_interception(svm); + return invalid_op_interception(svm); } static int mwait_interception(struct vcpu_svm *svm) { - printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); - return nop_interception(svm); + return invalid_op_interception(svm); } static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 801332e..577c7df 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5672,22 +5672,20 @@ static int handle_pause(struct kvm_vcpu *vcpu) return 1; } -static int handle_nop(struct kvm_vcpu *vcpu) +static int handle_invalid_op(struct kvm_vcpu *vcpu) { - skip_emulated_instruction(vcpu); + kvm_queue_exception(vcpu, UD_VECTOR); return 1; } static int handle_mwait(struct kvm_vcpu *vcpu) { - printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); - return handle_nop(vcpu); + return handle_invalid_op(vcpu); } static int handle_monitor(struct kvm_vcpu *vcpu) { - printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n"); - return handle_nop(vcpu); + return handle_invalid_op(vcpu); } /* -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation
On Wed, Jun 18, 2014 at 11:30:07AM -0700, Eric Northup wrote: > Quoting Gabriel's post http://www.spinics.net/lists/kvm/msg103792.html : > > [...] > > > E.g., OS X 10.5 *does* check CPUID, and panics if it doesn't find it. > > It needs the MONITOR cpuid flag to be on, *and* the actual > > instructions to work. That was an argument in favor of finding a mechanism to allow (qemu) users to enable an otherwise default-off monitor cpuid flag. We definitely don't want to advertise monitor/mwait availability to guests which would otherwise sanely fail back to a hlt-based idle loop when cpuid tells them monitor/mwait are not available :) However, check my earlier proposal of backing out of monitor/mwait entirely (as it turns out, there's a kernel command line to tell OS X not to use monitor/mwait, which is IMHO vastly preferable to creating "undocumented" KVM hacks :) Thanks much, --Gabriel -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86: correct mwait and monitor emulation
"Gabriel L. Somlo" writes: > On Wed, Jun 18, 2014 at 10:59:14AM -0700, Eric Northup wrote: >> On Wed, Jun 18, 2014 at 7:19 AM, Nadav Amit wrote: >> > mwait and monitor are currently handled as nop. Considering this behavior, >> > they >> > should still be handled correctly, i.e., check execution conditions and >> > generate >> > exceptions when required. mwait and monitor may also be executed in >> > real-mode >> > and are not handled in that case. This patch performs the emulation of >> > monitor-mwait according to Intel SDM (other than checking whether >> > interrupt can >> > be used as a break event). >> > >> > Signed-off-by: Nadav Amit > > How about this instead (details in the commit log below) ? Please let > me know what you think, and if you'd prefer me to send it out as a > separate patch rather than a reply to this thread. I am not saying we should "punish" guests that don't do the right thing. But if we can't emulate it the right way, we emulate it ? And the fact that a workaround exists for the guest is all the more better :) > Thanks, > --Gabriel > > > From 0375a0aceb54cdbc26a6c0e5b43c46324f830ec3 Mon Sep 17 00:00:00 2001 > From: "Gabriel L. Somlo" > Date: Wed, 18 Jun 2014 14:39:15 -0400 > Subject: [PATCH] kvm: x86: revert "emulate monitor and mwait instructions as > nop" > > This reverts commit 87c00572ba05aa8c9db118da75c608f47eb10b9e. > > OS X <= 10.7.* are the only known guests which realistically required > this functionality. As it turns out, OS X can be told to forego using > monitor/mwait by passing it "idlehalt=0" as a kernel argument, so we're > better off removing this hack from KVM altogether. > > Signed-off-by: Gabriel L. Somlo > --- > arch/x86/kvm/cpuid.c | 2 -- > arch/x86/kvm/svm.c | 8 +++- > arch/x86/kvm/vmx.c | 10 -- > 3 files changed, 7 insertions(+), 13 deletions(-) > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c > index 38a0afe..17b42fa 100644 > --- a/arch/x86/kvm/cpuid.c > +++ b/arch/x86/kvm/cpuid.c > @@ -283,8 +283,6 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 > *entry, u32 function, > 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW); > /* cpuid 1.ecx */ > const u32 kvm_supported_word4_x86_features = > - /* NOTE: MONITOR (and MWAIT) are emulated as NOP, > - * but *not* advertised to guests via CPUID ! */ > F(XMM3) | F(PCLMULQDQ) | 0 /* DTES64, MONITOR */ | > 0 /* DS-CPL, VMX, SMX, EST */ | > 0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ | > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c > index ec8366c..0e8ef20 100644 > --- a/arch/x86/kvm/svm.c > +++ b/arch/x86/kvm/svm.c > @@ -3274,7 +3274,7 @@ static int pause_interception(struct vcpu_svm *svm) > return 1; > } > > -static int nop_interception(struct vcpu_svm *svm) > +static int invalid_op_interception(struct vcpu_svm *svm) > { > skip_emulated_instruction(&(svm->vcpu)); > return 1; > @@ -3282,14 +3282,12 @@ static int nop_interception(struct vcpu_svm *svm) > > static int monitor_interception(struct vcpu_svm *svm) > { > - printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n"); > - return nop_interception(svm); > + return invalid_op_interception(svm); > } > > static int mwait_interception(struct vcpu_svm *svm) > { > - printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); > - return nop_interception(svm); > + return invalid_op_interception(svm); > } > > static int (*const svm_exit_handlers[])(struct vcpu_svm *svm) = { > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 801332e..577c7df 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -5672,22 +5672,20 @@ static int handle_pause(struct kvm_vcpu *vcpu) > return 1; > } > > -static int handle_nop(struct kvm_vcpu *vcpu) > +static int handle_invalid_op(struct kvm_vcpu *vcpu) > { > - skip_emulated_instruction(vcpu); > + kvm_queue_exception(vcpu, UD_VECTOR); > return 1; > } > > static int handle_mwait(struct kvm_vcpu *vcpu) > { > - printk_once(KERN_WARNING "kvm: MWAIT instruction emulated as NOP!\n"); > - return handle_nop(vcpu); > + return handle_invalid_op(vcpu); > } > > static int handle_monitor(struct kvm_vcpu *vcpu) > { > - printk_once(KERN_WARNING "kvm: MONITOR instruction emulated as NOP!\n"); > - return handle_nop(vcpu); > + return handle_invalid_op(vcpu); > } > > /* -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/11] pvqspinlock, x86: Rename paravirt_ticketlocks_enabled
On Sun, Jun 15, 2014 at 02:47:06PM +0200, Peter Zijlstra wrote: > From: Waiman Long > > This patch renames the paravirt_ticketlocks_enabled static key to a > more generic paravirt_spinlocks_enabled name. > > Signed-off-by: Waiman Long > Signed-off-by: Peter Zijlstra Acked-by: Konrad Rzeszutek Wilk > --- > arch/x86/include/asm/spinlock.h |4 ++-- > arch/x86/kernel/kvm.c|2 +- > arch/x86/kernel/paravirt-spinlocks.c |4 ++-- > arch/x86/xen/spinlock.c |2 +- > 4 files changed, 6 insertions(+), 6 deletions(-) > > --- a/arch/x86/include/asm/spinlock.h > +++ b/arch/x86/include/asm/spinlock.h > @@ -39,7 +39,7 @@ > /* How long a lock should spin before we consider blocking */ > #define SPIN_THRESHOLD (1 << 15) > > -extern struct static_key paravirt_ticketlocks_enabled; > +extern struct static_key paravirt_spinlocks_enabled; > static __always_inline bool static_key_false(struct static_key *key); > > #ifdef CONFIG_QUEUE_SPINLOCK > @@ -150,7 +150,7 @@ static inline void __ticket_unlock_slowp > static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) > { > if (TICKET_SLOWPATH_FLAG && > - static_key_false(¶virt_ticketlocks_enabled)) { > + static_key_false(¶virt_spinlocks_enabled)) { > arch_spinlock_t prev; > > prev = *lock; > --- a/arch/x86/kernel/kvm.c > +++ b/arch/x86/kernel/kvm.c > @@ -819,7 +819,7 @@ static __init int kvm_spinlock_init_jump > if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) > return 0; > > - static_key_slow_inc(¶virt_ticketlocks_enabled); > + static_key_slow_inc(¶virt_spinlocks_enabled); > printk(KERN_INFO "KVM setup paravirtual spinlock\n"); > > return 0; > --- a/arch/x86/kernel/paravirt-spinlocks.c > +++ b/arch/x86/kernel/paravirt-spinlocks.c > @@ -16,5 +16,5 @@ struct pv_lock_ops pv_lock_ops = { > }; > EXPORT_SYMBOL(pv_lock_ops); > > -struct static_key paravirt_ticketlocks_enabled = STATIC_KEY_INIT_FALSE; > -EXPORT_SYMBOL(paravirt_ticketlocks_enabled); > +struct static_key paravirt_spinlocks_enabled = STATIC_KEY_INIT_FALSE; > +EXPORT_SYMBOL(paravirt_spinlocks_enabled); > --- a/arch/x86/xen/spinlock.c > +++ b/arch/x86/xen/spinlock.c > @@ -293,7 +293,7 @@ static __init int xen_init_spinlocks_jum > if (!xen_domain()) > return 0; > > - static_key_slow_inc(¶virt_ticketlocks_enabled); > + static_key_slow_inc(¶virt_spinlocks_enabled); > return 0; > } > early_initcall(xen_init_spinlocks_jump); > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/11] qspinlock: Revert to test-and-set on hypervisors
On Sun, Jun 15, 2014 at 02:47:05PM +0200, Peter Zijlstra wrote: > When we detect a hypervisor (!paravirt, see later patches), revert to Please spell out the name of the patches. > a simple test-and-set lock to avoid the horrors of queue preemption. Heheh. > > Signed-off-by: Peter Zijlstra > --- > arch/x86/include/asm/qspinlock.h | 14 ++ > include/asm-generic/qspinlock.h |7 +++ > kernel/locking/qspinlock.c |3 +++ > 3 files changed, 24 insertions(+) > > --- a/arch/x86/include/asm/qspinlock.h > +++ b/arch/x86/include/asm/qspinlock.h > @@ -1,6 +1,7 @@ > #ifndef _ASM_X86_QSPINLOCK_H > #define _ASM_X86_QSPINLOCK_H > > +#include > #include > > #if !defined(CONFIG_X86_OOSTORE) && !defined(CONFIG_X86_PPRO_FENCE) > @@ -20,6 +21,19 @@ static inline void queue_spin_unlock(str > > #endif /* !CONFIG_X86_OOSTORE && !CONFIG_X86_PPRO_FENCE */ > > +#define virt_queue_spin_lock virt_queue_spin_lock > + > +static inline bool virt_queue_spin_lock(struct qspinlock *lock) > +{ > + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) > + return false; > + > + while (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) != 0) > + cpu_relax(); > + > + return true; > +} > + > #include > > #endif /* _ASM_X86_QSPINLOCK_H */ > --- a/include/asm-generic/qspinlock.h > +++ b/include/asm-generic/qspinlock.h > @@ -98,6 +98,13 @@ static __always_inline void queue_spin_u > } > #endif > > +#ifndef virt_queue_spin_lock > +static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock) > +{ > + return false; > +} > +#endif > + > /* > * Initializier > */ > --- a/kernel/locking/qspinlock.c > +++ b/kernel/locking/qspinlock.c > @@ -247,6 +247,9 @@ void queue_spin_lock_slowpath(struct qsp > > BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)); > > + if (virt_queue_spin_lock(lock)) > + return; > + > /* >* wait for in-progress pending->locked hand-overs >* > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/11] qspinlock: Use a simple write to grab the lock, if applicable
On Sun, Jun 15, 2014 at 02:47:04PM +0200, Peter Zijlstra wrote: > From: Waiman Long > > Currently, atomic_cmpxchg() is used to get the lock. However, this is > not really necessary if there is more than one task in the queue and > the queue head don't need to reset the queue code word. For that case, s/queue code word/tail {number,value}/ ? > a simple write to set the lock bit is enough as the queue head will > be the only one eligible to get the lock as long as it checks that > both the lock and pending bits are not set. The current pending bit > waiting code will ensure that the bit will not be set as soon as the > queue code word (tail) in the lock is set. Just use the same word as above. > > With that change, the are some slight improvement in the performance > of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket > Westere-EX machine as shown in the tables below. > > [Standalone/Embedded - same node] > # of tasks Before patchAfter patch %Change > -- --- -- --- >3 2324/2321 2248/2265-3%/-2% >4 2890/2896 2819/2831-2%/-2% >5 3611/3595 3522/3512-2%/-2% >6 4281/4276 4173/4160-3%/-3% >7 5018/5001 4875/4861-3%/-3% >8 5759/5750 5563/5568-3%/-3% > > [Standalone/Embedded - different nodes] > # of tasks Before patchAfter patch %Change > -- --- -- --- >3 12242/12237 12087/12093 -1%/-1% >4 10688/10696 10507/10521 -2%/-2% > > It was also found that this change produced a much bigger performance > improvement in the newer IvyBridge-EX chip and was essentially to close > the performance gap between the ticket spinlock and queue spinlock. > > The disk workload of the AIM7 benchmark was run on a 4-socket > Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users > on a 3.14 based kernel. The results of the test runs were: > > AIM7 XFS Disk Test > kernel JPMReal Time Sys TimeUsr Time > - ---- > ticketlock56782333.17 96.61 5.81 > qspinlock 57507993.13 94.83 5.97 > > AIM7 EXT4 Disk Test > kernel JPMReal Time Sys TimeUsr Time > - ---- > ticketlock1114551 16.15 509.72 7.11 > qspinlock 21844668.24 232.99 6.01 > > The ext4 filesystem run had a much higher spinlock contention than > the xfs filesystem run. > > The "ebizzy -m" test was also run with the following results: > > kernel records/s Real Time Sys TimeUsr Time > -- - > ticketlock 2075 10.00 216.35 3.49 > qspinlock 3023 10.00 198.20 4.80 > > Signed-off-by: Waiman Long > Signed-off-by: Peter Zijlstra > --- > kernel/locking/qspinlock.c | 59 > - > 1 file changed, 43 insertions(+), 16 deletions(-) > > --- a/kernel/locking/qspinlock.c > +++ b/kernel/locking/qspinlock.c > @@ -93,24 +93,33 @@ static inline struct mcs_spinlock *decod > * By using the whole 2nd least significant byte for the pending bit, we > * can allow better optimization of the lock acquisition for the pending > * bit holder. > + * > + * This internal structure is also used by the set_locked function which > + * is not restricted to _Q_PENDING_BITS == 8. > */ > -#if _Q_PENDING_BITS == 8 > - > struct __qspinlock { > union { > atomic_t val; > - struct { > #ifdef __LITTLE_ENDIAN > + u8 locked; > + struct { > u16 locked_pending; > u16 tail; > + }; > #else > + struct { > u16 tail; > u16 locked_pending; > -#endif > }; > + struct { > + u8 reserved[3]; > + u8 locked; > + }; > +#endif > }; > }; > > +#if _Q_PENDING_BITS == 8 > /** > * clear_pending_set_locked - take ownership and clear the pending bit. > * @lock: Pointer to queue spinlock structure > @@ -197,6 +206,19 @@ static __always_inline u32 xchg_tail(str > #endif /* _Q_PENDING_BITS == 8 */ > > /** > + * set_locked - Set the lock bit and own the lock Full stop missing. > + * @lock: Pointer to queue spinlock structure Ditto. > + * > + * *,*,0 -> *,0,1 > + */ > +static __always_inline void set_locked(struct qspinlock *lock) > +{ > + struct __qspinlock *l = (void *)lock; > +
Re: [PATCH 05/11] qspinlock: Optimize for smaller NR_CPUS
On Sun, Jun 15, 2014 at 02:47:02PM +0200, Peter Zijlstra wrote: > From: Peter Zijlstra > > When we allow for a max NR_CPUS < 2^14 we can optimize the pending > wait-acquire and the xchg_tail() operations. > > By growing the pending bit to a byte, we reduce the tail to 16bit. > This means we can use xchg16 for the tail part and do away with all > the repeated compxchg() operations. > > This in turn allows us to unconditionally acquire; the locked state > as observed by the wait loops cannot change. And because both locked > and pending are now a full byte we can use simple stores for the > state transition, obviating one atomic operation entirely. I have to ask - how much more performance do you get from this? Is this extra atomic operation hurting that much? > > All this is horribly broken on Alpha pre EV56 (and any other arch that > cannot do single-copy atomic byte stores). > > Signed-off-by: Peter Zijlstra > --- > include/asm-generic/qspinlock_types.h | 13 > kernel/locking/qspinlock.c| 103 > ++ > 2 files changed, 106 insertions(+), 10 deletions(-) > > --- a/include/asm-generic/qspinlock_types.h > +++ b/include/asm-generic/qspinlock_types.h > @@ -38,6 +38,14 @@ typedef struct qspinlock { > /* > * Bitfields in the atomic value: > * > + * When NR_CPUS < 16K > + * 0- 7: locked byte > + * 8: pending > + * 9-15: not used > + * 16-17: tail index > + * 18-31: tail cpu (+1) > + * > + * When NR_CPUS >= 16K > * 0- 7: locked byte > * 8: pending > * 9-10: tail index > @@ -50,7 +58,11 @@ typedef struct qspinlock { > #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) > > #define _Q_PENDING_OFFSET(_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) > +#if CONFIG_NR_CPUS < (1U << 14) > +#define _Q_PENDING_BITS 8 > +#else > #define _Q_PENDING_BITS 1 > +#endif > #define _Q_PENDING_MASK _Q_SET_MASK(PENDING) > > #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) > @@ -61,6 +73,7 @@ typedef struct qspinlock { > #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) > #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) > > +#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET > #define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) > > #define _Q_LOCKED_VAL(1U << _Q_LOCKED_OFFSET) > --- a/kernel/locking/qspinlock.c > +++ b/kernel/locking/qspinlock.c > @@ -22,6 +22,7 @@ > #include > #include > #include > +#include > #include > > /* > @@ -48,6 +49,9 @@ > * We can further change the first spinner to spin on a bit in the lock word > * instead of its node; whereby avoiding the need to carry a node from lock > to > * unlock, and preserving API. > + * > + * N.B. The current implementation only supports architectures that allow > + * atomic operations on smaller 8-bit and 16-bit data types. > */ > > #include "mcs_spinlock.h" > @@ -85,6 +89,87 @@ static inline struct mcs_spinlock *decod > > #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) > > +/* > + * By using the whole 2nd least significant byte for the pending bit, we > + * can allow better optimization of the lock acquisition for the pending > + * bit holder. > + */ > +#if _Q_PENDING_BITS == 8 > + > +struct __qspinlock { > + union { > + atomic_t val; > + struct { > +#ifdef __LITTLE_ENDIAN > + u16 locked_pending; > + u16 tail; > +#else > + u16 tail; > + u16 locked_pending; > +#endif > + }; > + }; > +}; > + > +/** > + * clear_pending_set_locked - take ownership and clear the pending bit. > + * @lock: Pointer to queue spinlock structure > + * @val : Current value of the queue spinlock 32-bit word > + * > + * *,1,0 -> *,0,1 > + * > + * Lock stealing is not allowed if this function is used. > + */ > +static __always_inline void > +clear_pending_set_locked(struct qspinlock *lock, u32 val) > +{ > + struct __qspinlock *l = (void *)lock; > + > + ACCESS_ONCE(l->locked_pending) = _Q_LOCKED_VAL; > +} > + > +/* > + * xchg_tail - Put in the new queue tail code word & retrieve previous one Missing full stop. > + * @lock : Pointer to queue spinlock structure > + * @tail : The new queue tail code word > + * Return: The previous queue tail code word > + * > + * xchg(lock, tail) > + * > + * p,*,* -> n,*,* ; prev = xchg(lock, node) > + */ > +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) > +{ > + struct __qspinlock *l = (void *)lock; > + > + return (u32)xchg(&l->tail, tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET; > +} > + > +#else /* _Q_PENDING_BITS == 8 */ > + > +/** > + * clear_pending_set_locked - take ownership and clear the pending bit. > + * @lock: Pointer to queue spinlock structure > + * @val : Current value of the queue spinlock 32-bit word > + * > + * *,1,0 -> *,0,1 > + */ > +sta
[qom-cpu PATCH 1/3] target-i386: Disable CPUID_ACPI by default on KVM mode
KVM never supported the CPUID_ACPI flag, so it doesn't make sense to have it enabled by default when KVM is enabled. The motivation here is exactly the same we had for the MONITOR flag. And like on the MONITOR flag case, we don't need machine-type compat code because it is currently impossible to run a KVM VM with the ACPI flag set. Signed-off-by: Eduardo Habkost --- target-i386/cpu.c | 1 + 1 file changed, 1 insertion(+) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index de09ca2..8de1566 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -461,6 +461,7 @@ static uint32_t kvm_default_features[FEATURE_WORDS] = { /* Features that are not added by default to any CPU model when KVM is enabled. */ static uint32_t kvm_default_unset_features[FEATURE_WORDS] = { +[FEAT_1_EDX] = CPUID_ACPI, [FEAT_1_ECX] = CPUID_EXT_MONITOR, }; -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[qom-cpu PATCH 3/3] target-i386: Don't enable nested VMX by default
TCG doesn't support VMX, and nested VMX is not enabled by default on the KVM kernel module. So, there's no reason to have VMX enabled by default on the core2duo and coreduo CPU models, today. Even the newer Intel CPU model definitions don't have it enabled. In this case, we need machine-type compat code, as people may be running the older machine-types on hosts that had VMX nesting enabled. Signed-off-by: Eduardo Habkost --- hw/i386/pc_piix.c | 2 ++ hw/i386/pc_q35.c | 2 ++ target-i386/cpu.c | 8 3 files changed, 8 insertions(+), 4 deletions(-) diff --git a/hw/i386/pc_piix.c b/hw/i386/pc_piix.c index a48e263..61882d5 100644 --- a/hw/i386/pc_piix.c +++ b/hw/i386/pc_piix.c @@ -267,6 +267,8 @@ static void pc_init_pci(MachineState *machine) static void pc_compat_2_0(MachineState *machine) { smbios_legacy_mode = true; +x86_cpu_compat_set_features("coreduo", FEAT_1_ECX, CPUID_EXT_VMX, 0); +x86_cpu_compat_set_features("core2duo", FEAT_1_ECX, CPUID_EXT_VMX, 0); } static void pc_compat_1_7(MachineState *machine) diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c index b3c02c1..3949267 100644 --- a/hw/i386/pc_q35.c +++ b/hw/i386/pc_q35.c @@ -245,6 +245,8 @@ static void pc_q35_init(MachineState *machine) static void pc_compat_2_0(MachineState *machine) { smbios_legacy_mode = true; +x86_cpu_compat_set_features("coreduo", FEAT_1_ECX, CPUID_EXT_VMX, 0); +x86_cpu_compat_set_features("core2duo", FEAT_1_ECX, CPUID_EXT_VMX, 0); } static void pc_compat_1_7(MachineState *machine) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 2f32d29..6bd44e1 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -719,10 +719,10 @@ static X86CPUDefinition builtin_x86_defs[] = { CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA | CPUID_PSE36 | CPUID_VME | CPUID_ACPI | CPUID_SS, /* Missing: CPUID_EXT_DTES64, CPUID_EXT_DSCPL, CPUID_EXT_EST, - * CPUID_EXT_TM2, CPUID_EXT_XTPR, CPUID_EXT_PDCM */ + * CPUID_EXT_TM2, CPUID_EXT_XTPR, CPUID_EXT_PDCM, CPUID_EXT_VMX */ .features[FEAT_1_ECX] = CPUID_EXT_SSE3 | CPUID_EXT_MONITOR | CPUID_EXT_SSSE3 | -CPUID_EXT_VMX | CPUID_EXT_CX16, +CPUID_EXT_CX16, .features[FEAT_8000_0001_EDX] = CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX, .features[FEAT_8000_0001_ECX] = @@ -803,9 +803,9 @@ static X86CPUDefinition builtin_x86_defs[] = { CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA | CPUID_ACPI | CPUID_SS, /* Missing: CPUID_EXT_EST, CPUID_EXT_TM2 , CPUID_EXT_XTPR, - * CPUID_EXT_PDCM */ + * CPUID_EXT_PDCM, CPUID_EXT_VMX */ .features[FEAT_1_ECX] = -CPUID_EXT_SSE3 | CPUID_EXT_MONITOR | CPUID_EXT_VMX, +CPUID_EXT_SSE3 | CPUID_EXT_MONITOR, .features[FEAT_8000_0001_EDX] = CPUID_EXT2_NX, .xlevel = 0x8008, -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[qom-cpu PATCH 2/3] target-i386: Remove unsupported bits from all CPU models
The following CPU features were never supported by neither TCG or KVM, so they are useless on the CPU model definitions, today: * CPUID_DTS (DS) * CPUID_HT * CPUID_TM * CPUID_PBE * CPUID_EXT_DTES64 * CPUID_EXT_DSCPL * CPUID_EXT_EST * CPUID_EXT_TM2 * CPUID_EXT_XTPR * CPUID_EXT_PDCM * CPUID_SVM_LBRV As using "enforce" mode is the only way to ensure guest ABI doesn't change when moving to a different host, we should make "enforce" mode the default or at least encourage management software to always use it. In turn, to make "enforce" usable, we need CPU models that work without always requiring some features to be explicitly disabled. This patch removes the above features from all CPU model definitions. We won't need any machine-type compat code for those changes, because it is impossible to have existing VMs with those features enabled. Signed-off-by: Eduardo Habkost Cc: Aurelien Jarno --- target-i386/cpu.c | 33 - 1 file changed, 20 insertions(+), 13 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 8de1566..2f32d29 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -680,10 +680,11 @@ static X86CPUDefinition builtin_x86_defs[] = { .family = 16, .model = 2, .stepping = 3, +/* MIssing: CPUID_HT */ .features[FEAT_1_EDX] = PPRO_FEATURES | CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA | -CPUID_PSE36 | CPUID_VME | CPUID_HT, +CPUID_PSE36 | CPUID_VME, .features[FEAT_1_ECX] = CPUID_EXT_SSE3 | CPUID_EXT_MONITOR | CPUID_EXT_CX16 | CPUID_EXT_POPCNT, @@ -699,8 +700,9 @@ static X86CPUDefinition builtin_x86_defs[] = { .features[FEAT_8000_0001_ECX] = CPUID_EXT3_LAHF_LM | CPUID_EXT3_SVM | CPUID_EXT3_ABM | CPUID_EXT3_SSE4A, +/* Missing: CPUID_SVM_LBRV */ .features[FEAT_SVM] = -CPUID_SVM_NPT | CPUID_SVM_LBRV, +CPUID_SVM_NPT, .xlevel = 0x801A, .model_id = "AMD Phenom(tm) 9550 Quad-Core Processor" }, @@ -711,15 +713,16 @@ static X86CPUDefinition builtin_x86_defs[] = { .family = 6, .model = 15, .stepping = 11, +/* Missing: CPUID_DTS, CPUID_HT, CPUID_TM, CPUID_PBE */ .features[FEAT_1_EDX] = PPRO_FEATURES | CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA | -CPUID_PSE36 | CPUID_VME | CPUID_DTS | CPUID_ACPI | CPUID_SS | -CPUID_HT | CPUID_TM | CPUID_PBE, +CPUID_PSE36 | CPUID_VME | CPUID_ACPI | CPUID_SS, +/* Missing: CPUID_EXT_DTES64, CPUID_EXT_DSCPL, CPUID_EXT_EST, + * CPUID_EXT_TM2, CPUID_EXT_XTPR, CPUID_EXT_PDCM */ .features[FEAT_1_ECX] = CPUID_EXT_SSE3 | CPUID_EXT_MONITOR | CPUID_EXT_SSSE3 | -CPUID_EXT_DTES64 | CPUID_EXT_DSCPL | CPUID_EXT_VMX | CPUID_EXT_EST | -CPUID_EXT_TM2 | CPUID_EXT_CX16 | CPUID_EXT_XTPR | CPUID_EXT_PDCM, +CPUID_EXT_VMX | CPUID_EXT_CX16, .features[FEAT_8000_0001_EDX] = CPUID_EXT2_LM | CPUID_EXT2_SYSCALL | CPUID_EXT2_NX, .features[FEAT_8000_0001_ECX] = @@ -794,13 +797,15 @@ static X86CPUDefinition builtin_x86_defs[] = { .family = 6, .model = 14, .stepping = 8, +/* Missing: CPUID_DTS, CPUID_HT, CPUID_TM, CPUID_PBE */ .features[FEAT_1_EDX] = PPRO_FEATURES | CPUID_VME | -CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA | CPUID_DTS | CPUID_ACPI | -CPUID_SS | CPUID_HT | CPUID_TM | CPUID_PBE, +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA | CPUID_ACPI | +CPUID_SS, +/* Missing: CPUID_EXT_EST, CPUID_EXT_TM2 , CPUID_EXT_XTPR, + * CPUID_EXT_PDCM */ .features[FEAT_1_ECX] = -CPUID_EXT_SSE3 | CPUID_EXT_MONITOR | CPUID_EXT_VMX | -CPUID_EXT_EST | CPUID_EXT_TM2 | CPUID_EXT_XTPR | CPUID_EXT_PDCM, +CPUID_EXT_SSE3 | CPUID_EXT_MONITOR | CPUID_EXT_VMX, .features[FEAT_8000_0001_EDX] = CPUID_EXT2_NX, .xlevel = 0x8008, @@ -873,14 +878,16 @@ static X86CPUDefinition builtin_x86_defs[] = { .family = 6, .model = 28, .stepping = 2, +/* Missing: CPUID_DTS, CPUID_HT, CPUID_TM, CPUID_PBE */ .features[FEAT_1_EDX] = PPRO_FEATURES | -CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA | CPUID_VME | CPUID_DTS | -CPUID_ACPI | CPUID_SS | CPUID_HT | CPUID_TM | CPUID_PBE, +CPUID_MTRR | CPUID_CLFLUSH | CPUID_MCA | CPUID_VME | +CPUID_ACPI | CPUID_SS, /* Some CPUs got no CPUID_SEP */ +/* Missing: CPUID_EXT_DSCPL, CPUID_EXT_EST, CPUID_EXT_TM2, + * CPUID_EXT_XTPR */ .features[FEAT_1_ECX] = CPUID_EXT_SSE3 | CPUID_EXT_MONITOR | CPUID_EXT_SSSE3 | -CPUID_EXT_DSCPL | CPUID_EXT_EST | CPUID_EXT_TM2 | CPUID_EXT_XTPR | CPU
[qom-cpu PATCH 0/3] target-i386: Make most CPU models work with "enforce" out of the box
Most of the bits that make "enforce" breaks were introduced in 2010 by commit 8560efed6a72a816c0115f41ddb9d79f7ce63f28. The intention behind that commit made sense, the only problem is that we can't guarantee guest ABI stability across hosts if we simply rely on trimming of CPU features based on host capabilities. So, this series remove CPUID bits from the CPU model definitions so they become defaults that: 1) won't unexpectly stop working when we start using the "enforce" flag; 2) won't silently break the guest ABI when TCG or KVM start supporting new features. There's only one non-trivial case left: the qemu32/qemu64 models. The problem with them is that we have conflicting expectations about it, from different users: TCG users expect the default CPU model to contain most TCG-supported features (and it makes sense). See, for example, commit f1e00a9cf326acc1f2386a72525af8859852e1df. KVM users expect the default CPU model to be a conservative choice which will work on most host CPUs (and will only contain features that are supported by KVM). We could solve the qemu32/qemu64 issue by having different defaults for TCG and KVM. But we have existinting management code (libvirt) that already expects qemu32 or qemu64 to be the default, and changing the default would break that code. I will send an RFC to address that later. Cc: Aurelien Jarno Cc: Paolo Bonzini Cc: kvm@vger.kernel.org Eduardo Habkost (3): target-i386: Disable CPUID_ACPI by default on KVM mode target-i386: Remove unsupported bits from all CPU models target-i386: Don't enable nested VMX by default hw/i386/pc_piix.c | 2 ++ hw/i386/pc_q35.c | 2 ++ target-i386/cpu.c | 34 +- 3 files changed, 25 insertions(+), 13 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 -next 4/9] DMA, CMA: support arbitrary bitmap granularity
On Mon, 16 Jun 2014 14:40:46 +0900 Joonsoo Kim wrote: > PPC KVM's CMA area management requires arbitrary bitmap granularity, > since they want to reserve very large memory and manage this region > with bitmap that one bit for several pages to reduce management overheads. > So support arbitrary bitmap granularity for following generalization. > > ... > > --- a/drivers/base/dma-contiguous.c > +++ b/drivers/base/dma-contiguous.c > @@ -38,6 +38,7 @@ struct cma { > unsigned long base_pfn; > unsigned long count; > unsigned long *bitmap; > + unsigned int order_per_bit; /* Order of pages represented by one bit */ > struct mutexlock; > }; > > @@ -157,9 +158,37 @@ void __init dma_contiguous_reserve(phys_addr_t limit) > > static DEFINE_MUTEX(cma_mutex); > > +static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int > align_order) > +{ > + return (1 << (align_order >> cma->order_per_bit)) - 1; > +} Might want a "1UL << ..." here. > +static unsigned long cma_bitmap_maxno(struct cma *cma) > +{ > + return cma->count >> cma->order_per_bit; > +} > + > +static unsigned long cma_bitmap_pages_to_bits(struct cma *cma, > + unsigned long pages) > +{ > + return ALIGN(pages, 1 << cma->order_per_bit) >> cma->order_per_bit; > +} Ditto. I'm not really sure what the compiler will do in these cases, but would prefer not to rely on it anyway! --- a/drivers/base/dma-contiguous.c~dma-cma-support-arbitrary-bitmap-granularity-fix +++ a/drivers/base/dma-contiguous.c @@ -160,7 +160,7 @@ static DEFINE_MUTEX(cma_mutex); static unsigned long cma_bitmap_aligned_mask(struct cma *cma, int align_order) { - return (1 << (align_order >> cma->order_per_bit)) - 1; + return (1UL << (align_order >> cma->order_per_bit)) - 1; } static unsigned long cma_bitmap_maxno(struct cma *cma) @@ -171,7 +171,7 @@ static unsigned long cma_bitmap_maxno(st static unsigned long cma_bitmap_pages_to_bits(struct cma *cma, unsigned long pages) { - return ALIGN(pages, 1 << cma->order_per_bit) >> cma->order_per_bit; + return ALIGN(pages, 1UL << cma->order_per_bit) >> cma->order_per_bit; } static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, int count) _ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 -next 0/9] CMA: generalize CMA reserved area management code
On Tue, 17 Jun 2014 10:25:07 +0900 Joonsoo Kim wrote: > > >v2: > > > - Although this patchset looks very different with v1, the end result, > > > that is, mm/cma.c is same with v1's one. So I carry Ack to patch 6-7. > > > > > >This patchset is based on linux-next 20140610. > > > > Thanks for taking care of this. I will test it with my setup and if > > everything goes well, I will take it to my -next tree. If any branch > > is required for anyone to continue his works on top of those patches, > > let me know, I will also prepare it. > > Hello, > > I'm glad to hear that. :) > But, there is one concern. As you already know, I am preparing further > patches (Aggressively allocate the pages on CMA reserved memory). It > may be highly related to MM branch and also slightly depends on this CMA > changes. In this case, what is the best strategy to merge this > patchset? IMHO, Anrew's tree is more appropriate branch. If there is > no issue in this case, I am willing to develope further patches based > on your tree. That's probably easier. Marek, I'll merge these into -mm (and hence -next and git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git) and shall hold them pending you review/ack/test/etc, OK? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [qom-cpu PATCH 2/3] target-i386: Remove unsupported bits from all CPU models
On 06/18/2014 01:55 PM, Eduardo Habkost wrote: > The following CPU features were never supported by neither TCG or KVM, > so they are useless on the CPU model definitions, today: > The overall idea of this series makes sense to me (yes, I'd love to get libvirt to the point that we can use enforce mode), but I decline to review the actual contents (it's a bit over my head how all the models work) and leave it to the experts. But here's a trivial finding: > diff --git a/target-i386/cpu.c b/target-i386/cpu.c > index 8de1566..2f32d29 100644 > --- a/target-i386/cpu.c > +++ b/target-i386/cpu.c > @@ -680,10 +680,11 @@ static X86CPUDefinition builtin_x86_defs[] = { > .family = 16, > .model = 2, > .stepping = 3, > +/* MIssing: CPUID_HT */ s/MIssing/Missing/ -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
[patch 5/5] KVM: MMU: pinned sps are not candidates for deletion.
Skip pinned shadow pages when selecting pages to zap. Signed-off-by: Marcelo Tosatti --- arch/x86/kvm/mmu.c | 26 ++ 1 file changed, 18 insertions(+), 8 deletions(-) Index: kvm/arch/x86/kvm/mmu.c === --- kvm.orig/arch/x86/kvm/mmu.c +++ kvm/arch/x86/kvm/mmu.c @@ -2279,16 +2279,24 @@ static void kvm_mmu_commit_zap_page(stru static bool prepare_zap_oldest_mmu_page(struct kvm *kvm, struct list_head *invalid_list) { - struct kvm_mmu_page *sp; - - if (list_empty(&kvm->arch.active_mmu_pages)) - return false; + struct kvm_mmu_page *sp, *nsp; + LIST_HEAD(pinned_list); - sp = list_entry(kvm->arch.active_mmu_pages.prev, - struct kvm_mmu_page, link); - kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); + list_for_each_entry_safe_reverse(sp, nsp, +&kvm->arch.active_mmu_pages, link) { + if (sp->pinned) { + list_move(&sp->link, &pinned_list); + continue; + } + if (!list_empty(&pinned_list)) + list_move(&pinned_list, &kvm->arch.active_mmu_pages); + kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); + return true; + } - return true; + if (!list_empty(&pinned_list)) + list_move(&pinned_list, &kvm->arch.active_mmu_pages); + return false; } /* @@ -4660,6 +4668,8 @@ void kvm_mmu_invalidate_zap_all_pages(st * Notify all vcpus to reload its shadow page table * and flush TLB. Then all vcpus will switch to new * shadow page table with the new mmu_valid_gen. +* MMU reload request also forces fault of +* sptes for pinned ranges. * * Note: we should do this under the protection of * mmu-lock, otherwise, vcpu would purge shadow page -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 3/5] KVM: MMU: notifiers support for pinned sptes
Request KVM_REQ_MMU_RELOAD when deleting sptes from MMU notifiers. Keep pinned sptes intact if page aging. Signed-off-by: Marcelo Tosatti --- arch/x86/kvm/mmu.c | 71 ++--- 1 file changed, 62 insertions(+), 9 deletions(-) Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c === --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c2014-06-18 17:28:24.339435654 -0300 +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c 2014-06-18 17:29:32.510225755 -0300 @@ -1184,6 +1184,42 @@ kvm_flush_remote_tlbs(vcpu->kvm); } +static void ack_flush(void *_completed) +{ +} + +static void mmu_reload_pinned_vcpus(struct kvm *kvm) +{ + int i, cpu, me; + cpumask_var_t cpus; + struct kvm_vcpu *vcpu; + unsigned int req = KVM_REQ_MMU_RELOAD; + + zalloc_cpumask_var(&cpus, GFP_ATOMIC); + + me = get_cpu(); + kvm_for_each_vcpu(i, vcpu, kvm) { + if (list_empty(&vcpu->arch.pinned_mmu_pages)) + continue; + kvm_make_request(req, vcpu); + cpu = vcpu->cpu; + + /* Set ->requests bit before we read ->mode */ + smp_mb(); + + if (cpus != NULL && cpu != -1 && cpu != me && + kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE) + cpumask_set_cpu(cpu, cpus); + } + if (unlikely(cpus == NULL)) + smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1); + else if (!cpumask_empty(cpus)) + smp_call_function_many(cpus, ack_flush, NULL, 1); + put_cpu(); + free_cpumask_var(cpus); + return; +} + /* * Write-protect on the specified @sptep, @pt_protect indicates whether * spte write-protection is caused by protecting shadow page table. @@ -1276,7 +1312,8 @@ } static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp, - struct kvm_memory_slot *slot, unsigned long data) + struct kvm_memory_slot *slot, unsigned long data, + bool age) { u64 *sptep; struct rmap_iterator iter; @@ -1286,6 +1323,14 @@ BUG_ON(!(*sptep & PT_PRESENT_MASK)); rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", sptep, *sptep); + if (is_pinned_spte(*sptep)) { + /* don't nuke pinned sptes if page aging: return +* young=yes instead. +*/ + if (age) + return 1; + mmu_reload_pinned_vcpus(kvm); + } drop_spte(kvm, sptep); need_tlb_flush = 1; } @@ -1294,7 +1339,8 @@ } static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp, -struct kvm_memory_slot *slot, unsigned long data) +struct kvm_memory_slot *slot, unsigned long data, +bool age) { u64 *sptep; struct rmap_iterator iter; @@ -1312,6 +1358,9 @@ need_flush = 1; + if (is_pinned_spte(*sptep)) + mmu_reload_pinned_vcpus(kvm); + if (pte_write(*ptep)) { drop_spte(kvm, sptep); sptep = rmap_get_first(*rmapp, &iter); @@ -1342,7 +1391,8 @@ int (*handler)(struct kvm *kvm, unsigned long *rmapp, struct kvm_memory_slot *slot, - unsigned long data)) + unsigned long data, + bool age)) { int j; int ret = 0; @@ -1382,7 +1432,7 @@ rmapp = __gfn_to_rmap(gfn_start, j, memslot); for (; idx <= idx_end; ++idx) - ret |= handler(kvm, rmapp++, memslot, data); + ret |= handler(kvm, rmapp++, memslot, data, false); } } @@ -1393,7 +1443,8 @@ unsigned long data, int (*handler)(struct kvm *kvm, unsigned long *rmapp, struct kvm_memory_slot *slot, -unsigned long data)) +unsigned long data, +bool age)) { return kvm_handle_hva_range(kvm, hva, hva + 1, data, handler); } @@ -1414,7 +1465,8 @@ } static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp, -struct kvm_memory_slot *slot, unsigned long data) +struct kvm_memory_slot *slot, unsigned long data, +
[patch 1/5] KVM: x86: add pinned parameter to page_fault methods
To be used by next patch. Signed-off-by: Marcelo Tosatti --- arch/x86/include/asm/kvm_host.h |2 +- arch/x86/kvm/mmu.c | 11 ++- arch/x86/kvm/paging_tmpl.h |2 +- arch/x86/kvm/x86.c |2 +- 4 files changed, 9 insertions(+), 8 deletions(-) Index: kvm.pinned-sptes/arch/x86/include/asm/kvm_host.h === --- kvm.pinned-sptes.orig/arch/x86/include/asm/kvm_host.h 2014-06-18 17:27:47.579549247 -0300 +++ kvm.pinned-sptes/arch/x86/include/asm/kvm_host.h2014-06-18 17:28:17.549456614 -0300 @@ -259,7 +259,7 @@ unsigned long (*get_cr3)(struct kvm_vcpu *vcpu); u64 (*get_pdptr)(struct kvm_vcpu *vcpu, int index); int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, - bool prefault); + bool prefault, bool pin, bool *pinned); void (*inject_page_fault)(struct kvm_vcpu *vcpu, struct x86_exception *fault); gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access, Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c === --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c2014-06-18 17:27:47.582549238 -0300 +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c 2014-06-18 17:28:17.550456611 -0300 @@ -2899,7 +2899,7 @@ static void make_mmu_pages_available(struct kvm_vcpu *vcpu); static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code, -gfn_t gfn, bool prefault) +gfn_t gfn, bool prefault, bool pin, bool *pinned) { int r; int level; @@ -3299,7 +3299,8 @@ } static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva, - u32 error_code, bool prefault) + u32 error_code, bool prefault, bool pin, + bool *pinned) { gfn_t gfn; int r; @@ -3323,7 +3324,7 @@ gfn = gva >> PAGE_SHIFT; return nonpaging_map(vcpu, gva & PAGE_MASK, -error_code, gfn, prefault); +error_code, gfn, prefault, pin, pinned); } static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn) @@ -3373,7 +3374,7 @@ } static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, - bool prefault) + bool prefault, bool pin, bool *pinned) { pfn_t pfn; int r; @@ -4190,7 +4191,7 @@ int r, emulation_type = EMULTYPE_RETRY; enum emulation_result er; - r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false); + r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false, false, NULL); if (r < 0) goto out; Index: kvm.pinned-sptes/arch/x86/kvm/paging_tmpl.h === --- kvm.pinned-sptes.orig/arch/x86/kvm/paging_tmpl.h2014-06-18 17:27:47.583549234 -0300 +++ kvm.pinned-sptes/arch/x86/kvm/paging_tmpl.h 2014-06-18 17:28:17.550456611 -0300 @@ -687,7 +687,7 @@ * a negative value on error. */ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, -bool prefault) +bool prefault, bool pin, bool *pinned) { int write_fault = error_code & PFERR_WRITE_MASK; int user_fault = error_code & PFERR_USER_MASK; Index: kvm.pinned-sptes/arch/x86/kvm/x86.c === --- kvm.pinned-sptes.orig/arch/x86/kvm/x86.c2014-06-18 17:27:47.586549225 -0300 +++ kvm.pinned-sptes/arch/x86/kvm/x86.c 2014-06-18 17:28:17.552456605 -0300 @@ -7415,7 +7415,7 @@ work->arch.cr3 != vcpu->arch.mmu.get_cr3(vcpu)) return; - vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true); + vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true, false, NULL); } static inline u32 kvm_async_pf_hash_fn(gfn_t gfn) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)
Allow vcpus to pin spte translations by: 1) Creating a per-vcpu list of pinned ranges. 2) On mmu reload request: - Fault ranges. - Mark sptes with a pinned bit. - Mark shadow pages as pinned. 3) Then modify the following actions: - Page age => skip spte flush. - MMU notifiers => force mmu reload request (which kicks cpu out of guest mode). - GET_DIRTY_LOG => force mmu reload request. - SLAB shrinker => skip shadow page deletion. TDP-only. Signed-off-by: Marcelo Tosatti --- arch/x86/include/asm/kvm_host.h | 14 ++ arch/x86/kvm/mmu.c | 202 ++-- arch/x86/kvm/mmu.h |5 arch/x86/kvm/mmutrace.h | 23 arch/x86/kvm/paging_tmpl.h |2 arch/x86/kvm/x86.c |4 6 files changed, 241 insertions(+), 9 deletions(-) Index: kvm.pinned-sptes/arch/x86/include/asm/kvm_host.h === --- kvm.pinned-sptes.orig/arch/x86/include/asm/kvm_host.h 2014-06-18 17:28:17.549456614 -0300 +++ kvm.pinned-sptes/arch/x86/include/asm/kvm_host.h2014-06-18 17:28:24.338435658 -0300 @@ -221,6 +221,8 @@ /* hold the gfn of each spte inside spt */ gfn_t *gfns; bool unsync; + bool pinned; + int root_count; /* Currently serving as active root */ unsigned int unsync_children; unsigned long parent_ptes; /* Reverse mapping for parent_pte */ @@ -337,6 +339,14 @@ KVM_DEBUGREG_WONT_EXIT = 2, }; +struct kvm_pinned_page_range { + gfn_t base_gfn; + unsigned long npages; + struct list_head link; +}; + +#define KVM_MAX_PER_VCPU_PINNED_RANGE 10 + struct kvm_vcpu_arch { /* * rip and regs accesses must go through @@ -392,6 +402,10 @@ struct kvm_mmu_memory_cache mmu_page_cache; struct kvm_mmu_memory_cache mmu_page_header_cache; + struct list_head pinned_mmu_pages; + struct mutex pinned_mmu_mutex; + unsigned int nr_pinned_ranges; + struct fpu guest_fpu; u64 xcr0; u64 guest_supported_xcr0; Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c === --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c2014-06-18 17:28:17.550456611 -0300 +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c 2014-06-18 17:28:24.339435654 -0300 @@ -148,6 +148,9 @@ #define SPTE_HOST_WRITEABLE(1ULL << PT_FIRST_AVAIL_BITS_SHIFT) #define SPTE_MMU_WRITEABLE (1ULL << (PT_FIRST_AVAIL_BITS_SHIFT + 1)) +#define SPTE_PINNED(1ULL << (PT64_SECOND_AVAIL_BITS_SHIFT)) + +#define SPTE_PINNED_BIT PT64_SECOND_AVAIL_BITS_SHIFT #define SHADOW_PT_INDEX(addr, level) PT64_INDEX(addr, level) @@ -327,6 +330,11 @@ return pte & PT_PRESENT_MASK && !is_mmio_spte(pte); } +static int is_pinned_spte(u64 spte) +{ + return spte & SPTE_PINNED && is_shadow_present_pte(spte); +} + static int is_large_pte(u64 pte) { return pte & PT_PAGE_SIZE_MASK; @@ -2818,7 +2826,7 @@ * - false: let the real page fault path to fix it. */ static bool fast_page_fault(struct kvm_vcpu *vcpu, gva_t gva, int level, - u32 error_code) + u32 error_code, bool pin) { struct kvm_shadow_walk_iterator iterator; struct kvm_mmu_page *sp; @@ -2828,6 +2836,9 @@ if (!VALID_PAGE(vcpu->arch.mmu.root_hpa)) return false; + if (pin) + return false; + if (!page_fault_can_be_fast(error_code)) return false; @@ -2895,9 +2906,55 @@ } static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn, -gva_t gva, pfn_t *pfn, bool write, bool *writable); +gva_t gva, pfn_t *pfn, bool write, bool *writable, +bool pin); static void make_mmu_pages_available(struct kvm_vcpu *vcpu); + +static int get_sptep_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes[4]) +{ + struct kvm_shadow_walk_iterator iterator; + int nr_sptes = 0; + + if (!VALID_PAGE(vcpu->arch.mmu.root_hpa)) + return nr_sptes; + + for_each_shadow_entry(vcpu, addr, iterator) { + sptes[iterator.level-1] = iterator.sptep; + nr_sptes++; + if (!is_shadow_present_pte(*iterator.sptep)) + break; + } + + return nr_sptes; +} + +static bool direct_pin_sptes(struct kvm_vcpu *vcpu, gfn_t gfn) +{ + u64 *sptes[4]; + int r, i, level; + + r = get_sptep_hierarchy(vcpu, gfn << PAGE_SHIFT, sptes); + if (!r) + return false; + + level = 5 - r; + if (!is_last_spte(*sptes[r-1], level)) + return false; + if (!is_shadow_present_pte(*sptes[r-1])) + return false; + + for (i = 0; i
[patch 0/5] KVM: support for pinning sptes
Required by PEBS support as discussed at Subject: [patch 0/5] Implement PEBS virtualization for Silvermont Message-Id: <1401412327-14810-1-git-send-email-a...@firstfloor.org> Thread. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path
Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before deleting a pinned spte. Signed-off-by: Marcelo Tosatti --- arch/x86/kvm/mmu.c |3 +++ 1 file changed, 3 insertions(+) Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c === --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c2014-06-13 16:50:50.040140594 -0300 +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c 2014-06-13 16:51:05.620104451 -0300 @@ -1247,6 +1247,9 @@ spte &= ~SPTE_MMU_WRITEABLE; spte = spte & ~PT_WRITABLE_MASK; + if (is_pinned_spte(spte)) + mmu_reload_pinned_vcpus(kvm); + return mmu_spte_update(sptep, spte); } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On 06/19/2014 04:35 AM, Alex Williamson wrote: > On Wed, 2014-06-18 at 21:36 +1000, Alexey Kardashevskiy wrote: >> VFIO exposes BARs to user space as a byte stream so userspace can >> read it using pread()/pwrite(). Since this is a byte stream, VFIO should >> not do byte swapping and simply return values as it gets them from >> PCI device. >> >> Instead, the existing code assumes that byte stream in read/write is >> little-endian and it fixes endianness for values which it passes to >> ioreadXX/iowriteXX helpers. This works for little-endian as PCI is >> little endian and le32_to_cpu/... are stubs. > > vfio read32: > > val = cpu_to_le32(ioread32(io + off)); > > Where the typical x86 case, ioread32 is: > > #define ioread32(addr) readl(addr) > > and readl is: > > __le32_to_cpu(__raw_readl(addr)); > > So we do canceling byte swaps, which are both nops on x86, and end up > returning device endian, which we assume is little endian. > > vfio write32 is similar: > > iowrite32(le32_to_cpu(val), io + off); > > The implicit cpu_to_le32 of iowrite32() and our explicit swap cancel > out, so input data is device endian, which is assumed little. > >> This also works for big endian but rather by an accident: it reads 4 bytes >> from the stream (@val is big endian), converts to CPU format (which should >> be big endian) as it was little endian (@val becomes actually little >> endian) and calls iowrite32() which does not do swapping on big endian >> system. > > Really? > > In arch/powerpc/kernel/iomap.c iowrite32() is just a wrapper around > writel(), which seems to use the generic implementation, which does > include a cpu_to_le32. Ouch, wrong comment. iowrite32() does swapping. My bad. > > I also see other big endian archs like parisc doing cpu_to_le32 on > iowrite32, so I don't think this statement is true. I imagine it's > probably working for you because the swap cancel. > >> This removes byte swapping and makes use ioread32be/iowrite32be >> (and 16bit versions) on big-endian systems. The "be" helpers take >> native endian values and do swapping at the moment of writing to a PCI >> register using one of "store byte-reversed" instructions. > > So now you want iowrite32() on little endian and iowrite32be() on big > endian, the former does a cpu_to_le32 (which is a nop on little endian) > and the latter does a cpu_to_be32 (which is a nop on big endian)... > should we just be using __raw_writel() on both? We can do that too. The beauty of iowrite32be on ppc64 is that it does not swap and write separately, it is implemented via the "Store Word Byte-Reverse Indexed X-form" single instruction. And some archs (do not know which ones) may add memory barriers in their implementations of ioread/iowrite. __raw_writel is too raw :) > There doesn't actually > seem to be any change in behavior here, it just eliminates back-to-back > byte swaps, which are a nop on x86, but not power, right? Exactly. No dependency for QEMU. > >> Suggested-by: Benjamin Herrenschmidt >> Signed-off-by: Alexey Kardashevskiy >> --- >> drivers/vfio/pci/vfio_pci_rdwr.c | 20 >> 1 file changed, 16 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c >> b/drivers/vfio/pci/vfio_pci_rdwr.c >> index 210db24..f363b5a 100644 >> --- a/drivers/vfio/pci/vfio_pci_rdwr.c >> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c >> @@ -21,6 +21,18 @@ >> >> #include "vfio_pci_private.h" >> >> +#ifdef __BIG_ENDIAN__ >> +#define ioread16_native ioread16be >> +#define ioread32_native ioread32be >> +#define iowrite16_nativeiowrite16be >> +#define iowrite32_nativeiowrite32be >> +#else >> +#define ioread16_native ioread16 >> +#define ioread32_native ioread32 >> +#define iowrite16_nativeiowrite16 >> +#define iowrite32_nativeiowrite32 >> +#endif >> + >> /* >> * Read or write from an __iomem region (MMIO or I/O port) with an excluded >> * range which is inaccessible. The excluded range drops writes and fills >> @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf, >> if (copy_from_user(&val, buf, 4)) >> return -EFAULT; >> >> -iowrite32(le32_to_cpu(val), io + off); >> +iowrite32_native(val, io + off); >> } else { >> -val = cpu_to_le32(ioread32(io + off)); >> +val = ioread32_native(io + off); >> >> if (copy_to_user(buf, &val, 4)) >> return -EFAULT; >> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user *buf, >> if (copy_from_user(&val, buf, 2)) >> return -EFAULT; >> >> -iowrite16(le16_to_cpu(val), io + off); >> +
[PATCH v8 3/4] arm: dirty log write protect management support
This patch adds support for keeping track of VM dirty pages. As dirty page log is retrieved, the pages that have been written are write protected again for next write and log read. For ARMv8 read of dirty log returns invalid operation. Signed-off-by: Mario Smarduch --- arch/arm/include/asm/kvm_host.h |3 ++ arch/arm/kvm/arm.c | 83 +++ arch/arm/kvm/mmu.c | 22 +++ 3 files changed, 108 insertions(+) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 586c467..dbf3d45 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -233,5 +233,8 @@ int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value); void kvm_tlb_flush_vmid(struct kvm *kvm); void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot); +void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, + struct kvm_memory_slot *slot, + gfn_t gfn_offset, unsigned long mask); #endif /* __ARM_KVM_HOST_H__ */ diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index e11c2dd..cb3c090 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -783,10 +783,93 @@ long kvm_arch_vcpu_ioctl(struct file *filp, } } +#ifdef CONFIG_ARM +/** + * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot + * @kvm: kvm instance + * @log: slot id and address to which we copy the log + * + * We need to keep it in mind that VCPU threads can write to the bitmap + * concurrently. So, to avoid losing data, we keep the following order for + * each bit: + * + * 1. Take a snapshot of the bit and clear it if needed. + * 2. Write protect the corresponding page. + * 3. Flush TLB's if needed. + * 4. Copy the snapshot to the userspace. + * + * Between 2 and 3, the guest may write to the page using the remaining TLB + * entry. This is not a problem because the page will be reported dirty at + * step 4 using the snapshot taken before and step 3 ensures that successive + * writes will be logged for the next call. + */ +int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, + struct kvm_dirty_log *log) +{ + int r; + struct kvm_memory_slot *memslot; + unsigned long n, i; + unsigned long *dirty_bitmap; + unsigned long *dirty_bitmap_buffer; + bool is_dirty = false; + + mutex_lock(&kvm->slots_lock); + + r = -EINVAL; + if (log->slot >= KVM_USER_MEM_SLOTS) + goto out; + + memslot = id_to_memslot(kvm->memslots, log->slot); + + dirty_bitmap = memslot->dirty_bitmap; + r = -ENOENT; + if (!dirty_bitmap) + goto out; + + n = kvm_dirty_bitmap_bytes(memslot); + + dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long); + memset(dirty_bitmap_buffer, 0, n); + + spin_lock(&kvm->mmu_lock); + + for (i = 0; i < n / sizeof(long); i++) { + unsigned long mask; + gfn_t offset; + + if (!dirty_bitmap[i]) + continue; + + is_dirty = true; + + mask = xchg(&dirty_bitmap[i], 0); + dirty_bitmap_buffer[i] = mask; + + offset = i * BITS_PER_LONG; + kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask); + } + + spin_unlock(&kvm->mmu_lock); + + lockdep_assert_held(&kvm->slots_lock); + if (is_dirty) + kvm_tlb_flush_vmid(kvm); + + r = -EFAULT; + if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n)) + goto out; + + r = 0; +out: + mutex_unlock(&kvm->slots_lock); + return r; +} +#else int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) { return -EINVAL; } +#endif static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm, struct kvm_arm_device_addr *dev_addr) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 37edcbe..1caf511 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -888,6 +888,28 @@ void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot) kvm_tlb_flush_vmid(kvm); spin_unlock(&kvm->mmu_lock); } + +/** + * kvm_mmu_write_protected_pt_masked() - write protect dirty pages set in mask + * @kvm: The KVM pointer + * @slot: The memory slot associated with mask + * @gfn_offset:The gfn offset in memory slot + * @mask: The mask of dirty pages at offset 'gfn_offset' in this memory + * slot to be write protected + * + * Walks bits set in mask write protects the associated pte's. Caller must + * acquire kvm_mmu_lock. + */ +void kvm_mmu_write_protect_pt_masked(struct kvm *kvm, + struct kvm_memory_slot *slot, + gfn_t gfn_offset, unsigned long mask) +{ + phys_addr_t base_gfn = slot->base_gfn + gfn_offset; + phys_addr_t start = (base_gfn + __ffs
[PATCH v8 4/4] arm: dirty page logging 2nd stage page fault handling support
This patch adds support for handling 2nd stage page faults during migration, it disables faulting in huge pages, and dissolves huge pages to page tables. In case migration is canceled huge pages will be used again. For ARMv8 logging is hardcoded to false. Signed-off-by: Mario Smarduch --- arch/arm/kvm/mmu.c | 31 +-- 1 file changed, 25 insertions(+), 6 deletions(-) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 1caf511..d49df28 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -641,7 +641,8 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache } static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache, - phys_addr_t addr, const pte_t *new_pte, bool iomap) + phys_addr_t addr, const pte_t *new_pte, bool iomap, + bool logging_active) { pmd_t *pmd; pte_t *pte, old_pte; @@ -656,6 +657,15 @@ static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache, return 0; } + /* +* While dirty memory logging, clear PMD entry for huge page and split +* into smaller pages, to track dirty memory at page granularity. +*/ + if (logging_active && kvm_pmd_huge(*pmd)) { + phys_addr_t ipa = pmd_pfn(*pmd) << PAGE_SHIFT; + clear_pmd_entry(kvm, pmd, ipa); + } + /* Create stage-2 page mappings - Level 2 */ if (pmd_none(*pmd)) { if (!cache) @@ -708,7 +718,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa, if (ret) goto out; spin_lock(&kvm->mmu_lock); - ret = stage2_set_pte(kvm, &cache, addr, &pte, true); + ret = stage2_set_pte(kvm, &cache, addr, &pte, true, false); spin_unlock(&kvm->mmu_lock); if (ret) goto out; @@ -925,6 +935,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache; struct vm_area_struct *vma; pfn_t pfn; + /* Get logging status, if dirty_bitmap is not NULL then logging is on */ +#ifdef CONFIG_ARM + bool logging_active = !!memslot->dirty_bitmap; +#else + bool logging_active = false; +#endif write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu)); if (fault_status == FSC_PERM && !write_fault) { @@ -935,7 +951,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, /* Let's check if we will get back a huge page backed by hugetlbfs */ down_read(¤t->mm->mmap_sem); vma = find_vma_intersection(current->mm, hva, hva + 1); - if (is_vm_hugetlb_page(vma)) { + if (is_vm_hugetlb_page(vma) && !logging_active) { hugetlb = true; gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT; } else { @@ -978,7 +994,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, spin_lock(&kvm->mmu_lock); if (mmu_notifier_retry(kvm, mmu_seq)) goto out_unlock; - if (!hugetlb && !force_pte) + if (!hugetlb && !force_pte && !logging_active) hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa); if (hugetlb) { @@ -997,9 +1013,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, kvm_set_pfn_dirty(pfn); } coherent_cache_guest_page(vcpu, hva, PAGE_SIZE); - ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false); + ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, false, + logging_active); } + if (write_fault) + mark_page_dirty(kvm, gfn); out_unlock: spin_unlock(&kvm->mmu_lock); @@ -1150,7 +1169,7 @@ static void kvm_set_spte_handler(struct kvm *kvm, gpa_t gpa, void *data) { pte_t *pte = (pte_t *)data; - stage2_set_pte(kvm, NULL, gpa, pte, false); + stage2_set_pte(kvm, NULL, gpa, pte, false, false); } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 2/4] arm: dirty page logging inital mem region write protect (w/no huge PUD support)
Patch adds support for initial write protection VM memlsot. This patch series assumes that huge PUDs will not be used in 2nd stage tables. For ARMv8 nothing happens here. Signed-off-by: Mario Smarduch --- arch/arm/include/asm/kvm_host.h |1 + arch/arm/include/asm/kvm_mmu.h| 20 ++ arch/arm/include/asm/pgtable-3level.h |1 + arch/arm/kvm/arm.c|9 +++ arch/arm/kvm/mmu.c| 128 + 5 files changed, 159 insertions(+) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index ac3bb65..586c467 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -232,5 +232,6 @@ u64 kvm_arm_timer_get_reg(struct kvm_vcpu *, u64 regid); int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value); void kvm_tlb_flush_vmid(struct kvm *kvm); +void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot); #endif /* __ARM_KVM_HOST_H__ */ diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h index 5cc0b0f..08ab5e8 100644 --- a/arch/arm/include/asm/kvm_mmu.h +++ b/arch/arm/include/asm/kvm_mmu.h @@ -114,6 +114,26 @@ static inline void kvm_set_s2pmd_writable(pmd_t *pmd) pmd_val(*pmd) |= L_PMD_S2_RDWR; } +static inline void kvm_set_s2pte_readonly(pte_t *pte) +{ + pte_val(*pte) = (pte_val(*pte) & ~L_PTE_S2_RDWR) | L_PTE_S2_RDONLY; +} + +static inline bool kvm_s2pte_readonly(pte_t *pte) +{ + return (pte_val(*pte) & L_PTE_S2_RDWR) == L_PTE_S2_RDONLY; +} + +static inline void kvm_set_s2pmd_readonly(pmd_t *pmd) +{ + pmd_val(*pmd) = (pmd_val(*pmd) & ~L_PMD_S2_RDWR) | L_PMD_S2_RDONLY; +} + +static inline bool kvm_s2pmd_readonly(pmd_t *pmd) +{ + return (pmd_val(*pmd) & L_PMD_S2_RDWR) == L_PMD_S2_RDONLY; +} + /* Open coded p*d_addr_end that can deal with 64bit addresses */ #define kvm_pgd_addr_end(addr, end)\ ({ u64 __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;\ diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h index 85c60ad..d8bb40b 100644 --- a/arch/arm/include/asm/pgtable-3level.h +++ b/arch/arm/include/asm/pgtable-3level.h @@ -129,6 +129,7 @@ #define L_PTE_S2_RDONLY(_AT(pteval_t, 1) << 6) /* HAP[1] */ #define L_PTE_S2_RDWR (_AT(pteval_t, 3) << 6) /* HAP[2:1] */ +#define L_PMD_S2_RDONLY(_AT(pteval_t, 1) << 6) /* HAP[1] */ #define L_PMD_S2_RDWR (_AT(pmdval_t, 3) << 6) /* HAP[2:1] */ /* diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 3c82b37..e11c2dd 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -242,6 +242,15 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, const struct kvm_memory_slot *old, enum kvm_mr_change change) { +#ifdef CONFIG_ARM + /* +* At this point memslot has been committed and there is an +* allocated dirty_bitmap[], dirty pages will be be tracked while the +* memory slot is write protected. +*/ + if ((change != KVM_MR_DELETE) && (mem->flags & KVM_MEM_LOG_DIRTY_PAGES)) + kvm_mmu_wp_memory_region(kvm, mem->slot); +#endif } void kvm_arch_flush_shadow_all(struct kvm *kvm) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index e90b9e4..37edcbe 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -762,6 +762,134 @@ static bool transparent_hugepage_adjust(pfn_t *pfnp, phys_addr_t *ipap) return false; } +#ifdef CONFIG_ARM +/** + * stage2_wp_pte_range - write protect PTE range + * @pmd: pointer to pmd entry + * @addr: range start address + * @end: range end address + */ +static void stage2_wp_pte_range(pmd_t *pmd, phys_addr_t addr, phys_addr_t end) +{ + pte_t *pte; + + pte = pte_offset_kernel(pmd, addr); + do { + if (!pte_none(*pte)) { + if (!kvm_s2pte_readonly(pte)) + kvm_set_s2pte_readonly(pte); + } + } while (pte++, addr += PAGE_SIZE, addr != end); +} + +/** + * stage2_wp_pmd_range - write protect PMD range + * @pud: pointer to pud entry + * @addr: range start address + * @end: range end address + */ +static void stage2_wp_pmd_range(pud_t *pud, phys_addr_t addr, phys_addr_t end) +{ + pmd_t *pmd; + phys_addr_t next; + + pmd = pmd_offset(pud, addr); + + do { + next = kvm_pmd_addr_end(addr, end); + if (!pmd_none(*pmd)) { + if (kvm_pmd_huge(*pmd)) { + if (!kvm_s2pmd_readonly(pmd)) + kvm_set_s2pmd_readonly(pmd); + } else + stage2_wp_pte_range(pmd, addr, next); + + } + } while (pmd++, a
[PATCH v8 1/4] arm: add ARMv7 HYP API to flush VM TLBs without address param
Patch adds HYP interface for global VM TLB invalidation without address parameter. Moved VM TLB flushing back to architecture layer. This patch depends on the unmap_range() patch, it needs to be applied first. No changes to ARMv8. Signed-off-by: Mario Smarduch --- arch/arm/include/asm/kvm_asm.h |1 + arch/arm/include/asm/kvm_host.h |2 ++ arch/arm/kvm/interrupts.S | 11 +++ arch/arm/kvm/mmu.c | 16 4 files changed, 30 insertions(+) diff --git a/arch/arm/include/asm/kvm_asm.h b/arch/arm/include/asm/kvm_asm.h index 53b3c4a..21bc519 100644 --- a/arch/arm/include/asm/kvm_asm.h +++ b/arch/arm/include/asm/kvm_asm.h @@ -78,6 +78,7 @@ extern char __kvm_hyp_code_end[]; extern void __kvm_flush_vm_context(void); extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa); +extern void __kvm_tlb_flush_vmid(struct kvm *kvm); extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu); #endif diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 193ceaf..ac3bb65 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -231,4 +231,6 @@ int kvm_perf_teardown(void); u64 kvm_arm_timer_get_reg(struct kvm_vcpu *, u64 regid); int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value); +void kvm_tlb_flush_vmid(struct kvm *kvm); + #endif /* __ARM_KVM_HOST_H__ */ diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S index 0d68d40..a3717b7 100644 --- a/arch/arm/kvm/interrupts.S +++ b/arch/arm/kvm/interrupts.S @@ -66,6 +66,17 @@ ENTRY(__kvm_tlb_flush_vmid_ipa) bx lr ENDPROC(__kvm_tlb_flush_vmid_ipa) +/** + * void __kvm_tlb_flush_vmid(struct kvm *kvm) - Flush per-VMID TLBs + * @kvm: pointer to kvm structure + * + * Reuses __kvm_tlb_flush_vmid_ipa() for ARMv7, without passing address + * parameter + */ +ENTRY(__kvm_tlb_flush_vmid) + b __kvm_tlb_flush_vmid_ipa +ENDPROC(__kvm_tlb_flush_vmid) + / * Flush TLBs and instruction caches of all CPUs inside the inner-shareable * domain, for all VMIDs diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 2ac9588..e90b9e4 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -56,6 +56,22 @@ static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa) kvm_call_hyp(__kvm_tlb_flush_vmid_ipa, kvm, ipa); } +#ifdef CONFIG_ARM +/** + * kvm_tlb_flush_vmid() - flush all VM TLB entries + * @kvm: pointer to kvm structure. + * + * Interface to HYP function to flush all VM TLB entries without address + * parameter. In HYP mode reuses __kvm_tlb_flush_vmid_ipa() function used by + * kvm_tlb_flush_vmid_ipa(). + */ +void kvm_tlb_flush_vmid(struct kvm *kvm) +{ + if (kvm) + kvm_call_hyp(__kvm_tlb_flush_vmid, kvm); +} +#endif + static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache, int min, int max) { -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 0/4] arm: dirty page logging support for ARMv7
This patch adds support for dirty page logging so far tested only on ARMv7, and verified to compile on ARMv8. With dirty page logging, GICv2 vGIC and arch timer save/restore support, live migration is supported. Dirty page logging support - - initially write protects VM RAM memory regions - 2nd stage page tables - add support to read dirty page log and again write protect the dirty pages - second stage page table for next pass. - second stage huge page are dissolved into page tables to keep track of dirty pages at page granularity. Tracking at huge page granularity limits migration to an almost idle system. - In the event migration is canceled, normal behavior is resumed huge pages are rebuilt over time. - At this time reverse mappings are not used to for write protecting of 2nd stage tables. - Future work - Enable diry memory logging to work on ARMv8 FastModels. Test Environment: --- NOTE: RUNNING on FAST Models will hardly ever fail and mask bugs, infact initially light loads were succeeding without dirty page logging support. --- - Will put all components on github, including test setup diagram - In short summary o Two ARM Exyonys 5440 development platforms - 4-way 1.7 GHz, with 8GB, 256GB storage, 1GBs Ethernet, with swap enabled o NFS Server runing Ubuntu 13.04 - both ARM boards mount shared file system - Shared file system includes - QEMU, Guest Kernel, DTB, multiple Ext3 root file systems. o Component versions: qemu-1.7.5, vexpress-a15, host/guest kernel 3.15-rc1, o Use QEMU Ctr+A+C and migrate -d tcp:IP:port command - Destination command syntax: can change smp to 4, machine model outdated, but has been tested on virt by others (need to upgrade) /mnt/migration/qemu-system-arm -enable-kvm -smp 2 -kernel \ /mnt/migration/zImage -dtb /mnt/migration/guest-a15.dtb -m 1792 \ -M vexpress-a15 -cpu cortex-a15 -nographic \ -append "root=/dev/vda rw console=ttyAMA0 rootwait" \ -drive if=none,file=/mnt/migration/guest1.root,id=vm1 \ -device virtio-blk-device,drive=vm1 \ -netdev type=tap,id=net0,ifname=tap0 \ -device virtio-net-device,netdev=net0,mac="52:54:00:12:34:58" \ -incoming tcp:0:4321 - Source command syntax same except '-incoming' o Test migration of multiple VMs use tap0, tap1, ..., and guest0.root, . has been tested as well. o On source run multiple copies of 'dirtyram.arm' - simple program to dirty pages periodically. ./dirtyarm.ram Example: ./dirtyram.arm 102580 812 30 - dirty 102580 pages - 812 pages every 30ms with an incrementing counter - run anywhere from one to as many copies as VM resources can support. If the dirty rate is too high migration will run indefintely - run date output loop, check date is picked up smoothly - place guest/host into page reclaim/swap mode - by whatever means in this case run multiple copies of 'dirtyram.ram' on host - issue migrate command(s) on source - Top result is 409600, 8192, 5 o QEMU is instrumented to save RAM memory regions on source and destination after memory is migrated, but before guest started. Later files are checksummed on both ends for correctness, given VMs are small this works. o Guest kernel is instrumented to capture current cycle counter - last cycle and compare to qemu down time to test arch timer accuracy. o Network failover is at L3 due to interface limitations, ping continues working transparently o Also tested 'migrate_cancel' to test reassemble of huge pages (inserted low level instrumentation code). - Basic Network Test - Assuming one ethernet interface available Source host IP 192.168.10.101/24, VM tap0 192.168.2.1/24 and VM eth0 192.168.2.100/24 with default route 192.168.2.1 Destination host IP 192.168.10.100/24, VM same settings as above. Both VMs have identical MAC addresses. Initially NFS server route to 192.168.2.100 is via 192.168.10.101 - ssh 192.168.2.100 - start migration from source to destination - after migration ends - on NFS server switch routes. route add -host 192.168.2.100 gw 192.168.10.100 ssh should resume after route switch. ping as well should work seamlessly. Changes since v7: - Reworked write protection of dirty page mask - Moved generic code back to architecture layer, keep it there for time being, until a KVM framework for architecture functions to override genric ones is defined. - Fixed conditon bug for marking pages dirty Mario Smarduch (4): add ARMv7 HYP API to flush VM TLBs without address param dirty page logging inital mem region write protect (w/no huge PUD support) dirty log write protect management support dirty page logging 2nd stage page fault handling support arch/arm/include/asm/kvm_asm.h
Re: [patch 0/5] KVM: support for pinning sptes
On Wed, Jun 18, 2014 at 08:12:03PM -0300, mtosa...@redhat.com wrote: > Required by PEBS support as discussed at > > Subject: [patch 0/5] Implement PEBS virtualization for Silvermont > Message-Id: <1401412327-14810-1-git-send-email-a...@firstfloor.org> Thanks marcelo. I'll give it a stress test here. -Andi -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html