Re: [KVM_AUTOTEST] unattended installs take 2
On 6/19/2009 12:50 AM, David Huff wrote: Second pass at the unattended install test. Both Linux and Windows guests are working however currently it just uses the existing boot test and extends timeouts. We still need a good way of determining if an unattended install completed without error. For Linux guest we should be able to run something similar to a regular boot test, after reboot try to ssh in. For windows guest we still have to run setup in order to have shh up. The unattended script can run at the end user defined commands. Those could be 'setup SSH and run it'. Y. Requires the processor patch as well as the patch to "strip and split" patches. Scripts still uses loop back mounts for now we can upgrade this in the future however most generic way of creating and manipulating disk images. 5 patches in this set 0005-Modified-boot-test-in-kvm_test.py.patch 0004-Added-two-sample-unattended-config-files-Fedora-and.patch 0003-added-unattended.sh-script.patch 0002-modified-config-file-to-run-unattended-install.patch 0001-Added-floppy-and-tftp-options-to-qemu-command.patch -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Added two sample unattended config files, Fedora and Windows
On 6/19/2009 12:50 AM, David Huff wrote: You must add your own product key to the windows config file. --- client/tests/kvm/unattended/Fedora-11-i386.ks | 24 +++ client/tests/kvm/unattended/WinXP-32.sif | 52 + 2 files changed, 76 insertions(+), 0 deletions(-) create mode 100644 client/tests/kvm/unattended/Fedora-11-i386.ks create mode 100644 client/tests/kvm/unattended/WinXP-32.sif diff --git a/client/tests/kvm/unattended/Fedora-11-i386.ks b/client/tests/kvm/unattended/Fedora-11-i386.ks new file mode 100644 index 000..da2f1ed --- /dev/null +++ b/client/tests/kvm/unattended/Fedora-11-i386.ks @@ -0,0 +1,24 @@ +install +cdrom +text +reboot +lang en_US +keyboard us +network --bootproto dhcp +rootpw 123456 +firewall --enabled --ssh +selinux --enforcing +timezone --utc America/New_York +firstboot --disable +bootloader --location=mbr +zerombr + +clearpart --all --initlabel +part /boot --fstype=ext3 --size=100 +part / --fstype=ext3 --size=2000 +part swap --fstype=swap --size=512 + +%packages +@ Base +%end + diff --git a/client/tests/kvm/unattended/WinXP-32.sif b/client/tests/kvm/unattended/WinXP-32.sif new file mode 100644 index 000..196e193 --- /dev/null +++ b/client/tests/kvm/unattended/WinXP-32.sif @@ -0,0 +1,52 @@ +[Data] +AutoPartition = 1 +MsDosInitiated = 0 +UnattendedInstall = Yes + +[Unattended] +UnattendMode = FullUnattended +OemSkipEula = Yes +OemPreinstall = No +UnattendSwitch = Yes +CrashDumpSetting = 1 +DriverSigningPolicy = ignore +WaitForReboot = no +Repartition = yes + +[GuiUnattended] +AdminPassword = 123456 +AutoLogon = Yes +AutoLogonCount = 5 +OEMSkipRegional = 1 +TimeZone = 85 +OemSkipWelcome = 1 + +[UserData] +FullName = "QumranetUser" +OrgName = "Qumranet" Lets change this... Y. +ComputerName = * +ProductKey = + +[Identification] + +[Networking] + +[Components] +dialer = off +media_clips = off +media_utopia = off +msnexplr = off +netoc = off +OEAccess = off +templates = off +WMAccess = off +zonegames = off + +[TerminalServices] +AllowConnections = 1 + +[WindowsFirewall] +Profiles = WindowsFirewall.TurnOffFirewall +[WindowsFirewall.TurnOffFirewall] +Mode = 0 + -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH/RFC] virtio_test: A module for testing virtio via userspace
Hello Rusty, this is a result of a two month internship about virtio testing. From: Adrian Schneider From: Tim Hofmann From: Christian Ehrhardt From: Christian Borntraeger This patch introduces a prototype for a virtio_test module. This module can be bound to any virtio device via sysfs bind/unbind feature, e.g: $ echo virtio1 > /sys/bus/virtio/drivers/virtio_rng/unbind $ modprobe virtio_test On probe this module registers to all virtqueues and creates a character device for every virtio device. (/dev/viotest). The character device offers ioctls to allow a userspace application to submit virtio operations like addbuf, kick and getbuf. It also offers ioctls to get information about the device and to query the amount of occurred callbacks (or wait synchronously on callbacks). The driver currently lacks the following planned features: o userspace tooling for fuzzing (a prototype exists) o feature bit support o support arbitrary pointer mode in add_buf (e.g. test how qemu deals with iovecs pointing beyond the guest memory size) o priority binding with other virtio drivers (e.g. if virtio_blk and virtio_test are compiled into the kernel, virtio_blk should get all block devices by default on hotplug) I would like to get feedback on o the general idea of a virtio_test module o the user interface ioctls o further ideas and comments Signed-off-by: Christian Borntraeger --- drivers/virtio/Kconfig | 12 drivers/virtio/Makefile |2 drivers/virtio/virtio_test.c | 710 +++ include/linux/Kbuild |1 include/linux/virtio_test.h | 146 5 files changed, 871 insertions(+) Index: linux-2.6/drivers/virtio/Kconfig === --- linux-2.6.orig/drivers/virtio/Kconfig +++ linux-2.6/drivers/virtio/Kconfig @@ -33,3 +33,15 @@ config VIRTIO_BALLOON If unsure, say M. +config VIRTIO_TEST + tristate "Virtio test driver (EXPERIMENTAL)" + select VIRTIO + select VIRTIO_RING + ---help--- +This driver supports testing arbitrary virtio devices. The drivers +offers IOCTLs to run add_buf/get_buf etc. from userspace. You can +bind/unbind any unused virtio device to this driver via sysfs. Each +bound device will get a /dev/viotest* device node. + +If unsure, say M. + Index: linux-2.6/drivers/virtio/Makefile === --- linux-2.6.orig/drivers/virtio/Makefile +++ linux-2.6/drivers/virtio/Makefile @@ -2,3 +2,5 @@ obj-$(CONFIG_VIRTIO) += virtio.o obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o +obj-$(CONFIG_VIRTIO_TEST) += virtio_test.o + Index: linux-2.6/drivers/virtio/virtio_test.c === --- /dev/null +++ linux-2.6/drivers/virtio/virtio_test.c @@ -0,0 +1,710 @@ +/* + * Test driver for the virtio bus + * + *Copyright IBM Corp. 2009 + *Author(s): Adrian Schneider + * Tim Hofmann + * Christian Ehrhardt + * Christian Borntraeger + */ + + +#define KMSG_COMPONENT "virtio_test" +#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include + +static u32 viotest_major = VIOTEST_MAJOR; +static struct class *viotest_class; +static LIST_HEAD(viotest_list); +static spinlock_t viotest_list_lock = SPIN_LOCK_UNLOCKED; + +static void free_kvec(struct kvec *kiov, u32 index) +{ + u32 i; + + for (i = 0; i < index; i++) + kfree(kiov[i].iov_base); + + kfree(kiov); +} + +/* + * This function copies a userspace iovec * array into a kernel kvec * array + */ +static int copy_iovec_from_user(struct kvec **kiov, struct iovec __user *uiov, + u32 uiov_num) +{ + u32 i; + u64 kiov_sz; + struct iovec uservec; + + kiov_sz = sizeof(struct kvec) * uiov_num; + *kiov = kmalloc(kiov_sz, GFP_KERNEL); + if (!(*kiov)) + return -ENOMEM; + + for (i = 0; i < uiov_num; i++) { + if (copy_from_user(&uservec, &uiov[i], sizeof(struct iovec))) { + free_kvec(*kiov, i); + return -EFAULT; + } + (*kiov)[i].iov_base = kmalloc(uservec.iov_len, GFP_KERNEL); + if (!(*kiov)[i].iov_base) { + free_kvec(*kiov, i); + return -ENOMEM; + } + + if (copy_from_user((*kiov)[i].iov_base, uservec.iov_base, uservec.iov_len)) { + free_kvec(*kiov, i); + return -EFAULT; + } + (*kiov)[i].iov_len = uservec.iov_len; + } + + return 0; +} + +static int copy_kvec_to_user(struct iovec
Re: [PATCH] kvm: device-assignment: Add PCI option ROM support
On Friday 19 June 2009 00:28:41 Alex Williamson wrote: > On Thu, 2009-06-18 at 13:30 +0800, Yang, Sheng wrote: > > On Tuesday 16 June 2009 00:29:17 Alex Williamson wrote: > > > The PCI code already knows about option ROMs, so we just need to > > > mmap some space for it, load it with a copy of the contents, and > > > complete the plubming to the generic code. > > > > > > With this a Linux guest can get access to the ROM contents via > > > /sys/bus/pci/devices//rom. This might also enable the BIOS > > > to execute ROMs by loading them dynamically from the device > > > rather than hoping they all fit into RAM. > > > > The patch looks fine. One question: if guest write to the ROM, I think > > the guest would be killed for QEmu would receive a SIGSEGV? I am not sure > > if it's too severe... > > Hi Sheng, > > Good thought. I tested this with a simple program that mmaps the ROM > address from /dev/mem and tries to write to it (using setpci to enable > the ROM). The results are a little surprising. On the host, writing to > the ROM causes an NMI and the system dies. On the KVM guest, the write > is happily discarded, neither segfaulting from the mprotect nor > affecting the contents of the ROM. So it seems that something above my > mprotect is discarding the write, and if we did hit it, a qemu segfault > isn't that far from what happens on bare metal. > Oh, that's good. :) > The one oddity I noticed is that even when the enable bit is clear, the > guest can read the ROM. I don't know that this is actually illegal, vs > returning zeros or ones though. It seems like maybe the generic PCI > code isn't tracking the enable bit. I think that's an independent > problem from this patch though. Thanks, That should be fine. I've taken a look at code, seems Linux kernel set enable_bit when someone begin to read rom, and copy rom to buffer, then unmap the rom. So the rom can be read when enable bit clear. -- regards Yang, Sheng > > Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ESX on KVM requirements
On 19.06.2009, at 03:19, Ben Sanders wrote: I needed nested paging to have things work, so phenom was mandatory. Also, because I was running things on a Phenom, I figured the safest bet is to be truthful in telling the guest what it's running on. Also ESX checked for specific cpuid revisions and I knew the CPU I was on is safe. If you get it running in debug mode, do -serial stdio and send me the output so I can see if I remember the cure ;) Alex I cut out the mailing list since people probably don't want my attachment. I switched to a phenom 9950 to see if i'd have some better luck. I'm runing kvm v86 with nested=1 and -enable nesting on Ubuntu 9.04 Here's the command i'm running. Setting the nic model was necessary to get it to install. qemu-system-x86_64 -cpu phenom -enable-nesting -hda ~/esx.img -m 1024 -net nic,model=e1000 -net user Attached is the debug output. Thanks for your help. Looks like you're missing my VMware interface extension patches that implement TSC fake value passing to the guest? The series was called "VMware ESX guest bringup (partial)". Also, try to keep the ML in CC and simply pastebin instead of attaching the log file. There are probably more people out there who have the same question ;-). Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: SVM: Pass through the host kernel's IO delay port
From: Paolo Bonzini KVM's optimization of guest port 80 accesses was removed last May 11 in commit 99f85a. However, this probably has speed penalties. I don't have a machine to test but the equivalent VMX patch (fdef3ad) reported a speedup of 3-5%, and on the Xen mailing list it was mentioned that on Xen passing port 80 through had positive effects on startup speed. We can enable passthrough to the same port the host kernel uses instead. Signed-off-by: Paolo Bonzini --- arch/x86/kernel/io_delay.c |1 + arch/x86/kvm/svm.c |5 + 2 files changed, 6 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/io_delay.c b/arch/x86/kernel/io_delay.c index a979b5b..a478cb2 100644 --- a/arch/x86/kernel/io_delay.c +++ b/arch/x86/kernel/io_delay.c @@ -129,3 +129,4 @@ static int __init io_delay_param(char *s) } early_param("io_delay", io_delay_param); +EXPORT_SYMBOL_GPL(io_delay_type); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 1f8510c..c49f4db 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -26,6 +26,7 @@ #include #include +#include #include #include @@ -411,6 +412,10 @@ static __init int svm_hardware_setup(void) iopm_va = page_address(iopm_pages); memset(iopm_va, 0xff, PAGE_SIZE * (1 << IOPM_ALLOC_ORDER)); + if (io_delay_type == CONFIG_IO_DELAY_TYPE_0X80) + clear_bit(0x80, iopm_va); + else if (io_delay_type == CONFIG_IO_DELAY_TYPE_0XED) + clear_bit(0xED, iopm_va); iopm_base = page_to_pfn(iopm_pages) << PAGE_SHIFT; if (boot_cpu_has(X86_FEATURE_NX)) -- 1.6.0.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM-AUTOTEST PATCH] stress_boot - Boot VMs until one of them becomes unresponsive - Version2
On Thu, 2009-06-18 at 17:16 +0800, Yolkfull Chow wrote: > Hi Lucas, I also found the number counting problem later after sending > the patch. I haven't been able to re-send the updated one since I got > some other things to deal with in these days. Sorry for that... > > Please see attachment for updated version. Thank you so much. :) Ok, patch applied. Thank you very much for your work! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/8] kvm/mmu: rename is_largepage_backed to mapping_level
With the new name and the corresponding backend changes this function can now support multiple hugepage sizes. Signed-off-by: Joerg Roedel --- arch/x86/kvm/mmu.c | 100 +-- arch/x86/kvm/paging_tmpl.h |4 +- 2 files changed, 69 insertions(+), 35 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 1f24d88..3fa6009 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -390,37 +390,52 @@ static void mmu_free_rmap_desc(struct kvm_rmap_desc *rd) * Return the pointer to the largepage write count for a given * gfn, handling slots that are not large page aligned. */ -static int *slot_largepage_idx(gfn_t gfn, struct kvm_memory_slot *slot) +static int *slot_largepage_idx(gfn_t gfn, + struct kvm_memory_slot *slot, + int level) { unsigned long idx; - idx = (gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL)) - - (slot->base_gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL)); - return &slot->lpage_info[0][idx].write_count; + idx = (gfn / KVM_PAGES_PER_HPAGE(level)) - + (slot->base_gfn / KVM_PAGES_PER_HPAGE(level)); + return &slot->lpage_info[level - 2][idx].write_count; } static void account_shadowed(struct kvm *kvm, gfn_t gfn) { + struct kvm_memory_slot *slot; int *write_count; + int i; gfn = unalias_gfn(kvm, gfn); - write_count = slot_largepage_idx(gfn, -gfn_to_memslot_unaliased(kvm, gfn)); - *write_count += 1; + + for (i = PT_DIRECTORY_LEVEL; +i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) { + slot = gfn_to_memslot_unaliased(kvm, gfn); + write_count = slot_largepage_idx(gfn, slot, i); + *write_count += 1; + } } static void unaccount_shadowed(struct kvm *kvm, gfn_t gfn) { + struct kvm_memory_slot *slot; int *write_count; + int i; gfn = unalias_gfn(kvm, gfn); - write_count = slot_largepage_idx(gfn, -gfn_to_memslot_unaliased(kvm, gfn)); - *write_count -= 1; - WARN_ON(*write_count < 0); + for (i = PT_DIRECTORY_LEVEL; +i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) { + slot = gfn_to_memslot_unaliased(kvm, gfn); + write_count = slot_largepage_idx(gfn, slot, i); + *write_count -= 1; + WARN_ON(*write_count < 0); + } } -static int has_wrprotected_page(struct kvm *kvm, gfn_t gfn) +static int has_wrprotected_page(struct kvm *kvm, + gfn_t gfn, + int level) { struct kvm_memory_slot *slot; int *largepage_idx; @@ -428,47 +443,67 @@ static int has_wrprotected_page(struct kvm *kvm, gfn_t gfn) gfn = unalias_gfn(kvm, gfn); slot = gfn_to_memslot_unaliased(kvm, gfn); if (slot) { - largepage_idx = slot_largepage_idx(gfn, slot); + largepage_idx = slot_largepage_idx(gfn, slot, level); return *largepage_idx; } return 1; } -static int host_largepage_backed(struct kvm *kvm, gfn_t gfn) +static int host_mapping_level(struct kvm *kvm, gfn_t gfn) { + unsigned long page_size = PAGE_SIZE; struct vm_area_struct *vma; unsigned long addr; - int ret = 0; + int i, ret = 0; addr = gfn_to_hva(kvm, gfn); if (kvm_is_error_hva(addr)) - return ret; + return page_size; down_read(¤t->mm->mmap_sem); vma = find_vma(current->mm, addr); - if (vma && is_vm_hugetlb_page(vma)) - ret = 1; + if (!vma) + goto out; + + page_size = vma_kernel_pagesize(vma); + +out: up_read(¤t->mm->mmap_sem); + for (i = PT_PAGE_TABLE_LEVEL; +i < (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES); ++i) { + if (page_size >= KVM_HPAGE_SIZE(i)) + ret = i; + else + break; + } + return ret; } -static int is_largepage_backed(struct kvm_vcpu *vcpu, gfn_t large_gfn) +static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn) { struct kvm_memory_slot *slot; - - if (has_wrprotected_page(vcpu->kvm, large_gfn)) - return 0; - - if (!host_largepage_backed(vcpu->kvm, large_gfn)) - return 0; + int host_level; + int level = PT_PAGE_TABLE_LEVEL; slot = gfn_to_memslot(vcpu->kvm, large_gfn); if (slot && slot->dirty_bitmap) - return 0; + return PT_PAGE_TABLE_LEVEL; - return 1; + host_level = host_mapping_level(vcpu->kvm, large_gfn); + + if (host_level == PT_PAGE_TABLE_LEVEL) + return host_level; + + for (lev
[PATCH 7/8] kvm/mmu: enable gbpages by increasing nr of pagesizes
Signed-off-by: Joerg Roedel --- arch/x86/include/asm/kvm_host.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 397f992..26ea9b8 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -54,7 +54,7 @@ #define UNMAPPED_GVA (~(gpa_t)0) /* KVM Hugepage definitions for x86 */ -#define KVM_NR_PAGE_SIZES 2 +#define KVM_NR_PAGE_SIZES 3 #define KVM_HPAGE_SHIFT(x) (PAGE_SHIFT + (((x) - 1) * 9)) #define KVM_HPAGE_SIZE(x) (1UL << KVM_HPAGE_SHIFT(x)) #define KVM_HPAGE_MASK(x) (~(KVM_HPAGE_SIZE(x) - 1)) -- 1.6.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/8] kvm: change memslot data structures for multiple hugepage sizes
Signed-off-by: Joerg Roedel --- arch/ia64/include/asm/kvm_host.h|3 +- arch/powerpc/include/asm/kvm_host.h |3 +- arch/x86/include/asm/kvm_host.h | 12 arch/x86/kvm/mmu.c | 30 ++- arch/x86/kvm/paging_tmpl.h |3 +- include/linux/kvm_host.h|2 +- virt/kvm/kvm_main.c | 53 +++--- 7 files changed, 65 insertions(+), 41 deletions(-) diff --git a/arch/ia64/include/asm/kvm_host.h b/arch/ia64/include/asm/kvm_host.h index 9cf1c4b..d9b6325 100644 --- a/arch/ia64/include/asm/kvm_host.h +++ b/arch/ia64/include/asm/kvm_host.h @@ -235,7 +235,8 @@ struct kvm_vm_data { #define KVM_REQ_PTC_G 32 #define KVM_REQ_RESUME 33 -#define KVM_PAGES_PER_HPAGE1 +#define KVM_NR_PAGE_SIZES 1 +#define KVM_PAGES_PER_HPAGE(x) 1 struct kvm; struct kvm_vcpu; diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 3625424..10e884e 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -34,7 +34,8 @@ #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 /* We don't currently support large pages. */ -#define KVM_PAGES_PER_HPAGE (1<<31) +#define KVM_NR_PAGE_SIZES 1 +#define KVM_PAGES_PER_HPAGE(x) (1<<31) struct kvm; struct kvm_run; diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index c7b0cc2..930bac2 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -53,12 +53,12 @@ #define INVALID_PAGE (~(hpa_t)0) #define UNMAPPED_GVA (~(gpa_t)0) -/* shadow tables are PAE even on non-PAE hosts */ -#define KVM_HPAGE_SHIFT 21 -#define KVM_HPAGE_SIZE (1UL << KVM_HPAGE_SHIFT) -#define KVM_HPAGE_MASK (~(KVM_HPAGE_SIZE - 1)) - -#define KVM_PAGES_PER_HPAGE (KVM_HPAGE_SIZE / PAGE_SIZE) +/* KVM Hugepage definitions for x86 */ +#define KVM_NR_PAGE_SIZES 2 +#define KVM_HPAGE_SHIFT(x) (PAGE_SHIFT + (((x) - 1) * 9)) +#define KVM_HPAGE_SIZE(x) (1UL << KVM_HPAGE_SHIFT(x)) +#define KVM_HPAGE_MASK(x) (~(KVM_HPAGE_SIZE(x) - 1)) +#define KVM_PAGES_PER_HPAGE(x) (KVM_HPAGE_SIZE(x) / PAGE_SIZE) #define DE_VECTOR 0 #define DB_VECTOR 1 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index edf3dea..1f24d88 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -394,9 +394,9 @@ static int *slot_largepage_idx(gfn_t gfn, struct kvm_memory_slot *slot) { unsigned long idx; - idx = (gfn / KVM_PAGES_PER_HPAGE) - - (slot->base_gfn / KVM_PAGES_PER_HPAGE); - return &slot->lpage_info[idx].write_count; + idx = (gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL)) - + (slot->base_gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL)); + return &slot->lpage_info[0][idx].write_count; } static void account_shadowed(struct kvm *kvm, gfn_t gfn) @@ -485,10 +485,10 @@ static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int lpage) if (!lpage) return &slot->rmap[gfn - slot->base_gfn]; - idx = (gfn / KVM_PAGES_PER_HPAGE) - - (slot->base_gfn / KVM_PAGES_PER_HPAGE); + idx = (gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL)) - + (slot->base_gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL)); - return &slot->lpage_info[idx].rmap_pde; + return &slot->lpage_info[0][idx].rmap_pde; } /* @@ -724,11 +724,11 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva, end = start + (memslot->npages << PAGE_SHIFT); if (hva >= start && hva < end) { gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT; + int idx = gfn_offset / + KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL); retval |= handler(kvm, &memslot->rmap[gfn_offset]); retval |= handler(kvm, - &memslot->lpage_info[ - gfn_offset / - KVM_PAGES_PER_HPAGE].rmap_pde); + &memslot->lpage_info[0][idx].rmap_pde); } } @@ -1852,8 +1852,9 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) pfn_t pfn; unsigned long mmu_seq; - if (is_largepage_backed(vcpu, gfn & ~(KVM_PAGES_PER_HPAGE-1))) { - gfn &= ~(KVM_PAGES_PER_HPAGE-1); + if (is_largepage_backed(vcpu, gfn & + ~(KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL) - 1))) { + gfn &= ~(KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL) - 1); largepage = 1; } @@ -2058,8 +2059,9 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, if (r) return r; - if (is_largepage_backed(vcpu, gfn & ~(KVM_PAGES_PER_HPAGE-1))) { - gfn &= ~(KVM_PAGES_PE
[PATCH 1/8] hugetlbfs: export vma_kernel_pagsize to modules
This function is required by KVM. Signed-off-by: Joerg Roedel --- mm/hugetlb.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index e83ad2c..e2016ef 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -234,6 +234,7 @@ unsigned long vma_kernel_pagesize(struct vm_area_struct *vma) return 1UL << (hstate->order + PAGE_SHIFT); } +EXPORT_SYMBOL_GPL(vma_kernel_pagesize); /* * Return the page size being used by the MMU to back a VMA. In the majority -- 1.6.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/8] kvm/mmu: add support for another level to page walker
The page walker may be used with nested paging too when accessing mmio areas. Make it support the additional page-level too. Signed-off-by: Joerg Roedel --- arch/x86/kvm/mmu.c |6 ++ arch/x86/kvm/paging_tmpl.h | 16 2 files changed, 22 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ef2396d..fc0e2fc 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -117,6 +117,11 @@ module_param(oos_shadow, bool, 0644); #define PT64_DIR_BASE_ADDR_MASK \ (PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + PT64_LEVEL_BITS)) - 1)) +#define PT64_PDPE_BASE_ADDR_MASK \ + (PT64_BASE_ADDR_MASK & ~(1ULL << (PAGE_SHIFT + (2 * PT64_LEVEL_BITS +#define PT64_PDPE_OFFSET_MASK \ + (PT64_BASE_ADDR_MASK & (1ULL << (PAGE_SHIFT + (2 * PT64_LEVEL_BITS + #define PT32_BASE_ADDR_MASK PAGE_MASK #define PT32_DIR_BASE_ADDR_MASK \ (PAGE_MASK & ~((1ULL << (PAGE_SHIFT + PT32_LEVEL_BITS)) - 1)) @@ -130,6 +135,7 @@ module_param(oos_shadow, bool, 0644); #define PFERR_RSVD_MASK (1U << 3) #define PFERR_FETCH_MASK (1U << 4) +#define PT_PDPE_LEVEL 3 #define PT_DIRECTORY_LEVEL 2 #define PT_PAGE_TABLE_LEVEL 1 diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 8fbf4e7..54c77be 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -55,6 +55,7 @@ #define gpte_to_gfn FNAME(gpte_to_gfn) #define gpte_to_gfn_pde FNAME(gpte_to_gfn_pde) +#define gpte_to_gfn_pdpe FNAME(gpte_to_gfn_pdpe) /* * The guest_walker structure emulates the behavior of the hardware page @@ -81,6 +82,11 @@ static gfn_t gpte_to_gfn_pde(pt_element_t gpte) return (gpte & PT_DIR_BASE_ADDR_MASK) >> PAGE_SHIFT; } +static gfn_t gpte_to_gfn_pdpe(pt_element_t gpte) +{ + return (gpte & PT64_PDPE_BASE_ADDR_MASK) >> PAGE_SHIFT; +} + static bool FNAME(cmpxchg_gpte)(struct kvm *kvm, gfn_t table_gfn, unsigned index, pt_element_t orig_pte, pt_element_t new_pte) @@ -201,6 +207,15 @@ walk: break; } + if (walker->level == PT_PDPE_LEVEL && + (pte & PT_PAGE_SIZE_MASK) && + is_long_mode(vcpu)) { + walker->gfn = gpte_to_gfn_pdpe(pte); + walker->gfn += (addr & PT64_PDPE_OFFSET_MASK) + >> PAGE_SHIFT; + break; + } + pt_access = pte_access; --walker->level; } @@ -609,4 +624,5 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) #undef PT_MAX_FULL_LEVELS #undef gpte_to_gfn #undef gpte_to_gfn_pde +#undef gpte_to_gfn_pdpe #undef CMPXCHG -- 1.6.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/8] kvm/mmu: make direct mapping paths aware of mapping levels
Signed-off-by: Joerg Roedel --- arch/x86/include/asm/kvm_host.h |2 +- arch/x86/kvm/mmu.c | 60 ++- arch/x86/kvm/paging_tmpl.h |6 ++-- 3 files changed, 38 insertions(+), 30 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 930bac2..397f992 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -314,7 +314,7 @@ struct kvm_vcpu_arch { struct { gfn_t gfn; /* presumed gfn during guest pte update */ pfn_t pfn; /* pfn corresponding to that gfn */ - int largepage; + int level; unsigned long mmu_seq; } update_pte; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0ef947d..ef2396d 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -254,7 +254,7 @@ static int is_last_spte(u64 pte, int level) { if (level == PT_PAGE_TABLE_LEVEL) return 1; - if (level == PT_DIRECTORY_LEVEL && is_large_pte(pte)) + if (is_large_pte(pte)) return 1; return 0; } @@ -1708,7 +1708,7 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, gfn_t gfn, static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pte_access, int user_fault, - int write_fault, int dirty, int largepage, + int write_fault, int dirty, int level, gfn_t gfn, pfn_t pfn, bool speculative, bool can_unsync) { @@ -1731,7 +1731,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, spte |= shadow_nx_mask; if (pte_access & ACC_USER_MASK) spte |= shadow_user_mask; - if (largepage) + if (level > PT_PAGE_TABLE_LEVEL) spte |= PT_PAGE_SIZE_MASK; if (tdp_enabled) spte |= kvm_x86_ops->get_mt_mask(vcpu, gfn, @@ -1742,7 +1742,8 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, if ((pte_access & ACC_WRITE_MASK) || (write_fault && !is_write_protection(vcpu) && !user_fault)) { - if (largepage && has_wrprotected_page(vcpu->kvm, gfn, 1)) { + if (level > PT_PAGE_TABLE_LEVEL && + has_wrprotected_page(vcpu->kvm, gfn, level)) { ret = 1; spte = shadow_trap_nonpresent_pte; goto set_pte; @@ -1780,7 +1781,7 @@ set_pte: static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned pt_access, unsigned pte_access, int user_fault, int write_fault, int dirty, -int *ptwrite, int largepage, gfn_t gfn, +int *ptwrite, int level, gfn_t gfn, pfn_t pfn, bool speculative) { int was_rmapped = 0; @@ -1796,7 +1797,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, * If we overwrite a PTE page pointer with a 2MB PMD, unlink * the parent of the now unreachable PTE. */ - if (largepage && !is_large_pte(*sptep)) { + if (level > PT_PAGE_TABLE_LEVEL && + !is_large_pte(*sptep)) { struct kvm_mmu_page *child; u64 pte = *sptep; @@ -1809,8 +1811,9 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, } else was_rmapped = 1; } + if (set_spte(vcpu, sptep, pte_access, user_fault, write_fault, - dirty, largepage, gfn, pfn, speculative, true)) { + dirty, level, gfn, pfn, speculative, true)) { if (write_fault) *ptwrite = 1; kvm_x86_ops->tlb_flush(vcpu); @@ -1846,7 +1849,7 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu) } static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, - int largepage, gfn_t gfn, pfn_t pfn) + int level, gfn_t gfn, pfn_t pfn) { struct kvm_shadow_walk_iterator iterator; struct kvm_mmu_page *sp; @@ -1854,11 +1857,10 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, gfn_t pseudo_gfn; for_each_shadow_entry(vcpu, (u64)gfn << PAGE_SHIFT, iterator) { - if (iterator.level == PT_PAGE_TABLE_LEVEL - || (largepage && iterator.level == PT_DIRECTORY_LEVEL)) { + if (iterator.level == level) { mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, ACC_ALL, 0, write, 1, &pt_write, -largepage, gfn, pfn, false); +level, gfn, pfn, false); ++vcpu->stat.pf_fixed; break;
[PATCH 4/8] kvm/mmu: make rmap code aware of mapping levels
This patch removes the largepage parameter from the rmap_add function. Together with rmap_remove this function now uses the role.level field to find determine if the page is a huge page. Signed-off-by: Joerg Roedel --- arch/x86/kvm/mmu.c | 53 +++ 1 files changed, 28 insertions(+), 25 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 3fa6009..0ef947d 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -511,19 +511,19 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn) * Note: gfn must be unaliased before this function get called */ -static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int lpage) +static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int level) { struct kvm_memory_slot *slot; unsigned long idx; slot = gfn_to_memslot(kvm, gfn); - if (!lpage) + if (likely(level == PT_PAGE_TABLE_LEVEL)) return &slot->rmap[gfn - slot->base_gfn]; - idx = (gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL)) - - (slot->base_gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL)); + idx = (gfn / KVM_PAGES_PER_HPAGE(level)) - + (slot->base_gfn / KVM_PAGES_PER_HPAGE(level)); - return &slot->lpage_info[0][idx].rmap_pde; + return &slot->lpage_info[level - 2][idx].rmap_pde; } /* @@ -535,7 +535,7 @@ static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int lpage) * If rmapp bit zero is one, (then rmap & ~1) points to a struct kvm_rmap_desc * containing more mappings. */ -static void rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn, int lpage) +static void rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn) { struct kvm_mmu_page *sp; struct kvm_rmap_desc *desc; @@ -547,7 +547,7 @@ static void rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn, int lpage) gfn = unalias_gfn(vcpu->kvm, gfn); sp = page_header(__pa(spte)); sp->gfns[spte - sp->spt] = gfn; - rmapp = gfn_to_rmap(vcpu->kvm, gfn, lpage); + rmapp = gfn_to_rmap(vcpu->kvm, gfn, sp->role.level); if (!*rmapp) { rmap_printk("rmap_add: %p %llx 0->1\n", spte, *spte); *rmapp = (unsigned long)spte; @@ -614,7 +614,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte) kvm_release_pfn_dirty(pfn); else kvm_release_pfn_clean(pfn); - rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt], is_large_pte(*spte)); + rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt], sp->role.level); if (!*rmapp) { printk(KERN_ERR "rmap_remove: %p %llx 0->BUG\n", spte, *spte); BUG(); @@ -677,10 +677,10 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn) { unsigned long *rmapp; u64 *spte; - int write_protected = 0; + int i, write_protected = 0; gfn = unalias_gfn(kvm, gfn); - rmapp = gfn_to_rmap(kvm, gfn, 0); + rmapp = gfn_to_rmap(kvm, gfn, PT_PAGE_TABLE_LEVEL); spte = rmap_next(kvm, rmapp, NULL); while (spte) { @@ -702,21 +702,24 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn) } /* check for huge page mappings */ - rmapp = gfn_to_rmap(kvm, gfn, 1); - spte = rmap_next(kvm, rmapp, NULL); - while (spte) { - BUG_ON(!spte); - BUG_ON(!(*spte & PT_PRESENT_MASK)); - BUG_ON((*spte & (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)) != (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)); - pgprintk("rmap_write_protect(large): spte %p %llx %lld\n", spte, *spte, gfn); - if (is_writeble_pte(*spte)) { - rmap_remove(kvm, spte); - --kvm->stat.lpages; - __set_spte(spte, shadow_trap_nonpresent_pte); - spte = NULL; - write_protected = 1; + for (i = PT_DIRECTORY_LEVEL; +i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) { + rmapp = gfn_to_rmap(kvm, gfn, i); + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + BUG_ON(!spte); + BUG_ON(!(*spte & PT_PRESENT_MASK)); + BUG_ON((*spte & (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)) != (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)); + pgprintk("rmap_write_protect(large): spte %p %llx %lld\n", spte, *spte, gfn); + if (is_writeble_pte(*spte)) { + rmap_remove(kvm, spte); + --kvm->stat.lpages; + __set_spte(spte, shadow_trap_nonpresent_pte); + spte = NULL; + write_protected = 1; + } + spte = rmap_next(kvm, rmapp, spte); } -
[PATCH 8/8] kvm x86: report 1GB page support to userspace
If userspace knows that the kernel part supports 1GB pages it can enable the corresponding cpuid bit so that guests actually use GB pages. Signed-off-by: Joerg Roedel --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/svm.c |6 ++ arch/x86/kvm/vmx.c |6 ++ arch/x86/kvm/x86.c |3 ++- 4 files changed, 15 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 26ea9b8..c120882 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -527,6 +527,7 @@ struct kvm_x86_ops { int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); int (*get_tdp_level)(void); u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio); + bool (*gb_page_enable)(void); }; extern struct kvm_x86_ops *kvm_x86_ops; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index c283201..ccb75fb 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -2718,6 +2718,11 @@ static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) return 0; } +static bool svm_gb_page_enable(void) +{ + return npt_enabled; +} + static struct kvm_x86_ops svm_x86_ops = { .cpu_has_kvm_support = has_svm, .disabled_by_bios = is_disabled, @@ -2779,6 +2784,7 @@ static struct kvm_x86_ops svm_x86_ops = { .set_tss_addr = svm_set_tss_addr, .get_tdp_level = get_npt_level, .get_mt_mask = svm_get_mt_mask, + .gb_page_enable = svm_gb_page_enable, }; static int __init svm_init(void) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 9332938..4418ba6 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3891,6 +3891,11 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) return ret; } +static bool vmx_gb_page_enable(void) +{ + return false; +} + static struct kvm_x86_ops vmx_x86_ops = { .cpu_has_kvm_support = cpu_has_kvm_support, .disabled_by_bios = vmx_disabled_by_bios, @@ -3950,6 +3955,7 @@ static struct kvm_x86_ops vmx_x86_ops = { .set_tss_addr = vmx_set_tss_addr, .get_tdp_level = get_ept_level, .get_mt_mask = vmx_get_mt_mask, + .gb_page_enable = vmx_gb_page_enable, }; static int __init vmx_init(void) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 71090d2..ddfb77e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1371,6 +1371,7 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, u32 index, int *nent, int maxnent) { unsigned f_nx = is_efer_nx() ? F(NX) : 0; + unsigned f_gbpages = kvm_x86_ops->gb_page_enable() ? F(GBPAGES) : 0; #ifdef CONFIG_X86_64 unsigned f_lm = F(LM); #else @@ -1395,7 +1396,7 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function, F(MTRR) | F(PGE) | F(MCA) | F(CMOV) | F(PAT) | F(PSE36) | 0 /* Reserved */ | f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) | - F(FXSR) | F(FXSR_OPT) | 0 /* GBPAGES */ | 0 /* RDTSCP */ | + F(FXSR) | F(FXSR_OPT) | f_gbpages | 0 /* RDTSCP */ | 0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW); /* cpuid 1.ecx */ const u32 kvm_supported_word4_x86_features = -- 1.6.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/8 v3] KVM support for 1GB pages
Hi, this is the third version of the patches for KVM to support 1GB pages. Changes to the last version include: - changed reporting of 1GB page support to userspace to use SUPPORTED_CPUID ioctl - added support for 1GB pages to the software page walker code too This code still only can make use of 1GB pages with nested paging enabled. I will give the shadow paging code another debug round soon. Please comment or consider to apply these patches. Thanks, Joerg Shortlog: Joerg Roedel (8): hugetlbfs: export vma_kernel_pagsize to modules kvm: change memslot data structures for multiple hugepage sizes kvm/mmu: rename is_largepage_backed to mapping_level kvm/mmu: make rmap code aware of mapping levels kvm/mmu: make direct mapping paths aware of mapping levels kvm/mmu: add support for another level to page walker kvm/mmu: enable gbpages by increasing nr of pagesizes kvm x86: report 1GB page support to userspace Diffstat: arch/ia64/include/asm/kvm_host.h|3 +- arch/powerpc/include/asm/kvm_host.h |3 +- arch/x86/include/asm/kvm_host.h | 15 ++- arch/x86/kvm/mmu.c | 219 ++- arch/x86/kvm/paging_tmpl.h | 27 - arch/x86/kvm/svm.c |6 + arch/x86/kvm/vmx.c |6 + arch/x86/kvm/x86.c |3 +- include/linux/kvm_host.h|2 +- mm/hugetlb.c|1 + virt/kvm/kvm_main.c | 53 ++--- 11 files changed, 222 insertions(+), 116 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: device-assignment: Add PCI option ROM support
On Fri, 2009-06-19 at 15:27 +0800, Yang, Sheng wrote: > On Friday 19 June 2009 00:28:41 Alex Williamson wrote: > > The one oddity I noticed is that even when the enable bit is clear, the > > guest can read the ROM. I don't know that this is actually illegal, vs > > returning zeros or ones though. It seems like maybe the generic PCI > > code isn't tracking the enable bit. I think that's an independent > > problem from this patch though. Thanks, > > That should be fine. I've taken a look at code, seems Linux kernel set > enable_bit when someone begin to read rom, and copy rom to buffer, then unmap > the rom. So the rom can be read when enable bit clear. For this testing, I used an mmap of the ROM address though, so the kernel caching shouldn't have been involved. It looks to me like the problem is that the map function provided via pci_register_io_region() only knows how to create mappings, not tear them down. I think maybe pci_update_mappings() should still call the map_func when new_addr is -1 to let the io space drive shutdown the mapping. As it is, once we setup the mapping, it lives until something else happens to overlap it, regardless of the state of the PCI BAR. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2][RFC] Kernel changes for HPET legacy mode (v7)
Jan Kiszka wrote: Beth Kon wrote: When kvm is in hpet_legacy_mode, the hpet is providing the timer interrupt and the pit should not be. So in legacy mode, the pit timer is destroyed, but the *state* of the pit is maintained. So if kvm or the guest tries to modify the state of the pit, this modification is accepted, *except* that the timer isn't actually started. When we exit hpet_legacy_mode, the current state of the pit (which is up to date since we've been accepting modifications) is used to restart the pit timer. The saved_mode code in kvm_pit_load_count temporarily changes mode to 0xff in order to destroy the timer, but then restores the actual value, again maintaining "current" state of the pit for possible later reenablement. changes from v6: - Added ioctl interface for legacy mode in order not to break the abi. Signed-off-by: Beth Kon ... @@ -1986,7 +1987,24 @@ static int kvm_vm_ioctl_set_pit(struct kvm *kvm, struct kvm_pit_state *ps) int r = 0; memcpy(&kvm->arch.vpit->pit_state, ps, sizeof(struct kvm_pit_state)); - kvm_pit_load_count(kvm, 0, ps->channels[0].count); + kvm_pit_load_count(kvm, 0, ps->channels[0].count, 0); + return r; +} + +static int kvm_vm_ioctl_get_hpet_legacy_mode(struct kvm *kvm, u8 *mode) +{ + int r = 0; + *mode = kvm->arch.vpit->pit_state.hpet_legacy_mode; + return r; +} This only applies if we go for a separate mode IOCTL: The legacy mode is not directly modifiable by the guest. Is it planned to add in-kernel hpet support? Otherwise get_hpet_legacy_mode looks a bit like overkill given that user space could easily track the state. Assuming I will at least generalize the ioctl, I'll leave this question for the time being. + +static int kvm_vm_ioctl_set_hpet_legacy_mode(struct kvm *kvm, u8 *mode) +{ + int r = 0, start = 0; + if (kvm->arch.vpit->pit_state.hpet_legacy_mode == 0 && *mode == 1) Here you check more mode == 1, but legacy_mode is only checked for != 0. I would make this consistent. ok + start = 1; + kvm->arch.vpit->pit_state.hpet_legacy_mode = *mode; + kvm_pit_load_count(kvm, 0, kvm->arch.vpit->pit_state.channels[0].count, start); return r; } @@ -2047,6 +2065,7 @@ long kvm_arch_vm_ioctl(struct file *filp, struct kvm_pit_state ps; struct kvm_memory_alias alias; struct kvm_pit_config pit_config; + u8 hpet_legacy_mode; Hmm, stead of introducing a new pair of singe-purpose IOCTLs, why not add KVM_GET/SET_PIT2 which exchanges an extended kvm_pit_state2. And that struct should also include some flags field and enough padding to be potentially extended yet again in the future. In that case I see no problem having also a mode read-back interface. I thought about that, but it seemed to add unnecessary complexity, since this legacy control is really outside of normal PIT operation, which is embodied by KVM_GET/SET_PIT. It might be worth making this ioctl more general. Rather than SET_HPET_LEGACY, have SET_PIT_CONTROLS and pass a bit field, one of which is HPET_LEGACY. But, if general consensus is that it would be better to create a kvm_pit_state2, and get/set that, I can do that. Jan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch] qemu-kvm introduce -maxcpus argument
Hi, This one introduces a -maxcpus setting, allowing the user to specify the maximum number of vCPUs the system can have, as discussed with Gleb earlier in the week. Cheers, Jes Introduce -maxcpus flag to QEMU and pass the value to the BIOS through FW_CFG. Follow on patch will use it to determine the size of the MADT. Signed-off-by: Jes Sorensen --- hw/fw_cfg.c |1 + hw/fw_cfg.h |1 + kvm/bios/rombios32.c | 16 qemu-options.hx |9 + sysemu.h |1 + vl.c |8 6 files changed, 36 insertions(+) Index: qemu-kvm/hw/fw_cfg.c === --- qemu-kvm.orig/hw/fw_cfg.c +++ qemu-kvm/hw/fw_cfg.c @@ -279,6 +279,7 @@ void *fw_cfg_init(uint32_t ctl_port, uin fw_cfg_add_bytes(s, FW_CFG_UUID, qemu_uuid, 16); fw_cfg_add_i16(s, FW_CFG_NOGRAPHIC, (uint16_t)(display_type == DT_NOGRAPHIC)); fw_cfg_add_i16(s, FW_CFG_NB_CPUS, (uint16_t)smp_cpus); +fw_cfg_add_i16(s, FW_CFG_MAX_CPUS, (uint16_t)max_cpus); register_savevm("fw_cfg", -1, 1, fw_cfg_save, fw_cfg_load, s); qemu_register_reset(fw_cfg_reset, 0, s); Index: qemu-kvm/hw/fw_cfg.h === --- qemu-kvm.orig/hw/fw_cfg.h +++ qemu-kvm/hw/fw_cfg.h @@ -15,6 +15,7 @@ #define FW_CFG_INITRD_SIZE 0x0b #define FW_CFG_BOOT_DEVICE 0x0c #define FW_CFG_NUMA 0x0d +#define FW_CFG_MAX_CPUS 0x0e #define FW_CFG_MAX_ENTRY0x10 #define FW_CFG_WRITE_CHANNEL0x4000 Index: qemu-kvm/kvm/bios/rombios32.c === --- qemu-kvm.orig/kvm/bios/rombios32.c +++ qemu-kvm/kvm/bios/rombios32.c @@ -441,6 +441,7 @@ void delay_ms(int n) } uint16_t smp_cpus; +uint16_t max_cpus = MAX_CPUS; uint32_t cpuid_signature; uint32_t cpuid_features; uint32_t cpuid_ext_features; @@ -484,6 +485,7 @@ void wrmsr_smp(uint32_t index, uint64_t #define QEMU_CFG_ID 0x01 #define QEMU_CFG_UUID 0x02 #define QEMU_CFG_NUMA 0x0D +#define QEMU_CFG_MAX_CPUS 0x0E #define QEMU_CFG_ARCH_LOCAL 0x8000 #define QEMU_CFG_ACPI_TABLES (QEMU_CFG_ARCH_LOCAL + 0) #define QEMU_CFG_SMBIOS_ENTRIES (QEMU_CFG_ARCH_LOCAL + 1) @@ -546,6 +548,19 @@ static uint16_t smbios_entries(void) return cnt; } +static uint16_t get_max_cpus(void) +{ +uint16_t cnt; + +qemu_cfg_select(QEMU_CFG_MAX_CPUS); +qemu_cfg_read((uint8_t*)&cnt, sizeof(cnt)); + +if (!cnt) +cnt = MAX_CPUS; + +return cnt; +} + uint64_t qemu_cfg_get64 (void) { uint64_t ret; @@ -1655,6 +1670,7 @@ void acpi_bios_init(void) addr += sizeof(SSDTCode); #ifdef BX_QEMU +max_cpus = get_max_cpus(); qemu_cfg_select(QEMU_CFG_NUMA); nb_numa_nodes = qemu_cfg_get64(); #else Index: qemu-kvm/qemu-options.hx === --- qemu-kvm.orig/qemu-options.hx +++ qemu-kvm/qemu-options.hx @@ -47,6 +47,15 @@ CPUs are supported. On Sparc32 target, L to 4. ETEXI +DEF("maxcpus", HAS_ARG, QEMU_OPTION_maxcpus, +"-maxcpus n set maximumthe number of possibly CPUs to 'n'\n") +STEXI +...@item -maxcpus @var{n} +Set the maximum number of possible CPUs to @var(n). @var(n) has to be +bigger or equal to the value of -smp. If @var(n) is equal to -smp, +there will be no space for hotplug cpus to be added later. +ETEXI + DEF("numa", HAS_ARG, QEMU_OPTION_numa, "-numa node[,mem=size][,cpus=cpu[-cpu]][,nodeid=node]\n") STEXI Index: qemu-kvm/sysemu.h === --- qemu-kvm.orig/sysemu.h +++ qemu-kvm/sysemu.h @@ -119,6 +119,7 @@ extern int alt_grab; extern int usb_enabled; extern int no_virtio_balloon; extern int smp_cpus; +extern int max_cpus; extern int cursor_hide; extern int graphic_rotate; extern int no_quit; Index: qemu-kvm/vl.c === --- qemu-kvm.orig/vl.c +++ qemu-kvm/vl.c @@ -246,6 +246,7 @@ int singlestep = 0; const char *assigned_devices[MAX_DEV_ASSIGN_CMDLINE]; int assigned_devices_index; int smp_cpus = 1; +int max_cpus = 16; const char *vnc_display; int acpi_enabled = 1; int no_hpet = 0; @@ -5666,6 +5667,13 @@ int main(int argc, char **argv, char **e exit(1); } break; +case QEMU_OPTION_maxcpus: +max_cpus = atoi(optarg); +if ((max_cpus < 1) || (max_cpus > machine->max_cpus)) { +fprintf(stderr, "Invalid number of CPUs\n"); +exit(1); +} +break; case QEMU_OPTION_vnc: display_type = DT_VNC; vnc_display = optarg;
[patch] qemu-kvm, use max_cpus to size MADT table
Hi, This one goes on top of the -maxcpus patch. It changes the bios to use max_cpus to determine the size of the MADT table. When no -maxcpus argument is given, we still fall back to the old default of 16. Cheers, Jes Build madt table based on max_cpus count. Signed-off-by: Jes Sorensen --- kvm/bios/rombios32.c |8 1 file changed, 4 insertions(+), 4 deletions(-) Index: qemu-kvm/kvm/bios/rombios32.c === --- qemu-kvm.orig/kvm/bios/rombios32.c +++ qemu-kvm/kvm/bios/rombios32.c @@ -1168,13 +1168,13 @@ static void mptable_init(void) putstr(&q, "0.1 "); /* vendor id */ putle32(&q, 0); /* OEM table ptr */ putle16(&q, 0); /* OEM table size */ -putle16(&q, MAX_CPUS + 18); /* entry count */ +putle16(&q, max_cpus + 18); /* entry count */ putle32(&q, 0xfee0); /* local APIC addr */ putle16(&q, 0); /* ext table length */ putb(&q, 0); /* ext table checksum */ putb(&q, 0); /* reserved */ -for(i = 0; i < MAX_CPUS ; i++) { +for(i = 0; i < max_cpus ; i++) { putb(&q, 0); /* entry type = processor */ putb(&q, i); /* APIC id */ putb(&q, 0x11); /* local APIC version number */ @@ -1700,7 +1700,7 @@ void acpi_bios_init(void) addr = (addr + 7) & ~7; madt_addr = addr; madt_size = sizeof(*madt) + -sizeof(struct madt_processor_apic) * MAX_CPUS + +sizeof(struct madt_processor_apic) * max_cpus + #ifdef BX_QEMU sizeof(struct madt_io_apic) + sizeof(struct madt_int_override) * MAX_INT_OVERRIDES; #else @@ -1778,7 +1778,7 @@ void acpi_bios_init(void) madt->local_apic_address = cpu_to_le32(0xfee0); madt->flags = cpu_to_le32(1); *(uint32_t*)APIC_MADT_PTR = apic = (void *)(madt + 1); -for(i=0;itype = APIC_PROCESSOR; apic->length = sizeof(*apic); apic->processor_id = i;
Re: [patch] qemu-kvm introduce -maxcpus argument
On Fri, Jun 19, 2009 at 03:59:57PM +0200, Jes Sorensen wrote: > Hi, > > This one introduces a -maxcpus setting, allowing the user to specify > the maximum number of vCPUs the system can have, as discussed with Gleb > earlier in the week. What is the maximum value for the 'maxcpus' setting for KVM ? libvirt currently does fd = open("/dev/kvm") r = ioctl(fd, KVM_CHECK_EXTENSION, KVM_CAP_NR_VCPUS); to figure out what the maximum allowed vCPUs will be for KVM, and currently it is returning 16 IIRC. > @@ -5666,6 +5667,13 @@ int main(int argc, char **argv, char **e > exit(1); > } > break; > +case QEMU_OPTION_maxcpus: > +max_cpus = atoi(optarg); > +if ((max_cpus < 1) || (max_cpus > machine->max_cpus)) { > +fprintf(stderr, "Invalid number of CPUs\n"); > +exit(1); > +} > +break; > case QEMU_OPTION_vnc: > display_type = DT_VNC; > vnc_display = optarg; This implies the limit (for x86 pc machine at least) is now 255. Is that the correct interpretation on my part ? Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] qemu-kvm introduce -maxcpus argument
On 06/19/2009 04:15 PM, Daniel P. Berrange wrote: On Fri, Jun 19, 2009 at 03:59:57PM +0200, Jes Sorensen wrote: Hi, This one introduces a -maxcpus setting, allowing the user to specify the maximum number of vCPUs the system can have, as discussed with Gleb earlier in the week. What is the maximum value for the 'maxcpus' setting for KVM ? Right now it is still 16, however I plan to change this. libvirt currently does fd = open("/dev/kvm") r = ioctl(fd, KVM_CHECK_EXTENSION, KVM_CAP_NR_VCPUS); to figure out what the maximum allowed vCPUs will be for KVM, and currently it is returning 16 IIRC. Interesting, this will need to be addressed as well. I have plans to introduce a mechanism telling the kernel where the limit will be, in order to allow it to allocate data structures in a reasonable manner. @@ -5666,6 +5667,13 @@ int main(int argc, char **argv, char **e exit(1); } break; +case QEMU_OPTION_maxcpus: +max_cpus = atoi(optarg); +if ((max_cpus< 1) || (max_cpus> machine->max_cpus)) { +fprintf(stderr, "Invalid number of CPUs\n"); +exit(1); +} This implies the limit (for x86 pc machine at least) is now 255. Is that the correct interpretation on my part ? Actually the 255 limit is tied to ACPI. Once we support ACPI 3.0 and x2apic, it will get much worse. Be afraid, be very afraid :-) To be honest, that is going to be a while before we get to that, but I hope to get to it eventually. I strongly recommend you try not to impose static limits within libvirt for the number of vCPUs. I guess it will become a tricky issue who is telling who what the limit is. Ideally I would like to see the kernel limit becoming unlimited and the restrictions being set by userland. Cheers, Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-no-kvm broken since merge a5b526135d
JFYI: qemu-kvm commit de408d70d8 works fine with -no-kvm, a5b526135d produces qemu: fatal: Trying to execute code outside RAM or ROM at 0x0001000d5f4f EAX=0001 EBX=0070 ECX=005f EDX=0001 ESI= EDI= EBP= ESP= EIP=000e5f4f EFL=0002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES = 9200 CS =f000 9a00 SS = 9200 DS = 9200 FS = 9200 GS = 9200 LDT= 8200 TR = 8b00 GDT= IDT= CR0=6010 CR2= CR3= CR4= DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 CCS=0001 CCD= CCO=SUBL FCW=037f FSW= [ST=0] FTW=00 MXCSR=1f80 FPR0= FPR1= FPR2= FPR3= FPR4= FPR5= FPR6= FPR7= XMM00= XMM01= XMM02= XMM03= XMM04= XMM05= XMM06= XMM07= (addresses may vary) during early BIOS boot. Probably a merge conflict, but I do not yet see which one. Maybe someone has an immediate idea what could cause this. Upstream does not show this regression. Jan -- Siemens AG, Corporate Technology, CT SE 2 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch] qemu-kvm introduce -maxcpus argument
On (Fri) Jun 19 2009 [15:59:57], Jes Sorensen wrote: > Hi, > > This one introduces a -maxcpus setting, allowing the user to specify > the maximum number of vCPUs the system can have, as discussed with Gleb > earlier in the week. ACK, but please fix this: +DEF("maxcpus", HAS_ARG, QEMU_OPTION_maxcpus, +"-maxcpus n set maximumthe number of possibly CPUs to 'n'\n") +STEXI Amit -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
On Thu, Jun 18, 2009 at 12:29:31PM -0400, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote: > > > >> Michael S. Tsirkin wrote: > >> > >>> On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote: > >>> > >>> > Michael S. Tsirkin wrote: > > > > > BTW, Gregory, this can be used to fix the race in the design: create a > > thread and let it drop the module reference with module_put_and_exit. > > > > > > > I had thought of doing something like this initially too, but I think > its racy as well. Ultimately, you need to make sure the eventfd > callback is completely out before its safe to run, and deferring to a > thread would not change this race. The only sane way I can see to do > that is to have the caller infrastructure annotate the event somehow > (either directly with a module_put(), or indirectly with some kind of > state transition that can be tracked with something like > synchronize_sched(). > > > >>> Here's what one could do: create a thread for each irqfd, and increment > >>> module ref count, put that thread to sleep. When done with > >>> irqfd, don't delete it and don't decrement module refcount, wake thread > >>> instead. thread kills irqfd and calls module_put_and_exit. > >>> > >>> I don't think it's racy > >>> > >> I believe it is. How would you prevent the thread from doing the > >> module_put_and_exit() before the eventfd callback thread is known to > >> have exited the relevant .text section? > >> > > > > Right. > > > > > >> All this talk does give me an idea, tho. Ill make a patch. > >> > > > > OK, but ask yourself whether this bag of tricks is worth it, and whether > > we'll find another hole later. Let's reserve the trickiness for > > fast-path, where it's needed, and keep at least the assign/deassign simple. > > > > Understood. OTOH, going back to the model where two steps are needed > for close() is ugly too, so I don't want to just give up and revert that > fix too easily. At some point we will call it one way or the other, but > I am not there quite yet. > > > >>> > >>> > > Which will work, but I guess at this point we should ask ourselves > > whether all the hearburn with srcu, threads and module references is > > better than just asking the user to call and ioctl. > > > > > > > I am starting to agree with you, here. :) > > Note one thing: the SRCU stuff is mostly orthogonal from the rest of the > conversation re: the module_put() races. I only tied it into the > current thread because the eventfd_notifier_register() thread gave me a > convenient way to hook some other context to do the module_put(). In > the long term, the srcu changes are for the can_sleep() stuff. So on > that note, lets see if I can convince Davide that the srcu stuff is not > so evil before we revert the POLLHUP patches, since the module_put() fix > is trivial once that is in place. > > > >>> Can this help with DEASSIGN as well? We need it for migration. > >>> > >>> > >> No, but afaict you do not need this for migration anyway. Migrate the > >> GSI and re-call kvm_irqfd() on the other side. Would the fd even be > >> relevant across a migration anyway? I would think not, but admittedly I > >> know little about how qemu/kvm migration actually works. > >> > > > > Yes but that's not live migration. For live migration, the trick is that > > you are running locally but send changes to remote guest. For that, we > > need to put qemu in the middle between the device and the guest, so it > > can detect activity and update the remote side. > > > > And the best way to do that is to take poll eventfd that device assigns > > and push eventfd that kvm polls. To switch between this setup > > and the one where kvm polls the ventfd from device directly, > > you need deassign. > > > > So its still not clear why the distinction between > deassign-the-gsi-but-leave-the-fd-valid is needed over a simple > close(). Can you elaborate? The fd needs to be left assigned to the device, so that we can poll the fd and get events, then forward them to kvm. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl
Michael S. Tsirkin wrote: > On Thu, Jun 18, 2009 at 12:29:31PM -0400, Gregory Haskins wrote: > >> Michael S. Tsirkin wrote: >> >>> On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote: >>> >>> Michael S. Tsirkin wrote: > On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote: > > > >> Michael S. Tsirkin wrote: >> >> >> >> >>> BTW, Gregory, this can be used to fix the race in the design: create a >>> thread and let it drop the module reference with module_put_and_exit. >>> >>> >>> >>> >> I had thought of doing something like this initially too, but I think >> its racy as well. Ultimately, you need to make sure the eventfd >> callback is completely out before its safe to run, and deferring to a >> thread would not change this race. The only sane way I can see to do >> that is to have the caller infrastructure annotate the event somehow >> (either directly with a module_put(), or indirectly with some kind of >> state transition that can be tracked with something like >> synchronize_sched(). >> >> >> > Here's what one could do: create a thread for each irqfd, and increment > module ref count, put that thread to sleep. When done with > irqfd, don't delete it and don't decrement module refcount, wake thread > instead. thread kills irqfd and calls module_put_and_exit. > > I don't think it's racy > > I believe it is. How would you prevent the thread from doing the module_put_and_exit() before the eventfd callback thread is known to have exited the relevant .text section? >>> Right. >>> >>> >>> All this talk does give me an idea, tho. Ill make a patch. >>> OK, but ask yourself whether this bag of tricks is worth it, and whether >>> we'll find another hole later. Let's reserve the trickiness for >>> fast-path, where it's needed, and keep at least the assign/deassign simple. >>> >>> >> Understood. OTOH, going back to the model where two steps are needed >> for close() is ugly too, so I don't want to just give up and revert that >> fix too easily. At some point we will call it one way or the other, but >> I am not there quite yet. >> >>> >>> > > > >>> Which will work, but I guess at this point we should ask ourselves >>> whether all the hearburn with srcu, threads and module references is >>> better than just asking the user to call and ioctl. >>> >>> >>> >>> >> I am starting to agree with you, here. :) >> >> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the >> conversation re: the module_put() races. I only tied it into the >> current thread because the eventfd_notifier_register() thread gave me a >> convenient way to hook some other context to do the module_put(). In >> the long term, the srcu changes are for the can_sleep() stuff. So on >> that note, lets see if I can convince Davide that the srcu stuff is not >> so evil before we revert the POLLHUP patches, since the module_put() fix >> is trivial once that is in place. >> >> >> > Can this help with DEASSIGN as well? We need it for migration. > > > No, but afaict you do not need this for migration anyway. Migrate the GSI and re-call kvm_irqfd() on the other side. Would the fd even be relevant across a migration anyway? I would think not, but admittedly I know little about how qemu/kvm migration actually works. >>> Yes but that's not live migration. For live migration, the trick is that >>> you are running locally but send changes to remote guest. For that, we >>> need to put qemu in the middle between the device and the guest, so it >>> can detect activity and update the remote side. >>> >>> And the best way to do that is to take poll eventfd that device assigns >>> and push eventfd that kvm polls. To switch between this setup >>> and the one where kvm polls the ventfd from device directly, >>> you need deassign. >>> >>> >> So its still not clear why the distinction between >> deassign-the-gsi-but-leave-the-fd-valid is needed over a simple >> close(). Can you elaborate? >> > > > The fd needs to be left assigned to the device, so that we can poll > the fd and get events, then forward them to kvm. > Ah, ok. Now I get what you are trying to do. Well, per the PM I sent you this morning, I figured out the magic to resolve the locking issues. So we should be able to add DEASSIGN logic soon, I hope. -Greg signature.asc Descripti
[PATCH] qemu-kvm: x86: Fix CPU initialization
Fix regression in CPU initialization caused by merge a5b526135d and try to avoid this in the future by dropping qemu-kvm specific pc_new_cpu. If such refactoring is desired, it should go through upstream first. Signed-off-by: Jan Kiszka --- hw/pc.c | 44 ++-- 1 files changed, 18 insertions(+), 26 deletions(-) diff --git a/hw/pc.c b/hw/pc.c index 6c19f55..c4117e5 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -839,31 +839,6 @@ int cpu_is_bsp(CPUState *env) return env->cpuid_apic_id == 0; } -CPUState *pc_new_cpu(int cpu, const char *cpu_model, int pci_enabled) -{ -CPUState *env = cpu_init(cpu_model); -if (!env) { -fprintf(stderr, "Unable to find x86 CPU definition\n"); -exit(1); -} -if (cpu != 0) -env->halted = 1; -if ((env->cpuid_features & CPUID_APIC) || smp_cpus > 1) { -env->cpuid_apic_id = env->cpu_index; -apic_init(env); -} -qemu_register_reset(main_cpu_reset, 0, env); -if (pci_enabled) { -apic_init(env); -} - -/* kvm needs this to run after the apic is initialized. Otherwise, - * it can access invalid state and crash. - */ -qemu_init_vcpu(env); - return env; -} - /* PC hardware initialisation */ static void pc_init1(ram_addr_t ram_size, const char *boot_device, @@ -906,7 +881,24 @@ static void pc_init1(ram_addr_t ram_size, } for(i = 0; i < smp_cpus; i++) { - env = pc_new_cpu(i, cpu_model, pci_enabled); +env = cpu_init(cpu_model); +if (!env) { +fprintf(stderr, "Unable to find x86 CPU definition\n"); +exit(1); +} +if (cpu != 0) +env->halted = 1; +if ((env->cpuid_features & CPUID_APIC) || smp_cpus > 1) { +env->cpuid_apic_id = env->cpu_index; +apic_init(env); +} +qemu_register_reset(main_cpu_reset, 0, env); + +/* + * FIXME: qemu-kvm needs this after apic_init as it accesses APIC + * structures. + */ +qemu_init_vcpu(env); } vmport_init(); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] qemu-kvm: x86: Fix CPU initialization
Jan Kiszka wrote: > Fix regression in CPU initialization caused by merge a5b526135d and try > to avoid this in the future by dropping qemu-kvm specific pc_new_cpu. If > such refactoring is desired, it should go through upstream first. F...ine. I'll write a hundred times: "Don't post while still compiling." Here is a version that only fixes the regression. ---> Fix regression in CPU initialization caused by merge a5b526135d. Signed-off-by: Jan Kiszka --- hw/pc.c |3 --- 1 files changed, 0 insertions(+), 3 deletions(-) diff --git a/hw/pc.c b/hw/pc.c index 6c19f55..cb5b4d0 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -853,9 +853,6 @@ CPUState *pc_new_cpu(int cpu, const char *cpu_model, int pci_enabled) apic_init(env); } qemu_register_reset(main_cpu_reset, 0, env); -if (pci_enabled) { -apic_init(env); -} /* kvm needs this to run after the apic is initialized. Otherwise, * it can access invalid state and crash. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Checking guest memory pages changes from host userspace
Hi list, I need to monitor some guest memory pages. I need to know if the information in these pages was changed. For this, I was thinking to mark the guest memory pages in some way (like write protecting them) so a page fault is generated. Then manage this fault inside qemu. Is there some API in libkvm that allows me to do this? Thanks a lot, Pablo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] virtio_blk: don't bounce highmem requests
By default a block driver bounces highmem requests, but virtio-blk is perfectly fine with any request that fit into it's 64 bit addressing scheme, mapped in the kernel virtual space or not. Besides improving performance on highmem systems this also makes the reproducible oops in __bounce_end_io go away (but hiding the real cause). Signed-off-by: Christoph Hellwig Index: linux-2.6/drivers/block/virtio_blk.c === --- linux-2.6.orig/drivers/block/virtio_blk.c 2009-06-15 16:28:24.225815322 +0200 +++ linux-2.6/drivers/block/virtio_blk.c2009-06-19 18:03:12.469805377 +0200 @@ -360,6 +360,9 @@ static int __devinit virtblk_probe(struc blk_queue_max_phys_segments(vblk->disk->queue, vblk->sg_elems-2); blk_queue_max_hw_segments(vblk->disk->queue, vblk->sg_elems-2); + /* No need to bounce any requests */ + blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY); + /* No real sector limit. */ blk_queue_max_sectors(vblk->disk->queue, -1U); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] eventfd: Allow waiters to be notified about the eventfd file* going away
From: Davide Libenzi And give them a change to unregister from the wait queue. This is turn allows eventfd users to use the eventfd file* w/out holding a live reference to it. After the eventfd user callbacks returns, any usage of the eventfd file* should be dropped. The eventfd user callback can acquire sleepy locks since it is invoked lockless. This is a feature, needed by KVM to avoid an awkward workaround when using eventfd. Signed-off-by: Davide Libenzi Tested-by: Gregory Haskins Signed-off-by: Andrew Morton Signed-off-by: Gregory Haskins Signed-off-by: Avi Kivity --- fs/eventfd.c | 10 +- 1 files changed, 9 insertions(+), 1 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index 2a701d5..c71f51d 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -59,7 +59,15 @@ int eventfd_signal(struct file *file, int n) static int eventfd_release(struct inode *inode, struct file *file) { - kfree(file->private_data); + struct eventfd_ctx *ctx = file->private_data; + + /* +* No need to hold the lock here, since we are on the file cleanup +* path and the ones still attached to the wait queue will be +* serialized by wake_up_locked_poll(). +*/ + wake_up_locked_poll(&ctx->wqh, POLLHUP); + kfree(ctx); return 0; } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface
Davide Libenzi wrote: > On Thu, 18 Jun 2009, Gregory Haskins wrote: > >> >> But the fact is: I do not see any way to actually use your referenceless >> POLLHUP release code in a race free way without doing something like I >> propose in 3/4, 4/4. Lets keep the discussion focused on that for now, >> if we could. > > OK, since I got literally swamped by the amount of talks and patches over > this theoretically simple topic, would you mind 1) posting the global > patch over eventfd Done. Should show up as replies to this email. These apply to v2.6.30 and are devoid of any KVM-isms. :) (Note-1: I've built and run these patches against additional kvm/irqfd patches and verified they all work properly) (Note-2: I included your POLLHUP patch as 1/3 since it currently only exists in kvm.git, and is a pre-req for the other two) > 2) describe exactly what races are you talking about Covered in the patch headers (3/3 probably has the most detail) > 3) explain why this should be any concern of eventfd at all? > To support any in-kernel eventfd client that wants to get both "signal" and "release" notifications (based on using your POLLHUP patch) in a race-free way. Without 2/3, 3/3, I cannot see how this can be done. Yet your 1/3 patch is an important enhancement (to at least one client). I suspect any other in-kernel users will find this a good feature as well. Thanks Davide, -Greg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions
eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a notifier->release() callback. This lets notification clients know if the eventfd is about to go away and is very useful particularly for in-kernel clients. However, as it stands today it is not possible to use the notification API in a race-free way. This patch adds some additional logic to the notification subsystem to rectify this problem. Background: --- Eventfd currently only has one reference count mechanism: fget/fput. This in of itself is normally fine. However, if a client expects to be notified if the eventfd is closed, it cannot hold a fget() reference itself or the underlying f_ops->release() callback will never be invoked by VFS. Therefore we have this somewhat unusual situation where we may hold a pointer to an eventfd object (by virtue of having a waiter registered in its wait-queue), but no reference. This makes it nearly impossible to design a mutual decoupling algorithm: you cannot unhook one side from the other (or vice versa) without racing. The first problem was dealt with by essentially "donating" a surrogate object to eventfd. In other words, when a client attached to eventfd and then later detached, it would decouple internally in a race free way and then leave part of the object still attached to the eventfd. This decoupled object would correctly detect the end-of-life of the eventfd object at some point in the future and be deallocated: However, we cannot guarantee that this operation would not race with a potential rmmod of the client, and is therefore broken. Solution Details: --- 1) We add a private kref to the internal eventfd_ctx object. This reference can be (transparently) held by notification clients without affecting the ability for VFS to indicate ->release() notification. 2) We convert the current lockless POLLHUP to a more traditional locked variant (*) so that we can ensure a race free mutual-decouple algorithm without requiring an surrogate object. 3) We guard the decouple algorithm with an atomic bit-clear to ensure mutual exclusion of the decoupling and reference-drop. 4) We hold a reference to the underlying eventfd_ctx until all paths have satisfactorily completed to ensure we do not race with eventfd going away. Between these points, we believe we now have a race-free release mechanism. [*] Clients that previously assumed the ->release() could sleep will need to be refactored. Signed-off-by: Gregory Haskins CC: Davide Libenzi CC: Michael S. Tsirkin --- fs/eventfd.c| 62 +-- include/linux/eventfd.h |3 ++ 2 files changed, 47 insertions(+), 18 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index 3d7fb16..934efee 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -16,8 +16,10 @@ #include #include #include +#include struct eventfd_ctx { + struct kref kref; wait_queue_head_t wqh; /* * Every time that a write(2) is performed on an eventfd, the @@ -57,17 +59,24 @@ int eventfd_signal(struct file *file, int n) return n; } +static void _eventfd_release(struct kref *kref) +{ + struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref); + + kfree(ctx); +} + +static void _eventfd_put(struct kref *kref) +{ + kref_put(kref, &_eventfd_release); +} + static int eventfd_release(struct inode *inode, struct file *file) { struct eventfd_ctx *ctx = file->private_data; - /* -* No need to hold the lock here, since we are on the file cleanup -* path and the ones still attached to the wait queue will be -* serialized by wake_up_locked_poll(). -*/ - wake_up_locked_poll(&ctx->wqh, POLLHUP); - kfree(ctx); + wake_up_poll(&ctx->wqh, POLLHUP); + _eventfd_put(&ctx->kref); return 0; } @@ -222,6 +231,7 @@ SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags) if (!ctx) return -ENOMEM; + kref_init(&ctx->kref); init_waitqueue_head(&ctx->wqh); ctx->count = count; ctx->flags = flags; @@ -242,6 +252,15 @@ SYSCALL_DEFINE1(eventfd, unsigned int, count) return sys_eventfd2(count, 0); } +enum { + eventfd_notifier_flag_active, +}; + +static int test_and_clear_active(struct eventfd_notifier *en) +{ + return test_and_clear_bit(eventfd_notifier_flag_active, &en->flags); +} + static int eventfd_notifier_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key) { @@ -251,19 +270,15 @@ static int eventfd_notifier_wakeup(wait_queue_t *wait, unsigned mode, en = container_of(wait, struct eventfd_notifier, wait); if (flags & POLLIN) - /* -* The POLLIN wake_up is called with interrupts disabled. -*/ en->ops->signal(en);
[PATCH 2/3] eventfd: add generalized notifier interface
Users that want to register for signal notifications with eventfd have several choices today: They can do a standard sleep+wakeup against a ->read(), or they can provide their own wakeup handling using the wait-queue callback mechanism coupled with the the eventfd->poll() interface. In fact, Davide recently published a patch that allows eventfd to transmit a "release" event when the underlying eventfd is closed via a POLLHUP wakeup. This type of event is extremely useful for in-kernel notification clients. However the wait-queue based notification interface alone is not sufficient to use this new information race-free since it requires operating lockless and referenceless. We need to track some additional data that is independent of the file* pointer, since we need f_ops->release() to still function. Therefore, this patch lays the groundwork to try and fix these issues. It accomplishes this by abstracting eventfd's wait-queue based notification interface behind eventfd specific register()/unregister() verbs. It also provides an eventfd specific object (eventfd_notifier) that is intended to be embedded in the client, but used by eventfd to track proper state. We will use this interface later in the series to fix the current races. Signed-off-by: Gregory Haskins CC: Davide Libenzi CC: Michael S. Tsirkin --- fs/eventfd.c| 64 +++ include/linux/eventfd.h | 33 2 files changed, 97 insertions(+), 0 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index c71f51d..3d7fb16 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -242,3 +242,67 @@ SYSCALL_DEFINE1(eventfd, unsigned int, count) return sys_eventfd2(count, 0); } +static int eventfd_notifier_wakeup(wait_queue_t *wait, unsigned mode, + int sync, void *key) +{ + struct eventfd_notifier *en; + unsigned long flags = (unsigned long)key; + + en = container_of(wait, struct eventfd_notifier, wait); + + if (flags & POLLIN) + /* +* The POLLIN wake_up is called with interrupts disabled. +*/ + en->ops->signal(en); + + if (flags & POLLHUP) { + /* +* The POLLHUP is called unlocked, so it theoretically should +* be safe to remove ourselves from the wqh using the locked +* variant of remove_wait_queue() +*/ + remove_wait_queue(en->wqh, &en->wait); + en->ops->release(en); + } + + return 0; +} + +static void eventfd_notifier_ptable_enqueue(struct file *file, + wait_queue_head_t *wqh, + poll_table *pt) +{ + struct eventfd_notifier *en; + + en = container_of(pt, struct eventfd_notifier, pt); + + en->wqh = wqh; + add_wait_queue(wqh, &en->wait); +} + +int eventfd_notifier_register(struct file *file, struct eventfd_notifier *en) +{ + unsigned int events; + + if (file->f_op != &eventfd_fops) + return -EINVAL; + + /* +* Install our own custom wake-up handling so we are notified via +* a callback whenever someone signals the underlying eventfd +*/ + init_waitqueue_func_entry(&en->wait, eventfd_notifier_wakeup); + init_poll_funcptr(&en->pt, eventfd_notifier_ptable_enqueue); + + events = file->f_op->poll(file, &en->pt); + + return (events & POLLIN) ? 1 : 0; +} +EXPORT_SYMBOL_GPL(eventfd_notifier_register); + +void eventfd_notifier_unregister(struct eventfd_notifier *en) +{ + remove_wait_queue(en->wqh, &en->wait); +} +EXPORT_SYMBOL_GPL(eventfd_notifier_unregister); diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h index f45a8ae..cb23969 100644 --- a/include/linux/eventfd.h +++ b/include/linux/eventfd.h @@ -8,6 +8,32 @@ #ifndef _LINUX_EVENTFD_H #define _LINUX_EVENTFD_H +#include +#include +#include +#include + +struct eventfd_notifier; + +struct eventfd_notifier_ops { + void (*signal)(struct eventfd_notifier *en); + void (*release)(struct eventfd_notifier *en); +}; + +struct eventfd_notifier { + poll_table pt; + wait_queue_head_t *wqh; + wait_queue_t wait; + const struct eventfd_notifier_ops *ops; +}; + +static inline void eventfd_notifier_init(struct eventfd_notifier *en, +const struct eventfd_notifier_ops *ops) +{ + memset(en, 0, sizeof(*en)); + en->ops = ops; +} + #ifdef CONFIG_EVENTFD /* For O_CLOEXEC and O_NONBLOCK */ @@ -29,12 +55,19 @@ struct file *eventfd_fget(int fd); int eventfd_signal(struct file *file, int n); +int eventfd_notifier_register(struct file *file, struct eventfd_notifier *en); +void eventfd_notifier_unregister(struct eventfd_notifier *en); #else /* CONF
Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions
On Fri, 19 Jun 2009, Gregory Haskins wrote: > eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a > notifier->release() callback. This lets notification clients know if > the eventfd is about to go away and is very useful particularly for > in-kernel clients. However, as it stands today it is not possible to > use the notification API in a race-free way. This patch adds some > additional logic to the notification subsystem to rectify this problem. > > Background: > --- > Eventfd currently only has one reference count mechanism: fget/fput. This > in of itself is normally fine. However, if a client expects to be > notified if the eventfd is closed, it cannot hold a fget() reference > itself or the underlying f_ops->release() callback will never be invoked > by VFS. Therefore we have this somewhat unusual situation where we may > hold a pointer to an eventfd object (by virtue of having a waiter registered > in its wait-queue), but no reference. This makes it nearly impossible to > design a mutual decoupling algorithm: you cannot unhook one side from the > other (or vice versa) without racing. And why is that? struct xxx { struct mutex mtx; struct file *file; ... }; struct file *xxx_get_file(struct xxx *x) { struct file *file; mutex_lock(&x->mtx); file = x->file; if (!file) mutex_unlock(&x->mtx); return file; } void xxx_release_file(struct xxx *x) { mutex_unlock(&x->mtx); } void handle_POLLHUP(struct xxx *x) { struct file *file; file = xxx_get_file(x); if (file) { unhook_waitqueue(file, ...); x->file = NULL; xxx_release_file(x); } } Every time you need to "use" file, you call xxx_get_file(), and if you get NULL, it means it's gone and you handle it accordigly to your IRQ fd policies. As soon as you done with the file, you call xxx_release_file(). Replace "mtx" with the lock that fits your needs. - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kvm: emulation failure
I see this problem with a x86 sles10 guest running on x86_64 intel host. If the guest is reset abruptly and rebooted, some where before grub sequence it hangs and the following message is seen in the logs emulation failed (pagetable) rip 7ed5 66 60 ac 20. I located this instruction sequence in isolinux.bin on the iso ;if that is relevant. I did some analysis and find that there is an ept violation, which is handled and then the next instruction '66 60' is attempted to decode and emulate. But decode fails. kvm continues loops in the kernel in __vcpu_run(). the code path is kvm_run() -> __vcpu_run() -> vcpu_enter_guest() -> kvm_handle_exit() -> handle_ept_violation() -> kvm_mmu_page_fault() -> emulate_instruction() -> x86_decode_insn() Any insights here on how to fix the problem is appreciated. And if a fix already exists even better :) thanks, RP -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions
Davide Libenzi wrote: > On Fri, 19 Jun 2009, Gregory Haskins wrote: > > >> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a >> notifier->release() callback. This lets notification clients know if >> the eventfd is about to go away and is very useful particularly for >> in-kernel clients. However, as it stands today it is not possible to >> use the notification API in a race-free way. This patch adds some >> additional logic to the notification subsystem to rectify this problem. >> >> Background: >> --- >> Eventfd currently only has one reference count mechanism: fget/fput. This >> in of itself is normally fine. However, if a client expects to be >> notified if the eventfd is closed, it cannot hold a fget() reference >> itself or the underlying f_ops->release() callback will never be invoked >> by VFS. Therefore we have this somewhat unusual situation where we may >> hold a pointer to an eventfd object (by virtue of having a waiter registered >> in its wait-queue), but no reference. This makes it nearly impossible to >> design a mutual decoupling algorithm: you cannot unhook one side from the >> other (or vice versa) without racing. >> > > And why is that? > > struct xxx { > struct mutex mtx; > struct file *file; > ... > }; > > struct file *xxx_get_file(struct xxx *x) { > struct file *file; > > mutex_lock(&x->mtx); > file = x->file; > if (!file) > mutex_unlock(&x->mtx); > return file; > } > > void xxx_release_file(struct xxx *x) { > mutex_unlock(&x->mtx); > } > > void handle_POLLHUP(struct xxx *x) { > struct file *file; > > file = xxx_get_file(x); > if (file) { > unhook_waitqueue(file, ...); > x->file = NULL; > xxx_release_file(x); > } > } > > > Every time you need to "use" file, you call xxx_get_file(), and if you get > NULL, it means it's gone and you handle it accordigly to your IRQ fd > policies. As soon as you done with the file, you call xxx_release_file(). > Replace "mtx" with the lock that fits your needs. > Consider what would happen if the f_ops->release() was preempted inside the wake_up_locked_polled() after it dereferenced the xxx from the list, but before it calls the callback(POLLHUP). The xxx object, and/or the .text for the xxx object may be long gone by the time it comes back around. Afaict, there is no way to guard against that scenario unless you do something like 2/3+3/3. Or am I missing something? -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions
On Fri, 19 Jun 2009, Gregory Haskins wrote: > Davide Libenzi wrote: > > On Fri, 19 Jun 2009, Gregory Haskins wrote: > > > > > >> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a > >> notifier->release() callback. This lets notification clients know if > >> the eventfd is about to go away and is very useful particularly for > >> in-kernel clients. However, as it stands today it is not possible to > >> use the notification API in a race-free way. This patch adds some > >> additional logic to the notification subsystem to rectify this problem. > >> > >> Background: > >> --- > >> Eventfd currently only has one reference count mechanism: fget/fput. This > >> in of itself is normally fine. However, if a client expects to be > >> notified if the eventfd is closed, it cannot hold a fget() reference > >> itself or the underlying f_ops->release() callback will never be invoked > >> by VFS. Therefore we have this somewhat unusual situation where we may > >> hold a pointer to an eventfd object (by virtue of having a waiter > >> registered > >> in its wait-queue), but no reference. This makes it nearly impossible to > >> design a mutual decoupling algorithm: you cannot unhook one side from the > >> other (or vice versa) without racing. > >> > > > > And why is that? > > > > struct xxx { > > struct mutex mtx; > > struct file *file; > > ... > > }; > > > > struct file *xxx_get_file(struct xxx *x) { > > struct file *file; > > > > mutex_lock(&x->mtx); > > file = x->file; > > if (!file) > > mutex_unlock(&x->mtx); > > return file; > > } > > > > void xxx_release_file(struct xxx *x) { > > mutex_unlock(&x->mtx); > > } > > > > void handle_POLLHUP(struct xxx *x) { > > struct file *file; > > > > file = xxx_get_file(x); > > if (file) { > > unhook_waitqueue(file, ...); > > x->file = NULL; > > xxx_release_file(x); > > } > > } > > > > > > Every time you need to "use" file, you call xxx_get_file(), and if you get > > NULL, it means it's gone and you handle it accordigly to your IRQ fd > > policies. As soon as you done with the file, you call xxx_release_file(). > > Replace "mtx" with the lock that fits your needs. > > > > Consider what would happen if the f_ops->release() was preempted inside > the wake_up_locked_polled() after it dereferenced the xxx from the list, > but before it calls the callback(POLLHUP). The xxx object, and/or the > .text for the xxx object may be long gone by the time it comes back > around. Afaict, there is no way to guard against that scenario unless > you do something like 2/3+3/3. Or am I missing something? Right. Don't you see an easier answer to that, instead of adding 300 lines of code to eventfd? For example, turning wake_up_locked() into a nornal wake_up(). - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions
Davide Libenzi wrote: > On Fri, 19 Jun 2009, Gregory Haskins wrote: > > >> Davide Libenzi wrote: >> >>> On Fri, 19 Jun 2009, Gregory Haskins wrote: >>> >>> >>> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a notifier->release() callback. This lets notification clients know if the eventfd is about to go away and is very useful particularly for in-kernel clients. However, as it stands today it is not possible to use the notification API in a race-free way. This patch adds some additional logic to the notification subsystem to rectify this problem. Background: --- Eventfd currently only has one reference count mechanism: fget/fput. This in of itself is normally fine. However, if a client expects to be notified if the eventfd is closed, it cannot hold a fget() reference itself or the underlying f_ops->release() callback will never be invoked by VFS. Therefore we have this somewhat unusual situation where we may hold a pointer to an eventfd object (by virtue of having a waiter registered in its wait-queue), but no reference. This makes it nearly impossible to design a mutual decoupling algorithm: you cannot unhook one side from the other (or vice versa) without racing. >>> And why is that? >>> >>> struct xxx { >>> struct mutex mtx; >>> struct file *file; >>> ... >>> }; >>> >>> struct file *xxx_get_file(struct xxx *x) { >>> struct file *file; >>> >>> mutex_lock(&x->mtx); >>> file = x->file; >>> if (!file) >>> mutex_unlock(&x->mtx); >>> return file; >>> } >>> >>> void xxx_release_file(struct xxx *x) { >>> mutex_unlock(&x->mtx); >>> } >>> >>> void handle_POLLHUP(struct xxx *x) { >>> struct file *file; >>> >>> file = xxx_get_file(x); >>> if (file) { >>> unhook_waitqueue(file, ...); >>> x->file = NULL; >>> xxx_release_file(x); >>> } >>> } >>> >>> >>> Every time you need to "use" file, you call xxx_get_file(), and if you get >>> NULL, it means it's gone and you handle it accordigly to your IRQ fd >>> policies. As soon as you done with the file, you call xxx_release_file(). >>> Replace "mtx" with the lock that fits your needs. >>> >>> >> Consider what would happen if the f_ops->release() was preempted inside >> the wake_up_locked_polled() after it dereferenced the xxx from the list, >> but before it calls the callback(POLLHUP). The xxx object, and/or the >> .text for the xxx object may be long gone by the time it comes back >> around. Afaict, there is no way to guard against that scenario unless >> you do something like 2/3+3/3. Or am I missing something? >> > > Right. Don't you see an easier answer to that, instead of adding 300 lines > of code to eventfd? > I tried, but this problem made my head hurt and this was what I came up with that I felt closes the holes all the way. Also keep in mind that while I added X lines to eventfd, I took Y lines *out* of irqfd in the process, too. I just excluded the KVM portions in this thread per your request, so its not apparent. But technically, any other clients that may come along can reuse that notification code instead of coding it again. One way or the other, *someone* has to do that ptable_proc stuff ;) FYI: Its more like 133 lines, fwiw. fs/eventfd.c| 104 include/linux/eventfd.h | 36 2 files changed, 133 insertions(+), 7 deletions(-) In case you care, heres what the complete solution when I include KVM currently looks like: fs/eventfd.c| 104 +-- include/linux/eventfd.h | 36 + virt/kvm/eventfd.c | 181 +--- 3 files changed, 228 insertions(+), 93 deletions(-) > For example, turning wake_up_locked() into a nornal wake_up(). > I am fairly confident it is not that simple after having thought about this issue over the last few days. But I've been wrong in the past. Propose a patch and I will review it for races/correctness, if you like. Perhaps a combination of that plus your asymmetrical locking scheme would work. One of the challenges you will hit is avoiding ABBA between your "get" lock and the wqh, but good luck! -Greg signature.asc Description: OpenPGP digital signature
Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions
On Fri, 19 Jun 2009, Gregory Haskins wrote: > I am fairly confident it is not that simple after having thought about > this issue over the last few days. But I've been wrong in the past. > Propose a patch and I will review it for races/correctness, if you > like. Perhaps a combination of that plus your asymmetrical locking > scheme would work. One of the challenges you will hit is avoiding ABBA > between your "get" lock and the wqh, but good luck! A patch for what? The eventfd patch is a one-liner. It seems hard to believe that the thing cannot be handled on your side. Once the wake_up_locked() is turned into a wake_up(), what other races are there? Let's not try to find the problem that fits and justify the "cool" solution, but let's see if a problem is there at all. - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions
On Fri, 19 Jun 2009, Davide Libenzi wrote: > On Fri, 19 Jun 2009, Gregory Haskins wrote: > > > I am fairly confident it is not that simple after having thought about > > this issue over the last few days. But I've been wrong in the past. > > Propose a patch and I will review it for races/correctness, if you > > like. Perhaps a combination of that plus your asymmetrical locking > > scheme would work. One of the challenges you will hit is avoiding ABBA > > between your "get" lock and the wqh, but good luck! > > A patch for what? The eventfd patch is a one-liner. > It seems hard to believe that the thing cannot be handled on your side. > Once the wake_up_locked() is turned into a wake_up(), what other races are > there? AFAICS, the IRQfd code simply registers the callback to ->poll() and waits for two events. In the POLLIN event, you schedule_work(&irqfd->inject) and there are no races there AFAICS (you basically do not care of anything eventfd memory related at all). For POLLHUP, you do: spin_lock(irqfd->slock); if (irqfd->wqh) schedule_work(&irqfd->inject); irqfd->wqh = NULL; spin_unlock(irqfd->slock); In your work function you notice the POLLHUP condition and take proper action (dunno what it is in your case). In your kvm_irqfd_release() function: spin_lock(irqfd->slock); if (irqfd->wqh) remove_wait_queue(irqfd->wqh, &irqfd->wait); irqfd->wqh = NULL; spin_unlock(irqfd->slock); Any races in there? - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions
Davide Libenzi wrote: > On Fri, 19 Jun 2009, Davide Libenzi wrote: > > >> On Fri, 19 Jun 2009, Gregory Haskins wrote: >> >> >>> I am fairly confident it is not that simple after having thought about >>> this issue over the last few days. But I've been wrong in the past. >>> Propose a patch and I will review it for races/correctness, if you >>> like. Perhaps a combination of that plus your asymmetrical locking >>> scheme would work. One of the challenges you will hit is avoiding ABBA >>> between your "get" lock and the wqh, but good luck! >>> >> A patch for what? The eventfd patch is a one-liner. >> Yes, understood. What I was trying to gently say is that the one-liner proposal alone is still broken afaict. However, if there is another solution that works that you like better than 133-liner I posted, I am more than willing to help analyze it. In the end, I only care that this is fixed. >> It seems hard to believe that the thing cannot be handled on your side. >> Once the wake_up_locked() is turned into a wake_up(), what other races are >> there? >> > > AFAICS, the IRQfd code simply registers the callback to ->poll() and waits > for two events. > In the POLLIN event, you schedule_work(&irqfd->inject) and there are no > races there AFAICS (you basically do not care of anything eventfd memory > related at all). > For POLLHUP, you do: > > spin_lock(irqfd->slock); > if (irqfd->wqh) > schedule_work(&irqfd->inject); > irqfd->wqh = NULL; > spin_unlock(irqfd->slock); > > In your work function you notice the POLLHUP condition and take proper > action (dunno what it is in your case). > In your kvm_irqfd_release() function: > > spin_lock(irqfd->slock); > if (irqfd->wqh) > remove_wait_queue(irqfd->wqh, &irqfd->wait); > irqfd->wqh = NULL; > spin_unlock(irqfd->slock); > > Any races in there? > Yes, for one you have an ABBA deadlock on the irqfd->slock and wqh->lock (recall wqh has to be locked to fix that other race I mentioned). (As a hint, I think I fixed 4-5 races with these patches, so there are a few others still lurking as well) -Greg signature.asc Description: OpenPGP digital signature
Luvalley-3 has been released, this release is for AMD CPU only
Hi, all, This Luvalley release is for machines with AMD CPUs only. It cannot work on Intel CPUs. Also, this release is for Linux dom0 only. Please refer to Luvalley-2 if you have a machine with Intel CPUs. Due to the limitations of AMD CPUs, this Luvalley release has a new architecture different from that on Intel CPUs. The limitation is that AMD CPUs cannot receive the NMI IPI. That is to say, sending an NMI IPI to other CPUs in the SMP computer cannot work. More thoroughly, the BSP (i.e., CPU #0) can receive the NMI IPI issued from other APs (i.e., CPU #1, CPU #2, ...). The APs, however, cannot receive the NMI IPI from BSP. I tried in Linux 2.6.18 and Xen 3.1 and get the same problem. Ridiculously, the "AMD64 Architecture Programmer's Manual Volume 2: System Programming" clearly states that NMI IPI is fully supported, just like Intel's. The same codes works very well on Intel's CPUs, but cannot work on AMD's CPUs. So, we have to trap all physical interrupts on machines with AMD CPU and then inject them back to dom0. That is to say, the CPU is configured to vmexit to Luvalley when physical interrupts occur. Moreover, we have to use a normal interrupt number for IPI sending and receiving. This caused many problems on APIC ACK and interrupt block issues. Although there are still many problems unsolved, Luvalley indeed can run Linux as its dom0 and a Windows domU as well. We decided to pause the development for AMD CPUs, until (maybe impossible) AMD works around the NMI IPI bug. On the other side, Luvalley's architecture will be slightly modified little by little in the future, while still preserving current features such as easy-to-use-and-install, arbitrary dom0 OS and hardware compatibility, etc. From long-term consideration, Luvalley should trap physical interrupts and manage APICs. At that time, AMD CPUs could be supported even if the NMI IPI bug is still there. More details on Luvalley Luvalley is a Virtual Machine Monitor (VMM) spawned from the KVM project. Its part of source codes are derived from KVM to virtualize CPU instructions and memory management unit (MMU). However, its overall architecture is completely different from KVM, but somewhat like Xen. Luvalley runs outside of Linux, just like Xen's architecture. Any operating system, including Linux, could be used as Luvalley's scheduler, memory manager, physical device driver provider and virtual IO device emulator. Currently, Luvalley supports Linux and Windows. That is to say, one may run Luvalley to boot a Linux or Windows, and then run multiple virtualized operating systems on such Linux or Windows. If you are interested in Luvalley project, you may download the source codes from http://sourceforge.net/projects/luvalley/ Regards, Xiaodong -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kernel backtrace pointing at KVM
Hello KVM team, I am using KVM for maintaining mixture of Windows and Linux VMs. recently I reinstalled the system with latest kernel 2.6.30 and qemu-kvm-0.86. I kept using the KVM module from the 2.6.30 kernel not from the pack with the qemu-kvm. Last night one of the Windows XP SP3 system rebooted and I found the following trace in my logs: [ cut here ] WARNING: at arch/x86/kvm/x86.c:204 kvm_queue_exception_e+0x24/0x45 [kvm]() Hardware name: IBM System x3650 -[7979B4G]- Modules linked in: 8021q garp tun reiserfs st ide_gd_mod ide_cd_mod bridge stp kvm_intel kvm ipv6 af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq speedstep_lib loop dm_mod i5k_amb ibmpex i5000_edac ibmaem i2c_i801 iTCO_wdt iTCO_vendor_support rtc_cmos bnx2 sr_mod rtc_core ipmi_msghandler ses edac_core rtc_lib button pcspkr i2c_core cdrom serio_raw enclosure shpchp pci_hotplug joydev sg usbhid hid uhci_hcd ehci_hcd usbcore sd_mod crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic ata_piix libata thermal processor thermal_sys hwmon aacraid scsi_mod Pid: 26872, comm: qemu-system-x86 Not tainted 2.6.30-9-pae #1 Call Trace: [] warn_slowpath_common+0x60/0x90 [] warn_slowpath_null+0xd/0x10 [] kvm_queue_exception_e+0x24/0x45 [kvm] [] kvm_task_switch+0xfb/0xada [kvm] [] ? _spin_unlock+0xf/0x23 [] ? kvm_mmu_page_fault+0x16/0x75 [kvm] [] handle_task_switch+0x6d/0x96 [kvm_intel] [] kvm_handle_exit+0x1c3/0x1e0 [kvm_intel] [] kvm_arch_vcpu_ioctl_run+0x89d/0xab6 [kvm] [] ? do_sync_readv_writev+0xa1/0xdf [] kvm_vcpu_ioctl+0xec/0x602 [kvm] [] ? _spin_unlock+0xf/0x23 [] ? kvm_vcpu_ioctl+0x0/0x602 [kvm] [] vfs_ioctl+0x22/0x69 [] do_vfs_ioctl+0x439/0x472 [] ? unmap_region+0x10a/0x124 [] ? fget_light+0x8a/0xb1 [] sys_ioctl+0x40/0x5a [] sysenter_do_call+0x12/0x28 ---[ end trace 9cc5a56b9c7eda31 ]--- Here is some more info: # uname -a Linux pc187 2.6.30-9-pae #1 SMP PREEMPT Wed Jun 17 15:29:59 EEST 2009 i686 i686 i386 GNU/Linux # qemu-system-x86_64 --version QEMU PC emulator version 0.10.50 (qemu-kvm-devel-86), Copyright (c) 2003-2008 Fabrice Bellard # modinfo /lib/modules/2.6.30-9-pae/kernel/arch/x86/kvm/kvm.ko filename: /lib/modules/2.6.30-9-pae/kernel/arch/x86/kvm/kvm.ko license:GPL author: Qumranet srcversion: 934CD9DB264501B0431438A depends: vermagic: 2.6.30-9-pae SMP preempt mod_unload modversions CORE2 parm: oos_shadow:bool parm: msi2intx:bool Do I do something wrong or is it a bug in KVM ? Thanks for the great work! Hope to hear from you soon. Best regards, Ivelin Ivanov P.S. Please, don't hesitate to ask me if you need any more info to trace this issue. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Checking guest memory pages changes from host userspace
On (Fri) Jun 19 2009 [12:09:10], Passera, Pablo R wrote: > Hi list, > I need to monitor some guest memory pages. I need to know if the > information in these pages was changed. For this, I was thinking to mark the > guest memory pages in some way (like write protecting them) so a page fault > is generated. Then manage this fault inside qemu. Is there some API in libkvm > that allows me to do this? If it's for debugging, you can use qemu without kvm. If you're only interested in a few pages (and are proposing a solution based on that), I don't think that can be done with EPT / NPT enabled so it won't be something that'll be accepted upstream. Amit -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html