Re: [KVM_AUTOTEST] unattended installs take 2

2009-06-19 Thread Yaniv Kaul

On 6/19/2009 12:50 AM, David Huff wrote:

Second pass at the unattended install test.  Both Linux and Windows guests are
working however currently it just uses the existing boot test and extends
timeouts.  We still need a good way of determining if an unattended install
completed without error. For Linux guest we should be able to run something
similar to a regular boot test, after reboot try to ssh in.  For windows guest
we still have to run setup in order to have shh up.
   
The unattended script can run at the end user defined commands. Those 
could be 'setup SSH and run it'.

Y.

Requires the processor patch as well as the patch to "strip and split" patches.

Scripts still uses loop back mounts for now we can upgrade this in the future
however most generic way of creating and manipulating disk images.

5 patches in this set
0005-Modified-boot-test-in-kvm_test.py.patch
0004-Added-two-sample-unattended-config-files-Fedora-and.patch
0003-added-unattended.sh-script.patch
0002-modified-config-file-to-run-unattended-install.patch
0001-Added-floppy-and-tftp-options-to-qemu-command.patch

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Added two sample unattended config files, Fedora and Windows

2009-06-19 Thread Yaniv Kaul

On 6/19/2009 12:50 AM, David Huff wrote:

You must add your own product key to the windows config file.
---
  client/tests/kvm/unattended/Fedora-11-i386.ks |   24 +++
  client/tests/kvm/unattended/WinXP-32.sif  |   52 +
  2 files changed, 76 insertions(+), 0 deletions(-)
  create mode 100644 client/tests/kvm/unattended/Fedora-11-i386.ks
  create mode 100644 client/tests/kvm/unattended/WinXP-32.sif

diff --git a/client/tests/kvm/unattended/Fedora-11-i386.ks 
b/client/tests/kvm/unattended/Fedora-11-i386.ks
new file mode 100644
index 000..da2f1ed
--- /dev/null
+++ b/client/tests/kvm/unattended/Fedora-11-i386.ks
@@ -0,0 +1,24 @@
+install
+cdrom
+text
+reboot
+lang en_US
+keyboard us
+network --bootproto dhcp
+rootpw 123456
+firewall --enabled --ssh
+selinux --enforcing
+timezone --utc America/New_York
+firstboot --disable
+bootloader --location=mbr
+zerombr
+
+clearpart --all --initlabel
+part /boot  --fstype=ext3 --size=100
+part /  --fstype=ext3 --size=2000
+part swap   --fstype=swap --size=512
+
+%packages
+@ Base
+%end
+
diff --git a/client/tests/kvm/unattended/WinXP-32.sif 
b/client/tests/kvm/unattended/WinXP-32.sif
new file mode 100644
index 000..196e193
--- /dev/null
+++ b/client/tests/kvm/unattended/WinXP-32.sif
@@ -0,0 +1,52 @@
+[Data]
+AutoPartition = 1
+MsDosInitiated = 0
+UnattendedInstall = Yes
+
+[Unattended]
+UnattendMode = FullUnattended
+OemSkipEula = Yes
+OemPreinstall = No
+UnattendSwitch = Yes
+CrashDumpSetting = 1
+DriverSigningPolicy = ignore
+WaitForReboot = no
+Repartition = yes
+
+[GuiUnattended]
+AdminPassword = 123456
+AutoLogon = Yes
+AutoLogonCount = 5
+OEMSkipRegional = 1
+TimeZone = 85
+OemSkipWelcome = 1
+
+[UserData]
+FullName = "QumranetUser"
+OrgName = "Qumranet"
   

Lets change this...
Y.

+ComputerName = *
+ProductKey =
+
+[Identification]
+
+[Networking]
+
+[Components]
+dialer = off
+media_clips = off
+media_utopia = off
+msnexplr = off
+netoc = off
+OEAccess = off
+templates = off
+WMAccess = off
+zonegames = off
+
+[TerminalServices]
+AllowConnections = 1
+
+[WindowsFirewall]
+Profiles = WindowsFirewall.TurnOffFirewall
+[WindowsFirewall.TurnOffFirewall]
+Mode = 0
+
   


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH/RFC] virtio_test: A module for testing virtio via userspace

2009-06-19 Thread Christian Bornträger
Hello Rusty,

this is a result of a two month internship about virtio testing.

From: Adrian Schneider 
From: Tim Hofmann 
From: Christian Ehrhardt 
From: Christian Borntraeger 

This patch introduces a prototype for a virtio_test module. This module can
be bound to any virtio device via sysfs bind/unbind feature, e.g:
$ echo virtio1 > /sys/bus/virtio/drivers/virtio_rng/unbind
$ modprobe virtio_test

On probe this module registers to all virtqueues and creates a character
device for every virtio device. (/dev/viotest).
The character device offers ioctls to allow a userspace application to submit
virtio operations like addbuf, kick and getbuf. It also offers ioctls to get
information about the device and to query the amount of occurred callbacks (or
wait synchronously on callbacks).

The driver currently lacks the following planned features:
o  userspace tooling for fuzzing (a prototype exists)
o  feature bit support
o  support arbitrary pointer mode in add_buf (e.g. test how qemu deals with
   iovecs pointing beyond the guest memory size)
o  priority binding with other virtio drivers (e.g. if virtio_blk and 
   virtio_test are compiled into the kernel, virtio_blk should get all block
   devices by default on hotplug)

I would like to get feedback on 

o  the general idea of a virtio_test module
o  the user interface ioctls
o  further ideas and comments 

Signed-off-by: Christian Borntraeger 
---
 drivers/virtio/Kconfig   |   12 
 drivers/virtio/Makefile  |2 
 drivers/virtio/virtio_test.c |  710 +++
 include/linux/Kbuild |1 
 include/linux/virtio_test.h  |  146 
 5 files changed, 871 insertions(+)

Index: linux-2.6/drivers/virtio/Kconfig
===
--- linux-2.6.orig/drivers/virtio/Kconfig
+++ linux-2.6/drivers/virtio/Kconfig
@@ -33,3 +33,15 @@ config VIRTIO_BALLOON
 
 If unsure, say M.
 
+config VIRTIO_TEST
+   tristate "Virtio test driver (EXPERIMENTAL)"
+   select VIRTIO
+   select VIRTIO_RING
+   ---help---
+This driver supports testing arbitrary virtio devices. The drivers
+offers IOCTLs to run add_buf/get_buf etc. from userspace. You can
+bind/unbind any unused virtio device to this driver via sysfs. Each
+bound device will get a /dev/viotest* device node.
+
+If unsure, say M.
+
Index: linux-2.6/drivers/virtio/Makefile
===
--- linux-2.6.orig/drivers/virtio/Makefile
+++ linux-2.6/drivers/virtio/Makefile
@@ -2,3 +2,5 @@ obj-$(CONFIG_VIRTIO) += virtio.o
 obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
 obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
 obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
+obj-$(CONFIG_VIRTIO_TEST) += virtio_test.o
+
Index: linux-2.6/drivers/virtio/virtio_test.c
===
--- /dev/null
+++ linux-2.6/drivers/virtio/virtio_test.c
@@ -0,0 +1,710 @@
+/*
+ *  Test driver for the virtio bus
+ *
+ *Copyright IBM Corp. 2009
+ *Author(s): Adrian Schneider 
+ *   Tim Hofmann 
+ *   Christian Ehrhardt 
+ *   Christian Borntraeger 
+ */
+
+
+#define KMSG_COMPONENT "virtio_test"
+#define pr_fmt(fmt) KMSG_COMPONENT ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static u32 viotest_major = VIOTEST_MAJOR;
+static struct class *viotest_class;
+static LIST_HEAD(viotest_list);
+static spinlock_t viotest_list_lock = SPIN_LOCK_UNLOCKED;
+
+static void free_kvec(struct kvec *kiov, u32 index)
+{
+   u32 i;
+
+   for (i = 0; i < index; i++)
+   kfree(kiov[i].iov_base);
+
+   kfree(kiov);
+}
+
+/*
+ * This function copies a userspace iovec * array into a kernel kvec * array
+ */
+static int copy_iovec_from_user(struct kvec **kiov, struct iovec __user *uiov,
+   u32 uiov_num)
+{
+   u32 i;
+   u64 kiov_sz;
+   struct iovec uservec;
+
+   kiov_sz = sizeof(struct kvec) * uiov_num;
+   *kiov = kmalloc(kiov_sz, GFP_KERNEL);
+   if (!(*kiov))
+   return -ENOMEM;
+
+   for (i = 0; i < uiov_num; i++) {
+   if (copy_from_user(&uservec, &uiov[i], sizeof(struct iovec))) {
+   free_kvec(*kiov, i);
+   return -EFAULT;
+   }
+   (*kiov)[i].iov_base = kmalloc(uservec.iov_len, GFP_KERNEL);
+   if (!(*kiov)[i].iov_base) {
+   free_kvec(*kiov, i);
+   return -ENOMEM;
+   }
+
+   if (copy_from_user((*kiov)[i].iov_base, uservec.iov_base, 
uservec.iov_len)) {
+   free_kvec(*kiov, i);
+   return -EFAULT;
+   }
+   (*kiov)[i].iov_len = uservec.iov_len;
+   }
+
+   return 0;
+}
+
+static int copy_kvec_to_user(struct iovec 

Re: [PATCH] kvm: device-assignment: Add PCI option ROM support

2009-06-19 Thread Yang, Sheng
On Friday 19 June 2009 00:28:41 Alex Williamson wrote:
> On Thu, 2009-06-18 at 13:30 +0800, Yang, Sheng wrote:
> > On Tuesday 16 June 2009 00:29:17 Alex Williamson wrote:
> > > The PCI code already knows about option ROMs, so we just need to
> > > mmap some space for it, load it with a copy of the contents, and
> > > complete the plubming to the generic code.
> > >
> > > With this a Linux guest can get access to the ROM contents via
> > > /sys/bus/pci/devices//rom.  This might also enable the BIOS
> > > to execute ROMs by loading them dynamically from the device
> > > rather than hoping they all fit into RAM.
> >
> > The patch looks fine. One question: if guest write to the ROM, I think
> > the guest would be killed for QEmu would receive a SIGSEGV? I am not sure
> > if it's too severe...
>
> Hi Sheng,
>
> Good thought.  I tested this with a simple program that mmaps the ROM
> address from /dev/mem and tries to write to it (using setpci to enable
> the ROM).  The results are a little surprising.  On the host, writing to
> the ROM causes an NMI and the system dies.  On the KVM guest, the write
> is happily discarded, neither segfaulting from the mprotect nor
> affecting the contents of the ROM.  So it seems that something above my
> mprotect is discarding the write, and if we did hit it, a qemu segfault
> isn't that far from what happens on bare metal.
>
Oh, that's good. :)

> The one oddity I noticed is that even when the enable bit is clear, the
> guest can read the ROM.  I don't know that this is actually illegal, vs
> returning zeros or ones though.  It seems like maybe the generic PCI
> code isn't tracking the enable bit.  I think that's an independent
> problem from this patch though.  Thanks,

That should be fine. I've taken a look at code, seems Linux kernel set 
enable_bit when someone begin to read rom, and copy rom to buffer, then unmap 
the rom. So the rom can be read when enable bit clear.

-- 
regards
Yang, Sheng

>
> Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ESX on KVM requirements

2009-06-19 Thread Alexander Graf


On 19.06.2009, at 03:19, Ben Sanders wrote:


I needed nested paging to have things work, so phenom was mandatory.

Also, because I was running things on a Phenom, I figured the  
safest bet is
to be truthful in telling the guest what it's running on. Also ESX  
checked

for specific cpuid revisions and I knew the CPU I was on is safe.

If you get it running in debug mode, do -serial stdio and send me  
the output

so I can see if I remember the cure ;)

Alex



I cut out the mailing list since people probably don't want my  
attachment.


I switched to a phenom 9950 to see if i'd have some better luck.

I'm runing kvm v86 with nested=1 and -enable nesting on Ubuntu 9.04

Here's the command i'm running.  Setting the nic model was necessary
to get it to install.
qemu-system-x86_64 -cpu phenom -enable-nesting -hda ~/esx.img -m 1024
-net nic,model=e1000 -net user

Attached is the debug output.  Thanks for your help.


Looks like you're missing my VMware interface extension patches that  
implement TSC fake value passing to the guest? The series was called  
"VMware ESX guest bringup (partial)".


Also, try to keep the ML in CC and simply pastebin instead of  
attaching the log file. There are probably more people out there who  
have the same question ;-).


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: SVM: Pass through the host kernel's IO delay port

2009-06-19 Thread Paolo Bonzini
From: Paolo Bonzini 

KVM's optimization of guest port 80 accesses was removed last May 11
in commit 99f85a.  However, this probably has speed penalties.  I don't
have a machine to test but the equivalent VMX patch (fdef3ad) reported
a speedup of 3-5%, and on the Xen mailing list it was mentioned that on
Xen passing port 80 through had positive effects on startup speed.

We can enable passthrough to the same port the host kernel uses instead.

Signed-off-by: Paolo Bonzini 
---
 arch/x86/kernel/io_delay.c |1 +
 arch/x86/kvm/svm.c |5 +
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/io_delay.c b/arch/x86/kernel/io_delay.c
index a979b5b..a478cb2 100644
--- a/arch/x86/kernel/io_delay.c
+++ b/arch/x86/kernel/io_delay.c
@@ -129,3 +129,4 @@ static int __init io_delay_param(char *s)
 }
 
 early_param("io_delay", io_delay_param);
+EXPORT_SYMBOL_GPL(io_delay_type);
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 1f8510c..c49f4db 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 
+#include 
 #include 
 
 #include 
@@ -411,6 +412,10 @@ static __init int svm_hardware_setup(void)
 
iopm_va = page_address(iopm_pages);
memset(iopm_va, 0xff, PAGE_SIZE * (1 << IOPM_ALLOC_ORDER));
+   if (io_delay_type == CONFIG_IO_DELAY_TYPE_0X80)
+   clear_bit(0x80, iopm_va);
+   else if (io_delay_type == CONFIG_IO_DELAY_TYPE_0XED)
+   clear_bit(0xED, iopm_va);
iopm_base = page_to_pfn(iopm_pages) << PAGE_SHIFT;
 
if (boot_cpu_has(X86_FEATURE_NX))
-- 
1.6.0.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM-AUTOTEST PATCH] stress_boot - Boot VMs until one of them becomes unresponsive - Version2

2009-06-19 Thread Lucas Meneghel Rodrigues
On Thu, 2009-06-18 at 17:16 +0800, Yolkfull Chow wrote:

> Hi Lucas,  I also found the number counting problem later after sending 
> the patch. I haven't been able to re-send the updated one since I got 
> some other things to deal with in these days.   Sorry for that...
> 
> Please see attachment for updated version.   Thank you so much.  :)

Ok, patch applied. Thank you very much for your work!


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/8] kvm/mmu: rename is_largepage_backed to mapping_level

2009-06-19 Thread Joerg Roedel
With the new name and the corresponding backend changes this function
can now support multiple hugepage sizes.

Signed-off-by: Joerg Roedel 
---
 arch/x86/kvm/mmu.c |  100 +--
 arch/x86/kvm/paging_tmpl.h |4 +-
 2 files changed, 69 insertions(+), 35 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1f24d88..3fa6009 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -390,37 +390,52 @@ static void mmu_free_rmap_desc(struct kvm_rmap_desc *rd)
  * Return the pointer to the largepage write count for a given
  * gfn, handling slots that are not large page aligned.
  */
-static int *slot_largepage_idx(gfn_t gfn, struct kvm_memory_slot *slot)
+static int *slot_largepage_idx(gfn_t gfn,
+  struct kvm_memory_slot *slot,
+  int level)
 {
unsigned long idx;
 
-   idx = (gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL)) -
- (slot->base_gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL));
-   return &slot->lpage_info[0][idx].write_count;
+   idx = (gfn / KVM_PAGES_PER_HPAGE(level)) -
+ (slot->base_gfn / KVM_PAGES_PER_HPAGE(level));
+   return &slot->lpage_info[level - 2][idx].write_count;
 }
 
 static void account_shadowed(struct kvm *kvm, gfn_t gfn)
 {
+   struct kvm_memory_slot *slot;
int *write_count;
+   int i;
 
gfn = unalias_gfn(kvm, gfn);
-   write_count = slot_largepage_idx(gfn,
-gfn_to_memslot_unaliased(kvm, gfn));
-   *write_count += 1;
+
+   for (i = PT_DIRECTORY_LEVEL;
+i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
+   slot  = gfn_to_memslot_unaliased(kvm, gfn);
+   write_count   = slot_largepage_idx(gfn, slot, i);
+   *write_count += 1;
+   }
 }
 
 static void unaccount_shadowed(struct kvm *kvm, gfn_t gfn)
 {
+   struct kvm_memory_slot *slot;
int *write_count;
+   int i;
 
gfn = unalias_gfn(kvm, gfn);
-   write_count = slot_largepage_idx(gfn,
-gfn_to_memslot_unaliased(kvm, gfn));
-   *write_count -= 1;
-   WARN_ON(*write_count < 0);
+   for (i = PT_DIRECTORY_LEVEL;
+i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
+   slot  = gfn_to_memslot_unaliased(kvm, gfn);
+   write_count   = slot_largepage_idx(gfn, slot, i);
+   *write_count -= 1;
+   WARN_ON(*write_count < 0);
+   }
 }
 
-static int has_wrprotected_page(struct kvm *kvm, gfn_t gfn)
+static int has_wrprotected_page(struct kvm *kvm,
+   gfn_t gfn,
+   int level)
 {
struct kvm_memory_slot *slot;
int *largepage_idx;
@@ -428,47 +443,67 @@ static int has_wrprotected_page(struct kvm *kvm, gfn_t 
gfn)
gfn = unalias_gfn(kvm, gfn);
slot = gfn_to_memslot_unaliased(kvm, gfn);
if (slot) {
-   largepage_idx = slot_largepage_idx(gfn, slot);
+   largepage_idx = slot_largepage_idx(gfn, slot, level);
return *largepage_idx;
}
 
return 1;
 }
 
-static int host_largepage_backed(struct kvm *kvm, gfn_t gfn)
+static int host_mapping_level(struct kvm *kvm, gfn_t gfn)
 {
+   unsigned long page_size = PAGE_SIZE;
struct vm_area_struct *vma;
unsigned long addr;
-   int ret = 0;
+   int i, ret = 0;
 
addr = gfn_to_hva(kvm, gfn);
if (kvm_is_error_hva(addr))
-   return ret;
+   return page_size;
 
down_read(¤t->mm->mmap_sem);
vma = find_vma(current->mm, addr);
-   if (vma && is_vm_hugetlb_page(vma))
-   ret = 1;
+   if (!vma)
+   goto out;
+
+   page_size = vma_kernel_pagesize(vma);
+
+out:
up_read(¤t->mm->mmap_sem);
 
+   for (i = PT_PAGE_TABLE_LEVEL;
+i < (PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES); ++i) {
+   if (page_size >= KVM_HPAGE_SIZE(i))
+   ret = i;
+   else
+   break;
+   }
+
return ret;
 }
 
-static int is_largepage_backed(struct kvm_vcpu *vcpu, gfn_t large_gfn)
+static int mapping_level(struct kvm_vcpu *vcpu, gfn_t large_gfn)
 {
struct kvm_memory_slot *slot;
-
-   if (has_wrprotected_page(vcpu->kvm, large_gfn))
-   return 0;
-
-   if (!host_largepage_backed(vcpu->kvm, large_gfn))
-   return 0;
+   int host_level;
+   int level = PT_PAGE_TABLE_LEVEL;
 
slot = gfn_to_memslot(vcpu->kvm, large_gfn);
if (slot && slot->dirty_bitmap)
-   return 0;
+   return PT_PAGE_TABLE_LEVEL;
 
-   return 1;
+   host_level = host_mapping_level(vcpu->kvm, large_gfn);
+
+   if (host_level == PT_PAGE_TABLE_LEVEL)
+   return host_level;
+
+   for (lev

[PATCH 7/8] kvm/mmu: enable gbpages by increasing nr of pagesizes

2009-06-19 Thread Joerg Roedel
Signed-off-by: Joerg Roedel 
---
 arch/x86/include/asm/kvm_host.h |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 397f992..26ea9b8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -54,7 +54,7 @@
 #define UNMAPPED_GVA (~(gpa_t)0)
 
 /* KVM Hugepage definitions for x86 */
-#define KVM_NR_PAGE_SIZES  2
+#define KVM_NR_PAGE_SIZES  3
 #define KVM_HPAGE_SHIFT(x) (PAGE_SHIFT + (((x) - 1) * 9))
 #define KVM_HPAGE_SIZE(x)  (1UL << KVM_HPAGE_SHIFT(x))
 #define KVM_HPAGE_MASK(x)  (~(KVM_HPAGE_SIZE(x) - 1))
-- 
1.6.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/8] kvm: change memslot data structures for multiple hugepage sizes

2009-06-19 Thread Joerg Roedel
Signed-off-by: Joerg Roedel 
---
 arch/ia64/include/asm/kvm_host.h|3 +-
 arch/powerpc/include/asm/kvm_host.h |3 +-
 arch/x86/include/asm/kvm_host.h |   12 
 arch/x86/kvm/mmu.c  |   30 ++-
 arch/x86/kvm/paging_tmpl.h  |3 +-
 include/linux/kvm_host.h|2 +-
 virt/kvm/kvm_main.c |   53 +++---
 7 files changed, 65 insertions(+), 41 deletions(-)

diff --git a/arch/ia64/include/asm/kvm_host.h b/arch/ia64/include/asm/kvm_host.h
index 9cf1c4b..d9b6325 100644
--- a/arch/ia64/include/asm/kvm_host.h
+++ b/arch/ia64/include/asm/kvm_host.h
@@ -235,7 +235,8 @@ struct kvm_vm_data {
 #define KVM_REQ_PTC_G  32
 #define KVM_REQ_RESUME 33
 
-#define KVM_PAGES_PER_HPAGE1
+#define KVM_NR_PAGE_SIZES  1
+#define KVM_PAGES_PER_HPAGE(x) 1
 
 struct kvm;
 struct kvm_vcpu;
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 3625424..10e884e 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -34,7 +34,8 @@
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
 
 /* We don't currently support large pages. */
-#define KVM_PAGES_PER_HPAGE (1<<31)
+#define KVM_NR_PAGE_SIZES  1
+#define KVM_PAGES_PER_HPAGE(x) (1<<31)
 
 struct kvm;
 struct kvm_run;
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c7b0cc2..930bac2 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -53,12 +53,12 @@
 #define INVALID_PAGE (~(hpa_t)0)
 #define UNMAPPED_GVA (~(gpa_t)0)
 
-/* shadow tables are PAE even on non-PAE hosts */
-#define KVM_HPAGE_SHIFT 21
-#define KVM_HPAGE_SIZE (1UL << KVM_HPAGE_SHIFT)
-#define KVM_HPAGE_MASK (~(KVM_HPAGE_SIZE - 1))
-
-#define KVM_PAGES_PER_HPAGE (KVM_HPAGE_SIZE / PAGE_SIZE)
+/* KVM Hugepage definitions for x86 */
+#define KVM_NR_PAGE_SIZES  2
+#define KVM_HPAGE_SHIFT(x) (PAGE_SHIFT + (((x) - 1) * 9))
+#define KVM_HPAGE_SIZE(x)  (1UL << KVM_HPAGE_SHIFT(x))
+#define KVM_HPAGE_MASK(x)  (~(KVM_HPAGE_SIZE(x) - 1))
+#define KVM_PAGES_PER_HPAGE(x) (KVM_HPAGE_SIZE(x) / PAGE_SIZE)
 
 #define DE_VECTOR 0
 #define DB_VECTOR 1
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index edf3dea..1f24d88 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -394,9 +394,9 @@ static int *slot_largepage_idx(gfn_t gfn, struct 
kvm_memory_slot *slot)
 {
unsigned long idx;
 
-   idx = (gfn / KVM_PAGES_PER_HPAGE) -
- (slot->base_gfn / KVM_PAGES_PER_HPAGE);
-   return &slot->lpage_info[idx].write_count;
+   idx = (gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL)) -
+ (slot->base_gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL));
+   return &slot->lpage_info[0][idx].write_count;
 }
 
 static void account_shadowed(struct kvm *kvm, gfn_t gfn)
@@ -485,10 +485,10 @@ static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t 
gfn, int lpage)
if (!lpage)
return &slot->rmap[gfn - slot->base_gfn];
 
-   idx = (gfn / KVM_PAGES_PER_HPAGE) -
- (slot->base_gfn / KVM_PAGES_PER_HPAGE);
+   idx = (gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL)) -
+ (slot->base_gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL));
 
-   return &slot->lpage_info[idx].rmap_pde;
+   return &slot->lpage_info[0][idx].rmap_pde;
 }
 
 /*
@@ -724,11 +724,11 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long 
hva,
end = start + (memslot->npages << PAGE_SHIFT);
if (hva >= start && hva < end) {
gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT;
+   int idx = gfn_offset /
+ KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL);
retval |= handler(kvm, &memslot->rmap[gfn_offset]);
retval |= handler(kvm,
- &memslot->lpage_info[
- gfn_offset /
- 
KVM_PAGES_PER_HPAGE].rmap_pde);
+   &memslot->lpage_info[0][idx].rmap_pde);
}
}
 
@@ -1852,8 +1852,9 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
int write, gfn_t gfn)
pfn_t pfn;
unsigned long mmu_seq;
 
-   if (is_largepage_backed(vcpu, gfn & ~(KVM_PAGES_PER_HPAGE-1))) {
-   gfn &= ~(KVM_PAGES_PER_HPAGE-1);
+   if (is_largepage_backed(vcpu, gfn &
+   ~(KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL) - 1))) {
+   gfn &= ~(KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL) - 1);
largepage = 1;
}
 
@@ -2058,8 +2059,9 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa,
if (r)
return r;
 
-   if (is_largepage_backed(vcpu, gfn & ~(KVM_PAGES_PER_HPAGE-1))) {
-   gfn &= ~(KVM_PAGES_PE

[PATCH 1/8] hugetlbfs: export vma_kernel_pagsize to modules

2009-06-19 Thread Joerg Roedel
This function is required by KVM.

Signed-off-by: Joerg Roedel 
---
 mm/hugetlb.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e83ad2c..e2016ef 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -234,6 +234,7 @@ unsigned long vma_kernel_pagesize(struct vm_area_struct 
*vma)
 
return 1UL << (hstate->order + PAGE_SHIFT);
 }
+EXPORT_SYMBOL_GPL(vma_kernel_pagesize);
 
 /*
  * Return the page size being used by the MMU to back a VMA. In the majority
-- 
1.6.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/8] kvm/mmu: add support for another level to page walker

2009-06-19 Thread Joerg Roedel
The page walker may be used with nested paging too when accessing mmio areas.
Make it support the additional page-level too.

Signed-off-by: Joerg Roedel 
---
 arch/x86/kvm/mmu.c |6 ++
 arch/x86/kvm/paging_tmpl.h |   16 
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index ef2396d..fc0e2fc 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -117,6 +117,11 @@ module_param(oos_shadow, bool, 0644);
 #define PT64_DIR_BASE_ADDR_MASK \
(PT64_BASE_ADDR_MASK & ~((1ULL << (PAGE_SHIFT + PT64_LEVEL_BITS)) - 1))
 
+#define PT64_PDPE_BASE_ADDR_MASK \
+   (PT64_BASE_ADDR_MASK & ~(1ULL << (PAGE_SHIFT + (2 * PT64_LEVEL_BITS
+#define PT64_PDPE_OFFSET_MASK \
+   (PT64_BASE_ADDR_MASK & (1ULL << (PAGE_SHIFT + (2 * PT64_LEVEL_BITS
+
 #define PT32_BASE_ADDR_MASK PAGE_MASK
 #define PT32_DIR_BASE_ADDR_MASK \
(PAGE_MASK & ~((1ULL << (PAGE_SHIFT + PT32_LEVEL_BITS)) - 1))
@@ -130,6 +135,7 @@ module_param(oos_shadow, bool, 0644);
 #define PFERR_RSVD_MASK (1U << 3)
 #define PFERR_FETCH_MASK (1U << 4)
 
+#define PT_PDPE_LEVEL 3
 #define PT_DIRECTORY_LEVEL 2
 #define PT_PAGE_TABLE_LEVEL 1
 
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 8fbf4e7..54c77be 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -55,6 +55,7 @@
 
 #define gpte_to_gfn FNAME(gpte_to_gfn)
 #define gpte_to_gfn_pde FNAME(gpte_to_gfn_pde)
+#define gpte_to_gfn_pdpe FNAME(gpte_to_gfn_pdpe)
 
 /*
  * The guest_walker structure emulates the behavior of the hardware page
@@ -81,6 +82,11 @@ static gfn_t gpte_to_gfn_pde(pt_element_t gpte)
return (gpte & PT_DIR_BASE_ADDR_MASK) >> PAGE_SHIFT;
 }
 
+static gfn_t gpte_to_gfn_pdpe(pt_element_t gpte)
+{
+   return (gpte & PT64_PDPE_BASE_ADDR_MASK) >> PAGE_SHIFT;
+}
+
 static bool FNAME(cmpxchg_gpte)(struct kvm *kvm,
 gfn_t table_gfn, unsigned index,
 pt_element_t orig_pte, pt_element_t new_pte)
@@ -201,6 +207,15 @@ walk:
break;
}
 
+   if (walker->level == PT_PDPE_LEVEL &&
+   (pte & PT_PAGE_SIZE_MASK) &&
+   is_long_mode(vcpu)) {
+   walker->gfn  = gpte_to_gfn_pdpe(pte);
+   walker->gfn += (addr & PT64_PDPE_OFFSET_MASK)
+   >> PAGE_SHIFT;
+   break;
+   }
+
pt_access = pte_access;
--walker->level;
}
@@ -609,4 +624,5 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
 #undef PT_MAX_FULL_LEVELS
 #undef gpte_to_gfn
 #undef gpte_to_gfn_pde
+#undef gpte_to_gfn_pdpe
 #undef CMPXCHG
-- 
1.6.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/8] kvm/mmu: make direct mapping paths aware of mapping levels

2009-06-19 Thread Joerg Roedel
Signed-off-by: Joerg Roedel 
---
 arch/x86/include/asm/kvm_host.h |2 +-
 arch/x86/kvm/mmu.c  |   60 ++-
 arch/x86/kvm/paging_tmpl.h  |6 ++--
 3 files changed, 38 insertions(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 930bac2..397f992 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -314,7 +314,7 @@ struct kvm_vcpu_arch {
struct {
gfn_t gfn;  /* presumed gfn during guest pte update */
pfn_t pfn;  /* pfn corresponding to that gfn */
-   int largepage;
+   int level;
unsigned long mmu_seq;
} update_pte;
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0ef947d..ef2396d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -254,7 +254,7 @@ static int is_last_spte(u64 pte, int level)
 {
if (level == PT_PAGE_TABLE_LEVEL)
return 1;
-   if (level == PT_DIRECTORY_LEVEL && is_large_pte(pte))
+   if (is_large_pte(pte))
return 1;
return 0;
 }
@@ -1708,7 +1708,7 @@ static int mmu_need_write_protect(struct kvm_vcpu *vcpu, 
gfn_t gfn,
 
 static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
unsigned pte_access, int user_fault,
-   int write_fault, int dirty, int largepage,
+   int write_fault, int dirty, int level,
gfn_t gfn, pfn_t pfn, bool speculative,
bool can_unsync)
 {
@@ -1731,7 +1731,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
spte |= shadow_nx_mask;
if (pte_access & ACC_USER_MASK)
spte |= shadow_user_mask;
-   if (largepage)
+   if (level > PT_PAGE_TABLE_LEVEL)
spte |= PT_PAGE_SIZE_MASK;
if (tdp_enabled)
spte |= kvm_x86_ops->get_mt_mask(vcpu, gfn,
@@ -1742,7 +1742,8 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
if ((pte_access & ACC_WRITE_MASK)
|| (write_fault && !is_write_protection(vcpu) && !user_fault)) {
 
-   if (largepage && has_wrprotected_page(vcpu->kvm, gfn, 1)) {
+   if (level > PT_PAGE_TABLE_LEVEL &&
+   has_wrprotected_page(vcpu->kvm, gfn, level)) {
ret = 1;
spte = shadow_trap_nonpresent_pte;
goto set_pte;
@@ -1780,7 +1781,7 @@ set_pte:
 static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 unsigned pt_access, unsigned pte_access,
 int user_fault, int write_fault, int dirty,
-int *ptwrite, int largepage, gfn_t gfn,
+int *ptwrite, int level, gfn_t gfn,
 pfn_t pfn, bool speculative)
 {
int was_rmapped = 0;
@@ -1796,7 +1797,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
 * If we overwrite a PTE page pointer with a 2MB PMD, unlink
 * the parent of the now unreachable PTE.
 */
-   if (largepage && !is_large_pte(*sptep)) {
+   if (level > PT_PAGE_TABLE_LEVEL &&
+   !is_large_pte(*sptep)) {
struct kvm_mmu_page *child;
u64 pte = *sptep;
 
@@ -1809,8 +1811,9 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
} else
was_rmapped = 1;
}
+
if (set_spte(vcpu, sptep, pte_access, user_fault, write_fault,
- dirty, largepage, gfn, pfn, speculative, true)) {
+ dirty, level, gfn, pfn, speculative, true)) {
if (write_fault)
*ptwrite = 1;
kvm_x86_ops->tlb_flush(vcpu);
@@ -1846,7 +1849,7 @@ static void nonpaging_new_cr3(struct kvm_vcpu *vcpu)
 }
 
 static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
-   int largepage, gfn_t gfn, pfn_t pfn)
+   int level, gfn_t gfn, pfn_t pfn)
 {
struct kvm_shadow_walk_iterator iterator;
struct kvm_mmu_page *sp;
@@ -1854,11 +1857,10 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
gfn_t pseudo_gfn;
 
for_each_shadow_entry(vcpu, (u64)gfn << PAGE_SHIFT, iterator) {
-   if (iterator.level == PT_PAGE_TABLE_LEVEL
-   || (largepage && iterator.level == PT_DIRECTORY_LEVEL)) {
+   if (iterator.level == level) {
mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, ACC_ALL,
 0, write, 1, &pt_write,
-largepage, gfn, pfn, false);
+level, gfn, pfn, false);
++vcpu->stat.pf_fixed;
break;
  

[PATCH 4/8] kvm/mmu: make rmap code aware of mapping levels

2009-06-19 Thread Joerg Roedel
This patch removes the largepage parameter from the rmap_add function.
Together with rmap_remove this function now uses the role.level field to
find determine if the page is a huge page.

Signed-off-by: Joerg Roedel 
---
 arch/x86/kvm/mmu.c |   53 +++
 1 files changed, 28 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 3fa6009..0ef947d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -511,19 +511,19 @@ static int mapping_level(struct kvm_vcpu *vcpu, gfn_t 
large_gfn)
  * Note: gfn must be unaliased before this function get called
  */
 
-static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int lpage)
+static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int level)
 {
struct kvm_memory_slot *slot;
unsigned long idx;
 
slot = gfn_to_memslot(kvm, gfn);
-   if (!lpage)
+   if (likely(level == PT_PAGE_TABLE_LEVEL))
return &slot->rmap[gfn - slot->base_gfn];
 
-   idx = (gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL)) -
- (slot->base_gfn / KVM_PAGES_PER_HPAGE(PT_DIRECTORY_LEVEL));
+   idx = (gfn / KVM_PAGES_PER_HPAGE(level)) -
+   (slot->base_gfn / KVM_PAGES_PER_HPAGE(level));
 
-   return &slot->lpage_info[0][idx].rmap_pde;
+   return &slot->lpage_info[level - 2][idx].rmap_pde;
 }
 
 /*
@@ -535,7 +535,7 @@ static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t 
gfn, int lpage)
  * If rmapp bit zero is one, (then rmap & ~1) points to a struct kvm_rmap_desc
  * containing more mappings.
  */
-static void rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn, int lpage)
+static void rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn)
 {
struct kvm_mmu_page *sp;
struct kvm_rmap_desc *desc;
@@ -547,7 +547,7 @@ static void rmap_add(struct kvm_vcpu *vcpu, u64 *spte, 
gfn_t gfn, int lpage)
gfn = unalias_gfn(vcpu->kvm, gfn);
sp = page_header(__pa(spte));
sp->gfns[spte - sp->spt] = gfn;
-   rmapp = gfn_to_rmap(vcpu->kvm, gfn, lpage);
+   rmapp = gfn_to_rmap(vcpu->kvm, gfn, sp->role.level);
if (!*rmapp) {
rmap_printk("rmap_add: %p %llx 0->1\n", spte, *spte);
*rmapp = (unsigned long)spte;
@@ -614,7 +614,7 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
kvm_release_pfn_dirty(pfn);
else
kvm_release_pfn_clean(pfn);
-   rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt], is_large_pte(*spte));
+   rmapp = gfn_to_rmap(kvm, sp->gfns[spte - sp->spt], sp->role.level);
if (!*rmapp) {
printk(KERN_ERR "rmap_remove: %p %llx 0->BUG\n", spte, *spte);
BUG();
@@ -677,10 +677,10 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
 {
unsigned long *rmapp;
u64 *spte;
-   int write_protected = 0;
+   int i, write_protected = 0;
 
gfn = unalias_gfn(kvm, gfn);
-   rmapp = gfn_to_rmap(kvm, gfn, 0);
+   rmapp = gfn_to_rmap(kvm, gfn, PT_PAGE_TABLE_LEVEL);
 
spte = rmap_next(kvm, rmapp, NULL);
while (spte) {
@@ -702,21 +702,24 @@ static int rmap_write_protect(struct kvm *kvm, u64 gfn)
}
 
/* check for huge page mappings */
-   rmapp = gfn_to_rmap(kvm, gfn, 1);
-   spte = rmap_next(kvm, rmapp, NULL);
-   while (spte) {
-   BUG_ON(!spte);
-   BUG_ON(!(*spte & PT_PRESENT_MASK));
-   BUG_ON((*spte & (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)) != 
(PT_PAGE_SIZE_MASK|PT_PRESENT_MASK));
-   pgprintk("rmap_write_protect(large): spte %p %llx %lld\n", 
spte, *spte, gfn);
-   if (is_writeble_pte(*spte)) {
-   rmap_remove(kvm, spte);
-   --kvm->stat.lpages;
-   __set_spte(spte, shadow_trap_nonpresent_pte);
-   spte = NULL;
-   write_protected = 1;
+   for (i = PT_DIRECTORY_LEVEL;
+i < PT_PAGE_TABLE_LEVEL + KVM_NR_PAGE_SIZES; ++i) {
+   rmapp = gfn_to_rmap(kvm, gfn, i);
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   BUG_ON(!spte);
+   BUG_ON(!(*spte & PT_PRESENT_MASK));
+   BUG_ON((*spte & (PT_PAGE_SIZE_MASK|PT_PRESENT_MASK)) != 
(PT_PAGE_SIZE_MASK|PT_PRESENT_MASK));
+   pgprintk("rmap_write_protect(large): spte %p %llx 
%lld\n", spte, *spte, gfn);
+   if (is_writeble_pte(*spte)) {
+   rmap_remove(kvm, spte);
+   --kvm->stat.lpages;
+   __set_spte(spte, shadow_trap_nonpresent_pte);
+   spte = NULL;
+   write_protected = 1;
+   }
+   spte = rmap_next(kvm, rmapp, spte);
}
-   

[PATCH 8/8] kvm x86: report 1GB page support to userspace

2009-06-19 Thread Joerg Roedel
If userspace knows that the kernel part supports 1GB pages it can enable
the corresponding cpuid bit so that guests actually use GB pages.

Signed-off-by: Joerg Roedel 
---
 arch/x86/include/asm/kvm_host.h |1 +
 arch/x86/kvm/svm.c  |6 ++
 arch/x86/kvm/vmx.c  |6 ++
 arch/x86/kvm/x86.c  |3 ++-
 4 files changed, 15 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 26ea9b8..c120882 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -527,6 +527,7 @@ struct kvm_x86_ops {
int (*set_tss_addr)(struct kvm *kvm, unsigned int addr);
int (*get_tdp_level)(void);
u64 (*get_mt_mask)(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+   bool (*gb_page_enable)(void);
 };
 
 extern struct kvm_x86_ops *kvm_x86_ops;
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index c283201..ccb75fb 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2718,6 +2718,11 @@ static u64 svm_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t 
gfn, bool is_mmio)
return 0;
 }
 
+static bool svm_gb_page_enable(void)
+{
+   return npt_enabled;
+}
+
 static struct kvm_x86_ops svm_x86_ops = {
.cpu_has_kvm_support = has_svm,
.disabled_by_bios = is_disabled,
@@ -2779,6 +2784,7 @@ static struct kvm_x86_ops svm_x86_ops = {
.set_tss_addr = svm_set_tss_addr,
.get_tdp_level = get_npt_level,
.get_mt_mask = svm_get_mt_mask,
+   .gb_page_enable = svm_gb_page_enable,
 };
 
 static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9332938..4418ba6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3891,6 +3891,11 @@ static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t 
gfn, bool is_mmio)
return ret;
 }
 
+static bool vmx_gb_page_enable(void)
+{
+   return false;
+}
+
 static struct kvm_x86_ops vmx_x86_ops = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -3950,6 +3955,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
.set_tss_addr = vmx_set_tss_addr,
.get_tdp_level = get_ept_level,
.get_mt_mask = vmx_get_mt_mask,
+   .gb_page_enable = vmx_gb_page_enable,
 };
 
 static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 71090d2..ddfb77e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1371,6 +1371,7 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
u32 function,
 u32 index, int *nent, int maxnent)
 {
unsigned f_nx = is_efer_nx() ? F(NX) : 0;
+   unsigned f_gbpages = kvm_x86_ops->gb_page_enable() ? F(GBPAGES) : 0;
 #ifdef CONFIG_X86_64
unsigned f_lm = F(LM);
 #else
@@ -1395,7 +1396,7 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, 
u32 function,
F(MTRR) | F(PGE) | F(MCA) | F(CMOV) |
F(PAT) | F(PSE36) | 0 /* Reserved */ |
f_nx | 0 /* Reserved */ | F(MMXEXT) | F(MMX) |
-   F(FXSR) | F(FXSR_OPT) | 0 /* GBPAGES */ | 0 /* RDTSCP */ |
+   F(FXSR) | F(FXSR_OPT) | f_gbpages | 0 /* RDTSCP */ |
0 /* Reserved */ | f_lm | F(3DNOWEXT) | F(3DNOW);
/* cpuid 1.ecx */
const u32 kvm_supported_word4_x86_features =
-- 
1.6.3.1


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/8 v3] KVM support for 1GB pages

2009-06-19 Thread Joerg Roedel
Hi,

this is the third version of the patches for KVM to support 1GB pages. Changes
to the last version include:

- changed reporting of 1GB page support to userspace to use
  SUPPORTED_CPUID ioctl
- added support for 1GB pages to the software page walker code too

This code still only can make use of 1GB pages with nested paging enabled. I
will give the shadow paging code another debug round soon. Please comment or
consider to apply these patches.

Thanks,

Joerg

Shortlog:

Joerg Roedel (8):
  hugetlbfs: export vma_kernel_pagsize to modules
  kvm: change memslot data structures for multiple hugepage sizes
  kvm/mmu: rename is_largepage_backed to mapping_level
  kvm/mmu: make rmap code aware of mapping levels
  kvm/mmu: make direct mapping paths aware of mapping levels
  kvm/mmu: add support for another level to page walker
  kvm/mmu: enable gbpages by increasing nr of pagesizes
  kvm x86: report 1GB page support to userspace

Diffstat:

 arch/ia64/include/asm/kvm_host.h|3 +-
 arch/powerpc/include/asm/kvm_host.h |3 +-
 arch/x86/include/asm/kvm_host.h |   15 ++-
 arch/x86/kvm/mmu.c  |  219 ++-
 arch/x86/kvm/paging_tmpl.h  |   27 -
 arch/x86/kvm/svm.c  |6 +
 arch/x86/kvm/vmx.c  |6 +
 arch/x86/kvm/x86.c  |3 +-
 include/linux/kvm_host.h|2 +-
 mm/hugetlb.c|1 +
 virt/kvm/kvm_main.c |   53 ++---
 11 files changed, 222 insertions(+), 116 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: device-assignment: Add PCI option ROM support

2009-06-19 Thread Alex Williamson
On Fri, 2009-06-19 at 15:27 +0800, Yang, Sheng wrote:
> On Friday 19 June 2009 00:28:41 Alex Williamson wrote:
> > The one oddity I noticed is that even when the enable bit is clear, the
> > guest can read the ROM.  I don't know that this is actually illegal, vs
> > returning zeros or ones though.  It seems like maybe the generic PCI
> > code isn't tracking the enable bit.  I think that's an independent
> > problem from this patch though.  Thanks,
> 
> That should be fine. I've taken a look at code, seems Linux kernel set 
> enable_bit when someone begin to read rom, and copy rom to buffer, then unmap 
> the rom. So the rom can be read when enable bit clear.

For this testing, I used an mmap of the ROM address though, so the
kernel caching shouldn't have been involved.  It looks to me like the
problem is that the map function provided via pci_register_io_region()
only knows how to create mappings, not tear them down.  I think maybe
pci_update_mappings() should still call the map_func when new_addr is -1
to let the io space drive shutdown the mapping.  As it is, once we setup
the mapping, it lives until something else happens to overlap it,
regardless of the state of the PCI BAR.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2][RFC] Kernel changes for HPET legacy mode (v7)

2009-06-19 Thread Beth Kon

Jan Kiszka wrote:

Beth Kon wrote:
  
When kvm is in hpet_legacy_mode, the hpet is providing the 
timer interrupt and the pit should not be. So in legacy mode, the pit timer is

destroyed, but the *state* of the pit is maintained. So if kvm or the guest
tries to modify the state of the pit, this modification is accepted, *except* 
that the timer isn't actually started. When we exit hpet_legacy_mode, 
the current state of the pit (which is up to date since we've been 
accepting modifications) is used to restart the pit timer.


The saved_mode code in kvm_pit_load_count temporarily changes mode to 
0xff in order to destroy the timer, but then restores the actual value,

again maintaining "current" state of the pit for possible later reenablement.

changes from v6:

- Added ioctl interface for legacy mode in order not to break the abi.


Signed-off-by: Beth Kon 



...

  

@@ -1986,7 +1987,24 @@ static int kvm_vm_ioctl_set_pit(struct kvm *kvm, struct 
kvm_pit_state *ps)
int r = 0;
 
 	memcpy(&kvm->arch.vpit->pit_state, ps, sizeof(struct kvm_pit_state));

-   kvm_pit_load_count(kvm, 0, ps->channels[0].count);
+   kvm_pit_load_count(kvm, 0, ps->channels[0].count, 0);
+   return r;
+}
+
+static int kvm_vm_ioctl_get_hpet_legacy_mode(struct kvm *kvm, u8 *mode)
+{
+   int r = 0;
+   *mode = kvm->arch.vpit->pit_state.hpet_legacy_mode;
+   return r;
+}



This only applies if we go for a separate mode IOCTL:
The legacy mode is not directly modifiable by the guest. Is it planned
to add in-kernel hpet support? Otherwise get_hpet_legacy_mode looks a
bit like overkill given that user space could easily track the state.
  
Assuming I will at least generalize the ioctl, I'll leave this question 
for the time being.
  

+
+static int kvm_vm_ioctl_set_hpet_legacy_mode(struct kvm *kvm, u8 *mode)
+{
+   int r = 0, start = 0;
+   if (kvm->arch.vpit->pit_state.hpet_legacy_mode == 0 && *mode == 1)



Here you check more mode == 1, but legacy_mode is only checked for != 0.
I would make this consistent.

  

ok

+   start = 1;
+   kvm->arch.vpit->pit_state.hpet_legacy_mode = *mode;
+   kvm_pit_load_count(kvm, 0, kvm->arch.vpit->pit_state.channels[0].count, 
start);
return r;
 }
 
@@ -2047,6 +2065,7 @@ long kvm_arch_vm_ioctl(struct file *filp,

struct kvm_pit_state ps;
struct kvm_memory_alias alias;
struct kvm_pit_config pit_config;
+   u8 hpet_legacy_mode;



Hmm, stead of introducing a new pair of singe-purpose IOCTLs, why not
add KVM_GET/SET_PIT2 which exchanges an extended kvm_pit_state2. And
that struct should also include some flags field and enough padding to
be potentially extended yet again in the future. In that case I see no
problem having also a mode read-back interface.

  
I thought about that, but it seemed to add unnecessary complexity, since 
this legacy control is really outside of normal PIT operation, which is 
embodied by KVM_GET/SET_PIT. It might be worth making this ioctl more 
general. Rather than SET_HPET_LEGACY, have SET_PIT_CONTROLS and pass a 
bit field, one of which is HPET_LEGACY. But, if general consensus is 
that it would be better to create a kvm_pit_state2, and get/set that, I 
can do that.

Jan

  


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch] qemu-kvm introduce -maxcpus argument

2009-06-19 Thread Jes Sorensen

Hi,

This one introduces a -maxcpus setting, allowing the user to specify
the maximum number of vCPUs the system can have, as discussed with Gleb
earlier in the week.

Cheers,
Jes

Introduce -maxcpus flag to QEMU and pass the value to the BIOS through
FW_CFG.

Follow on patch will use it to determine the size of the MADT.

Signed-off-by: Jes Sorensen 

---
 hw/fw_cfg.c  |1 +
 hw/fw_cfg.h  |1 +
 kvm/bios/rombios32.c |   16 
 qemu-options.hx  |9 +
 sysemu.h |1 +
 vl.c |8 
 6 files changed, 36 insertions(+)

Index: qemu-kvm/hw/fw_cfg.c
===
--- qemu-kvm.orig/hw/fw_cfg.c
+++ qemu-kvm/hw/fw_cfg.c
@@ -279,6 +279,7 @@ void *fw_cfg_init(uint32_t ctl_port, uin
 fw_cfg_add_bytes(s, FW_CFG_UUID, qemu_uuid, 16);
 fw_cfg_add_i16(s, FW_CFG_NOGRAPHIC, (uint16_t)(display_type == DT_NOGRAPHIC));
 fw_cfg_add_i16(s, FW_CFG_NB_CPUS, (uint16_t)smp_cpus);
+fw_cfg_add_i16(s, FW_CFG_MAX_CPUS, (uint16_t)max_cpus);
 
 register_savevm("fw_cfg", -1, 1, fw_cfg_save, fw_cfg_load, s);
 qemu_register_reset(fw_cfg_reset, 0, s);
Index: qemu-kvm/hw/fw_cfg.h
===
--- qemu-kvm.orig/hw/fw_cfg.h
+++ qemu-kvm/hw/fw_cfg.h
@@ -15,6 +15,7 @@
 #define FW_CFG_INITRD_SIZE  0x0b
 #define FW_CFG_BOOT_DEVICE  0x0c
 #define FW_CFG_NUMA 0x0d
+#define FW_CFG_MAX_CPUS 0x0e
 #define FW_CFG_MAX_ENTRY0x10
 
 #define FW_CFG_WRITE_CHANNEL0x4000
Index: qemu-kvm/kvm/bios/rombios32.c
===
--- qemu-kvm.orig/kvm/bios/rombios32.c
+++ qemu-kvm/kvm/bios/rombios32.c
@@ -441,6 +441,7 @@ void delay_ms(int n)
 }
 
 uint16_t smp_cpus;
+uint16_t max_cpus = MAX_CPUS;
 uint32_t cpuid_signature;
 uint32_t cpuid_features;
 uint32_t cpuid_ext_features;
@@ -484,6 +485,7 @@ void wrmsr_smp(uint32_t index, uint64_t 
 #define QEMU_CFG_ID 0x01
 #define QEMU_CFG_UUID   0x02
 #define QEMU_CFG_NUMA   0x0D
+#define QEMU_CFG_MAX_CPUS   0x0E
 #define QEMU_CFG_ARCH_LOCAL 0x8000
 #define QEMU_CFG_ACPI_TABLES  (QEMU_CFG_ARCH_LOCAL + 0)
 #define QEMU_CFG_SMBIOS_ENTRIES  (QEMU_CFG_ARCH_LOCAL + 1)
@@ -546,6 +548,19 @@ static uint16_t smbios_entries(void)
 return cnt;
 }
 
+static uint16_t get_max_cpus(void)
+{
+uint16_t cnt;
+
+qemu_cfg_select(QEMU_CFG_MAX_CPUS);
+qemu_cfg_read((uint8_t*)&cnt, sizeof(cnt));
+
+if (!cnt)
+cnt = MAX_CPUS;
+
+return cnt;
+}
+
 uint64_t qemu_cfg_get64 (void)
 {
 uint64_t ret;
@@ -1655,6 +1670,7 @@ void acpi_bios_init(void)
 addr += sizeof(SSDTCode);
 
 #ifdef BX_QEMU
+max_cpus = get_max_cpus();
 qemu_cfg_select(QEMU_CFG_NUMA);
 nb_numa_nodes = qemu_cfg_get64();
 #else
Index: qemu-kvm/qemu-options.hx
===
--- qemu-kvm.orig/qemu-options.hx
+++ qemu-kvm/qemu-options.hx
@@ -47,6 +47,15 @@ CPUs are supported. On Sparc32 target, L
 to 4.
 ETEXI
 
+DEF("maxcpus", HAS_ARG, QEMU_OPTION_maxcpus,
+"-maxcpus n  set maximumthe number of possibly CPUs to 'n'\n")
+STEXI
+...@item -maxcpus @var{n}
+Set the maximum number of possible CPUs to @var(n). @var(n) has to be
+bigger or equal to the value of -smp. If @var(n) is equal to -smp,
+there will be no space for hotplug cpus to be added later.
+ETEXI
+
 DEF("numa", HAS_ARG, QEMU_OPTION_numa,
 "-numa node[,mem=size][,cpus=cpu[-cpu]][,nodeid=node]\n")
 STEXI
Index: qemu-kvm/sysemu.h
===
--- qemu-kvm.orig/sysemu.h
+++ qemu-kvm/sysemu.h
@@ -119,6 +119,7 @@ extern int alt_grab;
 extern int usb_enabled;
 extern int no_virtio_balloon;
 extern int smp_cpus;
+extern int max_cpus;
 extern int cursor_hide;
 extern int graphic_rotate;
 extern int no_quit;
Index: qemu-kvm/vl.c
===
--- qemu-kvm.orig/vl.c
+++ qemu-kvm/vl.c
@@ -246,6 +246,7 @@ int singlestep = 0;
 const char *assigned_devices[MAX_DEV_ASSIGN_CMDLINE];
 int assigned_devices_index;
 int smp_cpus = 1;
+int max_cpus = 16;
 const char *vnc_display;
 int acpi_enabled = 1;
 int no_hpet = 0;
@@ -5666,6 +5667,13 @@ int main(int argc, char **argv, char **e
 exit(1);
 }
 break;
+case QEMU_OPTION_maxcpus:
+max_cpus = atoi(optarg);
+if ((max_cpus < 1) || (max_cpus > machine->max_cpus)) {
+fprintf(stderr, "Invalid number of CPUs\n");
+exit(1);
+}
+break;
 	case QEMU_OPTION_vnc:
 display_type = DT_VNC;
 		vnc_display = optarg;


[patch] qemu-kvm, use max_cpus to size MADT table

2009-06-19 Thread Jes Sorensen

Hi,

This one goes on top of the -maxcpus patch. It changes the bios to use 
max_cpus to determine the size of the MADT table.


When no -maxcpus argument is given, we still fall back to the old 
default of 16.


Cheers,
Jes
Build madt table based on max_cpus count.

Signed-off-by: Jes Sorensen 

---
 kvm/bios/rombios32.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

Index: qemu-kvm/kvm/bios/rombios32.c
===
--- qemu-kvm.orig/kvm/bios/rombios32.c
+++ qemu-kvm/kvm/bios/rombios32.c
@@ -1168,13 +1168,13 @@ static void mptable_init(void)
 putstr(&q, "0.1 "); /* vendor id */
 putle32(&q, 0); /* OEM table ptr */
 putle16(&q, 0); /* OEM table size */
-putle16(&q, MAX_CPUS + 18); /* entry count */
+putle16(&q, max_cpus + 18); /* entry count */
 putle32(&q, 0xfee0); /* local APIC addr */
 putle16(&q, 0); /* ext table length */
 putb(&q, 0); /* ext table checksum */
 putb(&q, 0); /* reserved */
 
-for(i = 0; i < MAX_CPUS ; i++) {
+for(i = 0; i < max_cpus ; i++) {
 putb(&q, 0); /* entry type = processor */
 putb(&q, i); /* APIC id */
 putb(&q, 0x11); /* local APIC version number */
@@ -1700,7 +1700,7 @@ void acpi_bios_init(void)
 addr = (addr + 7) & ~7;
 madt_addr = addr;
 madt_size = sizeof(*madt) +
-sizeof(struct madt_processor_apic) * MAX_CPUS +
+sizeof(struct madt_processor_apic) * max_cpus +
 #ifdef BX_QEMU
 sizeof(struct madt_io_apic)  + sizeof(struct madt_int_override) * MAX_INT_OVERRIDES;
 #else
@@ -1778,7 +1778,7 @@ void acpi_bios_init(void)
 madt->local_apic_address = cpu_to_le32(0xfee0);
 madt->flags = cpu_to_le32(1);
 *(uint32_t*)APIC_MADT_PTR = apic = (void *)(madt + 1);
-for(i=0;itype = APIC_PROCESSOR;
 apic->length = sizeof(*apic);
 apic->processor_id = i;


Re: [patch] qemu-kvm introduce -maxcpus argument

2009-06-19 Thread Daniel P. Berrange
On Fri, Jun 19, 2009 at 03:59:57PM +0200, Jes Sorensen wrote:
> Hi,
> 
> This one introduces a -maxcpus setting, allowing the user to specify
> the maximum number of vCPUs the system can have, as discussed with Gleb
> earlier in the week.

What is the maximum value for the 'maxcpus' setting for KVM ? 

libvirt currently does

fd = open("/dev/kvm")
r = ioctl(fd, KVM_CHECK_EXTENSION, KVM_CAP_NR_VCPUS);

to figure out what the maximum allowed vCPUs will be for KVM,
and currently it is returning 16 IIRC.

> @@ -5666,6 +5667,13 @@ int main(int argc, char **argv, char **e
>  exit(1);
>  }
>  break;
> +case QEMU_OPTION_maxcpus:
> +max_cpus = atoi(optarg);
> +if ((max_cpus < 1) || (max_cpus > machine->max_cpus)) {
> +fprintf(stderr, "Invalid number of CPUs\n");
> +exit(1);
> +}
> +break;
>   case QEMU_OPTION_vnc:
>  display_type = DT_VNC;
>   vnc_display = optarg;

This implies the limit (for x86 pc machine at least) is now 255. Is that
the correct interpretation on my part ?


Regards,
Daniel
-- 
|: Red Hat, Engineering, London   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] qemu-kvm introduce -maxcpus argument

2009-06-19 Thread Jes Sorensen

On 06/19/2009 04:15 PM, Daniel P. Berrange wrote:

On Fri, Jun 19, 2009 at 03:59:57PM +0200, Jes Sorensen wrote:

Hi,

This one introduces a -maxcpus setting, allowing the user to specify
the maximum number of vCPUs the system can have, as discussed with Gleb
earlier in the week.


What is the maximum value for the 'maxcpus' setting for KVM ?


Right now it is still 16, however I plan to change this.


libvirt currently does

 fd = open("/dev/kvm")
 r = ioctl(fd, KVM_CHECK_EXTENSION, KVM_CAP_NR_VCPUS);

to figure out what the maximum allowed vCPUs will be for KVM,
and currently it is returning 16 IIRC.


Interesting, this will need to be addressed as well. I have plans to
introduce a mechanism telling the kernel where the limit will be, in
order to allow it to allocate data structures in a reasonable manner.


@@ -5666,6 +5667,13 @@ int main(int argc, char **argv, char **e
  exit(1);
  }
  break;
+case QEMU_OPTION_maxcpus:
+max_cpus = atoi(optarg);
+if ((max_cpus<  1) || (max_cpus>  machine->max_cpus)) {
+fprintf(stderr, "Invalid number of CPUs\n");
+exit(1);
+}



This implies the limit (for x86 pc machine at least) is now 255. Is that
the correct interpretation on my part ?


Actually the 255 limit is tied to ACPI. Once we support ACPI 3.0 and
x2apic, it will get much worse. Be afraid, be very afraid :-) To be
honest, that is going to be a while before we get to that, but I hope
to get to it eventually.

I strongly recommend you try not to impose static limits within libvirt
for the number of vCPUs.

I guess it will become a tricky issue who is telling who what the limit
is. Ideally I would like to see the kernel limit becoming unlimited and
the restrictions being set by userland.

Cheers,
Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


-no-kvm broken since merge a5b526135d

2009-06-19 Thread Jan Kiszka
JFYI: qemu-kvm commit de408d70d8 works fine with -no-kvm, a5b526135d
produces

qemu: fatal: Trying to execute code outside RAM or ROM at 0x0001000d5f4f

EAX=0001 EBX=0070 ECX=005f EDX=0001
ESI= EDI= EBP= ESP=
EIP=000e5f4f EFL=0002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =   9200
CS =f000   9a00
SS =   9200
DS =   9200
FS =   9200
GS =   9200
LDT=   8200
TR =   8b00
GDT=  
IDT=  
CR0=6010 CR2= CR3= CR4=
DR0= DR1= DR2= DR3=
DR6=0ff0 DR7=0400
CCS=0001 CCD= CCO=SUBL
FCW=037f FSW= [ST=0] FTW=00 MXCSR=1f80
FPR0=  FPR1= 
FPR2=  FPR3= 
FPR4=  FPR5= 
FPR6=  FPR7= 
XMM00=
XMM01=
XMM02=
XMM03=
XMM04=
XMM05=
XMM06=
XMM07=

(addresses may vary)

during early BIOS boot. Probably a merge conflict, but I do not yet see
which one. Maybe someone has an immediate idea what could cause this.

Upstream does not show this regression.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] qemu-kvm introduce -maxcpus argument

2009-06-19 Thread Amit Shah
On (Fri) Jun 19 2009 [15:59:57], Jes Sorensen wrote:
> Hi,
>
> This one introduces a -maxcpus setting, allowing the user to specify
> the maximum number of vCPUs the system can have, as discussed with Gleb
> earlier in the week.

ACK, but please fix this:

+DEF("maxcpus", HAS_ARG, QEMU_OPTION_maxcpus,
+"-maxcpus n  set maximumthe number of possibly CPUs to 'n'\n")
+STEXI

Amit
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl

2009-06-19 Thread Michael S. Tsirkin
On Thu, Jun 18, 2009 at 12:29:31PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote:
> >   
> >> Michael S. Tsirkin wrote:
> >> 
> >>> On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote:
> >>>   
> >>>   
>  Michael S. Tsirkin wrote:
>  
> 
>  
> > BTW, Gregory, this can be used to fix the race in the design: create a
> > thread and let it drop the module reference with module_put_and_exit.
> >   
> >   
> >   
>  I had thought of doing something like this initially too, but I think
>  its racy as well.  Ultimately, you need to make sure the eventfd
>  callback is completely out before its safe to run, and deferring to a
>  thread would not change this race.  The only sane way I can see to do
>  that is to have the caller infrastructure annotate the event somehow
>  (either directly with a module_put(), or indirectly with some kind of
>  state transition that can be tracked with something like
>  synchronize_sched().
>  
>  
> >>> Here's what one could do: create a thread for each irqfd, and increment
> >>> module ref count, put that thread to sleep.  When done with
> >>> irqfd, don't delete it and don't decrement module refcount, wake thread
> >>> instead.  thread kills irqfd and calls module_put_and_exit.
> >>>
> >>> I don't think it's racy
> >>>   
> >> I believe it is. How would you prevent the thread from doing the
> >> module_put_and_exit() before the eventfd callback thread is known to
> >> have exited the relevant .text section?
> >> 
> >
> > Right.
> >
> >   
> >> All this talk does give me an idea, tho.  Ill make a patch.
> >> 
> >
> > OK, but ask yourself whether this bag of tricks is worth it, and whether
> > we'll find another hole later. Let's reserve the trickiness for
> > fast-path, where it's needed, and keep at least the assign/deassign simple.
> >   
> 
> Understood.  OTOH, going back to the model where two steps are needed
> for close() is ugly too, so I don't want to just give up and revert that
> fix too easily.  At some point we will call it one way or the other, but
> I am not there quite yet.
> >   
> >>>   
> >>>   
> > Which will work, but I guess at this point we should ask ourselves
> > whether all the hearburn with srcu, threads and module references is
> > better than just asking the user to call and ioctl.
> >   
> >   
> >   
>  I am starting to agree with you, here. :)
> 
>  Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
>  conversation re: the module_put() races.  I only tied it into the
>  current thread because the eventfd_notifier_register() thread gave me a
>  convenient way to hook some other context to do the module_put().  In
>  the long term, the srcu changes are for the can_sleep() stuff.  So on
>  that note, lets see if I can convince Davide that the srcu stuff is not
>  so evil before we revert the POLLHUP patches, since the module_put() fix
>  is trivial once that is in place.
>  
>  
> >>> Can this help with DEASSIGN as well? We need it for migration.
> >>>   
> >>>   
> >> No, but afaict you do not need this for migration anyway.  Migrate the
> >> GSI and re-call kvm_irqfd() on the other side.  Would the fd even be
> >> relevant across a migration anyway?  I would think not, but admittedly I
> >> know little about how qemu/kvm migration actually works.
> >> 
> >
> > Yes but that's not live migration. For live migration, the trick is that
> > you are running locally but send changes to remote guest.  For that, we
> > need to put qemu in the middle between the device and the guest, so it
> > can detect activity and update the remote side.
> >
> > And the best way to do that is to take poll eventfd that device assigns
> > and push eventfd that kvm polls. To switch between this setup
> > and the one where kvm polls the ventfd from device directly,
> > you need deassign.
> >   
> 
> So its still not clear why the distinction between
> deassign-the-gsi-but-leave-the-fd-valid is needed over a simple
> close().  Can you elaborate?


The fd needs to be left assigned to the device, so that we can poll
the fd and get events, then forward them to kvm.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM PATCH v2 2/2] kvm: use POLLHUP to close an irqfd instead of an explicit ioctl

2009-06-19 Thread Gregory Haskins
Michael S. Tsirkin wrote:
> On Thu, Jun 18, 2009 at 12:29:31PM -0400, Gregory Haskins wrote:
>   
>> Michael S. Tsirkin wrote:
>> 
>>> On Thu, Jun 18, 2009 at 10:03:36AM -0400, Gregory Haskins wrote:
>>>   
>>>   
 Michael S. Tsirkin wrote:
 
 
> On Thu, Jun 18, 2009 at 08:00:39AM -0400, Gregory Haskins wrote:
>   
>   
>   
>> Michael S. Tsirkin wrote:
>> 
>>
>> 
>> 
>>> BTW, Gregory, this can be used to fix the race in the design: create a
>>> thread and let it drop the module reference with module_put_and_exit.
>>>   
>>>   
>>>   
>>>   
>> I had thought of doing something like this initially too, but I think
>> its racy as well.  Ultimately, you need to make sure the eventfd
>> callback is completely out before its safe to run, and deferring to a
>> thread would not change this race.  The only sane way I can see to do
>> that is to have the caller infrastructure annotate the event somehow
>> (either directly with a module_put(), or indirectly with some kind of
>> state transition that can be tracked with something like
>> synchronize_sched().
>> 
>> 
>> 
> Here's what one could do: create a thread for each irqfd, and increment
> module ref count, put that thread to sleep.  When done with
> irqfd, don't delete it and don't decrement module refcount, wake thread
> instead.  thread kills irqfd and calls module_put_and_exit.
>
> I don't think it's racy
>   
>   
 I believe it is. How would you prevent the thread from doing the
 module_put_and_exit() before the eventfd callback thread is known to
 have exited the relevant .text section?
 
 
>>> Right.
>>>
>>>   
>>>   
 All this talk does give me an idea, tho.  Ill make a patch.
 
 
>>> OK, but ask yourself whether this bag of tricks is worth it, and whether
>>> we'll find another hole later. Let's reserve the trickiness for
>>> fast-path, where it's needed, and keep at least the assign/deassign simple.
>>>   
>>>   
>> Understood.  OTOH, going back to the model where two steps are needed
>> for close() is ugly too, so I don't want to just give up and revert that
>> fix too easily.  At some point we will call it one way or the other, but
>> I am not there quite yet.
>> 
>>>   
>>>   
>   
>   
>   
>>> Which will work, but I guess at this point we should ask ourselves
>>> whether all the hearburn with srcu, threads and module references is
>>> better than just asking the user to call and ioctl.
>>>   
>>>   
>>>   
>>>   
>> I am starting to agree with you, here. :)
>>
>> Note one thing: the SRCU stuff is mostly orthogonal from the rest of the
>> conversation re: the module_put() races.  I only tied it into the
>> current thread because the eventfd_notifier_register() thread gave me a
>> convenient way to hook some other context to do the module_put().  In
>> the long term, the srcu changes are for the can_sleep() stuff.  So on
>> that note, lets see if I can convince Davide that the srcu stuff is not
>> so evil before we revert the POLLHUP patches, since the module_put() fix
>> is trivial once that is in place.
>> 
>> 
>> 
> Can this help with DEASSIGN as well? We need it for migration.
>   
>   
>   
 No, but afaict you do not need this for migration anyway.  Migrate the
 GSI and re-call kvm_irqfd() on the other side.  Would the fd even be
 relevant across a migration anyway?  I would think not, but admittedly I
 know little about how qemu/kvm migration actually works.
 
 
>>> Yes but that's not live migration. For live migration, the trick is that
>>> you are running locally but send changes to remote guest.  For that, we
>>> need to put qemu in the middle between the device and the guest, so it
>>> can detect activity and update the remote side.
>>>
>>> And the best way to do that is to take poll eventfd that device assigns
>>> and push eventfd that kvm polls. To switch between this setup
>>> and the one where kvm polls the ventfd from device directly,
>>> you need deassign.
>>>   
>>>   
>> So its still not clear why the distinction between
>> deassign-the-gsi-but-leave-the-fd-valid is needed over a simple
>> close().  Can you elaborate?
>> 
>
>
> The fd needs to be left assigned to the device, so that we can poll
> the fd and get events, then forward them to kvm.
>   

Ah, ok.  Now I get what you are trying to do.

Well, per the PM I sent you this morning, I figured out the magic to
resolve the locking issues.  So we should be able to add DEASSIGN logic
soon, I hope.

-Greg




signature.asc
Descripti

[PATCH] qemu-kvm: x86: Fix CPU initialization

2009-06-19 Thread Jan Kiszka
Fix regression in CPU initialization caused by merge a5b526135d and try
to avoid this in the future by dropping qemu-kvm specific pc_new_cpu. If
such refactoring is desired, it should go through upstream first.

Signed-off-by: Jan Kiszka 
---

 hw/pc.c |   44 ++--
 1 files changed, 18 insertions(+), 26 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index 6c19f55..c4117e5 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -839,31 +839,6 @@ int cpu_is_bsp(CPUState *env)
return env->cpuid_apic_id == 0;
 }
 
-CPUState *pc_new_cpu(int cpu, const char *cpu_model, int pci_enabled)
-{
-CPUState *env = cpu_init(cpu_model);
-if (!env) {
-fprintf(stderr, "Unable to find x86 CPU definition\n");
-exit(1);
-}
-if (cpu != 0)
-env->halted = 1;
-if ((env->cpuid_features & CPUID_APIC) || smp_cpus > 1) {
-env->cpuid_apic_id = env->cpu_index;
-apic_init(env);
-}
-qemu_register_reset(main_cpu_reset, 0, env);
-if (pci_enabled) {
-apic_init(env);
-}
-
-/* kvm needs this to run after the apic is initialized. Otherwise,
- * it can access invalid state and crash.
- */
-qemu_init_vcpu(env);
-   return env;
-}
-
 /* PC hardware initialisation */
 static void pc_init1(ram_addr_t ram_size,
  const char *boot_device,
@@ -906,7 +881,24 @@ static void pc_init1(ram_addr_t ram_size,
 }
 
 for(i = 0; i < smp_cpus; i++) {
-   env = pc_new_cpu(i, cpu_model, pci_enabled);
+env = cpu_init(cpu_model);
+if (!env) {
+fprintf(stderr, "Unable to find x86 CPU definition\n");
+exit(1);
+}
+if (cpu != 0)
+env->halted = 1;
+if ((env->cpuid_features & CPUID_APIC) || smp_cpus > 1) {
+env->cpuid_apic_id = env->cpu_index;
+apic_init(env);
+}
+qemu_register_reset(main_cpu_reset, 0, env);
+
+/*
+ * FIXME: qemu-kvm needs this after apic_init as it accesses APIC
+ * structures.
+ */
+qemu_init_vcpu(env);
 }
 
 vmport_init();
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] qemu-kvm: x86: Fix CPU initialization

2009-06-19 Thread Jan Kiszka
Jan Kiszka wrote:
> Fix regression in CPU initialization caused by merge a5b526135d and try
> to avoid this in the future by dropping qemu-kvm specific pc_new_cpu. If
> such refactoring is desired, it should go through upstream first.

F...ine. I'll write a hundred times: "Don't post while still compiling."
Here is a version that only fixes the regression.

--->

Fix regression in CPU initialization caused by merge a5b526135d.

Signed-off-by: Jan Kiszka 
---

 hw/pc.c |3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index 6c19f55..cb5b4d0 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -853,9 +853,6 @@ CPUState *pc_new_cpu(int cpu, const char *cpu_model, int 
pci_enabled)
 apic_init(env);
 }
 qemu_register_reset(main_cpu_reset, 0, env);
-if (pci_enabled) {
-apic_init(env);
-}
 
 /* kvm needs this to run after the apic is initialized. Otherwise,
  * it can access invalid state and crash.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Checking guest memory pages changes from host userspace

2009-06-19 Thread Passera, Pablo R
Hi list,
I need to monitor some guest memory pages. I need to know if the 
information in these pages was changed. For this, I was thinking to mark the 
guest memory pages in some way (like write protecting them) so a page fault is 
generated. Then manage this fault inside qemu. Is there some API in libkvm that 
allows me to do this?

Thanks a lot,
Pablo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] virtio_blk: don't bounce highmem requests

2009-06-19 Thread Christoph Hellwig

By default a block driver bounces highmem requests, but virtio-blk is
perfectly fine with any request that fit into it's 64 bit addressing scheme,
mapped in the kernel virtual space or not.

Besides improving performance on highmem systems this also makes the
reproducible oops in __bounce_end_io go away (but hiding the real cause).

Signed-off-by: Christoph Hellwig 

Index: linux-2.6/drivers/block/virtio_blk.c
===
--- linux-2.6.orig/drivers/block/virtio_blk.c   2009-06-15 16:28:24.225815322 
+0200
+++ linux-2.6/drivers/block/virtio_blk.c2009-06-19 18:03:12.469805377 
+0200
@@ -360,6 +360,9 @@ static int __devinit virtblk_probe(struc
blk_queue_max_phys_segments(vblk->disk->queue, vblk->sg_elems-2);
blk_queue_max_hw_segments(vblk->disk->queue, vblk->sg_elems-2);
 
+   /* No need to bounce any requests */
+   blk_queue_bounce_limit(brd->brd_queue, BLK_BOUNCE_ANY);
+
/* No real sector limit. */
blk_queue_max_sectors(vblk->disk->queue, -1U);
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] eventfd: Allow waiters to be notified about the eventfd file* going away

2009-06-19 Thread Gregory Haskins
From: Davide Libenzi 

And give them a change to unregister from the wait queue.

This is turn allows eventfd users to use the eventfd file* w/out holding a
live reference to it.

After the eventfd user callbacks returns, any usage of the eventfd file*
should be dropped.  The eventfd user callback can acquire sleepy locks
since it is invoked lockless.

This is a feature, needed by KVM to avoid an awkward workaround when using
eventfd.

Signed-off-by: Davide Libenzi 
Tested-by: Gregory Haskins 
Signed-off-by: Andrew Morton 
Signed-off-by: Gregory Haskins 
Signed-off-by: Avi Kivity 
---

 fs/eventfd.c |   10 +-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 2a701d5..c71f51d 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -59,7 +59,15 @@ int eventfd_signal(struct file *file, int n)
 
 static int eventfd_release(struct inode *inode, struct file *file)
 {
-   kfree(file->private_data);
+   struct eventfd_ctx *ctx = file->private_data;
+
+   /*
+* No need to hold the lock here, since we are on the file cleanup
+* path and the ones still attached to the wait queue will be
+* serialized by wake_up_locked_poll().
+*/
+   wake_up_locked_poll(&ctx->wqh, POLLHUP);
+   kfree(ctx);
return 0;
 }
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface

2009-06-19 Thread Gregory Haskins
Davide Libenzi wrote:
> On Thu, 18 Jun 2009, Gregory Haskins wrote:
>
>>
>> But the fact is: I do not see any way to actually use your referenceless
>> POLLHUP release code in a race free way without doing something like I
>> propose in 3/4, 4/4.   Lets keep the discussion focused on that for now,
>> if we could.
>
> OK, since I got literally swamped by the amount of talks and patches over 
> this theoretically simple topic, would you mind  1) posting the global 
> patch over eventfd

Done.  Should show up as replies to this email.  These apply to v2.6.30 and
are devoid of any KVM-isms. :)

(Note-1: I've built and run these patches against additional kvm/irqfd
patches and verified they all work properly)

(Note-2: I included your POLLHUP patch as 1/3 since it currently only exists
in kvm.git, and is a pre-req for the other two)

> 2) describe exactly what races are you talking about


Covered in the patch headers (3/3 probably has the most detail)

 
> 3) explain why this should be any concern of eventfd at all?
>

To support any  in-kernel eventfd client that wants to get both "signal"
and "release" notifications (based on using your POLLHUP patch) in a
race-free way.  Without 2/3, 3/3, I cannot see how this can be done.  Yet
your 1/3 patch is an important enhancement (to at least one client).  I
suspect any other in-kernel users will find this a good feature as well.

Thanks Davide,
-Greg


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

2009-06-19 Thread Gregory Haskins
eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
notifier->release() callback.  This lets notification clients know if
the eventfd is about to go away and is very useful particularly for
in-kernel clients.  However, as it stands today it is not possible to
use the notification API in a race-free way.  This patch adds some
additional logic to the notification subsystem to rectify this problem.

Background:
---
Eventfd currently only has one reference count mechanism: fget/fput.  This
in of itself is normally fine.  However, if a client expects to be
notified if the eventfd is closed, it cannot hold a fget() reference
itself or the underlying f_ops->release() callback will never be invoked
by VFS.  Therefore we have this somewhat unusual situation where we may
hold a pointer to an eventfd object (by virtue of having a waiter registered
in its wait-queue), but no reference.  This makes it nearly impossible to
design a mutual decoupling algorithm: you cannot unhook one side from the
other (or vice versa) without racing.

The first problem was dealt with by essentially "donating" a surrogate
object to eventfd.  In other words, when a client attached to eventfd
and then later detached, it would decouple internally in a race free way
and then leave part of the object still attached to the eventfd.  This
decoupled object would correctly detect the end-of-life of the eventfd
object at some point in the future and be deallocated: However, we cannot
guarantee that this operation would not race with a potential rmmod of the
client, and is therefore broken.

Solution Details:
---

1) We add a private kref to the internal eventfd_ctx object.  This
   reference can be (transparently) held by notification clients without
   affecting the ability for VFS to indicate ->release() notification.

2) We convert the current lockless POLLHUP to a more traditional locked
   variant (*) so that we can ensure a race free mutual-decouple
   algorithm without requiring an surrogate object.

3) We guard the decouple algorithm with an atomic bit-clear to ensure
   mutual exclusion of the decoupling and reference-drop.

4) We hold a reference to the underlying eventfd_ctx until all paths
   have satisfactorily completed to ensure we do not race with eventfd
   going away.

Between these points, we believe we now have a race-free release
mechanism.

[*] Clients that previously assumed the ->release() could sleep will
need to be refactored.

Signed-off-by: Gregory Haskins 
CC: Davide Libenzi 
CC: Michael S. Tsirkin 
---

 fs/eventfd.c|   62 +--
 include/linux/eventfd.h |3 ++
 2 files changed, 47 insertions(+), 18 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 3d7fb16..934efee 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -16,8 +16,10 @@
 #include 
 #include 
 #include 
+#include 
 
 struct eventfd_ctx {
+   struct kref kref;
wait_queue_head_t wqh;
/*
 * Every time that a write(2) is performed on an eventfd, the
@@ -57,17 +59,24 @@ int eventfd_signal(struct file *file, int n)
return n;
 }
 
+static void _eventfd_release(struct kref *kref)
+{
+   struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
+
+   kfree(ctx);
+}
+
+static void _eventfd_put(struct kref *kref)
+{
+   kref_put(kref, &_eventfd_release);
+}
+
 static int eventfd_release(struct inode *inode, struct file *file)
 {
struct eventfd_ctx *ctx = file->private_data;
 
-   /*
-* No need to hold the lock here, since we are on the file cleanup
-* path and the ones still attached to the wait queue will be
-* serialized by wake_up_locked_poll().
-*/
-   wake_up_locked_poll(&ctx->wqh, POLLHUP);
-   kfree(ctx);
+   wake_up_poll(&ctx->wqh, POLLHUP);
+   _eventfd_put(&ctx->kref);
return 0;
 }
 
@@ -222,6 +231,7 @@ SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
if (!ctx)
return -ENOMEM;
 
+   kref_init(&ctx->kref);
init_waitqueue_head(&ctx->wqh);
ctx->count = count;
ctx->flags = flags;
@@ -242,6 +252,15 @@ SYSCALL_DEFINE1(eventfd, unsigned int, count)
return sys_eventfd2(count, 0);
 }
 
+enum {
+   eventfd_notifier_flag_active,
+};
+
+static int test_and_clear_active(struct eventfd_notifier *en)
+{
+   return test_and_clear_bit(eventfd_notifier_flag_active, &en->flags);
+}
+
 static int eventfd_notifier_wakeup(wait_queue_t *wait, unsigned mode,
   int sync, void *key)
 {
@@ -251,19 +270,15 @@ static int eventfd_notifier_wakeup(wait_queue_t *wait, 
unsigned mode,
en = container_of(wait, struct eventfd_notifier, wait);
 
if (flags & POLLIN)
-   /*
-* The POLLIN wake_up is called with interrupts disabled.
-*/
en->ops->signal(en);
 
  

[PATCH 2/3] eventfd: add generalized notifier interface

2009-06-19 Thread Gregory Haskins
Users that want to register for signal notifications with eventfd have
several choices today:  They can do a standard sleep+wakeup against a
->read(), or they can provide their own wakeup handling using the wait-queue
callback mechanism coupled with the the eventfd->poll() interface.

In fact, Davide recently published a patch that allows eventfd to transmit
a "release" event when the underlying eventfd is closed via a POLLHUP
wakeup. This type of event is extremely useful for in-kernel notification
clients. However the wait-queue based notification interface alone is not
sufficient to use this new information race-free since it requires
operating lockless and referenceless.  We need to track some additional
data that is independent of the file* pointer, since we need
f_ops->release() to still function.

Therefore, this patch lays the groundwork to try and fix these issues. It
accomplishes this by abstracting eventfd's wait-queue based notification
interface behind eventfd specific register()/unregister() verbs.  It also
provides an eventfd specific object (eventfd_notifier) that is intended to
be embedded in the client, but used by eventfd to track proper state.

We will use this interface later in the series to fix the current races.

Signed-off-by: Gregory Haskins 
CC: Davide Libenzi 
CC: Michael S. Tsirkin 
---

 fs/eventfd.c|   64 +++
 include/linux/eventfd.h |   33 
 2 files changed, 97 insertions(+), 0 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index c71f51d..3d7fb16 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -242,3 +242,67 @@ SYSCALL_DEFINE1(eventfd, unsigned int, count)
return sys_eventfd2(count, 0);
 }
 
+static int eventfd_notifier_wakeup(wait_queue_t *wait, unsigned mode,
+  int sync, void *key)
+{
+   struct eventfd_notifier *en;
+   unsigned long flags = (unsigned long)key;
+
+   en = container_of(wait, struct eventfd_notifier, wait);
+
+   if (flags & POLLIN)
+   /*
+* The POLLIN wake_up is called with interrupts disabled.
+*/
+   en->ops->signal(en);
+
+   if (flags & POLLHUP) {
+   /*
+* The POLLHUP is called unlocked, so it theoretically should
+* be safe to remove ourselves from the wqh using the locked
+* variant of remove_wait_queue()
+*/
+   remove_wait_queue(en->wqh, &en->wait);
+   en->ops->release(en);
+   }
+
+   return 0;
+}
+
+static void eventfd_notifier_ptable_enqueue(struct file *file,
+   wait_queue_head_t *wqh,
+   poll_table *pt)
+{
+   struct eventfd_notifier *en;
+
+   en = container_of(pt, struct eventfd_notifier, pt);
+
+   en->wqh = wqh;
+   add_wait_queue(wqh, &en->wait);
+}
+
+int eventfd_notifier_register(struct file *file, struct eventfd_notifier *en)
+{
+   unsigned int events;
+
+   if (file->f_op != &eventfd_fops)
+   return -EINVAL;
+
+   /*
+* Install our own custom wake-up handling so we are notified via
+* a callback whenever someone signals the underlying eventfd
+*/
+   init_waitqueue_func_entry(&en->wait, eventfd_notifier_wakeup);
+   init_poll_funcptr(&en->pt, eventfd_notifier_ptable_enqueue);
+
+   events = file->f_op->poll(file, &en->pt);
+
+   return (events & POLLIN) ? 1 : 0;
+}
+EXPORT_SYMBOL_GPL(eventfd_notifier_register);
+
+void eventfd_notifier_unregister(struct eventfd_notifier *en)
+{
+   remove_wait_queue(en->wqh, &en->wait);
+}
+EXPORT_SYMBOL_GPL(eventfd_notifier_unregister);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index f45a8ae..cb23969 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -8,6 +8,32 @@
 #ifndef _LINUX_EVENTFD_H
 #define _LINUX_EVENTFD_H
 
+#include 
+#include 
+#include 
+#include 
+
+struct eventfd_notifier;
+
+struct eventfd_notifier_ops {
+   void (*signal)(struct eventfd_notifier *en);
+   void (*release)(struct eventfd_notifier *en);
+};
+
+struct eventfd_notifier {
+   poll_table pt;
+   wait_queue_head_t *wqh;
+   wait_queue_t   wait;
+   const struct eventfd_notifier_ops *ops;
+};
+
+static inline void eventfd_notifier_init(struct eventfd_notifier *en,
+const struct eventfd_notifier_ops *ops)
+{
+   memset(en, 0, sizeof(*en));
+   en->ops = ops;
+}
+
 #ifdef CONFIG_EVENTFD
 
 /* For O_CLOEXEC and O_NONBLOCK */
@@ -29,12 +55,19 @@
 
 struct file *eventfd_fget(int fd);
 int eventfd_signal(struct file *file, int n);
+int eventfd_notifier_register(struct file *file, struct eventfd_notifier *en);
+void eventfd_notifier_unregister(struct eventfd_notifier *en);
 
 #else /* CONF

Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

2009-06-19 Thread Davide Libenzi
On Fri, 19 Jun 2009, Gregory Haskins wrote:

> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
> notifier->release() callback.  This lets notification clients know if
> the eventfd is about to go away and is very useful particularly for
> in-kernel clients.  However, as it stands today it is not possible to
> use the notification API in a race-free way.  This patch adds some
> additional logic to the notification subsystem to rectify this problem.
> 
> Background:
> ---
> Eventfd currently only has one reference count mechanism: fget/fput.  This
> in of itself is normally fine.  However, if a client expects to be
> notified if the eventfd is closed, it cannot hold a fget() reference
> itself or the underlying f_ops->release() callback will never be invoked
> by VFS.  Therefore we have this somewhat unusual situation where we may
> hold a pointer to an eventfd object (by virtue of having a waiter registered
> in its wait-queue), but no reference.  This makes it nearly impossible to
> design a mutual decoupling algorithm: you cannot unhook one side from the
> other (or vice versa) without racing.

And why is that?

struct xxx {
struct mutex mtx;
struct file *file;
...
};

struct file *xxx_get_file(struct xxx *x) {
struct file *file;

mutex_lock(&x->mtx);
file = x->file;
if (!file)
mutex_unlock(&x->mtx);
return file;
}

void xxx_release_file(struct xxx *x) {
mutex_unlock(&x->mtx);
}

void handle_POLLHUP(struct xxx *x) {
struct file *file;

file = xxx_get_file(x);
if (file) {
unhook_waitqueue(file, ...);
x->file = NULL;
xxx_release_file(x);
}
}


Every time you need to "use" file, you call xxx_get_file(), and if you get 
NULL, it means it's gone and you handle it accordigly to your IRQ fd 
policies. As soon as you done with the file, you call xxx_release_file().
Replace "mtx" with the lock that fits your needs.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kvm: emulation failure

2009-06-19 Thread Ram Pai
I see this problem with a x86 sles10 guest running on x86_64 intel host.
If the guest is reset abruptly and rebooted, some where 
before grub sequence it hangs and the following message is seen in the
logs

emulation failed (pagetable) rip 7ed5 66 60 ac 20.

I located this instruction sequence in isolinux.bin on the iso ;if that
is relevant.


I did some analysis and find that there is an ept violation, which is
handled and then the next instruction '66 60' is attempted to decode 
and emulate. But decode fails. kvm continues loops in the kernel
in __vcpu_run().

the code path is 

kvm_run() -> __vcpu_run() -> vcpu_enter_guest() -> kvm_handle_exit() ->
handle_ept_violation() -> kvm_mmu_page_fault() -> emulate_instruction() ->
x86_decode_insn()


Any insights here on how to fix the problem is appreciated. And if a fix 
already exists even better :)

thanks,
RP

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

2009-06-19 Thread Gregory Haskins
Davide Libenzi wrote:
> On Fri, 19 Jun 2009, Gregory Haskins wrote:
>
>   
>> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
>> notifier->release() callback.  This lets notification clients know if
>> the eventfd is about to go away and is very useful particularly for
>> in-kernel clients.  However, as it stands today it is not possible to
>> use the notification API in a race-free way.  This patch adds some
>> additional logic to the notification subsystem to rectify this problem.
>>
>> Background:
>> ---
>> Eventfd currently only has one reference count mechanism: fget/fput.  This
>> in of itself is normally fine.  However, if a client expects to be
>> notified if the eventfd is closed, it cannot hold a fget() reference
>> itself or the underlying f_ops->release() callback will never be invoked
>> by VFS.  Therefore we have this somewhat unusual situation where we may
>> hold a pointer to an eventfd object (by virtue of having a waiter registered
>> in its wait-queue), but no reference.  This makes it nearly impossible to
>> design a mutual decoupling algorithm: you cannot unhook one side from the
>> other (or vice versa) without racing.
>> 
>
> And why is that?
>
> struct xxx {
>   struct mutex mtx;
>   struct file *file;
>   ...
> };
>
> struct file *xxx_get_file(struct xxx *x) {
>   struct file *file;
>
>   mutex_lock(&x->mtx);
>   file = x->file;
>   if (!file)
>   mutex_unlock(&x->mtx);
>   return file;
> }
>
> void xxx_release_file(struct xxx *x) {
>   mutex_unlock(&x->mtx);
> }
>
> void handle_POLLHUP(struct xxx *x) {
>   struct file *file;
>
>   file = xxx_get_file(x);
>   if (file) {
>   unhook_waitqueue(file, ...);
>   x->file = NULL;
>   xxx_release_file(x);
>   }
> }
>
>
> Every time you need to "use" file, you call xxx_get_file(), and if you get 
> NULL, it means it's gone and you handle it accordigly to your IRQ fd 
> policies. As soon as you done with the file, you call xxx_release_file().
> Replace "mtx" with the lock that fits your needs.
>   

Consider what would happen if the f_ops->release() was preempted inside
the wake_up_locked_polled() after it dereferenced the xxx from the list,
but before it calls the callback(POLLHUP).  The xxx object, and/or the
.text for the xxx object may be long gone by the time it comes back
around.  Afaict, there is no way to guard against that scenario unless
you do something like 2/3+3/3.  Or am I missing something?

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

2009-06-19 Thread Davide Libenzi
On Fri, 19 Jun 2009, Gregory Haskins wrote:

> Davide Libenzi wrote:
> > On Fri, 19 Jun 2009, Gregory Haskins wrote:
> >
> >   
> >> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
> >> notifier->release() callback.  This lets notification clients know if
> >> the eventfd is about to go away and is very useful particularly for
> >> in-kernel clients.  However, as it stands today it is not possible to
> >> use the notification API in a race-free way.  This patch adds some
> >> additional logic to the notification subsystem to rectify this problem.
> >>
> >> Background:
> >> ---
> >> Eventfd currently only has one reference count mechanism: fget/fput.  This
> >> in of itself is normally fine.  However, if a client expects to be
> >> notified if the eventfd is closed, it cannot hold a fget() reference
> >> itself or the underlying f_ops->release() callback will never be invoked
> >> by VFS.  Therefore we have this somewhat unusual situation where we may
> >> hold a pointer to an eventfd object (by virtue of having a waiter 
> >> registered
> >> in its wait-queue), but no reference.  This makes it nearly impossible to
> >> design a mutual decoupling algorithm: you cannot unhook one side from the
> >> other (or vice versa) without racing.
> >> 
> >
> > And why is that?
> >
> > struct xxx {
> > struct mutex mtx;
> > struct file *file;
> > ...
> > };
> >
> > struct file *xxx_get_file(struct xxx *x) {
> > struct file *file;
> >
> > mutex_lock(&x->mtx);
> > file = x->file;
> > if (!file)
> > mutex_unlock(&x->mtx);
> > return file;
> > }
> >
> > void xxx_release_file(struct xxx *x) {
> > mutex_unlock(&x->mtx);
> > }
> >
> > void handle_POLLHUP(struct xxx *x) {
> > struct file *file;
> >
> > file = xxx_get_file(x);
> > if (file) {
> > unhook_waitqueue(file, ...);
> > x->file = NULL;
> > xxx_release_file(x);
> > }
> > }
> >
> >
> > Every time you need to "use" file, you call xxx_get_file(), and if you get 
> > NULL, it means it's gone and you handle it accordigly to your IRQ fd 
> > policies. As soon as you done with the file, you call xxx_release_file().
> > Replace "mtx" with the lock that fits your needs.
> >   
> 
> Consider what would happen if the f_ops->release() was preempted inside
> the wake_up_locked_polled() after it dereferenced the xxx from the list,
> but before it calls the callback(POLLHUP).  The xxx object, and/or the
> .text for the xxx object may be long gone by the time it comes back
> around.  Afaict, there is no way to guard against that scenario unless
> you do something like 2/3+3/3.  Or am I missing something?

Right. Don't you see an easier answer to that, instead of adding 300 lines 
of code to eventfd?
For example, turning wake_up_locked() into a nornal wake_up().



- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

2009-06-19 Thread Gregory Haskins
Davide Libenzi wrote:
> On Fri, 19 Jun 2009, Gregory Haskins wrote:
>
>   
>> Davide Libenzi wrote:
>> 
>>> On Fri, 19 Jun 2009, Gregory Haskins wrote:
>>>
>>>   
>>>   
 eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
 notifier->release() callback.  This lets notification clients know if
 the eventfd is about to go away and is very useful particularly for
 in-kernel clients.  However, as it stands today it is not possible to
 use the notification API in a race-free way.  This patch adds some
 additional logic to the notification subsystem to rectify this problem.

 Background:
 ---
 Eventfd currently only has one reference count mechanism: fget/fput.  This
 in of itself is normally fine.  However, if a client expects to be
 notified if the eventfd is closed, it cannot hold a fget() reference
 itself or the underlying f_ops->release() callback will never be invoked
 by VFS.  Therefore we have this somewhat unusual situation where we may
 hold a pointer to an eventfd object (by virtue of having a waiter 
 registered
 in its wait-queue), but no reference.  This makes it nearly impossible to
 design a mutual decoupling algorithm: you cannot unhook one side from the
 other (or vice versa) without racing.
 
 
>>> And why is that?
>>>
>>> struct xxx {
>>> struct mutex mtx;
>>> struct file *file;
>>> ...
>>> };
>>>
>>> struct file *xxx_get_file(struct xxx *x) {
>>> struct file *file;
>>>
>>> mutex_lock(&x->mtx);
>>> file = x->file;
>>> if (!file)
>>> mutex_unlock(&x->mtx);
>>> return file;
>>> }
>>>
>>> void xxx_release_file(struct xxx *x) {
>>> mutex_unlock(&x->mtx);
>>> }
>>>
>>> void handle_POLLHUP(struct xxx *x) {
>>> struct file *file;
>>>
>>> file = xxx_get_file(x);
>>> if (file) {
>>> unhook_waitqueue(file, ...);
>>> x->file = NULL;
>>> xxx_release_file(x);
>>> }
>>> }
>>>
>>>
>>> Every time you need to "use" file, you call xxx_get_file(), and if you get 
>>> NULL, it means it's gone and you handle it accordigly to your IRQ fd 
>>> policies. As soon as you done with the file, you call xxx_release_file().
>>> Replace "mtx" with the lock that fits your needs.
>>>   
>>>   
>> Consider what would happen if the f_ops->release() was preempted inside
>> the wake_up_locked_polled() after it dereferenced the xxx from the list,
>> but before it calls the callback(POLLHUP).  The xxx object, and/or the
>> .text for the xxx object may be long gone by the time it comes back
>> around.  Afaict, there is no way to guard against that scenario unless
>> you do something like 2/3+3/3.  Or am I missing something?
>> 
>
> Right. Don't you see an easier answer to that, instead of adding 300 lines 
> of code to eventfd?
>   

I tried, but this problem made my head hurt and this was what I came up
with that I felt closes the holes all the way.  Also keep in mind that
while I added X lines to eventfd, I took Y lines *out* of irqfd in the
process, too.  I just excluded the KVM portions in this thread per your
request, so its not apparent.  But technically, any other clients that
may come along can reuse that notification code instead of coding it
again.  One way or the other, *someone* has to do that ptable_proc stuff
;)  FYI: Its more like 133 lines, fwiw.

fs/eventfd.c|  104

 include/linux/eventfd.h |   36 
 2 files changed, 133 insertions(+), 7 deletions(-)

In case you care, heres what the complete solution when I include KVM
currently looks like:

 fs/eventfd.c|  104 +--
 include/linux/eventfd.h |   36 +
 virt/kvm/eventfd.c  |  181
+---
 3 files changed, 228 insertions(+), 93 deletions(-)

> For example, turning wake_up_locked() into a nornal wake_up().
>   

I am fairly confident it is not that simple after having thought about
this issue over the last few days.  But I've been wrong in the past. 
Propose a patch and I will review it for races/correctness, if you
like.  Perhaps a combination of that plus your asymmetrical locking
scheme would work.  One of the challenges you will hit is avoiding ABBA
between your "get" lock and the wqh, but good luck!

-Greg




signature.asc
Description: OpenPGP digital signature


Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

2009-06-19 Thread Davide Libenzi
On Fri, 19 Jun 2009, Gregory Haskins wrote:

> I am fairly confident it is not that simple after having thought about
> this issue over the last few days.  But I've been wrong in the past. 
> Propose a patch and I will review it for races/correctness, if you
> like.  Perhaps a combination of that plus your asymmetrical locking
> scheme would work.  One of the challenges you will hit is avoiding ABBA
> between your "get" lock and the wqh, but good luck!

A patch for what? The eventfd patch is a one-liner.
It seems hard to believe that the thing cannot be handled on your side. 
Once the wake_up_locked() is turned into a wake_up(), what other races are 
there?
Let's not try to find the problem that fits and justify the "cool" 
solution, but let's see if a problem is there at all.


- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

2009-06-19 Thread Davide Libenzi
On Fri, 19 Jun 2009, Davide Libenzi wrote:

> On Fri, 19 Jun 2009, Gregory Haskins wrote:
> 
> > I am fairly confident it is not that simple after having thought about
> > this issue over the last few days.  But I've been wrong in the past. 
> > Propose a patch and I will review it for races/correctness, if you
> > like.  Perhaps a combination of that plus your asymmetrical locking
> > scheme would work.  One of the challenges you will hit is avoiding ABBA
> > between your "get" lock and the wqh, but good luck!
> 
> A patch for what? The eventfd patch is a one-liner.
> It seems hard to believe that the thing cannot be handled on your side. 
> Once the wake_up_locked() is turned into a wake_up(), what other races are 
> there?

AFAICS, the IRQfd code simply registers the callback to ->poll() and waits 
for two events.
In the POLLIN event, you schedule_work(&irqfd->inject) and there are no 
races there AFAICS (you basically do not care of anything eventfd memory 
related at all).
For POLLHUP, you do:

spin_lock(irqfd->slock);
if (irqfd->wqh)
schedule_work(&irqfd->inject);
irqfd->wqh = NULL;
spin_unlock(irqfd->slock);

In your work function you notice the POLLHUP condition and take proper 
action (dunno what it is in your case).
In your kvm_irqfd_release() function:

spin_lock(irqfd->slock);
if (irqfd->wqh)
remove_wait_queue(irqfd->wqh, &irqfd->wait);
irqfd->wqh = NULL;
spin_unlock(irqfd->slock);

Any races in there?



- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifier race conditions

2009-06-19 Thread Gregory Haskins
Davide Libenzi wrote:
> On Fri, 19 Jun 2009, Davide Libenzi wrote:
>
>   
>> On Fri, 19 Jun 2009, Gregory Haskins wrote:
>>
>> 
>>> I am fairly confident it is not that simple after having thought about
>>> this issue over the last few days.  But I've been wrong in the past. 
>>> Propose a patch and I will review it for races/correctness, if you
>>> like.  Perhaps a combination of that plus your asymmetrical locking
>>> scheme would work.  One of the challenges you will hit is avoiding ABBA
>>> between your "get" lock and the wqh, but good luck!
>>>   
>> A patch for what? The eventfd patch is a one-liner.
>> 

Yes, understood.  What I was trying to gently say is that the one-liner
proposal alone is still broken afaict.  However, if there is another
solution that works that you like better than 133-liner I posted, I am
more than willing to help analyze it.  In the end, I only care that this
is fixed.

>> It seems hard to believe that the thing cannot be handled on your side. 
>> Once the wake_up_locked() is turned into a wake_up(), what other races are 
>> there?
>> 
>
> AFAICS, the IRQfd code simply registers the callback to ->poll() and waits 
> for two events.
> In the POLLIN event, you schedule_work(&irqfd->inject) and there are no 
> races there AFAICS (you basically do not care of anything eventfd memory 
> related at all).
> For POLLHUP, you do:
>
>   spin_lock(irqfd->slock);
>   if (irqfd->wqh)
>   schedule_work(&irqfd->inject);
>   irqfd->wqh = NULL;
>   spin_unlock(irqfd->slock);
>
> In your work function you notice the POLLHUP condition and take proper 
> action (dunno what it is in your case).
> In your kvm_irqfd_release() function:
>
>   spin_lock(irqfd->slock);
>   if (irqfd->wqh)
>   remove_wait_queue(irqfd->wqh, &irqfd->wait);
>   irqfd->wqh = NULL;
>   spin_unlock(irqfd->slock);
>
> Any races in there?
>   

Yes, for one you have an ABBA deadlock on the irqfd->slock and wqh->lock
(recall wqh has to be locked to fix that other race I mentioned).

(As a hint, I think I fixed 4-5 races with these patches, so there are a
few others still lurking as well)

-Greg




signature.asc
Description: OpenPGP digital signature


Luvalley-3 has been released, this release is for AMD CPU only

2009-06-19 Thread Xiaodong Yi
Hi, all,

This Luvalley release is for machines with AMD CPUs only. It cannot
work on Intel CPUs. Also, this release is for Linux dom0 only. Please
refer to Luvalley-2 if you have a machine with Intel CPUs.

Due to the limitations of AMD CPUs, this Luvalley release has a new
architecture different from that on Intel CPUs. The limitation is that
AMD CPUs cannot receive the NMI IPI. That is to say, sending an NMI
IPI to other CPUs in the SMP computer cannot work. More thoroughly,
the BSP (i.e., CPU #0) can receive the NMI IPI issued from other APs
(i.e., CPU #1, CPU #2, ...). The APs, however, cannot receive the NMI
IPI from BSP. I tried in Linux 2.6.18 and Xen 3.1 and get the same
problem.

Ridiculously, the "AMD64 Architecture Programmer's Manual Volume 2:
System Programming" clearly states that NMI IPI is fully supported,
just like Intel's. The same codes works very well on Intel's CPUs, but
cannot work on AMD's CPUs.

So, we have to trap all physical interrupts on machines with AMD CPU
and then inject them back to dom0. That is to say, the CPU is
configured to vmexit to Luvalley when physical interrupts occur.
Moreover, we have to use a normal interrupt number for IPI sending and
receiving. This caused many problems on APIC ACK and interrupt block
issues.

Although there are still many problems unsolved, Luvalley indeed can
run Linux as its dom0 and a Windows domU as well. We decided to pause
the development for AMD CPUs, until (maybe impossible) AMD works
around the NMI IPI bug. On the other side, Luvalley's architecture
will be slightly modified little by little in the future, while still
preserving current features such as easy-to-use-and-install, arbitrary
dom0 OS and hardware compatibility, etc. From long-term consideration,
Luvalley should trap physical interrupts and manage APICs. At that
time, AMD CPUs could be supported even if the NMI IPI bug is still
there.


More details on Luvalley

Luvalley is a Virtual Machine Monitor (VMM) spawned from the KVM
project. Its part of source codes are derived from KVM to virtualize
CPU instructions and memory management unit (MMU). However, its
overall architecture is completely different from KVM, but somewhat
like Xen. Luvalley runs outside of Linux, just like Xen's
architecture. Any operating system, including Linux, could be used as
Luvalley's scheduler, memory
manager, physical device driver provider and virtual IO device
emulator. Currently, Luvalley supports Linux and Windows. That is to
say, one may run Luvalley to boot a Linux or Windows, and then run
multiple virtualized operating systems on such Linux or Windows.

If you are interested in Luvalley project, you may download the source
codes from
   http://sourceforge.net/projects/luvalley/

Regards,

Xiaodong
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kernel backtrace pointing at KVM

2009-06-19 Thread Ivelin Ivanov
Hello KVM team,

I am using KVM for maintaining mixture of Windows and Linux VMs.
recently I reinstalled the system with latest kernel 2.6.30 and
qemu-kvm-0.86.
I kept using the KVM module from the 2.6.30 kernel not from the pack
with the qemu-kvm.
Last night one of the Windows XP SP3 system rebooted and I found the
following trace in my logs:

[ cut here ]
WARNING: at arch/x86/kvm/x86.c:204 kvm_queue_exception_e+0x24/0x45 [kvm]()
Hardware name: IBM System x3650 -[7979B4G]-
Modules linked in: 8021q garp tun reiserfs st ide_gd_mod ide_cd_mod
bridge stp kvm_intel kvm ipv6 af_packet cpufreq_conservative
cpufreq_userspace cpufreq_powersave acpi_cpufreq speedstep_lib loop
dm_mod i5k_amb ibmpex i5000_edac ibmaem i2c_i801 iTCO_wdt
iTCO_vendor_support rtc_cmos bnx2 sr_mod rtc_core ipmi_msghandler ses
edac_core rtc_lib button pcspkr i2c_core cdrom serio_raw enclosure
shpchp pci_hotplug joydev sg usbhid hid uhci_hcd ehci_hcd usbcore sd_mod
crc_t10dif edd ext3 mbcache jbd fan ide_pci_generic ide_core ata_generic
ata_piix libata thermal processor thermal_sys hwmon aacraid scsi_mod
Pid: 26872, comm: qemu-system-x86 Not tainted 2.6.30-9-pae #1
Call Trace:
 [] warn_slowpath_common+0x60/0x90
 [] warn_slowpath_null+0xd/0x10
 [] kvm_queue_exception_e+0x24/0x45 [kvm]
 [] kvm_task_switch+0xfb/0xada [kvm]
 [] ? _spin_unlock+0xf/0x23
 [] ? kvm_mmu_page_fault+0x16/0x75 [kvm]
 [] handle_task_switch+0x6d/0x96 [kvm_intel]
 [] kvm_handle_exit+0x1c3/0x1e0 [kvm_intel]
 [] kvm_arch_vcpu_ioctl_run+0x89d/0xab6 [kvm]
 [] ? do_sync_readv_writev+0xa1/0xdf
 [] kvm_vcpu_ioctl+0xec/0x602 [kvm]
 [] ? _spin_unlock+0xf/0x23
 [] ? kvm_vcpu_ioctl+0x0/0x602 [kvm]
 [] vfs_ioctl+0x22/0x69
 [] do_vfs_ioctl+0x439/0x472
 [] ? unmap_region+0x10a/0x124
 [] ? fget_light+0x8a/0xb1
 [] sys_ioctl+0x40/0x5a
 [] sysenter_do_call+0x12/0x28
---[ end trace 9cc5a56b9c7eda31 ]---

Here is some more info:
# uname -a
Linux pc187 2.6.30-9-pae #1 SMP PREEMPT Wed Jun 17 15:29:59 EEST 2009
i686 i686 i386 GNU/Linux
# qemu-system-x86_64  --version
QEMU PC emulator version 0.10.50 (qemu-kvm-devel-86), Copyright (c)
2003-2008 Fabrice Bellard
# modinfo /lib/modules/2.6.30-9-pae/kernel/arch/x86/kvm/kvm.ko
filename:   /lib/modules/2.6.30-9-pae/kernel/arch/x86/kvm/kvm.ko
license:GPL
author: Qumranet
srcversion: 934CD9DB264501B0431438A
depends:
vermagic:   2.6.30-9-pae SMP preempt mod_unload modversions CORE2
parm:   oos_shadow:bool
parm:   msi2intx:bool

Do I do something wrong or is it a bug in KVM ?
Thanks for the great work!
Hope to hear from you soon.
Best regards,

Ivelin Ivanov

P.S. Please, don't hesitate to ask me if you need any more info to trace
this issue.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Checking guest memory pages changes from host userspace

2009-06-19 Thread Amit Shah
On (Fri) Jun 19 2009 [12:09:10], Passera, Pablo R wrote:
> Hi list,
> I need to monitor some guest memory pages. I need to know if the 
> information in these pages was changed. For this, I was thinking to mark the 
> guest memory pages in some way (like write protecting them) so a page fault 
> is generated. Then manage this fault inside qemu. Is there some API in libkvm 
> that allows me to do this?

If it's for debugging, you can use qemu without kvm. If you're only
interested in a few pages (and are proposing a solution based on that),
I don't think that can be done with EPT / NPT enabled so it won't be
something that'll be accepted upstream.

Amit
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html