from:"Gleb Natapov"

Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump

2012-11-26 Thread Gleb Natapov

On Mon, Nov 26, 2012 at 09:08:54AM -0600, Eric W. Biederman wrote:
> Zhang Yanfei  writes:
> 
> > This patch adds an atomic notifier list named crash_notifier_list.
> > Currently, when loading kvm-intel module, a notifier will be registered
> > in the list to enable vmcss loaded on all cpus to be VMCLEAR'd if
> > needed.
> 
> crash_notifier_list ick gag please no.  Effectively this makes the kexec
> on panic code path undebuggable.
> 
> Instead we need to use direct function calls to whatever you are doing.
> 
The code walks linked list in kvm-intel module and calls vmclear on
whatever it finds there. Since the function have to resides in kvm-intel
module it cannot be called directly. Is callback pointer that is set
by kvm-intel more acceptable?

> If a direct function call is too complex then the piece of code you want
> to call is almost certainly too complex to be calling on a code path
> like this.
> 
> Eric
> 
> > Signed-off-by: Zhang Yanfei 
> > ---
> >  arch/x86/include/asm/kexec.h |2 ++
> >  arch/x86/kernel/crash.c  |9 +
> >  2 files changed, 11 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> > index 317ff17..5e22b00 100644
> > --- a/arch/x86/include/asm/kexec.h
> > +++ b/arch/x86/include/asm/kexec.h
> > @@ -163,6 +163,8 @@ struct kimage_arch {
> >  };
> >  #endif
> >  
> > +extern struct atomic_notifier_head crash_notifier_list;
> > +
> >  #endif /* __ASSEMBLY__ */
> >  
> >  #endif /* _ASM_X86_KEXEC_H */
> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> > index 13ad899..c5b2f70 100644
> > --- a/arch/x86/kernel/crash.c
> > +++ b/arch/x86/kernel/crash.c
> > @@ -16,6 +16,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -30,6 +32,9 @@
> >  
> >  int in_crash_kexec;
> >  
> > +ATOMIC_NOTIFIER_HEAD(crash_notifier_list);
> > +EXPORT_SYMBOL_GPL(crash_notifier_list);
> > +
> >  #if defined(CONFIG_SMP) && defined(CONFIG_X86_LOCAL_APIC)
> >  
> >  static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
> > @@ -46,6 +51,8 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
> > *regs)
> >  #endif
> > crash_save_cpu(regs, cpu);
> >  
> > +   atomic_notifier_call_chain(&crash_notifier_list, 0, NULL);
> > +
> > /* Disable VMX or SVM if needed.
> >  *
> >  * We need to disable virtualization on all CPUs.
> > @@ -88,6 +95,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> >  
> > kdump_nmi_shootdown_cpus();
> >  
> > +   atomic_notifier_call_chain(&crash_notifier_list, 0, NULL);
> > +
> > /* Booting kdump kernel with VMX or SVM enabled won't work,
> >  * because (among other limitations) we can't disable paging
> >  * with the virt flags.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v8 1/2] x86/kexec: add a new atomic notifier list for kdump

2012-11-26 Thread Gleb Natapov

On Mon, Nov 26, 2012 at 11:43:10AM -0600, Eric W. Biederman wrote:
> Gleb Natapov  writes:
> 
> > On Mon, Nov 26, 2012 at 09:08:54AM -0600, Eric W. Biederman wrote:
> >> Zhang Yanfei  writes:
> >> 
> >> > This patch adds an atomic notifier list named crash_notifier_list.
> >> > Currently, when loading kvm-intel module, a notifier will be registered
> >> > in the list to enable vmcss loaded on all cpus to be VMCLEAR'd if
> >> > needed.
> >> 
> >> crash_notifier_list ick gag please no.  Effectively this makes the kexec
> >> on panic code path undebuggable.
> >> 
> >> Instead we need to use direct function calls to whatever you are doing.
> >> 
> > The code walks linked list in kvm-intel module and calls vmclear on
> > whatever it finds there. Since the function have to resides in kvm-intel
> > module it cannot be called directly. Is callback pointer that is set
> > by kvm-intel more acceptable?
> 
> Yes a specific callback function is more acceptable.  Looking a little
> deeper vmclear_local_loaded_vmcss is not particularly acceptable. It is
> doing a lot of work that is unnecessary to save the virtual registers
> on the kexec on panic path.
> 
What work are you referring to in particular that may not be acceptable?

> In fact I wonder if it might not just be easier to call vmcs_clear to a
> fixed per cpu buffer.
> 
There may be more than one vmcs loaded on a cpu, hence the list.

> Performing list walking in interrupt context without locking in
> vmclear_local_loaded vmcss looks a bit scary.  Not that locking would
> make it any better, as locking would simply add one more way to deadlock
> the system.  Only an rcu list walk is at all safe.  A list walk that
> modifies the list as vmclear_local_loaded_vmcss does is definitely not safe.
> 
The list vmclear_local_loaded walks is per cpu. Zhang's kvm patch
disables kexec callback while list is modified.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Qemu-devel] [PATCH 0/4] AER-KVM: Error containment of PCI pass-thru devices assigned to KVM guests

2012-11-26 Thread Gleb Natapov

On Mon, Nov 26, 2012 at 09:46:12PM -0200, Marcelo Tosatti wrote:
> On Tue, Nov 20, 2012 at 02:09:46PM +, Pandarathil, Vijaymohan R wrote:
> > 
> > 
> > > -Original Message-
> > > From: Stefan Hajnoczi [mailto:stefa...@gmail.com]
> > > Sent: Tuesday, November 20, 2012 5:41 AM
> > > To: Pandarathil, Vijaymohan R
> > > Cc: k...@vger.kernel.org; linux-...@vger.kernel.org; 
> > > qemu-de...@nongnu.org;
> > > linux-kernel@vger.kernel.org
> > > Subject: Re: [PATCH 0/4] AER-KVM: Error containment of PCI pass-thru
> > > devices assigned to KVM guests
> > > 
> > > On Tue, Nov 20, 2012 at 06:31:48AM +, Pandarathil, Vijaymohan R wrote:
> > > > Add support for error containment when a PCI pass-thru device assigned 
> > > > to
> > > a KVM
> > > > guest encounters an error. This is for PCIe devices/drivers that support
> > > AER
> > > > functionality. When the OS is notified of an error in a device either
> > > > through the firmware first approach or through an interrupt handled by
> > > the AER
> > > > root port driver, concerned subsystems are notified by invoking 
> > > > callbacks
> > > > registered by these subsystems. The device is also marked as tainted 
> > > > till
> > > the
> > > > corresponding driver recovery routines are successful.
> > > >
> > > > KVM module registers for a notification of such errors. In the KVM
> > > callback
> > > > routine, a global counter is incremented to keep track of the error
> > > > notification. Before each CPU enters guest mode to execute guest code,
> > > > appropriate checks are done to see if the impacted device belongs to the
> > > guest
> > > > or not. If the device belongs to the guest, qemu hypervisor for the 
> > > > guest
> > > is
> > > > informed and the guest is immediately brought down, thus preventing or
> > > > minimizing chances of any bad data being written out by the guest driver
> > > > after the device has encountered an error.
> > > 
> > > I'm surprised that the hypervisor would shut down the guest when PCIe
> > > AER kicks in for a pass-through device.  Shouldn't we pass the AER event
> > > into the guest and deal with it there?
> > 
> > Agreed. That would be the ideal behavior and is planned in a future patch.
> > Lack of control over the capabilities/type of the OS/drivers running in 
> > the guest is also a concern in passing along the event to the guest.
> > 
> > My understanding is that in the current implementation of Linux/KVM, these 
> > errors are not handled at all and can potentially cause a guest hang or 
> > crash or even data corruption depending on the implementation of the guest
> > driver for the device. As a first step, these patches make the behavior 
> > better by doing error containment with a predictable behavior when such
> > errors occur. 
> 
> For both ACPI notifications and Linux PCI AER driver there is a way for
> the PCI driver to receive a notification, correct?
> 
> Can just have virt/kvm/assigned-dev.c code register such a notifier (as
> a "PCI driver") and then perform appropriate action?
> 
> Also the semantics of "tainted driver" is not entirely clear.
> 
> Is there any reason for not having this feature for VFIO only, as KVM
> device assigment is being phased out?
> 
Exactly. We shouldn't add checks to guest entry code and introduce new
userspace ABI to add minor feature to deprecated code. New userspace ABI
means that QEMU changes are needed, so the feature will be fully functional
only with latest QEMU which is capable of using VFIO anyway.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v9 1/2] x86/kexec: VMCLEAR VMCSs loaded on all cpus if necessary

2012-11-27 Thread Gleb Natapov

Eric, can you ACK it?

On Tue, Nov 27, 2012 at 11:26:02AM +0800, Zhang Yanfei wrote:
> This patch provides a way to VMCLEAR VMCSs related to guests
> on all cpus before executing the VMXOFF when doing kdump. This
> is used to ensure the VMCSs in the vmcore updated and
> non-corrupted.
> 
> Signed-off-by: Zhang Yanfei 
> ---
>  arch/x86/include/asm/kexec.h |2 ++
>  arch/x86/kernel/crash.c  |   25 +
>  2 files changed, 27 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
> index 317ff17..28feeba 100644
> --- a/arch/x86/include/asm/kexec.h
> +++ b/arch/x86/include/asm/kexec.h
> @@ -163,6 +163,8 @@ struct kimage_arch {
>  };
>  #endif
>  
> +extern void (*crash_vmclear_loaded_vmcss)(void);
> +
>  #endif /* __ASSEMBLY__ */
>  
>  #endif /* _ASM_X86_KEXEC_H */
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index 13ad899..4a2a12f 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -29,6 +30,20 @@
>  #include 
>  
>  int in_crash_kexec;
> + 
> +/*
> + * This is used to VMCLEAR all VMCSs loaded on the
> + * processor. And when loading kvm_intel module, the
> + * callback function pointer will be assigned.
> + */
> +void (*crash_vmclear_loaded_vmcss)(void) = NULL;
> +EXPORT_SYMBOL_GPL(crash_vmclear_loaded_vmcss);
> +
> +static inline void cpu_emergency_vmclear_loaded_vmcss(void)
> +{
> + if (crash_vmclear_loaded_vmcss)
> + crash_vmclear_loaded_vmcss();
> +}
>  
>  #if defined(CONFIG_SMP) && defined(CONFIG_X86_LOCAL_APIC)
>  
> @@ -46,6 +61,11 @@ static void kdump_nmi_callback(int cpu, struct pt_regs 
> *regs)
>  #endif
>   crash_save_cpu(regs, cpu);
>  
> + /*
> +  * VMCLEAR VMCSs loaded on all cpus if needed.
> +  */
> + cpu_emergency_vmclear_loaded_vmcss();
> +
>   /* Disable VMX or SVM if needed.
>*
>* We need to disable virtualization on all CPUs.
> @@ -88,6 +108,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>  
>   kdump_nmi_shootdown_cpus();
>  
> + /*
> +  * VMCLEAR VMCSs loaded on this cpu if needed.
> +  */
> + cpu_emergency_vmclear_loaded_vmcss();
> +
>   /* Booting kdump kernel with VMX or SVM enabled won't work,
>* because (among other limitations) we can't disable paging
>* with the virt flags.
> -- 
> 1.7.1

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] Fix abnormal rcu dynticks_nesting values related to async page fault

2012-11-27 Thread Gleb Natapov

On Tue, Nov 27, 2012 at 01:15:25PM +0800, Li Zhong wrote:
> I noticed some warnings complaining about dynticks_nesting value, like 
> 
> [  267.545032] [ cut here ]
> [  267.545032] WARNING: at kernel/rcutree.c:382 rcu_eqs_enter+0xab/0xc0()
> [  267.545032] Hardware name: Bochs
> [  267.545032] Modules linked in:
> [  267.545032] Pid: 0, comm: swapper/2 Not tainted 3.7.0-rc5-next-20121115 #8
> [  267.545032] Call Trace:
> [  267.545032]  [] warn_slowpath_common+0x7f/0xc0
> [  267.545032]  [] warn_slowpath_null+0x1a/0x20
> [  267.545032]  [] rcu_eqs_enter+0xab/0xc0
> [  267.545032]  [] rcu_idle_enter+0x2b/0x70
> [  267.545032]  [] cpu_idle+0x6f/0x100
> [  267.545032]  [] start_secondary+0x205/0x20c
> [  267.545032] ---[ end trace 924ae80da035028d ]---
> 
> After enabling rcu-dyntick tracing, I got following abnormal
> dynticks_nesting values (13f, ff01,etc):
>   ...
>  1-0 [002] dN.2 18739.518567: rcu_dyntick: End 0 
> 140rcu_idle_exit
>  2  sshd-696   [002] d..1 18739.518675: rcu_dyntick: ++= 140 
> 141  rcu_irq_enter   - apf (not present)
> 
>  3-0 [002] d..2 18739.518705: rcu_dyntick: Start 
> 141 0  rcu_idle_enter
>  4-0 [002] d..2 18739.521252: rcu_dyntick: End 0 1  
> rcu_irq_enter   - apf (page ready)
>  5-0 [002] dN.2 18739.521261: rcu_dyntick: Start 1 0
> rcu_irq_exit- apf (page ready)
>  6-0 [002] dN.2 18739.521263: rcu_dyntick: End 0 
> 140rcu_idle_exit
> 
>  7  sshd-696   [002] d..1 18739.521299: rcu_dyntick: --= 140 
> 13f  rcu_irq_exit- apf (not present)
>  8  sshd-696   [002] d..1 18739.521302: rcu_dyntick: Start 
> 13f 0  rcu_user_enter
>  9  sshd-696   [002] d..1 18739.521330: rcu_dyntick: End 0 1  
> rcu_irq_enter   - apf (not present)
> 
> 10-0 [002] d..2 18739.521346: rcu_dyntick: Start 1 0
> rcu_idle_enter - old value 1, warning
> 11-0 [002] d..2 18739.530021: rcu_dyntick: ++= ff01 
> ff02
> 12-0 [002] dN.2 18739.530029: rcu_dyntick: --= ff02 
> ff01
>   ...
> 
> I added the functions I guess which printed the tracing after each
> line. 
> 
> Line #1, the idle-0 process calls rcu_idle_exit(), and finishes one
> loop, to switch to sshd-696
> 
> Line #2, sshd-696 calls rcu_irq_enter() because of async page fault(page
> not present), and puts itself to wait for page ready
> 
> Line #3, idle-0 is switched in, and clears the dynticks_nesting to 0
> 
> Line #4-5, I think the rcu_irq_enter/exit() is called because the page
> for sshd-696 is ready
> 
> Line #6, idle-0 calls rcu_idle_exit(), to switch to sshd-696
> 
> Line #7, sshd-696 calls rcu_irq_exit() in the apf (page not present)
> code path, decreasing dynticks_nesting to 13f. 
> 
> Line #8-9, sshd-696 calls rcu_user_enter() to start user eqs, and gets
> async page fault again. It puts itself sleep again, with
> dynticks_nesting value as 1.
> 
> Line #10, idle-0 switches in, as the dynticks_nesting value is 1, so
> warning is reported in rcu_idle_enter(), then the value is decreased to
> ff01. (In the tracing log, the new value is 0, that's
> because rcu hard-code the value to be 0. I will send another patch for
> this.)
> 
> This patch below tries to replace the rcu_irq_enter/exit() with
> rcu_idle_exit/enter(), if it is in rcu idle, and it is idle process;
> otherwise, rcu_user_exit() is called to exit user eqs if necessary. 
> 
> Signed-off-by: Li Zhong 
> ---
>  arch/x86/kernel/kvm.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 4180a87..f65648d 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -247,10 +247,17 @@ do_async_page_fault(struct pt_regs *regs, unsigned long 
> error_code)
>   break;
>   case KVM_PV_REASON_PAGE_NOT_PRESENT:
>   /* page is swapped out by the host. */
> - rcu_irq_enter();
> + if (is_idle_task(current) && rcu_is_cpu_idle())
> + rcu_idle_exit();
> + else
> + rcu_user_exit();
> +
>   exit_idle();
>   kvm_async_pf_task_wait((u32)read_cr2());
> - rcu_irq_exit();
> +
> + if (is_idle_task(current) && rcu_is_cpu_idle())
> + rcu_idle_enter();
> +
>   break;
>   case KVM_PV_REASON_PAGE_READY:
>   rcu_irq_enter();
Those rcu_irq_enter()/rcu_irq_exit() were introduced by commit
c5e015d4949aa665 "KVM guest: exit idleness when handling
KVM_PV_REASON_PAGE_NOT_PRESENT", but now I am starting to question this
commit. KVM_PV

Re: [RFC PATCH] Fix abnormal rcu dynticks_nesting values related to async page fault

2012-11-27 Thread Gleb Natapov

On Tue, Nov 27, 2012 at 03:38:14PM +0100, Frederic Weisbecker wrote:
> 2012/11/27 Li Zhong :
> > I noticed some warnings complaining about dynticks_nesting value, like
> >
> > [  267.545032] [ cut here ]
> > [  267.545032] WARNING: at kernel/rcutree.c:382 rcu_eqs_enter+0xab/0xc0()
> > [  267.545032] Hardware name: Bochs
> > [  267.545032] Modules linked in:
> > [  267.545032] Pid: 0, comm: swapper/2 Not tainted 3.7.0-rc5-next-20121115 
> > #8
> > [  267.545032] Call Trace:
> > [  267.545032]  [] warn_slowpath_common+0x7f/0xc0
> > [  267.545032]  [] warn_slowpath_null+0x1a/0x20
> > [  267.545032]  [] rcu_eqs_enter+0xab/0xc0
> > [  267.545032]  [] rcu_idle_enter+0x2b/0x70
> > [  267.545032]  [] cpu_idle+0x6f/0x100
> > [  267.545032]  [] start_secondary+0x205/0x20c
> > [  267.545032] ---[ end trace 924ae80da035028d ]---
> >
> > After enabling rcu-dyntick tracing, I got following abnormal
> > dynticks_nesting values (13f, ff01,etc):
> > ...
> >  1  -0 [002] dN.2 18739.518567: rcu_dyntick: End 0 
> > 140rcu_idle_exit
> >  2sshd-696   [002] d..1 18739.518675: rcu_dyntick: ++= 
> > 140 141  rcu_irq_enter   - apf (not present)
> 
> How did that happen? When I look at do_async_page_fault(),
> KVM_PV_REASON_PAGE_NOT_PRESENT doesn't do rcu_irq_enter().
> 
> >
> >  3  -0 [002] d..2 18739.518705: rcu_dyntick: Start 
> > 141 0  rcu_idle_enter
> >  4  -0 [002] d..2 18739.521252: rcu_dyntick: End 0 1  
> > rcu_irq_enter   - apf (page ready)
> >  5  -0 [002] dN.2 18739.521261: rcu_dyntick: Start 1 0
> > rcu_irq_exit- apf (page ready)
> >  6  -0 [002] dN.2 18739.521263: rcu_dyntick: End 0 
> > 140rcu_idle_exit
> >
> >  7sshd-696   [002] d..1 18739.521299: rcu_dyntick: --= 
> > 140 13f  rcu_irq_exit- apf (not present)
> 
> I'm confused for the same reason here.
> 
> >  8sshd-696   [002] d..1 18739.521302: rcu_dyntick: Start 
> > 13f 0  rcu_user_enter
> >  9sshd-696   [002] d..1 18739.521330: rcu_dyntick: End 0 1  
> > rcu_irq_enter   - apf (not present)
> 
> Same. Now we certainly need to add some rcu_user_exit() on
> do_async_page_fault(). Although I'm not quite sure when this function
> is called. Is it an exception or an irq?
> 
For KVM_PV_REASON_PAGE_NOT_PRESENT it behaves like an exception.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] Fix abnormal rcu dynticks_nesting values related to async page fault

2012-11-27 Thread Gleb Natapov

On Tue, Nov 27, 2012 at 04:56:30PM +0100, Frederic Weisbecker wrote:
> 2012/11/27 Gleb Natapov :
> > For KVM_PV_REASON_PAGE_NOT_PRESENT it behaves like an exception.
> 
> Ok.
> There seem to be a bug in kvm_async_pf_task_wait(). Using
> idle_cpu(cpu) to find out if the current task is the idle task may not
> work if there is pending wake up. Me may schedule another task but
> when that other task sleeps later we can't schedule back to idle until
> the fault is completed.
> 
> The right way is to use is_idle_task(current)
But if there is pending wake up then scheduling to the waked up task is
exactly what we want.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] Fix abnormal rcu dynticks_nesting values related to async page fault

2012-11-27 Thread Gleb Natapov

On Tue, Nov 27, 2012 at 05:51:12PM +0100, Frederic Weisbecker wrote:
> 2012/11/27 Gleb Natapov :
> > On Tue, Nov 27, 2012 at 04:56:30PM +0100, Frederic Weisbecker wrote:
> >> 2012/11/27 Gleb Natapov :
> >> > For KVM_PV_REASON_PAGE_NOT_PRESENT it behaves like an exception.
> >>
> >> Ok.
> >> There seem to be a bug in kvm_async_pf_task_wait(). Using
> >> idle_cpu(cpu) to find out if the current task is the idle task may not
> >> work if there is pending wake up. Me may schedule another task but
> >> when that other task sleeps later we can't schedule back to idle until
> >> the fault is completed.
> >>
> >> The right way is to use is_idle_task(current)
> > But if there is pending wake up then scheduling to the waked up task is
> > exactly what we want.
> 
> Ok, but what if that task goes to sleep soon after beeing scheduled
> and there is no other task on the runqueue and the page fault has not
> been handled yet? The only thing you can do is to schedule the idle
> task. But the idle task is waiting for the fault completion so you
> can't do that.
Yes, I see now. So even though we have runnable task we can't schedule
away from idle task. Wouldn't the patch below solve Sasha's and Li's
RCU problems then (not even compiled):


diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 4180a87..636800d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -113,7 +113,7 @@ void kvm_async_pf_task_wait(u32 token)
int cpu, idle;
 
cpu = get_cpu();
-   idle = idle_cpu(cpu);
+   idle = is_idle_task(current);
put_cpu();
 
spin_lock(&b->lock);
@@ -247,10 +247,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long 
error_code)
break;
case KVM_PV_REASON_PAGE_NOT_PRESENT:
/* page is swapped out by the host. */
-   rcu_irq_enter();
-   exit_idle();
kvm_async_pf_task_wait((u32)read_cr2());
-   rcu_irq_exit();
break;
case KVM_PV_REASON_PAGE_READY:
rcu_irq_enter();

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] Fix abnormal rcu dynticks_nesting values related to async page fault

2012-11-27 Thread Gleb Natapov

On Tue, Nov 27, 2012 at 06:30:32PM +0100, Frederic Weisbecker wrote:
> 2012/11/27 Gleb Natapov :
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index 4180a87..636800d 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -113,7 +113,7 @@ void kvm_async_pf_task_wait(u32 token)
> > int cpu, idle;
> >
> > cpu = get_cpu();
> > -   idle = idle_cpu(cpu);
> > +   idle = is_idle_task(current);
> 
> I suggest this part goes to a standalone patch.
> 
> > put_cpu();
> >
> > spin_lock(&b->lock);
> > @@ -247,10 +247,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned 
> > long error_code)
> > break;
> > case KVM_PV_REASON_PAGE_NOT_PRESENT:
> > /* page is swapped out by the host. */
> > -   rcu_irq_enter();
> > -   exit_idle();
> > kvm_async_pf_task_wait((u32)read_cr2());
> > -   rcu_irq_exit();
> 
> Hmm, we still need those above around. I believe we just need to add
> rcu_user_exit() in the beginning of that case.
The exception may happen in kernel space too. Is calling rcu_user_exit()
still OK? Also why calling exit_idle() if we are not exiting idle?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] Fix abnormal rcu dynticks_nesting values related to async page fault

2012-11-27 Thread Gleb Natapov

On Tue, Nov 27, 2012 at 07:12:40PM +0100, Frederic Weisbecker wrote:
> 2012/11/27 Gleb Natapov :
> > On Tue, Nov 27, 2012 at 06:30:32PM +0100, Frederic Weisbecker wrote:
> >> 2012/11/27 Gleb Natapov :
> >> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> >> > index 4180a87..636800d 100644
> >> > --- a/arch/x86/kernel/kvm.c
> >> > +++ b/arch/x86/kernel/kvm.c
> >> > @@ -113,7 +113,7 @@ void kvm_async_pf_task_wait(u32 token)
> >> > int cpu, idle;
> >> >
> >> > cpu = get_cpu();
> >> > -   idle = idle_cpu(cpu);
> >> > +   idle = is_idle_task(current);
> >>
> >> I suggest this part goes to a standalone patch.
> >>
> >> > put_cpu();
> >> >
> >> > spin_lock(&b->lock);
> >> > @@ -247,10 +247,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned 
> >> > long error_code)
> >> > break;
> >> > case KVM_PV_REASON_PAGE_NOT_PRESENT:
> >> > /* page is swapped out by the host. */
> >> > -   rcu_irq_enter();
> >> > -   exit_idle();
> >> > kvm_async_pf_task_wait((u32)read_cr2());
> >> > -   rcu_irq_exit();
> >>
> >> Hmm, we still need those above around. I believe we just need to add
> >> rcu_user_exit() in the beginning of that case.
> > The exception may happen in kernel space too. Is calling rcu_user_exit()
> > still OK? Also why calling exit_idle() if we are not exiting idle?
> 
> Yeah, rcu_user_exit() takes care of that. And exit_idle() also checks
> we are really idle before firing the notifier.
> 
> Now we should probably call back enter_idle() before resuming idle if
> needed. We disable irqs before calling enter_idle(). And exit_idle()
> is called from irqs. This way we ensure it's either called before we
> called local_irq_disable() or while the CPU is halt(). This provides
> the guarantee that enter_idle() is always called before the CPU goes
> to sleep. The fact we call exit_idle()  from an exception in idle
> breaks this guarantee. But that's another issue.

What is the semantics of enter_idle()/exit_idle(), what are they used for?
Not present fault happening in idle task does not mean we exit idle
task. If this happens exception handler will execute sti; hlt waiting
for missing page to be ready. Any interrupt happening during this hlt
will do exit_idle() by itself.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2] Add rcu user eqs exception hooks for async page fault

2012-11-28 Thread Gleb Natapov

On Wed, Nov 28, 2012 at 01:55:42PM +0100, Frederic Weisbecker wrote:
> > So I still want to remove it. And later if it shows that we really needs
> > rcu somewhere in this code path, maybe we could use RCU_NONIDLE() to
> > protect it. ( The suspicious RCU usage reported in commit
> > c5e015d4949aa665 seems related to schedule(), which is not in the code
> > path if we are in cpu idle eqs )
> 
> Yes but if rcu_irq_*() calls are fine to be called there, and I
> believe they are because exception_enter() exits the user mode, we
> should start to protect there right now instead of waiting for a
> potential future warning of illegal RCU use.
> 
Async page not present is not much different from regular page fault
exception when it happens not on idle task (regular #PF cannot happen
on idle task), but code have a special handling for idle task. So why
do you think rcu_irq_*() is required here, but not in page fault
handler?

> >
> > I think we still need Gleb's patch about the idle check in
> > kvm_async_pf_task_wait(), and maybe another patch for the
> > exit_idle()/enter_idle() issue.
> 
> Right.
> 
Done.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/3] KVM: x86: let reexecute_instruction work for tdp

2012-11-28 Thread Gleb Natapov

On Wed, Nov 28, 2012 at 11:15:13AM +0800, Xiao Guangrong wrote:
> On 11/28/2012 07:32 AM, Marcelo Tosatti wrote:
> > On Tue, Nov 27, 2012 at 11:13:11AM +0800, Xiao Guangrong wrote:
>  +static bool reexecute_instruction(struct kvm_vcpu *vcpu, unsigned long 
>  cr2)
>   {
>  -gpa_t gpa;
>  +gpa_t gpa = cr2;
>   pfn_t pfn;
> 
>  -if (tdp_enabled)
>  +if (!ACCESS_ONCE(vcpu->kvm->arch.indirect_shadow_pages))
>   return false;
> >>>
> >>> How is indirect_shadow_pages protected? Why is ACCESS_ONCE() being used
> >>> to read it?
> >>
> >> Hi Marcelo,
> >>
> >> It is protected by mmu-lock for it only be changed when mmu-lock is hold. 
> >> And
> >> ACCESS_ONCE is used on read path avoiding magic optimization from compiler.
> > 
> > Please switch to mmu_lock protection, there is no reason to have access
> > to this variable locklessly - not performance critical.
> > 
> > For example, there is no use of barriers when modifying the variable.
> 
> This is not bad, the worst case is, the direct mmu failed to unprotect the 
> shadow
> pages, (meet indirect_shadow_pages = 0, but there has shadow pages being 
> shadowed.),
> after enter to guest, we will go into reexecute_instruction again, then it 
> will
> remove shadow pages.
> 
Isn't the same scenario can happen even with mmu lock around
indirect_shadow_pages access?

> But, i do not have strong opinion on it, i respect your idea! :)
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/3] KVM: x86: improve reexecute_instruction

2012-11-28 Thread Gleb Natapov

On Tue, Nov 27, 2012 at 11:30:24AM +0800, Xiao Guangrong wrote:
> On 11/27/2012 06:41 AM, Marcelo Tosatti wrote:
> 
> >>
> >> -  return false;
> >> +again:
> >> +  page_fault_count = ACCESS_ONCE(vcpu->kvm->arch.page_fault_count);
> >> +
> >> +  /*
> >> +   * if emulation was due to access to shadowed page table
> >> +   * and it failed try to unshadow page and re-enter the
> >> +   * guest to let CPU execute the instruction.
> >> +   */
> >> +  kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
> >> +  emulate = vcpu->arch.mmu.page_fault(vcpu, cr3, PFERR_WRITE_MASK, false);
> > 
> > Can you explain what is the objective here?
> > 
> 
> Sure. :)
> 
> The instruction emulation is caused by fault access on cr3. After unprotect
> the target page, we call vcpu->arch.mmu.page_fault to fix the mapping of cr3.
> if it return 1, mmu can not fix the mapping, we should report the error,
> otherwise it is good to return to guest and let it re-execute the instruction
> again.
> 
> page_fault_count is used to avoid the race on other vcpus, since after we
> unprotect the target page, other cpu can enter page fault path and let the
> page be write-protected again.
> 
> This way can help us to detect all the case that mmu can not be fixed.
> 
Can you write this in a comment above vcpu->arch.mmu.page_fault()?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 2/2] fix kvm's use of __pa() on percpu areas

2012-11-28 Thread Gleb Natapov

On Thu, Nov 29, 2012 at 12:53:56AM +, Dave Hansen wrote:
> 
> In short, it is illegal to call __pa() on an address holding
> a percpu variable.  The times when this actually matters are
> pretty obscure (certain 32-bit NUMA systems), but it _does_
> happen.  It is important to keep KVM guests working on these
> systems because the real hardware is getting harder and
> harder to find.
> 
> This bug manifested first by me seeing a plain hang at boot
> after this message:
> 
>   CPU 0 irqstacks, hard=f3018000 soft=f301a000
> 
> or, sometimes, it would actually make it out to the console:
> 
> [0.00] BUG: unable to handle kernel paging request at 
> 
> I eventually traced it down to the KVM async pagefault code.
> This can be worked around by disabling that code either at
> compile-time, or on the kernel command-line.
> 
> The kvm async pagefault code was injecting page faults in
> to the guest which the guest misinterpreted because its
> "reason" was not being properly sent from the host.
> 
> The guest passes a physical address of an per-cpu async page
> fault structure via an MSR to the host.  Since __pa() is
> broken on percpu data, the physical address it sent was
> bascially bogus and the host went scribbling on random data.
> The guest never saw the real reason for the page fault (it
> was injected by the host), assumed that the kernel had taken
> a _real_ page fault, and panic()'d.
> 
> Signed-off-by: Dave Hansen 
> ---
> 
>  linux-2.6.git-dave/arch/x86/kernel/kvm.c |9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas 
> arch/x86/kernel/kvm.c
> --- linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas  
> 2012-11-29 00:39:59.130213376 +
> +++ linux-2.6.git-dave/arch/x86/kernel/kvm.c  2012-11-29 00:51:55.428091802 
> +
> @@ -284,9 +284,9 @@ static void kvm_register_steal_time(void
>  
>   memset(st, 0, sizeof(*st));
>  
> - wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
> + wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
Where is this slow_virt_to_phys() coming from? I do not find it in
kvm.git.

>   printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
> - cpu, __pa(st));
> + cpu, slow_virt_to_phys(st));
>  }
>  
>  static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
> @@ -311,7 +311,7 @@ void __cpuinit kvm_guest_cpu_init(void)
>   return;
>  
>   if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
> - u64 pa = __pa(&__get_cpu_var(apf_reason));
> + u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason));
>  
>  #ifdef CONFIG_PREEMPT
>   pa |= KVM_ASYNC_PF_SEND_ALWAYS;
> @@ -327,7 +327,8 @@ void __cpuinit kvm_guest_cpu_init(void)
>   /* Size alignment is implied but just to make it explicit. */
>   BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
>   __get_cpu_var(kvm_apic_eoi) = 0;
> - pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED;
> + pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi))
> + | KVM_MSR_ENABLED;
>   wrmsrl(MSR_KVM_PV_EOI_EN, pa);
>   }
>  
> _

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 2/2] fix kvm's use of __pa() on percpu areas

2012-11-29 Thread Gleb Natapov

On Thu, Nov 29, 2012 at 09:47:30AM +0200, Gleb Natapov wrote:
> On Thu, Nov 29, 2012 at 12:53:56AM +, Dave Hansen wrote:
> > 
> > In short, it is illegal to call __pa() on an address holding
> > a percpu variable.  The times when this actually matters are
> > pretty obscure (certain 32-bit NUMA systems), but it _does_
> > happen.  It is important to keep KVM guests working on these
> > systems because the real hardware is getting harder and
> > harder to find.
> > 
> > This bug manifested first by me seeing a plain hang at boot
> > after this message:
> > 
> > CPU 0 irqstacks, hard=f3018000 soft=f301a000
> > 
> > or, sometimes, it would actually make it out to the console:
> > 
> > [0.00] BUG: unable to handle kernel paging request at 
> > 
> > I eventually traced it down to the KVM async pagefault code.
> > This can be worked around by disabling that code either at
> > compile-time, or on the kernel command-line.
> > 
> > The kvm async pagefault code was injecting page faults in
> > to the guest which the guest misinterpreted because its
> > "reason" was not being properly sent from the host.
> > 
> > The guest passes a physical address of an per-cpu async page
> > fault structure via an MSR to the host.  Since __pa() is
> > broken on percpu data, the physical address it sent was
> > bascially bogus and the host went scribbling on random data.
> > The guest never saw the real reason for the page fault (it
> > was injected by the host), assumed that the kernel had taken
> > a _real_ page fault, and panic()'d.
> > 
> > Signed-off-by: Dave Hansen 
> > ---
> > 
> >  linux-2.6.git-dave/arch/x86/kernel/kvm.c |9 +
> >  1 file changed, 5 insertions(+), 4 deletions(-)
> > 
> > diff -puN arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas 
> > arch/x86/kernel/kvm.c
> > --- linux-2.6.git/arch/x86/kernel/kvm.c~fix-kvm-__pa-use-on-percpu-areas
> > 2012-11-29 00:39:59.130213376 +
> > +++ linux-2.6.git-dave/arch/x86/kernel/kvm.c2012-11-29 
> > 00:51:55.428091802 +
> > @@ -284,9 +284,9 @@ static void kvm_register_steal_time(void
> >  
> > memset(st, 0, sizeof(*st));
> >  
> > -   wrmsrl(MSR_KVM_STEAL_TIME, (__pa(st) | KVM_MSR_ENABLED));
> > +   wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
> Where is this slow_virt_to_phys() coming from? I do not find it in
> kvm.git.
> 
Disregard this.  1/2 didn't make it to my mailbox.

> > printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
> > -   cpu, __pa(st));
> > +   cpu, slow_virt_to_phys(st));
> >  }
> >  
> >  static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
> > @@ -311,7 +311,7 @@ void __cpuinit kvm_guest_cpu_init(void)
> > return;
> >  
> > if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF) && kvmapf) {
> > -   u64 pa = __pa(&__get_cpu_var(apf_reason));
> > +   u64 pa = slow_virt_to_phys(&__get_cpu_var(apf_reason));
> >  
> >  #ifdef CONFIG_PREEMPT
> > pa |= KVM_ASYNC_PF_SEND_ALWAYS;
> > @@ -327,7 +327,8 @@ void __cpuinit kvm_guest_cpu_init(void)
> > /* Size alignment is implied but just to make it explicit. */
> > BUILD_BUG_ON(__alignof__(kvm_apic_eoi) < 4);
> > __get_cpu_var(kvm_apic_eoi) = 0;
> > -   pa = __pa(&__get_cpu_var(kvm_apic_eoi)) | KVM_MSR_ENABLED;
> > +   pa = slow_virt_to_phys(&__get_cpu_var(kvm_apic_eoi))
> > +   | KVM_MSR_ENABLED;
> > wrmsrl(MSR_KVM_PV_EOI_EN, pa);
> > }
> >  
> > _
> 
> --
>   Gleb.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2] Add rcu user eqs exception hooks for async page fault

2012-11-29 Thread Gleb Natapov

On Wed, Nov 28, 2012 at 03:25:07PM +0100, Frederic Weisbecker wrote:
> 2012/11/28 Gleb Natapov :
> > On Wed, Nov 28, 2012 at 01:55:42PM +0100, Frederic Weisbecker wrote:
> >> Yes but if rcu_irq_*() calls are fine to be called there, and I
> >> believe they are because exception_enter() exits the user mode, we
> >> should start to protect there right now instead of waiting for a
> >> potential future warning of illegal RCU use.
> >>
> > Async page not present is not much different from regular page fault
> > exception when it happens not on idle task (regular #PF cannot happen
> > on idle task), but code have a special handling for idle task. So why
> > do you think rcu_irq_*() is required here, but not in page fault
> > handler?
> 
> Because we are not supposed to fault in idle, at least I hope there
> are no case around. Except on cases like here with KVM I guess but we
> have that special async handler for that.
> 
As far as I understand rcu_irq_enter() should be called before entering
the mode in which read-side critical sections can occur, but async page
fault does not uses RCU in case of faulting in idle. Actually, looking
closer, it may call kfree() if page ready notification arrives ahead
of not present notification (with todays KVM it cannot happen) and we
have to assume that kfree() uses rcu read even if it currently does not.
So may be it is possible to move rcu_irq_*() calls to be around unlikely
kfree() call only.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 RFC 2/2] kvm: Handle yield_to failure return code for potential undercommit case

2012-11-29 Thread Gleb Natapov

On Wed, Nov 28, 2012 at 10:40:56AM +0530, Raghavendra K T wrote:
> On 11/28/2012 06:42 AM, Marcelo Tosatti wrote:
> >
> >Don't understand the reasoning behind why 3 is a good choice.
> 
> Here is where I came from. (explaining from scratch for
> completeness, forgive me :))
> In moderate overcommits, we can falsely exit from ple handler even when
> we have preempted task of same VM waiting on other cpus. To reduce this
> problem, we try few times before exiting.
> The problem boils down to:
> what is the probability that we exit ple handler even when we have more
> than 1 task in other cpus. Theoretical worst case should be around 1.5x
> overcommit (As also pointed by Andrew Theurer). [But practical
> worstcase may be around 2x,3x overcommits as indicated by the results
> for the patch series]
> 
> So if p is the probability of finding rq length one on a particular cpu,
> and if we do n tries, then probability of exiting ple handler is:
> 
>  p^(n+1) [ because we would have come across one source with rq length
> 1 and n target cpu rqs  with length 1 ]
> 
> so
> num tries: probability of aborting ple handler (1.5x overcommit)
>  1 1/4
>  2 1/8
>  3 1/16
> 
> We can increase this probability with more tries, but the problem is
> the overhead.
IIRC Avi (again) had an idea to track vcpu preemption. When vcpu thread
is preempted we do kvm->preempted_vcpus++, when it runs again we do
kvm->preempted_vcpus--. PLE handler can try harder if kvm->preempted_vcpus
is big or do not try at all if it is zero.

> Also, If we have tried three times that means we would have iterated
> over 3 good eligible vcpus along with many non-eligible candidates. In
> worst case if we iterate all the vcpus, we reduce 1x performance and
> overcommit performance get hit. [ as in results ].
> 
> I have tried num_tries = 1,2,3 and n already ( not 4 yet). So I
> concluded 3 is enough.
> 
> Infact I have also run kernbench and hackbench which are giving 5-20%
> improvement.
> 
> [ As a side note , I also thought how about having num_tries = f(n) =
> ceil ( log(num_online_cpus)/2 ) But I thought calculation is too much
> overhead and also there is no point in probably making it dependent on
> online cpus ]
> 
> Please let me know if you are happy with this rationale/ or correct me
> if you foresee some problem. (Infact Avi, Rik's concern about false
> exiting made me arrive at 'try' logic which I did not have earlier).
> 
> I am currently trying out the result for 1.5x overcommit will post the
> result.
> 
> >
> >On Mon, Nov 26, 2012 at 05:38:04PM +0530, Raghavendra K T wrote:
> >>From: Raghavendra K T 
> >>
> >>yield_to returns -ESRCH, When source and target of yield_to
> >>run queue length is one. When we see three successive failures of
> >>yield_to we assume we are in potential undercommit case and abort
> >>from PLE handler.
> >>The assumption is backed by low probability of wrong decision
> >>for even worst case scenarios such as average runqueue length
> >>between 1 and 2.
> >>
> >>note that we do not update last boosted vcpu in failure cases.
> >>Thank Avi for raising question on aborting after first fail from yield_to.
> >>
> >>Reviewed-by: Srikar Dronamraju 
> >>Signed-off-by: Raghavendra K T 
> [...]

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2] Add rcu user eqs exception hooks for async page fault

2012-11-29 Thread Gleb Natapov

On Thu, Nov 29, 2012 at 03:40:05PM +0100, Frederic Weisbecker wrote:
> 2012/11/29 Li Zhong :
> > On Wed, 2012-11-28 at 13:55 +0100, Frederic Weisbecker wrote:
> >> > With rcu_user_exit() at the beginning, now rcu_irq_enter() only protects
> >> > the cpu idle eqs, but it's not good to call rcu_irq_exit() after the cpu
> >> > halt and the page ready.
> >>
> >> Hmm, why is it not good?
> >
> > I think in this case, as cpu halts and waits for page ready, it is
> > really in idle, and it's better for rcu to think it as rcu idle. But
> > with rcu_irq_enter(), the rcu would not think this cpu as idle. And the
> > rcu_irq_exit() is only called after this idle period to mark this cpu as
> > rcu idle again.
> 
> 
> Indeed. How about calling rcu_irq_exit() before native_safe_halt() and
> rcu_irq_enter() right after?
> 
We can't. If exception happens in the middle of rcu read section while
preemption is disabled then calling rcu_irq_exit() before halt is
incorrect. We can do that only if exception happen during idle and this
should be rare enough for us to not care.

> >> > So I still want to remove it. And later if it shows that we really needs
> >> > rcu somewhere in this code path, maybe we could use RCU_NONIDLE() to
> >> > protect it. ( The suspicious RCU usage reported in commit
> >> > c5e015d4949aa665 seems related to schedule(), which is not in the code
> >> > path if we are in cpu idle eqs )
> >>
> >> Yes but if rcu_irq_*() calls are fine to be called there, and I
> >> believe they are because exception_enter() exits the user mode, we
> >> should start to protect there right now instead of waiting for a
> >> potential future warning of illegal RCU use.
> >
> > I agree, but I think by only protecting the necessary code avoids
> > marking the entire waiting period as rcu non idle.
> 
> If we use RCU_NONIDLE(), this assume we need to check all the code
> there deeply for potential RCU uses and ensure it will never be
> extended later to use RCU. We really don't scale enough for that.
> There are too much subsystems involved there: waitqueue, spinlocks,
> slab, per cpu, etc...
> 
> So I strongly suggest we use rcu_irq_*() APIs, and relax them around
> native_safe_halt() like I suggested above. This seems to me the safest
> solution.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v2] Add rcu user eqs exception hooks for async page fault

2012-11-30 Thread Gleb Natapov

On Fri, Nov 30, 2012 at 04:36:46PM +0800, Li Zhong wrote:
> On Thu, 2012-11-29 at 19:25 +0200, Gleb Natapov wrote:
> > On Thu, Nov 29, 2012 at 03:40:05PM +0100, Frederic Weisbecker wrote:
> > > 2012/11/29 Li Zhong :
> > > > On Wed, 2012-11-28 at 13:55 +0100, Frederic Weisbecker wrote:
> > > >> > With rcu_user_exit() at the beginning, now rcu_irq_enter() only 
> > > >> > protects
> > > >> > the cpu idle eqs, but it's not good to call rcu_irq_exit() after the 
> > > >> > cpu
> > > >> > halt and the page ready.
> > > >>
> > > >> Hmm, why is it not good?
> > > >
> > > > I think in this case, as cpu halts and waits for page ready, it is
> > > > really in idle, and it's better for rcu to think it as rcu idle. But
> > > > with rcu_irq_enter(), the rcu would not think this cpu as idle. And the
> > > > rcu_irq_exit() is only called after this idle period to mark this cpu as
> > > > rcu idle again.
> > > 
> > > 
> > > Indeed. How about calling rcu_irq_exit() before native_safe_halt() and
> > > rcu_irq_enter() right after?
> > > 
> > We can't. If exception happens in the middle of rcu read section while
> > preemption is disabled then calling rcu_irq_exit() before halt is
> > incorrect. We can do that only if exception happen during idle and this
> > should be rare enough for us to not care.
> 
> If I understand correctly, maybe it doesn't cause problems.
> 
> I guess you mean in other(not idle) task's rcu read critical section, we
> get an async page fault? In this case the rcu_idle_exit() won't make
> this cpu to enter rcu idle state. As the task environment is rcu non
> idle. This is the case we use rcu_irq_enter()/rcu_irq_exit() in code
> path where it might be in rcu idle or rcu non idle, and they don't
> change the rcu non-idle state if used in rcu non-idle state.
> 
> And it is also safe if it is rcu read critical section in idle, as it is
> most probably wrapped in an outer rcu_irq_enter/exit(), so we have 
> 
> rcu idle
>   rcu_irq_enter() - exit rcu idle, to protect the read section
> exception
> rcu_irq_enter() in the page not present path
>   rcu_irq_exit() before halt 
> 
Yes, I was thinking about this scenario and you are right, since
rcu_irq_enter() will be nested it should be safe to call rcu_irq_exit
before halt.

> so the halt is also safe here, as we have two rcu_irq_enter() and one
> rcu_irq_exit().
> 
> I agree with Frederic that this is the safest and simplest way now, and
> will try to update the code according to it.
> 
> Thanks, Zhong
> 
> > > >> > So I still want to remove it. And later if it shows that we really 
> > > >> > needs
> > > >> > rcu somewhere in this code path, maybe we could use RCU_NONIDLE() to
> > > >> > protect it. ( The suspicious RCU usage reported in commit
> > > >> > c5e015d4949aa665 seems related to schedule(), which is not in the 
> > > >> > code
> > > >> > path if we are in cpu idle eqs )
> > > >>
> > > >> Yes but if rcu_irq_*() calls are fine to be called there, and I
> > > >> believe they are because exception_enter() exits the user mode, we
> > > >> should start to protect there right now instead of waiting for a
> > > >> potential future warning of illegal RCU use.
> > > >
> > > > I agree, but I think by only protecting the necessary code avoids
> > > > marking the entire waiting period as rcu non idle.
> > > 
> > > If we use RCU_NONIDLE(), this assume we need to check all the code
> > > there deeply for potential RCU uses and ensure it will never be
> > > extended later to use RCU. We really don't scale enough for that.
> > > There are too much subsystems involved there: waitqueue, spinlocks,
> > > slab, per cpu, etc...
> > > 
> > > So I strongly suggest we use rcu_irq_*() APIs, and relax them around
> > > native_safe_halt() like I suggested above. This seems to me the safest
> > > solution.
> > 
> > --
> > Gleb.
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH v3] Add rcu user eqs exception hooks for async page fault

2012-11-30 Thread Gleb Natapov

On Fri, Nov 30, 2012 at 05:18:41PM +0800, Li Zhong wrote:
> This patch adds user eqs exception hooks for async page fault page not
> present code path, to exit the user eqs and re-enter it as necessary. 
> 
> Async page fault is different from other exceptions that it may be
> triggered from idle process, so we still need rcu_irq_enter() and
> rcu_irq_exit() to exit cpu idle eqs when needed, to protect the code
> that needs use rcu.
> 
> As Frederic pointed out it would be safest and simplest to protect the
> whole kvm_async_pf_task_wait(). Otherwise, "we need to check all the
> code there deeply for potential RCU uses and ensure it will never be
> extended later to use RCU.".
> 
> However, We'd better re-enter the cpu idle eqs if we get the exception
> in cpu idle eqs, by calling rcu_irq_exit() before native_safe_halt(). 
> 
> So the patch does what Frederic suggested for rcu_irq_*() API usage
> here, except that I moved the rcu_irq_*() pair originally in
> do_async_page_fault() into kvm_async_pf_task_wait(). 
> 
> That's because, I think it's better to have rcu_irq_*() pairs to be in
> one function ( rcu_irq_exit() after rcu_irq_enter() ), especially here,
> kvm_async_pf_task_wait() has other callers, which might cause
> rcu_irq_exit() be called without a matching rcu_irq_enter() before it,
> which is illegal if the cpu happens to be in rcu idle state. 
> 
> Signed-off-by: Li Zhong 
> ---
>  arch/x86/kernel/kvm.c | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 4180a87..342b00b 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -42,6 +42,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static int kvmapf = 1;
>  
> @@ -112,6 +113,8 @@ void kvm_async_pf_task_wait(u32 token)
>   DEFINE_WAIT(wait);
>   int cpu, idle;
>  
> + rcu_irq_enter();
> +
Why move rcu_irq_*() calls into kvm_async_pf_task_wait()?

>   cpu = get_cpu();
>   idle = idle_cpu(cpu);
>   put_cpu();
> @@ -123,6 +126,8 @@ void kvm_async_pf_task_wait(u32 token)
>   hlist_del(&e->link);
>   kfree(e);
>   spin_unlock(&b->lock);
> +
> + rcu_irq_exit();
We can skip that if  rcu_irq_*() will stay outside.

>   return;
>   }
>  
> @@ -147,13 +152,16 @@ void kvm_async_pf_task_wait(u32 token)
>   /*
>* We cannot reschedule. So halt.
>*/
> + rcu_irq_exit();
>   native_safe_halt();
> + rcu_irq_enter();
>   local_irq_disable();
>   }
>   }
>   if (!n.halted)
>   finish_wait(&n.wq, &wait);
>  
> + rcu_irq_exit();
>   return;
>  }
>  EXPORT_SYMBOL_GPL(kvm_async_pf_task_wait);
> @@ -247,10 +255,10 @@ do_async_page_fault(struct pt_regs *regs, unsigned long 
> error_code)
>   break;
>   case KVM_PV_REASON_PAGE_NOT_PRESENT:
>   /* page is swapped out by the host. */
> - rcu_irq_enter();
> + exception_enter(regs);
>   exit_idle();
>   kvm_async_pf_task_wait((u32)read_cr2());
> - rcu_irq_exit();
> + exception_exit(regs);
>   break;
>   case KVM_PV_REASON_PAGE_READY:
>   rcu_irq_enter();
> -- 
> 1.7.11.4

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/5] KVM: MMU: fast invalid all mmio sptes

2013-03-18 Thread Gleb Natapov

On Mon, Mar 18, 2013 at 04:08:50PM +0800, Xiao Guangrong wrote:
> On 03/17/2013 11:02 PM, Gleb Natapov wrote:
> > On Fri, Mar 15, 2013 at 11:29:53PM +0800, Xiao Guangrong wrote:
> >> This patch tries to introduce a very simple and scale way to invalid all
> >> mmio sptes - it need not walk any shadow pages and hold mmu-lock
> >>
> >> KVM maintains a global mmio invalid generation-number which is stored in
> >> kvm->arch.mmio_invalid_gen and every mmio spte stores the current global
> >> generation-number into his available bits when it is created
> >>
> >> When KVM need zap all mmio sptes, it just simply increase the global
> >> generation-number. When guests do mmio access, KVM intercepts a MMIO #PF
> >> then it walks the shadow page table and get the mmio spte. If the
> >> generation-number on the spte does not equal the global generation-number,
> >> it will go to the normal #PF handler to update the mmio spte
> >>
> >> Since 19 bits are used to store generation-number on mmio spte, the
> >> generation-number can be round after 33554432 times. It is large enough
> >> for nearly all most cases, but making the code be more strong, we zap all
> >> shadow pages when the number is round
> >>
> > Very nice idea, but why drop Takuya patches instead of using
> > kvm_mmu_zap_mmio_sptes() when generation number overflows.
> 
> I am not sure whether it is still needed. Requesting to zap all mmio sptes for
> more than 50 times is really really rare, it nearly does not happen.
> (By the way, 33554432 is wrong in the changelog, i just copy that for my 
> origin
> implantation.) And, after my patch optimizing zapping all shadow pages,
> zap-all-sps should not be a problem anymore since it does not take too much 
> lock
> time.
> 
> Your idea?
> 
I expect 50 to become less since I already had plans to store some
information in mmio spte. Even if all zap-all-sptes becomes faster we
still needlessly zap all sptes while we can zap only mmio.

> > 
> > 
> >> Signed-off-by: Xiao Guangrong 
> >> ---
> >>  arch/x86/include/asm/kvm_host.h |2 +
> >>  arch/x86/kvm/mmu.c  |   61 
> >> +--
> >>  arch/x86/kvm/mmutrace.h |   17 +++
> >>  arch/x86/kvm/paging_tmpl.h  |7 +++-
> >>  arch/x86/kvm/vmx.c  |4 ++
> >>  arch/x86/kvm/x86.c  |6 +--
> >>  6 files changed, 82 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/kvm_host.h 
> >> b/arch/x86/include/asm/kvm_host.h
> >> index ef7f4a5..572398e 100644
> >> --- a/arch/x86/include/asm/kvm_host.h
> >> +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -529,6 +529,7 @@ struct kvm_arch {
> >>unsigned int n_requested_mmu_pages;
> >>unsigned int n_max_mmu_pages;
> >>unsigned int indirect_shadow_pages;
> >> +  unsigned int mmio_invalid_gen;
> > Why invalid? Should be mmio_valid_gen or mmio_current_get.
> 
> mmio_invalid_gen is only updated in kvm_mmu_invalidate_mmio_sptes,
> so i named it as _invalid_. But mmio_valid_gen is good for me.
> 
It holds currently valid value though, so calling it "invalid" is
confusing.

> > 
> >>struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> >>/*
> >> * Hash table of struct kvm_mmu_page.
> >> @@ -765,6 +766,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
> >> int slot);
> >>  void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> >> struct kvm_memory_slot *slot,
> >> gfn_t gfn_offset, unsigned long mask);
> >> +void kvm_mmu_invalid_mmio_spte(struct kvm *kvm);
> > Agree with Takuya that kvm_mmu_invalidate_mmio_sptes() is a better name.
> 
> Me too.
> 
> > 
> >>  void kvm_mmu_zap_all(struct kvm *kvm);
> >>  unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
> >>  void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int 
> >> kvm_nr_mmu_pages);
> >> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> >> index 13626f4..7093a92 100644
> >> --- a/arch/x86/kvm/mmu.c
> >> +++ b/arch/x86/kvm/mmu.c
> >> @@ -234,12 +234,13 @@ static unsigned int get_mmio_spte_generation(u64 
> >> spte)
> >>  static void mark_mmio_spte(struct kvm *kvm, u64 *sptep, u64 gfn,
> >>   unsigned access)
> >>  {
> >> -  u64 mask = generation_mmio_spte_mask

Re: [PATCH 5/5] KVM: MMU: fast invalid all mmio sptes

2013-03-18 Thread Gleb Natapov

On Mon, Mar 18, 2013 at 08:29:29PM +0800, Xiao Guangrong wrote:
> On 03/18/2013 05:13 PM, Gleb Natapov wrote:
> > On Mon, Mar 18, 2013 at 04:08:50PM +0800, Xiao Guangrong wrote:
> >> On 03/17/2013 11:02 PM, Gleb Natapov wrote:
> >>> On Fri, Mar 15, 2013 at 11:29:53PM +0800, Xiao Guangrong wrote:
> >>>> This patch tries to introduce a very simple and scale way to invalid all
> >>>> mmio sptes - it need not walk any shadow pages and hold mmu-lock
> >>>>
> >>>> KVM maintains a global mmio invalid generation-number which is stored in
> >>>> kvm->arch.mmio_invalid_gen and every mmio spte stores the current global
> >>>> generation-number into his available bits when it is created
> >>>>
> >>>> When KVM need zap all mmio sptes, it just simply increase the global
> >>>> generation-number. When guests do mmio access, KVM intercepts a MMIO #PF
> >>>> then it walks the shadow page table and get the mmio spte. If the
> >>>> generation-number on the spte does not equal the global 
> >>>> generation-number,
> >>>> it will go to the normal #PF handler to update the mmio spte
> >>>>
> >>>> Since 19 bits are used to store generation-number on mmio spte, the
> >>>> generation-number can be round after 33554432 times. It is large enough
> >>>> for nearly all most cases, but making the code be more strong, we zap all
> >>>> shadow pages when the number is round
> >>>>
> >>> Very nice idea, but why drop Takuya patches instead of using
> >>> kvm_mmu_zap_mmio_sptes() when generation number overflows.
> >>
> >> I am not sure whether it is still needed. Requesting to zap all mmio sptes 
> >> for
> >> more than 50 times is really really rare, it nearly does not happen.
> >> (By the way, 33554432 is wrong in the changelog, i just copy that for my 
> >> origin
> >> implantation.) And, after my patch optimizing zapping all shadow pages,
> >> zap-all-sps should not be a problem anymore since it does not take too 
> >> much lock
> >> time.
> >>
> >> Your idea?
> >>
> > I expect 50 to become less since I already had plans to store some
> 
> Interesting, just curious, what are the plans? ;)
> 
Currently we uses pio to signal that work is pending to virtio devices. The
requirement is that signaling should be fast and PIO is fast since there
is not need to emulate instruction. PCIE though is not really designed
with PIO in mind, so we will have to use MMIO to do signaling. To avoid
instruction emulation I thought about making guest access these devices using
predefined variety of MOV instruction so that emulation can be skipped.
The idea is to mark mmio spte to know that emulation is not needed.

> > information in mmio spte. Even if all zap-all-sptes becomes faster we
> > still needlessly zap all sptes while we can zap only mmio.
> 
> Okay.
> 
> > 
> >>>
> >>>
> >>>> Signed-off-by: Xiao Guangrong 
> >>>> ---
> >>>>  arch/x86/include/asm/kvm_host.h |2 +
> >>>>  arch/x86/kvm/mmu.c  |   61 
> >>>> +--
> >>>>  arch/x86/kvm/mmutrace.h |   17 +++
> >>>>  arch/x86/kvm/paging_tmpl.h  |7 +++-
> >>>>  arch/x86/kvm/vmx.c  |4 ++
> >>>>  arch/x86/kvm/x86.c  |6 +--
> >>>>  6 files changed, 82 insertions(+), 15 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/include/asm/kvm_host.h 
> >>>> b/arch/x86/include/asm/kvm_host.h
> >>>> index ef7f4a5..572398e 100644
> >>>> --- a/arch/x86/include/asm/kvm_host.h
> >>>> +++ b/arch/x86/include/asm/kvm_host.h
> >>>> @@ -529,6 +529,7 @@ struct kvm_arch {
> >>>>  unsigned int n_requested_mmu_pages;
> >>>>  unsigned int n_max_mmu_pages;
> >>>>  unsigned int indirect_shadow_pages;
> >>>> +unsigned int mmio_invalid_gen;
> >>> Why invalid? Should be mmio_valid_gen or mmio_current_get.
> >>
> >> mmio_invalid_gen is only updated in kvm_mmu_invalidate_mmio_sptes,
> >> so i named it as _invalid_. But mmio_valid_gen is good for me.
> >>
> > It holds currently valid value though, so calling it "invalid" is
> > confusing.
> 
> I agree.
> 
> > 
> >>>
>

Re: [PATCH 4/5] KVM: MMU: store generation-number into mmio spte

2013-03-18 Thread Gleb Natapov

On Mon, Mar 18, 2013 at 08:42:09PM +0800, Xiao Guangrong wrote:
> On 03/18/2013 07:19 PM, Paolo Bonzini wrote:
> > Il 15/03/2013 16:29, Xiao Guangrong ha scritto:
> >> +/*
> >> + * spte bits of bit 3 ~ bit 11 are used as low 9 bits of
> >> + * generation, the bits of bits 52 ~ bit 61 are used as
> >> + * high 12 bits of generation.
> >> + */
> > 
> > High 10 bits.
> 
> Yes, i forgot to update the comments! Thank you, Paolo!
> 
> > 
> > How often does the generation number change?  Especially with Takuya's
> > patches, using just 10 bits for the generation might be enough and
> > simplifies the code.
> 
> I guess it's only updated when guest is initializing and doing hotplug.
> In my origin thought, if the number is big enough that it could not
> round on all most case of hotplug, the things introduced by Takuya is
> not necessary. But, if you guys want to keep the things, 10 bits should
> be enough. :)
I think you encapsulated the 19bit generation number handling good enough, so
I personally do not mind making it as big as we can.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/5] KVM: MMU: fast invalid all mmio sptes

2013-03-18 Thread Gleb Natapov

On Mon, Mar 18, 2013 at 09:09:43PM +0800, Xiao Guangrong wrote:
> On 03/18/2013 08:46 PM, Gleb Natapov wrote:
> > On Mon, Mar 18, 2013 at 08:29:29PM +0800, Xiao Guangrong wrote:
> >> On 03/18/2013 05:13 PM, Gleb Natapov wrote:
> >>> On Mon, Mar 18, 2013 at 04:08:50PM +0800, Xiao Guangrong wrote:
> >>>> On 03/17/2013 11:02 PM, Gleb Natapov wrote:
> >>>>> On Fri, Mar 15, 2013 at 11:29:53PM +0800, Xiao Guangrong wrote:
> >>>>>> This patch tries to introduce a very simple and scale way to invalid 
> >>>>>> all
> >>>>>> mmio sptes - it need not walk any shadow pages and hold mmu-lock
> >>>>>>
> >>>>>> KVM maintains a global mmio invalid generation-number which is stored 
> >>>>>> in
> >>>>>> kvm->arch.mmio_invalid_gen and every mmio spte stores the current 
> >>>>>> global
> >>>>>> generation-number into his available bits when it is created
> >>>>>>
> >>>>>> When KVM need zap all mmio sptes, it just simply increase the global
> >>>>>> generation-number. When guests do mmio access, KVM intercepts a MMIO 
> >>>>>> #PF
> >>>>>> then it walks the shadow page table and get the mmio spte. If the
> >>>>>> generation-number on the spte does not equal the global 
> >>>>>> generation-number,
> >>>>>> it will go to the normal #PF handler to update the mmio spte
> >>>>>>
> >>>>>> Since 19 bits are used to store generation-number on mmio spte, the
> >>>>>> generation-number can be round after 33554432 times. It is large enough
> >>>>>> for nearly all most cases, but making the code be more strong, we zap 
> >>>>>> all
> >>>>>> shadow pages when the number is round
> >>>>>>
> >>>>> Very nice idea, but why drop Takuya patches instead of using
> >>>>> kvm_mmu_zap_mmio_sptes() when generation number overflows.
> >>>>
> >>>> I am not sure whether it is still needed. Requesting to zap all mmio 
> >>>> sptes for
> >>>> more than 50 times is really really rare, it nearly does not happen.
> >>>> (By the way, 33554432 is wrong in the changelog, i just copy that for my 
> >>>> origin
> >>>> implantation.) And, after my patch optimizing zapping all shadow pages,
> >>>> zap-all-sps should not be a problem anymore since it does not take too 
> >>>> much lock
> >>>> time.
> >>>>
> >>>> Your idea?
> >>>>
> >>> I expect 50 to become less since I already had plans to store some
> >>
> >> Interesting, just curious, what are the plans? ;)
> >>
> > Currently we uses pio to signal that work is pending to virtio devices. The
> > requirement is that signaling should be fast and PIO is fast since there
> > is not need to emulate instruction. PCIE though is not really designed
> > with PIO in mind, so we will have to use MMIO to do signaling. To avoid
> > instruction emulation I thought about making guest access these devices 
> > using
> > predefined variety of MOV instruction so that emulation can be skipped.
> > The idea is to mark mmio spte to know that emulation is not needed.
> 
> How to know page-fault is caused by the predefined instruction?
> 
Only predefined phys address rages will be accessed that way. If page
fault is in a range we assume the knows instruction is used.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/5] KVM: MMU: fast invalid all mmio sptes

2013-03-18 Thread Gleb Natapov

On Mon, Mar 18, 2013 at 09:25:10PM +0800, Xiao Guangrong wrote:
> On 03/18/2013 09:19 PM, Gleb Natapov wrote:
> > On Mon, Mar 18, 2013 at 09:09:43PM +0800, Xiao Guangrong wrote:
> >> On 03/18/2013 08:46 PM, Gleb Natapov wrote:
> >>> On Mon, Mar 18, 2013 at 08:29:29PM +0800, Xiao Guangrong wrote:
> >>>> On 03/18/2013 05:13 PM, Gleb Natapov wrote:
> >>>>> On Mon, Mar 18, 2013 at 04:08:50PM +0800, Xiao Guangrong wrote:
> >>>>>> On 03/17/2013 11:02 PM, Gleb Natapov wrote:
> >>>>>>> On Fri, Mar 15, 2013 at 11:29:53PM +0800, Xiao Guangrong wrote:
> >>>>>>>> This patch tries to introduce a very simple and scale way to invalid 
> >>>>>>>> all
> >>>>>>>> mmio sptes - it need not walk any shadow pages and hold mmu-lock
> >>>>>>>>
> >>>>>>>> KVM maintains a global mmio invalid generation-number which is 
> >>>>>>>> stored in
> >>>>>>>> kvm->arch.mmio_invalid_gen and every mmio spte stores the current 
> >>>>>>>> global
> >>>>>>>> generation-number into his available bits when it is created
> >>>>>>>>
> >>>>>>>> When KVM need zap all mmio sptes, it just simply increase the global
> >>>>>>>> generation-number. When guests do mmio access, KVM intercepts a MMIO 
> >>>>>>>> #PF
> >>>>>>>> then it walks the shadow page table and get the mmio spte. If the
> >>>>>>>> generation-number on the spte does not equal the global 
> >>>>>>>> generation-number,
> >>>>>>>> it will go to the normal #PF handler to update the mmio spte
> >>>>>>>>
> >>>>>>>> Since 19 bits are used to store generation-number on mmio spte, the
> >>>>>>>> generation-number can be round after 33554432 times. It is large 
> >>>>>>>> enough
> >>>>>>>> for nearly all most cases, but making the code be more strong, we 
> >>>>>>>> zap all
> >>>>>>>> shadow pages when the number is round
> >>>>>>>>
> >>>>>>> Very nice idea, but why drop Takuya patches instead of using
> >>>>>>> kvm_mmu_zap_mmio_sptes() when generation number overflows.
> >>>>>>
> >>>>>> I am not sure whether it is still needed. Requesting to zap all mmio 
> >>>>>> sptes for
> >>>>>> more than 50 times is really really rare, it nearly does not 
> >>>>>> happen.
> >>>>>> (By the way, 33554432 is wrong in the changelog, i just copy that for 
> >>>>>> my origin
> >>>>>> implantation.) And, after my patch optimizing zapping all shadow pages,
> >>>>>> zap-all-sps should not be a problem anymore since it does not take too 
> >>>>>> much lock
> >>>>>> time.
> >>>>>>
> >>>>>> Your idea?
> >>>>>>
> >>>>> I expect 50 to become less since I already had plans to store some
> >>>>
> >>>> Interesting, just curious, what are the plans? ;)
> >>>>
> >>> Currently we uses pio to signal that work is pending to virtio devices. 
> >>> The
> >>> requirement is that signaling should be fast and PIO is fast since there
> >>> is not need to emulate instruction. PCIE though is not really designed
> >>> with PIO in mind, so we will have to use MMIO to do signaling. To avoid
> >>> instruction emulation I thought about making guest access these devices 
> >>> using
> >>> predefined variety of MOV instruction so that emulation can be skipped.
> >>> The idea is to mark mmio spte to know that emulation is not needed.
> >>
> >> How to know page-fault is caused by the predefined instruction?
> >>
> > Only predefined phys address rages will be accessed that way. If page
> > fault is in a range we assume the knows instruction is used.
> 
> That means the access can be identified by the gfn, why need cache
> other things into mmio spte?
Two not search through memory ranges on each access.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/5] KVM: MMU: fast invalid all mmio sptes

2013-03-19 Thread Gleb Natapov

On Tue, Mar 19, 2013 at 11:15:36AM +0800, Xiao Guangrong wrote:
> On 03/19/2013 06:16 AM, Eric Northup wrote:
> > On Fri, Mar 15, 2013 at 8:29 AM, Xiao Guangrong
> >  wrote:
> >> This patch tries to introduce a very simple and scale way to invalid all
> >> mmio sptes - it need not walk any shadow pages and hold mmu-lock
> >>
> >> KVM maintains a global mmio invalid generation-number which is stored in
> >> kvm->arch.mmio_invalid_gen and every mmio spte stores the current global
> >> generation-number into his available bits when it is created
> >>
> >> When KVM need zap all mmio sptes, it just simply increase the global
> >> generation-number. When guests do mmio access, KVM intercepts a MMIO #PF
> >> then it walks the shadow page table and get the mmio spte. If the
> >> generation-number on the spte does not equal the global generation-number,
> >> it will go to the normal #PF handler to update the mmio spte
> >>
> >> Since 19 bits are used to store generation-number on mmio spte, the
> >> generation-number can be round after 33554432 times. It is large enough
> >> for nearly all most cases, but making the code be more strong, we zap all
> >> shadow pages when the number is round
> >>
> >> Signed-off-by: Xiao Guangrong 
> >> ---
> >>  arch/x86/include/asm/kvm_host.h |2 +
> >>  arch/x86/kvm/mmu.c  |   61 
> >> +--
> >>  arch/x86/kvm/mmutrace.h |   17 +++
> >>  arch/x86/kvm/paging_tmpl.h  |7 +++-
> >>  arch/x86/kvm/vmx.c  |4 ++
> >>  arch/x86/kvm/x86.c  |6 +--
> >>  6 files changed, 82 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/kvm_host.h 
> >> b/arch/x86/include/asm/kvm_host.h
> >> index ef7f4a5..572398e 100644
> >> --- a/arch/x86/include/asm/kvm_host.h
> >> +++ b/arch/x86/include/asm/kvm_host.h
> >> @@ -529,6 +529,7 @@ struct kvm_arch {
> >> unsigned int n_requested_mmu_pages;
> >> unsigned int n_max_mmu_pages;
> >> unsigned int indirect_shadow_pages;
> >> +   unsigned int mmio_invalid_gen;
> > 
> > Could this get initialized to something close to the wrap-around
> > value, so that the wrap-around case gets more real-world coverage?
> 
> I am afraid we can not. We cache the current mmio_invalid_gen into mmio spte 
> when
> it is created no matter what the initiation value is.
> 
> If you have a better way, please show me. ;)
> 
The idea is to initialize mmio_invalid_gen to value close to MAX_GEN in
order to exercise 

+   if (unlikely(kvm->arch.mmio_invalid_gen++ == MAX_GEN)) {
+   kvm->arch.mmio_invalid_gen = 0;
+   return kvm_mmu_zap_all(kvm);
+   }

path more often.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: correctly initialize the CS base on reset

2013-03-19 Thread Gleb Natapov

On Tue, Mar 19, 2013 at 04:30:26PM +0100, Paolo Bonzini wrote:
> The CS base was initialized to 0 on VMX (wrong, but usually overridden
> by userspace before starting) or 0xf on SVM.  The correct value is
> 0x, and VMX is able to emulate it now, so use it.
> 
> Signed-off-by: Paolo Bonzini 
Reviewed-by: Gleb Natapov 

> ---
>  arch/x86/kvm/svm.c | 8 +---
>  arch/x86/kvm/vmx.c | 1 +
>  2 files changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 7219a40..7a46c1f 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1131,17 +1131,11 @@ static void init_vmcb(struct vcpu_svm *svm)
>   init_seg(&save->gs);
>  
>   save->cs.selector = 0xf000;
> + save->cs.base = 0x;
>   /* Executable/Readable Code Segment */
>   save->cs.attrib = SVM_SELECTOR_READ_MASK | SVM_SELECTOR_P_MASK |
>   SVM_SELECTOR_S_MASK | SVM_SELECTOR_CODE_MASK;
>   save->cs.limit = 0x;
> - /*
> -  * cs.base should really be 0x, but vmx can't handle that, so
> -  * be consistent with it.
> -  *
> -  * Replace when we have real mode working for vmx.
> -  */
> - save->cs.base = 0xf;
>  
>   save->gdtr.limit = 0x;
>   save->idtr.limit = 0x;
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 4a0bafe..c75c25d 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4133,6 +4133,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)
>  
>   seg_setup(VCPU_SREG_CS);
>   vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
> + vmcs_write32(GUEST_CS_BASE, 0x);
>  
>   seg_setup(VCPU_SREG_DS);
>   seg_setup(VCPU_SREG_ES);
> -- 
> 1.8.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: correctly initialize the CS base on reset

2013-03-19 Thread Gleb Natapov

On Tue, Mar 19, 2013 at 05:41:45PM +0100, Jan Kiszka wrote:
> On 2013-03-19 16:43, Gleb Natapov wrote:
> > On Tue, Mar 19, 2013 at 04:30:26PM +0100, Paolo Bonzini wrote:
> >> The CS base was initialized to 0 on VMX (wrong, but usually overridden
> >> by userspace before starting) or 0xf on SVM.  The correct value is
> >> 0x, and VMX is able to emulate it now, so use it.
> >>
> >> Signed-off-by: Paolo Bonzini 
> > Reviewed-by: Gleb Natapov 
> 
> Just for the history (as I was wondering where this came from): CS base
> used to be set to 0xf on VMX as well, but that was changed by
> b246dd5d to only affect guests when unrestricted mode is missing. That
> change actually left the base uninitialized.
> 
Looking at the code before b246dd5d it sets CS base to 0xf only if it was
0x. This is the same as with current code: fix_rmode_seg() does
it.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: let userspace inject interrupts into the local APIC

2013-03-19 Thread Gleb Natapov

On Tue, Mar 19, 2013 at 04:51:13PM +0100, Paolo Bonzini wrote:
> There is no way for userspace to inject interrupts into a VCPU's
> local APIC, which is important in order to inject INITs coming from
> the chipset.  KVM_INTERRUPT is currently disabled when the in-kernel
> local APIC is used, so we can repurpose it.  The shorthand destination
> field must contain APIC_DEST_SELF, which has a double effect: first,
> the ICR2 register is not used and the 32-bit field of KVM_INTERRUPT is
> enough; second, it ensures that the valid range of the irq field is
> distinct in the userspace-APIC and kernel-APIC cases.
> 
Init coming from triggering INIT# line should not be modeled as INIT coming from
APIC. In fact INIT cannot be send using SELF shorthand. If APIC code
will be fixed to check for that this code will break.

> Signed-off-by: Paolo Bonzini 
> ---
>  Documentation/virtual/kvm/api.txt | 13 ++---
>  arch/x86/kvm/lapic.c  | 20 +++-
>  arch/x86/kvm/lapic.h  |  1 +
>  arch/x86/kvm/x86.c|  6 +++---
>  4 files changed, 29 insertions(+), 11 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 4237c27..a69706b 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -396,8 +396,8 @@ Type: vcpu ioctl
>  Parameters: struct kvm_interrupt (in)
>  Returns: 0 on success, -1 on error
>  
> -Queues a hardware interrupt vector to be injected.  This is only
> -useful if in-kernel local APIC or equivalent is not used.
> +Queues a hardware interrupt vector to be injected into either the VCPU
> +or into the in-kernel local APIC or equivalent (if in use).
>  
>  /* for KVM_INTERRUPT */
>  struct kvm_interrupt {
> @@ -407,7 +407,14 @@ struct kvm_interrupt {
>  
>  X86:
>  
> -Note 'irq' is an interrupt vector, not an interrupt pin or line.
> +If the in-kernel local APIC is enabled, 'irq' is sent to the local APIC
> +as if it were written to the ICR register, except that it is not reflected
> +into ICR if the guest reads it.  The destination (bits 18 and 19) must be
> +set to APIC_DEST_SELF.
> +
> +If the in-kernel local APIC is disabled, 'irq' is an interrupt vector (not
> +an interrupt pin or line) that is injected synchronously into the VCPU.
> +
>  
>  PPC:
>  
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index a8e9369..efb67f7 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -826,12 +826,9 @@ void kvm_apic_set_eoi_accelerated(struct kvm_vcpu *vcpu, 
> int vector)
>  }
>  EXPORT_SYMBOL_GPL(kvm_apic_set_eoi_accelerated);
>  
> -static void apic_send_ipi(struct kvm_lapic *apic)
> +static void apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high)
>  {
> - u32 icr_low = kvm_apic_get_reg(apic, APIC_ICR);
> - u32 icr_high = kvm_apic_get_reg(apic, APIC_ICR2);
>   struct kvm_lapic_irq irq;
> -
>   irq.vector = icr_low & APIC_VECTOR_MASK;
>   irq.delivery_mode = icr_low & APIC_MODE_MASK;
>   irq.dest_mode = icr_low & APIC_DEST_MASK;
> @@ -855,6 +852,16 @@ static void apic_send_ipi(struct kvm_lapic *apic)
>   kvm_irq_delivery_to_apic(apic->vcpu->kvm, apic, &irq);
>  }
>  
> +int kvm_lapic_ioctl_interrupt(struct kvm_vcpu *vcpu, u32 value)
> +{
> + if ((value & APIC_SHORT_MASK) != APIC_DEST_SELF)
> + return -EINVAL;
> +
> + apic_send_ipi(vcpu->arch.apic, value, 0);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(kvm_lapic_ioctl_interrupt);
> +
>  static u32 apic_get_tmcct(struct kvm_lapic *apic)
>  {
>   ktime_t remaining;
> @@ -1155,7 +1162,10 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 
> reg, u32 val)
>   case APIC_ICR:
>   /* No delay here, so we always clear the pending bit */
>   apic_set_reg(apic, APIC_ICR, val & ~(1 << 12));
> - apic_send_ipi(apic);
> + apic_send_ipi(apic,
> +   kvm_apic_get_reg(apic, APIC_ICR),
> +   kvm_apic_get_reg(apic, APIC_ICR2));
> +
>   break;
>  
>   case APIC_ICR2:
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 2c721b9..0631b5c 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -49,6 +49,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu);
>  u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
>  void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
>  void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
> +int kvm_lapic_ioctl_interrupt(struct kvm_vcpu *vcpu, u32 value);
>  void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
>  u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
>  void kvm_apic_set_version(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3e0a8ba..ab4a401 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2703,11 +2703,11 @@ static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu 
> *vcpu,
>  static int kvm_vcpu_ioctl_int

Re: [PATCH] x86: let userspace inject interrupts into the local APIC

2013-03-19 Thread Gleb Natapov

On Tue, Mar 19, 2013 at 07:39:24PM +0100, Paolo Bonzini wrote:
> Il 19/03/2013 19:13, Gleb Natapov ha scritto:
> >> > There is no way for userspace to inject interrupts into a VCPU's
> >> > local APIC, which is important in order to inject INITs coming from
> >> > the chipset.  KVM_INTERRUPT is currently disabled when the in-kernel
> >> > local APIC is used, so we can repurpose it.  The shorthand destination
> >> > field must contain APIC_DEST_SELF, which has a double effect: first,
> >> > the ICR2 register is not used and the 32-bit field of KVM_INTERRUPT is
> >> > enough; second, it ensures that the valid range of the irq field is
> >> > distinct in the userspace-APIC and kernel-APIC cases.
> >> > 
> > Init coming from triggering INIT# line should not be modeled as INIT coming 
> > from
> > APIC.
> 
> Then Jan's patch was wrong, and INIT should not have been an apic event
> (perhaps SIPI should).
> 
If it goes through APIC it is. Also the problem with reusing KVM_INTERRUPT is
that it is synchronous interface but INIT# is asynchronous. Shouldn't be a big
deal though.

> > In fact INIT cannot be send using SELF shorthand.
> 
> Where does the SDM say that?
> 
Table 10-3.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: let userspace inject interrupts into the local APIC

2013-03-20 Thread Gleb Natapov

On Tue, Mar 19, 2013 at 09:22:33PM +0100, Paolo Bonzini wrote:
> Il 19/03/2013 19:50, Gleb Natapov ha scritto:
> > On Tue, Mar 19, 2013 at 07:39:24PM +0100, Paolo Bonzini wrote:
> >> Il 19/03/2013 19:13, Gleb Natapov ha scritto:
> >>>>> There is no way for userspace to inject interrupts into a VCPU's
> >>>>> local APIC, which is important in order to inject INITs coming from
> >>>>> the chipset.  KVM_INTERRUPT is currently disabled when the in-kernel
> >>>>> local APIC is used, so we can repurpose it.  The shorthand destination
> >>>>> field must contain APIC_DEST_SELF, which has a double effect: first,
> >>>>> the ICR2 register is not used and the 32-bit field of KVM_INTERRUPT is
> >>>>> enough; second, it ensures that the valid range of the irq field is
> >>>>> distinct in the userspace-APIC and kernel-APIC cases.
> >>>>>
> >>> Init coming from triggering INIT# line should not be modeled as INIT 
> >>> coming from
> >>> APIC.
> >>
> >> Then Jan's patch was wrong, and INIT should not have been an apic event
> >> (perhaps SIPI should).
> >>
> > If it goes through APIC it is.
> 
> Ok, I'll extract KVM_APIC_INIT handling into a separate function and
> call it synchronously from KVM_INTERRUPT, with irq = -1
> (KVM_INTERRUPT_INIT, similar to PPC's values of irq).
> KVM_INTERRUPT_INIT will be accessible even with in-kernel irqchip.
> 
Why should it be accessible with in-kernel irqchip? The only valid value
for mp_state is RUNNING with userspace irqchip. We even validate it in
kvm_arch_vcpu_ioctl_set_mpstate() now.

> >>> In fact INIT cannot be send using SELF shorthand.
> >>
> >> Where does the SDM say that?
> >>
> > Table 10-3.
> 
> Yeah, table 10-6 and 10-7 here.
> 
Hmm, somebody needs to update SDM. Mine is from January 2013.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] KVM: allow host header to be included even for !CONFIG_KVM

2013-03-21 Thread Gleb Natapov

On Wed, Mar 20, 2013 at 06:58:41PM -0500, Scott Wood wrote:
> On 03/14/2013 07:13:46 PM, Kevin Hilman wrote:
> >The new context tracking subsystem unconditionally includes kvm_host.h
> >headers for the guest enter/exit macros.  This causes a compile
> >failure when KVM is not enabled.
> >
> >Fix by adding an IS_ENABLED(CONFIG_KVM) check to kvm_host so it can
> >be included/compiled even when KVM is not enabled.
> >
> >Cc: Frederic Weisbecker 
> >Signed-off-by: Kevin Hilman 
> >---
> >Applies on v3.9-rc2
> >
> > include/linux/kvm_host.h | 7 ++-
> > 1 file changed, 6 insertions(+), 1 deletion(-)
> 
> This broke the PPC non-KVM build, which was relying on stub
> functions in kvm_ppc.h, which relies on "struct vcpu" in kvm_host.h.
> 
> Why can't the entirety kvm_host.h be included regardless of
> CONFIG_KVM, just like most other feature-specific headers?  Why
> can't the if/else just go around the functions that you want to stub
> out for non-KVM builds?
> 
Kevin,

 What compilation failure this patch fixes? I presume something ARM
related.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 4/7] KVM: MMU: delete shadow page from hash list in kvm_mmu_prepare_zap_page

2013-03-21 Thread Gleb Natapov

On Wed, Mar 20, 2013 at 04:30:24PM +0800, Xiao Guangrong wrote:
> Move deletion shadow page from the hash list from kvm_mmu_commit_zap_page to
> kvm_mmu_prepare_zap_page, we that we can free the shadow page out of mmu-lock.
> 
> Also, delete the invalid shadow page from the hash list since this page can
> not be reused anymore. This makes reset mmu-cache more easier - we do not need
> to care all hash entries after reset mmu-cache
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c |8 ++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index dc37512..5578c91 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1472,7 +1472,7 @@ static inline void kvm_mod_used_mmu_pages(struct kvm 
> *kvm, int nr)
>  static void kvm_mmu_free_page(struct kvm_mmu_page *sp)
>  {
>   ASSERT(is_empty_shadow_page(sp->spt));
> - hlist_del(&sp->hash_link);
> +
>   list_del(&sp->link);
>   free_page((unsigned long)sp->spt);
>   if (!sp->role.direct)
> @@ -1660,7 +1660,8 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
>  
>  #define for_each_gfn_indirect_valid_sp(_kvm, _sp, _gfn)  
> \
>   for_each_gfn_sp(_kvm, _sp, _gfn)\
> - if ((_sp)->role.direct || (_sp)->role.invalid) {} else
> + if ((_sp)->role.direct ||   \
> +   ((_sp)->role.invalid && WARN_ON(1))) {} else
>  
>  /* @sp->gfn should be write-protected at the call site */
>  static int __kvm_sync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> @@ -2079,6 +2080,9 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, 
> struct kvm_mmu_page *sp,
>   unaccount_shadowed(kvm, sp->gfn);
>   if (sp->unsync)
>   kvm_unlink_unsync_page(kvm, sp);
> +
> + hlist_del_init(&sp->hash_link);
> +
Now we delete roots from hash, but leave it on active_mmu_pages list. Is
this OK?

>   if (!sp->root_count) {
>   /* Count self */
>   ret++;
> -- 
> 1.7.7.6

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 0/3] AER-KVM: Error containment of VFIO devices assigned to KVM guests

2013-02-03 Thread Gleb Natapov

Please use "git send-email --thread --no-chain-reply-to" when sending
patch series.

On Sun, Feb 03, 2013 at 02:10:08PM +, Pandarathil, Vijaymohan R wrote:
> 
> Add support for error containment when a VFIO device assigned to a KVM
> guest encounters an error. This is for PCIe devices/drivers that support AER
> functionality. When the host OS is notified of an error in a device either
> through the firmware first approach or through an interrupt handled by the AER
> root port driver, the error handler registered by the vfio-pci driver gets
> invoked. The qemu process is signaled through an eventfd registered per
> VFIO device by the qemu process. In the eventfd handler, qemu decides on
> what action to take. In this implementation, guest is brought down to
> contain the error.
> 
> 
> v3:
>  - Removed PCI_AER* flags from device info ioctl.
>  - Incorporated feedback
> v2:
>  - Rebased to latest upstream stable bits
>  - Changed the new ioctl to be part of VFIO_SET_IRQs ioctl
>  - Added a new patch to get/put reference to a vfio device from struct device
>  - Incorporated all other feedback.
> 
> ---
> 
> Vijay Mohan Pandarathil(3):
> 
> [PATCH 1/3] VFIO: Wrapper for getting reference to vfio_device from device 
> [PATCH 2/3] VFIO-AER: Vfio-pci driver changes for supporting AER
> [PATCH 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
> 
> Kernel files changed
> 
>  drivers/vfio/vfio.c  | 41 -
>  include/linux/vfio.h |  3 +++
>  2 files changed, 35 insertions(+), 9 deletions(-)
> 
>  drivers/vfio/pci/vfio_pci.c | 43 
> -
>  drivers/vfio/pci/vfio_pci_intrs.c   | 30 ++
>  drivers/vfio/pci/vfio_pci_private.h |  1 +
>  include/uapi/linux/vfio.h   |  1 +
>  4 files changed, 74 insertions(+), 1 deletion(-)
> 
> Qemu files changed
> 
>  hw/vfio_pci.c  | 105 
> +
>  linux-headers/linux/vfio.h |   1 +
>  2 files changed, 106 insertions(+)

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Gleb Natapov

On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
> On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
>  wrote:
> > - Create eventfd per vfio device assigned to a guest and register an
> >   event handler
> >
> > - This fd is passed to the vfio_pci driver through the SET_IRQ ioctl
> >
> > - When the device encounters an error, the eventfd is signalled
> >   and the qemu eventfd handler gets invoked.
> >
> > - In the handler decide what action to take. Current action taken
> >   is to terminate the guest.
> 
> Usually this is not OK, but I guess this is not guest triggerable.
> 
Still not OK. Why not stop a guest with appropriate stop reason?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3] KVM: MMU: lazily drop large spte

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 03:11:09PM +0800, Xiao Guangrong wrote:
> Currently, kvm zaps the large spte if write-protected is needed, the later
> read can fault on that spte. Actually, we can make the large spte readonly
> instead of making them un-present, the page fault caused by read access can
> be avoid
> 
> The idea is from Avi:
> | As I mentioned before, write-protecting a large spte is a good idea,
> | since it moves some work from protect-time to fault-time, so it reduces
> | jitter.  This removes the need for the return value.
> 
> Signed-off-by: Xiao Guangrong 
Reviewed-by: Gleb Natapov 

> ---
> Changelog:
> v3:
> - address Gleb's comments, we make the function return true if flush is
>   needed instead of returning it via pointer to a variable
> - improve the changelog
> 
>  arch/x86/kvm/mmu.c |   23 +++
>  1 files changed, 7 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 42ba85c..ff2fc80 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1106,8 +1106,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
> *sptep)
> 
>  /*
>   * Write-protect on the specified @sptep, @pt_protect indicates whether
> - * spte writ-protection is caused by protecting shadow page table.
> - * @flush indicates whether tlb need be flushed.
> + * spte write-protection is caused by protecting shadow page table.
>   *
>   * Note: write protection is difference between drity logging and spte
>   * protection:
> @@ -1116,10 +1115,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
> *sptep)
>   * - for spte protection, the spte can be writable only after unsync-ing
>   *   shadow page.
>   *
> - * Return true if the spte is dropped.
> + * Return true if tlb need be flushed.
>   */
> -static bool
> -spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
> +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
>  {
>   u64 spte = *sptep;
> 
> @@ -1129,17 +1127,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool 
> *flush, bool pt_protect)
> 
>   rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
> 
> - if (__drop_large_spte(kvm, sptep)) {
> - *flush |= true;
> - return true;
> - }
> -
>   if (pt_protect)
>   spte &= ~SPTE_MMU_WRITEABLE;
>   spte = spte & ~PT_WRITABLE_MASK;
> 
> - *flush |= mmu_spte_update(sptep, spte);
> - return false;
> + return mmu_spte_update(sptep, spte);
>  }
> 
>  static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
> @@ -1151,11 +1143,8 @@ static bool __rmap_write_protect(struct kvm *kvm, 
> unsigned long *rmapp,
> 
>   for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
>   BUG_ON(!(*sptep & PT_PRESENT_MASK));
> - if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
> - sptep = rmap_get_first(*rmapp, &iter);
> - continue;
> - }
> 
> + flush |= spte_write_protect(kvm, sptep, pt_protect);
>   sptep = rmap_get_next(&iter);
>   }
> 
> @@ -2611,6 +2600,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
> int write,
>   break;
>   }
> 
> + drop_large_spte(vcpu, iterator.sptep);
> +
>   if (!is_shadow_present_pte(*iterator.sptep)) {
>   u64 base_addr = iterator.addr;
> 
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R wrote:
> 
> 
> > -Original Message-
> > From: Gleb Natapov [mailto:g...@redhat.com]
> > Sent: Tuesday, February 05, 2013 12:05 AM
> > To: Blue Swirl
> > Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas; Ortiz, Lance
> > E; k...@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
> > linux-kernel@vger.kernel.org
> > Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-
> > PCI devices
> > 
> > On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
> > > On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
> > >  wrote:
> > > > - Create eventfd per vfio device assigned to a guest and
> > register an
> > > >   event handler
> > > >
> > > > - This fd is passed to the vfio_pci driver through the SET_IRQ
> > ioctl
> > > >
> > > > - When the device encounters an error, the eventfd is signalled
> > > >   and the qemu eventfd handler gets invoked.
> > > >
> > > > - In the handler decide what action to take. Current action
> > taken
> > > >   is to terminate the guest.
> > >
> > > Usually this is not OK, but I guess this is not guest triggerable.
> > >
> > Still not OK. Why not stop a guest with appropriate stop reason?
> 
> The thinking was that since this is a hardware error, we would want to stop 
> the guest at the earliest. The hw_error() routine which aborts the qemu 
> process was suggested by Alex and that seemed appropriate. Earlier I was 
> using qemu_system_shutdown_request().  Any suggestions ?
> 
I am thinking vm_stop(). Stopping SMP guest (and UP too in fact)
involves sending IPIs to other cpus running guest's vcpus. Both exit()
and vm_stop() will do it, but former is implicitly in the kernel and
later is explicitly in QEMU.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R wrote:
> 
> 
> > -Original Message-
> > From: Gleb Natapov [mailto:g...@redhat.com]
> > Sent: Tuesday, February 05, 2013 1:21 AM
> > To: Pandarathil, Vijaymohan R
> > Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
> > k...@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
> > linux-kernel@vger.kernel.org
> > Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-
> > PCI devices
> > 
> > On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R wrote:
> > >
> > >
> > > > -Original Message-
> > > > From: Gleb Natapov [mailto:g...@redhat.com]
> > > > Sent: Tuesday, February 05, 2013 12:05 AM
> > > > To: Blue Swirl
> > > > Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas; Ortiz,
> > Lance
> > > > E; k...@vger.kernel.org; qemu-de...@nongnu.org; linux-
> > p...@vger.kernel.org;
> > > > linux-kernel@vger.kernel.org
> > > > Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for
> > VFIO-
> > > > PCI devices
> > > >
> > > > On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
> > > > > On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
> > > > >  wrote:
> > > > > > - Create eventfd per vfio device assigned to a guest and
> > > > register an
> > > > > >   event handler
> > > > > >
> > > > > > - This fd is passed to the vfio_pci driver through the
> > SET_IRQ
> > > > ioctl
> > > > > >
> > > > > > - When the device encounters an error, the eventfd is
> > signalled
> > > > > >   and the qemu eventfd handler gets invoked.
> > > > > >
> > > > > > - In the handler decide what action to take. Current action
> > > > taken
> > > > > >   is to terminate the guest.
> > > > >
> > > > > Usually this is not OK, but I guess this is not guest triggerable.
> > > > >
> > > > Still not OK. Why not stop a guest with appropriate stop reason?
> > >
> > > The thinking was that since this is a hardware error, we would want to
> > stop the guest at the earliest. The hw_error() routine which aborts the
> > qemu process was suggested by Alex and that seemed appropriate. Earlier I
> > was using qemu_system_shutdown_request().  Any suggestions ?
> > >
> > I am thinking vm_stop(). Stopping SMP guest (and UP too in fact)
> > involves sending IPIs to other cpus running guest's vcpus. Both exit()
> > and vm_stop() will do it, but former is implicitly in the kernel and
> > later is explicitly in QEMU.
> 
> I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while testing, 
> guest ended up in a hang rather than exiting. There seems to some cleanup 
> work which is being done as part of vm_stop. In our case, we wanted the guest 
> to exit immediately. So use of hw_error() seemed appropriate.
> 
What makes you think it hang? It stopped, precisely what it should do if
you call vm_stop(). Now it is possible for vm user to investigate what
happened and even salvage some data from guest memory.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 0/3] KVM: MMU: simple cleanups

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 03:26:21PM +0800, Xiao Guangrong wrote:
> There are the simple cleanups for MMU, no function / logic changed.
> 
> Marcelo, Gleb, please apply them after applying
> "[PATCH v3] KVM: MMU: lazily drop large spte"
> 
> Changelog:
> no change, just split them from the previous patchset for good review.
> 
> Thanks!
Reviewed-by: Gleb Natapov 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 12:05:11PM +, Pandarathil, Vijaymohan R wrote:
> 
> 
> > -Original Message-
> > From: linux-pci-ow...@vger.kernel.org [mailto:linux-pci-
> > ow...@vger.kernel.org] On Behalf Of Gleb Natapov
> > Sent: Tuesday, February 05, 2013 3:37 AM
> > To: Pandarathil, Vijaymohan R
> > Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
> > k...@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
> > linux-kernel@vger.kernel.org
> > Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-
> > PCI devices
> > 
> > On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R wrote:
> > >
> > >
> > > > -Original Message-
> > > > From: Gleb Natapov [mailto:g...@redhat.com]
> > > > Sent: Tuesday, February 05, 2013 1:21 AM
> > > > To: Pandarathil, Vijaymohan R
> > > > Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
> > > > k...@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
> > > > linux-kernel@vger.kernel.org
> > > > Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for
> > VFIO-
> > > > PCI devices
> > > >
> > > > On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R
> > wrote:
> > > > >
> > > > >
> > > > > > -Original Message-
> > > > > > From: Gleb Natapov [mailto:g...@redhat.com]
> > > > > > Sent: Tuesday, February 05, 2013 12:05 AM
> > > > > > To: Blue Swirl
> > > > > > Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas;
> > Ortiz,
> > > > Lance
> > > > > > E; k...@vger.kernel.org; qemu-de...@nongnu.org; linux-
> > > > p...@vger.kernel.org;
> > > > > > linux-kernel@vger.kernel.org
> > > > > > Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER
> > for
> > > > VFIO-
> > > > > > PCI devices
> > > > > >
> > > > > > On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
> > > > > > > On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
> > > > > > >  wrote:
> > > > > > > > - Create eventfd per vfio device assigned to a guest
> > and
> > > > > > register an
> > > > > > > >   event handler
> > > > > > > >
> > > > > > > > - This fd is passed to the vfio_pci driver through the
> > > > SET_IRQ
> > > > > > ioctl
> > > > > > > >
> > > > > > > > - When the device encounters an error, the eventfd is
> > > > signalled
> > > > > > > >   and the qemu eventfd handler gets invoked.
> > > > > > > >
> > > > > > > > - In the handler decide what action to take. Current
> > action
> > > > > > taken
> > > > > > > >   is to terminate the guest.
> > > > > > >
> > > > > > > Usually this is not OK, but I guess this is not guest
> > triggerable.
> > > > > > >
> > > > > > Still not OK. Why not stop a guest with appropriate stop reason?
> > > > >
> > > > > The thinking was that since this is a hardware error, we would want
> > to
> > > > stop the guest at the earliest. The hw_error() routine which aborts the
> > > > qemu process was suggested by Alex and that seemed appropriate. Earlier
> > I
> > > > was using qemu_system_shutdown_request().  Any suggestions ?
> > > > >
> > > > I am thinking vm_stop(). Stopping SMP guest (and UP too in fact)
> > > > involves sending IPIs to other cpus running guest's vcpus. Both exit()
> > > > and vm_stop() will do it, but former is implicitly in the kernel and
> > > > later is explicitly in QEMU.
> > >
> > > I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while
> > testing, guest ended up in a hang rather than exiting. There seems to some
> > cleanup work which is being done as part of vm_stop. In our case, we wanted
> > the guest to exit immediately. So use of hw_error() seemed appropriate.
> > >
> > What makes you think it hang? It stopped, precisely what it should do if
> > you call vm_stop(). Now it is possible for vm user to investigate what
> > happened and even salvage some data from guest memory.
> 
> That was ignorance on my part on the expected behavior of vm_stop(). 
> So what you are suggesting is to stop the guest displaying an appropriate 
> error/next-steps message and have the users do any 
> data-collection/investigation 
> and then manually kill the guest, if they so desire. Right ?
> 
Yes.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 0/5] KVM: MMU: unify and cleanup the code of walking pte list

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 04:52:35PM +0800, Xiao Guangrong wrote:
> Current code has two ways to walk pte_list, the one is pte_list_walk and
> the another way is rmap_get_first and rmap_get_next, they have the same logic.
> This patchset tries to unify the code and also make the code more tidy.
> 
> Patch 1: KVM: MMU: introduce mmu_spte_establish, which tries to eliminates
> the different between walking parent pte list and rmap, prepare for the
> later patch.
> 
> Patch 2: KVM: MMU: clarify the logic in kvm_set_pte_rmapp, which prepares for
> the next patch, no logic changed.
> 
> Patch 3: KVM: MMU: unify the code of walking pte list, unify the walking code.
> 
> Patch 4: KVM: MMU: fix spte assertion, fix a minor bug and remove the 
> duplicate
> code.
> 
> Patch 5: KVM: MMU: fast drop all spte on the pte_list, optimize for dropping
> all sptes on rmap and remove all the "goto restart" pattern introduced by
> the Patch 3.
> 
> Marcelo, Gleb, please apply them after applying the patchset of
> [PATCH v3 0/3] KVM: MMU: simple cleanups.
> 
> Changelog:
> v3:
> - address Gleb's comments, remove the remained "goto restart" in
>   kvm_set_pte_rmapp
> - improve the changelog
> 
Reviewed-by: Gleb Natapov 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices

2013-02-05 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 06:37:35AM -0700, Alex Williamson wrote:
> On Tue, 2013-02-05 at 12:05 +, Pandarathil, Vijaymohan R wrote:
> > 
> > > -Original Message-
> > > From: linux-pci-ow...@vger.kernel.org [mailto:linux-pci-
> > > ow...@vger.kernel.org] On Behalf Of Gleb Natapov
> > > Sent: Tuesday, February 05, 2013 3:37 AM
> > > To: Pandarathil, Vijaymohan R
> > > Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
> > > k...@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org;
> > > linux-kernel@vger.kernel.org
> > > Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for 
> > > VFIO-
> > > PCI devices
> > > 
> > > On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R wrote:
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Gleb Natapov [mailto:g...@redhat.com]
> > > > > Sent: Tuesday, February 05, 2013 1:21 AM
> > > > > To: Pandarathil, Vijaymohan R
> > > > > Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E;
> > > > > k...@vger.kernel.org; qemu-de...@nongnu.org; 
> > > > > linux-...@vger.kernel.org;
> > > > > linux-kernel@vger.kernel.org
> > > > > Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for
> > > VFIO-
> > > > > PCI devices
> > > > >
> > > > > On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R
> > > wrote:
> > > > > >
> > > > > >
> > > > > > > -Original Message-
> > > > > > > From: Gleb Natapov [mailto:g...@redhat.com]
> > > > > > > Sent: Tuesday, February 05, 2013 12:05 AM
> > > > > > > To: Blue Swirl
> > > > > > > Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas;
> > > Ortiz,
> > > > > Lance
> > > > > > > E; k...@vger.kernel.org; qemu-de...@nongnu.org; linux-
> > > > > p...@vger.kernel.org;
> > > > > > > linux-kernel@vger.kernel.org
> > > > > > > Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER
> > > for
> > > > > VFIO-
> > > > > > > PCI devices
> > > > > > >
> > > > > > > On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote:
> > > > > > > > On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R
> > > > > > > >  wrote:
> > > > > > > > > - Create eventfd per vfio device assigned to a guest
> > > and
> > > > > > > register an
> > > > > > > > >   event handler
> > > > > > > > >
> > > > > > > > > - This fd is passed to the vfio_pci driver through the
> > > > > SET_IRQ
> > > > > > > ioctl
> > > > > > > > >
> > > > > > > > > - When the device encounters an error, the eventfd is
> > > > > signalled
> > > > > > > > >   and the qemu eventfd handler gets invoked.
> > > > > > > > >
> > > > > > > > > - In the handler decide what action to take. Current
> > > action
> > > > > > > taken
> > > > > > > > >   is to terminate the guest.
> > > > > > > >
> > > > > > > > Usually this is not OK, but I guess this is not guest
> > > triggerable.
> > > > > > > >
> > > > > > > Still not OK. Why not stop a guest with appropriate stop reason?
> > > > > >
> > > > > > The thinking was that since this is a hardware error, we would want
> > > to
> > > > > stop the guest at the earliest. The hw_error() routine which aborts 
> > > > > the
> > > > > qemu process was suggested by Alex and that seemed appropriate. 
> > > > > Earlier
> > > I
> > > > > was using qemu_system_shutdown_request().  Any suggestions ?
> > > > > >
> > > > > I am thinking vm_stop(). Stopping SMP guest (and UP too in fact)
> > > > > involves sending IPIs to other cpus running guest's vcpus. Both exit()
> > > > > and vm_stop() will do it, but f

Re: [PATCH linux-next] x86/kvm: Fix compile warning in kvm_register_steal_time()

2013-02-11 Thread Gleb Natapov

On Tue, Feb 05, 2013 at 07:57:22PM -0700, Shuah Khan wrote:
> Fix the following compile warning in kvm_register_steal_time():
>   CC  arch/x86/kernel/kvm.o
> arch/x86/kernel/kvm.c: In function ‘kvm_register_steal_time’:
> arch/x86/kernel/kvm.c:302:3: warning: format ‘%lx’ expects argument of type 
> ‘long unsigned int’, but argument 3 has type ‘phys_addr_t’ [-Wformat]
> 
Ingo, the warning is from the tip tree. Can you take the fix?

> Signed-off-by: Shuah Khan 
> ---
>  arch/x86/kernel/kvm.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index fe75a28..b686a90 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -298,8 +298,8 @@ static void kvm_register_steal_time(void)
>   memset(st, 0, sizeof(*st));
>  
>   wrmsrl(MSR_KVM_STEAL_TIME, (slow_virt_to_phys(st) | KVM_MSR_ENABLED));
> - printk(KERN_INFO "kvm-stealtime: cpu %d, msr %lx\n",
> - cpu, slow_virt_to_phys(st));
> + pr_info("kvm-stealtime: cpu %d, msr %llx\n",
> + cpu, (unsigned long long) slow_virt_to_phys(st));
>  }
>  
>  static DEFINE_PER_CPU(unsigned long, kvm_apic_eoi) = KVM_PV_EOI_DISABLED;
> -- 
> 1.7.9.5
> 
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH linux-next] x86/kvm: Fix compile warning in kvm_register_steal_time()

2013-02-11 Thread Gleb Natapov

On Mon, Feb 11, 2013 at 11:04:39AM +0100, Ingo Molnar wrote:
> 
> * Gleb Natapov  wrote:
> 
> > On Tue, Feb 05, 2013 at 07:57:22PM -0700, Shuah Khan wrote:
> > > Fix the following compile warning in kvm_register_steal_time():
> > >   CC  arch/x86/kernel/kvm.o
> > > arch/x86/kernel/kvm.c: In function ?kvm_register_steal_time?:
> > > arch/x86/kernel/kvm.c:302:3: warning: format ?%lx? expects argument of 
> > > type ?long unsigned int?, but argument 3 has type ?phys_addr_t? [-Wformat]
> > > 
> > Ingo, the warning is from the tip tree. Can you take the fix?
> 
> Yeah, it came via these x86 improvements to __pa():
> 
> 5dfd486c4750 x86, kvm: Fix kvm's use of __pa() on percpu areas
> d76565344512 x86, mm: Create slow_virt_to_phys()
> f3c4fbb68e93 x86, mm: Use new pagetable helpers in try_preserve_large_page()
> 4cbeb51b860c x86, mm: Pagetable level size/shift/mask helpers
> a25b9316841c x86, mm: Make DEBUG_VIRTUAL work earlier in boot
> 
> So up the fix - I added an Acked-by from you as well, is that 
> fine with you?
> 
Yes, of course. Thank you.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] KVM: allow host header to be included even for !CONFIG_KVM

2013-04-02 Thread Gleb Natapov

On Mon, Mar 25, 2013 at 02:14:20PM -0700, Kevin Hilman wrote:
> Gleb Natapov  writes:
> 
> > On Sun, Mar 24, 2013 at 02:44:26PM +0100, Frederic Weisbecker wrote:
> >> 2013/3/21 Gleb Natapov :
> >> > Isn't is simpler for kernel/context_tracking.c to define empty
> >> > __guest_enter()/__guest_exit() if !CONFIG_KVM.
> >> 
> >> That doesn't look right. Off-cases are usually handled from the
> >> headers, right? So that we avoid iffdeffery ugliness in core code.
> > Lets put it in linux/context_tracking.h header then.
> 
> Here's a version to do that.
> 
Frederic, are you OK with this version?


> Kevin
> 
> >From d9d909394479dd7ff90b7bddb95a564945406719 Mon Sep 17 00:00:00 2001
> From: Kevin Hilman 
> Date: Mon, 25 Mar 2013 14:12:41 -0700
> Subject: [PATCH v2] ontext_tracking: fix !CONFIG_KVM compile: add stub guest
>  enter/exit
> 
> When KVM is not enabled, or not available on a platform, the KVM
> headers should not be included.  Instead, just define stub
> __guest_[enter|exit] functions.
> 
> Cc: Frederic Weisbecker 
> Signed-off-by: Kevin Hilman 
> ---
>  include/linux/context_tracking.h | 7 +++
>  kernel/context_tracking.c| 1 -
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/context_tracking.h 
> b/include/linux/context_tracking.h
> index 365f4a6..9d0f242 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -3,6 +3,13 @@
>  
>  #include 
>  #include 
> +#if IS_ENABLED(CONFIG_KVM)
> +#include 
> +#else
> +#define __guest_enter()
> +#define __guest_exit()
> +#endif
> +
>  #include 
>  
>  struct context_tracking {
> diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
> index 65349f0..85bdde1 100644
> --- a/kernel/context_tracking.c
> +++ b/kernel/context_tracking.c
> @@ -15,7 +15,6 @@
>   */
>  
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> -- 
> 1.8.2

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov

On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> 
> > With KVM, MMIO is much slower than PIO, due to the need to
> > do page walk and emulation. But with EPT, it does not have to be: we
> > know the address from the VMCS so if the address is unique, we can look
> > up the eventfd directly, bypassing emulation.
> > 
> > Add an interface for userspace to specify this per-address, we can
> > use this e.g. for virtio.
> > 
> > The implementation adds a separate bus internally. This serves two
> > purposes:
> > - minimize overhead for old userspace that does not use PV MMIO
> > - minimize disruption in other code (since we don't know the length,
> >  devices on the MMIO bus only get a valid address in write, this
> >  way we don't need to touch all devices to teach them handle
> >  an dinvalid length)
> > 
> > At the moment, this optimization is only supported for EPT on x86 and
> > silently ignored for NPT and MMU, so everything works correctly but
> > slowly.
> > 
> > TODO: NPT, MMU and non x86 architectures.
> > 
> > The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> > pre-review and suggestions.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> 
> This still uses page fault intercepts which are orders of magnitudes slower 
> than hypercalls. Why don't you just create a PV MMIO hypercall that the guest 
> can use to invoke MMIO accesses towards the host based on physical addresses 
> with explicit length encodings?
> 
It is slower, but not an order of magnitude slower. It become faster
with newer HW.

> That way you simplify and speed up all code paths, exceeding the speed of PIO 
> exits even. It should also be quite easily portable, as all other platforms 
> have hypercalls available as well.
> 
We are trying to avoid PV as much as possible (well this is also PV,
but not guest visible). We haven't replaced PIO with hypercall for the
same reason. My hope is that future HW will provide us with instruction
decode for basic mov instruction at which point this optimisation can be
dropped. And hypercall has its own set of problems with Windows guests.
When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
hypercalls.  Mixing KVM hypercalls and Hyper-V requires some tricks. It
may also affect WHQLing Windows drivers since driver will talk to HW
bypassing Windows interfaces.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov

On Thu, Apr 04, 2013 at 02:09:53PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 13:04, Michael S. Tsirkin wrote:
> 
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >> 
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>> 
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>> 
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>> 
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>> 
> >>> TODO: NPT, MMU and non x86 architectures.
> >>> 
> >>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>> 
> >>> Signed-off-by: Michael S. Tsirkin 
> >> 
> >> This still uses page fault intercepts which are orders of magnitudes
> >> slower than hypercalls.
> > 
> > Not really. Here's a test:
> > compare vmcall to portio:
> > 
> > vmcall 1519
> > ...
> > outl_to_kernel 1745
> > 
> > compare portio to mmio:
> > 
> > mmio-wildcard-eventfd:pci-mem 3529
> > mmio-pv-eventfd:pci-mem 1878
> > portio-wildcard-eventfd:pci-io 1846
> > 
> > So not orders of magnitude.
> 
> https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf
> 
> Check out page 41. Higher is better (number is number of loop cycles in a 
> second). My test system was an AMD Istanbul based box.
> 
Have you bypassed instruction emulation in your testing?

> > 
> >> Why don't you just create a PV MMIO hypercall
> >> that the guest can use to invoke MMIO accesses towards the host based
> >> on physical addresses with explicit length encodings?
> >> That way you simplify and speed up all code paths, exceeding the speed
> >> of PIO exits even. It should also be quite easily portable, as all
> >> other platforms have hypercalls available as well.
> >> 
> >> 
> >> Alex
> > 
> > I sent such a patch, but maintainers seem reluctant to add hypercalls.
> > Gleb, could you comment please?
> > 
> > A fast way to do MMIO is probably useful in any case ...
> 
> Yes, but at least according to my numbers optimizing anything that is not 
> hcalls is a waste of time.
> 
> 
> Alex

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov

On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >> 
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>> 
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>> 
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>> 
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>> 
> >>> TODO: NPT, MMU and non x86 architectures.
> >>> 
> >>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>> 
> >>> Signed-off-by: Michael S. Tsirkin 
> >> 
> >> This still uses page fault intercepts which are orders of magnitudes 
> >> slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
> >> the guest can use to invoke MMIO accesses towards the host based on 
> >> physical addresses with explicit length encodings?
> >> 
> > It is slower, but not an order of magnitude slower. It become faster
> > with newer HW.
> > 
> >> That way you simplify and speed up all code paths, exceeding the speed of 
> >> PIO exits even. It should also be quite easily portable, as all other 
> >> platforms have hypercalls available as well.
> >> 
> > We are trying to avoid PV as much as possible (well this is also PV,
> > but not guest visible). We haven't replaced PIO with hypercall for the
> > same reason. My hope is that future HW will provide us with instruction
> > decode for basic mov instruction at which point this optimisation can be
> > dropped.
> 
> The same applies to an MMIO hypercall. Once the PV interface becomes 
> obsolete, we can drop the capability we expose to the guest.
> 
Disable it on newer HW is easy, but it is not that simple to get rid of the 
guest code.

> > And hypercall has its own set of problems with Windows guests.
> > When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
> > hypercalls.  Mixing KVM hypercalls and Hyper-V requires some tricks. It
> 
> Can't we simply register a hypercall ID range with Microsoft?
Doubt it. This is not only about the rang though. The calling convention
is completely different.

> 
> > may also affect WHQLing Windows drivers since driver will talk to HW
> > bypassing Windows interfaces.
> 
> Then the WHQL'ed driver doesn't support the PV MMIO hcall?
> 
All Windows drivers have to be WHQL'ed, saying that is like saying that
we do not care about Windows guest virtio speed.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov

On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >> 
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>> 
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>> 
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>> 
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>> 
> >>> TODO: NPT, MMU and non x86 architectures.
> >>> 
> >>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>> 
> >>> Signed-off-by: Michael S. Tsirkin 
> >> 
> >> This still uses page fault intercepts which are orders of magnitudes 
> >> slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
> >> the guest can use to invoke MMIO accesses towards the host based on 
> >> physical addresses with explicit length encodings?
> >> 
> > It is slower, but not an order of magnitude slower. It become faster
> > with newer HW.
> > 
> >> That way you simplify and speed up all code paths, exceeding the speed of 
> >> PIO exits even. It should also be quite easily portable, as all other 
> >> platforms have hypercalls available as well.
> >> 
> > We are trying to avoid PV as much as possible (well this is also PV,
> > but not guest visible
> 
> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The 
> comment above its definition indicates that the guest does so, so it is guest 
> visible.
> 
QEMU sets it.

> +/*
> + * PV_MMIO - Guest can promise us that all accesses touching this address
> + * are writes of specified length, starting at the specified address.
> + * If not - it's a Guest bug.
> + * Can not be used together with either PIO or DATAMATCH.
> + */
> 
Virtio spec will state that access to a kick register needs to be of
specific length. This is reasonable thing for HW to ask.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov

On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>>> 
> >>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>>> 
> >>>>> With KVM, MMIO is much slower than PIO, due to the need to
> >>>>> do page walk and emulation. But with EPT, it does not have to be: we
> >>>>> know the address from the VMCS so if the address is unique, we can look
> >>>>> up the eventfd directly, bypassing emulation.
> >>>>> 
> >>>>> Add an interface for userspace to specify this per-address, we can
> >>>>> use this e.g. for virtio.
> >>>>> 
> >>>>> The implementation adds a separate bus internally. This serves two
> >>>>> purposes:
> >>>>> - minimize overhead for old userspace that does not use PV MMIO
> >>>>> - minimize disruption in other code (since we don't know the length,
> >>>>> devices on the MMIO bus only get a valid address in write, this
> >>>>> way we don't need to touch all devices to teach them handle
> >>>>> an dinvalid length)
> >>>>> 
> >>>>> At the moment, this optimization is only supported for EPT on x86 and
> >>>>> silently ignored for NPT and MMU, so everything works correctly but
> >>>>> slowly.
> >>>>> 
> >>>>> TODO: NPT, MMU and non x86 architectures.
> >>>>> 
> >>>>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>>>> pre-review and suggestions.
> >>>>> 
> >>>>> Signed-off-by: Michael S. Tsirkin 
> >>>> 
> >>>> This still uses page fault intercepts which are orders of magnitudes 
> >>>> slower than hypercalls. Why don't you just create a PV MMIO hypercall 
> >>>> that the guest can use to invoke MMIO accesses towards the host based on 
> >>>> physical addresses with explicit length encodings?
> >>>> 
> >>> It is slower, but not an order of magnitude slower. It become faster
> >>> with newer HW.
> >>> 
> >>>> That way you simplify and speed up all code paths, exceeding the speed 
> >>>> of PIO exits even. It should also be quite easily portable, as all other 
> >>>> platforms have hypercalls available as well.
> >>>> 
> >>> We are trying to avoid PV as much as possible (well this is also PV,
> >>> but not guest visible
> >> 
> >> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? 
> >> The comment above its definition indicates that the guest does so, so it 
> >> is guest visible.
> >> 
> > QEMU sets it.
> 
> How does QEMU know?
> 
Knows what? When to create such eventfd? virtio device knows.

> > 
> >> +/*
> >> + * PV_MMIO - Guest can promise us that all accesses touching this address
> >> + * are writes of specified length, starting at the specified address.
> >> + * If not - it's a Guest bug.
> >> + * Can not be used together with either PIO or DATAMATCH.
> >> + */
> >> 
> > Virtio spec will state that access to a kick register needs to be of
> > specific length. This is reasonable thing for HW to ask.
> 
> This is a spec change. So the guest would have to indicate that it adheres to 
> a newer spec. Thus it's a guest visible change.
> 
There is not virtio spec that has kick register in MMIO. The spec is in
the works AFAIK. Actually PIO will not be deprecated and my suggestion
is to move to MMIO only when PIO address space is exhausted. For PCI it
will be never, for PCI-e it will be after ~16 devices.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov

On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:45, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >>>> 
> >>>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >>>> 
> >>>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>>>>> 
> >>>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>>>>> 
> >>>>>>> With KVM, MMIO is much slower than PIO, due to the need to
> >>>>>>> do page walk and emulation. But with EPT, it does not have to be: we
> >>>>>>> know the address from the VMCS so if the address is unique, we can 
> >>>>>>> look
> >>>>>>> up the eventfd directly, bypassing emulation.
> >>>>>>> 
> >>>>>>> Add an interface for userspace to specify this per-address, we can
> >>>>>>> use this e.g. for virtio.
> >>>>>>> 
> >>>>>>> The implementation adds a separate bus internally. This serves two
> >>>>>>> purposes:
> >>>>>>> - minimize overhead for old userspace that does not use PV MMIO
> >>>>>>> - minimize disruption in other code (since we don't know the length,
> >>>>>>> devices on the MMIO bus only get a valid address in write, this
> >>>>>>> way we don't need to touch all devices to teach them handle
> >>>>>>> an dinvalid length)
> >>>>>>> 
> >>>>>>> At the moment, this optimization is only supported for EPT on x86 and
> >>>>>>> silently ignored for NPT and MMU, so everything works correctly but
> >>>>>>> slowly.
> >>>>>>> 
> >>>>>>> TODO: NPT, MMU and non x86 architectures.
> >>>>>>> 
> >>>>>>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>>>>>> pre-review and suggestions.
> >>>>>>> 
> >>>>>>> Signed-off-by: Michael S. Tsirkin 
> >>>>>> 
> >>>>>> This still uses page fault intercepts which are orders of magnitudes 
> >>>>>> slower than hypercalls. Why don't you just create a PV MMIO hypercall 
> >>>>>> that the guest can use to invoke MMIO accesses towards the host based 
> >>>>>> on physical addresses with explicit length encodings?
> >>>>>> 
> >>>>> It is slower, but not an order of magnitude slower. It become faster
> >>>>> with newer HW.
> >>>>> 
> >>>>>> That way you simplify and speed up all code paths, exceeding the speed 
> >>>>>> of PIO exits even. It should also be quite easily portable, as all 
> >>>>>> other platforms have hypercalls available as well.
> >>>>>> 
> >>>>> We are trying to avoid PV as much as possible (well this is also PV,
> >>>>> but not guest visible
> >>>> 
> >>>> Also, how is this not guest visible? Who sets 
> >>>> KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
> >>>> that the guest does so, so it is guest visible.
> >>>> 
> >>> QEMU sets it.
> >> 
> >> How does QEMU know?
> >> 
> > Knows what? When to create such eventfd? virtio device knows.
> 
> Where does it know from?
> 
It does it always.

> > 
> >>> 
> >>>> +/*
> >>>> + * PV_MMIO - Guest can promise us that all accesses touching this 
> >>>> address
> >>>> + * are writes of specified length, starting at the specified address.
> >>>> + * If not - it's a Guest bug.
> >>>> + * Can not be used together with either PIO or DATAMATCH.
> >>>> + */
> >>>> 
> >>> Virtio spec will state that access to a kick register needs to be of
> >>> specific length. This is reasonable thing for HW to ask.
> >> 
> >> This is a spec change. So the guest would have to indicate that it adheres 
> >> to a newer spec. Thus it's a guest visible change.
> >> 
> > There is not virtio spec that has kick register in MMIO. The spec is in
> > the works AFAIK. Actually PIO will not be deprecated and my suggestion
> 
> So the guest would indicate that it supports a newer revision of the spec (in 
> your case, that it supports MMIO). How is that any different from exposing 
> that it supports a PV MMIO hcall?
> 
Guest will indicate nothing. New driver will use MMIO if PIO is bar is
not configured. All driver will not work for virtio devices with MMIO
bar, but not PIO bar.

> > is to move to MMIO only when PIO address space is exhausted. For PCI it
> > will be never, for PCI-e it will be after ~16 devices.
> 
> Ok, let's go back a step here. Are you actually able to measure any speed in 
> performance with this patch applied and without when going through MMIO kicks?
> 
> 
That's the question for MST. I think he did only micro benchmarks till
now and he already posted his result here:

mmio-wildcard-eventfd:pci-mem 3529
mmio-pv-eventfd:pci-mem 1878
portio-wildcard-eventfd:pci-io 1846

So the patch speedup mmio by almost 100% and it is almost the same as PIO.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov

On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:56, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:45, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> >>>> 
> >>>> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> >>>> 
> >>>>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >>>>>> 
> >>>>>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >>>>>> 
> >>>>>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >>>>>>>> 
> >>>>>>>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >>>>>>>> 
> >>>>>>>>> With KVM, MMIO is much slower than PIO, due to the need to
> >>>>>>>>> do page walk and emulation. But with EPT, it does not have to be: we
> >>>>>>>>> know the address from the VMCS so if the address is unique, we can 
> >>>>>>>>> look
> >>>>>>>>> up the eventfd directly, bypassing emulation.
> >>>>>>>>> 
> >>>>>>>>> Add an interface for userspace to specify this per-address, we can
> >>>>>>>>> use this e.g. for virtio.
> >>>>>>>>> 
> >>>>>>>>> The implementation adds a separate bus internally. This serves two
> >>>>>>>>> purposes:
> >>>>>>>>> - minimize overhead for old userspace that does not use PV MMIO
> >>>>>>>>> - minimize disruption in other code (since we don't know the length,
> >>>>>>>>> devices on the MMIO bus only get a valid address in write, this
> >>>>>>>>> way we don't need to touch all devices to teach them handle
> >>>>>>>>> an dinvalid length)
> >>>>>>>>> 
> >>>>>>>>> At the moment, this optimization is only supported for EPT on x86 
> >>>>>>>>> and
> >>>>>>>>> silently ignored for NPT and MMU, so everything works correctly but
> >>>>>>>>> slowly.
> >>>>>>>>> 
> >>>>>>>>> TODO: NPT, MMU and non x86 architectures.
> >>>>>>>>> 
> >>>>>>>>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>>>>>>>> pre-review and suggestions.
> >>>>>>>>> 
> >>>>>>>>> Signed-off-by: Michael S. Tsirkin 
> >>>>>>>> 
> >>>>>>>> This still uses page fault intercepts which are orders of magnitudes 
> >>>>>>>> slower than hypercalls. Why don't you just create a PV MMIO 
> >>>>>>>> hypercall that the guest can use to invoke MMIO accesses towards the 
> >>>>>>>> host based on physical addresses with explicit length encodings?
> >>>>>>>> 
> >>>>>>> It is slower, but not an order of magnitude slower. It become faster
> >>>>>>> with newer HW.
> >>>>>>> 
> >>>>>>>> That way you simplify and speed up all code paths, exceeding the 
> >>>>>>>> speed of PIO exits even. It should also be quite easily portable, as 
> >>>>>>>> all other platforms have hypercalls available as well.
> >>>>>>>> 
> >>>>>>> We are trying to avoid PV as much as possible (well this is also PV,
> >>>>>>> but not guest visible
> >>>>>> 
> >>>>>> Also, how is this not guest visible? Who sets 
> >>>>>> KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
> >>>>>> that the guest does so, so it is guest visible.
> >>>>>> 
> >>>>> QEMU sets it.
> >>>> 
> >>>> How does QEMU know?
> >>>> 
> >>> Knows what? When to create such eventfd? virtio device knows.
> >> 
> >> Where does it know from?
> >> 
> > It does it always.

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov

On Thu, Apr 04, 2013 at 05:36:40PM +0200, Alexander Graf wrote:
> > 
> > #define GOAL (1ull << 30)
> > 
> >do {
> >iterations *= 2;
> >t1 = rdtsc();
> > 
> >for (i = 0; i < iterations; ++i)
> >func();
> >t2 = rdtsc();
> >} while ((t2 - t1) < GOAL);
> >printf("%s %d\n", test->name, (int)((t2 - t1) / iterations));
> 
> So it's the number of cycles per run.
> 
> That means translated my numbers are:
> 
>   MMIO: 4307
>   PIO: 3658
>   HCALL: 1756
> 
> MMIO - PIO = 649
> 
> which aligns roughly with your PV MMIO callback.
> 
> My MMIO benchmark was to poke the LAPIC version register. That does go 
> through instruction emulation, no?
> 
It should and PIO 'in' or string also goes through the emulator.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov

On Thu, Apr 04, 2013 at 06:36:30PM +0300, Michael S. Tsirkin wrote:
> > processor   : 0
> > vendor_id   : AuthenticAMD
> > cpu family  : 16
> > model   : 8
> > model name  : Six-Core AMD Opteron(tm) Processor 8435
> > stepping: 0
> > cpu MHz : 800.000
> > cache size  : 512 KB
> > physical id : 0
> > siblings: 6
> > core id : 0
> > cpu cores   : 6
> > apicid  : 8
> > initial apicid  : 0
> > fpu : yes
> > fpu_exception   : yes
> > cpuid level : 5
> > wp  : yes
> > flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
> > mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc 
> > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv 
> > svm_lock nrip_save pausefilter
> > bogomips: 5199.87
> > TLB size: 1024 4K pages
> > clflush size: 64
> > cache_alignment : 64
> > address sizes   : 48 bits physical, 48 bits virtual
> > power management: ts ttp tm stc 100mhzsteps hwpstate
> 
> Hmm, svm code seems less optimized for MMIO, but PIO
> is almost identical. Gleb says the unittest is broken
> on AMD so I'll wait until it's fixed to test.
> 
It's not unittest is broken, its my environment is broken :)

> Did you do PIO reads by chance?
> 
> > > 
> > > Or could be different software, this is on top of 3.9.0-rc5, what
> > > did you try?
> > 
> > 3.0 plus kvm-kmod of whatever was current back in autumn :).
> > 
> > > 
> > >> MST, could you please do a real world latency benchmark with virtio-net 
> > >> and
> > >> 
> > >>  * normal ioeventfd
> > >>  * mmio-pv eventfd
> > >>  * hcall eventfd
> > > 
> > > I can't do this right away, sorry.  For MMIO we are discussing the new
> > > layout on the virtio mailing list, guest and qemu need a patch for this
> > > too.  My hcall patches are stale and would have to be brought up to
> > > date.
> > > 
> > > 
> > >> to give us some idea how much performance we would gain from each 
> > >> approach? Thoughput should be completely unaffected anyway, since virtio 
> > >> just coalesces kicks internally.
> > > 
> > > Latency is dominated by the scheduling latency.
> > > This means virtio-net is not the best benchmark.
> > 
> > So what is a good benchmark?
> 
> E.g. ping pong stress will do but need to look at CPU utilization,
> that's what is affected, not latency.
> 
> > Is there any difference in speed at all? I strongly doubt it. One of 
> > virtio's main points is to reduce the number of kicks.
> 
> For this stage of the project I think microbenchmarks are more appropriate.
> Doubling the price of exit is likely to be measureable. 30 cycles likely
> not ...
> 
> > > 
> > >> I'm also slightly puzzled why the wildcard eventfd mechanism is so 
> > >> significantly slower, while it was only a few percent on my test system. 
> > >> What are the numbers you're listing above? Cycles? How many cycles do 
> > >> you execute in a second?
> > >> 
> > >> 
> > >> Alex
> > > 
> > > 
> > > It's the TSC divided by number of iterations.  kvm unittest this value, 
> > > here's
> > > what it does (removed some dead code):
> > > 
> > > #define GOAL (1ull << 30)
> > > 
> > >do {
> > >iterations *= 2;
> > >t1 = rdtsc();
> > > 
> > >for (i = 0; i < iterations; ++i)
> > >func();
> > >t2 = rdtsc();
> > >} while ((t2 - t1) < GOAL);
> > >printf("%s %d\n", test->name, (int)((t2 - t1) / iterations));
> > 
> > So it's the number of cycles per run.
> > 
> > That means translated my numbers are:
> > 
> >   MMIO: 4307
> >   PIO: 3658
> >   HCALL: 1756
> > 
> > MMIO - PIO = 649
> > 
> > which aligns roughly with your PV MMIO callback.
> > 
> > My MMIO benchmark was to poke the LAPIC version register. That does go 
> > through instruction emulation, no?
> > 
> > 
> > Alex
> 
> Why wouldn't it?
> 
Intel decodes access to apic page, but we use it only for fast eoi.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] kvm: IOMMU read-only mapping support

2013-01-28 Thread Gleb Natapov

On Tue, Jan 29, 2013 at 11:06:43AM +0800, Xiao Guangrong wrote:
> On 01/28/2013 06:59 PM, Gleb Natapov wrote:
> > On Fri, Jan 25, 2013 at 11:28:40AM +0800, Xiao Guangrong wrote:
> >> On 01/25/2013 09:17 AM, Takuya Yoshikawa wrote:
> >>> On Thu, 24 Jan 2013 15:03:57 -0700
> >>> Alex Williamson  wrote:
> >>>
> >>>> A couple patches to make KVM IOMMU support honor read-only mappings.
> >>>> This causes an un-map, re-map when the read-only flag changes and
> >>>> makes use of it when setting IOMMU attributes.  Thanks,
> >>>
> >>> Looks good to me.
> >>>
> >>> I think I can naturally update my patch after this gets merged.
> >>>
> >>
> >> Please wait.
> >>
> >> The commit c972f3b1 changed the write-protect behaviour - it does
> >> wirte-protection only when dirty flag is set.
> >> [ I did not see this commit when we discussed the problem before. ]
> >>
> >> Further more, i notice that write-protect is not enough, when do sync
> >> shadow page:
> >>
> >> FNAME(sync_page):
> >>
> >>host_writable = sp->spt[i] & SPTE_HOST_WRITEABLE;
> >>
> >>set_spte(vcpu, &sp->spt[i], pte_access,
> >> PT_PAGE_TABLE_LEVEL, gfn,
> >> spte_to_pfn(sp->spt[i]), true, false,
> >> host_writable);
> >>
> >> It sets spte based on the old value that means the readonly flag check
> >> is missed. We need to call kvm_arch_flush_shadow_all under this case.
> > Why not just disallow changing memory region KVM_MEM_READONLY flag
> > without deleting the region?
> 
> It will introduce some restriction when VM-sharing-mem is being implemented,
> but we need to do some optimization for it, at least, properly write-protect
> readonly pages (fix sync_page()) instead of zap_all_page.
> 
What is VM-sharing-mem?

> So, i guess we can do the simple fix first.
> 
By simple fix you mean calling kvm_arch_flush_shadow_all() on READONLY
flag change?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/11] ksm: NUMA trees and page migration

2013-01-29 Thread Gleb Natapov

On Mon, Jan 28, 2013 at 05:07:15PM -0800, Hugh Dickins wrote:
> On Mon, 28 Jan 2013, Andrew Morton wrote:
> > On Fri, 25 Jan 2013 17:53:10 -0800 (PST)
> > Hugh Dickins  wrote:
> > 
> > > Here's a KSM series
> > 
> > Sanity check: do you have a feeling for how useful KSM is? 
> > Performance/space improvements for typical (or atypical) workloads? 
> > Are people using it?  Successfully?
> > 
> > IOW, is it justifying itself?
> 
> I have no idea!  To me it's simply a technical challenge - and I agree
> with your implication that that's not a good enough justification.
> 
> I've added Marcelo and Gleb and the KVM list to the Cc:
> my understanding is that it's the KVM guys who really appreciate KSM.
> 
KSM is used on all RH kvm deployments for memory overcommit. I asked
around for numbers and got the answer that it allows to squeeze anywhere
between 10% and 100% more VMs on the same machine depends on a type of
a guest OS and how similar workloads of VMs are. And management tries
to keep VMs with similar OSes/workloads on the same host to gain more
from KSM.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 06/12] KVM: MMU: introduce a static table to map guest access to spte access

2013-01-29 Thread Gleb Natapov

On Mon, Jan 28, 2013 at 11:07:58PM -0200, Marcelo Tosatti wrote:
> On Mon, Jan 28, 2013 at 10:07:15PM -0200, Marcelo Tosatti wrote:
> > On Fri, Jan 25, 2013 at 10:46:31AM +0800, Xiao Guangrong wrote:
> > > On 01/25/2013 08:15 AM, Marcelo Tosatti wrote:
> > > > On Wed, Jan 23, 2013 at 06:07:20PM +0800, Xiao Guangrong wrote:
> > > >> It makes set_spte more clean and reduces branch prediction
> > > >>
> > > >> Signed-off-by: Xiao Guangrong 
> > > >> ---
> > > >>  arch/x86/kvm/mmu.c |   37 ++---
> > > >>  1 files changed, 26 insertions(+), 11 deletions(-)
> > > > 
> > > > Don't see set_spte as being a performance problem?
> > > > IMO the current code is quite simple.
> > > 
> > > Yes, this is not a performance problem.
> > > 
> > > I just dislike this many continuous "if"-s in the function:
> > > 
> > > if (xxx)
> > >   xxx
> > > if (xxx)
> > >   xxx
> > > 
> > > 
> > > Totally, it has 7 "if"-s before this patch.
> > > 
> > > Okay, if you think this is unnecessary, i will drop this patch. :)
> > 
> > Yes, please (unless you can show set_spte is a performance problem).
> 
> Same thing for spte fast drop: is it a performance problem? 
> 
I like spte fast drop because it gets rid of "goto restart" pattern.
for_each_spte_in_rmap_safe() can be the alternative.

> Please try to group changes into smaller, less controversial sets with 
> a clear goal:
> 
> - Debated performance improvement.
> - Cleanups (eg mmu_set_spte argument removal).
> - Bug fixes.
> - Performance improvements.
> 
> Thanks.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V3 RESEND RFC 0/2] kvm: Improving undercommit scenarios

2013-01-29 Thread Gleb Natapov

On Tue, Jan 22, 2013 at 01:08:54PM +0530, Raghavendra K T wrote:
>  In some special scenarios like #vcpu <= #pcpu, PLE handler may
> prove very costly, because there is no need to iterate over vcpus
> and do unsuccessful yield_to burning CPU.
> 
>  The first patch optimizes all the yield_to by bailing out when there
>  is no need to continue in yield_to (i.e., when there is only one task 
>  in source and target rq).
> 
>  Second patch uses that in PLE handler. Further when a yield_to fails
>  we do not immediately go out of PLE handler instead we try thrice 
>  to have better statistical possibility of false return. Otherwise that
>  would affect moderate overcommit cases.
>  
>  Result on 3.7.0-rc6 kernel shows around 140% improvement for ebizzy 1x and
>  around 51% for dbench 1x  with 32 core PLE machine with 32 vcpu guest.
> 
> 
> base = 3.7.0-rc6 
> machine: 32 core mx3850 x5 PLE mc
> 
> --+---+---+---++---+
>ebizzy (rec/sec higher is beter)
> --+---+---+---++---+
> basestdev   patched stdev   %improve 
> --+---+---+---++---+
> 1x   2511.300021.54096051.8000   170.2592   140.98276   
> 2x   2679.4000   332.44822692.3000   251.4005 0.48145
> 3x   2253.5000   266.42432192.1667   178.9753-2.72169
> --+---+---+---++---+
> 
> --+---+---+---++---+
> dbench (throughput in MB/sec. higher is better)
> --+---+---+---++---+
> basestdev   patched stdev   %improve 
> --+---+---+---++---+
> 1x  6677.4080   638.504810098.0060   3449.7026 51.22643
> 2x  2012.676064.76422019.0440 62.6702   0.31639
> 3x  1302.078340.83361292.7517 27.0515  -0.71629
> --+---+---+---++---+
> 
> Here is the refernce of no ple result.
>  ebizzy-1x_nople 7592.6000 rec/sec
>  dbench_1x_nople 7853.6960 MB/sec
> 
> The result says we can still improve by 60% for ebizzy, but overall we are
> getting impressive performance with the patches.
> 
>  Changes Since V2:
>  - Dropped global measures usage patch (Peter Zilstra)
>  - Do not bail out on first failure (Avi Kivity)
>  - Try thrice for the failure of yield_to to get statistically more correct
>behaviour.
> 
>  Changes since V1:
>  - Discard the idea of exporting nrrunning and optimize in core scheduler 
> (Peter)
>  - Use yield() instead of schedule in overcommit scenarios (Rik)
>  - Use loadavg knowledge to detect undercommit/overcommit
> 
>  Peter Zijlstra (1):
>   Bail out of yield_to when source and target runqueue has one task
> 
>  Raghavendra K T (1):
>   Handle yield_to failure return for potential undercommit case
> 
>  Please let me know your comments and suggestions.
> 
>  Link for the discussion of V3 original:
>  https://lkml.org/lkml/2012/11/26/166
> 
>  Link for V2:
>  https://lkml.org/lkml/2012/10/29/287
> 
>  Link for V1:
>  https://lkml.org/lkml/2012/9/21/168
> 
>  kernel/sched/core.c | 25 +++--
>  virt/kvm/kvm_main.c | 26 --
>  2 files changed, 35 insertions(+), 16 deletions(-)
Applied, thanks.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] x86: KVM: Fix warnings by returning bool

2013-01-29 Thread Gleb Natapov

On Sun, Jan 27, 2013 at 05:40:00PM +0100, Emil Goode wrote:
> The function pointer x2apic_available in struct hypervisor_x86 is
> supposed to point to functions that return bool. This patch changes
> the return type of the kvm_para_available function to bool.
> 
> Sparse warnings:
> 
> arch/x86/kernel/kvm.c:508:35: warning:
>   incorrect type in initializer (different signedness)
>   arch/x86/kernel/kvm.c:508:35:
>   expected bool ( *x2apic_available )( ... )
>   arch/x86/kernel/kvm.c:508:35:
>   got int ( static inline [toplevel] * )( ... )
> 
> arch/x86/kernel/kvm.c:508:2: warning:
>   initialization from incompatible pointer type [enabled by default]
> 
> arch/x86/kernel/kvm.c:508:2: warning:
>   (near initialization for ‘x86_hyper_kvm.x2apic_available’)
>   [enabled by default]
> 
The breakage was in tip x86/apic tree only and it was fixed by
3b4a505821 there.

> Signed-off-by: Emil Goode 
> ---
>  arch/x86/include/asm/kvm_para.h |8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 0be7039..695399f 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -85,13 +85,13 @@ static inline long kvm_hypercall4(unsigned int nr, 
> unsigned long p1,
>   return ret;
>  }
>  
> -static inline int kvm_para_available(void)
> +static inline bool kvm_para_available(void)
>  {
>   unsigned int eax, ebx, ecx, edx;
>   char signature[13];
>  
>   if (boot_cpu_data.cpuid_level < 0)
> - return 0;   /* So we don't blow up on old processors */
> + return false;   /* So we don't blow up on old processors */
>  
>   if (cpu_has_hypervisor) {
>   cpuid(KVM_CPUID_SIGNATURE, &eax, &ebx, &ecx, &edx);
> @@ -101,10 +101,10 @@ static inline int kvm_para_available(void)
>   signature[12] = 0;
>  
>   if (strcmp(signature, "KVMKVMKVM") == 0)
> - return 1;
> + return true;
>   }
>  
> - return 0;
> + return false;
>  }
>  
>  static inline unsigned int kvm_arch_para_features(void)
> -- 
> 1.7.10.4

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, x2apic: Only WARN on broken BIOSes inside a virtual guest

2013-01-31 Thread Gleb Natapov

On Thu, Jan 31, 2013 at 11:40:35AM -0500, Don Zickus wrote:
> In commit "41750d3 x86, x2apic: Enable the bios request for x2apic optout"
> it was explained how OEMs are trying to opt out of using x2apics.
> 
> That commit moved code around and screamed with a WARN if the BIOS
> opted out of x2apic mode.  Fair enough.
> 
> This code hit our RHEL distro and OEMs started complaining that the
> WARN is scaring their customers and asked we tone it down to a
> pr_warn().
> 
> Before we did that, we thought we should change it upstream too.
> Upstream complained that WARN was necessary due to a serious
> security threat, namely irq injections.  Hard to argue that.
> 
> This left us between a rock and a hard place.  We toned down the
> WARN in RHEL to keep our customers happy.  But this leaves us with
> a perpetual patch in RHEL and possibly covering up a security threat.
> 
> I poked around to understand the nature of the security threat and why
> OEMs would want to leave themselves vulnerable.  The only security
> threat I could find was this whitepaper talking about Xen and irq
> injections:
> 
> http://www.invisiblethingslab.com/resources/2011/Software%20Attacks%20on%20Intel%20VT-d.pdf
> 
> After talking with folks, the threat of irq injections on virtual guests
> made sense.  However, when discussing if this was possible on bare metal
> machines, we could not come up with a plausible scenario.
> 
The irq injections is something that a guest with assigned device does
to attack a hypervisor it runs on. Interrupt remapping protects host
from this attack. According to pdf above if x2apic is disabled in a
hypervisor interrupt remapping can be bypassed and leave host vulnerable
to guest attack. This means that situation is exactly opposite: warning
has sense on a bare metal, but not in a guest. I am not sure that there is
a hypervisor that emulates interrupt remapping device though and without
it the warning cannot be triggered in a guest.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, x2apic: Only WARN on broken BIOSes inside a virtual guest

2013-01-31 Thread Gleb Natapov

On Thu, Jan 31, 2013 at 02:34:27PM -0500, Don Zickus wrote:
> On Thu, Jan 31, 2013 at 08:52:00PM +0200, Gleb Natapov wrote:
> > > http://www.invisiblethingslab.com/resources/2011/Software%20Attacks%20on%20Intel%20VT-d.pdf
> > > 
> > > After talking with folks, the threat of irq injections on virtual guests
> > > made sense.  However, when discussing if this was possible on bare metal
> > > machines, we could not come up with a plausible scenario.
> > > 
> > The irq injections is something that a guest with assigned device does
> > to attack a hypervisor it runs on. Interrupt remapping protects host
> > from this attack. According to pdf above if x2apic is disabled in a
> > hypervisor interrupt remapping can be bypassed and leave host vulnerable
> > to guest attack. This means that situation is exactly opposite: warning
> > has sense on a bare metal, but not in a guest. I am not sure that there is
> > a hypervisor that emulates interrupt remapping device though and without
> > it the warning cannot be triggered in a guest.
> 
> Ah, it makes sense.  Not sure how I got it backwards then.  So my patch is
> pointless then?  I'll asked for it to be dropped.
Yes, it is backwards.

> 
> >From my previous discussions with folks, is that KVM was protected from
> this type of attack.  Is that still true?
> 
Copying Alex. He said that to use device assignment without interrupt
remapping customer needs to opt-in explicitly. Not sure what happens
with interrupt remapping but with x2apic disabled.

The problem is not limited to virtualization BTW. Any vfio user may
attack kernel without interrupt remapping so vfio has the same opt-in.
 
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 05/12] KVM: MMU: introduce vcpu_adjust_access

2013-01-24 Thread Gleb Natapov

On Wed, Jan 23, 2013 at 06:06:36PM +0800, Xiao Guangrong wrote:
> Introduce it to split the code of adjusting pte_access from the large
> function of set_spte
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c |   63 
> +---
>  1 files changed, 40 insertions(+), 23 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index af8bcb2..43b7e0c 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2324,25 +2324,18 @@ static int mmu_need_write_protect(struct kvm_vcpu 
> *vcpu, gfn_t gfn,
>   return 0;
>  }
> 
> -static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> - unsigned pte_access, int level,
> - gfn_t gfn, pfn_t pfn, bool speculative,
> - bool can_unsync, bool host_writable)
> +/*
> + * Return -1 if a race condition is detected, 1 if @gfn need to be
> + * write-protected, otherwise 0 is returned.
> + */
That's a little bit crafty.

Isn't it better to handle race condition in set_spte() explicitly?
Something like do:

 if (host_writable && (pte_access & ACC_WRITE_MASK) &&
level > PT_PAGE_TABLE_LEVEL && has_wrprotected_page(vcpu->kvm, gfn, 
level))
return 0;

before calling vcpu_adjust_access() in set_spte()?

Or even do:

 if ((pte_access & ACC_WRITE_MASK) && level > PT_PAGE_TABLE_LEVEL &&
   has_wrprotected_page(vcpu->kvm, gfn, level))
return 0;

After calling vcpu_adjust_access().

The later will create read only large page mapping where now it is not
created, but it shouldn't be a problem as far as I see.

> +static int vcpu_adjust_access(struct kvm_vcpu *vcpu, u64 *sptep,
> +   unsigned *pte_access, int level, gfn_t gfn,
> +   bool can_unsync, bool host_writable)
>  {
> - u64 spte;
> - int ret = 0;
> -
> - if (set_mmio_spte(sptep, gfn, pfn, pte_access))
> - return 0;
> + if (!host_writable)
> + *pte_access &= ~ACC_WRITE_MASK;
> 
> - spte = PT_PRESENT_MASK;
> -
> - if (host_writable)
> - spte |= SPTE_HOST_WRITEABLE;
> - else
> - pte_access &= ~ACC_WRITE_MASK;
> -
> - if (pte_access & ACC_WRITE_MASK) {
> + if (*pte_access & ACC_WRITE_MASK) {
>   /*
>* Other vcpu creates new sp in the window between
>* mapping_level() and acquiring mmu-lock. We can
> @@ -2351,7 +2344,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>*/
>   if (level > PT_PAGE_TABLE_LEVEL &&
> has_wrprotected_page(vcpu->kvm, gfn, level))
> - goto done;
> + return -1;
> 
>   /*
>* Optimization: for pte sync, if spte was writable the hash
> @@ -2360,17 +2353,41 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
>* Same reasoning can be applied to dirty page accounting.
>*/
>   if (!can_unsync && is_writable_pte(*sptep))
> - goto out_access_adjust;
> + return 0;
> 
>   if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
>   pgprintk("%s: found shadow page for %llx, marking ro\n",
>__func__, gfn);
> - ret = 1;
> - pte_access &= ~ACC_WRITE_MASK;
> +
> + *pte_access &= ~ACC_WRITE_MASK;
> + return 1;
>   }
>   }
> 
> -out_access_adjust:
> + return 0;
> +}
> +
> +static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
> + unsigned pte_access, int level,
> + gfn_t gfn, pfn_t pfn, bool speculative,
> + bool can_unsync, bool host_writable)
> +{
> + u64 spte;
> + int ret;
> +
> + if (set_mmio_spte(sptep, gfn, pfn, pte_access))
> + return 0;
> +
> + ret = vcpu_adjust_access(vcpu, sptep, &pte_access, level, gfn,
> +  can_unsync, host_writable);
> + if (ret < 0)
> + return 0;
> +
> + spte = PT_PRESENT_MASK;
> +
> + if (host_writable)
> + spte |= SPTE_HOST_WRITEABLE;
> +
>   if (!speculative)
>   spte |= shadow_accessed_mask;
> 
> @@ -2399,7 +2416,7 @@ out_access_adjust:
> 
>   if (mmu_spte_update(sptep, spte))
>   kvm_flush_remote_tlbs(vcpu->kvm);
> -done:
> +
>   return ret;
>  }
> 
> -- 
> 1.7.7.6

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 01/12] KVM: MMU: lazily drop large spte

2013-01-27 Thread Gleb Natapov

On Wed, Jan 23, 2013 at 06:04:17PM +0800, Xiao Guangrong wrote:
> Do not drop large spte until it can be insteaded by small pages so that
> the guest can happliy read memory through it
> 
> The idea is from Avi:
> | As I mentioned before, write-protecting a large spte is a good idea,
> | since it moves some work from protect-time to fault-time, so it reduces
> | jitter.  This removes the need for the return value.
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c |   21 ++---
>  1 files changed, 6 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 9f628f7..0f90269 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1105,7 +1105,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 
> *sptep)
> 
>  /*
>   * Write-protect on the specified @sptep, @pt_protect indicates whether
> - * spte writ-protection is caused by protecting shadow page table.
> + * spte write-protection is caused by protecting shadow page table.
>   * @flush indicates whether tlb need be flushed.
>   *
>   * Note: write protection is difference between drity logging and spte
> @@ -1114,31 +1114,23 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, 
> u64 *sptep)
>   *   its dirty bitmap is properly set.
>   * - for spte protection, the spte can be writable only after unsync-ing
>   *   shadow page.
> - *
> - * Return true if the spte is dropped.
>   */
> -static bool
> +static void
>  spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect)
Since return value is not longer used make the function return true if flush is 
needed
instead of returning it via pointer to a variable.

>  {
>   u64 spte = *sptep;
> 
>   if (!is_writable_pte(spte) &&
> !(pt_protect && spte_is_locklessly_modifiable(spte)))
> - return false;
> + return;
> 
>   rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);
> 
> - if (__drop_large_spte(kvm, sptep)) {
> - *flush |= true;
> - return true;
> - }
> -
>   if (pt_protect)
>   spte &= ~SPTE_MMU_WRITEABLE;
>   spte = spte & ~PT_WRITABLE_MASK;
> 
>   *flush |= mmu_spte_update(sptep, spte);
> - return false;
>  }
> 
>  static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp,
> @@ -1150,11 +1142,8 @@ static bool __rmap_write_protect(struct kvm *kvm, 
> unsigned long *rmapp,
> 
>   for (sptep = rmap_get_first(*rmapp, &iter); sptep;) {
>   BUG_ON(!(*sptep & PT_PRESENT_MASK));
> - if (spte_write_protect(kvm, sptep, &flush, pt_protect)) {
> - sptep = rmap_get_first(*rmapp, &iter);
> - continue;
> - }
> 
> + spte_write_protect(kvm, sptep, &flush, pt_protect);
>   sptep = rmap_get_next(&iter);
>   }
> 
> @@ -2611,6 +2600,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
> int write,
>   break;
>   }
> 
> + drop_large_spte(vcpu, iterator.sptep);
> +
>   if (!is_shadow_present_pte(*iterator.sptep)) {
>   u64 base_addr = iterator.addr;
> 
> -- 
> 1.7.7.6

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2 10/12] KVM: MMU: unify the code of walking pte list

2013-01-27 Thread Gleb Natapov

On Wed, Jan 23, 2013 at 06:09:34PM +0800, Xiao Guangrong wrote:
> Walking parent spte and walking ramp have same logic, this patch introduces
> for_each_spte_in_pte_list to integrate their code
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c   |  199 
> ++
>  arch/x86/kvm/mmu_audit.c |5 +-
>  2 files changed, 97 insertions(+), 107 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 55198a1..b7da3fb 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -968,26 +968,75 @@ static void pte_list_remove(u64 *spte, unsigned long 
> *pte_list)
>   }
>  }
> 
> -typedef void (*pte_list_walk_fn) (u64 *spte);
> -static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn)
> +/*
> + * Used by the following functions to iterate through the sptes linked by a
> + * pte_list.  All fields are private and not assumed to be used outside.
> + */
> +struct pte_list_iterator {
> + /* private fields */
> + struct pte_list_desc *desc; /* holds the sptep if not NULL */
> + int pos;/* index of the sptep */
> +};
> +
> +/*
> + * Iteration must be started by this function.  This should also be used 
> after
> + * removing/dropping sptes from the pte_list link because in such cases the
> + * information in the itererator may not be valid.
> + *
> + * Returns sptep if found, NULL otherwise.
> + */
> +static u64 *pte_list_get_first(unsigned long pte_list,
> +struct pte_list_iterator *iter)
>  {
> - struct pte_list_desc *desc;
> - int i;
> + if (!pte_list)
> + return NULL;
> 
> - if (!*pte_list)
> - return;
> + if (!(pte_list & 1)) {
> + iter->desc = NULL;
> + return (u64 *)pte_list;
> + }
> 
> - if (!(*pte_list & 1))
> - return fn((u64 *)*pte_list);
> + iter->desc = (struct pte_list_desc *)(pte_list & ~1ul);
> + iter->pos = 0;
> + return iter->desc->sptes[iter->pos];
> +}
> 
> - desc = (struct pte_list_desc *)(*pte_list & ~1ul);
> - while (desc) {
> - for (i = 0; i < PTE_LIST_EXT && desc->sptes[i]; ++i)
> - fn(desc->sptes[i]);
> - desc = desc->more;
> +/*
> + * Must be used with a valid iterator: e.g. after pte_list_get_next().
> + *
> + * Returns sptep if found, NULL otherwise.
> + */
> +static u64 *pte_list_get_next(struct pte_list_iterator *iter)
> +{
> + if (iter->desc) {
> + if (iter->pos < PTE_LIST_EXT - 1) {
> + u64 *sptep;
> +
> + ++iter->pos;
> + sptep = iter->desc->sptes[iter->pos];
> + if (sptep)
> + return sptep;
> + }
> +
> + iter->desc = iter->desc->more;
> +
> + if (iter->desc) {
> + iter->pos = 0;
> + /* desc->sptes[0] cannot be NULL */
> + return iter->desc->sptes[iter->pos];
> + }
>   }
> +
> + return NULL;
>  }
> 
> +#define for_each_spte_in_pte_list(pte_list, iter, spte)  \
> +for (spte = pte_list_get_first(pte_list, &(iter));   \
> +   spte != NULL; spte = pte_list_get_next(&(iter)))
> +
> +#define for_each_spte_in_rmap(rmap, iter, spte)  \
> +for_each_spte_in_pte_list(rmap, iter, spte)
> +
>  static unsigned long *__gfn_to_rmap(gfn_t gfn, int level,
>   struct kvm_memory_slot *slot)
>  {
> @@ -1039,67 +1088,6 @@ static void rmap_remove(struct kvm *kvm, u64 *spte)
>   pte_list_remove(spte, rmapp);
>  }
> 
> -/*
> - * Used by the following functions to iterate through the sptes linked by a
> - * rmap.  All fields are private and not assumed to be used outside.
> - */
> -struct rmap_iterator {
> - /* private fields */
> - struct pte_list_desc *desc; /* holds the sptep if not NULL */
> - int pos;/* index of the sptep */
> -};
> -
> -/*
> - * Iteration must be started by this function.  This should also be used 
> after
> - * removing/dropping sptes from the rmap link because in such cases the
> - * information in the itererator may not be valid.
> - *
> - * Returns sptep if found, NULL otherwise.
> - */
> -static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter)
> -{
> - if (!rmap)
> - return NULL;
> -
> - if (!(rmap & 1)) {
> - iter->desc = NULL;
> - return (u64 *)rmap;
> - }
> -
> - iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
> - iter->pos = 0;
> - return iter->desc->sptes[iter->pos];
> -}
> -
> -/*
> - * Must be used with a valid iterator: e.g. after rmap_get_first().
> - *
> - * Returns sptep if found, NULL otherwise.
> - */
> -static u64 *rmap_get_next(struct rmap_iterator *iter)
> -{
> - if (iter->desc) {
> - if (

Re: [PATCH 02/18] perf, kvm: Support the intx/intx_cp modifiers in KVM arch perfmon emulation v5

2013-01-27 Thread Gleb Natapov

On Fri, Jan 25, 2013 at 02:32:56PM -0800, Andi Kleen wrote:
> From: Andi Kleen 
> 
> This is not arch perfmon, but older CPUs will just ignore it. This makes
> it possible to do at least some TSX measurements from a KVM guest
> 
> Cc: g...@redhat.com
> v2: Various fixes to address review feedback
> v3: Ignore the bits when no CPUID. No #GP. Force raw events with TSX bits.
> v4: Use reserved bits for #GP
> v5: Remove obsolete argument
> Signed-off-by: Andi Kleen 
Acked-by: Gleb Natapov 

I suppose it will go through tip tree with all the other patches this one
depends on.

> ---
>  arch/x86/include/asm/kvm_host.h |1 +
>  arch/x86/kvm/pmu.c  |   25 -
>  2 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index dc87b65..703a1f8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -320,6 +320,7 @@ struct kvm_pmu {
>   u64 global_ovf_ctrl;
>   u64 counter_bitmask[2];
>   u64 global_ctrl_mask;
> + u64 reserved_bits;
>   u8 version;
>   struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
>   struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index cfc258a..9317c43 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -160,7 +160,7 @@ static void stop_counter(struct kvm_pmc *pmc)
>  
>  static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
>   unsigned config, bool exclude_user, bool exclude_kernel,
> - bool intr)
> + bool intr, bool intx, bool intx_cp)
>  {
>   struct perf_event *event;
>   struct perf_event_attr attr = {
> @@ -173,6 +173,10 @@ static void reprogram_counter(struct kvm_pmc *pmc, u32 
> type,
>   .exclude_kernel = exclude_kernel,
>   .config = config,
>   };
> + if (intx)
> + attr.config |= HSW_INTX;
> + if (intx_cp)
> + attr.config |= HSW_INTX_CHECKPOINTED;
>  
>   attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
>  
> @@ -226,7 +230,9 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 
> eventsel)
>  
>   if (!(eventsel & (ARCH_PERFMON_EVENTSEL_EDGE |
>   ARCH_PERFMON_EVENTSEL_INV |
> - ARCH_PERFMON_EVENTSEL_CMASK))) {
> + ARCH_PERFMON_EVENTSEL_CMASK |
> + HSW_INTX |
> + HSW_INTX_CHECKPOINTED))) {
>   config = find_arch_event(&pmc->vcpu->arch.pmu, event_select,
>   unit_mask);
>   if (config != PERF_COUNT_HW_MAX)
> @@ -239,7 +245,9 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 
> eventsel)
>   reprogram_counter(pmc, type, config,
>   !(eventsel & ARCH_PERFMON_EVENTSEL_USR),
>   !(eventsel & ARCH_PERFMON_EVENTSEL_OS),
> - eventsel & ARCH_PERFMON_EVENTSEL_INT);
> + eventsel & ARCH_PERFMON_EVENTSEL_INT,
> + (eventsel & HSW_INTX),
> + (eventsel & HSW_INTX_CHECKPOINTED));
>  }
>  
>  static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
> @@ -256,7 +264,7 @@ static void reprogram_fixed_counter(struct kvm_pmc *pmc, 
> u8 en_pmi, int idx)
>   arch_events[fixed_pmc_events[idx]].event_type,
>   !(en & 0x2), /* exclude user */
>   !(en & 0x1), /* exclude kernel */
> - pmi);
> + pmi, false, false);
>  }
>  
>  static inline u8 fixed_en_pmi(u64 ctrl, int idx)
> @@ -400,7 +408,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 
> data)
>   } else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) {
>   if (data == pmc->eventsel)
>   return 0;
> - if (!(data & 0x0020ull)) {
> + if (!(data & pmu->reserved_bits)) {
>   reprogram_gp_counter(pmc, data);
>   return 0;
>   }
> @@ -442,6 +450,7 @@ void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu)
>   pmu->counter_bitmask[KVM_PMC_GP] = 0;
>   pmu->counter_bitmask[KVM_PMC_FIXED] = 0;
>   pmu->version = 0;
> + pmu->reserved_bits = 0x0020ull;
>  
>   entry = kvm_find_cpuid_entry(vcpu, 0xa, 0);
>   if (!entry)
&

Re: [PATCH 0/2] kvm: IOMMU read-only mapping support

2013-01-27 Thread Gleb Natapov

On Thu, Jan 24, 2013 at 03:03:57PM -0700, Alex Williamson wrote:
> A couple patches to make KVM IOMMU support honor read-only mappings.
> This causes an un-map, re-map when the read-only flag changes and
> makes use of it when setting IOMMU attributes.  Thanks,
> 
> Alex
> 
Applied, thanks.

> ---
> 
> Alex Williamson (2):
>   kvm: Force IOMMU remapping on memory slot read-only flag changes
>   kvm: Obey read-only mappings in iommu
> 
> 
>  virt/kvm/iommu.c|4 +++-
>  virt/kvm/kvm_main.c |   28 
>  2 files changed, 27 insertions(+), 5 deletions(-)

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] kvm: IOMMU read-only mapping support

2013-01-28 Thread Gleb Natapov

On Fri, Jan 25, 2013 at 11:28:40AM +0800, Xiao Guangrong wrote:
> On 01/25/2013 09:17 AM, Takuya Yoshikawa wrote:
> > On Thu, 24 Jan 2013 15:03:57 -0700
> > Alex Williamson  wrote:
> > 
> >> A couple patches to make KVM IOMMU support honor read-only mappings.
> >> This causes an un-map, re-map when the read-only flag changes and
> >> makes use of it when setting IOMMU attributes.  Thanks,
> > 
> > Looks good to me.
> > 
> > I think I can naturally update my patch after this gets merged.
> > 
> 
> Please wait.
> 
> The commit c972f3b1 changed the write-protect behaviour - it does
> wirte-protection only when dirty flag is set.
> [ I did not see this commit when we discussed the problem before. ]
> 
> Further more, i notice that write-protect is not enough, when do sync
> shadow page:
> 
> FNAME(sync_page):
> 
>   host_writable = sp->spt[i] & SPTE_HOST_WRITEABLE;
> 
>   set_spte(vcpu, &sp->spt[i], pte_access,
>PT_PAGE_TABLE_LEVEL, gfn,
>spte_to_pfn(sp->spt[i]), true, false,
>host_writable);
> 
> It sets spte based on the old value that means the readonly flag check
> is missed. We need to call kvm_arch_flush_shadow_all under this case.
Why not just disallow changing memory region KVM_MEM_READONLY flag
without deleting the region?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/5] [v3] fix illegal use of __pa() in KVM code

2013-02-25 Thread Gleb Natapov

On Sun, Feb 24, 2013 at 04:28:58PM -0500, Peter Hurley wrote:
> 
> On Tue, 2013-01-22 at 13:24 -0800, Dave Hansen wrote:
> > This series fixes a hard-to-debug early boot hang on 32-bit
> > NUMA systems.  It adds coverage to the debugging code,
> > adds some helpers, and eventually fixes the original bug I
> > was hitting.
> 
> Hi Dave,
> 
> Now that the alloc_remap() has been/is being removed, is most/all of
> this being reverted?
> 
> I ask because I was fixing a different bug in KVM's para-virt clock and
> saw part of this series (from [PATCH 5/5] fix kvm's use of __pa() on
> percpu areas):
> 
> diff -puN arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas 
> arch/x86/kernel/kvmclock.c
> --- linux-2.6.git/arch/x86/kernel/kvmclock.c~fix-kvm-__pa-use-on-percpu-areas 
>   2013-01-22 13:17:16.428317508 -0800
> +++ linux-2.6.git-dave/arch/x86/kernel/kvmclock.c   2013-01-22 
> 13:17:16.432317541 -0800
> @@ -162,8 +162,8 @@ int kvm_register_clock(char *txt)
> int low, high, ret;
> struct pvclock_vcpu_time_info *src = &hv_clock[cpu].pvti;
>  
> -   low = (int)__pa(src) | 1;
> -   high = ((u64)__pa(src) >> 32);
> +   low = (int)slow_virt_to_phys(src) | 1;
> +   high = ((u64)slow_virt_to_phys(src) >> 32);
> ret = native_write_msr_safe(msr_kvm_system_time, low, high);
> printk(KERN_INFO "kvm-clock: cpu %d, msr %x:%x, %s\n",
>cpu, high, low, txt);
> 
I think this hunk was converted by mistake since the code does not use
per cpu allocations, but since __pa() and slow_virt_to_phys() should
return the same thing here it does not matter much.

> which confused me because hv_clock is the __va of allocated physical
> memory, not a per-cpu variable.
> 
>   mem = memblock_alloc(size, PAGE_SIZE);
>   if (!mem)
>   return;
>   hv_clock = __va(mem);
> 
> So in short, my questions are:
> 1) is the slow_virt_to_phys() necessary anymore?
> 2) if yes, does it apply to the code above?
> 3) if yes, would you explain in more detail what the 32-bit NUMA mm is
> doing, esp. wrt. when __va(__pa) is not identical across all cpus?
> 
> Regards,
> Peter Hurley
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] KVM ARM compilation fixes for the 3.9 merge window

2013-02-25 Thread Gleb Natapov


Linus,

Please pull from

git://git.kernel.org/pub/scm/virt/kvm/kvm.git master

To receive the following KVM bug fixes. They fix ARM KVM compilation breakage
due to changes from kvm.git.


Marc Zyngier (3):
  ARM: KVM: fix kvm_arch_{prepare,commit}_memory_region
  ARM: KVM: Rename KVM_MEMORY_SLOTS -> KVM_USER_MEM_SLOTS
  ARM: KVM: fix compilation after removal of user_alloc from struct 
kvm_memory_slot

 arch/arm/include/asm/kvm_host.h |2 +-
 arch/arm/kvm/arm.c  |4 ++--
 arch/arm/kvm/mmu.c  |5 -
 3 files changed, 3 insertions(+), 8 deletions(-)

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[GIT PULL] KVM fixe for 3.9

2013-02-28 Thread Gleb Natapov


Linus,

Please pull from

git://git.kernel.org/pub/scm/virt/kvm/kvm.git master

To receive the following KVM bug fix.


Peter Hurley (1):
  x86/kvm: Fix pvclock vsyscall fixmap

 arch/x86/kernel/pvclock.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Bisected][-next-20130204+] [x86/kvm] udevd:[97]: segfault at ffffffffff5fd020 ip 00007fff069e277f sp 00007fff068c9ef8 error d

2013-02-19 Thread Gleb Natapov

On Mon, Feb 18, 2013 at 08:12:21PM -0500, Peter Hurley wrote:
> On Mon, 2013-02-18 at 19:59 -0300, Marcelo Tosatti wrote:
> > On Wed, Feb 13, 2013 at 06:57:09AM -0500, Peter Hurley wrote:
> > > On Wed, 2013-02-13 at 12:51 +0200, Gleb Natapov wrote:
> > > > On Tue, Feb 12, 2013 at 04:39:03PM -0800, H. Peter Anvin wrote:
> > > > > On 02/12/2013 04:26 PM, Peter Hurley wrote:
> > > > > > With -next-20130204+ in ubuntu 12.10 VM (so the 80x25 VGA
> > > > > > device/console):
> > > > > > 
> > > > > > [0.666410] udevd[97]: starting version 175
> > > > > > [0.674043] udevd[97]: udevd:[97]: segfault at ff5fd020 
> > > > > > ip 7fff069e277f sp 7fff068c9ef8 error d
> > > > > > 
> > > > > > and boots to an initramfs prompt.
> > > > > > 
> > > > > > git bisect (log attached) blames:
> > > > > > 
> > > > > > commit 7b5c4a65cc27f017c170b025f8d6d75dabb11c6f
> > > > > > Merge: 3596f5b 949db15
> > > > > > Author: H. Peter Anvin 
> > > > > > Date:   Fri Jan 25 16:31:21 2013 -0800
> > > > > > 
> > > > > > Merge tag 'v3.8-rc5' into x86/mm
> > > > > > 
> > > > > > The __pa() fixup series that follows touches KVM code that is 
> > > > > > not
> > > > > > present in the existing branch based on v3.7-rc5, so merge in 
> > > > > > the
> > > > > > current upstream from Linus.
> > > > > > 
> > > > > > Signed-off-by: H. Peter Anvin 
> > > > > > 
> > > > > > 
> > > > > > This only happens with the VGA device/console but that is the 
> > > > > > default
> > > > > > configuration for Ubuntu/KVM because it blacklists pretty much 
> > > > > > every fb
> > > > > > driver.
> > > > > > 
> > > > > 
> > > > > I am guessing this is another bad use of __pa()... need to look into 
> > > > > that.
> > 
> > Can't find this commit on kvm.git or linux-2.6.git. Is it reproducible
> > there?
> 
> This is in the linux-next repo (any git tag after 'next-20130204' will
> reproduce this). It's a pretty large merge commit.
> 
> This doesn't happen on 3.8-rc7.
> 
> I'll try to repro this on kvm.git sometime this week. Otherwise, we can
> wait for it to show up in 3.9.
> 
Can you also drop 5dfd486c4750c9278c63fa96e6e85bdd2fb58e9d from linux-next and 
reproduce?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: general protection fault in do_msgrcv [3.8]

2013-02-20 Thread Gleb Natapov

On Wed, Feb 20, 2013 at 12:23:22PM +0400, Stanislav Kinsbursky wrote:
> 19.02.2013 22:04, Dave Jones пишет:
> >general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> >Modules linked in: can af_rxrpc binfmt_misc scsi_transport_iscsi ax25 
> >ipt_ULOG decnet nfc appletalk x25 rds ipx p8023 psnap p8022 llc irda 
> >crc_ccitt atm lockd sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 
> >xt_conntrack nf_conntrack ip6table_filter ip6_tables btusb bluetooth 
> >snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_pcm edac_core 
> >snd_page_alloc snd_timer microcode rfkill usb_debug serio_raw pcspkr snd 
> >soundcore vhost_net r8169 mii tun macvtap macvlan kvm_amd kvm
> >CPU 2
> >Pid: 887, comm: trinity-child2 Not tainted 3.8.0+ #57 Gigabyte Technology 
> >Co., Ltd. GA-MA78GM-S2H/GA-MA78GM-S2H
> >RIP: 0010:[]  [] do_msgrcv+0x22a/0x670
> >RSP: 0018:88011892be88  EFLAGS: 00010297
> >RAX:  RBX: 0001 RCX: 4000
> >RDX: 7adea6f6 RSI: 6b6b6b6b6b6b6b6b RDI: 8801189ffb60
> >RBP: 88011892bf68 R08: 0001 R09: 
> >R10:  R11:  R12: 
> >R13: 8801189ffc10 R14: 8801189ffb60 R15: 6b6b6b6b6b6b6b6b
> >FS:  7f681e955740() GS:88012f20() knlGS:
> >CS:  0010 DS:  ES:  CR0: 80050033
> >CR2: 7f681e846064 CR3: 00012553d000 CR4: 07e0
> >DR0:  DR1:  DR2: 
> >DR3:  DR6: 0ff0 DR7: 0400
> >Process trinity-child2 (pid: 887, threadinfo 88011892a000, task 
> >88010bc82490)
> >Stack:
> >  88011892beb8 88010bc82490 88010bc82490 88010bc82490
> >  8801186d8000 812ad5f0 01aba000 81c688c0
> >  7adea6f6 001f 400046a9467e 6b6b6b6b6b6b6b6b
> >Call Trace:
> >  [] ? load_msg+0x180/0x180
> >  [] ? trace_hardirqs_on_caller+0x115/0x1a0
> >  [] ? trace_hardirqs_on_thunk+0x3a/0x3f
> >  [] sys_msgrcv+0x15/0x20
> >  [] system_call_fastpath+0x16/0x1b
> >Code: 84 14 01 00 00 8b 8d 74 ff ff ff 85 c9 0f 84 52 02 00 00 48 8b 95 60 
> >ff ff ff 48 39 55 80 0f 84 4d 02 00 00 4c 89 bd 78 ff ff ff <4d> 8b 3f 48 ff 
> >45 80 4d 39 ef 75 9a 66 90 48 81 bd 78 ff ff ff
> >RIP  [] do_msgrcv+0x22a/0x670
> >  RSP 
> >---[ end trace d3cc044a84b1d828 ]---
> >
> >oopsing instruction is..
> >
> >0:   4d 8b 3fmov(%r15),%r15
> >
> >Looks like a use-after-free.
> >
> >Disassembly of ipc/msg.o shows this happens here..
> >
> > msg = ERR_PTR(-EAGAIN);
> > tmp = msq->q_messages.next;
> > 1537:   4d 8b be b0 00 00 00mov0xb0(%r14),%r15
> > while (tmp != &msq->q_messages) {
> > 153e:   4d 8d ae b0 00 00 00lea0xb0(%r14),%r13
> > 1545:   4d 39 efcmp%r13,%r15
> > 1548:   0f 84 5f 03 00 00   je 18ad 
> > 154e:   48 c7 45 80 00 00 00movq   $0x0,-0x80(%rbp)
> > 1555:   00
> > 1556:   48 c7 85 78 ff ff ffmovq   
> > $0xfff5,-0x88(%rbp)
> > 155d:   f5 ff ff ff
> > 1561:   eb 0d   jmp1570 
> > 1563:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
> > }
> > } else
> > break;
> > msg_counter++;
> > }
> > tmp = tmp->next;
> > 1568:   4d 8b 3fmov(%r15),%r15
> > if (ipcperms(ns, &msq->q_perm, S_IRUGO))
> > goto out_unlock;
> >
> > msg = ERR_PTR(-EAGAIN);
> > tmp = msq->q_messages.next;
> > while (tmp != &msq->q_messages) {
> >
> >Looks like Stanislav recently changed this code, so problem was likely 
> >introduced
> >in those changes.
> >
> 
> Hello.
> Is it easy to reproduce? Do you use KVM?
Jugging by motherboard name in the OOPs it is not KVM. And since r15 is
6b6b6b6b6b6b6b6b you need DEBUG_PAGEALLOC to reproduce.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: kvm: reset the bootstrap processor when it gets an INIT

2013-03-12 Thread Gleb Natapov

On Tue, Mar 12, 2013 at 10:25:35AM +0100, Jan Kiszka wrote:
> On 2013-03-11 20:30, Gleb Natapov wrote:
> > On Mon, Mar 11, 2013 at 08:01:30PM +0100, Jan Kiszka wrote:
> >> On 2013-03-11 19:51, Gleb Natapov wrote:
> >>>>> On Intel:
> >>>>> CPU 1  CPU 2 in a guest mode
> >>>>> send INIT
> >>>>> send SIPI
> >>>>> INIT vmexit
> >>>>> vmxoff
> >>>>> reset and start from SIPI vector
> >>>>
> >>>> Is SIPI sticky as well, even if the CPU is not in the wait-for-SIPI
> >>>> state (but runnable and in vmxon) while receiving it?
> >>>>
> >>> That what they seams to be saying:
> >>>  However, an INIT and SIPI interrupts sent to a CPU during time when
> >>>  it is in a VMX mode are remembered and delivered, perhaps hours later,
> >>>  when the CPU exits the VMX mode
> >>>
> >>> Otherwise their exploit will not work.
> >>
> >> Very weird, specifically as SIPI is not just a binary event but carries
> >> payload. Will another SIPI event overwrite the previously "saved"
> >> vector? We are deep into an underspecified area...
> > My guess is that VMX INIT blocking is done by the same mechanism as
> > INIT blocking during SMM. Obviously after exit from SMM pending
> > INIT/SIPI have to be processed.
> 
> I think this should be further examined via a test case that can run on
> real HW. Is kvm-unit-test ready for this? Then we "just" need to
> implement what you were already asking for: minimalistic nVMX tests...
> 
I do not think kvm-unit-test will run on bare metal. I once implemented
framework for interrupt injection testing that ran on bare metal too. It
was very handy to compare KVM behaviour with real HW, but it was since
folded into kvm-unit-test.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 5/5] KVM: MMU: fast invalid all mmio sptes

2013-03-17 Thread Gleb Natapov

On Fri, Mar 15, 2013 at 11:29:53PM +0800, Xiao Guangrong wrote:
> This patch tries to introduce a very simple and scale way to invalid all
> mmio sptes - it need not walk any shadow pages and hold mmu-lock
> 
> KVM maintains a global mmio invalid generation-number which is stored in
> kvm->arch.mmio_invalid_gen and every mmio spte stores the current global
> generation-number into his available bits when it is created
> 
> When KVM need zap all mmio sptes, it just simply increase the global
> generation-number. When guests do mmio access, KVM intercepts a MMIO #PF
> then it walks the shadow page table and get the mmio spte. If the
> generation-number on the spte does not equal the global generation-number,
> it will go to the normal #PF handler to update the mmio spte
> 
> Since 19 bits are used to store generation-number on mmio spte, the
> generation-number can be round after 33554432 times. It is large enough
> for nearly all most cases, but making the code be more strong, we zap all
> shadow pages when the number is round
> 
Very nice idea, but why drop Takuya patches instead of using
kvm_mmu_zap_mmio_sptes() when generation number overflows.


> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/include/asm/kvm_host.h |2 +
>  arch/x86/kvm/mmu.c  |   61 
> +--
>  arch/x86/kvm/mmutrace.h |   17 +++
>  arch/x86/kvm/paging_tmpl.h  |7 +++-
>  arch/x86/kvm/vmx.c  |4 ++
>  arch/x86/kvm/x86.c  |6 +--
>  6 files changed, 82 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index ef7f4a5..572398e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -529,6 +529,7 @@ struct kvm_arch {
>   unsigned int n_requested_mmu_pages;
>   unsigned int n_max_mmu_pages;
>   unsigned int indirect_shadow_pages;
> + unsigned int mmio_invalid_gen;
Why invalid? Should be mmio_valid_gen or mmio_current_get.

>   struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>   /*
>* Hash table of struct kvm_mmu_page.
> @@ -765,6 +766,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, 
> int slot);
>  void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
>struct kvm_memory_slot *slot,
>gfn_t gfn_offset, unsigned long mask);
> +void kvm_mmu_invalid_mmio_spte(struct kvm *kvm);
Agree with Takuya that kvm_mmu_invalidate_mmio_sptes() is a better name.

>  void kvm_mmu_zap_all(struct kvm *kvm);
>  unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
>  void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int 
> kvm_nr_mmu_pages);
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 13626f4..7093a92 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -234,12 +234,13 @@ static unsigned int get_mmio_spte_generation(u64 spte)
>  static void mark_mmio_spte(struct kvm *kvm, u64 *sptep, u64 gfn,
>  unsigned access)
>  {
> - u64 mask = generation_mmio_spte_mask(0);
> + unsigned int gen = ACCESS_ONCE(kvm->arch.mmio_invalid_gen);
> + u64 mask = generation_mmio_spte_mask(gen);
> 
>   access &= ACC_WRITE_MASK | ACC_USER_MASK;
>   mask |= shadow_mmio_mask | access | gfn << PAGE_SHIFT;
> 
> - trace_mark_mmio_spte(sptep, gfn, access, 0);
> + trace_mark_mmio_spte(sptep, gfn, access, gen);
>   mmu_spte_set(sptep, mask);
>  }
> 
> @@ -269,6 +270,34 @@ static bool set_mmio_spte(struct kvm *kvm, u64 *sptep, 
> gfn_t gfn,
>   return false;
>  }
> 
> +static bool check_mmio_spte(struct kvm *kvm, u64 spte)
> +{
> + return get_mmio_spte_generation(spte) ==
> + ACCESS_ONCE(kvm->arch.mmio_invalid_gen);
> +}
> +
> +/*
> + * The caller should protect concurrent access on
> + * kvm->arch.mmio_invalid_gen. Currently, it is used by
> + * kvm_arch_commit_memory_region and protected by kvm->slots_lock.
> + */
> +void kvm_mmu_invalid_mmio_spte(struct kvm *kvm)
> +{
> + /* Ensure update memslot has been completed. */
> + smp_mb();
What barrier this one is paired with?

> +
> +  trace_kvm_mmu_invalid_mmio_spte(kvm);
Something wrong with indentation.

> +
> + /*
> +  * The very rare case: if the generation-number is round,
> +  * zap all shadow pages.
> +  */
> + if (unlikely(kvm->arch.mmio_invalid_gen++ == MAX_GEN)) {
> + kvm->arch.mmio_invalid_gen = 0;
> + return kvm_mmu_zap_all(kvm);
> + }
> +}
> +
>  static inline u64 rsvd_bits(int s, int e)
>  {
>   return ((1ULL << (e - s + 1)) - 1) << s;
> @@ -3183,9 +3212,12 @@ static u64 walk_shadow_page_get_mmio_spte(struct 
> kvm_vcpu *vcpu, u64 addr)
>  }
> 
>  /*
> - * If it is a real mmio page fault, return 1 and emulat the instruction
> - * directly, return 0 to let CPU fault again on the address, -1 is
> - * returned if bug is detected.
> + * Retu

Re: [PATCH RFC 1/1] kvm: Use vcpu_id as pivot instead of last boosted vcpu in PLE handler

2012-09-02 Thread Gleb Natapov

On Thu, Aug 30, 2012 at 12:51:01AM +0530, Raghavendra K T wrote:
>  The idea of starting from next vcpu (source of yield_to + 1) seem to work
>  well for overcomitted guest rather than using last boosted vcpu. We can also
>  remove per VM variable with this approach.
>  
>  Iteration for eligible candidate after this patch starts from vcpu source+1
>  and ends at source-1 (after wrapping)
>  
>  Thanks Nikunj for his quick verification of the patch.
>  
>  Please let me know if this patch is interesting and makes sense.
> 
This last_boosted_vcpu thing caused us trouble during attempt to
implement vcpu destruction. It is good to see it removed from this POV.

> 8<
> From: Raghavendra K T 
> 
>  Currently we use next vcpu to last boosted vcpu as starting point
>  while deciding eligible vcpu for directed yield.
> 
>  In overcomitted scenarios, if more vcpu try to do directed yield,
>  they start from same vcpu, resulting in wastage of cpu time (because of
>  failing yields and double runqueue lock).
>  
>  Since probability of same vcpu trying to do directed yield is already
>  prevented by improved PLE handler, we can start from next vcpu from source
>  of yield_to.
> 
> Suggested-by: Srikar Dronamraju 
> Signed-off-by: Raghavendra K T 
> ---
> 
>  include/linux/kvm_host.h |1 -
>  virt/kvm/kvm_main.c  |   12 
>  2 files changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index b70b48b..64a090d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -275,7 +275,6 @@ struct kvm {
>  #endif
>   struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
>   atomic_t online_vcpus;
> - int last_boosted_vcpu;
>   struct list_head vm_list;
>   struct mutex lock;
>   struct kvm_io_bus *buses[KVM_NR_BUSES];
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2468523..65a6c83 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1584,7 +1584,6 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  {
>   struct kvm *kvm = me->kvm;
>   struct kvm_vcpu *vcpu;
> - int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
>   int yielded = 0;
>   int pass;
>   int i;
> @@ -1594,21 +1593,18 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>* currently running, because it got preempted by something
>* else and called schedule in __vcpu_run.  Hopefully that
>* VCPU is holding the lock that we need and will release it.
> -  * We approximate round-robin by starting at the last boosted VCPU.
> +  * We approximate round-robin by starting at the next VCPU.
>*/
>   for (pass = 0; pass < 2 && !yielded; pass++) {
>   kvm_for_each_vcpu(i, vcpu, kvm) {
> - if (!pass && i <= last_boosted_vcpu) {
> - i = last_boosted_vcpu;
> + if (!pass && i <= me->vcpu_id) {
> + i = me->vcpu_id;
>   continue;
> - } else if (pass && i > last_boosted_vcpu)
> + } else if (pass && i >= me->vcpu_id)
>   break;
> - if (vcpu == me)
> - continue;
>   if (waitqueue_active(&vcpu->wq))
>   continue;
>   if (kvm_vcpu_yield_to(vcpu)) {
> - kvm->last_boosted_vcpu = i;
>   yielded = 1;
>   break;
>   }

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-04 Thread Gleb Natapov

On Wed, Oct 03, 2012 at 04:56:57PM +0200, Avi Kivity wrote:
> On 10/03/2012 04:17 PM, Raghavendra K T wrote:
> > * Avi Kivity  [2012-09-30 13:13:09]:
> > 
> >> On 09/30/2012 01:07 PM, Gleb Natapov wrote:
> >> > On Sun, Sep 30, 2012 at 10:18:17AM +0200, Avi Kivity wrote:
> >> >> On 09/28/2012 08:16 AM, Raghavendra K T wrote:
> >> >> > 
> >> >> >>
> >> >> >> +struct pv_sched_info {
> >> >> >> +   unsigned long   sched_bitmap;
> >> >> > 
> >> >> > Thinking, whether we need something similar to cpumask here?
> >> >> > Only thing is we are representing guest (v)cpumask.
> >> >> > 
> >> >> 
> >> >> DECLARE_BITMAP(sched_bitmap, KVM_MAX_VCPUS)
> >> >> 
> >> > vcpu_id can be greater than KVM_MAX_VCPUS.
> >> 
> >> Use the index into the vcpu table as the bitmap index then.  In fact
> >> it's better because then the lookup to get the vcpu pointer is trivial.
> > 
> > Did you mean, while setting the bitmap,
> > 
> > we should do 
> > for (i = 1..n)
> > if (kvm->vcpus[i] == vcpu) set ith position in bitmap?
> 
> You can store i in the vcpu itself:
> 
>   set_bit(vcpu->index, &kvm->preempted);
> 
This will make the fact that vcpus are stored in an array into API
instead of implementation detail :( There were patches for vcpu
destruction that changed the array to rculist. Well, it will be still
possible to make the array rcu protected and copy it every time vcpu is
deleted/added I guess.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [06/31] perf, kvm: Support the intx/intx_cp modifiers in KVM arch perfmon emulation

2012-10-04 Thread Gleb Natapov

On Tue, Oct 02, 2012 at 11:48:26PM -, Andi Kleen wrote:
> From: Andi Kleen 
> 
> This is not arch perfmon, but older CPUs will just ignore it. This makes
> it possible to do at least some TSX measurements from a KVM guest
> 
> Cc: a...@redhat.com
> Signed-off-by: Andi Kleen 
> 
> ---
> arch/x86/kvm/pmu.c |   13 ++---
>  1 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index 9b7ec11..f72a409 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -160,7 +160,7 @@ static void stop_counter(struct kvm_pmc *pmc)
>  
>  static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
>   unsigned config, bool exclude_user, bool exclude_kernel,
> - bool intr)
> + bool intr, bool intx, bool intx_cp)
>  {
>   struct perf_event *event;
>   struct perf_event_attr attr = {
> @@ -173,6 +173,11 @@ static void reprogram_counter(struct kvm_pmc *pmc, u32 
> type,
>   .exclude_kernel = exclude_kernel,
>   .config = config,
>   };
> + /* Will be ignored on CPUs that don't support this. */
> + if (intx)
> + attr.config |= HSW_INTX;
> + if (intx_cp)
> + attr.config |= HSW_INTX_CHECKPOINTED;
>  
>   attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
>  
> @@ -239,7 +244,9 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 
> eventsel)
>   reprogram_counter(pmc, type, config,
>   !(eventsel & ARCH_PERFMON_EVENTSEL_USR),
>   !(eventsel & ARCH_PERFMON_EVENTSEL_OS),
> - eventsel & ARCH_PERFMON_EVENTSEL_INT);
> + eventsel & ARCH_PERFMON_EVENTSEL_INT,
> + !!(eventsel & HSW_INTX),
> + !!(eventsel & HSW_INTX_CHECKPOINTED));
Before reprogram_gp_counter() is called from kvm_pmu_set_msr() eventsel
is checked for reserved bits. Both HSW_INTX and HSW_INTX_CHECKPOINTED
are marked as reserved currently. Reserved bit mask should depend on
cpuid bits provided to a guest.

>  }
>  
>  static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
> @@ -256,7 +263,7 @@ static void reprogram_fixed_counter(struct kvm_pmc *pmc, 
> u8 en_pmi, int idx)
>   arch_events[fixed_pmc_events[idx]].event_type,
>   !(en & 0x2), /* exclude user */
>   !(en & 0x1), /* exclude kernel */
> - pmi);
> + pmi, false, false);
>  }
>  
>  static inline u8 fixed_en_pmi(u64 ctrl, int idx)

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 06/34] perf, kvm: Support the intx/intx_cp modifiers in KVM arch perfmon emulation v2

2012-10-23 Thread Gleb Natapov

On Tue, Oct 23, 2012 at 02:36:05PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-10-18 at 16:19 -0700, Andi Kleen wrote:
> > From: Andi Kleen 
> > 
> > This is not arch perfmon, but older CPUs will just ignore it. This makes
> > it possible to do at least some TSX measurements from a KVM guest
> 
> Please, always CC people who wrote the code as well, in this case that's
> Gleb.
> 
Yes, I missed the v2. Thanks Peter.

> > Cc: a...@redhat.com
> > v2: Various fixes to address review feedback
> > Signed-off-by: Andi Kleen 
> > ---
> >  arch/x86/kvm/pmu.c |   15 +++
> >  1 files changed, 11 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> > index cfc258a..81c1632 100644
> > --- a/arch/x86/kvm/pmu.c
> > +++ b/arch/x86/kvm/pmu.c
> 
> > @@ -173,6 +173,11 @@ static void reprogram_counter(struct kvm_pmc *pmc, u32 
> > type,
> > .exclude_kernel = exclude_kernel,
> > .config = config,
> > };
> > +   /* Will be ignored on CPUs that don't support this. */
> > +   if (intx)
> > +   attr.config |= HSW_INTX;
> > +   if (intx_cp)
> > +   attr.config |= HSW_INTX_CHECKPOINTED;
> >  
> > attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
> >  
> 
> So I forgot how all this worked, but will the KVM emulation not pass
> this straight down to the hardware?
KVM PMU emulation does not talk to HW directly. It creates perf_event to
emulate PMU counter. As far as I see those two will be dropped by
hsw_hw_config() if host HW does not support them.


> 
> Don't we have a problem where we're emulating arch perfmon v1 on AMD
> hardware? On AMD hardware those bits do have meaning.
PMU emulation does not support AMD yet.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [06/34] perf, kvm: Support the intx/intx_cp modifiers in KVM arch perfmon emulation v2

2012-10-23 Thread Gleb Natapov

On Thu, Oct 18, 2012 at 11:19:14PM -, Andi Kleen wrote:
>  static inline u8 fixed_en_pmi(u64 ctrl, int idx)
> @@ -400,7 +407,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 
> data)
>   } else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) {
>   if (data == pmc->eventsel)
>   return 0;
> - if (!(data & 0x0020ull)) {
> + if (!(data & 0xfffc0020ull)) {
>   reprogram_gp_counter(pmc, data);
>   return 0;
>   }

Mask should depend on cpuid bits provided to a guest. SDM says TSX is
available if CPUID.(EAX=7, ECX=0):RTM[bit 11]=1, or  if
CPUID.07H.EBX.HLE [bit 4] = 1, so we need to check for this in
kvm_pmu_cpuid_update() and initialize mask accordingly.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [06/34] perf, kvm: Support the intx/intx_cp modifiers in KVM arch perfmon emulation v2

2012-10-24 Thread Gleb Natapov

On Tue, Oct 23, 2012 at 03:20:39PM +0200, Andi Kleen wrote:
> On Tue, Oct 23, 2012 at 03:05:09PM +0200, Gleb Natapov wrote:
> > On Thu, Oct 18, 2012 at 11:19:14PM -, Andi Kleen wrote:
> > >  static inline u8 fixed_en_pmi(u64 ctrl, int idx)
> > > @@ -400,7 +407,7 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, 
> > > u64 data)
> > >   } else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) {
> > >   if (data == pmc->eventsel)
> > >   return 0;
> > > - if (!(data & 0x0020ull)) {
> > > + if (!(data & 0xfffc0020ull)) {
> > >   reprogram_gp_counter(pmc, data);
> > >   return 0;
> > >   }
> > 
> > Mask should depend on cpuid bits provided to a guest. SDM says TSX is
> > available if CPUID.(EAX=7, ECX=0):RTM[bit 11]=1, or  if
> > CPUID.07H.EBX.HLE [bit 4] = 1, so we need to check for this in
> > kvm_pmu_cpuid_update() and initialize mask accordingly.
> 
> I think it will still error out or do nothing.
It will drop TSX related bits without any way for guest to know and count
something else as a result.  It is true that well behaved guest will check
cpuid bits by itself before setting them, but we are trying to emulate
HW as close as possible for the sake of not so well behaved. Some crazy
guest may try to check if those bits are supported by trying to set them
and checking for the #GP for instance. (And I noticed that performance
monitoring application writers tend to not check cpuid even for those
things that can be checked and go by CPU model alone.)

> So I have doubts the explicit check is worth it.
> That should be near enough the hardware.
> 
> BTW does the PMU feature still work? I couldn't get it to work at all
> with the latest kernel, just when I originally wrote it.
> 
Works for me here with 3.6.0. Have you ran qemu with "-cpu host"?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 05/32] perf, kvm: Support the intx/intx_cp modifiers in KVM arch perfmon emulation v3

2012-10-31 Thread Gleb Natapov

On Tue, Oct 30, 2012 at 05:33:56PM -0700, Andi Kleen wrote:
> From: Andi Kleen 
> 
> This is not arch perfmon, but older CPUs will just ignore it. This makes
> it possible to do at least some TSX measurements from a KVM guest
> 
You are ignoring my reviews.

> Cc: a...@redhat.com
> Cc: g...@redhat.com
> v2: Various fixes to address review feedback
> v3: Ignore the bits when no CPUID. No #GP. Force raw events with TSX bits.
> Cc: g...@redhat.com
> Signed-off-by: Andi Kleen 
> ---
>  arch/x86/include/asm/kvm_host.h |1 +
>  arch/x86/kvm/pmu.c  |   34 ++
>  2 files changed, 27 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b2e11f4..6783289 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -318,6 +318,7 @@ struct kvm_pmu {
>   u64 global_ovf_ctrl;
>   u64 counter_bitmask[2];
>   u64 global_ctrl_mask;
> + u64 cpuid_word9;
>   u8 version;
>   struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
>   struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index cfc258a..8bc954a 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -160,7 +160,7 @@ static void stop_counter(struct kvm_pmc *pmc)
>  
>  static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
>   unsigned config, bool exclude_user, bool exclude_kernel,
> - bool intr)
> + bool intr, bool intx, bool intx_cp)
>  {
>   struct perf_event *event;
>   struct perf_event_attr attr = {
> @@ -173,6 +173,11 @@ static void reprogram_counter(struct kvm_pmc *pmc, u32 
> type,
>   .exclude_kernel = exclude_kernel,
>   .config = config,
>   };
> + /* Will be ignored on CPUs that don't support this. */
> + if (intx)
> + attr.config |= HSW_INTX;
> + if (intx_cp)
> + attr.config |= HSW_INTX_CHECKPOINTED;
>  
>   attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
>  
> @@ -206,7 +211,8 @@ static unsigned find_arch_event(struct kvm_pmu *pmu, u8 
> event_select,
>   return arch_events[i].event_type;
>  }
>  
> -static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
> +static void reprogram_gp_counter(struct kvm_pmu *pmu, struct kvm_pmc *pmc, 
> +  u64 eventsel)
>  {
>   unsigned config, type = PERF_TYPE_RAW;
>   u8 event_select, unit_mask;
> @@ -224,9 +230,16 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, 
> u64 eventsel)
>   event_select = eventsel & ARCH_PERFMON_EVENTSEL_EVENT;
>   unit_mask = (eventsel & ARCH_PERFMON_EVENTSEL_UMASK) >> 8;
>  
> + if (!(boot_cpu_has(X86_FEATURE_HLE) ||
> +   boot_cpu_has(X86_FEATURE_RTM)) ||
> + !(pmu->cpuid_word9 & (X86_FEATURE_HLE|X86_FEATURE_RTM)))
> + eventsel &= ~(HSW_INTX|HSW_INTX_CHECKPOINTED);
> +
>   if (!(eventsel & (ARCH_PERFMON_EVENTSEL_EDGE |
>   ARCH_PERFMON_EVENTSEL_INV |
> - ARCH_PERFMON_EVENTSEL_CMASK))) {
> + ARCH_PERFMON_EVENTSEL_CMASK |
> + HSW_INTX |
> + HSW_INTX_CHECKPOINTED))) {
>   config = find_arch_event(&pmc->vcpu->arch.pmu, event_select,
>   unit_mask);
>   if (config != PERF_COUNT_HW_MAX)
> @@ -239,7 +252,9 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 
> eventsel)
>   reprogram_counter(pmc, type, config,
>   !(eventsel & ARCH_PERFMON_EVENTSEL_USR),
>   !(eventsel & ARCH_PERFMON_EVENTSEL_OS),
> - eventsel & ARCH_PERFMON_EVENTSEL_INT);
> + eventsel & ARCH_PERFMON_EVENTSEL_INT,
> + (eventsel & HSW_INTX),
> + (eventsel & HSW_INTX_CHECKPOINTED));
>  }
>  
>  static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
> @@ -256,7 +271,7 @@ static void reprogram_fixed_counter(struct kvm_pmc *pmc, 
> u8 en_pmi, int idx)
>   arch_events[fixed_pmc_events[idx]].event_type,
>   !(en & 0x2), /* exclude user */
>   !(en & 0x1), /* exclude kernel */
> - pmi);
> + pmi, false, false);
>  }
>  
>  static inline u8 fixed_en_pmi(u64 ctrl, int idx)
> @@ -289,7 +304,7 @@ static void reprogram_idx(struct kvm_pmu *pmu, int idx)
>   return;
>  
>   if (pmc_is_gp(pmc))
> - reprogram_gp_counter(pmc, pmc->eventsel);
> + reprogram_gp_counter(pmu, pmc, pmc->eventsel);
>   else {
>   int fidx = idx - INTEL_PMC_IDX_FIXED;
>   reprogram_fixed_counter(pmc,
> @@ -400,8 +415,8 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 
> data)
>   }

Re: [PATCH 05/32] perf, kvm: Support the intx/intx_cp modifiers in KVM arch perfmon emulation v3

2012-10-31 Thread Gleb Natapov

On Wed, Oct 31, 2012 at 11:32:48AM +0100, Andi Kleen wrote:
> On Wed, Oct 31, 2012 at 12:27:01PM +0200, Gleb Natapov wrote:
> > On Tue, Oct 30, 2012 at 05:33:56PM -0700, Andi Kleen wrote:
> > > From: Andi Kleen 
> > > 
> > > This is not arch perfmon, but older CPUs will just ignore it. This makes
> > > it possible to do at least some TSX measurements from a KVM guest
> > > 
> > You are ignoring my reviews.
> 
> Huh? I nearly rewrote the patch based on your feedback.
> It's much more complicated now just for the CPUID check you wanted.
> I hope all that code is worth it.
> 
> If you want more changes please let me know.
> 
I did comment on v3 yesterday. May be you missed it. 
https://lkml.org/lkml/2012/10/30/87

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [GIT PULL] Linux KVM tool for v3.7-rc0

2012-10-31 Thread Gleb Natapov

On Tue, Oct 23, 2012 at 10:20:34AM +0800, Asias He wrote:
> On Mon, Oct 22, 2012 at 6:16 PM, Gleb Natapov  wrote:
> > On Mon, Oct 22, 2012 at 11:24:19AM +0200, Avi Kivity wrote:
> >> On 10/21/2012 05:39 PM, Pekka Enberg wrote:
> >> >
> >> > On Sun, Oct 21, 2012 at 5:02 PM, richard -rw- weinberger
> >> >  wrote:
> >> >> qemu supports all these features.
> >> >> E.g. to access the host fs use:
> >> >> qemu ... \
> >> >>   -fsdev 
> >> >> local,security_model=passthrough,id=fsdev-root,path=/your/root/,readonly
> >> >> \
> >> >>   -device virtio-9p-pci,id=fs-root,fsdev=fsdev-root,mount_tag=rootshare
> >> >
> >> > IIRC, QEMU uses SLIRP non-root zero-config networking which is much
> >> > more limited than what LKVM offers out of the box.
> >>
> >> Curious, what are the differences?
> >>
> > Me too, especially as we discussed replacing SLIRP with lkvm code for
> > userspace networking and decided (for reasons I do not remember) that it
> > lacks futures SLIRP has. Was it host port redirection?
> 
> Yes.  Currently, there is no host to guest port forward support in lkvm.
> However, it's faster than slirp.
> e.g. tcp bandwidth in a 100Mb/s environment
> lkvm's userspace networking ~90Mb/s  v.s qemu's slirp  ~10Mb/s
> 
This is not entirely slirp fault BTW. If you run QEMU with "-usbdevice tablet", 
slirp
bandwidth goes up to 80-90Mb too.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86, add hypervisor name to dump_stack() [v4]

2012-11-01 Thread Gleb Natapov

On Tue, Oct 30, 2012 at 04:26:33PM -0400, Prarit Bhargava wrote:
> Debugging crash, panics, stack trace WARN_ONs, etc., from both virtual and
> bare-metal boots can get difficult very quickly.  While there are ways to
> decipher the output and determine if the output is from a virtual guest,
> the in-kernel hypervisors now have a single registration point
> and set x86_hyper.  We can use this to output additional debug
> information during a panic/oops/stack trace.
> 
> Signed-off-by: Prarit Bhargava 
> Cc: Avi Kivity 
> Cc: Gleb Natapov 
> Cc: Alex Williamson 
> Cc: Marcelo Tostatti 
> Cc: Ingo Molnar 
> Cc: k...@vger.kernel.org
> Cc: x...@kernel.org
> 
Acked-by: Gleb Natapov 

> [v2]: Modifications suggested by Ingo and added changes for similar output
>   from process.c
> 
> [v3]: Unify common code and move output to end of line
> ---
>  arch/x86/kernel/dumpstack.c |6 +-
>  arch/x86/kernel/process.c   |   14 --
>  2 files changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
> index ae42418b..96d40ed 100644
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -188,11 +188,7 @@ void dump_stack(void)
>   unsigned long stack;
>  
>   bp = stack_frame(current, NULL);
> - printk("Pid: %d, comm: %.20s %s %s %.*s\n",
> - current->pid, current->comm, print_tainted(),
> - init_utsname()->release,
> - (int)strcspn(init_utsname()->version, " "),
> - init_utsname()->version);
> + show_regs_common();
>   show_trace(NULL, NULL, &stack, bp);
>  }
>  EXPORT_SYMBOL(dump_stack);
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index b644e1c..7ea4692 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * per-CPU TSS segments. Threads are completely 'soft' on Linux,
> @@ -124,6 +125,13 @@ void exit_thread(void)
>  void show_regs_common(void)
>  {
>   const char *vendor, *product, *board;
> + const char *machine_name = "x86";
> + const char *kernel_type = "native";
> +
> + if (x86_hyper) {
> + machine_name = x86_hyper->name;
> + kernel_type = "guest";
> + }
>  
>   vendor = dmi_get_system_info(DMI_SYS_VENDOR);
>   if (!vendor)
> @@ -135,14 +143,16 @@ void show_regs_common(void)
>   /* Board Name is optional */
>   board = dmi_get_system_info(DMI_BOARD_NAME);
>  
> - printk(KERN_DEFAULT "Pid: %d, comm: %.20s %s %s %.*s %s %s%s%s\n",
> + printk(KERN_DEFAULT
> +"Pid: %d, comm: %.20s %s %s %.*s %s %s%s%s [%s %s kernel]\n",
>  current->pid, current->comm, print_tainted(),
>  init_utsname()->release,
>  (int)strcspn(init_utsname()->version, " "),
>  init_utsname()->version,
>  vendor, product,
>  board ? "/" : "",
> -board ? board : "");
> +board ? board : "",
> +machine_name, kernel_type);
>  }
>  
>  void flush_thread(void)
> -- 
> 1.7.9.3

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 08/11] perf tool: precise mode requires exclude_guest

2012-07-25 Thread Gleb Natapov

On Wed, Jul 25, 2012 at 10:35:46PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-07-24 at 18:15 +0200, Robert Richter wrote:
> > David,
> > 
> > On 24.07.12 08:20:19, David Ahern wrote:
> > > On 7/23/12 12:13 PM, Arnaldo Carvalho de Melo wrote:
> > > > Em Fri, Jul 20, 2012 at 05:25:53PM -0600, David Ahern escreveu:
> > > >> PEBS cannot be used with guest mode. If user adds :p modifier set
> > > >> exclude_guest as well.
> > > >
> > > > Is this something Intel specific? Or can someone think of an arch where
> > > > this limitation wouldn't exist?
> > > 
> > > Good point. So far precise_ip is used by arch/x86 only. I don't have an 
> > > AMD based server so I don't know if there is a conflict between 
> > > virtualization and IBS. Added Robert for advice.
> > 
> > thanks for this hint.
> > 
> > On AMD cpus precise_ip maps to IBS, which does not support hardware
> > options as perfctrs do. Thus, following attr flags are not supported:
> > 
> >  exclude_user, exclude_kernel, exclude_host, exclude_guest
> > 
> > Counting in guest mode is possible with IBS, but not the exclusion of
> > a certain mode. If precise_ip counting is enabled on AMD we may not
> > set the exclude_guest flag.
> 
> IIRC we have SVM enter/exit hooks in kvm/perf already, we could use
> those to implement exclude_guest for IBS.
> 
> Now I've been trying to find the patches that introduced that, but I'm
> failing horridly. Gleb, got us a pointer here?
The commit is 144d31e6, but it introduces hook that is used on VMX only.
SVM does not need it to implement guest/host only counters since it
has HW support for that in the PMU.
 
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 08/11] perf tool: precise mode requires exclude_guest

2012-07-26 Thread Gleb Natapov

On Thu, Jul 26, 2012 at 10:07:37AM +0200, Peter Zijlstra wrote:
> On Thu, 2012-07-26 at 08:50 +0300, Gleb Natapov wrote:
> > On Wed, Jul 25, 2012 at 10:35:46PM +0200, Peter Zijlstra wrote:
> > > On Tue, 2012-07-24 at 18:15 +0200, Robert Richter wrote:
> > > > David,
> > > > 
> > > > On 24.07.12 08:20:19, David Ahern wrote:
> > > > > On 7/23/12 12:13 PM, Arnaldo Carvalho de Melo wrote:
> > > > > > Em Fri, Jul 20, 2012 at 05:25:53PM -0600, David Ahern escreveu:
> > > > > >> PEBS cannot be used with guest mode. If user adds :p modifier set
> > > > > >> exclude_guest as well.
> > > > > >
> > > > > > Is this something Intel specific? Or can someone think of an arch 
> > > > > > where
> > > > > > this limitation wouldn't exist?
> > > > > 
> > > > > Good point. So far precise_ip is used by arch/x86 only. I don't have 
> > > > > an 
> > > > > AMD based server so I don't know if there is a conflict between 
> > > > > virtualization and IBS. Added Robert for advice.
> > > > 
> > > > thanks for this hint.
> > > > 
> > > > On AMD cpus precise_ip maps to IBS, which does not support hardware
> > > > options as perfctrs do. Thus, following attr flags are not supported:
> > > > 
> > > >  exclude_user, exclude_kernel, exclude_host, exclude_guest
> > > > 
> > > > Counting in guest mode is possible with IBS, but not the exclusion of
> > > > a certain mode. If precise_ip counting is enabled on AMD we may not
> > > > set the exclude_guest flag.
> > > 
> > > IIRC we have SVM enter/exit hooks in kvm/perf already, we could use
> > > those to implement exclude_guest for IBS.
> > > 
> > > Now I've been trying to find the patches that introduced that, but I'm
> > > failing horridly. Gleb, got us a pointer here?
> 
> > The commit is 144d31e6, but it introduces hook that is used on VMX only.
> > SVM does not need it to implement guest/host only counters since it
> > has HW support for that in the PMU.
> 
> Right, seems that has changed now that we support IBS.

Copying Joerg who wrote guest/host exclude code for AMD.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 1/7] KVM: MMU: fix release noslot pfn

2012-09-23 Thread Gleb Natapov

On Fri, Sep 21, 2012 at 02:57:19PM +0800, Xiao Guangrong wrote:
> We can not directly call kvm_release_pfn_clean to release the pfn
> since we can meet noslot pfn which is used to cache mmio info into
> spte
> 
Wouldn't it be better to move the check into kvm_release_pfn_clean()?

> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c |6 --
>  arch/x86/kvm/paging_tmpl.h |6 --
>  2 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index aa0b469..0f56169 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2877,7 +2877,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t 
> v, u32 error_code,
> 
>  out_unlock:
>   spin_unlock(&vcpu->kvm->mmu_lock);
> - kvm_release_pfn_clean(pfn);
> + if (likely(!is_noslot_pfn(pfn)))
> + kvm_release_pfn_clean(pfn);
>   return 0;
>  }
> 
> @@ -3345,7 +3346,8 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
> gpa, u32 error_code,
> 
>  out_unlock:
>   spin_unlock(&vcpu->kvm->mmu_lock);
> - kvm_release_pfn_clean(pfn);
> + if (likely(!is_noslot_pfn(pfn)))
> + kvm_release_pfn_clean(pfn);
>   return 0;
>  }
> 
> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
> index bf8c42b..9ce6bc0 100644
> --- a/arch/x86/kvm/paging_tmpl.h
> +++ b/arch/x86/kvm/paging_tmpl.h
> @@ -544,7 +544,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t 
> addr,
>  out_gpte_changed:
>   if (sp)
>   kvm_mmu_put_page(sp, it.sptep);
> - kvm_release_pfn_clean(pfn);
> + if (likely(!is_noslot_pfn(pfn)))
> + kvm_release_pfn_clean(pfn);
>   return NULL;
>  }
> 
> @@ -645,7 +646,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
> addr, u32 error_code,
> 
>  out_unlock:
>   spin_unlock(&vcpu->kvm->mmu_lock);
> - kvm_release_pfn_clean(pfn);
> + if (likely(!is_noslot_pfn(pfn)))
> + kvm_release_pfn_clean(pfn);
>   return 0;
>  }
> 
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 1/7] KVM: MMU: fix release noslot pfn

2012-09-24 Thread Gleb Natapov

On Mon, Sep 24, 2012 at 12:59:32PM +0800, Xiao Guangrong wrote:
> On 09/23/2012 05:13 PM, Gleb Natapov wrote:
> > On Fri, Sep 21, 2012 at 02:57:19PM +0800, Xiao Guangrong wrote:
> >> We can not directly call kvm_release_pfn_clean to release the pfn
> >> since we can meet noslot pfn which is used to cache mmio info into
> >> spte
> >>
> > Wouldn't it be better to move the check into kvm_release_pfn_clean()?
> 
> I think there is no reason for us to prefer to adding this branch in
> the common code. :)

Is the function performance critical? Is function called without the check
on a hot path?  The function already contains much heavier kvm_is_mmio_pfn()
check. If most/all function invocation require check before call it's
better to move it inside.
 
--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 1/7] KVM: MMU: fix release noslot pfn

2012-09-24 Thread Gleb Natapov

On Mon, Sep 24, 2012 at 07:49:37PM +0800, Xiao Guangrong wrote:
> On 09/24/2012 07:24 PM, Gleb Natapov wrote:
> > On Mon, Sep 24, 2012 at 12:59:32PM +0800, Xiao Guangrong wrote:
> >> On 09/23/2012 05:13 PM, Gleb Natapov wrote:
> >>> On Fri, Sep 21, 2012 at 02:57:19PM +0800, Xiao Guangrong wrote:
> >>>> We can not directly call kvm_release_pfn_clean to release the pfn
> >>>> since we can meet noslot pfn which is used to cache mmio info into
> >>>> spte
> >>>>
> >>> Wouldn't it be better to move the check into kvm_release_pfn_clean()?
> >>
> >> I think there is no reason for us to prefer to adding this branch in
> >> the common code. :)
> > 
> > Is the function performance critical? Is function called without the check
> > on a hot path?  The function already contains much heavier kvm_is_mmio_pfn()
> > check. If most/all function invocation require check before call it's
> > better to move it inside.
> 
> It is not most/all functions need do this check - it is only needed on x86 mmu
> page-fault/prefetch path.
At least on x86 there 7 calls to kvm_release_pfn_clean(), 5 of them are
guarded by is_noslot_pfn() (after this patch) and one by even stronger
is_error_pfn(). I guess when/if other architectures will add MMIO MMU
caching they will need to guard kvm_release_pfn_clean() by is_noslot_pfn()
too in most cases. I am not insisting, but as this patch shows it is
easy to miss the check before calling the function.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 05/33] perf, kvm: Support the intx/intx_cp modifiers in KVM arch perfmon emulation v3

2012-10-30 Thread Gleb Natapov

On Fri, Oct 26, 2012 at 01:29:47PM -0700, Andi Kleen wrote:
> From: Andi Kleen 
> 
> This is not arch perfmon, but older CPUs will just ignore it. This makes
> it possible to do at least some TSX measurements from a KVM guest
> 
> Cc: a...@redhat.com
> Cc: g...@redhat.com
> v2: Various fixes to address review feedback
> v3: Ignore the bits when no CPUID. No #GP. Force raw events with TSX bits.
> Cc: g...@redhat.com
> Signed-off-by: Andi Kleen 
> ---
>  arch/x86/include/asm/kvm_host.h |1 +
>  arch/x86/kvm/pmu.c  |   34 ++
>  2 files changed, 27 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b2e11f4..6783289 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -318,6 +318,7 @@ struct kvm_pmu {
>   u64 global_ovf_ctrl;
>   u64 counter_bitmask[2];
>   u64 global_ctrl_mask;
> + u64 cpuid_word9;
>   u8 version;
>   struct kvm_pmc gp_counters[INTEL_PMC_MAX_GENERIC];
>   struct kvm_pmc fixed_counters[INTEL_PMC_MAX_FIXED];
> diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
> index cfc258a..8bc954a 100644
> --- a/arch/x86/kvm/pmu.c
> +++ b/arch/x86/kvm/pmu.c
> @@ -160,7 +160,7 @@ static void stop_counter(struct kvm_pmc *pmc)
>  
>  static void reprogram_counter(struct kvm_pmc *pmc, u32 type,
>   unsigned config, bool exclude_user, bool exclude_kernel,
> - bool intr)
> + bool intr, bool intx, bool intx_cp)
>  {
>   struct perf_event *event;
>   struct perf_event_attr attr = {
> @@ -173,6 +173,11 @@ static void reprogram_counter(struct kvm_pmc *pmc, u32 
> type,
>   .exclude_kernel = exclude_kernel,
>   .config = config,
>   };
> + /* Will be ignored on CPUs that don't support this. */
> + if (intx)
> + attr.config |= HSW_INTX;
> + if (intx_cp)
> + attr.config |= HSW_INTX_CHECKPOINTED;
>  
>   attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
>  
> @@ -206,7 +211,8 @@ static unsigned find_arch_event(struct kvm_pmu *pmu, u8 
> event_select,
>   return arch_events[i].event_type;
>  }
>  
> -static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 eventsel)
> +static void reprogram_gp_counter(struct kvm_pmu *pmu, struct kvm_pmc *pmc, 
> +  u64 eventsel)
>  {
>   unsigned config, type = PERF_TYPE_RAW;
>   u8 event_select, unit_mask;
> @@ -224,9 +230,16 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, 
> u64 eventsel)
>   event_select = eventsel & ARCH_PERFMON_EVENTSEL_EVENT;
>   unit_mask = (eventsel & ARCH_PERFMON_EVENTSEL_UMASK) >> 8;
>  
> + if (!(boot_cpu_has(X86_FEATURE_HLE) ||
> +   boot_cpu_has(X86_FEATURE_RTM)) ||
> + !(pmu->cpuid_word9 & (X86_FEATURE_HLE|X86_FEATURE_RTM)))
> + eventsel &= ~(HSW_INTX|HSW_INTX_CHECKPOINTED);
If you put this check into kvm_pmu_cpuid_update() and disallow guest to
set those bits in the first place by choosing appropriate reserved mask
you will not need this check here. This will simplify the code and will
make emulation more correct.

> +
>   if (!(eventsel & (ARCH_PERFMON_EVENTSEL_EDGE |
>   ARCH_PERFMON_EVENTSEL_INV |
> - ARCH_PERFMON_EVENTSEL_CMASK))) {
> + ARCH_PERFMON_EVENTSEL_CMASK |
> + HSW_INTX |
> + HSW_INTX_CHECKPOINTED))) {
>   config = find_arch_event(&pmc->vcpu->arch.pmu, event_select,
>   unit_mask);
>   if (config != PERF_COUNT_HW_MAX)
> @@ -239,7 +252,9 @@ static void reprogram_gp_counter(struct kvm_pmc *pmc, u64 
> eventsel)
>   reprogram_counter(pmc, type, config,
>   !(eventsel & ARCH_PERFMON_EVENTSEL_USR),
>   !(eventsel & ARCH_PERFMON_EVENTSEL_OS),
> - eventsel & ARCH_PERFMON_EVENTSEL_INT);
> + eventsel & ARCH_PERFMON_EVENTSEL_INT,
> + (eventsel & HSW_INTX),
> + (eventsel & HSW_INTX_CHECKPOINTED));
>  }
>  
>  static void reprogram_fixed_counter(struct kvm_pmc *pmc, u8 en_pmi, int idx)
> @@ -256,7 +271,7 @@ static void reprogram_fixed_counter(struct kvm_pmc *pmc, 
> u8 en_pmi, int idx)
>   arch_events[fixed_pmc_events[idx]].event_type,
>   !(en & 0x2), /* exclude user */
>   !(en & 0x1), /* exclude kernel */
> - pmi);
> + pmi, false, false);
>  }
>  
>  static inline u8 fixed_en_pmi(u64 ctrl, int idx)
> @@ -289,7 +304,7 @@ static void reprogram_idx(struct kvm_pmu *pmu, int idx)
>   return;
>  
>   if (pmc_is_gp(pmc))
> - reprogram_gp_counter(pmc, pmc->eventsel);
> + reprogram_gp_counter(pmu, pmc, pmc->eventsel);
>

Re: [PATCH V12 3/5] kvm : Fold pv_unhalt flag into GET_MP_STATE ioctl to aid migration

2013-08-25 Thread Gleb Natapov

On Wed, Aug 07, 2013 at 12:40:36AM +0530, Raghavendra K T wrote:
> On 08/07/2013 12:02 AM, Eric Northup wrote:
> >
> >If this is about migration correctness, could it get folded into the
> >previous patch 2/5, so that there's not a broken commit which could
> >hurt bisection?
> 
> Yes. It could be. Only reason I maintained like that was,
> original author in the previous patch is different (Srivatsa) and I did
> not want to merge this hunk when the patch series got evolved to mix
> the sign-offs.
> 
> Gleb, Paolo please let me know.
> 
Yes please, do so.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH V13 0/4] Paravirtualized ticket spinlocks for KVM host

2013-08-26 Thread Gleb Natapov

On Mon, Aug 26, 2013 at 02:18:32PM +0530, Raghavendra K T wrote:
> 
>  This series forms the kvm host part of paravirtual spinlock
>  based against kvm tree.
> 
>  Please refer to https://lkml.org/lkml/2013/8/9/265 for
>  kvm guest and Xen, x86 part merged to -tip spinlocks.
> 
>  Please note that:
>  kvm uapi: Add KICK_CPU and PV_UNHALT definition to uapi is a common patch
>  for both guest and host.
> 
Thanks, applied. The patchset is not against kvm.git queue though, so I
had to fix one minor conflict manually.

>  Changes since V12:
>   fold the patch 3 into patch 2 for bisection. (Eric Northup)
> 
> Raghavendra K T (3):
>   kvm uapi: Add KICK_CPU and PV_UNHALT definition to uapi
>   kvm hypervisor: Simplify kvm_for_each_vcpu with
> kvm_irq_delivery_to_apic
>   Documentation/kvm : Add documentation on Hypercalls and features used
> for PV spinlock
> 
> Srivatsa Vaddagiri (1):
>   kvm hypervisor : Add a hypercall to KVM hypervisor to support
> pv-ticketlocks
> 
>  Documentation/virtual/kvm/cpuid.txt  |  4 
>  Documentation/virtual/kvm/hypercalls.txt | 14 ++
>  arch/x86/include/asm/kvm_host.h  |  5 +
>  arch/x86/include/uapi/asm/kvm_para.h |  1 +
>  arch/x86/kvm/cpuid.c |  3 ++-
>  arch/x86/kvm/lapic.c |  5 -
>  arch/x86/kvm/x86.c   | 31 ++-
>  include/uapi/linux/kvm_para.h|  1 +
>  8 files changed, 61 insertions(+), 3 deletions(-)
> 
> -- 
> 1.7.11.7

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/2] kvm: use anon_inode_getfd() with O_CLOEXEC flag

2013-08-26 Thread Gleb Natapov

On Sat, Aug 24, 2013 at 10:14:06PM +0200, Yann Droneaud wrote:
> Hi,
> 
> Following a patchset asking to change calls to get_unused_flag() [1]
> to use O_CLOEXEC, Alex Williamson [2][3] decided to change VFIO
> to use the flag.
> 
> Since it's a related subsystem to KVM, using O_CLOEXEC for
> file descriptors created by KVM might be applicable too.
> 
> I'm suggesting to change calls to anon_inode_getfd() to use O_CLOEXEC
> as default flag.
> 
> This patchset should be reviewed to not break existing userspace program.
> 
> BTW, if it's not applicable, I would suggest that new ioctls be added to
> KVM subsystem, those ioctls would have a "flag" field added to their 
> arguments.
> Such "flag" would let userspace choose the open flag to use.
> See for example other APIs using anon_inode_getfd() such as fanotify,
> inotify, signalfd and timerfd.
> 
> You might be interested to read:
> 
> - Secure File Descriptor Handling (Ulrich Drepper, 2008)
>   http://udrepper.livejournal.com/20407.html
> 
> - Excuse me son, but your code is leaking !!! (Dan Walsh, March 2012) 
>   http://danwalsh.livejournal.com/53603.html
> 
Applied, thanks.

> Regards.
> 
> [1] http://lkml.kernel.org/r/cover.1376327678.git.ydrone...@opteya.com
> [2] http://lkml.kernel.org/r/1377186804.25163.17.ca...@ul30vt.home
> [3] http://lkml.kernel.org/r/20130822171744.1297.13711.st...@bling.home
> 
> Yann Droneaud (2):
>   kvm: use anon_inode_getfd() with O_CLOEXEC flag
>   ppc: kvm: use anon_inode_getfd() with O_CLOEXEC flag
> 
>  arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 +-
>  arch/powerpc/kvm/book3s_64_vio.c| 2 +-
>  arch/powerpc/kvm/book3s_hv.c| 2 +-
>  virt/kvm/kvm_main.c | 6 +++---
>  4 files changed, 6 insertions(+), 6 deletions(-)
> 
> -- 
> 1.8.3.1

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v2] tile: support KVM for tilegx

2013-08-26 Thread Gleb Natapov

On Sun, Aug 25, 2013 at 09:26:47PM -0400, Chris Metcalf wrote:
> On 8/25/2013 7:39 AM, Gleb Natapov wrote:
> > On Mon, Aug 12, 2013 at 04:24:11PM -0400, Chris Metcalf wrote:
> >> This change provides the initial framework support for KVM on tilegx.
> >> Basic virtual disk and networking is supported.
> >>
> > This needs to be broken down to more reviewable patches.
> 
> I already broke out one pre-requisite patch that wasn't strictly KVM-related:
> 
> https://lkml.org/lkml/2013/8/12/339
> 
> In addition, we've separately arranged to support booting our kernels in a 
> way that is compatible with the Tilera booter running at the highest 
> privilege level, which enables multiple kernel privilege levels:
> 
> https://lkml.org/lkml/2013/5/2/468
> 
> How would you recommend further breaking down this patch?  It's pretty much 
> just the basic support for minimal KVM.  I suppose I could break out all the 
> I/O related stuff into a separate patch, though it wouldn't amount to much; 
> perhaps the console could also be broken out separately.  Any other 
> suggestions?
>
First of all please break out host and guest bits. Also I/O related stuff,
like you suggest (so that guest PV bits will be in separate patch) and
change to a common code (not much as far as I see) with explanation why
it is needed. (Why kvm_vcpu_kick() is not needed for instance?)
 
> > Also can you
> > describe the implementation a little bit? Does tile arch has vitalization
> > extension this implementation uses, or is it trap and emulate approach?
> > If later does it run unmodified guest kernels? What userspace are you
> > using with this implementation?
> 
> We could do full virtualization via trap and emulate, but we've elected to do 
> a para-virtualized approach.  Userspace runs at PL (privilege level) 0, the 
> guest kernel runs at PL1, and the host runs at PL2.  We have available per-PL 
> resources for various things, and take advantage of having two on-chip timers 
> (for example) to handle timing for the host and guest kernels.  We run the 
> same userspace with either the host or the guest.
> 
OK, thanks for explanation. Why have you decided to do PV over trap and
emulate?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 02/10] KVM: PPC: reserve a capability number for multitce support

2013-08-26 Thread Gleb Natapov

On Wed, Aug 14, 2013 at 10:51:14AM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2013-08-01 at 14:44 +1000, Alexey Kardashevskiy wrote:
> > This is to reserve a capablity number for upcoming support
> > of H_PUT_TCE_INDIRECT and H_STUFF_TCE pseries hypercalls
> > which support mulptiple DMA map/unmap operations per one call.
> 
> Gleb, any chance you can put this (and the next one) into a tree to
> "lock in" the numbers ?
> 
Applied it. Sorry for slow response, was on vocation and still go
through the email backlog.

> I've been wanting to apply the whole series to powerpc-next, that's
> stuff has been simmering for way too long and is in a good enough shape
> imho, but I need the capabilities and ioctl numbers locked in your tree
> first.
> 
> Cheers,
> Ben.
> 
> > Signed-off-by: Alexey Kardashevskiy 
> > ---
> > Changes:
> > 2013/07/16:
> > * changed the number
> > 
> > Signed-off-by: Alexey Kardashevskiy 
> > ---
> >  include/uapi/linux/kvm.h | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index acccd08..99c2533 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info {
> >  #define KVM_CAP_PPC_RTAS 91
> >  #define KVM_CAP_IRQ_XICS 92
> >  #define KVM_CAP_ARM_EL1_32BIT 93
> > +#define KVM_CAP_SPAPR_MULTITCE 94
> >  
> >  #ifdef KVM_CAP_IRQ_ROUTING
> >  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 02/10] KVM: PPC: reserve a capability number for multitce support

2013-08-26 Thread Gleb Natapov

On Tue, Aug 27, 2013 at 02:19:58PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2013-08-26 at 15:37 +0300, Gleb Natapov wrote:
> > > Gleb, any chance you can put this (and the next one) into a tree to
> > > "lock in" the numbers ?
> > > 
> > Applied it. Sorry for slow response, was on vocation and still go
> > through the email backlog.
> 
> Thanks. Since it's not in a topic branch that I can pull, I'm going to
> just cherry-pick them. However, they are in your "queue" branch, not
> "next" branch. Should I still assume this is a stable branch and that
> the numbers aren't going to change ?
> 
Queue will become next after I will test it and if test will fail the
commit hash may change, but since you are going to cherry-pick and this
does not preserve commit hash it does not matter.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 02/10] KVM: PPC: reserve a capability number for multitce support

2013-08-26 Thread Gleb Natapov

On Tue, Aug 27, 2013 at 02:22:42PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2013-08-27 at 14:19 +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2013-08-26 at 15:37 +0300, Gleb Natapov wrote:
> > > > Gleb, any chance you can put this (and the next one) into a tree to
> > > > "lock in" the numbers ?
> > > > 
> > > Applied it. Sorry for slow response, was on vocation and still go
> > > through the email backlog.
> > 
> > Thanks. Since it's not in a topic branch that I can pull, I'm going to
> > just cherry-pick them. However, they are in your "queue" branch, not
> > "next" branch. Should I still assume this is a stable branch and that
> > the numbers aren't going to change ?
> 
> Oh and Alexey mentions that there are two capabilities and you only
> applied one :-)
> 
Another one is:
 [PATCH v8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO

2013-08-27 Thread Gleb Natapov

On Thu, Aug 15, 2013 at 05:49:26PM +1000, Alexey Kardashevskiy wrote:
> This is to reserve a capablity number for upcoming support
> of VFIO-IOMMU DMA operations in real mode.
> 
> The last ioctl in the group which KVM_CREATE_SPAPR_TCE_IOMMU is added to
> is 0xac, the next two numbers are taken - 0xad for KVM_KVMCLOCK_CTRL and
0xac was also taken by KVM_SET_ONE_REG :(

> 0xae for KVM_ARM_VCPU_INIT. So the KVM_CREATE_SPAPR_TCE_IOMMU ioclt gets
> 0xaf.
> 
> Signed-off-by: Alexey Kardashevskiy 
> 
> ---
> Changes:
> 2013/08/15 v8:
> * fixed comment again
> 
> 2013/08/15:
> * fixed mistype in comments
> * fixed commit message which says what uses ioctls 0xad and 0xae
> 
> 2013/07/16:
> * changed the number
> 
> 2013/07/11:
> * changed order in a file, added comment about a gap in ioctl number
> ---
>  include/uapi/linux/kvm.h | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 99c2533..bd94127 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
>  #define KVM_CAP_IRQ_XICS 92
>  #define KVM_CAP_ARM_EL1_32BIT 93
>  #define KVM_CAP_SPAPR_MULTITCE 94
> +#define KVM_CAP_SPAPR_TCE_IOMMU 95
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -933,6 +934,11 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_ARM_SET_DEVICE_ADDR_IOW(KVMIO,  0xab, struct 
> kvm_arm_device_addr)
>  /* Available with KVM_CAP_PPC_RTAS */
>  #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO,  0xac, struct 
> kvm_rtas_token_args)
> +/* 0xad is taken by KVM_KVMCLOCK_CTRL */
> +/* 0xae is taken by KVM_ARM_VCPU_INIT */
> +/* Available with KVM_CAP_SPAPR_TCE_IOMMU */
> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, \
> + struct kvm_create_spapr_tce_iommu)
>  
Why not use KVM_CREATE_DEVICE API for that?

>  /* ioctl for vm fd */
>  #define KVM_CREATE_DEVICE  _IOWR(KVMIO,  0xe0, struct kvm_create_device)
> -- 
> 1.8.3.2

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO

2013-08-27 Thread Gleb Natapov

On Tue, Aug 27, 2013 at 06:42:18PM +1000, Alexey Kardashevskiy wrote:
> On 08/27/2013 05:56 PM, Gleb Natapov wrote:
> > On Thu, Aug 15, 2013 at 05:49:26PM +1000, Alexey Kardashevskiy wrote:
> >> This is to reserve a capablity number for upcoming support
> >> of VFIO-IOMMU DMA operations in real mode.
> >>
> >> The last ioctl in the group which KVM_CREATE_SPAPR_TCE_IOMMU is added to
> >> is 0xac, the next two numbers are taken - 0xad for KVM_KVMCLOCK_CTRL and
> > 0xac was also taken by KVM_SET_ONE_REG :(
> > 
> >> 0xae for KVM_ARM_VCPU_INIT. So the KVM_CREATE_SPAPR_TCE_IOMMU ioclt gets
> >> 0xaf.
> >>
> >> Signed-off-by: Alexey Kardashevskiy 
> >>
> >> ---
> >> Changes:
> >> 2013/08/15 v8:
> >> * fixed comment again
> >>
> >> 2013/08/15:
> >> * fixed mistype in comments
> >> * fixed commit message which says what uses ioctls 0xad and 0xae
> >>
> >> 2013/07/16:
> >> * changed the number
> >>
> >> 2013/07/11:
> >> * changed order in a file, added comment about a gap in ioctl number
> >> ---
> >>  include/uapi/linux/kvm.h | 6 ++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 99c2533..bd94127 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
> >>  #define KVM_CAP_IRQ_XICS 92
> >>  #define KVM_CAP_ARM_EL1_32BIT 93
> >>  #define KVM_CAP_SPAPR_MULTITCE 94
> >> +#define KVM_CAP_SPAPR_TCE_IOMMU 95
> >>  
> >>  #ifdef KVM_CAP_IRQ_ROUTING
> >>  
> >> @@ -933,6 +934,11 @@ struct kvm_s390_ucas_mapping {
> >>  #define KVM_ARM_SET_DEVICE_ADDR _IOW(KVMIO,  0xab, struct 
> >> kvm_arm_device_addr)
> >>  /* Available with KVM_CAP_PPC_RTAS */
> >>  #define KVM_PPC_RTAS_DEFINE_TOKEN _IOW(KVMIO,  0xac, struct 
> >> kvm_rtas_token_args)
> >> +/* 0xad is taken by KVM_KVMCLOCK_CTRL */
> >> +/* 0xae is taken by KVM_ARM_VCPU_INIT */
> >> +/* Available with KVM_CAP_SPAPR_TCE_IOMMU */
> >> +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO,  0xaf, \
> >> +  struct kvm_create_spapr_tce_iommu)
> >>  
> > Why not use KVM_CREATE_DEVICE API for that?
> 
> 
> Because when I came up with my ioctl first time, it was not in upstream and
> since then nobody pointed me to this new ioctl :)
Sorry about that :(. The ioctl exists for a while now, but with v8 patch I
imaging your patch series predates it.

> So my stuff is not going to upstream again. Heh. Ok. I'll implement it.
> 
Thanks! Should I keep KVM_CAP_SPAPR_MULTITCE capability patch or can I
drop it for now?

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO

2013-08-27 Thread Gleb Natapov

On Wed, Aug 28, 2013 at 11:26:31AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2013-08-28 at 10:51 +1000, Alexey Kardashevskiy wrote:
> > The ioctl I made up is basically a copy of KVM_CREATE_SPAPR_TCE which does
> > the same thing for emulated devices and it is there for quite a while but
> > it is not really extensible. And these two ioctls share some bits of code.
> > Now we will have 2 pieces of code which do almost the same thing but in a
> > different way. Kinda sucks :(
> 
> Right. Thus the question, Gleb, we can either:
> 
>  - Keep Alexey patch as-is allowing us to *finally* merge that stuff
> that's been around for monthes
> 
>  - Convert *both* existing TCE objects to the new 
> KVM_CREATE_DEVICE, and have some backward compat code for the old one.
> 
> I don't think it makes sense to have the "emulated TCE" and "IOMMU TCE"
> objects use a fundamentally different API and infrastructure.
> 
As a general rule we are not going to mandate converting old devices to
new API, but if it make sense to do here I would much prefer that over
adding another special ioctl

> > >> So my stuff is not going to upstream again. Heh. Ok. I'll implement it.
> > >>
> > > Thanks! Should I keep KVM_CAP_SPAPR_MULTITCE capability patch or can I
> > > drop it for now?
> > 
> > Please keep it, it is unrelated to the IOMMU-VFIO thing.
> 

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 >

1 - 100 of 706 matches

Mail list logo