date:20140331

Re: [PATCH 2/4] KVM: Add SMAP support when setting CR4

2014-03-31 Thread Paolo Bonzini


Il 31/03/2014 08:16, Wu, Feng ha scritto:


/*
 * If CPL < 3, SMAP protections are disabled if EFLAGS.AC = 1.
 *
 * If CPL = 3, SMAP applies to all supervisor-mode data accesses
 * (these are implicit supervisor accesses) regardless of the value
 * of EFLAGS.AC.
 *
 * This computes (cpl < 3) && (rflags & X86_EFLAGS_AC), leaving
 * the result in X86_EFLAGS_AC. We then insert it in place of
 * the PFERR_RSVD_MASK bit; this bit will always be zero in pfec,
 * but it will be one in index if SMAP checks are being overridden.
 * It is important to keep this branchless.
 */
smap = (cpl - 3) & (rflags & X86_EFLAGS_AC);
index =
(pfec >> 1) +
(smap >> (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1));

return (mmu->permissions[index] >> pte_access) & 1;

The direction of PFERR_RSVD_MASK is the opposite compared to your code.
---


Correct.


I am a little confused about some points of the above code:
1. "smap = (cpl - 3) & (rflags & X86_EFLAGS_AC);"
"smap" equals 1 when it is overridden and it is 0 when being enforced.


Actually, smap equals X86_EFLAGS_AC when it is overridden.  Perhaps this 
is the source of the confusion.  Note that I'm using &, not &&.



So "index"
will be (pfec >> 1) when SMAP is enforced, but in my understanding of this 
case, we
should use the index with PFERR_RSVD_MASK bit being 1 in mmu-> permissions[]
to check the fault.
2. " smap >> (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1)"
I am not quite understand this line. BTW, I cannot find the definition of 
"PFERR_RSVD_BIT",
Do you mean PFERR_RSVD_BIT equals 3?


Yes.  You can similarly add PFERR_PRESENT_BIT (equal to 0) etc.


I think the basic idea is using group 0 to check permission faults when !((cpl - 3) 
& (rflags & X86_EFLAGS_AC)), that is SMAP is overridden
while using group 1 to check faults when (cpl - 3) & (rflags & X86_EFLAGS_AC), 
that is SMAP is enforced.

Here is the code base on your proposal in my understanding:

---
smap = !((cpl - 3) & (rflags & X86_EFLAGS_AC));
index =
(pfec >> 1) + (smap << (PFERR_RSVD_BIT - 1)); /*assuming PFERR_RSVD_BIT 
== 3*/

return (mmu->permissions[index] >> pte_access) & 1;
---

Could you please have a look at it? Appreciate your help! :)


It is faster if you avoid the "!" and shift right from the AC bit into 
position PFERR_RSVD_BIT - 1.  In update_permission_bitmask you can 
invert the direction of the bit when you extract it from pfec.


Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 2/4] KVM: Add SMAP support when setting CR4

2014-03-31 Thread Wu, Feng



> -Original Message-
> From: Paolo Bonzini [mailto:paolo.bonz...@gmail.com] On Behalf Of Paolo
> Bonzini
> Sent: Monday, March 31, 2014 3:29 PM
> To: Wu, Feng; g...@redhat.com; h...@zytor.com; kvm@vger.kernel.org
> Subject: Re: [PATCH 2/4] KVM: Add SMAP support when setting CR4
> 
> Il 31/03/2014 08:16, Wu, Feng ha scritto:
> >
> 
> 
> > /*
> >  * If CPL < 3, SMAP protections are disabled if EFLAGS.AC = 1.
> >  *
> >  * If CPL = 3, SMAP applies to all supervisor-mode data accesses
> >  * (these are implicit supervisor accesses) regardless of the value
> >  * of EFLAGS.AC.
> >  *
> >  * This computes (cpl < 3) && (rflags & X86_EFLAGS_AC), leaving
> >  * the result in X86_EFLAGS_AC. We then insert it in place of
> >  * the PFERR_RSVD_MASK bit; this bit will always be zero in pfec,
> >  * but it will be one in index if SMAP checks are being overridden.
> >  * It is important to keep this branchless.
> >  */
> > smap = (cpl - 3) & (rflags & X86_EFLAGS_AC);
> > index =
> > (pfec >> 1) +
> > (smap >> (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1));
> >
> > return (mmu->permissions[index] >> pte_access) & 1;
> >
> > The direction of PFERR_RSVD_MASK is the opposite compared to your code.
> >
> 
> ---
> 
> Correct.
> 
> > I am a little confused about some points of the above code:
> > 1. "smap = (cpl - 3) & (rflags & X86_EFLAGS_AC);"
> > "smap" equals 1 when it is overridden and it is 0 when being enforced.
> 
> Actually, smap equals X86_EFLAGS_AC when it is overridden.  Perhaps this
> is the source of the confusion.  Note that I'm using &, not &&.
> 

Thanks for your explanation, you are right I didn't notice *&* just now, I was 
thinking
it is a *&&*, sorry for making the confusion.

> > So "index"
> > will be (pfec >> 1) when SMAP is enforced, but in my understanding of this
> case, we
> > should use the index with PFERR_RSVD_MASK bit being 1 in mmu->
> permissions[]
> > to check the fault.
> > 2. " smap >> (X86_EFLAGS_AC_BIT - PFERR_RSVD_BIT + 1)"
> > I am not quite understand this line. BTW, I cannot find the definition of
> "PFERR_RSVD_BIT",
> > Do you mean PFERR_RSVD_BIT equals 3?
> 
> Yes.  You can similarly add PFERR_PRESENT_BIT (equal to 0) etc.
> 
> > I think the basic idea is using group 0 to check permission faults when 
> > !((cpl -
> 3) & (rflags & X86_EFLAGS_AC)), that is SMAP is overridden
> > while using group 1 to check faults when (cpl - 3) & (rflags & 
> > X86_EFLAGS_AC),
> that is SMAP is enforced.
> >
> > Here is the code base on your proposal in my understanding:
> >
> >
> 
> ---
> > smap = !((cpl - 3) & (rflags & X86_EFLAGS_AC));
> > index =
> > (pfec >> 1) + (smap << (PFERR_RSVD_BIT - 1)); /*assuming
> PFERR_RSVD_BIT == 3*/
> >
> > return (mmu->permissions[index] >> pte_access) & 1;
> >
> 
> ---
> >
> > Could you please have a look at it? Appreciate your help! :)
> 
> It is faster if you avoid the "!" and shift right from the AC bit into
> position PFERR_RSVD_BIT - 1.  In update_permission_bitmask you can
> invert the direction of the bit when you extract it from pfec.

So in that case, we should set "smapf" in update_permission_bitmask() this way, 
right?

smapf = !(pfec & PFERR_RSVD_MASK);

> 
> Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL] KVM changes for 3.15 merge window

2014-03-31 Thread Paolo Bonzini

Linus,

The following changes since commit 6d0abeca3242a88cab8232e4acd7e2bf088f3bc2:

  Linux 3.14-rc3 (2014-02-16 13:30:25 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/kvm-3.15-1

for you to fetch changes up to 7227fc006b0df2c0d2966a7f4859b01bdf74:

  Merge branch 'kvm-ppchv-next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-next 
(2014-03-29 15:44:05 +0100)

No other changes expected for this merge window.

Paolo



PPC and ARM do not have much going on this time.  Most of the cool stuff,
instead, is in s390 and (after a few releases) x86.

ARM has some caching fixes and PPC has transactional memory support
in guests.  MIPS has some fixes, with more probably coming in 3.16 as
QEMU will soon get support for MIPS KVM.

For x86 there are optimizations for debug registers, which trigger on
some Windows games, and other important fixes for Windows guests.  We now
expose to the guest Broadwell instruction set extensions and also Intel
MPX.  There's also a fix/workaround for OS X guests, nested virtualization
features (preemption timer), and a couple kvmclock refinements.

For s390, the main news is asynchronous page faults, together with
improvements to IRQs (floating irqs and adapter irqs) that speed up
virtio devices.


Andrew Honig (1):
  kvm: x86: fix emulator buffer overflow (CVE-2014-0049)

Andrew Jones (2):
  x86: kvm: rate-limit global clock updates
  x86: kvm: introduce periodic global clock updates

Anton Blanchard (1):
  KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n

Christian Borntraeger (4):
  KVM: s390: Provide access to program parameter
  KVM: s390: expose gbea register to userspace
  KVM: s390: Optimize ucontrol path
  KVM: s390: randomize sca address

Christoffer Dall (1):
  KVM: Specify byte order for KVM_EXIT_MMIO

Cornelia Huck (6):
  virtio-ccw: virtio-ccw adapter interrupt support.
  KVM: eventfd: Fix lock order inversion.
  KVM: Add per-vm capability enablement.
  KVM: s390: adapter interrupt sources
  KVM: s390: irq routing for adapter interrupts.
  KVM: Bump KVM_MAX_IRQ_ROUTES for s390

Dominik Dingel (7):
  KVM: s390: Add FAULT_FLAG_RETRY_NOWAIT for guest fault
  KVM: async_pf: Provide additional direct page notification
  KVM: async_pf: Allow to wait for outstanding work
  KVM: async_pf: Async page fault support on s390
  KVM: async_pf: Exploit one reg interface for pfault
  KVM: async_pf: Add missing call for async page present
  KVM: s390: Removing untriggerable BUG_ONs

Fernando Luis Vázquez Cao (1):
  kvm: remove redundant registration of BSP's hv_clock area

Gabriel L. Somlo (1):
  kvm: x86: ignore ioapic polarity

Greg Kurz (1):
  KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write

Heinz Graalfs (2):
  virtio_ccw: fix vcdev pointer handling issues
  virtio_ccw: fix hang in set offline processing

Igor Mammedov (2):
  KVM: x86 emulator: emulate MOVAPS
  KVM: x86 emulator: emulate MOVAPD

James Hogan (4):
  MIPS: KVM: asm/kvm_host.h: Clean up whitespace
  MIPS: KVM: Pass reserved instruction exceptions to guest
  MIPS: KVM: Consult HWREna before emulating RDHWR
  MIPS: KVM: Remove dead code in CP0 emulation

Jan Kiszka (4):
  KVM: nVMX: Rework interception of IRQs and NMIs
  KVM: nVMX: Fully emulate preemption timer
  KVM: nVMX: Do not inject NMI vmexits when L2 has a pending interrupt
  KVM: x86: Remove return code from enable_irq/nmi_window

Jens Freimann (7):
  KVM: s390: add and extend interrupt information data structs
  KVM: s390: add floating irq controller
  KVM: s390: limit floating irqs
  KVM: s390: Simplify online vcpus counting for stsi
  KVM: s390: get rid of local_int array
  KVM: s390: fix calculation of idle_mask array size
  KVM: s390: clear local interrupts at cpu initial reset

Laurent Dufour (1):
  KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE

Liu, Jinsong (6):
  KVM: x86: expose new instruction RDSEED to guest
  KVM: x86: expose ADX feature to guest
  KVM: x86: Fix xsave cpuid exposing bug
  KVM: x86: Intel MPX vmx and msr handle
  KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save
  KVM: x86: Enable Intel MPX for guest

Marc Zyngier (13):
  arm/arm64: KVM: detect CPU reset on CPU_PM_EXIT
  arm64: KVM: force cache clean on page fault when caches are off
  arm64: KVM: allows discrimination of AArch32 sysreg access
  arm64: KVM: trap VM system registers until MMU and caches are ON
  ARM: KVM: introduce kvm_p*d_addr_end
  arm64: KVM: flush VM pages before letting the guest enable caches
  ARM: KVM: force cache clean on page fault when caches are off
  ARM: KVM: fix handling of trapp

Re: [PATCH 2/4] KVM: Add SMAP support when setting CR4

2014-03-31 Thread Paolo Bonzini


Il 31/03/2014 10:06, Wu, Feng ha scritto:

>
> It is faster if you avoid the "!" and shift right from the AC bit into
> position PFERR_RSVD_BIT - 1.  In update_permission_bitmask you can
> invert the direction of the bit when you extract it from pfec.

So in that case, we should set "smapf" in update_permission_bitmask() this way, 
right?

smapf = !(pfec & PFERR_RSVD_MASK);



Yep.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [GIT PULL] KVM changes for 3.15 merge window

2014-03-31 Thread Paolo Bonzini


Il 31/03/2014 10:30, Paolo Bonzini ha scritto:

Linus,

The following changes since commit 6d0abeca3242a88cab8232e4acd7e2bf088f3bc2:

  Linux 3.14-rc3 (2014-02-16 13:30:25 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/kvm-3.15-1


Also at tags/for-linus, but request-pull chose this other tag for some 
reason.


Paolo


for you to fetch changes up to 7227fc006b0df2c0d2966a7f4859b01bdf74:

  Merge branch 'kvm-ppchv-next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-next 
(2014-03-29 15:44:05 +0100)

No other changes expected for this merge window.

Paolo



PPC and ARM do not have much going on this time.  Most of the cool stuff,
instead, is in s390 and (after a few releases) x86.

ARM has some caching fixes and PPC has transactional memory support
in guests.  MIPS has some fixes, with more probably coming in 3.16 as
QEMU will soon get support for MIPS KVM.

For x86 there are optimizations for debug registers, which trigger on
some Windows games, and other important fixes for Windows guests.  We now
expose to the guest Broadwell instruction set extensions and also Intel
MPX.  There's also a fix/workaround for OS X guests, nested virtualization
features (preemption timer), and a couple kvmclock refinements.

For s390, the main news is asynchronous page faults, together with
improvements to IRQs (floating irqs and adapter irqs) that speed up
virtio devices.


Andrew Honig (1):
  kvm: x86: fix emulator buffer overflow (CVE-2014-0049)

Andrew Jones (2):
  x86: kvm: rate-limit global clock updates
  x86: kvm: introduce periodic global clock updates

Anton Blanchard (1):
  KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n

Christian Borntraeger (4):
  KVM: s390: Provide access to program parameter
  KVM: s390: expose gbea register to userspace
  KVM: s390: Optimize ucontrol path
  KVM: s390: randomize sca address

Christoffer Dall (1):
  KVM: Specify byte order for KVM_EXIT_MMIO

Cornelia Huck (6):
  virtio-ccw: virtio-ccw adapter interrupt support.
  KVM: eventfd: Fix lock order inversion.
  KVM: Add per-vm capability enablement.
  KVM: s390: adapter interrupt sources
  KVM: s390: irq routing for adapter interrupts.
  KVM: Bump KVM_MAX_IRQ_ROUTES for s390

Dominik Dingel (7):
  KVM: s390: Add FAULT_FLAG_RETRY_NOWAIT for guest fault
  KVM: async_pf: Provide additional direct page notification
  KVM: async_pf: Allow to wait for outstanding work
  KVM: async_pf: Async page fault support on s390
  KVM: async_pf: Exploit one reg interface for pfault
  KVM: async_pf: Add missing call for async page present
  KVM: s390: Removing untriggerable BUG_ONs

Fernando Luis Vázquez Cao (1):
  kvm: remove redundant registration of BSP's hv_clock area

Gabriel L. Somlo (1):
  kvm: x86: ignore ioapic polarity

Greg Kurz (1):
  KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write

Heinz Graalfs (2):
  virtio_ccw: fix vcdev pointer handling issues
  virtio_ccw: fix hang in set offline processing

Igor Mammedov (2):
  KVM: x86 emulator: emulate MOVAPS
  KVM: x86 emulator: emulate MOVAPD

James Hogan (4):
  MIPS: KVM: asm/kvm_host.h: Clean up whitespace
  MIPS: KVM: Pass reserved instruction exceptions to guest
  MIPS: KVM: Consult HWREna before emulating RDHWR
  MIPS: KVM: Remove dead code in CP0 emulation

Jan Kiszka (4):
  KVM: nVMX: Rework interception of IRQs and NMIs
  KVM: nVMX: Fully emulate preemption timer
  KVM: nVMX: Do not inject NMI vmexits when L2 has a pending interrupt
  KVM: x86: Remove return code from enable_irq/nmi_window

Jens Freimann (7):
  KVM: s390: add and extend interrupt information data structs
  KVM: s390: add floating irq controller
  KVM: s390: limit floating irqs
  KVM: s390: Simplify online vcpus counting for stsi
  KVM: s390: get rid of local_int array
  KVM: s390: fix calculation of idle_mask array size
  KVM: s390: clear local interrupts at cpu initial reset

Laurent Dufour (1):
  KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE

Liu, Jinsong (6):
  KVM: x86: expose new instruction RDSEED to guest
  KVM: x86: expose ADX feature to guest
  KVM: x86: Fix xsave cpuid exposing bug
  KVM: x86: Intel MPX vmx and msr handle
  KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save
  KVM: x86: Enable Intel MPX for guest

Marc Zyngier (13):
  arm/arm64: KVM: detect CPU reset on CPU_PM_EXIT
  arm64: KVM: force cache clean on page fault when caches are off
  arm64: KVM: allows discrimination of AArch32 sysreg access
  arm64: KVM: trap VM system registers until MMU and caches are ON
  ARM: KVM: introduce kvm_p*d_addr_end
  arm64: KVM: flush VM pages befo

[PATCH v3 0/4] KVM: enable Intel SMAP for KVM

2014-03-31 Thread Feng Wu

Supervisor Mode Access Prevention (SMAP) is a new security feature 
disclosed by Intel, please refer to the following document: 

http://software.intel.com/sites/default/files/319433-014.pdf
 
Every access to a linear address is either a supervisor-mode access
or a user-mode access. All accesses performed while the current
privilege level (CPL) is less than 3 are supervisor-mode accesses.
If CPL = 3, accesses are generally user-mode accesses. However, some
operations implicitly access system data structures, and the resulting
accesses to those data structures are supervisor-mode accesses regardless
of CPL. Examples of such implicit supervisor accesses include the following:
accesses to the global descriptor table (GDT) or local descriptor table
(LDT) to load a segment descriptor; accesses to the interrupt descriptor
table (IDT) when delivering an interrupt or exception; and accesses to the
task-state segment (TSS) as part of a task switch or change of CPL.

If CR4.SMAP = 1, supervisor-mode data accesses are not allowed to linear
addresses that are accessible in user mode. If CPL < 3, SMAP protections
are disabled if EFLAGS.AC = 1. If CPL = 3, SMAP applies to all supervisor-mode
data accesses (these are implicit supervisor accesses) regardless of the
value of EFLAGS.AC.

This patchset pass-through SMAP feature to guests, and let guests
benefit from it.

Version 1:
  * Remove SMAP bit from CR4_RESERVED_BITS.
  * Add SMAP support when setting CR4
  * Disable SMAP for guests in EPT realmode and EPT unpaging mode
  * Expose SMAP feature to guest

Version 2:
  * Change the logic of updating mmu permission bitmap for SMAP violation
  * Expose SMAP feature to guest in the last patch of this series.

Version 3:
  * Changes in update_permission_bitmask().
  * Use a branchless way suggested by Paolo Bonzini to detect SMAP
violation in permission_fault(). 

Feng Wu (4):
  KVM: Remove SMAP bit from CR4_RESERVED_BITS.
  KVM: Add SMAP support when setting CR4
  KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
  KVM: expose SMAP feature to guest

 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/cpuid.c|  2 +-
 arch/x86/kvm/cpuid.h|  8 
 arch/x86/kvm/mmu.c  | 35 +---
 arch/x86/kvm/mmu.h  | 44 +
 arch/x86/kvm/paging_tmpl.h  |  2 +-
 arch/x86/kvm/vmx.c  | 11 ++-
 arch/x86/kvm/x86.c  |  9 -
 8 files changed, 93 insertions(+), 20 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 1/4] KVM: Remove SMAP bit from CR4_RESERVED_BITS.

2014-03-31 Thread Feng Wu

This patch removes SMAP bit from CR4_RESERVED_BITS.

Signed-off-by: Feng Wu 
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fdf83af..4eeb049 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -60,7 +60,7 @@
  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \
  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | 
X86_CR4_PCIDE \
  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
- | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE))
+ | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 3/4] KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode

2014-03-31 Thread Feng Wu

SMAP is disabled if CPU is in non-paging mode in hardware.
However KVM always uses paging mode to emulate guest non-paging
mode with TDP. To emulate this behavior, SMAP needs to be
manually disabled when guest switches to non-paging mode.

Signed-off-by: Feng Wu 
---
 arch/x86/kvm/vmx.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3927528..e58cb5f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3452,13 +3452,14 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
hw_cr4 &= ~X86_CR4_PAE;
hw_cr4 |= X86_CR4_PSE;
/*
-* SMEP is disabled if CPU is in non-paging mode in
-* hardware. However KVM always uses paging mode to
+* SMEP/SMAP is disabled if CPU is in non-paging mode
+* in hardware. However KVM always uses paging mode to
 * emulate guest non-paging mode with TDP.
-* To emulate this behavior, SMEP needs to be manually
-* disabled when guest switches to non-paging mode.
+* To emulate this behavior, SMEP/SMAP needs to be
+* manually disabled when guest switches to non-paging
+* mode.
 */
-   hw_cr4 &= ~X86_CR4_SMEP;
+   hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP);
} else if (!(cr4 & X86_CR4_PAE)) {
hw_cr4 &= ~X86_CR4_PAE;
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 4/4] KVM: expose SMAP feature to guest

2014-03-31 Thread Feng Wu

This patch exposes SMAP feature to guest

Signed-off-by: Feng Wu 
---
 arch/x86/kvm/cpuid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c697625..deb5f9b 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -303,7 +303,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
/* cpuid 7.0.ebx */
const u32 kvm_supported_word9_x86_features =
F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) |
-   F(BMI2) | F(ERMS) | f_invpcid | F(RTM);
+   F(BMI2) | F(ERMS) | f_invpcid | F(RTM) | F(SMAP);
 
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 2/4] KVM: Add SMAP support when setting CR4

2014-03-31 Thread Feng Wu

This patch adds SMAP handling logic when setting CR4 for guests

Thanks a lot to Paolo Bonzini for his suggestion to use the branchless
way to detect SMAP violation.

Signed-off-by: Feng Wu 
---
 arch/x86/kvm/cpuid.h   |  8 
 arch/x86/kvm/mmu.c | 35 ---
 arch/x86/kvm/mmu.h | 44 
 arch/x86/kvm/paging_tmpl.h |  2 +-
 arch/x86/kvm/x86.c |  9 -
 5 files changed, 85 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index a2a1bb7..eeecbed 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -48,6 +48,14 @@ static inline bool guest_cpuid_has_smep(struct kvm_vcpu 
*vcpu)
return best && (best->ebx & bit(X86_FEATURE_SMEP));
 }
 
+static inline bool guest_cpuid_has_smap(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 7, 0);
+   return best && (best->ebx & bit(X86_FEATURE_SMAP));
+}
+
 static inline bool guest_cpuid_has_fsgsbase(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9b53135..5a1ed38 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3601,20 +3601,28 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu 
*vcpu,
}
 }
 
-static void update_permission_bitmask(struct kvm_vcpu *vcpu,
+void update_permission_bitmask(struct kvm_vcpu *vcpu,
struct kvm_mmu *mmu, bool ept)
 {
unsigned bit, byte, pfec;
u8 map;
-   bool fault, x, w, u, wf, uf, ff, smep;
+   bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, smep, smap = 0;
 
smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
+   cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
for (byte = 0; byte < ARRAY_SIZE(mmu->permissions); ++byte) {
pfec = byte << 1;
map = 0;
wf = pfec & PFERR_WRITE_MASK;
uf = pfec & PFERR_USER_MASK;
ff = pfec & PFERR_FETCH_MASK;
+   /*
+* PFERR_RSVD_MASK bit is used to detect SMAP violation.
+* We will check it in permission_fault(), this bit is
+* set in pfec for normal fault, while it is cleared for
+* SMAP violations.
+*/
+   smapf = !(pfec & PFERR_RSVD_MASK);
for (bit = 0; bit < 8; ++bit) {
x = bit & ACC_EXEC_MASK;
w = bit & ACC_WRITE_MASK;
@@ -3627,11 +3635,32 @@ static void update_permission_bitmask(struct kvm_vcpu 
*vcpu,
w |= !is_write_protection(vcpu) && !uf;
/* Disallow supervisor fetches of user code if 
cr4.smep */
x &= !(smep && u && !uf);
+
+   /*
+* SMAP:kernel-mode data accesses from user-mode
+* mappings should fault. A fault is considered
+* as a SMAP violation if all of the following
+* conditions are ture:
+*   - X86_CR4_SMAP is set in CR4
+*   - An user page is accessed
+*   - Page fault in kernel mode
+*   - !(CPL<3 && X86_EFLAGS_AC is set)
+*
+*   Here, we cover the first three conditions,
+*   The CPL and X86_EFLAGS_AC is in smapf,which
+*   permission_fault() computes dynamically.
+*
+*   Also, SMAP does not affect instruction
+*   fetches, add the !ff check here to make it
+*   clearer.
+*/
+   smap = cr4_smap && u && !uf && !ff;
} else
/* Not really needed: no U/S accesses on ept  */
u = 1;
 
-   fault = (ff && !x) || (uf && !u) || (wf && !w);
+   fault = (ff && !x) || (uf && !u) || (wf && !w) ||
+   (smapf && smap);
map |= fault << bit;
}
mmu->permissions[byte] = map;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2926152..822190f 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -44,11 +44,17 @@
 #define PT_DIRECTORY_LEVEL 2
 #define PT_PAGE_TABLE_LEVEL 1
 
-#define PFERR_PRESENT_MASK (1U << 0)
-#define PFERR_WRITE_MASK (1U << 1)
-#define PFERR_USER_MASK (1U << 2)
-#define PFERR_RSVD_MASK (1U << 3)
-#define PFERR_FETCH_MASK (1U << 4)
+#define PFERR_PRESENT_BIT 0
+#define PFERR_

KVM call agenfda for 2014-04-01

2014-03-31 Thread Juan Quintela


Hi

Please, send any topic that you are interested in covering.

Thanks, Juan.

Call details:

10:00 AM to 11:00 AM EDT
Every two weeks

If you need phone number details,  contact me privately.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Andreas Färber

Hi,

Am 31.03.2014 12:40, schrieb Juan Quintela:
> 
> Please, send any topic that you are interested in covering.

I would like to discuss the state of the QEMU release process, please:

* -rc1 has not been tagged.
* Who besides Anthony could upload a tarball if we tag and create it?
* make-release fix for SeaBIOS on the list. Ping, and are more affected?

Thanks,
Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Stefan Hajnoczi

On Mon, Mar 31, 2014 at 12:51:31PM +0200, Andreas Färber wrote:
> Am 31.03.2014 12:40, schrieb Juan Quintela:
> > 
> > Please, send any topic that you are interested in covering.
> 
> I would like to discuss the state of the QEMU release process, please:
> 
> * -rc1 has not been tagged.
> * Who besides Anthony could upload a tarball if we tag and create it?
> * make-release fix for SeaBIOS on the list. Ping, and are more affected?

+1

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Peter Maydell

On 31 March 2014 14:21, Christian Borntraeger  wrote:
> Another thing might be the release process in general. Currently it seems
> that everybody tries to push everything just before the hard freeze.  I had
> to debug some problems introduced _after_ soft freeze. Is there some
> interest in having a Linux-like process (merge window + stabilization)? This
> would require shorter release cycles of course.

"merge window" has been suggested before. I think it would be
a terrible idea for QEMU, personally. We're not the kernel in
many ways, notably dev community size and a greater tendency
to changes that have effects across the whole tree.

Soft + hard freeze is our stabilization period currently.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Christian Borntraeger

On 31/03/14 12:51, Andreas Färber wrote:
> Hi,
> 
> Am 31.03.2014 12:40, schrieb Juan Quintela:
>>
>> Please, send any topic that you are interested in covering.
> 
> I would like to discuss the state of the QEMU release process, please:
> 
> * -rc1 has not been tagged.
> * Who besides Anthony could upload a tarball if we tag and create it?
> * make-release fix for SeaBIOS on the list. Ping, and are more affected?

+1 for this one.

Another thing might be the release process in general. Currently it seems
that everybody tries to push everything just before the hard freeze.  I had
to debug some problems introduced _after_ soft freeze. Is there some
interest in having a Linux-like process (merge window + stabilization)? This
would require shorter release cycles of course.

Christian

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 2/4] KVM: Add SMAP support when setting CR4

2014-03-31 Thread Paolo Bonzini


Just a few comments...


-static void update_permission_bitmask(struct kvm_vcpu *vcpu,
+void update_permission_bitmask(struct kvm_vcpu *vcpu,
struct kvm_mmu *mmu, bool ept)
 {
unsigned bit, byte, pfec;
u8 map;
-   bool fault, x, w, u, wf, uf, ff, smep;
+   bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, smep, smap = 0;

smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);


Can you make an additional patch to rename this to cr4_smep?


+   cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
for (byte = 0; byte < ARRAY_SIZE(mmu->permissions); ++byte) {
pfec = byte << 1;
map = 0;
wf = pfec & PFERR_WRITE_MASK;
uf = pfec & PFERR_USER_MASK;
ff = pfec & PFERR_FETCH_MASK;
+   /*
+* PFERR_RSVD_MASK bit is used to detect SMAP violation.
+* We will check it in permission_fault(), this bit is
+* set in pfec for normal fault, while it is cleared for
+* SMAP violations.
+*/


"This bit is set in PFEC if we the access is _not_ subject to SMAP 
restrictions, and cleared otherwise.  The bit is only meaningful if

the SMAP bit is set in CR4."


+   smapf = !(pfec & PFERR_RSVD_MASK);
for (bit = 0; bit < 8; ++bit) {
x = bit & ACC_EXEC_MASK;
w = bit & ACC_WRITE_MASK;
@@ -3627,11 +3635,32 @@ static void update_permission_bitmask(struct kvm_vcpu 
*vcpu,
w |= !is_write_protection(vcpu) && !uf;
/* Disallow supervisor fetches of user code if 
cr4.smep */
x &= !(smep && u && !uf);
+
+   /*
+* SMAP:kernel-mode data accesses from user-mode
+* mappings should fault. A fault is considered
+* as a SMAP violation if all of the following
+* conditions are ture:
+*   - X86_CR4_SMAP is set in CR4
+*   - An user page is accessed
+*   - Page fault in kernel mode
+*   - !(CPL<3 && X86_EFLAGS_AC is set)


- if CPL < 3, EFLAGS.AC is clear


+*   Here, we cover the first three conditions,
+*   The CPL and X86_EFLAGS_AC is in smapf,which
+*   permission_fault() computes dynamically.


The fourth is computed dynamically in permission_fault() and is in SMAPF.


+*   Also, SMAP does not affect instruction
+*   fetches, add the !ff check here to make it
+*   clearer.
+*/
+   smap = cr4_smap && u && !uf && !ff;
} else
/* Not really needed: no U/S accesses on ept  */
u = 1;

-   fault = (ff && !x) || (uf && !u) || (wf && !w);
+   fault = (ff && !x) || (uf && !u) || (wf && !w) ||
+   (smapf && smap);
map |= fault << bit;
}
mmu->permissions[byte] = map;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2926152..822190f 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -44,11 +44,17 @@
 #define PT_DIRECTORY_LEVEL 2
 #define PT_PAGE_TABLE_LEVEL 1

-#define PFERR_PRESENT_MASK (1U << 0)
-#define PFERR_WRITE_MASK (1U << 1)
-#define PFERR_USER_MASK (1U << 2)
-#define PFERR_RSVD_MASK (1U << 3)
-#define PFERR_FETCH_MASK (1U << 4)
+#define PFERR_PRESENT_BIT 0
+#define PFERR_WRITE_BIT 1
+#define PFERR_USER_BIT 2
+#define PFERR_RSVD_BIT 3
+#define PFERR_FETCH_BIT 4
+
+#define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT)
+#define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT)
+#define PFERR_USER_MASK (1U << PFERR_USER_BIT)
+#define PFERR_RSVD_MASK (1U << PFERR_RSVD_BIT)
+#define PFERR_FETCH_MASK (1U << PFERR_FETCH_BIT)

 int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
@@ -73,6 +79,8 @@ int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 
addr, bool direct);
 void kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
 void kvm_init_shadow_ept_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context,
bool execonly);
+void update_permission_bitmask(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
+   bool ept);

 static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
 {
@@ -110,10 +118,30 @@ static inline bool is_write_protection(struct kvm_vcpu 
*vcpu)
  * Will a fault with a given page-fault error co

Re: [PATCH 3/4] KVM: PPC: Alow kvmppc_get_last_inst() to fail

2014-03-31 Thread Alexander Graf


On 03/26/2014 09:52 PM, Scott Wood wrote:

On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:

diff --git a/arch/powerpc/kvm/book3s_paired_singles.c 
b/arch/powerpc/kvm/book3s_paired_singles.c
index a59a25a..80c533e 100644
--- a/arch/powerpc/kvm/book3s_paired_singles.c
+++ b/arch/powerpc/kvm/book3s_paired_singles.c
@@ -640,19 +640,24 @@ static int kvmppc_ps_one_in(struct kvm_vcpu *vcpu, bool 
rc,
  
  int kvmppc_emulate_paired_single(struct kvm_run *run, struct kvm_vcpu *vcpu)

  {
-   u32 inst = kvmppc_get_last_inst(vcpu);
+   u32 inst;
enum emulation_result emulated = EMULATE_DONE;
-
-   int ax_rd = inst_get_field(inst, 6, 10);
-   int ax_ra = inst_get_field(inst, 11, 15);
-   int ax_rb = inst_get_field(inst, 16, 20);
-   int ax_rc = inst_get_field(inst, 21, 25);
-   short full_d = inst_get_field(inst, 16, 31);
-
-   u64 *fpr_d = &vcpu->arch.fpr[ax_rd];
-   u64 *fpr_a = &vcpu->arch.fpr[ax_ra];
-   u64 *fpr_b = &vcpu->arch.fpr[ax_rb];
-   u64 *fpr_c = &vcpu->arch.fpr[ax_rc];
+   int ax_rd, ax_ra, ax_rb, ax_rc;
+   short full_d;
+   u64 *fpr_d, *fpr_a, *fpr_b, *fpr_c;
+
+   kvmppc_get_last_inst(vcpu, &inst);

Should probably check for failure here and elsewhere -- even though it
can't currently fail on book3s, the interface now allows it.


diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 5b9e906..b0d884d 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -624,9 +624,10 @@ void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr)
  static int kvmppc_read_inst(struct kvm_vcpu *vcpu)
  {
ulong srr0 = kvmppc_get_pc(vcpu);
-   u32 last_inst = kvmppc_get_last_inst(vcpu);
+   u32 last_inst;
int ret;
  
+	kvmppc_get_last_inst(vcpu, &last_inst);

ret = kvmppc_ld(vcpu, &srr0, sizeof(u32), &last_inst, false);

This isn't new, but this function looks odd to me -- calling
kvmppc_get_last_inst() but ignoring last_inst, then calling kvmppc_ld()
and ignoring anything but failure.  last_inst itself is never read.  And
no comments to explain the weirdness. :-)

I get that kvmppc_get_last_inst() is probably being used for the side
effect of filling in vcpu->arch.last_inst, but why store the return
value without using it?  Why pass the address of it to kvmppc_ld(),
which seems to be used only as an indirect way of determining whether
kvmppc_get_last_inst() failed?  And that whole mechanism becomes
stranger once it becomes possible for kvmppc_get_last_inst() to directly
return failure.


If you're interested in the history of this, here's the patch :)

https://github.com/mirrors/linux-2.6/commit/c7f38f46f2a98d232147e47284cb4e7363296a3e

The idea is that we need 2 things to be good after this function:

  1) vcpu->arch.last_inst is valid
  2) if the last instruction is not readable, return failure

Hence this weird construct. I don't think it's really necessary though - 
just remove the kvmppc_ld() call and only fail read_inst() when the 
caching didn't work and we can't translate the address.



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

2014-03-31 Thread Alexander Graf


On 03/26/2014 10:17 PM, Scott Wood wrote:

On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:

Load external pid (lwepx) instruction faults (when called from
KVM with guest context) needs to be handled by KVM. This implies
additional code in DO_KVM macro to identify the source of the
exception (which oiginate from KVM host rather than the guest).
The hook requires to check the Exception Syndrome Register
ESR[EPID] and External PID Load Context Register EPLC[EGS] for
some exceptions (DTLB_MISS, DSI and LRAT). Doing this on Data TLB
miss exception is obvious intrusive for the host.

Get rid of lwepx and acquire last instuction in kvmppc_get_last_inst()
by searching for the physical address and kmap it. This fixes an
infinite loop caused by lwepx's data TLB miss handled in the host
and the TODO for TLB eviction and execute-but-not-read entries.

Signed-off-by: Mihai Caraman 
---
  arch/powerpc/kvm/bookehv_interrupts.S |   37 +++--
  arch/powerpc/kvm/e500_mmu_host.c  |   93 +
  2 files changed, 102 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/kvm/bookehv_interrupts.S 
b/arch/powerpc/kvm/bookehv_interrupts.S
index 20c7a54..c50490c 100644
--- a/arch/powerpc/kvm/bookehv_interrupts.S
+++ b/arch/powerpc/kvm/bookehv_interrupts.S
@@ -119,38 +119,14 @@
  1:
  
  	.if	\flags & NEED_EMU

-   /*
-* This assumes you have external PID support.
-* To support a bookehv CPU without external PID, you'll
-* need to look up the TLB entry and create a temporary mapping.
-*
-* FIXME: we don't currently handle if the lwepx faults.  PR-mode
-* booke doesn't handle it either.  Since Linux doesn't use
-* broadcast tlbivax anymore, the only way this should happen is
-* if the guest maps its memory execute-but-not-read, or if we
-* somehow take a TLB miss in the middle of this entry code and
-* evict the relevant entry.  On e500mc, all kernel lowmem is
-* bolted into TLB1 large page mappings, and we don't use
-* broadcast invalidates, so we should not take a TLB miss here.
-*
-* Later we'll need to deal with faults here.  Disallowing guest
-* mappings that are execute-but-not-read could be an option on
-* e500mc, but not on chips with an LRAT if it is used.
-*/
-
-   mfspr   r3, SPRN_EPLC   /* will already have correct ELPID and EGS */
PPC_STL r15, VCPU_GPR(R15)(r4)
PPC_STL r16, VCPU_GPR(R16)(r4)
PPC_STL r17, VCPU_GPR(R17)(r4)
PPC_STL r18, VCPU_GPR(R18)(r4)
PPC_STL r19, VCPU_GPR(R19)(r4)
-   mr  r8, r3
PPC_STL r20, VCPU_GPR(R20)(r4)
-   rlwimi  r8, r6, EPC_EAS_SHIFT - MSR_IR_LG, EPC_EAS
PPC_STL r21, VCPU_GPR(R21)(r4)
-   rlwimi  r8, r6, EPC_EPR_SHIFT - MSR_PR_LG, EPC_EPR
PPC_STL r22, VCPU_GPR(R22)(r4)
-   rlwimi  r8, r10, EPC_EPID_SHIFT, EPC_EPID
PPC_STL r23, VCPU_GPR(R23)(r4)
PPC_STL r24, VCPU_GPR(R24)(r4)
PPC_STL r25, VCPU_GPR(R25)(r4)
@@ -160,10 +136,15 @@
PPC_STL r29, VCPU_GPR(R29)(r4)
PPC_STL r30, VCPU_GPR(R30)(r4)
PPC_STL r31, VCPU_GPR(R31)(r4)
-   mtspr   SPRN_EPLC, r8
-   isync
-   lwepx   r9, 0, r5
-   mtspr   SPRN_EPLC, r3
+
+   /*
+* We don't use external PID support. lwepx faults would need to be
+* handled by KVM and this implies aditional code in DO_KVM (for
+* DTB_MISS, DSI and LRAT) to check ESR[EPID] and EPLC[EGS] which
+* is too intrusive for the host. Get last instuction in
+* kvmppc_get_last_inst().
+*/
+   li  r9, KVM_INST_FETCH_FAILED
stw r9, VCPU_LAST_INST(r4)
.endif
  
diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c

index 6025cb7..1b4cb41 100644
--- a/arch/powerpc/kvm/e500_mmu_host.c
+++ b/arch/powerpc/kvm/e500_mmu_host.c
@@ -598,9 +598,102 @@ void kvmppc_mmu_map(struct kvm_vcpu *vcpu, u64 eaddr, 
gpa_t gpaddr,
}
  }
  
+#ifdef CONFIG_KVM_BOOKE_HV

+int kvmppc_ld_inst(struct kvm_vcpu *vcpu, u32 *instr)

It'd be interesting to see what the performance impact of doing this on
non-HV would be -- it would eliminate divergent code, eliminate the
MSR_DS hack, and make exec-only mappings work.


We hit the instruction emulation path a lot more often on non-HV, so 
even a slight performance impact that might not be a major bummer for HV 
would become critical for PR.


But I agree - it'd be interesting to see numbers.




+{
+   gva_t geaddr;
+   hpa_t addr;
+   hfn_t pfn;
+   hva_t eaddr;
+   u32 mas0, mas1, mas2, mas3;
+   u64 mas7_mas3;
+   struct page *page;
+   unsigned int addr_space, psize_shift;
+   bool pr;
+   unsigned long flags;
+
+   /* Search TLB for guest pc to get the real address */
+   geaddr = kvmppc_get_pc(vcpu);
+   addr_space = (vcpu->arch.shared->msr & MSR_IS) >> MSR_IR_LG;
+
+

Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Anthony Liguori

On Mon, Mar 31, 2014 at 6:25 AM, Peter Maydell  wrote:
> On 31 March 2014 14:21, Christian Borntraeger  wrote:
>> Another thing might be the release process in general. Currently it seems
>> that everybody tries to push everything just before the hard freeze.  I had
>> to debug some problems introduced _after_ soft freeze. Is there some
>> interest in having a Linux-like process (merge window + stabilization)? This
>> would require shorter release cycles of course.
>
> "merge window" has been suggested before. I think it would be
> a terrible idea for QEMU, personally. We're not the kernel in
> many ways, notably dev community size and a greater tendency
> to changes that have effects across the whole tree.
>
> Soft + hard freeze is our stabilization period currently.

Peter, are you willing to do the tagging and announcement for the 2.0
rcs?  I sent instructions privately and between stefanha and I we can
get your permissions sorted out.

Regards,

Anthony Liguori

> thanks
> -- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Paolo Bonzini


Il 31/03/2014 16:01, Anthony Liguori ha scritto:

> "merge window" has been suggested before. I think it would be
> a terrible idea for QEMU, personally. We're not the kernel in
> many ways, notably dev community size and a greater tendency
> to changes that have effects across the whole tree.
>
> Soft + hard freeze is our stabilization period currently.

Peter, are you willing to do the tagging and announcement for the 2.0
rcs?  I sent instructions privately and between stefanha and I we can
get your permissions sorted out.


I think it would be a good idea to separate the committer and release 
manager roles.  Peter is providing the community with a wonderful 
service, just like you were; putting too much work on his shoulders 
risks getting us in the same situation if anything were to affect his 
ability to provide it.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Peter Maydell

On 31 March 2014 15:28, Paolo Bonzini  wrote:
> I think it would be a good idea to separate the committer and release
> manager roles.  Peter is providing the community with a wonderful service,
> just like you were; putting too much work on his shoulders risks getting us
> in the same situation if anything were to affect his ability to provide it.

Yes, I strongly agree with this. I think we'll do much better
if we can manage to share out responsibilities among a wider
group of people.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Andreas Färber

Am 31.03.2014 16:32, schrieb Peter Maydell:
> On 31 March 2014 15:28, Paolo Bonzini  wrote:
>> I think it would be a good idea to separate the committer and release
>> manager roles.  Peter is providing the community with a wonderful service,
>> just like you were; putting too much work on his shoulders risks getting us
>> in the same situation if anything were to affect his ability to provide it.
> 
> Yes, I strongly agree with this. I think we'll do much better
> if we can manage to share out responsibilities among a wider
> group of people.

May I propose Michael Roth, who is already experienced from the N-1
stable releases?

If we can enable him to upload the tarballs created from his tags that
would also streamline the stable workflow while at it.

Regards,
Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Anthony Liguori

On Mon, Mar 31, 2014 at 7:46 AM, Andreas Färber  wrote:
> Am 31.03.2014 16:32, schrieb Peter Maydell:
>> On 31 March 2014 15:28, Paolo Bonzini  wrote:
>>> I think it would be a good idea to separate the committer and release
>>> manager roles.  Peter is providing the community with a wonderful service,
>>> just like you were; putting too much work on his shoulders risks getting us
>>> in the same situation if anything were to affect his ability to provide it.
>>
>> Yes, I strongly agree with this. I think we'll do much better
>> if we can manage to share out responsibilities among a wider
>> group of people.
>
> May I propose Michael Roth, who is already experienced from the N-1
> stable releases?
>
> If we can enable him to upload the tarballs created from his tags that
> would also streamline the stable workflow while at it.

If mdroth is willing to take this on, I am very supportive.

Regards,

Anthony Liguori

>
> Regards,
> Andreas
>
> --
> SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
> GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Markus Armbruster

Anthony Liguori  writes:

> On Mon, Mar 31, 2014 at 7:46 AM, Andreas Färber  wrote:
>> Am 31.03.2014 16:32, schrieb Peter Maydell:
>>> On 31 March 2014 15:28, Paolo Bonzini  wrote:
 I think it would be a good idea to separate the committer and release
 manager roles.  Peter is providing the community with a wonderful service,
 just like you were; putting too much work on his shoulders risks getting us
 in the same situation if anything were to affect his ability to provide it.
>>>
>>> Yes, I strongly agree with this. I think we'll do much better
>>> if we can manage to share out responsibilities among a wider
>>> group of people.
>>
>> May I propose Michael Roth, who is already experienced from the N-1
>> stable releases?
>>
>> If we can enable him to upload the tarballs created from his tags that
>> would also streamline the stable workflow while at it.
>
> If mdroth is willing to take this on, I am very supportive.

Another vote of confidence from me.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2] ARM VM System Specification

2014-03-31 Thread Christoffer Dall

On Sun, Mar 30, 2014 at 03:10:50PM -0700, Olof Johansson wrote:
> On Fri, Mar 28, 2014 at 11:45 AM, Christoffer Dall
>  wrote:
> > ARM VM System Specification
> > ===
> 
> [not quoting the whole spec here]
> 
> This looks very sane to me, and aligns very well with non-virtualized images.
> 
> For what it's worth, even though it's not a patch:
> 
> Acked-by: Olof Johansson 

Thanks, appreciate it.

> 
> What's the plan on how to lock this in? Where's this document going to
> live? In the kernel sources under Documentation/, or external?
> 

There were talks during LCA14 about publishing it as a Linaro
white-paper, but I would like it to be hosted by the open-source
projects that relate to it, for example the Linux kernel under
Documentaiton/ to represent both the guest and KVM side, and the Xen
sources too.

However, that may be a pain to update so the preferred method could be
to host it in either its current form (clear-text) or publish some
versioned PDF somewhere and simply point to it from the compliant
software pieces.

Suggestions or input welcome.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: VDSO pvclock may increase host cpu consumption, is this a problem?

2014-03-31 Thread Andy Lutomirski

On 03/29/2014 01:47 AM, Zhanghailiang wrote:
> Hi,
> I found when Guest is idle, VDSO pvclock may increase host consumption.
> We can calcutate as follow, Correct me if I am wrong.
>   (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest)
> In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in 
> timekeeping.c. It consume nearly 900 cycles per call. So in consideration of 
> 250 Hz, it may consume 225,000 cycles per second, even no VM is created.
> In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If the 
> no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles per 
> call. The feature decrease 150 cycles consumption per call. 
> When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the 
> host consumption.
> Both Host and Guest is linux-3.13.6.
> So, whether the host cpu consumption is a problem?

Does pvclock serve any real purpose on systems with fully-functional
TSCs?  The x86 guest implementation is awful, so it's about 2x slower
than TSC.  It could be improved a lot, but I'm not sure I understand why
it exists in the first place.

I certainly understand the goal of keeping the guest CLOCK_REALTIME is
sync with the host, but pvclock seems like overkill for that.

--Andy

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Demand paging for VM on KVM

2014-03-31 Thread Andrea Arcangeli

Hi Grigory,

On Thu, Mar 20, 2014 at 10:50:07AM -0700, Grigory Makarevich wrote:
> Andrea, Paolo,
> 
> Thanks a lot for the comments.
> 
> I like the idea of userfaultfd a lot.  For my prototype I had to solve a
> problem of accessing to the ondemand page from the paths where exiting is
> not safe (emulator is one example). I have solved it using send message
> using netlink socket/blocking calling thread/waking up once page is
> delivered. userfaultfd might be cleaner way to achieve the same goal.

I'm glad you like the idea of userfaultfd, the way the syscall works
tends to mirror the eventfd(2) syscall but the protocol talked on the
fd is clearly different.

So you also made the vcpu to talk from the kernel to the "ondemand"
thread through some file descriptor and protocol. So the difference is
that it shouldn't be limited to the emulator but it shall work for all
syscalls that the IO thread may invoke and that could hit on missing
guest physical memory.

> My concerns regarding "general" mm solution are:
> 
> - Will it work with any memory mapping schema or only with anonymous memory
> ?

Currently I have the hooks only into anonymous memory but there's no
reason why this shouldn't work for other kind of page faults. The main
issue is not with the technicality in waiting in the page fault but
with the semantics of a page not being mapped.

If there's an hole in a filebacked mapping things are different than
if there's an hole in anonymous memory. As the VM can unmap and free a
filebacked page at any time. But if you're ok to notify the migration
thread even in such a case, it could work the same.

It would be more tricky if we had to differentiate an initial fault
after the vma is created (in order to notify the migration thread only
for initial faults), from faults triggered after the VM unmapped and
freed the page as result of VM pressure. That would require putting a
placeholder in the pagetable instead of keeping the VM code identical
to now (which zeroes a pagetable when a page is unmapped from a
filebacked mapping).

There is also some similarity between the userfault mechanism and the
volatile ranges but volatile ranges are handled in a finegrined way in
the pagetables, but it looked like there was no code to share in the
end. Volatile ranges provides a very different functionality from
userfault, for example they don't need to provide transparent behavior
if the memory is given as parameter to syscalls as far as I know. They
are used only to store data in memory accessed by userland through the
pagetables (not syscalls). The objective of userfault is also
fundamentally different from the objective of volatile ranges. We
cannot ever lose any data, while their whole point is to lose data if
there's VM pressure.

However if you want to extend the userfaultfd functionalty to trap the
first access in the pagetabels for filebacked pages, we would also
need to mangle pagetables with some placeholders to differentiate the
first fault and we may want to revisit if there's some code or
placeholder to share across the two features. Ideally support for
filebacked ranges could be a added at a second stage.

> - Before blocking the calling thread while serving the page-fault in
> host-kernel, one would need carefully release mmu semaphore (otherwise, the
> user-space might be in trouble serving the page-fault), which may be not
> that trivial.

Yes all locks must be dropped before waiting for the migration thread
and that includes the mmap_sem. I don't think we'll need to add much
complexity though, because we can relay on the behavior of page faults
that can be repeated endlessy until they succeed. It's certainly more
trivial for real faults than gup (because gup can work on more than 1
page at time so it may require some rolling back or special lock
retaking during the gup loop). With the real faults the only trick as
optimization will be to repeat it without returning to userland.

> Regarding qemu part of that:
> 
> -  Yes, indeed, user-space would need to be careful accessing "ondemand"
> pages. However,  that should not be a problem, considering that qemu would
> need to know in advance all "ondemand" regions.  Though, I would expect
> some  refactoring of the qemu internal api might be required.

I don't think the migration thread would risk messing with missing
memory, but the refactoring of some API and wire protocol will be
still needed to make the protocol bidirectional and to handle the
postcopy mechanism. David is working on it.

Paolo also pointed out one case that won't be entirely transparent. If
gdb would debug the migration thread breakpointing into it, and then
the gdb user tries to touch missing memory with ptrace after it
stopped the migration thread, gdb would then soft lockup. It'll
require a sigkill to unblock. There's no real way to fix it in my
view... and overall it looks quite reasonable to end up in a soft
lockup in such a scenario, I mean gdb has other ways to interfere in
bad w

RE: mechanism to allow a driver to bind to any device

2014-03-31 Thread Stuart Yoder



> -Original Message-
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Wednesday, March 26, 2014 5:09 PM
> To: Alexander Graf
> Cc: kvm@vger.kernel.org; jan.kis...@siemens.com; will.dea...@arm.com;
> Yoder Stuart-B08248; a.r...@virtualopensystems.com; Michal Hocko; Wood
> Scott-B07421; Sethi Varun-B16395; kvm...@lists.cs.columbia.edu; Rafael J.
> Wysocki; Guenter Roeck; Dmitry Kasatkin; Tejun Heo; Bjorn Helgaas;
> Antonios Motakis; t...@virtualopensystems.com; Toshi Kani; Greg KH;
> linux-ker...@vger.kernel.org; io...@lists.linux-foundation.org; Joe
> Perches; christoffer.d...@linaro.org
> Subject: Re: mechanism to allow a driver to bind to any device
> 
> On Wed, 2014-03-26 at 10:21 -0600, Alex Williamson wrote:
> > On Wed, 2014-03-26 at 23:06 +0800, Alexander Graf wrote:
> > >
> > > > Am 26.03.2014 um 22:40 schrieb Konrad Rzeszutek Wilk
> :
> > > >
> > > >> On Wed, Mar 26, 2014 at 01:40:32AM +, Stuart Yoder wrote:
> > > >> Hi Greg,
> > > >>
> > > >> We (Linaro, Freescale, Virtual Open Systems) are trying get an
> issue
> > > >> closed that has been perculating for a while around creating a
> mechanism
> > > >> that will allow kernel drivers like vfio can bind to devices of
> any type.
> > > >>
> > > >> This thread with you:
> > > >> http://www.spinics.net/lists/kvm-arm/msg08370.html
> > > >> ...seems to have died out, so am trying to get your response
> > > >> and will summarize again.  Vfio drivers in the kernel (regardless
> of
> > > >> bus type) need to bind to devices of any type.  The driver's
> function
> > > >> is to simply export hardware resources of any type to user space.
> > > >>
> > > >> There are several approaches that have been proposed:
> > > >
> > > > You seem to have missed the one I proposed.
> > > >>
> > > >>   1.  new_id -- (current approach) the user explicitly registers
> > > >>   each new device type with the vfio driver using the new_id
> > > >>   mechanism.
> > > >>
> > > >>   Problem: multiple drivers will be resident that handle the
> > > >>   same device type...and there is nothing user space hotplug
> > > >>   infrastructure can do to help.
> > > >>
> > > >>   2.  "any id" -- the vfio driver could specify a wildcard match
> > > >>   of some kind in its ID match table which would allow it to
> > > >>   match and bind to any possible device id.  However,
> > > >>   we don't want the vfio driver grabbing _all_ devices...just
> the ones we
> > > >>   explicitly want to pass to user space.
> > > >>
> > > >>   The proposed patch to support this was to create a new flag
> > > >>   "sysfs_bind_only" in struct device_driver.  When this flag
> > > >>   is set, the driver can only bind to devices via the sysfs
> > > >>   bind file.  This would allow the wildcard match to work.
> > > >>
> > > >>   Patch is here:
> > > >>   https://lkml.org/lkml/2013/12/3/253
> > > >>
> > > >>   3.  "Driver initiated explicit bind" -- with this approach the
> > > >>   vfio driver would create a private 'bind' sysfs object
> > > >>   and the user would echo the requested device into it:
> > > >>
> > > >>   echo 0001:03:00.0 > /sys/bus/pci/drivers/vfio-pci/vfio_bind
> > > >>
> > > >>   In order to make that work, the driver would need to call
> > > >>   driver_probe_device() and thus we need this patch:
> > > >>   https://lkml.org/lkml/2014/2/8/175
> > > >
> > > > 4). Use the 'unbind' (from the original device) and 'bind' to vfio
> driver.
> > >
> > > This is approach 2, no?
> > >
> > > >
> > > > Which I think is what is currently being done. Why is that not
> sufficient?
> > >
> > > How would 'bind to vfio driver' look like?
> > >
> > > > The only thing I see in the URL is " That works, but it is ugly."
> > > > There is some mention of race but I don't see how - if you do the
> 'unbind'
> > > > on the original driver and then bind the BDF to the VFIO how would
> you get
> > > > a race?
> > >
> > > Typically on PCI, you do a
> > >
> > >   - add wildcard (pci id) match to vfio driver
> > >   - unbind driver
> > >   -> reprobe
> > >   -> device attaches to vfio driver because it is the least recent
> match
> > >   - remove wildcard match from vfio driver
> > >
> > > If in between you hotplug add a card of the same type, it gets
> attached to vfio - even though the logical "default driver" would be the
> device specific driver.
> >
> > I've mentioned drivers_autoprobe in the past, but I'm not sure we're
> > really factoring it into the discussion.  drivers_autoprobe allows us
> to
> > toggle two points:
> >
> > a) When a new device is added whether we automatically give drivers a
> > try at binding to it
> >
> > b) When a new driver is added whether it gets to try to bind to
> anything
> > in the system
> >
> > So we do have a mechanism to avoid the race, but the problem is that it
> > becomes the responsibility of userspace to:
> >
> > 1) turn off drivers_autoprobe
> > 2) unbind/new_id/bind/re

[PATCH 1/2] kvm: support any-length wildcard ioeventfd

2014-03-31 Thread Michael S. Tsirkin

It is sometimes benefitial to ignore IO size, and only match on address.
In hindsight this would have been a better default than matching length
when KVM_IOEVENTFD_FLAG_DATAMATCH is not set, In particular, this kind
of access can be optimized on VMX: there no need to do page lookups.
This can currently be done with many ioeventfds but in a suboptimal way.

However we can't change kernel/userspace ABI without risk of breaking
some applications.
Use len = 0 to mean "ignore length for matching" in a more optimal way.

Signed-off-by: Michael S. Tsirkin 
---
 include/uapi/linux/kvm.h |  3 ++-
 arch/x86/kvm/x86.c   |  1 +
 virt/kvm/eventfd.c   | 27 ++-
 3 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 932d7f2..d73007e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -464,7 +464,7 @@ enum {
 struct kvm_ioeventfd {
__u64 datamatch;
__u64 addr;/* legal pio/mmio address */
-   __u32 len; /* 1, 2, 4, or 8 bytes*/
+   __u32 len; /* 1, 2, 4, or 8 bytes; or 0 to ignore length */
__s32 fd;
__u32 flags;
__u8  pad[36];
@@ -675,6 +675,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_SPAPR_MULTITCE 94
 #define KVM_CAP_EXT_EMUL_CPUID 95
 #define KVM_CAP_HYPERV_TIME 96
+#define KVM_CAP_IOEVENTFD_NO_LENGTH 97
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0c6218b..8ae1ff5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2602,6 +2602,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_IRQ_INJECT_STATUS:
case KVM_CAP_IRQFD:
case KVM_CAP_IOEVENTFD:
+   case KVM_CAP_IOEVENTFD_NO_LENGTH:
case KVM_CAP_PIT2:
case KVM_CAP_PIT_STATE2:
case KVM_CAP_SET_IDENTITY_MAP_ADDR:
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index abe4d60..9eadb59 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -600,7 +600,15 @@ ioeventfd_in_range(struct _ioeventfd *p, gpa_t addr, int 
len, const void *val)
 {
u64 _val;
 
-   if (!(addr == p->addr && len == p->length))
+   if (addr != p->addr)
+   /* address must be precise for a hit */
+   return false;
+
+   if (!p->length)
+   /* length = 0 means only look at the address, so always a hit */
+   return true;
+
+   if (len != p->length)
/* address-range must be precise for a hit */
return false;
 
@@ -671,9 +679,11 @@ ioeventfd_check_collision(struct kvm *kvm, struct 
_ioeventfd *p)
 
list_for_each_entry(_p, &kvm->ioeventfds, list)
if (_p->bus_idx == p->bus_idx &&
-   _p->addr == p->addr && _p->length == p->length &&
-   (_p->wildcard || p->wildcard ||
-_p->datamatch == p->datamatch))
+   _p->addr == p->addr &&
+   (!_p->length || !p->length ||
+(_p->length == p->length &&
+ (_p->wildcard || p->wildcard ||
+  _p->datamatch == p->datamatch
return true;
 
return false;
@@ -697,8 +707,9 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
int   ret;
 
bus_idx = ioeventfd_bus_from_flags(args->flags);
-   /* must be natural-word sized */
+   /* must be natural-word sized, or 0 to ignore length */
switch (args->len) {
+   case 0:
case 1:
case 2:
case 4:
@@ -716,6 +727,12 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
if (args->flags & ~KVM_IOEVENTFD_VALID_FLAG_MASK)
return -EINVAL;
 
+   /* ioeventfd with no length can't be combined with DATAMATCH */
+   if (!args->len &&
+   args->flags & (KVM_IOEVENTFD_FLAG_PIO |
+  KVM_IOEVENTFD_FLAG_DATAMATCH))
+   return -EINVAL;
+
eventfd = eventfd_ctx_fdget(args->fd);
if (IS_ERR(eventfd))
return PTR_ERR(eventfd);
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] vmx: speed up wildcard MMIO EVENTFD

2014-03-31 Thread Michael S. Tsirkin

With KVM, MMIO is much slower than PIO, due to the need to
do page walk and emulation. But with EPT, it does not have to be: we
know the address from the VMCS so if the address is unique, we can look
up the eventfd directly, bypassing emulation.

Unfortunately, this only works if userspace does not need to match on
access length and data.  The implementation adds a separate FAST_MMIO
bus internally. This serves two purposes:
- minimize overhead for old userspace that does not use eventfd with 
lengtth = 0
- minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them to handle
  an invalid length)

At the moment, this optimization only has effect for EPT on x86.

It will be possible to speed up MMIO for NPT and MMU using the same
idea in the future.

With this patch applied, on VMX MMIO EVENTFD is essentially as fast as PIO.
I was unable to detect any measureable slowdown to non-eventfd MMIO.

Making MMIO faster is important for the upcoming virtio 1.0 which
includes an MMIO signalling capability.

The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
pre-review and suggestions.

Signed-off-by: Michael S. Tsirkin 
---
 include/linux/kvm_host.h |  1 +
 include/uapi/linux/kvm.h |  1 +
 arch/x86/kvm/vmx.c   |  4 
 virt/kvm/eventfd.c   | 16 
 virt/kvm/kvm_main.c  |  1 +
 5 files changed, 23 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index b8e9a43..c2eaa9f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -163,6 +163,7 @@ enum kvm_bus {
KVM_MMIO_BUS,
KVM_PIO_BUS,
KVM_VIRTIO_CCW_NOTIFY_BUS,
+   KVM_FAST_MMIO_BUS,
KVM_NR_BUSES
 };
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index d73007e..fc75810 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -450,6 +450,7 @@ enum {
kvm_ioeventfd_flag_nr_pio,
kvm_ioeventfd_flag_nr_deassign,
kvm_ioeventfd_flag_nr_virtio_ccw_notify,
+   kvm_ioeventfd_flag_nr_fast_mmio,
kvm_ioeventfd_flag_nr_max,
 };
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3927528..78a1932 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5493,6 +5493,10 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
gpa_t gpa;
 
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+   if (!kvm_io_bus_write(vcpu->kvm, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
 
ret = handle_mmio_page_fault_common(vcpu, gpa, true);
if (likely(ret == RET_MMIO_PF_EMULATE))
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 9eadb59..c61f2eb 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -770,6 +770,16 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
if (ret < 0)
goto unlock_fail;
 
+   /* When length is ignored, MMIO is also put on a separate bus, for
+* faster lookups.
+*/
+   if (!args->len && !(args->flags & KVM_IOEVENTFD_FLAG_PIO)) {
+   ret = kvm_io_bus_register_dev(kvm, KVM_FAST_MMIO_BUS,
+ p->addr, 0, &p->dev);
+   if (ret < 0)
+   goto register_fail;
+   }
+
kvm->buses[bus_idx]->ioeventfd_count++;
list_add_tail(&p->list, &kvm->ioeventfds);
 
@@ -777,6 +787,8 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
 
return 0;
 
+register_fail:
+   kvm_io_bus_unregister_dev(kvm, bus_idx, &p->dev);
 unlock_fail:
mutex_unlock(&kvm->slots_lock);
 
@@ -816,6 +828,10 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct 
kvm_ioeventfd *args)
continue;
 
kvm_io_bus_unregister_dev(kvm, bus_idx, &p->dev);
+   if (!p->length) {
+   kvm_io_bus_unregister_dev(kvm, KVM_FAST_MMIO_BUS,
+ &p->dev);
+   }
kvm->buses[bus_idx]->ioeventfd_count--;
ioeventfd_release(p);
ret = 0;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 03a0381..22a4e1b 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2920,6 +2920,7 @@ static int __kvm_io_bus_read(struct kvm_io_bus *bus, 
struct kvm_io_range *range,
 
return -EOPNOTSUPP;
 }
+EXPORT_SYMBOL_GPL(kvm_io_bus_write);
 
 /* kvm_io_bus_read - called under kvm->slots_lock */
 int kvm_io_bus_read(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] KVM call agenda for 2014-04-01

2014-03-31 Thread Michael Roth

Quoting Andreas Färber (2014-03-31 09:46:45)
> Am 31.03.2014 16:32, schrieb Peter Maydell:
> > On 31 March 2014 15:28, Paolo Bonzini  wrote:
> >> I think it would be a good idea to separate the committer and release
> >> manager roles.  Peter is providing the community with a wonderful service,
> >> just like you were; putting too much work on his shoulders risks getting us
> >> in the same situation if anything were to affect his ability to provide it.
> > 
> > Yes, I strongly agree with this. I think we'll do much better
> > if we can manage to share out responsibilities among a wider
> > group of people.
> 
> May I propose Michael Roth, who is already experienced from the N-1
> stable releases?

Sure, I would be willing.

> 
> If we can enable him to upload the tarballs created from his tags that
> would also streamline the stable workflow while at it.

Agreed, though I feel a little weird about creating releases for tags that
aren't in the official repo. Would that be acceptable from a community
stand-point? I'm honestly not sure.

Otherwise I think Anthony/Peter would probably still need to process a pull
for stable-y.x branch in advance before we do the tarball/release. Would
still help simplify things a bit though by keeping tasks compartmentalized.

Anthony, Peter: in the past, prior to release, I just sent an email with
a pointer to my github branch with the stable release tagged. Would a proper
pull request (with a for-stable-x.y tag or somesuch) be preferable?

If we opt to align the stable repo updates with the actual release, what kind
of lead time would we need prior to actual release?

> 
> Regards,
> Andreas
> 
> -- 
> SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
> GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 0/3] nVMX: Fixes to run Xen as L1

2014-03-31 Thread Bandan Das

Minor changes to enable Xen as a L1 hypervisor.

Tested with a Haswell host, Xen-4.3 L1 and debian6 L2

v2: 
* Remove advertising single context invalidation for emulated invept
  Patch "KVM: nVMX: check for null vmcs12 when L1 does invept" from v1 
  is now obsolete and is removed
* Reorder patches "KVM: nVMX: Advertise support for interrupt acknowledgement"
  and "nVMX: Ack and write vector info to intr_info if L1 asks us to"
* Add commit description to 2/3 and change comment for nested_exit_intr_ack_set

Jan, I will send a separate unit-test patch

Bandan Das (3):
  KVM: nVMX: Don't advertise single context invalidation for invept
  KVM: nVMX: Ack and write vector info to intr_info if L1 asks us to
  KVM: nVMX: Advertise support for interrupt acknowledgement

 arch/x86/kvm/irq.c |  1 +
 arch/x86/kvm/vmx.c | 35 ---
 2 files changed, 25 insertions(+), 11 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 1/3] KVM: nVMX: Don't advertise single context invalidation for invept

2014-03-31 Thread Bandan Das

For single context invalidation, we fall through to global
invalidation in handle_invept() except for one case - when
the operand supplied by L1 is different from what we have in
vmcs12. However, typically hypervisors will only call invept
for the currently loaded eptp, so the condition will
never be true.

Signed-off-by: Bandan Das 
---
 arch/x86/kvm/vmx.c | 15 +--
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3927528..3e7f60c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2331,12 +2331,11 @@ static __init void nested_vmx_setup_ctls_msrs(void)
 VMX_EPT_INVEPT_BIT;
nested_vmx_ept_caps &= vmx_capability.ept;
/*
-* Since invept is completely emulated we support both global
-* and context invalidation independent of what host cpu
-* supports
+* For nested guests, we don't do anything specific
+* for single context invalidation. Hence, only advertise
+* support for global context invalidation.
 */
-   nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT |
-   VMX_EPT_EXTENT_CONTEXT_BIT;
+   nested_vmx_ept_caps |= VMX_EPT_EXTENT_GLOBAL_BIT;
} else
nested_vmx_ept_caps = 0;
 
@@ -6383,7 +6382,6 @@ static int handle_invept(struct kvm_vcpu *vcpu)
struct {
u64 eptp, gpa;
} operand;
-   u64 eptp_mask = ((1ull << 51) - 1) & PAGE_MASK;
 
if (!(nested_vmx_secondary_ctls_high & SECONDARY_EXEC_ENABLE_EPT) ||
!(nested_vmx_ept_caps & VMX_EPT_INVEPT_BIT)) {
@@ -6423,16 +6421,13 @@ static int handle_invept(struct kvm_vcpu *vcpu)
}
 
switch (type) {
-   case VMX_EPT_EXTENT_CONTEXT:
-   if ((operand.eptp & eptp_mask) !=
-   (nested_ept_get_cr3(vcpu) & eptp_mask))
-   break;
case VMX_EPT_EXTENT_GLOBAL:
kvm_mmu_sync_roots(vcpu);
kvm_mmu_flush_tlb(vcpu);
nested_vmx_succeed(vcpu);
break;
default:
+   /* Trap single context invalidation invept calls */
BUG_ON(1);
break;
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 2/3] KVM: nVMX: Ack and write vector info to intr_info if L1 asks us to

2014-03-31 Thread Bandan Das

This feature emulates the "Acknowledge interrupt on exit" behavior.
We can safely emulate it for L1 to run L2 even if L0 itself has it
disabled (to run L1).

Signed-off-by: Bandan Das 
---
 arch/x86/kvm/irq.c |  1 +
 arch/x86/kvm/vmx.c | 19 +++
 2 files changed, 20 insertions(+)

diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 484bc87..bd0da43 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -113,6 +113,7 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v)
 
return kvm_get_apic_interrupt(v);   /* APIC */
 }
+EXPORT_SYMBOL_GPL(kvm_cpu_get_interrupt);
 
 void kvm_inject_pending_timer_irqs(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3e7f60c..bdc8f2d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4489,6 +4489,18 @@ static bool nested_exit_on_intr(struct kvm_vcpu *vcpu)
PIN_BASED_EXT_INTR_MASK;
 }
 
+/*
+ * In nested virtualization, check if L1 has enabled
+ * interrupt acknowledgement that writes the interrupt vector
+ * info on vmexit
+ * 
+ */
+static bool nested_exit_intr_ack_set(struct kvm_vcpu *vcpu)
+{
+   return get_vmcs12(vcpu)->vm_exit_controls &
+   VM_EXIT_ACK_INTR_ON_EXIT;
+}
+
 static bool nested_exit_on_nmi(struct kvm_vcpu *vcpu)
 {
return get_vmcs12(vcpu)->pin_based_vm_exec_control &
@@ -8442,6 +8454,13 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 
exit_reason,
prepare_vmcs12(vcpu, vmcs12, exit_reason, exit_intr_info,
   exit_qualification);
 
+   if ((exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
+   && nested_exit_intr_ack_set(vcpu)) {
+   int irq = kvm_cpu_get_interrupt(vcpu);
+   vmcs12->vm_exit_intr_info = irq |
+   INTR_INFO_VALID_MASK | INTR_TYPE_EXT_INTR;
+   }
+
trace_kvm_nested_vmexit_inject(vmcs12->vm_exit_reason,
   vmcs12->exit_qualification,
   vmcs12->idt_vectoring_info_field,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 3/3] KVM: nVMX: Advertise support for interrupt acknowledgement

2014-03-31 Thread Bandan Das

Some Type 1 hypervisors such as XEN won't enable VMX without it present

Signed-off-by: Bandan Das 
---
 arch/x86/kvm/vmx.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e864b7a..a2a03c5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2273,7 +2273,8 @@ static __init void nested_vmx_setup_ctls_msrs(void)
nested_vmx_pinbased_ctls_high &= 
~PIN_BASED_VMX_PREEMPTION_TIMER;
}
nested_vmx_exit_ctls_high |= (VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
-   VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER);
+   VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
+ VM_EXIT_ACK_INTR_ON_EXIT);
 
/* entry controls */
rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mechanism to allow a driver to bind to any device

2014-03-31 Thread Kim Phillips

On Mon, 31 Mar 2014 20:23:36 +
Stuart Yoder  wrote:

> > From: Greg KH [mailto:gre...@linuxfoundation.org]
> > Sent: Monday, March 31, 2014 2:47 PM
> > 
> > On Mon, Mar 31, 2014 at 06:47:51PM +, Stuart Yoder wrote:
> > > I also, was at the point where I thought we should perhaps just
> > > go with current mechanisms and implement new_id for the platform
> > > bus...but Greg's recent response is 'platform devices suck' and it
> > sounds
> > > like he would reject a new_id patch for the platform bus.  So it kind
> > > of feels like we are stuck.
> > 
> > ids mean nothing in the platform device model, so having a new_id file
> > for them makes no sense.
> 
> They don't have IDs like PCI, but platform drivers have to match on
> something.  Platform device match tables are based on compatible strings.
> 
> Example from Freescale DMA driver:
>   static const struct of_device_id fsldma_of_ids[] = {
> { .compatible = "fsl,elo3-dma", },
> { .compatible = "fsl,eloplus-dma", },
> { .compatible = "fsl,elo-dma", },
> {}
>   };
> 
> The process of unbinding, setting a new_id, and binding to vfio would work
> just like PCI:
> 
>echo ffe101300.dma > /sys/bus/platform/devices/ffe101300.dma/driver/unbind
>echo fsl,eloplus-dma > /sys/bus/platform/drivers/vfio-platform/new_id

In platform device land, we don't want to pursue the
new_id/match-by-compatible methodology: we know exactly which specific
device (not device types) we want bound to which driver, so we just
want to be able to simply:

echo fff51000.ethernet | sudo tee -a 
/sys/bus/platform/devices/fff51000.ethernet/driver/unbind
echo fff51000.ethernet | sudo tee -a 
/sys/bus/platform/drivers/vfio-platform/bind

and not get involved with how PCI "doesn't simply do that," independent
of autoprobe/hotplug.

Kim
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mechanism to allow a driver to bind to any device

2014-03-31 Thread Kim Phillips

On Fri, 28 Mar 2014 11:10:23 -0600
Alex Williamson  wrote:

> On Fri, 2014-03-28 at 12:58 -0400, Konrad Rzeszutek Wilk wrote:
> > On Wed, Mar 26, 2014 at 04:09:21PM -0600, Alex Williamson wrote:
> > > On Wed, 2014-03-26 at 10:21 -0600, Alex Williamson wrote:
> > > > On Wed, 2014-03-26 at 23:06 +0800, Alexander Graf wrote:
> > > > > 
> > > > > > Am 26.03.2014 um 22:40 schrieb Konrad Rzeszutek Wilk 
> > > > > > :
> > > > > > 
> > > > > >> On Wed, Mar 26, 2014 at 01:40:32AM +, Stuart Yoder wrote:
> > > > > >> Hi Greg,
> > > > > >> 
> > > > > >> We (Linaro, Freescale, Virtual Open Systems) are trying get an 
> > > > > >> issue
> > > > > >> closed that has been perculating for a while around creating a 
> > > > > >> mechanism
> > > > > >> that will allow kernel drivers like vfio can bind to devices of 
> > > > > >> any type.
> > > > > >> 
> > > > > >> This thread with you:
> > > > > >> http://www.spinics.net/lists/kvm-arm/msg08370.html
> > > > > >> ...seems to have died out, so am trying to get your response
> > > > > >> and will summarize again.  Vfio drivers in the kernel (regardless 
> > > > > >> of
> > > > > >> bus type) need to bind to devices of any type.  The driver's 
> > > > > >> function
> > > > > >> is to simply export hardware resources of any type to user space.
> > > > > >> 
> > > > > >> There are several approaches that have been proposed:
> > > > > > 
> > > > > > You seem to have missed the one I proposed.
> > > > > >> 
> > > > > >>   1.  new_id -- (current approach) the user explicitly registers
> > > > > >>   each new device type with the vfio driver using the new_id
> > > > > >>   mechanism.
> > > > > >> 
> > > > > >>   Problem: multiple drivers will be resident that handle the
> > > > > >>   same device type...and there is nothing user space hotplug
> > > > > >>   infrastructure can do to help.
> > > > > >> 
> > > > > >>   2.  "any id" -- the vfio driver could specify a wildcard match
> > > > > >>   of some kind in its ID match table which would allow it to
> > > > > >>   match and bind to any possible device id.  However,
> > > > > >>   we don't want the vfio driver grabbing _all_ devices...just 
> > > > > >> the ones we
> > > > > >>   explicitly want to pass to user space.
> > > > > >> 
> > > > > >>   The proposed patch to support this was to create a new flag
> > > > > >>   "sysfs_bind_only" in struct device_driver.  When this flag
> > > > > >>   is set, the driver can only bind to devices via the sysfs
> > > > > >>   bind file.  This would allow the wildcard match to work.
> > > > > >> 
> > > > > >>   Patch is here:
> > > > > >>   https://lkml.org/lkml/2013/12/3/253
> > > > > >> 
> > > > > >>   3.  "Driver initiated explicit bind" -- with this approach the
> > > > > >>   vfio driver would create a private 'bind' sysfs object
> > > > > >>   and the user would echo the requested device into it:
> > > > > >> 
> > > > > >>   echo 0001:03:00.0 > /sys/bus/pci/drivers/vfio-pci/vfio_bind
> > > > > >> 
> > > > > >>   In order to make that work, the driver would need to call
> > > > > >>   driver_probe_device() and thus we need this patch:
> > > > > >>   https://lkml.org/lkml/2014/2/8/175
> > > > > > 
> > > > > > 4). Use the 'unbind' (from the original device) and 'bind' to vfio 
> > > > > > driver.
> > > > > 
> > > > > This is approach 2, no?
> > > > > 
> > > > > > 
> > > > > > Which I think is what is currently being done. Why is that not 
> > > > > > sufficient?
> > > > > 
> > > > > How would 'bind to vfio driver' look like?
> > > > > 
> > > > > > The only thing I see in the URL is " That works, but it is ugly."
> > > > > > There is some mention of race but I don't see how - if you do the 
> > > > > > 'unbind'
> > > > > > on the original driver and then bind the BDF to the VFIO how would 
> > > > > > you get
> > > > > > a race?
> > > > > 
> > > > > Typically on PCI, you do a
> > > > > 
> > > > >   - add wildcard (pci id) match to vfio driver
> > > > >   - unbind driver
> > > > >   -> reprobe
> > > > >   -> device attaches to vfio driver because it is the least recent 
> > > > > match
> > > > >   - remove wildcard match from vfio driver
> > > > > 
> > > > > If in between you hotplug add a card of the same type, it gets 
> > > > > attached to vfio - even though the logical "default driver" would be 
> > > > > the device specific driver.
> > > > 
> > > > I've mentioned drivers_autoprobe in the past, but I'm not sure we're
> > > > really factoring it into the discussion.  drivers_autoprobe allows us to
> > > > toggle two points:
> > > > 
> > > > a) When a new device is added whether we automatically give drivers a
> > > > try at binding to it
> > > > 
> > > > b) When a new driver is added whether it gets to try to bind to anything
> > > > in the system
> > > > 
> > > > So we do have a mechanism to avoid the race, but the problem is that it
> > > > becomes the responsibility of userspace to:
> > > > 
> > > > 1) t

Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

2014-03-31 Thread Scott Wood

On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote:
> On 03/26/2014 10:17 PM, Scott Wood wrote:
> > On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:
> >> +  /*
> >> +   * Another thread may rewrite the TLB entry in parallel, don't
> >> +   * execute from the address if the execute permission is not set
> >> +   */
> 
> What happens when another thread rewrites the TLB entry in parallel? 
> Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? 
> Are the contents of the MAS registers consistent at this point or 
> inconsistent?

If another thread rewrites the TLB entry, then we use the new TLB entry,
just as if it had raced in hardware.  This check ensures that we don't
execute from the new TLB entry if it doesn't have execute permissions
(just as hardware would).

What happens if the new TLB entry is valid and executable, and the
instruction pointed to is valid, but doesn't trap (and thus we don't
have emulation for it)?

> There has to be a good way to detect such a race and deal with it, no?

How would you detect it?  We don't get any information from the trap
about what physical address the instruction was fetched from, or what
the instruction was.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mechanism to allow a driver to bind to any device

2014-03-31 Thread Alex Williamson

On Mon, 2014-03-31 at 17:36 -0500, Kim Phillips wrote:
> On Fri, 28 Mar 2014 11:10:23 -0600
> Alex Williamson  wrote:
> 
> > On Fri, 2014-03-28 at 12:58 -0400, Konrad Rzeszutek Wilk wrote:
> > > On Wed, Mar 26, 2014 at 04:09:21PM -0600, Alex Williamson wrote:
> > > > On Wed, 2014-03-26 at 10:21 -0600, Alex Williamson wrote:
> > > > > On Wed, 2014-03-26 at 23:06 +0800, Alexander Graf wrote:
> > > > > > 
> > > > > > > Am 26.03.2014 um 22:40 schrieb Konrad Rzeszutek Wilk 
> > > > > > > :
> > > > > > > 
> > > > > > >> On Wed, Mar 26, 2014 at 01:40:32AM +, Stuart Yoder wrote:
> > > > > > >> Hi Greg,
> > > > > > >> 
> > > > > > >> We (Linaro, Freescale, Virtual Open Systems) are trying get an 
> > > > > > >> issue
> > > > > > >> closed that has been perculating for a while around creating a 
> > > > > > >> mechanism
> > > > > > >> that will allow kernel drivers like vfio can bind to devices of 
> > > > > > >> any type.
> > > > > > >> 
> > > > > > >> This thread with you:
> > > > > > >> http://www.spinics.net/lists/kvm-arm/msg08370.html
> > > > > > >> ...seems to have died out, so am trying to get your response
> > > > > > >> and will summarize again.  Vfio drivers in the kernel 
> > > > > > >> (regardless of
> > > > > > >> bus type) need to bind to devices of any type.  The driver's 
> > > > > > >> function
> > > > > > >> is to simply export hardware resources of any type to user space.
> > > > > > >> 
> > > > > > >> There are several approaches that have been proposed:
> > > > > > > 
> > > > > > > You seem to have missed the one I proposed.
> > > > > > >> 
> > > > > > >>   1.  new_id -- (current approach) the user explicitly registers
> > > > > > >>   each new device type with the vfio driver using the new_id
> > > > > > >>   mechanism.
> > > > > > >> 
> > > > > > >>   Problem: multiple drivers will be resident that handle the
> > > > > > >>   same device type...and there is nothing user space hotplug
> > > > > > >>   infrastructure can do to help.
> > > > > > >> 
> > > > > > >>   2.  "any id" -- the vfio driver could specify a wildcard match
> > > > > > >>   of some kind in its ID match table which would allow it to
> > > > > > >>   match and bind to any possible device id.  However,
> > > > > > >>   we don't want the vfio driver grabbing _all_ 
> > > > > > >> devices...just the ones we
> > > > > > >>   explicitly want to pass to user space.
> > > > > > >> 
> > > > > > >>   The proposed patch to support this was to create a new flag
> > > > > > >>   "sysfs_bind_only" in struct device_driver.  When this flag
> > > > > > >>   is set, the driver can only bind to devices via the sysfs
> > > > > > >>   bind file.  This would allow the wildcard match to work.
> > > > > > >> 
> > > > > > >>   Patch is here:
> > > > > > >>   https://lkml.org/lkml/2013/12/3/253
> > > > > > >> 
> > > > > > >>   3.  "Driver initiated explicit bind" -- with this approach the
> > > > > > >>   vfio driver would create a private 'bind' sysfs object
> > > > > > >>   and the user would echo the requested device into it:
> > > > > > >> 
> > > > > > >>   echo 0001:03:00.0 > /sys/bus/pci/drivers/vfio-pci/vfio_bind
> > > > > > >> 
> > > > > > >>   In order to make that work, the driver would need to call
> > > > > > >>   driver_probe_device() and thus we need this patch:
> > > > > > >>   https://lkml.org/lkml/2014/2/8/175
> > > > > > > 
> > > > > > > 4). Use the 'unbind' (from the original device) and 'bind' to 
> > > > > > > vfio driver.
> > > > > > 
> > > > > > This is approach 2, no?
> > > > > > 
> > > > > > > 
> > > > > > > Which I think is what is currently being done. Why is that not 
> > > > > > > sufficient?
> > > > > > 
> > > > > > How would 'bind to vfio driver' look like?
> > > > > > 
> > > > > > > The only thing I see in the URL is " That works, but it is ugly."
> > > > > > > There is some mention of race but I don't see how - if you do the 
> > > > > > > 'unbind'
> > > > > > > on the original driver and then bind the BDF to the VFIO how 
> > > > > > > would you get
> > > > > > > a race?
> > > > > > 
> > > > > > Typically on PCI, you do a
> > > > > > 
> > > > > >   - add wildcard (pci id) match to vfio driver
> > > > > >   - unbind driver
> > > > > >   -> reprobe
> > > > > >   -> device attaches to vfio driver because it is the least recent 
> > > > > > match
> > > > > >   - remove wildcard match from vfio driver
> > > > > > 
> > > > > > If in between you hotplug add a card of the same type, it gets 
> > > > > > attached to vfio - even though the logical "default driver" would 
> > > > > > be the device specific driver.
> > > > > 
> > > > > I've mentioned drivers_autoprobe in the past, but I'm not sure we're
> > > > > really factoring it into the discussion.  drivers_autoprobe allows us 
> > > > > to
> > > > > toggle two points:
> > > > > 
> > > > > a) When a new device is added whether we automatically give drivers a
> > > > > try at binding to it

Re: [Qemu-devel] Massive read only kvm guests when backing file was missing

2014-03-31 Thread Alejandro Comisario

Thanks Stefan and thanks Michael also.

That situation regarding the IRC was very special, since i didnt
wanted to tell Michael "hey, everyone in the mailing list got it and
im here chatting with you and you didn't" so i assumed the IRC was
9 times more pro than the mailing list so i decided to
keep my head down and assume the communication error was on my side.

Still, IMHO, i really believe that if you are a user willing to give
KVM a chance enought to make a query on the IRC, you might feel you
are not geek enought to be there, and i dont mean be there on IRC, but
trying to use the community to support you while you try KVM.

In my case, while was very important to understant what were my
chances regarding this issue, i knew i would find my answer no matter
what because i was decided to find it, i could get mad with 10.5K
guests running on my back, yes my experience was more from the "virsh
stop; virsh start" side, but still i felt i needed you guys to try to
find this out.

Again, thanks to everyone.

best.
Alejandro Comisario

On Fri, Mar 28, 2014 at 5:47 AM, Stefan Hajnoczi  wrote:
> On Fri, Mar 28, 2014 at 11:01:00AM +0400, Michael Tokarev wrote:
>> 27.03.2014 20:14, Alejandro Comisario wrote:
>> > Seems like virtio (kvm 1.0) doesnt expose timeout on the guest side
>> > (ubuntu 12.04 on host and guest).
>> > So, how can i adjust the tinmeout on the guest ?
>>
>> After a bit more talks on IRC yesterday, it turned out that the situation
>> is _much_ more "interesting" than originally described.  The OP claims to
>> have 10500 guests running off an NFS server, and that after NFS server
>> downtime, the "backing files" were disappeared (whatever it means), so
>> they had to restore those files.  More, the OP didn't even bother to look
>> at the guest's dmesg, being busy rebooting all 10500 guests.
>>
>> > This solution is the most logical one, but i cannot apply it!
>> > thanks for all the responses!
>>
>> I suggested the OP to actually describe the _real_ situation, instead of
>> giving random half-pictures, and actually take a look at the actual problem
>> as reported in various places (most importantly the guest kernel log), and
>> reoirt _those_ hints to the list.  I also mentioned that, at least for some
>> NFS servers, if a client has a file open on the server, and this file is
>> deleted, the server will report error to the client when client tries to
>> access that file, and this has nothing at all to do with timeouts of any
>> kind.
>
> Thanks for the update and for taking time to help on IRC.  I feel you're
> being harsh on Alejandro though.
>
> Improving the quality of bug reports is important but it shouldn't be at
> the expense of quality of communication.  We can't assume that everyone
> is an expert in troubleshooting KVM or Linux.  Therefore we can't blame
> them, which will only drive people away and detract from the community.
>
> TL;DR post logs and error messages +1, berate him -1
>
> Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v3 2/4] KVM: Add SMAP support when setting CR4

2014-03-31 Thread Wu, Feng



> -Original Message-
> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On
> Behalf Of Paolo Bonzini
> Sent: Monday, March 31, 2014 9:31 PM
> To: Wu, Feng; g...@redhat.com; h...@zytor.com; kvm@vger.kernel.org
> Subject: Re: [PATCH v3 2/4] KVM: Add SMAP support when setting CR4
> 
> Just a few comments...
> 
> > -static void update_permission_bitmask(struct kvm_vcpu *vcpu,
> > +void update_permission_bitmask(struct kvm_vcpu *vcpu,
> > struct kvm_mmu *mmu, bool ept)
> >  {
> > unsigned bit, byte, pfec;
> > u8 map;
> > -   bool fault, x, w, u, wf, uf, ff, smep;
> > +   bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, smep, smap = 0;
> >
> > smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
> 
> Can you make an additional patch to rename this to cr4_smep?

Sure! I noticed your comments about this issue in the previous email, I was 
prepare to make a patch for it, will send out it today!
> 
> > +   cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
> > for (byte = 0; byte < ARRAY_SIZE(mmu->permissions); ++byte) {
> > pfec = byte << 1;
> > map = 0;
> > wf = pfec & PFERR_WRITE_MASK;
> > uf = pfec & PFERR_USER_MASK;
> > ff = pfec & PFERR_FETCH_MASK;
> > +   /*
> > +* PFERR_RSVD_MASK bit is used to detect SMAP violation.
> > +* We will check it in permission_fault(), this bit is
> > +* set in pfec for normal fault, while it is cleared for
> > +* SMAP violations.
> > +*/
> 
> "This bit is set in PFEC if we the access is _not_ subject to SMAP
> restrictions, and cleared otherwise.  The bit is only meaningful if
> the SMAP bit is set in CR4."
> 
> > +   smapf = !(pfec & PFERR_RSVD_MASK);
> > for (bit = 0; bit < 8; ++bit) {
> > x = bit & ACC_EXEC_MASK;
> > w = bit & ACC_WRITE_MASK;
> > @@ -3627,11 +3635,32 @@ static void update_permission_bitmask(struct
> kvm_vcpu *vcpu,
> > w |= !is_write_protection(vcpu) && !uf;
> > /* Disallow supervisor fetches of user code if 
> > cr4.smep */
> > x &= !(smep && u && !uf);
> > +
> > +   /*
> > +* SMAP:kernel-mode data accesses from user-mode
> > +* mappings should fault. A fault is considered
> > +* as a SMAP violation if all of the following
> > +* conditions are ture:
> > +*   - X86_CR4_SMAP is set in CR4
> > +*   - An user page is accessed
> > +*   - Page fault in kernel mode
> > +*   - !(CPL<3 && X86_EFLAGS_AC is set)
> 
> - if CPL < 3, EFLAGS.AC is clear

Should it be "if CPL =3 or EFLAGS.AC is clear" ?

> 
> > +*   Here, we cover the first three conditions,
> > +*   The CPL and X86_EFLAGS_AC is in smapf,which
> > +*   permission_fault() computes dynamically.
> 
> The fourth is computed dynamically in permission_fault() and is in SMAPF.
> 
> > +*   Also, SMAP does not affect instruction
> > +*   fetches, add the !ff check here to make it
> > +*   clearer.
> > +*/
> > +   smap = cr4_smap && u && !uf && !ff;
> > } else
> > /* Not really needed: no U/S accesses on ept  */
> > u = 1;
> >
> > -   fault = (ff && !x) || (uf && !u) || (wf && !w);
> > +   fault = (ff && !x) || (uf && !u) || (wf && !w) ||
> > +   (smapf && smap);
> > map |= fault << bit;
> > }
> > mmu->permissions[byte] = map;
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 2926152..822190f 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -44,11 +44,17 @@
> >  #define PT_DIRECTORY_LEVEL 2
> >  #define PT_PAGE_TABLE_LEVEL 1
> >
> > -#define PFERR_PRESENT_MASK (1U << 0)
> > -#define PFERR_WRITE_MASK (1U << 1)
> > -#define PFERR_USER_MASK (1U << 2)
> > -#define PFERR_RSVD_MASK (1U << 3)
> > -#define PFERR_FETCH_MASK (1U << 4)
> > +#define PFERR_PRESENT_BIT 0
> > +#define PFERR_WRITE_BIT 1
> > +#define PFERR_USER_BIT 2
> > +#define PFERR_RSVD_BIT 3
> > +#define PFERR_FETCH_BIT 4
> > +
> > +#define PFERR_PRESENT_MASK (1U << PFERR_PRESENT_BIT)
> > +#define PFERR_WRITE_MASK (1U << PFERR_WRITE_BIT)
> > +#define PFERR_USER_MASK (1U << PFERR_USER_BIT)
> > +#define PFERR_RSVD_MASK (1U << PFERR_RSVD_BIT)
> > +#define PFERR_FETCH_MASK (1U << PFERR_FETCH_BIT)
> >
> >  int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64
> sptes[

[PATCH v4 0/4] KVM: enable Intel SMAP for KVM

2014-03-31 Thread Feng Wu

Supervisor Mode Access Prevention (SMAP) is a new security feature 
disclosed by Intel, please refer to the following document: 

http://software.intel.com/sites/default/files/319433-014.pdf
 
Every access to a linear address is either a supervisor-mode access
or a user-mode access. All accesses performed while the current
privilege level (CPL) is less than 3 are supervisor-mode accesses.
If CPL = 3, accesses are generally user-mode accesses. However, some
operations implicitly access system data structures, and the resulting
accesses to those data structures are supervisor-mode accesses regardless
of CPL. Examples of such implicit supervisor accesses include the following:
accesses to the global descriptor table (GDT) or local descriptor table
(LDT) to load a segment descriptor; accesses to the interrupt descriptor
table (IDT) when delivering an interrupt or exception; and accesses to the
task-state segment (TSS) as part of a task switch or change of CPL.

If CR4.SMAP = 1, supervisor-mode data accesses are not allowed to linear
addresses that are accessible in user mode. If CPL < 3, SMAP protections
are disabled if EFLAGS.AC = 1. If CPL = 3, SMAP applies to all supervisor-mode
data accesses (these are implicit supervisor accesses) regardless of the
value of EFLAGS.AC.

This patchset pass-through SMAP feature to guests, and let guests
benefit from it.

Version 1:
  * Remove SMAP bit from CR4_RESERVED_BITS.
  * Add SMAP support when setting CR4
  * Disable SMAP for guests in EPT realmode and EPT unpaging mode
  * Expose SMAP feature to guest

Version 2:
  * Change the logic of updating mmu permission bitmap for SMAP violation
  * Expose SMAP feature to guest in the last patch of this series.

Version 3:
  * Changes in update_permission_bitmask().
  * Use a branchless way suggested by Paolo Bonzini to detect SMAP
violation in permission_fault(). 

Version 4:
  * Changes to some comments and code style.

Feng Wu (4):
  KVM: Remove SMAP bit from CR4_RESERVED_BITS.
  KVM: Add SMAP support when setting CR4
  KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
  KVM: expose SMAP feature to guest

 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/cpuid.c|  2 +-
 arch/x86/kvm/cpuid.h|  8 
 arch/x86/kvm/mmu.c  | 34 ---
 arch/x86/kvm/mmu.h  | 44 +
 arch/x86/kvm/paging_tmpl.h  |  2 +-
 arch/x86/kvm/vmx.c  | 11 ++-
 arch/x86/kvm/x86.c  |  9 -
 8 files changed, 92 insertions(+), 20 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 4/4] KVM: expose SMAP feature to guest

2014-03-31 Thread Feng Wu

This patch exposes SMAP feature to guest

Signed-off-by: Feng Wu 
---
 arch/x86/kvm/cpuid.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index c697625..deb5f9b 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -303,7 +303,7 @@ static inline int __do_cpuid_ent(struct kvm_cpuid_entry2 
*entry, u32 function,
/* cpuid 7.0.ebx */
const u32 kvm_supported_word9_x86_features =
F(FSGSBASE) | F(BMI1) | F(HLE) | F(AVX2) | F(SMEP) |
-   F(BMI2) | F(ERMS) | f_invpcid | F(RTM);
+   F(BMI2) | F(ERMS) | f_invpcid | F(RTM) | F(SMAP);
 
/* all calls to cpuid_count() should be made on the same cpu */
get_cpu();
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 3/4] KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode

2014-03-31 Thread Feng Wu

SMAP is disabled if CPU is in non-paging mode in hardware.
However KVM always uses paging mode to emulate guest non-paging
mode with TDP. To emulate this behavior, SMAP needs to be
manually disabled when guest switches to non-paging mode.

Signed-off-by: Feng Wu 
---
 arch/x86/kvm/vmx.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3927528..e58cb5f 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3452,13 +3452,14 @@ static int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
hw_cr4 &= ~X86_CR4_PAE;
hw_cr4 |= X86_CR4_PSE;
/*
-* SMEP is disabled if CPU is in non-paging mode in
-* hardware. However KVM always uses paging mode to
+* SMEP/SMAP is disabled if CPU is in non-paging mode
+* in hardware. However KVM always uses paging mode to
 * emulate guest non-paging mode with TDP.
-* To emulate this behavior, SMEP needs to be manually
-* disabled when guest switches to non-paging mode.
+* To emulate this behavior, SMEP/SMAP needs to be
+* manually disabled when guest switches to non-paging
+* mode.
 */
-   hw_cr4 &= ~X86_CR4_SMEP;
+   hw_cr4 &= ~(X86_CR4_SMEP | X86_CR4_SMAP);
} else if (!(cr4 & X86_CR4_PAE)) {
hw_cr4 &= ~X86_CR4_PAE;
}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 1/4] KVM: Remove SMAP bit from CR4_RESERVED_BITS.

2014-03-31 Thread Feng Wu

This patch removes SMAP bit from CR4_RESERVED_BITS.

Signed-off-by: Feng Wu 
---
 arch/x86/include/asm/kvm_host.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fdf83af..4eeb049 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -60,7 +60,7 @@
  | X86_CR4_PSE | X86_CR4_PAE | X86_CR4_MCE \
  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | 
X86_CR4_PCIDE \
  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
- | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE))
+ | X86_CR4_OSXMMEXCPT | X86_CR4_VMXE | X86_CR4_SMAP))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 2/4] KVM: Add SMAP support when setting CR4

2014-03-31 Thread Feng Wu

This patch adds SMAP handling logic when setting CR4 for guests

Thanks a lot to Paolo Bonzini for his suggestion to use the branchless
way to detect SMAP violation.

Signed-off-by: Feng Wu 
---
 arch/x86/kvm/cpuid.h   |  8 
 arch/x86/kvm/mmu.c | 34 +++---
 arch/x86/kvm/mmu.h | 44 
 arch/x86/kvm/paging_tmpl.h |  2 +-
 arch/x86/kvm/x86.c |  9 -
 5 files changed, 84 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index a2a1bb7..eeecbed 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -48,6 +48,14 @@ static inline bool guest_cpuid_has_smep(struct kvm_vcpu 
*vcpu)
return best && (best->ebx & bit(X86_FEATURE_SMEP));
 }
 
+static inline bool guest_cpuid_has_smap(struct kvm_vcpu *vcpu)
+{
+   struct kvm_cpuid_entry2 *best;
+
+   best = kvm_find_cpuid_entry(vcpu, 7, 0);
+   return best && (best->ebx & bit(X86_FEATURE_SMAP));
+}
+
 static inline bool guest_cpuid_has_fsgsbase(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9b53135..a183783 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3601,20 +3601,27 @@ static void reset_rsvds_bits_mask_ept(struct kvm_vcpu 
*vcpu,
}
 }
 
-static void update_permission_bitmask(struct kvm_vcpu *vcpu,
+void update_permission_bitmask(struct kvm_vcpu *vcpu,
struct kvm_mmu *mmu, bool ept)
 {
unsigned bit, byte, pfec;
u8 map;
-   bool fault, x, w, u, wf, uf, ff, smep;
+   bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, smep, smap = 0;
 
smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
+   cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
for (byte = 0; byte < ARRAY_SIZE(mmu->permissions); ++byte) {
pfec = byte << 1;
map = 0;
wf = pfec & PFERR_WRITE_MASK;
uf = pfec & PFERR_USER_MASK;
ff = pfec & PFERR_FETCH_MASK;
+   /*
+* PFERR_RSVD_MASK bit is set in PFEC if the access is not
+* subject to SMAP restrictions, and cleared otherwise. The
+* bit is only meaningful if the SMAP bit is set in CR4.
+*/
+   smapf = !(pfec & PFERR_RSVD_MASK);
for (bit = 0; bit < 8; ++bit) {
x = bit & ACC_EXEC_MASK;
w = bit & ACC_WRITE_MASK;
@@ -3627,11 +3634,32 @@ static void update_permission_bitmask(struct kvm_vcpu 
*vcpu,
w |= !is_write_protection(vcpu) && !uf;
/* Disallow supervisor fetches of user code if 
cr4.smep */
x &= !(smep && u && !uf);
+
+   /*
+* SMAP:kernel-mode data accesses from user-mode
+* mappings should fault. A fault is considered
+* as a SMAP violation if all of the following
+* conditions are ture:
+*   - X86_CR4_SMAP is set in CR4
+*   - An user page is accessed
+*   - Page fault in kernel mode
+*   - if CPL = 3 or X86_EFLAGS_AC is clear
+*
+*   Here, we cover the first three conditions.
+*   The fourth is computed dynamically in
+*   permission_fault() and is in smapf.
+*
+*   Also, SMAP does not affect instruction
+*   fetches, add the !ff check here to make it
+*   clearer.
+*/
+   smap = cr4_smap && u && !uf && !ff;
} else
/* Not really needed: no U/S accesses on ept  */
u = 1;
 
-   fault = (ff && !x) || (uf && !u) || (wf && !w);
+   fault = (ff && !x) || (uf && !u) || (wf && !w) ||
+   (smapf && smap);
map |= fault << bit;
}
mmu->permissions[byte] = map;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 2926152..3842e70 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -44,11 +44,17 @@
 #define PT_DIRECTORY_LEVEL 2
 #define PT_PAGE_TABLE_LEVEL 1
 
-#define PFERR_PRESENT_MASK (1U << 0)
-#define PFERR_WRITE_MASK (1U << 1)
-#define PFERR_USER_MASK (1U << 2)
-#define PFERR_RSVD_MASK (1U << 3)
-#define PFERR_FETCH_MASK (1U << 4)
+#define PFERR_PRESENT_BIT 0
+#define PFERR_WRITE_BIT 1
+#define PFERR_USER_BIT 2

[PATCH] Rename variable smep to cr4_smep

2014-03-31 Thread Feng Wu

This patch is based on the smap patchset

Feng Wu (1):
  Rename variable smep to cr4_smep

 arch/x86/kvm/mmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Rename variable smep to cr4_smep

2014-03-31 Thread Feng Wu

Rename variable smep to cr4_smep, which can better reflect the
meaning of the variable.

Signed-off-by: Feng Wu 
---
 arch/x86/kvm/mmu.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a183783..6000557 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3606,9 +3606,9 @@ void update_permission_bitmask(struct kvm_vcpu *vcpu,
 {
unsigned bit, byte, pfec;
u8 map;
-   bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, smep, smap = 0;
+   bool fault, x, w, u, wf, uf, ff, smapf, cr4_smap, cr4_smep, smap = 0;
 
-   smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
+   cr4_smep = kvm_read_cr4_bits(vcpu, X86_CR4_SMEP);
cr4_smap = kvm_read_cr4_bits(vcpu, X86_CR4_SMAP);
for (byte = 0; byte < ARRAY_SIZE(mmu->permissions); ++byte) {
pfec = byte << 1;
@@ -3633,7 +3633,7 @@ void update_permission_bitmask(struct kvm_vcpu *vcpu,
/* Allow supervisor writes if !cr0.wp */
w |= !is_write_protection(vcpu) && !uf;
/* Disallow supervisor fetches of user code if 
cr4.smep */
-   x &= !(smep && u && !uf);
+   x &= !(cr4_smep && u && !uf);
 
/*
 * SMAP:kernel-mode data accesses from user-mode
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: VDSO pvclock may increase host cpu consumption, is this a problem?

2014-03-31 Thread Marcelo Tosatti

On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote:
> On 03/29/2014 01:47 AM, Zhanghailiang wrote:
> > Hi,
> > I found when Guest is idle, VDSO pvclock may increase host consumption.
> > We can calcutate as follow, Correct me if I am wrong.
> >   (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest)
> > In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in 
> > timekeeping.c. It consume nearly 900 cycles per call. So in consideration 
> > of 250 Hz, it may consume 225,000 cycles per second, even no VM is created.
> > In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If 
> > the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles 
> > per call. The feature decrease 150 cycles consumption per call. 
> > When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the 
> > host consumption.
> > Both Host and Guest is linux-3.13.6.
> > So, whether the host cpu consumption is a problem?
> 
> Does pvclock serve any real purpose on systems with fully-functional
> TSCs?  The x86 guest implementation is awful, so it's about 2x slower
> than TSC.  It could be improved a lot, but I'm not sure I understand why
> it exists in the first place.

VM migration.

Can you explain why you consider it so bad ? How you think it could be
improved ?

> I certainly understand the goal of keeping the guest CLOCK_REALTIME is
> sync with the host, but pvclock seems like overkill for that.

VM migration.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: VDSO pvclock may increase host cpu consumption, is this a problem?

2014-03-31 Thread Andy Lutomirski

On Mar 31, 2014 8:45 PM, "Marcelo Tosatti"  wrote:
>
> On Mon, Mar 31, 2014 at 10:52:25AM -0700, Andy Lutomirski wrote:
> > On 03/29/2014 01:47 AM, Zhanghailiang wrote:
> > > Hi,
> > > I found when Guest is idle, VDSO pvclock may increase host consumption.
> > > We can calcutate as follow, Correct me if I am wrong.
> > >   (Host)250 * update_pvclock_gtod = 1500 * gettimeofday(Guest)
> > > In Host, VDSO pvclock introduce a notifier chain, pvclock_gtod_chain in 
> > > timekeeping.c. It consume nearly 900 cycles per call. So in consideration 
> > > of 250 Hz, it may consume 225,000 cycles per second, even no VM is 
> > > created.
> > > In Guest, gettimeofday consumes 220 cycles per call with VDSO pvclock. If 
> > > the no-kvmclock-vsyscall is configured, gettimeofday consumes 370 cycles 
> > > per call. The feature decrease 150 cycles consumption per call.
> > > When call gettimeofday 1500 times,it decrease 225,000 cycles,equal to the 
> > > host consumption.
> > > Both Host and Guest is linux-3.13.6.
> > > So, whether the host cpu consumption is a problem?
> >
> > Does pvclock serve any real purpose on systems with fully-functional
> > TSCs?  The x86 guest implementation is awful, so it's about 2x slower
> > than TSC.  It could be improved a lot, but I'm not sure I understand why
> > it exists in the first place.
>
> VM migration.

Why does that need percpu stuff?  Wouldn't it be sufficient to
interrupt all CPUs (or at least all cpus running in userspace) on
migration and update the normal timing data structures?

Even better: have the VM offer to invalidate the physical page
containing the kernel's clock data on migration and interrupt one CPU.
 If another CPU races, it'll fault and wait for the guest kernel to
update its timing.

Does the current kvmclock stuff track CLOCK_MONOTONIC and
CLOCK_REALTIME separately?

>
> Can you explain why you consider it so bad ? How you think it could be
> improved ?

The second rdtsc_barrier looks unnecessary.  Even better, if rdtscp is
available, then rdtscp can replace rdtsc_barrier, rdtsc, and the
getcpu call.

It would also be nice to avoid having two sets of rescalings of the timing data.


>
> > I certainly understand the goal of keeping the guest CLOCK_REALTIME is
> > sync with the host, but pvclock seems like overkill for that.
>
> VM migration.
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/4] KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

2014-03-31 Thread Alexander Graf



> Am 01.04.2014 um 01:03 schrieb Scott Wood :
> 
>> On Mon, 2014-03-31 at 15:41 +0200, Alexander Graf wrote:
>>> On 03/26/2014 10:17 PM, Scott Wood wrote:
 On Thu, 2014-02-20 at 18:30 +0200, Mihai Caraman wrote:
 +/*
 + * Another thread may rewrite the TLB entry in parallel, don't
 + * execute from the address if the execute permission is not set
 + */
>> 
>> What happens when another thread rewrites the TLB entry in parallel? 
>> Does tlbsx succeed? Does it fail? Do we see failure indicated somehow? 
>> Are the contents of the MAS registers consistent at this point or 
>> inconsistent?
> 
> If another thread rewrites the TLB entry, then we use the new TLB entry,
> just as if it had raced in hardware.  This check ensures that we don't
> execute from the new TLB entry if it doesn't have execute permissions
> (just as hardware would).
> 
> What happens if the new TLB entry is valid and executable, and the
> instruction pointed to is valid, but doesn't trap (and thus we don't
> have emulation for it)?
> 
>> There has to be a good way to detect such a race and deal with it, no?
> 
> How would you detect it?  We don't get any information from the trap
> about what physical address the instruction was fetched from, or what
> the instruction was.

Ah, this is a guest race where the guest modifies exactly the TLB in question. 
I see.

Why would this ever happen in practice (in a case where we're not executing 
malicious code)?


Alex


> 
> -Scott
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Bug 73331] New: Nested Virtualization, L2 cannot boot up on Ivybridge and Haswell

2014-03-31 Thread bugzilla-daemon

https://bugzilla.kernel.org/show_bug.cgi?id=73331

Bug ID: 73331
   Summary: Nested Virtualization, L2 cannot boot up on Ivybridge
and Haswell
   Product: Virtualization
   Version: unspecified
Kernel Version: 3.14.0-rc3
  Hardware: x86-64
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: kvm
  Assignee: virtualization_...@kernel-bugs.osdl.org
  Reporter: robert...@intel.com
CC: jan.kis...@web.de
Regression: No

Environment:

Host OS (ia32/ia32e/IA64):ia32e
Guest OS (ia32/ia32e/IA64):ia32e
Guest OS Type (Linux/Windows):Linux
kvm.git Commit:7227fc006b0df2c0d2966a7f4859b01bdf74
qemu.git Commit: 8648fcd52a9bcc2aa415cbe87b7c636e545acb38
Host Kernel Version:3.14.0-rc3
Hardware:Ivytown_EP, Grantley-EP

Bug detailed description:
--
create L2 guest, L2 guest cannot boot up on Ivytown.


Reproduce steps:

1.create L1 guests:
qemu-system-x86_64 --enable-kvm -m 10240 -smp 8 -net
nic,macaddr=00:12:45:67:2B:1C -net tap,script=/etc/kvm/qemu-ifup
nested-kvm-rhel6u4.qcow -cpu host
2. create L2 guests:
qemu-system-x86_64 -enable-kvm -m 1024 -smp 2 -net none rhel6u4.qcow


Current result:

L2 guest cannot boot up fine

Expected result:

L2 guest boot up fine.

Basic root-causing log:
--
[root@rhel6u4 ~]# qemu-system-x86_64 -enable-kvm -m 1024 -smp 2 -net none
rhel6u4.qcow
VNC server running on `::1:5900'
KVM: entry failed, hardware error 0x0
EAX= EBX= ECX= EDX=0623
ESI= EDI= EBP= ESP=
EIP=fff0 EFL=0002 [---] CPL=3 II=0 A20=1 SMM=0 HLT=0
ES =   9300
CS =f000 000f  f300
SS =   f300
DS =   9300
FS =   9300
GS =   9300
LDT=   8200
TR =   8b00
GDT=  
IDT=  
CR0=6010 CR2= CR3= CR4=
DR0= DR1= DR2=
DR3=
DR6=0ff0 DR7=0400
EFER=
Code=ff 66 89 c8 66 5b 66 5e 66 5f 66 5d 66 c3 fb f4 fa fc 66 c3  5b e0 00
f0 30 36 2f 32 33 2f 39 39 00 fc 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

52 matches

Mail list logo