Re: market share??

2014-06-20 Thread Jidong Xiao
On Thu, Jun 12, 2014 at 12:48 PM, urgrue  wrote:
> Does anyone have ANY idea where I could find out about market share and
> adoption rates of linux hypervisors (kvm vs xen vs vmware, mainly)?
> I've not found anything despite extensive searching, which is really kind of
> problematic when you're trying to convince the bosses...
>
>

Hi,

Not sure whether these statistics are what you are looking for:

http://up2v.nl/2012/09/01/nexenta-server-hypervisor-market-share-survey-indicates-vsphere-remains-dominant-hypervisor/
http://www.businesswire.com/news/home/20120827005276/en/Nexenta-Releases-Server-Hypervisor-Market-Share-Survey#.U6PircakpfQ

Although the original link to the Nexenta website is not available...

-Jidong
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/14] KVM GICv3 emulation

2014-06-20 Thread Andre Przywara
On 20/06/14 02:39, Chalamarla, Tirumalesh wrote:

Hi Tirumalesh,

>  Is there a public repo where we can get this patches from easily.

Indeed, many thanks to Marc who hosts now my patches on his kernel.org
repo. Simply checkout the gicv3/kvm-guest branch from
git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git
to get all of GICv3 host, vgic-dyn and the GICv3 guest emualation stuff
at once.

Regards,
Andre.

> 
> From: kvmarm-boun...@lists.cs.columbia.edu 
>  on behalf of Andre Przywara 
> 
> Sent: Thursday, June 19, 2014 3:15 PM
> To: linux-arm-ker...@lists.infradead.org; kvm...@lists.cs.columbia.edu; 
> kvm@vger.kernel.org
> Cc: christoffer.d...@linaro.org
> Subject: [PATCH 00/14] KVM GICv3 emulation
>
> GICv3 is the ARM generic interrupt controller designed to overcome
> some limits of the prevalent GICv2. Most notably it lifts the 8-CPU
> limit. Though with recent patches from Marc there is support for
> hosts to use a GICv3, the CPU limitation still applies to KVM guests,
> since the current code emulates a GICv2 only.
> Also, GICv2 backward compatibility being optional in GICv3, a number
> of systems won't be able to run GICv2 guests.
>
> This patch series provides code to emulate a GICv3 distributor and
> redistributor for any KVM guest. It requires a GICv3 in the host to
> work. With those patches one can run guests efficiently on any GICv3
> host. It has the following features:
> - Affinity routing (support for up to 255 VCPUs, more possible)
> - System registers (as opposed to MMIO access)
> - No ITS
> - No priority support (as the GICv2 emulation)
> - No save / restore support so far (will be added soon)
>
> The first 10 patches actually refactor the current VGIC code to make
> room for a different VGIC model to be dropped in with Patch 11/14.
> The remaining patches connect the new model to the kernel backend and
> the userland facing code.
>
> The series goes on top of both Marc's GICv3 host support series as
> well as his vgic-dyn patches.
> The necessary patches for kvmtool to enable the guest's GICv3 will be
> posted here as well.
> There was some testing on the fast model with some I/O and interrupt
> affinity shuffling in a Linux guest with a varying number of VCPUs.
>
> Please review and test.
> I would be grateful for people to test for GICv2 regressions also
> (so on a GICv2 host with current kvmtool/qemu), as there is quite
> some refactoring on that front.
>
> Much of the code was inspired by Marc, so send all praises to him
> (while I take the blame).
>
> Cheers,
> Andre.
>
> Andre Przywara (14):
>   arm/arm64: KVM: rework MPIDR assignment and add accessors
>   arm/arm64: KVM: pass down user space provided GIC type into vGIC code
>   arm/arm64: KVM: refactor vgic_handle_mmio() function
>   arm/arm64: KVM: wrap 64 bit MMIO accesses with two 32 bit ones
>   arm/arm64: KVM: introduce per-VM ops
>   arm/arm64: KVM: make the maximum number of vCPUs a per-VM value
>   arm/arm64: KVM: make the value of ICC_SRE_EL1 a per-VM variable
>   arm/arm64: KVM: refactor MMIO accessors
>   arm/arm64: KVM: split GICv2 specific emulation code from vgic.c
>   arm/arm64: KVM: add opaque private pointer to MMIO accessors
>   arm/arm64: KVM: add virtual GICv3 distributor emulation
>   arm/arm64: KVM: add SGI system register trapping
>   arm/arm64: KVM: enable kernel side of GICv3 emulation
>   arm/arm64: KVM: allow userland to request a virtual GICv3
>
>  arch/arm/include/asm/kvm_emulate.h   |2 +-
>  arch/arm/include/asm/kvm_host.h  |3 +
>  arch/arm/kvm/Makefile|1 +
>  arch/arm/kvm/arm.c   |   23 +-
>  arch/arm/kvm/coproc.c|   19 +
>  arch/arm/kvm/psci.c  |   15 +-
>  arch/arm64/include/asm/kvm_emulate.h |3 +-
>  arch/arm64/include/asm/kvm_host.h|5 +
>  arch/arm64/include/uapi/asm/kvm.h|7 +
>  arch/arm64/kernel/asm-offsets.c  |1 +
>  arch/arm64/kvm/Makefile  |2 +
>  arch/arm64/kvm/sys_regs.c|   37 +-
>  arch/arm64/kvm/vgic-v3-switch.S  |   14 +-
>  include/kvm/arm_vgic.h   |   38 +-
>  include/linux/irqchip/arm-gic-v3.h   |   26 +
>  include/linux/kvm_host.h |1 +
>  include/uapi/linux/kvm.h |1 +
>  virt/kvm/arm/vgic-v2-emul.c  |  802 +++
>  virt/kvm/arm/vgic-v2.c   |   22 +-
>  virt/kvm/arm/vgic-v3-emul.c  |  898 ++
>  virt/kvm/arm/vgic-v3.c   |  157 +-
>  virt/kvm/arm/vgic.c  | 1017 
> +++---
>  virt/kvm/arm/vgic.h  |  117 
>  virt/kvm/kvm_main.c  |3 +
>  24 files changed, 2346 insertions(+), 868 deletions(-)
>  create mode 100644 virt/kvm/arm/vgic-v2-emul.c
>  create mode 100644 virt/kvm/arm/vgic-v3-emul.c
>  create mode 100644 virt/kvm/arm/vgic.h
>
> --
> 1.7.9.5
>
> 

Re: [PATCH 04/14] arm/arm64: KVM: wrap 64 bit MMIO accesses with two 32 bit ones

2014-06-20 Thread Andre Przywara

On 19/06/14 22:15, Chalamarla, Tirumalesh wrote:
>
>
> -Original Message-
> From: kvmarm-boun...@lists.cs.columbia.edu 
> [mailto:kvmarm-boun...@lists.cs.columbia.edu] On Behalf Of Andre Przywara
> Sent: Thursday, June 19, 2014 2:46 AM
> To: linux-arm-ker...@lists.infradead.org; kvm...@lists.cs.columbia.edu; 
> kvm@vger.kernel.org
> Cc: christoffer.d...@linaro.org
> Subject: [PATCH 04/14] arm/arm64: KVM: wrap 64 bit MMIO accesses with two 32 
> bit ones
>
> Some GICv3 registers can and will be accessed as 64 bit registers.
> Currently the register handling code can only deal with 32 bit accesses, so 
> we do two consecutive calls to cover this.
>
> Signed-off-by: Andre Przywara 
> ---
>  virt/kvm/arm/vgic.c |   48 +---
>  1 file changed, 45 insertions(+), 3 deletions(-)
>
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c index 4c6b212..b3cf4c7 
> 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -906,6 +906,48 @@ static bool vgic_validate_access(const struct vgic_dist 
> *dist,  }
>
>  /*
> + * Call the respective handler function for the given range.
> + * We split up any 64 bit accesses into two consecutive 32 bit
> + * handler calls and merge the result afterwards.
> + */
> +static bool call_range_handler(struct kvm_vcpu *vcpu,
> +struct kvm_exit_mmio *mmio,
> +unsigned long offset,
> +const struct mmio_range *range) {
> + u32 *data32 = (void *)mmio->data;
> + struct kvm_exit_mmio mmio32;
> + bool ret;
> +
> + if (likely(mmio->len <= 4))
> + return range->handle_mmio(vcpu, mmio, offset);
> +
> + /*
> +  * We assume that any access greater than 4 bytes is actually
> +  * 8 bytes long, caused by a 64-bit access
> +  */
> +
> + mmio32.len = 4;
> + mmio32.is_write = mmio->is_write;
> +
> + mmio32.phys_addr = mmio->phys_addr + 4;
> + if (mmio->is_write)
> + *(u32 *)mmio32.data = data32[1];
> + ret = range->handle_mmio(vcpu, &mmio32, offset + 4);
> + if (!mmio->is_write)
> + data32[1] = *(u32 *)mmio32.data;
> +
> + mmio32.phys_addr = mmio->phys_addr;
> + if (mmio->is_write)
> + *(u32 *)mmio32.data = data32[0];
> + ret |= range->handle_mmio(vcpu, &mmio32, offset);
> + if (!mmio->is_write)
> + data32[0] = *(u32 *)mmio32.data;
> +
> + return ret;
> +}
>
> Any reason to use two 32 bits instead of one 64 bit. AArch32 on ARMv8 may be.

For the registers we care about right now we get along with this split.
And it seems to be less intrusive.
I have a patch to support native 64-bit accesses, but it needs some more
work. If there is high demand for it, I can post it (but Marc didn't
like the first version so much ;-) )

Regards,
Andre.

>
> +
> +/*
>   * vgic_handle_mmio_range - handle an in-kernel MMIO access
>   * @vcpu:pointer to the vcpu performing the access
>   * @run: pointer to the kvm_run structure
> @@ -936,10 +978,10 @@ static bool vgic_handle_mmio_range(struct kvm_vcpu 
> *vcpu, struct kvm_run *run,
>   spin_lock(&vcpu->kvm->arch.vgic.lock);
>   offset -= range->base;
>   if (vgic_validate_access(dist, range, offset)) {
> - updated_state = range->handle_mmio(vcpu, mmio, offset);
> + updated_state = call_range_handler(vcpu, mmio, offset, range);
>   } else {
> - vgic_reg_access(mmio, NULL, offset,
> - ACCESS_READ_RAZ | ACCESS_WRITE_IGNORED);
> + if (!mmio->is_write)
> + memset(mmio->data, 0, mmio->len);
>
>  What is the use of this memset.
>
>   updated_state = false;
>   }
>   spin_unlock(&vcpu->kvm->arch.vgic.lock);
> --
> 1.7.9.5
>
> ___
> kvmarm mailing list
> kvm...@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>

-- IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium.  Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered 
in England & Wales, Company No:  2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, 
Registered in England & Wales, Company No:  2548782

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/14] arm/arm64: KVM: enable kernel side of GICv3 emulation

2014-06-20 Thread Andre Przywara


On 19/06/14 22:43, Chalamarla, Tirumalesh wrote:
>
>
> -Original Message-
> From: kvmarm-boun...@lists.cs.columbia.edu 
> [mailto:kvmarm-boun...@lists.cs.columbia.edu] On Behalf Of Andre Przywara
> Sent: Thursday, June 19, 2014 2:46 AM
> To: linux-arm-ker...@lists.infradead.org; kvm...@lists.cs.columbia.edu; 
> kvm@vger.kernel.org
> Cc: christoffer.d...@linaro.org
> Subject: [PATCH 13/14] arm/arm64: KVM: enable kernel side of GICv3 emulation
>
> With all the necessary GICv3 emulation code in place, we can now connect the 
> code to the GICv3 backend in the kernel.
> The LR register handling is different depending on the emulated GIC model, so 
> provide different implementations for each.
> Also allow non-v2-compatible GICv3 implementations (which don't provide MMIO 
> regions for the virtual CPU interface in the DT), but restrict those hosts to 
> use GICv3 guests only.
>
> Signed-off-by: Andre Przywara 
> ---
>  virt/kvm/arm/vgic-v3.c |  138 
> ++--
>  virt/kvm/arm/vgic.c|2 +
>  2 files changed, 112 insertions(+), 28 deletions(-)
>
> diff --git a/virt/kvm/arm/vgic-v3.c b/virt/kvm/arm/vgic-v3.c index 
> 7d9c85e..d26d12f 100644
> --- a/virt/kvm/arm/vgic-v3.c
> +++ b/virt/kvm/arm/vgic-v3.c
> @@ -34,6 +34,7 @@
>  #define GICH_LR_VIRTUALID(0x3ffUL << 0)
>  #define GICH_LR_PHYSID_CPUID_SHIFT   (10)
>  #define GICH_LR_PHYSID_CPUID (7UL << GICH_LR_PHYSID_CPUID_SHIFT)
> +#define ICH_LR_VIRTUALID_MASK(BIT_ULL(32) - 1)
>
>  /*
>   * LRs are stored in reverse order in memory. make sure we index them @@ 
> -43,7 +44,35 @@
>
>  static u32 ich_vtr_el2;
>
> -static struct vgic_lr vgic_v3_get_lr(const struct kvm_vcpu *vcpu, int lr)
> +static u64 sync_lr_val(u8 state)
> +{
> + u64 lr_val = 0;
> +
> + if (state & LR_STATE_PENDING)
> + lr_val |= ICH_LR_PENDING_BIT;
> + if (state & LR_STATE_ACTIVE)
> + lr_val |= ICH_LR_ACTIVE_BIT;
> + if (state & LR_EOI_INT)
> + lr_val |= ICH_LR_EOI;
> +
> + return lr_val;
> +}
> +
> +static u8 sync_lr_state(u64 lr_val)
> +{
> + u8 state = 0;
> +
> + if (lr_val & ICH_LR_PENDING_BIT)
> + state |= LR_STATE_PENDING;
> + if (lr_val & ICH_LR_ACTIVE_BIT)
> + state |= LR_STATE_ACTIVE;
> + if (lr_val & ICH_LR_EOI)
> + state |= LR_EOI_INT;
> +
> + return state;
> +}
> +
> +static struct vgic_lr vgic_v2_on_v3_get_lr(const struct kvm_vcpu *vcpu,
> +int lr)
>  {
>   struct vgic_lr lr_desc;
>   u64 val = vcpu->arch.vgic_cpu.vgic_v3.vgic_lr[LR_INDEX(lr)];
> @@ -53,30 +82,53 @@ static struct vgic_lr vgic_v3_get_lr(const struct 
> kvm_vcpu *vcpu, int lr)
>   lr_desc.source  = (val >> GICH_LR_PHYSID_CPUID_SHIFT) & 0x7;
>   else
>   lr_desc.source = 0;
> - lr_desc.state   = 0;
> + lr_desc.state   = sync_lr_state(val);
>
> - if (val & ICH_LR_PENDING_BIT)
> - lr_desc.state |= LR_STATE_PENDING;
> - if (val & ICH_LR_ACTIVE_BIT)
> - lr_desc.state |= LR_STATE_ACTIVE;
> - if (val & ICH_LR_EOI)
> - lr_desc.state |= LR_EOI_INT;
> + return lr_desc;
> +}
> +
> +static struct vgic_lr vgic_v3_on_v3_get_lr(const struct kvm_vcpu *vcpu,
> +int lr) {
> + struct vgic_lr lr_desc;
> + u64 val = vcpu->arch.vgic_cpu.vgic_v3.vgic_lr[LR_INDEX(lr)];
> +
> + lr_desc.irq = val & ICH_LR_VIRTUALID_MASK;
> + lr_desc.source  = 0;
> + lr_desc.state   = sync_lr_state(val);
>
>   return lr_desc;
>  }
>
> -static void vgic_v3_set_lr(struct kvm_vcpu *vcpu, int lr,
> -struct vgic_lr lr_desc)
> +static void vgic_v3_on_v3_set_lr(struct kvm_vcpu *vcpu, int lr,
> +  struct vgic_lr lr_desc)
>  {
> - u64 lr_val = (((u32)lr_desc.source << GICH_LR_PHYSID_CPUID_SHIFT) |
> -   lr_desc.irq);
> + u64 lr_val;
>
> - if (lr_desc.state & LR_STATE_PENDING)
> - lr_val |= ICH_LR_PENDING_BIT;
> - if (lr_desc.state & LR_STATE_ACTIVE)
> - lr_val |= ICH_LR_ACTIVE_BIT;
> - if (lr_desc.state & LR_EOI_INT)
> - lr_val |= ICH_LR_EOI;
> + lr_val = lr_desc.irq;
> +
> + /*
> +  * currently all guest IRQs are Group1, as Group0 would result
> +  * in a FIQ in the guest, which it wouldn't expect.
> +  * Eventually we want to make this configurable, so we may revisit
> +  * this in the future.
> +  */
> + lr_val |= ICH_LR_GROUP;
> +
> + lr_val |= sync_lr_val(lr_desc.state);
> +
> + vcpu->arch.vgic_cpu.vgic_v3.vgic_lr[LR_INDEX(lr)] = lr_val; }
> +
> +static void vgic_v2_on_v3_set_lr(struct kvm_vcpu *vcpu, int lr,
> +  struct vgic_lr lr_desc)
> +{
> + u64 lr_val;
> +
> + lr_val = lr_desc.irq;
> +
> + lr_val |= (u32)lr_desc.source << GICH_LR_PHYSID_CPUID_SHIFT;
> +
> + lr_val |= sync_lr_val(lr_desc.state);
>
>   vcpu->arch.vgic_cpu.vgic_v3.vgic_lr

Re: [Qemu-devel] [PATCH v5 10/12] hw/mips: malta: Add KVM support

2014-06-20 Thread James Hogan
Hi,

On 20/06/14 07:07, Paolo Bonzini wrote:
> - Messaggio originale -
>> Da: "Aurelien Jarno" 
>> A: "Sanjay Lal" 
>> Cc: "James Hogan" , qemu-de...@nongnu.org, "Peter 
>> Maydell" ,
>> kvm@vger.kernel.org, "Gleb Natapov" , "Paolo Bonzini" 
>> 
>> Inviato: Giovedì, 19 giugno 2014 23:47:34
>> Oggetto: Re: [Qemu-devel] [PATCH v5 10/12] hw/mips: malta: Add KVM support
>>
>> On Thu, Jun 19, 2014 at 12:34:24PM -0700, Sanjay Lal wrote:
>>>
>>> On Jun 19, 2014, at 9:27 AM, Aurelien Jarno  wrote:
>>>
 On Tue, Jun 17, 2014 at 11:10:35PM +0100, James Hogan wrote:
> In KVM mode the bootrom is loaded and executed from the last 1MB of
> DRAM.

 What is the reason for that? I am not opposed to that, but if it is
 really needed, it means that loading a bootloader into the flash area
 (for example YAMON) won't work and that this should be forbidden to the
 user.

>>>
>>> In trap and emulate mode, both the kernel and userland run in user mode on
>>> the processor. Virtual addresses >= 0x8000 are only accessible in
>>> kernel mode, and the default flash area (VA: 0xbfc0/PA: 0x1fc0)
>>> falls in this range.
>>>
>>> We therefore decided to relocate the bootloader to the last 1MB of RAM.
>>> This area is excluded from the RAM ranges supplied to the kernel, so it
>>> should not be accessible to the user.

I did recently try relocating the bootloader to the reset address in the
T&E KSeg0 (i.e. PA=0x1fc0, VA=0x5fc0), but the current MIPS KVM
implementation in the kernel has some limitations when it comes to
memory regions. It allocates a linear guest_pmap array (for GPA->RPA
page mapping) based only on the first memory region committed, so if you
set e.g. mem=64MB then physical memory according to guest_pmap won't
reach the reset address and it fails to map it. The kernel needs fixing
to use a more flexible physical page table structure first really.

>> Thanks for the explanation. It means we should disable the support for
>> booting from the flash (using -pflash) in KVM mode, as it would simply
>> not work.
> 
> My idea was to add a machines-specific option umkernel=on, and require it
> in order to run KVM.  Later we can add umkernel=on support for TCG as well,

FYI I tried this and it was a fairly small change (fixing CP0_EBase
initialisation and switching a couple of kvm_enabled() checks to
something like mips_um_ksegs_enabled()). Needs more testing though.

> while umkernel=off with KVM requires virtualization extensions.
> 
> The same option can disable pflash boot.
> 
> What do you think?

I think with an executable flash region / reset address the pflash
option could be made to work, but of course you'd probably need a
relocated flash image too, which may make the option less useful (and it
presumably isn't like a kernel ELF where you can detect what address
it's linked).

For now disabling Malta non kernel loads in KVM mode makes sense I think.

Thanks
James
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/14] arm/arm64: KVM: wrap 64 bit MMIO accesses with two 32 bit ones

2014-06-20 Thread Marc Zyngier
On 20/06/14 09:34, Andre Przywara wrote:
> 
> On 19/06/14 22:15, Chalamarla, Tirumalesh wrote:
>>
>>
>> -Original Message-
>> From: kvmarm-boun...@lists.cs.columbia.edu 
>> [mailto:kvmarm-boun...@lists.cs.columbia.edu] On Behalf Of Andre Przywara
>> Sent: Thursday, June 19, 2014 2:46 AM
>> To: linux-arm-ker...@lists.infradead.org; kvm...@lists.cs.columbia.edu; 
>> kvm@vger.kernel.org
>> Cc: christoffer.d...@linaro.org
>> Subject: [PATCH 04/14] arm/arm64: KVM: wrap 64 bit MMIO accesses with two 32 
>> bit ones
>>
>> Some GICv3 registers can and will be accessed as 64 bit registers.
>> Currently the register handling code can only deal with 32 bit accesses, so 
>> we do two consecutive calls to cover this.
>>
>> Signed-off-by: Andre Przywara 
>> ---
>>  virt/kvm/arm/vgic.c |   48 +---
>>  1 file changed, 45 insertions(+), 3 deletions(-)
>>
>> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c index 
>> 4c6b212..b3cf4c7 100644
>> --- a/virt/kvm/arm/vgic.c
>> +++ b/virt/kvm/arm/vgic.c
>> @@ -906,6 +906,48 @@ static bool vgic_validate_access(const struct vgic_dist 
>> *dist,  }
>>  
>>  /*
>> + * Call the respective handler function for the given range.
>> + * We split up any 64 bit accesses into two consecutive 32 bit
>> + * handler calls and merge the result afterwards.
>> + */
>> +static bool call_range_handler(struct kvm_vcpu *vcpu,
>> +   struct kvm_exit_mmio *mmio,
>> +   unsigned long offset,
>> +   const struct mmio_range *range) {
>> +u32 *data32 = (void *)mmio->data;
>> +struct kvm_exit_mmio mmio32;
>> +bool ret;
>> +
>> +if (likely(mmio->len <= 4))
>> +return range->handle_mmio(vcpu, mmio, offset);
>> +
>> +/*
>> + * We assume that any access greater than 4 bytes is actually
>> + * 8 bytes long, caused by a 64-bit access
>> + */
>> +
>> +mmio32.len = 4;
>> +mmio32.is_write = mmio->is_write;
>> +
>> +mmio32.phys_addr = mmio->phys_addr + 4;
>> +if (mmio->is_write)
>> +*(u32 *)mmio32.data = data32[1];
>> +ret = range->handle_mmio(vcpu, &mmio32, offset + 4);
>> +if (!mmio->is_write)
>> +data32[1] = *(u32 *)mmio32.data;
>> +
>> +mmio32.phys_addr = mmio->phys_addr;
>> +if (mmio->is_write)
>> +*(u32 *)mmio32.data = data32[0];
>> +ret |= range->handle_mmio(vcpu, &mmio32, offset);
>> +if (!mmio->is_write)
>> +data32[0] = *(u32 *)mmio32.data;
>> +
>> +return ret;
>> +}
>>
>> Any reason to use two 32 bits instead of one 64 bit. AArch32 on ARMv8 may be.
> 
> For the registers we care about right now we get along with this split.
> And it seems to be less intrusive.
> I have a patch to support native 64-bit accesses, but it needs some more
> work. If there is high demand for it, I can post it (but Marc didn't
> like the first version so much ;-) )

The main reason is that GICv3 doesn't mandate nor guarantee any form of
64bit atomicity. There is strictly no need to provide a guarantee that
the architecture doesn't/cannot offer (think AArch32 guests for example).

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 04/20] arm64: boot protocol documentation update for GICv3

2014-06-20 Thread Mark Rutland
On Thu, Jun 19, 2014 at 07:40:02PM +0100, Marc Zyngier wrote:
> Hi Mark,
> 
> On 19/06/14 15:01, Mark Rutland wrote:
> > Hi Marc,
> > 
> > On Thu, Jun 19, 2014 at 10:19:27AM +0100, Marc Zyngier wrote:
> >> Linux has some requirements that must be satisfied in order to boot
> >> on a system built with a GICv3.
> >>
> >> Acked-by: Christoffer Dall 
> >> Signed-off-by: Marc Zyngier 
> >> ---
> >>  Documentation/arm64/booting.txt | 6 ++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/Documentation/arm64/booting.txt 
> >> b/Documentation/arm64/booting.txt
> >> index 37fc4f6..e28ccec 100644
> >> --- a/Documentation/arm64/booting.txt
> >> +++ b/Documentation/arm64/booting.txt
> >> @@ -141,6 +141,12 @@ Before jumping into the kernel, the following 
> >> conditions must be met:
> >>the kernel image will be entered must be initialised by software at a
> >>higher exception level to prevent execution in an UNKNOWN state.
> >>  
> >> +  For systems with a GICv3 interrupt controller, it is expected that:
> >> +  - If EL3 is present, it must program ICC_SRE_EL3.Enable (bit 3) to
> >> +0b1 and ICC_SRE_EL3.SRE (bit 0) to 0b1.
> >> +  - If the kernel is entered at EL1, EL2 must set ICC_SRE_EL2.Enable
> >> +(bit 3) to 0b1 and ICC_SRE_EL2.SRE (bit 0) to 0b1.
> > 
> > Apologies for spotting this so late, but to me this sounds slightly
> > ambiguous. The use of "it is expected" doesn't read like a hard
> > requirement, and in the first point, it's ambiguous as to what "it" is.
> > 
> > I assume that if the GIC is communicated to the kernel as a GICv2 then
> > these points do not hold?
> 
> The first point always holds, specially if the kernel is entered at EL2
> (see patch #2 and the way we initialize System Registers in head.S). At
> this stage, we haven't looked at DT yet, and must setup EL2
> independently of what the platform will describe. The only source of
> information we have is whether or not this CPU implements GICv3 System
> Registers (id_aa64pfr0_el1).
>
> Assuming EL3 doesn't set these two bits, you will end up trapping back
> to EL3. You can hope that EL3 will do the right thing (do what is
> described above and restart the offending instruction at EL2). If it
> doesn't, you're dead.
> 
> > How about:
> > 
> >   For systems with a GICv3 interrupt controller, where the presence of
> >   GICv3 is communicated to the kernel:
> >   - If EL3 is present:
> > ICC_SRE_EL3.Enable (bit 3) must be initialiased to 0b1.
> > ICC_SRE_EL3.SRE (bit 0) must be initialised to 0b1.
> >   - If the kernel is entered at EL1:
> > ICC.SRE_EL2.Enable (bit 3) must be initialised to 0b1
> > ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b1.
> 
> I'm happy with that change, provided that we get rid of the ", where the
> presence of GICv3 is communicated to the kernel".

Sure, given the above that sounds fine by me.

Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 10/12] hw/mips: malta: Add KVM support

2014-06-20 Thread Aurelien Jarno
On Fri, Jun 20, 2014 at 02:07:05AM -0400, Paolo Bonzini wrote:
> 
> 
> - Messaggio originale -
> > Da: "Aurelien Jarno" 
> > A: "Sanjay Lal" 
> > Cc: "James Hogan" , qemu-de...@nongnu.org, "Peter 
> > Maydell" ,
> > kvm@vger.kernel.org, "Gleb Natapov" , "Paolo Bonzini" 
> > 
> > Inviato: Giovedì, 19 giugno 2014 23:47:34
> > Oggetto: Re: [Qemu-devel] [PATCH v5 10/12] hw/mips: malta: Add KVM support
> > 
> > On Thu, Jun 19, 2014 at 12:34:24PM -0700, Sanjay Lal wrote:
> > > 
> > > On Jun 19, 2014, at 9:27 AM, Aurelien Jarno  wrote:
> > > 
> > > > On Tue, Jun 17, 2014 at 11:10:35PM +0100, James Hogan wrote:
> > > >> In KVM mode the bootrom is loaded and executed from the last 1MB of
> > > >> DRAM.
> > > > 
> > > > What is the reason for that? I am not opposed to that, but if it is
> > > > really needed, it means that loading a bootloader into the flash area
> > > > (for example YAMON) won't work and that this should be forbidden to the
> > > > user.
> > > > 
> > > 
> > > In trap and emulate mode, both the kernel and userland run in user mode on
> > > the processor. Virtual addresses >= 0x8000 are only accessible in
> > > kernel mode, and the default flash area (VA: 0xbfc0/PA: 0x1fc0)
> > > falls in this range.
> > > 
> > > We therefore decided to relocate the bootloader to the last 1MB of RAM.
> > > This area is excluded from the RAM ranges supplied to the kernel, so it
> > > should not be accessible to the user.
> > > 
> > 
> > Thanks for the explanation. It means we should disable the support for
> > booting from the flash (using -pflash) in KVM mode, as it would simply
> > not work.
> 
> My idea was to add a machines-specific option umkernel=on, and require it
> in order to run KVM.  Later we can add umkernel=on support for TCG as well,
> while umkernel=off with KVM requires virtualization extensions.
> 
> The same option can disable pflash boot.
> 
> What do you think?

For what I understand the current KVM support in MIPS uses trap and
emulate and thus doesn't need hardware support, just a recent kernel
with the option enabled. That's why I do wonder if there is a real point
in supporting UM kernels in TCG mode.

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 10/12] hw/mips: malta: Add KVM support

2014-06-20 Thread James Hogan
On 19/06/14 22:47, Aurelien Jarno wrote:
> On Thu, Jun 19, 2014 at 12:34:24PM -0700, Sanjay Lal wrote:
>>
>> On Jun 19, 2014, at 9:27 AM, Aurelien Jarno  wrote:
>>
>>> On Tue, Jun 17, 2014 at 11:10:35PM +0100, James Hogan wrote:
 In KVM mode the bootrom is loaded and executed from the last 1MB of
 DRAM.
>>>
>>> What is the reason for that? I am not opposed to that, but if it is
>>> really needed, it means that loading a bootloader into the flash area
>>> (for example YAMON) won't work and that this should be forbidden to the
>>> user.
>>>
>>
>> In trap and emulate mode, both the kernel and userland run in user mode on 
>> the processor. Virtual addresses >= 0x8000 are only accessible in kernel 
>> mode, and the default flash area (VA: 0xbfc0/PA: 0x1fc0) falls in 
>> this range.
>>
>> We therefore decided to relocate the bootloader to the last 1MB of RAM.  
>> This area is excluded from the RAM ranges supplied to the kernel, so it 
>> should not be accessible to the user.
>>
> 
> Thanks for the explanation. It means we should disable the support for
> booting from the flash (using -pflash) in KVM mode, as it would simply
> not work.
> 

Hi Aurelien,

Is this fixup to the malta patch the sort of thing you had in mind? If
so I'll generate a v6 patchset with it.

Cheers
James

diff --git a/hw/mips/mips_malta.c b/hw/mips/mips_malta.c
index 8bc5392b4223..91b0ce566111 100644
--- a/hw/mips/mips_malta.c
+++ b/hw/mips/mips_malta.c
@@ -1052,6 +1052,12 @@ void mips_malta_init(MachineState *machine)
  bootloader_run_addr, kernel_entry);
 }
 } else {
+/* The flash region isn't executable from a KVM T&E guest */
+if (kvm_enabled()) {
+error_report("KVM enabled but no -kernel argument was specified. "
+ "Booting from flash is not supported with KVM T&E.");
+exit(1);
+}
 /* Load firmware from flash. */
 if (!dinfo) {
 /* Load a BIOS image. */
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 02/20] arm64: initial support for GICv3

2014-06-20 Thread Mark Rutland
Hi Marc,

I have some minor comments below.

On Thu, Jun 19, 2014 at 10:19:25AM +0100, Marc Zyngier wrote:
> The Generic Interrupt Controller (version 3) offers services that are
> similar to GICv2, with a number of additional features:
> - Affinity routing based on the CPU MPIDR (ARE)
> - System register for the CPU interfaces (SRE)
> - Support for more that 8 CPUs
> - Locality-specific Peripheral Interrupts (LPIs)
> - Interrupt Translation Services (ITS)
>
> This patch adds preliminary support for GICv3 with ARE and SRE,
> non-secure mode only. It relies on higher exception levels to grant ARE
> and SRE access.
>
> Support for LPI and ITS will be added at a later time.
>
> Cc: Thomas Gleixner 
> Cc: Jason Cooper 
> Reviewed-by: Zi Shen Lim 
> Reviewed-by: Christoffer Dall 
> Reviewed-by: Tirumalesh Chalamarla 
> Reviewed-by: Yun Wu 
> Reviewed-by: Zhen Lei 
> Tested-by: Tirumalesh Chalamarla
> Tested-by: Radha Mohan Chintakuntla 
> Acked-by: Radha Mohan Chintakuntla 
> Acked-by: Catalin Marinas 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/Kconfig |   1 +
>  arch/arm64/kernel/head.S   |  18 +
>  arch/arm64/kernel/hyp-stub.S   |   1 +
>  drivers/irqchip/Kconfig|   5 +
>  drivers/irqchip/Makefile   |   1 +
>  drivers/irqchip/irq-gic-v3.c   | 690 
> +
>  include/linux/irqchip/arm-gic-v3.h | 193 +++
>  7 files changed, 909 insertions(+)
>  create mode 100644 drivers/irqchip/irq-gic-v3.c
>  create mode 100644 include/linux/irqchip/arm-gic-v3.h

[...]

> +#ifdef CONFIG_ARM_GIC_V3
> +   /* GICv3 system register access */
> +   mrs x0, id_aa64pfr0_el1
> +   ubfxx0, x0, #24, #4
> +   cmp x0, #1
> +   b.ne3f
> +
> +   mrs x0, ICC_SRE_EL2
> +   orr x0, x0, #1  // Set ICC_SRE_EL2.SRE==1
> +   orr x0, x0, #(1 << 3)   // Set ICC_SRE_EL2.Enable==1

Could we use macros for these constants?

We already seem to have ICC_SRE_EL2_ENABLE in arm-gic-v3.h, so we'd just
need to add ICC_SRE_EL2_SRE.

The ubfx on the id_aa64pfr0_el1 value probably can't be made nicer with
macros, but I guess we can't have everything.

[...]

> +#define DEFAULT_PMR_VALUE  0xf0

Is this an arbitrary value, or chosen for a particular reason? Could we
have a comment either way?

> +static void gic_do_wait_for_rwp(void __iomem *base)
> +{
> +   u32 count = 100;/* 1s! */

I would suggest using USEC_PER_SEC, but it looks like to do so you'd
need to pull in all of , so I guess that's not worthwhile.

> +
> +   while (readl_relaxed(base + GICD_CTLR) & GICD_CTLR_RWP) {
> +   count--;
> +   if (!count) {
> +   pr_err_ratelimited("RWP timeout, gone fishing\n");
> +   return;
> +   }
> +   cpu_relax();
> +   udelay(1);
> +   };
> +}

[...]

> +/*
> + * Routines to acknowledge, disable and enable interrupts
> + */

This comment looks out of sync with the code; gic_read_iar was earlier.

[...]

> +static u64 gic_mpidr_to_affinity(u64 mpidr)
> +{
> +   u64 aff;
> +
> +   aff = (MPIDR_AFFINITY_LEVEL(mpidr, 3) << 32 |
> +  MPIDR_AFFINITY_LEVEL(mpidr, 2) << 16 |
> +  MPIDR_AFFINITY_LEVEL(mpidr, 1) << 8  |
> +  MPIDR_AFFINITY_LEVEL(mpidr, 0)) & ~GICD_IROUTER_SPI_MODE_ANY;

Surely GICD_IROUTER_SPI_MODE_ANY (bit 31) can't ever be set by the rest
of the expression above? Or am I being thick?

[...]

> +static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs 
> *regs)
> +{
> +   u64 irqnr;
> +
> +   do {
> +   irqnr = gic_read_iar();
> +
> +   if (likely(irqnr > 15 && irqnr < 1020)) {
> +   u64 irq = irq_find_mapping(gic_data.domain, irqnr);
> +   if (likely(irq)) {
> +   handle_IRQ(irq, regs);
> +   continue;
> +   }
> +
> +   WARN_ONCE(true, "Unexpected SPI received!\n");
> +   gic_write_eoir(irqnr);
> +   }
> +   if (irqnr < 16) {
> +   gic_write_eoir(irqnr);
> +#ifdef CONFIG_SMP
> +   handle_IPI(irqnr, regs);
> +#else
> +   WARN_ONCE(true, "Unexpected SGI received!\n");
> +#endif
> +   continue;
> +   }
> +   } while (irqnr != 0x3ff);

Any chance we could have a GIC_IRQNR_SPURIOUS macro or similar?

[...]

> +static void __init gic_dist_init(void)
> +{
> +   unsigned int i;
> +   u64 affinity;
> +   void __iomem *base = gic_data.dist_base;
> +
> +   /* Disable the distributor */
> +   writel_relaxed(0, base + GICD_CTLR);
> +   gic_dist_wait_for_rwp();
> +
> +   gic_dist_config(base, gic_data.irq_nr, gic_dist_wait_for_rwp);
> +
> +   /* Enable distributor with

Re: [patch 2/5] KVM: MMU: allow pinning spte translations (TDP-only)

2014-06-20 Thread Gleb Natapov
On Thu, Jun 19, 2014 at 04:22:57PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 19, 2014 at 10:21:16AM +0300, Gleb Natapov wrote:
> > On Wed, Jun 18, 2014 at 08:12:05PM -0300, mtosa...@redhat.com wrote:
> > > Allow vcpus to pin spte translations by:
> > > 
> > > 1) Creating a per-vcpu list of pinned ranges.
> > What if memory slot containing pinned range is going away?
> 
> ->page_fault() should fail and guest abort. Will double check.
> 
> > > 2) On mmu reload request:
> > >   - Fault ranges.
> > >   - Mark sptes with a pinned bit.
> > Should also be marked "dirty" as per SDM:
> >  The three DS save area sections should be allocated from a non-paged pool, 
> > and marked accessed and dirty
> 
> This (SDM text) is about guest pagetable AFAICS.
> 
Its hard to say. SDM does not mention virtualization or two dimensional
paging in that section at all. My reading is that this section talks about
all translations that CPU should perform to get to the physical address,
otherwise why are we trying hard to make sure that EPT translations are
always present? Because the same paragraph say in the next sentence:

 It is the responsibility of the operating system to keep the pages that
 contain the buffer present and to mark them accessed and dirty

So it we take from it that translation should be present the same goes for
accessed and dirty. If Andi can clarify this within Intel it would be great.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 3/5] KVM: MMU: notifiers support for pinned sptes

2014-06-20 Thread Gleb Natapov
On Thu, Jun 19, 2014 at 03:28:25PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 19, 2014 at 09:48:50AM +0300, Gleb Natapov wrote:
> > On Wed, Jun 18, 2014 at 08:12:06PM -0300, mtosa...@redhat.com wrote:
> > > Request KVM_REQ_MMU_RELOAD when deleting sptes from MMU notifiers.
> > > 
> > > Keep pinned sptes intact if page aging.
> > > 
> > > Signed-off-by: Marcelo Tosatti 
> > > 
> > > ---
> > >  arch/x86/kvm/mmu.c |   71 
> > > ++---
> > >  1 file changed, 62 insertions(+), 9 deletions(-)
> > > 
> > > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > > ===
> > > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c  2014-06-18 
> > > 17:28:24.339435654 -0300
> > > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c   2014-06-18 17:29:32.510225755 
> > > -0300
> > > @@ -1184,6 +1184,42 @@
> > >   kvm_flush_remote_tlbs(vcpu->kvm);
> > >  }
> > >  
> > > +static void ack_flush(void *_completed)
> > > +{
> > > +}
> > > +
> > > +static void mmu_reload_pinned_vcpus(struct kvm *kvm)
> > > +{
> > > + int i, cpu, me;
> > > + cpumask_var_t cpus;
> > > + struct kvm_vcpu *vcpu;
> > > + unsigned int req = KVM_REQ_MMU_RELOAD;
> > > +
> > > + zalloc_cpumask_var(&cpus, GFP_ATOMIC);
> > > +
> > > + me = get_cpu();
> > > + kvm_for_each_vcpu(i, vcpu, kvm) {
> > > + if (list_empty(&vcpu->arch.pinned_mmu_pages))
> > > + continue;
> > > + kvm_make_request(req, vcpu);
> > > + cpu = vcpu->cpu;
> > > +
> > > + /* Set ->requests bit before we read ->mode */
> > > + smp_mb();
> > > +
> > > + if (cpus != NULL && cpu != -1 && cpu != me &&
> > > +   kvm_vcpu_exiting_guest_mode(vcpu) != OUTSIDE_GUEST_MODE)
> > > + cpumask_set_cpu(cpu, cpus);
> > > + }
> > > + if (unlikely(cpus == NULL))
> > > + smp_call_function_many(cpu_online_mask, ack_flush, NULL, 1);
> > > + else if (!cpumask_empty(cpus))
> > > + smp_call_function_many(cpus, ack_flush, NULL, 1);
> > > + put_cpu();
> > > + free_cpumask_var(cpus);
> > > + return;
> > > +}
> > This is a c&p of make_all_cpus_request(), the only difference is checking
> > of vcpu->arch.pinned_mmu_pages.  You can add make_some_cpus_request(..., 
> > bool (*predicate)(struct kvm_vcpu *))
> > to kvm_main.c and rewrite make_all_cpus_request() to use it instead.
> 
> Half-way through it i decided it was better to c&p.
> 
> Can change make_all_cpus_request() though if it makes more sense to you.
> 
If I haven't missed anything and checking of pinned_mmu_pages is indeed the
only difference, then yes, reusing make_all_cpus_request() makes more sense.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 02/20] arm64: initial support for GICv3

2014-06-20 Thread Marc Zyngier
Hi Mark,

On 20/06/14 11:02, Mark Rutland wrote:
> Hi Marc,
> 
> I have some minor comments below.
> 
> On Thu, Jun 19, 2014 at 10:19:25AM +0100, Marc Zyngier wrote:
>> The Generic Interrupt Controller (version 3) offers services that are
>> similar to GICv2, with a number of additional features:
>> - Affinity routing based on the CPU MPIDR (ARE)
>> - System register for the CPU interfaces (SRE)
>> - Support for more that 8 CPUs
>> - Locality-specific Peripheral Interrupts (LPIs)
>> - Interrupt Translation Services (ITS)
>>
>> This patch adds preliminary support for GICv3 with ARE and SRE,
>> non-secure mode only. It relies on higher exception levels to grant ARE
>> and SRE access.
>>
>> Support for LPI and ITS will be added at a later time.
>>
>> Cc: Thomas Gleixner 
>> Cc: Jason Cooper 
>> Reviewed-by: Zi Shen Lim 
>> Reviewed-by: Christoffer Dall 
>> Reviewed-by: Tirumalesh Chalamarla 
>> Reviewed-by: Yun Wu 
>> Reviewed-by: Zhen Lei 
>> Tested-by: Tirumalesh Chalamarla
>> Tested-by: Radha Mohan Chintakuntla 
>> Acked-by: Radha Mohan Chintakuntla 
>> Acked-by: Catalin Marinas 
>> Signed-off-by: Marc Zyngier 
>> ---
>>  arch/arm64/Kconfig |   1 +
>>  arch/arm64/kernel/head.S   |  18 +
>>  arch/arm64/kernel/hyp-stub.S   |   1 +
>>  drivers/irqchip/Kconfig|   5 +
>>  drivers/irqchip/Makefile   |   1 +
>>  drivers/irqchip/irq-gic-v3.c   | 690 
>> +
>>  include/linux/irqchip/arm-gic-v3.h | 193 +++
>>  7 files changed, 909 insertions(+)
>>  create mode 100644 drivers/irqchip/irq-gic-v3.c
>>  create mode 100644 include/linux/irqchip/arm-gic-v3.h
> 
> [...]
> 
>> +#ifdef CONFIG_ARM_GIC_V3
>> +   /* GICv3 system register access */
>> +   mrs x0, id_aa64pfr0_el1
>> +   ubfxx0, x0, #24, #4
>> +   cmp x0, #1
>> +   b.ne3f
>> +
>> +   mrs x0, ICC_SRE_EL2
>> +   orr x0, x0, #1  // Set ICC_SRE_EL2.SRE==1
>> +   orr x0, x0, #(1 << 3)   // Set ICC_SRE_EL2.Enable==1
> 
> Could we use macros for these constants?
> 
> We already seem to have ICC_SRE_EL2_ENABLE in arm-gic-v3.h, so we'd just
> need to add ICC_SRE_EL2_SRE.

Sure.

> The ubfx on the id_aa64pfr0_el1 value probably can't be made nicer with
> macros, but I guess we can't have everything.

I suppose it would look nicer with shift and mask, but that'd be two
instructions, and this is such a critical path... ;-)

> [...]
> 
>> +#define DEFAULT_PMR_VALUE  0xf0
> 
> Is this an arbitrary value, or chosen for a particular reason? Could we
> have a comment either way?
> 
>> +static void gic_do_wait_for_rwp(void __iomem *base)
>> +{
>> +   u32 count = 100;/* 1s! */
> 
> I would suggest using USEC_PER_SEC, but it looks like to do so you'd
> need to pull in all of , so I guess that's not worthwhile.

I'll have a look anyway.

>> +
>> +   while (readl_relaxed(base + GICD_CTLR) & GICD_CTLR_RWP) {
>> +   count--;
>> +   if (!count) {
>> +   pr_err_ratelimited("RWP timeout, gone fishing\n");
>> +   return;
>> +   }
>> +   cpu_relax();
>> +   udelay(1);
>> +   };
>> +}
> 
> [...]
> 
>> +/*
>> + * Routines to acknowledge, disable and enable interrupts
>> + */
> 
> This comment looks out of sync with the code; gic_read_iar was earlier.

Sure, I'll move it around.

> [...]
> 
>> +static u64 gic_mpidr_to_affinity(u64 mpidr)
>> +{
>> +   u64 aff;
>> +
>> +   aff = (MPIDR_AFFINITY_LEVEL(mpidr, 3) << 32 |
>> +  MPIDR_AFFINITY_LEVEL(mpidr, 2) << 16 |
>> +  MPIDR_AFFINITY_LEVEL(mpidr, 1) << 8  |
>> +  MPIDR_AFFINITY_LEVEL(mpidr, 0)) & ~GICD_IROUTER_SPI_MODE_ANY;
> 
> Surely GICD_IROUTER_SPI_MODE_ANY (bit 31) can't ever be set by the rest
> of the expression above? Or am I being thick?

Probably a leftover from an early refactor. Thanks for noticing this.

> [...]
> 
>> +static asmlinkage void __exception_irq_entry gic_handle_irq(struct pt_regs 
>> *regs)
>> +{
>> +   u64 irqnr;
>> +
>> +   do {
>> +   irqnr = gic_read_iar();
>> +
>> +   if (likely(irqnr > 15 && irqnr < 1020)) {
>> +   u64 irq = irq_find_mapping(gic_data.domain, irqnr);
>> +   if (likely(irq)) {
>> +   handle_IRQ(irq, regs);
>> +   continue;
>> +   }
>> +
>> +   WARN_ONCE(true, "Unexpected SPI received!\n");
>> +   gic_write_eoir(irqnr);
>> +   }
>> +   if (irqnr < 16) {
>> +   gic_write_eoir(irqnr);
>> +#ifdef CONFIG_SMP
>> +   handle_IPI(irqnr, regs);
>> +#else
>> +   WARN_ONCE(true, "Unexpected SGI received!\n");
>> +#endif
>> +   continue;
>> +   }
>> +   } while (ir

Re: [Qemu-devel] [PATCH v5 10/12] hw/mips: malta: Add KVM support

2014-06-20 Thread Paolo Bonzini

Il 20/06/2014 11:10, Aurelien Jarno ha scritto:

> My idea was to add a machines-specific option umkernel=on, and require it
> in order to run KVM.  Later we can add umkernel=on support for TCG as well,
> while umkernel=off with KVM requires virtualization extensions.
>
> The same option can disable pflash boot.
>
> What do you think?

For what I understand the current KVM support in MIPS uses trap and
emulate and thus doesn't need hardware support, just a recent kernel
with the option enabled.


Yes, but work to support virtualization extensions is underway.  Patches 
were posted a few months ago.



That's why I do wonder if there is a real point
in supporting UM kernels in TCG mode.


Debugging, mainly.  It is sometimes useful to compare TCG with KVM on 
x86, and I suppose it could be the same on MIPS.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 4/5] KVM: MMU: reload request from GET_DIRTY_LOG path

2014-06-20 Thread Gleb Natapov
On Thu, Jun 19, 2014 at 03:40:31PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 19, 2014 at 11:17:19AM +0300, Gleb Natapov wrote:
> > On Wed, Jun 18, 2014 at 08:12:07PM -0300, mtosa...@redhat.com wrote:
> > > Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before
> > > deleting a pinned spte.
> > > 
> > > Signed-off-by: Marcelo Tosatti 
> > > 
> > > ---
> > >  arch/x86/kvm/mmu.c |3 +++
> > >  1 file changed, 3 insertions(+)
> > > 
> > > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c
> > > ===
> > > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c  2014-06-13 
> > > 16:50:50.040140594 -0300
> > > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c   2014-06-13 16:51:05.620104451 
> > > -0300
> > > @@ -1247,6 +1247,9 @@
> > >   spte &= ~SPTE_MMU_WRITEABLE;
> > >   spte = spte & ~PT_WRITABLE_MASK;
> > >  
> > > + if (is_pinned_spte(spte))
> > > + mmu_reload_pinned_vcpus(kvm);
> > > +
> > Why write protect it at all? mmu_reload_pinned_vcpus() will unprotected it 
> > anyway
> > on the next vmentry. Isn't it better to just report all pinned pages as 
> > dirty alway.
> 
> That was the initial plan, however its awkward to stop vcpus, execute
> get_dirty_log twice, and have pages marked as dirty on the second
> execution.
Indeed, but I think it may happen today with vhost (or even other devices
that emulate dma), so userspace should be able to deal with it already.

> 
> That is, it is in "incorrect" to report pages as dirty when they are
> clean.
In case of PEBS area we cannot know. CPU writes there directly without
KVM even knowing about it so the only sane thing is to assume that after
a vmentry PEBS area is dirty. This is what happens with this patch BTW,
mmu_reload_pinned_vcpus() will force ->page_fault(FERR_WRITE) which will
mark all pinned pages as dirty even if pages are never written to. You
can achieve the same by having vcpu->pinned_page_dirty which is set to
true on each vmentry and to false on each GET_DIRTY_LOG.

> 
> Moreover, if the number of pinned pages is larger than the dirty
> threshold to stop VM and migrate, you'll never migrate. If vcpus are
> in HLT and don't VM-enter immediately, the pages should not be refaulted
> right away.
We should not allow that many pinned pages for security reasons. And
having a lot of page to fault in on a next vmentry after each
GET_DIRTY_LOG will slow down a guest during migration considerably.

> 
> Do you think the optimization is worthwhile ?
> 
I think it's simpler, but even if we will go with your current approach
it should be improved: there is no point sending IPI to all vcpus in
spte_write_protect() like you do here since the IPI will be send anyway at
the end of write protect because of ptes writable->nonwritable transition,
so all you have to do here is to set KVM_REQ_MMU_RELOAD, not need for IPI.

In fact this makes me thinking that write protecting pinned page here is
incorrect because old translation may not be in TLB and if CPU will try to
write PEBS entry after pte is write protected but before MMU is reloaded
it will encounter non writable pte and god knows what will happen, SDM
does not tell. So spte_write_protect() should be something like that IMO:

static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect)
{
u64 spte = *sptep;

if (!is_writable_pte(spte) &&
  !(pt_protect && spte_is_locklessly_modifiable(spte)))
return false;

if (!pt_protect && !is_pinned_spte(spte)) {
for_each_vcpu()
 kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
return true;
}

rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, *sptep);

if (pt_protect)
spte &= ~SPTE_MMU_WRITEABLE;

spte = spte & ~PT_WRITABLE_MASK;

return mmu_spte_update(sptep, spte);
}



--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG at mm/memory.c

2014-06-20 Thread Ortwin Glück

On 06/19/2014 06:52 PM, Kirill A. Shutemov wrote:

http://marc.info/?l=linux-kernel&m=140319579508104&w=2


Yes, those symptoms look very familiar. The patch should really go in stable 
3.15.y.

Thanks.

Ortwin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 10/12] hw/mips: malta: Add KVM support

2014-06-20 Thread Paolo Bonzini

Il 20/06/2014 11:25, James Hogan ha scritto:

On 19/06/14 22:47, Aurelien Jarno wrote:

On Thu, Jun 19, 2014 at 12:34:24PM -0700, Sanjay Lal wrote:


On Jun 19, 2014, at 9:27 AM, Aurelien Jarno  wrote:


On Tue, Jun 17, 2014 at 11:10:35PM +0100, James Hogan wrote:

In KVM mode the bootrom is loaded and executed from the last 1MB of
DRAM.


What is the reason for that? I am not opposed to that, but if it is
really needed, it means that loading a bootloader into the flash area
(for example YAMON) won't work and that this should be forbidden to the
user.



In trap and emulate mode, both the kernel and userland run in user mode on the 
processor. Virtual addresses >= 0x8000 are only accessible in kernel mode, 
and the default flash area (VA: 0xbfc0/PA: 0x1fc0) falls in this range.

We therefore decided to relocate the bootloader to the last 1MB of RAM.  This 
area is excluded from the RAM ranges supplied to the kernel, so it should not 
be accessible to the user.



Thanks for the explanation. It means we should disable the support for
booting from the flash (using -pflash) in KVM mode, as it would simply
not work.



Hi Aurelien,

Is this fixup to the malta patch the sort of thing you had in mind? If
so I'll generate a v6 patchset with it.


It looks like this.  No hurry, it can go in after the main series; just 
git-send-email it so it gets noticed and has a proper commit message.


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] Move two pinned pages to non-movable node in kvm.

2014-06-20 Thread Gleb Natapov
On Thu, Jun 19, 2014 at 04:00:24PM -0300, Marcelo Tosatti wrote:
> On Thu, Jun 19, 2014 at 12:20:32PM +0300, Gleb Natapov wrote:
> > CCing Marcelo,
> > 
> > On Wed, Jun 18, 2014 at 02:50:44PM +0800, Tang Chen wrote:
> > > Hi Gleb,
> > > 
> > > Thanks for the quick reply. Please see below.
> > > 
> > > On 06/18/2014 02:12 PM, Gleb Natapov wrote:
> > > >On Wed, Jun 18, 2014 at 01:50:00PM +0800, Tang Chen wrote:
> > > >>[Questions]
> > > >>And by the way, would you guys please answer the following questions 
> > > >>for me ?
> > > >>
> > > >>1. What's the ept identity pagetable for ?  Only one page is enough ?
> > > >>
> > > >>2. Is the ept identity pagetable only used in realmode ?
> > > >>Can we free it once the guest is up (vcpu in protect mode)?
> > > >>
> > > >>3. Now, ept identity pagetable is allocated in qemu userspace.
> > > >>Can we allocate it in kernel space ?
> > > >What would be the benefit?
> > > 
> > > I think the benefit is we can hot-remove the host memory a kvm guest
> > > is using.
> > > 
> > > For now, only memory in ZONE_MOVABLE can be migrated/hot-removed. And the
> > > kernel
> > > will never use ZONE_MOVABLE memory. So if we can allocate these two pages 
> > > in
> > > kernel space, we can pin them without any trouble. When doing memory
> > > hot-remove,
> > > the kernel will not try to migrate these two pages.
> > But we can do that by other means, no? The patch you've sent for instance.
> > 
> > > 
> > > >
> > > >>
> > > >>4. If I want to migrate these two pages, what do you think is the best 
> > > >>way ?
> > > >>
> > > >I answered most of those here: 
> > > >http://www.mail-archive.com/kvm@vger.kernel.org/msg103718.html
> > > 
> > > I'm sorry I must missed this email.
> > > 
> > > Seeing your advice, we can unpin these two pages and repin them in the 
> > > next
> > > EPT violation.
> > > So about this problem, which solution would you prefer, allocate these two
> > > pages in kernel
> > > space, or migrate them before memory hot-remove ?
> > > 
> > > I think the first solution is simpler. But I'm not quite sure if there is
> > > any other pages
> > > pinned in memory. If we have the same problem with other kvm pages, I 
> > > think
> > > it is better to
> > > solve it in the second way.
> > > 
> > > What do you think ?
> > Remove pinning is preferable. In fact looks like for identity pagetable
> > it should be trivial, just don't pin. APIC access page is a little bit
> > more complicated since its physical address needs to be tracked to be
> > updated in VMCS.
> 
> Yes, and there are new users of page pinning as well soon (see PEBS
> threads on kvm-devel).
> 
> Was thinking of notifiers scheme. Perhaps:
> 
> ->begin_page_unpin(struct page *page)
>   - Remove any possible access to page.
> 
> ->end_page_unpin(struct page *page)
>   - Reinstantiate any possible access to page.
> 
> For KVM:
> 
> ->begin_page_unpin()
>   - Remove APIC-access page address from VMCS.
> or
>   - Remove spte translation to pinned page.
>   
>   - Put vcpu in state where no VM-entries are allowed.
> 
> ->end_page_unpin()
>   - Setup APIC-access page, ...
>   - Allow vcpu to VM-entry.
> 
I believe that to handle identity page and APIC access page we do not
need any of those. We can use mmu notifiers to track when page begins
to be moved and we can find new page location on EPT violation.

> 
> Because allocating APIC access page from distant NUMA node can
> be a performance problem, i believe.
I do not think this is the case. APIC access page is never written to,
and in fact SDM advice to share it between all vcpus.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 10/12] hw/mips: malta: Add KVM support

2014-06-20 Thread Aurelien Jarno
On Fri, Jun 20, 2014 at 12:38:30PM +0200, Paolo Bonzini wrote:
> Il 20/06/2014 11:10, Aurelien Jarno ha scritto:
> >> My idea was to add a machines-specific option umkernel=on, and require it
> >> in order to run KVM.  Later we can add umkernel=on support for TCG as well,
> >> while umkernel=off with KVM requires virtualization extensions.
> >>
> >> The same option can disable pflash boot.
> >>
> >> What do you think?
> >
> >For what I understand the current KVM support in MIPS uses trap and
> >emulate and thus doesn't need hardware support, just a recent kernel
> >with the option enabled.
> 
> Yes, but work to support virtualization extensions is underway.
> Patches were posted a few months ago.
> 
> >That's why I do wonder if there is a real point
> >in supporting UM kernels in TCG mode.
> 
> Debugging, mainly.  It is sometimes useful to compare TCG with KVM
> on x86, and I suppose it could be the same on MIPS.

Ok, then we can indeed add a umkernel option, which is always enabled
with KVM, and which disable the flash (and why not other devices) in
that case.

At some point it might be a good idea to add a specific machine for
emulation/virtualization, like it is done on ARM, which do not have to
handle this kind of devices, and which does not have all the current
limitations of the Malta board.

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 10/12] hw/mips: malta: Add KVM support

2014-06-20 Thread Aurelien Jarno
On Fri, Jun 20, 2014 at 10:25:24AM +0100, James Hogan wrote:
> On 19/06/14 22:47, Aurelien Jarno wrote:
> > On Thu, Jun 19, 2014 at 12:34:24PM -0700, Sanjay Lal wrote:
> >>
> >> On Jun 19, 2014, at 9:27 AM, Aurelien Jarno  wrote:
> >>
> >>> On Tue, Jun 17, 2014 at 11:10:35PM +0100, James Hogan wrote:
>  In KVM mode the bootrom is loaded and executed from the last 1MB of
>  DRAM.
> >>>
> >>> What is the reason for that? I am not opposed to that, but if it is
> >>> really needed, it means that loading a bootloader into the flash area
> >>> (for example YAMON) won't work and that this should be forbidden to the
> >>> user.
> >>>
> >>
> >> In trap and emulate mode, both the kernel and userland run in user mode on 
> >> the processor. Virtual addresses >= 0x8000 are only accessible in 
> >> kernel mode, and the default flash area (VA: 0xbfc0/PA: 0x1fc0) 
> >> falls in this range.
> >>
> >> We therefore decided to relocate the bootloader to the last 1MB of RAM.  
> >> This area is excluded from the RAM ranges supplied to the kernel, so it 
> >> should not be accessible to the user.
> >>
> > 
> > Thanks for the explanation. It means we should disable the support for
> > booting from the flash (using -pflash) in KVM mode, as it would simply
> > not work.
> > 
> 
> Hi Aurelien,
> 
> Is this fixup to the malta patch the sort of thing you had in mind? If
> so I'll generate a v6 patchset with it.
> 
> Cheers
> James
> 
> diff --git a/hw/mips/mips_malta.c b/hw/mips/mips_malta.c
> index 8bc5392b4223..91b0ce566111 100644
> --- a/hw/mips/mips_malta.c
> +++ b/hw/mips/mips_malta.c
> @@ -1052,6 +1052,12 @@ void mips_malta_init(MachineState *machine)
>   bootloader_run_addr, kernel_entry);
>  }
>  } else {
> +/* The flash region isn't executable from a KVM T&E guest */
> +if (kvm_enabled()) {
> +error_report("KVM enabled but no -kernel argument was specified. 
> "
> + "Booting from flash is not supported with KVM 
> T&E.");
> +exit(1);
> +}
>  /* Load firmware from flash. */
>  if (!dinfo) {
>  /* Load a BIOS image. */
> 

This looks fine to me.

-- 
Aurelien Jarno  GPG: 4096R/1DDD8C9B
aurel...@aurel32.net http://www.aurel32.net
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v5 10/12] hw/mips: malta: Add KVM support

2014-06-20 Thread James Hogan
On 20/06/14 12:19, Aurelien Jarno wrote:
> On Fri, Jun 20, 2014 at 12:38:30PM +0200, Paolo Bonzini wrote:
>> Il 20/06/2014 11:10, Aurelien Jarno ha scritto:
 My idea was to add a machines-specific option umkernel=on, and require it
 in order to run KVM.  Later we can add umkernel=on support for TCG as well,
 while umkernel=off with KVM requires virtualization extensions.

 The same option can disable pflash boot.

 What do you think?
>>>
>>> For what I understand the current KVM support in MIPS uses trap and
>>> emulate and thus doesn't need hardware support, just a recent kernel
>>> with the option enabled.
>>
>> Yes, but work to support virtualization extensions is underway.
>> Patches were posted a few months ago.
>>
>>> That's why I do wonder if there is a real point
>>> in supporting UM kernels in TCG mode.
>>
>> Debugging, mainly.  It is sometimes useful to compare TCG with KVM
>> on x86, and I suppose it could be the same on MIPS.
> 
> Ok, then we can indeed add a umkernel option, which is always enabled
> with KVM, and which disable the flash (and why not other devices) in
> that case.
> 
> At some point it might be a good idea to add a specific machine for
> emulation/virtualization, like it is done on ARM, which do not have to
> handle this kind of devices, and which does not have all the current
> limitations of the Malta board.

FYI Cavium have been working on a para-virtualised machine which they
use with their VZ KVM implementation. They're using lkvm, but I expect
it will make sense to port that to QEMU too.

lkvm patchset (applied):
https://www.mail-archive.com/kvm%40vger.kernel.org/msg102792.html

linux kernel patchset (merged in v3.16-rc1):
https://www.mail-archive.com/kvm%40vger.kernel.org/msg102806.html

Cheers
James
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC v3] ARM: KVM: add irqfd and irq routing support

2014-06-20 Thread Eric Auger
This patch enables irqfd and irq routing on ARM.

It turns CONFIG_HAVE_KVM_EVENTFD and CONFIG_HAVE_KVM_IRQ_ROUTING on.

irqfd framework enables to inject a virtual IRQ into a guest upon an
eventfd trigger.

1) user-side first needs to setup a GSI routing table using
KVM_SET_GSI_ROUTING ioctl. A routing entry defines an association
between an IRQ (aka GSI) and an irqchip pin. On ARM there is a single
irqchip, ie. the GIC. On ARM, natural choice is to set gsi = irqchip.pin.

2) user-side uses KVM_IRQFD VM ioctl to provide KVM with a kvm_irqfd struct
that associates a VM, an eventfd, an IRQ number (aka. the GSI). When an actor
signals the eventfd (typically a VFIO platform driver), the irqfd subsystem
injects a virtual IRQ correponding to the irqchip pin associated to that
GSI. irqchip.pin is computed from previous routing table. On ARM it is
assumed to by an SPI only.

This RFC applies on top of
https://lists.cs.columbia.edu/pipermail/kvmarm/2014-June/009979.html

All pieces can be found on git://git.linaro.org/people/eric.auger/linux.git
branch irqfd_integ_v3

Signed-off-by: Eric Auger 

---

GSI routing mostly is implemented in generic irqchip.c.
The tiny ARM specific part is directly implemented in the virtual interrupt
controller (vgic.c) as it is done for powerpc for instance. This option was
prefered compared to implementing other #ifdef in irq_comm.c (x86 and ia64).
Hence irq_comm.c is not used at all.

MSI routing is not supported yet. Edge sensitive IRQ injection was not tested
but should be OK (KVM_USERSPACE_IRQ_SOURCE_ID path).

This work was tested with Calxeda Midway xgmac main interrupt with
qemu-system-arm and QEMU VFIO platform device.

Known issues:
- static allocation of chip[][] in irqchip.c forces to statically dimension
the number of IRQS supported by the VGIC.
KVM_IRQCHIP_NUM_PINS still currently is set to VGIC_NR_IRQS, which may become
VGIC_NR_IRQS_LEGACY with the advent of:
http://www.spinics.net/lists/arm-kernel/msg277415.html

- if for some reason the IRQ is never EOI'ed, the notifier never is called.
  If its job typically consists in unmasking the physical IRQ as it is for
  VFIO, the IRQ might stay masked.

v3:
- correct misc style issues
- remove notifier call in clear pending MMIO write, now fixed by
  Christoffer VGIC clear pending correction:
  https://lists.cs.columbia.edu/pipermail/kvmarm/2014-June/009979.html
- remove allocation of identity routing table. It is assumed to be
  user-side's job to set it.
- vgic_set_assigned_irq now handles both levels in a symetrical way.
  dist lock issue fixed by defining finer lock regions in
  kvm_vgic_sync_hwstate()
- IRQFD implementation better documented in kvm/api.txt
- KVM_IRQCHIP_NUM_PINS set to VGIC_NR_IRQS_LEGACY as temporary solution
- check ue->u.irqchip.irqchip

v2:
2 fixes:
- v1 assumed gsi/irqchip.pin was already incremented by VGIC_NR_PRIVATE_IRQS.
  This is now vgic_set_assigned_irq that increments it before injection.
- v2 now handles the case where a pending assigned irq is cleared through
  MMIO access. The irq is properly acked allowing the resamplefd handler
  to possibly unmask the physical IRQ.
---
 Documentation/virtual/kvm/api.txt |  12 -
 arch/arm/include/uapi/asm/kvm.h   |   9 
 arch/arm/kvm/Kconfig  |   2 +
 arch/arm/kvm/Makefile |   1 +
 arch/arm/kvm/irq.h|  25 +
 virt/kvm/arm/vgic.c   | 108 --
 6 files changed, 151 insertions(+), 6 deletions(-)
 create mode 100644 arch/arm/kvm/irq.h

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index b4f5365..326e382 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1339,7 +1339,7 @@ KVM_ASSIGN_DEV_IRQ. Partial deassignment of host or guest 
IRQ is allowed.
 4.52 KVM_SET_GSI_ROUTING
 
 Capability: KVM_CAP_IRQ_ROUTING
-Architectures: x86 ia64 s390
+Architectures: x86 ia64 s390 arm
 Type: vm ioctl
 Parameters: struct kvm_irq_routing (in)
 Returns: 0 on success, -1 on error
@@ -2126,7 +2126,7 @@ into the hash PTE second double word).
 4.75 KVM_IRQFD
 
 Capability: KVM_CAP_IRQFD
-Architectures: x86 s390
+Architectures: x86 s390 arm
 Type: vm ioctl
 Parameters: struct kvm_irqfd (in)
 Returns: 0 on success, -1 on error
@@ -2152,6 +2152,14 @@ Note that closing the resamplefd is not sufficient to 
disable the
 irqfd.  The KVM_IRQFD_FLAG_RESAMPLE is only necessary on assignment
 and need not be specified with KVM_IRQFD_FLAG_DEASSIGN.
 
+On ARM/arm64 the virtual IRQ injection mandates the existence of a
+routing table, set through KVM_SET_GSI_ROUTING. this latter contains
+entries which associate a gsi with an irqchip pin. The injected virtual
+IRQ actually corresponds to the irqchip.pin. It is up to the user to
+define gsi = irqchip.pin or not. On ARM the single irqchip is the GIC.
+Then irqchip.pin is interpreted as the system shared peripheral
+interrupt number (SPI). Associated GIC interrupt ID is ir

[PATCH 0/2] ivshmem: update documentation, add client/server tools

2014-06-20 Thread David Marchand
Hello, 

(as suggested by Paolo, ccing Claudio and kvm mailing list)

Here is a patchset containing an update on ivshmem specs documentation and
importing ivshmem server and client tools.
These tools have been written from scratch and are not related to what is
available in nahanni repository.
I put them in contrib/ directory as the qemu-doc.texi was already telling the
server was supposed to be there.

-- 
David Marchand

David Marchand (2):
  docs: update ivshmem device spec
  contrib: add ivshmem client and server

 contrib/ivshmem-client/Makefile |   26 ++
 contrib/ivshmem-client/ivshmem-client.c |  418 ++
 contrib/ivshmem-client/ivshmem-client.h |  238 ++
 contrib/ivshmem-client/main.c   |  246 ++
 contrib/ivshmem-server/Makefile |   26 ++
 contrib/ivshmem-server/ivshmem-server.c |  420 +++
 contrib/ivshmem-server/ivshmem-server.h |  185 ++
 contrib/ivshmem-server/main.c   |  296 ++
 docs/specs/ivshmem_device_spec.txt  |   41 ++-
 qemu-doc.texi   |   10 +-
 10 files changed, 1897 insertions(+), 9 deletions(-)
 create mode 100644 contrib/ivshmem-client/Makefile
 create mode 100644 contrib/ivshmem-client/ivshmem-client.c
 create mode 100644 contrib/ivshmem-client/ivshmem-client.h
 create mode 100644 contrib/ivshmem-client/main.c
 create mode 100644 contrib/ivshmem-server/Makefile
 create mode 100644 contrib/ivshmem-server/ivshmem-server.c
 create mode 100644 contrib/ivshmem-server/ivshmem-server.h
 create mode 100644 contrib/ivshmem-server/main.c

-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] contrib: add ivshmem client and server

2014-06-20 Thread David Marchand
When using ivshmem devices, notifications between guests can be sent as
interrupts using a ivshmem-server (typical use described in documentation).
The client is provided as a debug tool.

Signed-off-by: David Marchand 
---
 contrib/ivshmem-client/Makefile |   26 ++
 contrib/ivshmem-client/ivshmem-client.c |  418 ++
 contrib/ivshmem-client/ivshmem-client.h |  238 ++
 contrib/ivshmem-client/main.c   |  246 ++
 contrib/ivshmem-server/Makefile |   26 ++
 contrib/ivshmem-server/ivshmem-server.c |  420 +++
 contrib/ivshmem-server/ivshmem-server.h |  185 ++
 contrib/ivshmem-server/main.c   |  296 ++
 qemu-doc.texi   |   10 +-
 9 files changed, 1862 insertions(+), 3 deletions(-)
 create mode 100644 contrib/ivshmem-client/Makefile
 create mode 100644 contrib/ivshmem-client/ivshmem-client.c
 create mode 100644 contrib/ivshmem-client/ivshmem-client.h
 create mode 100644 contrib/ivshmem-client/main.c
 create mode 100644 contrib/ivshmem-server/Makefile
 create mode 100644 contrib/ivshmem-server/ivshmem-server.c
 create mode 100644 contrib/ivshmem-server/ivshmem-server.h
 create mode 100644 contrib/ivshmem-server/main.c

diff --git a/contrib/ivshmem-client/Makefile b/contrib/ivshmem-client/Makefile
new file mode 100644
index 000..9e32409
--- /dev/null
+++ b/contrib/ivshmem-client/Makefile
@@ -0,0 +1,26 @@
+# Copyright 2014 6WIND S.A.
+# All rights reserved
+
+S ?= $(CURDIR)
+O ?= $(CURDIR)
+
+CFLAGS += -Wall -Wextra -Werror -g
+LDFLAGS +=
+LDLIBS += -lrt
+
+VPATH = $(S)
+PROG = ivshmem-client
+OBJS := $(O)/ivshmem-client.o
+OBJS += $(O)/main.o
+
+$(O)/%.o: %.c
+   $(CC) $(CFLAGS) -o $@ -c $<
+
+$(O)/$(PROG): $(OBJS)
+   $(CC) $(LDFLAGS) -o $@ $^ $(LDLIBS)
+
+.PHONY: all
+all: $(O)/$(PROG)
+
+clean:
+   rm -f $(OBJS) $(O)/$(PROG)
diff --git a/contrib/ivshmem-client/ivshmem-client.c 
b/contrib/ivshmem-client/ivshmem-client.c
new file mode 100644
index 000..32ef3ef
--- /dev/null
+++ b/contrib/ivshmem-client/ivshmem-client.c
@@ -0,0 +1,418 @@
+/*
+ * Copyright(c) 2014 6WIND S.A.
+ * All rights reserved.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include "ivshmem-client.h"
+
+/* log a message on stdout if verbose=1 */
+#define debug_log(client, fmt, ...) do { \
+if ((client)->verbose) { \
+printf(fmt, ## __VA_ARGS__); \
+}\
+} while (0)
+
+/* read message from the unix socket */
+static int
+read_one_msg(struct ivshmem_client *client, long *index, int *fd)
+{
+int ret;
+struct msghdr msg;
+struct iovec iov[1];
+union {
+struct cmsghdr cmsg;
+char control[CMSG_SPACE(sizeof(int))];
+} msg_control;
+struct cmsghdr *cmsg;
+
+iov[0].iov_base = index;
+iov[0].iov_len = sizeof(*index);
+
+memset(&msg, 0, sizeof(msg));
+msg.msg_iov = iov;
+msg.msg_iovlen = 1;
+msg.msg_control = &msg_control;
+msg.msg_controllen = sizeof(msg_control);
+
+ret = recvmsg(client->sock_fd, &msg, 0);
+if (ret < 0) {
+debug_log(client, "cannot read message: %s\n", strerror(errno));
+return -1;
+}
+if (ret == 0) {
+debug_log(client, "lost connection to server\n");
+return -1;
+}
+
+*fd = -1;
+
+for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) {
+
+if (cmsg->cmsg_len != CMSG_LEN(sizeof(int)) ||
+cmsg->cmsg_level != SOL_SOCKET ||
+cmsg->cmsg_type != SCM_RIGHTS) {
+continue;
+}
+
+memcpy(fd, CMSG_DATA(cmsg), sizeof(*fd));
+}
+
+return 0;
+}
+
+/* free a peer when the server advertise a disconnection or when the
+ * client is freed */
+static void
+free_peer(struct ivshmem_client *client, struct ivshmem_client_peer *peer)
+{
+unsigned vector;
+
+TAILQ_REMOVE(&client->peer_list, peer, next);
+for (vector = 0; vector < peer->vectors_count; vector++) {
+close(peer->vectors[vector]);
+}
+
+free(peer);
+}
+
+/* handle message coming from server (new peer, new vectors) */
+static int
+handle_server_msg(struct ivshmem_client *client)
+{
+struct ivshmem_client_peer *peer;
+long peer_id;
+int ret, fd;
+
+ret = read_one_msg(client, &peer_id, &fd);
+if (ret < 0) {
+return -1;
+}
+
+/* can return a peer or the local client */
+peer = ivshmem_client_search_peer(client, peer_id);
+
+/* delete peer */
+if (fd == -1) {
+
+if (peer == NULL || peer == &client->local) {
+debug_log(client, "receive delete for invalid peer %ld", peer_id);
+return -1;
+}
+
+de

[PATCH 1/2] docs: update ivshmem device spec

2014-06-20 Thread David Marchand
Add some notes on the parts needed to use ivshmem devices: more specifically,
explain the purpose of an ivshmem server and the basic concept to use the
ivshmem devices in guests.

Signed-off-by: David Marchand 
---
 docs/specs/ivshmem_device_spec.txt |   41 ++--
 1 file changed, 35 insertions(+), 6 deletions(-)

diff --git a/docs/specs/ivshmem_device_spec.txt 
b/docs/specs/ivshmem_device_spec.txt
index 667a862..7d2b73f 100644
--- a/docs/specs/ivshmem_device_spec.txt
+++ b/docs/specs/ivshmem_device_spec.txt
@@ -85,12 +85,41 @@ events have occurred.  The semantics of interrupt vectors 
are left to the
 user's discretion.
 
 
+IVSHMEM host services
+-
+
+This part is optional (see *Usage in the Guest* section below).
+
+To handle notifications between users of a ivshmem device, a ivshmem server has
+been added. This server is responsible for creating the shared memory and
+creating a set of eventfds for each users of the shared memory. It behaves as a
+proxy between the different ivshmem clients (QEMU): giving the shared memory fd
+to each client, allocating eventfds to new clients and broadcasting to all
+clients when a client disappears.
+
+Apart from the current ivshmem implementation in QEMU, a ivshmem client can be
+written for debug, for development purposes, or to implement notifications
+between host and guests.
+
+
 Usage in the Guest
 --
 
-The shared memory device is intended to be used with the provided UIO driver.
-Very little configuration is needed.  The guest should map BAR0 to access the
-registers (an array of 32-bit ints allows simple writing) and map BAR2 to
-access the shared memory region itself.  The size of the shared memory region
-is specified when the guest (or shared memory server) is started.  A guest may
-map the whole shared memory region or only part of it.
+The guest should map BAR0 to access the registers (an array of 32-bit ints
+allows simple writing) and map BAR2 to access the shared memory region itself.
+The size of the shared memory region is specified when the guest (or shared
+memory server) is started.  A guest may map the whole shared memory region or
+only part of it.
+
+ivshmem provides an optional notification mechanism through eventfds handled by
+QEMU that will trigger interrupts in guests. This mechanism is enabled when
+using a ivshmem-server which must be started prior to VMs and which serves as a
+proxy for exchanging eventfds.
+
+It is your choice how to use the ivshmem device.
+- the simple way, you don't need anything else than what is already in QEMU.
+  You can map the shared memory in guest, then use it in userland as you see 
fit
+  (memnic for example works this way http://dpdk.org/browse/memnic),
+- the more advanced way, basically, if you want an event mechanism between the
+  VMs using your ivshmem device. In this case, then you will most likely want 
to
+  write a kernel driver that will handle interrupts.
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] Move two pinned pages to non-movable node in kvm.

2014-06-20 Thread Marcelo Tosatti
On Fri, Jun 20, 2014 at 02:15:10PM +0300, Gleb Natapov wrote:
> On Thu, Jun 19, 2014 at 04:00:24PM -0300, Marcelo Tosatti wrote:
> > On Thu, Jun 19, 2014 at 12:20:32PM +0300, Gleb Natapov wrote:
> > > CCing Marcelo,
> > > 
> > > On Wed, Jun 18, 2014 at 02:50:44PM +0800, Tang Chen wrote:
> > > > Hi Gleb,
> > > > 
> > > > Thanks for the quick reply. Please see below.
> > > > 
> > > > On 06/18/2014 02:12 PM, Gleb Natapov wrote:
> > > > >On Wed, Jun 18, 2014 at 01:50:00PM +0800, Tang Chen wrote:
> > > > >>[Questions]
> > > > >>And by the way, would you guys please answer the following questions 
> > > > >>for me ?
> > > > >>
> > > > >>1. What's the ept identity pagetable for ?  Only one page is enough ?
> > > > >>
> > > > >>2. Is the ept identity pagetable only used in realmode ?
> > > > >>Can we free it once the guest is up (vcpu in protect mode)?
> > > > >>
> > > > >>3. Now, ept identity pagetable is allocated in qemu userspace.
> > > > >>Can we allocate it in kernel space ?
> > > > >What would be the benefit?
> > > > 
> > > > I think the benefit is we can hot-remove the host memory a kvm guest
> > > > is using.
> > > > 
> > > > For now, only memory in ZONE_MOVABLE can be migrated/hot-removed. And 
> > > > the
> > > > kernel
> > > > will never use ZONE_MOVABLE memory. So if we can allocate these two 
> > > > pages in
> > > > kernel space, we can pin them without any trouble. When doing memory
> > > > hot-remove,
> > > > the kernel will not try to migrate these two pages.
> > > But we can do that by other means, no? The patch you've sent for instance.
> > > 
> > > > 
> > > > >
> > > > >>
> > > > >>4. If I want to migrate these two pages, what do you think is the 
> > > > >>best way ?
> > > > >>
> > > > >I answered most of those here: 
> > > > >http://www.mail-archive.com/kvm@vger.kernel.org/msg103718.html
> > > > 
> > > > I'm sorry I must missed this email.
> > > > 
> > > > Seeing your advice, we can unpin these two pages and repin them in the 
> > > > next
> > > > EPT violation.
> > > > So about this problem, which solution would you prefer, allocate these 
> > > > two
> > > > pages in kernel
> > > > space, or migrate them before memory hot-remove ?
> > > > 
> > > > I think the first solution is simpler. But I'm not quite sure if there 
> > > > is
> > > > any other pages
> > > > pinned in memory. If we have the same problem with other kvm pages, I 
> > > > think
> > > > it is better to
> > > > solve it in the second way.
> > > > 
> > > > What do you think ?
> > > Remove pinning is preferable. In fact looks like for identity pagetable
> > > it should be trivial, just don't pin. APIC access page is a little bit
> > > more complicated since its physical address needs to be tracked to be
> > > updated in VMCS.
> > 
> > Yes, and there are new users of page pinning as well soon (see PEBS
> > threads on kvm-devel).
> > 
> > Was thinking of notifiers scheme. Perhaps:
> > 
> > ->begin_page_unpin(struct page *page)
> > - Remove any possible access to page.
> > 
> > ->end_page_unpin(struct page *page)
> > - Reinstantiate any possible access to page.
> > 
> > For KVM:
> > 
> > ->begin_page_unpin()
> > - Remove APIC-access page address from VMCS.
> >   or
> > - Remove spte translation to pinned page.
> > 
> > - Put vcpu in state where no VM-entries are allowed.
> > 
> > ->end_page_unpin()
> > - Setup APIC-access page, ...
> > - Allow vcpu to VM-entry.
> > 
> I believe that to handle identity page and APIC access page we do not
> need any of those. 
> We can use mmu notifiers to track when page begins
> to be moved and we can find new page location on EPT violation.

Does page migration hook via mmu notifiers? I don't think so. 

It won't even attempt page migration because the page count is
increased (would have to confirm though). Tang?

The problem with identity page is this: its location is written into the
guest CR3. So you cannot allow it (the page which the guest CR3 points
to) to be reused before you remove the reference.

Where is the guarantee there will be an EPT violation, allowing a vcpu
to execute with guest CR3 pointing to page with random data?

Same with the APIC access page.

> > Because allocating APIC access page from distant NUMA node can
> > be a performance problem, i believe.
> I do not think this is the case. APIC access page is never written to,
> and in fact SDM advice to share it between all vcpus.

Right. 

But the point is not so much relevant as this should be handled for
PEBS pages which would be interesting to force to non-movable zones.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 6/9] arm64: KVM: check ordering of all system register tables

2014-06-20 Thread Marc Zyngier
We now have multiple tables for the various system registers
we trap. Make sure we check the order of all of them, as it is
critical that we get the order right (been there, done that...).

Reviewed-by: Anup Patel 
Reviewed-by: Christoffer Dall 
Signed-off-by: Marc Zyngier 
---
 arch/arm64/kvm/sys_regs.c | 22 --
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 1fb1bff..9147b0c 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1299,14 +1299,32 @@ int kvm_arm_copy_sys_reg_indices(struct kvm_vcpu *vcpu, 
u64 __user *uindices)
return write_demux_regids(uindices);
 }
 
+static int check_sysreg_table(const struct sys_reg_desc *table, unsigned int n)
+{
+   unsigned int i;
+
+   for (i = 1; i < n; i++) {
+   if (cmp_sys_reg(&table[i-1], &table[i]) >= 0) {
+   kvm_err("sys_reg table %p out of order (%d)\n", table, 
i - 1);
+   return 1;
+   }
+   }
+
+   return 0;
+}
+
 void kvm_sys_reg_table_init(void)
 {
unsigned int i;
struct sys_reg_desc clidr;
 
/* Make sure tables are unique and in order. */
-   for (i = 1; i < ARRAY_SIZE(sys_reg_descs); i++)
-   BUG_ON(cmp_sys_reg(&sys_reg_descs[i-1], &sys_reg_descs[i]) >= 
0);
+   BUG_ON(check_sysreg_table(sys_reg_descs, ARRAY_SIZE(sys_reg_descs)));
+   BUG_ON(check_sysreg_table(cp14_regs, ARRAY_SIZE(cp14_regs)));
+   BUG_ON(check_sysreg_table(cp14_64_regs, ARRAY_SIZE(cp14_64_regs)));
+   BUG_ON(check_sysreg_table(cp15_regs, ARRAY_SIZE(cp15_regs)));
+   BUG_ON(check_sysreg_table(cp15_64_regs, ARRAY_SIZE(cp15_64_regs)));
+   BUG_ON(check_sysreg_table(invariant_sys_regs, 
ARRAY_SIZE(invariant_sys_regs)));
 
/* We abuse the reset function to overwrite the table itself. */
for (i = 0; i < ARRAY_SIZE(invariant_sys_regs); i++)
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/9] arm64: move DBG_MDSCR_* to asm/debug-monitors.h

2014-06-20 Thread Marc Zyngier
In order to be able to use the DBG_MDSCR_* macros from the KVM code,
move the relevant definitions to the obvious include file.

Also move the debug_el enum to a portion of the file that is guarded
by #ifndef __ASSEMBLY__ in order to use that file from assembly code.

Acked-by: Will Deacon 
Reviewed-by: Anup Patel 
Signed-off-by: Marc Zyngier 
---
 arch/arm64/include/asm/debug-monitors.h | 19 ++-
 arch/arm64/kernel/debug-monitors.c  |  9 -
 2 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/arm64/include/asm/debug-monitors.h 
b/arch/arm64/include/asm/debug-monitors.h
index 6e9b5b3..7fb3437 100644
--- a/arch/arm64/include/asm/debug-monitors.h
+++ b/arch/arm64/include/asm/debug-monitors.h
@@ -18,6 +18,15 @@
 
 #ifdef __KERNEL__
 
+/* Low-level stepping controls. */
+#define DBG_MDSCR_SS   (1 << 0)
+#define DBG_SPSR_SS(1 << 21)
+
+/* MDSCR_EL1 enabling bits */
+#define DBG_MDSCR_KDE  (1 << 13)
+#define DBG_MDSCR_MDE  (1 << 15)
+#define DBG_MDSCR_MASK ~(DBG_MDSCR_KDE | DBG_MDSCR_MDE)
+
 #defineDBG_ESR_EVT(x)  (((x) >> 27) & 0x7)
 
 /* AArch64 */
@@ -73,11 +82,6 @@
 
 #define CACHE_FLUSH_IS_SAFE1
 
-enum debug_el {
-   DBG_ACTIVE_EL0 = 0,
-   DBG_ACTIVE_EL1,
-};
-
 /* AArch32 */
 #define DBG_ESR_EVT_BKPT   0x4
 #define DBG_ESR_EVT_VECC   0x5
@@ -115,6 +119,11 @@ void unregister_break_hook(struct break_hook *hook);
 
 u8 debug_monitors_arch(void);
 
+enum debug_el {
+   DBG_ACTIVE_EL0 = 0,
+   DBG_ACTIVE_EL1,
+};
+
 void enable_debug_monitors(enum debug_el el);
 void disable_debug_monitors(enum debug_el el);
 
diff --git a/arch/arm64/kernel/debug-monitors.c 
b/arch/arm64/kernel/debug-monitors.c
index a7fb874..e022f87 100644
--- a/arch/arm64/kernel/debug-monitors.c
+++ b/arch/arm64/kernel/debug-monitors.c
@@ -30,15 +30,6 @@
 #include 
 #include 
 
-/* Low-level stepping controls. */
-#define DBG_MDSCR_SS   (1 << 0)
-#define DBG_SPSR_SS(1 << 21)
-
-/* MDSCR_EL1 enabling bits */
-#define DBG_MDSCR_KDE  (1 << 13)
-#define DBG_MDSCR_MDE  (1 << 15)
-#define DBG_MDSCR_MASK ~(DBG_MDSCR_KDE | DBG_MDSCR_MDE)
-
 /* Determine debug architecture. */
 u8 debug_monitors_arch(void)
 {
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/9] arm64: KVM: debug infrastructure support

2014-06-20 Thread Marc Zyngier
This patch series adds debug support, a key feature missing from the
KVM/arm64 port.

The main idea is to keep track of whether the debug registers are
"dirty" (changed by the guest) or not. In this case, perform the usual
save/restore dance, for one run only. It means we only have a penalty
if a guest is actively using the debug registers.

The amount of registers is properly frightening, but CPUs actually
only implement a subset of them. Also, there is a number of registers
we don't bother emulating (things having to do with external debug and
OSlock).

External debug is when you actually plug a physical JTAG into the CPU.
OSlock is a way to prevent "other software" to play with the debug
registers. My understanding is that it is only useful in combination
with the external debug. In both case, implementing support for this
is probably not worth the effort, at least for the time being.

This has been tested on a Cortex-A53/A57 platform, running both 32 and
64bit guests, on top of 3.16-rc1. This code also lives in my tree in
the kvm-arm64/debug-trap branch.

>From v2 [2]:
- Fixed a number of very stupid bugs in the macros generating the trap
  entries
- Added some documentation explaining why we don't bother emulating
  external debug and the OSlock stuff
- Other bits of documentation here and there

>From v1 [1]:
- Renamed trap_wi_raz to trap_raz_wi
- Renamed skip_clean_debug_state to skip_debug_state
- Simplified debug state computing, moved to its own macro
- Added some comment to make the logic more obvious

[1]: https://lists.cs.columbia.edu/pipermail/kvmarm/2014-May/009332.html
[2]: https://lists.cs.columbia.edu/pipermail/kvmarm/2014-May/009534.html

Marc Zyngier (9):
  arm64: KVM: rename pm_fake handler to trap_raz_wi
  arm64: move DBG_MDSCR_* to asm/debug-monitors.h
  arm64: KVM: add trap handlers for AArch64 debug registers
  arm64: KVM: common infrastructure for handling AArch32 CP14/CP15
  arm64: KVM: use separate tables for AArch32 32 and 64bit traps
  arm64: KVM: check ordering of all system register tables
  arm64: KVM: add trap handlers for AArch32 debug registers
  arm64: KVM: implement lazy world switch for debug registers
  arm64: KVM: enable trapping of all debug registers

 arch/arm64/include/asm/debug-monitors.h |  19 +-
 arch/arm64/include/asm/kvm_asm.h|  39 ++-
 arch/arm64/include/asm/kvm_coproc.h |   3 +-
 arch/arm64/include/asm/kvm_host.h   |  12 +-
 arch/arm64/kernel/asm-offsets.c |   1 +
 arch/arm64/kernel/debug-monitors.c  |   9 -
 arch/arm64/kvm/handle_exit.c|   4 +-
 arch/arm64/kvm/hyp.S| 470 -
 arch/arm64/kvm/sys_regs.c   | 520 
 9 files changed, 978 insertions(+), 99 deletions(-)

-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/9] arm64: KVM: rename pm_fake handler to trap_raz_wi

2014-06-20 Thread Marc Zyngier
pm_fake doesn't quite describe what the handler does (ignoring writes
and returning 0 for reads).

As we're about to use it (a lot) in a different context, rename it
with a (admitedly cryptic) name that make sense for all users.

Reviewed-by: Anup Patel 
Signed-off-by: Marc Zyngier 
---
 arch/arm64/kvm/sys_regs.c | 83 ---
 1 file changed, 43 insertions(+), 40 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index c59a1bd..4abd84e 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -163,18 +163,9 @@ static bool access_sctlr(struct kvm_vcpu *vcpu,
return true;
 }
 
-/*
- * We could trap ID_DFR0 and tell the guest we don't support performance
- * monitoring.  Unfortunately the patch to make the kernel check ID_DFR0 was
- * NAKed, so it will read the PMCR anyway.
- *
- * Therefore we tell the guest we have 0 counters.  Unfortunately, we
- * must always support PMCCNTR (the cycle counter): we just RAZ/WI for
- * all PM registers, which doesn't crash the guest kernel at least.
- */
-static bool pm_fake(struct kvm_vcpu *vcpu,
-   const struct sys_reg_params *p,
-   const struct sys_reg_desc *r)
+static bool trap_raz_wi(struct kvm_vcpu *vcpu,
+   const struct sys_reg_params *p,
+   const struct sys_reg_desc *r)
 {
if (p->is_write)
return ignore_write(vcpu, p);
@@ -201,6 +192,17 @@ static void reset_mpidr(struct kvm_vcpu *vcpu, const 
struct sys_reg_desc *r)
 /*
  * Architected system registers.
  * Important: Must be sorted ascending by Op0, Op1, CRn, CRm, Op2
+ *
+ * We could trap ID_DFR0 and tell the guest we don't support performance
+ * monitoring.  Unfortunately the patch to make the kernel check ID_DFR0 was
+ * NAKed, so it will read the PMCR anyway.
+ *
+ * Therefore we tell the guest we have 0 counters.  Unfortunately, we
+ * must always support PMCCNTR (the cycle counter): we just RAZ/WI for
+ * all PM registers, which doesn't crash the guest kernel at least.
+ *
+ * Same goes for the whole debug infrastructure, which probably breaks
+ * some guest functionnality. This should be fixed.
  */
 static const struct sys_reg_desc sys_reg_descs[] = {
/* DC ISW */
@@ -260,10 +262,10 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 
/* PMINTENSET_EL1 */
{ Op0(0b11), Op1(0b000), CRn(0b1001), CRm(0b1110), Op2(0b001),
- pm_fake },
+ trap_raz_wi },
/* PMINTENCLR_EL1 */
{ Op0(0b11), Op1(0b000), CRn(0b1001), CRm(0b1110), Op2(0b010),
- pm_fake },
+ trap_raz_wi },
 
/* MAIR_EL1 */
{ Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0010), Op2(0b000),
@@ -292,43 +294,43 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 
/* PMCR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b000),
- pm_fake },
+ trap_raz_wi },
/* PMCNTENSET_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b001),
- pm_fake },
+ trap_raz_wi },
/* PMCNTENCLR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b010),
- pm_fake },
+ trap_raz_wi },
/* PMOVSCLR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b011),
- pm_fake },
+ trap_raz_wi },
/* PMSWINC_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b100),
- pm_fake },
+ trap_raz_wi },
/* PMSELR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b101),
- pm_fake },
+ trap_raz_wi },
/* PMCEID0_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b110),
- pm_fake },
+ trap_raz_wi },
/* PMCEID1_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1100), Op2(0b111),
- pm_fake },
+ trap_raz_wi },
/* PMCCNTR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b000),
- pm_fake },
+ trap_raz_wi },
/* PMXEVTYPER_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b001),
- pm_fake },
+ trap_raz_wi },
/* PMXEVCNTR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b010),
- pm_fake },
+ trap_raz_wi },
/* PMUSERENR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1110), Op2(0b000),
- pm_fake },
+ trap_raz_wi },
/* PMOVSSET_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1110), Op2(0b011),
- pm_fake },
+ trap_raz_wi },
 
/* TPIDR_EL0 */
{ Op0(0b11), Op1(0b011), CRn(0b1101), CRm(0b), Op2(0b010),
@@ -374,19 +376,20 @@ static const struct sys_reg_desc cp15_regs[] = {
{ Op1( 0), CRn( 7), CRm(10), Op2( 2), access_dcsw },
{ Op1( 0), CRn( 7), CRm(14), Op2( 2), access_d

[PATCH v3 7/9] arm64: KVM: add trap handlers for AArch32 debug registers

2014-06-20 Thread Marc Zyngier
Add handlers for all the AArch32 debug registers that are accessible
from EL0 or EL1. The code follow the same strategy as the AArch64
counterpart with regards to tracking the dirty state of the debug
registers.

Reviewed-by: Anup Patel 
Signed-off-by: Marc Zyngier 
---
 arch/arm64/include/asm/kvm_asm.h |   9 +++
 arch/arm64/kvm/sys_regs.c| 144 ++-
 2 files changed, 151 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 12f9dd7..993a7db 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -93,6 +93,15 @@
 #define c10_AMAIR0 (AMAIR_EL1 * 2) /* Aux Memory Attr Indirection Reg */
 #define c10_AMAIR1 (c10_AMAIR0 + 1)/* Aux Memory Attr Indirection Reg */
 #define c14_CNTKCTL(CNTKCTL_EL1 * 2) /* Timer Control Register (PL1) */
+
+#define cp14_DBGDSCRext(MDSCR_EL1 * 2)
+#define cp14_DBGBCR0   (DBGBCR0_EL1 * 2)
+#define cp14_DBGBVR0   (DBGBVR0_EL1 * 2)
+#define cp14_DBGBXVR0  (cp14_DBGBVR0 + 1)
+#define cp14_DBGWCR0   (DBGWCR0_EL1 * 2)
+#define cp14_DBGWVR0   (DBGWVR0_EL1 * 2)
+#define cp14_DBGDCCINT (MDCCINT_EL1 * 2)
+
 #define NR_COPRO_REGS  (NR_SYS_REGS * 2)
 
 #define ARM_EXCEPTION_IRQ0
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 9147b0c..daa635e 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -483,12 +483,153 @@ static const struct sys_reg_desc sys_reg_descs[] = {
  NULL, reset_val, FPEXC32_EL2, 0x70 },
 };
 
-/* Trapped cp14 registers */
+static bool trap_dbgidr(struct kvm_vcpu *vcpu,
+   const struct sys_reg_params *p,
+   const struct sys_reg_desc *r)
+{
+   if (p->is_write) {
+   return ignore_write(vcpu, p);
+   } else {
+   u64 dfr = read_cpuid(ID_AA64DFR0_EL1);
+   u64 pfr = read_cpuid(ID_AA64PFR0_EL1);
+   u32 el3 = !!((pfr >> 12) & 0xf);
+
+   *vcpu_reg(vcpu, p->Rt) = dfr >> 20) & 0xf) << 28) |
+ (((dfr >> 12) & 0xf) << 24) |
+ (((dfr >> 28) & 0xf) << 20) |
+ (6 << 16) | (el3 << 14) | (el3 << 
12));
+   return true;
+   }
+}
+
+static bool trap_debug32(struct kvm_vcpu *vcpu,
+const struct sys_reg_params *p,
+const struct sys_reg_desc *r)
+{
+   if (p->is_write) {
+   vcpu_cp14(vcpu, r->reg) = *vcpu_reg(vcpu, p->Rt);
+   vcpu->arch.debug_flags |= KVM_ARM64_DEBUG_DIRTY;
+   } else {
+   *vcpu_reg(vcpu, p->Rt) = vcpu_cp14(vcpu, r->reg);
+   }
+
+   return true;
+}
+
+#define DBG_BCR_BVR_WCR_WVR(n) \
+   /* DBGBVRn */   \
+   { Op1( 0), CRn( 0), CRm((n)), Op2( 4), trap_debug32,\
+ NULL, (cp14_DBGBVR0 + (n) * 2) }, \
+   /* DBGBCRn */   \
+   { Op1( 0), CRn( 0), CRm((n)), Op2( 5), trap_debug32,\
+ NULL, (cp14_DBGBCR0 + (n) * 2) }, \
+   /* DBGWVRn */   \
+   { Op1( 0), CRn( 0), CRm((n)), Op2( 6), trap_debug32,\
+ NULL, (cp14_DBGWVR0 + (n) * 2) }, \
+   /* DBGWCRn */   \
+   { Op1( 0), CRn( 0), CRm((n)), Op2( 7), trap_debug32,\
+ NULL, (cp14_DBGWCR0 + (n) * 2) }
+
+#define DBGBXVR(n) \
+   { Op1( 0), CRn( 1), CRm((n)), Op2( 1), trap_debug32,\
+ NULL, cp14_DBGBXVR0 + n * 2 }
+
+/*
+ * Trapped cp14 registers. We generally ignore most of the external
+ * debug, on the principle that they don't really make sense to a
+ * guest. Revisit this one day, whould this principle change.
+ */
 static const struct sys_reg_desc cp14_regs[] = {
+   /* DBGIDR */
+   { Op1( 0), CRn( 0), CRm( 0), Op2( 0), trap_dbgidr },
+   /* DBGDTRRXext */
+   { Op1( 0), CRn( 0), CRm( 0), Op2( 2), trap_raz_wi },
+
+   DBG_BCR_BVR_WCR_WVR(0),
+   /* DBGDSCRint */
+   { Op1( 0), CRn( 0), CRm( 1), Op2( 0), trap_raz_wi },
+   DBG_BCR_BVR_WCR_WVR(1),
+   /* DBGDCCINT */
+   { Op1( 0), CRn( 0), CRm( 2), Op2( 0), trap_debug32 },
+   /* DBGDSCRext */
+   { Op1( 0), CRn( 0), CRm( 2), Op2( 2), trap_debug32 },
+   DBG_BCR_BVR_WCR_WVR(2),
+   /* DBGDTR[RT]Xint */
+   { Op1( 0), CRn( 0), CRm( 3), Op2( 0), trap_raz_wi },
+   /* DBGDTR[RT]Xext */
+   { Op1( 0), CRn( 0), CRm( 3), Op2( 2), trap_raz_wi },
+   DBG_BCR_BVR_WCR_WVR(3),
+   DBG_BCR_BVR_WCR_WVR(4),
+   DBG_BCR_BVR_WCR_WVR(5),
+   /* DBGWFAR */
+   { Op1( 0), CRn( 0), CRm( 6), Op2( 0), trap_raz_wi },
+   /* DBGOSECCR */
+   { Op1( 0), CRn( 0), 

[PATCH v3 8/9] arm64: KVM: implement lazy world switch for debug registers

2014-06-20 Thread Marc Zyngier
Implement switching of the debug registers. While the number
of registers is massive, CPUs usually don't implement them all
(A57 has 6 breakpoints and 4 watchpoints, which gives us a total
of 22 registers "only").

Also, we only save/restore them when MDSCR_EL1 has debug enabled,
or when we've flagged the debug registers as dirty. It means that
most of the time, we only save/restore MDSCR_EL1.

Reviewed-by: Anup Patel 
Signed-off-by: Marc Zyngier 
---
 arch/arm64/kernel/asm-offsets.c |   1 +
 arch/arm64/kvm/hyp.S| 462 +++-
 2 files changed, 457 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index 646f888..ae73a83 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -120,6 +120,7 @@ int main(void)
   DEFINE(VCPU_ESR_EL2, offsetof(struct kvm_vcpu, arch.fault.esr_el2));
   DEFINE(VCPU_FAR_EL2, offsetof(struct kvm_vcpu, arch.fault.far_el2));
   DEFINE(VCPU_HPFAR_EL2,   offsetof(struct kvm_vcpu, 
arch.fault.hpfar_el2));
+  DEFINE(VCPU_DEBUG_FLAGS, offsetof(struct kvm_vcpu, arch.debug_flags));
   DEFINE(VCPU_HCR_EL2, offsetof(struct kvm_vcpu, arch.hcr_el2));
   DEFINE(VCPU_IRQ_LINES,   offsetof(struct kvm_vcpu, arch.irq_lines));
   DEFINE(VCPU_HOST_CONTEXT,offsetof(struct kvm_vcpu, 
arch.host_cpu_context));
diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index b0d1512..727087c 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -215,6 +216,7 @@ __kvm_hyp_code_start:
mrs x22,amair_el1
mrs x23,cntkctl_el1
mrs x24,par_el1
+   mrs x25,mdscr_el1
 
stp x4, x5, [x3]
stp x6, x7, [x3, #16]
@@ -226,7 +228,202 @@ __kvm_hyp_code_start:
stp x18, x19, [x3, #112]
stp x20, x21, [x3, #128]
stp x22, x23, [x3, #144]
-   str x24, [x3, #160]
+   stp x24, x25, [x3, #160]
+.endm
+
+.macro save_debug
+   // x2: base address for cpu context
+   // x3: tmp register
+
+   mrs x26, id_aa64dfr0_el1
+   ubfxx24, x26, #12, #4   // Extract BRPs
+   ubfxx25, x26, #20, #4   // Extract WRPs
+   mov w26, #15
+   sub w24, w26, w24   // How many BPs to skip
+   sub w25, w26, w25   // How many WPs to skip
+
+   add x3, x2, #CPU_SYSREG_OFFSET(DBGBCR0_EL1)
+
+   adr x26, 1f
+   add x26, x26, x24, lsl #2
+   br  x26
+1:
+   mrs x20, dbgbcr15_el1
+   mrs x19, dbgbcr14_el1
+   mrs x18, dbgbcr13_el1
+   mrs x17, dbgbcr12_el1
+   mrs x16, dbgbcr11_el1
+   mrs x15, dbgbcr10_el1
+   mrs x14, dbgbcr9_el1
+   mrs x13, dbgbcr8_el1
+   mrs x12, dbgbcr7_el1
+   mrs x11, dbgbcr6_el1
+   mrs x10, dbgbcr5_el1
+   mrs x9, dbgbcr4_el1
+   mrs x8, dbgbcr3_el1
+   mrs x7, dbgbcr2_el1
+   mrs x6, dbgbcr1_el1
+   mrs x5, dbgbcr0_el1
+
+   adr x26, 1f
+   add x26, x26, x24, lsl #2
+   br  x26
+
+1:
+   str x20, [x3, #(15 * 8)]
+   str x19, [x3, #(14 * 8)]
+   str x18, [x3, #(13 * 8)]
+   str x17, [x3, #(12 * 8)]
+   str x16, [x3, #(11 * 8)]
+   str x15, [x3, #(10 * 8)]
+   str x14, [x3, #(9 * 8)]
+   str x13, [x3, #(8 * 8)]
+   str x12, [x3, #(7 * 8)]
+   str x11, [x3, #(6 * 8)]
+   str x10, [x3, #(5 * 8)]
+   str x9, [x3, #(4 * 8)]
+   str x8, [x3, #(3 * 8)]
+   str x7, [x3, #(2 * 8)]
+   str x6, [x3, #(1 * 8)]
+   str x5, [x3, #(0 * 8)]
+
+   add x3, x2, #CPU_SYSREG_OFFSET(DBGBVR0_EL1)
+
+   adr x26, 1f
+   add x26, x26, x24, lsl #2
+   br  x26
+1:
+   mrs x20, dbgbvr15_el1
+   mrs x19, dbgbvr14_el1
+   mrs x18, dbgbvr13_el1
+   mrs x17, dbgbvr12_el1
+   mrs x16, dbgbvr11_el1
+   mrs x15, dbgbvr10_el1
+   mrs x14, dbgbvr9_el1
+   mrs x13, dbgbvr8_el1
+   mrs x12, dbgbvr7_el1
+   mrs x11, dbgbvr6_el1
+   mrs x10, dbgbvr5_el1
+   mrs x9, dbgbvr4_el1
+   mrs x8, dbgbvr3_el1
+   mrs x7, dbgbvr2_el1
+   mrs x6, dbgbvr1_el1
+   mrs x5, dbgbvr0_el1
+
+   adr x26, 1f
+   add x26, x26, x24, lsl #2
+   br  x26
+
+1:
+   str x20, [x3, #(15 * 8)]
+   str x19, [x3, #(14 * 8)]
+   str x18, [x3, #(13 * 8)]
+   str x17, [x3, #(12 * 8)]
+   str x16, [x3, #(11 * 8)]
+   str x15, [x3, #(10 * 8)]
+   str x14, [x3, #(9 * 8)]
+   str x13, [x3, #(8 * 8)]
+   str x12, [x3, #(7 * 8)]
+   str x11, [x3, #(6 * 8)]
+   str   

[PATCH v3 9/9] arm64: KVM: enable trapping of all debug registers

2014-06-20 Thread Marc Zyngier
Enable trapping of the debug registers, preventing the guests to
mess with the host state (and allowing guests to use the debug
infrastructure as well).

Reviewed-by: Anup Patel 
Reviewed-by: Christoffer Dall 
Signed-off-by: Marc Zyngier 
---
 arch/arm64/kvm/hyp.S | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
index 727087c..e36ca91 100644
--- a/arch/arm64/kvm/hyp.S
+++ b/arch/arm64/kvm/hyp.S
@@ -777,6 +777,14 @@ __kvm_hyp_code_start:
mrs x2, mdcr_el2
and x2, x2, #MDCR_EL2_HPMN_MASK
orr x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
+   orr x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
+
+   // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
+   // if not dirty.
+   ldr x3, [x0, #VCPU_DEBUG_FLAGS]
+   tbnzx3, #KVM_ARM64_DEBUG_DIRTY_SHIFT, 1f
+   orr x2, x2,  #MDCR_EL2_TDA
+1:
msr mdcr_el2, x2
 .endm
 
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/9] arm64: KVM: add trap handlers for AArch64 debug registers

2014-06-20 Thread Marc Zyngier
Add handlers for all the AArch64 debug registers that are accessible
from EL0 or EL1. The trapping code keeps track of the state of the
debug registers, allowing for the switch code to implement a lazy
switching strategy.

Reviewed-by: Anup Patel 
Signed-off-by: Marc Zyngier 
---
 arch/arm64/include/asm/kvm_asm.h  |  28 ++--
 arch/arm64/include/asm/kvm_host.h |   3 +
 arch/arm64/kvm/sys_regs.c | 137 +-
 3 files changed, 159 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index 9fcd54b..e6b159a 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -43,14 +43,25 @@
 #defineAMAIR_EL1   19  /* Aux Memory Attribute Indirection 
Register */
 #defineCNTKCTL_EL1 20  /* Timer Control Register (EL1) */
 #definePAR_EL1 21  /* Physical Address Register */
+#define MDSCR_EL1  22  /* Monitor Debug System Control Register */
+#define DBGBCR0_EL123  /* Debug Breakpoint Control Registers (0-15) */
+#define DBGBCR15_EL1   38
+#define DBGBVR0_EL139  /* Debug Breakpoint Value Registers (0-15) */
+#define DBGBVR15_EL1   54
+#define DBGWCR0_EL155  /* Debug Watchpoint Control Registers (0-15) */
+#define DBGWCR15_EL1   70
+#define DBGWVR0_EL171  /* Debug Watchpoint Value Registers (0-15) */
+#define DBGWVR15_EL1   86
+#define MDCCINT_EL187  /* Monitor Debug Comms Channel Interrupt Enable 
Reg */
+
 /* 32bit specific registers. Keep them at the end of the range */
-#defineDACR32_EL2  22  /* Domain Access Control Register */
-#defineIFSR32_EL2  23  /* Instruction Fault Status Register */
-#defineFPEXC32_EL2 24  /* Floating-Point Exception Control 
Register */
-#defineDBGVCR32_EL225  /* Debug Vector Catch Register */
-#defineTEECR32_EL1 26  /* ThumbEE Configuration Register */
-#defineTEEHBR32_EL127  /* ThumbEE Handler Base Register */
-#defineNR_SYS_REGS 28
+#defineDACR32_EL2  88  /* Domain Access Control Register */
+#defineIFSR32_EL2  89  /* Instruction Fault Status Register */
+#defineFPEXC32_EL2 90  /* Floating-Point Exception Control 
Register */
+#defineDBGVCR32_EL291  /* Debug Vector Catch Register */
+#defineTEECR32_EL1 92  /* ThumbEE Configuration Register */
+#defineTEEHBR32_EL193  /* ThumbEE Handler Base Register */
+#defineNR_SYS_REGS 94
 
 /* 32bit mapping */
 #define c0_MPIDR   (MPIDR_EL1 * 2) /* MultiProcessor ID Register */
@@ -87,6 +98,9 @@
 #define ARM_EXCEPTION_IRQ0
 #define ARM_EXCEPTION_TRAP   1
 
+#define KVM_ARM64_DEBUG_DIRTY_SHIFT0
+#define KVM_ARM64_DEBUG_DIRTY  (1 << KVM_ARM64_DEBUG_DIRTY_SHIFT)
+
 #ifndef __ASSEMBLY__
 struct kvm;
 struct kvm_vcpu;
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 92242ce..79573c86 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -101,6 +101,9 @@ struct kvm_vcpu_arch {
/* Exception Information */
struct kvm_vcpu_fault_info fault;
 
+   /* Debug state */
+   u64 debug_flags;
+
/* Pointer to host CPU context */
kvm_cpu_context_t *host_cpu_context;
 
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 4abd84e..808e3b2 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "sys_regs.h"
@@ -173,6 +174,60 @@ static bool trap_raz_wi(struct kvm_vcpu *vcpu,
return read_zero(vcpu, p);
 }
 
+static bool trap_oslsr_el1(struct kvm_vcpu *vcpu,
+  const struct sys_reg_params *p,
+  const struct sys_reg_desc *r)
+{
+   if (p->is_write) {
+   return ignore_write(vcpu, p);
+   } else {
+   *vcpu_reg(vcpu, p->Rt) = (1 << 3);
+   return true;
+   }
+}
+
+static bool trap_dbgauthstatus_el1(struct kvm_vcpu *vcpu,
+  const struct sys_reg_params *p,
+  const struct sys_reg_desc *r)
+{
+   if (p->is_write) {
+   return ignore_write(vcpu, p);
+   } else {
+   u32 val;
+   asm volatile("mrs %0, dbgauthstatus_el1" : "=r" (val));
+   *vcpu_reg(vcpu, p->Rt) = val;
+   return true;
+   }
+}
+
+/*
+ * We want to avoid world-switching all the DBG registers all the
+ * time. For this, we use a DIRTY but, indicating the guest has
+ * modified the debug registers, and only restore the registers once,
+ * disabling traps.
+ *
+ * The best thing to do would be to trap MDSCR_EL1 independently, test
+ * if DBG_MDSCR_KDE or DBG_MDSCR_MDE is getting 

[PATCH v3 4/9] arm64: KVM: common infrastructure for handling AArch32 CP14/CP15

2014-06-20 Thread Marc Zyngier
As we're about to trap a bunch of CP14 registers, let's rework
the CP15 handling so it can be generalized and work with multiple
tables.

Reviewed-by: Anup Patel 
Reviewed-by: Christoffer Dall 
Signed-off-by: Marc Zyngier 
---
 arch/arm64/include/asm/kvm_asm.h|   2 +-
 arch/arm64/include/asm/kvm_coproc.h |   3 +-
 arch/arm64/include/asm/kvm_host.h   |   9 ++-
 arch/arm64/kvm/handle_exit.c|   4 +-
 arch/arm64/kvm/sys_regs.c   | 133 +---
 5 files changed, 122 insertions(+), 29 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
index e6b159a..12f9dd7 100644
--- a/arch/arm64/include/asm/kvm_asm.h
+++ b/arch/arm64/include/asm/kvm_asm.h
@@ -93,7 +93,7 @@
 #define c10_AMAIR0 (AMAIR_EL1 * 2) /* Aux Memory Attr Indirection Reg */
 #define c10_AMAIR1 (c10_AMAIR0 + 1)/* Aux Memory Attr Indirection Reg */
 #define c14_CNTKCTL(CNTKCTL_EL1 * 2) /* Timer Control Register (PL1) */
-#define NR_CP15_REGS   (NR_SYS_REGS * 2)
+#define NR_COPRO_REGS  (NR_SYS_REGS * 2)
 
 #define ARM_EXCEPTION_IRQ0
 #define ARM_EXCEPTION_TRAP   1
diff --git a/arch/arm64/include/asm/kvm_coproc.h 
b/arch/arm64/include/asm/kvm_coproc.h
index 9a59301..0b52377 100644
--- a/arch/arm64/include/asm/kvm_coproc.h
+++ b/arch/arm64/include/asm/kvm_coproc.h
@@ -39,7 +39,8 @@ void kvm_register_target_sys_reg_table(unsigned int target,
   struct kvm_sys_reg_target_table *table);
 
 int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu, struct kvm_run *run);
-int kvm_handle_cp14_access(struct kvm_vcpu *vcpu, struct kvm_run *run);
+int kvm_handle_cp14_32(struct kvm_vcpu *vcpu, struct kvm_run *run);
+int kvm_handle_cp14_64(struct kvm_vcpu *vcpu, struct kvm_run *run);
 int kvm_handle_cp15_32(struct kvm_vcpu *vcpu, struct kvm_run *run);
 int kvm_handle_cp15_64(struct kvm_vcpu *vcpu, struct kvm_run *run);
 int kvm_handle_sys_reg(struct kvm_vcpu *vcpu, struct kvm_run *run);
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index 79573c86..108a297 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -86,7 +86,7 @@ struct kvm_cpu_context {
struct kvm_regs gp_regs;
union {
u64 sys_regs[NR_SYS_REGS];
-   u32 cp15[NR_CP15_REGS];
+   u32 copro[NR_COPRO_REGS];
};
 };
 
@@ -141,7 +141,12 @@ struct kvm_vcpu_arch {
 
 #define vcpu_gp_regs(v)(&(v)->arch.ctxt.gp_regs)
 #define vcpu_sys_reg(v,r)  ((v)->arch.ctxt.sys_regs[(r)])
-#define vcpu_cp15(v,r) ((v)->arch.ctxt.cp15[(r)])
+/*
+ * CP14 and CP15 live in the same array, as they are backed by the
+ * same system registers.
+ */
+#define vcpu_cp14(v,r) ((v)->arch.ctxt.copro[(r)])
+#define vcpu_cp15(v,r) ((v)->arch.ctxt.copro[(r)])
 
 struct kvm_vm_stat {
u32 remote_tlb_flush;
diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
index 182415e..e28be51 100644
--- a/arch/arm64/kvm/handle_exit.c
+++ b/arch/arm64/kvm/handle_exit.c
@@ -73,9 +73,9 @@ static exit_handle_fn arm_exit_handlers[] = {
[ESR_EL2_EC_WFI]= kvm_handle_wfx,
[ESR_EL2_EC_CP15_32]= kvm_handle_cp15_32,
[ESR_EL2_EC_CP15_64]= kvm_handle_cp15_64,
-   [ESR_EL2_EC_CP14_MR]= kvm_handle_cp14_access,
+   [ESR_EL2_EC_CP14_MR]= kvm_handle_cp14_32,
[ESR_EL2_EC_CP14_LS]= kvm_handle_cp14_load_store,
-   [ESR_EL2_EC_CP14_64]= kvm_handle_cp14_access,
+   [ESR_EL2_EC_CP14_64]= kvm_handle_cp14_64,
[ESR_EL2_EC_HVC32]  = handle_hvc,
[ESR_EL2_EC_SMC32]  = handle_smc,
[ESR_EL2_EC_HVC64]  = handle_hvc,
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 808e3b2..fb6eece 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -483,6 +483,10 @@ static const struct sys_reg_desc sys_reg_descs[] = {
  NULL, reset_val, FPEXC32_EL2, 0x70 },
 };
 
+/* Trapped cp14 registers */
+static const struct sys_reg_desc cp14_regs[] = {
+};
+
 /*
  * Trapped cp15 registers. TTBR0/TTBR1 get a double encoding,
  * depending on the way they are accessed (as a 32bit or a 64bit
@@ -590,26 +594,29 @@ int kvm_handle_cp14_load_store(struct kvm_vcpu *vcpu, 
struct kvm_run *run)
return 1;
 }
 
-int kvm_handle_cp14_access(struct kvm_vcpu *vcpu, struct kvm_run *run)
-{
-   kvm_inject_undefined(vcpu);
-   return 1;
-}
-
-static void emulate_cp15(struct kvm_vcpu *vcpu,
-const struct sys_reg_params *params)
+/*
+ * emulate_cp --  tries to match a sys_reg access in a handling table, and
+ *call the corresponding trap handler.
+ *
+ * @params: pointer to the descriptor of the access
+ * @table: array of trap descriptors
+ * @num: size of the trap descriptor array
+ *
+ * Return 0 if the access has been handled, and -1 if not.
+ */
+static int emulate_

[PATCH v3 5/9] arm64: KVM: use separate tables for AArch32 32 and 64bit traps

2014-06-20 Thread Marc Zyngier
An interesting "feature" of the CP14 encoding is that there is
an overlap between 32 and 64bit registers, meaning they cannot
live in the same table as we did for CP15.

Create separate tables for 64bit CP14 and CP15 registers, and
let the top level handler use the right one.

Reviewed-by: Anup Patel 
Reviewed-by: Christoffer Dall 
Signed-off-by: Marc Zyngier 
---
 arch/arm64/kvm/sys_regs.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index fb6eece..1fb1bff 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -487,13 +487,16 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 static const struct sys_reg_desc cp14_regs[] = {
 };
 
+/* Trapped cp14 64bit registers */
+static const struct sys_reg_desc cp14_64_regs[] = {
+};
+
 /*
  * Trapped cp15 registers. TTBR0/TTBR1 get a double encoding,
  * depending on the way they are accessed (as a 32bit or a 64bit
  * register).
  */
 static const struct sys_reg_desc cp15_regs[] = {
-   { Op1( 0), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, c2_TTBR0 },
{ Op1( 0), CRn( 1), CRm( 0), Op2( 0), access_sctlr, NULL, c1_SCTLR },
{ Op1( 0), CRn( 2), CRm( 0), Op2( 0), access_vm_reg, NULL, c2_TTBR0 },
{ Op1( 0), CRn( 2), CRm( 0), Op2( 1), access_vm_reg, NULL, c2_TTBR1 },
@@ -534,6 +537,10 @@ static const struct sys_reg_desc cp15_regs[] = {
{ Op1( 0), CRn(10), CRm( 3), Op2( 1), access_vm_reg, NULL, c10_AMAIR1 },
{ Op1( 0), CRn(13), CRm( 0), Op2( 1), access_vm_reg, NULL, c13_CID },
 
+};
+
+static const struct sys_reg_desc cp15_64_regs[] = {
+   { Op1( 0), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, c2_TTBR0 },
{ Op1( 1), CRn( 0), CRm( 2), Op2( 0), access_vm_reg, NULL, c2_TTBR1 },
 };
 
@@ -759,7 +766,7 @@ int kvm_handle_cp15_64(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 
target_specific = get_target_table(vcpu->arch.target, false, &num);
return kvm_handle_cp_64(vcpu,
-   cp15_regs, ARRAY_SIZE(cp15_regs),
+   cp15_64_regs, ARRAY_SIZE(cp15_64_regs),
target_specific, num);
 }
 
@@ -777,7 +784,7 @@ int kvm_handle_cp15_32(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
 int kvm_handle_cp14_64(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
return kvm_handle_cp_64(vcpu,
-   cp14_regs, ARRAY_SIZE(cp14_regs),
+   cp14_64_regs, ARRAY_SIZE(cp14_64_regs),
NULL, 0);
 }
 
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/11] qspinlock: Paravirt support

2014-06-20 Thread Konrad Rzeszutek Wilk
On Sun, Jun 15, 2014 at 02:47:07PM +0200, Peter Zijlstra wrote:
> Add minimal paravirt support.
> 
> The code aims for minimal impact on the native case.

Woot!
> 
> On the lock side we add one jump label (asm_goto) and 4 paravirt
> callee saved calls that default to NOPs. The only effects are the
> extra NOPs and some pointless MOVs to accomodate the calling
> convention. No register spills happen because of this (x86_64).
> 
> On the unlock side we have one paravirt callee saved call, which
> defaults to the actual unlock sequence: "movb $0, (%rdi)" and a NOP.
> 
> The actual paravirt code comes in 3 parts;
> 
>  - init_node; this initializes the extra data members required for PV
>state. PV state data is kept 1 cacheline ahead of the regular data.
> 
>  - link_and_wait_node/kick_node; these are paired with the regular MCS
>queueing and are placed resp. before/after the paired MCS ops.
> 
>  - wait_head/queue_unlock; the interesting part here is finding the
>head node to kick.
> 
> Tracking the head is done in two parts, firstly the pv_wait_head will
> store its cpu number in whichever node is pointed to by the tail part
> of the lock word. Secondly, pv_link_and_wait_node() will propagate the
> existing head from the old to the new tail node.

I dug in the code and I have some comments about it, but before
I post them I was wondering if you have any plans to run any performance
tests against the PV ticketlock with normal and over-committed scenarios?

Looking at this with a pen and paper I see that compared to
PV ticketlock for the CPUs that are contending on the queue (so they
go to pv_wait_head_and_link, then progress to pv_wait_head), they
go sleep twice and get woken up twice. In PV ticketlock the
contending CPUs would only go to sleep once and woken up once it
was their turn.

That of course is the worst case scenario - where the CPU
that has the lock is taking forever to do its job and the
host is quite overcommitted.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-20 Thread Alexey Kardashevskiy
On 06/20/2014 01:21 PM, Alex Williamson wrote:
> On Thu, 2014-06-19 at 13:48 +1000, Alexey Kardashevskiy wrote:
>> On 06/19/2014 11:50 AM, Alexey Kardashevskiy wrote:
>>> On 06/19/2014 10:50 AM, Alexey Kardashevskiy wrote:
 On 06/19/2014 04:35 AM, Alex Williamson wrote:
> On Wed, 2014-06-18 at 21:36 +1000, Alexey Kardashevskiy wrote:
>> VFIO exposes BARs to user space as a byte stream so userspace can
>> read it using pread()/pwrite(). Since this is a byte stream, VFIO should
>> not do byte swapping and simply return values as it gets them from
>> PCI device.
>>
>> Instead, the existing code assumes that byte stream in read/write is
>> little-endian and it fixes endianness for values which it passes to
>> ioreadXX/iowriteXX helpers. This works for little-endian as PCI is
>> little endian and le32_to_cpu/... are stubs.
>
> vfio read32:
>
> val = cpu_to_le32(ioread32(io + off));
>
> Where the typical x86 case, ioread32 is:
>
> #define ioread32(addr)  readl(addr)
>
> and readl is:
>
> __le32_to_cpu(__raw_readl(addr));
>
> So we do canceling byte swaps, which are both nops on x86, and end up
> returning device endian, which we assume is little endian.
>
> vfio write32 is similar:
>
> iowrite32(le32_to_cpu(val), io + off);
>
> The implicit cpu_to_le32 of iowrite32() and our explicit swap cancel
> out, so input data is device endian, which is assumed little.
>
>> This also works for big endian but rather by an accident: it reads 4 
>> bytes
>> from the stream (@val is big endian), converts to CPU format (which 
>> should
>> be big endian) as it was little endian (@val becomes actually little
>> endian) and calls iowrite32() which does not do swapping on big endian
>> system.
>
> Really?
>
> In arch/powerpc/kernel/iomap.c iowrite32() is just a wrapper around
> writel(), which seems to use the generic implementation, which does
> include a cpu_to_le32.


 Ouch, wrong comment. iowrite32() does swapping. My bad.


>
> I also see other big endian archs like parisc doing cpu_to_le32 on
> iowrite32, so I don't think this statement is true.  I imagine it's
> probably working for you because the swap cancel.
>
>> This removes byte swapping and makes use ioread32be/iowrite32be
>> (and 16bit versions) on big-endian systems. The "be" helpers take
>> native endian values and do swapping at the moment of writing to a PCI
>> register using one of "store byte-reversed" instructions.
>
> So now you want iowrite32() on little endian and iowrite32be() on big
> endian, the former does a cpu_to_le32 (which is a nop on little endian)
> and the latter does a cpu_to_be32 (which is a nop on big endian)...
> should we just be using __raw_writel() on both?


 We can do that too. The beauty of iowrite32be on ppc64 is that it does not
 swap and write separately, it is implemented via the "Store Word
 Byte-Reverse Indexed X-form" single instruction.

 And some archs (do not know which ones) may add memory barriers in their
 implementations of ioread/iowrite. __raw_writel is too raw :)

>  There doesn't actually
> seem to be any change in behavior here, it just eliminates back-to-back
> byte swaps, which are a nop on x86, but not power, right?

 Exactly. No dependency for QEMU.
>>>
>>> How about that:
>>> ===
>>>
>>> VFIO exposes BARs to user space as a byte stream so userspace can
>>> read it using pread()/pwrite(). Since this is a byte stream, VFIO should
>>> not do byte swapping and simply return values as it gets them from
>>> PCI device.
>>>
>>> Instead, the existing code assumes that byte stream in read/write is
>>> little-endian and it fixes endianness for values which it passes to
>>> ioreadXX/iowriteXX helpers in native format. The IO helpers do swapping
>>> again. Since both byte swaps are nops on little-endian host, this works.
>>>
>>> This also works for big endian but rather by an accident: it reads 4 bytes
>>> from the stream (@val is big endian), converts to CPU format (which should
>>> be big endian) as it was little endian (and @val becomes actually little
>>> endian) and calls iowrite32() which does swapping on big endian
>>> system again. So byte swap gets cancelled, __raw_writel() receives
>>> a native value and then
>>> *(volatile unsigned int __force *)PCI_FIX_ADDR(addr) = v;
>>> just does the right thing.
>>
>> I am wrong here, sorry. This is what happens when you watch soccer between
>> 2am and 4am :)
>>
>>
>>>
>>> This removes byte swaps and makes use of ioread32be/iowrite32be
>>> (and 16bit versions) which do explicit byte swapping at the moment
>>> of write to a PCI register. PPC64 uses a special "Store Word
>>> Byte-Reverse Indexed X-form" instruction which does swap and store.
>>
>> No swappi

Re: [RFC PATCH 1/1] Move two pinned pages to non-movable node in kvm.

2014-06-20 Thread Gleb Natapov
On Fri, Jun 20, 2014 at 09:53:26AM -0300, Marcelo Tosatti wrote:
> On Fri, Jun 20, 2014 at 02:15:10PM +0300, Gleb Natapov wrote:
> > On Thu, Jun 19, 2014 at 04:00:24PM -0300, Marcelo Tosatti wrote:
> > > On Thu, Jun 19, 2014 at 12:20:32PM +0300, Gleb Natapov wrote:
> > > > CCing Marcelo,
> > > > 
> > > > On Wed, Jun 18, 2014 at 02:50:44PM +0800, Tang Chen wrote:
> > > > > Hi Gleb,
> > > > > 
> > > > > Thanks for the quick reply. Please see below.
> > > > > 
> > > > > On 06/18/2014 02:12 PM, Gleb Natapov wrote:
> > > > > >On Wed, Jun 18, 2014 at 01:50:00PM +0800, Tang Chen wrote:
> > > > > >>[Questions]
> > > > > >>And by the way, would you guys please answer the following 
> > > > > >>questions for me ?
> > > > > >>
> > > > > >>1. What's the ept identity pagetable for ?  Only one page is enough 
> > > > > >>?
> > > > > >>
> > > > > >>2. Is the ept identity pagetable only used in realmode ?
> > > > > >>Can we free it once the guest is up (vcpu in protect mode)?
> > > > > >>
> > > > > >>3. Now, ept identity pagetable is allocated in qemu userspace.
> > > > > >>Can we allocate it in kernel space ?
> > > > > >What would be the benefit?
> > > > > 
> > > > > I think the benefit is we can hot-remove the host memory a kvm guest
> > > > > is using.
> > > > > 
> > > > > For now, only memory in ZONE_MOVABLE can be migrated/hot-removed. And 
> > > > > the
> > > > > kernel
> > > > > will never use ZONE_MOVABLE memory. So if we can allocate these two 
> > > > > pages in
> > > > > kernel space, we can pin them without any trouble. When doing memory
> > > > > hot-remove,
> > > > > the kernel will not try to migrate these two pages.
> > > > But we can do that by other means, no? The patch you've sent for 
> > > > instance.
> > > > 
> > > > > 
> > > > > >
> > > > > >>
> > > > > >>4. If I want to migrate these two pages, what do you think is the 
> > > > > >>best way ?
> > > > > >>
> > > > > >I answered most of those here: 
> > > > > >http://www.mail-archive.com/kvm@vger.kernel.org/msg103718.html
> > > > > 
> > > > > I'm sorry I must missed this email.
> > > > > 
> > > > > Seeing your advice, we can unpin these two pages and repin them in 
> > > > > the next
> > > > > EPT violation.
> > > > > So about this problem, which solution would you prefer, allocate 
> > > > > these two
> > > > > pages in kernel
> > > > > space, or migrate them before memory hot-remove ?
> > > > > 
> > > > > I think the first solution is simpler. But I'm not quite sure if 
> > > > > there is
> > > > > any other pages
> > > > > pinned in memory. If we have the same problem with other kvm pages, I 
> > > > > think
> > > > > it is better to
> > > > > solve it in the second way.
> > > > > 
> > > > > What do you think ?
> > > > Remove pinning is preferable. In fact looks like for identity pagetable
> > > > it should be trivial, just don't pin. APIC access page is a little bit
> > > > more complicated since its physical address needs to be tracked to be
> > > > updated in VMCS.
> > > 
> > > Yes, and there are new users of page pinning as well soon (see PEBS
> > > threads on kvm-devel).
> > > 
> > > Was thinking of notifiers scheme. Perhaps:
> > > 
> > > ->begin_page_unpin(struct page *page)
> > >   - Remove any possible access to page.
> > > 
> > > ->end_page_unpin(struct page *page)
> > >   - Reinstantiate any possible access to page.
> > > 
> > > For KVM:
> > > 
> > > ->begin_page_unpin()
> > >   - Remove APIC-access page address from VMCS.
> > > or
> > >   - Remove spte translation to pinned page.
> > >   
> > >   - Put vcpu in state where no VM-entries are allowed.
> > > 
> > > ->end_page_unpin()
> > >   - Setup APIC-access page, ...
> > >   - Allow vcpu to VM-entry.
> > > 
> > I believe that to handle identity page and APIC access page we do not
> > need any of those. 
> > We can use mmu notifiers to track when page begins
> > to be moved and we can find new page location on EPT violation.
> 
> Does page migration hook via mmu notifiers? I don't think so. 
> 
Both identity page and APIC access page are userspace pages which will
have to be unmap from process address space during migration. At this point
mmu notifiers will be called.

> It won't even attempt page migration because the page count is
> increased (would have to confirm though). Tang?
> 
Of course, we should not pin.
 
> The problem with identity page is this: its location is written into the
> guest CR3. So you cannot allow it (the page which the guest CR3 points
> to) to be reused before you remove the reference.
> 
> Where is the guarantee there will be an EPT violation, allowing a vcpu
> to execute with guest CR3 pointing to page with random data?
> 
A guest's physical address is written into CR3 (0xfffbc000 usually),
not a physical address of an identity page directly. When a guest will
try to use CR3 KVM will get EPT violation and shadow page code will find
a page that backs guest's address 0xfffbc000 and will map it into EPT
table. This is what happens on a first vmentry 

Re: Hang on reboot in FreeBSD guest on Linux KVM host

2014-06-20 Thread John Nielsen
On Jun 16, 2014, at 10:21 PM, Paolo Bonzini  wrote:

> Il 16/06/2014 18:47, John Nielsen ha scritto:
>> On Jun 16, 2014, at 10:39 AM, Paolo Bonzini  wrote:
>> 
>>> Il 16/06/2014 18:09, John Nielsen ha scritto:
>> The only substantial difference on the hardware side is the CPU.
>> The hosts where the problem occurs use "Intel(R) Xeon(R) CPU
>> E5-2650 v2 @ 2.60GHz", while the hosts that don't show the
>> problem use the prior revision, "Intel(R) Xeon(R) CPU E5-2650 0 @
>> 2.00GHz".
>>> 
>>> Can you do "grep . /sys/module/kvm_intel/parameters/*" on both hosts please?
>> 
>> No differences that I can see. Output below.
> 
> Not really:
> 
>> Working host:
>> Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
>> # grep . /sys/module/kvm_intel/parameters/*
>> /sys/module/kvm_intel/parameters/enable_apicv:N
>> 
>> Problem host:
>> Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
>> # grep . /sys/module/kvm_intel/parameters/*
>> /sys/module/kvm_intel/parameters/enable_apicv:Y
> 
> So we have a clue.  Let me study the code more, I'll try to get back with a 
> suggestion.

Paolo, have you had an opportunity to look in to this some more?

Thanks,

JN

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvmtool: ARM: timers: add "always-on" property to the device tree

2014-06-20 Thread Marc Zyngier
The new optional property "always-on" indicates that the timers
are, well, always on when used with KVM.

This allows for substantial performance improvement in the guest
(it switches to NOHZ instead of using a periodic tick per vcpu)
and removes a lot of burden from the host (no need to inject tons
of interrupts with the associated rescheduling overhead).

Old kernels that don't understand this property will simply ignore it.

Signed-off-by: Marc Zyngier 
---
 tools/kvm/arm/timer.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/kvm/arm/timer.c b/tools/kvm/arm/timer.c
index d757c1d..209251e 100644
--- a/tools/kvm/arm/timer.c
+++ b/tools/kvm/arm/timer.c
@@ -33,6 +33,7 @@ void timer__generate_fdt_nodes(void *fdt, struct kvm *kvm, 
int *irqs)
_FDT(fdt_begin_node(fdt, "timer"));
_FDT(fdt_property(fdt, "compatible", compatible, sizeof(compatible)));
_FDT(fdt_property(fdt, "interrupts", irq_prop, sizeof(irq_prop)));
+   _FDT(fdt_property(fdt, "always-on", NULL, 0));
if (kvm->cfg.arch.force_cntfrq > 0)
_FDT(fdt_property_cell(fdt, "clock-frequency", 
kvm->cfg.arch.force_cntfrq));
_FDT(fdt_end_node(fdt));
-- 
1.8.3.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] kvmtool: prepare for instantiating different IRQ chip devices

2014-06-20 Thread Will Deacon
Hi Andre,

On Thu, Jun 19, 2014 at 01:44:37PM +0100, Andre Przywara wrote:
> Extend the vGIC handling code to deal with different IRQ chip devices
> instead of hard-coding the GICv2 in.

Minor comment, but there are a few stray pr_infos in here which I don't
think are especially useful.

> + if (!ioctl(gic_fd, KVM_HAS_DEVICE_ATTR, &offset_attr)) {
> + err = ioctl(gic_fd, KVM_GET_DEVICE_ATTR, &offset_attr);
> + if (err)
> + return err;
> + }
> +
> + cpu_if_addr += offset;
> +
> + err = ioctl(gic_fd, KVM_SET_DEVICE_ATTR, &cpu_if_attr);
>   if (err)
>   return err;
> + pr_info("creating GICv2 KVM device");

One here...

> + switch (type) {
> + case KVM_DEV_TYPE_ARM_VGIC_V2:
> + compatible = "arm,cortex-a15-gic";
> + pr_info("creating FDT for a GICv2");

... and here.

> diff --git a/tools/kvm/virtio/mmio.c b/tools/kvm/virtio/mmio.c
> index afae6a7..35dc113 100644
> --- a/tools/kvm/virtio/mmio.c
> +++ b/tools/kvm/virtio/mmio.c
> @@ -293,7 +293,7 @@ int virtio_mmio_init(struct kvm *kvm, void *dev, struct 
> virtio_device *vdev,
>*
>* virtio_mmio.devices=0x200@0xd200:5,0x200@0xd2000200:6
>*/
> - pr_info("virtio-mmio.devices=0x%x@0x%x:%d\n", VIRTIO_MMIO_IO_SIZE, 
> vmmio->addr, line);
> + pr_info("virtio-mmio.devices=0x%x@0x%x:%d", VIRTIO_MMIO_IO_SIZE, 
> vmmio->addr, line);

Huh?

Anyway, the general idea looks ok to me.

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvmtool: ARM: timers: add "always-on" property to the device tree

2014-06-20 Thread Will Deacon
On Fri, Jun 20, 2014 at 05:38:14PM +0100, Marc Zyngier wrote:
> The new optional property "always-on" indicates that the timers
> are, well, always on when used with KVM.
> 
> This allows for substantial performance improvement in the guest
> (it switches to NOHZ instead of using a periodic tick per vcpu)
> and removes a lot of burden from the host (no need to inject tons
> of interrupts with the associated rescheduling overhead).
> 
> Old kernels that don't understand this property will simply ignore it.

  Acked-by: Will Deacon 

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hang on reboot in FreeBSD guest on Linux KVM host

2014-06-20 Thread Paolo Bonzini

Il 20/06/2014 17:41, John Nielsen ha scritto:

>
> So we have a clue.  Let me study the code more, I'll try to get back with a 
suggestion.

Paolo, have you had an opportunity to look in to this some more?


Not yet, sorry.

One possibility is this though.  Can you try migrating (or 
saving/restoring) the guest when it's hung, and see if it resuscitates?


Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/1] Move two pinned pages to non-movable node in kvm.

2014-06-20 Thread Marcelo Tosatti
On Fri, Jun 20, 2014 at 05:26:22PM +0300, Gleb Natapov wrote:
> On Fri, Jun 20, 2014 at 09:53:26AM -0300, Marcelo Tosatti wrote:
> > On Fri, Jun 20, 2014 at 02:15:10PM +0300, Gleb Natapov wrote:
> > > On Thu, Jun 19, 2014 at 04:00:24PM -0300, Marcelo Tosatti wrote:
> > > > On Thu, Jun 19, 2014 at 12:20:32PM +0300, Gleb Natapov wrote:
> > > > > CCing Marcelo,
> > > > > 
> > > > > On Wed, Jun 18, 2014 at 02:50:44PM +0800, Tang Chen wrote:
> > > > > > Hi Gleb,
> > > > > > 
> > > > > > Thanks for the quick reply. Please see below.
> > > > > > 
> > > > > > On 06/18/2014 02:12 PM, Gleb Natapov wrote:
> > > > > > >On Wed, Jun 18, 2014 at 01:50:00PM +0800, Tang Chen wrote:
> > > > > > >>[Questions]
> > > > > > >>And by the way, would you guys please answer the following 
> > > > > > >>questions for me ?
> > > > > > >>
> > > > > > >>1. What's the ept identity pagetable for ?  Only one page is 
> > > > > > >>enough ?
> > > > > > >>
> > > > > > >>2. Is the ept identity pagetable only used in realmode ?
> > > > > > >>Can we free it once the guest is up (vcpu in protect mode)?
> > > > > > >>
> > > > > > >>3. Now, ept identity pagetable is allocated in qemu userspace.
> > > > > > >>Can we allocate it in kernel space ?
> > > > > > >What would be the benefit?
> > > > > > 
> > > > > > I think the benefit is we can hot-remove the host memory a kvm guest
> > > > > > is using.
> > > > > > 
> > > > > > For now, only memory in ZONE_MOVABLE can be migrated/hot-removed. 
> > > > > > And the
> > > > > > kernel
> > > > > > will never use ZONE_MOVABLE memory. So if we can allocate these two 
> > > > > > pages in
> > > > > > kernel space, we can pin them without any trouble. When doing memory
> > > > > > hot-remove,
> > > > > > the kernel will not try to migrate these two pages.
> > > > > But we can do that by other means, no? The patch you've sent for 
> > > > > instance.
> > > > > 
> > > > > > 
> > > > > > >
> > > > > > >>
> > > > > > >>4. If I want to migrate these two pages, what do you think is the 
> > > > > > >>best way ?
> > > > > > >>
> > > > > > >I answered most of those here: 
> > > > > > >http://www.mail-archive.com/kvm@vger.kernel.org/msg103718.html
> > > > > > 
> > > > > > I'm sorry I must missed this email.
> > > > > > 
> > > > > > Seeing your advice, we can unpin these two pages and repin them in 
> > > > > > the next
> > > > > > EPT violation.
> > > > > > So about this problem, which solution would you prefer, allocate 
> > > > > > these two
> > > > > > pages in kernel
> > > > > > space, or migrate them before memory hot-remove ?
> > > > > > 
> > > > > > I think the first solution is simpler. But I'm not quite sure if 
> > > > > > there is
> > > > > > any other pages
> > > > > > pinned in memory. If we have the same problem with other kvm pages, 
> > > > > > I think
> > > > > > it is better to
> > > > > > solve it in the second way.
> > > > > > 
> > > > > > What do you think ?
> > > > > Remove pinning is preferable. In fact looks like for identity 
> > > > > pagetable
> > > > > it should be trivial, just don't pin. APIC access page is a little bit
> > > > > more complicated since its physical address needs to be tracked to be
> > > > > updated in VMCS.
> > > > 
> > > > Yes, and there are new users of page pinning as well soon (see PEBS
> > > > threads on kvm-devel).
> > > > 
> > > > Was thinking of notifiers scheme. Perhaps:
> > > > 
> > > > ->begin_page_unpin(struct page *page)
> > > > - Remove any possible access to page.
> > > > 
> > > > ->end_page_unpin(struct page *page)
> > > > - Reinstantiate any possible access to page.
> > > > 
> > > > For KVM:
> > > > 
> > > > ->begin_page_unpin()
> > > > - Remove APIC-access page address from VMCS.
> > > >   or
> > > > - Remove spte translation to pinned page.
> > > > 
> > > > - Put vcpu in state where no VM-entries are allowed.
> > > > 
> > > > ->end_page_unpin()
> > > > - Setup APIC-access page, ...
> > > > - Allow vcpu to VM-entry.
> > > > 
> > > I believe that to handle identity page and APIC access page we do not
> > > need any of those. 
> > > We can use mmu notifiers to track when page begins
> > > to be moved and we can find new page location on EPT violation.
> > 
> > Does page migration hook via mmu notifiers? I don't think so. 
> > 
> Both identity page and APIC access page are userspace pages which will
> have to be unmap from process address space during migration. At this point
> mmu notifiers will be called.

Right.

> > It won't even attempt page migration because the page count is
> > increased (would have to confirm though). Tang?
> > 
> Of course, we should not pin.
>  
> > The problem with identity page is this: its location is written into the
> > guest CR3. So you cannot allow it (the page which the guest CR3 points
> > to) to be reused before you remove the reference.
> > 
> > Where is the guarantee there will be an EPT violation, allowing a vcpu
> > to execute with guest

Re: [RFC PATCH 1/1] Move two pinned pages to non-movable node in kvm.

2014-06-20 Thread Marcelo Tosatti
On Fri, Jun 20, 2014 at 05:31:46PM -0300, Marcelo Tosatti wrote:
> > IIRC your shadow page pinning patch series support flushing of ptes
> > by mmu notifier by forcing MMU reload and, as a result, faulting in of
> > pinned pages during next entry.  Your patch series does not pin pages
> > by elevating their page count.
> 
> No but PEBS series does and its required to stop swap-out
> of the page.

Well actually no because of mmu notifiers.

Tang, can you implement mmu notifiers for the other breaker of 
mem hotplug ?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Hang on reboot in FreeBSD guest on Linux KVM host

2014-06-20 Thread John Nielsen
On Jun 20, 2014, at 1:53 PM, Paolo Bonzini  wrote:

> Il 20/06/2014 17:41, John Nielsen ha scritto:
>>> >
>>> > So we have a clue.  Let me study the code more, I'll try to get back with 
>>> > a suggestion.
>> Paolo, have you had an opportunity to look in to this some more?
> 
> Not yet, sorry.
> 
> One possibility is this though.  Can you try migrating (or saving/restoring) 
> the guest when it's hung, and see if it resuscitates?

The guest is still hung after a save/restore.

# /usr/bin/qemu-system-x86_64 -machine accel=kvm -name bsdtest -m 512 -smp 
2,sockets=1,cores=1,threads=2 -drive 
file=./20140613_FreeBSD_9.2-RELEASE_ufs.qcow2,if=none,id=drive0,format=qcow2 
-device virtio-blk-pci,scsi=off,drive=drive0 -vnc 0.0.0.0:0 -net none -monitor 
stdio
QEMU 2.0.50 monitor - type 'help' for more information
(qemu) stop
(qemu) savevm smphang
(qemu) q
# /usr/bin/qemu-system-x86_64 -machine accel=kvm -name bsdtest -m 512 -smp 
2,sockets=1,cores=1,threads=2 -drive 
file=./20140613_FreeBSD_9.2-RELEASE_ufs.qcow2,if=none,id=drive0,format=qcow2 
-device virtio-blk-pci,scsi=off,drive=drive0 -vnc 0.0.0.0:0 -net none -monitor 
stdio -loadvm smphang
QEMU 2.0.50 monitor - type 'help' for more information
(qemu) 

[The VNC console shows the same hung kernel screen as when I ran savevm]

JN

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-20 Thread Benjamin Herrenschmidt
On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:

> Working on big endian being an accident may be a matter of perspective

 :-)

> The comment remains that this patch doesn't actually fix anything except
> the overhead on big endian systems doing redundant byte swapping and
> maybe the philosophy that vfio regions are little endian.

Yes, that works by accident because technically VFIO is a transport and
thus shouldn't perform any endian swapping of any sort, which remains
the responsibility of the end driver which is the only one to know
whether a given BAR location is a a register or some streaming data
and in the former case whether it's LE or BE (some PCI devices are BE
even ! :-)

But yes, in the end, it works with the dual "cancelling" swaps and the
overhead of those swaps is probably drowned in the noise of the syscall
overhead.

> I'm still not a fan of iowrite vs iowritebe, there must be something we
> can use that doesn't have an implicit swap.

Sadly there isn't ... In the old day we didn't even have the "be"
variant and readl/writel style accessors still don't have them either
for all archs.

There is __raw_readl/writel but here the semantics are much more than
just "don't swap", they also don't have memory barriers (which means
they are essentially useless to most drivers unless those are platform
specific drivers which know exactly what they are doing, or in the rare
cases such as accessing a framebuffer which we know never have side
effects). 

>  Calling it iowrite*_native is also an abuse of the namespace.


>  Next thing we know some common code
> will legitimately use that name. 

I might make sense to those definitions into a common header. There have
been a handful of cases in the past that wanted that sort of "native
byte order" MMIOs iirc (though don't ask me for examples, I can't really
remember).

>  If we do need to define an alias
> (which I'd like to avoid) it should be something like vfio_iowrite32.
> Thanks,

Cheers,
Ben.

> Alex
> 
> > > ===
> > > 
> > > any better?
> > > 
> > > 
> > > 
> > > 
> >  Suggested-by: Benjamin Herrenschmidt 
> >  Signed-off-by: Alexey Kardashevskiy 
> >  ---
> >   drivers/vfio/pci/vfio_pci_rdwr.c | 20 
> >   1 file changed, 16 insertions(+), 4 deletions(-)
> > 
> >  diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c 
> >  b/drivers/vfio/pci/vfio_pci_rdwr.c
> >  index 210db24..f363b5a 100644
> >  --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> >  +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> >  @@ -21,6 +21,18 @@
> >   
> >   #include "vfio_pci_private.h"
> >   
> >  +#ifdef __BIG_ENDIAN__
> >  +#define ioread16_native   ioread16be
> >  +#define ioread32_native   ioread32be
> >  +#define iowrite16_native  iowrite16be
> >  +#define iowrite32_native  iowrite32be
> >  +#else
> >  +#define ioread16_native   ioread16
> >  +#define ioread32_native   ioread32
> >  +#define iowrite16_native  iowrite16
> >  +#define iowrite32_native  iowrite32
> >  +#endif
> >  +
> >   /*
> >    * Read or write from an __iomem region (MMIO or I/O port) with an 
> >  excluded
> >    * range which is inaccessible.  The excluded range drops writes and 
> >  fills
> >  @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char 
> >  __user *buf,
> > if (copy_from_user(&val, buf, 4))
> > return -EFAULT;
> >   
> >  -  iowrite32(le32_to_cpu(val), io + off);
> >  +  iowrite32_native(val, io + off);
> > } else {
> >  -  val = cpu_to_le32(ioread32(io + off));
> >  +  val = ioread32_native(io + off);
> >   
> > if (copy_to_user(buf, &val, 4))
> > return -EFAULT;
> >  @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char 
> >  __user *buf,
> > if (copy_from_user(&val, buf, 2))
> > return -EFAULT;
> >   
> >  -  iowrite16(le16_to_cpu(val), io + off);
> >  +  iowrite16_native(val, io + off);
> > } else {
> >  -  val = cpu_to_le16(ioread16(io + off));
> >  +  val = ioread16_native(io + off);
> >   
> > if (copy_to_user(buf, &val, 2))
> > return -EFAULT;
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > > 
> > > 
> > 
> > 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a 

Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-20 Thread Benjamin Herrenschmidt
On Sat, 2014-06-21 at 00:14 +1000, Alexey Kardashevskiy wrote:

> We can still use __raw_writel&co, would that be ok?

No unless you understand precisely what kind of memory barriers each
platform require for these.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html