date:20141113

Re: [PATCH v9 03/19] vfio: platform: add the VFIO PLATFORM module to Kconfig

2014-11-13 Thread Hongbo Zhang

On 12 November 2014 17:57, Antonios Motakis
 wrote:
> Hello Hongbo,
>
> On Wed, Nov 12, 2014 at 10:52 AM, Hongbo Zhang  
> wrote:
>> On 28 October 2014 02:07, Antonios Motakis
>>  wrote:
>>>
>>> Enable building the VFIO PLATFORM driver that allows to use Linux platform
>>> devices with VFIO.
>>>
>>> Signed-off-by: Antonios Motakis 
>>> ---
>>>  drivers/vfio/Kconfig   | 1 +
>>>  drivers/vfio/Makefile  | 1 +
>>>  drivers/vfio/platform/Kconfig  | 9 +
>>>  drivers/vfio/platform/Makefile | 4 
>>>  4 files changed, 15 insertions(+)
>>>  create mode 100644 drivers/vfio/platform/Kconfig
>>>  create mode 100644 drivers/vfio/platform/Makefile
>>>
>>> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
>>> index a0abe04..962fb80 100644
>>> --- a/drivers/vfio/Kconfig
>>> +++ b/drivers/vfio/Kconfig
>>> @@ -27,3 +27,4 @@ menuconfig VFIO
>>>   If you don't know what to do here, say N.
>>>
>>>  source "drivers/vfio/pci/Kconfig"
>>> +source "drivers/vfio/platform/Kconfig"
>>> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
>>> index 0b035b1..dadf0ca 100644
>>> --- a/drivers/vfio/Makefile
>>> +++ b/drivers/vfio/Makefile
>>> @@ -3,3 +3,4 @@ obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
>>>  obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>>>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>>>  obj-$(CONFIG_VFIO_PCI) += pci/
>>> +obj-$(CONFIG_VFIO_PLATFORM) += platform/
>>> diff --git a/drivers/vfio/platform/Kconfig b/drivers/vfio/platform/Kconfig
>>> new file mode 100644
>>> index 000..c51af17
>>> --- /dev/null
>>> +++ b/drivers/vfio/platform/Kconfig
>>> @@ -0,0 +1,9 @@
>>> +config VFIO_PLATFORM
>>> +   tristate "VFIO support for platform devices"
>>> +   depends on VFIO && EVENTFD && ARM
>>
>> Hi Antonios,
>> Is this only for ARM? how about X86 and PowerPC?
>> On Freescale's PowerPC platform, the IOMMU is called PAMU (Peripheral
>> Access Management Unit), and I am trying to use this VFIO framework on
>> it.
>>
>
> In principle it should be working on any platform with such devices;
> as long as you have a VFIO IOMMU driver for the PAMU (on ARM we use
> VFIO PLATFORM for the device, with VFIO IOMMU TYPE1 for the IOMMU).
>

Antonios,
As far as you know, on which ARM platform can I apply your patches directly?
My purpose is to apply you patches[1], and then implement a user space
driver to evaluate the performance.

[1]  It is better without manually merging conflicts/dependencies etc,
I am vfio-platform user, not a iommu expert.

> So if you have a suitable IOMMU driver for your target, feel free to
> test it, and let us know of the results.
>
>>>
>>> +   help
>>> + Support for platform devices with VFIO. This is required to make
>>> + use of platform devices present on the system using the VFIO
>>> + framework.
>>> +
>>> + If you don't know what to do here, say N.
>>> diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile
>>> new file mode 100644
>>> index 000..279862b
>>> --- /dev/null
>>> +++ b/drivers/vfio/platform/Makefile
>>> @@ -0,0 +1,4 @@
>>> +
>>> +vfio-platform-y := vfio_platform.o vfio_platform_common.o
>>> +
>>> +obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o
>>> --
>>> 2.1.1
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: x86: add module parameter to disable periodic kvmclock sync

2014-11-13 Thread Andrew Jones

On Thu, Nov 13, 2014 at 12:44:39AM -0200, Marcelo Tosatti wrote:
> 
> The periodic kvmclock sync can be an undesired source of latencies.
> 
> Signed-off-by: Marcelo Tosatti 
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0033df3..be56fd3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -98,6 +98,9 @@ module_param(ignore_msrs, bool, S_IRUGO | S_IWUSR);
>  unsigned int min_timer_period_us = 500;
>  module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
>  
> +static bool kvmclock_periodic_sync = 1;

Using 'true' would look nicer.

> +module_param(kvmclock_periodic_sync, bool, S_IRUGO | S_IWUSR);
> +
>  bool kvm_has_tsc_control;
>  EXPORT_SYMBOL_GPL(kvm_has_tsc_control);
>  u32  kvm_max_guest_tsc_khz;
> @@ -1718,7 +1721,8 @@ static void kvmclock_sync_fn(struct work_struct *work)
>   struct kvm *kvm = container_of(ka, struct kvm, arch);
>  
>   schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
> - schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> + if (kvmclock_periodic_sync)
> + schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
>   KVMCLOCK_SYNC_PERIOD);
>  }

The above hunk shouldn't be necessary, as we'll never get there if we
don't do the first scheduling with the below hunk.

>  
> @@ -6971,7 +6975,8 @@ int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
>   kvm_write_tsc(vcpu, &msr);
>   vcpu_put(vcpu);
>  
> - schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> + if (kvmclock_periodic_sync)
> + schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
>   KVMCLOCK_SYNC_PERIOD);
>  
>   return r;
> 
> 

I'm not opposed to making this optional, but just curious. Were
general use cases getting adversely affected? Or is this part of
some RT work trying to kill as many sources of asynchronous latency
as possible?

drew
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: x86: add module parameter to disable periodic kvmclock sync

2014-11-13 Thread Andrew Jones

On Thu, Nov 13, 2014 at 09:40:41AM +0100, Andrew Jones wrote:
> On Thu, Nov 13, 2014 at 12:44:39AM -0200, Marcelo Tosatti wrote:
> > 
> > The periodic kvmclock sync can be an undesired source of latencies.
> > 
> > Signed-off-by: Marcelo Tosatti 
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 0033df3..be56fd3 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -98,6 +98,9 @@ module_param(ignore_msrs, bool, S_IRUGO | S_IWUSR);
> >  unsigned int min_timer_period_us = 500;
> >  module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
> >  
> > +static bool kvmclock_periodic_sync = 1;
> 
> Using 'true' would look nicer.

Ahh, disregard this comment. 1 matches what the user would input.

> 
> > +module_param(kvmclock_periodic_sync, bool, S_IRUGO | S_IWUSR);
> > +
> >  bool kvm_has_tsc_control;
> >  EXPORT_SYMBOL_GPL(kvm_has_tsc_control);
> >  u32  kvm_max_guest_tsc_khz;
> > @@ -1718,7 +1721,8 @@ static void kvmclock_sync_fn(struct work_struct *work)
> > struct kvm *kvm = container_of(ka, struct kvm, arch);
> >  
> > schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
> > -   schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> > +   if (kvmclock_periodic_sync)
> > +   schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> > KVMCLOCK_SYNC_PERIOD);
> >  }
> 
> The above hunk shouldn't be necessary, as we'll never get there if we
> don't do the first scheduling with the below hunk.

Disregard this comment too. I didn't pay enough attention to the module
param permissions. We definitely need this here to modify behaviour of
running VMs when the parameter gets updated with writes to sysfs.

> 
> >  
> > @@ -6971,7 +6975,8 @@ int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
> > kvm_write_tsc(vcpu, &msr);
> > vcpu_put(vcpu);
> >  
> > -   schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> > +   if (kvmclock_periodic_sync)
> > +   schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> > KVMCLOCK_SYNC_PERIOD);
> >  
> > return r;
> > 
> > 

But... if the kvmclock_periodic_sync is false here, then it won't matter
if we turn it on later. Maybe we don't care about that, but if we do,
then we should remove this hunk, and also change the hunk above to be

@@ -1717,6 +1717,9 @@ static void kvmclock_sync_fn(struct work_struct *work)
   kvmclock_sync_work);
struct kvm *kvm = container_of(ka, struct kvm, arch);
 
+   if (!kvmclock_periodic_sync)
+   return;
+
schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
KVMCLOCK_SYNC_PERIOD);


> 
> I'm not opposed to making this optional, but just curious. Were
> general use cases getting adversely affected? Or is this part of
> some RT work trying to kill as many sources of asynchronous latency
> as possible?
> 
> drew
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/4] ARM: KVM: on unhandled IO mem abort, route the call to the KVM MMIO bus

2014-11-13 Thread Nikolay Nikolaev

On Mon, Nov 10, 2014 at 6:27 PM, Christoffer Dall
 wrote:
> On Mon, Nov 10, 2014 at 05:09:07PM +0200, Nikolay Nikolaev wrote:
>> Hello,
>>
>> On Fri, Mar 28, 2014 at 9:09 PM, Christoffer Dall
>>  wrote:
>> >
>> > On Thu, Mar 13, 2014 at 04:57:26PM +0100, Antonios Motakis wrote:
>> > > On an unhandled IO memory abort, use the kvm_io_bus_* API in order to
>> > > handle the MMIO access through any registered read/write callbacks. This
>> > > is a dependency for eventfd support (ioeventfd and irqfd).
>> > >
>> > > However, accesses to the VGIC are still left implemented independently,
>> > > since the kvm_io_bus_* API doesn't pass the VCPU pointer doing the 
>> > > access.
>> > >
>> > > Signed-off-by: Antonios Motakis 
>> > > Signed-off-by: Nikolay Nikolaev 
>> > > ---
>> > >  arch/arm/kvm/mmio.c | 32 
>> > >  virt/kvm/arm/vgic.c |  5 -
>> > >  2 files changed, 36 insertions(+), 1 deletion(-)
>> > >
>> > > diff --git a/arch/arm/kvm/mmio.c b/arch/arm/kvm/mmio.c
>> > > index 4cb5a93..1d17831 100644
>> > > --- a/arch/arm/kvm/mmio.c
>> > > +++ b/arch/arm/kvm/mmio.c
>> > > @@ -162,6 +162,35 @@ static int decode_hsr(struct kvm_vcpu *vcpu, 
>> > > phys_addr_t fault_ipa,
>> > >   return 0;
>> > >  }
>> > >
>> > > +/**
>> > > + * kvm_handle_mmio - handle an in-kernel MMIO access
>> > > + * @vcpu:pointer to the vcpu performing the access
>> > > + * @run: pointer to the kvm_run structure
>> > > + * @mmio:pointer to the data describing the access
>> > > + *
>> > > + * returns true if the MMIO access has been performed in kernel space,
>> > > + * and false if it needs to be emulated in user space.
>> > > + */
>> > > +static bool handle_kernel_mmio(struct kvm_vcpu *vcpu, struct kvm_run 
>> > > *run,
>> > > + struct kvm_exit_mmio *mmio)
>> > > +{
>> > > + int ret;
>> > > + if (mmio->is_write) {
>> > > + ret = kvm_io_bus_write(vcpu->kvm, KVM_MMIO_BUS, 
>> > > mmio->phys_addr,
>> > > + mmio->len, &mmio->data);
>> > > +
>> > > + } else {
>> > > + ret = kvm_io_bus_read(vcpu->kvm, KVM_MMIO_BUS, 
>> > > mmio->phys_addr,
>> > > + mmio->len, &mmio->data);
>> > > + }
>> > > + if (!ret) {
>> > > + kvm_prepare_mmio(run, mmio);
>> > > + kvm_handle_mmio_return(vcpu, run);
>> > > + }
>> > > +
>> > > + return !ret;
>> > > +}
>> > > +
>> > >  int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run,
>> > >phys_addr_t fault_ipa)
>> > >  {
>> > > @@ -200,6 +229,9 @@ int io_mem_abort(struct kvm_vcpu *vcpu, struct 
>> > > kvm_run *run,
>> > >   if (vgic_handle_mmio(vcpu, run, &mmio))
>> > >   return 1;
>> > >
>> > > + if (handle_kernel_mmio(vcpu, run, &mmio))
>> > > + return 1;
>> > > +
>>
>>
>> We're reconsidering ioeventfds patchseries and we tried to evaluate
>> what you suggested here.
>>
>> >
>> > this special-casing of the vgic is now really terrible.  Is there
>> > anything holding you back from doing the necessary restructure of the
>> > kvm_bus_io_*() API instead?
>>
>> Restructuring the kvm_io_bus_ API is not a big thing (we actually did
>> it), but is not directly related to the these patches.
>> Of course it can be justified if we do it in the context of removing
>> vgic_handle_mmio and leaving only handle_kernel_mmio.
>>
>> >
>> > That would allow us to get rid of the ugly
>> > Fix it! in the vgic driver as well.
>>
>> Going through the vgic_handle_mmio we see that it will require large
>> refactoring:
>>  - there are 15 MMIO ranges for the vgic now - each should be
>> registered as a separate device
>>  - the handler of each range should be split into read and write
>>  - all handlers take 'struct kvm_exit_mmio', and pass it to
>> 'vgic_reg_access', 'mmio_data_read' and 'mmio_data_read'
>>
>> To sum up - if we do this refactoring of vgic's MMIO handling +
>> kvm_io_bus_ API getting 'vcpu" argument we'll get a 'much' cleaner
>> vgic code and as a bonus we'll get 'ioeventfd' capabilities.
>>
>> We have 3 questions:
>>  - is the kvm_io_bus_ getting 'vcpu' argument acceptable for the other
>> architectures too?
>>  - is this huge vgic MMIO handling redesign acceptable/desired (it
>> touches a lot of code)?
>>  - is there a way that ioeventfd is accepted leaving vgic.c in it's
>> current state?
>>
> Not sure how the latter question is relevant to this, but check with
> Andre who recently looked at this as well and decided that for GICv3 the
> only sane thing was to remove that comment for the gic.
@Andre - what's your experience with the GICv3 and MMIO handling,
anything specific?
>
> I don't recall the details of what you were trying to accomplish here
> (it's been 8 months or so) but the surely the vgic handling code should
> *somehow* be integrated into the handle_kernel_mmio (like Paolo
> suggested), unless you come back and tell me that that would involve a
> complete rewrite of the vgic

Re: [PATCH] kvm: svm: move WARN_ON in svm_adjust_tsc_offset

2014-11-13 Thread Paolo Bonzini

On 13/11/2014 04:00, Chris J Arges wrote:
> When running the tsc_adjust kvm-unit-test on an AMD processor with the
> IA32_TSC_ADJUST feature enabled, the WARN_ON in svm_adjust_tsc_offset can be
> triggered. This WARN_ON checks for a negative adjustment in case __scale_tsc
> is called; however it may trigger unnecessary warnings.
> 
> This patch moves the WARN_ON to trigger only if __scale_tsc will actually be
> called from svm_adjust_tsc_offset. In addition make adj in kvm_set_msr_common
> s64 since this can have signed values.
> 
> Signed-off-by: Chris J Arges 
> ---
>  arch/x86/kvm/svm.c | 8 +---
>  arch/x86/kvm/x86.c | 2 +-
>  2 files changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index d4f3aaa..6b411ad 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1056,9 +1056,11 @@ static void svm_adjust_tsc_offset(struct kvm_vcpu 
> *vcpu, s64 adjustment, bool ho
>  {
>   struct vcpu_svm *svm = to_svm(vcpu);
>  
> - WARN_ON(adjustment < 0);
> - if (host)
> - adjustment = svm_scale_tsc(vcpu, adjustment);
> + if (host) {
> + if (svm->tsc_ratio != TSC_RATIO_DEFAULT)
> + WARN_ON(adjustment < 0);
> + adjustment = svm_scale_tsc(vcpu, (u64)adjustment);
> + }
>  
>   svm->vmcb->control.tsc_offset += adjustment;
>   if (is_guest_mode(vcpu))
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f85da5c..1cd1376 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2141,7 +2141,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
> msr_data *msr_info)
>   case MSR_IA32_TSC_ADJUST:
>   if (guest_cpuid_has_tsc_adjust(vcpu)) {
>   if (!msr_info->host_initiated) {
> - u64 adj = data - vcpu->arch.ia32_tsc_adjust_msr;
> + s64 adj = data - vcpu->arch.ia32_tsc_adjust_msr;
>   kvm_x86_ops->adjust_tsc_offset(vcpu, adj, true);
>   }
>   vcpu->arch.ia32_tsc_adjust_msr = data;
> 

Applied, thanks!

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Borislav Petkov

On Thu, Nov 13, 2014 at 12:31:30AM +, Luck, Tony wrote:
> > Is this something I can try under KVM?
> 
> I don't know if KVM has a way to simulate a machine check event.

I've been thinking about it recently too - adding MCA functionality to
qemu/kvm could be very useful, especially the thresholding stuff, for
testing RAS kernel code.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/4] ARM: KVM: on unhandled IO mem abort, route the call to the KVM MMIO bus

2014-11-13 Thread Christoffer Dall

On Thu, Nov 13, 2014 at 12:45:42PM +0200, Nikolay Nikolaev wrote:

[...]

> >>
> >> Going through the vgic_handle_mmio we see that it will require large
> >> refactoring:
> >>  - there are 15 MMIO ranges for the vgic now - each should be
> >> registered as a separate device
> >>  - the handler of each range should be split into read and write
> >>  - all handlers take 'struct kvm_exit_mmio', and pass it to
> >> 'vgic_reg_access', 'mmio_data_read' and 'mmio_data_read'
> >>
> >> To sum up - if we do this refactoring of vgic's MMIO handling +
> >> kvm_io_bus_ API getting 'vcpu" argument we'll get a 'much' cleaner
> >> vgic code and as a bonus we'll get 'ioeventfd' capabilities.
> >>
> >> We have 3 questions:
> >>  - is the kvm_io_bus_ getting 'vcpu' argument acceptable for the other
> >> architectures too?
> >>  - is this huge vgic MMIO handling redesign acceptable/desired (it
> >> touches a lot of code)?
> >>  - is there a way that ioeventfd is accepted leaving vgic.c in it's
> >> current state?
> >>
> > Not sure how the latter question is relevant to this, but check with
> > Andre who recently looked at this as well and decided that for GICv3 the
> > only sane thing was to remove that comment for the gic.
> @Andre - what's your experience with the GICv3 and MMIO handling,
> anything specific?
> >
> > I don't recall the details of what you were trying to accomplish here
> > (it's been 8 months or so) but the surely the vgic handling code should
> > *somehow* be integrated into the handle_kernel_mmio (like Paolo
> > suggested), unless you come back and tell me that that would involve a
> > complete rewrite of the vgic code.
> I'm experimenting now - it's not exactly rewrite of whole vgic code,
> but it will touch a lot of it  - all MMIO access handlers and the
> supporting functions.
> We're ready to spend the effort. My question is  - is this acceptable?
> 
I certainly appreciate the offer to do this work, but it's hard to say
at this point if it is worth it.

What I was trying to say above is that Andre looked at this, and came to
the conclusion that it is not worth it.

Marc, what are your thoughts?

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/4] ARM: KVM: on unhandled IO mem abort, route the call to the KVM MMIO bus

2014-11-13 Thread Christoffer Dall

[resending to Andre's actual e-mail address]

On Thu, Nov 13, 2014 at 12:20 PM, Christoffer Dall
 wrote:
> On Thu, Nov 13, 2014 at 12:45:42PM +0200, Nikolay Nikolaev wrote:
>
> [...]
>
>> >>
>> >> Going through the vgic_handle_mmio we see that it will require large
>> >> refactoring:
>> >>  - there are 15 MMIO ranges for the vgic now - each should be
>> >> registered as a separate device
>> >>  - the handler of each range should be split into read and write
>> >>  - all handlers take 'struct kvm_exit_mmio', and pass it to
>> >> 'vgic_reg_access', 'mmio_data_read' and 'mmio_data_read'
>> >>
>> >> To sum up - if we do this refactoring of vgic's MMIO handling +
>> >> kvm_io_bus_ API getting 'vcpu" argument we'll get a 'much' cleaner
>> >> vgic code and as a bonus we'll get 'ioeventfd' capabilities.
>> >>
>> >> We have 3 questions:
>> >>  - is the kvm_io_bus_ getting 'vcpu' argument acceptable for the other
>> >> architectures too?
>> >>  - is this huge vgic MMIO handling redesign acceptable/desired (it
>> >> touches a lot of code)?
>> >>  - is there a way that ioeventfd is accepted leaving vgic.c in it's
>> >> current state?
>> >>
>> > Not sure how the latter question is relevant to this, but check with
>> > Andre who recently looked at this as well and decided that for GICv3 the
>> > only sane thing was to remove that comment for the gic.
>> @Andre - what's your experience with the GICv3 and MMIO handling,
>> anything specific?
>> >
>> > I don't recall the details of what you were trying to accomplish here
>> > (it's been 8 months or so) but the surely the vgic handling code should
>> > *somehow* be integrated into the handle_kernel_mmio (like Paolo
>> > suggested), unless you come back and tell me that that would involve a
>> > complete rewrite of the vgic code.
>> I'm experimenting now - it's not exactly rewrite of whole vgic code,
>> but it will touch a lot of it  - all MMIO access handlers and the
>> supporting functions.
>> We're ready to spend the effort. My question is  - is this acceptable?
>>
> I certainly appreciate the offer to do this work, but it's hard to say
> at this point if it is worth it.
>
> What I was trying to say above is that Andre looked at this, and came to
> the conclusion that it is not worth it.
>
> Marc, what are your thoughts?
>
> -Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: x86: add module parameter to disable periodic kvmclock sync

2014-11-13 Thread Andrew Jones

On Thu, Nov 13, 2014 at 11:44:02AM +0100, Andrew Jones wrote:
> On Thu, Nov 13, 2014 at 09:40:41AM +0100, Andrew Jones wrote:
> > On Thu, Nov 13, 2014 at 12:44:39AM -0200, Marcelo Tosatti wrote:
> > > 
> > > The periodic kvmclock sync can be an undesired source of latencies.
> > > 
> > > Signed-off-by: Marcelo Tosatti 
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 0033df3..be56fd3 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -98,6 +98,9 @@ module_param(ignore_msrs, bool, S_IRUGO | S_IWUSR);
> > >  unsigned int min_timer_period_us = 500;
> > >  module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
> > >  
> > > +static bool kvmclock_periodic_sync = 1;
> > 
> > Using 'true' would look nicer.
> 
> Ahh, disregard this comment. 1 matches what the user would input.
> 
> > 
> > > +module_param(kvmclock_periodic_sync, bool, S_IRUGO | S_IWUSR);
> > > +
> > >  bool kvm_has_tsc_control;
> > >  EXPORT_SYMBOL_GPL(kvm_has_tsc_control);
> > >  u32  kvm_max_guest_tsc_khz;
> > > @@ -1718,7 +1721,8 @@ static void kvmclock_sync_fn(struct work_struct 
> > > *work)
> > >   struct kvm *kvm = container_of(ka, struct kvm, arch);
> > >  
> > >   schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
> > > - schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> > > + if (kvmclock_periodic_sync)
> > > + schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> > >   KVMCLOCK_SYNC_PERIOD);
> > >  }
> > 
> > The above hunk shouldn't be necessary, as we'll never get there if we
> > don't do the first scheduling with the below hunk.
> 
> Disregard this comment too. I didn't pay enough attention to the module
> param permissions. We definitely need this here to modify behaviour of
> running VMs when the parameter gets updated with writes to sysfs.
> 
> > 
> > >  
> > > @@ -6971,7 +6975,8 @@ int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
> > >   kvm_write_tsc(vcpu, &msr);
> > >   vcpu_put(vcpu);
> > >  
> > > - schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> > > + if (kvmclock_periodic_sync)
> > > + schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
> > >   KVMCLOCK_SYNC_PERIOD);
> > >  
> > >   return r;
> > > 
> > > 
> 
> But... if the kvmclock_periodic_sync is false here, then it won't matter
> if we turn it on later. Maybe we don't care about that, but if we do,
> then we should remove this hunk, and also change the hunk above to be
> 
> @@ -1717,6 +1717,9 @@ static void kvmclock_sync_fn(struct work_struct *work)
>  kvmclock_sync_work);
>   struct kvm *kvm = container_of(ka, struct kvm, arch);
>  
> + if (!kvmclock_periodic_sync)
> + return;
> +
>   schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
>   schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
>   KVMCLOCK_SYNC_PERIOD);
> 

Oh man... I'll reply correctly eventually. The above should of course be

@@ -1717,7 +1717,8 @@ static void kvmclock_sync_fn(struct work_struct *work)
   kvmclock_sync_work);
struct kvm *kvm = container_of(ka, struct kvm, arch);
 
-   schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
+   if (kvmclock_periodic_sync)
+   schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
KVMCLOCK_SYNC_PERIOD);
 }

> 
> > 
> > I'm not opposed to making this optional, but just curious. Were
> > general use cases getting adversely affected? Or is this part of
> > some RT work trying to kill as many sources of asynchronous latency
> > as possible?
> > 
> > drew
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/4] ARM: KVM: on unhandled IO mem abort, route the call to the KVM MMIO bus

2014-11-13 Thread Andre Przywara

Hi Nikolay,

On 13/11/14 11:37, Marc Zyngier wrote:
> [fixing Andre's email address]
> 
> On 13/11/14 11:20, Christoffer Dall wrote:
>> On Thu, Nov 13, 2014 at 12:45:42PM +0200, Nikolay Nikolaev wrote:
>>
>> [...]
>>
>
> Going through the vgic_handle_mmio we see that it will require large
> refactoring:
>  - there are 15 MMIO ranges for the vgic now - each should be
> registered as a separate device
>  - the handler of each range should be split into read and write
>  - all handlers take 'struct kvm_exit_mmio', and pass it to
> 'vgic_reg_access', 'mmio_data_read' and 'mmio_data_read'
>
> To sum up - if we do this refactoring of vgic's MMIO handling +
> kvm_io_bus_ API getting 'vcpu" argument we'll get a 'much' cleaner
> vgic code and as a bonus we'll get 'ioeventfd' capabilities.
>
> We have 3 questions:
>  - is the kvm_io_bus_ getting 'vcpu' argument acceptable for the other
> architectures too?
>  - is this huge vgic MMIO handling redesign acceptable/desired (it
> touches a lot of code)?
>  - is there a way that ioeventfd is accepted leaving vgic.c in it's
> current state?
>
 Not sure how the latter question is relevant to this, but check with
 Andre who recently looked at this as well and decided that for GICv3 the
 only sane thing was to remove that comment for the gic.
>>> @Andre - what's your experience with the GICv3 and MMIO handling,
>>> anything specific?

 I don't recall the details of what you were trying to accomplish here
 (it's been 8 months or so) but the surely the vgic handling code should
 *somehow* be integrated into the handle_kernel_mmio (like Paolo
 suggested), unless you come back and tell me that that would involve a
 complete rewrite of the vgic code.
>>> I'm experimenting now - it's not exactly rewrite of whole vgic code,
>>> but it will touch a lot of it  - all MMIO access handlers and the
>>> supporting functions.
>>> We're ready to spend the effort. My question is  - is this acceptable?
>>>
>> I certainly appreciate the offer to do this work, but it's hard to say
>> at this point if it is worth it.
>>
>> What I was trying to say above is that Andre looked at this, and came to
>> the conclusion that it is not worth it.
>>
>> Marc, what are your thoughts?
> 
> Same here, I rely on Andre's view that it was not very useful. Now, it
> would be good to see a mock-up of the patches and find out:

Seconded, can you send a pointer to the VGIC rework patches mentioned?

> - if it is a major improvement for the general quality of the code
> - if that allow us to *delete* a lot of code (if it is just churn, I'm
> not really interested)
> - if it helps or hinders further developments that are currently in the
> pipeline
> 
> Andre, can you please share your findings? I don't remember the
> specifics of the discussion we had a few months ago...

1) Given the date in the reply I sense that your patches are from March
this year or earlier. So this is based on VGIC code from March, which
predates Marc's vgic_dyn changes that just went in 3.18-rc1? His patches
introduced another member to struct mmio_range to check validity of
accesses with a reduced number of SPIs supported (.bits_per_irq).
So is this covered in your rework?

2)
>>>  - there are 15 MMIO ranges for the vgic now - each should be

Well, the GICv3 emulation adds 41 new ranges. Not sure if this still fits.

>>> registered as a separate device

I found this fact a show-stopper when looking at this a month ago.
Somehow it feels wrong to register a bunch of pseudo-devices. I could go
with registering a small number of regions (one distributor, two
redistributor regions for instance), but not handling every single of
the 41 + 15 register "groups" as a device.

Also I wasn't sure if we had to expose some of the vGIC structures to
the other KVM code layers.

But I am open to any suggestions (as long as they go in _after_ my
vGICv3 series ;-)  - so looking forward to some repo to see how it looks
like.

Cheers,
Andre.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net] Revert "drivers/net: Disable UFO through virtio" in macvtap and tun

2014-11-13 Thread Jelle de Jong

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/11/14 20:02, Stefan Hajnoczi wrote:
> On Tue, Nov 11, 2014 at 05:12:58PM +, Ben Hutchings wrote:
>> This reverts commit 88e0e0e5aa722b193c8758c8b45d041de5316924 for 
>> the tap drivers, but leaves UFO disabled in virtio_net.
>> 
>> libvirt at least assumes that tap features will never be dropped 
>> in new kernel versions, and doing so prevents migration of VMs
>> to the never kernel version while they are running with virtio
>> net devices.
>> 
>> Fixes: 88e0e0e5aa7a ("drivers/net: Disable UFO through virtio") 
>> Signed-off-by: Ben Hutchings  --- 
>> Compile-tested only.
> 
> Jelle, care to test this and reply with "Tested-by: Jelle de Jong 
> " if it solves the live migration
> problem you reported?
> 
> It requires applying the patch to the host kernel on your virt01
> host.

If someone can provide an x86_64 kernel package for debian wheezy (or
put it in wheezy-backports I would be willing to test it and report (I
cant set-up a cross-build building system right now). I'm currently
planning on taking four virtualisation platform down this weekend to
reboot all systems and guests to work around the issue that already
hit production.

Many thanks for picking up the problem and working on a solution.

Kind regards,

Jelle de Jong
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)

iJwEAQECAAYFAlRknWIACgkQ1WclBW9j5HlBLwP/aMoJy9Jgs6M6nuQXtWOPZZ12
N30le4kj+s6AP7Bt5j9vDjJf1/B0bdKbykEvUFlW0xbrD7YGFZm0HflM0lqwZ6lZ
lql9mqrcooumYZuDt/CxBHsUlRIbiR/Bzd4qdkCkpxHo7aXLRk0HyPsK4XG6OulN
4MVHRR8W8Yk87FBDLCw=
=DwU5
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/4] ARM: KVM: on unhandled IO mem abort, route the call to the KVM MMIO bus

2014-11-13 Thread Nikolay Nikolaev

On Thu, Nov 13, 2014 at 1:52 PM, Andre Przywara  wrote:
> Hi Nikolay,
>
> On 13/11/14 11:37, Marc Zyngier wrote:
>> [fixing Andre's email address]
>>
>> On 13/11/14 11:20, Christoffer Dall wrote:
>>> On Thu, Nov 13, 2014 at 12:45:42PM +0200, Nikolay Nikolaev wrote:
>>>
>>> [...]
>>>
>>
>> Going through the vgic_handle_mmio we see that it will require large
>> refactoring:
>>  - there are 15 MMIO ranges for the vgic now - each should be
>> registered as a separate device
>>  - the handler of each range should be split into read and write
>>  - all handlers take 'struct kvm_exit_mmio', and pass it to
>> 'vgic_reg_access', 'mmio_data_read' and 'mmio_data_read'
>>
>> To sum up - if we do this refactoring of vgic's MMIO handling +
>> kvm_io_bus_ API getting 'vcpu" argument we'll get a 'much' cleaner
>> vgic code and as a bonus we'll get 'ioeventfd' capabilities.
>>
>> We have 3 questions:
>>  - is the kvm_io_bus_ getting 'vcpu' argument acceptable for the other
>> architectures too?
>>  - is this huge vgic MMIO handling redesign acceptable/desired (it
>> touches a lot of code)?
>>  - is there a way that ioeventfd is accepted leaving vgic.c in it's
>> current state?
>>
> Not sure how the latter question is relevant to this, but check with
> Andre who recently looked at this as well and decided that for GICv3 the
> only sane thing was to remove that comment for the gic.
 @Andre - what's your experience with the GICv3 and MMIO handling,
 anything specific?
>
> I don't recall the details of what you were trying to accomplish here
> (it's been 8 months or so) but the surely the vgic handling code should
> *somehow* be integrated into the handle_kernel_mmio (like Paolo
> suggested), unless you come back and tell me that that would involve a
> complete rewrite of the vgic code.
 I'm experimenting now - it's not exactly rewrite of whole vgic code,
 but it will touch a lot of it  - all MMIO access handlers and the
 supporting functions.
 We're ready to spend the effort. My question is  - is this acceptable?

>>> I certainly appreciate the offer to do this work, but it's hard to say
>>> at this point if it is worth it.
>>>
>>> What I was trying to say above is that Andre looked at this, and came to
>>> the conclusion that it is not worth it.
>>>
>>> Marc, what are your thoughts?
>>
>> Same here, I rely on Andre's view that it was not very useful. Now, it
>> would be good to see a mock-up of the patches and find out:
>
> Seconded, can you send a pointer to the VGIC rework patches mentioned?
They are still in WiP state - not exactly working. I'm still exploring
what the status is.

Our major target is having ioeventfd suport in ARM. For this we need
to support kvm_io_bus_ mechanisms for MMIO access (cause ioevent fd
device is registered this way). Then this subject of integrating vgic
with the kvm_io_bus_ APIs came up.
My personal opinion - they should be able to coexist in peace.

>
>> - if it is a major improvement for the general quality of the code
>> - if that allow us to *delete* a lot of code (if it is just churn, I'm
>> not really interested)
>> - if it helps or hinders further developments that are currently in the
>> pipeline
>>
>> Andre, can you please share your findings? I don't remember the
>> specifics of the discussion we had a few months ago...
>
> 1) Given the date in the reply I sense that your patches are from March
> this year or earlier. So this is based on VGIC code from March, which
> predates Marc's vgic_dyn changes that just went in 3.18-rc1? His patches
> introduced another member to struct mmio_range to check validity of
> accesses with a reduced number of SPIs supported (.bits_per_irq).
> So is this covered in your rework?
Still no (rebased to 3.17) - didn't see it, but should not be an issue.
>
> 2)
  - there are 15 MMIO ranges for the vgic now - each should be
>
> Well, the GICv3 emulation adds 41 new ranges. Not sure if this still fits.
>
 registered as a separate device
>
> I found this fact a show-stopper when looking at this a month ago.
> Somehow it feels wrong to register a bunch of pseudo-devices. I could go
> with registering a small number of regions (one distributor, two
> redistributor regions for instance), but not handling every single of
> the 41 + 15 register "groups" as a device.
Do you sense performance issues, or just "it's not right"?
Maybe kvm_io_bus_ needs some extesion to hanlde a device with multiple regions?

>
> Also I wasn't sure if we had to expose some of the vGIC structures to
> the other KVM code layers.
I don't see such a need. Can you point some example?
>
> But I am open to any suggestions (as long as they go in _after_ my
> vGICv3 series ;-)  - so looking forward to some repo to see how it looks
> like.
There is still nothing much to show - but if there is interest we may
prepare something that shows the i

Re: [RFC PATCH 1/4] ARM: KVM: on unhandled IO mem abort, route the call to the KVM MMIO bus

2014-11-13 Thread Andre Przywara

Hi Nikolay,

On 13/11/14 12:29, Nikolay Nikolaev wrote:
> On Thu, Nov 13, 2014 at 1:52 PM, Andre Przywara  
> wrote:
>> Hi Nikolay,
>>
>> On 13/11/14 11:37, Marc Zyngier wrote:
>>> [fixing Andre's email address]
>>>
>>> On 13/11/14 11:20, Christoffer Dall wrote:
 On Thu, Nov 13, 2014 at 12:45:42PM +0200, Nikolay Nikolaev wrote:

 [...]

>>>
>>> Going through the vgic_handle_mmio we see that it will require large
>>> refactoring:
>>>  - there are 15 MMIO ranges for the vgic now - each should be
>>> registered as a separate device
>>>  - the handler of each range should be split into read and write
>>>  - all handlers take 'struct kvm_exit_mmio', and pass it to
>>> 'vgic_reg_access', 'mmio_data_read' and 'mmio_data_read'
>>>
>>> To sum up - if we do this refactoring of vgic's MMIO handling +
>>> kvm_io_bus_ API getting 'vcpu" argument we'll get a 'much' cleaner
>>> vgic code and as a bonus we'll get 'ioeventfd' capabilities.
>>>
>>> We have 3 questions:
>>>  - is the kvm_io_bus_ getting 'vcpu' argument acceptable for the other
>>> architectures too?
>>>  - is this huge vgic MMIO handling redesign acceptable/desired (it
>>> touches a lot of code)?
>>>  - is there a way that ioeventfd is accepted leaving vgic.c in it's
>>> current state?
>>>
>> Not sure how the latter question is relevant to this, but check with
>> Andre who recently looked at this as well and decided that for GICv3 the
>> only sane thing was to remove that comment for the gic.
> @Andre - what's your experience with the GICv3 and MMIO handling,
> anything specific?
>>
>> I don't recall the details of what you were trying to accomplish here
>> (it's been 8 months or so) but the surely the vgic handling code should
>> *somehow* be integrated into the handle_kernel_mmio (like Paolo
>> suggested), unless you come back and tell me that that would involve a
>> complete rewrite of the vgic code.
> I'm experimenting now - it's not exactly rewrite of whole vgic code,
> but it will touch a lot of it  - all MMIO access handlers and the
> supporting functions.
> We're ready to spend the effort. My question is  - is this acceptable?
>
 I certainly appreciate the offer to do this work, but it's hard to say
 at this point if it is worth it.

 What I was trying to say above is that Andre looked at this, and came to
 the conclusion that it is not worth it.

 Marc, what are your thoughts?
>>>
>>> Same here, I rely on Andre's view that it was not very useful. Now, it
>>> would be good to see a mock-up of the patches and find out:
>>
>> Seconded, can you send a pointer to the VGIC rework patches mentioned?
> They are still in WiP state - not exactly working. I'm still exploring
> what the status is.
> 
> Our major target is having ioeventfd suport in ARM. For this we need
> to support kvm_io_bus_ mechanisms for MMIO access (cause ioevent fd
> device is registered this way). Then this subject of integrating vgic
> with the kvm_io_bus_ APIs came up.
> My personal opinion - they should be able to coexist in peace.
> 
>>
>>> - if it is a major improvement for the general quality of the code
>>> - if that allow us to *delete* a lot of code (if it is just churn, I'm
>>> not really interested)
>>> - if it helps or hinders further developments that are currently in the
>>> pipeline
>>>
>>> Andre, can you please share your findings? I don't remember the
>>> specifics of the discussion we had a few months ago...
>>
>> 1) Given the date in the reply I sense that your patches are from March
>> this year or earlier. So this is based on VGIC code from March, which
>> predates Marc's vgic_dyn changes that just went in 3.18-rc1? His patches
>> introduced another member to struct mmio_range to check validity of
>> accesses with a reduced number of SPIs supported (.bits_per_irq).
>> So is this covered in your rework?
> Still no (rebased to 3.17) - didn't see it, but should not be an issue.
>>
>> 2)
>  - there are 15 MMIO ranges for the vgic now - each should be
>>
>> Well, the GICv3 emulation adds 41 new ranges. Not sure if this still fits.
>>
> registered as a separate device
>>
>> I found this fact a show-stopper when looking at this a month ago.
>> Somehow it feels wrong to register a bunch of pseudo-devices. I could go
>> with registering a small number of regions (one distributor, two
>> redistributor regions for instance), but not handling every single of
>> the 41 + 15 register "groups" as a device.
> Do you sense performance issues, or just "it's not right"?

Just "not right", since they are no 15 devices for a GICv2 emulation.

> Maybe kvm_io_bus_ needs some extesion to hanlde a device with multiple 
> regions?

Well, maybe a simple rename could fix this, but I am not sure it still
fits then. I am just afraid we end up with quite some code duplication
in each handler function

Re: KVM: x86: add module parameter to disable periodic kvmclock sync

2014-11-13 Thread Michael Tokarev

13.11.2014 05:44, Marcelo Tosatti wrote:
> The periodic kvmclock sync can be an undesired source of latencies.

Shouldn't this be a per-vm property, not global host property?
Maybe it's better to control frequency of syncs (with 0=disabled)?

Thanks,

/mjt
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] virt: kvm: arm: vgic: Process the failure case when kvm_register_device_ops() fails

2014-11-13 Thread Chen Gang

When kvm_register_device_ops() fails, also need call free_percpu_irq()
just like others have down within kvm_vgic_hyp_init().

Signed-off-by: Chen Gang 
---
 virt/kvm/arm/vgic.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index 3aaca49..b799f17 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -2470,8 +2470,14 @@ int kvm_vgic_hyp_init(void)
 
on_each_cpu(vgic_init_maintenance_interrupt, NULL, 1);
 
-   return kvm_register_device_ops(&kvm_arm_vgic_v2_ops,
-  KVM_DEV_TYPE_ARM_VGIC_V2);
+   ret = kvm_register_device_ops(&kvm_arm_vgic_v2_ops,
+ KVM_DEV_TYPE_ARM_VGIC_V2);
+   if (ret) {
+   kvm_err("Cannot register device ops\n");
+   goto out_free_irq;
+   }
+
+   return 0;
 
 out_free_irq:
free_percpu_irq(vgic->maint_irq, kvm_get_running_vcpus());
-- 
1.9.3
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/4] ARM: KVM: on unhandled IO mem abort, route the call to the KVM MMIO bus

2014-11-13 Thread Nikolay Nikolaev

On Thu, Nov 13, 2014 at 4:23 PM, Eric Auger  wrote:
> On 11/13/2014 03:16 PM, Eric Auger wrote:
>> On 11/13/2014 11:45 AM, Nikolay Nikolaev wrote:
>>> On Mon, Nov 10, 2014 at 6:27 PM, Christoffer Dall
>>>  wrote:
 On Mon, Nov 10, 2014 at 05:09:07PM +0200, Nikolay Nikolaev wrote:
> Hello,
>
> On Fri, Mar 28, 2014 at 9:09 PM, Christoffer Dall
>  wrote:
>>
>> On Thu, Mar 13, 2014 at 04:57:26PM +0100, Antonios Motakis wrote:
>>> On an unhandled IO memory abort, use the kvm_io_bus_* API in order to
>>> handle the MMIO access through any registered read/write callbacks. This
>>> is a dependency for eventfd support (ioeventfd and irqfd).
>>>
>>> However, accesses to the VGIC are still left implemented independently,
>>> since the kvm_io_bus_* API doesn't pass the VCPU pointer doing the 
>>> access.
>>>
>>> Signed-off-by: Antonios Motakis 
>>> Signed-off-by: Nikolay Nikolaev 
>>> ---
>>>  arch/arm/kvm/mmio.c | 32 
>>>  virt/kvm/arm/vgic.c |  5 -
>>>  2 files changed, 36 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm/kvm/mmio.c b/arch/arm/kvm/mmio.c
>>> index 4cb5a93..1d17831 100644
>>> --- a/arch/arm/kvm/mmio.c
>>> +++ b/arch/arm/kvm/mmio.c
>>> @@ -162,6 +162,35 @@ static int decode_hsr(struct kvm_vcpu *vcpu, 
>>> phys_addr_t fault_ipa,
>>>   return 0;
>>>  }
>>>
>>> +/**
>>> + * kvm_handle_mmio - handle an in-kernel MMIO access
>>> + * @vcpu:pointer to the vcpu performing the access
>>> + * @run: pointer to the kvm_run structure
>>> + * @mmio:pointer to the data describing the access
>>> + *
>>> + * returns true if the MMIO access has been performed in kernel space,
>>> + * and false if it needs to be emulated in user space.
>>> + */
>>> +static bool handle_kernel_mmio(struct kvm_vcpu *vcpu, struct kvm_run 
>>> *run,
>>> + struct kvm_exit_mmio *mmio)
>>> +{
>>> + int ret;
>>> + if (mmio->is_write) {
>>> + ret = kvm_io_bus_write(vcpu->kvm, KVM_MMIO_BUS, 
>>> mmio->phys_addr,
>>> + mmio->len, &mmio->data);
>>> +
>>> + } else {
>>> + ret = kvm_io_bus_read(vcpu->kvm, KVM_MMIO_BUS, 
>>> mmio->phys_addr,
>>> + mmio->len, &mmio->data);
>>> + }
>>> + if (!ret) {
>>> + kvm_prepare_mmio(run, mmio);
>>> + kvm_handle_mmio_return(vcpu, run);
>>> + }
>>> +
>>> + return !ret;
>>> +}
>>> +
>>>  int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run,
>>>phys_addr_t fault_ipa)
>>>  {
>>> @@ -200,6 +229,9 @@ int io_mem_abort(struct kvm_vcpu *vcpu, struct 
>>> kvm_run *run,
>>>   if (vgic_handle_mmio(vcpu, run, &mmio))
>>>   return 1;
>>>
>>> + if (handle_kernel_mmio(vcpu, run, &mmio))
>>> + return 1;
>>> +
>
>
> We're reconsidering ioeventfds patchseries and we tried to evaluate
> what you suggested here.
>
>>
>> this special-casing of the vgic is now really terrible.  Is there
>> anything holding you back from doing the necessary restructure of the
>> kvm_bus_io_*() API instead?
>
> Restructuring the kvm_io_bus_ API is not a big thing (we actually did
> it), but is not directly related to the these patches.
> Of course it can be justified if we do it in the context of removing
> vgic_handle_mmio and leaving only handle_kernel_mmio.
>
>>
>> That would allow us to get rid of the ugly
>> Fix it! in the vgic driver as well.
>
> Going through the vgic_handle_mmio we see that it will require large
> refactoring:
>  - there are 15 MMIO ranges for the vgic now - each should be
> registered as a separate device
> Re-correcting Andre's address, sorry:
> Hi Nikolay, Andre,
>
> what does mandate to register 15 devices? Isn't possible to register a
> single kvm_io_device covering the whole distributor range [base, base +
> KVM_VGIC_V2_DIST_SIZE] (current code) and in associated
> kvm_io_device_ops read/write locate the addressed range and do the same
> as what is done in current vgic_handle_mmio? Isn't it done that way for

Well, then we'll actually get slower mmio processing. Instead of calling
vgic_handle_mmio in io_mem_abort, we'll be calling kvm_io_bus_write.
This just adds another level of translation (i.e. find the kvm_io_ device)
and the underlying vgic code will remain almost the same.

> the ioapic? what do I miss?
I looked quickly in the ioapic code, and if I get it right there are no "ranges'
like what we have with the GIC. They have this regselect/regwindow concept
and they seem to have much less "registers" to handle. GIC seems a lot more
complex in terms of MMIO interface.

regards,
Nikolay Ni

Re: [RFC PATCH 1/4] ARM: KVM: on unhandled IO mem abort, route the call to the KVM MMIO bus

2014-11-13 Thread Christoffer Dall

Hi Nikolay,

On Thu, Nov 13, 2014 at 4:02 PM, Nikolay Nikolaev
 wrote:
[...]

>>
>>
>> We're reconsidering ioeventfds patchseries and we tried to evaluate
>> what you suggested here.
>>
>>>
>>> this special-casing of the vgic is now really terrible.  Is there
>>> anything holding you back from doing the necessary restructure of the
>>> kvm_bus_io_*() API instead?
>>
>> Restructuring the kvm_io_bus_ API is not a big thing (we actually did
>> it), but is not directly related to the these patches.
>> Of course it can be justified if we do it in the context of removing
>> vgic_handle_mmio and leaving only handle_kernel_mmio.
>>
>>>
>>> That would allow us to get rid of the ugly
>>> Fix it! in the vgic driver as well.
>>
>> Going through the vgic_handle_mmio we see that it will require large
>> refactoring:
>>  - there are 15 MMIO ranges for the vgic now - each should be
>> registered as a separate device
>> Re-correcting Andre's address, sorry:
>> Hi Nikolay, Andre,
>>
>> what does mandate to register 15 devices? Isn't possible to register a
>> single kvm_io_device covering the whole distributor range [base, base +
>> KVM_VGIC_V2_DIST_SIZE] (current code) and in associated
>> kvm_io_device_ops read/write locate the addressed range and do the same
>> as what is done in current vgic_handle_mmio? Isn't it done that way for
>
> Well, then we'll actually get slower mmio processing. Instead of calling
> vgic_handle_mmio in io_mem_abort, we'll be calling kvm_io_bus_write.
> This just adds another level of translation (i.e. find the kvm_io_ device)
> and the underlying vgic code will remain almost the same.
>

Define slower please.  Have you measured this?  With my ideas about
where we are spending overhead on a world-switch in this system,
looping through a few ranges is going to be infinitesimal, but as I
said, we would need to measure it before using it as an argument to
structure the code in a certain way, unless of course we're obviously
doing O(n^2) operations or something idiotic like that.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] virt: kvm: arm: vgic: Process the failure case when kvm_register_device_ops() fails

2014-11-13 Thread Marc Zyngier

On 13/11/14 15:04, Chen Gang wrote:
> When kvm_register_device_ops() fails, also need call free_percpu_irq()
> just like others have down within kvm_vgic_hyp_init().
> 
> Signed-off-by: Chen Gang 
> ---
>  virt/kvm/arm/vgic.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> index 3aaca49..b799f17 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -2470,8 +2470,14 @@ int kvm_vgic_hyp_init(void)
>  
>   on_each_cpu(vgic_init_maintenance_interrupt, NULL, 1);
>  
> - return kvm_register_device_ops(&kvm_arm_vgic_v2_ops,
> -KVM_DEV_TYPE_ARM_VGIC_V2);
> + ret = kvm_register_device_ops(&kvm_arm_vgic_v2_ops,
> +   KVM_DEV_TYPE_ARM_VGIC_V2);
> + if (ret) {
> + kvm_err("Cannot register device ops\n");
> + goto out_free_irq;
> + }
> +
> + return 0;
>  
>  out_free_irq:
>   free_percpu_irq(vgic->maint_irq, kvm_get_running_vcpus());
> 

Awesome. You're now freeing a per-cpu interrupt after just after having
enabled it on all CPUs. What could possibly go wrong?

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 1/4] ARM: KVM: on unhandled IO mem abort, route the call to the KVM MMIO bus

2014-11-13 Thread Andre Przywara



On 13/11/14 15:02, Nikolay Nikolaev wrote:
> On Thu, Nov 13, 2014 at 4:23 PM, Eric Auger  wrote:
>> On 11/13/2014 03:16 PM, Eric Auger wrote:
>>> On 11/13/2014 11:45 AM, Nikolay Nikolaev wrote:
 On Mon, Nov 10, 2014 at 6:27 PM, Christoffer Dall
  wrote:
> On Mon, Nov 10, 2014 at 05:09:07PM +0200, Nikolay Nikolaev wrote:
>> Hello,
>>
>> On Fri, Mar 28, 2014 at 9:09 PM, Christoffer Dall
>>  wrote:
>>>
>>> On Thu, Mar 13, 2014 at 04:57:26PM +0100, Antonios Motakis wrote:
 On an unhandled IO memory abort, use the kvm_io_bus_* API in order to
 handle the MMIO access through any registered read/write callbacks. 
 This
 is a dependency for eventfd support (ioeventfd and irqfd).

 However, accesses to the VGIC are still left implemented independently,
 since the kvm_io_bus_* API doesn't pass the VCPU pointer doing the 
 access.

 Signed-off-by: Antonios Motakis 
 Signed-off-by: Nikolay Nikolaev 
 ---
  arch/arm/kvm/mmio.c | 32 
  virt/kvm/arm/vgic.c |  5 -
  2 files changed, 36 insertions(+), 1 deletion(-)

 diff --git a/arch/arm/kvm/mmio.c b/arch/arm/kvm/mmio.c
 index 4cb5a93..1d17831 100644
 --- a/arch/arm/kvm/mmio.c
 +++ b/arch/arm/kvm/mmio.c
 @@ -162,6 +162,35 @@ static int decode_hsr(struct kvm_vcpu *vcpu, 
 phys_addr_t fault_ipa,
   return 0;
  }

 +/**
 + * kvm_handle_mmio - handle an in-kernel MMIO access
 + * @vcpu:pointer to the vcpu performing the access
 + * @run: pointer to the kvm_run structure
 + * @mmio:pointer to the data describing the access
 + *
 + * returns true if the MMIO access has been performed in kernel space,
 + * and false if it needs to be emulated in user space.
 + */
 +static bool handle_kernel_mmio(struct kvm_vcpu *vcpu, struct kvm_run 
 *run,
 + struct kvm_exit_mmio *mmio)
 +{
 + int ret;
 + if (mmio->is_write) {
 + ret = kvm_io_bus_write(vcpu->kvm, KVM_MMIO_BUS, 
 mmio->phys_addr,
 + mmio->len, &mmio->data);
 +
 + } else {
 + ret = kvm_io_bus_read(vcpu->kvm, KVM_MMIO_BUS, 
 mmio->phys_addr,
 + mmio->len, &mmio->data);
 + }
 + if (!ret) {
 + kvm_prepare_mmio(run, mmio);
 + kvm_handle_mmio_return(vcpu, run);
 + }
 +
 + return !ret;
 +}
 +
  int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run,
phys_addr_t fault_ipa)
  {
 @@ -200,6 +229,9 @@ int io_mem_abort(struct kvm_vcpu *vcpu, struct 
 kvm_run *run,
   if (vgic_handle_mmio(vcpu, run, &mmio))
   return 1;

 + if (handle_kernel_mmio(vcpu, run, &mmio))
 + return 1;
 +
>>
>>
>> We're reconsidering ioeventfds patchseries and we tried to evaluate
>> what you suggested here.
>>
>>>
>>> this special-casing of the vgic is now really terrible.  Is there
>>> anything holding you back from doing the necessary restructure of the
>>> kvm_bus_io_*() API instead?
>>
>> Restructuring the kvm_io_bus_ API is not a big thing (we actually did
>> it), but is not directly related to the these patches.
>> Of course it can be justified if we do it in the context of removing
>> vgic_handle_mmio and leaving only handle_kernel_mmio.
>>
>>>
>>> That would allow us to get rid of the ugly
>>> Fix it! in the vgic driver as well.
>>
>> Going through the vgic_handle_mmio we see that it will require large
>> refactoring:
>>  - there are 15 MMIO ranges for the vgic now - each should be
>> registered as a separate device
>> Re-correcting Andre's address, sorry:
>> Hi Nikolay, Andre,
>>
>> what does mandate to register 15 devices? Isn't possible to register a
>> single kvm_io_device covering the whole distributor range [base, base +
>> KVM_VGIC_V2_DIST_SIZE] (current code) and in associated
>> kvm_io_device_ops read/write locate the addressed range and do the same
>> as what is done in current vgic_handle_mmio? Isn't it done that way for
> 
> Well, then we'll actually get slower mmio processing. Instead of calling
> vgic_handle_mmio in io_mem_abort, we'll be calling kvm_io_bus_write.
> This just adds another level of translation (i.e. find the kvm_io_ device)
> and the underlying vgic code will remain almost the same.

Agreed. That was one possibility I came around also, but I think it
defeats the purpose of the rework, which is mostly to get rid of the
GIC's

Re: [RFC PATCH 1/4] ARM: KVM: on unhandled IO mem abort, route the call to the KVM MMIO bus

2014-11-13 Thread Eric Auger

On 11/13/2014 04:31 PM, Andre Przywara wrote:
> 
> 
> On 13/11/14 15:02, Nikolay Nikolaev wrote:
>> On Thu, Nov 13, 2014 at 4:23 PM, Eric Auger  wrote:
>>> On 11/13/2014 03:16 PM, Eric Auger wrote:
 On 11/13/2014 11:45 AM, Nikolay Nikolaev wrote:
> On Mon, Nov 10, 2014 at 6:27 PM, Christoffer Dall
>  wrote:
>> On Mon, Nov 10, 2014 at 05:09:07PM +0200, Nikolay Nikolaev wrote:
>>> Hello,
>>>
>>> On Fri, Mar 28, 2014 at 9:09 PM, Christoffer Dall
>>>  wrote:

 On Thu, Mar 13, 2014 at 04:57:26PM +0100, Antonios Motakis wrote:
> On an unhandled IO memory abort, use the kvm_io_bus_* API in order to
> handle the MMIO access through any registered read/write callbacks. 
> This
> is a dependency for eventfd support (ioeventfd and irqfd).
>
> However, accesses to the VGIC are still left implemented 
> independently,
> since the kvm_io_bus_* API doesn't pass the VCPU pointer doing the 
> access.
>
> Signed-off-by: Antonios Motakis 
> Signed-off-by: Nikolay Nikolaev 
> ---
>  arch/arm/kvm/mmio.c | 32 
>  virt/kvm/arm/vgic.c |  5 -
>  2 files changed, 36 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm/kvm/mmio.c b/arch/arm/kvm/mmio.c
> index 4cb5a93..1d17831 100644
> --- a/arch/arm/kvm/mmio.c
> +++ b/arch/arm/kvm/mmio.c
> @@ -162,6 +162,35 @@ static int decode_hsr(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   return 0;
>  }
>
> +/**
> + * kvm_handle_mmio - handle an in-kernel MMIO access
> + * @vcpu:pointer to the vcpu performing the access
> + * @run: pointer to the kvm_run structure
> + * @mmio:pointer to the data describing the access
> + *
> + * returns true if the MMIO access has been performed in kernel 
> space,
> + * and false if it needs to be emulated in user space.
> + */
> +static bool handle_kernel_mmio(struct kvm_vcpu *vcpu, struct kvm_run 
> *run,
> + struct kvm_exit_mmio *mmio)
> +{
> + int ret;
> + if (mmio->is_write) {
> + ret = kvm_io_bus_write(vcpu->kvm, KVM_MMIO_BUS, 
> mmio->phys_addr,
> + mmio->len, &mmio->data);
> +
> + } else {
> + ret = kvm_io_bus_read(vcpu->kvm, KVM_MMIO_BUS, 
> mmio->phys_addr,
> + mmio->len, &mmio->data);
> + }
> + if (!ret) {
> + kvm_prepare_mmio(run, mmio);
> + kvm_handle_mmio_return(vcpu, run);
> + }
> +
> + return !ret;
> +}
> +
>  int io_mem_abort(struct kvm_vcpu *vcpu, struct kvm_run *run,
>phys_addr_t fault_ipa)
>  {
> @@ -200,6 +229,9 @@ int io_mem_abort(struct kvm_vcpu *vcpu, struct 
> kvm_run *run,
>   if (vgic_handle_mmio(vcpu, run, &mmio))
>   return 1;
>
> + if (handle_kernel_mmio(vcpu, run, &mmio))
> + return 1;
> +
>>>
>>>
>>> We're reconsidering ioeventfds patchseries and we tried to evaluate
>>> what you suggested here.
>>>

 this special-casing of the vgic is now really terrible.  Is there
 anything holding you back from doing the necessary restructure of the
 kvm_bus_io_*() API instead?
>>>
>>> Restructuring the kvm_io_bus_ API is not a big thing (we actually did
>>> it), but is not directly related to the these patches.
>>> Of course it can be justified if we do it in the context of removing
>>> vgic_handle_mmio and leaving only handle_kernel_mmio.
>>>

 That would allow us to get rid of the ugly
 Fix it! in the vgic driver as well.
>>>
>>> Going through the vgic_handle_mmio we see that it will require large
>>> refactoring:
>>>  - there are 15 MMIO ranges for the vgic now - each should be
>>> registered as a separate device
>>> Re-correcting Andre's address, sorry:
>>> Hi Nikolay, Andre,
>>>
>>> what does mandate to register 15 devices? Isn't possible to register a
>>> single kvm_io_device covering the whole distributor range [base, base +
>>> KVM_VGIC_V2_DIST_SIZE] (current code) and in associated
>>> kvm_io_device_ops read/write locate the addressed range and do the same
>>> as what is done in current vgic_handle_mmio? Isn't it done that way for
>>
>> Well, then we'll actually get slower mmio processing. Instead of calling
>> vgic_handle_mmio in io_mem_abort, we'll be calling kvm_io_bus_write.
>> This just adds another level of translation (i.e. find the kvm_io_ device)
>> and the unde

[PATCH] kvm: memslots: replace heap sort with insertion sort

2014-11-13 Thread Igor Mammedov

memslots is a sorted array, when slot changes in it
with current heapsort it would take O(n log n) time
to update array, while using insertion sort like
algorithm on array with 1 item out of order will
take only O(n) time.

Replace current heapsort with custom sort that
takes advantage of memslots usage pattern and known
position of changed slot.

performance change of 128 memslots insersions with
gradually increasing size (the worst case):
  heap sort   custom sort
max:  249747  15654 cycles
min:  52536   5562 cycles
with custom sort alg taking 90% less then original
update time.

Signed-off-by: Igor Mammedov 
---
 virt/kvm/kvm_main.c | 54 +
 1 file changed, 30 insertions(+), 24 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 25ffac9..5fcbc45 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -668,31 +668,43 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot 
*memslot)
return 0;
 }
 
-static int cmp_memslot(const void *slot1, const void *slot2)
+static void swap_memslot(struct kvm_memory_slot *s1, struct kvm_memory_slot 
*s2)
 {
-   struct kvm_memory_slot *s1, *s2;
+   struct kvm_memory_slot tmp;
 
-   s1 = (struct kvm_memory_slot *)slot1;
-   s2 = (struct kvm_memory_slot *)slot2;
-
-   if (s1->npages < s2->npages)
-   return 1;
-   if (s1->npages > s2->npages)
-   return -1;
-
-   return 0;
+   tmp = *s2;
+   *s2 = *s1;
+   *s1 = tmp;
 }
 
 /*
- * Sort the memslots base on its size, so the larger slots
- * will get better fit.
+ * Insert memslot and re-sort memslots based on their size,
+ * so the larger slots will get better fit. Sorting algorithm
+ * takes advantage of having initially sorted array and
+ * known changed memslot position.
  */
-static void sort_memslots(struct kvm_memslots *slots)
+static void insert_memslot(struct kvm_memslots *slots,
+  struct kvm_memory_slot *new)
 {
-   int i;
+   int i = slots->id_to_index[new->id];
+   struct kvm_memory_slot *old = id_to_memslot(slots, new->id);
+   struct kvm_memory_slot *mslots = slots->memslots;
 
-   sort(slots->memslots, KVM_MEM_SLOTS_NUM,
- sizeof(struct kvm_memory_slot), cmp_memslot, NULL);
+   if (new->npages == old->npages)
+   return;
+
+   *old = *new;
+   while (1) {
+   if (i < (KVM_MEM_SLOTS_NUM - 1) &&
+   mslots[i].npages < mslots[i + 1].npages) {
+   swap_memslot(&mslots[i], &mslots[i + 1]);
+   i++;
+   } else if (i > 0 && mslots[i].npages > mslots[i - 1].npages) {
+   swap_memslot(&mslots[i], &mslots[i - 1]);
+   i--;
+   } else
+   break;
+   }
 
for (i = 0; i < KVM_MEM_SLOTS_NUM; i++)
slots->id_to_index[slots->memslots[i].id] = i;
@@ -702,13 +714,7 @@ static void update_memslots(struct kvm_memslots *slots,
struct kvm_memory_slot *new)
 {
if (new) {
-   int id = new->id;
-   struct kvm_memory_slot *old = id_to_memslot(slots, id);
-   unsigned long npages = old->npages;
-
-   *old = *new;
-   if (new->npages != npages)
-   sort_memslots(slots);
+   insert_memslot(slots, new);
}
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

trying to use vfio to pass VGA card and getting operation not permitted error

2014-11-13 Thread Steven DuChene

I am trying to use VFIO and pci-bind to pass a NVidia VGA card and it's 
companion audio device through to a VM I am trying to start up.


I am trying this on two different hardware platforms, a HP zbook 15 and 
a HP z800 workstation. Both systems are running Ubuntu 14.04.


Each time I try to start the VM on the laptop using "virsh start vmname" 
I get the following errors:


error: Failed to start domain vmname
error: internal error: early end of file from monitor: possible problem:
qemu-system-x86_64: -device 
vfio-pci,host=01:00.0,bus=sven.1,addr=00.0,multifunction=on,x-vga=on,rombar=0: 
vfio: error opening /dev/vfio/1: Operation not permitted
qemu-system-x86_64: -device 
vfio-pci,host=01:00.0,bus=sven.1,addr=00.0,multifunction=on,x-vga=on,rombar=0: 
vfio: failed to get group 1
qemu-system-x86_64: -device 
vfio-pci,host=01:00.0,bus=sven.1,addr=00.0,multifunction=on,x-vga=on,rombar=0: 
Device initialization failed.
qemu-system-x86_64: -device 
vfio-pci,host=01:00.0,bus=sven.1,addr=00.0,multifunction=on,x-vga=on,rombar=0: 
Device 'vfio-pci' could not be initialized


If I start the VM from the command line on the z800 as root using the 
qemu-system-x86_64 command directly it works. The VM starts and both of 
the passed through devices show up in the VM. If I try to start the same 
VM as the libvirt-qemu user using the same qemu-system-x86_64 command I 
get the same above error (but with a different group number obviously).


My order of operations are:

blacklist the nouveau driver from loading (the closed source nvidia 
driver is not loaded on either system)


In the kernel command line passed through grub on the z800 workstation I 
have:


intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1

on the zbook15 laptop I have:

pci-stub.ids=10de:11fc,10de:0e0b intel_iommu=on

and on the laptop I have added the "allow_unsafe_interrupts=1" option to 
the modprobe of that module.


On the z800 I manually bind the nvidia video and audio devices to the 
pci-stub driver.


In both cases I see the following in the kernel dmesg:

(from the laptop)
 [6.342603] pci-stub: add 10DE:11FC sub=: 
cls=/

[6.342618] pci-stub :01:00.0: claimed by stub
[6.342625] pci-stub: add 10DE:0E0B sub=: 
cls=/

[6.342632] pci-stub :01:00.1: claimed by stub

(or from the z800 workstation)
[  115.116860] pci-stub :0f:00.1: claimed by stub
[  157.126503] pci-stub :0f:00.0: claimed by stub

I then use a vfio-bind script to bind both the video and audio devices 
to the vfio driver.

Once I do that I see the appropriate files under /dev/vfio

(on the laptop)
drwxr-xr-x  2 root root 80 Nov 13 08:15 ./
drwxr-xr-x 18 root root   4380 Nov 13 08:14 ../
crw---  1 root root 249, 1 Nov 13 08:15 1
crw-rw-rw-  1 root root 249, 0 Nov 13 08:14 vfio

(on the z800)
drwxr-xr-x  2 root root 80 Nov 13 10:26 ./
drwxr-xr-x 16 root root   4540 Nov 13 10:26 ../
crw---  1 root root 247, 1 Nov 13 10:26 14
crw-rw-rw-  1 root root 247, 0 Nov 13 10:26 vfio

I have confirmed that the only devices in the iommu group is the nvidia 
video and audio devices and I am attempting to pass both devices through 
to the VM when I invoke it.


On the laptop I was seeing messages in the logs from apparmor each time 
I tried to start the VM referring to a libvirt profile for this VM's 
uuid. After looking into that for a while, I finally set the 
security_driver in /etc/libvirt/qemu.conf to "none" and that stopped 
those messages from showing up in the logs. This also prevented the 
libvirt-UUID files from showing up for this VM in 
/etc/apparmor.d/libvirt/ each time I tried to start the VM. I did try to 
add the files in /dev/vfio to the apparmor profile and TEMPLATE files 
but that did not seem to have any effect on the log messages.


In both cases I am actually wanting to start the VMs though openstack 
which invokes the qemu-system-x86_64 process through libvirt. However if 
I try to start the VM on the z800 workstation as root from the command 
line using the following command:


qemu-system-x86_64 -enable-kvm -M q35 -m 1024 -cpu host -smp 
2,sockets=1,cores=2,threads=1 -bios /usr/share/qemu/bios.bin -device 
ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=sven.1 
-device 
vfio-pci,host=01:00.0,bus=sven.1,addr=00.0,multifunction=on,x-vga=on,rombar=0 
-device vfio-pci,host=01:00.1,bus=sven.1,addr=00.1 -drive 
file=/root/ubuntu_withdriver.raw,id=disk,format=raw -device 
ide-hd,bus=ide.0,drive=disk -drive file=/root/my-seed.img,id=isocd 
-device ide-cd,bus=ide.1,drive=isocd -vnc 0.0.0.0:1 -k en-us


it works. If I try to start the VM on the z800 as the libvirt-qemu user 
(user id virsh and openstack use) I get the error I listed at the top of 
this E-mail. I have even tried changing the ownerships of the files in 
/dev/vfio to this user but that did not make any difference and I still 
get the same error.


I have spent quite a lot of time on this

Re: [PATCH] kvm: memslots: replace heap sort with insertion sort

2014-11-13 Thread Paolo Bonzini



On 13/11/2014 17:31, Igor Mammedov wrote:
> memslots is a sorted array, when slot changes in it
> with current heapsort it would take O(n log n) time
> to update array, while using insertion sort like
> algorithm on array with 1 item out of order will
> take only O(n) time.
> 
> Replace current heapsort with custom sort that
> takes advantage of memslots usage pattern and known
> position of changed slot.
> 
> performance change of 128 memslots insersions with
> gradually increasing size (the worst case):
>   heap sort   custom sort
> max:  249747  15654 cycles
> min:  52536   5562 cycles
> with custom sort alg taking 90% less then original
> update time.
> 
> Signed-off-by: Igor Mammedov 
> ---
>  virt/kvm/kvm_main.c | 54 
> +
>  1 file changed, 30 insertions(+), 24 deletions(-)

Nice!  I think strictly speaking it's not an insertion sort because
insertion sort doesn't use swaps; it's more similar to a bubble sort.
But the code is very readable and we do not need ultimate
performance---we are just trying to avoid doing something stupid.

Reviewed-by: Paolo Bonzini 

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: trying to use vfio to pass VGA card and getting operation not permitted error

2014-11-13 Thread Alex Williamson

On Thu, 2014-11-13 at 11:50 -0500, Steven DuChene wrote:
> I am trying to use VFIO and pci-bind to pass a NVidia VGA card and it's 
> companion audio device through to a VM I am trying to start up.
> 
> I am trying this on two different hardware platforms, a HP zbook 15 and 
> a HP z800 workstation. Both systems are running Ubuntu 14.04.
> 
> Each time I try to start the VM on the laptop using "virsh start vmname" 
> I get the following errors:
> 
> error: Failed to start domain vmname
> error: internal error: early end of file from monitor: possible problem:
> qemu-system-x86_64: -device 
> vfio-pci,host=01:00.0,bus=sven.1,addr=00.0,multifunction=on,x-vga=on,rombar=0:
>  
> vfio: error opening /dev/vfio/1: Operation not permitted
> qemu-system-x86_64: -device 
> vfio-pci,host=01:00.0,bus=sven.1,addr=00.0,multifunction=on,x-vga=on,rombar=0:
>  
> vfio: failed to get group 1
> qemu-system-x86_64: -device 
> vfio-pci,host=01:00.0,bus=sven.1,addr=00.0,multifunction=on,x-vga=on,rombar=0:
>  
> Device initialization failed.
> qemu-system-x86_64: -device 
> vfio-pci,host=01:00.0,bus=sven.1,addr=00.0,multifunction=on,x-vga=on,rombar=0:
>  
> Device 'vfio-pci' could not be initialized
> 
> If I start the VM from the command line on the z800 as root using the 
> qemu-system-x86_64 command directly it works. The VM starts and both of 
> the passed through devices show up in the VM. If I try to start the same 
> VM as the libvirt-qemu user using the same qemu-system-x86_64 command I 
> get the same above error (but with a different group number obviously).
> 
> My order of operations are:
> 
> blacklist the nouveau driver from loading (the closed source nvidia 
> driver is not loaded on either system)
> 
> In the kernel command line passed through grub on the z800 workstation I 
> have:
> 
> intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1
> 
> on the zbook15 laptop I have:
> 
> pci-stub.ids=10de:11fc,10de:0e0b intel_iommu=on
> 
> and on the laptop I have added the "allow_unsafe_interrupts=1" option to 
> the modprobe of that module.
> 
> On the z800 I manually bind the nvidia video and audio devices to the 
> pci-stub driver.
> 
> In both cases I see the following in the kernel dmesg:
> 
> (from the laptop)
>   [6.342603] pci-stub: add 10DE:11FC sub=: 
> cls=/
> [6.342618] pci-stub :01:00.0: claimed by stub
> [6.342625] pci-stub: add 10DE:0E0B sub=: 
> cls=/
> [6.342632] pci-stub :01:00.1: claimed by stub
> 
> (or from the z800 workstation)
> [  115.116860] pci-stub :0f:00.1: claimed by stub
> [  157.126503] pci-stub :0f:00.0: claimed by stub
> 
> I then use a vfio-bind script to bind both the video and audio devices 
> to the vfio driver.
> Once I do that I see the appropriate files under /dev/vfio
> 
> (on the laptop)
> drwxr-xr-x  2 root root 80 Nov 13 08:15 ./
> drwxr-xr-x 18 root root   4380 Nov 13 08:14 ../
> crw---  1 root root 249, 1 Nov 13 08:15 1
> crw-rw-rw-  1 root root 249, 0 Nov 13 08:14 vfio
> 
> (on the z800)
> drwxr-xr-x  2 root root 80 Nov 13 10:26 ./
> drwxr-xr-x 16 root root   4540 Nov 13 10:26 ../
> crw---  1 root root 247, 1 Nov 13 10:26 14
> crw-rw-rw-  1 root root 247, 0 Nov 13 10:26 vfio
> 
> I have confirmed that the only devices in the iommu group is the nvidia 
> video and audio devices and I am attempting to pass both devices through 
> to the VM when I invoke it.
> 
> On the laptop I was seeing messages in the logs from apparmor each time 
> I tried to start the VM referring to a libvirt profile for this VM's 
> uuid. After looking into that for a while, I finally set the 
> security_driver in /etc/libvirt/qemu.conf to "none" and that stopped 
> those messages from showing up in the logs. This also prevented the 
> libvirt-UUID files from showing up for this VM in 
> /etc/apparmor.d/libvirt/ each time I tried to start the VM. I did try to 
> add the files in /dev/vfio to the apparmor profile and TEMPLATE files 
> but that did not seem to have any effect on the log messages.
> 
> In both cases I am actually wanting to start the VMs though openstack 
> which invokes the qemu-system-x86_64 process through libvirt. However if 
> I try to start the VM on the z800 workstation as root from the command 
> line using the following command:
> 
> qemu-system-x86_64 -enable-kvm -M q35 -m 1024 -cpu host -smp 
> 2,sockets=1,cores=2,threads=1 -bios /usr/share/qemu/bios.bin -device 
> ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=sven.1 
> -device 
> vfio-pci,host=01:00.0,bus=sven.1,addr=00.0,multifunction=on,x-vga=on,rombar=0 
> -device vfio-pci,host=01:00.1,bus=sven.1,addr=00.1 -drive 
> file=/root/ubuntu_withdriver.raw,id=disk,format=raw -device 
> ide-hd,bus=ide.0,drive=disk -drive file=/root/my-seed.img,id=isocd 
> -device ide-cd,bus=ide.1,drive=isocd -vnc 0.0.0.0:1 -k en-us
> 
> it works. If I try to start the VM on the z800 as the libvirt-qemu user 
> (user

Re: trying to use vfio to pass VGA card and getting operation not permitted error

2014-11-13 Thread Steven DuChene


Alex:
Thanks for the quick reply.
Yes, I am using qemu:args in the xml file.
I looked around via google for an example of how to use hostdev in the 
xml file for the audio or video devices.
I seem to recall the only example I could find was a hostdev section in 
the libvirt wiki for a NIC card.
Do you know of any example xml files where hostdev is used for video and 
audio devices?

--
Steven DuChene

On 11/13/2014 12:24 PM, Alex Williamson wrote:

On Thu, 2014-11-13 at 11:50 -0500, Steven DuChene wrote:

I am trying to use VFIO and pci-bind to pass a NVidia VGA card and it's
companion audio device through to a VM I am trying to start up.

I am trying this on two different hardware platforms, a HP zbook 15 and
a HP z800 workstation. Both systems are running Ubuntu 14.04.

Each time I try to start the VM on the laptop using "virsh start vmname"
I get the following errors:

error: Failed to start domain vmname
error: internal error: early end of file from monitor: possible problem:
qemu-system-x86_64: -device
vfio-pci,host=01:00.0,bus=sven.1,addr=00.0,multifunction=on,x-vga=on,rombar=0:
vfio: error opening /dev/vfio/1: Operation not permitted
qemu-system-x86_64: -device
vfio-pci,host=01:00.0,bus=sven.1,addr=00.0,multifunction=on,x-vga=on,rombar=0:
vfio: failed to get group 1
qemu-system-x86_64: -device
vfio-pci,host=01:00.0,bus=sven.1,addr=00.0,multifunction=on,x-vga=on,rombar=0:
Device initialization failed.
qemu-system-x86_64: -device
vfio-pci,host=01:00.0,bus=sven.1,addr=00.0,multifunction=on,x-vga=on,rombar=0:
Device 'vfio-pci' could not be initialized

If I start the VM from the command line on the z800 as root using the
qemu-system-x86_64 command directly it works. The VM starts and both of
the passed through devices show up in the VM. If I try to start the same
VM as the libvirt-qemu user using the same qemu-system-x86_64 command I
get the same above error (but with a different group number obviously).

My order of operations are:

blacklist the nouveau driver from loading (the closed source nvidia
driver is not loaded on either system)

In the kernel command line passed through grub on the z800 workstation I
have:

intel_iommu=on vfio_iommu_type1.allow_unsafe_interrupts=1

on the zbook15 laptop I have:

pci-stub.ids=10de:11fc,10de:0e0b intel_iommu=on

and on the laptop I have added the "allow_unsafe_interrupts=1" option to
the modprobe of that module.

On the z800 I manually bind the nvidia video and audio devices to the
pci-stub driver.

In both cases I see the following in the kernel dmesg:

(from the laptop)
   [6.342603] pci-stub: add 10DE:11FC sub=:
cls=/
[6.342618] pci-stub :01:00.0: claimed by stub
[6.342625] pci-stub: add 10DE:0E0B sub=:
cls=/
[6.342632] pci-stub :01:00.1: claimed by stub

(or from the z800 workstation)
[  115.116860] pci-stub :0f:00.1: claimed by stub
[  157.126503] pci-stub :0f:00.0: claimed by stub

I then use a vfio-bind script to bind both the video and audio devices
to the vfio driver.
Once I do that I see the appropriate files under /dev/vfio

(on the laptop)
drwxr-xr-x  2 root root 80 Nov 13 08:15 ./
drwxr-xr-x 18 root root   4380 Nov 13 08:14 ../
crw---  1 root root 249, 1 Nov 13 08:15 1
crw-rw-rw-  1 root root 249, 0 Nov 13 08:14 vfio

(on the z800)
drwxr-xr-x  2 root root 80 Nov 13 10:26 ./
drwxr-xr-x 16 root root   4540 Nov 13 10:26 ../
crw---  1 root root 247, 1 Nov 13 10:26 14
crw-rw-rw-  1 root root 247, 0 Nov 13 10:26 vfio

I have confirmed that the only devices in the iommu group is the nvidia
video and audio devices and I am attempting to pass both devices through
to the VM when I invoke it.

On the laptop I was seeing messages in the logs from apparmor each time
I tried to start the VM referring to a libvirt profile for this VM's
uuid. After looking into that for a while, I finally set the
security_driver in /etc/libvirt/qemu.conf to "none" and that stopped
those messages from showing up in the logs. This also prevented the
libvirt-UUID files from showing up for this VM in
/etc/apparmor.d/libvirt/ each time I tried to start the VM. I did try to
add the files in /dev/vfio to the apparmor profile and TEMPLATE files
but that did not seem to have any effect on the log messages.

In both cases I am actually wanting to start the VMs though openstack
which invokes the qemu-system-x86_64 process through libvirt. However if
I try to start the VM on the z800 workstation as root from the command
line using the following command:

qemu-system-x86_64 -enable-kvm -M q35 -m 1024 -cpu host -smp
2,sockets=1,cores=2,threads=1 -bios /usr/share/qemu/bios.bin -device
ioh3420,bus=pcie.0,addr=1c.0,multifunction=on,port=1,chassis=1,id=sven.1
-device
vfio-pci,host=01:00.0,bus=sven.1,addr=00.0,multifunction=on,x-vga=on,rombar=0
-device vfio-pci,host=01:00.1,bus=sven.1,addr=00.1 -drive
file=/root/ubuntu_withdriver.raw,id=disk,format=raw -device
ide-hd,bus=ide.0,drive=disk

Re: KVM: x86: add module parameter to disable periodic kvmclock sync

2014-11-13 Thread Radim Krčmář

2014-11-13 12:32+0100, Andrew Jones:
> On Thu, Nov 13, 2014 at 11:44:02AM +0100, Andrew Jones wrote:
> > On Thu, Nov 13, 2014 at 09:40:41AM +0100, Andrew Jones wrote:
> > > On Thu, Nov 13, 2014 at 12:44:39AM -0200, Marcelo Tosatti wrote:
> > > > The periodic kvmclock sync can be an undesired source of latencies.
[...]
> > > > +static bool kvmclock_periodic_sync = 1;
> > > 
> > > Using 'true' would look nicer.
> > 
> > Ahh, disregard this comment. 1 matches what the user would input.

The output is always Y|N and the user can input that as well.
(bool = {true, false}, so I'd prefer 'true'.)

> @@ -1717,7 +1717,8 @@ static void kvmclock_sync_fn(struct work_struct *work)
>  kvmclock_sync_work);
>   struct kvm *kvm = container_of(ka, struct kvm, arch);
>  
> - schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
> + if (kvmclock_periodic_sync)
> + schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
>   schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
>   KVMCLOCK_SYNC_PERIOD);

Yes, or add callbacks to sysfs writes that would schedule/cancel this
work.  (But having a for_every_vm loop is quite ugly.)

I'd be happy with a 'const kvmclock_periodic_sync'.
(Having useless timers is weird if we care about latencies.)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: x86: add module parameter to disable periodic kvmclock sync

2014-11-13 Thread Paolo Bonzini



On 13/11/2014 18:46, Radim Krčmář wrote:
> Yes, or add callbacks to sysfs writes that would schedule/cancel this
> work.  (But having a for_every_vm loop is quite ugly.)
> 
> I'd be happy with a 'const kvmclock_periodic_sync'.
> (Having useless timers is weird if we care about latencies.)

I agree.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: trying to use vfio to pass VGA card and getting operation not permitted error

2014-11-13 Thread Alex Williamson

On Thu, 2014-11-13 at 12:36 -0500, Steven DuChene wrote:
> Alex:
> Thanks for the quick reply.
> Yes, I am using qemu:args in the xml file.
> I looked around via google for an example of how to use hostdev in the 
> xml file for the audio or video devices.
> I seem to recall the only example I could find was a hostdev section in 
> the libvirt wiki for a NIC card.
> Do you know of any example xml files where hostdev is used for video and 
> audio devices?

virt-manager will happily create one:

In this case I've manually updated the xml so that the GPU+audio is
presented to the guest as a multifunction device.  This is not required.
Using separate slots in the guest also works.  The canonical reference
for libvirt syntax is here:

http://libvirt.org/formatdomain.html

Also note that we've had issues with the audio function on Quadro cards
and don't officially support assignment of them.  The INTx disable
support in the hardware seems to be broken.  If you do assign it, I
recommend forcing it to use MSI interrupts:

http://vfio.blogspot.com/2014/09/vfio-interrupts-and-how-to-coax-windows.html

Otherwise you can bind it to vfio-pci or pci-stub and only assign the
GPU function to the guest.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: x86: add module parameter to disable periodic kvmclock sync

2014-11-13 Thread Andrew Jones

On Thu, Nov 13, 2014 at 06:47:40PM +0100, Paolo Bonzini wrote:
> 
> 
> On 13/11/2014 18:46, Radim Krčmář wrote:
> > Yes, or add callbacks to sysfs writes that would schedule/cancel this
> > work.  (But having a for_every_vm loop is quite ugly.)
> > 
> > I'd be happy with a 'const kvmclock_periodic_sync'.
> > (Having useless timers is weird if we care about latencies.)
> 
> I agree.
> 

yeah, I guess Marcelo's thought was that users may turn it
off with sysfs, but once off, it'll never get to come back.
In that case the original patch is perfectly fine as is.

drew
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: KVM: x86: add module parameter to disable periodic kvmclock sync

2014-11-13 Thread Radim Krčmář

2014-11-13 18:57+0100, Andrew Jones:
> On Thu, Nov 13, 2014 at 06:47:40PM +0100, Paolo Bonzini wrote:
> > 
> > 
> > On 13/11/2014 18:46, Radim Krčmář wrote:
> > > Yes, or add callbacks to sysfs writes that would schedule/cancel this
> > > work.  (But having a for_every_vm loop is quite ugly.)
> > > 
> > > I'd be happy with a 'const kvmclock_periodic_sync'.
> > > (Having useless timers is weird if we care about latencies.)
> > 
> > I agree.
> > 
> 
> yeah, I guess Marcelo's thought was that users may turn it
> off with sysfs, but once off, it'll never get to come back.
> In that case the original patch is perfectly fine as is.

(Good point, I never thought of that.)

300 second "race", we'd want to have both under the 'if' at least.
Explicitly cancelling is better IMO.

(This would make more sense with the previously mentioned per-vm
 property, if we had those without ioctls, and if the toggle was turned
 read-only after it has been changed to 'N'.)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PULL] vhost: cleanups and fixes

2014-11-13 Thread Michael S. Tsirkin

The following changes since commit 206c5f60a3d902bc4b56dab2de3e88de5eb06108:

  Linux 3.18-rc4 (2014-11-09 14:55:29 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to 65eca3a20264a8999570c269406196bd1ae23be7:

  virtio_console: move early VQ enablement (2014-11-13 09:53:26 +0200)

It seems like a good idea to merge this bugfix now, as it's clearly
a regression and several people complained.


virtio: bugfix for 3.18

This fixes a crash in virtio console
multi-channel mode that got introduced in -rc1.

Signed-off-by: Michael S. Tsirkin 


Cornelia Huck (1):
  virtio_console: move early VQ enablement

 drivers/char/virtio_console.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] x86, entry: Switch stacks on a paranoid entry from userspace

2014-11-13 Thread Borislav Petkov

On Thu, Nov 13, 2014 at 11:59:37AM +0100, Borislav Petkov wrote:
> I've been thinking about it recently too - adding MCA functionality to
> qemu/kvm could be very useful, especially the thresholding stuff, for
> testing RAS kernel code.

Btw, qemu monitor has a mce injection command with which I was able
to tickle some response from the guest kernel. I'll play more with it
tomorrow and try to tickle a response from the memory failure code.

[  195.328466] Disabling lock debugging due to kernel taint
[  195.328466] [Hardware Error]: System Fatal error.
[  195.328466] [Hardware Error]: CPU:1 (10:2:3) 
MC4_STATUS[Over|UE|MiscV|PCC|AddrV|UECC]: 0xfe00201f012b
[  195.328466] [Hardware Error]: MC4_ADDR: 0x
[  195.328466] [Hardware Error]: MC4 Error (node 1): ECC Error in the Probe 
Filter directory.
[  195.328466] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: WR
[  195.328466] mce: [Hardware Error]: CPU 1: Machine Check Exception: 3 Bank 4: 
fe00201f012b
[  195.328466] mce: [Hardware Error]: RIP 10: 
{default_idle+0x25/0x240}
[  195.328466] mce: [Hardware Error]: TSC b9e2f56f95 MISC d1d1dad1deadbeef 
[  195.328466] mce: [Hardware Error]: PROCESSOR 2:100f23 TIME 1415915466 SOCKET 
1 APIC 1 microcode 165
[  195.328466] [Hardware Error]: System Fatal error.
[  195.328466] [Hardware Error]: CPU:1 (10:2:3) 
MC4_STATUS[Over|UE|MiscV|PCC|AddrV|UECC]: 0xfe00201f012b
[  195.328466] [Hardware Error]: MC4_ADDR: 0x
[  195.328466] [Hardware Error]: MC4 Error (node 1): ECC Error in the Probe 
Filter directory.
[  195.328466] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: WR
[  195.328466] mce: [Hardware Error]: Machine check: Invalid
[  195.328466] Kernel panic - not syncing: Fatal machine check on current CPU
[  195.328466] Kernel Offset: 0x0 from 0x8100 (relocation range: 
0x8000-0x9fff)
[  195.328466] ---[ end Kernel panic - not syncing: Fatal machine check on 
current CPU

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2] kvm: memslots: replace heap sort with insertion sort

2014-11-13 Thread Igor Mammedov

memslots is a sorted array, when slot changes in it
with current heapsort it would take O(n log n) time
to update array, while using insertion sort like
algorithm on array with 1 item out of order will
take only O(n) time.

Replace current heapsort with custom sort that
takes advantage of memslots usage pattern and known
position of changed slot.

performance change of 128 memslots insersions with
gradually increasing size (the worst case):
  heap sort   custom sort
max:  249747  2500 cycles
with custom sort alg taking ~98% less then original
update time.

Signed-off-by: Igor Mammedov 
---
v2:
  - replace swap with slot shift, improves result 2x
  - reprofile original/swap based and swapless 15 times
  discarding spikes swap based takes ~5900 cycles max
  and swapless ~2500 cycles.
---
 virt/kvm/kvm_main.c | 54 ++---
 1 file changed, 26 insertions(+), 28 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 25ffac9..49f896a 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -668,31 +668,35 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot 
*memslot)
return 0;
 }
 
-static int cmp_memslot(const void *slot1, const void *slot2)
-{
-   struct kvm_memory_slot *s1, *s2;
-
-   s1 = (struct kvm_memory_slot *)slot1;
-   s2 = (struct kvm_memory_slot *)slot2;
-
-   if (s1->npages < s2->npages)
-   return 1;
-   if (s1->npages > s2->npages)
-   return -1;
-
-   return 0;
-}
-
 /*
- * Sort the memslots base on its size, so the larger slots
- * will get better fit.
+ * Insert memslot and re-sort memslots based on their size,
+ * so the larger slots will get better fit. Sorting algorithm
+ * takes advantage of having initially sorted array and
+ * known changed memslot position.
  */
-static void sort_memslots(struct kvm_memslots *slots)
+static void insert_memslot(struct kvm_memslots *slots,
+  struct kvm_memory_slot *new)
 {
-   int i;
+   int i = slots->id_to_index[new->id];
+   struct kvm_memory_slot *old = id_to_memslot(slots, new->id);
+   struct kvm_memory_slot *mslots = slots->memslots;
+
+   if (new->npages == old->npages)
+   return;
 
-   sort(slots->memslots, KVM_MEM_SLOTS_NUM,
- sizeof(struct kvm_memory_slot), cmp_memslot, NULL);
+   while (1) {
+   if (i < (KVM_MEM_SLOTS_NUM - 1) &&
+   new->npages < mslots[i + 1].npages) {
+   mslots[i] = mslots[i + 1];
+   i++;
+   } else if (i > 0 && new->npages > mslots[i - 1].npages) {
+   mslots[i] = mslots[i - 1];
+   i--;
+   } else {
+   mslots[i] = *new;
+   break;
+   }
+   }
 
for (i = 0; i < KVM_MEM_SLOTS_NUM; i++)
slots->id_to_index[slots->memslots[i].id] = i;
@@ -702,13 +706,7 @@ static void update_memslots(struct kvm_memslots *slots,
struct kvm_memory_slot *new)
 {
if (new) {
-   int id = new->id;
-   struct kvm_memory_slot *old = id_to_memslot(slots, id);
-   unsigned long npages = old->npages;
-
-   *old = *new;
-   if (new->npages != npages)
-   sort_memslots(slots);
+   insert_memslot(slots, new);
}
 }
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] KVM: PPC: Book3S HV: Tracepoints for KVM HV guest interactions

2014-11-13 Thread Suresh E. Warrier

This patch adds trace points in the guest entry and exit code and also
for exceptions handled by the host in kernel mode - hypercalls and page
faults. The new events are added to /sys/kernel/debug/tracing/events
under a new subsystem called kvm_hv.

Acked-by: Paul Mackerras 
Signed-off-by: Suresh Warrier 
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c |  12 +-
 arch/powerpc/kvm/book3s_hv.c|  19 ++
 arch/powerpc/kvm/trace_hv.h | 497 
 3 files changed, 525 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/kvm/trace_hv.h

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 70feb7b..20cbad1 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -38,6 +38,7 @@
 #include 
 
 #include "book3s_hv_cma.h"
+#include "trace_hv.h"
 
 /* POWER7 has 10-bit LPIDs, PPC970 has 6-bit LPIDs */
 #define MAX_LPID_970   63
@@ -627,6 +628,8 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
gfn = gpa >> PAGE_SHIFT;
memslot = gfn_to_memslot(kvm, gfn);
 
+   trace_kvm_page_fault_enter(vcpu, hpte, memslot, ea, dsisr);
+
/* No memslot means it's an emulated MMIO region */
if (!memslot || (memslot->flags & KVM_MEMSLOT_INVALID))
return kvmppc_hv_emulate_mmio(run, vcpu, gpa, ea,
@@ -639,6 +642,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
mmu_seq = kvm->mmu_notifier_seq;
smp_rmb();
 
+   ret = -EFAULT;
is_io = 0;
pfn = 0;
page = NULL;
@@ -662,7 +666,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
}
up_read(¤t->mm->mmap_sem);
if (!pfn)
-   return -EFAULT;
+   goto out_put;
} else {
page = pages[0];
if (PageHuge(page)) {
@@ -690,14 +694,14 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
pfn = page_to_pfn(page);
}
 
-   ret = -EFAULT;
if (psize > pte_size)
goto out_put;
 
/* Check WIMG vs. the actual page we're accessing */
if (!hpte_cache_flags_ok(r, is_io)) {
if (is_io)
-   return -EFAULT;
+   goto out_put;
+
/*
 * Allow guest to map emulated device memory as
 * uncacheable, but actually make it cacheable.
@@ -753,6 +757,8 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct 
kvm_vcpu *vcpu,
SetPageDirty(page);
 
  out_put:
+   trace_kvm_page_fault_exit(vcpu, hpte, ret);
+
if (page) {
/*
 * We drop pages[0] here, not page because page might
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 69d4085..5143d17 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -57,6 +57,9 @@
 
 #include "book3s.h"
 
+#define CREATE_TRACE_POINTS
+#include "trace_hv.h"
+
 /* #define EXIT_DEBUG */
 /* #define EXIT_DEBUG_SIMPLE */
 /* #define EXIT_DEBUG_INT */
@@ -1679,6 +1682,7 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
list_for_each_entry(vcpu, &vc->runnable_threads, arch.run_list) {
kvmppc_start_thread(vcpu);
kvmppc_create_dtl_entry(vcpu, vc);
+   trace_kvm_guest_enter(vcpu);
}
 
/* Set this explicitly in case thread 0 doesn't have a vcpu */
@@ -1687,6 +1691,9 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
 
vc->vcore_state = VCORE_RUNNING;
preempt_disable();
+
+   trace_kvmppc_run_core(vc, 0);
+
spin_unlock(&vc->lock);
 
kvm_guest_enter();
@@ -1732,6 +1739,8 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
kvmppc_core_pending_dec(vcpu))
kvmppc_core_dequeue_dec(vcpu);
 
+   trace_kvm_guest_exit(vcpu);
+
ret = RESUME_GUEST;
if (vcpu->arch.trap)
ret = kvmppc_handle_exit_hv(vcpu->arch.kvm_run, vcpu,
@@ -1757,6 +1766,8 @@ static void kvmppc_run_core(struct kvmppc_vcore *vc)
wake_up(&vcpu->arch.cpu_run);
}
}
+
+   trace_kvmppc_run_core(vc, 1);
 }
 
 /*
@@ -1783,11 +1794,13 @@ static void kvmppc_vcore_blocked(struct kvmppc_vcore 
*vc)
 
prepare_to_wait(&vc->wq, &wait, TASK_INTERRUPTIBLE);
vc->vcore_state = VCORE_SLEEPING;
+   trace_kvmppc_vcore_blocked(vc, 0);
spin_unlock(&vc->lock);
schedule();
finish_wait(&vc->wq, &wait);
spin_lock(&vc->lock);
vc->vcore_state = VCORE_INACTIVE;
+   trace_kvmppc_vcore_blocked(vc, 1);
 }
 
 static int kvmppc_run_vcpu(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
@@ -1796,6 +1809,8 @@ static int kv

Re: [PATCH v4] KVM: x86: fix access memslots w/o hold srcu read lock

2014-11-13 Thread Wanpeng Li

Hi Tang,
On Tue, Nov 11, 2014 at 01:35:29PM +0800, Tang Chen wrote:
>Hi Wanpeng,
>

Sorry for the late.

>I think I have totally missed this thread.
>I opened lockdep and RCU debug, and tried on 3.18-rc1. But I didn't
>get the warning.

I also opened lockdep and RCU debug, and tried 3.18.0-rc2 on a Ivy
bridge, the warning will be triggered after run qemu immediately. There 
is no need to try any hotplug related stuff.

In addition, Paolo's patch is merged upstream to fix this.

commit a73896cb5bbdce672945745db8224352a689f580
Author: Paolo Bonzini 
Date:   Sun Nov 2 07:54:30 2014 +0100

KVM: vmx: defer load of APIC access page address during reset

Regards,
Wanpeng Li 

>My steps are:
>
>1. Use numactl to bind a qemu process to node1.
>2. Offline all node1 memory. And the qemu process is still running.
>
>Would you please tell me how did you reproduce it ?
>
>Thanks.
>
>On 11/02/2014 03:07 PM, Wanpeng Li wrote:
>>The srcu read lock must be held while accessing memslots (e.g.
>>when using gfn_to_* functions), however, commit c24ae0dcd3e8
>>("kvm: x86: Unpin and remove kvm_arch->apic_access_page") call
>>gfn_to_page() in kvm_vcpu_reload_apic_access_page() w/o hold it in
>>vmx_vcpu_reset() path which leads to suspicious rcu_dereference_check()
>>usage warning. This patch fix it by holding srcu read lock in all
>>kvm_vcpu_reset() call path.
>>
>>
>>[ INFO: suspicious RCU usage. ]
>>3.18.0-rc2-test2+ #70 Not tainted
>>---
>>include/linux/kvm_host.h:474 suspicious rcu_dereference_check() usage!
>>
>>other info that might help us debug this:
>>
>>rcu_scheduler_active = 1, debug_locks = 0
>>1 lock held by qemu-system-x86/2371:
>>  #0:  (&vcpu->mutex){+.+...}, at: [] vcpu_load+0x20/0xd0 
>> [kvm]
>>
>>stack backtrace:
>>CPU: 4 PID: 2371 Comm: qemu-system-x86 Not tainted 3.18.0-rc2-test2+ #70
>>Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
>>  0001 880209983ca8 816f514f 
>>  8802099b8990 880209983cd8 810bd687 000fee00
>>  880208a2c000 880208a1 88020ef50040 880209983d08
>>Call Trace:
>>  [] dump_stack+0x4e/0x71
>>  [] lockdep_rcu_suspicious+0xe7/0x120
>>  [] gfn_to_memslot+0xd5/0xe0 [kvm]
>>  [] __gfn_to_pfn+0x33/0x60 [kvm]
>>  [] gfn_to_page+0x25/0x90 [kvm]
>>  [] kvm_vcpu_reload_apic_access_page+0x3c/0x80 [kvm]
>>  [] vmx_vcpu_reset+0x20c/0x460 [kvm_intel]
>>  [] kvm_vcpu_reset+0x15e/0x1b0 [kvm]
>>  [] kvm_arch_vcpu_setup+0x2c/0x50 [kvm]
>>  [] kvm_vm_ioctl+0x1d0/0x780 [kvm]
>>  [] ? __lock_is_held+0x54/0x80
>>  [] do_vfs_ioctl+0x300/0x520
>>  [] ? __fget+0x5/0x250
>>  [] ? __fget_light+0x2a/0xe0
>>  [] SyS_ioctl+0x81/0xa0
>>  [] system_call_fastpath+0x16/0x1b
>>
>>Reported-by: Takashi Iwai 
>>Reported-by: Alexei Starovoitov 
>>Suggested-by: Paolo Bonzini 
>>Signed-off-by: Wanpeng Li 
>>---
>>v3 -> v4:
>>  * bypass the problem altoghter by kvm_make_request
>>v2 -> v3:
>>  * take care all vmx_vcpu_reset call path
>>v1 -> v2:
>>  * just fix hold the srcu read lock in vmx_vcpu_reset path
>>
>>  arch/x86/kvm/vmx.c |2 +-
>>  1 files changed, 1 insertions(+), 1 deletions(-)
>>
>>diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>index a0f78db..3e556c6 100644
>>--- a/arch/x86/kvm/vmx.c
>>+++ b/arch/x86/kvm/vmx.c
>>@@ -4579,7 +4579,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)
>>  vmcs_write32(TPR_THRESHOLD, 0);
>>  }
>>- kvm_vcpu_reload_apic_access_page(vcpu);
>>+ kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
>>  if (vmx_vm_has_apicv(vcpu->kvm))
>>  memset(&vmx->pi_desc, 0, sizeof(struct pi_desc));
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v14 2/7] KVM: Add generic support for dirty page logging

2014-11-13 Thread Mario Smarduch

kvm_get_dirty_log() provides generic handling of dirty bitmap, currently reused
by several architectures. Building on that we intrdoduce 
kvm_get_dirty_log_protect() adding write protection to mark these pages dirty
for future write access, before next KVM_GET_DIRTY_LOG ioctl call from user
space.

Signed-off-by: Mario Smarduch 
---
 include/linux/kvm_host.h |9 ++
 virt/kvm/Kconfig |6 
 virt/kvm/kvm_main.c  |   80 ++
 3 files changed, 95 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a4c33b3..c55dd75 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -590,6 +590,15 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
ext);
 
 int kvm_get_dirty_log(struct kvm *kvm,
struct kvm_dirty_log *log, int *is_dirty);
+
+int kvm_get_dirty_log_protect(struct kvm *kvm,
+   struct kvm_dirty_log *log, bool *is_dirty);
+
+void kvm_arch_mmu_write_protect_pt_masked(struct kvm *kvm,
+   struct kvm_memory_slot *slot,
+   gfn_t gfn_offset,
+   unsigned long mask);
+
 int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
struct kvm_dirty_log *log);
 
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 3796a21..314950c 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -40,3 +40,9 @@ config KVM_VFIO
 
 config HAVE_KVM_ARCH_TLB_FLUSH_ALL
bool
+
+config HAVE_KVM_ARCH_DIRTY_LOG_PROTECT
+   bool
+
+config KVM_GENERIC_DIRTYLOG_READ_PROTECT
+   bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 887df87..f50909c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -981,6 +981,86 @@ out:
 }
 EXPORT_SYMBOL_GPL(kvm_get_dirty_log);
 
+#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
+/**
+ * kvm_get_dirty_log_protect - get a snapshot of dirty pages, and if any pages
+ * are dirty write protect them for next write.
+ * @kvm:   pointer to kvm instance
+ * @log:   slot id and address to which we copy the log
+ * @is_dirty:  flag set if any page is dirty
+ *
+ * We need to keep it in mind that VCPU threads can write to the bitmap
+ * concurrently. So, to avoid losing track of dirty pages we keep the
+ * following order:
+ *
+ *1. Take a snapshot of the bit and clear it if needed.
+ *2. Write protect the corresponding page.
+ *3. Copy the snapshot to the userspace.
+ *4. Upon return caller flushes TLB's if needed.
+ *
+ * Between 2 and 4, the guest may write to the page using the remaining TLB
+ * entry.  This is not a problem because the page is reported dirty using
+ * the snapshot taken before and step 4 ensures that writes done after
+ * exiting to userspace will be logged for the next call.
+ *
+ */
+int kvm_get_dirty_log_protect(struct kvm *kvm,
+   struct kvm_dirty_log *log, bool *is_dirty)
+{
+   struct kvm_memory_slot *memslot;
+   int r, i;
+   unsigned long n;
+   unsigned long *dirty_bitmap;
+   unsigned long *dirty_bitmap_buffer;
+
+   r = -EINVAL;
+   if (log->slot >= KVM_USER_MEM_SLOTS)
+   goto out;
+
+   memslot = id_to_memslot(kvm->memslots, log->slot);
+
+   dirty_bitmap = memslot->dirty_bitmap;
+   r = -ENOENT;
+   if (!dirty_bitmap)
+   goto out;
+
+   n = kvm_dirty_bitmap_bytes(memslot);
+
+   dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
+   memset(dirty_bitmap_buffer, 0, n);
+
+   spin_lock(&kvm->mmu_lock);
+   *is_dirty = false;
+   for (i = 0; i < n / sizeof(long); i++) {
+   unsigned long mask;
+   gfn_t offset;
+
+   if (!dirty_bitmap[i])
+   continue;
+
+   *is_dirty = true;
+
+   mask = xchg(&dirty_bitmap[i], 0);
+   dirty_bitmap_buffer[i] = mask;
+
+   offset = i * BITS_PER_LONG;
+   kvm_arch_mmu_write_protect_pt_masked(kvm, memslot, offset,
+   mask);
+   }
+
+   spin_unlock(&kvm->mmu_lock);
+
+   r = -EFAULT;
+   if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
+   goto out;
+
+   r = 0;
+out:
+   return r;
+}
+EXPORT_SYMBOL_GPL(kvm_get_dirty_log_protect);
+#endif
+
 bool kvm_largepages_enabled(void)
 {
return largepages_enabled;
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v14 1/7] KVM: Add architecture-defined TLB flush support

2014-11-13 Thread Mario Smarduch

Allow architectures to override the generic kvm_flush_remote_tlbs()
function via HAVE_KVM_ARCH_TLB_FLUSH_ALL. ARMv7 will need this to
provide its own TLB flush interface.

Reviewed-by: Marc Zyngier 
Reviewed-by: Paolo Bonzini 
Signed-off-by: Mario Smarduch 
---
 virt/kvm/Kconfig|3 +++
 virt/kvm/kvm_main.c |2 ++
 2 files changed, 5 insertions(+)

diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index fc0c5e6..3796a21 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -37,3 +37,6 @@ config HAVE_KVM_CPU_RELAX_INTERCEPT
 
 config KVM_VFIO
bool
+
+config HAVE_KVM_ARCH_TLB_FLUSH_ALL
+   bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 33712fb..887df87 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -184,6 +184,7 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned 
int req)
return called;
 }
 
+#ifndef CONFIG_HAVE_KVM_ARCH_TLB_FLUSH_ALL
 void kvm_flush_remote_tlbs(struct kvm *kvm)
 {
long dirty_count = kvm->tlbs_dirty;
@@ -194,6 +195,7 @@ void kvm_flush_remote_tlbs(struct kvm *kvm)
cmpxchg(&kvm->tlbs_dirty, dirty_count, 0);
 }
 EXPORT_SYMBOL_GPL(kvm_flush_remote_tlbs);
+#endif
 
 void kvm_reload_remote_mmus(struct kvm *kvm)
 {
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v14 0/7] KVM/arm/x86: dirty page logging for ARMv7 (3.17.0-rc1)

2014-11-13 Thread Mario Smarduch

Patch series adds support for ARMv7 and generic dirty page logging support. 
As we try to move towards generic dirty page logging additional logic is moved 
to generic code. Initially x86, armv7 KVM_GET_DIRTY_LOG reuses generic 
code, shortly followed by armv8 patches.

Testing:
- Generally live migration + checksumming of source/destination memory regions 
  is used to validate correctness.
- ARMv7 - qemu machvirt, VExpress - Exynos 5440, FastModels - lmbench + dirty 
  guest memory cycling. On FastModels you must set 'migrate_set_downtime=1'
  for higher dirty rate (for example 2000 pges/100mS).

See https://github.com/mjsmar/arm-dirtylog-tests for details.

In addition to ARMv7 patch series was compiled for:
- x86_64 - defconfig 
- ia64 - ia64-linux-gcc4.6.3 - defconfig, ia64 Kconfig defines BROKEN worked 
  around that to make sure new changes don't break build. Eventually build
  breaks due to other reasons.
- mips - mips64-linux-gcc4.6.3 - malta_kvm_defconfig
- ppc - powerpc64-linux-gcc4.6.3 - pseries_defconfig
- s390 - s390x-linux-gcc4.6.3 - defconfig
- armv8 - aarch64-linux-gnu-gcc4.8.1 - defconfig

ARMv7 Dirty page logging implementation overivew-
- initially write protects memory region pages 2nd stage page tables
- add support to read dirty page log and again write protect dirty pages 
  for next pass.
- second stage huge page are dissolved into small page tables to keep track of
  dirty pages at page granularity. Tracking at huge page granularity limits
  migration to an almost idle system. Small page size logging supports higher
  memory dirty rates.
- In the event migration is canceled, normal behavior is resumed huge pages
  are rebuilt over time.

Changes since v13:
- Addressed comments from Cornelia, Paolo, Marc, and Christoffer
- Most signifcant change is reduce number of arguments to stage2_set_pte
- Another is introduce Kconfig symbol for generic kvm_get_dirty_log_protect()

Changes since v12:
- Added Paolos and James Hogan's comments to extend kvm_get_dirty_log() to
  make it further generic by adding write protection in addition to dirty bit
  map handling. This led to new generic function kvm_get_dirty_log_protect().

Changes since v11:
- Implemented Alex's comments to simplify generic layer.

Changes since v10:
- addressed wanghaibin comments 
- addressed Christoffers comments

Changes since v9:
- Split patches into generic and architecture specific variants for TLB Flushing
  and dirty log read (patches 1,2 & 3,4,5,6)
- rebased to 3.16.0-rc1
- Applied Christoffers comments.

 

Mario Smarduch (6):
  KVM: Add architecture-defined TLB flush support
  KVM: Add generic support for dirty page logging
  KVM: arm: Add ARMv7 API to flush TLBs
  KVM: arm: Add initial dirty page locking support
  KVM: arm: dirty logging write protect support
  KVM: arm: page logging 2nd stage fault handling

Paolo Bonzini (1):
  KVM: x86: switch to kvm_get_dirty_log_protect

 arch/arm/include/asm/kvm_asm.h|1 +
 arch/arm/include/asm/kvm_host.h   |   14 +++
 arch/arm/include/asm/kvm_mmu.h|   20 +++
 arch/arm/include/asm/pgtable-3level.h |1 +
 arch/arm/kvm/Kconfig  |2 +
 arch/arm/kvm/arm.c|   46 +++
 arch/arm/kvm/interrupts.S |   11 ++
 arch/arm/kvm/mmu.c|  216 +++--
 arch/x86/include/asm/kvm_host.h   |3 -
 arch/x86/kvm/Kconfig  |1 +
 arch/x86/kvm/mmu.c|4 +-
 arch/x86/kvm/x86.c|   64 ++
 include/linux/kvm_host.h  |9 ++
 virt/kvm/Kconfig  |9 ++
 virt/kvm/kvm_main.c   |   82 +
 15 files changed, 415 insertions(+), 68 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v14 3/7] KVM: x86: switch to kvm_get_dirty_log_protect

2014-11-13 Thread Mario Smarduch

From: Paolo Bonzini 

We now have a generic function that does most of the work of
kvm_vm_ioctl_get_dirty_log, now use it.

Signed-off-by: Mario Smarduch 
---
 arch/x86/include/asm/kvm_host.h |3 --
 arch/x86/kvm/Kconfig|1 +
 arch/x86/kvm/mmu.c  |4 +--
 arch/x86/kvm/x86.c  |   64 ++-
 4 files changed, 12 insertions(+), 60 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7c492ed..934dc24 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -805,9 +805,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask,
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
 void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
-struct kvm_memory_slot *slot,
-gfn_t gfn_offset, unsigned long mask);
 void kvm_mmu_zap_all(struct kvm *kvm);
 void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm);
 unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index f9d16ff..d073594 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -39,6 +39,7 @@ config KVM
select PERF_EVENTS
select HAVE_KVM_MSI
select HAVE_KVM_CPU_RELAX_INTERCEPT
+   select KVM_GENERIC_DIRTYLOG_READ_PROTECT
select KVM_VFIO
---help---
  Support hosting fully virtualized guest machines using hardware
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 9314678..bf6b82c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1224,7 +1224,7 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
 }
 
 /**
- * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
+ * kvm_arch_mmu_write_protect_pt_masked - write protect selected PT level pages
  * @kvm: kvm instance
  * @slot: slot to protect
  * @gfn_offset: start of the BITS_PER_LONG pages we care about
@@ -1233,7 +1233,7 @@ static bool __rmap_write_protect(struct kvm *kvm, 
unsigned long *rmapp,
  * Used when we do not need to care about huge page mappings: e.g. during dirty
  * logging we do not have any such mappings.
  */
-void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
+void kvm_arch_mmu_write_protect_pt_masked(struct kvm *kvm,
 struct kvm_memory_slot *slot,
 gfn_t gfn_offset, unsigned long mask)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8f1e22d..9f8ae9a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3606,77 +3606,31 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm,
  *
  *   1. Take a snapshot of the bit and clear it if needed.
  *   2. Write protect the corresponding page.
- *   3. Flush TLB's if needed.
- *   4. Copy the snapshot to the userspace.
+ *   3. Copy the snapshot to the userspace.
+ *   4. Flush TLB's if needed.
  *
- * Between 2 and 3, the guest may write to the page using the remaining TLB
- * entry.  This is not a problem because the page will be reported dirty at
- * step 4 using the snapshot taken before and step 3 ensures that successive
- * writes will be logged for the next call.
+ * Between 2 and 4, the guest may write to the page using the remaining TLB
+ * entry.  This is not a problem because the page is reported dirty using
+ * the snapshot taken before and step 4 ensures that writes done after
+ * exiting to userspace will be logged for the next call.
  */
 int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 {
-   int r;
-   struct kvm_memory_slot *memslot;
-   unsigned long n, i;
-   unsigned long *dirty_bitmap;
-   unsigned long *dirty_bitmap_buffer;
bool is_dirty = false;
+   int r;
 
mutex_lock(&kvm->slots_lock);
 
-   r = -EINVAL;
-   if (log->slot >= KVM_USER_MEM_SLOTS)
-   goto out;
-
-   memslot = id_to_memslot(kvm->memslots, log->slot);
-
-   dirty_bitmap = memslot->dirty_bitmap;
-   r = -ENOENT;
-   if (!dirty_bitmap)
-   goto out;
-
-   n = kvm_dirty_bitmap_bytes(memslot);
-
-   dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
-   memset(dirty_bitmap_buffer, 0, n);
-
-   spin_lock(&kvm->mmu_lock);
-
-   for (i = 0; i < n / sizeof(long); i++) {
-   unsigned long mask;
-   gfn_t offset;
-
-   if (!dirty_bitmap[i])
-   continue;
-
-   is_dirty = true;
-
-   mask = xchg(&dirty_bitmap[i], 0);
-   dirty_bitmap_buffer[i] = mask;
-
-   offset = i * BITS_PER_LONG;
-   kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask);
-   }
-
-   spin_unlock(&kvm->mmu_lock);
-
-   /* See the comments in kvm_mmu_slot_remove_write_access(). */

[PATCH v14 4/7] KVM: arm: Add ARMv7 API to flush TLBs

2014-11-13 Thread Mario Smarduch

This patch adds ARMv7 architecture TLB Flush function.

Acked-by: Marc Zyngier 
Acked-by: Christoffer Dall 
Signed-off-by: Mario Smarduch 
---
 arch/arm/include/asm/kvm_asm.h  |1 +
 arch/arm/include/asm/kvm_host.h |   12 
 arch/arm/kvm/Kconfig|1 +
 arch/arm/kvm/interrupts.S   |   11 +++
 4 files changed, 25 insertions(+)

diff --git a/arch/arm/include/asm/kvm_asm.h b/arch/arm/include/asm/kvm_asm.h
index 3a67bec..25410b2 100644
--- a/arch/arm/include/asm/kvm_asm.h
+++ b/arch/arm/include/asm/kvm_asm.h
@@ -96,6 +96,7 @@ extern char __kvm_hyp_code_end[];
 
 extern void __kvm_flush_vm_context(void);
 extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa);
+extern void __kvm_tlb_flush_vmid(struct kvm *kvm);
 
 extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu);
 #endif
diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 6dfb404..3da6ea7 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -220,6 +220,18 @@ static inline void __cpu_init_hyp_mode(phys_addr_t 
boot_pgd_ptr,
kvm_call_hyp((void*)hyp_stack_ptr, vector_ptr, pgd_ptr);
 }
 
+/**
+ * kvm_flush_remote_tlbs() - flush all VM TLB entries
+ * @kvm:   pointer to kvm structure.
+ *
+ * Interface to HYP function to flush all VM TLB entries without address
+ * parameter.
+ */
+static inline void kvm_flush_remote_tlbs(struct kvm *kvm)
+{
+   kvm_call_hyp(__kvm_tlb_flush_vmid, kvm);
+}
+
 static inline int kvm_arch_dev_ioctl_check_extension(long ext)
 {
return 0;
diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index 466bd29..f27f336 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
select PREEMPT_NOTIFIERS
select ANON_INODES
select HAVE_KVM_CPU_RELAX_INTERCEPT
+   select HAVE_KVM_ARCH_TLB_FLUSH_ALL
select KVM_MMIO
select KVM_ARM_HOST
depends on ARM_VIRT_EXT && ARM_LPAE
diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S
index 01dcb0e..79caf79 100644
--- a/arch/arm/kvm/interrupts.S
+++ b/arch/arm/kvm/interrupts.S
@@ -66,6 +66,17 @@ ENTRY(__kvm_tlb_flush_vmid_ipa)
bx  lr
 ENDPROC(__kvm_tlb_flush_vmid_ipa)
 
+/**
+ * void __kvm_tlb_flush_vmid(struct kvm *kvm) - Flush per-VMID TLBs
+ *
+ * Reuses __kvm_tlb_flush_vmid_ipa() for ARMv7, without passing address
+ * parameter
+ */
+
+ENTRY(__kvm_tlb_flush_vmid)
+   b   __kvm_tlb_flush_vmid_ipa
+ENDPROC(__kvm_tlb_flush_vmid)
+
 /
  * Flush TLBs and instruction caches of all CPUs inside the inner-shareable
  * domain, for all VMIDs
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v14 5/7] KVM: arm: Add initial dirty page locking support

2014-11-13 Thread Mario Smarduch

Add support for initial write protection of VM memslots. This patch
series assumes that huge PUDs will not be used in 2nd stage tables, which is
always valid on ARMv7.

Signed-off-by: Mario Smarduch 
---
 arch/arm/include/asm/kvm_host.h   |2 +
 arch/arm/include/asm/kvm_mmu.h|   20 +
 arch/arm/include/asm/pgtable-3level.h |1 +
 arch/arm/kvm/mmu.c|  138 +
 4 files changed, 161 insertions(+)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 3da6ea7..8fa6238 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -245,4 +245,6 @@ static inline void vgic_arch_setup(const struct vgic_params 
*vgic)
 int kvm_perf_init(void);
 int kvm_perf_teardown(void);
 
+void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot);
+
 #endif /* __ARM_KVM_HOST_H__ */
diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 5cc0b0f..08ab5e8 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -114,6 +114,26 @@ static inline void kvm_set_s2pmd_writable(pmd_t *pmd)
pmd_val(*pmd) |= L_PMD_S2_RDWR;
 }
 
+static inline void kvm_set_s2pte_readonly(pte_t *pte)
+{
+   pte_val(*pte) = (pte_val(*pte) & ~L_PTE_S2_RDWR) | L_PTE_S2_RDONLY;
+}
+
+static inline bool kvm_s2pte_readonly(pte_t *pte)
+{
+   return (pte_val(*pte) & L_PTE_S2_RDWR) == L_PTE_S2_RDONLY;
+}
+
+static inline void kvm_set_s2pmd_readonly(pmd_t *pmd)
+{
+   pmd_val(*pmd) = (pmd_val(*pmd) & ~L_PMD_S2_RDWR) | L_PMD_S2_RDONLY;
+}
+
+static inline bool kvm_s2pmd_readonly(pmd_t *pmd)
+{
+   return (pmd_val(*pmd) & L_PMD_S2_RDWR) == L_PMD_S2_RDONLY;
+}
+
 /* Open coded p*d_addr_end that can deal with 64bit addresses */
 #define kvm_pgd_addr_end(addr, end)\
 ({ u64 __boundary = ((addr) + PGDIR_SIZE) & PGDIR_MASK;\
diff --git a/arch/arm/include/asm/pgtable-3level.h 
b/arch/arm/include/asm/pgtable-3level.h
index 06e0bc0..d29c880 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -130,6 +130,7 @@
 #define L_PTE_S2_RDONLY(_AT(pteval_t, 1) << 6)   /* 
HAP[1]   */
 #define L_PTE_S2_RDWR  (_AT(pteval_t, 3) << 6)   /* HAP[2:1] */
 
+#define L_PMD_S2_RDONLY(_AT(pmdval_t, 1) << 6)   /* 
HAP[1]   */
 #define L_PMD_S2_RDWR  (_AT(pmdval_t, 3) << 6)   /* HAP[2:1] */
 
 /*
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 16e7994..1e8b6a9 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -45,6 +45,7 @@ static phys_addr_t hyp_idmap_vector;
 #define pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t))
 
 #define kvm_pmd_huge(_x)   (pmd_huge(_x) || pmd_trans_huge(_x))
+#define kvm_pud_huge(_x)   pud_huge(_x)
 
 static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
 {
@@ -746,6 +747,131 @@ static bool transparent_hugepage_adjust(pfn_t *pfnp, 
phys_addr_t *ipap)
return false;
 }
 
+#ifdef CONFIG_ARM
+/**
+ * stage2_wp_ptes - write protect PMD range
+ * @pmd:   pointer to pmd entry
+ * @addr:  range start address
+ * @end:   range end address
+ */
+static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
+{
+   pte_t *pte;
+
+   pte = pte_offset_kernel(pmd, addr);
+   do {
+   if (!pte_none(*pte)) {
+   if (!kvm_s2pte_readonly(pte))
+   kvm_set_s2pte_readonly(pte);
+   }
+   } while (pte++, addr += PAGE_SIZE, addr != end);
+}
+
+/**
+ * stage2_wp_pmds - write protect PUD range
+ * @pud:   pointer to pud entry
+ * @addr:  range start address
+ * @end:   range end address
+ */
+static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
+{
+   pmd_t *pmd;
+   phys_addr_t next;
+
+   pmd = pmd_offset(pud, addr);
+
+   do {
+   next = kvm_pmd_addr_end(addr, end);
+   if (!pmd_none(*pmd)) {
+   if (kvm_pmd_huge(*pmd)) {
+   if (!kvm_s2pmd_readonly(pmd))
+   kvm_set_s2pmd_readonly(pmd);
+   } else {
+   stage2_wp_ptes(pmd, addr, next);
+   }
+   }
+   } while (pmd++, addr = next, addr != end);
+}
+
+/**
+  * stage2_wp_puds - write protect PGD range
+  * @pgd:  pointer to pgd entry
+  * @addr: range start address
+  * @end:  range end address
+  *
+  * Process PUD entries, for a huge PUD we cause a panic.
+  */
+static void  stage2_wp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
+{
+   pud_t *pud;
+   phys_addr_t next;
+
+   pud = pud_offset(pgd, addr);
+   do {
+   next = kvm_pud_addr_end(addr, end);
+   if (!pud_none(*pud)) {
+   /* TODO:PU

[PATCH v14 6/7] KVM: arm: dirty logging write protect support

2014-11-13 Thread Mario Smarduch

Add support to track dirty pages between user space KVM_GET_DIRTY_LOG ioctl
calls. We call kvm_get_dirty_log_protect() function to do most of the work.

Reviewed-by: Marc Zyngier 
Signed-off-by: Mario Smarduch 
---
 arch/arm/kvm/Kconfig |1 +
 arch/arm/kvm/arm.c   |   46 ++
 arch/arm/kvm/mmu.c   |   22 ++
 3 files changed, 69 insertions(+)

diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index f27f336..a8d1ace 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -24,6 +24,7 @@ config KVM
select HAVE_KVM_ARCH_TLB_FLUSH_ALL
select KVM_MMIO
select KVM_ARM_HOST
+   select KVM_GENERIC_DIRTYLOG_READ_PROTECT
depends on ARM_VIRT_EXT && ARM_LPAE
---help---
  Support hosting virtualized guest machines. You will also
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index a99e0cd..040c0f3 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -737,9 +737,55 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
}
 }
 
+/**
+ * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
+ * @kvm:   kvm instance
+ * @log:   slot id and address to which we copy the log
+ *
+ * We need to keep it in mind that VCPU threads can write to the bitmap
+ * concurrently.  So, to avoid losing data, we keep the following order for
+ * each bit:
+ *
+ *   1. Take a snapshot of the bit and clear it if needed.
+ *   2. Write protect the corresponding page.
+ *   3. Copy the snapshot to the userspace.
+ *   4. Flush TLB's if needed.
+ *
+ * Steps 1,2,3 are handled by kvm_get_dirty_log_protect().
+ * Between 2 and 4, the guest may write to the page using the remaining TLB
+ * entry.  This is not a problem because the page is reported dirty using
+ * the snapshot taken before and step 4 ensures that writes done after
+ * exiting to userspace will be logged for the next call.
+ */
 int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
 {
+#ifdef CONFIG_ARM
+   int r;
+   bool is_dirty = false;
+
+   mutex_lock(&kvm->slots_lock);
+
+   r = kvm_get_dirty_log_protect(kvm, log, &is_dirty);
+   if (r)
+   goto out;
+
+   /*
+* kvm_get_dirty_log_protect() may fail and we may skip TLB flush
+* leaving few stale spte TLB entries which is harmless, given we're
+* just write protecting spte's, so few stale TLB's will be left in
+* original R/W state. And since the bitmap is corrupt userspace will
+* error out anyway (i.e. during migration or dirty page loging for
+* other reasons) terminating dirty page logging.
+*/
+   if (is_dirty)
+   kvm_flush_remote_tlbs(kvm);
+out:
+   mutex_unlock(&kvm->slots_lock);
+
+   return r;
+#else /* ARM64 */
return -EINVAL;
+#endif
 }
 
 static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 1e8b6a9..8137455 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -870,6 +870,28 @@ void kvm_mmu_wp_memory_region(struct kvm *kvm, int slot)
spin_unlock(&kvm->mmu_lock);
kvm_flush_remote_tlbs(kvm);
 }
+
+/**
+ * kvm_arch_mmu_write_protect_pt_masked() - write protect dirty pages
+ * @kvm:   The KVM pointer
+ * @slot:  The memory slot associated with mask
+ * @gfn_offset:The gfn offset in memory slot
+ * @mask:  The mask of dirty pages at offset 'gfn_offset' in this memory
+ *  slot to be write protected
+ *
+ * Walks bits set in mask write protects the associated pte's. Caller must
+ * acquire kvm_mmu_lock.
+ */
+void kvm_arch_mmu_write_protect_pt_masked(struct kvm *kvm,
+   struct kvm_memory_slot *slot,
+   gfn_t gfn_offset, unsigned long mask)
+{
+   phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
+   phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
+   phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
+
+   stage2_wp_range(kvm, start, end);
+}
 #endif
 
 static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v14 7/7] KVM: arm: page logging 2nd stage fault handling

2014-11-13 Thread Mario Smarduch

This patch adds support for handling 2nd stage page faults during migration,
it disables faulting in huge pages, and dissolves huge pages to page tables.
In case migration is canceled huge pages are used again.

Reviewed-by: Christoffer Dall 
Signed-off-by: Mario Smarduch 
---
 arch/arm/kvm/mmu.c |   56 
 1 file changed, 48 insertions(+), 8 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 8137455..ff88e5b 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -47,6 +47,20 @@ static phys_addr_t hyp_idmap_vector;
 #define kvm_pmd_huge(_x)   (pmd_huge(_x) || pmd_trans_huge(_x))
 #define kvm_pud_huge(_x)   pud_huge(_x)
 
+#define IOMAP_ATTR 0x1
+#define LOGGING_ACTIVE 0x2
+#define SET_SPTE_FLAGS(l, i)   ((l) << (LOGGING_ACTIVE - 1) | \
+(i) << (IOMAP_ATTR - 1))
+
+static bool kvm_get_logging_state(struct kvm_memory_slot *memslot)
+{
+#ifdef CONFIG_ARM
+   return !!memslot->dirty_bitmap;
+#else
+   return false;
+#endif
+}
+
 static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
 {
/*
@@ -626,10 +640,13 @@ static int stage2_set_pmd_huge(struct kvm *kvm, struct 
kvm_mmu_memory_cache
 }
 
 static int stage2_set_pte(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
- phys_addr_t addr, const pte_t *new_pte, bool iomap)
+ phys_addr_t addr, const pte_t *new_pte,
+ unsigned long flags)
 {
pmd_t *pmd;
pte_t *pte, old_pte;
+   bool iomap = flags & IOMAP_ATTR;
+   bool logging_active = flags & LOGGING_ACTIVE;
 
/* Create stage-2 page table mapping - Level 1 */
pmd = stage2_get_pmd(kvm, cache, addr);
@@ -641,6 +658,18 @@ static int stage2_set_pte(struct kvm *kvm, struct 
kvm_mmu_memory_cache *cache,
return 0;
}
 
+   /*
+* While dirty memory logging, clear PMD entry for huge page and split
+* into smaller pages, to track dirty memory at page granularity.
+*/
+   if (logging_active && kvm_pmd_huge(*pmd)) {
+   phys_addr_t ipa = pmd_pfn(*pmd) << PAGE_SHIFT;
+
+   pmd_clear(pmd);
+   kvm_tlb_flush_vmid_ipa(kvm, ipa);
+   put_page(virt_to_page(pmd));
+   }
+
/* Create stage-2 page mappings - Level 2 */
if (pmd_none(*pmd)) {
if (!cache)
@@ -693,7 +722,7 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t 
guest_ipa,
if (ret)
goto out;
spin_lock(&kvm->mmu_lock);
-   ret = stage2_set_pte(kvm, &cache, addr, &pte, true);
+   ret = stage2_set_pte(kvm, &cache, addr, &pte, IOMAP_ATTR);
spin_unlock(&kvm->mmu_lock);
if (ret)
goto out;
@@ -908,6 +937,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
struct vm_area_struct *vma;
pfn_t pfn;
pgprot_t mem_type = PAGE_S2;
+   bool logging_active = kvm_get_logging_state(memslot);
 
write_fault = kvm_is_write_fault(kvm_vcpu_get_hsr(vcpu));
if (fault_status == FSC_PERM && !write_fault) {
@@ -918,7 +948,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
/* Let's check if we will get back a huge page backed by hugetlbfs */
down_read(¤t->mm->mmap_sem);
vma = find_vma_intersection(current->mm, hva, hva + 1);
-   if (is_vm_hugetlb_page(vma)) {
+   if (is_vm_hugetlb_page(vma) && !logging_active) {
hugetlb = true;
gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT;
} else {
@@ -964,7 +994,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
spin_lock(&kvm->mmu_lock);
if (mmu_notifier_retry(kvm, mmu_seq))
goto out_unlock;
-   if (!hugetlb && !force_pte)
+   if (!hugetlb && !force_pte && !logging_active)
hugetlb = transparent_hugepage_adjust(&pfn, &fault_ipa);
 
if (hugetlb) {
@@ -978,16 +1008,18 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
ret = stage2_set_pmd_huge(kvm, memcache, fault_ipa, &new_pmd);
} else {
pte_t new_pte = pfn_pte(pfn, mem_type);
+   unsigned long flags = SET_SPTE_FLAGS(logging_active,
+   mem_type == PAGE_S2_DEVICE);
if (writable) {
kvm_set_s2pte_writable(&new_pte);
kvm_set_pfn_dirty(pfn);
}
coherent_cache_guest_page(vcpu, hva, PAGE_SIZE);
-   ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte,
-mem_type == PAGE_S2_DEVICE);
+   ret = stage2_set_pte(kvm, memcache, fault_ipa, &new_pte, flags);
}
 
-
+

Re: [PATCH v14 3/7] KVM: x86: switch to kvm_get_dirty_log_protect

2014-11-13 Thread Mario Smarduch

Hi Paolo,

  I changed your patch a little to use a Kconfig symbol,
hope that's fine with you.

- Mario

On 11/13/2014 05:57 PM, Mario Smarduch wrote:
> From: Paolo Bonzini 
> 
> We now have a generic function that does most of the work of
> kvm_vm_ioctl_get_dirty_log, now use it.
> 
> Signed-off-by: Mario Smarduch 
> ---
>  arch/x86/include/asm/kvm_host.h |3 --
>  arch/x86/kvm/Kconfig|1 +
>  arch/x86/kvm/mmu.c  |4 +--
>  arch/x86/kvm/x86.c  |   64 
> ++-
>  4 files changed, 12 insertions(+), 60 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7c492ed..934dc24 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -805,9 +805,6 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 
> accessed_mask,
>  
>  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu);
>  void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
> -void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> -  struct kvm_memory_slot *slot,
> -  gfn_t gfn_offset, unsigned long mask);
>  void kvm_mmu_zap_all(struct kvm *kvm);
>  void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm);
>  unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm);
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index f9d16ff..d073594 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -39,6 +39,7 @@ config KVM
>   select PERF_EVENTS
>   select HAVE_KVM_MSI
>   select HAVE_KVM_CPU_RELAX_INTERCEPT
> + select KVM_GENERIC_DIRTYLOG_READ_PROTECT
>   select KVM_VFIO
>   ---help---
> Support hosting fully virtualized guest machines using hardware
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 9314678..bf6b82c 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -1224,7 +1224,7 @@ static bool __rmap_write_protect(struct kvm *kvm, 
> unsigned long *rmapp,
>  }
>  
>  /**
> - * kvm_mmu_write_protect_pt_masked - write protect selected PT level pages
> + * kvm_arch_mmu_write_protect_pt_masked - write protect selected PT level 
> pages
>   * @kvm: kvm instance
>   * @slot: slot to protect
>   * @gfn_offset: start of the BITS_PER_LONG pages we care about
> @@ -1233,7 +1233,7 @@ static bool __rmap_write_protect(struct kvm *kvm, 
> unsigned long *rmapp,
>   * Used when we do not need to care about huge page mappings: e.g. during 
> dirty
>   * logging we do not have any such mappings.
>   */
> -void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> +void kvm_arch_mmu_write_protect_pt_masked(struct kvm *kvm,
>struct kvm_memory_slot *slot,
>gfn_t gfn_offset, unsigned long mask)
>  {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8f1e22d..9f8ae9a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3606,77 +3606,31 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm,
>   *
>   *   1. Take a snapshot of the bit and clear it if needed.
>   *   2. Write protect the corresponding page.
> - *   3. Flush TLB's if needed.
> - *   4. Copy the snapshot to the userspace.
> + *   3. Copy the snapshot to the userspace.
> + *   4. Flush TLB's if needed.
>   *
> - * Between 2 and 3, the guest may write to the page using the remaining TLB
> - * entry.  This is not a problem because the page will be reported dirty at
> - * step 4 using the snapshot taken before and step 3 ensures that successive
> - * writes will be logged for the next call.
> + * Between 2 and 4, the guest may write to the page using the remaining TLB
> + * entry.  This is not a problem because the page is reported dirty using
> + * the snapshot taken before and step 4 ensures that writes done after
> + * exiting to userspace will be logged for the next call.
>   */
>  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
>  {
> - int r;
> - struct kvm_memory_slot *memslot;
> - unsigned long n, i;
> - unsigned long *dirty_bitmap;
> - unsigned long *dirty_bitmap_buffer;
>   bool is_dirty = false;
> + int r;
>  
>   mutex_lock(&kvm->slots_lock);
>  
> - r = -EINVAL;
> - if (log->slot >= KVM_USER_MEM_SLOTS)
> - goto out;
> -
> - memslot = id_to_memslot(kvm->memslots, log->slot);
> -
> - dirty_bitmap = memslot->dirty_bitmap;
> - r = -ENOENT;
> - if (!dirty_bitmap)
> - goto out;
> -
> - n = kvm_dirty_bitmap_bytes(memslot);
> -
> - dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
> - memset(dirty_bitmap_buffer, 0, n);
> -
> - spin_lock(&kvm->mmu_lock);
> -
> - for (i = 0; i < n / sizeof(long); i++) {
> - unsigned long mask;
> - gfn_t offset;
> -
> - if (!dirty_bitmap[i])
> - continue;
> -
> - is_dirty = true;
> -
> -

Re: [PATCH v4] KVM: x86: fix access memslots w/o hold srcu read lock

2014-11-13 Thread Tang Chen



Thanks for the sharing. Will do more tests. :)

On 11/14/2014 07:39 AM, Wanpeng Li wrote:

Hi Tang,
On Tue, Nov 11, 2014 at 01:35:29PM +0800, Tang Chen wrote:

Hi Wanpeng,


Sorry for the late.


I think I have totally missed this thread.
I opened lockdep and RCU debug, and tried on 3.18-rc1. But I didn't
get the warning.

I also opened lockdep and RCU debug, and tried 3.18.0-rc2 on a Ivy
bridge, the warning will be triggered after run qemu immediately. There
is no need to try any hotplug related stuff.

In addition, Paolo's patch is merged upstream to fix this.

commit a73896cb5bbdce672945745db8224352a689f580
Author: Paolo Bonzini 
Date:   Sun Nov 2 07:54:30 2014 +0100

KVM: vmx: defer load of APIC access page address during reset

Regards,
Wanpeng Li


My steps are:

1. Use numactl to bind a qemu process to node1.
2. Offline all node1 memory. And the qemu process is still running.

Would you please tell me how did you reproduce it ?

Thanks.

On 11/02/2014 03:07 PM, Wanpeng Li wrote:

The srcu read lock must be held while accessing memslots (e.g.
when using gfn_to_* functions), however, commit c24ae0dcd3e8
("kvm: x86: Unpin and remove kvm_arch->apic_access_page") call
gfn_to_page() in kvm_vcpu_reload_apic_access_page() w/o hold it in
vmx_vcpu_reset() path which leads to suspicious rcu_dereference_check()
usage warning. This patch fix it by holding srcu read lock in all
kvm_vcpu_reset() call path.


[ INFO: suspicious RCU usage. ]
3.18.0-rc2-test2+ #70 Not tainted
---
include/linux/kvm_host.h:474 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
1 lock held by qemu-system-x86/2371:
  #0:  (&vcpu->mutex){+.+...}, at: [] vcpu_load+0x20/0xd0 
[kvm]

stack backtrace:
CPU: 4 PID: 2371 Comm: qemu-system-x86 Not tainted 3.18.0-rc2-test2+ #70
Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
  0001 880209983ca8 816f514f 
  8802099b8990 880209983cd8 810bd687 000fee00
  880208a2c000 880208a1 88020ef50040 880209983d08
Call Trace:
  [] dump_stack+0x4e/0x71
  [] lockdep_rcu_suspicious+0xe7/0x120
  [] gfn_to_memslot+0xd5/0xe0 [kvm]
  [] __gfn_to_pfn+0x33/0x60 [kvm]
  [] gfn_to_page+0x25/0x90 [kvm]
  [] kvm_vcpu_reload_apic_access_page+0x3c/0x80 [kvm]
  [] vmx_vcpu_reset+0x20c/0x460 [kvm_intel]
  [] kvm_vcpu_reset+0x15e/0x1b0 [kvm]
  [] kvm_arch_vcpu_setup+0x2c/0x50 [kvm]
  [] kvm_vm_ioctl+0x1d0/0x780 [kvm]
  [] ? __lock_is_held+0x54/0x80
  [] do_vfs_ioctl+0x300/0x520
  [] ? __fget+0x5/0x250
  [] ? __fget_light+0x2a/0xe0
  [] SyS_ioctl+0x81/0xa0
  [] system_call_fastpath+0x16/0x1b

Reported-by: Takashi Iwai 
Reported-by: Alexei Starovoitov 
Suggested-by: Paolo Bonzini 
Signed-off-by: Wanpeng Li 
---
v3 -> v4:
  * bypass the problem altoghter by kvm_make_request
v2 -> v3:
  * take care all vmx_vcpu_reset call path
v1 -> v2:
  * just fix hold the srcu read lock in vmx_vcpu_reset path

  arch/x86/kvm/vmx.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index a0f78db..3e556c6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4579,7 +4579,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu)
vmcs_write32(TPR_THRESHOLD, 0);
}
-   kvm_vcpu_reload_apic_access_page(vcpu);
+   kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
if (vmx_vm_has_apicv(vcpu->kvm))
memset(&vmx->pi_desc, 0, sizeof(struct pi_desc));

.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

45 matches

Mail list logo