Re: [PATCH] powerpc/kvm: Fix build break with CONFIG_KVM_BOOK3S_64_HV=y

2014-06-24 Thread Michael Ellerman
On Tue, 2014-06-24 at 15:51 +0900, Joonsoo Kim wrote:
> On Tue, Jun 24, 2014 at 04:36:47PM +1000, Michael Ellerman wrote:
> > Commit e58e263 "PPC, KVM, CMA: use general CMA reserved area management
> > framework" in next-20140624 removed arch/powerpc/kvm/book3s_hv_cma.c but
> > neglected to update the Makefile, thus breaking the build.
> > 
> > Signed-off-by: Michael Ellerman 
> > ---
> > 
> > Hi Andrew,
> > 
> > This is in your akpm-current and is breaking some of the builds for
> > powerpc in linux-next. Squashing this fix into the original patch would
> > be best for us.
> > 
> 
> I sent really same patch 10 minutes ago. :)

Haha, oh well, I should have had a coffee instead :)

cheers


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: x86: fix TSC matching

2014-06-24 Thread Tomasz Grabiec
I've observed kvmclock being marked as unstable on a modern
single-socket system with a stable TSC and qemu-1.6.2 or qemu-2.0.0.

The culprit was failure in TSC matching because of overflow of
kvm_arch::nr_vcpus_matched_tsc in case there were multiple TSC writes
in a single synchronization cycle.

Turns out that qemu does multiple TSC writes during init, below is the
evidence of that (qemu-2.0.0):

The first one:

 0xa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel]
 0xa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm]
 0xa04cfd6b : kvm_arch_vcpu_postcreate+0x4b/0x80 [kvm]
 0xa04b8188 : kvm_vm_ioctl+0x418/0x750 [kvm]

The second one:

 0xa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel]
 0xa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm]
 0xa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel]
 0xa04be83b : do_set_msr+0x3b/0x60 [kvm]
 0xa04c10a8 : msr_io+0xc8/0x160 [kvm]
 0xa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm]
 0xa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm]

 #0  kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780
 #1  kvm_put_msrs at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270
 #2  kvm_arch_put_registers at 
/build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909
 #3  kvm_cpu_synchronize_post_init at 
/build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1641
 #4  cpu_synchronize_post_init at 
/build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:330
 #5  cpu_synchronize_all_post_init () at 
/build/buildd/qemu-2.0.0+dfsg/cpus.c:521
 #6  main at /build/buildd/qemu-2.0.0+dfsg/vl.c:4390

The third one:

 0xa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel]
 0xa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm]
 0xa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel]
 0xa04be83b : do_set_msr+0x3b/0x60 [kvm]
 0xa04c10a8 : msr_io+0xc8/0x160 [kvm]
 0xa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm]
 0xa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm]

 #0  kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780
 #1  kvm_put_msrs  at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270
 #2  kvm_arch_put_registers  at 
/build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909
 #3  kvm_cpu_synchronize_post_reset  at 
/build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1635
 #4  cpu_synchronize_post_reset  at 
/build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:323
 #5  cpu_synchronize_all_post_reset () at 
/build/buildd/qemu-2.0.0+dfsg/cpus.c:512
 #6  main  at /build/buildd/qemu-2.0.0+dfsg/vl.c:4482

The fix is to count each vCPU only once when matched, so that
nr_vcpus_matched_tsc holds the size of the matched set. This is
achieved by reusing generation counters. Every vCPU with
this_tsc_generation == cur_tsc_generation is in the matched set. The
match set is cleared by setting cur_tsc_generation to a value which no
other vCPU is set to (by incrementing it).

I needed to bump up the counter size form u8 to u64 to ensure it never
overflows. Otherwise in cases TSC is not written the same number of
times on each vCPU the counter could overflow and incorrectly indicate
some vCPUs as being in the matched set. This scenario seems unlikely
but I'm not sure if it can be disregarded.

Signed-off-by: Tomasz Grabiec 
---
 arch/x86/include/asm/kvm_host.h |  4 ++--
 arch/x86/kvm/x86.c  | 11 +++
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 63e020b..af36f89 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -448,7 +448,7 @@ struct kvm_vcpu_arch {
u64 tsc_offset_adjustment;
u64 this_tsc_nsec;
u64 this_tsc_write;
-   u8  this_tsc_generation;
+   u64 this_tsc_generation;
bool tsc_catchup;
bool tsc_always_catchup;
s8 virtual_tsc_shift;
@@ -591,7 +591,7 @@ struct kvm_arch {
u64 cur_tsc_nsec;
u64 cur_tsc_write;
u64 cur_tsc_offset;
-   u8  cur_tsc_generation;
+   u64 cur_tsc_generation;
int nr_vcpus_matched_tsc;
 
spinlock_t pvclock_gtod_sync_lock;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 874607a..d1330f3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1215,6 +1215,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data 
*msr)
unsigned long flags;
s64 usdiff;
bool matched;
+   bool already_matched;
u64 data = msr->data;
 
raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
@@ -1279,6 +1280,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data 
*msr)
pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
}
matched = true;
+   already_matched = (vcpu->arch.this_tsc_generation == 
kvm->arch.cur_tsc_generation);
} else {
/*
 * We split periods of matched TSC writes into generations.
@

Re: [PATCH v3 -next 0/9] CMA: generalize CMA reserved area management code

2014-06-24 Thread Joonsoo Kim
On Wed, Jun 18, 2014 at 01:51:44PM -0700, Andrew Morton wrote:
> On Tue, 17 Jun 2014 10:25:07 +0900 Joonsoo Kim  wrote:
> 
> > > >v2:
> > > >   - Although this patchset looks very different with v1, the end result,
> > > >   that is, mm/cma.c is same with v1's one. So I carry Ack to patch 6-7.
> > > >
> > > >This patchset is based on linux-next 20140610.
> > > 
> > > Thanks for taking care of this. I will test it with my setup and if
> > > everything goes well, I will take it to my -next tree. If any branch
> > > is required for anyone to continue his works on top of those patches,
> > > let me know, I will also prepare it.
> > 
> > Hello,
> > 
> > I'm glad to hear that. :)
> > But, there is one concern. As you already know, I am preparing further
> > patches (Aggressively allocate the pages on CMA reserved memory). It
> > may be highly related to MM branch and also slightly depends on this CMA
> > changes. In this case, what is the best strategy to merge this
> > patchset? IMHO, Anrew's tree is more appropriate branch. If there is
> > no issue in this case, I am willing to develope further patches based
> > on your tree.
> 
> That's probably easier.  Marek, I'll merge these into -mm (and hence
> -next and git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git)
> and shall hold them pending you review/ack/test/etc, OK?

Hello, Marek.

Could you share your decision about this patchset?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to know that vPMU is enabled or disabled?

2014-06-24 Thread Jidong Xiao
On Mon, Jun 23, 2014 at 1:41 PM, Jidong Xiao  wrote:
> Hi, All,
>
> I am using a virtual machine in a cloud environment, which means I am
> in control of the Guest OS, but have no access to the Host OS. Is
> there a simple way to know whether or not the vPMU is enabled or
> disabled? Or, is there something I can control so as to turn its state
> from enable to disable, or vice versa? Thanks.
>

I think I have figured out this. According to this patch (as well as
the Intel SDM manual), it looks like pmu is exposed via the cpuid leaf
0xah.

https://github.com/torvalds/linux/commit/a6c06ed1a60aff77b27ba558c315c3fed4e35565

Therefore, in the guest os, one can run the cpuid instruction with
leaf 0xa, and if vpmu is supported/enabled, the return value in
eax/ebx/ecx/edx should be something non-zero, and in the cloud machine
which I am using I see these registers are all zero, therefore I think
vpmu is not supported in my virtual machine. Probably it is masked by
Qemu.

-Jidong
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Coupling between KVM_IRQFD and KVM_SET_GSI_ROUTING?

2014-06-24 Thread Paul Mackerras
On Mon, Jun 23, 2014 at 06:47:51PM +0200, Alexander Graf wrote:
> 
> On 17.06.14 13:39, Eric Auger wrote:
> >Hello,
> >
> >I have a question related to KVM_IRQFD and KVM_SET_GSI_ROUTING ioctl
> >relationship.
> >
> >When reading the KVM API documentation I do not understand there is any
> >dependency between KVM_IRQFD and KVM_SET_GSI_ROUTING. According to the
> >text it seems only the gsi field is used and interpreted as the irqchip pin.
> >
> >However irqchip.c kvm_set_irq code relies on an existing and not dummy
> >routing table.
> >
> >My question is: does anyone agree on the fact the user-side must set a
> >consistent routing table using KVM_SET_GSI_ROUTING before using
> >KVM_IRQFD? The other alternative would have been to build a default
> >identity GSI routing table in the kernel (gsi = irqchip.pin).
> 
> I untangled irqfd support from the x86 ioapic emulation a while back. When I
> looked at it, I didn't see any easy way to split it out from the routing
> too, so I kept that dependency in.
> 
> If you look at the code, you will see that the irq routing entry is used as
> token for an irqfd. So every irqfd only knows its routing table entry,
> nothing else.

As far as I can see, the routing table entry is used for only two
things: to tell whether the interrupt is an MSI or LSI, and to pass to
kvm_set_msi.

One way to tackle the problem would be to abstract out the routing
table lookup into a function in irqchip.c, rather than having it
open-coded in irqfd_update.  Then, if you don't want to have the
routing table you write an alternative function that creates a
routing entry from scratch.  It would need information from the
low-level irq chip code as to which interrupts are LSIs and which are
MSIs.  It also ties you down to having only one kind of interrupt
controller.

The other way I see to solve my particular problem, which is that
userspace tends to use hardware interrupt numbers starting at 4096, is
to replace the arrays in struct kvm_irq_routing_table with idr
structures or something similar.

> Splitting that dependency out is certainly quite tedious work. However, I'm
> sure the IBM folks will be grateful if you do it :).

Indeed. :)

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 67751] Stack trace with suspicious RCU usage, when starting ovs-switchd

2014-06-24 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=67751

--- Comment #5 from Kashyap Chamarthy  ---
I just saw this again when I attempt to restart OpenvSwitch with Kernel --
3.16.0-0.rc1.git1.1.fc21.x86_64

-
[ 9695.079715] 
[ 9695.080492] ===
[ 9695.082559] [ INFO: suspicious RCU usage. ]
[ 9695.085105] 3.16.0-0.rc1.git1.1.fc21.x86_64 #1 Not tainted
[ 9695.088505] ---
[ 9695.090413] net/openvswitch/flow_table.c:255 suspicious
rcu_dereference_protected() usage!
[ 9695.094887] 
[ 9695.094887] other info that might help us debug this:
[ 9695.094887] 
[ 9695.100722] 
[ 9695.100722] rcu_scheduler_active = 1, debug_locks = 0
[ 9695.105699] 1 lock held by ovs-vswitchd/1910:
[ 9695.108199]  #0:  (cb_lock){++}, at: []
genl_rcv+0x19/0x40
[ 9695.113083] 
[ 9695.113083] stack backtrace:
[ 9695.115395] CPU: 0 PID: 1910 Comm: ovs-vswitchd Not tainted
3.16.0-0.rc1.git1.1.fc21.x86_64 #1
[ 9695.121638] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 9695.124076]   2c60696c 8800b64ff968
81807520
[ 9695.127732]  88029ce24d70 8800b64ff998 810fd7f7

[ 9695.131321]  88029ccc6d20 8800b64ffa90 88029ccc6d00
8800b64ff9b8
[ 9695.134931] Call Trace:
[ 9695.136158]  [] dump_stack+0x4d/0x66
[ 9695.139031]  [] lockdep_rcu_suspicious+0xe7/0x120
[ 9695.141850]  [] ovs_flow_tbl_destroy+0x60/0x70
[openvswitch]
[ 9695.145034]  [] ovs_dp_cmd_new+0x311/0x4b0 [openvswitch]
[ 9695.148665]  [] genl_family_rcv_msg+0x1bc/0x3e0
[ 9695.152987]  [] ? sched_clock+0x9/0x10
[ 9695.157925]  [] ? sched_clock_local+0x1d/0x90
[ 9695.162486]  [] genl_rcv_msg+0x84/0xc0
[ 9695.166797]  [] ? genl_family_rcv_msg+0x3e0/0x3e0
[ 9695.171411]  [] netlink_rcv_skb+0xa9/0xd0
[ 9695.175809]  [] genl_rcv+0x28/0x40
[ 9695.179489]  [] netlink_unicast+0x125/0x1a0
[ 9695.184300]  [] netlink_sendmsg+0x35f/0x7d0
[ 9695.188510]  [] sock_sendmsg+0x9e/0xe0
[ 9695.192580]  [] ? might_fault+0x5e/0xc0
[ 9695.196850]  [] ? might_fault+0xb9/0xc0
[ 9695.200907]  [] ? might_fault+0x5e/0xc0
[ 9695.204862]  [] ___sys_sendmsg+0x408/0x420
[ 9695.208962]  [] ? __fget_light+0x13d/0x160
[ 9695.213088]  [] ? __fget_light+0x13d/0x160
[ 9695.217652]  [] __sys_sendmsg+0x51/0x90
[ 9695.220513]  [] SyS_sendmsg+0x12/0x20
[ 9695.223422]  [] system_call_fastpath+0x16/0x1b
-

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


vhost on mmio platform devices

2014-06-24 Thread GAUGUEY Rémy 228890
Hi, 

I'm currently trying to make vhost-net work on ARM cortexA15 Arndale board.
As irqfd is not yet upstream, I've used irqfd and irq routing patches recently 
posted on kvmarm list.
So far I can properly generate irqs to the guest using vgic and irq-routing , 
but as far as I can see virtio registers (interrupt status register @ offset 
0x60) are not updated by vhost kernel backend.
Does it mean that virtio virtual registers have to be managed by Qemu, even 
when using a vhost kernel back-end ???
I cannot find inside vhost code where those registers could be updated, and I 
can clearly see that mmio accesses from guest to those registers causing a qemu 
virtio_mmio_read or virtio_mmio-write.

I thought vhost was added to prevent any qemu interaction in the data path . 
Could someone more familiar with vhost could tell me if and where I'm wrong ?
Thanks a lot in advance
Best regards
Rémy



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread Alexey Kardashevskiy
On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
> 
>> Working on big endian being an accident may be a matter of perspective
> 
>  :-)
> 
>> The comment remains that this patch doesn't actually fix anything except
>> the overhead on big endian systems doing redundant byte swapping and
>> maybe the philosophy that vfio regions are little endian.
> 
> Yes, that works by accident because technically VFIO is a transport and
> thus shouldn't perform any endian swapping of any sort, which remains
> the responsibility of the end driver which is the only one to know
> whether a given BAR location is a a register or some streaming data
> and in the former case whether it's LE or BE (some PCI devices are BE
> even ! :-)
> 
> But yes, in the end, it works with the dual "cancelling" swaps and the
> overhead of those swaps is probably drowned in the noise of the syscall
> overhead.
> 
>> I'm still not a fan of iowrite vs iowritebe, there must be something we
>> can use that doesn't have an implicit swap.
> 
> Sadly there isn't ... In the old day we didn't even have the "be"
> variant and readl/writel style accessors still don't have them either
> for all archs.
> 
> There is __raw_readl/writel but here the semantics are much more than
> just "don't swap", they also don't have memory barriers (which means
> they are essentially useless to most drivers unless those are platform
> specific drivers which know exactly what they are doing, or in the rare
> cases such as accessing a framebuffer which we know never have side
> effects). 
> 
>>  Calling it iowrite*_native is also an abuse of the namespace.
> 
> 
>>  Next thing we know some common code
>> will legitimately use that name. 
> 
> I might make sense to those definitions into a common header. There have
> been a handful of cases in the past that wanted that sort of "native
> byte order" MMIOs iirc (though don't ask me for examples, I can't really
> remember).
> 
>>  If we do need to define an alias
>> (which I'd like to avoid) it should be something like vfio_iowrite32.


Ping?

We need to make a decision whether to move those xxx_native() helpers
somewhere (where?) or leave the patch as is (as we figured out that
iowriteXX functions implement barriers and we cannot just use raw
accessors) and fix commit log to explain everything.

Thanks!




>> Thanks,
> 
> Cheers,
> Ben.
> 
>> Alex
>>
 ===

 any better?




>>> Suggested-by: Benjamin Herrenschmidt 
>>> Signed-off-by: Alexey Kardashevskiy 
>>> ---
>>>  drivers/vfio/pci/vfio_pci_rdwr.c | 20 
>>>  1 file changed, 16 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c 
>>> b/drivers/vfio/pci/vfio_pci_rdwr.c
>>> index 210db24..f363b5a 100644
>>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
>>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
>>> @@ -21,6 +21,18 @@
>>>  
>>>  #include "vfio_pci_private.h"
>>>  
>>> +#ifdef __BIG_ENDIAN__
>>> +#define ioread16_nativeioread16be
>>> +#define ioread32_nativeioread32be
>>> +#define iowrite16_native   iowrite16be
>>> +#define iowrite32_native   iowrite32be
>>> +#else
>>> +#define ioread16_nativeioread16
>>> +#define ioread32_nativeioread32
>>> +#define iowrite16_native   iowrite16
>>> +#define iowrite32_native   iowrite32
>>> +#endif
>>> +
>>>  /*
>>>   * Read or write from an __iomem region (MMIO or I/O port) with an 
>>> excluded
>>>   * range which is inaccessible.  The excluded range drops writes and 
>>> fills
>>> @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user 
>>> *buf,
>>> if (copy_from_user(&val, buf, 4))
>>> return -EFAULT;
>>>  
>>> -   iowrite32(le32_to_cpu(val), io + off);
>>> +   iowrite32_native(val, io + off);
>>> } else {
>>> -   val = cpu_to_le32(ioread32(io + off));
>>> +   val = ioread32_native(io + off);
>>>  
>>> if (copy_to_user(buf, &val, 4))
>>> return -EFAULT;
>>> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user 
>>> *buf,
>>> if (copy_from_user(&val, buf, 2))
>>> return -EFAULT;
>>>  
>>> -   iowrite16(le16_to_cpu(val), io + off);
>>> +   iowrite16_native(val, io + off);
>>> } else {
>>> -   val = cpu_to_le16(ioread16(io + off));

Re: [PATCH 03/11] qspinlock: Add pending bit

2014-06-24 Thread Peter Zijlstra
On Tue, Jun 17, 2014 at 07:23:44PM -0400, Konrad Rzeszutek Wilk wrote:
> > Actually in my v11 patch, I subdivided the slowpath into a slowpath for 
> > the pending code and slowerpath for actual queuing. Perhaps, we could 
> > use quickpath and slowpath instead. Anyway, it is a minor detail that we 
> > can discuss after the core code get merged.

> Why not do it the right way the first time around?

Because I told him to not do this. There's the fast path; the inline
single trylock cmpxchg, and the slow path; the out-of-line thing doing
the rest.

Note that pretty much all other locking primitives are implemented
similarly, with fast and slow paths.

I find that having the entire state machine in a single function is
easier.

> That aside - these optimization - seem to make the code harder to
> read. And they do remind me of the scheduler code in 2.6.x which was
> based on heuristics - and eventually ripped out.

Well, it increases the states and thereby the complexity, nothing to be
done about that. Also, its not a random heuristic in the sense that it
has odd behaviour. Its behaviour is very well controlled.

Furthermore, without this the qspinlock performance is too far off the
ticket lock performance to be a possible replacement.

> So are these optimizations based on turning off certain hardware
> features? Say hardware prefetching?

We can try of course, but that doesn't help the code -- in fact, adding
the switch to turn if off _adds_ code on top.

> What I am getting at - can the hardware do this at some point (or
> perhaps already does on IvyBridge-EX?) - that is prefetch the per-cpu
> areas so they are always hot? And rendering this optimization not
> needed?

Got a ref to documentation on this new fancy stuff? I might have an
IVB-EX, but I've not tried it yet.

That said, memory fetches are 100s of cycles, and while prefetch can
hide some of that, I'm not sure we can hide all of it, there's not
_that_ much we do.

If we observe the pending and locked bit set, we immediately drop to the
queueing code and touch it. So there's only a few useful instructions to
do.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] qspinlock: Add pending bit

2014-06-24 Thread Peter Zijlstra
On Tue, Jun 17, 2014 at 05:07:29PM -0400, Konrad Rzeszutek Wilk wrote:
> > We are trying to make the fastpath as simple as possible as it may be
> > inlined. The complexity of the queue spinlock is in the slowpath.
> 
> Sure, but then it shouldn't be called slowpath anymore as it is not
> slow.

Its common terminology to call the inline part the fast path and the
out-of-line call on failure the slow path.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] ARM: KVM: user_mem_abort: support stage 2 MMIO page mapping

2014-06-24 Thread Will Deacon
On Wed, May 07, 2014 at 03:55:57PM +0100, Christoffer Dall wrote:
> On Wed, May 07, 2014 at 10:00:21AM +0100, Marc Zyngier wrote:
> > On Tue, May 06 2014 at  7:04:48 pm BST, Christoffer Dall 
> >  wrote:
> > > On Tue, Mar 25, 2014 at 05:08:14PM -0500, Kim Phillips wrote:
> > >> Use the correct memory type for device MMIO mappings: PAGE_S2_DEVICE.
> > >> 
> > >> Signed-off-by: Kim Phillips 
> > >> ---
> > >>  arch/arm/kvm/mmu.c | 11 ---
> > >>  1 file changed, 8 insertions(+), 3 deletions(-)

[...]

> > > I think this looks reasonable.
> > >
> > > Acked-by: Christoffer Dall 
> > 
> > I feel like I'm missing some context here, and the commit message is way
> > too terse for me to make sense of it.
> > 
> > So far, we can only get into user_mem_abort on a Stage-2 fault
> > (translation or permission) for memory. How can we suddenly get here for
> > a *device* fault? Do we get a special kind of memslot?
> > 
> > I'm not saying the patch does anything wrong, but I'd like to understand
> > the rationale behind it. On its own, it doesn't make much sense.
> > 
> Think device passthrough.  There's nothing preventing user space from
> setting up a memory region to point to device memory (through VFIO or
> /dev/mem).  If that's done, we should enforce device memory properties
> so writes don't linger around in the cache to be written some time later
> when that device memory potentially doesn't belong to the VM anymore.
> 
> This is just one tiny piece of all of them to make device passthrough
> work, and we could hold off with this patch until we have something more
> complete.  On the other hand, we need to start somewhere, and this is
> hardly intrusive and is functionally correct even though you don't have
> a full device passthrough setup.

Please can you queue this patch up? I need it for my VFIO work, where I'm
registering the PCI BARs using KVM_SET_USER_MEMORY_REGION.

Without this, I'd have to trap all accesses and do pread/pwrite from
kvmtool instead of mmaping the regions straight through.

Cheers,

Will
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Coupling between KVM_IRQFD and KVM_SET_GSI_ROUTING?

2014-06-24 Thread Alexander Graf


On 24.06.14 11:47, Paul Mackerras wrote:

On Mon, Jun 23, 2014 at 06:47:51PM +0200, Alexander Graf wrote:

On 17.06.14 13:39, Eric Auger wrote:

Hello,

I have a question related to KVM_IRQFD and KVM_SET_GSI_ROUTING ioctl
relationship.

When reading the KVM API documentation I do not understand there is any
dependency between KVM_IRQFD and KVM_SET_GSI_ROUTING. According to the
text it seems only the gsi field is used and interpreted as the irqchip pin.

However irqchip.c kvm_set_irq code relies on an existing and not dummy
routing table.

My question is: does anyone agree on the fact the user-side must set a
consistent routing table using KVM_SET_GSI_ROUTING before using
KVM_IRQFD? The other alternative would have been to build a default
identity GSI routing table in the kernel (gsi = irqchip.pin).

I untangled irqfd support from the x86 ioapic emulation a while back. When I
looked at it, I didn't see any easy way to split it out from the routing
too, so I kept that dependency in.

If you look at the code, you will see that the irq routing entry is used as
token for an irqfd. So every irqfd only knows its routing table entry,
nothing else.

As far as I can see, the routing table entry is used for only two
things: to tell whether the interrupt is an MSI or LSI, and to pass to
kvm_set_msi.

One way to tackle the problem would be to abstract out the routing
table lookup into a function in irqchip.c, rather than having it
open-coded in irqfd_update.  Then, if you don't want to have the


You could also create it along with the irqfd state struct. So instead of

  kzalloc(sizeof(*irqfd), GFP_KERNEL);

we could do

  kzalloc(sizeof(*irqfd) + sizeof(struct kvm_kernel_irq_routing_entry), 
GFP_KERNEL);


and point the routing entry to the inline irqfd one. We'd lose the 
ability to change any routings with that construct, but I think that's 
ok for irq controllers that are not configurable.


Alternatively, we could also have a single routing entry that is a 
wildcard match, ranging from say LSI [x...y]. The irqfd structure would 
then need to carry a local payload to define an offset inside that 
wildcard routing entry.



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] ARM: KVM: user_mem_abort: support stage 2 MMIO page mapping

2014-06-24 Thread Marc Zyngier
On 24/06/14 11:23, Will Deacon wrote:
> On Wed, May 07, 2014 at 03:55:57PM +0100, Christoffer Dall wrote:
>> On Wed, May 07, 2014 at 10:00:21AM +0100, Marc Zyngier wrote:
>>> On Tue, May 06 2014 at  7:04:48 pm BST, Christoffer Dall 
>>>  wrote:
 On Tue, Mar 25, 2014 at 05:08:14PM -0500, Kim Phillips wrote:
> Use the correct memory type for device MMIO mappings: PAGE_S2_DEVICE.
>
> Signed-off-by: Kim Phillips 
> ---
>  arch/arm/kvm/mmu.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> [...]
> 
 I think this looks reasonable.

 Acked-by: Christoffer Dall 
>>>
>>> I feel like I'm missing some context here, and the commit message is way
>>> too terse for me to make sense of it.
>>>
>>> So far, we can only get into user_mem_abort on a Stage-2 fault
>>> (translation or permission) for memory. How can we suddenly get here for
>>> a *device* fault? Do we get a special kind of memslot?
>>>
>>> I'm not saying the patch does anything wrong, but I'd like to understand
>>> the rationale behind it. On its own, it doesn't make much sense.
>>>
>> Think device passthrough.  There's nothing preventing user space from
>> setting up a memory region to point to device memory (through VFIO or
>> /dev/mem).  If that's done, we should enforce device memory properties
>> so writes don't linger around in the cache to be written some time later
>> when that device memory potentially doesn't belong to the VM anymore.
>>
>> This is just one tiny piece of all of them to make device passthrough
>> work, and we could hold off with this patch until we have something more
>> complete.  On the other hand, we need to start somewhere, and this is
>> hardly intrusive and is functionally correct even though you don't have
>> a full device passthrough setup.
> 
> Please can you queue this patch up? I need it for my VFIO work, where I'm
> registering the PCI BARs using KVM_SET_USER_MEMORY_REGION.
> 
> Without this, I'd have to trap all accesses and do pread/pwrite from
> kvmtool instead of mmaping the regions straight through.

I'm afraid there as been quite a bit of churn in this department, and
the patch doesn't apply any more.

Kim, any chance you could respin this patch on top of mainline?

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread Alexander Graf


On 24.06.14 12:11, Alexey Kardashevskiy wrote:

On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:

On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:


Working on big endian being an accident may be a matter of perspective

  :-)


The comment remains that this patch doesn't actually fix anything except
the overhead on big endian systems doing redundant byte swapping and
maybe the philosophy that vfio regions are little endian.

Yes, that works by accident because technically VFIO is a transport and
thus shouldn't perform any endian swapping of any sort, which remains
the responsibility of the end driver which is the only one to know
whether a given BAR location is a a register or some streaming data
and in the former case whether it's LE or BE (some PCI devices are BE
even ! :-)

But yes, in the end, it works with the dual "cancelling" swaps and the
overhead of those swaps is probably drowned in the noise of the syscall
overhead.


I'm still not a fan of iowrite vs iowritebe, there must be something we
can use that doesn't have an implicit swap.

Sadly there isn't ... In the old day we didn't even have the "be"
variant and readl/writel style accessors still don't have them either
for all archs.

There is __raw_readl/writel but here the semantics are much more than
just "don't swap", they also don't have memory barriers (which means
they are essentially useless to most drivers unless those are platform
specific drivers which know exactly what they are doing, or in the rare
cases such as accessing a framebuffer which we know never have side
effects).


  Calling it iowrite*_native is also an abuse of the namespace.



  Next thing we know some common code
will legitimately use that name.

I might make sense to those definitions into a common header. There have
been a handful of cases in the past that wanted that sort of "native
byte order" MMIOs iirc (though don't ask me for examples, I can't really
remember).


  If we do need to define an alias
(which I'd like to avoid) it should be something like vfio_iowrite32.


Ping?

We need to make a decision whether to move those xxx_native() helpers
somewhere (where?) or leave the patch as is (as we figured out that
iowriteXX functions implement barriers and we cannot just use raw
accessors) and fix commit log to explain everything.


Is there actually any difference in generated code with this patch 
applied and without? I would hope that iowrite..() is inlined and 
cancels out the cpu_to_le..() calls that are also inlined?



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word

2014-06-24 Thread Peter Zijlstra
On Wed, Jun 18, 2014 at 01:37:45PM +0200, Paolo Bonzini wrote:
> However, I *do* agree with you that it's simpler to just squash this patch
> into 01/11.

So I explicitly broke out these optimizations into separate patches so
that we can see them independently and agree they're idempotent wrt the
state machine.

The initial patches by Waiman were totally unreadable, partly because
the optimizations made the code terribly complex.

Luckily waiman then dropped the most horrible optimizations
(optimization for the very large nr_cpus case, were we cannot have a
pending byte), so the end result isn't quite as complex as it used to
be.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] KVM: PPC: e500mc: Enhance tlb invalidation condition on vcpu schedule

2014-06-24 Thread Alexander Graf


On 18.06.14 09:15, Mihai Caraman wrote:

On vcpu schedule, the condition checked for tlb pollution is too loose.
The tlb entries of a vcpu become polluted (vs stale) only when a different
vcpu within the same logical partition runs in-between. Optimize the tlb
invalidation condition keeping last_vcpu per logical partition id.

With the new invalidation condition, a guest shows 4% performance improvement
on P5020DS while running a memory stress application with the cpu 
oversubscribed,
the other guest running a cpu intensive workload.

Guest - old invalidation condition
   real 3.89
   user 3.87
   sys 0.01

Guest - enhanced invalidation condition
   real 3.75
   user 3.73
   sys 0.01

Host
   real 3.70
   user 1.85
   sys 0.00

The memory stress application accesses 4KB pages backed by 75% of available
TLB0 entries:

char foo[ENTRIES][4096] __attribute__ ((aligned (4096)));

int main()
{
char bar;
int i, j;

for (i = 0; i < ITERATIONS; i++)
for (j = 0; j < ENTRIES; j++)
bar = foo[j][0];

return 0;
}

Signed-off-by: Mihai Caraman 
Cc: Scott Wood 


Thanks, applied to kvm-ppc-queue.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread Alexey Kardashevskiy
On 06/24/2014 08:41 PM, Alexander Graf wrote:
> 
> On 24.06.14 12:11, Alexey Kardashevskiy wrote:
>> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
>>> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
>>>
 Working on big endian being an accident may be a matter of perspective
>>>   :-)
>>>
 The comment remains that this patch doesn't actually fix anything except
 the overhead on big endian systems doing redundant byte swapping and
 maybe the philosophy that vfio regions are little endian.
>>> Yes, that works by accident because technically VFIO is a transport and
>>> thus shouldn't perform any endian swapping of any sort, which remains
>>> the responsibility of the end driver which is the only one to know
>>> whether a given BAR location is a a register or some streaming data
>>> and in the former case whether it's LE or BE (some PCI devices are BE
>>> even ! :-)
>>>
>>> But yes, in the end, it works with the dual "cancelling" swaps and the
>>> overhead of those swaps is probably drowned in the noise of the syscall
>>> overhead.
>>>
 I'm still not a fan of iowrite vs iowritebe, there must be something we
 can use that doesn't have an implicit swap.
>>> Sadly there isn't ... In the old day we didn't even have the "be"
>>> variant and readl/writel style accessors still don't have them either
>>> for all archs.
>>>
>>> There is __raw_readl/writel but here the semantics are much more than
>>> just "don't swap", they also don't have memory barriers (which means
>>> they are essentially useless to most drivers unless those are platform
>>> specific drivers which know exactly what they are doing, or in the rare
>>> cases such as accessing a framebuffer which we know never have side
>>> effects).
>>>
   Calling it iowrite*_native is also an abuse of the namespace.
>>>
   Next thing we know some common code
 will legitimately use that name.
>>> I might make sense to those definitions into a common header. There have
>>> been a handful of cases in the past that wanted that sort of "native
>>> byte order" MMIOs iirc (though don't ask me for examples, I can't really
>>> remember).
>>>
   If we do need to define an alias
 (which I'd like to avoid) it should be something like vfio_iowrite32.
>>
>> Ping?
>>
>> We need to make a decision whether to move those xxx_native() helpers
>> somewhere (where?) or leave the patch as is (as we figured out that
>> iowriteXX functions implement barriers and we cannot just use raw
>> accessors) and fix commit log to explain everything.
> 
> Is there actually any difference in generated code with this patch applied
> and without? I would hope that iowrite..() is inlined and cancels out the
> cpu_to_le..() calls that are also inlined?

iowrite32 is a non-inline function so conversions take place so are the
others. And sorry but I fail to see why this matters. We are not trying to
accelerate things, we are removing redundant operations which confuse
people who read the code.


-- 
Alexey
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread Alexander Graf


On 24.06.14 14:50, Alexey Kardashevskiy wrote:

On 06/24/2014 08:41 PM, Alexander Graf wrote:

On 24.06.14 12:11, Alexey Kardashevskiy wrote:

On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:

On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:


Working on big endian being an accident may be a matter of perspective

   :-)


The comment remains that this patch doesn't actually fix anything except
the overhead on big endian systems doing redundant byte swapping and
maybe the philosophy that vfio regions are little endian.

Yes, that works by accident because technically VFIO is a transport and
thus shouldn't perform any endian swapping of any sort, which remains
the responsibility of the end driver which is the only one to know
whether a given BAR location is a a register or some streaming data
and in the former case whether it's LE or BE (some PCI devices are BE
even ! :-)

But yes, in the end, it works with the dual "cancelling" swaps and the
overhead of those swaps is probably drowned in the noise of the syscall
overhead.


I'm still not a fan of iowrite vs iowritebe, there must be something we
can use that doesn't have an implicit swap.

Sadly there isn't ... In the old day we didn't even have the "be"
variant and readl/writel style accessors still don't have them either
for all archs.

There is __raw_readl/writel but here the semantics are much more than
just "don't swap", they also don't have memory barriers (which means
they are essentially useless to most drivers unless those are platform
specific drivers which know exactly what they are doing, or in the rare
cases such as accessing a framebuffer which we know never have side
effects).


   Calling it iowrite*_native is also an abuse of the namespace.
   Next thing we know some common code
will legitimately use that name.

I might make sense to those definitions into a common header. There have
been a handful of cases in the past that wanted that sort of "native
byte order" MMIOs iirc (though don't ask me for examples, I can't really
remember).


   If we do need to define an alias
(which I'd like to avoid) it should be something like vfio_iowrite32.

Ping?

We need to make a decision whether to move those xxx_native() helpers
somewhere (where?) or leave the patch as is (as we figured out that
iowriteXX functions implement barriers and we cannot just use raw
accessors) and fix commit log to explain everything.

Is there actually any difference in generated code with this patch applied
and without? I would hope that iowrite..() is inlined and cancels out the
cpu_to_le..() calls that are also inlined?

iowrite32 is a non-inline function so conversions take place so are the
others. And sorry but I fail to see why this matters. We are not trying to
accelerate things, we are removing redundant operations which confuse
people who read the code.


The confusion depends on where you're coming from. If you happen to know 
that "iowrite32" writes in LE, then the LE conversion makes a lot of sense.


I don't have a strong feeling either way though and will let Alex decide 
on the path forward :).



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread Alexey Kardashevskiy
On 06/24/2014 10:52 PM, Alexander Graf wrote:
> 
> On 24.06.14 14:50, Alexey Kardashevskiy wrote:
>> On 06/24/2014 08:41 PM, Alexander Graf wrote:
>>> On 24.06.14 12:11, Alexey Kardashevskiy wrote:
 On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
>
>> Working on big endian being an accident may be a matter of perspective
>:-)
>
>> The comment remains that this patch doesn't actually fix anything except
>> the overhead on big endian systems doing redundant byte swapping and
>> maybe the philosophy that vfio regions are little endian.
> Yes, that works by accident because technically VFIO is a transport and
> thus shouldn't perform any endian swapping of any sort, which remains
> the responsibility of the end driver which is the only one to know
> whether a given BAR location is a a register or some streaming data
> and in the former case whether it's LE or BE (some PCI devices are BE
> even ! :-)
>
> But yes, in the end, it works with the dual "cancelling" swaps and the
> overhead of those swaps is probably drowned in the noise of the syscall
> overhead.
>
>> I'm still not a fan of iowrite vs iowritebe, there must be something we
>> can use that doesn't have an implicit swap.
> Sadly there isn't ... In the old day we didn't even have the "be"
> variant and readl/writel style accessors still don't have them either
> for all archs.
>
> There is __raw_readl/writel but here the semantics are much more than
> just "don't swap", they also don't have memory barriers (which means
> they are essentially useless to most drivers unless those are platform
> specific drivers which know exactly what they are doing, or in the rare
> cases such as accessing a framebuffer which we know never have side
> effects).
>
>>Calling it iowrite*_native is also an abuse of the namespace.
>>Next thing we know some common code
>> will legitimately use that name.
> I might make sense to those definitions into a common header. There have
> been a handful of cases in the past that wanted that sort of "native
> byte order" MMIOs iirc (though don't ask me for examples, I can't really
> remember).
>
>>If we do need to define an alias
>> (which I'd like to avoid) it should be something like vfio_iowrite32.
 Ping?

 We need to make a decision whether to move those xxx_native() helpers
 somewhere (where?) or leave the patch as is (as we figured out that
 iowriteXX functions implement barriers and we cannot just use raw
 accessors) and fix commit log to explain everything.
>>> Is there actually any difference in generated code with this patch applied
>>> and without? I would hope that iowrite..() is inlined and cancels out the
>>> cpu_to_le..() calls that are also inlined?
>> iowrite32 is a non-inline function so conversions take place so are the
>> others. And sorry but I fail to see why this matters. We are not trying to
>> accelerate things, we are removing redundant operations which confuse
>> people who read the code.
> 
> The confusion depends on where you're coming from. If you happen to know
> that "iowrite32" writes in LE, then the LE conversion makes a lot of sense.

It was like this (and this is just confusing):

iowrite32(le32_to_cpu(val), io + off);

What would make sense (according to you and I would understand this) is this:

iowrite32(cpu_to_le32(val), io + off);


Or I missed your point, did I?


> I don't have a strong feeling either way though and will let Alex decide on
> the path forward :)

It would probably help if you picked the side :)


-- 
Alexey
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread Alexander Graf


On 24.06.14 15:01, Alexey Kardashevskiy wrote:

On 06/24/2014 10:52 PM, Alexander Graf wrote:

On 24.06.14 14:50, Alexey Kardashevskiy wrote:

On 06/24/2014 08:41 PM, Alexander Graf wrote:

On 24.06.14 12:11, Alexey Kardashevskiy wrote:

On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:

On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:


Working on big endian being an accident may be a matter of perspective

:-)


The comment remains that this patch doesn't actually fix anything except
the overhead on big endian systems doing redundant byte swapping and
maybe the philosophy that vfio regions are little endian.

Yes, that works by accident because technically VFIO is a transport and
thus shouldn't perform any endian swapping of any sort, which remains
the responsibility of the end driver which is the only one to know
whether a given BAR location is a a register or some streaming data
and in the former case whether it's LE or BE (some PCI devices are BE
even ! :-)

But yes, in the end, it works with the dual "cancelling" swaps and the
overhead of those swaps is probably drowned in the noise of the syscall
overhead.


I'm still not a fan of iowrite vs iowritebe, there must be something we
can use that doesn't have an implicit swap.

Sadly there isn't ... In the old day we didn't even have the "be"
variant and readl/writel style accessors still don't have them either
for all archs.

There is __raw_readl/writel but here the semantics are much more than
just "don't swap", they also don't have memory barriers (which means
they are essentially useless to most drivers unless those are platform
specific drivers which know exactly what they are doing, or in the rare
cases such as accessing a framebuffer which we know never have side
effects).


Calling it iowrite*_native is also an abuse of the namespace.
Next thing we know some common code
will legitimately use that name.

I might make sense to those definitions into a common header. There have
been a handful of cases in the past that wanted that sort of "native
byte order" MMIOs iirc (though don't ask me for examples, I can't really
remember).


If we do need to define an alias
(which I'd like to avoid) it should be something like vfio_iowrite32.

Ping?

We need to make a decision whether to move those xxx_native() helpers
somewhere (where?) or leave the patch as is (as we figured out that
iowriteXX functions implement barriers and we cannot just use raw
accessors) and fix commit log to explain everything.

Is there actually any difference in generated code with this patch applied
and without? I would hope that iowrite..() is inlined and cancels out the
cpu_to_le..() calls that are also inlined?

iowrite32 is a non-inline function so conversions take place so are the
others. And sorry but I fail to see why this matters. We are not trying to
accelerate things, we are removing redundant operations which confuse
people who read the code.

The confusion depends on where you're coming from. If you happen to know
that "iowrite32" writes in LE, then the LE conversion makes a lot of sense.

It was like this (and this is just confusing):

iowrite32(le32_to_cpu(val), io + off);

What would make sense (according to you and I would understand this) is this:

iowrite32(cpu_to_le32(val), io + off);


Or I missed your point, did I?


No, you didn't miss it. I think for people who know how iowrite32() 
works the above is obvious. I find the fact that iowrite32() writes in 
LE always pretty scary though ;).


So IMHO we should either create new, generic iowrite helpers that don't 
do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls.



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 6/7] KVM: PPC: Book3S HV: Fix ABIv2 on LE

2014-06-24 Thread Alexander Graf
For code that doesn't live in modules we can just branch to the real function
names, giving us compatibility with ABIv1 and ABIv2.

Do this for the compiled-in code of HV KVM.

Signed-off-by: Alexander Graf 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 1a71f60..a9de981 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -668,9 +668,9 @@ END_FTR_SECTION_IFCLR(CPU_FTR_TM)
 
mr  r31, r4
addir3, r31, VCPU_FPRS_TM
-   bl  .load_fp_state
+   bl  load_fp_state
addir3, r31, VCPU_VRS_TM
-   bl  .load_vr_state
+   bl  load_vr_state
mr  r4, r31
lwz r7, VCPU_VRSAVE_TM(r4)
mtspr   SPRN_VRSAVE, r7
@@ -1414,9 +1414,9 @@ END_FTR_SECTION_IFCLR(CPU_FTR_TM)
 
/* Save FP/VSX. */
addir3, r9, VCPU_FPRS_TM
-   bl  .store_fp_state
+   bl  store_fp_state
addir3, r9, VCPU_VRS_TM
-   bl  .store_vr_state
+   bl  store_vr_state
mfspr   r6, SPRN_VRSAVE
stw r6, VCPU_VRSAVE_TM(r9)
 1:
@@ -2405,11 +2405,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
mtmsrd  r8
isync
addir3,r3,VCPU_FPRS
-   bl  .store_fp_state
+   bl  store_fp_state
 #ifdef CONFIG_ALTIVEC
 BEGIN_FTR_SECTION
addir3,r31,VCPU_VRS
-   bl  .store_vr_state
+   bl  store_vr_state
 END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
 #endif
mfspr   r6,SPRN_VRSAVE
@@ -2441,11 +2441,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
mtmsrd  r8
isync
addir3,r4,VCPU_FPRS
-   bl  .load_fp_state
+   bl  load_fp_state
 #ifdef CONFIG_ALTIVEC
 BEGIN_FTR_SECTION
addir3,r31,VCPU_VRS
-   bl  .load_vr_state
+   bl  load_vr_state
 END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
 #endif
lwz r7,VCPU_VRSAVE(r31)
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: PPC: Book3E: Unlock mmu_lock when setting caching atttribute

2014-06-24 Thread Alexander Graf


On 18.06.14 17:45, Mihai Caraman wrote:

The patch 08c9a188d0d0fc0f0c5e17d89a06bb59c493110f
kvm: powerpc: use caching attributes as per linux pte
do not handle properly the error case, letting mmu_lock locked. The lock
will further generate a RCU stall from kvmppc_e500_emul_tlbwe() caller.

In case of an error go to out label.

Signed-off-by: Mihai Caraman 
Cc: Bharat Bhushan 


Thanks, applied to for-3.16.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread Alex Williamson
On Tue, 2014-06-24 at 15:22 +0200, Alexander Graf wrote:
> On 24.06.14 15:01, Alexey Kardashevskiy wrote:
> > On 06/24/2014 10:52 PM, Alexander Graf wrote:
> >> On 24.06.14 14:50, Alexey Kardashevskiy wrote:
> >>> On 06/24/2014 08:41 PM, Alexander Graf wrote:
>  On 24.06.14 12:11, Alexey Kardashevskiy wrote:
> > On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
> >> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
> >>
> >>> Working on big endian being an accident may be a matter of perspective
> >> :-)
> >>
> >>> The comment remains that this patch doesn't actually fix anything 
> >>> except
> >>> the overhead on big endian systems doing redundant byte swapping and
> >>> maybe the philosophy that vfio regions are little endian.
> >> Yes, that works by accident because technically VFIO is a transport and
> >> thus shouldn't perform any endian swapping of any sort, which remains
> >> the responsibility of the end driver which is the only one to know
> >> whether a given BAR location is a a register or some streaming data
> >> and in the former case whether it's LE or BE (some PCI devices are BE
> >> even ! :-)
> >>
> >> But yes, in the end, it works with the dual "cancelling" swaps and the
> >> overhead of those swaps is probably drowned in the noise of the syscall
> >> overhead.
> >>
> >>> I'm still not a fan of iowrite vs iowritebe, there must be something 
> >>> we
> >>> can use that doesn't have an implicit swap.
> >> Sadly there isn't ... In the old day we didn't even have the "be"
> >> variant and readl/writel style accessors still don't have them either
> >> for all archs.
> >>
> >> There is __raw_readl/writel but here the semantics are much more than
> >> just "don't swap", they also don't have memory barriers (which means
> >> they are essentially useless to most drivers unless those are platform
> >> specific drivers which know exactly what they are doing, or in the rare
> >> cases such as accessing a framebuffer which we know never have side
> >> effects).
> >>
> >>> Calling it iowrite*_native is also an abuse of the namespace.
> >>> Next thing we know some common code
> >>> will legitimately use that name.
> >> I might make sense to those definitions into a common header. There 
> >> have
> >> been a handful of cases in the past that wanted that sort of "native
> >> byte order" MMIOs iirc (though don't ask me for examples, I can't 
> >> really
> >> remember).
> >>
> >>> If we do need to define an alias
> >>> (which I'd like to avoid) it should be something like vfio_iowrite32.
> > Ping?
> >
> > We need to make a decision whether to move those xxx_native() helpers
> > somewhere (where?) or leave the patch as is (as we figured out that
> > iowriteXX functions implement barriers and we cannot just use raw
> > accessors) and fix commit log to explain everything.
>  Is there actually any difference in generated code with this patch 
>  applied
>  and without? I would hope that iowrite..() is inlined and cancels out the
>  cpu_to_le..() calls that are also inlined?
> >>> iowrite32 is a non-inline function so conversions take place so are the
> >>> others. And sorry but I fail to see why this matters. We are not trying to
> >>> accelerate things, we are removing redundant operations which confuse
> >>> people who read the code.
> >> The confusion depends on where you're coming from. If you happen to know
> >> that "iowrite32" writes in LE, then the LE conversion makes a lot of sense.
> > It was like this (and this is just confusing):
> >
> > iowrite32(le32_to_cpu(val), io + off);
> >
> > What would make sense (according to you and I would understand this) is 
> > this:
> >
> > iowrite32(cpu_to_le32(val), io + off);
> >
> >
> > Or I missed your point, did I?
> 
> No, you didn't miss it. I think for people who know how iowrite32() 
> works the above is obvious. I find the fact that iowrite32() writes in 
> LE always pretty scary though ;).
> 
> So IMHO we should either create new, generic iowrite helpers that don't 
> do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls.

I'm one of those people for whom iowrite32(le32_to_cpu(val)) makes sense
and keeps the byte order consistent regardless of the platform, while
iowrite32(val) or iowrite32be(val) makes me scratch my head and try to
remember that the byte swaps are a nop on the given platforms.  As Ben
noted, a native, no-swap ioread/write doesn't exist, but perhaps should.
I'd prefer an attempt be made to make it exist before adding
vfio-specific macros.  vfio is arguably doing the right thing here given
the functions available.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info

Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread Alexey Kardashevskiy
On 06/25/2014 12:21 AM, Alex Williamson wrote:
> On Tue, 2014-06-24 at 15:22 +0200, Alexander Graf wrote:
>> On 24.06.14 15:01, Alexey Kardashevskiy wrote:
>>> On 06/24/2014 10:52 PM, Alexander Graf wrote:
 On 24.06.14 14:50, Alexey Kardashevskiy wrote:
> On 06/24/2014 08:41 PM, Alexander Graf wrote:
>> On 24.06.14 12:11, Alexey Kardashevskiy wrote:
>>> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
 On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:

> Working on big endian being an accident may be a matter of perspective
 :-)

> The comment remains that this patch doesn't actually fix anything 
> except
> the overhead on big endian systems doing redundant byte swapping and
> maybe the philosophy that vfio regions are little endian.
 Yes, that works by accident because technically VFIO is a transport and
 thus shouldn't perform any endian swapping of any sort, which remains
 the responsibility of the end driver which is the only one to know
 whether a given BAR location is a a register or some streaming data
 and in the former case whether it's LE or BE (some PCI devices are BE
 even ! :-)

 But yes, in the end, it works with the dual "cancelling" swaps and the
 overhead of those swaps is probably drowned in the noise of the syscall
 overhead.

> I'm still not a fan of iowrite vs iowritebe, there must be something 
> we
> can use that doesn't have an implicit swap.
 Sadly there isn't ... In the old day we didn't even have the "be"
 variant and readl/writel style accessors still don't have them either
 for all archs.

 There is __raw_readl/writel but here the semantics are much more than
 just "don't swap", they also don't have memory barriers (which means
 they are essentially useless to most drivers unless those are platform
 specific drivers which know exactly what they are doing, or in the rare
 cases such as accessing a framebuffer which we know never have side
 effects).

> Calling it iowrite*_native is also an abuse of the namespace.
> Next thing we know some common code
> will legitimately use that name.
 I might make sense to those definitions into a common header. There 
 have
 been a handful of cases in the past that wanted that sort of "native
 byte order" MMIOs iirc (though don't ask me for examples, I can't 
 really
 remember).

> If we do need to define an alias
> (which I'd like to avoid) it should be something like vfio_iowrite32.
>>> Ping?
>>>
>>> We need to make a decision whether to move those xxx_native() helpers
>>> somewhere (where?) or leave the patch as is (as we figured out that
>>> iowriteXX functions implement barriers and we cannot just use raw
>>> accessors) and fix commit log to explain everything.
>> Is there actually any difference in generated code with this patch 
>> applied
>> and without? I would hope that iowrite..() is inlined and cancels out the
>> cpu_to_le..() calls that are also inlined?
> iowrite32 is a non-inline function so conversions take place so are the
> others. And sorry but I fail to see why this matters. We are not trying to
> accelerate things, we are removing redundant operations which confuse
> people who read the code.
 The confusion depends on where you're coming from. If you happen to know
 that "iowrite32" writes in LE, then the LE conversion makes a lot of sense.
>>> It was like this (and this is just confusing):
>>>
>>> iowrite32(le32_to_cpu(val), io + off);
>>>
>>> What would make sense (according to you and I would understand this) is 
>>> this:
>>>
>>> iowrite32(cpu_to_le32(val), io + off);
>>>
>>>
>>> Or I missed your point, did I?
>>
>> No, you didn't miss it. I think for people who know how iowrite32() 
>> works the above is obvious. I find the fact that iowrite32() writes in 
>> LE always pretty scary though ;).
>>
>> So IMHO we should either create new, generic iowrite helpers that don't 
>> do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls.
> 
> I'm one of those people for whom iowrite32(le32_to_cpu(val)) makes sense


I do not understand why @val is considered LE here and need to be converted
to CPU. Really. I truly believe it should be cpu_to_le32().


> and keeps the byte order consistent regardless of the platform, while
> iowrite32(val) or iowrite32be(val) makes me scratch my head and try to
> remember that the byte swaps are a nop on the given platforms.  As Ben
> noted, a native, no-swap ioread/write doesn't exist, but perhaps should.
> I'd prefer an attempt be made to make it exist before adding
> vfio-specific macros.  vfio is arguably do

RE: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread David Laight
From: Alexey Kardashevskiy
...
> >> So IMHO we should either create new, generic iowrite helpers that don't
> >> do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls.
> >
> > I'm one of those people for whom iowrite32(le32_to_cpu(val)) makes sense
> 
> 
> I do not understand why @val is considered LE here and need to be converted
> to CPU. Really. I truly believe it should be cpu_to_le32().

Think about it carefully...

Apparently iowrite32() is writing a 'cpu' value out 'le'.
So if you have a 'le' value you need to convert it to 'cpu' first.

I really won't ask how 'be' ppc managed to get 'le' on-chip peripherals.

David

N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf

Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread Alex Williamson
On Wed, 2014-06-25 at 00:33 +1000, Alexey Kardashevskiy wrote:
> On 06/25/2014 12:21 AM, Alex Williamson wrote:
> > On Tue, 2014-06-24 at 15:22 +0200, Alexander Graf wrote:
> >> On 24.06.14 15:01, Alexey Kardashevskiy wrote:
> >>> On 06/24/2014 10:52 PM, Alexander Graf wrote:
>  On 24.06.14 14:50, Alexey Kardashevskiy wrote:
> > On 06/24/2014 08:41 PM, Alexander Graf wrote:
> >> On 24.06.14 12:11, Alexey Kardashevskiy wrote:
> >>> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
>  On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
> 
> > Working on big endian being an accident may be a matter of 
> > perspective
>  :-)
> 
> > The comment remains that this patch doesn't actually fix anything 
> > except
> > the overhead on big endian systems doing redundant byte swapping and
> > maybe the philosophy that vfio regions are little endian.
>  Yes, that works by accident because technically VFIO is a transport 
>  and
>  thus shouldn't perform any endian swapping of any sort, which remains
>  the responsibility of the end driver which is the only one to know
>  whether a given BAR location is a a register or some streaming data
>  and in the former case whether it's LE or BE (some PCI devices are BE
>  even ! :-)
> 
>  But yes, in the end, it works with the dual "cancelling" swaps and 
>  the
>  overhead of those swaps is probably drowned in the noise of the 
>  syscall
>  overhead.
> 
> > I'm still not a fan of iowrite vs iowritebe, there must be 
> > something we
> > can use that doesn't have an implicit swap.
>  Sadly there isn't ... In the old day we didn't even have the "be"
>  variant and readl/writel style accessors still don't have them either
>  for all archs.
> 
>  There is __raw_readl/writel but here the semantics are much more than
>  just "don't swap", they also don't have memory barriers (which means
>  they are essentially useless to most drivers unless those are 
>  platform
>  specific drivers which know exactly what they are doing, or in the 
>  rare
>  cases such as accessing a framebuffer which we know never have side
>  effects).
> 
> > Calling it iowrite*_native is also an abuse of the namespace.
> > Next thing we know some common code
> > will legitimately use that name.
>  I might make sense to those definitions into a common header. There 
>  have
>  been a handful of cases in the past that wanted that sort of "native
>  byte order" MMIOs iirc (though don't ask me for examples, I can't 
>  really
>  remember).
> 
> > If we do need to define an alias
> > (which I'd like to avoid) it should be something like 
> > vfio_iowrite32.
> >>> Ping?
> >>>
> >>> We need to make a decision whether to move those xxx_native() helpers
> >>> somewhere (where?) or leave the patch as is (as we figured out that
> >>> iowriteXX functions implement barriers and we cannot just use raw
> >>> accessors) and fix commit log to explain everything.
> >> Is there actually any difference in generated code with this patch 
> >> applied
> >> and without? I would hope that iowrite..() is inlined and cancels out 
> >> the
> >> cpu_to_le..() calls that are also inlined?
> > iowrite32 is a non-inline function so conversions take place so are the
> > others. And sorry but I fail to see why this matters. We are not trying 
> > to
> > accelerate things, we are removing redundant operations which confuse
> > people who read the code.
>  The confusion depends on where you're coming from. If you happen to know
>  that "iowrite32" writes in LE, then the LE conversion makes a lot of 
>  sense.
> >>> It was like this (and this is just confusing):
> >>>
> >>> iowrite32(le32_to_cpu(val), io + off);
> >>>
> >>> What would make sense (according to you and I would understand this) is 
> >>> this:
> >>>
> >>> iowrite32(cpu_to_le32(val), io + off);
> >>>
> >>>
> >>> Or I missed your point, did I?
> >>
> >> No, you didn't miss it. I think for people who know how iowrite32() 
> >> works the above is obvious. I find the fact that iowrite32() writes in 
> >> LE always pretty scary though ;).
> >>
> >> So IMHO we should either create new, generic iowrite helpers that don't 
> >> do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls.
> > 
> > I'm one of those people for whom iowrite32(le32_to_cpu(val)) makes sense
> 
> 
> I do not understand why @val is considered LE here and need to be converted
> to CPU. Really. I truly believe it should be cpu_to_le32().

Because iowrite32 is defined to take 

Re: Coupling between KVM_IRQFD and KVM_SET_GSI_ROUTING?

2014-06-24 Thread Eric Auger
On 06/24/2014 12:40 PM, Alexander Graf wrote:
> 
> On 24.06.14 11:47, Paul Mackerras wrote:
>> On Mon, Jun 23, 2014 at 06:47:51PM +0200, Alexander Graf wrote:
>>> On 17.06.14 13:39, Eric Auger wrote:


 Hello,

 I have a question related to KVM_IRQFD and KVM_SET_GSI_ROUTING ioctl
 relationship.

 When reading the KVM API documentation I do not understand there is any
 dependency between KVM_IRQFD and KVM_SET_GSI_ROUTING. According to the
 text it seems only the gsi field is used and interpreted as the
 irqchip pin.

 However irqchip.c kvm_set_irq code relies on an existing and not dummy
 routing table.

 My question is: does anyone agree on the fact the user-side must set a
 consistent routing table using KVM_SET_GSI_ROUTING before using
 KVM_IRQFD? The other alternative would have been to build a default
 identity GSI routing table in the kernel (gsi = irqchip.pin).
>>> I untangled irqfd support from the x86 ioapic emulation a while back.
>>> When I
>>> looked at it, I didn't see any easy way to split it out from the routing
>>> too, so I kept that dependency in.
Hi Alex, Paul,

thanks for your replies. hence don't you think the KVM API doc deserves
to be clarified then?
>>>
>>> If you look at the code, you will see that the irq routing entry is
>>> used as
>>> token for an irqfd. So every irqfd only knows its routing table entry,
>>> nothing else.
>> As far as I can see, the routing table entry is used for only two
>> things: to tell whether the interrupt is an MSI or LSI, and to pass to
>> kvm_set_msi.


Indeed I noticed irq_entry in _irqfd struct only is set for MSI routing.
For other IRQS - I guess what you call LSI - , routine table map[] is
used to retrieve the irchip.pin(s) associated to that gsi and call set().
>>
>> One way to tackle the problem would be to abstract out the routing
>> table lookup into a function in irqchip.c, rather than having it
>> open-coded in irqfd_update.  Then, if you don't want to have the
>> routing table you write an alternative function that creates a
>> routing entry from scratch.  It would need information from the
>> low-level irq chip code as to which interrupts are LSIs and which are
>> MSIs.  It also ties you down to having only one kind of interrupt
controller.
> 
> You could also create it along with the irqfd state struct. So instead of
> 
>   kzalloc(sizeof(*irqfd), GFP_KERNEL);
> 
> we could do
> 
>   kzalloc(sizeof(*irqfd) + sizeof(struct kvm_kernel_irq_routing_entry),
> GFP_KERNEL);


Wouldn't it make sense, in kvm_irqfd_assign, to test whether there is a
routing entry associated to that gsi. In the negative, create a default
one where irqchip.pin = gsi or something architecture specific?.

That way would not need a complete & big routing table to be set by
user-side?

> 
> and point the routing entry to the inline irqfd one. We'd lose the
> ability to change any routings with that construct, but I think that's
> ok for irq controllers that are not configurable.
> 
> Alternatively, we could also have a single routing entry that is a
> wildcard match, ranging from say LSI [x...y]. The irqfd structure would
> then need to carry a local payload to define an offset inside that
> wildcard routing entry.

Also whatever the number of IRQs assigned, we have this big
chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS] allocated. Also on ARM we
have some difficulties to define KVM_IRQCHIP_NUM_PINS which becomes
quite dynamic.


Best Regards

Eric
> 
> 
> Alex
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New improvements on KVM

2014-06-24 Thread Brian Jackson



On 6/23/2014 10:09 PM, Oscar Garcia wrote:

Hello,

I am planning to study a phd for the next year, and I would like to
spend part on my time studying KVM code in order to suggest new
improvements.
I know that this mail list is used by experts related to KVM, I would
like to ask you, what  improvements are needed in KVM in order to
advance in that direction?



For starters I'd check out the KVM TODO page on the wiki and the GSOC 
pages on the Qemu wiki. Aside from that, just hang out and try to pick 
up what you can. Realistically though (unless you just love hypervisor 
code), all the interesting work is going on in Qemu and 
libvirt/openstack/other management tools.


http://www.linux-kvm.org/page/TODO
http://wiki.qemu.org/Google_Summer_of_Code_2014
http://wiki.qemu.org/Google_Summer_of_Code_2013
http://wiki.qemu.org/Google_Summer_of_Code_201



Thank you

Oscar
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Coupling between KVM_IRQFD and KVM_SET_GSI_ROUTING?

2014-06-24 Thread Alexander Graf


On 24.06.14 17:05, Eric Auger wrote:

On 06/24/2014 12:40 PM, Alexander Graf wrote:

On 24.06.14 11:47, Paul Mackerras wrote:

On Mon, Jun 23, 2014 at 06:47:51PM +0200, Alexander Graf wrote:

On 17.06.14 13:39, Eric Auger wrote:



Hello,

I have a question related to KVM_IRQFD and KVM_SET_GSI_ROUTING ioctl
relationship.

When reading the KVM API documentation I do not understand there is any
dependency between KVM_IRQFD and KVM_SET_GSI_ROUTING. According to the
text it seems only the gsi field is used and interpreted as the
irqchip pin.

However irqchip.c kvm_set_irq code relies on an existing and not dummy
routing table.

My question is: does anyone agree on the fact the user-side must set a
consistent routing table using KVM_SET_GSI_ROUTING before using
KVM_IRQFD? The other alternative would have been to build a default
identity GSI routing table in the kernel (gsi = irqchip.pin).

I untangled irqfd support from the x86 ioapic emulation a while back.
When I
looked at it, I didn't see any easy way to split it out from the routing
too, so I kept that dependency in.

Hi Alex, Paul,

thanks for your replies. hence don't you think the KVM API doc deserves
to be clarified then?

If you look at the code, you will see that the irq routing entry is
used as
token for an irqfd. So every irqfd only knows its routing table entry,
nothing else.

As far as I can see, the routing table entry is used for only two
things: to tell whether the interrupt is an MSI or LSI, and to pass to
kvm_set_msi.


Indeed I noticed irq_entry in _irqfd struct only is set for MSI routing.
For other IRQS - I guess what you call LSI - , routine table map[] is
used to retrieve the irchip.pin(s) associated to that gsi and call set().

One way to tackle the problem would be to abstract out the routing
table lookup into a function in irqchip.c, rather than having it
open-coded in irqfd_update.  Then, if you don't want to have the
routing table you write an alternative function that creates a
routing entry from scratch.  It would need information from the
low-level irq chip code as to which interrupts are LSIs and which are
MSIs.  It also ties you down to having only one kind of interrupt

controller.

You could also create it along with the irqfd state struct. So instead of

   kzalloc(sizeof(*irqfd), GFP_KERNEL);

we could do

   kzalloc(sizeof(*irqfd) + sizeof(struct kvm_kernel_irq_routing_entry),
GFP_KERNEL);


Wouldn't it make sense, in kvm_irqfd_assign, to test whether there is a
routing entry associated to that gsi. In the negative, create a default
one where irqchip.pin = gsi or something architecture specific?.


"no entry" really should be "no entry". I think if we want some other 
mechanism, we should make it be explicit.




That way would not need a complete & big routing table to be set by
user-side?


and point the routing entry to the inline irqfd one. We'd lose the
ability to change any routings with that construct, but I think that's
ok for irq controllers that are not configurable.

Alternatively, we could also have a single routing entry that is a
wildcard match, ranging from say LSI [x...y]. The irqfd structure would
then need to carry a local payload to define an offset inside that
wildcard routing entry.

Also whatever the number of IRQs assigned, we have this big
chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS] allocated. Also on ARM we
have some difficulties to define KVM_IRQCHIP_NUM_PINS which becomes
quite dynamic.


Right, if you go with the ranged routing entries you should be able to 
get by with a very small number of NUM_PINS, no?



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3.15 25/61] MIPS: KVM: Allocate at least 16KB for exception handlers

2014-06-24 Thread Greg Kroah-Hartman
3.15-stable review patch.  If anyone has any objections, please let me know.

--

From: James Hogan 

commit 7006e2dfda9adfa40251093604db76d7e44263b3 upstream.

Each MIPS KVM guest has its own copy of the KVM exception vector. This
contains the TLB refill exception handler at offset 0x000, the general
exception handler at offset 0x180, and interrupt exception handlers at
offset 0x200 in case Cause_IV=1. A common handler is copied to offset
0x2000 and offset 0x3000 is used for temporarily storing k1 during entry
from guest.

However the amount of memory allocated for this purpose is calculated as
0x200 rounded up to the next page boundary, which is insufficient if 4KB
pages are in use. This can lead to the common handler at offset 0x2000
being overwritten and infinitely recursive exceptions on the next exit
from the guest.

Increase the minimum size from 0x200 to 0x4000 to cover the full use of
the page.

Signed-off-by: James Hogan 
Cc: Paolo Bonzini 
Cc: Gleb Natapov 
Cc: kvm@vger.kernel.org
Cc: Ralf Baechle 
Cc: linux-m...@linux-mips.org
Cc: Sanjay Lal 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman 

---
 arch/mips/kvm/kvm_mips.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/mips/kvm/kvm_mips.c
+++ b/arch/mips/kvm/kvm_mips.c
@@ -304,7 +304,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(st
if (cpu_has_veic || cpu_has_vint) {
size = 0x200 + VECTORSPACING * 64;
} else {
-   size = 0x200;
+   size = 0x4000;
}
 
/* Save Linux EBASE */


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] docs: update ivshmem device spec

2014-06-24 Thread Eric Blake
On 06/20/2014 06:15 AM, David Marchand wrote:
> Add some notes on the parts needed to use ivshmem devices: more specifically,
> explain the purpose of an ivshmem server and the basic concept to use the
> ivshmem devices in guests.
> 
> Signed-off-by: David Marchand 
> ---
>  docs/specs/ivshmem_device_spec.txt |   41 
> ++--
>  1 file changed, 35 insertions(+), 6 deletions(-)
> 
> diff --git a/docs/specs/ivshmem_device_spec.txt 
> b/docs/specs/ivshmem_device_spec.txt
> index 667a862..7d2b73f 100644
> --- a/docs/specs/ivshmem_device_spec.txt
> +++ b/docs/specs/ivshmem_device_spec.txt
> @@ -85,12 +85,41 @@ events have occurred.  The semantics of interrupt vectors 
> are left to the
>  user's discretion.
>  
>  
> +IVSHMEM host services
> +-
> +
> +This part is optional (see *Usage in the Guest* section below).
> +
> +To handle notifications between users of a ivshmem device, a ivshmem server 
> has

s/a ivshmem/an ivshmem/ (twice)

> +been added. This server is responsible for creating the shared memory and
> +creating a set of eventfds for each users of the shared memory. It behaves 
> as a
> +proxy between the different ivshmem clients (QEMU): giving the shared memory 
> fd
> +to each client, allocating eventfds to new clients and broadcasting to all
> +clients when a client disappears.
> +
> +Apart from the current ivshmem implementation in QEMU, a ivshmem client can 
> be

and again

> +written for debug, for development purposes, or to implement notifications
> +between host and guests.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: Coupling between KVM_IRQFD and KVM_SET_GSI_ROUTING?

2014-06-24 Thread Eric Auger
On 06/24/2014 05:47 PM, Alexander Graf wrote:
> 
> On 24.06.14 17:05, Eric Auger wrote:
>> On 06/24/2014 12:40 PM, Alexander Graf wrote:
>>> On 24.06.14 11:47, Paul Mackerras wrote:
 On Mon, Jun 23, 2014 at 06:47:51PM +0200, Alexander Graf wrote:
> On 17.06.14 13:39, Eric Auger wrote:
>>
>> Hello,
>>
>> I have a question related to KVM_IRQFD and KVM_SET_GSI_ROUTING ioctl
>> relationship.
>>
>> When reading the KVM API documentation I do not understand there
>> is any
>> dependency between KVM_IRQFD and KVM_SET_GSI_ROUTING. According to
>> the
>> text it seems only the gsi field is used and interpreted as the
>> irqchip pin.
>>
>> However irqchip.c kvm_set_irq code relies on an existing and not
>> dummy
>> routing table.
>>
>> My question is: does anyone agree on the fact the user-side must
>> set a
>> consistent routing table using KVM_SET_GSI_ROUTING before using
>> KVM_IRQFD? The other alternative would have been to build a default
>> identity GSI routing table in the kernel (gsi = irqchip.pin).
> I untangled irqfd support from the x86 ioapic emulation a while back.
> When I
> looked at it, I didn't see any easy way to split it out from the
> routing
> too, so I kept that dependency in.
>> Hi Alex, Paul,
>>
>> thanks for your replies. hence don't you think the KVM API doc deserves
>> to be clarified then?
> If you look at the code, you will see that the irq routing entry is
> used as
> token for an irqfd. So every irqfd only knows its routing table entry,
> nothing else.
 As far as I can see, the routing table entry is used for only two
 things: to tell whether the interrupt is an MSI or LSI, and to pass to
 kvm_set_msi.
>>
>> Indeed I noticed irq_entry in _irqfd struct only is set for MSI routing.
>> For other IRQS - I guess what you call LSI - , routine table map[] is
>> used to retrieve the irchip.pin(s) associated to that gsi and call set().
 One way to tackle the problem would be to abstract out the routing
 table lookup into a function in irqchip.c, rather than having it
 open-coded in irqfd_update.  Then, if you don't want to have the
 routing table you write an alternative function that creates a
 routing entry from scratch.  It would need information from the
 low-level irq chip code as to which interrupts are LSIs and which are
 MSIs.  It also ties you down to having only one kind of interrupt
>> controller.
>>> You could also create it along with the irqfd state struct. So
>>> instead of
>>>
>>>kzalloc(sizeof(*irqfd), GFP_KERNEL);
>>>
>>> we could do
>>>
>>>kzalloc(sizeof(*irqfd) + sizeof(struct kvm_kernel_irq_routing_entry),
>>> GFP_KERNEL);
>>
>> Wouldn't it make sense, in kvm_irqfd_assign, to test whether there is a
>> routing entry associated to that gsi. In the negative, create a default
>> one where irqchip.pin = gsi or something architecture specific?.
> 
> "no entry" really should be "no entry". I think if we want some other
> mechanism, we should make it be explicit.
Yep I agree. But currently the usage of routing mechanism is not
explicit neither in 4.75 KVM_IRQFD and that was causing my doubts;-)
On top of that the doc says:
"kvm_irqfd.gsi specifies the irqchip pin toggled by this event" but if
there is no routing table set before by user-side, the injection will fail.
> 
>>
>> That way would not need a complete & big routing table to be set by
>> user-side?
>>
>>> and point the routing entry to the inline irqfd one. We'd lose the
>>> ability to change any routings with that construct, but I think that's
>>> ok for irq controllers that are not configurable.
>>>
>>> Alternatively, we could also have a single routing entry that is a
>>> wildcard match, ranging from say LSI [x...y]. The irqfd structure would
>>> then need to carry a local payload to define an offset inside that
>>> wildcard routing entry.
>> Also whatever the number of IRQs assigned, we have this big
>> chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS] allocated. Also on ARM we
>> have some difficulties to define KVM_IRQCHIP_NUM_PINS which becomes
>> quite dynamic.
> 
> Right, if you go with the ranged routing entries you should be able to
> get by with a very small number of NUM_PINS, no?
Yes that's correct. but the effort to integrate such feature in the code
need to be further assessed.
Eric
> 
> 
> Alex
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3.14 38/68] MIPS: KVM: Allocate at least 16KB for exception handlers

2014-06-24 Thread Greg Kroah-Hartman
3.14-stable review patch.  If anyone has any objections, please let me know.

--

From: James Hogan 

commit 7006e2dfda9adfa40251093604db76d7e44263b3 upstream.

Each MIPS KVM guest has its own copy of the KVM exception vector. This
contains the TLB refill exception handler at offset 0x000, the general
exception handler at offset 0x180, and interrupt exception handlers at
offset 0x200 in case Cause_IV=1. A common handler is copied to offset
0x2000 and offset 0x3000 is used for temporarily storing k1 during entry
from guest.

However the amount of memory allocated for this purpose is calculated as
0x200 rounded up to the next page boundary, which is insufficient if 4KB
pages are in use. This can lead to the common handler at offset 0x2000
being overwritten and infinitely recursive exceptions on the next exit
from the guest.

Increase the minimum size from 0x200 to 0x4000 to cover the full use of
the page.

Signed-off-by: James Hogan 
Cc: Paolo Bonzini 
Cc: Gleb Natapov 
Cc: kvm@vger.kernel.org
Cc: Ralf Baechle 
Cc: linux-m...@linux-mips.org
Cc: Sanjay Lal 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman 

---
 arch/mips/kvm/kvm_mips.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/mips/kvm/kvm_mips.c
+++ b/arch/mips/kvm/kvm_mips.c
@@ -304,7 +304,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(st
if (cpu_has_veic || cpu_has_vint) {
size = 0x200 + VECTORSPACING * 64;
} else {
-   size = 0x200;
+   size = 0x4000;
}
 
/* Save Linux EBASE */


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread Alexey Kardashevskiy
On 06/25/2014 12:43 AM, Alex Williamson wrote:
> On Wed, 2014-06-25 at 00:33 +1000, Alexey Kardashevskiy wrote:
>> On 06/25/2014 12:21 AM, Alex Williamson wrote:
>>> On Tue, 2014-06-24 at 15:22 +0200, Alexander Graf wrote:
 On 24.06.14 15:01, Alexey Kardashevskiy wrote:
> On 06/24/2014 10:52 PM, Alexander Graf wrote:
>> On 24.06.14 14:50, Alexey Kardashevskiy wrote:
>>> On 06/24/2014 08:41 PM, Alexander Graf wrote:
 On 24.06.14 12:11, Alexey Kardashevskiy wrote:
> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote:
>> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote:
>>
>>> Working on big endian being an accident may be a matter of 
>>> perspective
>> :-)
>>
>>> The comment remains that this patch doesn't actually fix anything 
>>> except
>>> the overhead on big endian systems doing redundant byte swapping and
>>> maybe the philosophy that vfio regions are little endian.
>> Yes, that works by accident because technically VFIO is a transport 
>> and
>> thus shouldn't perform any endian swapping of any sort, which remains
>> the responsibility of the end driver which is the only one to know
>> whether a given BAR location is a a register or some streaming data
>> and in the former case whether it's LE or BE (some PCI devices are BE
>> even ! :-)
>>
>> But yes, in the end, it works with the dual "cancelling" swaps and 
>> the
>> overhead of those swaps is probably drowned in the noise of the 
>> syscall
>> overhead.
>>
>>> I'm still not a fan of iowrite vs iowritebe, there must be 
>>> something we
>>> can use that doesn't have an implicit swap.
>> Sadly there isn't ... In the old day we didn't even have the "be"
>> variant and readl/writel style accessors still don't have them either
>> for all archs.
>>
>> There is __raw_readl/writel but here the semantics are much more than
>> just "don't swap", they also don't have memory barriers (which means
>> they are essentially useless to most drivers unless those are 
>> platform
>> specific drivers which know exactly what they are doing, or in the 
>> rare
>> cases such as accessing a framebuffer which we know never have side
>> effects).
>>
>>> Calling it iowrite*_native is also an abuse of the namespace.
>>> Next thing we know some common code
>>> will legitimately use that name.
>> I might make sense to those definitions into a common header. There 
>> have
>> been a handful of cases in the past that wanted that sort of "native
>> byte order" MMIOs iirc (though don't ask me for examples, I can't 
>> really
>> remember).
>>
>>> If we do need to define an alias
>>> (which I'd like to avoid) it should be something like 
>>> vfio_iowrite32.
> Ping?
>
> We need to make a decision whether to move those xxx_native() helpers
> somewhere (where?) or leave the patch as is (as we figured out that
> iowriteXX functions implement barriers and we cannot just use raw
> accessors) and fix commit log to explain everything.
 Is there actually any difference in generated code with this patch 
 applied
 and without? I would hope that iowrite..() is inlined and cancels out 
 the
 cpu_to_le..() calls that are also inlined?
>>> iowrite32 is a non-inline function so conversions take place so are the
>>> others. And sorry but I fail to see why this matters. We are not trying 
>>> to
>>> accelerate things, we are removing redundant operations which confuse
>>> people who read the code.
>> The confusion depends on where you're coming from. If you happen to know
>> that "iowrite32" writes in LE, then the LE conversion makes a lot of 
>> sense.
> It was like this (and this is just confusing):
>
> iowrite32(le32_to_cpu(val), io + off);
>
> What would make sense (according to you and I would understand this) is 
> this:
>
> iowrite32(cpu_to_le32(val), io + off);
>
>
> Or I missed your point, did I?

 No, you didn't miss it. I think for people who know how iowrite32() 
 works the above is obvious. I find the fact that iowrite32() writes in 
 LE always pretty scary though ;).

 So IMHO we should either create new, generic iowrite helpers that don't 
 do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls.
>>>
>>> I'm one of those people for whom iowrite32(le32_to_cpu(val)) makes sense
>>
>>
>> I do not understand why @val is considered LE here and need to be converted
>> to CPU. Really. I truly believe it should be cp

Re: [PATCH 4/4] kvm: Implement PEBS virtualization

2014-06-24 Thread Marcelo Tosatti
On Sun, Jun 22, 2014 at 09:02:25PM +0200, Andi Kleen wrote:
> > First, it's not sufficient to pin the debug store area, you also
> > have to pin the guest page tables that are used to map the debug
> > store.  But even if you do that, as soon as the guest fork()s, it
> > will create a new pgd which the host will be free to swap out.  The
> > processor can then attempt a PEBS store to an unmapped address which
> > will fail, even though the guest is configured correctly.
> 
> That's a good point. You're right of course.
> 
> The only way I can think around it would be to intercept CR3 writes
> while PEBS is active and always pin all the table pages leading 
> to the PEBS buffer. That's slow, but should be only needed
> while PEBS is running.
> 
> -Andi

Suppose that can be done separately from the pinned spte patchset.
And it requires accounting into mlock limits as well, as noted.

One set of pagetables per pinned virtual address leading down to the
last translations is sufficient per-vcpu.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3.10 29/52] MIPS: KVM: Allocate at least 16KB for exception handlers

2014-06-24 Thread Greg Kroah-Hartman
3.10-stable review patch.  If anyone has any objections, please let me know.

--

From: James Hogan 

commit 7006e2dfda9adfa40251093604db76d7e44263b3 upstream.

Each MIPS KVM guest has its own copy of the KVM exception vector. This
contains the TLB refill exception handler at offset 0x000, the general
exception handler at offset 0x180, and interrupt exception handlers at
offset 0x200 in case Cause_IV=1. A common handler is copied to offset
0x2000 and offset 0x3000 is used for temporarily storing k1 during entry
from guest.

However the amount of memory allocated for this purpose is calculated as
0x200 rounded up to the next page boundary, which is insufficient if 4KB
pages are in use. This can lead to the common handler at offset 0x2000
being overwritten and infinitely recursive exceptions on the next exit
from the guest.

Increase the minimum size from 0x200 to 0x4000 to cover the full use of
the page.

Signed-off-by: James Hogan 
Cc: Paolo Bonzini 
Cc: Gleb Natapov 
Cc: kvm@vger.kernel.org
Cc: Ralf Baechle 
Cc: linux-m...@linux-mips.org
Cc: Sanjay Lal 
Signed-off-by: Paolo Bonzini 
Signed-off-by: Greg Kroah-Hartman 

---
 arch/mips/kvm/kvm_mips.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/mips/kvm/kvm_mips.c
+++ b/arch/mips/kvm/kvm_mips.c
@@ -299,7 +299,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(st
if (cpu_has_veic || cpu_has_vint) {
size = 0x200 + VECTORSPACING * 64;
} else {
-   size = 0x200;
+   size = 0x4000;
}
 
/* Save Linux EBASE */


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/9] MIPS: KVM: Bugfixes and cleanups

2014-06-24 Thread Deng-Cheng Zhu
The patches are pretty straightforward.

Changes:
v3 - v2:
o In patch #2, change the use of kvm_[err|info|debug].
o In patch #3, add err removal in kvm_arch_commit_memory_region().
o In patch #3, revert the changes to kvm_arch_vm_ioctl().
o In patch #7, drop the merge of kvm_arch_vcpu_free() and pointer nullification.
o Add patch #9.
v2 - v1:
o In patch #1, don't change the opening comment mark for kernel-doc comments.
o In patch #1, to make long lines more readable, use local variables / macros.
o In patch #1, slight format adjustments are made.
o Use -M flag to generate patches (detect renames).
o Add patch #8.

Deng-Cheng Zhu (8):
  MIPS: KVM: Reformat code and comments
  MIPS: KVM: Use KVM internal logger
  MIPS: KVM: Simplify functions by removing redundancy
  MIPS: KVM: Remove unneeded volatile
  MIPS: KVM: Rename files to remove the prefix "kvm_" and "kvm_mips_"
  MIPS: KVM: Restore correct value for WIRED at TLB uninit
  MIPS: KVM: Fix memory leak on VCPU
  MIPS: KVM: Skip memory cleaning in kvm_mips_commpage_init()

James Hogan (1):
  MIPS: KVM: Remove dead code of TLB index error in
kvm_mips_emul_tlbwr()

 arch/mips/include/asm/kvm_host.h  |  12 +-
 arch/mips/include/asm/r4kcache.h  |   3 +
 arch/mips/kvm/Makefile|   8 +-
 arch/mips/kvm/{kvm_cb.c => callback.c}|   0
 arch/mips/kvm/commpage.c  |  33 ++
 arch/mips/kvm/commpage.h  |  24 +
 arch/mips/kvm/{kvm_mips_dyntrans.c => dyntrans.c} |  40 +-
 arch/mips/kvm/{kvm_mips_emul.c => emulate.c}  | 539 +++---
 arch/mips/kvm/{kvm_mips_int.c => interrupt.c} |  47 +-
 arch/mips/kvm/{kvm_mips_int.h => interrupt.h} |  22 +-
 arch/mips/kvm/kvm_mips_comm.h |  23 -
 arch/mips/kvm/kvm_mips_commpage.c |  37 --
 arch/mips/kvm/kvm_mips_opcode.h   |  24 -
 arch/mips/kvm/{kvm_locore.S => locore.S}  |  55 ++-
 arch/mips/kvm/{kvm_mips.c => mips.c}  | 227 +
 arch/mips/kvm/opcode.h|  22 +
 arch/mips/kvm/{kvm_mips_stats.c => stats.c}   |  28 +-
 arch/mips/kvm/{kvm_tlb.c => tlb.c}| 258 +--
 arch/mips/kvm/trace.h |  18 +-
 arch/mips/kvm/{kvm_trap_emul.c => trap_emul.c}| 109 +++--
 20 files changed, 750 insertions(+), 779 deletions(-)
 rename arch/mips/kvm/{kvm_cb.c => callback.c} (100%)
 create mode 100644 arch/mips/kvm/commpage.c
 create mode 100644 arch/mips/kvm/commpage.h
 rename arch/mips/kvm/{kvm_mips_dyntrans.c => dyntrans.c} (79%)
 rename arch/mips/kvm/{kvm_mips_emul.c => emulate.c} (83%)
 rename arch/mips/kvm/{kvm_mips_int.c => interrupt.c} (85%)
 rename arch/mips/kvm/{kvm_mips_int.h => interrupt.h} (74%)
 delete mode 100644 arch/mips/kvm/kvm_mips_comm.h
 delete mode 100644 arch/mips/kvm/kvm_mips_commpage.c
 delete mode 100644 arch/mips/kvm/kvm_mips_opcode.h
 rename arch/mips/kvm/{kvm_locore.S => locore.S} (93%)
 rename arch/mips/kvm/{kvm_mips.c => mips.c} (83%)
 create mode 100644 arch/mips/kvm/opcode.h
 rename arch/mips/kvm/{kvm_mips_stats.c => stats.c} (63%)
 rename arch/mips/kvm/{kvm_tlb.c => tlb.c} (78%)
 rename arch/mips/kvm/{kvm_trap_emul.c => trap_emul.c} (83%)

-- 
1.8.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 8/9] MIPS: KVM: Skip memory cleaning in kvm_mips_commpage_init()

2014-06-24 Thread Deng-Cheng Zhu
From: Deng-Cheng Zhu 

The commpage is allocated using kzalloc(), so there's no need of cleaning
the memory of the kvm_mips_commpage struct and its internal mips_coproc.

Signed-off-by: Deng-Cheng Zhu 
---
 arch/mips/kvm/commpage.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/mips/kvm/commpage.c b/arch/mips/kvm/commpage.c
index 61b9c04..2d6e976 100644
--- a/arch/mips/kvm/commpage.c
+++ b/arch/mips/kvm/commpage.c
@@ -28,9 +28,6 @@ void kvm_mips_commpage_init(struct kvm_vcpu *vcpu)
 {
struct kvm_mips_commpage *page = vcpu->arch.kseg0_commpage;
 
-   memset(page, 0, sizeof(struct kvm_mips_commpage));
-
/* Specific init values for fields */
vcpu->arch.cop0 = &page->cop0;
-   memset(vcpu->arch.cop0, 0, sizeof(struct mips_coproc));
 }
-- 
1.8.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/9] MIPS: KVM: Remove unneeded volatile

2014-06-24 Thread Deng-Cheng Zhu
From: Deng-Cheng Zhu 

The keyword volatile for idx in the TLB functions is unnecessary.

Signed-off-by: Deng-Cheng Zhu 
---
 arch/mips/kvm/kvm_tlb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/mips/kvm/kvm_tlb.c b/arch/mips/kvm/kvm_tlb.c
index 29a5bdb..bbcd822 100644
--- a/arch/mips/kvm/kvm_tlb.c
+++ b/arch/mips/kvm/kvm_tlb.c
@@ -201,7 +201,7 @@ int kvm_mips_host_tlb_write(struct kvm_vcpu *vcpu, unsigned 
long entryhi,
 {
unsigned long flags;
unsigned long old_entryhi;
-   volatile int idx;
+   int idx;
 
local_irq_save(flags);
 
@@ -426,7 +426,7 @@ EXPORT_SYMBOL(kvm_mips_guest_tlb_lookup);
 int kvm_mips_host_tlb_lookup(struct kvm_vcpu *vcpu, unsigned long vaddr)
 {
unsigned long old_entryhi, flags;
-   volatile int idx;
+   int idx;
 
local_irq_save(flags);
 
-- 
1.8.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 6/9] MIPS: KVM: Restore correct value for WIRED at TLB uninit

2014-06-24 Thread Deng-Cheng Zhu
From: Deng-Cheng Zhu 

At TLB initialization, the commpage TLB entry is reserved on top of the
existing WIRED entries (the number not necessarily be 0).

Signed-off-by: Deng-Cheng Zhu 
---
 arch/mips/kvm/mips.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 27250ee..3d53d34 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -170,7 +170,7 @@ void kvm_arch_sync_events(struct kvm *kvm)
 static void kvm_mips_uninit_tlbs(void *arg)
 {
/* Restore wired count */
-   write_c0_wired(0);
+   write_c0_wired(read_c0_wired() - 1);
mtc0_tlbw_hazard();
/* Clear out all the TLBs */
kvm_local_flush_tlb_all();
-- 
1.8.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 9/9] MIPS: KVM: Remove dead code of TLB index error in kvm_mips_emul_tlbwr()

2014-06-24 Thread Deng-Cheng Zhu
From: James Hogan 

It's impossible to fall into the error handling of the TLB index after
being masked by (KVM_MIPS_GUEST_TLB_SIZE - 1). Remove the dead code.

Signed-off-by: James Hogan 
Signed-off-by: Deng-Cheng Zhu 
---
 arch/mips/kvm/emulate.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/mips/kvm/emulate.c b/arch/mips/kvm/emulate.c
index 982f4af..5500fef 100644
--- a/arch/mips/kvm/emulate.c
+++ b/arch/mips/kvm/emulate.c
@@ -846,11 +846,6 @@ enum emulation_result kvm_mips_emul_tlbwr(struct kvm_vcpu 
*vcpu)
get_random_bytes(&index, sizeof(index));
index &= (KVM_MIPS_GUEST_TLB_SIZE - 1);
 
-   if (index < 0 || index >= KVM_MIPS_GUEST_TLB_SIZE) {
-   kvm_err("%s: illegal index: %d\n", __func__, index);
-   return EMULATE_FAIL;
-   }
-
tlb = &vcpu->arch.guest_tlb[index];
 
/*
-- 
1.8.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 7/9] MIPS: KVM: Fix memory leak on VCPU

2014-06-24 Thread Deng-Cheng Zhu
From: Deng-Cheng Zhu 

kvm_arch_vcpu_free() is called in 2 code paths:

1) kvm_vm_ioctl()
   kvm_vm_ioctl_create_vcpu()
   kvm_arch_vcpu_destroy()
   kvm_arch_vcpu_free()
2) kvm_put_kvm()
   kvm_destroy_vm()
   kvm_arch_destroy_vm()
   kvm_mips_free_vcpus()
   kvm_arch_vcpu_free()

Neither of the paths handles VCPU free. We need to do it in
kvm_arch_vcpu_free() corresponding to the memory allocation in
kvm_arch_vcpu_create().

Signed-off-by: Deng-Cheng Zhu 
---
Changes:
v3 - v2:
o Drop the merge of kvm_arch_vcpu_free() and pointer nullification.

 arch/mips/kvm/mips.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index 3d53d34..7f7fd76 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -384,6 +384,7 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 
kfree(vcpu->arch.guest_ebase);
kfree(vcpu->arch.kseg0_commpage);
+   kfree(vcpu);
 }
 
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
-- 
1.8.5.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/9] MIPS: KVM: Use KVM internal logger

2014-06-24 Thread Deng-Cheng Zhu
From: Deng-Cheng Zhu 

Replace printks with kvm_[err|info|debug].

Signed-off-by: Deng-Cheng Zhu 
---
Changes:
v3 - v2:
o Change the use of kvm_[err|info|debug].

 arch/mips/kvm/kvm_mips.c   |  23 -
 arch/mips/kvm/kvm_mips_emul.c  | 107 -
 arch/mips/kvm/kvm_mips_stats.c |   6 +--
 arch/mips/kvm/kvm_tlb.c|  60 +++
 arch/mips/kvm/kvm_trap_emul.c  |  31 +---
 5 files changed, 110 insertions(+), 117 deletions(-)

diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c
index 821e7e8..bdca619 100644
--- a/arch/mips/kvm/kvm_mips.c
+++ b/arch/mips/kvm/kvm_mips.c
@@ -817,8 +817,8 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct 
kvm_dirty_log *log)
ga = memslot->base_gfn << PAGE_SHIFT;
ga_end = ga + (memslot->npages << PAGE_SHIFT);
 
-   printk("%s: dirty, ga: %#lx, ga_end %#lx\n", __func__, ga,
-  ga_end);
+   kvm_info("%s: dirty, ga: %#lx, ga_end %#lx\n", __func__, ga,
+ga_end);
 
n = kvm_dirty_bitmap_bytes(memslot);
memset(memslot->dirty_bitmap, 0, n);
@@ -925,24 +925,25 @@ int kvm_arch_vcpu_dump_regs(struct kvm_vcpu *vcpu)
if (!vcpu)
return -1;
 
-   printk("VCPU Register Dump:\n");
-   printk("\tpc = 0x%08lx\n", vcpu->arch.pc);
-   printk("\texceptions: %08lx\n", vcpu->arch.pending_exceptions);
+   kvm_debug("VCPU Register Dump:\n");
+   kvm_debug("\tpc = 0x%08lx\n", vcpu->arch.pc);
+   kvm_debug("\texceptions: %08lx\n", vcpu->arch.pending_exceptions);
 
for (i = 0; i < 32; i += 4) {
-   printk("\tgpr%02d: %08lx %08lx %08lx %08lx\n", i,
+   kvm_debug("\tgpr%02d: %08lx %08lx %08lx %08lx\n", i,
   vcpu->arch.gprs[i],
   vcpu->arch.gprs[i + 1],
   vcpu->arch.gprs[i + 2], vcpu->arch.gprs[i + 3]);
}
-   printk("\thi: 0x%08lx\n", vcpu->arch.hi);
-   printk("\tlo: 0x%08lx\n", vcpu->arch.lo);
+   kvm_debug("\thi: 0x%08lx\n", vcpu->arch.hi);
+   kvm_debug("\tlo: 0x%08lx\n", vcpu->arch.lo);
 
cop0 = vcpu->arch.cop0;
-   printk("\tStatus: 0x%08lx, Cause: 0x%08lx\n",
-  kvm_read_c0_guest_status(cop0), kvm_read_c0_guest_cause(cop0));
+   kvm_debug("\tStatus: 0x%08lx, Cause: 0x%08lx\n",
+ kvm_read_c0_guest_status(cop0),
+ kvm_read_c0_guest_cause(cop0));
 
-   printk("\tEPC: 0x%08lx\n", kvm_read_c0_guest_epc(cop0));
+   kvm_debug("\tEPC: 0x%08lx\n", kvm_read_c0_guest_epc(cop0));
 
return 0;
 }
diff --git a/arch/mips/kvm/kvm_mips_emul.c b/arch/mips/kvm/kvm_mips_emul.c
index 993dd1c..262ce3e 100644
--- a/arch/mips/kvm/kvm_mips_emul.c
+++ b/arch/mips/kvm/kvm_mips_emul.c
@@ -183,18 +183,18 @@ unsigned long kvm_compute_return_epc(struct kvm_vcpu 
*vcpu,
 
/* And now the FPA/cp1 branch instructions. */
case cop1_op:
-   printk("%s: unsupported cop1_op\n", __func__);
+   kvm_err("%s: unsupported cop1_op\n", __func__);
break;
}
 
return nextpc;
 
 unaligned:
-   printk("%s: unaligned epc\n", __func__);
+   kvm_err("%s: unaligned epc\n", __func__);
return nextpc;
 
 sigill:
-   printk("%s: DSP branch but not DSP ASE\n", __func__);
+   kvm_err("%s: DSP branch but not DSP ASE\n", __func__);
return nextpc;
 }
 
@@ -751,8 +751,8 @@ enum emulation_result kvm_mips_emul_eret(struct kvm_vcpu 
*vcpu)
kvm_clear_c0_guest_status(cop0, ST0_ERL);
vcpu->arch.pc = kvm_read_c0_guest_errorepc(cop0);
} else {
-   printk("[%#lx] ERET when MIPS_SR_EXL|MIPS_SR_ERL == 0\n",
-  vcpu->arch.pc);
+   kvm_err("[%#lx] ERET when MIPS_SR_EXL|MIPS_SR_ERL == 0\n",
+   vcpu->arch.pc);
er = EMULATE_FAIL;
}
 
@@ -795,7 +795,7 @@ enum emulation_result kvm_mips_emul_tlbr(struct kvm_vcpu 
*vcpu)
enum emulation_result er = EMULATE_FAIL;
uint32_t pc = vcpu->arch.pc;
 
-   printk("[%#x] COP0_TLBR [%ld]\n", pc, kvm_read_c0_guest_index(cop0));
+   kvm_err("[%#x] COP0_TLBR [%ld]\n", pc, kvm_read_c0_guest_index(cop0));
return er;
 }
 
@@ -809,13 +809,12 @@ enum emulation_result kvm_mips_emul_tlbwi(struct kvm_vcpu 
*vcpu)
uint32_t pc = vcpu->arch.pc;
 
if (index < 0 || index >= KVM_MIPS_GUEST_TLB_SIZE) {
-   printk("%s: illegal index: %d\n", __func__, index);
-   printk
-   ("[%#x] COP0_TLBWI [%d] (entryhi: %#lx, entrylo0: %#lx 
entrylo1: %#lx, mask: %#lx)\n",
-pc, index, kvm_read_c0_guest_entryhi(cop0),
-kvm_read_c0_guest_entrylo0(cop0),
-kvm_read_c0_guest_entrylo1(cop0),
-kvm_read_c0_guest_pagemask(cop0));
+ 

[PATCH v3 5/9] MIPS: KVM: Rename files to remove the prefix "kvm_" and "kvm_mips_"

2014-06-24 Thread Deng-Cheng Zhu
From: Deng-Cheng Zhu 

Since all the files are in arch/mips/kvm/, there's no need of the prefixes
"kvm_" and "kvm_mips_".

Signed-off-by: Deng-Cheng Zhu 
---
 arch/mips/kvm/Makefile| 8 
 arch/mips/kvm/{kvm_cb.c => callback.c}| 0
 arch/mips/kvm/{kvm_mips_commpage.c => commpage.c} | 2 +-
 arch/mips/kvm/{kvm_mips_comm.h => commpage.h} | 0
 arch/mips/kvm/{kvm_mips_dyntrans.c => dyntrans.c} | 2 +-
 arch/mips/kvm/{kvm_mips_emul.c => emulate.c}  | 6 +++---
 arch/mips/kvm/{kvm_mips_int.c => interrupt.c} | 2 +-
 arch/mips/kvm/{kvm_mips_int.h => interrupt.h} | 0
 arch/mips/kvm/{kvm_locore.S => locore.S}  | 0
 arch/mips/kvm/{kvm_mips.c => mips.c}  | 6 +++---
 arch/mips/kvm/{kvm_mips_opcode.h => opcode.h} | 0
 arch/mips/kvm/{kvm_mips_stats.c => stats.c}   | 0
 arch/mips/kvm/{kvm_tlb.c => tlb.c}| 0
 arch/mips/kvm/{kvm_trap_emul.c => trap_emul.c}| 4 ++--
 14 files changed, 15 insertions(+), 15 deletions(-)
 rename arch/mips/kvm/{kvm_cb.c => callback.c} (100%)
 rename arch/mips/kvm/{kvm_mips_commpage.c => commpage.c} (97%)
 rename arch/mips/kvm/{kvm_mips_comm.h => commpage.h} (100%)
 rename arch/mips/kvm/{kvm_mips_dyntrans.c => dyntrans.c} (99%)
 rename arch/mips/kvm/{kvm_mips_emul.c => emulate.c} (99%)
 rename arch/mips/kvm/{kvm_mips_int.c => interrupt.c} (99%)
 rename arch/mips/kvm/{kvm_mips_int.h => interrupt.h} (100%)
 rename arch/mips/kvm/{kvm_locore.S => locore.S} (100%)
 rename arch/mips/kvm/{kvm_mips.c => mips.c} (99%)
 rename arch/mips/kvm/{kvm_mips_opcode.h => opcode.h} (100%)
 rename arch/mips/kvm/{kvm_mips_stats.c => stats.c} (100%)
 rename arch/mips/kvm/{kvm_tlb.c => tlb.c} (100%)
 rename arch/mips/kvm/{kvm_trap_emul.c => trap_emul.c} (99%)

diff --git a/arch/mips/kvm/Makefile b/arch/mips/kvm/Makefile
index 78d87bb..401fe02 100644
--- a/arch/mips/kvm/Makefile
+++ b/arch/mips/kvm/Makefile
@@ -5,9 +5,9 @@ common-objs = $(addprefix ../../../virt/kvm/, kvm_main.o 
coalesced_mmio.o)
 
 EXTRA_CFLAGS += -Ivirt/kvm -Iarch/mips/kvm
 
-kvm-objs := $(common-objs) kvm_mips.o kvm_mips_emul.o kvm_locore.o \
-   kvm_mips_int.o kvm_mips_stats.o kvm_mips_commpage.o \
-   kvm_mips_dyntrans.o kvm_trap_emul.o
+kvm-objs := $(common-objs) mips.o emulate.o locore.o \
+   interrupt.o stats.o commpage.o \
+   dyntrans.o trap_emul.o
 
 obj-$(CONFIG_KVM)  += kvm.o
-obj-y  += kvm_cb.o kvm_tlb.o
+obj-y  += callback.o tlb.o
diff --git a/arch/mips/kvm/kvm_cb.c b/arch/mips/kvm/callback.c
similarity index 100%
rename from arch/mips/kvm/kvm_cb.c
rename to arch/mips/kvm/callback.c
diff --git a/arch/mips/kvm/kvm_mips_commpage.c b/arch/mips/kvm/commpage.c
similarity index 97%
rename from arch/mips/kvm/kvm_mips_commpage.c
rename to arch/mips/kvm/commpage.c
index 4b5612b..61b9c04 100644
--- a/arch/mips/kvm/kvm_mips_commpage.c
+++ b/arch/mips/kvm/commpage.c
@@ -22,7 +22,7 @@
 
 #include 
 
-#include "kvm_mips_comm.h"
+#include "commpage.h"
 
 void kvm_mips_commpage_init(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/mips/kvm/kvm_mips_comm.h b/arch/mips/kvm/commpage.h
similarity index 100%
rename from arch/mips/kvm/kvm_mips_comm.h
rename to arch/mips/kvm/commpage.h
diff --git a/arch/mips/kvm/kvm_mips_dyntrans.c b/arch/mips/kvm/dyntrans.c
similarity index 99%
rename from arch/mips/kvm/kvm_mips_dyntrans.c
rename to arch/mips/kvm/dyntrans.c
index fa7184d..521121b 100644
--- a/arch/mips/kvm/kvm_mips_dyntrans.c
+++ b/arch/mips/kvm/dyntrans.c
@@ -18,7 +18,7 @@
 #include 
 #include 
 
-#include "kvm_mips_comm.h"
+#include "commpage.h"
 
 #define SYNCI_TEMPLATE  0x041f
 #define SYNCI_BASE(x)   (((x) >> 21) & 0x1f)
diff --git a/arch/mips/kvm/kvm_mips_emul.c b/arch/mips/kvm/emulate.c
similarity index 99%
rename from arch/mips/kvm/kvm_mips_emul.c
rename to arch/mips/kvm/emulate.c
index e5862bc..982f4af 100644
--- a/arch/mips/kvm/kvm_mips_emul.c
+++ b/arch/mips/kvm/emulate.c
@@ -29,9 +29,9 @@
 #include 
 #define CONFIG_MIPS_MT
 
-#include "kvm_mips_opcode.h"
-#include "kvm_mips_int.h"
-#include "kvm_mips_comm.h"
+#include "opcode.h"
+#include "interrupt.h"
+#include "commpage.h"
 
 #include "trace.h"
 
diff --git a/arch/mips/kvm/kvm_mips_int.c b/arch/mips/kvm/interrupt.c
similarity index 99%
rename from arch/mips/kvm/kvm_mips_int.c
rename to arch/mips/kvm/interrupt.c
index d458c04..9b44459 100644
--- a/arch/mips/kvm/kvm_mips_int.c
+++ b/arch/mips/kvm/interrupt.c
@@ -20,7 +20,7 @@
 
 #include 
 
-#include "kvm_mips_int.h"
+#include "interrupt.h"
 
 void kvm_mips_queue_irq(struct kvm_vcpu *vcpu, uint32_t priority)
 {
diff --git a/arch/mips/kvm/kvm_mips_int.h b/arch/mips/kvm/interrupt.h
similarity index 100%
rename from arch/mips/kvm/kvm_mips_int.h
rename to arch/mips/kvm/interrupt.h
diff --git a/arch/mips/kvm/kvm_locore.S b/arch/mips/kvm/locore.S
similarity index 100%
rename from arch/mips/kvm/kvm_locore.S
rename to arch/mips/kvm/locore.S
diff --git a/arch/mips/kvm/kvm_mips.c b/arch/m

[PATCH v3 3/9] MIPS: KVM: Simplify functions by removing redundancy

2014-06-24 Thread Deng-Cheng Zhu
From: Deng-Cheng Zhu 

No logic changes inside.

Signed-off-by: Deng-Cheng Zhu 
---
Changes:
v3 - v2:
o Add err removal in kvm_arch_commit_memory_region().
o Revert the changes to kvm_arch_vm_ioctl().

 arch/mips/include/asm/kvm_host.h  |  2 +-
 arch/mips/kvm/kvm_mips.c  | 18 --
 arch/mips/kvm/kvm_mips_commpage.c |  2 --
 arch/mips/kvm/kvm_mips_emul.c | 34 +++---
 arch/mips/kvm/kvm_mips_stats.c|  4 +---
 5 files changed, 17 insertions(+), 43 deletions(-)

diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index 3f813f2..7a3fc67 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -764,7 +764,7 @@ extern int kvm_mips_trans_mtc0(uint32_t inst, uint32_t *opc,
   struct kvm_vcpu *vcpu);
 
 /* Misc */
-extern int kvm_mips_dump_stats(struct kvm_vcpu *vcpu);
+extern void kvm_mips_dump_stats(struct kvm_vcpu *vcpu);
 extern unsigned long kvm_mips_get_ramsize(struct kvm *kvm);
 
 
diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c
index bdca619..cabcac0a 100644
--- a/arch/mips/kvm/kvm_mips.c
+++ b/arch/mips/kvm/kvm_mips.c
@@ -97,9 +97,7 @@ void kvm_arch_hardware_unsetup(void)
 
 void kvm_arch_check_processor_compat(void *rtn)
 {
-   int *r = (int *)rtn;
-   *r = 0;
-   return;
+   *(int *)rtn = 0;
 }
 
 static void kvm_mips_init_tlbs(struct kvm *kvm)
@@ -225,7 +223,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
   enum kvm_mr_change change)
 {
unsigned long npages = 0;
-   int i, err = 0;
+   int i;
 
kvm_debug("%s: kvm: %p slot: %d, GPA: %llx, size: %llx, QVA: %llx\n",
  __func__, kvm, mem->slot, mem->guest_phys_addr,
@@ -243,8 +241,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 
if (!kvm->arch.guest_pmap) {
kvm_err("Failed to allocate guest PMAP");
-   err = -ENOMEM;
-   goto out;
+   return;
}
 
kvm_debug("Allocated space for Guest PMAP Table (%ld 
pages) @ %p\n",
@@ -255,8 +252,6 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
kvm->arch.guest_pmap[i] = KVM_INVALID_PAGE;
}
}
-out:
-   return;
 }
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
@@ -845,16 +840,12 @@ long kvm_arch_vm_ioctl(struct file *filp, unsigned int 
ioctl, unsigned long arg)
 
 int kvm_arch_init(void *opaque)
 {
-   int ret;
-
if (kvm_mips_callbacks) {
kvm_err("kvm: module already exists\n");
return -EEXIST;
}
 
-   ret = kvm_mips_emulation_init(&kvm_mips_callbacks);
-
-   return ret;
+   return kvm_mips_emulation_init(&kvm_mips_callbacks);
 }
 
 void kvm_arch_exit(void)
@@ -1008,7 +999,6 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
-   return;
 }
 
 int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
diff --git a/arch/mips/kvm/kvm_mips_commpage.c 
b/arch/mips/kvm/kvm_mips_commpage.c
index ab7096e..4b5612b 100644
--- a/arch/mips/kvm/kvm_mips_commpage.c
+++ b/arch/mips/kvm/kvm_mips_commpage.c
@@ -33,6 +33,4 @@ void kvm_mips_commpage_init(struct kvm_vcpu *vcpu)
/* Specific init values for fields */
vcpu->arch.cop0 = &page->cop0;
memset(vcpu->arch.cop0, 0, sizeof(struct mips_coproc));
-
-   return;
 }
diff --git a/arch/mips/kvm/kvm_mips_emul.c b/arch/mips/kvm/kvm_mips_emul.c
index 262ce3e..e5862bc 100644
--- a/arch/mips/kvm/kvm_mips_emul.c
+++ b/arch/mips/kvm/kvm_mips_emul.c
@@ -761,8 +761,6 @@ enum emulation_result kvm_mips_emul_eret(struct kvm_vcpu 
*vcpu)
 
 enum emulation_result kvm_mips_emul_wait(struct kvm_vcpu *vcpu)
 {
-   enum emulation_result er = EMULATE_DONE;
-
kvm_debug("[%#lx] !!!WAIT!!! (%#lx)\n", vcpu->arch.pc,
  vcpu->arch.pending_exceptions);
 
@@ -782,7 +780,7 @@ enum emulation_result kvm_mips_emul_wait(struct kvm_vcpu 
*vcpu)
}
}
 
-   return er;
+   return EMULATE_DONE;
 }
 
 /*
@@ -792,11 +790,10 @@ enum emulation_result kvm_mips_emul_wait(struct kvm_vcpu 
*vcpu)
 enum emulation_result kvm_mips_emul_tlbr(struct kvm_vcpu *vcpu)
 {
struct mips_coproc *cop0 = vcpu->arch.cop0;
-   enum emulation_result er = EMULATE_FAIL;
uint32_t pc = vcpu->arch.pc;
 
kvm_err("[%#x] COP0_TLBR [%ld]\n", pc, kvm_read_c0_guest_index(cop0));
-   return er;
+   return EMULATE_FAIL;
 }
 
 /* Write Guest TLB Entry @ Index */
@@ -804,7 +801,6 @@ enum emulation_result kvm_mips_emul_tlbwi(struct kvm_vcpu 
*vcpu)
 {
struct mips_coproc *cop0 = vcpu->arch.cop0;
int index = kvm_read_c0_guest_index(cop0);
-   enum emulation_result er = EMULATE_DONE;
struct k

Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Scott Wood
On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:
> Howdy,
> 
> Ben reminded me a while back that we have a nasty race in our KVM PV code.
> 
> We replace a few instructions with longer streams of instructions to check
> whether it's necessary to trap out from it (like mtmsr, no need to trap if
> we only disable interrupts). During those replacement chunks we must not get
> any interrupts, because they might overwrite scratch space that we already
> used to save otherwise clobbered register state into.
> 
> So we have a thing called "critical sections" which allows us to atomically
> get in and out of "interrupt disabled" modes without touching MSR. When we
> are supposed to deliver an interrupt into the guest while we are in a critical
> section, we just don't inject the interrupt yet, but leave it be until the
> next trap.
> 
> However, we never really know when the next trap would be. For all we know it
> could be never. At this point we created a race that is a potential source
> for interrupt loss or at least deferral.
> 
> This patch set aims at solving the race. Instead of merely deferring an
> interrupt when we see such a situation, we go into a special instruction
> interpretation mode. In this mode, we interpret all PPC assembler instructions
> that happen until we are out of the critical section again, at which point
> we can now inject the interrupt.
> 
> This bug only affects KVM implementations that make use of the magic page, so
> e500v2, book3s_32 and book3s_64 PR KVM.

Would it be possible to single step through the critical section
instead?  Or set a high res timer to expire very quickly?

-Scott


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 9/9] arm64: KVM: vgic: deal with GIC sub-page alignment

2014-06-24 Thread Joel Schopp


On 06/19/2014 04:21 AM, Marc Zyngier wrote:

The GIC CPU interface is always 4k aligned. If the host is using
64k pages, it is critical to place the guest's GICC interface at the
same relative alignment as the host's GICV. Failure to do so results
in an impossibility for the guest to deal with interrupts.

Add a KVM_DEV_ARM_VGIC_GRP_ADDR_OFFSET attribute for the VGIC, allowing
userspace to retrieve the GICV offset in a page. It becomes then trivial
to adjust the GICC base address for the guest.


Does this mean there is a corresponding patch for qemu?



Signed-off-by: Marc Zyngier 
---
  arch/arm/include/uapi/asm/kvm.h   | 1 +
  arch/arm64/include/uapi/asm/kvm.h | 1 +
  virt/kvm/arm/vgic.c   | 7 +++
  3 files changed, 9 insertions(+)

diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h
index 8b51c1a..056b782 100644
--- a/arch/arm/include/uapi/asm/kvm.h
+++ b/arch/arm/include/uapi/asm/kvm.h
@@ -174,6 +174,7 @@ struct kvm_arch_memory_slot {
  #define   KVM_DEV_ARM_VGIC_OFFSET_SHIFT   0
  #define   KVM_DEV_ARM_VGIC_OFFSET_MASK(0xULL << 
KVM_DEV_ARM_VGIC_OFFSET_SHIFT)
  #define KVM_DEV_ARM_VGIC_GRP_NR_IRQS  3
+#define KVM_DEV_ARM_VGIC_GRP_ADDR_OFFSET 4
  
  /* KVM_IRQ_LINE irq field index values */

  #define KVM_ARM_IRQ_TYPE_SHIFT24
diff --git a/arch/arm64/include/uapi/asm/kvm.h 
b/arch/arm64/include/uapi/asm/kvm.h
index b5cd6ed..5513de4 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -160,6 +160,7 @@ struct kvm_arch_memory_slot {
  #define   KVM_DEV_ARM_VGIC_OFFSET_SHIFT   0
  #define   KVM_DEV_ARM_VGIC_OFFSET_MASK(0xULL << 
KVM_DEV_ARM_VGIC_OFFSET_SHIFT)
  #define KVM_DEV_ARM_VGIC_GRP_NR_IRQS  3
+#define KVM_DEV_ARM_VGIC_GRP_ADDR_OFFSET 4
  
  /* KVM_IRQ_LINE irq field index values */

  #define KVM_ARM_IRQ_TYPE_SHIFT24
diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
index b0cd417..68ac9c6 100644
--- a/virt/kvm/arm/vgic.c
+++ b/virt/kvm/arm/vgic.c
@@ -2228,6 +2228,12 @@ static int vgic_get_attr(struct kvm_device *dev, struct 
kvm_device_attr *attr)
r = put_user(dev->kvm->arch.vgic.nr_irqs, uaddr);
break;
}
+   case KVM_DEV_ARM_VGIC_GRP_ADDR_OFFSET: {
+   u32 __user *uaddr = (u32 __user *)(long)attr->addr;
+   u32 val = vgic->vcpu_base & ~PAGE_MASK;
+   r = put_user(val, uaddr);
+   break;
+   }
  
  	}
  
@@ -2265,6 +2271,7 @@ static int vgic_has_attr(struct kvm_device *dev, struct kvm_device_attr *attr)

offset = attr->attr & KVM_DEV_ARM_VGIC_OFFSET_MASK;
return vgic_has_attr_regs(vgic_cpu_ranges, offset);
case KVM_DEV_ARM_VGIC_GRP_NR_IRQS:
+   case KVM_DEV_ARM_VGIC_GRP_ADDR_OFFSET:
return 0;
}
return -ENXIO;


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to know that vPMU is enabled or disabled?

2014-06-24 Thread Paolo Bonzini

Il 24/06/2014 10:33, Jidong Xiao ha scritto:

I think I have figured out this. According to this patch (as well as
the Intel SDM manual), it looks like pmu is exposed via the cpuid leaf
0xah.

https://github.com/torvalds/linux/commit/a6c06ed1a60aff77b27ba558c315c3fed4e35565

Therefore, in the guest os, one can run the cpuid instruction with
leaf 0xa, and if vpmu is supported/enabled, the return value in
eax/ebx/ecx/edx should be something non-zero, and in the cloud machine
which I am using I see these registers are all zero, therefore I think
vpmu is not supported in my virtual machine. Probably it is masked by
Qemu.


This is correct.  QEMU only enables vPMU if you use "-cpu host".

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread Benjamin Herrenschmidt
On Tue, 2014-06-24 at 12:41 +0200, Alexander Graf wrote:
> Is there actually any difference in generated code with this patch 
> applied and without? I would hope that iowrite..() is inlined and 
> cancels out the cpu_to_le..() calls that are also inlined?

No, the former uses byteswapping asm, the compiler can't "cancel" it
out, but the overhead of the additional byteswap might not be
measurable.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread Benjamin Herrenschmidt
On Wed, 2014-06-25 at 00:33 +1000, Alexey Kardashevskiy wrote:
> 
> I do not understand why @val is considered LE here and need to be
> converted
> to CPU. Really. I truly believe it should be cpu_to_le32().

No. Both are slightly wrong semantically but le32_to_cpu() is less
wrong :-)

iowrite32 supposedly takes a "cpu" value as argument and writes an "le"
value. So if anything, you need something that converts to a "cpu" value
before you call iowrite32.

Now it's still slightly wrong because the "input" to le32_to_cpu() is
supposed to be an "LE" value but of course here it's not, it's a "cpu"
value too :-)

But yes, I agree with aw here, either do nothing or stick a set of
iowriteXX_native or something equivalent in the generic iomap header,
define them in term of iowrite32be/iowrite32 based on the compile time
endian of the arch.

Hitting asm-generic/iomap.h I think will cover all archs except ARM.
For ARM, just hack arch/arm/asm/io.h

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 9/9] arm64: KVM: vgic: deal with GIC sub-page alignment

2014-06-24 Thread Peter Maydell
On 24 June 2014 20:28, Joel Schopp  wrote:
>
> On 06/19/2014 04:21 AM, Marc Zyngier wrote:
>>
>> The GIC CPU interface is always 4k aligned. If the host is using
>> 64k pages, it is critical to place the guest's GICC interface at the
>> same relative alignment as the host's GICV. Failure to do so results
>> in an impossibility for the guest to deal with interrupts.
>>
>> Add a KVM_DEV_ARM_VGIC_GRP_ADDR_OFFSET attribute for the VGIC, allowing
>> userspace to retrieve the GICV offset in a page. It becomes then trivial
>> to adjust the GICC base address for the guest.
>
>
> Does this mean there is a corresponding patch for qemu?

Not as far as I know. It's a bit awkward on the QEMU end because
we really want to provide the guest a consistent memory map
regardless of the host CPU. So at best we'd probably use it to
say "sorry, can't run on this CPU/host kernel".

(That said, if you think you can make QEMU usefully use the
information and want to write a QEMU patch I'm not averse
to the idea.)

kvmtool is probably better placed to take advantage of it since
it takes more of a "deal with what the host provides you"
philosophy.

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Alexander Graf


On 24.06.14 20:53, Scott Wood wrote:

On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:

Howdy,

Ben reminded me a while back that we have a nasty race in our KVM PV code.

We replace a few instructions with longer streams of instructions to check
whether it's necessary to trap out from it (like mtmsr, no need to trap if
we only disable interrupts). During those replacement chunks we must not get
any interrupts, because they might overwrite scratch space that we already
used to save otherwise clobbered register state into.

So we have a thing called "critical sections" which allows us to atomically
get in and out of "interrupt disabled" modes without touching MSR. When we
are supposed to deliver an interrupt into the guest while we are in a critical
section, we just don't inject the interrupt yet, but leave it be until the
next trap.

However, we never really know when the next trap would be. For all we know it
could be never. At this point we created a race that is a potential source
for interrupt loss or at least deferral.

This patch set aims at solving the race. Instead of merely deferring an
interrupt when we see such a situation, we go into a special instruction
interpretation mode. In this mode, we interpret all PPC assembler instructions
that happen until we are out of the critical section again, at which point
we can now inject the interrupt.

This bug only affects KVM implementations that make use of the magic page, so
e500v2, book3s_32 and book3s_64 PR KVM.

Would it be possible to single step through the critical section
instead?  Or set a high res timer to expire very quickly?


There are a few other alternatives to this implementation:

  1) Unmap the magic page, emulate all memory access to it while in 
critical and irq pending
  2) Trigger a timer that sends a request to the vcpu to wake it from 
potential sleep and inject the irq

  3) Single step until we're beyond the critical section
  4) Probably more that I can't think of right now :)

Each has their good and bad sides. Unmapping the magic page adds 
complexity to the MMU mapping code, since we need to make sure we don't 
map it back in on demand and treat faults to it specially.


The timer interrupt works, but I'm not fully convinced that it's a good 
idea for things like MC events which we also block during critical 
sections on e500v2.


Single stepping is hard enough to get right on interaction between QEMU, 
KVM and the guest. I didn't really want to make that stuff any more 
complicated.


This approach is really just one out of many - and it's one that's 
nicely self-contained and shouldn't have any impact at all on 
implementations that don't care about it ;).



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/9] MIPS: KVM: Reformat code and comments

2014-06-24 Thread James Hogan
Hi Deng-Cheng,

On Tuesday 24 June 2014 10:31:02 Deng-Cheng Zhu wrote:
> diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c
> index cd5e4f5..821e7e8 100644
> --- a/arch/mips/kvm/kvm_mips.c
> +++ b/arch/mips/kvm/kvm_mips.c

> -#define VCPU_STAT(x) offsetof(struct kvm_vcpu, stat.x), KVM_STAT_VCPU
> +#define VCPU_STAT(x) offsetof(struct kvm_vcpu, stat.x)
>  struct kvm_stats_debugfs_item debugfs_entries[] = {
> - { "wait", VCPU_STAT(wait_exits) },
> - { "cache", VCPU_STAT(cache_exits) },
> - { "signal", VCPU_STAT(signal_exits) },
> - { "interrupt", VCPU_STAT(int_exits) },
> - { "cop_unsuable", VCPU_STAT(cop_unusable_exits) },
> - { "tlbmod", VCPU_STAT(tlbmod_exits) },
> - { "tlbmiss_ld", VCPU_STAT(tlbmiss_ld_exits) },
> - { "tlbmiss_st", VCPU_STAT(tlbmiss_st_exits) },
> - { "addrerr_st", VCPU_STAT(addrerr_st_exits) },
> - { "addrerr_ld", VCPU_STAT(addrerr_ld_exits) },
> - { "syscall", VCPU_STAT(syscall_exits) },
> - { "resvd_inst", VCPU_STAT(resvd_inst_exits) },
> - { "break_inst", VCPU_STAT(break_inst_exits) },
> - { "flush_dcache", VCPU_STAT(flush_dcache_exits) },
> - { "halt_wakeup", VCPU_STAT(halt_wakeup) },
> + { "wait", VCPU_STAT(wait_exits), KVM_STAT_VCPU },
> + { "cache", VCPU_STAT(cache_exits), KVM_STAT_VCPU },
> + { "signal", VCPU_STAT(signal_exits), KVM_STAT_VCPU },
> + { "interrupt", VCPU_STAT(int_exits), KVM_STAT_VCPU },
> + { "cop_unsuable", VCPU_STAT(cop_unusable_exits), KVM_STAT_VCPU },
> + { "tlbmod", VCPU_STAT(tlbmod_exits), KVM_STAT_VCPU },
> + { "tlbmiss_ld", VCPU_STAT(tlbmiss_ld_exits), KVM_STAT_VCPU },
> + { "tlbmiss_st", VCPU_STAT(tlbmiss_st_exits), KVM_STAT_VCPU },
> + { "addrerr_st", VCPU_STAT(addrerr_st_exits), KVM_STAT_VCPU },
> + { "addrerr_ld", VCPU_STAT(addrerr_ld_exits), KVM_STAT_VCPU },
> + { "syscall", VCPU_STAT(syscall_exits), KVM_STAT_VCPU },
> + { "resvd_inst", VCPU_STAT(resvd_inst_exits), KVM_STAT_VCPU },
> + { "break_inst", VCPU_STAT(break_inst_exits), KVM_STAT_VCPU },
> + { "flush_dcache", VCPU_STAT(flush_dcache_exits), KVM_STAT_VCPU },
> + { "halt_wakeup", VCPU_STAT(halt_wakeup), KVM_STAT_VCPU },

IMO more important than making checkpatch happy here would be to put it in 
nicely tabulated columns ;-)

> diff --git a/arch/mips/kvm/kvm_mips_emul.c b/arch/mips/kvm/kvm_mips_emul.c
> index 8d48400..993dd1c 100644
> --- a/arch/mips/kvm/kvm_mips_emul.c
> +++ b/arch/mips/kvm/kvm_mips_emul.c

>   switch (insn.i_format.opcode) {
> - /*
> -  * jr and jalr are in r_format format.
> -  */
> +  /* jr and jalr are in r_format format. */

bad indentation.

> diff --git a/arch/mips/kvm/kvm_trap_emul.c b/arch/mips/kvm/kvm_trap_emul.c
> index 693f952..baf6577 100644
> --- a/arch/mips/kvm/kvm_trap_emul.c
> +++ b/arch/mips/kvm/kvm_trap_emul.c

> @@ -186,10 +185,12 @@ static int kvm_trap_emul_handle_tlb_ld_miss(struct
> kvm_vcpu *vcpu) vcpu->arch.pc, badvaddr);
> 
>   /* User Address (UA) fault, this could happen if

this comment could be fixed too

Otherwise this patch looks good. Thanks for doing this!

Cheers
James
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Scott Wood
On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote:
> On 24.06.14 20:53, Scott Wood wrote:
> > On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:
> >> Howdy,
> >>
> >> Ben reminded me a while back that we have a nasty race in our KVM PV code.
> >>
> >> We replace a few instructions with longer streams of instructions to check
> >> whether it's necessary to trap out from it (like mtmsr, no need to trap if
> >> we only disable interrupts). During those replacement chunks we must not 
> >> get
> >> any interrupts, because they might overwrite scratch space that we already
> >> used to save otherwise clobbered register state into.
> >>
> >> So we have a thing called "critical sections" which allows us to atomically
> >> get in and out of "interrupt disabled" modes without touching MSR. When we
> >> are supposed to deliver an interrupt into the guest while we are in a 
> >> critical
> >> section, we just don't inject the interrupt yet, but leave it be until the
> >> next trap.
> >>
> >> However, we never really know when the next trap would be. For all we know 
> >> it
> >> could be never. At this point we created a race that is a potential source
> >> for interrupt loss or at least deferral.
> >>
> >> This patch set aims at solving the race. Instead of merely deferring an
> >> interrupt when we see such a situation, we go into a special instruction
> >> interpretation mode. In this mode, we interpret all PPC assembler 
> >> instructions
> >> that happen until we are out of the critical section again, at which point
> >> we can now inject the interrupt.
> >>
> >> This bug only affects KVM implementations that make use of the magic page, 
> >> so
> >> e500v2, book3s_32 and book3s_64 PR KVM.
> > Would it be possible to single step through the critical section
> > instead?  Or set a high res timer to expire very quickly?
> 
> There are a few other alternatives to this implementation:
> 
>1) Unmap the magic page, emulate all memory access to it while in 
> critical and irq pending
>2) Trigger a timer that sends a request to the vcpu to wake it from 
> potential sleep and inject the irq
>3) Single step until we're beyond the critical section
>4) Probably more that I can't think of right now :)
> 
> Each has their good and bad sides. Unmapping the magic page adds 
> complexity to the MMU mapping code, since we need to make sure we don't 
> map it back in on demand and treat faults to it specially.
> 
> The timer interrupt works, but I'm not fully convinced that it's a good 
> idea for things like MC events which we also block during critical 
> sections on e500v2.

Are you concerned about the guest seeing machine checks that are (more)
asynchronous with the error condition?  e500v2 machine checks are always
asynchronous.  From the core manual:

Machine check interrupts are typically caused by a hardware or
memory subsystem failure or by an attempt to access an invalid
address. They may be caused indirectly by execution of an
instruction, but may not be recognized or reported until long
after the processor has executed past the instruction that
caused the machine check. As such, machine check interrupts are
not thought of as synchronous or asynchronous nor as precise or
imprecise.

I don't think the lag would be a problem, and certainly it's better than
the current situation.

> Single stepping is hard enough to get right on interaction between QEMU, 
> KVM and the guest. I didn't really want to make that stuff any more 
> complicated.

I'm not sure that it would add much complexity.  We'd just need to check
whether any source other than the magic page turned wants DCBR0_IC on,
to determine whether to exit to userspace or not.

> This approach is really just one out of many - and it's one that's 
> nicely self-contained and shouldn't have any impact at all on 
> implementations that don't care about it ;).

"Nicely self-contained" is not a phrase I'd associate with 33 patches,
including a bunch of new emulation that probably isn't getting great
test coverage.

-Scott


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Alexander Graf


On 25.06.14 01:15, Scott Wood wrote:

On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote:

On 24.06.14 20:53, Scott Wood wrote:

On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote:

Howdy,

Ben reminded me a while back that we have a nasty race in our KVM PV code.

We replace a few instructions with longer streams of instructions to check
whether it's necessary to trap out from it (like mtmsr, no need to trap if
we only disable interrupts). During those replacement chunks we must not get
any interrupts, because they might overwrite scratch space that we already
used to save otherwise clobbered register state into.

So we have a thing called "critical sections" which allows us to atomically
get in and out of "interrupt disabled" modes without touching MSR. When we
are supposed to deliver an interrupt into the guest while we are in a critical
section, we just don't inject the interrupt yet, but leave it be until the
next trap.

However, we never really know when the next trap would be. For all we know it
could be never. At this point we created a race that is a potential source
for interrupt loss or at least deferral.

This patch set aims at solving the race. Instead of merely deferring an
interrupt when we see such a situation, we go into a special instruction
interpretation mode. In this mode, we interpret all PPC assembler instructions
that happen until we are out of the critical section again, at which point
we can now inject the interrupt.

This bug only affects KVM implementations that make use of the magic page, so
e500v2, book3s_32 and book3s_64 PR KVM.

Would it be possible to single step through the critical section
instead?  Or set a high res timer to expire very quickly?

There are a few other alternatives to this implementation:

1) Unmap the magic page, emulate all memory access to it while in
critical and irq pending
2) Trigger a timer that sends a request to the vcpu to wake it from
potential sleep and inject the irq
3) Single step until we're beyond the critical section
4) Probably more that I can't think of right now :)

Each has their good and bad sides. Unmapping the magic page adds
complexity to the MMU mapping code, since we need to make sure we don't
map it back in on demand and treat faults to it specially.

The timer interrupt works, but I'm not fully convinced that it's a good
idea for things like MC events which we also block during critical
sections on e500v2.

Are you concerned about the guest seeing machine checks that are (more)
asynchronous with the error condition?  e500v2 machine checks are always
asynchronous.  From the core manual:

 Machine check interrupts are typically caused by a hardware or
 memory subsystem failure or by an attempt to access an invalid
 address. They may be caused indirectly by execution of an
 instruction, but may not be recognized or reported until long
 after the processor has executed past the instruction that
 caused the machine check. As such, machine check interrupts are
 not thought of as synchronous or asynchronous nor as precise or
 imprecise.

I don't think the lag would be a problem, and certainly it's better than
the current situation.


So what value would you set the timer to? If the value is too small, we 
never finish the critical section. If it's too big, we add lots of jitter.





Single stepping is hard enough to get right on interaction between QEMU,
KVM and the guest. I didn't really want to make that stuff any more
complicated.

I'm not sure that it would add much complexity.  We'd just need to check
whether any source other than the magic page turned wants DCBR0_IC on,
to determine whether to exit to userspace or not.


What if the guest is single stepping itself? How do we determine when to 
unset the bit again? When we get out of the critical section? How do we 
know what the value was before we set it?





This approach is really just one out of many - and it's one that's
nicely self-contained and shouldn't have any impact at all on
implementations that don't care about it ;).

"Nicely self-contained" is not a phrase I'd associate with 33 patches,
including a bunch of new emulation that probably isn't getting great
test coverage.


It means that there's only a single entry point for when the code gets 
executed, not that it's very little code.


Eventually this emulation code should get merged with the already 
existing in-kernel emulation code. Paul had already started work to 
merge the emulators a while ago. He even measured speedups when he sent 
all real mode and split real mode code via the interpreter rather than 
the entry/exit dance we do today.



Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code

2014-06-24 Thread Scott Wood
On Wed, 2014-06-25 at 01:40 +0200, Alexander Graf wrote:
> On 25.06.14 01:15, Scott Wood wrote:
> > On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote:
> >> On 24.06.14 20:53, Scott Wood wrote:
> >>> The timer interrupt works, but I'm not fully convinced that it's a good
> >> idea for things like MC events which we also block during critical
> >> sections on e500v2.
> > Are you concerned about the guest seeing machine checks that are (more)
> > asynchronous with the error condition?  e500v2 machine checks are always
> > asynchronous.  From the core manual:
> >
> >  Machine check interrupts are typically caused by a hardware or
> >  memory subsystem failure or by an attempt to access an invalid
> >  address. They may be caused indirectly by execution of an
> >  instruction, but may not be recognized or reported until long
> >  after the processor has executed past the instruction that
> >  caused the machine check. As such, machine check interrupts are
> >  not thought of as synchronous or asynchronous nor as precise or
> >  imprecise.
> >
> > I don't think the lag would be a problem, and certainly it's better than
> > the current situation.
> 
> So what value would you set the timer to? If the value is too small, we 
> never finish the critical section. If it's too big, we add lots of jitter.

Maybe something like 100us?

Single stepping would be better, though.

> >> Single stepping is hard enough to get right on interaction between QEMU,
> >> KVM and the guest. I didn't really want to make that stuff any more
> >> complicated.
> > I'm not sure that it would add much complexity.  We'd just need to check
> > whether any source other than the magic page turned wants DCBR0_IC on,
> > to determine whether to exit to userspace or not.
> 
> What if the guest is single stepping itself? How do we determine when to 
> unset the bit again? When we get out of the critical section? How do we 
> know what the value was before we set it?

Keep track of each requester of single stepping separately, and only
ever set the real bit by ORing them.

-Scott


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42980] BUG in gfn_to_pfn_prot

2014-06-24 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42980

xerofo...@gmail.com changed:

   What|Removed |Added

 CC||xerofo...@gmail.com

--- Comment #22 from xerofo...@gmail.com ---
Please tell against a newer kernel. This bug seems obsolete to me as of
kernel versions released in 2014 time frame.
Cheers Nick

-- 
You are receiving this mail because:
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix endianness handling for emulated BARs

2014-06-24 Thread Alexey Kardashevskiy
On 06/25/2014 07:54 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2014-06-25 at 00:33 +1000, Alexey Kardashevskiy wrote:
>>
>> I do not understand why @val is considered LE here and need to be
>> converted
>> to CPU. Really. I truly believe it should be cpu_to_le32().
> 
> No. Both are slightly wrong semantically but le32_to_cpu() is less
> wrong :-)
>
> iowrite32 supposedly takes a "cpu" value as argument and writes an "le"
> value. So if anything, you need something that converts to a "cpu" value
> before you call iowrite32.
> 
> Now it's still slightly wrong because the "input" to le32_to_cpu() is
> supposed to be an "LE" value but of course here it's not, it's a "cpu"
> value too :-)
> 
> But yes, I agree with aw here, either do nothing or stick a set of
> iowriteXX_native or something equivalent in the generic iomap header,
> define them in term of iowrite32be/iowrite32 based on the compile time
> endian of the arch.


Ok. I'll do nothing.


> Hitting asm-generic/iomap.h I think will cover all archs except ARM.
> For ARM, just hack arch/arm/asm/io.h


-- 
Alexey
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to know that vPMU is enabled or disabled?

2014-06-24 Thread Jidong Xiao
On Tue, Jun 24, 2014 at 4:38 PM, Paolo Bonzini  wrote:
> Il 24/06/2014 10:33, Jidong Xiao ha scritto:
>
>> I think I have figured out this. According to this patch (as well as
>> the Intel SDM manual), it looks like pmu is exposed via the cpuid leaf
>> 0xah.
>>
>>
>> https://github.com/torvalds/linux/commit/a6c06ed1a60aff77b27ba558c315c3fed4e35565
>>
>> Therefore, in the guest os, one can run the cpuid instruction with
>> leaf 0xa, and if vpmu is supported/enabled, the return value in
>> eax/ebx/ecx/edx should be something non-zero, and in the cloud machine
>> which I am using I see these registers are all zero, therefore I think
>> vpmu is not supported in my virtual machine. Probably it is masked by
>> Qemu.
>
>
> This is correct.  QEMU only enables vPMU if you use "-cpu host".
>
> Paolo

Thanks Paolo.

-Jidong
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html