Re: [PATCH] powerpc/kvm: Fix build break with CONFIG_KVM_BOOK3S_64_HV=y
On Tue, 2014-06-24 at 15:51 +0900, Joonsoo Kim wrote: > On Tue, Jun 24, 2014 at 04:36:47PM +1000, Michael Ellerman wrote: > > Commit e58e263 "PPC, KVM, CMA: use general CMA reserved area management > > framework" in next-20140624 removed arch/powerpc/kvm/book3s_hv_cma.c but > > neglected to update the Makefile, thus breaking the build. > > > > Signed-off-by: Michael Ellerman > > --- > > > > Hi Andrew, > > > > This is in your akpm-current and is breaking some of the builds for > > powerpc in linux-next. Squashing this fix into the original patch would > > be best for us. > > > > I sent really same patch 10 minutes ago. :) Haha, oh well, I should have had a coffee instead :) cheers -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: fix TSC matching
I've observed kvmclock being marked as unstable on a modern single-socket system with a stable TSC and qemu-1.6.2 or qemu-2.0.0. The culprit was failure in TSC matching because of overflow of kvm_arch::nr_vcpus_matched_tsc in case there were multiple TSC writes in a single synchronization cycle. Turns out that qemu does multiple TSC writes during init, below is the evidence of that (qemu-2.0.0): The first one: 0xa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel] 0xa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm] 0xa04cfd6b : kvm_arch_vcpu_postcreate+0x4b/0x80 [kvm] 0xa04b8188 : kvm_vm_ioctl+0x418/0x750 [kvm] The second one: 0xa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel] 0xa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm] 0xa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel] 0xa04be83b : do_set_msr+0x3b/0x60 [kvm] 0xa04c10a8 : msr_io+0xc8/0x160 [kvm] 0xa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm] 0xa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm] #0 kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780 #1 kvm_put_msrs at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270 #2 kvm_arch_put_registers at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909 #3 kvm_cpu_synchronize_post_init at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1641 #4 cpu_synchronize_post_init at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:330 #5 cpu_synchronize_all_post_init () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:521 #6 main at /build/buildd/qemu-2.0.0+dfsg/vl.c:4390 The third one: 0xa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel] 0xa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm] 0xa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel] 0xa04be83b : do_set_msr+0x3b/0x60 [kvm] 0xa04c10a8 : msr_io+0xc8/0x160 [kvm] 0xa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm] 0xa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm] #0 kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780 #1 kvm_put_msrs at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270 #2 kvm_arch_put_registers at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909 #3 kvm_cpu_synchronize_post_reset at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1635 #4 cpu_synchronize_post_reset at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:323 #5 cpu_synchronize_all_post_reset () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:512 #6 main at /build/buildd/qemu-2.0.0+dfsg/vl.c:4482 The fix is to count each vCPU only once when matched, so that nr_vcpus_matched_tsc holds the size of the matched set. This is achieved by reusing generation counters. Every vCPU with this_tsc_generation == cur_tsc_generation is in the matched set. The match set is cleared by setting cur_tsc_generation to a value which no other vCPU is set to (by incrementing it). I needed to bump up the counter size form u8 to u64 to ensure it never overflows. Otherwise in cases TSC is not written the same number of times on each vCPU the counter could overflow and incorrectly indicate some vCPUs as being in the matched set. This scenario seems unlikely but I'm not sure if it can be disregarded. Signed-off-by: Tomasz Grabiec --- arch/x86/include/asm/kvm_host.h | 4 ++-- arch/x86/kvm/x86.c | 11 +++ 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 63e020b..af36f89 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -448,7 +448,7 @@ struct kvm_vcpu_arch { u64 tsc_offset_adjustment; u64 this_tsc_nsec; u64 this_tsc_write; - u8 this_tsc_generation; + u64 this_tsc_generation; bool tsc_catchup; bool tsc_always_catchup; s8 virtual_tsc_shift; @@ -591,7 +591,7 @@ struct kvm_arch { u64 cur_tsc_nsec; u64 cur_tsc_write; u64 cur_tsc_offset; - u8 cur_tsc_generation; + u64 cur_tsc_generation; int nr_vcpus_matched_tsc; spinlock_t pvclock_gtod_sync_lock; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 874607a..d1330f3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1215,6 +1215,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr) unsigned long flags; s64 usdiff; bool matched; + bool already_matched; u64 data = msr->data; raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags); @@ -1279,6 +1280,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr) pr_debug("kvm: adjusted tsc offset by %llu\n", delta); } matched = true; + already_matched = (vcpu->arch.this_tsc_generation == kvm->arch.cur_tsc_generation); } else { /* * We split periods of matched TSC writes into generations. @
Re: [PATCH v3 -next 0/9] CMA: generalize CMA reserved area management code
On Wed, Jun 18, 2014 at 01:51:44PM -0700, Andrew Morton wrote: > On Tue, 17 Jun 2014 10:25:07 +0900 Joonsoo Kim wrote: > > > > >v2: > > > > - Although this patchset looks very different with v1, the end result, > > > > that is, mm/cma.c is same with v1's one. So I carry Ack to patch 6-7. > > > > > > > >This patchset is based on linux-next 20140610. > > > > > > Thanks for taking care of this. I will test it with my setup and if > > > everything goes well, I will take it to my -next tree. If any branch > > > is required for anyone to continue his works on top of those patches, > > > let me know, I will also prepare it. > > > > Hello, > > > > I'm glad to hear that. :) > > But, there is one concern. As you already know, I am preparing further > > patches (Aggressively allocate the pages on CMA reserved memory). It > > may be highly related to MM branch and also slightly depends on this CMA > > changes. In this case, what is the best strategy to merge this > > patchset? IMHO, Anrew's tree is more appropriate branch. If there is > > no issue in this case, I am willing to develope further patches based > > on your tree. > > That's probably easier. Marek, I'll merge these into -mm (and hence > -next and git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git) > and shall hold them pending you review/ack/test/etc, OK? Hello, Marek. Could you share your decision about this patchset? Thanks. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to know that vPMU is enabled or disabled?
On Mon, Jun 23, 2014 at 1:41 PM, Jidong Xiao wrote: > Hi, All, > > I am using a virtual machine in a cloud environment, which means I am > in control of the Guest OS, but have no access to the Host OS. Is > there a simple way to know whether or not the vPMU is enabled or > disabled? Or, is there something I can control so as to turn its state > from enable to disable, or vice versa? Thanks. > I think I have figured out this. According to this patch (as well as the Intel SDM manual), it looks like pmu is exposed via the cpuid leaf 0xah. https://github.com/torvalds/linux/commit/a6c06ed1a60aff77b27ba558c315c3fed4e35565 Therefore, in the guest os, one can run the cpuid instruction with leaf 0xa, and if vpmu is supported/enabled, the return value in eax/ebx/ecx/edx should be something non-zero, and in the cloud machine which I am using I see these registers are all zero, therefore I think vpmu is not supported in my virtual machine. Probably it is masked by Qemu. -Jidong -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Coupling between KVM_IRQFD and KVM_SET_GSI_ROUTING?
On Mon, Jun 23, 2014 at 06:47:51PM +0200, Alexander Graf wrote: > > On 17.06.14 13:39, Eric Auger wrote: > >Hello, > > > >I have a question related to KVM_IRQFD and KVM_SET_GSI_ROUTING ioctl > >relationship. > > > >When reading the KVM API documentation I do not understand there is any > >dependency between KVM_IRQFD and KVM_SET_GSI_ROUTING. According to the > >text it seems only the gsi field is used and interpreted as the irqchip pin. > > > >However irqchip.c kvm_set_irq code relies on an existing and not dummy > >routing table. > > > >My question is: does anyone agree on the fact the user-side must set a > >consistent routing table using KVM_SET_GSI_ROUTING before using > >KVM_IRQFD? The other alternative would have been to build a default > >identity GSI routing table in the kernel (gsi = irqchip.pin). > > I untangled irqfd support from the x86 ioapic emulation a while back. When I > looked at it, I didn't see any easy way to split it out from the routing > too, so I kept that dependency in. > > If you look at the code, you will see that the irq routing entry is used as > token for an irqfd. So every irqfd only knows its routing table entry, > nothing else. As far as I can see, the routing table entry is used for only two things: to tell whether the interrupt is an MSI or LSI, and to pass to kvm_set_msi. One way to tackle the problem would be to abstract out the routing table lookup into a function in irqchip.c, rather than having it open-coded in irqfd_update. Then, if you don't want to have the routing table you write an alternative function that creates a routing entry from scratch. It would need information from the low-level irq chip code as to which interrupts are LSIs and which are MSIs. It also ties you down to having only one kind of interrupt controller. The other way I see to solve my particular problem, which is that userspace tends to use hardware interrupt numbers starting at 4096, is to replace the arrays in struct kvm_irq_routing_table with idr structures or something similar. > Splitting that dependency out is certainly quite tedious work. However, I'm > sure the IBM folks will be grateful if you do it :). Indeed. :) Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 67751] Stack trace with suspicious RCU usage, when starting ovs-switchd
https://bugzilla.kernel.org/show_bug.cgi?id=67751 --- Comment #5 from Kashyap Chamarthy --- I just saw this again when I attempt to restart OpenvSwitch with Kernel -- 3.16.0-0.rc1.git1.1.fc21.x86_64 - [ 9695.079715] [ 9695.080492] === [ 9695.082559] [ INFO: suspicious RCU usage. ] [ 9695.085105] 3.16.0-0.rc1.git1.1.fc21.x86_64 #1 Not tainted [ 9695.088505] --- [ 9695.090413] net/openvswitch/flow_table.c:255 suspicious rcu_dereference_protected() usage! [ 9695.094887] [ 9695.094887] other info that might help us debug this: [ 9695.094887] [ 9695.100722] [ 9695.100722] rcu_scheduler_active = 1, debug_locks = 0 [ 9695.105699] 1 lock held by ovs-vswitchd/1910: [ 9695.108199] #0: (cb_lock){++}, at: [] genl_rcv+0x19/0x40 [ 9695.113083] [ 9695.113083] stack backtrace: [ 9695.115395] CPU: 0 PID: 1910 Comm: ovs-vswitchd Not tainted 3.16.0-0.rc1.git1.1.fc21.x86_64 #1 [ 9695.121638] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 9695.124076] 2c60696c 8800b64ff968 81807520 [ 9695.127732] 88029ce24d70 8800b64ff998 810fd7f7 [ 9695.131321] 88029ccc6d20 8800b64ffa90 88029ccc6d00 8800b64ff9b8 [ 9695.134931] Call Trace: [ 9695.136158] [] dump_stack+0x4d/0x66 [ 9695.139031] [] lockdep_rcu_suspicious+0xe7/0x120 [ 9695.141850] [] ovs_flow_tbl_destroy+0x60/0x70 [openvswitch] [ 9695.145034] [] ovs_dp_cmd_new+0x311/0x4b0 [openvswitch] [ 9695.148665] [] genl_family_rcv_msg+0x1bc/0x3e0 [ 9695.152987] [] ? sched_clock+0x9/0x10 [ 9695.157925] [] ? sched_clock_local+0x1d/0x90 [ 9695.162486] [] genl_rcv_msg+0x84/0xc0 [ 9695.166797] [] ? genl_family_rcv_msg+0x3e0/0x3e0 [ 9695.171411] [] netlink_rcv_skb+0xa9/0xd0 [ 9695.175809] [] genl_rcv+0x28/0x40 [ 9695.179489] [] netlink_unicast+0x125/0x1a0 [ 9695.184300] [] netlink_sendmsg+0x35f/0x7d0 [ 9695.188510] [] sock_sendmsg+0x9e/0xe0 [ 9695.192580] [] ? might_fault+0x5e/0xc0 [ 9695.196850] [] ? might_fault+0xb9/0xc0 [ 9695.200907] [] ? might_fault+0x5e/0xc0 [ 9695.204862] [] ___sys_sendmsg+0x408/0x420 [ 9695.208962] [] ? __fget_light+0x13d/0x160 [ 9695.213088] [] ? __fget_light+0x13d/0x160 [ 9695.217652] [] __sys_sendmsg+0x51/0x90 [ 9695.220513] [] SyS_sendmsg+0x12/0x20 [ 9695.223422] [] system_call_fastpath+0x16/0x1b - -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
vhost on mmio platform devices
Hi, I'm currently trying to make vhost-net work on ARM cortexA15 Arndale board. As irqfd is not yet upstream, I've used irqfd and irq routing patches recently posted on kvmarm list. So far I can properly generate irqs to the guest using vgic and irq-routing , but as far as I can see virtio registers (interrupt status register @ offset 0x60) are not updated by vhost kernel backend. Does it mean that virtio virtual registers have to be managed by Qemu, even when using a vhost kernel back-end ??? I cannot find inside vhost code where those registers could be updated, and I can clearly see that mmio accesses from guest to those registers causing a qemu virtio_mmio_read or virtio_mmio-write. I thought vhost was added to prevent any qemu interaction in the data path . Could someone more familiar with vhost could tell me if and where I'm wrong ? Thanks a lot in advance Best regards Rémy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote: > On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote: > >> Working on big endian being an accident may be a matter of perspective > > :-) > >> The comment remains that this patch doesn't actually fix anything except >> the overhead on big endian systems doing redundant byte swapping and >> maybe the philosophy that vfio regions are little endian. > > Yes, that works by accident because technically VFIO is a transport and > thus shouldn't perform any endian swapping of any sort, which remains > the responsibility of the end driver which is the only one to know > whether a given BAR location is a a register or some streaming data > and in the former case whether it's LE or BE (some PCI devices are BE > even ! :-) > > But yes, in the end, it works with the dual "cancelling" swaps and the > overhead of those swaps is probably drowned in the noise of the syscall > overhead. > >> I'm still not a fan of iowrite vs iowritebe, there must be something we >> can use that doesn't have an implicit swap. > > Sadly there isn't ... In the old day we didn't even have the "be" > variant and readl/writel style accessors still don't have them either > for all archs. > > There is __raw_readl/writel but here the semantics are much more than > just "don't swap", they also don't have memory barriers (which means > they are essentially useless to most drivers unless those are platform > specific drivers which know exactly what they are doing, or in the rare > cases such as accessing a framebuffer which we know never have side > effects). > >> Calling it iowrite*_native is also an abuse of the namespace. > > >> Next thing we know some common code >> will legitimately use that name. > > I might make sense to those definitions into a common header. There have > been a handful of cases in the past that wanted that sort of "native > byte order" MMIOs iirc (though don't ask me for examples, I can't really > remember). > >> If we do need to define an alias >> (which I'd like to avoid) it should be something like vfio_iowrite32. Ping? We need to make a decision whether to move those xxx_native() helpers somewhere (where?) or leave the patch as is (as we figured out that iowriteXX functions implement barriers and we cannot just use raw accessors) and fix commit log to explain everything. Thanks! >> Thanks, > > Cheers, > Ben. > >> Alex >> === any better? >>> Suggested-by: Benjamin Herrenschmidt >>> Signed-off-by: Alexey Kardashevskiy >>> --- >>> drivers/vfio/pci/vfio_pci_rdwr.c | 20 >>> 1 file changed, 16 insertions(+), 4 deletions(-) >>> >>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c >>> b/drivers/vfio/pci/vfio_pci_rdwr.c >>> index 210db24..f363b5a 100644 >>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c >>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c >>> @@ -21,6 +21,18 @@ >>> >>> #include "vfio_pci_private.h" >>> >>> +#ifdef __BIG_ENDIAN__ >>> +#define ioread16_nativeioread16be >>> +#define ioread32_nativeioread32be >>> +#define iowrite16_native iowrite16be >>> +#define iowrite32_native iowrite32be >>> +#else >>> +#define ioread16_nativeioread16 >>> +#define ioread32_nativeioread32 >>> +#define iowrite16_native iowrite16 >>> +#define iowrite32_native iowrite32 >>> +#endif >>> + >>> /* >>> * Read or write from an __iomem region (MMIO or I/O port) with an >>> excluded >>> * range which is inaccessible. The excluded range drops writes and >>> fills >>> @@ -50,9 +62,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user >>> *buf, >>> if (copy_from_user(&val, buf, 4)) >>> return -EFAULT; >>> >>> - iowrite32(le32_to_cpu(val), io + off); >>> + iowrite32_native(val, io + off); >>> } else { >>> - val = cpu_to_le32(ioread32(io + off)); >>> + val = ioread32_native(io + off); >>> >>> if (copy_to_user(buf, &val, 4)) >>> return -EFAULT; >>> @@ -66,9 +78,9 @@ static ssize_t do_io_rw(void __iomem *io, char __user >>> *buf, >>> if (copy_from_user(&val, buf, 2)) >>> return -EFAULT; >>> >>> - iowrite16(le16_to_cpu(val), io + off); >>> + iowrite16_native(val, io + off); >>> } else { >>> - val = cpu_to_le16(ioread16(io + off));
Re: [PATCH 03/11] qspinlock: Add pending bit
On Tue, Jun 17, 2014 at 07:23:44PM -0400, Konrad Rzeszutek Wilk wrote: > > Actually in my v11 patch, I subdivided the slowpath into a slowpath for > > the pending code and slowerpath for actual queuing. Perhaps, we could > > use quickpath and slowpath instead. Anyway, it is a minor detail that we > > can discuss after the core code get merged. > Why not do it the right way the first time around? Because I told him to not do this. There's the fast path; the inline single trylock cmpxchg, and the slow path; the out-of-line thing doing the rest. Note that pretty much all other locking primitives are implemented similarly, with fast and slow paths. I find that having the entire state machine in a single function is easier. > That aside - these optimization - seem to make the code harder to > read. And they do remind me of the scheduler code in 2.6.x which was > based on heuristics - and eventually ripped out. Well, it increases the states and thereby the complexity, nothing to be done about that. Also, its not a random heuristic in the sense that it has odd behaviour. Its behaviour is very well controlled. Furthermore, without this the qspinlock performance is too far off the ticket lock performance to be a possible replacement. > So are these optimizations based on turning off certain hardware > features? Say hardware prefetching? We can try of course, but that doesn't help the code -- in fact, adding the switch to turn if off _adds_ code on top. > What I am getting at - can the hardware do this at some point (or > perhaps already does on IvyBridge-EX?) - that is prefetch the per-cpu > areas so they are always hot? And rendering this optimization not > needed? Got a ref to documentation on this new fancy stuff? I might have an IVB-EX, but I've not tried it yet. That said, memory fetches are 100s of cycles, and while prefetch can hide some of that, I'm not sure we can hide all of it, there's not _that_ much we do. If we observe the pending and locked bit set, we immediately drop to the queueing code and touch it. So there's only a few useful instructions to do. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/11] qspinlock: Add pending bit
On Tue, Jun 17, 2014 at 05:07:29PM -0400, Konrad Rzeszutek Wilk wrote: > > We are trying to make the fastpath as simple as possible as it may be > > inlined. The complexity of the queue spinlock is in the slowpath. > > Sure, but then it shouldn't be called slowpath anymore as it is not > slow. Its common terminology to call the inline part the fast path and the out-of-line call on failure the slow path. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ARM: KVM: user_mem_abort: support stage 2 MMIO page mapping
On Wed, May 07, 2014 at 03:55:57PM +0100, Christoffer Dall wrote: > On Wed, May 07, 2014 at 10:00:21AM +0100, Marc Zyngier wrote: > > On Tue, May 06 2014 at 7:04:48 pm BST, Christoffer Dall > > wrote: > > > On Tue, Mar 25, 2014 at 05:08:14PM -0500, Kim Phillips wrote: > > >> Use the correct memory type for device MMIO mappings: PAGE_S2_DEVICE. > > >> > > >> Signed-off-by: Kim Phillips > > >> --- > > >> arch/arm/kvm/mmu.c | 11 --- > > >> 1 file changed, 8 insertions(+), 3 deletions(-) [...] > > > I think this looks reasonable. > > > > > > Acked-by: Christoffer Dall > > > > I feel like I'm missing some context here, and the commit message is way > > too terse for me to make sense of it. > > > > So far, we can only get into user_mem_abort on a Stage-2 fault > > (translation or permission) for memory. How can we suddenly get here for > > a *device* fault? Do we get a special kind of memslot? > > > > I'm not saying the patch does anything wrong, but I'd like to understand > > the rationale behind it. On its own, it doesn't make much sense. > > > Think device passthrough. There's nothing preventing user space from > setting up a memory region to point to device memory (through VFIO or > /dev/mem). If that's done, we should enforce device memory properties > so writes don't linger around in the cache to be written some time later > when that device memory potentially doesn't belong to the VM anymore. > > This is just one tiny piece of all of them to make device passthrough > work, and we could hold off with this patch until we have something more > complete. On the other hand, we need to start somewhere, and this is > hardly intrusive and is functionally correct even though you don't have > a full device passthrough setup. Please can you queue this patch up? I need it for my VFIO work, where I'm registering the PCI BARs using KVM_SET_USER_MEMORY_REGION. Without this, I'd have to trap all accesses and do pread/pwrite from kvmtool instead of mmaping the regions straight through. Cheers, Will -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Coupling between KVM_IRQFD and KVM_SET_GSI_ROUTING?
On 24.06.14 11:47, Paul Mackerras wrote: On Mon, Jun 23, 2014 at 06:47:51PM +0200, Alexander Graf wrote: On 17.06.14 13:39, Eric Auger wrote: Hello, I have a question related to KVM_IRQFD and KVM_SET_GSI_ROUTING ioctl relationship. When reading the KVM API documentation I do not understand there is any dependency between KVM_IRQFD and KVM_SET_GSI_ROUTING. According to the text it seems only the gsi field is used and interpreted as the irqchip pin. However irqchip.c kvm_set_irq code relies on an existing and not dummy routing table. My question is: does anyone agree on the fact the user-side must set a consistent routing table using KVM_SET_GSI_ROUTING before using KVM_IRQFD? The other alternative would have been to build a default identity GSI routing table in the kernel (gsi = irqchip.pin). I untangled irqfd support from the x86 ioapic emulation a while back. When I looked at it, I didn't see any easy way to split it out from the routing too, so I kept that dependency in. If you look at the code, you will see that the irq routing entry is used as token for an irqfd. So every irqfd only knows its routing table entry, nothing else. As far as I can see, the routing table entry is used for only two things: to tell whether the interrupt is an MSI or LSI, and to pass to kvm_set_msi. One way to tackle the problem would be to abstract out the routing table lookup into a function in irqchip.c, rather than having it open-coded in irqfd_update. Then, if you don't want to have the You could also create it along with the irqfd state struct. So instead of kzalloc(sizeof(*irqfd), GFP_KERNEL); we could do kzalloc(sizeof(*irqfd) + sizeof(struct kvm_kernel_irq_routing_entry), GFP_KERNEL); and point the routing entry to the inline irqfd one. We'd lose the ability to change any routings with that construct, but I think that's ok for irq controllers that are not configurable. Alternatively, we could also have a single routing entry that is a wildcard match, ranging from say LSI [x...y]. The irqfd structure would then need to carry a local payload to define an offset inside that wildcard routing entry. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] ARM: KVM: user_mem_abort: support stage 2 MMIO page mapping
On 24/06/14 11:23, Will Deacon wrote: > On Wed, May 07, 2014 at 03:55:57PM +0100, Christoffer Dall wrote: >> On Wed, May 07, 2014 at 10:00:21AM +0100, Marc Zyngier wrote: >>> On Tue, May 06 2014 at 7:04:48 pm BST, Christoffer Dall >>> wrote: On Tue, Mar 25, 2014 at 05:08:14PM -0500, Kim Phillips wrote: > Use the correct memory type for device MMIO mappings: PAGE_S2_DEVICE. > > Signed-off-by: Kim Phillips > --- > arch/arm/kvm/mmu.c | 11 --- > 1 file changed, 8 insertions(+), 3 deletions(-) > > [...] > I think this looks reasonable. Acked-by: Christoffer Dall >>> >>> I feel like I'm missing some context here, and the commit message is way >>> too terse for me to make sense of it. >>> >>> So far, we can only get into user_mem_abort on a Stage-2 fault >>> (translation or permission) for memory. How can we suddenly get here for >>> a *device* fault? Do we get a special kind of memslot? >>> >>> I'm not saying the patch does anything wrong, but I'd like to understand >>> the rationale behind it. On its own, it doesn't make much sense. >>> >> Think device passthrough. There's nothing preventing user space from >> setting up a memory region to point to device memory (through VFIO or >> /dev/mem). If that's done, we should enforce device memory properties >> so writes don't linger around in the cache to be written some time later >> when that device memory potentially doesn't belong to the VM anymore. >> >> This is just one tiny piece of all of them to make device passthrough >> work, and we could hold off with this patch until we have something more >> complete. On the other hand, we need to start somewhere, and this is >> hardly intrusive and is functionally correct even though you don't have >> a full device passthrough setup. > > Please can you queue this patch up? I need it for my VFIO work, where I'm > registering the PCI BARs using KVM_SET_USER_MEMORY_REGION. > > Without this, I'd have to trap all accesses and do pread/pwrite from > kvmtool instead of mmaping the regions straight through. I'm afraid there as been quite a bit of churn in this department, and the patch doesn't apply any more. Kim, any chance you could respin this patch on top of mainline? Thanks, M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On 24.06.14 12:11, Alexey Kardashevskiy wrote: On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote: On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote: Working on big endian being an accident may be a matter of perspective :-) The comment remains that this patch doesn't actually fix anything except the overhead on big endian systems doing redundant byte swapping and maybe the philosophy that vfio regions are little endian. Yes, that works by accident because technically VFIO is a transport and thus shouldn't perform any endian swapping of any sort, which remains the responsibility of the end driver which is the only one to know whether a given BAR location is a a register or some streaming data and in the former case whether it's LE or BE (some PCI devices are BE even ! :-) But yes, in the end, it works with the dual "cancelling" swaps and the overhead of those swaps is probably drowned in the noise of the syscall overhead. I'm still not a fan of iowrite vs iowritebe, there must be something we can use that doesn't have an implicit swap. Sadly there isn't ... In the old day we didn't even have the "be" variant and readl/writel style accessors still don't have them either for all archs. There is __raw_readl/writel but here the semantics are much more than just "don't swap", they also don't have memory barriers (which means they are essentially useless to most drivers unless those are platform specific drivers which know exactly what they are doing, or in the rare cases such as accessing a framebuffer which we know never have side effects). Calling it iowrite*_native is also an abuse of the namespace. Next thing we know some common code will legitimately use that name. I might make sense to those definitions into a common header. There have been a handful of cases in the past that wanted that sort of "native byte order" MMIOs iirc (though don't ask me for examples, I can't really remember). If we do need to define an alias (which I'd like to avoid) it should be something like vfio_iowrite32. Ping? We need to make a decision whether to move those xxx_native() helpers somewhere (where?) or leave the patch as is (as we figured out that iowriteXX functions implement barriers and we cannot just use raw accessors) and fix commit log to explain everything. Is there actually any difference in generated code with this patch applied and without? I would hope that iowrite..() is inlined and cancels out the cpu_to_le..() calls that are also inlined? Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/11] qspinlock: Extract out the exchange of tail code word
On Wed, Jun 18, 2014 at 01:37:45PM +0200, Paolo Bonzini wrote: > However, I *do* agree with you that it's simpler to just squash this patch > into 01/11. So I explicitly broke out these optimizations into separate patches so that we can see them independently and agree they're idempotent wrt the state machine. The initial patches by Waiman were totally unreadable, partly because the optimizations made the code terribly complex. Luckily waiman then dropped the most horrible optimizations (optimization for the very large nr_cpus case, were we cannot have a pending byte), so the end result isn't quite as complex as it used to be. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] KVM: PPC: e500mc: Enhance tlb invalidation condition on vcpu schedule
On 18.06.14 09:15, Mihai Caraman wrote: On vcpu schedule, the condition checked for tlb pollution is too loose. The tlb entries of a vcpu become polluted (vs stale) only when a different vcpu within the same logical partition runs in-between. Optimize the tlb invalidation condition keeping last_vcpu per logical partition id. With the new invalidation condition, a guest shows 4% performance improvement on P5020DS while running a memory stress application with the cpu oversubscribed, the other guest running a cpu intensive workload. Guest - old invalidation condition real 3.89 user 3.87 sys 0.01 Guest - enhanced invalidation condition real 3.75 user 3.73 sys 0.01 Host real 3.70 user 1.85 sys 0.00 The memory stress application accesses 4KB pages backed by 75% of available TLB0 entries: char foo[ENTRIES][4096] __attribute__ ((aligned (4096))); int main() { char bar; int i, j; for (i = 0; i < ITERATIONS; i++) for (j = 0; j < ENTRIES; j++) bar = foo[j][0]; return 0; } Signed-off-by: Mihai Caraman Cc: Scott Wood Thanks, applied to kvm-ppc-queue. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On 06/24/2014 08:41 PM, Alexander Graf wrote: > > On 24.06.14 12:11, Alexey Kardashevskiy wrote: >> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote: >>> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote: >>> Working on big endian being an accident may be a matter of perspective >>> :-) >>> The comment remains that this patch doesn't actually fix anything except the overhead on big endian systems doing redundant byte swapping and maybe the philosophy that vfio regions are little endian. >>> Yes, that works by accident because technically VFIO is a transport and >>> thus shouldn't perform any endian swapping of any sort, which remains >>> the responsibility of the end driver which is the only one to know >>> whether a given BAR location is a a register or some streaming data >>> and in the former case whether it's LE or BE (some PCI devices are BE >>> even ! :-) >>> >>> But yes, in the end, it works with the dual "cancelling" swaps and the >>> overhead of those swaps is probably drowned in the noise of the syscall >>> overhead. >>> I'm still not a fan of iowrite vs iowritebe, there must be something we can use that doesn't have an implicit swap. >>> Sadly there isn't ... In the old day we didn't even have the "be" >>> variant and readl/writel style accessors still don't have them either >>> for all archs. >>> >>> There is __raw_readl/writel but here the semantics are much more than >>> just "don't swap", they also don't have memory barriers (which means >>> they are essentially useless to most drivers unless those are platform >>> specific drivers which know exactly what they are doing, or in the rare >>> cases such as accessing a framebuffer which we know never have side >>> effects). >>> Calling it iowrite*_native is also an abuse of the namespace. >>> Next thing we know some common code will legitimately use that name. >>> I might make sense to those definitions into a common header. There have >>> been a handful of cases in the past that wanted that sort of "native >>> byte order" MMIOs iirc (though don't ask me for examples, I can't really >>> remember). >>> If we do need to define an alias (which I'd like to avoid) it should be something like vfio_iowrite32. >> >> Ping? >> >> We need to make a decision whether to move those xxx_native() helpers >> somewhere (where?) or leave the patch as is (as we figured out that >> iowriteXX functions implement barriers and we cannot just use raw >> accessors) and fix commit log to explain everything. > > Is there actually any difference in generated code with this patch applied > and without? I would hope that iowrite..() is inlined and cancels out the > cpu_to_le..() calls that are also inlined? iowrite32 is a non-inline function so conversions take place so are the others. And sorry but I fail to see why this matters. We are not trying to accelerate things, we are removing redundant operations which confuse people who read the code. -- Alexey -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On 24.06.14 14:50, Alexey Kardashevskiy wrote: On 06/24/2014 08:41 PM, Alexander Graf wrote: On 24.06.14 12:11, Alexey Kardashevskiy wrote: On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote: On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote: Working on big endian being an accident may be a matter of perspective :-) The comment remains that this patch doesn't actually fix anything except the overhead on big endian systems doing redundant byte swapping and maybe the philosophy that vfio regions are little endian. Yes, that works by accident because technically VFIO is a transport and thus shouldn't perform any endian swapping of any sort, which remains the responsibility of the end driver which is the only one to know whether a given BAR location is a a register or some streaming data and in the former case whether it's LE or BE (some PCI devices are BE even ! :-) But yes, in the end, it works with the dual "cancelling" swaps and the overhead of those swaps is probably drowned in the noise of the syscall overhead. I'm still not a fan of iowrite vs iowritebe, there must be something we can use that doesn't have an implicit swap. Sadly there isn't ... In the old day we didn't even have the "be" variant and readl/writel style accessors still don't have them either for all archs. There is __raw_readl/writel but here the semantics are much more than just "don't swap", they also don't have memory barriers (which means they are essentially useless to most drivers unless those are platform specific drivers which know exactly what they are doing, or in the rare cases such as accessing a framebuffer which we know never have side effects). Calling it iowrite*_native is also an abuse of the namespace. Next thing we know some common code will legitimately use that name. I might make sense to those definitions into a common header. There have been a handful of cases in the past that wanted that sort of "native byte order" MMIOs iirc (though don't ask me for examples, I can't really remember). If we do need to define an alias (which I'd like to avoid) it should be something like vfio_iowrite32. Ping? We need to make a decision whether to move those xxx_native() helpers somewhere (where?) or leave the patch as is (as we figured out that iowriteXX functions implement barriers and we cannot just use raw accessors) and fix commit log to explain everything. Is there actually any difference in generated code with this patch applied and without? I would hope that iowrite..() is inlined and cancels out the cpu_to_le..() calls that are also inlined? iowrite32 is a non-inline function so conversions take place so are the others. And sorry but I fail to see why this matters. We are not trying to accelerate things, we are removing redundant operations which confuse people who read the code. The confusion depends on where you're coming from. If you happen to know that "iowrite32" writes in LE, then the LE conversion makes a lot of sense. I don't have a strong feeling either way though and will let Alex decide on the path forward :). Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On 06/24/2014 10:52 PM, Alexander Graf wrote: > > On 24.06.14 14:50, Alexey Kardashevskiy wrote: >> On 06/24/2014 08:41 PM, Alexander Graf wrote: >>> On 24.06.14 12:11, Alexey Kardashevskiy wrote: On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote: > On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote: > >> Working on big endian being an accident may be a matter of perspective >:-) > >> The comment remains that this patch doesn't actually fix anything except >> the overhead on big endian systems doing redundant byte swapping and >> maybe the philosophy that vfio regions are little endian. > Yes, that works by accident because technically VFIO is a transport and > thus shouldn't perform any endian swapping of any sort, which remains > the responsibility of the end driver which is the only one to know > whether a given BAR location is a a register or some streaming data > and in the former case whether it's LE or BE (some PCI devices are BE > even ! :-) > > But yes, in the end, it works with the dual "cancelling" swaps and the > overhead of those swaps is probably drowned in the noise of the syscall > overhead. > >> I'm still not a fan of iowrite vs iowritebe, there must be something we >> can use that doesn't have an implicit swap. > Sadly there isn't ... In the old day we didn't even have the "be" > variant and readl/writel style accessors still don't have them either > for all archs. > > There is __raw_readl/writel but here the semantics are much more than > just "don't swap", they also don't have memory barriers (which means > they are essentially useless to most drivers unless those are platform > specific drivers which know exactly what they are doing, or in the rare > cases such as accessing a framebuffer which we know never have side > effects). > >>Calling it iowrite*_native is also an abuse of the namespace. >>Next thing we know some common code >> will legitimately use that name. > I might make sense to those definitions into a common header. There have > been a handful of cases in the past that wanted that sort of "native > byte order" MMIOs iirc (though don't ask me for examples, I can't really > remember). > >>If we do need to define an alias >> (which I'd like to avoid) it should be something like vfio_iowrite32. Ping? We need to make a decision whether to move those xxx_native() helpers somewhere (where?) or leave the patch as is (as we figured out that iowriteXX functions implement barriers and we cannot just use raw accessors) and fix commit log to explain everything. >>> Is there actually any difference in generated code with this patch applied >>> and without? I would hope that iowrite..() is inlined and cancels out the >>> cpu_to_le..() calls that are also inlined? >> iowrite32 is a non-inline function so conversions take place so are the >> others. And sorry but I fail to see why this matters. We are not trying to >> accelerate things, we are removing redundant operations which confuse >> people who read the code. > > The confusion depends on where you're coming from. If you happen to know > that "iowrite32" writes in LE, then the LE conversion makes a lot of sense. It was like this (and this is just confusing): iowrite32(le32_to_cpu(val), io + off); What would make sense (according to you and I would understand this) is this: iowrite32(cpu_to_le32(val), io + off); Or I missed your point, did I? > I don't have a strong feeling either way though and will let Alex decide on > the path forward :) It would probably help if you picked the side :) -- Alexey -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On 24.06.14 15:01, Alexey Kardashevskiy wrote: On 06/24/2014 10:52 PM, Alexander Graf wrote: On 24.06.14 14:50, Alexey Kardashevskiy wrote: On 06/24/2014 08:41 PM, Alexander Graf wrote: On 24.06.14 12:11, Alexey Kardashevskiy wrote: On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote: On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote: Working on big endian being an accident may be a matter of perspective :-) The comment remains that this patch doesn't actually fix anything except the overhead on big endian systems doing redundant byte swapping and maybe the philosophy that vfio regions are little endian. Yes, that works by accident because technically VFIO is a transport and thus shouldn't perform any endian swapping of any sort, which remains the responsibility of the end driver which is the only one to know whether a given BAR location is a a register or some streaming data and in the former case whether it's LE or BE (some PCI devices are BE even ! :-) But yes, in the end, it works with the dual "cancelling" swaps and the overhead of those swaps is probably drowned in the noise of the syscall overhead. I'm still not a fan of iowrite vs iowritebe, there must be something we can use that doesn't have an implicit swap. Sadly there isn't ... In the old day we didn't even have the "be" variant and readl/writel style accessors still don't have them either for all archs. There is __raw_readl/writel but here the semantics are much more than just "don't swap", they also don't have memory barriers (which means they are essentially useless to most drivers unless those are platform specific drivers which know exactly what they are doing, or in the rare cases such as accessing a framebuffer which we know never have side effects). Calling it iowrite*_native is also an abuse of the namespace. Next thing we know some common code will legitimately use that name. I might make sense to those definitions into a common header. There have been a handful of cases in the past that wanted that sort of "native byte order" MMIOs iirc (though don't ask me for examples, I can't really remember). If we do need to define an alias (which I'd like to avoid) it should be something like vfio_iowrite32. Ping? We need to make a decision whether to move those xxx_native() helpers somewhere (where?) or leave the patch as is (as we figured out that iowriteXX functions implement barriers and we cannot just use raw accessors) and fix commit log to explain everything. Is there actually any difference in generated code with this patch applied and without? I would hope that iowrite..() is inlined and cancels out the cpu_to_le..() calls that are also inlined? iowrite32 is a non-inline function so conversions take place so are the others. And sorry but I fail to see why this matters. We are not trying to accelerate things, we are removing redundant operations which confuse people who read the code. The confusion depends on where you're coming from. If you happen to know that "iowrite32" writes in LE, then the LE conversion makes a lot of sense. It was like this (and this is just confusing): iowrite32(le32_to_cpu(val), io + off); What would make sense (according to you and I would understand this) is this: iowrite32(cpu_to_le32(val), io + off); Or I missed your point, did I? No, you didn't miss it. I think for people who know how iowrite32() works the above is obvious. I find the fact that iowrite32() writes in LE always pretty scary though ;). So IMHO we should either create new, generic iowrite helpers that don't do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 6/7] KVM: PPC: Book3S HV: Fix ABIv2 on LE
For code that doesn't live in modules we can just branch to the real function names, giving us compatibility with ABIv1 and ABIv2. Do this for the compiled-in code of HV KVM. Signed-off-by: Alexander Graf --- arch/powerpc/kvm/book3s_hv_rmhandlers.S | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S index 1a71f60..a9de981 100644 --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S @@ -668,9 +668,9 @@ END_FTR_SECTION_IFCLR(CPU_FTR_TM) mr r31, r4 addir3, r31, VCPU_FPRS_TM - bl .load_fp_state + bl load_fp_state addir3, r31, VCPU_VRS_TM - bl .load_vr_state + bl load_vr_state mr r4, r31 lwz r7, VCPU_VRSAVE_TM(r4) mtspr SPRN_VRSAVE, r7 @@ -1414,9 +1414,9 @@ END_FTR_SECTION_IFCLR(CPU_FTR_TM) /* Save FP/VSX. */ addir3, r9, VCPU_FPRS_TM - bl .store_fp_state + bl store_fp_state addir3, r9, VCPU_VRS_TM - bl .store_vr_state + bl store_vr_state mfspr r6, SPRN_VRSAVE stw r6, VCPU_VRSAVE_TM(r9) 1: @@ -2405,11 +2405,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX) mtmsrd r8 isync addir3,r3,VCPU_FPRS - bl .store_fp_state + bl store_fp_state #ifdef CONFIG_ALTIVEC BEGIN_FTR_SECTION addir3,r31,VCPU_VRS - bl .store_vr_state + bl store_vr_state END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC) #endif mfspr r6,SPRN_VRSAVE @@ -2441,11 +2441,11 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX) mtmsrd r8 isync addir3,r4,VCPU_FPRS - bl .load_fp_state + bl load_fp_state #ifdef CONFIG_ALTIVEC BEGIN_FTR_SECTION addir3,r31,VCPU_VRS - bl .load_vr_state + bl load_vr_state END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC) #endif lwz r7,VCPU_VRSAVE(r31) -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: Book3E: Unlock mmu_lock when setting caching atttribute
On 18.06.14 17:45, Mihai Caraman wrote: The patch 08c9a188d0d0fc0f0c5e17d89a06bb59c493110f kvm: powerpc: use caching attributes as per linux pte do not handle properly the error case, letting mmu_lock locked. The lock will further generate a RCU stall from kvmppc_e500_emul_tlbwe() caller. In case of an error go to out label. Signed-off-by: Mihai Caraman Cc: Bharat Bhushan Thanks, applied to for-3.16. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On Tue, 2014-06-24 at 15:22 +0200, Alexander Graf wrote: > On 24.06.14 15:01, Alexey Kardashevskiy wrote: > > On 06/24/2014 10:52 PM, Alexander Graf wrote: > >> On 24.06.14 14:50, Alexey Kardashevskiy wrote: > >>> On 06/24/2014 08:41 PM, Alexander Graf wrote: > On 24.06.14 12:11, Alexey Kardashevskiy wrote: > > On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote: > >> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote: > >> > >>> Working on big endian being an accident may be a matter of perspective > >> :-) > >> > >>> The comment remains that this patch doesn't actually fix anything > >>> except > >>> the overhead on big endian systems doing redundant byte swapping and > >>> maybe the philosophy that vfio regions are little endian. > >> Yes, that works by accident because technically VFIO is a transport and > >> thus shouldn't perform any endian swapping of any sort, which remains > >> the responsibility of the end driver which is the only one to know > >> whether a given BAR location is a a register or some streaming data > >> and in the former case whether it's LE or BE (some PCI devices are BE > >> even ! :-) > >> > >> But yes, in the end, it works with the dual "cancelling" swaps and the > >> overhead of those swaps is probably drowned in the noise of the syscall > >> overhead. > >> > >>> I'm still not a fan of iowrite vs iowritebe, there must be something > >>> we > >>> can use that doesn't have an implicit swap. > >> Sadly there isn't ... In the old day we didn't even have the "be" > >> variant and readl/writel style accessors still don't have them either > >> for all archs. > >> > >> There is __raw_readl/writel but here the semantics are much more than > >> just "don't swap", they also don't have memory barriers (which means > >> they are essentially useless to most drivers unless those are platform > >> specific drivers which know exactly what they are doing, or in the rare > >> cases such as accessing a framebuffer which we know never have side > >> effects). > >> > >>> Calling it iowrite*_native is also an abuse of the namespace. > >>> Next thing we know some common code > >>> will legitimately use that name. > >> I might make sense to those definitions into a common header. There > >> have > >> been a handful of cases in the past that wanted that sort of "native > >> byte order" MMIOs iirc (though don't ask me for examples, I can't > >> really > >> remember). > >> > >>> If we do need to define an alias > >>> (which I'd like to avoid) it should be something like vfio_iowrite32. > > Ping? > > > > We need to make a decision whether to move those xxx_native() helpers > > somewhere (where?) or leave the patch as is (as we figured out that > > iowriteXX functions implement barriers and we cannot just use raw > > accessors) and fix commit log to explain everything. > Is there actually any difference in generated code with this patch > applied > and without? I would hope that iowrite..() is inlined and cancels out the > cpu_to_le..() calls that are also inlined? > >>> iowrite32 is a non-inline function so conversions take place so are the > >>> others. And sorry but I fail to see why this matters. We are not trying to > >>> accelerate things, we are removing redundant operations which confuse > >>> people who read the code. > >> The confusion depends on where you're coming from. If you happen to know > >> that "iowrite32" writes in LE, then the LE conversion makes a lot of sense. > > It was like this (and this is just confusing): > > > > iowrite32(le32_to_cpu(val), io + off); > > > > What would make sense (according to you and I would understand this) is > > this: > > > > iowrite32(cpu_to_le32(val), io + off); > > > > > > Or I missed your point, did I? > > No, you didn't miss it. I think for people who know how iowrite32() > works the above is obvious. I find the fact that iowrite32() writes in > LE always pretty scary though ;). > > So IMHO we should either create new, generic iowrite helpers that don't > do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls. I'm one of those people for whom iowrite32(le32_to_cpu(val)) makes sense and keeps the byte order consistent regardless of the platform, while iowrite32(val) or iowrite32be(val) makes me scratch my head and try to remember that the byte swaps are a nop on the given platforms. As Ben noted, a native, no-swap ioread/write doesn't exist, but perhaps should. I'd prefer an attempt be made to make it exist before adding vfio-specific macros. vfio is arguably doing the right thing here given the functions available. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On 06/25/2014 12:21 AM, Alex Williamson wrote: > On Tue, 2014-06-24 at 15:22 +0200, Alexander Graf wrote: >> On 24.06.14 15:01, Alexey Kardashevskiy wrote: >>> On 06/24/2014 10:52 PM, Alexander Graf wrote: On 24.06.14 14:50, Alexey Kardashevskiy wrote: > On 06/24/2014 08:41 PM, Alexander Graf wrote: >> On 24.06.14 12:11, Alexey Kardashevskiy wrote: >>> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote: On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote: > Working on big endian being an accident may be a matter of perspective :-) > The comment remains that this patch doesn't actually fix anything > except > the overhead on big endian systems doing redundant byte swapping and > maybe the philosophy that vfio regions are little endian. Yes, that works by accident because technically VFIO is a transport and thus shouldn't perform any endian swapping of any sort, which remains the responsibility of the end driver which is the only one to know whether a given BAR location is a a register or some streaming data and in the former case whether it's LE or BE (some PCI devices are BE even ! :-) But yes, in the end, it works with the dual "cancelling" swaps and the overhead of those swaps is probably drowned in the noise of the syscall overhead. > I'm still not a fan of iowrite vs iowritebe, there must be something > we > can use that doesn't have an implicit swap. Sadly there isn't ... In the old day we didn't even have the "be" variant and readl/writel style accessors still don't have them either for all archs. There is __raw_readl/writel but here the semantics are much more than just "don't swap", they also don't have memory barriers (which means they are essentially useless to most drivers unless those are platform specific drivers which know exactly what they are doing, or in the rare cases such as accessing a framebuffer which we know never have side effects). > Calling it iowrite*_native is also an abuse of the namespace. > Next thing we know some common code > will legitimately use that name. I might make sense to those definitions into a common header. There have been a handful of cases in the past that wanted that sort of "native byte order" MMIOs iirc (though don't ask me for examples, I can't really remember). > If we do need to define an alias > (which I'd like to avoid) it should be something like vfio_iowrite32. >>> Ping? >>> >>> We need to make a decision whether to move those xxx_native() helpers >>> somewhere (where?) or leave the patch as is (as we figured out that >>> iowriteXX functions implement barriers and we cannot just use raw >>> accessors) and fix commit log to explain everything. >> Is there actually any difference in generated code with this patch >> applied >> and without? I would hope that iowrite..() is inlined and cancels out the >> cpu_to_le..() calls that are also inlined? > iowrite32 is a non-inline function so conversions take place so are the > others. And sorry but I fail to see why this matters. We are not trying to > accelerate things, we are removing redundant operations which confuse > people who read the code. The confusion depends on where you're coming from. If you happen to know that "iowrite32" writes in LE, then the LE conversion makes a lot of sense. >>> It was like this (and this is just confusing): >>> >>> iowrite32(le32_to_cpu(val), io + off); >>> >>> What would make sense (according to you and I would understand this) is >>> this: >>> >>> iowrite32(cpu_to_le32(val), io + off); >>> >>> >>> Or I missed your point, did I? >> >> No, you didn't miss it. I think for people who know how iowrite32() >> works the above is obvious. I find the fact that iowrite32() writes in >> LE always pretty scary though ;). >> >> So IMHO we should either create new, generic iowrite helpers that don't >> do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls. > > I'm one of those people for whom iowrite32(le32_to_cpu(val)) makes sense I do not understand why @val is considered LE here and need to be converted to CPU. Really. I truly believe it should be cpu_to_le32(). > and keeps the byte order consistent regardless of the platform, while > iowrite32(val) or iowrite32be(val) makes me scratch my head and try to > remember that the byte swaps are a nop on the given platforms. As Ben > noted, a native, no-swap ioread/write doesn't exist, but perhaps should. > I'd prefer an attempt be made to make it exist before adding > vfio-specific macros. vfio is arguably do
RE: [PATCH] vfio: Fix endianness handling for emulated BARs
From: Alexey Kardashevskiy ... > >> So IMHO we should either create new, generic iowrite helpers that don't > >> do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls. > > > > I'm one of those people for whom iowrite32(le32_to_cpu(val)) makes sense > > > I do not understand why @val is considered LE here and need to be converted > to CPU. Really. I truly believe it should be cpu_to_le32(). Think about it carefully... Apparently iowrite32() is writing a 'cpu' value out 'le'. So if you have a 'le' value you need to convert it to 'cpu' first. I really won't ask how 'be' ppc managed to get 'le' on-chip peripherals. David N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On Wed, 2014-06-25 at 00:33 +1000, Alexey Kardashevskiy wrote: > On 06/25/2014 12:21 AM, Alex Williamson wrote: > > On Tue, 2014-06-24 at 15:22 +0200, Alexander Graf wrote: > >> On 24.06.14 15:01, Alexey Kardashevskiy wrote: > >>> On 06/24/2014 10:52 PM, Alexander Graf wrote: > On 24.06.14 14:50, Alexey Kardashevskiy wrote: > > On 06/24/2014 08:41 PM, Alexander Graf wrote: > >> On 24.06.14 12:11, Alexey Kardashevskiy wrote: > >>> On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote: > On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote: > > > Working on big endian being an accident may be a matter of > > perspective > :-) > > > The comment remains that this patch doesn't actually fix anything > > except > > the overhead on big endian systems doing redundant byte swapping and > > maybe the philosophy that vfio regions are little endian. > Yes, that works by accident because technically VFIO is a transport > and > thus shouldn't perform any endian swapping of any sort, which remains > the responsibility of the end driver which is the only one to know > whether a given BAR location is a a register or some streaming data > and in the former case whether it's LE or BE (some PCI devices are BE > even ! :-) > > But yes, in the end, it works with the dual "cancelling" swaps and > the > overhead of those swaps is probably drowned in the noise of the > syscall > overhead. > > > I'm still not a fan of iowrite vs iowritebe, there must be > > something we > > can use that doesn't have an implicit swap. > Sadly there isn't ... In the old day we didn't even have the "be" > variant and readl/writel style accessors still don't have them either > for all archs. > > There is __raw_readl/writel but here the semantics are much more than > just "don't swap", they also don't have memory barriers (which means > they are essentially useless to most drivers unless those are > platform > specific drivers which know exactly what they are doing, or in the > rare > cases such as accessing a framebuffer which we know never have side > effects). > > > Calling it iowrite*_native is also an abuse of the namespace. > > Next thing we know some common code > > will legitimately use that name. > I might make sense to those definitions into a common header. There > have > been a handful of cases in the past that wanted that sort of "native > byte order" MMIOs iirc (though don't ask me for examples, I can't > really > remember). > > > If we do need to define an alias > > (which I'd like to avoid) it should be something like > > vfio_iowrite32. > >>> Ping? > >>> > >>> We need to make a decision whether to move those xxx_native() helpers > >>> somewhere (where?) or leave the patch as is (as we figured out that > >>> iowriteXX functions implement barriers and we cannot just use raw > >>> accessors) and fix commit log to explain everything. > >> Is there actually any difference in generated code with this patch > >> applied > >> and without? I would hope that iowrite..() is inlined and cancels out > >> the > >> cpu_to_le..() calls that are also inlined? > > iowrite32 is a non-inline function so conversions take place so are the > > others. And sorry but I fail to see why this matters. We are not trying > > to > > accelerate things, we are removing redundant operations which confuse > > people who read the code. > The confusion depends on where you're coming from. If you happen to know > that "iowrite32" writes in LE, then the LE conversion makes a lot of > sense. > >>> It was like this (and this is just confusing): > >>> > >>> iowrite32(le32_to_cpu(val), io + off); > >>> > >>> What would make sense (according to you and I would understand this) is > >>> this: > >>> > >>> iowrite32(cpu_to_le32(val), io + off); > >>> > >>> > >>> Or I missed your point, did I? > >> > >> No, you didn't miss it. I think for people who know how iowrite32() > >> works the above is obvious. I find the fact that iowrite32() writes in > >> LE always pretty scary though ;). > >> > >> So IMHO we should either create new, generic iowrite helpers that don't > >> do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls. > > > > I'm one of those people for whom iowrite32(le32_to_cpu(val)) makes sense > > > I do not understand why @val is considered LE here and need to be converted > to CPU. Really. I truly believe it should be cpu_to_le32(). Because iowrite32 is defined to take
Re: Coupling between KVM_IRQFD and KVM_SET_GSI_ROUTING?
On 06/24/2014 12:40 PM, Alexander Graf wrote: > > On 24.06.14 11:47, Paul Mackerras wrote: >> On Mon, Jun 23, 2014 at 06:47:51PM +0200, Alexander Graf wrote: >>> On 17.06.14 13:39, Eric Auger wrote: Hello, I have a question related to KVM_IRQFD and KVM_SET_GSI_ROUTING ioctl relationship. When reading the KVM API documentation I do not understand there is any dependency between KVM_IRQFD and KVM_SET_GSI_ROUTING. According to the text it seems only the gsi field is used and interpreted as the irqchip pin. However irqchip.c kvm_set_irq code relies on an existing and not dummy routing table. My question is: does anyone agree on the fact the user-side must set a consistent routing table using KVM_SET_GSI_ROUTING before using KVM_IRQFD? The other alternative would have been to build a default identity GSI routing table in the kernel (gsi = irqchip.pin). >>> I untangled irqfd support from the x86 ioapic emulation a while back. >>> When I >>> looked at it, I didn't see any easy way to split it out from the routing >>> too, so I kept that dependency in. Hi Alex, Paul, thanks for your replies. hence don't you think the KVM API doc deserves to be clarified then? >>> >>> If you look at the code, you will see that the irq routing entry is >>> used as >>> token for an irqfd. So every irqfd only knows its routing table entry, >>> nothing else. >> As far as I can see, the routing table entry is used for only two >> things: to tell whether the interrupt is an MSI or LSI, and to pass to >> kvm_set_msi. Indeed I noticed irq_entry in _irqfd struct only is set for MSI routing. For other IRQS - I guess what you call LSI - , routine table map[] is used to retrieve the irchip.pin(s) associated to that gsi and call set(). >> >> One way to tackle the problem would be to abstract out the routing >> table lookup into a function in irqchip.c, rather than having it >> open-coded in irqfd_update. Then, if you don't want to have the >> routing table you write an alternative function that creates a >> routing entry from scratch. It would need information from the >> low-level irq chip code as to which interrupts are LSIs and which are >> MSIs. It also ties you down to having only one kind of interrupt controller. > > You could also create it along with the irqfd state struct. So instead of > > kzalloc(sizeof(*irqfd), GFP_KERNEL); > > we could do > > kzalloc(sizeof(*irqfd) + sizeof(struct kvm_kernel_irq_routing_entry), > GFP_KERNEL); Wouldn't it make sense, in kvm_irqfd_assign, to test whether there is a routing entry associated to that gsi. In the negative, create a default one where irqchip.pin = gsi or something architecture specific?. That way would not need a complete & big routing table to be set by user-side? > > and point the routing entry to the inline irqfd one. We'd lose the > ability to change any routings with that construct, but I think that's > ok for irq controllers that are not configurable. > > Alternatively, we could also have a single routing entry that is a > wildcard match, ranging from say LSI [x...y]. The irqfd structure would > then need to carry a local payload to define an offset inside that > wildcard routing entry. Also whatever the number of IRQs assigned, we have this big chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS] allocated. Also on ARM we have some difficulties to define KVM_IRQCHIP_NUM_PINS which becomes quite dynamic. Best Regards Eric > > > Alex > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New improvements on KVM
On 6/23/2014 10:09 PM, Oscar Garcia wrote: Hello, I am planning to study a phd for the next year, and I would like to spend part on my time studying KVM code in order to suggest new improvements. I know that this mail list is used by experts related to KVM, I would like to ask you, what improvements are needed in KVM in order to advance in that direction? For starters I'd check out the KVM TODO page on the wiki and the GSOC pages on the Qemu wiki. Aside from that, just hang out and try to pick up what you can. Realistically though (unless you just love hypervisor code), all the interesting work is going on in Qemu and libvirt/openstack/other management tools. http://www.linux-kvm.org/page/TODO http://wiki.qemu.org/Google_Summer_of_Code_2014 http://wiki.qemu.org/Google_Summer_of_Code_2013 http://wiki.qemu.org/Google_Summer_of_Code_201 Thank you Oscar -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Coupling between KVM_IRQFD and KVM_SET_GSI_ROUTING?
On 24.06.14 17:05, Eric Auger wrote: On 06/24/2014 12:40 PM, Alexander Graf wrote: On 24.06.14 11:47, Paul Mackerras wrote: On Mon, Jun 23, 2014 at 06:47:51PM +0200, Alexander Graf wrote: On 17.06.14 13:39, Eric Auger wrote: Hello, I have a question related to KVM_IRQFD and KVM_SET_GSI_ROUTING ioctl relationship. When reading the KVM API documentation I do not understand there is any dependency between KVM_IRQFD and KVM_SET_GSI_ROUTING. According to the text it seems only the gsi field is used and interpreted as the irqchip pin. However irqchip.c kvm_set_irq code relies on an existing and not dummy routing table. My question is: does anyone agree on the fact the user-side must set a consistent routing table using KVM_SET_GSI_ROUTING before using KVM_IRQFD? The other alternative would have been to build a default identity GSI routing table in the kernel (gsi = irqchip.pin). I untangled irqfd support from the x86 ioapic emulation a while back. When I looked at it, I didn't see any easy way to split it out from the routing too, so I kept that dependency in. Hi Alex, Paul, thanks for your replies. hence don't you think the KVM API doc deserves to be clarified then? If you look at the code, you will see that the irq routing entry is used as token for an irqfd. So every irqfd only knows its routing table entry, nothing else. As far as I can see, the routing table entry is used for only two things: to tell whether the interrupt is an MSI or LSI, and to pass to kvm_set_msi. Indeed I noticed irq_entry in _irqfd struct only is set for MSI routing. For other IRQS - I guess what you call LSI - , routine table map[] is used to retrieve the irchip.pin(s) associated to that gsi and call set(). One way to tackle the problem would be to abstract out the routing table lookup into a function in irqchip.c, rather than having it open-coded in irqfd_update. Then, if you don't want to have the routing table you write an alternative function that creates a routing entry from scratch. It would need information from the low-level irq chip code as to which interrupts are LSIs and which are MSIs. It also ties you down to having only one kind of interrupt controller. You could also create it along with the irqfd state struct. So instead of kzalloc(sizeof(*irqfd), GFP_KERNEL); we could do kzalloc(sizeof(*irqfd) + sizeof(struct kvm_kernel_irq_routing_entry), GFP_KERNEL); Wouldn't it make sense, in kvm_irqfd_assign, to test whether there is a routing entry associated to that gsi. In the negative, create a default one where irqchip.pin = gsi or something architecture specific?. "no entry" really should be "no entry". I think if we want some other mechanism, we should make it be explicit. That way would not need a complete & big routing table to be set by user-side? and point the routing entry to the inline irqfd one. We'd lose the ability to change any routings with that construct, but I think that's ok for irq controllers that are not configurable. Alternatively, we could also have a single routing entry that is a wildcard match, ranging from say LSI [x...y]. The irqfd structure would then need to carry a local payload to define an offset inside that wildcard routing entry. Also whatever the number of IRQs assigned, we have this big chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS] allocated. Also on ARM we have some difficulties to define KVM_IRQCHIP_NUM_PINS which becomes quite dynamic. Right, if you go with the ranged routing entries you should be able to get by with a very small number of NUM_PINS, no? Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3.15 25/61] MIPS: KVM: Allocate at least 16KB for exception handlers
3.15-stable review patch. If anyone has any objections, please let me know. -- From: James Hogan commit 7006e2dfda9adfa40251093604db76d7e44263b3 upstream. Each MIPS KVM guest has its own copy of the KVM exception vector. This contains the TLB refill exception handler at offset 0x000, the general exception handler at offset 0x180, and interrupt exception handlers at offset 0x200 in case Cause_IV=1. A common handler is copied to offset 0x2000 and offset 0x3000 is used for temporarily storing k1 during entry from guest. However the amount of memory allocated for this purpose is calculated as 0x200 rounded up to the next page boundary, which is insufficient if 4KB pages are in use. This can lead to the common handler at offset 0x2000 being overwritten and infinitely recursive exceptions on the next exit from the guest. Increase the minimum size from 0x200 to 0x4000 to cover the full use of the page. Signed-off-by: James Hogan Cc: Paolo Bonzini Cc: Gleb Natapov Cc: kvm@vger.kernel.org Cc: Ralf Baechle Cc: linux-m...@linux-mips.org Cc: Sanjay Lal Signed-off-by: Paolo Bonzini Signed-off-by: Greg Kroah-Hartman --- arch/mips/kvm/kvm_mips.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/arch/mips/kvm/kvm_mips.c +++ b/arch/mips/kvm/kvm_mips.c @@ -304,7 +304,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(st if (cpu_has_veic || cpu_has_vint) { size = 0x200 + VECTORSPACING * 64; } else { - size = 0x200; + size = 0x4000; } /* Save Linux EBASE */ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 1/2] docs: update ivshmem device spec
On 06/20/2014 06:15 AM, David Marchand wrote: > Add some notes on the parts needed to use ivshmem devices: more specifically, > explain the purpose of an ivshmem server and the basic concept to use the > ivshmem devices in guests. > > Signed-off-by: David Marchand > --- > docs/specs/ivshmem_device_spec.txt | 41 > ++-- > 1 file changed, 35 insertions(+), 6 deletions(-) > > diff --git a/docs/specs/ivshmem_device_spec.txt > b/docs/specs/ivshmem_device_spec.txt > index 667a862..7d2b73f 100644 > --- a/docs/specs/ivshmem_device_spec.txt > +++ b/docs/specs/ivshmem_device_spec.txt > @@ -85,12 +85,41 @@ events have occurred. The semantics of interrupt vectors > are left to the > user's discretion. > > > +IVSHMEM host services > +- > + > +This part is optional (see *Usage in the Guest* section below). > + > +To handle notifications between users of a ivshmem device, a ivshmem server > has s/a ivshmem/an ivshmem/ (twice) > +been added. This server is responsible for creating the shared memory and > +creating a set of eventfds for each users of the shared memory. It behaves > as a > +proxy between the different ivshmem clients (QEMU): giving the shared memory > fd > +to each client, allocating eventfds to new clients and broadcasting to all > +clients when a client disappears. > + > +Apart from the current ivshmem implementation in QEMU, a ivshmem client can > be and again > +written for debug, for development purposes, or to implement notifications > +between host and guests. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: Coupling between KVM_IRQFD and KVM_SET_GSI_ROUTING?
On 06/24/2014 05:47 PM, Alexander Graf wrote: > > On 24.06.14 17:05, Eric Auger wrote: >> On 06/24/2014 12:40 PM, Alexander Graf wrote: >>> On 24.06.14 11:47, Paul Mackerras wrote: On Mon, Jun 23, 2014 at 06:47:51PM +0200, Alexander Graf wrote: > On 17.06.14 13:39, Eric Auger wrote: >> >> Hello, >> >> I have a question related to KVM_IRQFD and KVM_SET_GSI_ROUTING ioctl >> relationship. >> >> When reading the KVM API documentation I do not understand there >> is any >> dependency between KVM_IRQFD and KVM_SET_GSI_ROUTING. According to >> the >> text it seems only the gsi field is used and interpreted as the >> irqchip pin. >> >> However irqchip.c kvm_set_irq code relies on an existing and not >> dummy >> routing table. >> >> My question is: does anyone agree on the fact the user-side must >> set a >> consistent routing table using KVM_SET_GSI_ROUTING before using >> KVM_IRQFD? The other alternative would have been to build a default >> identity GSI routing table in the kernel (gsi = irqchip.pin). > I untangled irqfd support from the x86 ioapic emulation a while back. > When I > looked at it, I didn't see any easy way to split it out from the > routing > too, so I kept that dependency in. >> Hi Alex, Paul, >> >> thanks for your replies. hence don't you think the KVM API doc deserves >> to be clarified then? > If you look at the code, you will see that the irq routing entry is > used as > token for an irqfd. So every irqfd only knows its routing table entry, > nothing else. As far as I can see, the routing table entry is used for only two things: to tell whether the interrupt is an MSI or LSI, and to pass to kvm_set_msi. >> >> Indeed I noticed irq_entry in _irqfd struct only is set for MSI routing. >> For other IRQS - I guess what you call LSI - , routine table map[] is >> used to retrieve the irchip.pin(s) associated to that gsi and call set(). One way to tackle the problem would be to abstract out the routing table lookup into a function in irqchip.c, rather than having it open-coded in irqfd_update. Then, if you don't want to have the routing table you write an alternative function that creates a routing entry from scratch. It would need information from the low-level irq chip code as to which interrupts are LSIs and which are MSIs. It also ties you down to having only one kind of interrupt >> controller. >>> You could also create it along with the irqfd state struct. So >>> instead of >>> >>>kzalloc(sizeof(*irqfd), GFP_KERNEL); >>> >>> we could do >>> >>>kzalloc(sizeof(*irqfd) + sizeof(struct kvm_kernel_irq_routing_entry), >>> GFP_KERNEL); >> >> Wouldn't it make sense, in kvm_irqfd_assign, to test whether there is a >> routing entry associated to that gsi. In the negative, create a default >> one where irqchip.pin = gsi or something architecture specific?. > > "no entry" really should be "no entry". I think if we want some other > mechanism, we should make it be explicit. Yep I agree. But currently the usage of routing mechanism is not explicit neither in 4.75 KVM_IRQFD and that was causing my doubts;-) On top of that the doc says: "kvm_irqfd.gsi specifies the irqchip pin toggled by this event" but if there is no routing table set before by user-side, the injection will fail. > >> >> That way would not need a complete & big routing table to be set by >> user-side? >> >>> and point the routing entry to the inline irqfd one. We'd lose the >>> ability to change any routings with that construct, but I think that's >>> ok for irq controllers that are not configurable. >>> >>> Alternatively, we could also have a single routing entry that is a >>> wildcard match, ranging from say LSI [x...y]. The irqfd structure would >>> then need to carry a local payload to define an offset inside that >>> wildcard routing entry. >> Also whatever the number of IRQs assigned, we have this big >> chip[KVM_NR_IRQCHIPS][KVM_IRQCHIP_NUM_PINS] allocated. Also on ARM we >> have some difficulties to define KVM_IRQCHIP_NUM_PINS which becomes >> quite dynamic. > > Right, if you go with the ranged routing entries you should be able to > get by with a very small number of NUM_PINS, no? Yes that's correct. but the effort to integrate such feature in the code need to be further assessed. Eric > > > Alex > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3.14 38/68] MIPS: KVM: Allocate at least 16KB for exception handlers
3.14-stable review patch. If anyone has any objections, please let me know. -- From: James Hogan commit 7006e2dfda9adfa40251093604db76d7e44263b3 upstream. Each MIPS KVM guest has its own copy of the KVM exception vector. This contains the TLB refill exception handler at offset 0x000, the general exception handler at offset 0x180, and interrupt exception handlers at offset 0x200 in case Cause_IV=1. A common handler is copied to offset 0x2000 and offset 0x3000 is used for temporarily storing k1 during entry from guest. However the amount of memory allocated for this purpose is calculated as 0x200 rounded up to the next page boundary, which is insufficient if 4KB pages are in use. This can lead to the common handler at offset 0x2000 being overwritten and infinitely recursive exceptions on the next exit from the guest. Increase the minimum size from 0x200 to 0x4000 to cover the full use of the page. Signed-off-by: James Hogan Cc: Paolo Bonzini Cc: Gleb Natapov Cc: kvm@vger.kernel.org Cc: Ralf Baechle Cc: linux-m...@linux-mips.org Cc: Sanjay Lal Signed-off-by: Paolo Bonzini Signed-off-by: Greg Kroah-Hartman --- arch/mips/kvm/kvm_mips.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/arch/mips/kvm/kvm_mips.c +++ b/arch/mips/kvm/kvm_mips.c @@ -304,7 +304,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(st if (cpu_has_veic || cpu_has_vint) { size = 0x200 + VECTORSPACING * 64; } else { - size = 0x200; + size = 0x4000; } /* Save Linux EBASE */ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On 06/25/2014 12:43 AM, Alex Williamson wrote: > On Wed, 2014-06-25 at 00:33 +1000, Alexey Kardashevskiy wrote: >> On 06/25/2014 12:21 AM, Alex Williamson wrote: >>> On Tue, 2014-06-24 at 15:22 +0200, Alexander Graf wrote: On 24.06.14 15:01, Alexey Kardashevskiy wrote: > On 06/24/2014 10:52 PM, Alexander Graf wrote: >> On 24.06.14 14:50, Alexey Kardashevskiy wrote: >>> On 06/24/2014 08:41 PM, Alexander Graf wrote: On 24.06.14 12:11, Alexey Kardashevskiy wrote: > On 06/21/2014 09:12 AM, Benjamin Herrenschmidt wrote: >> On Thu, 2014-06-19 at 21:21 -0600, Alex Williamson wrote: >> >>> Working on big endian being an accident may be a matter of >>> perspective >> :-) >> >>> The comment remains that this patch doesn't actually fix anything >>> except >>> the overhead on big endian systems doing redundant byte swapping and >>> maybe the philosophy that vfio regions are little endian. >> Yes, that works by accident because technically VFIO is a transport >> and >> thus shouldn't perform any endian swapping of any sort, which remains >> the responsibility of the end driver which is the only one to know >> whether a given BAR location is a a register or some streaming data >> and in the former case whether it's LE or BE (some PCI devices are BE >> even ! :-) >> >> But yes, in the end, it works with the dual "cancelling" swaps and >> the >> overhead of those swaps is probably drowned in the noise of the >> syscall >> overhead. >> >>> I'm still not a fan of iowrite vs iowritebe, there must be >>> something we >>> can use that doesn't have an implicit swap. >> Sadly there isn't ... In the old day we didn't even have the "be" >> variant and readl/writel style accessors still don't have them either >> for all archs. >> >> There is __raw_readl/writel but here the semantics are much more than >> just "don't swap", they also don't have memory barriers (which means >> they are essentially useless to most drivers unless those are >> platform >> specific drivers which know exactly what they are doing, or in the >> rare >> cases such as accessing a framebuffer which we know never have side >> effects). >> >>> Calling it iowrite*_native is also an abuse of the namespace. >>> Next thing we know some common code >>> will legitimately use that name. >> I might make sense to those definitions into a common header. There >> have >> been a handful of cases in the past that wanted that sort of "native >> byte order" MMIOs iirc (though don't ask me for examples, I can't >> really >> remember). >> >>> If we do need to define an alias >>> (which I'd like to avoid) it should be something like >>> vfio_iowrite32. > Ping? > > We need to make a decision whether to move those xxx_native() helpers > somewhere (where?) or leave the patch as is (as we figured out that > iowriteXX functions implement barriers and we cannot just use raw > accessors) and fix commit log to explain everything. Is there actually any difference in generated code with this patch applied and without? I would hope that iowrite..() is inlined and cancels out the cpu_to_le..() calls that are also inlined? >>> iowrite32 is a non-inline function so conversions take place so are the >>> others. And sorry but I fail to see why this matters. We are not trying >>> to >>> accelerate things, we are removing redundant operations which confuse >>> people who read the code. >> The confusion depends on where you're coming from. If you happen to know >> that "iowrite32" writes in LE, then the LE conversion makes a lot of >> sense. > It was like this (and this is just confusing): > > iowrite32(le32_to_cpu(val), io + off); > > What would make sense (according to you and I would understand this) is > this: > > iowrite32(cpu_to_le32(val), io + off); > > > Or I missed your point, did I? No, you didn't miss it. I think for people who know how iowrite32() works the above is obvious. I find the fact that iowrite32() writes in LE always pretty scary though ;). So IMHO we should either create new, generic iowrite helpers that don't do any endian swapping at all or do iowrite32(cpu_to_le32(val)) calls. >>> >>> I'm one of those people for whom iowrite32(le32_to_cpu(val)) makes sense >> >> >> I do not understand why @val is considered LE here and need to be converted >> to CPU. Really. I truly believe it should be cp
Re: [PATCH 4/4] kvm: Implement PEBS virtualization
On Sun, Jun 22, 2014 at 09:02:25PM +0200, Andi Kleen wrote: > > First, it's not sufficient to pin the debug store area, you also > > have to pin the guest page tables that are used to map the debug > > store. But even if you do that, as soon as the guest fork()s, it > > will create a new pgd which the host will be free to swap out. The > > processor can then attempt a PEBS store to an unmapped address which > > will fail, even though the guest is configured correctly. > > That's a good point. You're right of course. > > The only way I can think around it would be to intercept CR3 writes > while PEBS is active and always pin all the table pages leading > to the PEBS buffer. That's slow, but should be only needed > while PEBS is running. > > -Andi Suppose that can be done separately from the pinned spte patchset. And it requires accounting into mlock limits as well, as noted. One set of pagetables per pinned virtual address leading down to the last translations is sufficient per-vcpu. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3.10 29/52] MIPS: KVM: Allocate at least 16KB for exception handlers
3.10-stable review patch. If anyone has any objections, please let me know. -- From: James Hogan commit 7006e2dfda9adfa40251093604db76d7e44263b3 upstream. Each MIPS KVM guest has its own copy of the KVM exception vector. This contains the TLB refill exception handler at offset 0x000, the general exception handler at offset 0x180, and interrupt exception handlers at offset 0x200 in case Cause_IV=1. A common handler is copied to offset 0x2000 and offset 0x3000 is used for temporarily storing k1 during entry from guest. However the amount of memory allocated for this purpose is calculated as 0x200 rounded up to the next page boundary, which is insufficient if 4KB pages are in use. This can lead to the common handler at offset 0x2000 being overwritten and infinitely recursive exceptions on the next exit from the guest. Increase the minimum size from 0x200 to 0x4000 to cover the full use of the page. Signed-off-by: James Hogan Cc: Paolo Bonzini Cc: Gleb Natapov Cc: kvm@vger.kernel.org Cc: Ralf Baechle Cc: linux-m...@linux-mips.org Cc: Sanjay Lal Signed-off-by: Paolo Bonzini Signed-off-by: Greg Kroah-Hartman --- arch/mips/kvm/kvm_mips.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/arch/mips/kvm/kvm_mips.c +++ b/arch/mips/kvm/kvm_mips.c @@ -299,7 +299,7 @@ struct kvm_vcpu *kvm_arch_vcpu_create(st if (cpu_has_veic || cpu_has_vint) { size = 0x200 + VECTORSPACING * 64; } else { - size = 0x200; + size = 0x4000; } /* Save Linux EBASE */ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/9] MIPS: KVM: Bugfixes and cleanups
The patches are pretty straightforward. Changes: v3 - v2: o In patch #2, change the use of kvm_[err|info|debug]. o In patch #3, add err removal in kvm_arch_commit_memory_region(). o In patch #3, revert the changes to kvm_arch_vm_ioctl(). o In patch #7, drop the merge of kvm_arch_vcpu_free() and pointer nullification. o Add patch #9. v2 - v1: o In patch #1, don't change the opening comment mark for kernel-doc comments. o In patch #1, to make long lines more readable, use local variables / macros. o In patch #1, slight format adjustments are made. o Use -M flag to generate patches (detect renames). o Add patch #8. Deng-Cheng Zhu (8): MIPS: KVM: Reformat code and comments MIPS: KVM: Use KVM internal logger MIPS: KVM: Simplify functions by removing redundancy MIPS: KVM: Remove unneeded volatile MIPS: KVM: Rename files to remove the prefix "kvm_" and "kvm_mips_" MIPS: KVM: Restore correct value for WIRED at TLB uninit MIPS: KVM: Fix memory leak on VCPU MIPS: KVM: Skip memory cleaning in kvm_mips_commpage_init() James Hogan (1): MIPS: KVM: Remove dead code of TLB index error in kvm_mips_emul_tlbwr() arch/mips/include/asm/kvm_host.h | 12 +- arch/mips/include/asm/r4kcache.h | 3 + arch/mips/kvm/Makefile| 8 +- arch/mips/kvm/{kvm_cb.c => callback.c}| 0 arch/mips/kvm/commpage.c | 33 ++ arch/mips/kvm/commpage.h | 24 + arch/mips/kvm/{kvm_mips_dyntrans.c => dyntrans.c} | 40 +- arch/mips/kvm/{kvm_mips_emul.c => emulate.c} | 539 +++--- arch/mips/kvm/{kvm_mips_int.c => interrupt.c} | 47 +- arch/mips/kvm/{kvm_mips_int.h => interrupt.h} | 22 +- arch/mips/kvm/kvm_mips_comm.h | 23 - arch/mips/kvm/kvm_mips_commpage.c | 37 -- arch/mips/kvm/kvm_mips_opcode.h | 24 - arch/mips/kvm/{kvm_locore.S => locore.S} | 55 ++- arch/mips/kvm/{kvm_mips.c => mips.c} | 227 + arch/mips/kvm/opcode.h| 22 + arch/mips/kvm/{kvm_mips_stats.c => stats.c} | 28 +- arch/mips/kvm/{kvm_tlb.c => tlb.c}| 258 +-- arch/mips/kvm/trace.h | 18 +- arch/mips/kvm/{kvm_trap_emul.c => trap_emul.c}| 109 +++-- 20 files changed, 750 insertions(+), 779 deletions(-) rename arch/mips/kvm/{kvm_cb.c => callback.c} (100%) create mode 100644 arch/mips/kvm/commpage.c create mode 100644 arch/mips/kvm/commpage.h rename arch/mips/kvm/{kvm_mips_dyntrans.c => dyntrans.c} (79%) rename arch/mips/kvm/{kvm_mips_emul.c => emulate.c} (83%) rename arch/mips/kvm/{kvm_mips_int.c => interrupt.c} (85%) rename arch/mips/kvm/{kvm_mips_int.h => interrupt.h} (74%) delete mode 100644 arch/mips/kvm/kvm_mips_comm.h delete mode 100644 arch/mips/kvm/kvm_mips_commpage.c delete mode 100644 arch/mips/kvm/kvm_mips_opcode.h rename arch/mips/kvm/{kvm_locore.S => locore.S} (93%) rename arch/mips/kvm/{kvm_mips.c => mips.c} (83%) create mode 100644 arch/mips/kvm/opcode.h rename arch/mips/kvm/{kvm_mips_stats.c => stats.c} (63%) rename arch/mips/kvm/{kvm_tlb.c => tlb.c} (78%) rename arch/mips/kvm/{kvm_trap_emul.c => trap_emul.c} (83%) -- 1.8.5.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 8/9] MIPS: KVM: Skip memory cleaning in kvm_mips_commpage_init()
From: Deng-Cheng Zhu The commpage is allocated using kzalloc(), so there's no need of cleaning the memory of the kvm_mips_commpage struct and its internal mips_coproc. Signed-off-by: Deng-Cheng Zhu --- arch/mips/kvm/commpage.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/arch/mips/kvm/commpage.c b/arch/mips/kvm/commpage.c index 61b9c04..2d6e976 100644 --- a/arch/mips/kvm/commpage.c +++ b/arch/mips/kvm/commpage.c @@ -28,9 +28,6 @@ void kvm_mips_commpage_init(struct kvm_vcpu *vcpu) { struct kvm_mips_commpage *page = vcpu->arch.kseg0_commpage; - memset(page, 0, sizeof(struct kvm_mips_commpage)); - /* Specific init values for fields */ vcpu->arch.cop0 = &page->cop0; - memset(vcpu->arch.cop0, 0, sizeof(struct mips_coproc)); } -- 1.8.5.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/9] MIPS: KVM: Remove unneeded volatile
From: Deng-Cheng Zhu The keyword volatile for idx in the TLB functions is unnecessary. Signed-off-by: Deng-Cheng Zhu --- arch/mips/kvm/kvm_tlb.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/mips/kvm/kvm_tlb.c b/arch/mips/kvm/kvm_tlb.c index 29a5bdb..bbcd822 100644 --- a/arch/mips/kvm/kvm_tlb.c +++ b/arch/mips/kvm/kvm_tlb.c @@ -201,7 +201,7 @@ int kvm_mips_host_tlb_write(struct kvm_vcpu *vcpu, unsigned long entryhi, { unsigned long flags; unsigned long old_entryhi; - volatile int idx; + int idx; local_irq_save(flags); @@ -426,7 +426,7 @@ EXPORT_SYMBOL(kvm_mips_guest_tlb_lookup); int kvm_mips_host_tlb_lookup(struct kvm_vcpu *vcpu, unsigned long vaddr) { unsigned long old_entryhi, flags; - volatile int idx; + int idx; local_irq_save(flags); -- 1.8.5.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 6/9] MIPS: KVM: Restore correct value for WIRED at TLB uninit
From: Deng-Cheng Zhu At TLB initialization, the commpage TLB entry is reserved on top of the existing WIRED entries (the number not necessarily be 0). Signed-off-by: Deng-Cheng Zhu --- arch/mips/kvm/mips.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index 27250ee..3d53d34 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -170,7 +170,7 @@ void kvm_arch_sync_events(struct kvm *kvm) static void kvm_mips_uninit_tlbs(void *arg) { /* Restore wired count */ - write_c0_wired(0); + write_c0_wired(read_c0_wired() - 1); mtc0_tlbw_hazard(); /* Clear out all the TLBs */ kvm_local_flush_tlb_all(); -- 1.8.5.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 9/9] MIPS: KVM: Remove dead code of TLB index error in kvm_mips_emul_tlbwr()
From: James Hogan It's impossible to fall into the error handling of the TLB index after being masked by (KVM_MIPS_GUEST_TLB_SIZE - 1). Remove the dead code. Signed-off-by: James Hogan Signed-off-by: Deng-Cheng Zhu --- arch/mips/kvm/emulate.c | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/mips/kvm/emulate.c b/arch/mips/kvm/emulate.c index 982f4af..5500fef 100644 --- a/arch/mips/kvm/emulate.c +++ b/arch/mips/kvm/emulate.c @@ -846,11 +846,6 @@ enum emulation_result kvm_mips_emul_tlbwr(struct kvm_vcpu *vcpu) get_random_bytes(&index, sizeof(index)); index &= (KVM_MIPS_GUEST_TLB_SIZE - 1); - if (index < 0 || index >= KVM_MIPS_GUEST_TLB_SIZE) { - kvm_err("%s: illegal index: %d\n", __func__, index); - return EMULATE_FAIL; - } - tlb = &vcpu->arch.guest_tlb[index]; /* -- 1.8.5.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 7/9] MIPS: KVM: Fix memory leak on VCPU
From: Deng-Cheng Zhu kvm_arch_vcpu_free() is called in 2 code paths: 1) kvm_vm_ioctl() kvm_vm_ioctl_create_vcpu() kvm_arch_vcpu_destroy() kvm_arch_vcpu_free() 2) kvm_put_kvm() kvm_destroy_vm() kvm_arch_destroy_vm() kvm_mips_free_vcpus() kvm_arch_vcpu_free() Neither of the paths handles VCPU free. We need to do it in kvm_arch_vcpu_free() corresponding to the memory allocation in kvm_arch_vcpu_create(). Signed-off-by: Deng-Cheng Zhu --- Changes: v3 - v2: o Drop the merge of kvm_arch_vcpu_free() and pointer nullification. arch/mips/kvm/mips.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c index 3d53d34..7f7fd76 100644 --- a/arch/mips/kvm/mips.c +++ b/arch/mips/kvm/mips.c @@ -384,6 +384,7 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu) kfree(vcpu->arch.guest_ebase); kfree(vcpu->arch.kseg0_commpage); + kfree(vcpu); } void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) -- 1.8.5.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/9] MIPS: KVM: Use KVM internal logger
From: Deng-Cheng Zhu Replace printks with kvm_[err|info|debug]. Signed-off-by: Deng-Cheng Zhu --- Changes: v3 - v2: o Change the use of kvm_[err|info|debug]. arch/mips/kvm/kvm_mips.c | 23 - arch/mips/kvm/kvm_mips_emul.c | 107 - arch/mips/kvm/kvm_mips_stats.c | 6 +-- arch/mips/kvm/kvm_tlb.c| 60 +++ arch/mips/kvm/kvm_trap_emul.c | 31 +--- 5 files changed, 110 insertions(+), 117 deletions(-) diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c index 821e7e8..bdca619 100644 --- a/arch/mips/kvm/kvm_mips.c +++ b/arch/mips/kvm/kvm_mips.c @@ -817,8 +817,8 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) ga = memslot->base_gfn << PAGE_SHIFT; ga_end = ga + (memslot->npages << PAGE_SHIFT); - printk("%s: dirty, ga: %#lx, ga_end %#lx\n", __func__, ga, - ga_end); + kvm_info("%s: dirty, ga: %#lx, ga_end %#lx\n", __func__, ga, +ga_end); n = kvm_dirty_bitmap_bytes(memslot); memset(memslot->dirty_bitmap, 0, n); @@ -925,24 +925,25 @@ int kvm_arch_vcpu_dump_regs(struct kvm_vcpu *vcpu) if (!vcpu) return -1; - printk("VCPU Register Dump:\n"); - printk("\tpc = 0x%08lx\n", vcpu->arch.pc); - printk("\texceptions: %08lx\n", vcpu->arch.pending_exceptions); + kvm_debug("VCPU Register Dump:\n"); + kvm_debug("\tpc = 0x%08lx\n", vcpu->arch.pc); + kvm_debug("\texceptions: %08lx\n", vcpu->arch.pending_exceptions); for (i = 0; i < 32; i += 4) { - printk("\tgpr%02d: %08lx %08lx %08lx %08lx\n", i, + kvm_debug("\tgpr%02d: %08lx %08lx %08lx %08lx\n", i, vcpu->arch.gprs[i], vcpu->arch.gprs[i + 1], vcpu->arch.gprs[i + 2], vcpu->arch.gprs[i + 3]); } - printk("\thi: 0x%08lx\n", vcpu->arch.hi); - printk("\tlo: 0x%08lx\n", vcpu->arch.lo); + kvm_debug("\thi: 0x%08lx\n", vcpu->arch.hi); + kvm_debug("\tlo: 0x%08lx\n", vcpu->arch.lo); cop0 = vcpu->arch.cop0; - printk("\tStatus: 0x%08lx, Cause: 0x%08lx\n", - kvm_read_c0_guest_status(cop0), kvm_read_c0_guest_cause(cop0)); + kvm_debug("\tStatus: 0x%08lx, Cause: 0x%08lx\n", + kvm_read_c0_guest_status(cop0), + kvm_read_c0_guest_cause(cop0)); - printk("\tEPC: 0x%08lx\n", kvm_read_c0_guest_epc(cop0)); + kvm_debug("\tEPC: 0x%08lx\n", kvm_read_c0_guest_epc(cop0)); return 0; } diff --git a/arch/mips/kvm/kvm_mips_emul.c b/arch/mips/kvm/kvm_mips_emul.c index 993dd1c..262ce3e 100644 --- a/arch/mips/kvm/kvm_mips_emul.c +++ b/arch/mips/kvm/kvm_mips_emul.c @@ -183,18 +183,18 @@ unsigned long kvm_compute_return_epc(struct kvm_vcpu *vcpu, /* And now the FPA/cp1 branch instructions. */ case cop1_op: - printk("%s: unsupported cop1_op\n", __func__); + kvm_err("%s: unsupported cop1_op\n", __func__); break; } return nextpc; unaligned: - printk("%s: unaligned epc\n", __func__); + kvm_err("%s: unaligned epc\n", __func__); return nextpc; sigill: - printk("%s: DSP branch but not DSP ASE\n", __func__); + kvm_err("%s: DSP branch but not DSP ASE\n", __func__); return nextpc; } @@ -751,8 +751,8 @@ enum emulation_result kvm_mips_emul_eret(struct kvm_vcpu *vcpu) kvm_clear_c0_guest_status(cop0, ST0_ERL); vcpu->arch.pc = kvm_read_c0_guest_errorepc(cop0); } else { - printk("[%#lx] ERET when MIPS_SR_EXL|MIPS_SR_ERL == 0\n", - vcpu->arch.pc); + kvm_err("[%#lx] ERET when MIPS_SR_EXL|MIPS_SR_ERL == 0\n", + vcpu->arch.pc); er = EMULATE_FAIL; } @@ -795,7 +795,7 @@ enum emulation_result kvm_mips_emul_tlbr(struct kvm_vcpu *vcpu) enum emulation_result er = EMULATE_FAIL; uint32_t pc = vcpu->arch.pc; - printk("[%#x] COP0_TLBR [%ld]\n", pc, kvm_read_c0_guest_index(cop0)); + kvm_err("[%#x] COP0_TLBR [%ld]\n", pc, kvm_read_c0_guest_index(cop0)); return er; } @@ -809,13 +809,12 @@ enum emulation_result kvm_mips_emul_tlbwi(struct kvm_vcpu *vcpu) uint32_t pc = vcpu->arch.pc; if (index < 0 || index >= KVM_MIPS_GUEST_TLB_SIZE) { - printk("%s: illegal index: %d\n", __func__, index); - printk - ("[%#x] COP0_TLBWI [%d] (entryhi: %#lx, entrylo0: %#lx entrylo1: %#lx, mask: %#lx)\n", -pc, index, kvm_read_c0_guest_entryhi(cop0), -kvm_read_c0_guest_entrylo0(cop0), -kvm_read_c0_guest_entrylo1(cop0), -kvm_read_c0_guest_pagemask(cop0)); +
[PATCH v3 5/9] MIPS: KVM: Rename files to remove the prefix "kvm_" and "kvm_mips_"
From: Deng-Cheng Zhu Since all the files are in arch/mips/kvm/, there's no need of the prefixes "kvm_" and "kvm_mips_". Signed-off-by: Deng-Cheng Zhu --- arch/mips/kvm/Makefile| 8 arch/mips/kvm/{kvm_cb.c => callback.c}| 0 arch/mips/kvm/{kvm_mips_commpage.c => commpage.c} | 2 +- arch/mips/kvm/{kvm_mips_comm.h => commpage.h} | 0 arch/mips/kvm/{kvm_mips_dyntrans.c => dyntrans.c} | 2 +- arch/mips/kvm/{kvm_mips_emul.c => emulate.c} | 6 +++--- arch/mips/kvm/{kvm_mips_int.c => interrupt.c} | 2 +- arch/mips/kvm/{kvm_mips_int.h => interrupt.h} | 0 arch/mips/kvm/{kvm_locore.S => locore.S} | 0 arch/mips/kvm/{kvm_mips.c => mips.c} | 6 +++--- arch/mips/kvm/{kvm_mips_opcode.h => opcode.h} | 0 arch/mips/kvm/{kvm_mips_stats.c => stats.c} | 0 arch/mips/kvm/{kvm_tlb.c => tlb.c}| 0 arch/mips/kvm/{kvm_trap_emul.c => trap_emul.c}| 4 ++-- 14 files changed, 15 insertions(+), 15 deletions(-) rename arch/mips/kvm/{kvm_cb.c => callback.c} (100%) rename arch/mips/kvm/{kvm_mips_commpage.c => commpage.c} (97%) rename arch/mips/kvm/{kvm_mips_comm.h => commpage.h} (100%) rename arch/mips/kvm/{kvm_mips_dyntrans.c => dyntrans.c} (99%) rename arch/mips/kvm/{kvm_mips_emul.c => emulate.c} (99%) rename arch/mips/kvm/{kvm_mips_int.c => interrupt.c} (99%) rename arch/mips/kvm/{kvm_mips_int.h => interrupt.h} (100%) rename arch/mips/kvm/{kvm_locore.S => locore.S} (100%) rename arch/mips/kvm/{kvm_mips.c => mips.c} (99%) rename arch/mips/kvm/{kvm_mips_opcode.h => opcode.h} (100%) rename arch/mips/kvm/{kvm_mips_stats.c => stats.c} (100%) rename arch/mips/kvm/{kvm_tlb.c => tlb.c} (100%) rename arch/mips/kvm/{kvm_trap_emul.c => trap_emul.c} (99%) diff --git a/arch/mips/kvm/Makefile b/arch/mips/kvm/Makefile index 78d87bb..401fe02 100644 --- a/arch/mips/kvm/Makefile +++ b/arch/mips/kvm/Makefile @@ -5,9 +5,9 @@ common-objs = $(addprefix ../../../virt/kvm/, kvm_main.o coalesced_mmio.o) EXTRA_CFLAGS += -Ivirt/kvm -Iarch/mips/kvm -kvm-objs := $(common-objs) kvm_mips.o kvm_mips_emul.o kvm_locore.o \ - kvm_mips_int.o kvm_mips_stats.o kvm_mips_commpage.o \ - kvm_mips_dyntrans.o kvm_trap_emul.o +kvm-objs := $(common-objs) mips.o emulate.o locore.o \ + interrupt.o stats.o commpage.o \ + dyntrans.o trap_emul.o obj-$(CONFIG_KVM) += kvm.o -obj-y += kvm_cb.o kvm_tlb.o +obj-y += callback.o tlb.o diff --git a/arch/mips/kvm/kvm_cb.c b/arch/mips/kvm/callback.c similarity index 100% rename from arch/mips/kvm/kvm_cb.c rename to arch/mips/kvm/callback.c diff --git a/arch/mips/kvm/kvm_mips_commpage.c b/arch/mips/kvm/commpage.c similarity index 97% rename from arch/mips/kvm/kvm_mips_commpage.c rename to arch/mips/kvm/commpage.c index 4b5612b..61b9c04 100644 --- a/arch/mips/kvm/kvm_mips_commpage.c +++ b/arch/mips/kvm/commpage.c @@ -22,7 +22,7 @@ #include -#include "kvm_mips_comm.h" +#include "commpage.h" void kvm_mips_commpage_init(struct kvm_vcpu *vcpu) { diff --git a/arch/mips/kvm/kvm_mips_comm.h b/arch/mips/kvm/commpage.h similarity index 100% rename from arch/mips/kvm/kvm_mips_comm.h rename to arch/mips/kvm/commpage.h diff --git a/arch/mips/kvm/kvm_mips_dyntrans.c b/arch/mips/kvm/dyntrans.c similarity index 99% rename from arch/mips/kvm/kvm_mips_dyntrans.c rename to arch/mips/kvm/dyntrans.c index fa7184d..521121b 100644 --- a/arch/mips/kvm/kvm_mips_dyntrans.c +++ b/arch/mips/kvm/dyntrans.c @@ -18,7 +18,7 @@ #include #include -#include "kvm_mips_comm.h" +#include "commpage.h" #define SYNCI_TEMPLATE 0x041f #define SYNCI_BASE(x) (((x) >> 21) & 0x1f) diff --git a/arch/mips/kvm/kvm_mips_emul.c b/arch/mips/kvm/emulate.c similarity index 99% rename from arch/mips/kvm/kvm_mips_emul.c rename to arch/mips/kvm/emulate.c index e5862bc..982f4af 100644 --- a/arch/mips/kvm/kvm_mips_emul.c +++ b/arch/mips/kvm/emulate.c @@ -29,9 +29,9 @@ #include #define CONFIG_MIPS_MT -#include "kvm_mips_opcode.h" -#include "kvm_mips_int.h" -#include "kvm_mips_comm.h" +#include "opcode.h" +#include "interrupt.h" +#include "commpage.h" #include "trace.h" diff --git a/arch/mips/kvm/kvm_mips_int.c b/arch/mips/kvm/interrupt.c similarity index 99% rename from arch/mips/kvm/kvm_mips_int.c rename to arch/mips/kvm/interrupt.c index d458c04..9b44459 100644 --- a/arch/mips/kvm/kvm_mips_int.c +++ b/arch/mips/kvm/interrupt.c @@ -20,7 +20,7 @@ #include -#include "kvm_mips_int.h" +#include "interrupt.h" void kvm_mips_queue_irq(struct kvm_vcpu *vcpu, uint32_t priority) { diff --git a/arch/mips/kvm/kvm_mips_int.h b/arch/mips/kvm/interrupt.h similarity index 100% rename from arch/mips/kvm/kvm_mips_int.h rename to arch/mips/kvm/interrupt.h diff --git a/arch/mips/kvm/kvm_locore.S b/arch/mips/kvm/locore.S similarity index 100% rename from arch/mips/kvm/kvm_locore.S rename to arch/mips/kvm/locore.S diff --git a/arch/mips/kvm/kvm_mips.c b/arch/m
[PATCH v3 3/9] MIPS: KVM: Simplify functions by removing redundancy
From: Deng-Cheng Zhu No logic changes inside. Signed-off-by: Deng-Cheng Zhu --- Changes: v3 - v2: o Add err removal in kvm_arch_commit_memory_region(). o Revert the changes to kvm_arch_vm_ioctl(). arch/mips/include/asm/kvm_host.h | 2 +- arch/mips/kvm/kvm_mips.c | 18 -- arch/mips/kvm/kvm_mips_commpage.c | 2 -- arch/mips/kvm/kvm_mips_emul.c | 34 +++--- arch/mips/kvm/kvm_mips_stats.c| 4 +--- 5 files changed, 17 insertions(+), 43 deletions(-) diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h index 3f813f2..7a3fc67 100644 --- a/arch/mips/include/asm/kvm_host.h +++ b/arch/mips/include/asm/kvm_host.h @@ -764,7 +764,7 @@ extern int kvm_mips_trans_mtc0(uint32_t inst, uint32_t *opc, struct kvm_vcpu *vcpu); /* Misc */ -extern int kvm_mips_dump_stats(struct kvm_vcpu *vcpu); +extern void kvm_mips_dump_stats(struct kvm_vcpu *vcpu); extern unsigned long kvm_mips_get_ramsize(struct kvm *kvm); diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c index bdca619..cabcac0a 100644 --- a/arch/mips/kvm/kvm_mips.c +++ b/arch/mips/kvm/kvm_mips.c @@ -97,9 +97,7 @@ void kvm_arch_hardware_unsetup(void) void kvm_arch_check_processor_compat(void *rtn) { - int *r = (int *)rtn; - *r = 0; - return; + *(int *)rtn = 0; } static void kvm_mips_init_tlbs(struct kvm *kvm) @@ -225,7 +223,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, enum kvm_mr_change change) { unsigned long npages = 0; - int i, err = 0; + int i; kvm_debug("%s: kvm: %p slot: %d, GPA: %llx, size: %llx, QVA: %llx\n", __func__, kvm, mem->slot, mem->guest_phys_addr, @@ -243,8 +241,7 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, if (!kvm->arch.guest_pmap) { kvm_err("Failed to allocate guest PMAP"); - err = -ENOMEM; - goto out; + return; } kvm_debug("Allocated space for Guest PMAP Table (%ld pages) @ %p\n", @@ -255,8 +252,6 @@ void kvm_arch_commit_memory_region(struct kvm *kvm, kvm->arch.guest_pmap[i] = KVM_INVALID_PAGE; } } -out: - return; } void kvm_arch_flush_shadow_all(struct kvm *kvm) @@ -845,16 +840,12 @@ long kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) int kvm_arch_init(void *opaque) { - int ret; - if (kvm_mips_callbacks) { kvm_err("kvm: module already exists\n"); return -EEXIST; } - ret = kvm_mips_emulation_init(&kvm_mips_callbacks); - - return ret; + return kvm_mips_emulation_init(&kvm_mips_callbacks); } void kvm_arch_exit(void) @@ -1008,7 +999,6 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) { - return; } int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu, diff --git a/arch/mips/kvm/kvm_mips_commpage.c b/arch/mips/kvm/kvm_mips_commpage.c index ab7096e..4b5612b 100644 --- a/arch/mips/kvm/kvm_mips_commpage.c +++ b/arch/mips/kvm/kvm_mips_commpage.c @@ -33,6 +33,4 @@ void kvm_mips_commpage_init(struct kvm_vcpu *vcpu) /* Specific init values for fields */ vcpu->arch.cop0 = &page->cop0; memset(vcpu->arch.cop0, 0, sizeof(struct mips_coproc)); - - return; } diff --git a/arch/mips/kvm/kvm_mips_emul.c b/arch/mips/kvm/kvm_mips_emul.c index 262ce3e..e5862bc 100644 --- a/arch/mips/kvm/kvm_mips_emul.c +++ b/arch/mips/kvm/kvm_mips_emul.c @@ -761,8 +761,6 @@ enum emulation_result kvm_mips_emul_eret(struct kvm_vcpu *vcpu) enum emulation_result kvm_mips_emul_wait(struct kvm_vcpu *vcpu) { - enum emulation_result er = EMULATE_DONE; - kvm_debug("[%#lx] !!!WAIT!!! (%#lx)\n", vcpu->arch.pc, vcpu->arch.pending_exceptions); @@ -782,7 +780,7 @@ enum emulation_result kvm_mips_emul_wait(struct kvm_vcpu *vcpu) } } - return er; + return EMULATE_DONE; } /* @@ -792,11 +790,10 @@ enum emulation_result kvm_mips_emul_wait(struct kvm_vcpu *vcpu) enum emulation_result kvm_mips_emul_tlbr(struct kvm_vcpu *vcpu) { struct mips_coproc *cop0 = vcpu->arch.cop0; - enum emulation_result er = EMULATE_FAIL; uint32_t pc = vcpu->arch.pc; kvm_err("[%#x] COP0_TLBR [%ld]\n", pc, kvm_read_c0_guest_index(cop0)); - return er; + return EMULATE_FAIL; } /* Write Guest TLB Entry @ Index */ @@ -804,7 +801,6 @@ enum emulation_result kvm_mips_emul_tlbwi(struct kvm_vcpu *vcpu) { struct mips_coproc *cop0 = vcpu->arch.cop0; int index = kvm_read_c0_guest_index(cop0); - enum emulation_result er = EMULATE_DONE; struct k
Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code
On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote: > Howdy, > > Ben reminded me a while back that we have a nasty race in our KVM PV code. > > We replace a few instructions with longer streams of instructions to check > whether it's necessary to trap out from it (like mtmsr, no need to trap if > we only disable interrupts). During those replacement chunks we must not get > any interrupts, because they might overwrite scratch space that we already > used to save otherwise clobbered register state into. > > So we have a thing called "critical sections" which allows us to atomically > get in and out of "interrupt disabled" modes without touching MSR. When we > are supposed to deliver an interrupt into the guest while we are in a critical > section, we just don't inject the interrupt yet, but leave it be until the > next trap. > > However, we never really know when the next trap would be. For all we know it > could be never. At this point we created a race that is a potential source > for interrupt loss or at least deferral. > > This patch set aims at solving the race. Instead of merely deferring an > interrupt when we see such a situation, we go into a special instruction > interpretation mode. In this mode, we interpret all PPC assembler instructions > that happen until we are out of the critical section again, at which point > we can now inject the interrupt. > > This bug only affects KVM implementations that make use of the magic page, so > e500v2, book3s_32 and book3s_64 PR KVM. Would it be possible to single step through the critical section instead? Or set a high res timer to expire very quickly? -Scott -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 9/9] arm64: KVM: vgic: deal with GIC sub-page alignment
On 06/19/2014 04:21 AM, Marc Zyngier wrote: The GIC CPU interface is always 4k aligned. If the host is using 64k pages, it is critical to place the guest's GICC interface at the same relative alignment as the host's GICV. Failure to do so results in an impossibility for the guest to deal with interrupts. Add a KVM_DEV_ARM_VGIC_GRP_ADDR_OFFSET attribute for the VGIC, allowing userspace to retrieve the GICV offset in a page. It becomes then trivial to adjust the GICC base address for the guest. Does this mean there is a corresponding patch for qemu? Signed-off-by: Marc Zyngier --- arch/arm/include/uapi/asm/kvm.h | 1 + arch/arm64/include/uapi/asm/kvm.h | 1 + virt/kvm/arm/vgic.c | 7 +++ 3 files changed, 9 insertions(+) diff --git a/arch/arm/include/uapi/asm/kvm.h b/arch/arm/include/uapi/asm/kvm.h index 8b51c1a..056b782 100644 --- a/arch/arm/include/uapi/asm/kvm.h +++ b/arch/arm/include/uapi/asm/kvm.h @@ -174,6 +174,7 @@ struct kvm_arch_memory_slot { #define KVM_DEV_ARM_VGIC_OFFSET_SHIFT 0 #define KVM_DEV_ARM_VGIC_OFFSET_MASK(0xULL << KVM_DEV_ARM_VGIC_OFFSET_SHIFT) #define KVM_DEV_ARM_VGIC_GRP_NR_IRQS 3 +#define KVM_DEV_ARM_VGIC_GRP_ADDR_OFFSET 4 /* KVM_IRQ_LINE irq field index values */ #define KVM_ARM_IRQ_TYPE_SHIFT24 diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h index b5cd6ed..5513de4 100644 --- a/arch/arm64/include/uapi/asm/kvm.h +++ b/arch/arm64/include/uapi/asm/kvm.h @@ -160,6 +160,7 @@ struct kvm_arch_memory_slot { #define KVM_DEV_ARM_VGIC_OFFSET_SHIFT 0 #define KVM_DEV_ARM_VGIC_OFFSET_MASK(0xULL << KVM_DEV_ARM_VGIC_OFFSET_SHIFT) #define KVM_DEV_ARM_VGIC_GRP_NR_IRQS 3 +#define KVM_DEV_ARM_VGIC_GRP_ADDR_OFFSET 4 /* KVM_IRQ_LINE irq field index values */ #define KVM_ARM_IRQ_TYPE_SHIFT24 diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c index b0cd417..68ac9c6 100644 --- a/virt/kvm/arm/vgic.c +++ b/virt/kvm/arm/vgic.c @@ -2228,6 +2228,12 @@ static int vgic_get_attr(struct kvm_device *dev, struct kvm_device_attr *attr) r = put_user(dev->kvm->arch.vgic.nr_irqs, uaddr); break; } + case KVM_DEV_ARM_VGIC_GRP_ADDR_OFFSET: { + u32 __user *uaddr = (u32 __user *)(long)attr->addr; + u32 val = vgic->vcpu_base & ~PAGE_MASK; + r = put_user(val, uaddr); + break; + } } @@ -2265,6 +2271,7 @@ static int vgic_has_attr(struct kvm_device *dev, struct kvm_device_attr *attr) offset = attr->attr & KVM_DEV_ARM_VGIC_OFFSET_MASK; return vgic_has_attr_regs(vgic_cpu_ranges, offset); case KVM_DEV_ARM_VGIC_GRP_NR_IRQS: + case KVM_DEV_ARM_VGIC_GRP_ADDR_OFFSET: return 0; } return -ENXIO; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to know that vPMU is enabled or disabled?
Il 24/06/2014 10:33, Jidong Xiao ha scritto: I think I have figured out this. According to this patch (as well as the Intel SDM manual), it looks like pmu is exposed via the cpuid leaf 0xah. https://github.com/torvalds/linux/commit/a6c06ed1a60aff77b27ba558c315c3fed4e35565 Therefore, in the guest os, one can run the cpuid instruction with leaf 0xa, and if vpmu is supported/enabled, the return value in eax/ebx/ecx/edx should be something non-zero, and in the cloud machine which I am using I see these registers are all zero, therefore I think vpmu is not supported in my virtual machine. Probably it is masked by Qemu. This is correct. QEMU only enables vPMU if you use "-cpu host". Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On Tue, 2014-06-24 at 12:41 +0200, Alexander Graf wrote: > Is there actually any difference in generated code with this patch > applied and without? I would hope that iowrite..() is inlined and > cancels out the cpu_to_le..() calls that are also inlined? No, the former uses byteswapping asm, the compiler can't "cancel" it out, but the overhead of the additional byteswap might not be measurable. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On Wed, 2014-06-25 at 00:33 +1000, Alexey Kardashevskiy wrote: > > I do not understand why @val is considered LE here and need to be > converted > to CPU. Really. I truly believe it should be cpu_to_le32(). No. Both are slightly wrong semantically but le32_to_cpu() is less wrong :-) iowrite32 supposedly takes a "cpu" value as argument and writes an "le" value. So if anything, you need something that converts to a "cpu" value before you call iowrite32. Now it's still slightly wrong because the "input" to le32_to_cpu() is supposed to be an "LE" value but of course here it's not, it's a "cpu" value too :-) But yes, I agree with aw here, either do nothing or stick a set of iowriteXX_native or something equivalent in the generic iomap header, define them in term of iowrite32be/iowrite32 based on the compile time endian of the arch. Hitting asm-generic/iomap.h I think will cover all archs except ARM. For ARM, just hack arch/arm/asm/io.h Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 9/9] arm64: KVM: vgic: deal with GIC sub-page alignment
On 24 June 2014 20:28, Joel Schopp wrote: > > On 06/19/2014 04:21 AM, Marc Zyngier wrote: >> >> The GIC CPU interface is always 4k aligned. If the host is using >> 64k pages, it is critical to place the guest's GICC interface at the >> same relative alignment as the host's GICV. Failure to do so results >> in an impossibility for the guest to deal with interrupts. >> >> Add a KVM_DEV_ARM_VGIC_GRP_ADDR_OFFSET attribute for the VGIC, allowing >> userspace to retrieve the GICV offset in a page. It becomes then trivial >> to adjust the GICC base address for the guest. > > > Does this mean there is a corresponding patch for qemu? Not as far as I know. It's a bit awkward on the QEMU end because we really want to provide the guest a consistent memory map regardless of the host CPU. So at best we'd probably use it to say "sorry, can't run on this CPU/host kernel". (That said, if you think you can make QEMU usefully use the information and want to write a QEMU patch I'm not averse to the idea.) kvmtool is probably better placed to take advantage of it since it takes more of a "deal with what the host provides you" philosophy. thanks -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code
On 24.06.14 20:53, Scott Wood wrote: On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote: Howdy, Ben reminded me a while back that we have a nasty race in our KVM PV code. We replace a few instructions with longer streams of instructions to check whether it's necessary to trap out from it (like mtmsr, no need to trap if we only disable interrupts). During those replacement chunks we must not get any interrupts, because they might overwrite scratch space that we already used to save otherwise clobbered register state into. So we have a thing called "critical sections" which allows us to atomically get in and out of "interrupt disabled" modes without touching MSR. When we are supposed to deliver an interrupt into the guest while we are in a critical section, we just don't inject the interrupt yet, but leave it be until the next trap. However, we never really know when the next trap would be. For all we know it could be never. At this point we created a race that is a potential source for interrupt loss or at least deferral. This patch set aims at solving the race. Instead of merely deferring an interrupt when we see such a situation, we go into a special instruction interpretation mode. In this mode, we interpret all PPC assembler instructions that happen until we are out of the critical section again, at which point we can now inject the interrupt. This bug only affects KVM implementations that make use of the magic page, so e500v2, book3s_32 and book3s_64 PR KVM. Would it be possible to single step through the critical section instead? Or set a high res timer to expire very quickly? There are a few other alternatives to this implementation: 1) Unmap the magic page, emulate all memory access to it while in critical and irq pending 2) Trigger a timer that sends a request to the vcpu to wake it from potential sleep and inject the irq 3) Single step until we're beyond the critical section 4) Probably more that I can't think of right now :) Each has their good and bad sides. Unmapping the magic page adds complexity to the MMU mapping code, since we need to make sure we don't map it back in on demand and treat faults to it specially. The timer interrupt works, but I'm not fully convinced that it's a good idea for things like MC events which we also block during critical sections on e500v2. Single stepping is hard enough to get right on interaction between QEMU, KVM and the guest. I didn't really want to make that stuff any more complicated. This approach is really just one out of many - and it's one that's nicely self-contained and shouldn't have any impact at all on implementations that don't care about it ;). Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 1/9] MIPS: KVM: Reformat code and comments
Hi Deng-Cheng, On Tuesday 24 June 2014 10:31:02 Deng-Cheng Zhu wrote: > diff --git a/arch/mips/kvm/kvm_mips.c b/arch/mips/kvm/kvm_mips.c > index cd5e4f5..821e7e8 100644 > --- a/arch/mips/kvm/kvm_mips.c > +++ b/arch/mips/kvm/kvm_mips.c > -#define VCPU_STAT(x) offsetof(struct kvm_vcpu, stat.x), KVM_STAT_VCPU > +#define VCPU_STAT(x) offsetof(struct kvm_vcpu, stat.x) > struct kvm_stats_debugfs_item debugfs_entries[] = { > - { "wait", VCPU_STAT(wait_exits) }, > - { "cache", VCPU_STAT(cache_exits) }, > - { "signal", VCPU_STAT(signal_exits) }, > - { "interrupt", VCPU_STAT(int_exits) }, > - { "cop_unsuable", VCPU_STAT(cop_unusable_exits) }, > - { "tlbmod", VCPU_STAT(tlbmod_exits) }, > - { "tlbmiss_ld", VCPU_STAT(tlbmiss_ld_exits) }, > - { "tlbmiss_st", VCPU_STAT(tlbmiss_st_exits) }, > - { "addrerr_st", VCPU_STAT(addrerr_st_exits) }, > - { "addrerr_ld", VCPU_STAT(addrerr_ld_exits) }, > - { "syscall", VCPU_STAT(syscall_exits) }, > - { "resvd_inst", VCPU_STAT(resvd_inst_exits) }, > - { "break_inst", VCPU_STAT(break_inst_exits) }, > - { "flush_dcache", VCPU_STAT(flush_dcache_exits) }, > - { "halt_wakeup", VCPU_STAT(halt_wakeup) }, > + { "wait", VCPU_STAT(wait_exits), KVM_STAT_VCPU }, > + { "cache", VCPU_STAT(cache_exits), KVM_STAT_VCPU }, > + { "signal", VCPU_STAT(signal_exits), KVM_STAT_VCPU }, > + { "interrupt", VCPU_STAT(int_exits), KVM_STAT_VCPU }, > + { "cop_unsuable", VCPU_STAT(cop_unusable_exits), KVM_STAT_VCPU }, > + { "tlbmod", VCPU_STAT(tlbmod_exits), KVM_STAT_VCPU }, > + { "tlbmiss_ld", VCPU_STAT(tlbmiss_ld_exits), KVM_STAT_VCPU }, > + { "tlbmiss_st", VCPU_STAT(tlbmiss_st_exits), KVM_STAT_VCPU }, > + { "addrerr_st", VCPU_STAT(addrerr_st_exits), KVM_STAT_VCPU }, > + { "addrerr_ld", VCPU_STAT(addrerr_ld_exits), KVM_STAT_VCPU }, > + { "syscall", VCPU_STAT(syscall_exits), KVM_STAT_VCPU }, > + { "resvd_inst", VCPU_STAT(resvd_inst_exits), KVM_STAT_VCPU }, > + { "break_inst", VCPU_STAT(break_inst_exits), KVM_STAT_VCPU }, > + { "flush_dcache", VCPU_STAT(flush_dcache_exits), KVM_STAT_VCPU }, > + { "halt_wakeup", VCPU_STAT(halt_wakeup), KVM_STAT_VCPU }, IMO more important than making checkpatch happy here would be to put it in nicely tabulated columns ;-) > diff --git a/arch/mips/kvm/kvm_mips_emul.c b/arch/mips/kvm/kvm_mips_emul.c > index 8d48400..993dd1c 100644 > --- a/arch/mips/kvm/kvm_mips_emul.c > +++ b/arch/mips/kvm/kvm_mips_emul.c > switch (insn.i_format.opcode) { > - /* > - * jr and jalr are in r_format format. > - */ > + /* jr and jalr are in r_format format. */ bad indentation. > diff --git a/arch/mips/kvm/kvm_trap_emul.c b/arch/mips/kvm/kvm_trap_emul.c > index 693f952..baf6577 100644 > --- a/arch/mips/kvm/kvm_trap_emul.c > +++ b/arch/mips/kvm/kvm_trap_emul.c > @@ -186,10 +185,12 @@ static int kvm_trap_emul_handle_tlb_ld_miss(struct > kvm_vcpu *vcpu) vcpu->arch.pc, badvaddr); > > /* User Address (UA) fault, this could happen if this comment could be fixed too Otherwise this patch looks good. Thanks for doing this! Cheers James -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code
On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote: > On 24.06.14 20:53, Scott Wood wrote: > > On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote: > >> Howdy, > >> > >> Ben reminded me a while back that we have a nasty race in our KVM PV code. > >> > >> We replace a few instructions with longer streams of instructions to check > >> whether it's necessary to trap out from it (like mtmsr, no need to trap if > >> we only disable interrupts). During those replacement chunks we must not > >> get > >> any interrupts, because they might overwrite scratch space that we already > >> used to save otherwise clobbered register state into. > >> > >> So we have a thing called "critical sections" which allows us to atomically > >> get in and out of "interrupt disabled" modes without touching MSR. When we > >> are supposed to deliver an interrupt into the guest while we are in a > >> critical > >> section, we just don't inject the interrupt yet, but leave it be until the > >> next trap. > >> > >> However, we never really know when the next trap would be. For all we know > >> it > >> could be never. At this point we created a race that is a potential source > >> for interrupt loss or at least deferral. > >> > >> This patch set aims at solving the race. Instead of merely deferring an > >> interrupt when we see such a situation, we go into a special instruction > >> interpretation mode. In this mode, we interpret all PPC assembler > >> instructions > >> that happen until we are out of the critical section again, at which point > >> we can now inject the interrupt. > >> > >> This bug only affects KVM implementations that make use of the magic page, > >> so > >> e500v2, book3s_32 and book3s_64 PR KVM. > > Would it be possible to single step through the critical section > > instead? Or set a high res timer to expire very quickly? > > There are a few other alternatives to this implementation: > >1) Unmap the magic page, emulate all memory access to it while in > critical and irq pending >2) Trigger a timer that sends a request to the vcpu to wake it from > potential sleep and inject the irq >3) Single step until we're beyond the critical section >4) Probably more that I can't think of right now :) > > Each has their good and bad sides. Unmapping the magic page adds > complexity to the MMU mapping code, since we need to make sure we don't > map it back in on demand and treat faults to it specially. > > The timer interrupt works, but I'm not fully convinced that it's a good > idea for things like MC events which we also block during critical > sections on e500v2. Are you concerned about the guest seeing machine checks that are (more) asynchronous with the error condition? e500v2 machine checks are always asynchronous. From the core manual: Machine check interrupts are typically caused by a hardware or memory subsystem failure or by an attempt to access an invalid address. They may be caused indirectly by execution of an instruction, but may not be recognized or reported until long after the processor has executed past the instruction that caused the machine check. As such, machine check interrupts are not thought of as synchronous or asynchronous nor as precise or imprecise. I don't think the lag would be a problem, and certainly it's better than the current situation. > Single stepping is hard enough to get right on interaction between QEMU, > KVM and the guest. I didn't really want to make that stuff any more > complicated. I'm not sure that it would add much complexity. We'd just need to check whether any source other than the magic page turned wants DCBR0_IC on, to determine whether to exit to userspace or not. > This approach is really just one out of many - and it's one that's > nicely self-contained and shouldn't have any impact at all on > implementations that don't care about it ;). "Nicely self-contained" is not a phrase I'd associate with 33 patches, including a bunch of new emulation that probably isn't getting great test coverage. -Scott -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code
On 25.06.14 01:15, Scott Wood wrote: On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote: On 24.06.14 20:53, Scott Wood wrote: On Sun, 2014-06-22 at 23:23 +0200, Alexander Graf wrote: Howdy, Ben reminded me a while back that we have a nasty race in our KVM PV code. We replace a few instructions with longer streams of instructions to check whether it's necessary to trap out from it (like mtmsr, no need to trap if we only disable interrupts). During those replacement chunks we must not get any interrupts, because they might overwrite scratch space that we already used to save otherwise clobbered register state into. So we have a thing called "critical sections" which allows us to atomically get in and out of "interrupt disabled" modes without touching MSR. When we are supposed to deliver an interrupt into the guest while we are in a critical section, we just don't inject the interrupt yet, but leave it be until the next trap. However, we never really know when the next trap would be. For all we know it could be never. At this point we created a race that is a potential source for interrupt loss or at least deferral. This patch set aims at solving the race. Instead of merely deferring an interrupt when we see such a situation, we go into a special instruction interpretation mode. In this mode, we interpret all PPC assembler instructions that happen until we are out of the critical section again, at which point we can now inject the interrupt. This bug only affects KVM implementations that make use of the magic page, so e500v2, book3s_32 and book3s_64 PR KVM. Would it be possible to single step through the critical section instead? Or set a high res timer to expire very quickly? There are a few other alternatives to this implementation: 1) Unmap the magic page, emulate all memory access to it while in critical and irq pending 2) Trigger a timer that sends a request to the vcpu to wake it from potential sleep and inject the irq 3) Single step until we're beyond the critical section 4) Probably more that I can't think of right now :) Each has their good and bad sides. Unmapping the magic page adds complexity to the MMU mapping code, since we need to make sure we don't map it back in on demand and treat faults to it specially. The timer interrupt works, but I'm not fully convinced that it's a good idea for things like MC events which we also block during critical sections on e500v2. Are you concerned about the guest seeing machine checks that are (more) asynchronous with the error condition? e500v2 machine checks are always asynchronous. From the core manual: Machine check interrupts are typically caused by a hardware or memory subsystem failure or by an attempt to access an invalid address. They may be caused indirectly by execution of an instruction, but may not be recognized or reported until long after the processor has executed past the instruction that caused the machine check. As such, machine check interrupts are not thought of as synchronous or asynchronous nor as precise or imprecise. I don't think the lag would be a problem, and certainly it's better than the current situation. So what value would you set the timer to? If the value is too small, we never finish the critical section. If it's too big, we add lots of jitter. Single stepping is hard enough to get right on interaction between QEMU, KVM and the guest. I didn't really want to make that stuff any more complicated. I'm not sure that it would add much complexity. We'd just need to check whether any source other than the magic page turned wants DCBR0_IC on, to determine whether to exit to userspace or not. What if the guest is single stepping itself? How do we determine when to unset the bit again? When we get out of the critical section? How do we know what the value was before we set it? This approach is really just one out of many - and it's one that's nicely self-contained and shouldn't have any impact at all on implementations that don't care about it ;). "Nicely self-contained" is not a phrase I'd associate with 33 patches, including a bunch of new emulation that probably isn't getting great test coverage. It means that there's only a single entry point for when the code gets executed, not that it's very little code. Eventually this emulation code should get merged with the already existing in-kernel emulation code. Paul had already started work to merge the emulators a while ago. He even measured speedups when he sent all real mode and split real mode code via the interpreter rather than the entry/exit dance we do today. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/33] KVM: PPC: Fix IRQ race in magic page code
On Wed, 2014-06-25 at 01:40 +0200, Alexander Graf wrote: > On 25.06.14 01:15, Scott Wood wrote: > > On Wed, 2014-06-25 at 00:41 +0200, Alexander Graf wrote: > >> On 24.06.14 20:53, Scott Wood wrote: > >>> The timer interrupt works, but I'm not fully convinced that it's a good > >> idea for things like MC events which we also block during critical > >> sections on e500v2. > > Are you concerned about the guest seeing machine checks that are (more) > > asynchronous with the error condition? e500v2 machine checks are always > > asynchronous. From the core manual: > > > > Machine check interrupts are typically caused by a hardware or > > memory subsystem failure or by an attempt to access an invalid > > address. They may be caused indirectly by execution of an > > instruction, but may not be recognized or reported until long > > after the processor has executed past the instruction that > > caused the machine check. As such, machine check interrupts are > > not thought of as synchronous or asynchronous nor as precise or > > imprecise. > > > > I don't think the lag would be a problem, and certainly it's better than > > the current situation. > > So what value would you set the timer to? If the value is too small, we > never finish the critical section. If it's too big, we add lots of jitter. Maybe something like 100us? Single stepping would be better, though. > >> Single stepping is hard enough to get right on interaction between QEMU, > >> KVM and the guest. I didn't really want to make that stuff any more > >> complicated. > > I'm not sure that it would add much complexity. We'd just need to check > > whether any source other than the magic page turned wants DCBR0_IC on, > > to determine whether to exit to userspace or not. > > What if the guest is single stepping itself? How do we determine when to > unset the bit again? When we get out of the critical section? How do we > know what the value was before we set it? Keep track of each requester of single stepping separately, and only ever set the real bit by ORing them. -Scott -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42980] BUG in gfn_to_pfn_prot
https://bugzilla.kernel.org/show_bug.cgi?id=42980 xerofo...@gmail.com changed: What|Removed |Added CC||xerofo...@gmail.com --- Comment #22 from xerofo...@gmail.com --- Please tell against a newer kernel. This bug seems obsolete to me as of kernel versions released in 2014 time frame. Cheers Nick -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] vfio: Fix endianness handling for emulated BARs
On 06/25/2014 07:54 AM, Benjamin Herrenschmidt wrote: > On Wed, 2014-06-25 at 00:33 +1000, Alexey Kardashevskiy wrote: >> >> I do not understand why @val is considered LE here and need to be >> converted >> to CPU. Really. I truly believe it should be cpu_to_le32(). > > No. Both are slightly wrong semantically but le32_to_cpu() is less > wrong :-) > > iowrite32 supposedly takes a "cpu" value as argument and writes an "le" > value. So if anything, you need something that converts to a "cpu" value > before you call iowrite32. > > Now it's still slightly wrong because the "input" to le32_to_cpu() is > supposed to be an "LE" value but of course here it's not, it's a "cpu" > value too :-) > > But yes, I agree with aw here, either do nothing or stick a set of > iowriteXX_native or something equivalent in the generic iomap header, > define them in term of iowrite32be/iowrite32 based on the compile time > endian of the arch. Ok. I'll do nothing. > Hitting asm-generic/iomap.h I think will cover all archs except ARM. > For ARM, just hack arch/arm/asm/io.h -- Alexey -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to know that vPMU is enabled or disabled?
On Tue, Jun 24, 2014 at 4:38 PM, Paolo Bonzini wrote: > Il 24/06/2014 10:33, Jidong Xiao ha scritto: > >> I think I have figured out this. According to this patch (as well as >> the Intel SDM manual), it looks like pmu is exposed via the cpuid leaf >> 0xah. >> >> >> https://github.com/torvalds/linux/commit/a6c06ed1a60aff77b27ba558c315c3fed4e35565 >> >> Therefore, in the guest os, one can run the cpuid instruction with >> leaf 0xa, and if vpmu is supported/enabled, the return value in >> eax/ebx/ecx/edx should be something non-zero, and in the cloud machine >> which I am using I see these registers are all zero, therefore I think >> vpmu is not supported in my virtual machine. Probably it is masked by >> Qemu. > > > This is correct. QEMU only enables vPMU if you use "-cpu host". > > Paolo Thanks Paolo. -Jidong -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html