Some tutorial initiative for beginners
Hi experts, I scribbled thru kvm code as part of my work. I would like to make it useful for beginners , as part of that effort i created blog with some stuff. Would appreciate feedback,so that i can continue enhancing or dropping the effort all together. Once again , really appreciate your time... http://linux-kvm-internals.blogspot.in/2014/08/kvm-linux-kernel-virtual-machine.html Thanks, Ratheesh -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/17] mm: madvise MADV_USERFAULT: prepare vm_flags to allow more than 32bits
On Fri, Oct 03, 2014 at 07:07:57PM +0200, Andrea Arcangeli wrote: > We run out of 32bits in vm_flags, noop change for 64bit archs. > > Signed-off-by: Andrea Arcangeli > --- > fs/proc/task_mmu.c | 4 ++-- > include/linux/huge_mm.h | 4 ++-- > include/linux/ksm.h | 4 ++-- > include/linux/mm_types.h | 2 +- > mm/huge_memory.c | 2 +- > mm/ksm.c | 2 +- > mm/madvise.c | 2 +- > mm/mremap.c | 2 +- > 8 files changed, 11 insertions(+), 11 deletions(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index c341568..ee1c3a2 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -532,11 +532,11 @@ static void show_smap_vma_flags(struct seq_file *m, > struct vm_area_struct *vma) > /* >* Don't forget to update Documentation/ on changes. >*/ > - static const char mnemonics[BITS_PER_LONG][2] = { > + static const char mnemonics[BITS_PER_LONG+1][2] = { I believe here and below should be BITS_PER_LONG_LONG instead: it will catch unknown vmflags. And +1 is not needed un 64-bit systems. > /* >* In case if we meet a flag we don't know about. >*/ > - [0 ... (BITS_PER_LONG-1)] = "??", > + [0 ... (BITS_PER_LONG)] = "??", > > [ilog2(VM_READ)]= "rd", > [ilog2(VM_WRITE)] = "wr", -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] KVM: x86: Decoding guest instructions which cross page boundary may fail
On Oct 6, 2014, at 11:50 PM, Radim Krčmář wrote: > 2014-10-03 01:10+0300, Nadav Amit: >> Once an instruction crosses a page boundary, the size read from the second >> page >> disregards the common case that part of the operand resides on the first >> page. >> As a result, fetch of long insturctions may fail, and thereby cause the >> decoding to fail as well. >> >> Signed-off-by: Nadav Amit >> --- > > Good catch, was it thanks to an exhaustive test-suite? > > Reviewed-by: Radim Krčmář It was catcher in a test-environment. However, I keep wondering how it did not happen in real guest OS. I think it is due to pure luck, so I recommend to put it in -stable. Nadav-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/3] arm/arm64: KVM: Host 48-bit VA support and IPA limits
On Mon, Oct 06, 2014 at 09:30:24PM +0100, Christoffer Dall wrote: > The following host configurations have been tested with KVM on APM > Mustang: [...] > 3) 64KB + 39 bits VA space That would be 42-bit VA space. -- Catalin -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] KVM: x86: Using TSC deadline may cause multiple interrupts by user writes
Thanks for reviewing this patch and the rest of the gang. On Oct 6, 2014, at 11:57 PM, Radim Krčmář wrote: > 2014-10-03 01:10+0300, Nadav Amit: >> Setting the TSC deadline MSR that are initiated by the host (using ioctl's) >> may >> cause superfluous interrupt. This occurs in the following case: >> >> 1. A TSC deadline timer interrupt is pending. >> 2. TSC deadline was still not cleared (which happens during vcpu_run). >> 3. Userspace uses KVM_GET_MSRS/KVM_SET_MSRS to load the same deadline msr. >> >> To solve this situation, ignore host initiated TSC deadline writes that do >> not >> change the deadline value. > > I find this change slightly dubious … Why? I see similar handling of MSR_TSC_ADJUST. > - why does the userspace do that? It seems qemu’s kvm_cpu_exec does so when it calls kvm_arch_put_registers. It is pretty much done after every exit to userspace. > - why is host_initiated required? Since if the guest writes to the MSR, it means it wants to rearm the TSC deadline. Even if the deadline passed, interrupt should be triggered. If the guest writes the same value on the deadline MSR twice, it might expect two interrupts. > > Thanks. > > It seems like an performance improvement, so why shouldn't return when > 'data <= tscdeadline && pending()' > or even > 'data <= now() && pending()' > > (Sorry, I ran out of time to search for answers today.) The bug I encountered is not a performance issue, but a superfluous interrupt (functional bug). As I said, the guest may write a new deadline MSR value which is in the past and expect an interrupt. Nadav-- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/3] arm/arm64: KVM: Host 48-bit VA support and IPA limits
On Tue, Oct 07, 2014 at 10:24:58AM +0100, Catalin Marinas wrote: > On Mon, Oct 06, 2014 at 09:30:24PM +0100, Christoffer Dall wrote: > > The following host configurations have been tested with KVM on APM > > Mustang: > [...] > > 3) 64KB + 39 bits VA space > > That would be 42-bit VA space. > Yeah, -ECUTNPASTE, sorry about that. -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/17] mm: madvise MADV_USERFAULT
On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if > userland touches a still unmapped virtual address, a sigbus signal is > sent instead of allocating a new page. The sigbus signal handler will > then resolve the page fault in userland by calling the > remap_anon_pages syscall. Hm. I wounder if this functionality really fits madvise(2) interface: as far as I understand it, it provides a way to give a *hint* to kernel which may or may not trigger an action from kernel side. I don't think an application will behaive reasonably if kernel ignore the *advise* and will not send SIGBUS, but allocate memory. I would suggest to consider to use some other interface for the functionality: a new syscall or, perhaps, mprotect(). -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/17] mm: madvise MADV_USERFAULT
* Kirill A. Shutemov (kir...@shutemov.name) wrote: > On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: > > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the > > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if > > userland touches a still unmapped virtual address, a sigbus signal is > > sent instead of allocating a new page. The sigbus signal handler will > > then resolve the page fault in userland by calling the > > remap_anon_pages syscall. > > Hm. I wounder if this functionality really fits madvise(2) interface: as > far as I understand it, it provides a way to give a *hint* to kernel which > may or may not trigger an action from kernel side. I don't think an > application will behaive reasonably if kernel ignore the *advise* and will > not send SIGBUS, but allocate memory. Aren't DONTNEED and DONTDUMP similar cases of madvise operations that are expected to do what they say ? > I would suggest to consider to use some other interface for the > functionality: a new syscall or, perhaps, mprotect(). Dave > -- > Kirill A. Shutemov -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2
On Mon, Oct 06, 2014 at 09:30:25PM +0100, Christoffer Dall wrote: > +/** > + * kvm_prealloc_hwpgd - allocate inital table for VTTBR > + * @kvm: The KVM struct pointer for the VM. > + * @pgd: The kernel pseudo pgd > + * > + * When the kernel uses more levels of page tables than the guest, we > allocate > + * a fake PGD and pre-populate it to point to the next-level page table, > which > + * will be the real initial page table pointed to by the VTTBR. > + * > + * When KVM_PREALLOC_LEVEL==2, we allocate a single page for the PMD and > + * the kernel will use folded pud. When KVM_PREALLOC_LEVEL==1, we > + * allocate 2 consecutive PUD pages. > + */ > +#if defined(CONFIG_ARM64_64K_PAGES) && CONFIG_ARM64_PGTABLE_LEVELS == 3 > +#define KVM_PREALLOC_LEVEL 2 > +#define PTRS_PER_S2_PGD1 > +#define S2_PGD_ORDER get_order(PTRS_PER_S2_PGD * sizeof(pgd_t)) I agree that my magic equation wasn't readable ;) (I had troubles re-understanding it as well), but you also have some constants here that are not immediately obvious where you got to them from. IIUC, KVM_PREALLOC_LEVEL == 2 here means that the hardware only understands stage 2 pmd and pte. I guess you could look into the ARM ARM tables but it's still not clear. Let's look at PTRS_PER_S2_PGD as I think it's simpler. My proposal was: #if PGDIR_SHIFT > KVM_PHYS_SHIFT #define PTRS_PER_S2_PGD (1) #else #define PTRS_PER_S2_PGD (1 << (KVM_PHYS_SHIFT - PGDIR_SHIFT)) #endif In this case PGDIR_SHIFT is 42, so we get PTRS_PER_S2_PGD == 1. The 4K and 4 levels case below is also correct. The KVM start level calculation, we could assume that KVM needs either host levels or host levels - 1 (unless we go for some weirdly small KVM_PHYS_SHIFT). So we could define them KVM_PREALLOC_LEVEL as: #if PTRS_PER_S2_PGD <= 16 #define KVM_PREALLOC_LEVEL (4 - CONFIG_ARM64_PGTABLE_LEVELS + 1) #else #define KVM_PREALLOC_LEVEL (0) #endif Basically if you can concatenate 16 or less pages at the level below the top, the architecture does not allow a small top level. In this case, (4 - CONFIG_ARM64_PGTABLE_LEVELS) represents the first level for the host and we add 1 to go to the next level for KVM stage 2 when PTRS_PER_S2_PGD is 16 or less. We use 0 when we don't need to preallocate. > +static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd) > +{ > + pud_t *pud; > + pmd_t *pmd; > + > + pud = pud_offset(pgd, 0); > + pmd = (pmd_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 0); > + > + if (!pmd) > + return -ENOMEM; > + pud_populate(NULL, pud, pmd); > + > + return 0; > +} > + > +static inline void kvm_free_hwpgd(struct kvm *kvm) > +{ > + pgd_t *pgd = kvm->arch.pgd; > + pud_t *pud = pud_offset(pgd, 0); > + pmd_t *pmd = pmd_offset(pud, 0); > + free_pages((unsigned long)pmd, 0); > +} > + > +static inline phys_addr_t kvm_get_hwpgd(struct kvm *kvm) > +{ > + pgd_t *pgd = kvm->arch.pgd; > + pud_t *pud = pud_offset(pgd, 0); > + pmd_t *pmd = pmd_offset(pud, 0); > + return virt_to_phys(pmd); > + > +} > +#elif defined(CONFIG_ARM64_4K_PAGES) && CONFIG_ARM64_PGTABLE_LEVELS == 4 > +#define KVM_PREALLOC_LEVEL 1 > +#define PTRS_PER_S2_PGD2 > +#define S2_PGD_ORDER get_order(PTRS_PER_S2_PGD * sizeof(pgd_t)) Here PGDIR_SHIFT is 39, so we get PTRS_PER_S2_PGD == (1 << (40 - 39)) which is 2 and KVM_PREALLOC_LEVEL == 1. > +static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd) > +{ > + pud_t *pud; > + > + pud = (pud_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 1); > + if (!pud) > + return -ENOMEM; > + pgd_populate(NULL, pgd, pud); > + pgd_populate(NULL, pgd + 1, pud + PTRS_PER_PUD); > + > + return 0; > +} You still need to define these functions but you can make their implementation dependent solely on the KVM_PREALLOC_LEVEL rather than 64K/4K and levels combinations. If it is KVM_PREALLOC_LEVEL is 1, you allocate pud and populate the pgds (in a loop based on the PTRS_PER_S2_PGD). If it is 2, you allocate the pmd and populate the pud (still in a loop though it would probably be 1 iteration). We know based on the assumption above that you can't get KVM_PREALLOC_LEVEL == 2 and CONFIG_ARM64_PGTABLE_LEVELS == 4. -- Catalin -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT
On Tue, Oct 07, 2014 at 11:46:04AM +0100, Dr. David Alan Gilbert wrote: > * Kirill A. Shutemov (kir...@shutemov.name) wrote: > > On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: > > > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the > > > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if > > > userland touches a still unmapped virtual address, a sigbus signal is > > > sent instead of allocating a new page. The sigbus signal handler will > > > then resolve the page fault in userland by calling the > > > remap_anon_pages syscall. > > > > Hm. I wounder if this functionality really fits madvise(2) interface: as > > far as I understand it, it provides a way to give a *hint* to kernel which > > may or may not trigger an action from kernel side. I don't think an > > application will behaive reasonably if kernel ignore the *advise* and will > > not send SIGBUS, but allocate memory. > > Aren't DONTNEED and DONTDUMP similar cases of madvise operations that are > expected to do what they say ? No. If kernel would ignore MADV_DONTNEED or MADV_DONTDUMP it will not affect correctness, just behaviour will be suboptimal: more than needed memory used or wasted space in coredump. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT
* Kirill A. Shutemov (kir...@shutemov.name) wrote: > On Tue, Oct 07, 2014 at 11:46:04AM +0100, Dr. David Alan Gilbert wrote: > > * Kirill A. Shutemov (kir...@shutemov.name) wrote: > > > On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: > > > > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the > > > > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if > > > > userland touches a still unmapped virtual address, a sigbus signal is > > > > sent instead of allocating a new page. The sigbus signal handler will > > > > then resolve the page fault in userland by calling the > > > > remap_anon_pages syscall. > > > > > > Hm. I wounder if this functionality really fits madvise(2) interface: as > > > far as I understand it, it provides a way to give a *hint* to kernel which > > > may or may not trigger an action from kernel side. I don't think an > > > application will behaive reasonably if kernel ignore the *advise* and will > > > not send SIGBUS, but allocate memory. > > > > Aren't DONTNEED and DONTDUMP similar cases of madvise operations that are > > expected to do what they say ? > > No. If kernel would ignore MADV_DONTNEED or MADV_DONTDUMP it will not > affect correctness, just behaviour will be suboptimal: more than needed > memory used or wasted space in coredump. That's not how the manpage reads for DONTNEED; it calls it out as a special case near the top, and explicitly says what will happen if you read the area marked as DONTNEED. It looks like there are openssl patches that use DONTDUMP to explicitly make sure keys etc don't land in cores. Dave > > -- > Kirill A. Shutemov -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 10/17] mm: rmap preparation for remap_anon_pages
On Fri, Oct 03, 2014 at 07:08:00PM +0200, Andrea Arcangeli wrote: > There's one constraint enforced to allow this simplification: the > source pages passed to remap_anon_pages must be mapped only in one > vma, but this is not a limitation when used to handle userland page > faults with MADV_USERFAULT. The source addresses passed to > remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to > avoid any risk of the mapcount of the pages increasing, if fork runs > in parallel in another thread, before or while remap_anon_pages runs. Have you considered triggering COW instead of adding limitation on pages' mapcount? The limitation looks artificial from interface POV. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 08/17] mm: madvise MADV_USERFAULT
On Tue, Oct 07, 2014 at 12:01:02PM +0100, Dr. David Alan Gilbert wrote: > * Kirill A. Shutemov (kir...@shutemov.name) wrote: > > On Tue, Oct 07, 2014 at 11:46:04AM +0100, Dr. David Alan Gilbert wrote: > > > * Kirill A. Shutemov (kir...@shutemov.name) wrote: > > > > On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: > > > > > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the > > > > > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if > > > > > userland touches a still unmapped virtual address, a sigbus signal is > > > > > sent instead of allocating a new page. The sigbus signal handler will > > > > > then resolve the page fault in userland by calling the > > > > > remap_anon_pages syscall. > > > > > > > > Hm. I wounder if this functionality really fits madvise(2) interface: as > > > > far as I understand it, it provides a way to give a *hint* to kernel > > > > which > > > > may or may not trigger an action from kernel side. I don't think an > > > > application will behaive reasonably if kernel ignore the *advise* and > > > > will > > > > not send SIGBUS, but allocate memory. > > > > > > Aren't DONTNEED and DONTDUMP similar cases of madvise operations that are > > > expected to do what they say ? > > > > No. If kernel would ignore MADV_DONTNEED or MADV_DONTDUMP it will not > > affect correctness, just behaviour will be suboptimal: more than needed > > memory used or wasted space in coredump. > > That's not how the manpage reads for DONTNEED; it calls it out as a special > case near the top, and explicitly says what will happen if you read the > area marked as DONTNEED. Your are right. MADV_DONTNEED doesn't fit the interface too. That's bad and we can't fix it. But it's not a reason to make this mistake again. Read the next sentence: "The kernel is free to ignore the advice." Note, POSIX_MADV_DONTNEED has totally different semantics. > It looks like there are openssl patches that use DONTDUMP to explicitly > make sure keys etc don't land in cores. That's nice to have. But openssl works on systems without the interface, meaning it's not essential for functionality. -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/15] virtio_pci: fix virtio spec compliance on restore
On (Mon) 06 Oct 2014 [18:10:40], Michael S. Tsirkin wrote: > On restore, virtio pci does the following: > + set features > + init vqs etc - device can be used at this point! > + set ACKNOWLEDGE,DRIVER and DRIVER_OK status bits > > This is in violation of the virtio spec, which > requires the following order: > - ACKNOWLEDGE > - DRIVER > - init vqs > - DRIVER_OK > > This behaviour will break with hypervisors that assume spec compliant > behaviour. It seems like a good idea to have this patch applied to > stable branches to reduce the support butden for the hypervisors. > > Cc: sta...@vger.kernel.org > Cc: Amit Shah > Signed-off-by: Michael S. Tsirkin I didn't see my previous questions answered from the initial posting -- can you please respond to them? Amit -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Possible to backport this vhost-net fix to 3.10?
On 07/10/14 04:47, Dmitry Petuhov wrote: 07.10.2014 7:42, Dmitry Petuhov пишет: 05.10.2014 19:44, Michael S. Tsirkin пишет: OK but pls cleanup indentation, it's all scrambled. You'll also need to add proper attribution (using >From: header), your signature etc. Don't understand what's wrong with indentation. Oh. Didn't mentioned what maillist it is. OK, I'll read https://www.kernel.org/doc/Documentation/SubmittingPatches and submit it correctly. Hello Dmitry, That's great that you will do it, thanks :-) Eddie -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli wrote: > > Of course if somebody has better ideas on how to resolve an anonymous > userfault they're welcome. So I'd *much* rather have a "write()" style interface (ie _copying_ bytes from user space into a newly allocated page that gets mapped) than a "remap page" style interface remapping anonymous pages involves page table games that really aren't necessarily a good idea, and tlb invalidates for the old page etc. Just don't do it. Linus -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/15] virtio_pci: fix virtio spec compliance on restore
On Tue, Oct 07, 2014 at 05:35:13PM +0530, Amit Shah wrote: > On (Mon) 06 Oct 2014 [18:10:40], Michael S. Tsirkin wrote: > > On restore, virtio pci does the following: > > + set features > > + init vqs etc - device can be used at this point! > > + set ACKNOWLEDGE,DRIVER and DRIVER_OK status bits > > > > This is in violation of the virtio spec, which > > requires the following order: > > - ACKNOWLEDGE > > - DRIVER > > - init vqs > > - DRIVER_OK > > > > This behaviour will break with hypervisors that assume spec compliant > > behaviour. It seems like a good idea to have this patch applied to > > stable branches to reduce the support butden for the hypervisors. > > > > Cc: sta...@vger.kernel.org > > Cc: Amit Shah > > Signed-off-by: Michael S. Tsirkin > > I didn't see my previous questions answered from the initial posting > -- can you please respond to them? > > Amit Ooops missed your mail. Will respond now. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/15] virtio_pci: fix virtio spec compliance on restore
On Mon, Oct 06, 2014 at 06:10:40PM +0300, Michael S. Tsirkin wrote: > On restore, virtio pci does the following: > + set features > + init vqs etc - device can be used at this point! > + set ACKNOWLEDGE,DRIVER and DRIVER_OK status bits > > This is in violation of the virtio spec, which > requires the following order: > - ACKNOWLEDGE > - DRIVER > - init vqs > - DRIVER_OK > > This behaviour will break with hypervisors that assume spec compliant > behaviour. It seems like a good idea to have this patch applied to > stable branches to reduce the support butden for the hypervisors. Tested suspend to ram with virtio net and blk. > > Cc: sta...@vger.kernel.org > Cc: Amit Shah > Signed-off-by: Michael S. Tsirkin > --- > drivers/virtio/virtio_pci.c | 36 > 1 file changed, 32 insertions(+), 4 deletions(-) > > diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c > index 3d1463c..0f2db51 100644 > --- a/drivers/virtio/virtio_pci.c > +++ b/drivers/virtio/virtio_pci.c > @@ -789,6 +789,7 @@ static int virtio_pci_restore(struct device *dev) > struct pci_dev *pci_dev = to_pci_dev(dev); > struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev); > struct virtio_driver *drv; > + unsigned status = 0; > int ret; > > drv = container_of(vp_dev->vdev.dev.driver, > @@ -799,14 +800,41 @@ static int virtio_pci_restore(struct device *dev) > return ret; > > pci_set_master(pci_dev); > + /* We always start by resetting the device, in case a previous > + * driver messed it up. */ > + vp_reset(&vp_dev->vdev); > + > + /* Acknowledge that we've seen the device. */ > + status |= VIRTIO_CONFIG_S_ACKNOWLEDGE; > + vp_set_status(&vp_dev->vdev, status); > + > + /* Maybe driver failed before freeze. > + * Restore the failed status, for debugging. */ > + status |= vp_dev->saved_status & VIRTIO_CONFIG_S_FAILED; > + vp_set_status(&vp_dev->vdev, status); > + > + if (!drv) > + return 0; > + > + /* We have a driver! */ > + status |= VIRTIO_CONFIG_S_DRIVER; > + vp_set_status(&vp_dev->vdev, status); > + > vp_finalize_features(&vp_dev->vdev); > > - if (drv && drv->restore) > - ret = drv->restore(&vp_dev->vdev); > + if (!drv->restore) > + return 0; As Amit points out, this is a bug: for a driver without restore callback we should still set DRIVER_OK. Will post v3. > + > + ret = drv->restore(&vp_dev->vdev); > + if (ret) { > + status |= VIRTIO_CONFIG_S_FAILED; > + vp_set_status(&vp_dev->vdev, status); > + return ret; > + } > > /* Finally, tell the device we're all set */ > - if (!ret) > - vp_set_status(&vp_dev->vdev, vp_dev->saved_status); > + status |= VIRTIO_CONFIG_S_DRIVER_OK; > + vp_set_status(&vp_dev->vdev, status); > > return ret; > } > -- > MST > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/17] mm: madvise MADV_USERFAULT
Hi Kirill, On Tue, Oct 07, 2014 at 01:36:45PM +0300, Kirill A. Shutemov wrote: > On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: > > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the > > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if > > userland touches a still unmapped virtual address, a sigbus signal is > > sent instead of allocating a new page. The sigbus signal handler will > > then resolve the page fault in userland by calling the > > remap_anon_pages syscall. > > Hm. I wounder if this functionality really fits madvise(2) interface: as > far as I understand it, it provides a way to give a *hint* to kernel which > may or may not trigger an action from kernel side. I don't think an > application will behaive reasonably if kernel ignore the *advise* and will > not send SIGBUS, but allocate memory. > > I would suggest to consider to use some other interface for the > functionality: a new syscall or, perhaps, mprotect(). I didn't feel like adding PROT_USERFAULT to mprotect, which looks hardwired to just these flags: PROT_NONE The memory cannot be accessed at all. PROT_READ The memory can be read. PROT_WRITE The memory can be modified. PROT_EXEC The memory can be executed. Normally mprotect doesn't just alter the vmas but it also alters pte/hugepmds protection bits, that's something that is never needed with VM_USERFAULT so I didn't feel like VM_USERFAULT is a protection change to the VMA. mprotect is also hardwired to mangle only the VM_READ|WRITE|EXEC flags, while madvise is ideal to set arbitrary vma flags. >From an implementation standpoint the perfect place to set a flag in a vma is madvise. This is what MADV_DONTFORK (it sets VM_DONTCOPY) already does too in an identical way to MADV_USERFAULT/VM_USERFAULT. MADV_DONTFORK is as critical as MADV_USERFAULT because people depends on it for example to prevent the O_DIRECT vs fork race condition that results in silent data corruption during I/O with threads that may fork. The other reason why MADV_DONTFORK is critical is that fork() would otherwise fail with OOM unless full overcommit is enabled (i.e. pci hotplug crashes the guest if you forget to set MADV_DONTFORK). Another madvise that would generate a failure if not obeyed by the kernel is MADV_DONTNEED that if it does nothing it could run lead to OOM killing. We don't inflate virt balloons using munmap just to make an example. Various other apps (maybe JVM garbage collection too) makes extensive use of MADV_DONTNEED and depend on it. Said that I can change it to mprotect, the only thing that I don't like is that it'll result in a less clean patch and I can't possibly see a practical risk in keeping it simpler with madvise, as long as we always return -EINVAL whenever we encounter a vma type that cannot raise userfaults yet (that is something I already enforced). Yet another option would be to drop MADV_USERFAULT and vm_flags&VM_USERFAULT entirely and in turn the ability to handle userfaults with SIGBUS, and retain only the userfaultfd. The new userfaultfd protocol requires registering each created userfaultfd into its own private virtual memory ranges (that is to allow an unlimited number of userfaultfd per process). Currently the userfaultfd engages iff the fault address intersects both the MADV_USERFAULT range and the userfaultfd registered ranges. So I could drop MADV_USERFAULT and VM_USERFAULT and just check for vma->vm_userfaultfd_ctx!=NULL to know if the userfaultfd protocol needs to be engaged during the first page fault for a still unmapped virtual address. I just thought it would be more flexibile to also allow SIGBUS without forcing people to use userfaultfd (that's in fact the only reason to still retain madvise(MADV_USERFAULT)!). Volatile pages earlier patches only supported SIGBUS behavior for example.. and I didn't intend to force them to use userfaultfd if they're guaranteed to access the memory with the CPU and never through a kernel syscall (that is something the app can enforce by design). userfaultfd becomes necessary the moment you want to handle userfaults through syscalls/gup etc... qemu obviously requires userfaultfd and it never uses the userfaultfd-less SIGBUS behavior as it touches the memory in all possible ways (first and foremost with the KVM page fault that uses almost all variants of gup..). So here somebody should comment and choose between: 1) set VM_USERFAULT with mprotect(PROT_USERFAULT) instead of the current madvise(MADV_USERFAULT) 2) drop MADV_USERFAULT and VM_USERFAULT and force the usage of the userfaultfd protocol as the only way for userland to catch userfaults (each userfaultfd must already register itself into its own virtual memory ranges so it's a trivial change for userfaultfd users that deletes just 1 or 2 lines of userland code, but it would prevent to use the SIGBUS behavior with info->si_addr=faultaddr for other users) 3) keep
Re: [PATCH v2 1/3] arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2
On 07/10/14 11:48, Catalin Marinas wrote: > On Mon, Oct 06, 2014 at 09:30:25PM +0100, Christoffer Dall wrote: >> +/** >> + * kvm_prealloc_hwpgd - allocate inital table for VTTBR >> + * @kvm: The KVM struct pointer for the VM. >> + * @pgd: The kernel pseudo pgd >> + * >> + * When the kernel uses more levels of page tables than the guest, we >> allocate >> + * a fake PGD and pre-populate it to point to the next-level page table, >> which >> + * will be the real initial page table pointed to by the VTTBR. >> + * >> + * When KVM_PREALLOC_LEVEL==2, we allocate a single page for the PMD and >> + * the kernel will use folded pud. When KVM_PREALLOC_LEVEL==1, we >> + * allocate 2 consecutive PUD pages. >> + */ >> +#if defined(CONFIG_ARM64_64K_PAGES) && CONFIG_ARM64_PGTABLE_LEVELS == 3 >> +#define KVM_PREALLOC_LEVEL 2 >> +#define PTRS_PER_S2_PGD1 >> +#define S2_PGD_ORDER get_order(PTRS_PER_S2_PGD * sizeof(pgd_t)) > > I agree that my magic equation wasn't readable ;) (I had troubles > re-understanding it as well), but you also have some constants here that > are not immediately obvious where you got to them from. IIUC, > KVM_PREALLOC_LEVEL == 2 here means that the hardware only understands > stage 2 pmd and pte. I guess you could look into the ARM ARM tables but > it's still not clear. > > Let's look at PTRS_PER_S2_PGD as I think it's simpler. My proposal was: > > #if PGDIR_SHIFT > KVM_PHYS_SHIFT > #define PTRS_PER_S2_PGD (1) > #else > #define PTRS_PER_S2_PGD (1 << (KVM_PHYS_SHIFT - > PGDIR_SHIFT)) > #endif > > In this case PGDIR_SHIFT is 42, so we get PTRS_PER_S2_PGD == 1. The 4K > and 4 levels case below is also correct. > > The KVM start level calculation, we could assume that KVM needs either > host levels or host levels - 1 (unless we go for some weirdly small > KVM_PHYS_SHIFT). So we could define them KVM_PREALLOC_LEVEL as: > > #if PTRS_PER_S2_PGD <= 16 > #define KVM_PREALLOC_LEVEL(4 - CONFIG_ARM64_PGTABLE_LEVELS + 1) > #else > #define KVM_PREALLOC_LEVEL(0) > #endif > > Basically if you can concatenate 16 or less pages at the level below the > top, the architecture does not allow a small top level. In this case, > (4 - CONFIG_ARM64_PGTABLE_LEVELS) represents the first level for the > host and we add 1 to go to the next level for KVM stage 2 when > PTRS_PER_S2_PGD is 16 or less. We use 0 when we don't need to > preallocate. I think this makes the whole thing clearer (at least for me), as it makes the relationship between KVM_PREALLOC_LEVEL and CONFIG_ARM64_PGTABLE_LEVELS explicit (it wasn't completely obvious to me initially). >> +static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd) >> +{ >> + pud_t *pud; >> + pmd_t *pmd; >> + >> + pud = pud_offset(pgd, 0); >> + pmd = (pmd_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 0); >> + >> + if (!pmd) >> + return -ENOMEM; >> + pud_populate(NULL, pud, pmd); >> + >> + return 0; >> +} >> + >> +static inline void kvm_free_hwpgd(struct kvm *kvm) >> +{ >> + pgd_t *pgd = kvm->arch.pgd; >> + pud_t *pud = pud_offset(pgd, 0); >> + pmd_t *pmd = pmd_offset(pud, 0); >> + free_pages((unsigned long)pmd, 0); >> +} >> + >> +static inline phys_addr_t kvm_get_hwpgd(struct kvm *kvm) >> +{ >> + pgd_t *pgd = kvm->arch.pgd; >> + pud_t *pud = pud_offset(pgd, 0); >> + pmd_t *pmd = pmd_offset(pud, 0); >> + return virt_to_phys(pmd); >> + >> +} >> +#elif defined(CONFIG_ARM64_4K_PAGES) && CONFIG_ARM64_PGTABLE_LEVELS == 4 >> +#define KVM_PREALLOC_LEVEL 1 >> +#define PTRS_PER_S2_PGD2 >> +#define S2_PGD_ORDER get_order(PTRS_PER_S2_PGD * sizeof(pgd_t)) > > Here PGDIR_SHIFT is 39, so we get PTRS_PER_S2_PGD == (1 << (40 - 39)) > which is 2 and KVM_PREALLOC_LEVEL == 1. > >> +static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd) >> +{ >> + pud_t *pud; >> + >> + pud = (pud_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 1); >> + if (!pud) >> + return -ENOMEM; >> + pgd_populate(NULL, pgd, pud); >> + pgd_populate(NULL, pgd + 1, pud + PTRS_PER_PUD); >> + >> + return 0; >> +} > > You still need to define these functions but you can make their > implementation dependent solely on the KVM_PREALLOC_LEVEL rather than > 64K/4K and levels combinations. If it is KVM_PREALLOC_LEVEL is 1, you > allocate pud and populate the pgds (in a loop based on the > PTRS_PER_S2_PGD). If it is 2, you allocate the pmd and populate the pud > (still in a loop though it would probably be 1 iteration). We know based > on the assumption above that you can't get KVM_PREALLOC_LEVEL == 2 and > CONFIG_ARM64_PGTABLE_LEVELS == 4. > Also agreed. Most of what you wrote here could also be gathered as comments in the patch. Thanks, M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line
Re: [Qemu-devel] [PATCH 10/17] mm: rmap preparation for remap_anon_pages
Hi Kirill, On Tue, Oct 07, 2014 at 02:10:26PM +0300, Kirill A. Shutemov wrote: > On Fri, Oct 03, 2014 at 07:08:00PM +0200, Andrea Arcangeli wrote: > > There's one constraint enforced to allow this simplification: the > > source pages passed to remap_anon_pages must be mapped only in one > > vma, but this is not a limitation when used to handle userland page > > faults with MADV_USERFAULT. The source addresses passed to > > remap_anon_pages should be set as VM_DONTCOPY with MADV_DONTFORK to > > avoid any risk of the mapcount of the pages increasing, if fork runs > > in parallel in another thread, before or while remap_anon_pages runs. > > Have you considered triggering COW instead of adding limitation on > pages' mapcount? The limitation looks artificial from interface POV. I haven't considered it, mostly because I see it as a feature that it returns -EBUSY. I prefer to avoid the risk of userland getting a successful retval but internally the kernel silently behaving non-zerocopy by mistake because some userland bug forgot to set MADV_DONTFORK on the src_vma. COW would be not zerocopy so it's not ok. We get sub 1msec latency for userfaults through 10gbit and we don't want to risk wasting CPU caches. I however considered allowing to extend the strict behavior (i.e. the feature) later in a backwards compatible way. We could provide a non-zerocopy beahvior with a RAP_ALLOW_COW flag that would then turn the -EBUSY error into a copy. It's also more complex to implement the cow now, so it would make the code that really matters, harder to review. So it may be preferable to extend this later in a backwards compatible way with a new RAP_ALLOW_COW flag. The current handling the flags is already written in a way that should allow backwards compatible extension with RAP_ALLOW_*: #define RAP_ALLOW_SRC_HOLES (1UL<<0) SYSCALL_DEFINE4(remap_anon_pages, unsigned long, dst_start, unsigned long, src_start, unsigned long, len, unsigned long, flags) [..] long err = -EINVAL; [..] if (flags & ~RAP_ALLOW_SRC_HOLES) return err; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2
Hi Christoffer, On 06/10/14 21:30, Christoffer Dall wrote: > This patch adds the necessary support for all host kernel PGSIZE and > VA_SPACE configuration options for both EL2 and the Stage-2 page tables. > > However, for 40bit and 42bit PARange systems, the architecture mandates > that VTCR_EL2.SL0 is maximum 1, resulting in fewer levels of stage-2 > pagge tables than levels of host kernel page tables. At the same time, > systems with a PARange > 42bit, we limit the IPA range by always setting > VTCR_EL2.T0SZ to 24. > > To solve the situation with different levels of page tables for Stage-2 > translation than the host kernel page tables, we allocate a dummy PGD > with pointers to our actual inital level Stage-2 page table, in order > for us to reuse the kernel pgtable manipulation primitives. Reproducing > all these in KVM does not look pretty and unnecessarily complicates the > 32-bit side. > > Systems with a PARange < 40bits are not yet supported. > > [ I have reworked this patch from its original form submitted by >Jungseok to take the architecture constraints into consideration. >There were too many changes from the original patch for me to >preserve the authorship. Thanks to Catalin Marinas for his help in >figuring out a good solution to this challenge. I have also fixed >various bugs and missing error code handling from the original >patch. - Christoffer ] > > Cc: Marc Zyngier > Cc: Catalin Marinas > Signed-off-by: Jungseok Lee > Signed-off-by: Christoffer Dall On top of Catalin's review, I have the following comments: [...] > diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c > index bb06f76..3b3e18f 100644 > --- a/arch/arm/kvm/mmu.c > +++ b/arch/arm/kvm/mmu.c > @@ -42,7 +42,7 @@ static unsigned long hyp_idmap_start; > static unsigned long hyp_idmap_end; > static phys_addr_t hyp_idmap_vector; > > -#define pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t)) > +#define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t)) > > #define kvm_pmd_huge(_x) (pmd_huge(_x) || pmd_trans_huge(_x)) > > @@ -158,7 +158,7 @@ static void unmap_pmds(struct kvm *kvm, pud_t *pud, > } > } while (pmd++, addr = next, addr != end); > > - if (kvm_pmd_table_empty(start_pmd)) > + if (kvm_pmd_table_empty(start_pmd) && (!kvm || KVM_PREALLOC_LEVEL < > 2)) This really feels clunky. Can we fold the additional tests inside kvm_pmd_table_empty(), taking kvm as an additional parameter? > clear_pud_entry(kvm, pud, start_addr); > } > > @@ -182,7 +182,7 @@ static void unmap_puds(struct kvm *kvm, pgd_t *pgd, > } > } while (pud++, addr = next, addr != end); > > - if (kvm_pud_table_empty(start_pud)) > + if (kvm_pud_table_empty(start_pud) && (!kvm || KVM_PREALLOC_LEVEL < > 1)) Same here. Thanks, M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
Hello, On Tue, Oct 07, 2014 at 08:47:59AM -0400, Linus Torvalds wrote: > On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli wrote: > > > > Of course if somebody has better ideas on how to resolve an anonymous > > userfault they're welcome. > > So I'd *much* rather have a "write()" style interface (ie _copying_ > bytes from user space into a newly allocated page that gets mapped) > than a "remap page" style interface > > remapping anonymous pages involves page table games that really aren't > necessarily a good idea, and tlb invalidates for the old page etc. > Just don't do it. I see what you mean. The only cons I see is that we couldn't use then recv(tmp_addr, PAGE_SIZE), remap_anon_pages(faultaddr, tmp_addr, PAGE_SIZE, ..) and retain the zerocopy behavior. Or how could we? There's no recvfile(userfaultfd, socketfd, PAGE_SIZE). Ideally if we could prevent the page data coming from the network to ever become visible in the kernel we could avoid the TLB flush and also be zerocopy but I can't see how we could achieve that. The page data could come through a ssh pipe or anything (qemu supports all kind of network transports for live migration), this is why leaving the network protocol into userland is preferable. As things stands now, I'm afraid with a write() syscall we couldn't do it zerocopy. We'd still need to receive the memory in a temporary page and then copy it to a kernel page (invisible to userland while we write to it) to later map into the userfault address. If it wasn't for the TLB flush of the old page, the remap_anon_pages variant would be more optimal than doing a copy through a write syscall. Is the copy cheaper than a TLB flush? I probably naively assumed the TLB flush was always cheaper. Now another idea that comes to mind to be able to add the ability to switch between copy and TLB flush is using a RAP_FORCE_COPY flag, that would then do a copy inside remap_anon_pages and leave the original page mapped in place... (and such flag would also disable the -EBUSY error if page_mapcount is > 1). So then if the RAP_FORCE_COPY flag is set remap_anon_pages would behave like you suggested (but with a mremap-like interface, instead of a write syscall) and we could benchmark the difference between copy and TLB flush too. We could even periodically benchmark it at runtime and switch over the faster method (the more CPUs there are in the host and the more threads the process has, the faster the copy will be compared to the TLB flush). Of course in terms of API I could implement the exact same mechanism as described above for remap_anon_pages inside a write() to the userfaultfd (it's a pseudo inode). It'd need two different commands to prepare for the coming write (with a len multiple of PAGE_SIZE) to know the address where the page should be mapped into and if to behave zerocopy or if to skip the TLB flush and copy. Because the copy vs TLB flush trade off is possible to achieve with both interfaces, I think it really boils down to choosing between a mremap like interface, or file+commands protocol interface. I tend to like mremap more, that's why I opted for a remap_anon_pages syscall kept orthogonal to the userfaultfd functionality (remap_anon_pages could be also used standalone as an accelerated mremap in some circumstances) but nothing prevents to just embed the same mechanism inside userfaultfd if a file+commands API is preferable. Or we could add a different syscall (separated from userfaultfd) that creates another pseudofd to write a command plus the page data into it. Just I wouldn't see the point of creating a pseudofd just to copy a page atomically, the write() syscall would look more ideal if the userfaultfd is already open for other reasons and the pseudofd overhead is required anyway. Last thing to keep in mind is that if using userfaults with SIGBUS and without userfaultfd, remap_anon_pages would have been still useful, so if we retain the SIGBUS behavior for volatile pages and we don't force the usage for userfaultfd, it may be cleaner not to use userfaultfd but a separate pseudofd to do the write() syscall though. Otherwise the app would need to open the userfaultfd to resolve the fault even though it's not using the userfaultfd protocol which doesn't look an intuitive interface to me. Comments welcome. Thanks, Andrea -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/17] mm: madvise MADV_USERFAULT
On Tue, Oct 07, 2014 at 03:24:58PM +0200, Andrea Arcangeli wrote: > Hi Kirill, > > On Tue, Oct 07, 2014 at 01:36:45PM +0300, Kirill A. Shutemov wrote: > > On Fri, Oct 03, 2014 at 07:07:58PM +0200, Andrea Arcangeli wrote: > > > MADV_USERFAULT is a new madvise flag that will set VM_USERFAULT in the > > > vma flags. Whenever VM_USERFAULT is set in an anonymous vma, if > > > userland touches a still unmapped virtual address, a sigbus signal is > > > sent instead of allocating a new page. The sigbus signal handler will > > > then resolve the page fault in userland by calling the > > > remap_anon_pages syscall. > > > > Hm. I wounder if this functionality really fits madvise(2) interface: as > > far as I understand it, it provides a way to give a *hint* to kernel which > > may or may not trigger an action from kernel side. I don't think an > > application will behaive reasonably if kernel ignore the *advise* and will > > not send SIGBUS, but allocate memory. > > > > I would suggest to consider to use some other interface for the > > functionality: a new syscall or, perhaps, mprotect(). > > I didn't feel like adding PROT_USERFAULT to mprotect, which looks > hardwired to just these flags: PROT_NOALLOC may be? > >PROT_NONE The memory cannot be accessed at all. > >PROT_READ The memory can be read. > >PROT_WRITE The memory can be modified. > >PROT_EXEC The memory can be executed. To be complete: PROT_GROWSDOWN, PROT_GROWSUP and unused PROT_SEM. > So here somebody should comment and choose between: > > 1) set VM_USERFAULT with mprotect(PROT_USERFAULT) instead of >the current madvise(MADV_USERFAULT) > > 2) drop MADV_USERFAULT and VM_USERFAULT and force the usage of the >userfaultfd protocol as the only way for userland to catch >userfaults (each userfaultfd must already register itself into its >own virtual memory ranges so it's a trivial change for userfaultfd >users that deletes just 1 or 2 lines of userland code, but it would >prevent to use the SIGBUS behavior with info->si_addr=faultaddr for >other users) > > 3) keep things as they are now: use MADV_USERFAULT for SIGBUS >userfaults, with optional intersection between the >vm_flags&VM_USERFAULT ranges and the userfaultfd registered ranges >with vma->vm_userfaultfd_ctx!=NULL to know if to engage the >userfaultfd protocol instead of the plain SIGBUS 4) new syscall? > I will update the code accordingly to feedback, so please comment. I don't have strong points on this. Just *feel* it doesn't fit advice semantics. The only userspace interface I've designed was not proven good by time. I would listen what senior maintainers say. :) -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote: > mremap like interface, or file+commands protocol interface. I tend to > like mremap more, that's why I opted for a remap_anon_pages syscall > kept orthogonal to the userfaultfd functionality (remap_anon_pages > could be also used standalone as an accelerated mremap in some > circumstances) but nothing prevents to just embed the same mechanism Sorry for the self followup, but something else comes to mind to elaborate this further. In term of interfaces, the most efficient I could think of to minimize the enter/exit kernel, would be to append the "source address" of the data received from the network transport, to the userfaultfd_write() command (by appending 8 bytes to the wakeup command). Said that, mixing the mechanism to be notified about userfaults with the mechanism to resolve an userfault to me looks a complication. I kind of liked to keep the userfaultfd protocol is very simple and doing just its thing. The userfaultfd doesn't need to know how the userfault was resolved, even mremap would work theoretically (until we run out of vmas). I thought it was simpler to keep it that way. However if we want to resolve the fault with a "write()" syscall this may be the most efficient way to do it, as we're already doing a write() into the pseudofd to wakeup the page fault that contains the destination address, I just need to append the source address to the wakeup command. I probably grossly overestimated the benefits of resolving the userfault with a zerocopy page move, sorry. So if we entirely drop the zerocopy behavior and the TLB flush of the old page like you suggested, the way to keep the userfaultfd mechanism decoupled from the userfault resolution mechanism would be to implement an atomic-copy syscall. That would work for SIGBUS userfaults too without requiring a pseudofd then. It would be enough then to call mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic wouldn't page fault or call GUP into the destination address (it can't otherwise the in-flight partial copy would be visible to the process, breaking the atomicity of the copy), but it would fill in the pte/trans_huge_pmd with the same strict behavior that remap_anon_pages currently has (in turn it would by design bypass the VM_USERFAULT check and be ideal for resolving userfaults). mcopy_atomic could then be also extended to tmpfs and it would work without requiring the source page to be a tmpfs page too without having to convert page types on the fly. If I add mcopy_atomic, the patch in subject (10/17) can be dropped of course so it'd be even less intrusive than the current remap_anon_pages and it would require zero TLB flush during its runtime (it would just require an atomic copy). So should I try to embed a mcopy_atomic inside userfault_write or can I expose it to userland as a standalone new syscall? Or should I do something different? Comments? Thanks, Andrea -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
On Tue, Oct 7, 2014 at 8:52 AM, Andrea Arcangeli wrote: > On Tue, Oct 07, 2014 at 04:19:13PM +0200, Andrea Arcangeli wrote: >> mremap like interface, or file+commands protocol interface. I tend to >> like mremap more, that's why I opted for a remap_anon_pages syscall >> kept orthogonal to the userfaultfd functionality (remap_anon_pages >> could be also used standalone as an accelerated mremap in some >> circumstances) but nothing prevents to just embed the same mechanism > > Sorry for the self followup, but something else comes to mind to > elaborate this further. > > In term of interfaces, the most efficient I could think of to minimize > the enter/exit kernel, would be to append the "source address" of the > data received from the network transport, to the userfaultfd_write() > command (by appending 8 bytes to the wakeup command). Said that, > mixing the mechanism to be notified about userfaults with the > mechanism to resolve an userfault to me looks a complication. I kind > of liked to keep the userfaultfd protocol is very simple and doing > just its thing. The userfaultfd doesn't need to know how the userfault > was resolved, even mremap would work theoretically (until we run out > of vmas). I thought it was simpler to keep it that way. However if we > want to resolve the fault with a "write()" syscall this may be the > most efficient way to do it, as we're already doing a write() into the > pseudofd to wakeup the page fault that contains the destination > address, I just need to append the source address to the wakeup command. > > I probably grossly overestimated the benefits of resolving the > userfault with a zerocopy page move, sorry. So if we entirely drop the > zerocopy behavior and the TLB flush of the old page like you > suggested, the way to keep the userfaultfd mechanism decoupled from > the userfault resolution mechanism would be to implement an > atomic-copy syscall. That would work for SIGBUS userfaults too without > requiring a pseudofd then. It would be enough then to call > mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints > that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic > wouldn't page fault or call GUP into the destination address (it can't > otherwise the in-flight partial copy would be visible to the process, > breaking the atomicity of the copy), but it would fill in the > pte/trans_huge_pmd with the same strict behavior that remap_anon_pages > currently has (in turn it would by design bypass the VM_USERFAULT > check and be ideal for resolving userfaults). At the risk of asking a possibly useless question, would it make sense to splice data into a userfaultfd? --Andy > > mcopy_atomic could then be also extended to tmpfs and it would work > without requiring the source page to be a tmpfs page too without > having to convert page types on the fly. > > If I add mcopy_atomic, the patch in subject (10/17) can be dropped of > course so it'd be even less intrusive than the current > remap_anon_pages and it would require zero TLB flush during its > runtime (it would just require an atomic copy). > > So should I try to embed a mcopy_atomic inside userfault_write or can > I expose it to userland as a standalone new syscall? Or should I do > something different? Comments? > > Thanks, > Andrea -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
On Tue, Oct 07, 2014 at 05:52:47PM +0200, Andrea Arcangeli wrote: > I probably grossly overestimated the benefits of resolving the > userfault with a zerocopy page move, sorry. [...] For posterity, I think it's worth noting that most expensive aspect of a TLB shootdown is the interprocessor interrupt necessary to flush other CPUs' TLBs. On a many-core machine, copying 4K of data looks pretty cheap compared to taking an interrupt and invalidating TLBs on many cores :-) > [...] So if we entirely drop the > zerocopy behavior and the TLB flush of the old page like you > suggested, the way to keep the userfaultfd mechanism decoupled from > the userfault resolution mechanism would be to implement an > atomic-copy syscall. That would work for SIGBUS userfaults too without > requiring a pseudofd then. It would be enough then to call > mcopy_atomic(userfault_addr,tmp_addr,len) with the only constraints > that len must be a multiple of PAGE_SIZE. Of course mcopy_atomic > wouldn't page fault or call GUP into the destination address (it can't > otherwise the in-flight partial copy would be visible to the process, > breaking the atomicity of the copy), but it would fill in the > pte/trans_huge_pmd with the same strict behavior that remap_anon_pages > currently has (in turn it would by design bypass the VM_USERFAULT > check and be ideal for resolving userfaults). > > mcopy_atomic could then be also extended to tmpfs and it would work > without requiring the source page to be a tmpfs page too without > having to convert page types on the fly. > > If I add mcopy_atomic, the patch in subject (10/17) can be dropped of > course so it'd be even less intrusive than the current > remap_anon_pages and it would require zero TLB flush during its > runtime (it would just require an atomic copy). I like this new approach. It will be good to have a single interface for resolving anon and tmpfs userfaults. > So should I try to embed a mcopy_atomic inside userfault_write or can > I expose it to userland as a standalone new syscall? Or should I do > something different? Comments? One interesting (ab)use of userfault_write would be that the faulting process and the fault-handling process could be different, which would be necessary for post-copy live migration in CRIU (http://criu.org). Aside from the asthetic difference, I can't think of any advantage in favor of a syscall. Peter -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/15] virtio_pci: fix virtio spec compliance on restore
On (Tue) 07 Oct 2014 [15:53:55], Michael S. Tsirkin wrote: > On Mon, Oct 06, 2014 at 06:10:40PM +0300, Michael S. Tsirkin wrote: > > On restore, virtio pci does the following: > > + set features > > + init vqs etc - device can be used at this point! > > + set ACKNOWLEDGE,DRIVER and DRIVER_OK status bits > > > > This is in violation of the virtio spec, which > > requires the following order: > > - ACKNOWLEDGE > > - DRIVER > > - init vqs > > - DRIVER_OK > > > > This behaviour will break with hypervisors that assume spec compliant > > behaviour. It seems like a good idea to have this patch applied to > > stable branches to reduce the support butden for the hypervisors. > > Tested suspend to ram with virtio net and blk. I'd recommend running a continuous ping for virtio-net and a dd process running across s3/s4 for testing. That's what I had run while doing the original set of patches. Amit -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
On Tue, Oct 7, 2014 at 10:19 AM, Andrea Arcangeli wrote: > > I see what you mean. The only cons I see is that we couldn't use then > recv(tmp_addr, PAGE_SIZE), remap_anon_pages(faultaddr, tmp_addr, > PAGE_SIZE, ..) and retain the zerocopy behavior. Or how could we? > There's no recvfile(userfaultfd, socketfd, PAGE_SIZE). You're doing completelt faulty math, and you haven't thought it through. Your "zero-copy" case is no such thing. Who cares if some packet receive is zero-copy, when you need to set up the anonymous page to *receive* the zero copy into, which involves page allocation, page zeroing, page table setup with VM and page table locking, etc etc. The thing is, the whole concept of "zero-copy" is pure and utter bullshit. Sun made a big deal about the whole concept back in the nineties, and IT DOES NOT WORK. It's a scam. Don't buy into it. It's crap. It's made-up and not real. Then, once you've allocated and cleared the page, mapped it in, your "zero-copy" model involves looking up the page in the page tables again (locking etc), then doing that zero-copy to the page. Then, when you remap it, you look it up in the page tables AGAIN, with locking, move it around, have to clear the old page table entry (which involves a locked cmpxchg64), a TLB flush with most likely a cross-CPU IPI - since the people who do this are all threaded and want many CPU's, and then you insert the page into the new place. That's *insane*. It's crap. All just to try to avoid one page copy. Don't do it. remapping games really are complete BS. They never beat just copying the data. It's that simple. > As things stands now, I'm afraid with a write() syscall we couldn't do > it zerocopy. Really, you need to rethink your whole "zerocopy" model. It's broken. Nobody sane cares. You've bought into a model that Sun already showed doesn't work. The only time zero-copy works is in random benchmarks that are very careful to not touch the data at all at any point, and also try to make sure that the benchmark is very much single-threaded so that you never have the whole cross-CPU IPI issue for the TLB invalidate. Then, and only then, can zero-copy win. And it's just not a realistic scenario. > If it wasn't for the TLB flush of the old page, the remap_anon_pages > variant would be more optimal than doing a copy through a write > syscall. Is the copy cheaper than a TLB flush? I probably naively > assumed the TLB flush was always cheaper. A local TLB flush is cheap. That's not the problem. The problem is the setup of the page, and the clearing of the page, and the cross-CPU TLB flush. And the page table locking, etc etc. So no, I'm not AT ALL worried about a single "invlpg" instruction. That's nothing. Local CPU TLB flushes of single pages are basically free. But that really isn't where the costs are. Quite frankly, the *only* time page remapping has ever made sense is when it is used for realloc() kind of purposes, where you need to remap pages not because of zero-copy, but because you need to change the virtual address space layout. And then you make sure it's not a common operation, because you're not doing it as a performance win, you're doing it because you're changing your virtual layout. Really. Zerocopy is for benchmarks, and for things like "splice()" when you can avoid the page tables *entirely*. But the notion that page remapping of user pages is faster than a copy is pure and utter garbage. It's simply not true. So I really think you should aim for a "write()": kind of interface. With write, you may not get the zero-copy, but on the other hand it allows you to re-use the source buffer many times without having to allocate new pages and map it in etc. So a "read()+write()" loop (or, quite commonly a "generate data computationally from a compressed source + write()" loop) is actually much more efficient than the zero-copy remapping, because you don't have all the complexities and overheads in creating the source page. It is possible that that could involve "splice()" too, although I don't really think the source data tends to be in page-aligned chunks. But hey, splice() at least *can* be faster than copying (and then we have vmsplice() not because it's magically faster, but because it can under certain circumstances be worth it, and it kind of made sense to allow the interface, but I really don't think it's used very much or very useful). Linus -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
Il 07/10/2014 19:07, Dr. David Alan Gilbert ha scritto: >> > >> > So I'd *much* rather have a "write()" style interface (ie _copying_ >> > bytes from user space into a newly allocated page that gets mapped) >> > than a "remap page" style interface > Something like that might work for the postcopy case; it doesn't work > for some of the other uses that need to stop a page being changed by the > guest, but then need to somehow get a copy of that page internally to QEMU, > and perhaps provide it back later. I cannot parse this. Which uses do you have in mind? Is it for QEMU-specific or is it for other applications of userfaults? As long as the page is atomically mapped, I'm not sure what the difference from remap_anon_pages are (as far as the destination page is concerned). Are you thinking of having userfaults enabled on the source as well? Paolo > remap_anon_pages worked for those cases > as well; I can't think of another current way of doing it in userspace. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
* Paolo Bonzini (pbonz...@redhat.com) wrote: > Il 07/10/2014 19:07, Dr. David Alan Gilbert ha scritto: > >> > > >> > So I'd *much* rather have a "write()" style interface (ie _copying_ > >> > bytes from user space into a newly allocated page that gets mapped) > >> > than a "remap page" style interface > > Something like that might work for the postcopy case; it doesn't work > > for some of the other uses that need to stop a page being changed by the > > guest, but then need to somehow get a copy of that page internally to QEMU, > > and perhaps provide it back later. > > I cannot parse this. Which uses do you have in mind? Is it for > QEMU-specific or is it for other applications of userfaults? > As long as the page is atomically mapped, I'm not sure what the > difference from remap_anon_pages are (as far as the destination page is > concerned). Are you thinking of having userfaults enabled on the source > as well? What I'm talking about here is when I want to stop a page being accessed by the guest, do something with the data in qemu, and give it back to the guest sometime later. The main example is: Memory pools for guests where you swap RAM between a series of VM hosts. You have to take the page out, send it over the wire, sometime later if the guest tries to access it you userfault and pull it back. (There's at least one or two companies selling something like this, and at least one Linux based implementations with their own much more involved kernel hacks) Dave -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/17] mm: rmap preparation for remap_anon_pages
* Linus Torvalds (torva...@linux-foundation.org) wrote: > On Mon, Oct 6, 2014 at 12:41 PM, Andrea Arcangeli wrote: > > > > Of course if somebody has better ideas on how to resolve an anonymous > > userfault they're welcome. > > So I'd *much* rather have a "write()" style interface (ie _copying_ > bytes from user space into a newly allocated page that gets mapped) > than a "remap page" style interface Something like that might work for the postcopy case; it doesn't work for some of the other uses that need to stop a page being changed by the guest, but then need to somehow get a copy of that page internally to QEMU, and perhaps provide it back later. remap_anon_pages worked for those cases as well; I can't think of another current way of doing it in userspace. I'm thinking here of systems for making VMs with memory larger than a single host; that's something that's not as well thought out. I've also seen people writing emulation that want to trap and emulate some page accesses while still having the original data available to the emulator itself. So yes, OK for now, but the result is less general. Dave > remapping anonymous pages involves page table games that really aren't > necessarily a good idea, and tlb invalidates for the old page etc. > Just don't do it. > >Linus -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 02/11] virtio: add support for 64 bit features.
From: Rusty Russell Change the u32 to a u64, and make sure to use 1ULL everywhere! Cc: Brian Swetland Cc: Christian Borntraeger [Thomas Huth: fix up virtio-ccw get_features] Signed-off-by: Rusty Russell Signed-off-by: Cornelia Huck Acked-by: Pawel Moll Acked-by: Ohad Ben-Cohen --- drivers/char/virtio_console.c |2 +- drivers/lguest/lguest_device.c | 10 +- drivers/remoteproc/remoteproc_virtio.c |5 - drivers/s390/kvm/kvm_virtio.c | 10 +- drivers/s390/kvm/virtio_ccw.c | 29 - drivers/virtio/virtio.c| 12 ++-- drivers/virtio/virtio_mmio.c | 14 +- drivers/virtio/virtio_pci.c|5 ++--- drivers/virtio/virtio_ring.c |2 +- include/linux/virtio.h |2 +- include/linux/virtio_config.h |8 tools/virtio/linux/virtio.h|2 +- tools/virtio/linux/virtio_config.h |2 +- 13 files changed, 64 insertions(+), 39 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index c4a437e..f9c6288 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -355,7 +355,7 @@ static inline bool use_multiport(struct ports_device *portdev) */ if (!portdev->vdev) return 0; - return portdev->vdev->features & (1 << VIRTIO_CONSOLE_F_MULTIPORT); + return portdev->vdev->features & (1ULL << VIRTIO_CONSOLE_F_MULTIPORT); } static DEFINE_SPINLOCK(dma_bufs_lock); diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c index c831c47..4d29bcd 100644 --- a/drivers/lguest/lguest_device.c +++ b/drivers/lguest/lguest_device.c @@ -94,17 +94,17 @@ static unsigned desc_size(const struct lguest_device_desc *desc) } /* This gets the device's feature bits. */ -static u32 lg_get_features(struct virtio_device *vdev) +static u64 lg_get_features(struct virtio_device *vdev) { unsigned int i; - u32 features = 0; + u64 features = 0; struct lguest_device_desc *desc = to_lgdev(vdev)->desc; u8 *in_features = lg_features(desc); /* We do this the slow but generic way. */ - for (i = 0; i < min(desc->feature_len * 8, 32); i++) + for (i = 0; i < min(desc->feature_len * 8, 64); i++) if (in_features[i / 8] & (1 << (i % 8))) - features |= (1 << i); + features |= (1ULL << i); return features; } @@ -144,7 +144,7 @@ static void lg_finalize_features(struct virtio_device *vdev) memset(out_features, 0, desc->feature_len); bits = min_t(unsigned, desc->feature_len, sizeof(vdev->features)) * 8; for (i = 0; i < bits; i++) { - if (vdev->features & (1 << i)) + if (vdev->features & (1ULL << i)) out_features[i / 8] |= (1 << (i % 8)); } diff --git a/drivers/remoteproc/remoteproc_virtio.c b/drivers/remoteproc/remoteproc_virtio.c index dafaf38..627737e 100644 --- a/drivers/remoteproc/remoteproc_virtio.c +++ b/drivers/remoteproc/remoteproc_virtio.c @@ -207,7 +207,7 @@ static void rproc_virtio_reset(struct virtio_device *vdev) } /* provide the vdev features as retrieved from the firmware */ -static u32 rproc_virtio_get_features(struct virtio_device *vdev) +static u64 rproc_virtio_get_features(struct virtio_device *vdev) { struct rproc_vdev *rvdev = vdev_to_rvdev(vdev); struct fw_rsc_vdev *rsc; @@ -227,6 +227,9 @@ static void rproc_virtio_finalize_features(struct virtio_device *vdev) /* Give virtio_ring a chance to accept features */ vring_transport_features(vdev); + /* Make sure we don't have any features > 32 bits! */ + BUG_ON((u32)vdev->features != vdev->features); + /* * Remember the finalized features of our vdev, and provide it * to the remote processor once it is powered on. diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c index d747ca4..6d4cbea 100644 --- a/drivers/s390/kvm/kvm_virtio.c +++ b/drivers/s390/kvm/kvm_virtio.c @@ -80,16 +80,16 @@ static unsigned desc_size(const struct kvm_device_desc *desc) } /* This gets the device's feature bits. */ -static u32 kvm_get_features(struct virtio_device *vdev) +static u64 kvm_get_features(struct virtio_device *vdev) { unsigned int i; - u32 features = 0; + u64 features = 0; struct kvm_device_desc *desc = to_kvmdev(vdev)->desc; u8 *in_features = kvm_vq_features(desc); - for (i = 0; i < min(desc->feature_len * 8, 32); i++) + for (i = 0; i < min(desc->feature_len * 8, 64); i++) if (in_features[i / 8] & (1 << (i % 8))) - features |= (1 << i); + features |= (1ULL << i); return features; } @@ -106,7 +106,7 @@ static void kvm_finalize_feat
[PATCH RFC 01/11] linux-headers/virtio_config: Update with VIRTIO_F_VERSION_1
From: Thomas Huth Add the new VIRTIO_F_VERSION_1 definition to the virtio_config.h linux header. Signed-off-by: Thomas Huth Signed-off-by: Cornelia Huck --- linux-headers/linux/virtio_config.h |3 +++ 1 file changed, 3 insertions(+) diff --git a/linux-headers/linux/virtio_config.h b/linux-headers/linux/virtio_config.h index 75dc20b..16aa289 100644 --- a/linux-headers/linux/virtio_config.h +++ b/linux-headers/linux/virtio_config.h @@ -54,4 +54,7 @@ /* Can the device handle any descriptor layout? */ #define VIRTIO_F_ANY_LAYOUT27 +/* v1.0 compliant. */ +#define VIRTIO_F_VERSION_1 32 + #endif /* _LINUX_VIRTIO_CONFIG_H */ -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 04/11] virtio_ring: implement endian reversal based on VERSION_1 feature.
From: Rusty Russell [Cornelia Huck: we don't need the vq->vring.num -> vq->ring_mask change] Signed-off-by: Rusty Russell Signed-off-by: Cornelia Huck --- drivers/virtio/virtio_ring.c | 195 ++ 1 file changed, 138 insertions(+), 57 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 1cfb5ba..350c39b 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -145,42 +145,54 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq, i = 0; for (n = 0; n < out_sgs; n++) { for (sg = sgs[n]; sg; sg = next(sg, &total_out)) { - desc[i].flags = VRING_DESC_F_NEXT; - desc[i].addr = sg_phys(sg); - desc[i].len = sg->length; - desc[i].next = i+1; + desc[i].flags = cpu_to_virtio_u16(vq->vq.vdev, + VRING_DESC_F_NEXT); + desc[i].addr = cpu_to_virtio_u64(vq->vq.vdev, +sg_phys(sg)); + desc[i].len = cpu_to_virtio_u32(vq->vq.vdev, + sg->length); + desc[i].next = cpu_to_virtio_u16(vq->vq.vdev, +i+1); i++; } } for (; n < (out_sgs + in_sgs); n++) { for (sg = sgs[n]; sg; sg = next(sg, &total_in)) { - desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE; - desc[i].addr = sg_phys(sg); - desc[i].len = sg->length; - desc[i].next = i+1; + desc[i].flags = cpu_to_virtio_u16(vq->vq.vdev, + VRING_DESC_F_NEXT| + VRING_DESC_F_WRITE); + desc[i].addr = cpu_to_virtio_u64(vq->vq.vdev, +sg_phys(sg)); + desc[i].len = cpu_to_virtio_u32(vq->vq.vdev, + sg->length); + desc[i].next = cpu_to_virtio_u16(vq->vq.vdev, i+1); i++; } } - BUG_ON(i != total_sg); /* Last one doesn't continue. */ - desc[i-1].flags &= ~VRING_DESC_F_NEXT; + desc[i-1].flags &= ~cpu_to_virtio_u16(vq->vq.vdev, VRING_DESC_F_NEXT); desc[i-1].next = 0; - /* We're about to use a buffer */ - vq->vq.num_free--; - /* Use a single buffer which doesn't continue */ head = vq->free_head; - vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT; - vq->vring.desc[head].addr = virt_to_phys(desc); + vq->vring.desc[head].flags = + cpu_to_virtio_u16(vq->vq.vdev, VRING_DESC_F_INDIRECT); + vq->vring.desc[head].addr = + cpu_to_virtio_u64(vq->vq.vdev, virt_to_phys(desc)); /* kmemleak gives a false positive, as it's hidden by virt_to_phys */ kmemleak_ignore(desc); - vq->vring.desc[head].len = i * sizeof(struct vring_desc); + vq->vring.desc[head].len = + cpu_to_virtio_u32(vq->vq.vdev, i * sizeof(struct vring_desc)); - /* Update free pointer */ + BUG_ON(i != total_sg); + + /* Update free pointer (we store this in native endian) */ vq->free_head = vq->vring.desc[head].next; + /* We've just used a buffer */ + vq->vq.num_free--; + return head; } @@ -199,6 +211,7 @@ static inline int virtqueue_add(struct virtqueue *_vq, struct scatterlist *sg; unsigned int i, n, avail, uninitialized_var(prev), total_sg; int head; + u16 nexti; START_USE(vq); @@ -253,26 +266,46 @@ static inline int virtqueue_add(struct virtqueue *_vq, vq->vq.num_free -= total_sg; head = i = vq->free_head; + for (n = 0; n < out_sgs; n++) { for (sg = sgs[n]; sg; sg = next(sg, &total_out)) { - vq->vring.desc[i].flags = VRING_DESC_F_NEXT; - vq->vring.desc[i].addr = sg_phys(sg); - vq->vring.desc[i].len = sg->length; + vq->vring.desc[i].flags = + cpu_to_virtio_u16(vq->vq.vdev, + VRING_DESC_F_NEXT); + vq->vring.desc[i].addr = + cpu_to_virtio_u64(vq->vq.vdev, sg_phys(sg)); + vq->vring.desc[i].len = + cpu_to_virtio_u32(vq->vq.vdev, sg->length); + + /* We chained .next in native: fix endian. */ + nexti = vq->vring.d
[PATCH RFC 07/11] virtio_net: use v1.0 endian.
From: Rusty Russell [Cornelia Huck: converted some missed fields] Signed-off-by: Rusty Russell Signed-off-by: Cornelia Huck --- drivers/net/virtio_net.c | 31 +++ 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 59caa06..cd18946 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -353,13 +353,14 @@ err: } static struct sk_buff *receive_mergeable(struct net_device *dev, +struct virtnet_info *vi, struct receive_queue *rq, unsigned long ctx, unsigned int len) { void *buf = mergeable_ctx_to_buf_address(ctx); struct skb_vnet_hdr *hdr = buf; - int num_buf = hdr->mhdr.num_buffers; + u16 num_buf = virtio_to_cpu_u16(rq->vq->vdev, hdr->mhdr.num_buffers); struct page *page = virt_to_head_page(buf); int offset = buf - page_address(page); unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx)); @@ -375,7 +376,9 @@ static struct sk_buff *receive_mergeable(struct net_device *dev, ctx = (unsigned long)virtqueue_get_buf(rq->vq, &len); if (unlikely(!ctx)) { pr_debug("%s: rx error: %d buffers out of %d missing\n", -dev->name, num_buf, hdr->mhdr.num_buffers); +dev->name, num_buf, +virtio_to_cpu_u16(rq->vq->vdev, + hdr->mhdr.num_buffers)); dev->stats.rx_length_errors++; goto err_buf; } @@ -460,7 +463,7 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len) } if (vi->mergeable_rx_bufs) - skb = receive_mergeable(dev, rq, (unsigned long)buf, len); + skb = receive_mergeable(dev, vi, rq, (unsigned long)buf, len); else if (vi->big_packets) skb = receive_big(dev, rq, buf, len); else @@ -479,8 +482,8 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len) if (hdr->hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { pr_debug("Needs csum!\n"); if (!skb_partial_csum_set(skb, - hdr->hdr.csum_start, - hdr->hdr.csum_offset)) + virtio_to_cpu_u16(vi->vdev, hdr->hdr.csum_start), + virtio_to_cpu_u16(vi->vdev, hdr->hdr.csum_offset))) goto frame_err; } else if (hdr->hdr.flags & VIRTIO_NET_HDR_F_DATA_VALID) { skb->ip_summed = CHECKSUM_UNNECESSARY; @@ -511,7 +514,8 @@ static void receive_buf(struct receive_queue *rq, void *buf, unsigned int len) if (hdr->hdr.gso_type & VIRTIO_NET_HDR_GSO_ECN) skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN; - skb_shinfo(skb)->gso_size = hdr->hdr.gso_size; + skb_shinfo(skb)->gso_size = virtio_to_cpu_u16(vi->vdev, + hdr->hdr.gso_size); if (skb_shinfo(skb)->gso_size == 0) { net_warn_ratelimited("%s: zero gso size.\n", dev->name); goto frame_err; @@ -871,16 +875,19 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) if (skb->ip_summed == CHECKSUM_PARTIAL) { hdr->hdr.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; - hdr->hdr.csum_start = skb_checksum_start_offset(skb); - hdr->hdr.csum_offset = skb->csum_offset; + hdr->hdr.csum_start = cpu_to_virtio_u16(vi->vdev, + skb_checksum_start_offset(skb)); + hdr->hdr.csum_offset = cpu_to_virtio_u16(vi->vdev, +skb->csum_offset); } else { hdr->hdr.flags = 0; hdr->hdr.csum_offset = hdr->hdr.csum_start = 0; } if (skb_is_gso(skb)) { - hdr->hdr.hdr_len = skb_headlen(skb); - hdr->hdr.gso_size = skb_shinfo(skb)->gso_size; + hdr->hdr.hdr_len = cpu_to_virtio_u16(vi->vdev, skb_headlen(skb)); + hdr->hdr.gso_size = cpu_to_virtio_u16(vi->vdev, + skb_shinfo(skb)->gso_size); if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV4) hdr->hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4; else if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6) @@ -1181,7 +1188,7 @@ static void virtnet_set_rx_mode(struct net_device *dev) sg_init_table(sg, 2); /* Store the unicast
[PATCH RFC 01/11] virtio: use u32, not bitmap for struct virtio_device's features
From: Rusty Russell It seemed like a good idea, but it's actually a pain when we get more than 32 feature bits. Just change it to a u32 for now. Cc: Brian Swetland Cc: Christian Borntraeger Signed-off-by: Rusty Russell Signed-off-by: Cornelia Huck Acked-by: Pawel Moll Acked-by: Ohad Ben-Cohen --- drivers/char/virtio_console.c |2 +- drivers/lguest/lguest_device.c |8 drivers/remoteproc/remoteproc_virtio.c |2 +- drivers/s390/kvm/kvm_virtio.c |2 +- drivers/s390/kvm/virtio_ccw.c | 23 +-- drivers/virtio/virtio.c| 10 +- drivers/virtio/virtio_mmio.c |8 ++-- drivers/virtio/virtio_pci.c|3 +-- drivers/virtio/virtio_ring.c |2 +- include/linux/virtio.h |3 +-- include/linux/virtio_config.h |2 +- tools/virtio/linux/virtio.h| 22 +- tools/virtio/linux/virtio_config.h |2 +- tools/virtio/virtio_test.c |5 ++--- tools/virtio/vringh_test.c | 16 15 files changed, 39 insertions(+), 71 deletions(-) diff --git a/drivers/char/virtio_console.c b/drivers/char/virtio_console.c index b585b47..c4a437e 100644 --- a/drivers/char/virtio_console.c +++ b/drivers/char/virtio_console.c @@ -355,7 +355,7 @@ static inline bool use_multiport(struct ports_device *portdev) */ if (!portdev->vdev) return 0; - return portdev->vdev->features[0] & (1 << VIRTIO_CONSOLE_F_MULTIPORT); + return portdev->vdev->features & (1 << VIRTIO_CONSOLE_F_MULTIPORT); } static DEFINE_SPINLOCK(dma_bufs_lock); diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c index d0a1d8a..c831c47 100644 --- a/drivers/lguest/lguest_device.c +++ b/drivers/lguest/lguest_device.c @@ -137,14 +137,14 @@ static void lg_finalize_features(struct virtio_device *vdev) vring_transport_features(vdev); /* -* The vdev->feature array is a Linux bitmask: this isn't the same as a -* the simple array of bits used by lguest devices for features. So we -* do this slow, manual conversion which is completely general. +* Since lguest is currently x86-only, we're little-endian. That +* means we could just memcpy. But it's not time critical, and in +* case someone copies this code, we do it the slow, obvious way. */ memset(out_features, 0, desc->feature_len); bits = min_t(unsigned, desc->feature_len, sizeof(vdev->features)) * 8; for (i = 0; i < bits; i++) { - if (test_bit(i, vdev->features)) + if (vdev->features & (1 << i)) out_features[i / 8] |= (1 << (i % 8)); } diff --git a/drivers/remoteproc/remoteproc_virtio.c b/drivers/remoteproc/remoteproc_virtio.c index a34b506..dafaf38 100644 --- a/drivers/remoteproc/remoteproc_virtio.c +++ b/drivers/remoteproc/remoteproc_virtio.c @@ -231,7 +231,7 @@ static void rproc_virtio_finalize_features(struct virtio_device *vdev) * Remember the finalized features of our vdev, and provide it * to the remote processor once it is powered on. */ - rsc->gfeatures = vdev->features[0]; + rsc->gfeatures = vdev->features; } static void rproc_virtio_get(struct virtio_device *vdev, unsigned offset, diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c index a134965..d747ca4 100644 --- a/drivers/s390/kvm/kvm_virtio.c +++ b/drivers/s390/kvm/kvm_virtio.c @@ -106,7 +106,7 @@ static void kvm_finalize_features(struct virtio_device *vdev) memset(out_features, 0, desc->feature_len); bits = min_t(unsigned, desc->feature_len, sizeof(vdev->features)) * 8; for (i = 0; i < bits; i++) { - if (test_bit(i, vdev->features)) + if (vdev->features & (1 << i)) out_features[i / 8] |= (1 << (i % 8)); } } diff --git a/drivers/s390/kvm/virtio_ccw.c b/drivers/s390/kvm/virtio_ccw.c index d2c0b44..c5acd19 100644 --- a/drivers/s390/kvm/virtio_ccw.c +++ b/drivers/s390/kvm/virtio_ccw.c @@ -701,7 +701,6 @@ static void virtio_ccw_finalize_features(struct virtio_device *vdev) { struct virtio_ccw_device *vcdev = to_vc_device(vdev); struct virtio_feature_desc *features; - int i; struct ccw1 *ccw; ccw = kzalloc(sizeof(*ccw), GFP_DMA | GFP_KERNEL); @@ -715,19 +714,15 @@ static void virtio_ccw_finalize_features(struct virtio_device *vdev) /* Give virtio_ring a chance to accept features. */ vring_transport_features(vdev); - for (i = 0; i < sizeof(*vdev->features) / sizeof(features->features); -i++) { - int highbits = i % 2 ? 32 : 0; - features->index = i; - features->features = cpu_to_le32(vdev->features[i / 2] -
[PATCH RFC 00/11] linux: towards virtio-1 guest support
This patchset tries to go towards implementing both virtio-1 compliant and transitional virtio drivers in Linux. Branch available at git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux virtio-1 This is based on some old patches by Rusty to handle extended feature bits and endianness conversions. Thomas implemented the new virtio-ccw transport revision command, and I hacked up some further endianness stuff and virtio-ccw enablement. Probably a lot still missing, but I can run a virtio-ccw guest that enables virtio-1 accesses if the host supports it (via the qemu host patchset) - virtio-net and virtio-blk only so far. I consider this patchset a starting point for further discussions. Cornelia Huck (5): virtio: endianess conversion helpers virtio: allow transports to get avail/used addresses virtio_blk: use virtio v1.0 endian KVM: s390: virtio-ccw revision 1 SET_VQ KVM: s390: enable virtio-ccw revision 1 Rusty Russell (5): virtio: use u32, not bitmap for struct virtio_device's features virtio: add support for 64 bit features. virtio_ring: implement endian reversal based on VERSION_1 feature. virtio_config: endian conversion for v1.0. virtio_net: use v1.0 endian. Thomas Huth (1): KVM: s390: Set virtio-ccw transport revision drivers/block/virtio_blk.c |4 + drivers/char/virtio_console.c |2 +- drivers/lguest/lguest_device.c | 16 +-- drivers/net/virtio_net.c | 31 +++-- drivers/remoteproc/remoteproc_virtio.c |7 +- drivers/s390/kvm/kvm_virtio.c | 10 +- drivers/s390/kvm/virtio_ccw.c | 165 - drivers/virtio/virtio.c| 22 ++-- drivers/virtio/virtio_mmio.c | 20 +-- drivers/virtio/virtio_pci.c|8 +- drivers/virtio/virtio_ring.c | 213 +++- include/linux/virtio.h | 46 ++- include/linux/virtio_config.h | 17 +-- include/uapi/linux/virtio_config.h |3 + tools/virtio/linux/virtio.h| 22 +--- tools/virtio/linux/virtio_config.h |2 +- tools/virtio/virtio_test.c |5 +- tools/virtio/vringh_test.c | 16 +-- 18 files changed, 428 insertions(+), 181 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 02/11] virtio: cull virtio_bus_set_vdev_features
The only user of this function was virtio-ccw, and it should use virtio_set_features() like everybody else: We need to make sure that bad features are masked out properly, which this function did not do. Reviewed-by: Thomas Huth Signed-off-by: Cornelia Huck --- hw/s390x/virtio-ccw.c |3 +-- hw/virtio/virtio-bus.c | 14 -- include/hw/virtio/virtio-bus.h |3 --- 3 files changed, 1 insertion(+), 19 deletions(-) diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c index e7d3ea1..7833dd2 100644 --- a/hw/s390x/virtio-ccw.c +++ b/hw/s390x/virtio-ccw.c @@ -400,8 +400,7 @@ static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw) ccw.cda + sizeof(features.features)); features.features = ldl_le_phys(&address_space_memory, ccw.cda); if (features.index < ARRAY_SIZE(dev->host_features)) { -virtio_bus_set_vdev_features(&dev->bus, features.features); -vdev->guest_features = features.features; +virtio_set_features(vdev, features.features); } else { /* * If the guest supports more feature bits, assert that it diff --git a/hw/virtio/virtio-bus.c b/hw/virtio/virtio-bus.c index eb77019..a8ffa07 100644 --- a/hw/virtio/virtio-bus.c +++ b/hw/virtio/virtio-bus.c @@ -109,20 +109,6 @@ uint32_t virtio_bus_get_vdev_features(VirtioBusState *bus, return k->get_features(vdev, requested_features); } -/* Set the features of the plugged device. */ -void virtio_bus_set_vdev_features(VirtioBusState *bus, - uint32_t requested_features) -{ -VirtIODevice *vdev = virtio_bus_get_device(bus); -VirtioDeviceClass *k; - -assert(vdev != NULL); -k = VIRTIO_DEVICE_GET_CLASS(vdev); -if (k->set_features != NULL) { -k->set_features(vdev, requested_features); -} -} - /* Get bad features of the plugged device. */ uint32_t virtio_bus_get_vdev_bad_features(VirtioBusState *bus) { diff --git a/include/hw/virtio/virtio-bus.h b/include/hw/virtio/virtio-bus.h index 0756545..0d2e7b4 100644 --- a/include/hw/virtio/virtio-bus.h +++ b/include/hw/virtio/virtio-bus.h @@ -84,9 +84,6 @@ size_t virtio_bus_get_vdev_config_len(VirtioBusState *bus); /* Get the features of the plugged device. */ uint32_t virtio_bus_get_vdev_features(VirtioBusState *bus, uint32_t requested_features); -/* Set the features of the plugged device. */ -void virtio_bus_set_vdev_features(VirtioBusState *bus, - uint32_t requested_features); /* Get bad features of the plugged device. */ uint32_t virtio_bus_get_vdev_bad_features(VirtioBusState *bus); /* Get config of the plugged device. */ -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 06/11] virtio: allow virtio-1 queue layout
For virtio-1 devices, we allow a more complex queue layout that doesn't require descriptor table and rings on a physically-contigous memory area: add virtio_queue_set_rings() to allow transports to set this up. Signed-off-by: Cornelia Huck --- hw/virtio/virtio.c | 16 include/hw/virtio/virtio.h |2 ++ 2 files changed, 18 insertions(+) diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index e6ae3a0..147d227 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -96,6 +96,13 @@ static void virtqueue_init(VirtQueue *vq) { hwaddr pa = vq->pa; +if (pa == -1ULL) { +/* + * This is a virtio-1 style vq that has already been setup + * in virtio_queue_set. + */ +return; +} vq->vring.desc = pa; vq->vring.avail = pa + vq->vring.num * sizeof(VRingDesc); vq->vring.used = vring_align(vq->vring.avail + @@ -719,6 +726,15 @@ hwaddr virtio_queue_get_addr(VirtIODevice *vdev, int n) return vdev->vq[n].pa; } +void virtio_queue_set_rings(VirtIODevice *vdev, int n, hwaddr desc, +hwaddr avail, hwaddr used) +{ +vdev->vq[n].pa = -1ULL; +vdev->vq[n].vring.desc = desc; +vdev->vq[n].vring.avail = avail; +vdev->vq[n].vring.used = used; +} + void virtio_queue_set_num(VirtIODevice *vdev, int n, int num) { /* Don't allow guest to flip queue between existent and diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h index 40e567c..f840320 100644 --- a/include/hw/virtio/virtio.h +++ b/include/hw/virtio/virtio.h @@ -227,6 +227,8 @@ void virtio_queue_set_addr(VirtIODevice *vdev, int n, hwaddr addr); hwaddr virtio_queue_get_addr(VirtIODevice *vdev, int n); void virtio_queue_set_num(VirtIODevice *vdev, int n, int num); int virtio_queue_get_num(VirtIODevice *vdev, int n); +void virtio_queue_set_rings(VirtIODevice *vdev, int n, hwaddr desc, +hwaddr avail, hwaddr used); void virtio_queue_set_align(VirtIODevice *vdev, int n, int align); void virtio_queue_notify(VirtIODevice *vdev, int n); uint16_t virtio_queue_vector(VirtIODevice *vdev, int n); -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 08/11] s390x/css: Add a callback for when subchannel gets disabled
From: Thomas Huth We need a possibility to run code when a subchannel gets disabled. This patch adds the necessary infrastructure. Signed-off-by: Thomas Huth Signed-off-by: Cornelia Huck --- hw/s390x/css.c | 12 hw/s390x/css.h |1 + 2 files changed, 13 insertions(+) diff --git a/hw/s390x/css.c b/hw/s390x/css.c index b67c039..735ec55 100644 --- a/hw/s390x/css.c +++ b/hw/s390x/css.c @@ -588,6 +588,7 @@ int css_do_msch(SubchDev *sch, SCHIB *orig_schib) { SCSW *s = &sch->curr_status.scsw; PMCW *p = &sch->curr_status.pmcw; +uint16_t oldflags; int ret; SCHIB schib; @@ -610,6 +611,7 @@ int css_do_msch(SubchDev *sch, SCHIB *orig_schib) copy_schib_from_guest(&schib, orig_schib); /* Only update the program-modifiable fields. */ p->intparm = schib.pmcw.intparm; +oldflags = p->flags; p->flags &= ~(PMCW_FLAGS_MASK_ISC | PMCW_FLAGS_MASK_ENA | PMCW_FLAGS_MASK_LM | PMCW_FLAGS_MASK_MME | PMCW_FLAGS_MASK_MP); @@ -625,6 +627,12 @@ int css_do_msch(SubchDev *sch, SCHIB *orig_schib) (PMCW_CHARS_MASK_MBFC | PMCW_CHARS_MASK_CSENSE); sch->curr_status.mba = schib.mba; +/* Has the channel been disabled? */ +if (sch->disable_cb && (oldflags & PMCW_FLAGS_MASK_ENA) != 0 +&& (p->flags & PMCW_FLAGS_MASK_ENA) == 0) { +sch->disable_cb(sch); +} + ret = 0; out: @@ -1443,6 +1451,10 @@ void css_reset_sch(SubchDev *sch) { PMCW *p = &sch->curr_status.pmcw; +if ((p->flags & PMCW_FLAGS_MASK_ENA) != 0 && sch->disable_cb) { +sch->disable_cb(sch); +} + p->intparm = 0; p->flags &= ~(PMCW_FLAGS_MASK_ISC | PMCW_FLAGS_MASK_ENA | PMCW_FLAGS_MASK_LM | PMCW_FLAGS_MASK_MME | diff --git a/hw/s390x/css.h b/hw/s390x/css.h index 33104ac..7fa807b 100644 --- a/hw/s390x/css.h +++ b/hw/s390x/css.h @@ -81,6 +81,7 @@ struct SubchDev { uint8_t ccw_no_data_cnt; /* transport-provided data: */ int (*ccw_cb) (SubchDev *, CCW1); +void (*disable_cb)(SubchDev *); SenseId id; void *driver_data; }; -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 07/11] dataplane: allow virtio-1 devices
Handle endianness conversion for virtio-1 virtqueues correctly. Note that dataplane now needs to be built per-target. Signed-off-by: Cornelia Huck --- hw/block/dataplane/virtio-blk.c |3 +- hw/scsi/virtio-scsi-dataplane.c |2 +- hw/virtio/Makefile.objs |2 +- hw/virtio/dataplane/Makefile.objs |2 +- hw/virtio/dataplane/vring.c | 85 +++ include/hw/virtio/dataplane/vring.h | 64 -- 6 files changed, 113 insertions(+), 45 deletions(-) diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c index 5458f9d..eb45a3d 100644 --- a/hw/block/dataplane/virtio-blk.c +++ b/hw/block/dataplane/virtio-blk.c @@ -16,6 +16,7 @@ #include "qemu/iov.h" #include "qemu/thread.h" #include "qemu/error-report.h" +#include "hw/virtio/virtio-access.h" #include "hw/virtio/dataplane/vring.h" #include "block/block.h" #include "hw/virtio/virtio-blk.h" @@ -75,7 +76,7 @@ static void complete_request_vring(VirtIOBlockReq *req, unsigned char status) VirtIOBlockDataPlane *s = req->dev->dataplane; stb_p(&req->in->status, status); -vring_push(&req->dev->dataplane->vring, &req->elem, +vring_push(s->vdev, &req->dev->dataplane->vring, &req->elem, req->qiov.size + sizeof(*req->in)); /* Suppress notification to guest by BH and its scheduled diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c index b778e05..3e2b706 100644 --- a/hw/scsi/virtio-scsi-dataplane.c +++ b/hw/scsi/virtio-scsi-dataplane.c @@ -81,7 +81,7 @@ VirtIOSCSIReq *virtio_scsi_pop_req_vring(VirtIOSCSI *s, void virtio_scsi_vring_push_notify(VirtIOSCSIReq *req) { -vring_push(&req->vring->vring, &req->elem, +vring_push((VirtIODevice *)req->dev, &req->vring->vring, &req->elem, req->qsgl.size + req->resp_iov.size); event_notifier_set(&req->vring->guest_notifier); } diff --git a/hw/virtio/Makefile.objs b/hw/virtio/Makefile.objs index d21c397..19b224a 100644 --- a/hw/virtio/Makefile.objs +++ b/hw/virtio/Makefile.objs @@ -2,7 +2,7 @@ common-obj-y += virtio-rng.o common-obj-$(CONFIG_VIRTIO_PCI) += virtio-pci.o common-obj-y += virtio-bus.o common-obj-y += virtio-mmio.o -common-obj-$(CONFIG_VIRTIO) += dataplane/ +obj-$(CONFIG_VIRTIO) += dataplane/ obj-y += virtio.o virtio-balloon.o obj-$(CONFIG_LINUX) += vhost.o vhost-backend.o vhost-user.o diff --git a/hw/virtio/dataplane/Makefile.objs b/hw/virtio/dataplane/Makefile.objs index 9a8cfc0..753a9ca 100644 --- a/hw/virtio/dataplane/Makefile.objs +++ b/hw/virtio/dataplane/Makefile.objs @@ -1 +1 @@ -common-obj-y += vring.o +obj-y += vring.o diff --git a/hw/virtio/dataplane/vring.c b/hw/virtio/dataplane/vring.c index b84957f..4624521 100644 --- a/hw/virtio/dataplane/vring.c +++ b/hw/virtio/dataplane/vring.c @@ -18,6 +18,7 @@ #include "hw/hw.h" #include "exec/memory.h" #include "exec/address-spaces.h" +#include "hw/virtio/virtio-access.h" #include "hw/virtio/dataplane/vring.h" #include "qemu/error-report.h" @@ -83,7 +84,7 @@ bool vring_setup(Vring *vring, VirtIODevice *vdev, int n) vring_init(&vring->vr, virtio_queue_get_num(vdev, n), vring_ptr, 4096); vring->last_avail_idx = virtio_queue_get_last_avail_idx(vdev, n); -vring->last_used_idx = vring->vr.used->idx; +vring->last_used_idx = vring_get_used_idx(vdev, vring); vring->signalled_used = 0; vring->signalled_used_valid = false; @@ -104,7 +105,7 @@ void vring_teardown(Vring *vring, VirtIODevice *vdev, int n) void vring_disable_notification(VirtIODevice *vdev, Vring *vring) { if (!(vdev->guest_features[0] & (1 << VIRTIO_RING_F_EVENT_IDX))) { -vring->vr.used->flags |= VRING_USED_F_NO_NOTIFY; +vring_set_used_flags(vdev, vring, VRING_USED_F_NO_NOTIFY); } } @@ -117,10 +118,10 @@ bool vring_enable_notification(VirtIODevice *vdev, Vring *vring) if (vdev->guest_features[0] & (1 << VIRTIO_RING_F_EVENT_IDX)) { vring_avail_event(&vring->vr) = vring->vr.avail->idx; } else { -vring->vr.used->flags &= ~VRING_USED_F_NO_NOTIFY; +vring_clear_used_flags(vdev, vring, VRING_USED_F_NO_NOTIFY); } smp_mb(); /* ensure update is seen before reading avail_idx */ -return !vring_more_avail(vring); +return !vring_more_avail(vdev, vring); } /* This is stolen from linux/drivers/vhost/vhost.c:vhost_notify() */ @@ -134,12 +135,13 @@ bool vring_should_notify(VirtIODevice *vdev, Vring *vring) smp_mb(); if ((vdev->guest_features[0] & VIRTIO_F_NOTIFY_ON_EMPTY) && -unlikely(vring->vr.avail->idx == vring->last_avail_idx)) { +unlikely(!vring_more_avail(vdev, vring))) { return true; } if (!(vdev->guest_features[0] & VIRTIO_RING_F_EVENT_IDX)) { -return !(vring->vr.avail->flags & VRING_AVAIL_F_NO_INTERRUPT); +return !(vring_get_avail_flags(vdev, vring) & + VRING_AVAIL_F_NO_INTERRUPT); }
[PATCH RFC 08/11] virtio_blk: use virtio v1.0 endian
Note that we care only about the fields still in use for virtio v1.0. Reviewed-by: Thomas Huth Reviewed-by: David Hildenbrand Signed-off-by: Cornelia Huck --- drivers/block/virtio_blk.c |4 1 file changed, 4 insertions(+) diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c index 0a58140..08a8012 100644 --- a/drivers/block/virtio_blk.c +++ b/drivers/block/virtio_blk.c @@ -119,6 +119,10 @@ static int __virtblk_add_req(struct virtqueue *vq, sg_init_one(&status, &vbr->status, sizeof(vbr->status)); sgs[num_out + num_in++] = &status; + /* we only care about fields valid for virtio-1 */ + vbr->out_hdr.type = cpu_to_virtio_u32(vq->vdev, vbr->out_hdr.type); + vbr->out_hdr.sector = cpu_to_virtio_u64(vq->vdev, vbr->out_hdr.sector); + return virtqueue_add_sgs(vq, sgs, num_out, num_in, vbr, GFP_ATOMIC); } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 03/11] virtio: endianess conversion helpers
Provide helper functions that convert from/to LE for virtio devices that are not operating in legacy mode. We check for the VERSION_1 feature bit to determine that. Based on original patches by Rusty Russell and Thomas Huth. Reviewed-by: David Hildenbrand Signed-off-by: Cornelia Huck --- drivers/virtio/virtio.c|4 include/linux/virtio.h | 40 include/uapi/linux/virtio_config.h |3 +++ 3 files changed, 47 insertions(+) diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c index cfd5d00..8f74cd6 100644 --- a/drivers/virtio/virtio.c +++ b/drivers/virtio/virtio.c @@ -144,6 +144,10 @@ static int virtio_dev_probe(struct device *_d) if (device_features & (1ULL << i)) dev->features |= (1ULL << i); + /* Version 1.0 compliant devices set the VIRTIO_F_VERSION_1 bit */ + if (device_features & (1ULL << VIRTIO_F_VERSION_1)) + dev->features |= (1ULL << VIRTIO_F_VERSION_1); + dev->config->finalize_features(dev); err = drv->probe(dev); diff --git a/include/linux/virtio.h b/include/linux/virtio.h index a24b41f..68cadd4 100644 --- a/include/linux/virtio.h +++ b/include/linux/virtio.h @@ -9,6 +9,7 @@ #include #include #include +#include /** * virtqueue - a queue to register buffers for sending or receiving. @@ -102,6 +103,11 @@ static inline struct virtio_device *dev_to_virtio(struct device *_dev) return container_of(_dev, struct virtio_device, dev); } +static inline bool virtio_device_legacy(const struct virtio_device *dev) +{ + return !(dev->features & (1ULL << VIRTIO_F_VERSION_1)); +} + int register_virtio_device(struct virtio_device *dev); void unregister_virtio_device(struct virtio_device *dev); @@ -149,4 +155,38 @@ void unregister_virtio_driver(struct virtio_driver *drv); #define module_virtio_driver(__virtio_driver) \ module_driver(__virtio_driver, register_virtio_driver, \ unregister_virtio_driver) + +/* + * v1.0 specifies LE headers, legacy was native endian. Therefore, we must + * convert from/to LE if and only if vdev is not legacy. + */ +static inline u16 virtio_to_cpu_u16(const struct virtio_device *vdev, u16 v) +{ + return virtio_device_legacy(vdev) ? v : le16_to_cpu(v); +} + +static inline u32 virtio_to_cpu_u32(const struct virtio_device *vdev, u32 v) +{ + return virtio_device_legacy(vdev) ? v : le32_to_cpu(v); +} + +static inline u64 virtio_to_cpu_u64(const struct virtio_device *vdev, u64 v) +{ + return virtio_device_legacy(vdev) ? v : le64_to_cpu(v); +} + +static inline u16 cpu_to_virtio_u16(const struct virtio_device *vdev, u16 v) +{ + return virtio_device_legacy(vdev) ? v : cpu_to_le16(v); +} + +static inline u32 cpu_to_virtio_u32(const struct virtio_device *vdev, u32 v) +{ + return virtio_device_legacy(vdev) ? v : cpu_to_le32(v); +} + +static inline u64 cpu_to_virtio_u64(const struct virtio_device *vdev, u64 v) +{ + return virtio_device_legacy(vdev) ? v : cpu_to_le64(v); +} #endif /* _LINUX_VIRTIO_H */ diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h index 3ce768c..80e7381 100644 --- a/include/uapi/linux/virtio_config.h +++ b/include/uapi/linux/virtio_config.h @@ -54,4 +54,7 @@ /* Can the device handle any descriptor layout? */ #define VIRTIO_F_ANY_LAYOUT27 +/* v1.0 compliant. */ +#define VIRTIO_F_VERSION_1 32 + #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */ -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 05/11] virtio_config: endian conversion for v1.0.
From: Rusty Russell Reviewed-by: David Hildenbrand Signed-off-by: Rusty Russell Signed-off-by: Cornelia Huck --- include/linux/virtio_config.h |9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h index a0e16d8..ca22e3a 100644 --- a/include/linux/virtio_config.h +++ b/include/linux/virtio_config.h @@ -222,12 +222,13 @@ static inline u16 virtio_cread16(struct virtio_device *vdev, { u16 ret; vdev->config->get(vdev, offset, &ret, sizeof(ret)); - return ret; + return virtio_to_cpu_u16(vdev, ret); } static inline void virtio_cwrite16(struct virtio_device *vdev, unsigned int offset, u16 val) { + val = cpu_to_virtio_u16(vdev, val); vdev->config->set(vdev, offset, &val, sizeof(val)); } @@ -236,12 +237,13 @@ static inline u32 virtio_cread32(struct virtio_device *vdev, { u32 ret; vdev->config->get(vdev, offset, &ret, sizeof(ret)); - return ret; + return virtio_to_cpu_u32(vdev, ret); } static inline void virtio_cwrite32(struct virtio_device *vdev, unsigned int offset, u32 val) { + val = cpu_to_virtio_u32(vdev, val); vdev->config->set(vdev, offset, &val, sizeof(val)); } @@ -250,12 +252,13 @@ static inline u64 virtio_cread64(struct virtio_device *vdev, { u64 ret; vdev->config->get(vdev, offset, &ret, sizeof(ret)); - return ret; + return virtio_to_cpu_u64(vdev, ret); } static inline void virtio_cwrite64(struct virtio_device *vdev, unsigned int offset, u64 val) { + val = cpu_to_virtio_u64(vdev, val); vdev->config->set(vdev, offset, &val, sizeof(val)); } -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 11/11] KVM: s390: enable virtio-ccw revision 1
Now that virtio-ccw has everything needed to support virtio 1.0 in place, try to enable it if the host supports it. Reviewed-by: David Hildenbrand Signed-off-by: Cornelia Huck --- drivers/s390/kvm/virtio_ccw.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/s390/kvm/virtio_ccw.c b/drivers/s390/kvm/virtio_ccw.c index f97d3fb..a2e0c33 100644 --- a/drivers/s390/kvm/virtio_ccw.c +++ b/drivers/s390/kvm/virtio_ccw.c @@ -103,7 +103,7 @@ struct virtio_rev_info { }; /* the highest virtio-ccw revision we support */ -#define VIRTIO_CCW_REV_MAX 0 +#define VIRTIO_CCW_REV_MAX 1 struct virtio_ccw_vq_info { struct virtqueue *vq; -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 09/11] KVM: s390: Set virtio-ccw transport revision
From: Thomas Huth With the new SET-VIRTIO-REVISION command of the virtio 1.0 standard, we can now negotiate the virtio-ccw revision after setting a channel online. Note that we don't negotiate version 1 yet. [Cornelia Huck: reworked revision loop a bit] Reviewed-by: David Hildenbrand Signed-off-by: Thomas Huth Signed-off-by: Cornelia Huck --- drivers/s390/kvm/virtio_ccw.c | 63 + 1 file changed, 63 insertions(+) diff --git a/drivers/s390/kvm/virtio_ccw.c b/drivers/s390/kvm/virtio_ccw.c index 4173b59..cbe2ba8 100644 --- a/drivers/s390/kvm/virtio_ccw.c +++ b/drivers/s390/kvm/virtio_ccw.c @@ -55,6 +55,7 @@ struct virtio_ccw_device { struct ccw_device *cdev; __u32 curr_io; int err; + unsigned int revision; /* Transport revision */ wait_queue_head_t wait_q; spinlock_t lock; struct list_head virtqueues; @@ -86,6 +87,15 @@ struct virtio_thinint_area { u8 isc; } __packed; +struct virtio_rev_info { + __u16 revision; + __u16 length; + __u8 data[]; +}; + +/* the highest virtio-ccw revision we support */ +#define VIRTIO_CCW_REV_MAX 0 + struct virtio_ccw_vq_info { struct virtqueue *vq; int num; @@ -122,6 +132,7 @@ static struct airq_info *airq_areas[MAX_AIRQ_AREAS]; #define CCW_CMD_WRITE_STATUS 0x31 #define CCW_CMD_READ_VQ_CONF 0x32 #define CCW_CMD_SET_IND_ADAPTER 0x73 +#define CCW_CMD_SET_VIRTIO_REV 0x83 #define VIRTIO_CCW_DOING_SET_VQ 0x0001 #define VIRTIO_CCW_DOING_RESET 0x0004 @@ -134,6 +145,7 @@ static struct airq_info *airq_areas[MAX_AIRQ_AREAS]; #define VIRTIO_CCW_DOING_READ_VQ_CONF 0x0200 #define VIRTIO_CCW_DOING_SET_CONF_IND 0x0400 #define VIRTIO_CCW_DOING_SET_IND_ADAPTER 0x0800 +#define VIRTIO_CCW_DOING_SET_VIRTIO_REV 0x1000 #define VIRTIO_CCW_INTPARM_MASK 0x static struct virtio_ccw_device *to_vc_device(struct virtio_device *vdev) @@ -934,6 +946,7 @@ static void virtio_ccw_int_handler(struct ccw_device *cdev, case VIRTIO_CCW_DOING_RESET: case VIRTIO_CCW_DOING_READ_VQ_CONF: case VIRTIO_CCW_DOING_SET_IND_ADAPTER: + case VIRTIO_CCW_DOING_SET_VIRTIO_REV: vcdev->curr_io &= ~activity; wake_up(&vcdev->wait_q); break; @@ -1053,6 +1066,51 @@ static int virtio_ccw_offline(struct ccw_device *cdev) return 0; } +static int virtio_ccw_set_transport_rev(struct virtio_ccw_device *vcdev) +{ + struct virtio_rev_info *rev; + struct ccw1 *ccw; + int ret; + + ccw = kzalloc(sizeof(*ccw), GFP_DMA | GFP_KERNEL); + if (!ccw) + return -ENOMEM; + rev = kzalloc(sizeof(*rev), GFP_DMA | GFP_KERNEL); + if (!rev) { + kfree(ccw); + return -ENOMEM; + } + + /* Set transport revision */ + ccw->cmd_code = CCW_CMD_SET_VIRTIO_REV; + ccw->flags = 0; + ccw->count = sizeof(*rev); + ccw->cda = (__u32)(unsigned long)rev; + + vcdev->revision = VIRTIO_CCW_REV_MAX; + do { + rev->revision = vcdev->revision; + /* none of our supported revisions carry payload */ + rev->length = 0; + ret = ccw_io_helper(vcdev, ccw, + VIRTIO_CCW_DOING_SET_VIRTIO_REV); + if (ret == -EOPNOTSUPP) { + if (vcdev->revision == 0) + /* +* The host device does not support setting +* the revision: let's operate it in legacy +* mode. +*/ + ret = 0; + else + vcdev->revision--; + } + } while (ret == -EOPNOTSUPP); + + kfree(ccw); + kfree(rev); + return ret; +} static int virtio_ccw_online(struct ccw_device *cdev) { @@ -1093,6 +1151,11 @@ static int virtio_ccw_online(struct ccw_device *cdev) spin_unlock_irqrestore(get_ccwdev_lock(cdev), flags); vcdev->vdev.id.vendor = cdev->id.cu_type; vcdev->vdev.id.device = cdev->id.cu_model; + + ret = virtio_ccw_set_transport_rev(vcdev); + if (ret) + goto out_free; + ret = register_virtio_device(&vcdev->vdev); if (ret) { dev_warn(&cdev->dev, "Failed to register virtio device: %d\n", -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 10/11] KVM: s390: virtio-ccw revision 1 SET_VQ
The CCW_CMD_SET_VQ command has a different format for revision 1+ devices, allowing to specify a more complex virtqueue layout. For now, we stay however with the old layout and simply use the new command format for virtio-1 devices. Signed-off-by: Cornelia Huck --- drivers/s390/kvm/virtio_ccw.c | 54 - 1 file changed, 42 insertions(+), 12 deletions(-) diff --git a/drivers/s390/kvm/virtio_ccw.c b/drivers/s390/kvm/virtio_ccw.c index cbe2ba8..f97d3fb 100644 --- a/drivers/s390/kvm/virtio_ccw.c +++ b/drivers/s390/kvm/virtio_ccw.c @@ -68,13 +68,22 @@ struct virtio_ccw_device { void *airq_info; }; -struct vq_info_block { +struct vq_info_block_legacy { __u64 queue; __u32 align; __u16 index; __u16 num; } __packed; +struct vq_info_block { + __u64 desc; + __u32 res0; + __u16 index; + __u16 num; + __u64 avail; + __u64 used; +} __packed; + struct virtio_feature_desc { __u32 features; __u8 index; @@ -100,7 +109,10 @@ struct virtio_ccw_vq_info { struct virtqueue *vq; int num; void *queue; - struct vq_info_block *info_block; + union { + struct vq_info_block s; + struct vq_info_block_legacy l; + } *info_block; int bit_nr; struct list_head node; long cookie; @@ -411,13 +423,22 @@ static void virtio_ccw_del_vq(struct virtqueue *vq, struct ccw1 *ccw) spin_unlock_irqrestore(&vcdev->lock, flags); /* Release from host. */ - info->info_block->queue = 0; - info->info_block->align = 0; - info->info_block->index = index; - info->info_block->num = 0; + if (vcdev->revision == 0) { + info->info_block->l.queue = 0; + info->info_block->l.align = 0; + info->info_block->l.index = index; + info->info_block->l.num = 0; + ccw->count = sizeof(info->info_block->l); + } else { + info->info_block->s.desc = 0; + info->info_block->s.index = index; + info->info_block->s.num = 0; + info->info_block->s.avail = 0; + info->info_block->s.used = 0; + ccw->count = sizeof(info->info_block->s); + } ccw->cmd_code = CCW_CMD_SET_VQ; ccw->flags = 0; - ccw->count = sizeof(*info->info_block); ccw->cda = (__u32)(unsigned long)(info->info_block); ret = ccw_io_helper(vcdev, ccw, VIRTIO_CCW_DOING_SET_VQ | index); @@ -500,13 +521,22 @@ static struct virtqueue *virtio_ccw_setup_vq(struct virtio_device *vdev, } /* Register it with the host. */ - info->info_block->queue = (__u64)info->queue; - info->info_block->align = KVM_VIRTIO_CCW_RING_ALIGN; - info->info_block->index = i; - info->info_block->num = info->num; + if (vcdev->revision == 0) { + info->info_block->l.queue = (__u64)info->queue; + info->info_block->l.align = KVM_VIRTIO_CCW_RING_ALIGN; + info->info_block->l.index = i; + info->info_block->l.num = info->num; + ccw->count = sizeof(info->info_block->l); + } else { + info->info_block->s.desc = (__u64)info->queue; + info->info_block->s.index = i; + info->info_block->s.num = info->num; + info->info_block->s.avail = (__u64)virtqueue_get_avail(vq); + info->info_block->s.used = (__u64)virtqueue_get_used(vq); + ccw->count = sizeof(info->info_block->s); + } ccw->cmd_code = CCW_CMD_SET_VQ; ccw->flags = 0; - ccw->count = sizeof(*info->info_block); ccw->cda = (__u32)(unsigned long)(info->info_block); err = ccw_io_helper(vcdev, ccw, VIRTIO_CCW_DOING_SET_VQ | i); if (err) { -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 11/11] s390x/virtio-ccw: enable virtio 1.0
virtio-ccw should now have everything in place to operate virtio 1.0 devices, so let's enable revision 1. Signed-off-by: Cornelia Huck --- hw/s390x/virtio-ccw.h |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/s390x/virtio-ccw.h b/hw/s390x/virtio-ccw.h index 03d5955..08edd8d 100644 --- a/hw/s390x/virtio-ccw.h +++ b/hw/s390x/virtio-ccw.h @@ -73,7 +73,7 @@ typedef struct VirtIOCCWDeviceClass { #define VIRTIO_CCW_FEATURE_SIZE NR_VIRTIO_FEATURE_WORDS /* The maximum virtio revision we support. */ -#define VIRTIO_CCW_REV_MAX 0 +#define VIRTIO_CCW_REV_MAX 1 /* Performance improves when virtqueue kick processing is decoupled from the * vcpu thread using ioeventfd for some devices. */ -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 04/11] s390x/virtio-ccw: fix check for WRITE_FEAT
We need to check guest feature size, not host feature size to find out whether we should call virtio_set_features(). This check is possible now that vdev->guest_features is an array. Reviewed-by: Thomas Huth Signed-off-by: Cornelia Huck --- hw/s390x/virtio-ccw.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c index 8aa79a7..69add47 100644 --- a/hw/s390x/virtio-ccw.c +++ b/hw/s390x/virtio-ccw.c @@ -399,7 +399,7 @@ static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw) features.index = ldub_phys(&address_space_memory, ccw.cda + sizeof(features.features)); features.features = ldl_le_phys(&address_space_memory, ccw.cda); -if (features.index < ARRAY_SIZE(dev->host_features)) { +if (features.index < ARRAY_SIZE(vdev->guest_features)) { virtio_set_features(vdev, features.index, features.features); } else { /* -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 09/11] s390x/virtio-ccw: add virtio set-revision call
From: Thomas Huth Handle the virtio-ccw revision according to what the guest sets. When revision 1 is selected, we have a virtio-1 standard device with byteswapping for the virtio rings. When a channel gets disabled, we have to revert to the legacy behavior in case the next user of the device does not negotiate the revision 1 anymore (e.g. the boot firmware uses revision 1, but the operating system only uses the legacy mode). Note that revisions > 0 are still disabled; but we still extend the feature bit size to be able to handle the VERSION_1 bit. Signed-off-by: Thomas Huth Signed-off-by: Cornelia Huck --- hw/s390x/virtio-ccw.c | 54 + hw/s390x/virtio-ccw.h |7 ++- 2 files changed, 60 insertions(+), 1 deletion(-) diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c index 69add47..0d414f6 100644 --- a/hw/s390x/virtio-ccw.c +++ b/hw/s390x/virtio-ccw.c @@ -20,9 +20,11 @@ #include "hw/virtio/virtio-net.h" #include "hw/sysbus.h" #include "qemu/bitops.h" +#include "hw/virtio/virtio-access.h" #include "hw/virtio/virtio-bus.h" #include "hw/s390x/adapter.h" #include "hw/s390x/s390_flic.h" +#include "linux/virtio_config.h" #include "ioinst.h" #include "css.h" @@ -260,6 +262,12 @@ typedef struct VirtioThinintInfo { uint8_t isc; } QEMU_PACKED VirtioThinintInfo; +typedef struct VirtioRevInfo { +uint16_t revision; +uint16_t length; +uint8_t data[0]; +} QEMU_PACKED VirtioRevInfo; + /* Specify where the virtqueues for the subchannel are in guest memory. */ static int virtio_ccw_set_vqs(SubchDev *sch, uint64_t addr, uint32_t align, uint16_t index, uint16_t num) @@ -299,6 +307,7 @@ static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw) { int ret; VqInfoBlock info; +VirtioRevInfo revinfo; uint8_t status; VirtioFeatDesc features; void *config; @@ -373,6 +382,13 @@ static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw) ccw.cda + sizeof(features.features)); if (features.index < ARRAY_SIZE(dev->host_features)) { features.features = dev->host_features[features.index]; +/* + * Don't offer version 1 to the guest if it did not + * negotiate at least revision 1. + */ +if (features.index == 1 && dev->revision <= 0) { +features.features &= ~(1 << (VIRTIO_F_VERSION_1 - 32)); +} } else { /* Return zeroes if the guest supports more feature bits. */ features.features = 0; @@ -400,6 +416,13 @@ static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw) ccw.cda + sizeof(features.features)); features.features = ldl_le_phys(&address_space_memory, ccw.cda); if (features.index < ARRAY_SIZE(vdev->guest_features)) { +/* + * The guest should not set version 1 if it didn't + * negotiate a revision >= 1. + */ +if (features.index == 1 && dev->revision <= 0) { +features.features &= ~(1 << (VIRTIO_F_VERSION_1 - 32)); +} virtio_set_features(vdev, features.index, features.features); } else { /* @@ -600,6 +623,25 @@ static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw) } } break; +case CCW_CMD_SET_VIRTIO_REV: +len = sizeof(revinfo); +if (ccw.count < len || (check_len && ccw.count > len)) { +ret = -EINVAL; +break; +} +if (!ccw.cda) { +ret = -EFAULT; +break; +} +cpu_physical_memory_read(ccw.cda, &revinfo, len); +if (dev->revision >= 0 || +revinfo.revision > VIRTIO_CCW_REV_MAX) { +ret = -ENOSYS; +break; +} +ret = 0; +dev->revision = revinfo.revision; +break; default: ret = -ENOSYS; break; @@ -607,6 +649,13 @@ static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw) return ret; } +static void virtio_sch_disable_cb(SubchDev *sch) +{ +VirtioCcwDevice *dev = sch->driver_data; + +dev->revision = -1; +} + static int virtio_ccw_device_init(VirtioCcwDevice *dev, VirtIODevice *vdev) { unsigned int cssid = 0; @@ -733,6 +782,7 @@ static int virtio_ccw_device_init(VirtioCcwDevice *dev, VirtIODevice *vdev) css_sch_build_virtual_schib(sch, 0, VIRTIO_CCW_CHPID_TYPE); sch->ccw_cb = virtio_ccw_cb; +sch->disable_cb = virtio_sch_disable_cb; /* Build senseid data. */ memset(&sch->id, 0, sizeof(SenseId)); @@ -740,6 +790,8 @@ static int virtio_ccw_device_init(VirtioCcwDevice *dev, VirtIODevice *vdev) sch->id.cu_type = VIRTIO_CCW_CU_TYPE; sch->id.cu_model = vdev->device_id; +dev->revision = -1; +
[PATCH RFC 10/11] s390x/virtio-ccw: support virtio-1 set_vq format
Support the new CCW_CMD_SET_VQ format for virtio-1 devices. While we're at it, refactor the code a bit and enforce big endian fields (which had always been required, even for legacy). Reviewed-by: Thomas Huth Signed-off-by: Cornelia Huck --- hw/s390x/virtio-ccw.c | 114 ++--- 1 file changed, 80 insertions(+), 34 deletions(-) diff --git a/hw/s390x/virtio-ccw.c b/hw/s390x/virtio-ccw.c index 0d414f6..80efe88 100644 --- a/hw/s390x/virtio-ccw.c +++ b/hw/s390x/virtio-ccw.c @@ -238,11 +238,20 @@ VirtualCssBus *virtual_css_bus_init(void) } /* Communication blocks used by several channel commands. */ -typedef struct VqInfoBlock { +typedef struct VqInfoBlockLegacy { uint64_t queue; uint32_t align; uint16_t index; uint16_t num; +} QEMU_PACKED VqInfoBlockLegacy; + +typedef struct VqInfoBlock { +uint64_t desc; +uint32_t res0; +uint16_t index; +uint16_t num; +uint64_t avail; +uint64_t used; } QEMU_PACKED VqInfoBlock; typedef struct VqConfigBlock { @@ -269,17 +278,20 @@ typedef struct VirtioRevInfo { } QEMU_PACKED VirtioRevInfo; /* Specify where the virtqueues for the subchannel are in guest memory. */ -static int virtio_ccw_set_vqs(SubchDev *sch, uint64_t addr, uint32_t align, - uint16_t index, uint16_t num) +static int virtio_ccw_set_vqs(SubchDev *sch, VqInfoBlock *info, + VqInfoBlockLegacy *linfo) { VirtIODevice *vdev = virtio_ccw_get_vdev(sch); +uint16_t index = info ? info->index : linfo->index; +uint16_t num = info ? info->num : linfo->num; +uint64_t desc = info ? info->desc : linfo->queue; if (index > VIRTIO_PCI_QUEUE_MAX) { return -EINVAL; } /* Current code in virtio.c relies on 4K alignment. */ -if (addr && (align != 4096)) { +if (linfo && desc && (linfo->align != 4096)) { return -EINVAL; } @@ -287,8 +299,12 @@ static int virtio_ccw_set_vqs(SubchDev *sch, uint64_t addr, uint32_t align, return -EINVAL; } -virtio_queue_set_addr(vdev, index, addr); -if (!addr) { +if (info) { +virtio_queue_set_rings(vdev, index, desc, info->avail, info->used); +} else { +virtio_queue_set_addr(vdev, index, desc); +} +if (!desc) { virtio_queue_set_vector(vdev, index, 0); } else { /* Fail if we don't have a big enough queue. */ @@ -303,10 +319,66 @@ static int virtio_ccw_set_vqs(SubchDev *sch, uint64_t addr, uint32_t align, return 0; } -static int virtio_ccw_cb(SubchDev *sch, CCW1 ccw) +static int virtio_ccw_handle_set_vq(SubchDev *sch, CCW1 ccw, bool check_len, +bool is_legacy) { int ret; VqInfoBlock info; +VqInfoBlockLegacy linfo; +size_t info_len = is_legacy ? sizeof(linfo) : sizeof(info); + +if (check_len) { +if (ccw.count != info_len) { +return -EINVAL; +} +} else if (ccw.count < info_len) { +/* Can't execute command. */ +return -EINVAL; +} +if (!ccw.cda) { +return -EFAULT; +} +if (is_legacy) { +linfo.queue = ldq_be_phys(&address_space_memory, ccw.cda); +linfo.align = ldl_be_phys(&address_space_memory, + ccw.cda + sizeof(linfo.queue)); +linfo.index = lduw_be_phys(&address_space_memory, + ccw.cda + sizeof(linfo.queue) + + sizeof(linfo.align)); +linfo.num = lduw_be_phys(&address_space_memory, + ccw.cda + sizeof(linfo.queue) + + sizeof(linfo.align) + + sizeof(linfo.index)); +ret = virtio_ccw_set_vqs(sch, NULL, &linfo); +} else { +info.desc = ldq_be_phys(&address_space_memory, ccw.cda); +info.index = lduw_be_phys(&address_space_memory, + ccw.cda + sizeof(info.desc) + + sizeof(info.res0)); +info.num = lduw_be_phys(&address_space_memory, +ccw.cda + sizeof(info.desc) + + sizeof(info.res0) + + sizeof(info.index)); +info.avail = ldq_be_phys(&address_space_memory, + ccw.cda + sizeof(info.desc) + + sizeof(info.res0) + + sizeof(info.index) + + sizeof(info.num)); +info.used = ldq_be_phys(&address_space_memory, +ccw.cda + sizeof(info.desc) ++ sizeof(info.res0) ++ sizeof(info.index) ++ sizeof(info.num) ++ sizeof(info.avail)); +ret = virtio_ccw_set_vqs(sch, &info, NULL); +} +
[PATCH RFC 00/11] qemu: towards virtio-1 host support
This patchset aims to get us some way to implement virtio-1 compliant and transitional devices in qemu. Branch available at git://github.com/cohuck/qemu virtio-1 I've mainly focused on: - endianness handling - extended feature bits - virtio-ccw new/changed commands Thanks go to Thomas for some preliminary work in this area. I've been able to start guests both with and without the virtio-1 patches in the linux guest patchset, with virtio-net and virtio-blk devices (with and without dataplane). virtio-ccw only :) vhost, migration and loads of other things have been ignored for now. I'd like to know whether I walk into the right direction, so let's consider this as a starting point. Cornelia Huck (8): virtio: cull virtio_bus_set_vdev_features virtio: support more feature bits s390x/virtio-ccw: fix check for WRITE_FEAT virtio: introduce legacy virtio devices virtio: allow virtio-1 queue layout dataplane: allow virtio-1 devices s390x/virtio-ccw: support virtio-1 set_vq format s390x/virtio-ccw: enable virtio 1.0 Thomas Huth (3): linux-headers/virtio_config: Update with VIRTIO_F_VERSION_1 s390x/css: Add a callback for when subchannel gets disabled s390x/virtio-ccw: add virtio set-revision call hw/9pfs/virtio-9p-device.c |7 +- hw/block/dataplane/virtio-blk.c |3 +- hw/block/virtio-blk.c |9 +- hw/char/virtio-serial-bus.c |9 +- hw/net/virtio-net.c | 38 --- hw/s390x/css.c | 12 +++ hw/s390x/css.h |1 + hw/s390x/s390-virtio-bus.c |9 +- hw/s390x/virtio-ccw.c | 188 +++ hw/s390x/virtio-ccw.h |7 +- hw/scsi/vhost-scsi.c|7 +- hw/scsi/virtio-scsi-dataplane.c |2 +- hw/scsi/virtio-scsi.c |8 +- hw/virtio/Makefile.objs |2 +- hw/virtio/dataplane/Makefile.objs |2 +- hw/virtio/dataplane/vring.c | 95 ++ hw/virtio/virtio-balloon.c |8 +- hw/virtio/virtio-bus.c | 23 + hw/virtio/virtio-mmio.c |9 +- hw/virtio/virtio-pci.c | 13 +-- hw/virtio/virtio-rng.c |2 +- hw/virtio/virtio.c | 51 +++--- include/hw/virtio/dataplane/vring.h | 64 +++- include/hw/virtio/virtio-access.h |4 + include/hw/virtio/virtio-bus.h | 10 +- include/hw/virtio/virtio.h | 34 +-- linux-headers/linux/virtio_config.h |3 + 27 files changed, 442 insertions(+), 178 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 03/11] virtio: support more feature bits
With virtio-1, we support more than 32 feature bits. Let's make vdev->guest_features depend on the number of supported feature bits, allowing us to grow the feature bits automatically. We also need to enhance the internal functions dealing with getting and setting features with an additional index field, so that all feature bits may be accessed (in chunks of 32 bits). vhost and migration have been ignored for now. Reviewed-by: Thomas Huth Signed-off-by: Cornelia Huck --- hw/9pfs/virtio-9p-device.c |7 ++- hw/block/virtio-blk.c |9 +++-- hw/char/virtio-serial-bus.c|9 +++-- hw/net/virtio-net.c| 38 ++ hw/s390x/s390-virtio-bus.c |9 + hw/s390x/virtio-ccw.c | 17 ++--- hw/scsi/vhost-scsi.c |7 +-- hw/scsi/virtio-scsi.c |8 hw/virtio/dataplane/vring.c| 10 +- hw/virtio/virtio-balloon.c |8 ++-- hw/virtio/virtio-bus.c |9 + hw/virtio/virtio-mmio.c|9 + hw/virtio/virtio-pci.c | 13 +++-- hw/virtio/virtio-rng.c |2 +- hw/virtio/virtio.c | 29 + include/hw/virtio/virtio-bus.h |7 --- include/hw/virtio/virtio.h | 19 ++- 17 files changed, 134 insertions(+), 76 deletions(-) diff --git a/hw/9pfs/virtio-9p-device.c b/hw/9pfs/virtio-9p-device.c index 2572747..c29c8c8 100644 --- a/hw/9pfs/virtio-9p-device.c +++ b/hw/9pfs/virtio-9p-device.c @@ -21,8 +21,13 @@ #include "virtio-9p-coth.h" #include "hw/virtio/virtio-access.h" -static uint32_t virtio_9p_get_features(VirtIODevice *vdev, uint32_t features) +static uint32_t virtio_9p_get_features(VirtIODevice *vdev, unsigned int index, + uint32_t features) { +if (index > 0) { +return features; +} + features |= 1 << VIRTIO_9P_MOUNT_TAG; return features; } diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c index 45e0c8f..5abc327 100644 --- a/hw/block/virtio-blk.c +++ b/hw/block/virtio-blk.c @@ -561,10 +561,15 @@ static void virtio_blk_set_config(VirtIODevice *vdev, const uint8_t *config) aio_context_release(bdrv_get_aio_context(s->bs)); } -static uint32_t virtio_blk_get_features(VirtIODevice *vdev, uint32_t features) +static uint32_t virtio_blk_get_features(VirtIODevice *vdev, unsigned int index, +uint32_t features) { VirtIOBlock *s = VIRTIO_BLK(vdev); +if (index > 0) { +return features; +} + features |= (1 << VIRTIO_BLK_F_SEG_MAX); features |= (1 << VIRTIO_BLK_F_GEOMETRY); features |= (1 << VIRTIO_BLK_F_TOPOLOGY); @@ -597,7 +602,7 @@ static void virtio_blk_set_status(VirtIODevice *vdev, uint8_t status) return; } -features = vdev->guest_features; +features = vdev->guest_features[0]; /* A guest that supports VIRTIO_BLK_F_CONFIG_WCE must be able to send * cache flushes. Thus, the "auto writethrough" behavior is never diff --git a/hw/char/virtio-serial-bus.c b/hw/char/virtio-serial-bus.c index 3931085..0d843fe 100644 --- a/hw/char/virtio-serial-bus.c +++ b/hw/char/virtio-serial-bus.c @@ -75,7 +75,7 @@ static VirtIOSerialPort *find_port_by_name(char *name) static bool use_multiport(VirtIOSerial *vser) { VirtIODevice *vdev = VIRTIO_DEVICE(vser); -return vdev->guest_features & (1 << VIRTIO_CONSOLE_F_MULTIPORT); +return vdev->guest_features[0] & (1 << VIRTIO_CONSOLE_F_MULTIPORT); } static size_t write_to_port(VirtIOSerialPort *port, @@ -467,10 +467,15 @@ static void handle_input(VirtIODevice *vdev, VirtQueue *vq) { } -static uint32_t get_features(VirtIODevice *vdev, uint32_t features) +static uint32_t get_features(VirtIODevice *vdev, unsigned int index, + uint32_t features) { VirtIOSerial *vser; +if (index > 0) { +return features; +} + vser = VIRTIO_SERIAL(vdev); if (vser->bus.max_nr_ports > 1) { diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c index 2040eac..67f91c0 100644 --- a/hw/net/virtio-net.c +++ b/hw/net/virtio-net.c @@ -86,7 +86,7 @@ static void virtio_net_set_config(VirtIODevice *vdev, const uint8_t *config) memcpy(&netcfg, config, n->config_size); -if (!(vdev->guest_features >> VIRTIO_NET_F_CTRL_MAC_ADDR & 1) && +if (!(vdev->guest_features[0] >> VIRTIO_NET_F_CTRL_MAC_ADDR & 1) && memcmp(netcfg.mac, n->mac, ETH_ALEN)) { memcpy(n->mac, netcfg.mac, ETH_ALEN); qemu_format_nic_info_str(qemu_get_queue(n->nic), n->mac); @@ -305,7 +305,7 @@ static RxFilterInfo *virtio_net_query_rxfilter(NetClientState *nc) info->multicast_table = str_list; info->vlan_table = get_vlan_table(n); -if (!((1 << VIRTIO_NET_F_CTRL_VLAN) & vdev->guest_features)) { +if (!((1 << VIRTIO_NET_F_CTRL_VLAN) & vdev->gue
RE: Windows NIC Teaming over SRIOV-VF devices.
I have a Windows 2012R2 guest hosted on top of the KVM, the guest has 2 VFs (from different PFs) assigned to it. I can team them together via Windows NIC Teaming, but when I test failover (pulling one cable) I lose connectivity until I restart the guest. Do you know where I can get instructions on how to configure KVM to allow Windows NIC Teaming? I have turned off spoofchking to the VFs and I haven't set the MAC addresses in the host so that the VFs can change them. David Kershner -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 06/11] virtio: allow transports to get avail/used addresses
For virtio-1, we can theoretically have a more complex virtqueue layout with avail and used buffers not on a contiguous memory area with the descriptor table. For now, it's fine for a transport driver to stay with the old layout: It needs, however, a way to access the locations of the avail/used rings so it can register them with the host. Reviewed-by: David Hildenbrand Signed-off-by: Cornelia Huck --- drivers/virtio/virtio_ring.c | 16 include/linux/virtio.h |3 +++ 2 files changed, 19 insertions(+) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 350c39b..dd0d4ec 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -961,4 +961,20 @@ void virtio_break_device(struct virtio_device *dev) } EXPORT_SYMBOL_GPL(virtio_break_device); +void *virtqueue_get_avail(struct virtqueue *_vq) +{ + struct vring_virtqueue *vq = to_vvq(_vq); + + return vq->vring.avail; +} +EXPORT_SYMBOL_GPL(virtqueue_get_avail); + +void *virtqueue_get_used(struct virtqueue *_vq) +{ + struct vring_virtqueue *vq = to_vvq(_vq); + + return vq->vring.used; +} +EXPORT_SYMBOL_GPL(virtqueue_get_used); + MODULE_LICENSE("GPL"); diff --git a/include/linux/virtio.h b/include/linux/virtio.h index 68cadd4..f10e6e7 100644 --- a/include/linux/virtio.h +++ b/include/linux/virtio.h @@ -76,6 +76,9 @@ unsigned int virtqueue_get_vring_size(struct virtqueue *vq); bool virtqueue_is_broken(struct virtqueue *vq); +void *virtqueue_get_avail(struct virtqueue *vq); +void *virtqueue_get_used(struct virtqueue *vq); + /** * virtio_device - representation of a device using virtio * @index: unique position on the virtio bus -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 05/11] virtio: introduce legacy virtio devices
Introduce a helper function to indicate whether a virtio device is operating in legacy or virtio standard mode. It may be used to make decisions about the endianess of virtio accesses and other virtio-1 specific changes, enabling us to support transitional devices. Reviewed-by: Thomas Huth Signed-off-by: Cornelia Huck --- hw/virtio/virtio.c|6 +- include/hw/virtio/virtio-access.h |4 include/hw/virtio/virtio.h| 13 +++-- 3 files changed, 20 insertions(+), 3 deletions(-) diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index 7aaa953..e6ae3a0 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -883,7 +883,11 @@ static bool virtio_device_endian_needed(void *opaque) VirtIODevice *vdev = opaque; assert(vdev->device_endian != VIRTIO_DEVICE_ENDIAN_UNKNOWN); -return vdev->device_endian != virtio_default_endian(); +if (virtio_device_is_legacy(vdev)) { +return vdev->device_endian != virtio_default_endian(); +} +/* Devices conforming to VIRTIO 1.0 or later are always LE. */ +return vdev->device_endian != VIRTIO_DEVICE_ENDIAN_LITTLE; } static const VMStateDescription vmstate_virtio_device_endian = { diff --git a/include/hw/virtio/virtio-access.h b/include/hw/virtio/virtio-access.h index 46456fd..c123ee0 100644 --- a/include/hw/virtio/virtio-access.h +++ b/include/hw/virtio/virtio-access.h @@ -19,6 +19,10 @@ static inline bool virtio_access_is_big_endian(VirtIODevice *vdev) { +if (!virtio_device_is_legacy(vdev)) { +/* Devices conforming to VIRTIO 1.0 or later are always LE. */ +return false; +} #if defined(TARGET_IS_BIENDIAN) return virtio_is_big_endian(vdev); #elif defined(TARGET_WORDS_BIGENDIAN) diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h index b408166..40e567c 100644 --- a/include/hw/virtio/virtio.h +++ b/include/hw/virtio/virtio.h @@ -275,9 +275,18 @@ void virtio_queue_set_host_notifier_fd_handler(VirtQueue *vq, bool assign, void virtio_queue_notify_vq(VirtQueue *vq); void virtio_irq(VirtQueue *vq); +static inline bool virtio_device_is_legacy(VirtIODevice *vdev) +{ +return !(vdev->guest_features[1] & (1 << (VIRTIO_F_VERSION_1 - 32))); +} + static inline bool virtio_is_big_endian(VirtIODevice *vdev) { -assert(vdev->device_endian != VIRTIO_DEVICE_ENDIAN_UNKNOWN); -return vdev->device_endian == VIRTIO_DEVICE_ENDIAN_BIG; +if (virtio_device_is_legacy(vdev)) { +assert(vdev->device_endian != VIRTIO_DEVICE_ENDIAN_UNKNOWN); +return vdev->device_endian == VIRTIO_DEVICE_ENDIAN_BIG; +} +/* Devices conforming to VIRTIO 1.0 or later are always LE. */ +return false; } #endif -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2
On Tue, Oct 07, 2014 at 02:28:43PM +0100, Marc Zyngier wrote: > On 07/10/14 11:48, Catalin Marinas wrote: > > On Mon, Oct 06, 2014 at 09:30:25PM +0100, Christoffer Dall wrote: > >> +/** > >> + * kvm_prealloc_hwpgd - allocate inital table for VTTBR > >> + * @kvm: The KVM struct pointer for the VM. > >> + * @pgd: The kernel pseudo pgd > >> + * > >> + * When the kernel uses more levels of page tables than the guest, we > >> allocate > >> + * a fake PGD and pre-populate it to point to the next-level page table, > >> which > >> + * will be the real initial page table pointed to by the VTTBR. > >> + * > >> + * When KVM_PREALLOC_LEVEL==2, we allocate a single page for the PMD and > >> + * the kernel will use folded pud. When KVM_PREALLOC_LEVEL==1, we > >> + * allocate 2 consecutive PUD pages. > >> + */ > >> +#if defined(CONFIG_ARM64_64K_PAGES) && CONFIG_ARM64_PGTABLE_LEVELS == 3 > >> +#define KVM_PREALLOC_LEVEL 2 > >> +#define PTRS_PER_S2_PGD1 > >> +#define S2_PGD_ORDER get_order(PTRS_PER_S2_PGD * sizeof(pgd_t)) > > > > I agree that my magic equation wasn't readable ;) (I had troubles > > re-understanding it as well), but you also have some constants here that > > are not immediately obvious where you got to them from. IIUC, > > KVM_PREALLOC_LEVEL == 2 here means that the hardware only understands > > stage 2 pmd and pte. I guess you could look into the ARM ARM tables but > > it's still not clear. > > > > Let's look at PTRS_PER_S2_PGD as I think it's simpler. My proposal was: > > > > #if PGDIR_SHIFT > KVM_PHYS_SHIFT > > #define PTRS_PER_S2_PGD (1) > > #else > > #define PTRS_PER_S2_PGD (1 << (KVM_PHYS_SHIFT - > > PGDIR_SHIFT)) > > #endif > > > > In this case PGDIR_SHIFT is 42, so we get PTRS_PER_S2_PGD == 1. The 4K > > and 4 levels case below is also correct. > > > > The KVM start level calculation, we could assume that KVM needs either > > host levels or host levels - 1 (unless we go for some weirdly small > > KVM_PHYS_SHIFT). So we could define them KVM_PREALLOC_LEVEL as: > > > > #if PTRS_PER_S2_PGD <= 16 > > #define KVM_PREALLOC_LEVEL (4 - CONFIG_ARM64_PGTABLE_LEVELS + 1) > > #else > > #define KVM_PREALLOC_LEVEL (0) > > #endif > > > > Basically if you can concatenate 16 or less pages at the level below the > > top, the architecture does not allow a small top level. In this case, > > (4 - CONFIG_ARM64_PGTABLE_LEVELS) represents the first level for the > > host and we add 1 to go to the next level for KVM stage 2 when > > PTRS_PER_S2_PGD is 16 or less. We use 0 when we don't need to > > preallocate. > > I think this makes the whole thing clearer (at least for me), as it > makes the relationship between KVM_PREALLOC_LEVEL and > CONFIG_ARM64_PGTABLE_LEVELS explicit (it wasn't completely obvious to me > initially). Agreed. > > >> +static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd) > >> +{ > >> + pud_t *pud; > >> + pmd_t *pmd; > >> + > >> + pud = pud_offset(pgd, 0); > >> + pmd = (pmd_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 0); > >> + > >> + if (!pmd) > >> + return -ENOMEM; > >> + pud_populate(NULL, pud, pmd); > >> + > >> + return 0; > >> +} > >> + > >> +static inline void kvm_free_hwpgd(struct kvm *kvm) > >> +{ > >> + pgd_t *pgd = kvm->arch.pgd; > >> + pud_t *pud = pud_offset(pgd, 0); > >> + pmd_t *pmd = pmd_offset(pud, 0); > >> + free_pages((unsigned long)pmd, 0); > >> +} > >> + > >> +static inline phys_addr_t kvm_get_hwpgd(struct kvm *kvm) > >> +{ > >> + pgd_t *pgd = kvm->arch.pgd; > >> + pud_t *pud = pud_offset(pgd, 0); > >> + pmd_t *pmd = pmd_offset(pud, 0); > >> + return virt_to_phys(pmd); > >> + > >> +} > >> +#elif defined(CONFIG_ARM64_4K_PAGES) && CONFIG_ARM64_PGTABLE_LEVELS == 4 > >> +#define KVM_PREALLOC_LEVEL 1 > >> +#define PTRS_PER_S2_PGD2 > >> +#define S2_PGD_ORDER get_order(PTRS_PER_S2_PGD * sizeof(pgd_t)) > > > > Here PGDIR_SHIFT is 39, so we get PTRS_PER_S2_PGD == (1 << (40 - 39)) > > which is 2 and KVM_PREALLOC_LEVEL == 1. > > > >> +static inline int kvm_prealloc_hwpgd(struct kvm *kvm, pgd_t *pgd) > >> +{ > >> + pud_t *pud; > >> + > >> + pud = (pud_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, 1); > >> + if (!pud) > >> + return -ENOMEM; > >> + pgd_populate(NULL, pgd, pud); > >> + pgd_populate(NULL, pgd + 1, pud + PTRS_PER_PUD); > >> + > >> + return 0; > >> +} > > > > You still need to define these functions but you can make their > > implementation dependent solely on the KVM_PREALLOC_LEVEL rather than > > 64K/4K and levels combinations. If it is KVM_PREALLOC_LEVEL is 1, you > > allocate pud and populate the pgds (in a loop based on the > > PTRS_PER_S2_PGD). If it is 2, you allocate the pmd and populate the pud > > (still in a loop though it would probably be 1 iteration). We know based > > on the a
[PATCH] x86,kvm,vmx: Don't trap writes to CR4.TSD
CR4.TSD is guest-owned; don't trap writes to it in VMX guests. This avoids a VM exit on context switches into or out of a PR_TSC_SIGSEGV task. I think that this fixes an unintentional side-effect of: 4c38609ac569 KVM: VMX: Make guest cr4 mask more conservative Signed-off-by: Andy Lutomirski --- arch/x86/kvm/vmx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 54ca301f8415..0653233fcc82 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -105,7 +105,7 @@ module_param(nested, bool, S_IRUGO); (KVM_VM_CR0_ALWAYS_ON_UNRESTRICTED_GUEST | X86_CR0_PG | X86_CR0_PE) #define KVM_CR4_GUEST_OWNED_BITS \ (X86_CR4_PVI | X86_CR4_DE | X86_CR4_PCE | X86_CR4_OSFXSR \ -| X86_CR4_OSXMMEXCPT) +| X86_CR4_OSXMMEXCPT | X86_CR4_TSD) #define KVM_PMODE_VM_CR4_ALWAYS_ON (X86_CR4_PAE | X86_CR4_VMXE) #define KVM_RMODE_VM_CR4_ALWAYS_ON (X86_CR4_VME | X86_CR4_PAE | X86_CR4_VMXE) -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/11] qemu: towards virtio-1 host support
On 10/07/2014 07:39 AM, Cornelia Huck wrote: > This patchset aims to get us some way to implement virtio-1 compliant > and transitional devices in qemu. Branch available at > > git://github.com/cohuck/qemu virtio-1 > > I've mainly focused on: > - endianness handling > - extended feature bits > - virtio-ccw new/changed commands At the risk of some distraction, would it be worth thinking about a solution to the IOMMU bypassing mess as part of this? --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Query with respect to VCPU scheduling
On Fri, Oct 03, 2014 at 01:40:29PM -0400, Mohan Kumar wrote: >Hello all, > >I am new to KVM and look at a particular case in KVM. >I know that KVM uses CFS. > >The question I have is as follows: > >- I have two VMs with 2 VCPU each. And the actual > CPUs are also 2 and both the VMs use the 2 CPUs. > >I want one VM to use only 25 % of the actual CPU and the >remaining to be used by the other VM. > >Since KVM uses CFS, I am thinking that with CFS I can >provide these guarantee as mentioned above. I know >we can do these shares with cgroups. > >Is cgroups the only option?. Any pointers to this would help. How about CFS bandwidth control? Regards, Wanpeng Li > >Thanks, >Mohan >-- >To unsubscribe from this list: send the line "unsubscribe kvm" in >the body of a message to majord...@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: x86: some apic broadcast modes does not work
On Fri, Oct 03, 2014 at 12:30:52AM +0300, Nadav Amit wrote: >KVM does not deliver x2APIC broadcast messages with physical mode. Intel SDM >(10.12.9 ICR Operation in x2APIC Mode) states: "A destination ID value of >_H is used for broadcast of interrupts in both logical destination and >physical destination modes." > >In addition, the local-apic enables cluster mode broadcast. As Intel SDM >10.6.2.2 says: "Broadcast to all local APICs is achieved by setting all >destination bits to one." This patch enables cluster mode broadcast. > >The fix tries to combine broadcast in different modes through a unified code. > >One rare case occurs when the source of IPI has its APIC disabled. In such >case, the source can still issue IPIs, but since the source is not obliged to >have the same LAPIC mode as the enabled ones, we cannot rely on it. >Since it is a rare case, it is unoptimized and done on the slow-path. > >--- > >Changes v1->v2: >- Follow Radim's review: setting constants, preferring simplicity to marginal > performance gain, etc. >- Combine the cluster mode and x2apic mode patches > >Signed-off-by: Nadav Amit >--- Reviewed-by: Wanpeng Li > arch/x86/include/asm/kvm_host.h | 2 +- > arch/x86/kvm/lapic.c| 28 ++-- > arch/x86/kvm/lapic.h| 4 ++-- > virt/kvm/ioapic.h | 2 +- > 4 files changed, 26 insertions(+), 10 deletions(-) > >diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h >index 7d603a7..6f2be83 100644 >--- a/arch/x86/include/asm/kvm_host.h >+++ b/arch/x86/include/asm/kvm_host.h >@@ -542,7 +542,7 @@ struct kvm_apic_map { > struct rcu_head rcu; > u8 ldr_bits; > /* fields bellow are used to decode ldr values in different modes */ >- u32 cid_shift, cid_mask, lid_mask; >+ u32 cid_shift, cid_mask, lid_mask, broadcast; > struct kvm_lapic *phys_map[256]; > /* first index is cluster id second is cpu id in a cluster */ > struct kvm_lapic *logical_map[16][16]; >diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c >index b8345dd..ee04adf 100644 >--- a/arch/x86/kvm/lapic.c >+++ b/arch/x86/kvm/lapic.c >@@ -68,6 +68,9 @@ > #define MAX_APIC_VECTOR 256 > #define APIC_VECTORS_PER_REG 32 > >+#define APIC_BROADCAST0xFF >+#define X2APIC_BROADCAST 0xul >+ > #define VEC_POS(v) ((v) & (32 - 1)) > #define REG_POS(v) (((v) >> 5) << 4) > >@@ -149,6 +152,7 @@ static void recalculate_apic_map(struct kvm *kvm) > new->cid_shift = 8; > new->cid_mask = 0; > new->lid_mask = 0xff; >+ new->broadcast = APIC_BROADCAST; > > kvm_for_each_vcpu(i, vcpu, kvm) { > struct kvm_lapic *apic = vcpu->arch.apic; >@@ -170,6 +174,7 @@ static void recalculate_apic_map(struct kvm *kvm) > new->cid_shift = 16; > new->cid_mask = (1 << KVM_X2APIC_CID_BITS) - 1; > new->lid_mask = 0x; >+ new->broadcast = X2APIC_BROADCAST; > } else if (kvm_apic_sw_enabled(apic) && > !new->cid_mask /* flat mode */ && > kvm_apic_get_reg(apic, APIC_DFR) == > APIC_DFR_CLUSTER) { >@@ -558,16 +563,25 @@ static void apic_set_tpr(struct kvm_lapic *apic, u32 tpr) > apic_update_ppr(apic); > } > >-int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u16 dest) >+static int kvm_apic_broadcast(struct kvm_lapic *apic, u32 dest) >+{ >+ return dest == (apic_x2apic_mode(apic) ? >+ X2APIC_BROADCAST : APIC_BROADCAST); >+} >+ >+int kvm_apic_match_physical_addr(struct kvm_lapic *apic, u32 dest) > { >- return dest == 0xff || kvm_apic_id(apic) == dest; >+ return kvm_apic_id(apic) == dest || kvm_apic_broadcast(apic, dest); > } > >-int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 mda) >+int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u32 mda) > { > int result = 0; > u32 logical_id; > >+ if (kvm_apic_broadcast(apic, mda)) >+ return 1; >+ > if (apic_x2apic_mode(apic)) { > logical_id = kvm_apic_get_reg(apic, APIC_LDR); > return logical_id & mda; >@@ -595,7 +609,7 @@ int kvm_apic_match_logical_addr(struct kvm_lapic *apic, u8 >mda) > } > > int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source, >- int short_hand, int dest, int dest_mode) >+ int short_hand, unsigned int dest, int dest_mode) > { > int result = 0; > struct kvm_lapic *target = vcpu->arch.apic; >@@ -657,9 +671,11 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, >struct kvm_lapic *src, > if (!map) > goto out; > >+ if (irq->dest_id == map->broadcast) >+ goto out; >+ > if (irq->dest_mode == 0) { /* physical mode */ >- if (irq->delivery_mode == APIC_DM_
Re: [PATCH v2 00/15] KVM GICv3 emulation
On 2014/8/21 21:06, Andre Przywara wrote: > GICv3 is the ARM generic interrupt controller designed to overcome > some limits of the prevalent GICv2. Most notably it lifts the 8-CPU > limit. Though with recent patches from Marc there is support for > hosts to use a GICv3, the CPU limitation still applies to KVM guests, > since the current code emulates a GICv2 only. > Also, GICv2 backward compatibility being optional in GICv3, a number > of systems won't be able to run GICv2 guests. > > This patch series provides code to emulate a GICv3 distributor and > redistributor for any KVM guest. It requires a GICv3 in the host to > work. With those patches one can run guests efficiently on any GICv3 > host. It has the following features: > - Affinity routing (support for up to 255 VCPUs, more possible) Hi, Andre, in struct vgic_dist { /* Bitmap indicating which CPU has something pending */ unsigned long irq_pending_on_cpu; } the irq_pending_on_cpu parameter type just **unsigned long**. See the vgic_update_state func: void vgic_update_state(struct kvm *kvm) { ... kvm_for_each_vcpu(c, vcpu, kvm) { if (compute_pending_for_cpu(vcpu)) { pr_debug("CPU%d has pending interrupts\n", c); set_bit(c, &dist->irq_pending_on_cpu); } } } Maybe when the VCPU id greater than 64, and has pending irq, After calculated, here: set_bit(c, &dist->irq_pending_on_cpu); must be wrong. Isn't it ? > - System registers (as opposed to MMIO access) > - No ITS > - No priority support (as the GICv2 emulation) > - No save / restore support so far (will be added soon) > > The first 11 patches actually refactor the current VGIC code to make > room for a different VGIC model to be dropped in with Patch 12/15. > The remaining patches connect the new model to the kernel backend and > the userland facing code. > > The series goes on top of v3.17-rc1 and Marc's vgic-dyn series[2]. > The necessary patches for kvmtool to enable the guest's GICv3 have > been posted here before [3], an updated version will follow as soon > as the kvmtools tree has been updated. > > There was some testing on the fast model with some I/O and interrupt > affinity shuffling in a Linux guest with a varying number of VCPUs as > well as some testing on a Juno board (GICv2 only, to spot > regressions). > > Please review and test. > I would be grateful for people to test for GICv2 regressions also > (so on a GICv2 host with current kvmtool/qemu), as there is quite > some refactoring on that front. > > Much of the code was inspired by MarcZ, also kudos to him for doing > the rather painful rebase on top of v3.17-rc1. > > Cheers, > Andre. > > [1] https://lists.cs.columbia.edu/pipermail/kvmarm/2014-June/010060.html > [2] > https://git.kernel.org/cgit/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/vgic-dyn > [3] https://lists.cs.columbia.edu/pipermail/kvmarm/2014-June/010086.html > > Changes v1 ... v2: > * rebase to v3.17-rc1, caused quite some changes to the init code > * new 9/15 patch to make 10/15 smaller > * fix wrongly ordered cp15 register trap entry (MarcZ) > * fix SGI broadcast (thanks to wanghaibin for spotting) > * fix broken bailout path in kvm_vgic_create (wanghaibin) > * check return value of init_emulation_ops() (wanghaibin) > * fix return value check in vgic_[sg]et_attr() > * add header inclusion guards > * remove double definition of VCPU_NOT_ALLOCATED > * some code move-around > * whitespace fixes > > > Andre Przywara (15): > arm/arm64: KVM: rework MPIDR assignment and add accessors > arm/arm64: KVM: pass down user space provided GIC type into vGIC code > arm/arm64: KVM: refactor vgic_handle_mmio() function > arm/arm64: KVM: wrap 64 bit MMIO accesses with two 32 bit ones > arm/arm64: KVM: introduce per-VM ops > arm/arm64: KVM: make the maximum number of vCPUs a per-VM value > arm/arm64: KVM: make the value of ICC_SRE_EL1 a per-VM variable > arm/arm64: KVM: refactor MMIO accessors > arm/arm64: KVM: refactor/wrap vgic_set/get_attr() > arm/arm64: KVM: split GICv2 specific emulation code from vgic.c > arm/arm64: KVM: add opaque private pointer to MMIO accessors > arm/arm64: KVM: add virtual GICv3 distributor emulation > arm/arm64: KVM: add SGI system register trapping > arm/arm64: KVM: enable kernel side of GICv3 emulation > arm/arm64: KVM: allow userland to request a virtual GICv3 > > arch/arm/include/asm/kvm_emulate.h |2 +- > arch/arm/include/asm/kvm_host.h |3 + > arch/arm/kvm/Makefile|1 + > arch/arm/kvm/arm.c | 23 +- > arch/arm/kvm/coproc.c| 19 + > arch/arm/kvm/psci.c | 15 +- > arch/arm64/include/asm/kvm_emulate.h
Re: [patch 3/4] KVM: MMU: reload request from GET_DIRTY_LOG path
On Mon, Oct 06, 2014 at 02:19:32PM -0300, Marcelo Tosatti wrote: > On Sat, Oct 04, 2014 at 10:23:32AM +0300, Gleb Natapov wrote: > > On Tue, Sep 09, 2014 at 12:28:11PM -0300, Marcelo Tosatti wrote: > > > On Mon, Jul 21, 2014 at 04:14:24PM +0300, Gleb Natapov wrote: > > > > On Wed, Jul 09, 2014 at 04:12:53PM -0300, mtosa...@redhat.com wrote: > > > > > Reload remote vcpus MMU from GET_DIRTY_LOG codepath, before > > > > > deleting a pinned spte. > > > > > > > > > > Signed-off-by: Marcelo Tosatti > > > > > > > > > > --- > > > > > arch/x86/kvm/mmu.c | 29 +++-- > > > > > 1 file changed, 23 insertions(+), 6 deletions(-) > > > > > > > > > > Index: kvm.pinned-sptes/arch/x86/kvm/mmu.c > > > > > === > > > > > --- kvm.pinned-sptes.orig/arch/x86/kvm/mmu.c 2014-07-09 > > > > > 11:23:59.290744490 -0300 > > > > > +++ kvm.pinned-sptes/arch/x86/kvm/mmu.c 2014-07-09 > > > > > 11:24:58.449632435 -0300 > > > > > @@ -1208,7 +1208,8 @@ > > > > > * > > > > > * Return true if tlb need be flushed. > > > > > */ > > > > > -static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool > > > > > pt_protect) > > > > > +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool > > > > > pt_protect, > > > > > +bool skip_pinned) > > > > > { > > > > > u64 spte = *sptep; > > > > > > > > > > @@ -1218,6 +1219,22 @@ > > > > > > > > > > rmap_printk("rmap_write_protect: spte %p %llx\n", sptep, > > > > > *sptep); > > > > > > > > > > + if (is_pinned_spte(spte)) { > > > > > + /* keep pinned spte intact, mark page dirty again */ > > > > > + if (skip_pinned) { > > > > > + struct kvm_mmu_page *sp; > > > > > + gfn_t gfn; > > > > > + > > > > > + sp = page_header(__pa(sptep)); > > > > > + gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt); > > > > > + > > > > > + mark_page_dirty(kvm, gfn); > > > > > + return false; > > > > Why not mark all pinned gfns as dirty in kvm_vm_ioctl_get_dirty_log() > > > > while > > > > populating dirty_bitmap_buffer? > > > > > > The pinned gfns are per-vcpu. Requires looping all vcpus (not > > > scalable). > > > > > OK, but do they really have to be per-cpu? What's the advantage? > > The guest configures pinning on a per-cpu basis (that is, enabling PEBS > is done on a per-cpu basis). Is it a problem to maintain global pinned pages list for each memslot too? > > > > > > > > > + } else > > > > > + mmu_reload_pinned_vcpus(kvm); > > > > Can you explain why do you need this? > > > > > > Because if skip_pinned = false, we want vcpus to exit (called > > > from enable dirty logging codepath). > > > > > I guess what I wanted to ask is why do we need skip_pinned? As far as I see > > it > > is set to false in two cases: > > 1: page is write protected for shadow MMU needs, should not happen since > > the feature > > Correct. > > >is not supported with shadow mmu (can happen with nested EPT, but page > > will be marked > >is accessed during next vcpu entry anyway, so how will it work)? > > PEBS is not supported on nested EPT. > OK, so for this case we do not need skip_pinned. Assert if it happens. > > 2: when slot is marked as read only: such slot cannot have PEBS pages and > > if it will guest will die > >anyway during next guest entry, so why not maintain list of pinned pages > > per slot and kill aguest > >if slot with pinned pages is marked read only. > > 2: when slots pages have dirty logging enabled. In that case, the page > is marked dirty immediately. This is the call from > kvm_mmu_slot_remove_write_access. > Right, my 2 is incorrect. > So when enabling dirty logging, for pinned sptes: > > - maintain pinned spte intact. > - mark gfn for which pinned spte represents as dirty in the dirty > log. > But you set skip_pinned to false in kvm_mmu_slot_remove_write_access(), so this is not what is happening. Did you mean to set it to true there? -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html