Re: [PATCH kernel v7 04/31] vfio: powerpc/spapr: Use it_page_size
On 04/02/2015 08:48 AM, Alex Williamson wrote: On Sat, 2015-03-28 at 01:54 +1100, Alexey Kardashevskiy wrote: This makes use of the it_page_size from the iommu_table struct as page size can differ. This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code as recently introduced IOMMU_PAGE_XXX macros do not include IOMMU_PAGE_SHIFT. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- drivers/vfio/vfio_iommu_spapr_tce.c | 26 +- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index f835e63..8bbee22 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -91,7 +91,7 @@ static int tce_iommu_enable(struct tce_container *container) * enforcing the limit based on the max that the guest can map. */ down_write(¤t->mm->mmap_sem); - npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT; + npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT; locked = current->mm->locked_vm + npages; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) { @@ -120,7 +120,7 @@ static void tce_iommu_disable(struct tce_container *container) down_write(¤t->mm->mmap_sem); current->mm->locked_vm -= (container->tbl->it_size << - IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT; + container->tbl->it_page_shift) >> PAGE_SHIFT; up_write(¤t->mm->mmap_sem); } @@ -222,7 +222,7 @@ static long tce_iommu_build(struct tce_container *container, tce, ret); break; } - tce += IOMMU_PAGE_SIZE_4K; + tce += IOMMU_PAGE_SIZE(tbl); Is PAGE_SIZE ever smaller than IOMMU_PAGE_SIZE(tbl)? IOW, can the page we got from get_user_pages_fast() ever not completely fill the tce entry? Yes. IOMMU_PAGE_SIZE is 4K/64K/16M (16M is with huge pages enabled in QEMU with -mempath), PAGE_SIZE is 4K/64K (normally 64K). (Have I asked this before? Sorry if so) :) -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v7 12/31] powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group
On 04/02/2015 08:48 AM, Alex Williamson wrote: On Sat, 2015-03-28 at 01:54 +1100, Alexey Kardashevskiy wrote: Modern IBM POWERPC systems support multiple (currently two) TCE tables per IOMMU group (a.k.a. PE). This adds a iommu_table_group container for TCE tables. Right now just one table is supported. Signed-off-by: Alexey Kardashevskiy --- Documentation/vfio.txt | 23 ++ arch/powerpc/include/asm/iommu.h| 18 +++-- arch/powerpc/kernel/iommu.c | 34 arch/powerpc/platforms/powernv/pci-ioda.c | 38 + arch/powerpc/platforms/powernv/pci-p5ioc2.c | 17 ++-- arch/powerpc/platforms/powernv/pci.c| 2 +- arch/powerpc/platforms/powernv/pci.h| 4 +- arch/powerpc/platforms/pseries/iommu.c | 9 ++- drivers/vfio/vfio_iommu_spapr_tce.c | 120 9 files changed, 183 insertions(+), 82 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 96978ec..94328c8 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -427,6 +427,29 @@ The code flow from the example above should be slightly changed: +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ +VFIO_IOMMU_DISABLE and implements 2 new ioctls: +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY +(which are unsupported in v1 IOMMU). + +PPC64 paravirtualized guests generate a lot of map/unmap requests, +and the handling of those includes pinning/unpinning pages and updating +mm::locked_vm counter to make sure we do not exceed the rlimit. +The v2 IOMMU splits accounting and pinning into separate operations: + +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls +receive a user space address and size of the block to be pinned. +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to +be called with the exact address and size used for registering +the memory block. The userspace is not expected to call these often. +The ranges are stored in a linked list in a VFIO container. + +- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual +IOMMU table and do not do pinning; instead these check that the userspace +address is from pre-registered range. + +This separation helps in optimizing DMA for guests. + --- [1] VFIO was originally an acronym for "Virtual Function I/O" in its diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index eb75726..667aa1a 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -90,9 +90,7 @@ struct iommu_table { struct iommu_pool pools[IOMMU_NR_POOLS]; unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ -#ifdef CONFIG_IOMMU_API - struct iommu_group *it_group; -#endif + struct iommu_table_group *it_group; struct iommu_table_ops *it_ops; void (*set_bypass)(struct iommu_table *tbl, bool enable); }; @@ -126,14 +124,24 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); */ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, int nid); + +#define IOMMU_TABLE_GROUP_MAX_TABLES 1 + +struct iommu_table_group { #ifdef CONFIG_IOMMU_API -extern void iommu_register_group(struct iommu_table *tbl, + struct iommu_group *group; +#endif + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; +}; + +#ifdef CONFIG_IOMMU_API +extern void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, unsigned long pe_num); extern int iommu_add_device(struct device *dev); extern void iommu_del_device(struct device *dev); extern int __init tce_iommu_bus_notifier_init(void); #else -static inline void iommu_register_group(struct iommu_table *tbl, +static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, unsigned long pe_num) Not a new problem, but there's some awfully liberal use of the namespace with function names here. IOMMU API uses iommu_foo() functions. IOMMU group related interfaces within the IOMMU API include "group" somewhere in that name. powerpc specific functions should include a tag to avoid causing conflicts there. Cannot argue with that but it is kind of late or not for this patchset, no? And iommu_table is way too generic for powerpc/spapr-specific thing. I can replace with something better, should I do this now? (sorry for commenting twice on the same patch) -- Alexey ___ Linuxppc-dev mailing list Linuxppc-d
Re: [PATCH kernel v7 28/31] powerpc/mmu: Add userspace-to-physical addresses translation cache
On 04/02/2015 08:48 AM, Alex Williamson wrote: On Sat, 2015-03-28 at 01:55 +1100, Alexey Kardashevskiy wrote: We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory and in-kernel acceleration of DMA requests. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage a mapped pages counter; when it is zero, it is safe to unregister the region. Multiple registration of the same region is allowed, kref is used to track the number of registrations. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/mmu-hash64.h | 3 + arch/powerpc/include/asm/mmu_context.h | 16 +++ arch/powerpc/mm/Makefile | 1 + arch/powerpc/mm/mmu_context_hash64.c | 6 + arch/powerpc/mm/mmu_context_hash64_iommu.c | 215 + 5 files changed, 241 insertions(+) create mode 100644 arch/powerpc/mm/mmu_context_hash64_iommu.c diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h index 4f13c3e..83214c4 100644 --- a/arch/powerpc/include/asm/mmu-hash64.h +++ b/arch/powerpc/include/asm/mmu-hash64.h @@ -535,6 +535,9 @@ typedef struct { /* for 4K PTE fragment support */ void *pte_frag; #endif +#ifdef CONFIG_SPAPR_TCE_IOMMU + struct list_head iommu_group_mem_list; +#endif } mm_context_t; diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 73382eb..3461c91 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -16,6 +16,22 @@ */ extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm); extern void destroy_context(struct mm_struct *mm); +#ifdef CONFIG_SPAPR_TCE_IOMMU +typedef struct mm_iommu_table_group_mem_t mm_iommu_table_group_mem_t; + +extern bool mm_iommu_preregistered(void); +extern long mm_iommu_alloc(unsigned long ua, unsigned long entries, + mm_iommu_table_group_mem_t **pmem); +extern mm_iommu_table_group_mem_t *mm_iommu_get(unsigned long ua, + unsigned long entries); +extern long mm_iommu_put(mm_iommu_table_group_mem_t *mem); +extern void mm_iommu_cleanup(mm_context_t *ctx); +extern mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua, + unsigned long size); +extern long mm_iommu_ua_to_hpa(mm_iommu_table_group_mem_t *mem, + unsigned long ua, unsigned long *hpa); +extern long mm_iommu_mapped_update(mm_iommu_table_group_mem_t *mem, bool inc); +#endif extern void switch_mmu_context(struct mm_struct *prev, struct mm_struct *next); extern void switch_slb(struct task_struct *tsk, struct mm_struct *mm); diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile index 438dcd3..49fbfc7 100644 --- a/arch/powerpc/mm/Makefile +++ b/arch/powerpc/mm/Makefile @@ -35,3 +35,4 @@ obj-$(CONFIG_PPC_SUBPAGE_PROT)+= subpage-prot.o obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o obj-$(CONFIG_HIGHMEM) += highmem.o obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o +obj-$(CONFIG_SPAPR_TCE_IOMMU) += mmu_context_hash64_iommu.o diff --git a/arch/powerpc/mm/mmu_context_hash64.c b/arch/powerpc/mm/mmu_context_hash64.c index 178876ae..eb3080c 100644 --- a/arch/powerpc/mm/mmu_context_hash64.c +++ b/arch/powerpc/mm/mmu_context_hash64.c @@ -89,6 +89,9 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm) #ifdef CONFIG_PPC_64K_PAGES mm->context.pte_frag = NULL; #endif +#ifdef CONFIG_SPAPR_TCE_IOMMU + INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list); +#endif return 0; } @@ -132,6 +135,9 @@ static inline void destroy_pagetable_page(struct mm_struct *mm) void destroy_context(struct mm_struct *mm) { +#ifdef CONFIG_SPAPR_TCE_IOMMU + mm_iommu_cleanup(&mm->context); +#endif #ifdef CONFIG_PPC_ICSWX drop_cop(mm->context.acop, mm); diff --git a/arch/powerpc/mm/mmu_context_hash64_iommu.c b/arch/powerpc/mm/mmu_context_hash64_iommu.c new file mode 100644 index 000..c268c4d --- /dev/null +++ b/arch/powerpc/mm/mmu_context_hash64_iommu.c @@ -0,0 +1,215 @@ +/* + * IOMMU helpers in MMU context. + * + * Copyright (C) 2015 IBM Corp. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of
Re: [PATCH kernel v7 04/31] vfio: powerpc/spapr: Use it_page_size
On 04/02/2015 01:50 PM, Alex Williamson wrote: On Thu, 2015-04-02 at 13:30 +1100, Alexey Kardashevskiy wrote: On 04/02/2015 08:48 AM, Alex Williamson wrote: On Sat, 2015-03-28 at 01:54 +1100, Alexey Kardashevskiy wrote: This makes use of the it_page_size from the iommu_table struct as page size can differ. This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code as recently introduced IOMMU_PAGE_XXX macros do not include IOMMU_PAGE_SHIFT. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- drivers/vfio/vfio_iommu_spapr_tce.c | 26 +- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index f835e63..8bbee22 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -91,7 +91,7 @@ static int tce_iommu_enable(struct tce_container *container) * enforcing the limit based on the max that the guest can map. */ down_write(¤t->mm->mmap_sem); - npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT; + npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT; locked = current->mm->locked_vm + npages; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) { @@ -120,7 +120,7 @@ static void tce_iommu_disable(struct tce_container *container) down_write(¤t->mm->mmap_sem); current->mm->locked_vm -= (container->tbl->it_size << - IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT; + container->tbl->it_page_shift) >> PAGE_SHIFT; up_write(¤t->mm->mmap_sem); } @@ -222,7 +222,7 @@ static long tce_iommu_build(struct tce_container *container, tce, ret); break; } - tce += IOMMU_PAGE_SIZE_4K; + tce += IOMMU_PAGE_SIZE(tbl); Is PAGE_SIZE ever smaller than IOMMU_PAGE_SIZE(tbl)? IOW, can the page we got from get_user_pages_fast() ever not completely fill the tce entry? Yes. IOMMU_PAGE_SIZE is 4K/64K/16M (16M is with huge pages enabled in QEMU with -mempath), PAGE_SIZE is 4K/64K (normally 64K). Isn't that a problem then that you're filling the tce with processor page sizes via get_user_pages_fast(), but incrementing the tce by by IOMMU page size? For example, if PAGE_SIZE = 4K and IOMMU_PAGE_SIZE != 4K have we really pinned all of the memory backed by the tce?Where do you make sure the 4K page is really contiguous for the IOMMU page? Aaaah. This is just not supported. Instead, after the previous patch ("vfio: powerpc/spapr: Check that TCE page size is equal to it_page_size", which need fixed subject), tce_page_is_contained(page4K, 64K) will return false and the caller - tce_iommu_build() - will return -EPERM. -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v7 26/31] powerpc/iommu: Add userspace view of TCE table
On 04/03/2015 07:50 AM, Alex Williamson wrote: Should have sent this with the other comments, but found it hiding on my desktop... On Sat, 2015-03-28 at 01:55 +1100, Alexey Kardashevskiy wrote: In order to support memory pre-registration, we need a way to track the use of every registered memory region and only allow unregistration if a region is not in use anymore. So we need a way to tell from what region the just cleared TCE was from. This adds a userspace view of the TCE table into iommu_table struct. It contains userspace address, one per TCE entry. The table is only allocated when the ownership over an IOMMU group is taken which means it is only used from outside of the powernv code (such as VFIO). Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h | 6 ++ arch/powerpc/kernel/iommu.c | 7 +++ arch/powerpc/platforms/powernv/pci-ioda.c | 23 ++- 3 files changed, 35 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 2c08c91..a768a4d 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -106,9 +106,15 @@ struct iommu_table { unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_group; + unsigned long *it_userspace; /* userspace view of the table */ struct iommu_table_ops *it_ops; }; +#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \ + ((tbl)->it_userspace ? \ + &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \ + NULL) + /* Pure 2^n version of get_order */ static inline __attribute_const__ int get_iommu_order(unsigned long size, struct iommu_table *tbl) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 0bcd988..82102d1 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include @@ -1069,6 +1070,9 @@ static int iommu_table_take_ownership(struct iommu_table *tbl) spin_unlock(&tbl->pools[i].lock); spin_unlock_irqrestore(&tbl->large_pool.lock, flags); + BUG_ON(tbl->it_userspace); + tbl->it_userspace = vzalloc(sizeof(*tbl->it_userspace) * tbl->it_size); + -ENOMEM? return 0; } @@ -1102,6 +1106,9 @@ static void iommu_table_release_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; + vfree(tbl->it_userspace); + tbl->it_userspace = NULL; + spin_lock_irqsave(&tbl->large_pool.lock, flags); for (i = 0; i < tbl->nr_pools; i++) spin_lock(&tbl->pools[i].lock); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index bc36cf1..036f3c1 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include @@ -1469,6 +1470,9 @@ static void pnv_pci_free_table(struct iommu_table *tbl) if (!tbl->it_size) return; + if (tbl->it_userspace) Not necessary Out of curiosity - why? Is every single implementation is known for checking the argument? + vfree(tbl->it_userspace); + Why no NULL setting this time? iommu_reset_table() (2 lines below) will do memset(0) on the entire struct. pnv_free_tce_table(tbl->it_base, size, tbl->it_indirect_levels); iommu_reset_table(tbl, "ioda2"); } @@ -1656,9 +1660,26 @@ static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group, pnv_pci_ioda2_set_bypass(pe, !enable); } +static long pnv_pci_ioda2_create_table_with_uas( + struct iommu_table_group *table_group, + int num, __u32 page_shift, __u64 window_size, __u32 levels, + struct iommu_table *tbl) +{ + long ret = pnv_pci_ioda2_create_table(table_group, num, + page_shift, window_size, levels, tbl); + + if (ret) + return ret; + + BUG_ON(tbl->it_userspace); + tbl->it_userspace = vzalloc(sizeof(*tbl->it_userspace) * tbl->it_size); -ENOMEM + + return 0; +} + static struct iommu_table_group_ops pnv_pci_ioda2_ops = { .set_ownership = pnv_ioda2_set_ownership, - .create_table = pnv_pci_ioda2_create_table, + .create_table = pnv_pci_ioda2_create_table_with_uas, .set_window = pnv_pci_ioda2_set_window, .unset_window = pnv_pci_ioda2_unset_window, }; Thanks for the review! What is overall resume? Another respin? -- Alexey _
Re: [PATCH kernel v7 26/31] powerpc/iommu: Add userspace view of TCE table
On 04/09/2015 01:43 AM, Alex Williamson wrote: On Wed, 2015-04-08 at 13:22 +1000, Alexey Kardashevskiy wrote: On 04/03/2015 07:50 AM, Alex Williamson wrote: Should have sent this with the other comments, but found it hiding on my desktop... On Sat, 2015-03-28 at 01:55 +1100, Alexey Kardashevskiy wrote: In order to support memory pre-registration, we need a way to track the use of every registered memory region and only allow unregistration if a region is not in use anymore. So we need a way to tell from what region the just cleared TCE was from. This adds a userspace view of the TCE table into iommu_table struct. It contains userspace address, one per TCE entry. The table is only allocated when the ownership over an IOMMU group is taken which means it is only used from outside of the powernv code (such as VFIO). Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h | 6 ++ arch/powerpc/kernel/iommu.c | 7 +++ arch/powerpc/platforms/powernv/pci-ioda.c | 23 ++- 3 files changed, 35 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 2c08c91..a768a4d 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -106,9 +106,15 @@ struct iommu_table { unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_group; + unsigned long *it_userspace; /* userspace view of the table */ struct iommu_table_ops *it_ops; }; +#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \ + ((tbl)->it_userspace ? \ + &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \ + NULL) + /* Pure 2^n version of get_order */ static inline __attribute_const__ int get_iommu_order(unsigned long size, struct iommu_table *tbl) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 0bcd988..82102d1 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include @@ -1069,6 +1070,9 @@ static int iommu_table_take_ownership(struct iommu_table *tbl) spin_unlock(&tbl->pools[i].lock); spin_unlock_irqrestore(&tbl->large_pool.lock, flags); + BUG_ON(tbl->it_userspace); + tbl->it_userspace = vzalloc(sizeof(*tbl->it_userspace) * tbl->it_size); + -ENOMEM? return 0; } @@ -1102,6 +1106,9 @@ static void iommu_table_release_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; + vfree(tbl->it_userspace); + tbl->it_userspace = NULL; + spin_lock_irqsave(&tbl->large_pool.lock, flags); for (i = 0; i < tbl->nr_pools; i++) spin_lock(&tbl->pools[i].lock); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index bc36cf1..036f3c1 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include @@ -1469,6 +1470,9 @@ static void pnv_pci_free_table(struct iommu_table *tbl) if (!tbl->it_size) return; + if (tbl->it_userspace) Not necessary Out of curiosity - why? Is every single implementation is known for checking the argument? AFAIK, all flavors of free in the kernel accept NULL pointers and do the right thing. I verified this one does too. + vfree(tbl->it_userspace); + Why no NULL setting this time? iommu_reset_table() (2 lines below) will do memset(0) on the entire struct. So then should iommu_reset_table() handle the vfree() as well? I wanted to keep vfree() in the same file with vzalloc(). Bad idea? But I'll move vfree() to iommu_reset_table() anyway. pnv_free_tce_table(tbl->it_base, size, tbl->it_indirect_levels); iommu_reset_table(tbl, "ioda2"); } @@ -1656,9 +1660,26 @@ static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group, pnv_pci_ioda2_set_bypass(pe, !enable); } +static long pnv_pci_ioda2_create_table_with_uas( + struct iommu_table_group *table_group, + int num, __u32 page_shift, __u64 window_size, __u32 levels, + struct iommu_table *tbl) +{ + long ret = pnv_pci_ioda2_create_table(table_group, num, + page_shift, window_size, levels, tbl); + + if (ret) + return ret; + + BUG_ON(tbl->it_userspace); + tbl->it_userspace = vzalloc(sizeof(*tbl->it_userspace) * tbl->it_size); -E
[PATCH kernel] powerpc/pseries: Fix compile of memory hotplug without CONFIG_MEMORY_HOTREMOVE
51925fb3c5 "powerpc/pseries: Implement memory hotplug remove in the kernel" broke compile when CONFIG_MEMORY_HOTREMOVE is not defined due to missing symbols. This fixes the issue by adding the missing symbols. Signed-off-by: Alexey Kardashevskiy --- This is made against ad30cb99465 (mpe/next) and can be squashed into 51925fb3c5 if not too late. --- arch/powerpc/platforms/pseries/hotplug-memory.c | 13 + 1 file changed, 13 insertions(+) diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index 5cefcad..0ced387 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -369,6 +369,19 @@ static inline int dlpar_memory_remove(struct pseries_hp_errorlog *hp_elog) { return -EOPNOTSUPP; } +static int dlpar_remove_lmb(struct of_drconf_cell *lmb) +{ + return -EOPNOTSUPP; +} +static int dlpar_memory_remove_by_count(u32 lmbs_to_remove, + struct property *prop) +{ + return -EOPNOTSUPP; +} +static int dlpar_memory_remove_by_index(u32 drc_index, struct property *prop) +{ + return -EOPNOTSUPP; +} #endif /* CONFIG_MEMORY_HOTREMOVE */ -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v8 07/31] vfio: powerpc/spapr: Moving pinning/unpinning to helpers
On 04/15/2015 05:10 PM, David Gibson wrote: On Fri, Apr 10, 2015 at 04:30:49PM +1000, Alexey Kardashevskiy wrote: This is a pretty mechanical patch to make next patches simpler. New tce_iommu_unuse_page() helper does put_page() now but it might skip that after the memory registering patch applied. As we are here, this removes unnecessary checks for a value returned by pfn_to_page() as it cannot possibly return NULL. This moves tce_iommu_disable() later to let tce_iommu_clear() know if the container has been enabled because if it has not been, then put_page() must not be called on TCEs from the TCE table. This situation is not yet possible but it will after KVM acceleration patchset is applied. Signed-off-by: Alexey Kardashevskiy --- Changes: v6: * tce_get_hva() returns hva via a pointer --- drivers/vfio/vfio_iommu_spapr_tce.c | 68 +++-- 1 file changed, 50 insertions(+), 18 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index c137bb3..ec5ee83 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -196,7 +196,6 @@ static void tce_iommu_release(void *iommu_data) struct iommu_table *tbl = container->tbl; WARN_ON(tbl && !tbl->it_group); - tce_iommu_disable(container); if (tbl) { tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size); @@ -204,63 +203,96 @@ static void tce_iommu_release(void *iommu_data) if (tbl->it_group) tce_iommu_detach_group(iommu_data, tbl->it_group); } + + tce_iommu_disable(container); + mutex_destroy(&container->lock); kfree(container); } +static void tce_iommu_unuse_page(struct tce_container *container, + unsigned long oldtce) +{ + struct page *page; + + if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE))) + return; + + /* +* VFIO cannot map/unmap when a container is not enabled so +* we would not need this check but KVM could map/unmap and if +* this happened, we must not put pages as KVM does not get them as +* it expects memory pre-registation to do this part. +*/ + if (!container->enabled) + return; This worries me a bit. How can whether the contained is enabled now safely tell you whether get_page() at some earlier point in time? This is a leftover, I'll remove it as after the "iommu v2" patch there will be tce_iommu_unuse_page_v2(). + + page = pfn_to_page(__pa(oldtce) >> PAGE_SHIFT); + + if (oldtce & TCE_PCI_WRITE) + SetPageDirty(page); + + put_page(page); +} + static int tce_iommu_clear(struct tce_container *container, struct iommu_table *tbl, unsigned long entry, unsigned long pages) { unsigned long oldtce; - struct page *page; for ( ; pages; --pages, ++entry) { oldtce = iommu_clear_tce(tbl, entry); if (!oldtce) continue; - page = pfn_to_page(oldtce >> PAGE_SHIFT); - WARN_ON(!page); - if (page) { - if (oldtce & TCE_PCI_WRITE) - SetPageDirty(page); - put_page(page); - } + tce_iommu_unuse_page(container, (unsigned long) __va(oldtce)); } return 0; } +static int tce_get_hva(unsigned long tce, unsigned long *hva) +{ + struct page *page = NULL; + enum dma_data_direction direction = iommu_tce_direction(tce); + + if (get_user_pages_fast(tce & PAGE_MASK, 1, + direction != DMA_TO_DEVICE, &page) != 1) + return -EFAULT; + + *hva = (unsigned long) page_address(page); + + return 0; +} I'd prefer to see this called tce_iommu_use_page() for symmetry. If I rename this one, then what would I call tce_get_hva_cached() from "fio: powerpc/spapr: Register memory and define IOMMU v2"? + static long tce_iommu_build(struct tce_container *container, struct iommu_table *tbl, unsigned long entry, unsigned long tce, unsigned long pages) { long i, ret = 0; - struct page *page = NULL; + struct page *page; unsigned long hva; enum dma_data_direction direction = iommu_tce_direction(tce); for (i = 0; i < pages; ++i) { - ret = get_user_pages_fast(tce & PAGE_MASK, 1, - direction != DMA_TO_DEVICE, &page); - if (unlikely(ret != 1)) { - ret = -EFAULT; + ret = tce_get_hva(tce, &hva); + if (ret) break; - } +
Re: [PATCH kernel v8 12/31] powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group
On 04/16/2015 03:55 PM, David Gibson wrote: On Fri, Apr 10, 2015 at 04:30:54PM +1000, Alexey Kardashevskiy wrote: Modern IBM POWERPC systems support multiple (currently two) TCE tables per IOMMU group (a.k.a. PE). This adds a iommu_table_group container for TCE tables. Right now just one table is supported. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h| 18 +++-- arch/powerpc/kernel/iommu.c | 34 arch/powerpc/platforms/powernv/pci-ioda.c | 38 + arch/powerpc/platforms/powernv/pci-p5ioc2.c | 17 ++-- arch/powerpc/platforms/powernv/pci.c| 2 +- arch/powerpc/platforms/powernv/pci.h| 4 +- arch/powerpc/platforms/pseries/iommu.c | 9 ++- drivers/vfio/vfio_iommu_spapr_tce.c | 120 8 files changed, 160 insertions(+), 82 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index eb75726..667aa1a 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -90,9 +90,7 @@ struct iommu_table { struct iommu_pool pools[IOMMU_NR_POOLS]; unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ -#ifdef CONFIG_IOMMU_API - struct iommu_group *it_group; -#endif + struct iommu_table_group *it_group; struct iommu_table_ops *it_ops; void (*set_bypass)(struct iommu_table *tbl, bool enable); }; @@ -126,14 +124,24 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); */ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, int nid); + +#define IOMMU_TABLE_GROUP_MAX_TABLES 1 + +struct iommu_table_group { #ifdef CONFIG_IOMMU_API -extern void iommu_register_group(struct iommu_table *tbl, + struct iommu_group *group; +#endif + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; There's nothing to indicate which of the tables are in use at the current time. I mean, it doesn't matter now because there's only one, but the patch doesn't make a whole lot of sense without that. +}; + +#ifdef CONFIG_IOMMU_API +extern void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, unsigned long pe_num); extern int iommu_add_device(struct device *dev); extern void iommu_del_device(struct device *dev); extern int __init tce_iommu_bus_notifier_init(void); #else -static inline void iommu_register_group(struct iommu_table *tbl, +static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, unsigned long pe_num) { diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index b39d00a..fd49c8e 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -712,17 +712,20 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid) struct iommu_table *iommu_table_alloc(int node) { - struct iommu_table *tbl; + struct iommu_table_group *table_group; - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node); + table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL, + node); + table_group->tables[0].it_group = table_group; - return tbl; + return &table_group->tables[0]; } void iommu_free_table(struct iommu_table *tbl, const char *node_name) Surely the free function should take a table group rather than a table as argument. No, it should not. Tables lifetime is not the same even within the same group. { unsigned long bitmap_sz; unsigned int order; + struct iommu_table_group *table_group = tbl->it_group; if (!tbl || !tbl->it_map) { printk(KERN_ERR "%s: expected TCE map for %s\n", __func__, @@ -738,9 +741,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) clear_bit(0, tbl->it_map); #ifdef CONFIG_IOMMU_API - if (tbl->it_group) { - iommu_group_put(tbl->it_group); - BUG_ON(tbl->it_group); + if (table_group->group) { + iommu_group_put(table_group->group); + BUG_ON(table_group->group); } #endif @@ -756,7 +759,7 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) free_pages((unsigned long) tbl->it_map, order); /* free table */ - kfree(tbl); + kfree(table_group); } /* Creates TCEs for a user provided buffer. The user buffer must be @@ -903,11 +906,12 @@ EXPORT_SYMBOL_GPL(iommu_direction_to_tce_perm); */ static void group_release(void *iommu_data) { -
Re: [PATCH kernel v8 21/31] powerpc/iommu: Split iommu_free_table into 2 helpers
On 04/16/2015 04:46 PM, David Gibson wrote: On Fri, Apr 10, 2015 at 04:31:03PM +1000, Alexey Kardashevskiy wrote: The iommu_free_table helper release memory it is using (the TCE table and @it_map) and release the iommu_table struct as well. We might not want the very last step as we store iommu_table in parent structures. Yeah, as I commented on the earlier patch, freeing the surrounding group from a function taking just the individual table is wrong. This is iommu tables created by the old code which stores these iommu_table struct pointers in device nodes. I believe there is a plan to get rid of iommu tables there and when this is done, this workaround will be gone. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h | 1 + arch/powerpc/kernel/iommu.c | 57 2 files changed, 35 insertions(+), 23 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index bde7ee7..8ed4648 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -127,6 +127,7 @@ static inline void *get_iommu_table_base(struct device *dev) extern struct iommu_table *iommu_table_alloc(int node); /* Frees table for an individual device node */ +extern void iommu_reset_table(struct iommu_table *tbl, const char *node_name); extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); /* Initializes an iommu_table based in values set in the passed-in diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 501e8ee..0bcd988 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -721,24 +721,46 @@ struct iommu_table *iommu_table_alloc(int node) return &table_group->tables[0]; } +void iommu_reset_table(struct iommu_table *tbl, const char *node_name) +{ + if (!tbl) + return; + + if (tbl->it_map) { + unsigned long bitmap_sz; + unsigned int order; + + /* +* In case we have reserved the first bit, we should not emit +* the warning below. +*/ + if (tbl->it_offset == 0) + clear_bit(0, tbl->it_map); + + /* verify that table contains no entries */ + if (!bitmap_empty(tbl->it_map, tbl->it_size)) + pr_warn("%s: Unexpected TCEs for %s\n", __func__, + node_name); + + /* calculate bitmap size in bytes */ + bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long); + + /* free bitmap */ + order = get_order(bitmap_sz); + free_pages((unsigned long) tbl->it_map, order); + } + + memset(tbl, 0, sizeof(*tbl)); +} + void iommu_free_table(struct iommu_table *tbl, const char *node_name) { - unsigned long bitmap_sz; - unsigned int order; struct iommu_table_group *table_group = tbl->it_group; - if (!tbl || !tbl->it_map) { - printk(KERN_ERR "%s: expected TCE map for %s\n", __func__, - node_name); + if (!tbl) return; - } - /* -* In case we have reserved the first bit, we should not emit -* the warning below. -*/ - if (tbl->it_offset == 0) - clear_bit(0, tbl->it_map); + iommu_reset_table(tbl, node_name); #ifdef CONFIG_IOMMU_API if (table_group->group) { @@ -747,17 +769,6 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) } #endif - /* verify that table contains no entries */ - if (!bitmap_empty(tbl->it_map, tbl->it_size)) - pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name); - - /* calculate bitmap size in bytes */ - bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long); - - /* free bitmap */ - order = get_order(bitmap_sz); - free_pages((unsigned long) tbl->it_map, order); - /* free table */ kfree(table_group); } -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v8 12/31] powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group
On 04/16/2015 03:55 PM, David Gibson wrote: On Fri, Apr 10, 2015 at 04:30:54PM +1000, Alexey Kardashevskiy wrote: Modern IBM POWERPC systems support multiple (currently two) TCE tables per IOMMU group (a.k.a. PE). This adds a iommu_table_group container for TCE tables. Right now just one table is supported. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h| 18 +++-- arch/powerpc/kernel/iommu.c | 34 arch/powerpc/platforms/powernv/pci-ioda.c | 38 + arch/powerpc/platforms/powernv/pci-p5ioc2.c | 17 ++-- arch/powerpc/platforms/powernv/pci.c| 2 +- arch/powerpc/platforms/powernv/pci.h| 4 +- arch/powerpc/platforms/pseries/iommu.c | 9 ++- drivers/vfio/vfio_iommu_spapr_tce.c | 120 8 files changed, 160 insertions(+), 82 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index eb75726..667aa1a 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -90,9 +90,7 @@ struct iommu_table { struct iommu_pool pools[IOMMU_NR_POOLS]; unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ -#ifdef CONFIG_IOMMU_API - struct iommu_group *it_group; -#endif + struct iommu_table_group *it_group; struct iommu_table_ops *it_ops; void (*set_bypass)(struct iommu_table *tbl, bool enable); }; @@ -126,14 +124,24 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); */ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, int nid); + +#define IOMMU_TABLE_GROUP_MAX_TABLES 1 + +struct iommu_table_group { #ifdef CONFIG_IOMMU_API -extern void iommu_register_group(struct iommu_table *tbl, + struct iommu_group *group; +#endif + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; There's nothing to indicate which of the tables are in use at the current time. I mean, it doesn't matter now because there's only one, but the patch doesn't make a whole lot of sense without that. Later in the patchset, the code will look at @it_size to know if the table is in use. +}; + +#ifdef CONFIG_IOMMU_API +extern void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, unsigned long pe_num); extern int iommu_add_device(struct device *dev); extern void iommu_del_device(struct device *dev); extern int __init tce_iommu_bus_notifier_init(void); #else -static inline void iommu_register_group(struct iommu_table *tbl, +static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, unsigned long pe_num) { diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index b39d00a..fd49c8e 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -712,17 +712,20 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid) struct iommu_table *iommu_table_alloc(int node) { - struct iommu_table *tbl; + struct iommu_table_group *table_group; - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node); + table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL, + node); + table_group->tables[0].it_group = table_group; - return tbl; + return &table_group->tables[0]; } void iommu_free_table(struct iommu_table *tbl, const char *node_name) Surely the free function should take a table group rather than a table as argument. Please ignore my other response to your reply; I reworked the whole thing to store iommu_table_group in the pci device node. Thanks. -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v8 14/31] vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework IOMMU ownership control
On 04/16/2015 04:07 PM, David Gibson wrote: On Fri, Apr 10, 2015 at 04:30:56PM +1000, Alexey Kardashevskiy wrote: At the moment the iommu_table struct has a set_bypass() which enables/ disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code which calls this callback when external IOMMU users such as VFIO are about to get over a PHB. The set_bypass() callback is not really an iommu_table function but IOMMU/PE function. This introduces a iommu_table_group_ops struct and adds a set_ownership() callback to it which is called when an external user takes control over the IOMMU. Do you really need separate ops structures at both the single table and table group level? The different tables in a group will all belong to the same basic iommu won't they? IOMMU tables exist alone in VIO. Also, the platform code uses just a table (or it is in bypass mode) and does not care about table groups. It looked more clean for myself to keep them separated. Should I still merge those? This renames set_bypass() to set_ownership() as it is not necessarily just enabling bypassing, it can be something else/more so let's give it more generic name. The bool parameter is inverted. The callback is implemented for IODA2 only. Other platforms (P5IOC2, IODA1) will use the old iommu_take_ownership/iommu_release_ownership API. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h | 14 +- arch/powerpc/platforms/powernv/pci-ioda.c | 30 ++ drivers/vfio/vfio_iommu_spapr_tce.c | 25 + 3 files changed, 56 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index b9e50d3..d1f8c6c 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -92,7 +92,6 @@ struct iommu_table { unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_group; struct iommu_table_ops *it_ops; - void (*set_bypass)(struct iommu_table *tbl, bool enable); }; /* Pure 2^n version of get_order */ @@ -127,11 +126,24 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, #define IOMMU_TABLE_GROUP_MAX_TABLES 1 +struct iommu_table_group; + +struct iommu_table_group_ops { + /* +* Switches ownership from the kernel itself to an external +* user. While onwership is enabled, the kernel cannot use IOMMU +* for itself. +*/ + void (*set_ownership)(struct iommu_table_group *table_group, + bool enable); The meaning of "enable" in a function called "set_ownership" is entirely obscure. Suggest something better please :) I have nothing better... +}; + struct iommu_table_group { #ifdef CONFIG_IOMMU_API struct iommu_group *group; #endif struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; + struct iommu_table_group_ops *ops; }; #ifdef CONFIG_IOMMU_API diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index a964c50..9687731 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1255,10 +1255,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs)); } -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable) { - struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe, - table_group); uint16_t window_id = (pe->pe_number << 1 ) + 1; int64_t rc; @@ -1286,7 +1284,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) * host side. */ if (pe->pdev) - set_iommu_table_base(&pe->pdev->dev, tbl); + set_iommu_table_base(&pe->pdev->dev, + &pe->table_group.tables[0]); else pnv_ioda_setup_bus_dma(pe, pe->pbus, false); } @@ -1302,13 +1301,27 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb, /* TVE #1 is selected by PCI address bit 59 */ pe->tce_bypass_base = 1ull << 59; - /* Install set_bypass callback for VFIO */ - pe->table_group.tables[0].set_bypass = pnv_pci_ioda2_set_bypass; - /* Enable bypass by default */ - pnv_pci_ioda2_set_bypass(&pe->table_group.tables[0], true); + pnv_pci_ioda2_set_bypass(pe, true); } +static void pnv_ioda2_set_ownership(struct iommu_table_group *table_group, +
Re: [PATCH kernel v8 15/31] powerpc/iommu: Fix IOMMU ownership control functions
On 04/16/2015 04:10 PM, David Gibson wrote: On Fri, Apr 10, 2015 at 04:30:57PM +1000, Alexey Kardashevskiy wrote: This adds missing locks in iommu_take_ownership()/ iommu_release_ownership(). This marks all pages busy in iommu_table::it_map in order to catch errors if there is an attempt to use this table while ownership over it is taken. This only clears TCE content if there is no page marked busy in it_map. Clearing must be done outside of the table locks as iommu_clear_tce() called from iommu_clear_tces_and_put_pages() does this. Signed-off-by: Alexey Kardashevskiy --- Changes: v5: * do not store bit#0 value, it has to be set for zero-based table anyway * removed test_and_clear_bit --- arch/powerpc/kernel/iommu.c | 26 ++ 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 7d6089b..068fe4ff 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1052,17 +1052,28 @@ EXPORT_SYMBOL_GPL(iommu_tce_build); static int iommu_table_take_ownership(struct iommu_table *tbl) { - unsigned long sz = (tbl->it_size + 7) >> 3; + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; + int ret = 0; + + spin_lock_irqsave(&tbl->large_pool.lock, flags); + for (i = 0; i < tbl->nr_pools; i++) + spin_lock(&tbl->pools[i].lock); if (tbl->it_offset == 0) clear_bit(0, tbl->it_map); if (!bitmap_empty(tbl->it_map, tbl->it_size)) { pr_err("iommu_tce: it_map is not empty"); - return -EBUSY; + ret = -EBUSY; + if (tbl->it_offset == 0) + set_bit(0, tbl->it_map); This really needs a comment. Why on earth are you changing the it_map on a failure case? Does this explain? /* * The platform code reserves zero address in iommu_init_table(). * As we cleared busy bit for page @0 before using bitmap_empty(), * we are restoring it now. */ + } else { + memset(tbl->it_map, 0xff, sz); } - memset(tbl->it_map, 0xff, sz); + for (i = 0; i < tbl->nr_pools; i++) + spin_unlock(&tbl->pools[i].lock); + spin_unlock_irqrestore(&tbl->large_pool.lock, flags); return 0; } @@ -1095,7 +1106,11 @@ EXPORT_SYMBOL_GPL(iommu_take_ownership); static void iommu_table_release_ownership(struct iommu_table *tbl) { - unsigned long sz = (tbl->it_size + 7) >> 3; + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; + + spin_lock_irqsave(&tbl->large_pool.lock, flags); + for (i = 0; i < tbl->nr_pools; i++) + spin_lock(&tbl->pools[i].lock); memset(tbl->it_map, 0, sz); @@ -1103,6 +1118,9 @@ static void iommu_table_release_ownership(struct iommu_table *tbl) if (tbl->it_offset == 0) set_bit(0, tbl->it_map); + for (i = 0; i < tbl->nr_pools; i++) + spin_unlock(&tbl->pools[i].lock); + spin_unlock_irqrestore(&tbl->large_pool.lock, flags); } extern void iommu_release_ownership(struct iommu_table_group *table_group) -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v8 17/31] powerpc/iommu/powernv: Release replaced TCE
On 04/16/2015 04:26 PM, David Gibson wrote: On Fri, Apr 10, 2015 at 04:30:59PM +1000, Alexey Kardashevskiy wrote: At the moment writing new TCE value to the IOMMU table fails with EBUSY if there is a valid entry already. However PAPR specification allows the guest to write new TCE value without clearing it first. Another problem this patch is addressing is the use of pool locks for external IOMMU users such as VFIO. The pool locks are to protect DMA page allocator rather than entries and since the host kernel does not control what pages are in use, there is no point in pool locks and exchange()+put_page(oldtce) is sufficient to avoid possible races. This adds an exchange() callback to iommu_table_ops which does the same thing as set() plus it returns replaced TCE and DMA direction so the caller can release the pages afterwards. The returned old TCE value is a virtual address as the new TCE value. This is different from tce_clear() which returns a physical address. This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement for a platform to have exchange() implemented in order to support VFIO. This replaces iommu_tce_build() and iommu_clear_tce() with a single iommu_tce_xchg(). This makes sure that TCE permission bits are not set in TCE passed to IOMMU API as those are to be calculated by platform code from DMA direction. This moves SetPageDirty() to the IOMMU code to make it work for both VFIO ioctl interface in in-kernel TCE acceleration (when it becomes available later). Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h| 17 ++-- arch/powerpc/kernel/iommu.c | 53 +--- arch/powerpc/platforms/powernv/pci-ioda.c | 38 ++ arch/powerpc/platforms/powernv/pci-p5ioc2.c | 3 ++ arch/powerpc/platforms/powernv/pci.c| 17 arch/powerpc/platforms/powernv/pci.h| 2 + drivers/vfio/vfio_iommu_spapr_tce.c | 62 ++--- 7 files changed, 130 insertions(+), 62 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index d1f8c6c..bde7ee7 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -44,11 +44,22 @@ extern int iommu_is_off; extern int iommu_force_on; struct iommu_table_ops { + /* When called with direction==DMA_NONE, it is equal to clear() */ int (*set)(struct iommu_table *tbl, long index, long npages, unsigned long uaddr, enum dma_data_direction direction, struct dma_attrs *attrs); +#ifdef CONFIG_IOMMU_API + /* +* Exchanges existing TCE with new TCE plus direction bits; +* returns old TCE and DMA direction mask +*/ + int (*exchange)(struct iommu_table *tbl, + long index, + unsigned long *tce, + enum dma_data_direction *direction); +#endif void (*clear)(struct iommu_table *tbl, long index, long npages); unsigned long (*get)(struct iommu_table *tbl, long index); @@ -152,6 +163,8 @@ extern void iommu_register_group(struct iommu_table_group *table_group, extern int iommu_add_device(struct device *dev); extern void iommu_del_device(struct device *dev); extern int __init tce_iommu_bus_notifier_init(void); +extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry, + unsigned long *tce, enum dma_data_direction *direction); #else static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, @@ -231,10 +244,6 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl, unsigned long npages); extern int iommu_tce_put_param_check(struct iommu_table *tbl, unsigned long ioba, unsigned long tce); -extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, - unsigned long hwaddr, enum dma_data_direction direction); -extern unsigned long iommu_clear_tce(struct iommu_table *tbl, - unsigned long entry); extern void iommu_flush_tce(struct iommu_table *tbl); extern int iommu_take_ownership(struct iommu_table_group *table_group); diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 068fe4ff..501e8ee 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -982,9 +982,6 @@ EXPORT_SYMBOL_GPL(iommu_tce_clear_param_check); int iommu_tce_put_param_check(struct iommu_table *tbl, unsigned long ioba, unsigned long tce) { - if (!(tce & (TCE_PCI_WRITE | TCE_PCI_READ))) - return -EINVAL; - if (tce & ~(IOMMU_PAGE_MASK(tbl) | TCE_PCI_WRITE | TCE_PCI_READ)) return -EINVAL; @@ -1002,44 +999,20 @@ int iommu_tce_put_param_chec
Re: [PATCH kernel v8 15/31] powerpc/iommu: Fix IOMMU ownership control functions
On 04/20/2015 12:46 PM, David Gibson wrote: On Fri, Apr 17, 2015 at 08:16:13PM +1000, Alexey Kardashevskiy wrote: On 04/16/2015 04:10 PM, David Gibson wrote: On Fri, Apr 10, 2015 at 04:30:57PM +1000, Alexey Kardashevskiy wrote: This adds missing locks in iommu_take_ownership()/ iommu_release_ownership(). This marks all pages busy in iommu_table::it_map in order to catch errors if there is an attempt to use this table while ownership over it is taken. This only clears TCE content if there is no page marked busy in it_map. Clearing must be done outside of the table locks as iommu_clear_tce() called from iommu_clear_tces_and_put_pages() does this. Signed-off-by: Alexey Kardashevskiy --- Changes: v5: * do not store bit#0 value, it has to be set for zero-based table anyway * removed test_and_clear_bit --- arch/powerpc/kernel/iommu.c | 26 ++ 1 file changed, 22 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 7d6089b..068fe4ff 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1052,17 +1052,28 @@ EXPORT_SYMBOL_GPL(iommu_tce_build); static int iommu_table_take_ownership(struct iommu_table *tbl) { - unsigned long sz = (tbl->it_size + 7) >> 3; + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; + int ret = 0; + + spin_lock_irqsave(&tbl->large_pool.lock, flags); + for (i = 0; i < tbl->nr_pools; i++) + spin_lock(&tbl->pools[i].lock); if (tbl->it_offset == 0) clear_bit(0, tbl->it_map); if (!bitmap_empty(tbl->it_map, tbl->it_size)) { pr_err("iommu_tce: it_map is not empty"); - return -EBUSY; + ret = -EBUSY; + if (tbl->it_offset == 0) + set_bit(0, tbl->it_map); This really needs a comment. Why on earth are you changing the it_map on a failure case? Does this explain? /* * The platform code reserves zero address in iommu_init_table(). * As we cleared busy bit for page @0 before using bitmap_empty(), * we are restoring it now. */ Only partly. What's it reserved for, and why do you know it was always set on entry? Because it is only handled in this file and I can see it in the code. Or I did not understand the question here... -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v8 14/31] vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework IOMMU ownership control
On 04/20/2015 12:44 PM, David Gibson wrote: On Fri, Apr 17, 2015 at 08:09:29PM +1000, Alexey Kardashevskiy wrote: On 04/16/2015 04:07 PM, David Gibson wrote: On Fri, Apr 10, 2015 at 04:30:56PM +1000, Alexey Kardashevskiy wrote: At the moment the iommu_table struct has a set_bypass() which enables/ disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code which calls this callback when external IOMMU users such as VFIO are about to get over a PHB. The set_bypass() callback is not really an iommu_table function but IOMMU/PE function. This introduces a iommu_table_group_ops struct and adds a set_ownership() callback to it which is called when an external user takes control over the IOMMU. Do you really need separate ops structures at both the single table and table group level? The different tables in a group will all belong to the same basic iommu won't they? IOMMU tables exist alone in VIO. Also, the platform code uses just a table (or it is in bypass mode) and does not care about table groups. It looked more clean for myself to keep them separated. Should I still merge those? Ok, that sounds like a reasonable argument for keeping them separate, at least for now. This renames set_bypass() to set_ownership() as it is not necessarily just enabling bypassing, it can be something else/more so let's give it more generic name. The bool parameter is inverted. The callback is implemented for IODA2 only. Other platforms (P5IOC2, IODA1) will use the old iommu_take_ownership/iommu_release_ownership API. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h | 14 +- arch/powerpc/platforms/powernv/pci-ioda.c | 30 ++ drivers/vfio/vfio_iommu_spapr_tce.c | 25 + 3 files changed, 56 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index b9e50d3..d1f8c6c 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -92,7 +92,6 @@ struct iommu_table { unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_group; struct iommu_table_ops *it_ops; - void (*set_bypass)(struct iommu_table *tbl, bool enable); }; /* Pure 2^n version of get_order */ @@ -127,11 +126,24 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, #define IOMMU_TABLE_GROUP_MAX_TABLES 1 +struct iommu_table_group; + +struct iommu_table_group_ops { + /* +* Switches ownership from the kernel itself to an external +* user. While onwership is enabled, the kernel cannot use IOMMU +* for itself. +*/ + void (*set_ownership)(struct iommu_table_group *table_group, + bool enable); The meaning of "enable" in a function called "set_ownership" is entirely obscure. Suggest something better please :) I have nothing better... Well, given it's "set_ownershuip" you could have "owner" - that would want to be an enum with OWNER_KERNEL and OWNER_VFIO or something rather than a bool. It is iommu_take_ownership() in upstream and it is assumed that the owner is anything but the platform code (for now and probably for ever - VFIO). I am not changing this now, just using same naming approach when adding a callback with a similar name. Or you could leave it a bool but call it "allow_bypass". Commented below. +}; + struct iommu_table_group { #ifdef CONFIG_IOMMU_API struct iommu_group *group; #endif struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; + struct iommu_table_group_ops *ops; }; #ifdef CONFIG_IOMMU_API diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index a964c50..9687731 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1255,10 +1255,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs)); } -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable) { - struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe, - table_group); uint16_t window_id = (pe->pe_number << 1 ) + 1; int64_t rc; @@ -1286,7 +1284,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) * host side. */ if (pe->pdev) - set_iommu_table_base(&pe->pdev->dev, tbl); + set_iommu_table_base(&pe->pdev->dev, + &pe->table_group.tables[0]);
Re: [PATCH kernel v8 14/31] vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework IOMMU ownership control
On 04/21/2015 07:43 PM, David Gibson wrote: On Mon, Apr 20, 2015 at 04:55:32PM +1000, Alexey Kardashevskiy wrote: On 04/20/2015 12:44 PM, David Gibson wrote: On Fri, Apr 17, 2015 at 08:09:29PM +1000, Alexey Kardashevskiy wrote: On 04/16/2015 04:07 PM, David Gibson wrote: On Fri, Apr 10, 2015 at 04:30:56PM +1000, Alexey Kardashevskiy wrote: At the moment the iommu_table struct has a set_bypass() which enables/ disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code which calls this callback when external IOMMU users such as VFIO are about to get over a PHB. The set_bypass() callback is not really an iommu_table function but IOMMU/PE function. This introduces a iommu_table_group_ops struct and adds a set_ownership() callback to it which is called when an external user takes control over the IOMMU. Do you really need separate ops structures at both the single table and table group level? The different tables in a group will all belong to the same basic iommu won't they? IOMMU tables exist alone in VIO. Also, the platform code uses just a table (or it is in bypass mode) and does not care about table groups. It looked more clean for myself to keep them separated. Should I still merge those? Ok, that sounds like a reasonable argument for keeping them separate, at least for now. This renames set_bypass() to set_ownership() as it is not necessarily just enabling bypassing, it can be something else/more so let's give it more generic name. The bool parameter is inverted. The callback is implemented for IODA2 only. Other platforms (P5IOC2, IODA1) will use the old iommu_take_ownership/iommu_release_ownership API. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h | 14 +- arch/powerpc/platforms/powernv/pci-ioda.c | 30 ++ drivers/vfio/vfio_iommu_spapr_tce.c | 25 + 3 files changed, 56 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index b9e50d3..d1f8c6c 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -92,7 +92,6 @@ struct iommu_table { unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_group; struct iommu_table_ops *it_ops; - void (*set_bypass)(struct iommu_table *tbl, bool enable); }; /* Pure 2^n version of get_order */ @@ -127,11 +126,24 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, #define IOMMU_TABLE_GROUP_MAX_TABLES 1 +struct iommu_table_group; + +struct iommu_table_group_ops { + /* +* Switches ownership from the kernel itself to an external +* user. While onwership is enabled, the kernel cannot use IOMMU +* for itself. +*/ + void (*set_ownership)(struct iommu_table_group *table_group, + bool enable); The meaning of "enable" in a function called "set_ownership" is entirely obscure. Suggest something better please :) I have nothing better... Well, given it's "set_ownershuip" you could have "owner" - that would want to be an enum with OWNER_KERNEL and OWNER_VFIO or something rather than a bool. It is iommu_take_ownership() in upstream and it is assumed that the owner is anything but the platform code (for now and probably for ever - VFIO). I am not changing this now, just using same naming approach when adding a callback with a similar name. So "enabled" is actually that non kernel ownership is enabled. That is totally non-obvious. Or you could leave it a bool but call it "allow_bypass". Commented below. +}; + struct iommu_table_group { #ifdef CONFIG_IOMMU_API struct iommu_group *group; #endif struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; + struct iommu_table_group_ops *ops; }; #ifdef CONFIG_IOMMU_API diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index a964c50..9687731 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1255,10 +1255,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs)); } -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable) { - struct pnv_ioda_pe *pe = container_of(tbl->it_group, struct pnv_ioda_pe, - table_group); uint16_t window_id = (pe->pe_number << 1 ) + 1; int64_t rc; @@ -1286,7 +1284,8 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) * host side. */ if (pe->pdev) - set_iom
Re: [PATCH 2/2] pci: Use Qemu created PCI device nodes
On 04/25/2015 05:30 AM, Thomas Huth wrote: Hi Nikunj, On Wed, 22 Apr 2015 16:27:20 +0530 Nikunj A Dadhania wrote: PCI Enumeration has been part of SLOF. Now with hotplug code addition in Qemu, it makes more sense to have this code a one place, i.e. Qemu. s/Qemu/QEMU/ and s/code a one place/code in one place/ ? Adding routines to walk through the device nodes created by Qemu. SLOF will configure the device/bridges and program the BARs for communicating with the devices. I wonder whether it would make more sense to also set up the BARs etc. in QEMU instead of SLOF? We need BAR setup in 2 cases: when SLOF needs to boot from a PCI device (and SLOF can do BAR setup) and when we do PCI hotplug - and BARs are set by the guest, otherwise we hit races here (Michael Roth can tell more). So as for today there is no reason for doing this in QEMU. -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v9 04/32] vfio: powerpc/spapr: Check that IOMMU page is fully contained by system page
This checks that the TCE table page size is not bigger that the size of a page we just pinned and going to put its physical address to the table. Otherwise the hardware gets unwanted access to physical memory between the end of the actual page and the end of the aligned up TCE page. Since compound_order() and compound_head() work correctly on non-huge pages, there is no need for additional check whether the page is huge. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson Reviewed-by: David Gibson --- Changes: v8: changed subject v6: * the helper is simplified to one line v4: * s/tce_check_page_size/tce_page_is_contained/ --- drivers/vfio/vfio_iommu_spapr_tce.c | 16 1 file changed, 16 insertions(+) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index b95fa2b..735b308 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -47,6 +47,16 @@ struct tce_container { bool enabled; }; +static bool tce_page_is_contained(struct page *page, unsigned page_shift) +{ + /* +* Check that the TCE table granularity is not bigger than the size of +* a page we just found. Otherwise the hardware can get access to +* a bigger memory chunk that it should. +*/ + return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift; +} + static int tce_iommu_enable(struct tce_container *container) { int ret = 0; @@ -189,6 +199,12 @@ static long tce_iommu_build(struct tce_container *container, ret = -EFAULT; break; } + + if (!tce_page_is_contained(page, tbl->it_page_shift)) { + ret = -EPERM; + break; + } + hva = (unsigned long) page_address(page) + offset; ret = iommu_tce_build(tbl, entry + i, hva, direction); -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v9 02/32] Revert "powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically"
This reverts commit 9e8d4a19ab66ec9e132d405357b9108a4f26efd3 as tce32_table has exactly the same life time as the whole PE. This makes use of a new iommu_reset_table() helper instead. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h | 3 --- arch/powerpc/platforms/powernv/pci-ioda.c | 35 +-- arch/powerpc/platforms/powernv/pci.h | 2 +- 3 files changed, 15 insertions(+), 25 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index e2cef38..9d320e0 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -79,9 +79,6 @@ struct iommu_table { struct iommu_group *it_group; #endif void (*set_bypass)(struct iommu_table *tbl, bool enable); -#ifdef CONFIG_PPC_POWERNV - void *data; -#endif }; /* Pure 2^n version of get_order */ diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 920c252..eff26ed 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1086,10 +1086,6 @@ static void pnv_ioda_setup_bus_PE(struct pci_bus *bus, int all) return; } - pe->tce32_table = kzalloc_node(sizeof(struct iommu_table), - GFP_KERNEL, hose->node); - pe->tce32_table->data = pe; - /* Associate it with all child devices */ pnv_ioda_setup_same_PE(bus, pe); @@ -1295,7 +1291,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe bus = dev->bus; hose = pci_bus_to_host(bus); phb = hose->private_data; - tbl = pe->tce32_table; + tbl = &pe->tce32_table; addr = tbl->it_base; opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number, @@ -1310,9 +1306,8 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe if (rc) pe_warn(pe, "OPAL error %ld release DMA window\n", rc); - iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); + iommu_reset_table(tbl, of_node_full_name(dev->dev.of_node)); free_pages(addr, get_order(TCE32_TABLE_SIZE)); - pe->tce32_table = NULL; } static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs) @@ -1460,10 +1455,6 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs) continue; } - pe->tce32_table = kzalloc_node(sizeof(struct iommu_table), - GFP_KERNEL, hose->node); - pe->tce32_table->data = pe; - /* Put PE to the list */ mutex_lock(&phb->ioda.pe_list_mutex); list_add_tail(&pe->list, &phb->ioda.pe_list); @@ -1598,7 +1589,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev pe = &phb->ioda.pe_array[pdn->pe_number]; WARN_ON(get_dma_ops(&pdev->dev) != &dma_iommu_ops); - set_iommu_table_base_and_group(&pdev->dev, pe->tce32_table); + set_iommu_table_base_and_group(&pdev->dev, &pe->tce32_table); } static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb, @@ -1625,7 +1616,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb, } else { dev_info(&pdev->dev, "Using 32-bit DMA via iommu\n"); set_dma_ops(&pdev->dev, &dma_iommu_ops); - set_iommu_table_base(&pdev->dev, pe->tce32_table); + set_iommu_table_base(&pdev->dev, &pe->tce32_table); } *pdev->dev.dma_mask = dma_mask; return 0; @@ -1662,9 +1653,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, list_for_each_entry(dev, &bus->devices, bus_list) { if (add_to_iommu_group) set_iommu_table_base_and_group(&dev->dev, - pe->tce32_table); + &pe->tce32_table); else - set_iommu_table_base(&dev->dev, pe->tce32_table); + set_iommu_table_base(&dev->dev, &pe->tce32_table); if (dev->subordinate) pnv_ioda_setup_bus_dma(pe, dev->subordinate, @@ -1754,7 +1745,8 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl, __be64 *startp, __be64 *endp, bool rm) { - struct pnv_ioda_pe *pe = tbl->data; + struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe, + tce32_tab
[PATCH kernel v9 01/32] powerpc/iommu: Split iommu_free_table into 2 helpers
The iommu_free_table helper release memory it is using (the TCE table and @it_map) and release the iommu_table struct as well. We might not want the very last step as we store iommu_table in parent structures. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h | 1 + arch/powerpc/kernel/iommu.c | 58 +++- 2 files changed, 35 insertions(+), 24 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 1e27d63..e2cef38 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -105,6 +105,7 @@ static inline void *get_iommu_table_base(struct device *dev) } /* Frees table for an individual device node */ +extern void iommu_reset_table(struct iommu_table *tbl, const char *node_name); extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); /* Initializes an iommu_table based in values set in the passed-in diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index b054f33..5c154e1 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -708,23 +708,44 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid) return tbl; } +void iommu_reset_table(struct iommu_table *tbl, const char *node_name) +{ + if (!tbl) + return; + + if (tbl->it_map) { + unsigned long bitmap_sz; + unsigned int order; + + /* +* In case we have reserved the first bit, we should not emit +* the warning below. +*/ + if (tbl->it_offset == 0) + clear_bit(0, tbl->it_map); + + /* verify that table contains no entries */ + if (!bitmap_empty(tbl->it_map, tbl->it_size)) + pr_warn("%s: Unexpected TCEs for %s\n", __func__, + node_name); + + /* calculate bitmap size in bytes */ + bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long); + + /* free bitmap */ + order = get_order(bitmap_sz); + free_pages((unsigned long) tbl->it_map, order); + } + + memset(tbl, 0, sizeof(*tbl)); +} + void iommu_free_table(struct iommu_table *tbl, const char *node_name) { - unsigned long bitmap_sz; - unsigned int order; - - if (!tbl || !tbl->it_map) { - printk(KERN_ERR "%s: expected TCE map for %s\n", __func__, - node_name); + if (!tbl) return; - } - /* -* In case we have reserved the first bit, we should not emit -* the warning below. -*/ - if (tbl->it_offset == 0) - clear_bit(0, tbl->it_map); + iommu_reset_table(tbl, node_name); #ifdef CONFIG_IOMMU_API if (tbl->it_group) { @@ -733,17 +754,6 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) } #endif - /* verify that table contains no entries */ - if (!bitmap_empty(tbl->it_map, tbl->it_size)) - pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name); - - /* calculate bitmap size in bytes */ - bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long); - - /* free bitmap */ - order = get_order(bitmap_sz); - free_pages((unsigned long) tbl->it_map, order); - /* free table */ kfree(tbl); } -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v9 03/32] vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver
This moves page pinning (get_user_pages_fast()/put_page()) code out of the platform IOMMU code and puts it to VFIO IOMMU driver where it belongs to as the platform code does not deal with page pinning. This makes iommu_take_ownership()/iommu_release_ownership() deal with the IOMMU table bitmap only. This removes page unpinning from iommu_take_ownership() as the actual TCE table might contain garbage and doing put_page() on it is undefined behaviour. Besides the last part, the rest of the patch is mechanical. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson Reviewed-by: David Gibson --- Changes: v9: * added missing tce_iommu_clear call after iommu_release_ownership() * brought @offset (a local variable) back to make patch even more mechanical v4: * s/iommu_tce_build(tbl, entry + 1/iommu_tce_build(tbl, entry + i/ --- arch/powerpc/include/asm/iommu.h| 4 -- arch/powerpc/kernel/iommu.c | 55 - drivers/vfio/vfio_iommu_spapr_tce.c | 80 +++-- 3 files changed, 67 insertions(+), 72 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 9d320e0..4955233 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -199,10 +199,6 @@ extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, unsigned long hwaddr, enum dma_data_direction direction); extern unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry); -extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, - unsigned long entry, unsigned long pages); -extern int iommu_put_tce_user_mode(struct iommu_table *tbl, - unsigned long entry, unsigned long tce); extern void iommu_flush_tce(struct iommu_table *tbl); extern int iommu_take_ownership(struct iommu_table *tbl); diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 5c154e1..fc8b253 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1001,30 +1001,6 @@ unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry) } EXPORT_SYMBOL_GPL(iommu_clear_tce); -int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, - unsigned long entry, unsigned long pages) -{ - unsigned long oldtce; - struct page *page; - - for ( ; pages; --pages, ++entry) { - oldtce = iommu_clear_tce(tbl, entry); - if (!oldtce) - continue; - - page = pfn_to_page(oldtce >> PAGE_SHIFT); - WARN_ON(!page); - if (page) { - if (oldtce & TCE_PCI_WRITE) - SetPageDirty(page); - put_page(page); - } - } - - return 0; -} -EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages); - /* * hwaddr is a kernel virtual address here (0xc... bazillion), * tce_build converts it to a physical address. @@ -1054,35 +1030,6 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, } EXPORT_SYMBOL_GPL(iommu_tce_build); -int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry, - unsigned long tce) -{ - int ret; - struct page *page = NULL; - unsigned long hwaddr, offset = tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK; - enum dma_data_direction direction = iommu_tce_direction(tce); - - ret = get_user_pages_fast(tce & PAGE_MASK, 1, - direction != DMA_TO_DEVICE, &page); - if (unlikely(ret != 1)) { - /* pr_err("iommu_tce: get_user_pages_fast failed tce=%lx ioba=%lx ret=%d\n", - tce, entry << tbl->it_page_shift, ret); */ - return -EFAULT; - } - hwaddr = (unsigned long) page_address(page) + offset; - - ret = iommu_tce_build(tbl, entry, hwaddr, direction); - if (ret) - put_page(page); - - if (ret < 0) - pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n", - __func__, entry << tbl->it_page_shift, tce, ret); - - return ret; -} -EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode); - int iommu_take_ownership(struct iommu_table *tbl) { unsigned long sz = (tbl->it_size + 7) >> 3; @@ -1096,7 +1043,6 @@ int iommu_take_ownership(struct iommu_table *tbl) } memset(tbl->it_map, 0xff, sz); - iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size); /* * Disable iommu bypass, otherwise the user can DMA to all of @@ -1114,7 +1060,6 @@ void iommu_release_ownership(struct iommu_table *tbl) { unsigned long sz = (tbl->it_size + 7) >> 3; - iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl-
[PATCH kernel v9 00/32] powerpc/iommu/vfio: Enable Dynamic DMA windows
This enables sPAPR defined feature called Dynamic DMA windows (DDW). Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus where devices are allowed to do DMA. These ranges are called DMA windows. By default, there is a single DMA window, 1 or 2GB big, mapped at zero on a PCI bus. Hi-speed devices may suffer from the limited size of the window. The recent host kernels use a TCE bypass window on POWER8 CPU which implements direct PCI bus address range mapping (with offset of 1<<59) to the host memory. For guests, PAPR defines a DDW RTAS API which allows pseries guests querying the hypervisor about DDW support and capabilities (page size mask for now). A pseries guest may request an additional (to the default) DMA windows using this RTAS API. The existing pseries Linux guests request an additional window as big as the guest RAM and map the entire guest window which effectively creates direct mapping of the guest memory to a PCI bus. The multiple DMA windows feature is supported by POWER7/POWER8 CPUs; however this patchset only adds support for POWER8 as TCE tables are implemented in POWER7 in a quite different way ans POWER7 is not the highest priority. This patchset reworks PPC64 IOMMU code and adds necessary structures to support big windows. Once a Linux guest discovers the presence of DDW, it does: 1. query hypervisor about number of available windows and page size masks; 2. create a window with the biggest possible page size (today 4K/64K/16M); 3. map the entire guest RAM via H_PUT_TCE* hypercalls; 4. switche dma_ops to direct_dma_ops on the selected PE. Once this is done, H_PUT_TCE is not called anymore for 64bit devices and the guest does not waste time on DMA map/unmap operations. Note that 32bit devices won't use DDW and will keep using the default DMA window so KVM optimizations will be required (to be posted later). This is pushed to g...@github.com:aik/linux.git + d9b711d...4d0247b 4d0247b -> vfio-for-github (forced update) Changes: v9: * rebased on top of SRIOV (which is in upstream now) * fixed multiple comments from David * reworked ownership patches * removed vfio: powerpc/spapr: Do cleanup when releasing the group (used to be #2) as updated #1 should do this * moved "powerpc/powernv: Implement accessor to TCE entry" to a separate patch * added a patch which moves TCE Kill register address to PE from IOMMU table v8: * fixed a bug in error fallback in "powerpc/mmu: Add userspace-to-physical addresses translation cache" * fixed subject in "vfio: powerpc/spapr: Check that IOMMU page is fully contained by system page" * moved v2 documentation to the correct patch * added checks for failed vzalloc() in "powerpc/iommu: Add userspace view of TCE table" v7: * moved memory preregistration to the current process's MMU context * added code preventing unregistration if some pages are still mapped; for this, there is a userspace view of the table is stored in iommu_table * added locked_vm counting for DDW tables (including userspace view of those) v6: * fixed a bunch of errors in "vfio: powerpc/spapr: Support Dynamic DMA windows" * moved static IOMMU properties from iommu_table_group to iommu_table_group_ops v5: * added SPAPR_TCE_IOMMU_v2 to tell the userspace that there is a memory pre-registration feature * added backward compatibility * renamed few things (mostly powerpc_iommu -> iommu_table_group) v4: * moved patches around to have VFIO and PPC patches separated as much as possible * now works with the existing upstream QEMU v3: * redesigned the whole thing * multiple IOMMU groups per PHB -> one PHB is needed for VFIO in the guest -> no problems with locked_vm counting; also we save memory on actual tables * guest RAM preregistration is required for DDW * PEs (IOMMU groups) are passed to VFIO with no DMA windows at all so we do not bother with iommu_table::it_map anymore * added multilevel TCE tables support to support really huge guests v2: * added missing __pa() in "powerpc/powernv: Release replaced TCE" * reposted to make some noise Alexey Kardashevskiy (32): powerpc/iommu: Split iommu_free_table into 2 helpers Revert "powerpc/powernv: Allocate struct pnv_ioda_pe iommu_table dynamically" vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver vfio: powerpc/spapr: Check that IOMMU page is fully contained by system page vfio: powerpc/spapr: Use it_page_size vfio: powerpc/spapr: Move locked_vm accounting to helpers vfio: powerpc/spapr: Disable DMA mappings on disabled container vfio: powerpc/spapr: Moving pinning/unpinning to helpers vfio: powerpc/spapr: Rework groups attaching powerpc/powernv: Do not set "read" flag if direction==DMA_NONE powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group vfio: powerp
[PATCH kernel v9 19/32] powerpc/powernv/ioda2: Rework iommu_table creation
This moves iommu_table creation to the beginning to make following changes easier to review. This starts using table parameters from the iommu_table struct. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * updated commit log and did minor cleanup --- arch/powerpc/platforms/powernv/pci-ioda.c | 33 +++ 1 file changed, 16 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index fb765af..a80be34 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2041,7 +2041,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, { struct page *tce_mem = NULL; void *addr; - struct iommu_table *tbl; + struct iommu_table *tbl = &pe->table_group.tables[0]; unsigned int tce_table_size, end; int64_t rc; @@ -2068,13 +2068,26 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, addr = page_address(tce_mem); memset(addr, 0, tce_table_size); + /* Setup iommu */ + tbl->it_table_group = &pe->table_group; + + /* Setup linux iommu table */ + pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0, + IOMMU_PAGE_SHIFT_4K); + + tbl->it_ops = &pnv_ioda2_iommu_ops; + iommu_init_table(tbl, phb->hose->node); +#ifdef CONFIG_IOMMU_API + pe->table_group.ops = &pnv_pci_ioda2_ops; +#endif + /* * Map TCE table through TVT. The TVE index is the PE number * shifted by 1 bit for 32-bits DMA space. */ rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number, - pe->pe_number << 1, 1, __pa(addr), - tce_table_size, 0x1000); + pe->pe_number << 1, 1, __pa(tbl->it_base), + tbl->it_size << 3, 1ULL << tbl->it_page_shift); if (rc) { pe_err(pe, "Failed to configure 32-bit TCE table," " err %ld\n", rc); @@ -2083,24 +2096,10 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, pnv_pci_ioda2_tvt_invalidate(pe); - /* Setup iommu */ - pe->table_group.tables[0].it_table_group = &pe->table_group; - - /* Setup linux iommu table */ - tbl = &pe->table_group.tables[0]; - pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0, - IOMMU_PAGE_SHIFT_4K); - /* OPAL variant of PHB3 invalidated TCEs */ if (pe->tce_inval_reg) tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE); - tbl->it_ops = &pnv_ioda2_iommu_ops; - iommu_init_table(tbl, phb->hose->node); -#ifdef CONFIG_IOMMU_API - pe->table_group.ops = &pnv_pci_ioda2_ops; -#endif - if (pe->flags & PNV_IODA_PE_DEV) { iommu_register_group(&pe->table_group, phb->hose->global_number, pe->pe_number); -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v9 07/32] vfio: powerpc/spapr: Disable DMA mappings on disabled container
At the moment DMA map/unmap requests are handled irrespective to the container's state. This allows the user space to pin memory which it might not be allowed to pin. This adds checks to MAP/UNMAP that the container is enabled, otherwise -EPERM is returned. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson Reviewed-by: David Gibson --- drivers/vfio/vfio_iommu_spapr_tce.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 40583f9..e21479c 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -318,6 +318,9 @@ static long tce_iommu_ioctl(void *iommu_data, struct iommu_table *tbl = container->tbl; unsigned long tce; + if (!container->enabled) + return -EPERM; + if (!tbl) return -ENXIO; @@ -362,6 +365,9 @@ static long tce_iommu_ioctl(void *iommu_data, struct vfio_iommu_type1_dma_unmap param; struct iommu_table *tbl = container->tbl; + if (!container->enabled) + return -EPERM; + if (WARN_ON(!tbl)) return -ENXIO; -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v9 20/32] powerpc/powernv/ioda2: Introduce pnv_pci_create_table/pnv_pci_free_table
This is a part of moving TCE table allocation into an iommu_ops callback to support multiple IOMMU groups per one VFIO container. This moves a table creation window to the file with common powernv-pci helpers as it does not do anything IODA2-specific. This adds pnv_pci_free_table() helper to release the actual TCE table. This enforces window size to be a power of two. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- Changes: v9: * moved helpers to the common powernv pci.c file from pci-ioda.c * moved bits from pnv_pci_create_table() to pnv_alloc_tce_table_pages() --- arch/powerpc/platforms/powernv/pci-ioda.c | 36 ++ arch/powerpc/platforms/powernv/pci.c | 61 +++ arch/powerpc/platforms/powernv/pci.h | 4 ++ 3 files changed, 76 insertions(+), 25 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index a80be34..b9b3773 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1307,8 +1307,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe if (rc) pe_warn(pe, "OPAL error %ld release DMA window\n", rc); - iommu_reset_table(tbl, of_node_full_name(dev->dev.of_node)); - free_pages(addr, get_order(TCE32_TABLE_SIZE)); + pnv_pci_free_table(tbl); } static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs) @@ -2039,10 +2038,7 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = { static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe) { - struct page *tce_mem = NULL; - void *addr; struct iommu_table *tbl = &pe->table_group.tables[0]; - unsigned int tce_table_size, end; int64_t rc; /* We shouldn't already have a 32-bit DMA associated */ @@ -2053,29 +2049,20 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, /* The PE will reserve all possible 32-bits space */ pe->tce32_seg = 0; - end = (1 << ilog2(phb->ioda.m32_pci_base)); - tce_table_size = (end / 0x1000) * 8; pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n", - end); + phb->ioda.m32_pci_base); - /* Allocate TCE table */ - tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL, - get_order(tce_table_size)); - if (!tce_mem) { - pe_err(pe, "Failed to allocate a 32-bit TCE memory\n"); - goto fail; + rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node, + 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, tbl); + if (rc) { + pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc); + return; } - addr = page_address(tce_mem); - memset(addr, 0, tce_table_size); - - /* Setup iommu */ - tbl->it_table_group = &pe->table_group; - - /* Setup linux iommu table */ - pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0, - IOMMU_PAGE_SHIFT_4K); tbl->it_ops = &pnv_ioda2_iommu_ops; + + /* Setup iommu */ + tbl->it_table_group = &pe->table_group; iommu_init_table(tbl, phb->hose->node); #ifdef CONFIG_IOMMU_API pe->table_group.ops = &pnv_pci_ioda2_ops; @@ -2121,8 +2108,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, fail: if (pe->tce32_seg >= 0) pe->tce32_seg = -1; - if (tce_mem) - __free_pages(tce_mem, get_order(tce_table_size)); + pnv_pci_free_table(tbl); } static void pnv_ioda_setup_dma(struct pnv_phb *phb) diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index e8802ac..6bcfad5 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -20,7 +20,9 @@ #include #include #include +#include +#include #include #include #include @@ -645,6 +647,65 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl, tbl->it_type = TCE_PCI; } +static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift, + unsigned long *tce_table_allocated) +{ + struct page *tce_mem = NULL; + __be64 *addr; + unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_SHIFT; + unsigned long local_allocated = 1UL << (order + PAGE_SHIFT); + + tce_mem = alloc_pages_node(nid, GFP_KERNEL, order); + if (!tce_mem) { + pr_err("Failed to allocate a TCE memory, order=%d\n", order); + return NULL; + } + addr = page_addres
[PATCH kernel v9 21/32] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window
This is a part of moving DMA window programming to an iommu_ops callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as a first parameter (not pnv_ioda_pe) as it is going to be used as a callback for VFIO DDW code. This adds pnv_pci_ioda2_tvt_invalidate() to invalidate TVT as it is a good thing to do. It does not have immediate effect now as the table is never recreated after reboot but it will in the following patches. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- Changes: v9: * initialize pe->table_group.tables[0] at the very end when tbl is fully initialized * moved pnv_pci_ioda2_tvt_invalidate() from earlier patch --- arch/powerpc/platforms/powernv/pci-ioda.c | 67 +++ 1 file changed, 51 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index b9b3773..59baa15 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1960,6 +1960,52 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs)); } +static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, + struct iommu_table *tbl) +{ + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe, + table_group); + struct pnv_phb *phb = pe->phb; + int64_t rc; + const __u64 start_addr = tbl->it_offset << tbl->it_page_shift; + const __u64 win_size = tbl->it_size << tbl->it_page_shift; + + pe_info(pe, "Setting up window at %llx..%llx " + "pgsize=0x%x tablesize=0x%lx\n", + start_addr, start_addr + win_size - 1, + 1UL << tbl->it_page_shift, tbl->it_size << 3); + + tbl->it_table_group = &pe->table_group; + + /* +* Map TCE table through TVT. The TVE index is the PE number +* shifted by 1 bit for 32-bits DMA space. +*/ + rc = opal_pci_map_pe_dma_window(phb->opal_id, + pe->pe_number, + pe->pe_number << 1, + 1, + __pa(tbl->it_base), + tbl->it_size << 3, + 1ULL << tbl->it_page_shift); + if (rc) { + pe_err(pe, "Failed to configure TCE table, err %ld\n", rc); + goto fail; + } + + pnv_pci_ioda2_tvt_invalidate(pe); + + /* Store fully initialized *tbl (may be external) in PE */ + pe->table_group.tables[0] = *tbl; + + return 0; +fail: + if (pe->tce32_seg >= 0) + pe->tce32_seg = -1; + + return rc; +} + static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable) { uint16_t window_id = (pe->pe_number << 1 ) + 1; @@ -2068,21 +2114,16 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, pe->table_group.ops = &pnv_pci_ioda2_ops; #endif - /* -* Map TCE table through TVT. The TVE index is the PE number -* shifted by 1 bit for 32-bits DMA space. -*/ - rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number, - pe->pe_number << 1, 1, __pa(tbl->it_base), - tbl->it_size << 3, 1ULL << tbl->it_page_shift); + rc = pnv_pci_ioda2_set_window(&pe->table_group, tbl); if (rc) { pe_err(pe, "Failed to configure 32-bit TCE table," " err %ld\n", rc); - goto fail; + pnv_pci_free_table(tbl); + if (pe->tce32_seg >= 0) + pe->tce32_seg = -1; + return; } - pnv_pci_ioda2_tvt_invalidate(pe); - /* OPAL variant of PHB3 invalidated TCEs */ if (pe->tce_inval_reg) tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE); @@ -2103,12 +2144,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, /* Also create a bypass window */ if (!pnv_iommu_bypass_disabled) pnv_pci_ioda2_setup_bypass_pe(phb, pe); - - return; -fail: - if (pe->tce32_seg >= 0) - pe->tce32_seg = -1; - pnv_pci_free_table(tbl); } static void pnv_ioda_setup_dma(struct pnv_phb *phb) -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v9 16/32] powerpc/powernv/ioda: Move TCE kill register address to PE
At the moment the DMA setup code looks for the "ibm,opal-tce-kill" property which contains the TCE kill register address. Writes to this register invalidates TCE cache on IODA/IODA2 hub. This moves the register address from iommu_table to pnv_ioda_pe as later there will be 2 tables per PE and it will be used for both tables. This moves the property reading/remapping code to a helper to reduce code duplication. This adds a new pnv_pci_ioda2_tvt_invalidate() helper which invalidates the entire table. It should be called after every call to opal_pci_map_pe_dma_window(). It was not required before because there is just a single TCE table and 64bit DMA is handled via bypass window (which has no table so no chache is used) but this is going to change with Dynamic DMA windows (DDW). Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * new in the series --- arch/powerpc/platforms/powernv/pci-ioda.c | 69 +++ arch/powerpc/platforms/powernv/pci.h | 1 + 2 files changed, 44 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index f070c44..b22b3ca 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1672,7 +1672,7 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, struct pnv_ioda_pe, table_group); __be64 __iomem *invalidate = rm ? (__be64 __iomem *)pe->tce_inval_reg_phys : - (__be64 __iomem *)tbl->it_index; + pe->tce_inval_reg; unsigned long start, end, inc; const unsigned shift = tbl->it_page_shift; @@ -1743,6 +1743,18 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = { .get = pnv_tce_get, }; +static inline void pnv_pci_ioda2_tvt_invalidate(struct pnv_ioda_pe *pe) +{ + /* 01xb - invalidate TCEs that match the specified PE# */ + unsigned long addr = (0x4ull << 60) | (pe->pe_number & 0xFF); + + if (!pe->tce_inval_reg) + return; + +mb(); /* Ensure above stores are visible */ + __raw_writeq(cpu_to_be64(addr), pe->tce_inval_reg); +} + static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl, unsigned long index, unsigned long npages, bool rm) { @@ -1751,7 +1763,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl, unsigned long start, end, inc; __be64 __iomem *invalidate = rm ? (__be64 __iomem *)pe->tce_inval_reg_phys : - (__be64 __iomem *)tbl->it_index; + pe->tce_inval_reg; const unsigned shift = tbl->it_page_shift; /* We'll invalidate DMA address in PE scope */ @@ -1803,13 +1815,31 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = { .get = pnv_tce_get, }; +static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb, + struct pnv_ioda_pe *pe) +{ + const __be64 *swinvp; + + /* OPAL variant of PHB3 invalidated TCEs */ + swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL); + if (!swinvp) + return; + + /* We need a couple more fields -- an address and a data +* to or. Since the bus is only printed out on table free +* errors, and on the first pass the data will be a relative +* bus number, print that out instead. +*/ + pe->tce_inval_reg_phys = be64_to_cpup(swinvp); + pe->tce_inval_reg = ioremap(pe->tce_inval_reg_phys, 8); +} + static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe, unsigned int base, unsigned int segs) { struct page *tce_mem = NULL; - const __be64 *swinvp; struct iommu_table *tbl; unsigned int i; int64_t rc; @@ -1823,6 +1853,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, if (WARN_ON(pe->tce32_seg >= 0)) return; + pnv_pci_ioda_setup_opal_tce_kill(phb, pe); + /* Grab a 32-bit TCE table */ pe->tce32_seg = base; pe_info(pe, " Setting up 32-bit TCE table at %08x..%08x\n", @@ -1865,20 +1897,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, base << 28, IOMMU_PAGE_SHIFT_4K); /* OPAL variant of P7IOC SW invalidated TCEs */ - swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL); - if (swinvp) { - /* We need a couple more fields -- an address and a data -* to or. Since the bus is only printed out on table free -* errors, and on the first pass the data will be a relative -* bus number, print that out instead. -
[PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible
At the moment only one group per container is supported. POWER8 CPUs have more flexible design and allows naving 2 TCE tables per IOMMU group so we can relax this limitation and support multiple groups per container. This adds TCE table descriptors to a container and uses iommu_table_group_ops to create/set DMA windows on IOMMU groups so the same TCE tables will be shared between several IOMMU groups. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson --- Changes: v7: * updated doc --- Documentation/vfio.txt | 8 +- drivers/vfio/vfio_iommu_spapr_tce.c | 268 ++-- 2 files changed, 199 insertions(+), 77 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 94328c8..7dcf2b5 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note This implementation has some specifics: -1) Only one IOMMU group per container is supported as an IOMMU group -represents the minimal entity which isolation can be guaranteed for and -groups are allocated statically, one per a Partitionable Endpoint (PE) +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per +container is supported as an IOMMU table is allocated at the boot time, +one table per a IOMMU group which is a Partitionable Endpoint (PE) (PE is often a PCI domain but not always). +Newer systems (POWER8 with IODA2) have improved hardware design which allows +to remove this limitation and have multiple IOMMU groups per a VFIO container. 2) The hardware supports so called DMA windows - the PCI address range within which DMA transfer is allowed, any attempt to access address space diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index a7d6729..970e3a2 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -82,6 +82,11 @@ static void decrement_locked_vm(long npages) * into DMA'ble space using the IOMMU */ +struct tce_iommu_group { + struct list_head next; + struct iommu_group *grp; +}; + /* * The container descriptor supports only a single group per container. * Required by the API as the container is not supplied with the IOMMU group @@ -89,10 +94,11 @@ static void decrement_locked_vm(long npages) */ struct tce_container { struct mutex lock; - struct iommu_group *grp; bool enabled; unsigned long locked_pages; bool v2; + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; + struct list_head group_list; }; static long tce_unregister_pages(struct tce_container *container, @@ -154,20 +160,20 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift) return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift; } +static inline bool tce_groups_attached(struct tce_container *container) +{ + return !list_empty(&container->group_list); +} + static struct iommu_table *spapr_tce_find_table( struct tce_container *container, phys_addr_t ioba) { long i; struct iommu_table *ret = NULL; - struct iommu_table_group *table_group; - - table_group = iommu_group_get_iommudata(container->grp); - if (!table_group) - return NULL; for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { - struct iommu_table *tbl = &table_group->tables[i]; + struct iommu_table *tbl = &container->tables[i]; unsigned long entry = ioba >> tbl->it_page_shift; unsigned long start = tbl->it_offset; unsigned long end = start + tbl->it_size; @@ -186,9 +192,7 @@ static int tce_iommu_enable(struct tce_container *container) int ret = 0; unsigned long locked; struct iommu_table_group *table_group; - - if (!container->grp) - return -ENXIO; + struct tce_iommu_group *tcegrp; if (!current->mm) return -ESRCH; /* process exited */ @@ -225,7 +229,12 @@ static int tce_iommu_enable(struct tce_container *container) * as there is no way to know how much we should increment * the locked_vm counter. */ - table_group = iommu_group_get_iommudata(container->grp); + if (!tce_groups_attached(container)) + return -ENODEV; + + tcegrp = list_first_entry(&container->group_list, + struct tce_iommu_group, next); + table_group = iommu_group_get_iommudata(tcegrp->grp); if (!table_group) return -ENODEV; @@ -257,6 +266,48 @@ static void tce_iommu_disable(struct tce_container *container) decrement_locked_vm(container->locked_pages); } +static long tce_iommu_create_table(struct iommu_table_group *table_group, +
[PATCH kernel v9 17/32] powerpc/powernv: Implement accessor to TCE entry
This replaces direct accesses to TCE table with a helper which returns an TCE entry address. This does not make difference now but will when multi-level TCE tables get introduces. No change in behavior is expected. Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * new patch in the series to separate this mechanical change from functional changes; this is not right before "powerpc/powernv: Implement multilevel TCE tables" but here in order to let the next patch - "powerpc/iommu/powernv: Release replaced TCE" - use pnv_tce() and avoid changing the same code twice --- arch/powerpc/platforms/powernv/pci.c | 34 +- 1 file changed, 21 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index 84b4ea4..ba75aa5 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -572,38 +572,46 @@ struct pci_ops pnv_pci_ops = { .write = pnv_pci_write_config, }; +static __be64 *pnv_tce(struct iommu_table *tbl, long idx) +{ + __be64 *tmp = ((__be64 *)tbl->it_base); + + return tmp + idx; +} + int pnv_tce_build(struct iommu_table *tbl, long index, long npages, unsigned long uaddr, enum dma_data_direction direction, struct dma_attrs *attrs) { u64 proto_tce = iommu_direction_to_tce_perm(direction); - __be64 *tcep; - u64 rpn; + u64 rpn = __pa(uaddr) >> tbl->it_page_shift; + long i; - tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset; - rpn = __pa(uaddr) >> tbl->it_page_shift; - - while (npages--) - *(tcep++) = cpu_to_be64(proto_tce | - (rpn++ << tbl->it_page_shift)); + for (i = 0; i < npages; i++) { + unsigned long newtce = proto_tce | + ((rpn + i) << tbl->it_page_shift); + unsigned long idx = index - tbl->it_offset + i; + *(pnv_tce(tbl, idx)) = cpu_to_be64(newtce); + } return 0; } void pnv_tce_free(struct iommu_table *tbl, long index, long npages) { - __be64 *tcep; + long i; - tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset; + for (i = 0; i < npages; i++) { + unsigned long idx = index - tbl->it_offset + i; - while (npages--) - *(tcep++) = cpu_to_be64(0); + *(pnv_tce(tbl, idx)) = cpu_to_be64(0); + } } unsigned long pnv_tce_get(struct iommu_table *tbl, long index) { - return ((u64 *)tbl->it_base)[index - tbl->it_offset]; + return *(pnv_tce(tbl, index - tbl->it_offset)); } void pnv_pci_setup_iommu_table(struct iommu_table *tbl, -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v9 05/32] vfio: powerpc/spapr: Use it_page_size
This makes use of the it_page_size from the iommu_table struct as page size can differ. This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code as recently introduced IOMMU_PAGE_XXX macros do not include IOMMU_PAGE_SHIFT. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson [aw: for the vfio related changes] Acked-by: Alex Williamson --- drivers/vfio/vfio_iommu_spapr_tce.c | 26 +- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 735b308..64300cc 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -91,7 +91,7 @@ static int tce_iommu_enable(struct tce_container *container) * enforcing the limit based on the max that the guest can map. */ down_write(¤t->mm->mmap_sem); - npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT; + npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT; locked = current->mm->locked_vm + npages; lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) { @@ -120,7 +120,7 @@ static void tce_iommu_disable(struct tce_container *container) down_write(¤t->mm->mmap_sem); current->mm->locked_vm -= (container->tbl->it_size << - IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT; + container->tbl->it_page_shift) >> PAGE_SHIFT; up_write(¤t->mm->mmap_sem); } @@ -215,7 +215,7 @@ static long tce_iommu_build(struct tce_container *container, tce, ret); break; } - tce += IOMMU_PAGE_SIZE_4K; + tce += IOMMU_PAGE_SIZE(tbl); } if (ret) @@ -260,8 +260,8 @@ static long tce_iommu_ioctl(void *iommu_data, if (info.argsz < minsz) return -EINVAL; - info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT_4K; - info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT_4K; + info.dma32_window_start = tbl->it_offset << tbl->it_page_shift; + info.dma32_window_size = tbl->it_size << tbl->it_page_shift; info.flags = 0; if (copy_to_user((void __user *)arg, &info, minsz)) @@ -291,8 +291,8 @@ static long tce_iommu_ioctl(void *iommu_data, VFIO_DMA_MAP_FLAG_WRITE)) return -EINVAL; - if ((param.size & ~IOMMU_PAGE_MASK_4K) || - (param.vaddr & ~IOMMU_PAGE_MASK_4K)) + if ((param.size & ~IOMMU_PAGE_MASK(tbl)) || + (param.vaddr & ~IOMMU_PAGE_MASK(tbl))) return -EINVAL; /* iova is checked by the IOMMU API */ @@ -307,8 +307,8 @@ static long tce_iommu_ioctl(void *iommu_data, return ret; ret = tce_iommu_build(container, tbl, - param.iova >> IOMMU_PAGE_SHIFT_4K, - tce, param.size >> IOMMU_PAGE_SHIFT_4K); + param.iova >> tbl->it_page_shift, + tce, param.size >> tbl->it_page_shift); iommu_flush_tce(tbl); @@ -334,17 +334,17 @@ static long tce_iommu_ioctl(void *iommu_data, if (param.flags) return -EINVAL; - if (param.size & ~IOMMU_PAGE_MASK_4K) + if (param.size & ~IOMMU_PAGE_MASK(tbl)) return -EINVAL; ret = iommu_tce_clear_param_check(tbl, param.iova, 0, - param.size >> IOMMU_PAGE_SHIFT_4K); + param.size >> tbl->it_page_shift); if (ret) return ret; ret = tce_iommu_clear(container, tbl, - param.iova >> IOMMU_PAGE_SHIFT_4K, - param.size >> IOMMU_PAGE_SHIFT_4K); + param.iova >> tbl->it_page_shift, + param.size >> tbl->it_page_shift); iommu_flush_tce(tbl); return ret; -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v9 08/32] vfio: powerpc/spapr: Moving pinning/unpinning to helpers
This is a pretty mechanical patch to make next patches simpler. New tce_iommu_unuse_page() helper does put_page() now but it might skip that after the memory registering patch applied. As we are here, this removes unnecessary checks for a value returned by pfn_to_page() as it cannot possibly return NULL. This moves tce_iommu_disable() later to let tce_iommu_clear() know if the container has been enabled because if it has not been, then put_page() must not be called on TCEs from the TCE table. This situation is not yet possible but it will after KVM acceleration patchset is applied. This changes code to work with physical addresses rather than linear mapping addresses for better code readability. Following patches will add an xchg() callback for an IOMMU table which will accept/return physical addresses (unlike current tce_build()) which will eliminate redundant conversions. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson --- Changes: v9: * changed helpers to work with physical addresses rather than linear (for simplicity - later ::xchg() will receive physical and avoid additional convertions) v6: * tce_get_hva() returns hva via a pointer --- drivers/vfio/vfio_iommu_spapr_tce.c | 61 + 1 file changed, 41 insertions(+), 20 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index e21479c..115d5e6 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -191,69 +191,90 @@ static void tce_iommu_release(void *iommu_data) struct tce_container *container = iommu_data; WARN_ON(container->tbl && !container->tbl->it_group); - tce_iommu_disable(container); if (container->tbl && container->tbl->it_group) tce_iommu_detach_group(iommu_data, container->tbl->it_group); + tce_iommu_disable(container); mutex_destroy(&container->lock); kfree(container); } +static void tce_iommu_unuse_page(struct tce_container *container, + unsigned long oldtce) +{ + struct page *page; + + if (!(oldtce & (TCE_PCI_READ | TCE_PCI_WRITE))) + return; + + page = pfn_to_page(oldtce >> PAGE_SHIFT); + + if (oldtce & TCE_PCI_WRITE) + SetPageDirty(page); + + put_page(page); +} + static int tce_iommu_clear(struct tce_container *container, struct iommu_table *tbl, unsigned long entry, unsigned long pages) { unsigned long oldtce; - struct page *page; for ( ; pages; --pages, ++entry) { oldtce = iommu_clear_tce(tbl, entry); if (!oldtce) continue; - page = pfn_to_page(oldtce >> PAGE_SHIFT); - WARN_ON(!page); - if (page) { - if (oldtce & TCE_PCI_WRITE) - SetPageDirty(page); - put_page(page); - } + tce_iommu_unuse_page(container, oldtce); } return 0; } +static int tce_iommu_use_page(unsigned long tce, unsigned long *hpa) +{ + struct page *page = NULL; + enum dma_data_direction direction = iommu_tce_direction(tce); + + if (get_user_pages_fast(tce & PAGE_MASK, 1, + direction != DMA_TO_DEVICE, &page) != 1) + return -EFAULT; + + *hpa = __pa((unsigned long) page_address(page)); + + return 0; +} + static long tce_iommu_build(struct tce_container *container, struct iommu_table *tbl, unsigned long entry, unsigned long tce, unsigned long pages) { long i, ret = 0; - struct page *page = NULL; - unsigned long hva; + struct page *page; + unsigned long hpa; enum dma_data_direction direction = iommu_tce_direction(tce); for (i = 0; i < pages; ++i) { unsigned long offset = tce & IOMMU_PAGE_MASK(tbl) & ~PAGE_MASK; - ret = get_user_pages_fast(tce & PAGE_MASK, 1, - direction != DMA_TO_DEVICE, &page); - if (unlikely(ret != 1)) { - ret = -EFAULT; + ret = tce_iommu_use_page(tce, &hpa); + if (ret) break; - } + page = pfn_to_page(hpa >> PAGE_SHIFT); if (!tce_page_is_contained(page, tbl->it_page_shift)) { ret = -EPERM; break; } - hva = (unsigned long) page_address(page) + offset; - - ret = iommu_tce_build(tbl, entry + i, hva, direction); + hpa |= offset; + ret = iommu_tce_build(tbl, entry + i, (unsigned long) __v
[PATCH kernel v9 15/32] powerpc/powernv/ioda/ioda2: Rework TCE invalidation in tce_build()/tce_free()
The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is supposed to be called on IODA1/2 and not called on p5ioc2. It receives start and end host addresses of TCE table. IODA2 actually needs PCI addresses to invalidate the cache. Those can be calculated from host addresses but since we are going to implement multi-level TCE tables, calculating PCI address from a host address might get either tricky or ugly as TCE table remains flat on PCI bus but not in RAM. This moves pnv_pci_ioda_tce_invalidate() from generic pnv_tce_build/ pnt_tce_free and defines IODA1/2-specific callbacks which call generic ones and do PHB-model-specific TCE cache invalidation. P5IOC2 keeps using generic callbacks as before. This changes pnv_pci_ioda2_tce_invalidate() to receives TCE index and number of pages which are PCI addresses shifted by IOMMU page shift. No change in behaviour is expected. Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * removed confusing comment from commit log about unintentional calling of pnv_pci_ioda_tce_invalidate() * moved mechanical changes away to "powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table" * fixed bug with broken invalidation in pnv_pci_ioda2_tce_invalidate - @index includes @tbl->it_offset but old code added it anyway which later broke DDW --- arch/powerpc/platforms/powernv/pci-ioda.c | 86 +-- arch/powerpc/platforms/powernv/pci.c | 17 ++ 2 files changed, 64 insertions(+), 39 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 718d5cc..f070c44 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1665,18 +1665,20 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, } } -static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe, -struct iommu_table *tbl, -__be64 *startp, __be64 *endp, bool rm) +static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, + unsigned long index, unsigned long npages, bool rm) { + struct pnv_ioda_pe *pe = container_of(tbl->it_table_group, + struct pnv_ioda_pe, table_group); __be64 __iomem *invalidate = rm ? (__be64 __iomem *)pe->tce_inval_reg_phys : (__be64 __iomem *)tbl->it_index; unsigned long start, end, inc; const unsigned shift = tbl->it_page_shift; - start = __pa(startp); - end = __pa(endp); + start = __pa((__be64 *)tbl->it_base + index - tbl->it_offset); + end = __pa((__be64 *)tbl->it_base + index - tbl->it_offset + + npages - 1); /* BML uses this case for p6/p7/galaxy2: Shift addr and put in node */ if (tbl->it_busno) { @@ -1712,16 +1714,40 @@ static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe, */ } +static int pnv_ioda1_tce_build(struct iommu_table *tbl, long index, + long npages, unsigned long uaddr, + enum dma_data_direction direction, + struct dma_attrs *attrs) +{ + long ret = pnv_tce_build(tbl, index, npages, uaddr, direction, + attrs); + + if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE)) + pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false); + + return ret; +} + +static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index, + long npages) +{ + pnv_tce_free(tbl, index, npages); + + if (tbl->it_type & TCE_PCI_SWINV_FREE) + pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false); +} + static struct iommu_table_ops pnv_ioda1_iommu_ops = { - .set = pnv_tce_build, - .clear = pnv_tce_free, + .set = pnv_ioda1_tce_build, + .clear = pnv_ioda1_tce_free, .get = pnv_tce_get, }; -static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, -struct iommu_table *tbl, -__be64 *startp, __be64 *endp, bool rm) +static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl, + unsigned long index, unsigned long npages, bool rm) { + struct pnv_ioda_pe *pe = container_of(tbl->it_table_group, + struct pnv_ioda_pe, table_group); unsigned long start, end, inc; __be64 __iomem *invalidate = rm ? (__be64 __iomem *)pe->tce_inval_reg_phys : @@ -1734,10 +1760,8 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, end = start; /* Figure out the start, end and step */ - inc = tbl->it_offset + (((u64)startp - tbl->it_base) / sizeof(u64)); - start |= (inc << shift); - inc = tbl->it_offset + (((u64)endp - tbl->it
[PATCH kernel v9 06/32] vfio: powerpc/spapr: Move locked_vm accounting to helpers
There moves locked pages accounting to helpers. Later they will be reused for Dynamic DMA windows (DDW). This reworks debug messages to show the current value and the limit. This stores the locked pages number in the container so when unlocking the iommu table pointer won't be needed. This does not have an effect now but it will with the multiple tables per container as then we will allow attaching/detaching groups on fly and we may end up having a container with no group attached but with the counter incremented. While we are here, update the comment explaining why RLIMIT_MEMLOCK might be required to be bigger than the guest RAM. This also prints pid of the current process in pr_warn/pr_debug. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson Reviewed-by: David Gibson --- Changes: v4: * new helpers do nothing if @npages == 0 * tce_iommu_disable() now can decrement the counter if the group was detached (not possible now but will be in the future) --- drivers/vfio/vfio_iommu_spapr_tce.c | 82 - 1 file changed, 63 insertions(+), 19 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 64300cc..40583f9 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -29,6 +29,51 @@ static void tce_iommu_detach_group(void *iommu_data, struct iommu_group *iommu_group); +static long try_increment_locked_vm(long npages) +{ + long ret = 0, locked, lock_limit; + + if (!current || !current->mm) + return -ESRCH; /* process exited */ + + if (!npages) + return 0; + + down_write(¤t->mm->mmap_sem); + locked = current->mm->locked_vm + npages; + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) + ret = -ENOMEM; + else + current->mm->locked_vm += npages; + + pr_debug("[%d] RLIMIT_MEMLOCK +%ld %ld/%ld%s\n", current->pid, + npages << PAGE_SHIFT, + current->mm->locked_vm << PAGE_SHIFT, + rlimit(RLIMIT_MEMLOCK), + ret ? " - exceeded" : ""); + + up_write(¤t->mm->mmap_sem); + + return ret; +} + +static void decrement_locked_vm(long npages) +{ + if (!current || !current->mm || !npages) + return; /* process exited */ + + down_write(¤t->mm->mmap_sem); + if (npages > current->mm->locked_vm) + npages = current->mm->locked_vm; + current->mm->locked_vm -= npages; + pr_debug("[%d] RLIMIT_MEMLOCK -%ld %ld/%ld\n", current->pid, + npages << PAGE_SHIFT, + current->mm->locked_vm << PAGE_SHIFT, + rlimit(RLIMIT_MEMLOCK)); + up_write(¤t->mm->mmap_sem); +} + /* * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation * @@ -45,6 +90,7 @@ struct tce_container { struct mutex lock; struct iommu_table *tbl; bool enabled; + unsigned long locked_pages; }; static bool tce_page_is_contained(struct page *page, unsigned page_shift) @@ -60,7 +106,7 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift) static int tce_iommu_enable(struct tce_container *container) { int ret = 0; - unsigned long locked, lock_limit, npages; + unsigned long locked; struct iommu_table *tbl = container->tbl; if (!container->tbl) @@ -89,21 +135,22 @@ static int tce_iommu_enable(struct tce_container *container) * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits, * that would effectively kill the guest at random points, much better * enforcing the limit based on the max that the guest can map. +* +* Unfortunately at the moment it counts whole tables, no matter how +* much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups +* each with 2GB DMA window, 8GB will be counted here. The reason for +* this is that we cannot tell here the amount of RAM used by the guest +* as this information is only available from KVM and VFIO is +* KVM agnostic. */ - down_write(¤t->mm->mmap_sem); - npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT; - locked = current->mm->locked_vm + npages; - lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - if (locked > lock_limit && !capable(CAP_IPC_LOCK)) { - pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n", - rlimit(RLIMIT_MEMLOCK)); - ret = -E
[PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table
This adds a way for the IOMMU user to know how much a new table will use so it can be accounted in the locked_vm limit before allocation happens. This stores the allocated table size in pnv_pci_create_table() so the locked_vm counter can be updated correctly when a table is being disposed. This defines an iommu_table_group_ops callback to let VFIO know how much memory will be locked if a table is created. Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * reimplemented the whole patch --- arch/powerpc/include/asm/iommu.h | 5 + arch/powerpc/platforms/powernv/pci-ioda.c | 14 arch/powerpc/platforms/powernv/pci.c | 36 +++ arch/powerpc/platforms/powernv/pci.h | 2 ++ 4 files changed, 57 insertions(+) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 1472de3..9844c106 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -99,6 +99,7 @@ struct iommu_table { unsigned long it_size; /* Size of iommu table in entries */ unsigned long it_indirect_levels; unsigned long it_level_size; + unsigned long it_allocated_size; unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, struct iommu_table_group; struct iommu_table_group_ops { + unsigned long (*get_table_size)( + __u32 page_shift, + __u64 window_size, + __u32 levels); long (*create_table)(struct iommu_table_group *table_group, int num, __u32 page_shift, diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index e0be556..7f548b4 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb, } #ifdef CONFIG_IOMMU_API +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift, + __u64 window_size, __u32 levels) +{ + unsigned long ret = pnv_get_table_size(page_shift, window_size, levels); + + if (!ret) + return ret; + + /* Add size of it_userspace */ + return ret + (window_size >> page_shift) * sizeof(unsigned long); +} + static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group, int num, __u32 page_shift, __u64 window_size, __u32 levels, struct iommu_table *tbl) @@ -2086,6 +2098,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group, BUG_ON(tbl->it_userspace); tbl->it_userspace = uas; + tbl->it_allocated_size += uas_cb; tbl->it_ops = &pnv_ioda2_iommu_ops; if (pe->tce_inval_reg) tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE); @@ -2160,6 +2173,7 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group) } static struct iommu_table_group_ops pnv_pci_ioda2_ops = { + .get_table_size = pnv_pci_ioda2_get_table_size, .create_table = pnv_pci_ioda2_create_table, .set_window = pnv_pci_ioda2_set_window, .unset_window = pnv_pci_ioda2_unset_window, diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index fc129c4..1b5b48a 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -662,6 +662,38 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl, tbl->it_type = TCE_PCI; } +unsigned long pnv_get_table_size(__u32 page_shift, + __u64 window_size, __u32 levels) +{ + unsigned long bytes = 0; + const unsigned window_shift = ilog2(window_size); + unsigned entries_shift = window_shift - page_shift; + unsigned table_shift = entries_shift + 3; + unsigned long tce_table_size = max(0x1000UL, 1UL << table_shift); + unsigned long direct_table_size; + + if (!levels || (levels > POWERNV_IOMMU_MAX_LEVELS) || + (window_size > memory_hotplug_max()) || + !is_power_of_2(window_size)) + return 0; + + /* Calculate a direct table size from window_size and levels */ + entries_shift = ROUND_UP(entries_shift, levels) / levels; + table_shift = entries_shift + 3; + table_shift = max_t(unsigned, table_shift, PAGE_SHIFT); + direct_table_size = 1UL << table_shift; + + for ( ; levels; --levels) { + bytes += ROUND_UP(tce_table_size, direct_table_size); + + tce_table_size /= direct
[PATCH kernel v9 32/32] vfio: powerpc/spapr: Support Dynamic DMA windows
This adds create/remove window ioctls to create and remove DMA windows. sPAPR defines a Dynamic DMA windows capability which allows para-virtualized guests to create additional DMA windows on a PCI bus. The existing linux kernels use this new window to map the entire guest memory and switch to the direct DMA operations saving time on map/unmap requests which would normally happen in a big amounts. This adds 2 ioctl handlers - VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE - to create and remove windows. Up to 2 windows are supported now by the hardware and by this driver. This changes VFIO_IOMMU_SPAPR_TCE_GET_INFO handler to return additional information such as a number of supported windows and maximum number levels of TCE tables. DDW is added as a capability, not as a SPAPR TCE IOMMU v2 unique feature as we still want to support v2 on platforms which cannot do DDW for the sake of TCE acceleration in KVM (coming soon). Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson --- Changes: v7: * s/VFIO_IOMMU_INFO_DDW/VFIO_IOMMU_SPAPR_INFO_DDW/ * fixed typos in and updated vfio.txt * fixed VFIO_IOMMU_SPAPR_TCE_GET_INFO handler * moved ddw properties to vfio_iommu_spapr_tce_ddw_info v6: * added explicit VFIO_IOMMU_INFO_DDW flag to vfio_iommu_spapr_tce_info, it used to be page mask flags from platform code * added explicit pgsizes field * added cleanup if tce_iommu_create_window() failed in a middle * added checks for callbacks in tce_iommu_create_window and remove those from tce_iommu_remove_window when it is too late to test anyway * spapr_tce_find_free_table returns sensible error code now * updated description of VFIO_IOMMU_SPAPR_TCE_CREATE/ VFIO_IOMMU_SPAPR_TCE_REMOVE v4: * moved code to tce_iommu_create_window()/tce_iommu_remove_window() helpers * added docs --- Documentation/vfio.txt | 19 arch/powerpc/include/asm/iommu.h| 2 +- drivers/vfio/vfio_iommu_spapr_tce.c | 197 +++- include/uapi/linux/vfio.h | 61 ++- 4 files changed, 274 insertions(+), 5 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 7dcf2b5..8b1ec51 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -452,6 +452,25 @@ address is from pre-registered range. This separation helps in optimizing DMA for guests. +6) sPAPR specification allows guests to have an additional DMA window(s) on +a PCI bus with a variable page size. Two ioctls have been added to support +this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE. +The platform has to support the functionality or error will be returned to +the userspace. The existing hardware supports up to 2 DMA windows, one is +2GB long, uses 4K pages and called "default 32bit window"; the other can +be as big as entire RAM, use different page size, it is optional - guests +create those in run-time if the guest driver supports 64bit DMA. + +VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and +a number of TCE table levels (if a TCE table is going to be big enough and +the kernel may not be able to allocate enough of physically contiguous memory). +It creates a new window in the available slot and returns the bus address where +the new window starts. Due to hardware limitation, the user space cannot choose +the location of DMA windows. + +VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window +and removes it. + --- [1] VFIO was originally an acronym for "Virtual Function I/O" in its diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 9844c106..282767f 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -151,7 +151,7 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, int nid); -#define IOMMU_TABLE_GROUP_MAX_TABLES 1 +#define IOMMU_TABLE_GROUP_MAX_TABLES 2 struct iommu_table_group; diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 970e3a2..f04c6f5 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -266,6 +266,20 @@ static void tce_iommu_disable(struct tce_container *container) decrement_locked_vm(container->locked_pages); } +static int spapr_tce_find_free_table(struct tce_container *container) +{ + int i; + + for (i = 0; i < IOMMU_TABLE_GROUP_MAX_TABLES; ++i) { + struct iommu_table *tbl = &container->tables[i]; + + if (!tbl->it_size) + return i; + } + + return -ENOSPC; +} + static long tce_iommu_create_table(struct iommu_table_group *table_gr
[PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2
The existing implementation accounts the whole DMA window in the locked_vm counter. This is going to be worse with multiple containers and huge DMA windows. Also, real-time accounting would requite additional tracking of accounted pages due to the page size difference - IOMMU uses 4K pages and system uses 4K or 64K pages. Another issue is that actual pages pinning/unpinning happens on every DMA map/unmap request. This does not affect the performance much now as we spend way too much time now on switching context between guest/userspace/host but this will start to matter when we add in-kernel DMA map/unmap acceleration. This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU. New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces 2 new ioctls to register/unregister DMA memory - VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - which receive user space address and size of a memory region which needs to be pinned/unpinned and counted in locked_vm. New IOMMU splits physical pages pinning and TCE table update into 2 different operations. It requires 1) guest pages to be registered first 2) consequent map/unmap requests to work only with pre-registered memory. For the default single window case this means that the entire guest (instead of 2GB) needs to be pinned before using VFIO. When a huge DMA window is added, no additional pinning will be required, otherwise it would be guest RAM + 2GB. The new memory registration ioctls are not supported by VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration will require memory to be preregistered in order to work. The accounting is done per the user process. This advertises v2 SPAPR TCE IOMMU and restricts what the userspace can do with v1 or v2 IOMMUs. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson --- Changes: v9: * s/tce_get_hva_cached/tce_iommu_use_page_v2/ v7: * now memory is registered per mm (i.e. process) * moved memory registration code to powerpc/mmu * merged "vfio: powerpc/spapr: Define v2 IOMMU" into this * limited new ioctls to v2 IOMMU * updated doc * unsupported ioclts return -ENOTTY instead of -EPERM v6: * tce_get_hva_cached() returns hva via a pointer v4: * updated docs * s/kzmalloc/vzalloc/ * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and replaced offset with index * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory and removed duplicating vfio_iommu_spapr_register_memory --- Documentation/vfio.txt | 23 drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++- include/uapi/linux/vfio.h | 27 + 3 files changed, 274 insertions(+), 6 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 96978ec..94328c8 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -427,6 +427,29 @@ The code flow from the example above should be slightly changed: +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ +VFIO_IOMMU_DISABLE and implements 2 new ioctls: +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY +(which are unsupported in v1 IOMMU). + +PPC64 paravirtualized guests generate a lot of map/unmap requests, +and the handling of those includes pinning/unpinning pages and updating +mm::locked_vm counter to make sure we do not exceed the rlimit. +The v2 IOMMU splits accounting and pinning into separate operations: + +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls +receive a user space address and size of the block to be pinned. +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to +be called with the exact address and size used for registering +the memory block. The userspace is not expected to call these often. +The ranges are stored in a linked list in a VFIO container. + +- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual +IOMMU table and do not do pinning; instead these check that the userspace +address is from pre-registered range. + +This separation helps in optimizing DMA for guests. + --- [1] VFIO was originally an acronym for "Virtual Function I/O" in its diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 892a584..4cfc2c1 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -21,6 +21,7 @@ #include #include #include +#include #define DRIVER_VERSION "0.1" #define DRIVER_AUTHOR "a...@ozlabs.ru" @@ -91,8 +92,58 @@ struct tce_container { struct iommu_group *grp; bool enabled; unsigned long locked_pages; + bool v2; }; +static long tce_unregister_pages(struct tce_container *container, + __u64 vaddr, __u64 s
[PATCH kernel v9 09/32] vfio: powerpc/spapr: Rework groups attaching
This is to make extended ownership and multiple groups support patches simpler for review. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson Reviewed-by: David Gibson --- drivers/vfio/vfio_iommu_spapr_tce.c | 40 ++--- 1 file changed, 24 insertions(+), 16 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 115d5e6..0fbe03e 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -460,16 +460,21 @@ static int tce_iommu_attach_group(void *iommu_data, iommu_group_id(container->tbl->it_group), iommu_group_id(iommu_group)); ret = -EBUSY; - } else if (container->enabled) { + goto unlock_exit; + } + + if (container->enabled) { pr_err("tce_vfio: attaching group #%u to enabled container\n", iommu_group_id(iommu_group)); ret = -EBUSY; - } else { - ret = iommu_take_ownership(tbl); - if (!ret) - container->tbl = tbl; + goto unlock_exit; } + ret = iommu_take_ownership(tbl); + if (!ret) + container->tbl = tbl; + +unlock_exit: mutex_unlock(&container->lock); return ret; @@ -487,19 +492,22 @@ static void tce_iommu_detach_group(void *iommu_data, pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n", iommu_group_id(iommu_group), iommu_group_id(tbl->it_group)); - } else { - if (container->enabled) { - pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n", - iommu_group_id(tbl->it_group)); - tce_iommu_disable(container); - } + goto unlock_exit; + } - /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n", - iommu_group_id(iommu_group), iommu_group); */ - container->tbl = NULL; - tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size); - iommu_release_ownership(tbl); + if (container->enabled) { + pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n", + iommu_group_id(tbl->it_group)); + tce_iommu_disable(container); } + + /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n", + iommu_group_id(iommu_group), iommu_group); */ + container->tbl = NULL; + tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size); + iommu_release_ownership(tbl); + +unlock_exit: mutex_unlock(&container->lock); } -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v9 13/32] vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control
This adds tce_iommu_take_ownership() and tce_iommu_release_ownership which call in a loop iommu_take_ownership()/iommu_release_ownership() for every table on the group. As there is just one now, no change in behaviour is expected. At the moment the iommu_table struct has a set_bypass() which enables/ disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code which calls this callback when external IOMMU users such as VFIO are about to get over a PHB. The set_bypass() callback is not really an iommu_table function but IOMMU/PE function. This introduces a iommu_table_group_ops struct and adds take_ownership()/release_ownership() callbacks to it which are called when an external user takes/releases control over the IOMMU. This replaces set_bypass() with ownership callbacks as it is not necessarily just bypass enabling, it can be something else/more so let's give it more generic name. The callbacks is implemented for IODA2 only. Other platforms (P5IOC2, IODA1) will use the old iommu_take_ownership/iommu_release_ownership API. The following patches will replace iommu_take_ownership/ iommu_release_ownership calls in IODA2 with full IOMMU table release/ create. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson --- Changes: v9: * squashed "vfio: powerpc/spapr: powerpc/iommu: Rework IOMMU ownership control" and "vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework IOMMU ownership control" into a single patch * moved helpers with a loop through tables in a group to vfio_iommu_spapr_tce.c to keep the platform code free of IOMMU table groups as much as possible * added missing tce_iommu_clear() to tce_iommu_release_ownership() * replaced the set_ownership(enable) callback with take_ownership() and release_ownership() --- arch/powerpc/include/asm/iommu.h | 13 +- arch/powerpc/kernel/iommu.c | 11 -- arch/powerpc/platforms/powernv/pci-ioda.c | 40 +++ drivers/vfio/vfio_iommu_spapr_tce.c | 66 +++ 4 files changed, 103 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index fa37519..e63419e 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -93,7 +93,6 @@ struct iommu_table { unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_table_group; struct iommu_table_ops *it_ops; - void (*set_bypass)(struct iommu_table *tbl, bool enable); }; /* Pure 2^n version of get_order */ @@ -128,11 +127,23 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, #define IOMMU_TABLE_GROUP_MAX_TABLES 1 +struct iommu_table_group; + +struct iommu_table_group_ops { + /* +* Switches ownership from the kernel itself to an external +* user. While onwership is taken, the kernel cannot use IOMMU itself. +*/ + void (*take_ownership)(struct iommu_table_group *table_group); + void (*release_ownership)(struct iommu_table_group *table_group); +}; + struct iommu_table_group { #ifdef CONFIG_IOMMU_API struct iommu_group *group; #endif struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; + struct iommu_table_group_ops *ops; }; #ifdef CONFIG_IOMMU_API diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 005146b..2856d27 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1057,13 +1057,6 @@ int iommu_take_ownership(struct iommu_table *tbl) memset(tbl->it_map, 0xff, sz); - /* -* Disable iommu bypass, otherwise the user can DMA to all of -* our physical memory via the bypass window instead of just -* the pages that has been explicitly mapped into the iommu -*/ - if (tbl->set_bypass) - tbl->set_bypass(tbl, false); return 0; } @@ -1078,10 +1071,6 @@ void iommu_release_ownership(struct iommu_table *tbl) /* Restore bit#0 set by iommu_init_table() */ if (tbl->it_offset == 0) set_bit(0, tbl->it_map); - - /* The kernel owns the device now, we can restore the iommu bypass */ - if (tbl->set_bypass) - tbl->set_bypass(tbl, true); } EXPORT_SYMBOL_GPL(iommu_release_ownership); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 88472cb..718d5cc 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1870,10 +1870,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs)); } -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable) { - struct pnv_
[PATCH kernel v9 11/32] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
This adds a iommu_table_ops struct and puts pointer to it into the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush callbacks from ppc_md to the new struct where they really belong to. This adds the requirement for @it_ops to be initialized before calling iommu_init_table() to make sure that we do not leave any IOMMU table with iommu_table_ops uninitialized. This is not a parameter of iommu_init_table() though as there will be cases when iommu_init_table() will not be called on TCE tables, for example - VFIO. This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_" redundand prefixes. This removes tce_xxx_rm handlers from ppc_md but does not add them to iommu_table_ops as this will be done later if we decide to support TCE hypercalls in real mode. This removes _vm callbacks as only virtual mode is supported by now so this also removes @rm parameter. For pSeries, this always uses tce_buildmulti_pSeriesLP/ tce_buildmulti_pSeriesLP. This changes multi callback to fall back to tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not present. The reason for this is we still have to support "multitce=off" boot parameter in disable_multitce() and we do not want to walk through all IOMMU tables in the system and replace "multi" callbacks with single ones. For powernv, this defines _ops per PHB type which are P5IOC2/IODA1/IODA2. This makes the callbacks for them public. Later patches will extend callbacks for IODA1/2. No change in behaviour is expected. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- Changes: v9: * pnv_tce_build/pnv_tce_free/pnv_tce_get have been made public and lost "rm" parameters to make following patches simpler (realmode is not supported here anyway) * got rid of _vm versions of callbacks --- arch/powerpc/include/asm/iommu.h| 17 +++ arch/powerpc/include/asm/machdep.h | 25 --- arch/powerpc/kernel/iommu.c | 46 ++-- arch/powerpc/kernel/vio.c | 5 +++ arch/powerpc/platforms/cell/iommu.c | 8 +++-- arch/powerpc/platforms/pasemi/iommu.c | 7 +++-- arch/powerpc/platforms/powernv/pci-ioda.c | 14 + arch/powerpc/platforms/powernv/pci-p5ioc2.c | 7 + arch/powerpc/platforms/powernv/pci.c| 47 + arch/powerpc/platforms/powernv/pci.h| 5 +++ arch/powerpc/platforms/pseries/iommu.c | 34 - arch/powerpc/sysdev/dart_iommu.c| 12 +--- 12 files changed, 116 insertions(+), 111 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 5eb6e76..f0cab49 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -44,6 +44,22 @@ extern int iommu_is_off; extern int iommu_force_on; +struct iommu_table_ops { + int (*set)(struct iommu_table *tbl, + long index, long npages, + unsigned long uaddr, + enum dma_data_direction direction, + struct dma_attrs *attrs); + void (*clear)(struct iommu_table *tbl, + long index, long npages); + unsigned long (*get)(struct iommu_table *tbl, long index); + void (*flush)(struct iommu_table *tbl); +}; + +/* These are used by VIO */ +extern struct iommu_table_ops iommu_table_lpar_multi_ops; +extern struct iommu_table_ops iommu_table_pseries_ops; + /* * IOMAP_MAX_ORDER defines the largest contiguous block * of dma space we can get. IOMAP_MAX_ORDER = 13 @@ -78,6 +94,7 @@ struct iommu_table { #ifdef CONFIG_IOMMU_API struct iommu_group *it_group; #endif + struct iommu_table_ops *it_ops; void (*set_bypass)(struct iommu_table *tbl, bool enable); }; diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index ef889943..ab721b4 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -65,31 +65,6 @@ struct machdep_calls { * destroyed as well */ void(*hpte_clear_all)(void); - int (*tce_build)(struct iommu_table *tbl, -long index, -long npages, -unsigned long uaddr, -enum dma_data_direction direction, -struct dma_attrs *attrs); - void(*tce_free)(struct iommu_table *tbl, - long index, - long npages); - unsigned long (*tce_get)(struct iommu_table *tbl, - long index); - void(*tce_flush)(struct iommu_table *tbl); - - /* _rm versions are for real mode use only */ - int
[PATCH kernel v9 10/32] powerpc/powernv: Do not set "read" flag if direction==DMA_NONE
Normally a bitmap from the iommu_table is used to track what TCE entry is in use. Since we are going to use iommu_table without its locks and do xchg() instead, it becomes essential not to put bits which are not implied in the direction flag as the old TCE value (more precisely - the permission bits) will be used to decide whether to put the page or not. This adds iommu_direction_to_tce_perm() (its counterpart is there already) and uses it for powernv's pnv_tce_build(). Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- Changes: v9: * added comment why we must put only valid permission bits --- arch/powerpc/include/asm/iommu.h | 1 + arch/powerpc/kernel/iommu.c | 15 +++ arch/powerpc/platforms/powernv/pci.c | 7 +-- 3 files changed, 17 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 4955233..5eb6e76 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -205,6 +205,7 @@ extern int iommu_take_ownership(struct iommu_table *tbl); extern void iommu_release_ownership(struct iommu_table *tbl); extern enum dma_data_direction iommu_tce_direction(unsigned long tce); +extern unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir); #endif /* __KERNEL__ */ #endif /* _ASM_IOMMU_H */ diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index fc8b253..e0e94c7 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -881,6 +881,21 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size, } } +unsigned long iommu_direction_to_tce_perm(enum dma_data_direction dir) +{ + switch (dir) { + case DMA_BIDIRECTIONAL: + return TCE_PCI_READ | TCE_PCI_WRITE; + case DMA_FROM_DEVICE: + return TCE_PCI_WRITE; + case DMA_TO_DEVICE: + return TCE_PCI_READ; + default: + return 0; + } +} +EXPORT_SYMBOL_GPL(iommu_direction_to_tce_perm); + #ifdef CONFIG_IOMMU_API /* * SPAPR TCE API diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index bca2aeb..b7ea245 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -576,15 +576,10 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages, unsigned long uaddr, enum dma_data_direction direction, struct dma_attrs *attrs, bool rm) { - u64 proto_tce; + u64 proto_tce = iommu_direction_to_tce_perm(direction); __be64 *tcep, *tces; u64 rpn; - proto_tce = TCE_PCI_READ; // Read allowed - - if (direction != DMA_TO_DEVICE) - proto_tce |= TCE_PCI_WRITE; - tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset; rpn = __pa(uaddr) >> tbl->it_page_shift; -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v9 12/32] powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group
Modern IBM POWERPC systems support multiple (currently two) TCE tables per IOMMU group (a.k.a. PE). This adds a iommu_table_group container for TCE tables. Right now just one table is supported. For P5IOC2 and IODA, iommu_table_group is embedded into PE struct (pnv_ioda_pe and pnv_phb) and does not require iommu_free_table(), only . iommu_reset_table(). For pSeries, this replaces multiple calls of kzalloc_node() with a new iommu_pseries_group_alloc() helper and stores the table group struct pointer into the pci_dn struct. For release, a iommu_table_group_free() helper is added. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson --- Changes: v9: * s/it_group/it_table_group/ * added and used iommu_table_group_free(), from now iommu_free_table() is only used for VIO * added iommu_pseries_group_alloc() * squashed "powerpc/iommu: Introduce iommu_table_alloc() helper" into this --- arch/powerpc/include/asm/iommu.h| 18 +++-- arch/powerpc/include/asm/pci-bridge.h | 2 +- arch/powerpc/kernel/eeh.c | 2 +- arch/powerpc/kernel/iommu.c | 24 +++--- arch/powerpc/platforms/powernv/pci-ioda.c | 46 ++- arch/powerpc/platforms/powernv/pci-p5ioc2.c | 19 +++-- arch/powerpc/platforms/powernv/pci.h| 4 +- arch/powerpc/platforms/pseries/iommu.c | 104 + drivers/vfio/vfio_iommu_spapr_tce.c | 114 9 files changed, 222 insertions(+), 111 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index f0cab49..fa37519 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -91,9 +91,7 @@ struct iommu_table { struct iommu_pool pools[IOMMU_NR_POOLS]; unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ -#ifdef CONFIG_IOMMU_API - struct iommu_group *it_group; -#endif + struct iommu_table_group *it_table_group; struct iommu_table_ops *it_ops; void (*set_bypass)(struct iommu_table *tbl, bool enable); }; @@ -127,14 +125,24 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); */ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, int nid); + +#define IOMMU_TABLE_GROUP_MAX_TABLES 1 + +struct iommu_table_group { #ifdef CONFIG_IOMMU_API -extern void iommu_register_group(struct iommu_table *tbl, + struct iommu_group *group; +#endif + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; +}; + +#ifdef CONFIG_IOMMU_API +extern void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, unsigned long pe_num); extern int iommu_add_device(struct device *dev); extern void iommu_del_device(struct device *dev); extern int __init tce_iommu_bus_notifier_init(void); #else -static inline void iommu_register_group(struct iommu_table *tbl, +static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, unsigned long pe_num) { diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index 1811c44..e2d7479 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -185,7 +185,7 @@ struct pci_dn { struct pci_dn *parent; struct pci_controller *phb;/* for pci devices */ - struct iommu_table *iommu_table; /* for phb's or bridges */ + struct iommu_table_group *table_group; /* for phb's or bridges */ struct device_node *node; /* back-pointer to the device_node */ int pci_ext_config_space; /* for pci devices */ diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index a4c62eb..6bab695 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -1407,7 +1407,7 @@ static int dev_has_iommu_table(struct device *dev, void *data) return 0; tbl = get_iommu_table_base(dev); - if (tbl && tbl->it_group) { + if (tbl && tbl->it_table_group) { *ppdev = pdev; return 1; } diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index e289f91..005146b 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -749,12 +749,8 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) iommu_reset_table(tbl, node_name); -#ifdef CONFIG_IOMMU_API - if (tbl->it_group) { - iommu_group_put(tbl->it_group); - BUG_ON(tbl->it_group); - } -#endif + /* iommu_free_tab
[PATCH kernel v9 14/32] powerpc/iommu: Fix IOMMU ownership control functions
This adds missing locks in iommu_take_ownership()/ iommu_release_ownership(). This marks all pages busy in iommu_table::it_map in order to catch errors if there is an attempt to use this table while ownership over it is taken. This only clears TCE content if there is no page marked busy in it_map. Clearing must be done outside of the table locks as iommu_clear_tce() called from iommu_clear_tces_and_put_pages() does this. In order to use bitmap_empty(), the existing code clears bit#0 which is set even in an empty table if it is bus-mapped at 0 as iommu_init_table() reserves page#0 to prevent buggy drivers from crashing when allocated page is bus-mapped at zero (which is correct). This restores the bit in the case of failure to bring the it_map to the state it was in when we called iommu_take_ownership(). Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * iommu_table_take_ownership() did not return @ret (and ignored EBUSY), now it does return correct error. * updated commit log about setting bit#0 in the case of failure v5: * do not store bit#0 value, it has to be set for zero-based table anyway * removed test_and_clear_bit --- arch/powerpc/kernel/iommu.c | 31 +-- 1 file changed, 25 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 2856d27..ea2c8ba 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1045,32 +1045,51 @@ EXPORT_SYMBOL_GPL(iommu_tce_build); int iommu_take_ownership(struct iommu_table *tbl) { - unsigned long sz = (tbl->it_size + 7) >> 3; + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; + int ret = 0; + + spin_lock_irqsave(&tbl->large_pool.lock, flags); + for (i = 0; i < tbl->nr_pools; i++) + spin_lock(&tbl->pools[i].lock); if (tbl->it_offset == 0) clear_bit(0, tbl->it_map); if (!bitmap_empty(tbl->it_map, tbl->it_size)) { pr_err("iommu_tce: it_map is not empty"); - return -EBUSY; + ret = -EBUSY; + /* Restore bit#0 set by iommu_init_table() */ + if (tbl->it_offset == 0) + set_bit(0, tbl->it_map); + } else { + memset(tbl->it_map, 0xff, sz); } - memset(tbl->it_map, 0xff, sz); + for (i = 0; i < tbl->nr_pools; i++) + spin_unlock(&tbl->pools[i].lock); + spin_unlock_irqrestore(&tbl->large_pool.lock, flags); - - return 0; + return ret; } EXPORT_SYMBOL_GPL(iommu_take_ownership); void iommu_release_ownership(struct iommu_table *tbl) { - unsigned long sz = (tbl->it_size + 7) >> 3; + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; + + spin_lock_irqsave(&tbl->large_pool.lock, flags); + for (i = 0; i < tbl->nr_pools; i++) + spin_lock(&tbl->pools[i].lock); memset(tbl->it_map, 0, sz); /* Restore bit#0 set by iommu_init_table() */ if (tbl->it_offset == 0) set_bit(0, tbl->it_map); + + for (i = 0; i < tbl->nr_pools; i++) + spin_unlock(&tbl->pools[i].lock); + spin_unlock_irqrestore(&tbl->large_pool.lock, flags); } EXPORT_SYMBOL_GPL(iommu_release_ownership); -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache
We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory and in-kernel acceleration of DMA requests. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage a mapped pages counter; when it is zero, it is safe to unregister the region. Multiple registration of the same region is allowed, kref is used to track the number of registrations. Signed-off-by: Alexey Kardashevskiy --- Changes: v8: * s/mm_iommu_table_group_mem_t/struct mm_iommu_table_group_mem_t/ * fixed error fallback look (s/[i]/[j]/) --- arch/powerpc/include/asm/mmu-hash64.h | 3 + arch/powerpc/include/asm/mmu_context.h | 17 +++ arch/powerpc/mm/Makefile | 1 + arch/powerpc/mm/mmu_context_hash64.c | 6 + arch/powerpc/mm/mmu_context_hash64_iommu.c | 215 + 5 files changed, 242 insertions(+) create mode 100644 arch/powerpc/mm/mmu_context_hash64_iommu.c diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h index 1da6a81..a82f534 100644 --- a/arch/powerpc/include/asm/mmu-hash64.h +++ b/arch/powerpc/include/asm/mmu-hash64.h @@ -536,6 +536,9 @@ typedef struct { /* for 4K PTE fragment support */ void *pte_frag; #endif +#ifdef CONFIG_SPAPR_TCE_IOMMU + struct list_head iommu_group_mem_list; +#endif } mm_context_t; diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 73382eb..d6116ca 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -16,6 +16,23 @@ */ extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm); extern void destroy_context(struct mm_struct *mm); +#ifdef CONFIG_SPAPR_TCE_IOMMU +struct mm_iommu_table_group_mem_t; + +extern bool mm_iommu_preregistered(void); +extern long mm_iommu_alloc(unsigned long ua, unsigned long entries, + struct mm_iommu_table_group_mem_t **pmem); +extern struct mm_iommu_table_group_mem_t *mm_iommu_get(unsigned long ua, + unsigned long entries); +extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem); +extern void mm_iommu_cleanup(mm_context_t *ctx); +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua, + unsigned long size); +extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, + unsigned long ua, unsigned long *hpa); +extern long mm_iommu_mapped_update(struct mm_iommu_table_group_mem_t *mem, + bool inc); +#endif extern void switch_mmu_context(struct mm_struct *prev, struct mm_struct *next); extern void switch_slb(struct task_struct *tsk, struct mm_struct *mm); diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile index 9c8770b..e216704 100644 --- a/arch/powerpc/mm/Makefile +++ b/arch/powerpc/mm/Makefile @@ -36,3 +36,4 @@ obj-$(CONFIG_PPC_SUBPAGE_PROT)+= subpage-prot.o obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o obj-$(CONFIG_HIGHMEM) += highmem.o obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o +obj-$(CONFIG_SPAPR_TCE_IOMMU) += mmu_context_hash64_iommu.o diff --git a/arch/powerpc/mm/mmu_context_hash64.c b/arch/powerpc/mm/mmu_context_hash64.c index 178876ae..eb3080c 100644 --- a/arch/powerpc/mm/mmu_context_hash64.c +++ b/arch/powerpc/mm/mmu_context_hash64.c @@ -89,6 +89,9 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm) #ifdef CONFIG_PPC_64K_PAGES mm->context.pte_frag = NULL; #endif +#ifdef CONFIG_SPAPR_TCE_IOMMU + INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list); +#endif return 0; } @@ -132,6 +135,9 @@ static inline void destroy_pagetable_page(struct mm_struct *mm) void destroy_context(struct mm_struct *mm) { +#ifdef CONFIG_SPAPR_TCE_IOMMU + mm_iommu_cleanup(&mm->context); +#endif #ifdef CONFIG_PPC_ICSWX drop_cop(mm->context.acop, mm); diff --git a/arch/powerpc/mm/mmu_context_hash64_iommu.c b/arch/powerpc/mm/mmu_context_hash64_iommu.c new file mode 100644 index 000..af7668c --- /dev/null +++ b/arch/powerpc/mm/mmu_context_hash64_iommu.c @@ -0,0 +1,215 @@ +/* + * IOMMU helpers in MMU context. + * + * Copyright (C) 2015 IBM Corp. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either v
[PATCH kernel v9 18/32] powerpc/iommu/powernv: Release replaced TCE
At the moment writing new TCE value to the IOMMU table fails with EBUSY if there is a valid entry already. However PAPR specification allows the guest to write new TCE value without clearing it first. Another problem this patch is addressing is the use of pool locks for external IOMMU users such as VFIO. The pool locks are to protect DMA page allocator rather than entries and since the host kernel does not control what pages are in use, there is no point in pool locks and exchange()+put_page(oldtce) is sufficient to avoid possible races. This adds an exchange() callback to iommu_table_ops which does the same thing as set() plus it returns replaced TCE and DMA direction so the caller can release the pages afterwards. The exchange() receives a physical address unlike set() which receives linear mapping address; and returns a physical address as the clear() does. This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement for a platform to have exchange() implemented in order to support VFIO. This replaces iommu_tce_build() and iommu_clear_tce() with a single iommu_tce_xchg(). This makes sure that TCE permission bits are not set in TCE passed to IOMMU API as those are to be calculated by platform code from DMA direction. This moves SetPageDirty() to the IOMMU code to make it work for both VFIO ioctl interface in in-kernel TCE acceleration (when it becomes available later). Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson --- Changes: v9: * changed exchange() to work with physical addresses as these addresses are never accessed by the code and physical addresses are actual values we put into the IOMMU table --- arch/powerpc/include/asm/iommu.h| 22 +-- arch/powerpc/kernel/iommu.c | 57 +--- arch/powerpc/platforms/powernv/pci-ioda.c | 34 + arch/powerpc/platforms/powernv/pci-p5ioc2.c | 3 ++ arch/powerpc/platforms/powernv/pci.c| 17 + arch/powerpc/platforms/powernv/pci.h| 2 + drivers/vfio/vfio_iommu_spapr_tce.c | 58 ++--- 7 files changed, 128 insertions(+), 65 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index e63419e..7e7ca0a 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -45,13 +45,29 @@ extern int iommu_is_off; extern int iommu_force_on; struct iommu_table_ops { + /* +* When called with direction==DMA_NONE, it is equal to clear(). +* uaddr is a linear map address. +*/ int (*set)(struct iommu_table *tbl, long index, long npages, unsigned long uaddr, enum dma_data_direction direction, struct dma_attrs *attrs); +#ifdef CONFIG_IOMMU_API + /* +* Exchanges existing TCE with new TCE plus direction bits; +* returns old TCE and DMA direction mask. +* @tce is a physical address. +*/ + int (*exchange)(struct iommu_table *tbl, + long index, + unsigned long *tce, + enum dma_data_direction *direction); +#endif void (*clear)(struct iommu_table *tbl, long index, long npages); + /* get() returns a physical address */ unsigned long (*get)(struct iommu_table *tbl, long index); void (*flush)(struct iommu_table *tbl); }; @@ -152,6 +168,8 @@ extern void iommu_register_group(struct iommu_table_group *table_group, extern int iommu_add_device(struct device *dev); extern void iommu_del_device(struct device *dev); extern int __init tce_iommu_bus_notifier_init(void); +extern long iommu_tce_xchg(struct iommu_table *tbl, unsigned long entry, + unsigned long *tce, enum dma_data_direction *direction); #else static inline void iommu_register_group(struct iommu_table_group *table_group, int pci_domain_number, @@ -231,10 +249,6 @@ extern int iommu_tce_clear_param_check(struct iommu_table *tbl, unsigned long npages); extern int iommu_tce_put_param_check(struct iommu_table *tbl, unsigned long ioba, unsigned long tce); -extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, - unsigned long hwaddr, enum dma_data_direction direction); -extern unsigned long iommu_clear_tce(struct iommu_table *tbl, - unsigned long entry); extern void iommu_flush_tce(struct iommu_table *tbl); extern int iommu_take_ownership(struct iommu_table *tbl); diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index ea2c8ba..2eaba0c 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -975,9 +975,6 @@ EXPORT_SYMBOL_GPL(iommu_tce_clear_param_check); int iommu_tce_put_param_check(struct iommu_table
[PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table
In order to support memory pre-registration, we need a way to track the use of every registered memory region and only allow unregistration if a region is not in use anymore. So we need a way to tell from what region the just cleared TCE was from. This adds a userspace view of the TCE table into iommu_table struct. It contains userspace address, one per TCE entry. The table is only allocated when the ownership over an IOMMU group is taken which means it is only used from outside of the powernv code (such as VFIO). Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * fixed code flow in error cases added in v8 v8: * added ENOMEM on failed vzalloc() --- arch/powerpc/include/asm/iommu.h | 6 ++ arch/powerpc/kernel/iommu.c | 18 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 22 -- 3 files changed, 44 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 7694546..1472de3 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -111,9 +111,15 @@ struct iommu_table { unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_table_group; + unsigned long *it_userspace; /* userspace view of the table */ struct iommu_table_ops *it_ops; }; +#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \ + ((tbl)->it_userspace ? \ + &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \ + NULL) + /* Pure 2^n version of get_order */ static inline __attribute_const__ int get_iommu_order(unsigned long size, struct iommu_table *tbl) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 2eaba0c..74a3f52 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include @@ -739,6 +740,8 @@ void iommu_reset_table(struct iommu_table *tbl, const char *node_name) free_pages((unsigned long) tbl->it_map, order); } + WARN_ON(tbl->it_userspace); + memset(tbl, 0, sizeof(*tbl)); } @@ -1016,6 +1019,7 @@ int iommu_take_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; int ret = 0; + unsigned long *uas; /* * VFIO does not control TCE entries allocation and the guest @@ -1027,6 +1031,10 @@ int iommu_take_ownership(struct iommu_table *tbl) if (!tbl->it_ops->exchange) return -EINVAL; + uas = vzalloc(sizeof(*uas) * tbl->it_size); + if (!uas) + return -ENOMEM; + spin_lock_irqsave(&tbl->large_pool.lock, flags); for (i = 0; i < tbl->nr_pools; i++) spin_lock(&tbl->pools[i].lock); @@ -1044,6 +1052,13 @@ int iommu_take_ownership(struct iommu_table *tbl) memset(tbl->it_map, 0xff, sz); } + if (ret) { + vfree(uas); + } else { + BUG_ON(tbl->it_userspace); + tbl->it_userspace = uas; + } + for (i = 0; i < tbl->nr_pools; i++) spin_unlock(&tbl->pools[i].lock); spin_unlock_irqrestore(&tbl->large_pool.lock, flags); @@ -1056,6 +1071,9 @@ void iommu_release_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; + vfree(tbl->it_userspace); + tbl->it_userspace = NULL; + spin_lock_irqsave(&tbl->large_pool.lock, flags); for (i = 0; i < tbl->nr_pools; i++) spin_lock(&tbl->pools[i].lock); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 45bc131..e0be556 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include @@ -1827,6 +1828,14 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index, pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false); } +void pnv_pci_ioda2_free_table(struct iommu_table *tbl) +{ + vfree(tbl->it_userspace); + tbl->it_userspace = NULL; + + pnv_pci_free_table(tbl); +} + static struct iommu_table_ops pnv_ioda2_iommu_ops = { .set = pnv_ioda2_tce_build, #ifdef CONFIG_IOMMU_API @@ -1834,7 +1843,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = { #endif .clear = pnv_ioda2_tce_free, .get = pnv_tce_get, - .free = pnv_pci_free_table, + .free = pnv_pci_ioda2_free_table, }; static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb, @@ -2062,12 +
[PATCH kernel v9 24/32] powerpc/powernv/ioda2: Use new helpers to do proper cleanup on PE release
The existing code programmed TVT#0 with some address and then immediately released that memory. This makes use of pnv_pci_ioda2_unset_window() and pnv_pci_ioda2_set_bypass() which do correct resource release and TVT update. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/platforms/powernv/pci-ioda.c | 33 ++- 1 file changed, 10 insertions(+), 23 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 4828837..2a4b2b2 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1281,34 +1281,21 @@ m64_failed: return -EBUSY; } +static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group, + int num); +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable); + static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe) { - struct pci_bus*bus; - struct pci_controller *hose; - struct pnv_phb*phb; - struct iommu_table*tbl; - unsigned long addr; - int64_t rc; + long rc; - bus = dev->bus; - hose = pci_bus_to_host(bus); - phb = hose->private_data; - tbl = &pe->table_group.tables[0]; - addr = tbl->it_base; - - opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number, - pe->pe_number << 1, 1, __pa(addr), - 0, 0x1000); - - rc = opal_pci_map_pe_dma_window_real(pe->phb->opal_id, - pe->pe_number, - (pe->pe_number << 1) + 1, - pe->tce_bypass_base, - 0); + rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0); if (rc) - pe_warn(pe, "OPAL error %ld release DMA window\n", rc); + pe_warn(pe, "OPAL error %ld release default DMA window\n", rc); - pnv_pci_free_table(tbl); + pnv_pci_ioda2_set_bypass(pe, false); + + pnv_pci_free_table(&pe->table_group.tables[0]); } static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs) -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v9 25/32] vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework ownership
Before the IOMMU user (VFIO) would take control over the IOMMU table belonging to a specific IOMMU group. This approach did not allow sharing tables between IOMMU groups attached to the same container. This introduces a new IOMMU ownership flavour when the user can not just control the existing IOMMU table but remove/create tables on demand. If an IOMMU implements take/release_ownership() callbacks, this lets the user have full control over the IOMMU group. When the ownership is taken, the platform code removes all the windows so the caller must create them. Before returning the ownership back to the platform code, VFIO unprograms and removes all the tables it created. This changes IODA2's onwership handler to remove the existing table rather than manipulating with the existing one. From now on, iommu_take_ownership() and iommu_release_ownership() are only called from the vfio_iommu_spapr_tce driver. In tce_iommu_detach_group(), this copies a iommu_table descriptor on stack as IODA2's unset_window() will clear the descriptor embedded into PE and we will not be able to free the table afterwards. This is a transitional hack and following patches will replace this code anyway. Old-style ownership is still supported allowing VFIO to run on older P5IOC2 and IODA IO controllers. No change in userspace-visible behaviour is expected. Since it recreates TCE tables on each ownership change, related kernel traces will appear more often. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson --- Changes: v9: * fixed crash in tce_iommu_detach_group() on tbl->it_ops->free as tce_iommu_attach_group() used to initialize the table from a descriptor on stack (it does not matter for the series as this bit is changed later anyway but it ruing bisectability) v6: * fixed commit log that VFIO removes tables before passing ownership back to the platform code, not userspace 1 --- arch/powerpc/platforms/powernv/pci-ioda.c | 27 +++-- drivers/vfio/vfio_iommu_spapr_tce.c | 33 +-- 2 files changed, 56 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 2a4b2b2..45bc131 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2105,16 +2105,39 @@ static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group) struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe, table_group); - iommu_take_ownership(&table_group->tables[0]); pnv_pci_ioda2_set_bypass(pe, false); + pnv_pci_ioda2_unset_window(&pe->table_group, 0); + pnv_pci_free_table(&pe->table_group.tables[0]); } static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group) { struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe, table_group); + struct iommu_table *tbl = &pe->table_group.tables[0]; + int64_t rc; + + rc = pnv_pci_ioda2_create_table(&pe->table_group, 0, + IOMMU_PAGE_SHIFT_4K, + pe->phb->ioda.m32_pci_base, + POWERNV_IOMMU_DEFAULT_LEVELS, tbl); + if (rc) { + pe_err(pe, "Failed to create 32-bit TCE table, err %ld", + rc); + return; + } + + tbl->it_table_group = &pe->table_group; + iommu_init_table(tbl, pe->phb->hose->node); + + rc = pnv_pci_ioda2_set_window(&pe->table_group, 0, tbl); + if (rc) { + pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n", + rc); + pnv_pci_free_table(tbl); + return; + } - iommu_release_ownership(&table_group->tables[0]); pnv_pci_ioda2_set_bypass(pe, true); } diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 2d51bbf..892a584 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -569,6 +569,10 @@ static int tce_iommu_attach_group(void *iommu_data, if (!table_group->ops || !table_group->ops->take_ownership || !table_group->ops->release_ownership) { ret = tce_iommu_take_ownership(table_group); + } else if (!table_group->ops->create_table || + !table_group->ops->set_window) { + WARN_ON_ONCE(1); + ret = -EFAULT; } else { /* * Disable iommu bypass, otherwise the user can DMA to all of @@ -576,7 +580,15 @@ static int tce_iommu_attach_group(void *iommu_data, *
[PATCH kernel v9 30/32] vfio: powerpc/spapr: Use 32bit DMA window properties from table_group
A table group might not have a table but it always has the default 32bit window parameters so use these. No change in behavior is expected. Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * new in the series - to make the next patch simpler --- drivers/vfio/vfio_iommu_spapr_tce.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 4cfc2c1..a7d6729 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -185,7 +185,6 @@ static int tce_iommu_enable(struct tce_container *container) { int ret = 0; unsigned long locked; - struct iommu_table *tbl; struct iommu_table_group *table_group; if (!container->grp) @@ -221,13 +220,19 @@ static int tce_iommu_enable(struct tce_container *container) * this is that we cannot tell here the amount of RAM used by the guest * as this information is only available from KVM and VFIO is * KVM agnostic. +* +* So we do not allow enabling a container without a group attached +* as there is no way to know how much we should increment +* the locked_vm counter. */ table_group = iommu_group_get_iommudata(container->grp); if (!table_group) return -ENODEV; - tbl = &table_group->tables[0]; - locked = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT; + if (!table_group->tce32_size) + return -EPERM; + + locked = table_group->tce32_size >> PAGE_SHIFT; ret = try_increment_locked_vm(locked); if (ret) return ret; @@ -504,7 +509,6 @@ static long tce_iommu_ioctl(void *iommu_data, case VFIO_IOMMU_SPAPR_TCE_GET_INFO: { struct vfio_iommu_spapr_tce_info info; - struct iommu_table *tbl; struct iommu_table_group *table_group; if (WARN_ON(!container->grp)) @@ -512,8 +516,7 @@ static long tce_iommu_ioctl(void *iommu_data, table_group = iommu_group_get_iommudata(container->grp); - tbl = &table_group->tables[0]; - if (WARN_ON_ONCE(!tbl)) + if (!table_group) return -ENXIO; minsz = offsetofend(struct vfio_iommu_spapr_tce_info, @@ -525,8 +528,8 @@ static long tce_iommu_ioctl(void *iommu_data, if (info.argsz < minsz) return -EINVAL; - info.dma32_window_start = tbl->it_offset << tbl->it_page_shift; - info.dma32_window_size = tbl->it_size << tbl->it_page_shift; + info.dma32_window_start = table_group->tce32_start; + info.dma32_window_size = table_group->tce32_size; info.flags = 0; if (copy_to_user((void __user *)arg, &info, minsz)) -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v9 22/32] powerpc/powernv: Implement multilevel TCE tables
TCE tables might get too big in case of 4K IOMMU pages and DDW enabled on huge guests (hundreds of GB of RAM) so the kernel might be unable to allocate contiguous chunk of physical memory to store the TCE table. To address this, POWER8 CPU (actually, IODA2) supports multi-level TCE tables, up to 5 levels which splits the table into a tree of smaller subtables. This adds multi-level TCE tables support to pnv_pci_create_table() and pnv_pci_free_table() helpers. Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * moved from ioda2 to common powernv pci code * fixed cleanup if allocation fails in a middle * removed check for the size - all boundary checks happen in the calling code anyway --- arch/powerpc/include/asm/iommu.h | 2 + arch/powerpc/platforms/powernv/pci-ioda.c | 15 +++-- arch/powerpc/platforms/powernv/pci.c | 94 +-- arch/powerpc/platforms/powernv/pci.h | 4 +- 4 files changed, 104 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 7e7ca0a..0f50ee2 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -96,6 +96,8 @@ struct iommu_pool { struct iommu_table { unsigned long it_busno; /* Bus number this table belongs to */ unsigned long it_size; /* Size of iommu table in entries */ + unsigned long it_indirect_levels; + unsigned long it_level_size; unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 59baa15..cc1d09c 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1967,13 +1967,17 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, table_group); struct pnv_phb *phb = pe->phb; int64_t rc; + const unsigned long size = tbl->it_indirect_levels ? + tbl->it_level_size : tbl->it_size; const __u64 start_addr = tbl->it_offset << tbl->it_page_shift; const __u64 win_size = tbl->it_size << tbl->it_page_shift; pe_info(pe, "Setting up window at %llx..%llx " - "pgsize=0x%x tablesize=0x%lx\n", + "pgsize=0x%x tablesize=0x%lx " + "levels=%d levelsize=%x\n", start_addr, start_addr + win_size - 1, - 1UL << tbl->it_page_shift, tbl->it_size << 3); + 1UL << tbl->it_page_shift, tbl->it_size << 3, + tbl->it_indirect_levels + 1, tbl->it_level_size << 3); tbl->it_table_group = &pe->table_group; @@ -1984,9 +1988,9 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number, pe->pe_number << 1, - 1, + tbl->it_indirect_levels + 1, __pa(tbl->it_base), - tbl->it_size << 3, + size << 3, 1ULL << tbl->it_page_shift); if (rc) { pe_err(pe, "Failed to configure TCE table, err %ld\n", rc); @@ -2099,7 +2103,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, phb->ioda.m32_pci_base); rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node, - 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, tbl); + 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, + POWERNV_IOMMU_DEFAULT_LEVELS, tbl); if (rc) { pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc); return; diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index 6bcfad5..fc129c4 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -46,6 +46,8 @@ #define cfg_dbg(fmt...)do { } while(0) //#define cfg_dbg(fmt...) printk(fmt) +#define ROUND_UP(x, n) (((x) + (n) - 1ULL) & ~((n) - 1ULL)) + #ifdef CONFIG_PCI_MSI static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type) { @@ -577,6 +579,19 @@ struct pci_ops pnv_pci_ops = { static __be64 *pnv_tce(struct iommu_table *tbl, long idx) { __be64 *tmp = ((__be64 *)tbl->it_base); + int level = tbl->it_indirect_levels; + const long shift = ilo
[PATCH kernel v9 23/32] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks
This extends iommu_table_group_ops by a set of callbacks to support dynamic DMA windows management. create_table() creates a TCE table with specific parameters. it receives iommu_table_group to know nodeid in order to allocate TCE table memory closer to the PHB. The exact format of allocated multi-level table might be also specific to the PHB model (not the case now though). This callback calculated the DMA window offset on a PCI bus from @num and stores it in a just created table. set_window() sets the window at specified TVT index + @num on PHB. unset_window() unsets the window from specified TVT. This adds a free() callback to iommu_table_ops to free the memory (potentially a tree of tables) allocated for the TCE table. create_table() and free() are supposed to be called once per VFIO container and set_window()/unset_window() are supposed to be called for every group in a container. This adds IOMMU capabilities to iommu_table_group such as default 32bit window parameters and others. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h| 19 arch/powerpc/platforms/powernv/pci-ioda.c | 75 ++--- arch/powerpc/platforms/powernv/pci-p5ioc2.c | 12 +++-- 3 files changed, 96 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 0f50ee2..7694546 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -70,6 +70,7 @@ struct iommu_table_ops { /* get() returns a physical address */ unsigned long (*get)(struct iommu_table *tbl, long index); void (*flush)(struct iommu_table *tbl); + void (*free)(struct iommu_table *tbl); }; /* These are used by VIO */ @@ -148,6 +149,17 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, struct iommu_table_group; struct iommu_table_group_ops { + long (*create_table)(struct iommu_table_group *table_group, + int num, + __u32 page_shift, + __u64 window_size, + __u32 levels, + struct iommu_table *tbl); + long (*set_window)(struct iommu_table_group *table_group, + int num, + struct iommu_table *tblnew); + long (*unset_window)(struct iommu_table_group *table_group, + int num); /* * Switches ownership from the kernel itself to an external * user. While onwership is taken, the kernel cannot use IOMMU itself. @@ -160,6 +172,13 @@ struct iommu_table_group { #ifdef CONFIG_IOMMU_API struct iommu_group *group; #endif + /* Some key properties of IOMMU */ + __u32 tce32_start; + __u32 tce32_size; + __u64 pgsizes; /* Bitmap of supported page sizes */ + __u32 max_dynamic_windows_supported; + __u32 max_levels; + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; struct iommu_table_group_ops *ops; }; diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index cc1d09c..4828837 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include @@ -1846,6 +1847,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = { #endif .clear = pnv_ioda2_tce_free, .get = pnv_tce_get, + .free = pnv_pci_free_table, }; static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb, @@ -1936,6 +1938,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, TCE_PCI_SWINV_PAIR); tbl->it_ops = &pnv_ioda1_iommu_ops; + pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift; + pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift; iommu_init_table(tbl, phb->hose->node); if (pe->flags & PNV_IODA_PE_DEV) { @@ -1961,7 +1965,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, } static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, - struct iommu_table *tbl) + int num, struct iommu_table *tbl) { struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe, table_group); @@ -1972,9 +1976,10 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, const __u64 start_addr = tbl->it_offset << tbl->it_page_shift; const __u64 win_size = tbl->it_size << tbl->it_page_shift; - pe_info(pe, "Setting up window at %llx..%llx " + pe_info(pe, "Setting up window#%d at %llx..%llx " "pgsize=0x%x tablesize=0x%lx " "levels=%
[PATCH kernel] commit 4fbdf9cb ("lpfc: Fix for lun discovery issue with saturn adapter.")
This reverts 4fbdf9cb is breaks LPFC on POWER7 machine, big endian kernel. This is the hardware used for verification: 0005:01:00.0 Fibre Channel [0c04]: Emulex Corporation Saturn-X: LightPulse Fibre Channel Host Adapter [10df:f100] (rev 03) 0005:01:00.1 Fibre Channel [0c04]: Emulex Corporation Saturn-X: LightPulse Fibre Channel Host Adapter [10df:f100] (rev 03) Signed-off-by: Alexey Kardashevskiy --- drivers/scsi/lpfc/lpfc_scsi.c | 41 + 1 file changed, 21 insertions(+), 20 deletions(-) diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c index cb73cf9..c140f99 100644 --- a/drivers/scsi/lpfc/lpfc_scsi.c +++ b/drivers/scsi/lpfc/lpfc_scsi.c @@ -1130,25 +1130,6 @@ lpfc_release_scsi_buf(struct lpfc_hba *phba, struct lpfc_scsi_buf *psb) } /** - * lpfc_fcpcmd_to_iocb - copy the fcp_cmd data into the IOCB - * @data: A pointer to the immediate command data portion of the IOCB. - * @fcp_cmnd: The FCP Command that is provided by the SCSI layer. - * - * The routine copies the entire FCP command from @fcp_cmnd to @data while - * byte swapping the data to big endian format for transmission on the wire. - **/ -static void -lpfc_fcpcmd_to_iocb(uint8_t *data, struct fcp_cmnd *fcp_cmnd) -{ - int i, j; - - for (i = 0, j = 0; i < sizeof(struct fcp_cmnd); -i += sizeof(uint32_t), j++) { - ((uint32_t *)data)[j] = cpu_to_be32(((uint32_t *)fcp_cmnd)[j]); - } -} - -/** * lpfc_scsi_prep_dma_buf_s3 - DMA mapping for scsi buffer to SLI3 IF spec * @phba: The Hba for which this call is being executed. * @lpfc_cmd: The scsi buffer which is going to be mapped. @@ -1283,7 +1264,6 @@ lpfc_scsi_prep_dma_buf_s3(struct lpfc_hba *phba, struct lpfc_scsi_buf *lpfc_cmd) * we need to set word 4 of IOCB here */ iocb_cmd->un.fcpi.fcpi_parm = scsi_bufflen(scsi_cmnd); - lpfc_fcpcmd_to_iocb(iocb_cmd->unsli3.fcp_ext.icd, fcp_cmnd); return 0; } @@ -4147,6 +4127,24 @@ lpfc_scsi_cmd_iocb_cmpl(struct lpfc_hba *phba, struct lpfc_iocbq *pIocbIn, } /** + * lpfc_fcpcmd_to_iocb - copy the fcp_cmd data into the IOCB + * @data: A pointer to the immediate command data portion of the IOCB. + * @fcp_cmnd: The FCP Command that is provided by the SCSI layer. + * + * The routine copies the entire FCP command from @fcp_cmnd to @data while + * byte swapping the data to big endian format for transmission on the wire. + **/ +static void +lpfc_fcpcmd_to_iocb(uint8_t *data, struct fcp_cmnd *fcp_cmnd) +{ + int i, j; + for (i = 0, j = 0; i < sizeof(struct fcp_cmnd); +i += sizeof(uint32_t), j++) { + ((uint32_t *)data)[j] = cpu_to_be32(((uint32_t *)fcp_cmnd)[j]); + } +} + +/** * lpfc_scsi_prep_cmnd - Wrapper func for convert scsi cmnd to FCP info unit * @vport: The virtual port for which this call is being executed. * @lpfc_cmd: The scsi command which needs to send. @@ -4225,6 +4223,9 @@ lpfc_scsi_prep_cmnd(struct lpfc_vport *vport, struct lpfc_scsi_buf *lpfc_cmd, fcp_cmnd->fcpCntl3 = 0; phba->fc4ControlRequests++; } + if (phba->sli_rev == 3 && + !(phba->sli3_options & LPFC_SLI3_BG_ENABLED)) + lpfc_fcpcmd_to_iocb(iocb_cmd->unsli3.fcp_ext.icd, fcp_cmnd); /* * Finish initializing those IOCB fields that are independent * of the scsi_cmnd request_buffer -- 2.0.0 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel] commit 4fbdf9cb ("lpfc: Fix for lun discovery issue with saturn adapter.")
On 04/28/2015 07:18 PM, Sebastian Herbszt wrote: Alexey Kardashevskiy wrote: This reverts 4fbdf9cb is breaks LPFC on POWER7 machine, big endian kernel. This is the hardware used for verification: 0005:01:00.0 Fibre Channel [0c04]: Emulex Corporation Saturn-X: LightPulse Fibre Channel Host Adapter [10df:f100] (rev 03) 0005:01:00.1 Fibre Channel [0c04]: Emulex Corporation Saturn-X: LightPulse Fibre Channel Host Adapter [10df:f100] (rev 03) Signed-off-by: Alexey Kardashevskiy This issue is not specific to POWER7. I hit it on x86 [1] and James promised to look at it. [1] http://marc.info/?l=linux-scsi&m=142938432414173 Sebastian Well, I hope so, I just wanted to be more specific and the fault looks much different (and much cooler! :) ) on my hardware (it actually enters an infinite loop of oops'es): Welcome to Fedora 20 (Heisenbug)! INFO: rcu_sched self-detected stall on CPU INFO: rcu_sched self-detected stall on CPU INFO: rcu_sched self-detected stall on CPU 1: (2100 ticks this GP) idle=981/141/0 softirq=234/234 fqs =2083 2: (2100 ticks this GP) idle=c3d/141/0 softirq=259/259 fqs =2083 (t=2100 jiffies g=-7 c=-8 q=11820) (t=2100 jiffies g=-7 c=-8 q=11820) Task dump for CPU 0: kworker/u97:0 R running task 8192 7 2 0x0804 Workqueue: events_unbound .async_run_entry_fn Call Trace: [c00ffa29ef80] [c00ffa29f060] 0xc00ffa29f060 (unreliable) Task dump for CPU 1: kworker/u97:2 R running task10304 1636 2 0x0804 Workqueue: events_unbound .async_run_entry_fn Call Trace: [c00ff2fd2f80] [c00ff2fd3060] 0xc00ff2fd3060 (unreliable) Task dump for CPU 2: kworker/u97:1 R running task 8288 1633 2 0x0804 Workqueue: events_unbound .async_run_entry_fn Call Trace: [c00ff2f92eb0] [c00cf610] .sched_show_task+0xf0/0x180 (unreliable) [c00ff2f92f30] [c01041d8] .rcu_dump_cpu_stacks+0xd8/0x150 [c00ff2f92fd0] [c0108794] .rcu_check_callbacks+0x674/0x990 [c00ff2f93110] [c010e994] .update_process_times+0x44/0x90 [c00ff2f93190] [c01223f0] .tick_sched_handle.isra.16+0x20/0xa0 [c00ff2f93210] [c01224cc] .tick_sched_timer+0x5c/0xb0 [c00ff2f932b0] [c010f108] .__run_hrtimer+0x98/0x260 [c00ff2f93350] [c010fff8] .hrtimer_interrupt+0x138/0x2f0 [c00ff2f93460] [c001be1c] .__timer_interrupt+0x8c/0x230 [c00ff2f93500] [c001c488] .timer_interrupt+0x98/0xd0 [c00ff2f93580] [c00025d0] decrementer_common+0x150/0x180 --- interrupt: 901 at .string_get_size+0x120/0x250 LR = .sd_revalidate_disk+0x57c/0x1c10 [c00ff2f93870] [c048f84c] .string_get_size+0x18c/0x250 (unreliable ) [c00ff2f93940] [c05e7c1c] .sd_revalidate_disk+0x57c/0x1c10 [c00ff2f93a70] [c05e951c] .sd_probe_async+0xac/0x230 [c00ff2f93b00] [c00c28ec] .async_run_entry_fn+0x6c/0x180 [c00ff2f93ba0] [c00b7b78] .process_one_work+0x1a8/0x4a0 [c00ff2f93c40] [c00b7ff0] .worker_thread+0x180/0x5a0 [c00ff2f93d30] [c00bee08] .kthread+0x108/0x130 [c00ff2f93e30] [c0009590] .ret_from_kernel_thread+0x58/0xc8 Task dump for CPU 0: kworker/u97:0 R running task 8192 7 2 0x0804 Workqueue: events_unbound .async_run_entry_fn Call Trace: [c00ffa29ef80] [c00ffa29f060] 0xc00ffa29f060 (unreliable) Task dump for CPU 1: kworker/u97:2 R running task 9488 1636 2 0x0804 Workqueue: events_unbound .async_run_entry_fn Call Trace: [c00ff2fd2eb0] [c00cf610] .sched_show_task+0xf0/0x180 (unreliable) [c00ff2fd2f30] [c01041d8] .rcu_dump_cpu_stacks+0xd8/0x150 [c00ff2fd2fd0] [c0108794] .rcu_check_callbacks+0x674/0x990 [c00ff2fd3110] [c010e994] .update_process_times+0x44/0x90 [c00ff2fd3190] [c01223f0] .tick_sched_handle.isra.16+0x20/0xa0 [c00ff2fd3210] [c01224cc] .tick_sched_timer+0x5c/0xb0 [c00ff2fd32b0] [c010f108] .__run_hrtimer+0x98/0x260 [c00ff2fd3350] [c010fff8] .hrtimer_interrupt+0x138/0x2f0 [c00ff2fd3460] [c001be1c] .__timer_interrupt+0x8c/0x230 [c00ff2fd3500] [c001c488] .timer_interrupt+0x98/0xd0 [c00ff2fd3580] [c00025d0] decrementer_common+0x150/0x180 --- interrupt: 901 at .string_get_size+0x110/0x250 LR = .sd_revalidate_disk+0x57c/0x1c10 [c00ff2fd3870] [c048f84c] .string_get_size+0x18c/0x250 (unreliable ) [c00ff2fd3940] [c05e7c1c] .sd_revalidate_disk+0x57c/0x1c10 [c00ff2fd3a70] [c05e951c] .sd_probe_async+0xac/0x230 [c00ff2fd3b00] [c00c28ec] .async_run_entry_fn+0x6c/0x180 [c00ff2fd3ba0] [c00b7b78] .process_one_work+0x1a8/0x4a0 [c00ff2fd3c40] [c00b7ff0] .worker_thread+0x180/0x5a0 [c00ff2fd3d30] [c00bee08] .kthread+0x108/0x130 [c00ff2fd3e30] [c0009590] .ret_from_kernel_thread+0x58/
Re: [PATCH v2 0/4] PCI: Generic fixes and moving enumeration back to QEMU
On 04/27/2015 06:02 PM, Nikunj A Dadhania wrote: patch 1: Fixes a bug in pci-to-pci bridge while programming the limit registers during probe patch 2,3: Support 64-bit address translation and usb devices can have 64-bit pci memory BARs patch 4: Use QEMU create device tree nodes and program the BARs I taught my "git send-email" wrapper script to add "kernel/qemu/slof" in the subject, otherwise I believe it is hard for busy people to tell quickly what patchset is for what, I'd recommend you doing the same, especially when you mention "QEMU" in a subject :) > Nikunj A Dadhania (4): pci: program correct bridge limit registers during probe pci: Support 64-bit address translation usb: support 64-bit pci bars pci: Use QEMU created PCI device nodes board-qemu/slof/pci-phb.fs | 44 - slof/fs/devices/pci-class_0c.fs | 10 -- slof/fs/pci-properties.fs | 6 +- slof/fs/pci-scan.fs | 6 +++--- slof/fs/translate.fs| 6 ++ 5 files changed, 61 insertions(+), 11 deletions(-) -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 16/32] powerpc/powernv/ioda: Move TCE kill register address to PE
On 04/29/2015 01:25 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:40PM +1000, Alexey Kardashevskiy wrote: At the moment the DMA setup code looks for the "ibm,opal-tce-kill" property which contains the TCE kill register address. Writes to this register invalidates TCE cache on IODA/IODA2 hub. This moves the register address from iommu_table to pnv_ioda_pe as later there will be 2 tables per PE and it will be used for both tables. This moves the property reading/remapping code to a helper to reduce code duplication. This adds a new pnv_pci_ioda2_tvt_invalidate() helper which invalidates the entire table. It should be called after every call to opal_pci_map_pe_dma_window(). It was not required before because there is just a single TCE table and 64bit DMA is handled via bypass window (which has no table so no chache is used) but this is going to change with Dynamic DMA windows (DDW). Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * new in the series --- arch/powerpc/platforms/powernv/pci-ioda.c | 69 +++ arch/powerpc/platforms/powernv/pci.h | 1 + 2 files changed, 44 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index f070c44..b22b3ca 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1672,7 +1672,7 @@ static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, struct pnv_ioda_pe, table_group); __be64 __iomem *invalidate = rm ? (__be64 __iomem *)pe->tce_inval_reg_phys : - (__be64 __iomem *)tbl->it_index; + pe->tce_inval_reg; unsigned long start, end, inc; const unsigned shift = tbl->it_page_shift; @@ -1743,6 +1743,18 @@ static struct iommu_table_ops pnv_ioda1_iommu_ops = { .get = pnv_tce_get, }; +static inline void pnv_pci_ioda2_tvt_invalidate(struct pnv_ioda_pe *pe) +{ + /* 01xb - invalidate TCEs that match the specified PE# */ + unsigned long addr = (0x4ull << 60) | (pe->pe_number & 0xFF); This doesn't really look like an address, but rather the data you're writing to the register. This thing is made of "invalidate operation" (0x4 here), "invalidate address" (pci address but it is zero here as we reset everything, most bits are here) and "invalidate PE number". So what should I call it? :) + if (!pe->tce_inval_reg) + return; + +mb(); /* Ensure above stores are visible */ + __raw_writeq(cpu_to_be64(addr), pe->tce_inval_reg); +} + static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl, unsigned long index, unsigned long npages, bool rm) { @@ -1751,7 +1763,7 @@ static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl, unsigned long start, end, inc; __be64 __iomem *invalidate = rm ? (__be64 __iomem *)pe->tce_inval_reg_phys : - (__be64 __iomem *)tbl->it_index; + pe->tce_inval_reg; const unsigned shift = tbl->it_page_shift; /* We'll invalidate DMA address in PE scope */ @@ -1803,13 +1815,31 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = { .get = pnv_tce_get, }; +static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb, + struct pnv_ioda_pe *pe) +{ + const __be64 *swinvp; + + /* OPAL variant of PHB3 invalidated TCEs */ + swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL); + if (!swinvp) + return; + + /* We need a couple more fields -- an address and a data +* to or. Since the bus is only printed out on table free +* errors, and on the first pass the data will be a relative +* bus number, print that out instead. +*/ The comment above appears to have nothing to do with the surrounding code. I'll just remove it. + pe->tce_inval_reg_phys = be64_to_cpup(swinvp); + pe->tce_inval_reg = ioremap(pe->tce_inval_reg_phys, 8); +} + static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe, unsigned int base, unsigned int segs) { struct page *tce_mem = NULL; - const __be64 *swinvp; struct iommu_table *tbl; unsigned int i; int64_t rc; @@ -1823,6 +1853,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, if (WARN_ON(pe->tce32_seg >= 0)) return; + pnv_pci_ioda_setup_opal_tce_kill(phb, pe); + /* Grab a 32-bit TCE table */ pe->tce32_seg = base; pe_info(pe, " Setting up 32-bit TCE table at %08x..%08x\n", @@ -1865,20 +1897,11 @@ static void pnv_p
Re: [PATCH kernel v9 17/32] powerpc/powernv: Implement accessor to TCE entry
On 04/29/2015 02:04 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:41PM +1000, Alexey Kardashevskiy wrote: This replaces direct accesses to TCE table with a helper which returns an TCE entry address. This does not make difference now but will when multi-level TCE tables get introduces. No change in behavior is expected. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- Changes: v9: * new patch in the series to separate this mechanical change from functional changes; this is not right before "powerpc/powernv: Implement multilevel TCE tables" but here in order to let the next patch - "powerpc/iommu/powernv: Release replaced TCE" - use pnv_tce() and avoid changing the same code twice --- arch/powerpc/platforms/powernv/pci.c | 34 +- 1 file changed, 21 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index 84b4ea4..ba75aa5 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -572,38 +572,46 @@ struct pci_ops pnv_pci_ops = { .write = pnv_pci_write_config, }; +static __be64 *pnv_tce(struct iommu_table *tbl, long idx) +{ + __be64 *tmp = ((__be64 *)tbl->it_base); + + return tmp + idx; +} + int pnv_tce_build(struct iommu_table *tbl, long index, long npages, unsigned long uaddr, enum dma_data_direction direction, struct dma_attrs *attrs) { u64 proto_tce = iommu_direction_to_tce_perm(direction); - __be64 *tcep; - u64 rpn; + u64 rpn = __pa(uaddr) >> tbl->it_page_shift; I guess this was a problem in the existing code, not this patch. But "uaddr" is a really bad name (and unsigned long is a bad type) for what must actually be a kernel linear mapping address. Yes and may be one day I'll clean this up. s/uaddr/linear/ and s/hwaddr/hpa/ are the first things to do globally but not in this patchset. + long i; - tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset; - rpn = __pa(uaddr) >> tbl->it_page_shift; - - while (npages--) - *(tcep++) = cpu_to_be64(proto_tce | - (rpn++ << tbl->it_page_shift)); + for (i = 0; i < npages; i++) { + unsigned long newtce = proto_tce | + ((rpn + i) << tbl->it_page_shift); + unsigned long idx = index - tbl->it_offset + i; + *(pnv_tce(tbl, idx)) = cpu_to_be64(newtce); + } return 0; } void pnv_tce_free(struct iommu_table *tbl, long index, long npages) { - __be64 *tcep; + long i; - tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset; + for (i = 0; i < npages; i++) { + unsigned long idx = index - tbl->it_offset + i; - while (npages--) - *(tcep++) = cpu_to_be64(0); + *(pnv_tce(tbl, idx)) = cpu_to_be64(0); + } } unsigned long pnv_tce_get(struct iommu_table *tbl, long index) { - return ((u64 *)tbl->it_base)[index - tbl->it_offset]; + return *(pnv_tce(tbl, index - tbl->it_offset)); } void pnv_pci_setup_iommu_table(struct iommu_table *tbl, -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 20/32] powerpc/powernv/ioda2: Introduce pnv_pci_create_table/pnv_pci_free_table
On 04/29/2015 02:39 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:44PM +1000, Alexey Kardashevskiy wrote: This is a part of moving TCE table allocation into an iommu_ops callback to support multiple IOMMU groups per one VFIO container. This moves a table creation window to the file with common powernv-pci helpers as it does not do anything IODA2-specific. This adds pnv_pci_free_table() helper to release the actual TCE table. This enforces window size to be a power of two. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- Changes: v9: * moved helpers to the common powernv pci.c file from pci-ioda.c * moved bits from pnv_pci_create_table() to pnv_alloc_tce_table_pages() --- arch/powerpc/platforms/powernv/pci-ioda.c | 36 ++ arch/powerpc/platforms/powernv/pci.c | 61 +++ arch/powerpc/platforms/powernv/pci.h | 4 ++ 3 files changed, 76 insertions(+), 25 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index a80be34..b9b3773 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1307,8 +1307,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe if (rc) pe_warn(pe, "OPAL error %ld release DMA window\n", rc); - iommu_reset_table(tbl, of_node_full_name(dev->dev.of_node)); - free_pages(addr, get_order(TCE32_TABLE_SIZE)); + pnv_pci_free_table(tbl); } static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs) @@ -2039,10 +2038,7 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = { static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe) { - struct page *tce_mem = NULL; - void *addr; struct iommu_table *tbl = &pe->table_group.tables[0]; - unsigned int tce_table_size, end; int64_t rc; /* We shouldn't already have a 32-bit DMA associated */ @@ -2053,29 +2049,20 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, /* The PE will reserve all possible 32-bits space */ pe->tce32_seg = 0; - end = (1 << ilog2(phb->ioda.m32_pci_base)); - tce_table_size = (end / 0x1000) * 8; pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n", - end); + phb->ioda.m32_pci_base); - /* Allocate TCE table */ - tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL, - get_order(tce_table_size)); - if (!tce_mem) { - pe_err(pe, "Failed to allocate a 32-bit TCE memory\n"); - goto fail; + rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node, + 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, tbl); + if (rc) { + pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc); + return; } - addr = page_address(tce_mem); - memset(addr, 0, tce_table_size); - - /* Setup iommu */ - tbl->it_table_group = &pe->table_group; - - /* Setup linux iommu table */ - pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0, - IOMMU_PAGE_SHIFT_4K); tbl->it_ops = &pnv_ioda2_iommu_ops; + + /* Setup iommu */ + tbl->it_table_group = &pe->table_group; iommu_init_table(tbl, phb->hose->node); #ifdef CONFIG_IOMMU_API pe->table_group.ops = &pnv_pci_ioda2_ops; @@ -2121,8 +2108,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, fail: if (pe->tce32_seg >= 0) pe->tce32_seg = -1; - if (tce_mem) - __free_pages(tce_mem, get_order(tce_table_size)); + pnv_pci_free_table(tbl); } static void pnv_ioda_setup_dma(struct pnv_phb *phb) diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index e8802ac..6bcfad5 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -20,7 +20,9 @@ #include #include #include +#include +#include #include #include #include @@ -645,6 +647,65 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl, tbl->it_type = TCE_PCI; } +static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift, + unsigned long *tce_table_allocated) I'm a bit confused by the tce_table_allocated parameter. What's the circumstance where more memory is requested than required, and why does it matter to the caller? It does not make much sense here but it does for "powerpc/powernv: Implement multilevel TCE tables" - I was trying to avoid changing same
Re: [PATCH kernel v9 13/32] vfio: powerpc/spapr/iommu/powernv/ioda2: Rework IOMMU ownership control
On 04/29/2015 01:02 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:37PM +1000, Alexey Kardashevskiy wrote: This adds tce_iommu_take_ownership() and tce_iommu_release_ownership which call in a loop iommu_take_ownership()/iommu_release_ownership() for every table on the group. As there is just one now, no change in behaviour is expected. At the moment the iommu_table struct has a set_bypass() which enables/ disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code which calls this callback when external IOMMU users such as VFIO are about to get over a PHB. The set_bypass() callback is not really an iommu_table function but IOMMU/PE function. This introduces a iommu_table_group_ops struct and adds take_ownership()/release_ownership() callbacks to it which are called when an external user takes/releases control over the IOMMU. This replaces set_bypass() with ownership callbacks as it is not necessarily just bypass enabling, it can be something else/more so let's give it more generic name. The callbacks is implemented for IODA2 only. Other platforms (P5IOC2, IODA1) will use the old iommu_take_ownership/iommu_release_ownership API. The following patches will replace iommu_take_ownership/ iommu_release_ownership calls in IODA2 with full IOMMU table release/ create. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson --- Changes: v9: * squashed "vfio: powerpc/spapr: powerpc/iommu: Rework IOMMU ownership control" and "vfio: powerpc/spapr: powerpc/powernv/ioda2: Rework IOMMU ownership control" into a single patch * moved helpers with a loop through tables in a group to vfio_iommu_spapr_tce.c to keep the platform code free of IOMMU table groups as much as possible * added missing tce_iommu_clear() to tce_iommu_release_ownership() * replaced the set_ownership(enable) callback with take_ownership() and release_ownership() --- arch/powerpc/include/asm/iommu.h | 13 +- arch/powerpc/kernel/iommu.c | 11 -- arch/powerpc/platforms/powernv/pci-ioda.c | 40 +++ drivers/vfio/vfio_iommu_spapr_tce.c | 66 +++ 4 files changed, 103 insertions(+), 27 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index fa37519..e63419e 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -93,7 +93,6 @@ struct iommu_table { unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_table_group; struct iommu_table_ops *it_ops; - void (*set_bypass)(struct iommu_table *tbl, bool enable); }; /* Pure 2^n version of get_order */ @@ -128,11 +127,23 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, #define IOMMU_TABLE_GROUP_MAX_TABLES 1 +struct iommu_table_group; + +struct iommu_table_group_ops { + /* +* Switches ownership from the kernel itself to an external +* user. While onwership is taken, the kernel cannot use IOMMU itself. Typo in "onwership". I'd also like to see this be even more explicit that "take" is the "core kernel -> vfio/whatever" transition and release is the reverse. Will this work? /* * Switches ownership from the kernel itself to an external * user. * The ownership is taken when VFIO starts using the IOMMU group * and released when the platform code gets the control over the group back. * While ownership is taken, the platform code cannot use IOMMU itself. */ +*/ + void (*take_ownership)(struct iommu_table_group *table_group); + void (*release_ownership)(struct iommu_table_group *table_group); +}; + struct iommu_table_group { #ifdef CONFIG_IOMMU_API struct iommu_group *group; #endif struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; + struct iommu_table_group_ops *ops; }; #ifdef CONFIG_IOMMU_API diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 005146b..2856d27 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1057,13 +1057,6 @@ int iommu_take_ownership(struct iommu_table *tbl) memset(tbl->it_map, 0xff, sz); - /* -* Disable iommu bypass, otherwise the user can DMA to all of -* our physical memory via the bypass window instead of just -* the pages that has been explicitly mapped into the iommu -*/ - if (tbl->set_bypass) - tbl->set_bypass(tbl, false); return 0; } @@ -1078,10 +1071,6 @@ void iommu_release_ownership(struct iommu_table *tbl) /* Restore bit#0 set by iommu_init_table() */ if (tbl->it_offset == 0) set_bit(0, tbl->it_map); - - /* The kernel owns the device now, we can restore the iommu bypass */ - if (tbl->set_bypass) - tbl->se
Re: [PATCH kernel v9 21/32] powerpc/powernv/ioda2: Introduce pnv_pci_ioda2_set_window
On 04/29/2015 02:45 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:45PM +1000, Alexey Kardashevskiy wrote: This is a part of moving DMA window programming to an iommu_ops callback. pnv_pci_ioda2_set_window() takes an iommu_table_group as a first parameter (not pnv_ioda_pe) as it is going to be used as a callback for VFIO DDW code. This adds pnv_pci_ioda2_tvt_invalidate() to invalidate TVT as it is a good thing to do. What's the TVT and why is invalidating it a good thing? "TCE Validation Table". Yeah, I need to rephrase it. Will do. Also, it looks like it didn't add it, just move it. Agrh. Lost it in rebases. Will fix. It does not have immediate effect now as the table is never recreated after reboot but it will in the following patches. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson Really? I don't remember this one. Message-ID: <20150416064351.gk3...@voom.redhat.com> :) But I believe it did not have TVT stuff then so I should have removed your RB from here. --- Changes: v9: * initialize pe->table_group.tables[0] at the very end when tbl is fully initialized * moved pnv_pci_ioda2_tvt_invalidate() from earlier patch --- arch/powerpc/platforms/powernv/pci-ioda.c | 67 +++ 1 file changed, 51 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index b9b3773..59baa15 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1960,6 +1960,52 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs)); } +static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, + struct iommu_table *tbl) +{ + struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe, + table_group); + struct pnv_phb *phb = pe->phb; + int64_t rc; + const __u64 start_addr = tbl->it_offset << tbl->it_page_shift; + const __u64 win_size = tbl->it_size << tbl->it_page_shift; + + pe_info(pe, "Setting up window at %llx..%llx " + "pgsize=0x%x tablesize=0x%lx\n", + start_addr, start_addr + win_size - 1, + 1UL << tbl->it_page_shift, tbl->it_size << 3); + + tbl->it_table_group = &pe->table_group; + + /* +* Map TCE table through TVT. The TVE index is the PE number +* shifted by 1 bit for 32-bits DMA space. +*/ + rc = opal_pci_map_pe_dma_window(phb->opal_id, + pe->pe_number, + pe->pe_number << 1, + 1, + __pa(tbl->it_base), + tbl->it_size << 3, + 1ULL << tbl->it_page_shift); + if (rc) { + pe_err(pe, "Failed to configure TCE table, err %ld\n", rc); + goto fail; + } + + pnv_pci_ioda2_tvt_invalidate(pe); + + /* Store fully initialized *tbl (may be external) in PE */ + pe->table_group.tables[0] = *tbl; Hrm, a non-atomic copy of a whole structure into the array. Is that really what you want? set_window is called from VFIO (protected by mutex there) and the platform code which I believe is not racy (or hotplug takes care of it anyway). Or I am missing something else? + return 0; +fail: + if (pe->tce32_seg >= 0) + pe->tce32_seg = -1; + + return rc; +} + static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable) { uint16_t window_id = (pe->pe_number << 1 ) + 1; @@ -2068,21 +2114,16 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, pe->table_group.ops = &pnv_pci_ioda2_ops; #endif - /* -* Map TCE table through TVT. The TVE index is the PE number -* shifted by 1 bit for 32-bits DMA space. -*/ - rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number, - pe->pe_number << 1, 1, __pa(tbl->it_base), - tbl->it_size << 3, 1ULL << tbl->it_page_shift); + rc = pnv_pci_ioda2_set_window(&pe->table_group, tbl); if (rc) { pe_err(pe, "Failed to configure 32-bit TCE table," " err %ld\n", rc); - goto fail; + pnv_pci_free_table(tbl); + if (pe->tce32_seg >= 0) + pe->tce32_seg = -1; + return; } - pnv_pci_ioda2_tvt_invalidate(pe); - /* OPAL variant of PHB3 invalidat
Re: [PATCH kernel v9 23/32] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks
On 04/29/2015 03:30 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:47PM +1000, Alexey Kardashevskiy wrote: This extends iommu_table_group_ops by a set of callbacks to support dynamic DMA windows management. create_table() creates a TCE table with specific parameters. it receives iommu_table_group to know nodeid in order to allocate TCE table memory closer to the PHB. The exact format of allocated multi-level table might be also specific to the PHB model (not the case now though). This callback calculated the DMA window offset on a PCI bus from @num and stores it in a just created table. set_window() sets the window at specified TVT index + @num on PHB. unset_window() unsets the window from specified TVT. This adds a free() callback to iommu_table_ops to free the memory (potentially a tree of tables) allocated for the TCE table. Doesn't the free callback belong with the previous patch introducing multi-level tables? If I did that, you would say "why is it here if nothing calls it" on "multilevel" patch and "I see the allocation but I do not see memory release" ;) I need some rule of thumb here. I think it is a bit cleaner if the same patch adds a callback for memory allocation and its counterpart, no? create_table() and free() are supposed to be called once per VFIO container and set_window()/unset_window() are supposed to be called for every group in a container. This adds IOMMU capabilities to iommu_table_group such as default 32bit window parameters and others. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h| 19 arch/powerpc/platforms/powernv/pci-ioda.c | 75 ++--- arch/powerpc/platforms/powernv/pci-p5ioc2.c | 12 +++-- 3 files changed, 96 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 0f50ee2..7694546 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -70,6 +70,7 @@ struct iommu_table_ops { /* get() returns a physical address */ unsigned long (*get)(struct iommu_table *tbl, long index); void (*flush)(struct iommu_table *tbl); + void (*free)(struct iommu_table *tbl); }; /* These are used by VIO */ @@ -148,6 +149,17 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, struct iommu_table_group; struct iommu_table_group_ops { + long (*create_table)(struct iommu_table_group *table_group, + int num, + __u32 page_shift, + __u64 window_size, + __u32 levels, + struct iommu_table *tbl); + long (*set_window)(struct iommu_table_group *table_group, + int num, + struct iommu_table *tblnew); + long (*unset_window)(struct iommu_table_group *table_group, + int num); /* * Switches ownership from the kernel itself to an external * user. While onwership is taken, the kernel cannot use IOMMU itself. @@ -160,6 +172,13 @@ struct iommu_table_group { #ifdef CONFIG_IOMMU_API struct iommu_group *group; #endif + /* Some key properties of IOMMU */ + __u32 tce32_start; + __u32 tce32_size; + __u64 pgsizes; /* Bitmap of supported page sizes */ + __u32 max_dynamic_windows_supported; + __u32 max_levels; With this information, table_group seems even more like a bad name. "iommu_state" maybe? Please, no. We will never come to agreement then :( And "iommu_state" is too general anyway, it won't pass. struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; struct iommu_table_group_ops *ops; }; diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index cc1d09c..4828837 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include @@ -1846,6 +1847,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = { #endif .clear = pnv_ioda2_tce_free, .get = pnv_tce_get, + .free = pnv_pci_free_table, }; static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb, @@ -1936,6 +1938,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, TCE_PCI_SWINV_PAIR); tbl->it_ops = &pnv_ioda1_iommu_ops; + pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift; + pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift; iommu_init_table(tbl, phb->hose->node); if (pe->flags & PNV_IODA_PE_DEV) { @@ -1961,7 +1965,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, } stati
Re: [PATCH kernel v9 18/32] powerpc/iommu/powernv: Release replaced TCE
On 04/29/2015 02:18 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:42PM +1000, Alexey Kardashevskiy wrote: At the moment writing new TCE value to the IOMMU table fails with EBUSY if there is a valid entry already. However PAPR specification allows the guest to write new TCE value without clearing it first. Another problem this patch is addressing is the use of pool locks for external IOMMU users such as VFIO. The pool locks are to protect DMA page allocator rather than entries and since the host kernel does not control what pages are in use, there is no point in pool locks and exchange()+put_page(oldtce) is sufficient to avoid possible races. This adds an exchange() callback to iommu_table_ops which does the same thing as set() plus it returns replaced TCE and DMA direction so the caller can release the pages afterwards. The exchange() receives a physical address unlike set() which receives linear mapping address; and returns a physical address as the clear() does. This implements exchange() for P5IOC2/IODA/IODA2. This adds a requirement for a platform to have exchange() implemented in order to support VFIO. This replaces iommu_tce_build() and iommu_clear_tce() with a single iommu_tce_xchg(). This makes sure that TCE permission bits are not set in TCE passed to IOMMU API as those are to be calculated by platform code from DMA direction. This moves SetPageDirty() to the IOMMU code to make it work for both VFIO ioctl interface in in-kernel TCE acceleration (when it becomes available later). Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson This looks mostly good, but there are couple of details that need fixing. [...] diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index ba75aa5..e8802ac 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -598,6 +598,23 @@ int pnv_tce_build(struct iommu_table *tbl, long index, long npages, return 0; } +#ifdef CONFIG_IOMMU_API +int pnv_tce_xchg(struct iommu_table *tbl, long index, + unsigned long *tce, enum dma_data_direction *direction) +{ + u64 proto_tce = iommu_direction_to_tce_perm(*direction); + unsigned long newtce = *tce | proto_tce; + unsigned long idx = index - tbl->it_offset; Should this have a BUG_ON or WARN_ON if the supplied tce has bits set below the page mask? Why? The caller checks these bits, do we really need to duplicate it here? + *tce = xchg(pnv_tce(tbl, idx), cpu_to_be64(newtce)); + *tce = be64_to_cpu(*tce); + *direction = iommu_tce_direction(*tce); + *tce &= ~(TCE_PCI_READ | TCE_PCI_WRITE); + + return 0; +} +#endif -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 09/32] vfio: powerpc/spapr: Rework groups attaching
On 04/29/2015 12:16 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:33PM +1000, Alexey Kardashevskiy wrote: This is to make extended ownership and multiple groups support patches simpler for review. This should cause no behavioural change. Um.. this doesn't appear to be true. Previously removing a group from an enabled container would fail with EBUSY, now it forces a disable. This is the original tce_iommu_detach_group() where I cannot find EBUSY you are referring to; it did and does enforce disable. What do I miss here? static void tce_iommu_detach_group(void *iommu_data, struct iommu_group *iommu_group) { struct tce_container *container = iommu_data; struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group); BUG_ON(!tbl); mutex_lock(&container->lock); if (tbl != container->tbl) { pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n", iommu_group_id(iommu_group), iommu_group_id(tbl->it_group)); } else { if (container->enabled) { pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n", iommu_group_id(tbl->it_group)); tce_iommu_disable(container); } /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n", iommu_group_id(iommu_group), iommu_group); */ container->tbl = NULL; tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size); iommu_release_ownership(tbl); } mutex_unlock(&container->lock); } Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson Reviewed-by: David Gibson --- drivers/vfio/vfio_iommu_spapr_tce.c | 40 ++--- 1 file changed, 24 insertions(+), 16 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 115d5e6..0fbe03e 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -460,16 +460,21 @@ static int tce_iommu_attach_group(void *iommu_data, iommu_group_id(container->tbl->it_group), iommu_group_id(iommu_group)); ret = -EBUSY; - } else if (container->enabled) { + goto unlock_exit; + } + + if (container->enabled) { pr_err("tce_vfio: attaching group #%u to enabled container\n", iommu_group_id(iommu_group)); ret = -EBUSY; - } else { - ret = iommu_take_ownership(tbl); - if (!ret) - container->tbl = tbl; + goto unlock_exit; } + ret = iommu_take_ownership(tbl); + if (!ret) + container->tbl = tbl; + +unlock_exit: mutex_unlock(&container->lock); return ret; @@ -487,19 +492,22 @@ static void tce_iommu_detach_group(void *iommu_data, pr_warn("tce_vfio: detaching group #%u, expected group is #%u\n", iommu_group_id(iommu_group), iommu_group_id(tbl->it_group)); - } else { - if (container->enabled) { - pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n", - iommu_group_id(tbl->it_group)); - tce_iommu_disable(container); - } + goto unlock_exit; + } - /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n", - iommu_group_id(iommu_group), iommu_group); */ - container->tbl = NULL; - tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size); - iommu_release_ownership(tbl); + if (container->enabled) { + pr_warn("tce_vfio: detaching group #%u from enabled container, forcing disable\n", + iommu_group_id(tbl->it_group)); + tce_iommu_disable(container); } + + /* pr_debug("tce_vfio: detaching group #%u from iommu %p\n", + iommu_group_id(iommu_group), iommu_group); */ + container->tbl = NULL; + tce_iommu_clear(container, tbl, tbl->it_offset, tbl->it_size); + iommu_release_ownership(tbl); + +unlock_exit: mutex_unlock(&container->lock); } -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 12/32] powerpc/spapr: vfio: Switch from iommu_table to new iommu_table_group
On 04/29/2015 12:49 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:36PM +1000, Alexey Kardashevskiy wrote: Modern IBM POWERPC systems support multiple (currently two) TCE tables per IOMMU group (a.k.a. PE). This adds a iommu_table_group container for TCE tables. Right now just one table is supported. For P5IOC2 and IODA, iommu_table_group is embedded into PE struct (pnv_ioda_pe and pnv_phb) and does not require iommu_free_table(), only . iommu_reset_table(). For pSeries, this replaces multiple calls of kzalloc_node() with a new iommu_pseries_group_alloc() helper and stores the table group struct pointer into the pci_dn struct. For release, a iommu_table_group_free() helper is added. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson I'm not particularly fond of the "table_group" name, but I can't really think of a better name for now. So, I asked Ben again. iommu_state is not much better either. I'd stick to iommu_table_group. Reviewed-by: David Gibson -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 15/32] powerpc/powernv/ioda/ioda2: Rework TCE invalidation in tce_build()/tce_free()
On 04/29/2015 01:18 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:39PM +1000, Alexey Kardashevskiy wrote: The pnv_pci_ioda_tce_invalidate() helper invalidates TCE cache. It is supposed to be called on IODA1/2 and not called on p5ioc2. It receives start and end host addresses of TCE table. IODA2 actually needs PCI addresses to invalidate the cache. Those can be calculated from host addresses but since we are going to implement multi-level TCE tables, calculating PCI address from a host address might get either tricky or ugly as TCE table remains flat on PCI bus but not in RAM. This moves pnv_pci_ioda_tce_invalidate() from generic pnv_tce_build/ pnt_tce_free and defines IODA1/2-specific callbacks which call generic ones and do PHB-model-specific TCE cache invalidation. P5IOC2 keeps using generic callbacks as before. This changes pnv_pci_ioda2_tce_invalidate() to receives TCE index and number of pages which are PCI addresses shifted by IOMMU page shift. No change in behaviour is expected. Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * removed confusing comment from commit log about unintentional calling of pnv_pci_ioda_tce_invalidate() * moved mechanical changes away to "powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table" * fixed bug with broken invalidation in pnv_pci_ioda2_tce_invalidate - @index includes @tbl->it_offset but old code added it anyway which later broke DDW --- arch/powerpc/platforms/powernv/pci-ioda.c | 86 +-- arch/powerpc/platforms/powernv/pci.c | 17 ++ 2 files changed, 64 insertions(+), 39 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 718d5cc..f070c44 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1665,18 +1665,20 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, } } -static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe, -struct iommu_table *tbl, -__be64 *startp, __be64 *endp, bool rm) +static void pnv_pci_ioda1_tce_invalidate(struct iommu_table *tbl, + unsigned long index, unsigned long npages, bool rm) { + struct pnv_ioda_pe *pe = container_of(tbl->it_table_group, + struct pnv_ioda_pe, table_group); __be64 __iomem *invalidate = rm ? (__be64 __iomem *)pe->tce_inval_reg_phys : (__be64 __iomem *)tbl->it_index; unsigned long start, end, inc; const unsigned shift = tbl->it_page_shift; - start = __pa(startp); - end = __pa(endp); + start = __pa((__be64 *)tbl->it_base + index - tbl->it_offset); + end = __pa((__be64 *)tbl->it_base + index - tbl->it_offset + + npages - 1); This doesn't look right. The arguments to __pa don't appear to be addresses (since index and if_offset are in units of (TCE) pages, not bytes). tbl->it_base is an address and it is casted to __be64* which means: (char*)tbl->it_base + (index - tbl->it_offset)*sizeof(__be64). Which seems to be correct (I just removed extra braces compared to the old code), no? /* BML uses this case for p6/p7/galaxy2: Shift addr and put in node */ if (tbl->it_busno) { @@ -1712,16 +1714,40 @@ static void pnv_pci_ioda1_tce_invalidate(struct pnv_ioda_pe *pe, */ } +static int pnv_ioda1_tce_build(struct iommu_table *tbl, long index, + long npages, unsigned long uaddr, + enum dma_data_direction direction, + struct dma_attrs *attrs) +{ + long ret = pnv_tce_build(tbl, index, npages, uaddr, direction, + attrs); + + if (!ret && (tbl->it_type & TCE_PCI_SWINV_CREATE)) + pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false); + + return ret; +} + +static void pnv_ioda1_tce_free(struct iommu_table *tbl, long index, + long npages) +{ + pnv_tce_free(tbl, index, npages); + + if (tbl->it_type & TCE_PCI_SWINV_FREE) + pnv_pci_ioda1_tce_invalidate(tbl, index, npages, false); +} + static struct iommu_table_ops pnv_ioda1_iommu_ops = { - .set = pnv_tce_build, - .clear = pnv_tce_free, + .set = pnv_ioda1_tce_build, + .clear = pnv_ioda1_tce_free, .get = pnv_tce_get, }; -static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, -struct iommu_table *tbl, -__be64 *startp, __be64 *endp, bool rm) +static void pnv_pci_ioda2_tce_invalidate(struct iommu_table *tbl, + unsigned long index, unsigned long npages, bool rm) { + struct pnv_ioda_pe *pe = container_of(tbl->it_table_group, + str
Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible
On 04/30/2015 05:22 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote: At the moment only one group per container is supported. POWER8 CPUs have more flexible design and allows naving 2 TCE tables per IOMMU group so we can relax this limitation and support multiple groups per container. It's not obvious why allowing multiple TCE tables per PE has any pearing on allowing multiple groups per container. This patchset is a global TCE tables rework (patches 1..30, roughly) with 2 outcomes: 1. reusing the same IOMMU table for multiple groups - patch 31; 2. allowing dynamic create/remove of IOMMU tables - patch 32. I can remove this one from the patchset and post it separately later but since 1..30 aim to support both 1) and 2), I'd think I better keep them all together (might explain some of changes I do in 1..30). This adds TCE table descriptors to a container and uses iommu_table_group_ops to create/set DMA windows on IOMMU groups so the same TCE tables will be shared between several IOMMU groups. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson --- Changes: v7: * updated doc --- Documentation/vfio.txt | 8 +- drivers/vfio/vfio_iommu_spapr_tce.c | 268 ++-- 2 files changed, 199 insertions(+), 77 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 94328c8..7dcf2b5 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note This implementation has some specifics: -1) Only one IOMMU group per container is supported as an IOMMU group -represents the minimal entity which isolation can be guaranteed for and -groups are allocated statically, one per a Partitionable Endpoint (PE) +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per +container is supported as an IOMMU table is allocated at the boot time, +one table per a IOMMU group which is a Partitionable Endpoint (PE) (PE is often a PCI domain but not always). I thought the more fundamental problem was that different PEs tended to use disjoint bus address ranges, so even by duplicating put_tce across PEs you couldn't have a common address space. Sorry, I am not following you here. By duplicating put_tce, I can have multiple IOMMU groups on the same virtual PHB in QEMU, "[PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups per container" does this, the address ranges will the same. What I cannot do on p5ioc2 is programming the same table to multiple physical PHBs (or I could but it is very different than IODA2 and pretty ugly and might not always be possible because I would have to allocate these pages from some common pool and face problems like fragmentation). +Newer systems (POWER8 with IODA2) have improved hardware design which allows +to remove this limitation and have multiple IOMMU groups per a VFIO container. 2) The hardware supports so called DMA windows - the PCI address range within which DMA transfer is allowed, any attempt to access address space diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index a7d6729..970e3a2 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -82,6 +82,11 @@ static void decrement_locked_vm(long npages) * into DMA'ble space using the IOMMU */ +struct tce_iommu_group { + struct list_head next; + struct iommu_group *grp; +}; + /* * The container descriptor supports only a single group per container. * Required by the API as the container is not supplied with the IOMMU group @@ -89,10 +94,11 @@ static void decrement_locked_vm(long npages) */ struct tce_container { struct mutex lock; - struct iommu_group *grp; bool enabled; unsigned long locked_pages; bool v2; + struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; Hrm, so here we have more copies of the full iommu_table structures, which again muddies the lifetime. The table_group pointer is presumably meaningless in these copies, which seems dangerously confusing. Ouch. This is bad. No, table_group is not pointless here as it is used to get to the PE number to invalidate TCE cache. I just realized although I need to update just a single table, I still have to invalidate TCE cache for every attached group/PE so I need a list of iommu_table_group's here, not a single pointer... + struct list_head group_list; }; static long tce_unregister_pages(struct tce_container *container, @@ -154,20 +160,20 @@ static bool tce_page_is_contained(struct page *page, unsigned page_shift) return (PAGE_SHIFT + compound_order(compound_head(page))) >= page_shift; } +static inline bool tce_groups_attached(struct tce_container *container) +{ + return !list_empty(
Re: [PATCH kernel v9 23/32] powerpc/powernv/ioda: Define and implement DMA table/window management callbacks
On 04/30/2015 02:37 PM, David Gibson wrote: On Wed, Apr 29, 2015 at 07:44:20PM +1000, Alexey Kardashevskiy wrote: On 04/29/2015 03:30 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:47PM +1000, Alexey Kardashevskiy wrote: This extends iommu_table_group_ops by a set of callbacks to support dynamic DMA windows management. create_table() creates a TCE table with specific parameters. it receives iommu_table_group to know nodeid in order to allocate TCE table memory closer to the PHB. The exact format of allocated multi-level table might be also specific to the PHB model (not the case now though). This callback calculated the DMA window offset on a PCI bus from @num and stores it in a just created table. set_window() sets the window at specified TVT index + @num on PHB. unset_window() unsets the window from specified TVT. This adds a free() callback to iommu_table_ops to free the memory (potentially a tree of tables) allocated for the TCE table. Doesn't the free callback belong with the previous patch introducing multi-level tables? If I did that, you would say "why is it here if nothing calls it" on "multilevel" patch and "I see the allocation but I do not see memory release" ;) Yeah, fair enough ;) I need some rule of thumb here. I think it is a bit cleaner if the same patch adds a callback for memory allocation and its counterpart, no? On further consideration, yes, I think you're right. create_table() and free() are supposed to be called once per VFIO container and set_window()/unset_window() are supposed to be called for every group in a container. This adds IOMMU capabilities to iommu_table_group such as default 32bit window parameters and others. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h| 19 arch/powerpc/platforms/powernv/pci-ioda.c | 75 ++--- arch/powerpc/platforms/powernv/pci-p5ioc2.c | 12 +++-- 3 files changed, 96 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 0f50ee2..7694546 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -70,6 +70,7 @@ struct iommu_table_ops { /* get() returns a physical address */ unsigned long (*get)(struct iommu_table *tbl, long index); void (*flush)(struct iommu_table *tbl); + void (*free)(struct iommu_table *tbl); }; /* These are used by VIO */ @@ -148,6 +149,17 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, struct iommu_table_group; struct iommu_table_group_ops { + long (*create_table)(struct iommu_table_group *table_group, + int num, + __u32 page_shift, + __u64 window_size, + __u32 levels, + struct iommu_table *tbl); + long (*set_window)(struct iommu_table_group *table_group, + int num, + struct iommu_table *tblnew); + long (*unset_window)(struct iommu_table_group *table_group, + int num); /* * Switches ownership from the kernel itself to an external * user. While onwership is taken, the kernel cannot use IOMMU itself. @@ -160,6 +172,13 @@ struct iommu_table_group { #ifdef CONFIG_IOMMU_API struct iommu_group *group; #endif + /* Some key properties of IOMMU */ + __u32 tce32_start; + __u32 tce32_size; + __u64 pgsizes; /* Bitmap of supported page sizes */ + __u32 max_dynamic_windows_supported; + __u32 max_levels; With this information, table_group seems even more like a bad name. "iommu_state" maybe? Please, no. We will never come to agreement then :( And "iommu_state" is too general anyway, it won't pass. struct iommu_table tables[IOMMU_TABLE_GROUP_MAX_TABLES]; struct iommu_table_group_ops *ops; }; diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index cc1d09c..4828837 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include @@ -1846,6 +1847,7 @@ static struct iommu_table_ops pnv_ioda2_iommu_ops = { #endif .clear = pnv_ioda2_tce_free, .get = pnv_tce_get, + .free = pnv_pci_free_table, }; static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb, @@ -1936,6 +1938,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, TCE_PCI_SWINV_PAIR); tbl->it_ops = &pnv_ioda1_iommu_ops; + pe->table_group.tce32_start = tbl->it_offset << tbl->it_page_shift; + pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table
On 04/29/2015 04:31 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote: In order to support memory pre-registration, we need a way to track the use of every registered memory region and only allow unregistration if a region is not in use anymore. So we need a way to tell from what region the just cleared TCE was from. This adds a userspace view of the TCE table into iommu_table struct. It contains userspace address, one per TCE entry. The table is only allocated when the ownership over an IOMMU group is taken which means it is only used from outside of the powernv code (such as VFIO). Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * fixed code flow in error cases added in v8 v8: * added ENOMEM on failed vzalloc() --- arch/powerpc/include/asm/iommu.h | 6 ++ arch/powerpc/kernel/iommu.c | 18 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 22 -- 3 files changed, 44 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 7694546..1472de3 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -111,9 +111,15 @@ struct iommu_table { unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_table_group; + unsigned long *it_userspace; /* userspace view of the table */ A single unsigned long doesn't seem like enough. Why single? This is an array. How do you know which process's address space this address refers to? It is a current task. Multiple userspaces cannot use the same container/tables. struct iommu_table_ops *it_ops; }; +#define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \ + ((tbl)->it_userspace ? \ + &((tbl)->it_userspace[(entry) - (tbl)->it_offset]) : \ + NULL) + /* Pure 2^n version of get_order */ static inline __attribute_const__ int get_iommu_order(unsigned long size, struct iommu_table *tbl) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 2eaba0c..74a3f52 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include @@ -739,6 +740,8 @@ void iommu_reset_table(struct iommu_table *tbl, const char *node_name) free_pages((unsigned long) tbl->it_map, order); } + WARN_ON(tbl->it_userspace); + memset(tbl, 0, sizeof(*tbl)); } @@ -1016,6 +1019,7 @@ int iommu_take_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; int ret = 0; + unsigned long *uas; /* * VFIO does not control TCE entries allocation and the guest @@ -1027,6 +1031,10 @@ int iommu_take_ownership(struct iommu_table *tbl) if (!tbl->it_ops->exchange) return -EINVAL; + uas = vzalloc(sizeof(*uas) * tbl->it_size); + if (!uas) + return -ENOMEM; + spin_lock_irqsave(&tbl->large_pool.lock, flags); for (i = 0; i < tbl->nr_pools; i++) spin_lock(&tbl->pools[i].lock); @@ -1044,6 +1052,13 @@ int iommu_take_ownership(struct iommu_table *tbl) memset(tbl->it_map, 0xff, sz); } + if (ret) { + vfree(uas); + } else { + BUG_ON(tbl->it_userspace); + tbl->it_userspace = uas; + } + for (i = 0; i < tbl->nr_pools; i++) spin_unlock(&tbl->pools[i].lock); spin_unlock_irqrestore(&tbl->large_pool.lock, flags); @@ -1056,6 +1071,9 @@ void iommu_release_ownership(struct iommu_table *tbl) { unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; + vfree(tbl->it_userspace); + tbl->it_userspace = NULL; + spin_lock_irqsave(&tbl->large_pool.lock, flags); for (i = 0; i < tbl->nr_pools; i++) spin_lock(&tbl->pools[i].lock); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 45bc131..e0be556 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include @@ -1827,6 +1828,14 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, long index, pnv_pci_ioda2_tce_invalidate(tbl, index, npages, false); } +void pnv_pci_ioda2_free_table(struct iommu_table *tbl) +{ + vfree(tbl->it_userspace); + tbl->it_userspace = NULL; + + pnv_pci_free_table(tbl); +} + static struct iommu_table_ops pnv_ioda2_iommu_ops
Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table
On 04/29/2015 04:40 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote: This adds a way for the IOMMU user to know how much a new table will use so it can be accounted in the locked_vm limit before allocation happens. This stores the allocated table size in pnv_pci_create_table() so the locked_vm counter can be updated correctly when a table is being disposed. This defines an iommu_table_group_ops callback to let VFIO know how much memory will be locked if a table is created. Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * reimplemented the whole patch --- arch/powerpc/include/asm/iommu.h | 5 + arch/powerpc/platforms/powernv/pci-ioda.c | 14 arch/powerpc/platforms/powernv/pci.c | 36 +++ arch/powerpc/platforms/powernv/pci.h | 2 ++ 4 files changed, 57 insertions(+) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 1472de3..9844c106 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -99,6 +99,7 @@ struct iommu_table { unsigned long it_size; /* Size of iommu table in entries */ unsigned long it_indirect_levels; unsigned long it_level_size; + unsigned long it_allocated_size; unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, struct iommu_table_group; struct iommu_table_group_ops { + unsigned long (*get_table_size)( + __u32 page_shift, + __u64 window_size, + __u32 levels); long (*create_table)(struct iommu_table_group *table_group, int num, __u32 page_shift, diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index e0be556..7f548b4 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb, } #ifdef CONFIG_IOMMU_API +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift, + __u64 window_size, __u32 levels) +{ + unsigned long ret = pnv_get_table_size(page_shift, window_size, levels); + + if (!ret) + return ret; + + /* Add size of it_userspace */ + return ret + (window_size >> page_shift) * sizeof(unsigned long); This doesn't make much sense. The userspace view can't possibly be a property of the specific low-level IOMMU model. This it_userspace thing is all about memory preregistration. I need some way to track how many actual mappings the mm_iommu_table_group_mem_t has in order to decide whether to allow unregistering or not. When I clear TCE, I can read the old value which is host physical address which I cannot use to find the preregistered region and adjust the mappings counter; I can only use userspace addresses for this (not even guest physical addresses as it is VFIO and probably no KVM). So I have to keep userspace addresses somewhere, one per IOMMU page, and the iommu_table seems a natural place for this. +} + static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group, int num, __u32 page_shift, __u64 window_size, __u32 levels, struct iommu_table *tbl) @@ -2086,6 +2098,7 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group, BUG_ON(tbl->it_userspace); tbl->it_userspace = uas; + tbl->it_allocated_size += uas_cb; tbl->it_ops = &pnv_ioda2_iommu_ops; if (pe->tce_inval_reg) tbl->it_type |= (TCE_PCI_SWINV_CREATE | TCE_PCI_SWINV_FREE); @@ -2160,6 +2173,7 @@ static void pnv_ioda2_release_ownership(struct iommu_table_group *table_group) } static struct iommu_table_group_ops pnv_pci_ioda2_ops = { + .get_table_size = pnv_pci_ioda2_get_table_size, .create_table = pnv_pci_ioda2_create_table, .set_window = pnv_pci_ioda2_set_window, .unset_window = pnv_pci_ioda2_unset_window, diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index fc129c4..1b5b48a 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -662,6 +662,38 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl, tbl->it_type = TCE_PCI; } +unsigned long pnv_get_table_size(__u32 page_shift, + __u64 window_size, __u32 levels) +{ + unsigned long bytes = 0; + const unsigned window_shift = ilog2(window_size); + unsigned entries_shift = window_shift -
Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2
On 04/30/2015 04:55 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote: The existing implementation accounts the whole DMA window in the locked_vm counter. This is going to be worse with multiple containers and huge DMA windows. Also, real-time accounting would requite additional tracking of accounted pages due to the page size difference - IOMMU uses 4K pages and system uses 4K or 64K pages. Another issue is that actual pages pinning/unpinning happens on every DMA map/unmap request. This does not affect the performance much now as we spend way too much time now on switching context between guest/userspace/host but this will start to matter when we add in-kernel DMA map/unmap acceleration. This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU. New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces 2 new ioctls to register/unregister DMA memory - VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - which receive user space address and size of a memory region which needs to be pinned/unpinned and counted in locked_vm. New IOMMU splits physical pages pinning and TCE table update into 2 different operations. It requires 1) guest pages to be registered first 2) consequent map/unmap requests to work only with pre-registered memory. For the default single window case this means that the entire guest (instead of 2GB) needs to be pinned before using VFIO. When a huge DMA window is added, no additional pinning will be required, otherwise it would be guest RAM + 2GB. The new memory registration ioctls are not supported by VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration will require memory to be preregistered in order to work. The accounting is done per the user process. This advertises v2 SPAPR TCE IOMMU and restricts what the userspace can do with v1 or v2 IOMMUs. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson --- Changes: v9: * s/tce_get_hva_cached/tce_iommu_use_page_v2/ v7: * now memory is registered per mm (i.e. process) * moved memory registration code to powerpc/mmu * merged "vfio: powerpc/spapr: Define v2 IOMMU" into this * limited new ioctls to v2 IOMMU * updated doc * unsupported ioclts return -ENOTTY instead of -EPERM v6: * tce_get_hva_cached() returns hva via a pointer v4: * updated docs * s/kzmalloc/vzalloc/ * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and replaced offset with index * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory and removed duplicating vfio_iommu_spapr_register_memory --- Documentation/vfio.txt | 23 drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++- include/uapi/linux/vfio.h | 27 + 3 files changed, 274 insertions(+), 6 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 96978ec..94328c8 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -427,6 +427,29 @@ The code flow from the example above should be slightly changed: +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ +VFIO_IOMMU_DISABLE and implements 2 new ioctls: +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY +(which are unsupported in v1 IOMMU). A summary of the semantic differeces between v1 and v2 would be nice. At this point it's not really clear to me if there's a case for creating v2, or if this could just be done by adding (optional) functionality to v1. v1: memory preregistration is not supported; explicit enable/disable ioctls are required v2: memory preregistration is required; explicit enable/disable are prohibited (as they are not needed). Mixing these in one IOMMU type caused a lot of problems like should I increment locked_vm by the 32bit window size on enable() or not; what do I do about pages pinning when map/map (check if it is from registered memory and do not pin?). Having 2 IOMMU models makes everything a lot simpler. +PPC64 paravirtualized guests generate a lot of map/unmap requests, +and the handling of those includes pinning/unpinning pages and updating +mm::locked_vm counter to make sure we do not exceed the rlimit. +The v2 IOMMU splits accounting and pinning into separate operations: + +- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls +receive a user space address and size of the block to be pinned. +Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to +be called with the exact address and size used for registering +the memory block. The userspace is not expected to call these often. +The ranges are stored in a linked list in a VFIO container. + +- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual +IOMMU table and do not do pinning; instead these check that the userspace +address is from pre-registered range
Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible
On 05/01/2015 02:33 PM, David Gibson wrote: On Thu, Apr 30, 2015 at 07:33:09PM +1000, Alexey Kardashevskiy wrote: On 04/30/2015 05:22 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote: At the moment only one group per container is supported. POWER8 CPUs have more flexible design and allows naving 2 TCE tables per IOMMU group so we can relax this limitation and support multiple groups per container. It's not obvious why allowing multiple TCE tables per PE has any pearing on allowing multiple groups per container. This patchset is a global TCE tables rework (patches 1..30, roughly) with 2 outcomes: 1. reusing the same IOMMU table for multiple groups - patch 31; 2. allowing dynamic create/remove of IOMMU tables - patch 32. I can remove this one from the patchset and post it separately later but since 1..30 aim to support both 1) and 2), I'd think I better keep them all together (might explain some of changes I do in 1..30). The combined patchset is fine. My comment is because your commit message says that multiple groups are possible *because* 2 TCE tables per group are allowed, and it's not at all clear why one follows from the other. Ah. That's wrong indeed, I'll fix it. This adds TCE table descriptors to a container and uses iommu_table_group_ops to create/set DMA windows on IOMMU groups so the same TCE tables will be shared between several IOMMU groups. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson --- Changes: v7: * updated doc --- Documentation/vfio.txt | 8 +- drivers/vfio/vfio_iommu_spapr_tce.c | 268 ++-- 2 files changed, 199 insertions(+), 77 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 94328c8..7dcf2b5 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note This implementation has some specifics: -1) Only one IOMMU group per container is supported as an IOMMU group -represents the minimal entity which isolation can be guaranteed for and -groups are allocated statically, one per a Partitionable Endpoint (PE) +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per +container is supported as an IOMMU table is allocated at the boot time, +one table per a IOMMU group which is a Partitionable Endpoint (PE) (PE is often a PCI domain but not always). I thought the more fundamental problem was that different PEs tended to use disjoint bus address ranges, so even by duplicating put_tce across PEs you couldn't have a common address space. Sorry, I am not following you here. By duplicating put_tce, I can have multiple IOMMU groups on the same virtual PHB in QEMU, "[PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups per container" does this, the address ranges will the same. Oh, ok. For some reason I thought that (at least on the older machines) the different PEs used different and not easily changeable DMA windows in bus addresses space. They do use different tables (which VFIO does not get to remove/create and uses these old helpers - iommu_take/release_ownership), correct. But all these windows are mapped at zero on a PE's PCI bus and nothing prevents me from updating all these tables with the same TCE values when handling H_PUT_TCE. Yes it is slow but it works (bit more details below). What I cannot do on p5ioc2 is programming the same table to multiple physical PHBs (or I could but it is very different than IODA2 and pretty ugly and might not always be possible because I would have to allocate these pages from some common pool and face problems like fragmentation). So allowing multiple groups per container should be possible (at the kernel rather than qemu level) by writing the same value to multiple TCE tables. I guess its not worth doing for just the almost-obsolete IOMMUs though. It is done at QEMU level though. As it works now, QEMU opens a group, walks through all existing containers and tries attaching a new group there. If it succeeded (x86 always; POWER8 after this patch), a TCE table is shared. If it failed, QEMU creates another container, attaches it to the same VFIO/PHB address space and attaches a group there. Then the only thing left is repeating ioctl() in vfio_container_ioctl() for every container in the VFIO address space; this is what that QEMU patch does (the first version of that patch called ioctl() only for the first container in the address space). From the kernel prospective there are 2 isolated containers; I'd like to keep it this way. btw thanks for the detailed review :) -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 29/32] vfio: powerpc/spapr: Register memory and define IOMMU v2
On 05/01/2015 03:23 PM, David Gibson wrote: On Fri, May 01, 2015 at 02:35:23PM +1000, Alexey Kardashevskiy wrote: On 04/30/2015 04:55 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:53PM +1000, Alexey Kardashevskiy wrote: The existing implementation accounts the whole DMA window in the locked_vm counter. This is going to be worse with multiple containers and huge DMA windows. Also, real-time accounting would requite additional tracking of accounted pages due to the page size difference - IOMMU uses 4K pages and system uses 4K or 64K pages. Another issue is that actual pages pinning/unpinning happens on every DMA map/unmap request. This does not affect the performance much now as we spend way too much time now on switching context between guest/userspace/host but this will start to matter when we add in-kernel DMA map/unmap acceleration. This introduces a new IOMMU type for SPAPR - VFIO_SPAPR_TCE_v2_IOMMU. New IOMMU deprecates VFIO_IOMMU_ENABLE/VFIO_IOMMU_DISABLE and introduces 2 new ioctls to register/unregister DMA memory - VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY - which receive user space address and size of a memory region which needs to be pinned/unpinned and counted in locked_vm. New IOMMU splits physical pages pinning and TCE table update into 2 different operations. It requires 1) guest pages to be registered first 2) consequent map/unmap requests to work only with pre-registered memory. For the default single window case this means that the entire guest (instead of 2GB) needs to be pinned before using VFIO. When a huge DMA window is added, no additional pinning will be required, otherwise it would be guest RAM + 2GB. The new memory registration ioctls are not supported by VFIO_SPAPR_TCE_IOMMU. Dynamic DMA window and in-kernel acceleration will require memory to be preregistered in order to work. The accounting is done per the user process. This advertises v2 SPAPR TCE IOMMU and restricts what the userspace can do with v1 or v2 IOMMUs. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson --- Changes: v9: * s/tce_get_hva_cached/tce_iommu_use_page_v2/ v7: * now memory is registered per mm (i.e. process) * moved memory registration code to powerpc/mmu * merged "vfio: powerpc/spapr: Define v2 IOMMU" into this * limited new ioctls to v2 IOMMU * updated doc * unsupported ioclts return -ENOTTY instead of -EPERM v6: * tce_get_hva_cached() returns hva via a pointer v4: * updated docs * s/kzmalloc/vzalloc/ * in tce_pin_pages()/tce_unpin_pages() removed @vaddr, @size and replaced offset with index * renamed vfio_iommu_type_register_memory to vfio_iommu_spapr_register_memory and removed duplicating vfio_iommu_spapr_register_memory --- Documentation/vfio.txt | 23 drivers/vfio/vfio_iommu_spapr_tce.c | 230 +++- include/uapi/linux/vfio.h | 27 + 3 files changed, 274 insertions(+), 6 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 96978ec..94328c8 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -427,6 +427,29 @@ The code flow from the example above should be slightly changed: +5) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ +VFIO_IOMMU_DISABLE and implements 2 new ioctls: +VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY +(which are unsupported in v1 IOMMU). A summary of the semantic differeces between v1 and v2 would be nice. At this point it's not really clear to me if there's a case for creating v2, or if this could just be done by adding (optional) functionality to v1. v1: memory preregistration is not supported; explicit enable/disable ioctls are required v2: memory preregistration is required; explicit enable/disable are prohibited (as they are not needed). Mixing these in one IOMMU type caused a lot of problems like should I increment locked_vm by the 32bit window size on enable() or not; what do I do about pages pinning when map/map (check if it is from registered memory and do not pin?). Having 2 IOMMU models makes everything a lot simpler. Ok. Would it simplify it further if you made v2 only usable on IODA2 hardware? Very little. V2 addresses memory pinning issue which is handled the same way on ioda2 and older hardware, including KVM acceleration. Whether enable DDW or not - this is handled just fine via extra properties in the GET_INFO ioctl(). IODA2 and others are different in handling multiple groups per container but this does not require changes to userspace API. And remember, the only machine I can use 100% of time is POWER7/P5IOC2 so it is really useful if at least some bits of the patchset can be tested there; if it was a bit less different from IODA2, I would have even implemented DDW there too :) +PPC64 paravirtualized guests generate a lot of map/unmap requests, +and t
Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table
On 05/01/2015 03:12 PM, David Gibson wrote: On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote: On 04/29/2015 04:40 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote: This adds a way for the IOMMU user to know how much a new table will use so it can be accounted in the locked_vm limit before allocation happens. This stores the allocated table size in pnv_pci_create_table() so the locked_vm counter can be updated correctly when a table is being disposed. This defines an iommu_table_group_ops callback to let VFIO know how much memory will be locked if a table is created. Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * reimplemented the whole patch --- arch/powerpc/include/asm/iommu.h | 5 + arch/powerpc/platforms/powernv/pci-ioda.c | 14 arch/powerpc/platforms/powernv/pci.c | 36 +++ arch/powerpc/platforms/powernv/pci.h | 2 ++ 4 files changed, 57 insertions(+) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 1472de3..9844c106 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -99,6 +99,7 @@ struct iommu_table { unsigned long it_size; /* Size of iommu table in entries */ unsigned long it_indirect_levels; unsigned long it_level_size; + unsigned long it_allocated_size; unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, struct iommu_table_group; struct iommu_table_group_ops { + unsigned long (*get_table_size)( + __u32 page_shift, + __u64 window_size, + __u32 levels); long (*create_table)(struct iommu_table_group *table_group, int num, __u32 page_shift, diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index e0be556..7f548b4 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb, } #ifdef CONFIG_IOMMU_API +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift, + __u64 window_size, __u32 levels) +{ + unsigned long ret = pnv_get_table_size(page_shift, window_size, levels); + + if (!ret) + return ret; + + /* Add size of it_userspace */ + return ret + (window_size >> page_shift) * sizeof(unsigned long); This doesn't make much sense. The userspace view can't possibly be a property of the specific low-level IOMMU model. This it_userspace thing is all about memory preregistration. I need some way to track how many actual mappings the mm_iommu_table_group_mem_t has in order to decide whether to allow unregistering or not. When I clear TCE, I can read the old value which is host physical address which I cannot use to find the preregistered region and adjust the mappings counter; I can only use userspace addresses for this (not even guest physical addresses as it is VFIO and probably no KVM). So I have to keep userspace addresses somewhere, one per IOMMU page, and the iommu_table seems a natural place for this. Well.. sort of. But as noted elsewhere this pulls VFIO specific constraints into a platform code structure. And whether you get this table depends on the platform IOMMU type rather than on what VFIO wants to do with it, which doesn't make sense. What might make more sense is an opaque pointer io iommu_table for use by the table "owner" (in the take_ownership sense). The pointer would be stored in iommu_table, but VFIO is responsible for populating and managing its contents. Or you could just put the userspace mappings in the container. Although you might want a different data structure in that case. Nope. I need this table in in-kernel acceleration to update the mappings counter per mm_iommu_table_group_mem_t. In KVM's real mode handlers, I only have IOMMU tables, not containers or groups. QEMU creates a guest view of the table (KVM_CREATE_SPAPR_TCE) specifying a LIOBN, and then attaches TCE tables to it via set of ioctls (one per IOMMU group) to VFIO KVM device. So if I call it it_opaque (instead of it_userspace), I will still need a common place (visible to VFIO and PowerKVM) for this to put: #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) So far this place was arch/powerpc/include/asm/iommu.h and the iommu_table struct. The other thing to bear in mind is that registered regions are likely to be large contiguous blocks in user addresses, though obviously not
Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table
On 05/01/2015 02:23 PM, David Gibson wrote: On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote: On 04/29/2015 04:31 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote: In order to support memory pre-registration, we need a way to track the use of every registered memory region and only allow unregistration if a region is not in use anymore. So we need a way to tell from what region the just cleared TCE was from. This adds a userspace view of the TCE table into iommu_table struct. It contains userspace address, one per TCE entry. The table is only allocated when the ownership over an IOMMU group is taken which means it is only used from outside of the powernv code (such as VFIO). Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * fixed code flow in error cases added in v8 v8: * added ENOMEM on failed vzalloc() --- arch/powerpc/include/asm/iommu.h | 6 ++ arch/powerpc/kernel/iommu.c | 18 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 22 -- 3 files changed, 44 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 7694546..1472de3 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -111,9 +111,15 @@ struct iommu_table { unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_table_group; + unsigned long *it_userspace; /* userspace view of the table */ A single unsigned long doesn't seem like enough. Why single? This is an array. As in single per page. Sorry, I am not following you here. It is per IOMMU page. MAP/UNMAP work with IOMMU pages which are fully backed with either system page or a huge page. How do you know which process's address space this address refers to? It is a current task. Multiple userspaces cannot use the same container/tables. Where is that enforced? It is accessed from VFIO DMA map/unmap which are ioctls() to a container's fd which is per a process. Same for KVM - when it registers IOMMU groups in KVM, fd's of opened IOMMU groups are passed there. Or I did not understand the question... More to the point, that's a VFIO constraint, but it's here affecting the design of a structure owned by the platform code. Right. But keeping in mind KVM, I cannot think of any better design here. [snip] static void pnv_pci_ioda_setup_opal_tce_kill(struct pnv_phb *phb, @@ -2062,12 +2071,21 @@ static long pnv_pci_ioda2_create_table(struct iommu_table_group *table_group, int nid = pe->phb->hose->node; __u64 bus_offset = num ? pe->tce_bypass_base : 0; long ret; + unsigned long *uas, uas_cb = sizeof(*uas) * (window_size >> page_shift); + + uas = vzalloc(uas_cb); + if (!uas) + return -ENOMEM; I don't see why this is allocated both here as well as in take_ownership. Where else? The only alternative is vfio_iommu_spapr_tce but I really do not want to touch iommu_table fields there. Well to put it another way, why isn't take_ownership calling create itself (or at least a common helper). I am trying to keep DDW stuff away from platform-oriented arch/powerpc/kernel/iommu.c which main purpose is to implement iommu_alloc()&co. It already has I'd rather move it_userspace allocation completely to vfio_iommu_spapr_tce (should have done earlier, actually), would this be ok? Clearly the it_userspace table needs to have lifetime which matches the TCE table itself, so there should be a single function that marks the beginning of that joint lifetime. No. it_userspace lives as long as the platform code does not control the table. For IODA2 it is equal for the lifetime of the table, for IODA1/P5IOC2 it is not. Isn't this function used for core-kernel users of the iommu as well, in which case it shouldn't need the it_userspace. No. This is an iommu_table_group_ops callback which calls what the platform code calls (pnv_pci_create_table()) plus allocates this it_userspace thing. The callback is only called from VFIO. Ok. As touched on above it seems more like this should be owned by VFIO code than the platform code. Agree now :) I'll move the allocation to VFIO. Thanks! -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 22/32] powerpc/powernv: Implement multilevel TCE tables
On 04/29/2015 03:04 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:46PM +1000, Alexey Kardashevskiy wrote: TCE tables might get too big in case of 4K IOMMU pages and DDW enabled on huge guests (hundreds of GB of RAM) so the kernel might be unable to allocate contiguous chunk of physical memory to store the TCE table. To address this, POWER8 CPU (actually, IODA2) supports multi-level TCE tables, up to 5 levels which splits the table into a tree of smaller subtables. This adds multi-level TCE tables support to pnv_pci_create_table() and pnv_pci_free_table() helpers. Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * moved from ioda2 to common powernv pci code * fixed cleanup if allocation fails in a middle * removed check for the size - all boundary checks happen in the calling code anyway --- arch/powerpc/include/asm/iommu.h | 2 + arch/powerpc/platforms/powernv/pci-ioda.c | 15 +++-- arch/powerpc/platforms/powernv/pci.c | 94 +-- arch/powerpc/platforms/powernv/pci.h | 4 +- 4 files changed, 104 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 7e7ca0a..0f50ee2 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -96,6 +96,8 @@ struct iommu_pool { struct iommu_table { unsigned long it_busno; /* Bus number this table belongs to */ unsigned long it_size; /* Size of iommu table in entries */ + unsigned long it_indirect_levels; + unsigned long it_level_size; unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 59baa15..cc1d09c 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1967,13 +1967,17 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, table_group); struct pnv_phb *phb = pe->phb; int64_t rc; + const unsigned long size = tbl->it_indirect_levels ? + tbl->it_level_size : tbl->it_size; const __u64 start_addr = tbl->it_offset << tbl->it_page_shift; const __u64 win_size = tbl->it_size << tbl->it_page_shift; pe_info(pe, "Setting up window at %llx..%llx " - "pgsize=0x%x tablesize=0x%lx\n", + "pgsize=0x%x tablesize=0x%lx " + "levels=%d levelsize=%x\n", start_addr, start_addr + win_size - 1, - 1UL << tbl->it_page_shift, tbl->it_size << 3); + 1UL << tbl->it_page_shift, tbl->it_size << 3, + tbl->it_indirect_levels + 1, tbl->it_level_size << 3); tbl->it_table_group = &pe->table_group; @@ -1984,9 +1988,9 @@ static long pnv_pci_ioda2_set_window(struct iommu_table_group *table_group, rc = opal_pci_map_pe_dma_window(phb->opal_id, pe->pe_number, pe->pe_number << 1, - 1, + tbl->it_indirect_levels + 1, __pa(tbl->it_base), - tbl->it_size << 3, + size << 3, 1ULL << tbl->it_page_shift); if (rc) { pe_err(pe, "Failed to configure TCE table, err %ld\n", rc); @@ -2099,7 +2103,8 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, phb->ioda.m32_pci_base); rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node, - 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, tbl); + 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, + POWERNV_IOMMU_DEFAULT_LEVELS, tbl); if (rc) { pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc); return; diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index 6bcfad5..fc129c4 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -46,6 +46,8 @@ #define cfg_dbg(fmt...) do { } while(0) //#define cfg_dbg(fmt...) printk(fmt) +#define ROUND_UP(x, n) (((x) + (n) - 1ULL) & ~((n) - 1ULL)) Use the existing ALIGN_UP macro instead of creating a new one. Ok. I knew it existed, it is just _ALIGN_UP (with an underscore) and PPC-only - this is why I did not find it :) #ifdef CONFIG_PCI_MSI static int pnv_setup_ms
Re: [PATCH kernel v9 20/32] powerpc/powernv/ioda2: Introduce pnv_pci_create_table/pnv_pci_free_table
On 04/29/2015 02:39 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:44PM +1000, Alexey Kardashevskiy wrote: This is a part of moving TCE table allocation into an iommu_ops callback to support multiple IOMMU groups per one VFIO container. This moves a table creation window to the file with common powernv-pci helpers as it does not do anything IODA2-specific. This adds pnv_pci_free_table() helper to release the actual TCE table. This enforces window size to be a power of two. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy Reviewed-by: David Gibson --- Changes: v9: * moved helpers to the common powernv pci.c file from pci-ioda.c * moved bits from pnv_pci_create_table() to pnv_alloc_tce_table_pages() --- arch/powerpc/platforms/powernv/pci-ioda.c | 36 ++ arch/powerpc/platforms/powernv/pci.c | 61 +++ arch/powerpc/platforms/powernv/pci.h | 4 ++ 3 files changed, 76 insertions(+), 25 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index a80be34..b9b3773 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1307,8 +1307,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe if (rc) pe_warn(pe, "OPAL error %ld release DMA window\n", rc); - iommu_reset_table(tbl, of_node_full_name(dev->dev.of_node)); - free_pages(addr, get_order(TCE32_TABLE_SIZE)); + pnv_pci_free_table(tbl); } static void pnv_ioda_release_vf_PE(struct pci_dev *pdev, u16 num_vfs) @@ -2039,10 +2038,7 @@ static struct iommu_table_group_ops pnv_pci_ioda2_ops = { static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe) { - struct page *tce_mem = NULL; - void *addr; struct iommu_table *tbl = &pe->table_group.tables[0]; - unsigned int tce_table_size, end; int64_t rc; /* We shouldn't already have a 32-bit DMA associated */ @@ -2053,29 +2049,20 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, /* The PE will reserve all possible 32-bits space */ pe->tce32_seg = 0; - end = (1 << ilog2(phb->ioda.m32_pci_base)); - tce_table_size = (end / 0x1000) * 8; pe_info(pe, "Setting up 32-bit TCE table at 0..%08x\n", - end); + phb->ioda.m32_pci_base); - /* Allocate TCE table */ - tce_mem = alloc_pages_node(phb->hose->node, GFP_KERNEL, - get_order(tce_table_size)); - if (!tce_mem) { - pe_err(pe, "Failed to allocate a 32-bit TCE memory\n"); - goto fail; + rc = pnv_pci_create_table(&pe->table_group, pe->phb->hose->node, + 0, IOMMU_PAGE_SHIFT_4K, phb->ioda.m32_pci_base, tbl); + if (rc) { + pe_err(pe, "Failed to create 32-bit TCE table, err %ld", rc); + return; } - addr = page_address(tce_mem); - memset(addr, 0, tce_table_size); - - /* Setup iommu */ - tbl->it_table_group = &pe->table_group; - - /* Setup linux iommu table */ - pnv_pci_setup_iommu_table(tbl, addr, tce_table_size, 0, - IOMMU_PAGE_SHIFT_4K); tbl->it_ops = &pnv_ioda2_iommu_ops; + + /* Setup iommu */ + tbl->it_table_group = &pe->table_group; iommu_init_table(tbl, phb->hose->node); #ifdef CONFIG_IOMMU_API pe->table_group.ops = &pnv_pci_ioda2_ops; @@ -2121,8 +2108,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, fail: if (pe->tce32_seg >= 0) pe->tce32_seg = -1; - if (tce_mem) - __free_pages(tce_mem, get_order(tce_table_size)); + pnv_pci_free_table(tbl); } static void pnv_ioda_setup_dma(struct pnv_phb *phb) diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index e8802ac..6bcfad5 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -20,7 +20,9 @@ #include #include #include +#include +#include #include #include #include @@ -645,6 +647,65 @@ void pnv_pci_setup_iommu_table(struct iommu_table *tbl, tbl->it_type = TCE_PCI; } +static __be64 *pnv_alloc_tce_table_pages(int nid, unsigned shift, + unsigned long *tce_table_allocated) I'm a bit confused by the tce_table_allocated parameter. What's the circumstance where more memory is requested than required, and why does it matter to the caller? +{ + struct page *tce_mem = NULL; + __be64 *addr; + unsigned order = max_t(unsigned, shift, PAGE_SHIFT) - PAGE_S
Re: [PATCH kernel v9 28/32] powerpc/mmu: Add userspace-to-physical addresses translation cache
On 04/29/2015 05:01 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:52PM +1000, Alexey Kardashevskiy wrote: We are adding support for DMA memory pre-registration to be used in conjunction with VFIO. The idea is that the userspace which is going to run a guest may want to pre-register a user space memory region so it all gets pinned once and never goes away. Having this done, a hypervisor will not have to pin/unpin pages on every DMA map/unmap request. This is going to help with multiple pinning of the same memory and in-kernel acceleration of DMA requests. This adds a list of memory regions to mm_context_t. Each region consists of a header and a list of physical addresses. This adds API to: 1. register/unregister memory regions; 2. do final cleanup (which puts all pre-registered pages); 3. do userspace to physical address translation; 4. manage a mapped pages counter; when it is zero, it is safe to unregister the region. Multiple registration of the same region is allowed, kref is used to track the number of registrations. Signed-off-by: Alexey Kardashevskiy --- Changes: v8: * s/mm_iommu_table_group_mem_t/struct mm_iommu_table_group_mem_t/ * fixed error fallback look (s/[i]/[j]/) --- arch/powerpc/include/asm/mmu-hash64.h | 3 + arch/powerpc/include/asm/mmu_context.h | 17 +++ arch/powerpc/mm/Makefile | 1 + arch/powerpc/mm/mmu_context_hash64.c | 6 + arch/powerpc/mm/mmu_context_hash64_iommu.c | 215 + 5 files changed, 242 insertions(+) create mode 100644 arch/powerpc/mm/mmu_context_hash64_iommu.c diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h index 1da6a81..a82f534 100644 --- a/arch/powerpc/include/asm/mmu-hash64.h +++ b/arch/powerpc/include/asm/mmu-hash64.h @@ -536,6 +536,9 @@ typedef struct { /* for 4K PTE fragment support */ void *pte_frag; #endif +#ifdef CONFIG_SPAPR_TCE_IOMMU + struct list_head iommu_group_mem_list; +#endif Urgh. I know I'm not one to talk, having done the hugepage crap in there, but man mm_context_t has grown to a bloated mess from orginally being just intended as a context ID integer :/. Where else to put it then?... The other way to go would be some global map of pid<->iommu_group_mem_list which needs to be available from both VFIO and KVM. } mm_context_t; diff --git a/arch/powerpc/include/asm/mmu_context.h b/arch/powerpc/include/asm/mmu_context.h index 73382eb..d6116ca 100644 --- a/arch/powerpc/include/asm/mmu_context.h +++ b/arch/powerpc/include/asm/mmu_context.h @@ -16,6 +16,23 @@ */ extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm); extern void destroy_context(struct mm_struct *mm); +#ifdef CONFIG_SPAPR_TCE_IOMMU +struct mm_iommu_table_group_mem_t; + +extern bool mm_iommu_preregistered(void); +extern long mm_iommu_alloc(unsigned long ua, unsigned long entries, + struct mm_iommu_table_group_mem_t **pmem); +extern struct mm_iommu_table_group_mem_t *mm_iommu_get(unsigned long ua, + unsigned long entries); +extern long mm_iommu_put(struct mm_iommu_table_group_mem_t *mem); +extern void mm_iommu_cleanup(mm_context_t *ctx); +extern struct mm_iommu_table_group_mem_t *mm_iommu_lookup(unsigned long ua, + unsigned long size); +extern long mm_iommu_ua_to_hpa(struct mm_iommu_table_group_mem_t *mem, + unsigned long ua, unsigned long *hpa); +extern long mm_iommu_mapped_update(struct mm_iommu_table_group_mem_t *mem, + bool inc); +#endif extern void switch_mmu_context(struct mm_struct *prev, struct mm_struct *next); extern void switch_slb(struct task_struct *tsk, struct mm_struct *mm); diff --git a/arch/powerpc/mm/Makefile b/arch/powerpc/mm/Makefile index 9c8770b..e216704 100644 --- a/arch/powerpc/mm/Makefile +++ b/arch/powerpc/mm/Makefile @@ -36,3 +36,4 @@ obj-$(CONFIG_PPC_SUBPAGE_PROT)+= subpage-prot.o obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o obj-$(CONFIG_HIGHMEM) += highmem.o obj-$(CONFIG_PPC_COPRO_BASE) += copro_fault.o +obj-$(CONFIG_SPAPR_TCE_IOMMU) += mmu_context_hash64_iommu.o diff --git a/arch/powerpc/mm/mmu_context_hash64.c b/arch/powerpc/mm/mmu_context_hash64.c index 178876ae..eb3080c 100644 --- a/arch/powerpc/mm/mmu_context_hash64.c +++ b/arch/powerpc/mm/mmu_context_hash64.c @@ -89,6 +89,9 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm) #ifdef CONFIG_PPC_64K_PAGES mm->context.pte_frag = NULL; #endif +#ifdef CONFIG_SPAPR_TCE_IOMMU + INIT_LIST_HEAD_RCU(&mm->context.iommu_group_mem_list); +#endif return 0; } @@ -132,6 +135,9 @@ static inline void destroy_pagetable_page(struct mm_struct *mm) void destroy_context(struct mm_struct *mm) { +#ifdef CONFIG_SPAPR_TCE_IOMMU + mm_iommu_cleanup(&mm->context); +#endif #ifdef CONFIG_PPC_ICSWX drop_cop(mm->context.acop, mm
Re: [PATCH v4 00/21] PowerPC/PowerNV: PCI Slot Management
On 05/01/2015 04:02 PM, Gavin Shan wrote: The series of patches intend to support PCI slot for PowerPC PowerNV platform, which is running on top of skiboot firmware. The patchset requires corresponding changes from skiboot firmware, which is sent to skib...@lists.ozlabs.org for review. The PCI slots are exposed by skiboot with device node properties, and kernel utilizes those properties to populated PCI slots accordingly. The original PCI infrastructure on PowerNV platform can't support hotplug because the PE is assigned during PHB fixup time, which is called for once during system boot time. For this, the PCI infrastructure on PowerNV platform has been reworked for a lot. After that, the PE and its corresponding resources (IODT, M32DT, M64 segments, DMA32 and bypass window) are assigned upon updating PCI bridge's resources, which might decide PE# assigned to the PE (e.g. M64 resources, on P8 strictly speaking). Out of curiosity - does this PCI scan happen when memory subsystem is initialized? More precisely, after these changes, won't pnv_pci_ioda2_setup_dma_pe() be called too early after boot so I won't be able to use kmalloc() to allocate iommu_table's? Also, checkpatch.pl failed multiple times on the series. Please fix. Each PE will maintain a reference count, which is (number of child PCI devices + 1). That indicates when last child PCI device leaves the PE, the PE and its included resources will be relased and put back into free pool again. With this design, the PE will be released when EEH PE is released. PATCH[1 - 8] are related to this part. From skiboot perspective, PCI slot is providing (hot/fundamental/complete) resets to EEH. The kernel gets to know if skiboot supports various reset on one particular PCI slot through device-tree node. If it does, EEH will utilize the functionality provided by skiboot. Besides, the device-tree nodes have to change in order to support PCI hotplug. For example, when one PCI adapter inserted to one slot, its device-tree node should be added to the system dynamically. Conversely, the device-tree node should be removed from the system when the PCI adapter is going to be offline. Since pci_dn and eeh_dev have same life cyle as PCI device nodes, they should be added/removed accordingly during PCI hotplug. Patch[9 - 20] are doing the related work. The last patch is the standalone PCI hotplug driver for PowerNV platform. When removing PCI adapter from one PCI slot, which is invoked by command in userland, the skiboot will power off the slot to save power and remove all device-tree nodes for all PCI devices behind the slot. Conversely, the Power to the slot is turned on, the PCI devices behind the slot is rescanned, and the device-tree nodes for those newly detected PCI devices will be built in skiboot. For both of cases, one message will be sent to kernel by skiboot so that the kernel can adjust the device-tree accordingly. At the same time, the kernel also have to deallocate or allocate PE# and its related resources (PE# and so on) for the removed/added PCI devices. Changelog = v4: * Rebased to 4.1.RC1 * Added API to unflatten FDT blob to device node sub-tree, which is attached the indicated parent device node. The original mechanism based on formatted string stream has been dropped. * The PATCH[v3 09/21] ("powerpc/eeh: Delay probing EEH device during hotplug") was picked up sent to linux-ppc@ separately for review as Richard's "VF EEH Support" depends on that. v3: * Rebased to 4.1.RC0 * PowerNV PCI infrasturcture is total refactored in order to support PCI hotplug. The PowerNV hotplug driver is also reworked a lot because of the changes in skiboot in order to support PCI hotplug. Gavin Shan (21): pci: Add pcibios_setup_bridge() powerpc/powernv: Enable M64 on P7IOC powerpc/powernv: M64 support improvement powerpc/powernv: Improve IO and M32 mapping powerpc/powernv: Improve DMA32 segment assignment powerpc/powernv: Create PEs dynamically powerpc/powernv: Release PEs dynamically powerpc/powernv: Drop pnv_ioda_setup_dev_PE() powerpc/powernv: Use PCI slot reset infrastructure powerpc/powernv: Fundamental reset for PCI bus reset powerpc/pci: Don't scan empty slot powerpc/pci: Move pcibios_find_pci_bus() around powerpc/powernv: Introduce pnv_pci_poll() powerpc/powernv: Functions to get/reset PCI slot status powerpc/pci: Delay creating pci_dn powerpc/pci: Create eeh_dev while creating pci_dn powerpc/pci: Export traverse_pci_device_nodes() powerpc/pci: Update bridge windows on PCI plugging drivers/of: Support adding sub-tree powerpc/powernv: Select OF_DYNAMIC pci/hotplug: PowerPC PowerNV PCI hotplug driver arch/powerpc/include/asm/eeh.h |7 +- arch/powerpc/include/asm/opal-api.h|7 +- arch/powerpc/include/asm/opal.h|7 +- arch/powerpc/include/asm/pci-bridge.h
Re: [PATCH v4 02/21] powerpc/powernv: Enable M64 on P7IOC
On 05/01/2015 04:02 PM, Gavin Shan wrote: The patch enables M64 window on P7IOC, which has been enabled on PHB3. Comparing to PHB3, there are 16 M64 BARs and each of them are divided to 8 segments. "compared to something" means you will tell about PHB3 too :) Do I understand correctly that IODA==IODA1==P7IOC and P7IOC != IODA2? The code does not use "PHB3" or "P7IOC" acronym so it is a bit confusing. So each PHB can support 128 M64 segments. Also, P7IOC has M64DT, which helps mapping one particular M64 segment# to arbitrary PE#. However, we just provide 128 M64 (16 BARs) segments and fixed mapping between PE# and M64 segment# in order to keep same logic to support M64 for PHB3 and P7IOC. In turn, we just need different phb->init_m64() hooks for P7IOC and PHB3. Signed-off-by: Gavin Shan --- arch/powerpc/platforms/powernv/pci-ioda.c | 115 ++ 1 file changed, 103 insertions(+), 12 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index f8bc950..646962f 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -165,6 +165,67 @@ static void pnv_ioda_free_pe(struct pnv_phb *phb, int pe) clear_bit(pe, phb->ioda.pe_alloc); } +static int pnv_ioda1_init_m64(struct pnv_phb *phb) +{ + struct resource *r; + int seg; + s64 rc; Here @rc is of the "s64" type. + + /* Each PHB supports 16 separate M64 BARs, each of which are +* divided into 8 segments. So there are number of M64 segments +* as total PE#, which is 128. +*/ "there are as many M64 segments as a maximum number of PEs which is 128"? + for (seg = 0; seg < phb->ioda.total_pe; seg += 8) { + unsigned long base; + + base = phb->ioda.m64_base + seg * phb->ioda.m64_segsize; + rc = opal_pci_set_phb_mem_window(phb->opal_id, +OPAL_M64_WINDOW_TYPE, +seg / 8, +base, +0, /* unused */ +8 * phb->ioda.m64_segsize); + if (rc != OPAL_SUCCESS) { + pr_warn(" Failure %lld configuring M64 BAR#%d on PHB#%d\n", + rc, seg / 8, phb->hose->global_number); + goto fail; + } + + rc = opal_pci_phb_mmio_enable(phb->opal_id, + OPAL_M64_WINDOW_TYPE, + seg / 8, + OPAL_ENABLE_M64_SPLIT); + if (rc != OPAL_SUCCESS) { + pr_warn(" Failure %lld enabling M64 BAR#%d on PHB#%d\n", + rc, seg / 8, phb->hose->global_number); + goto fail; + } + } + + /* Strip of the segment used by the reserved PE, which +* is expected to be 0 or last supported PE# +*/ + r = &phb->hose->mem_resources[1]; mem_resources[0] is IO, mem_resources[1] is MMIO, mem_resources[2] is for what? Would be nice to have this commented somewhere. + if (phb->ioda.reserved_pe == 0) + r->start += phb->ioda.m64_segsize; + else if (phb->ioda.reserved_pe == (phb->ioda.total_pe - 1)) + r->end -= phb->ioda.m64_segsize; + else + pr_warn(" Cannot strip M64 segment for reserved PE#%d\n", + phb->ioda.reserved_pe); + + return 0; + +fail: + for ( ; seg >= 0; seg -= 8) + opal_pci_phb_mmio_enable(phb->opal_id, +OPAL_M64_WINDOW_TYPE, +seg / 8, +OPAL_DISABLE_M64); Out of curiosity - is not there a counterpart for opal_pci_set_phb_mem_window() for cleanup? + + return -EIO; +} + /* The default M64 BAR is shared by all PEs */ static int pnv_ioda2_init_m64(struct pnv_phb *phb) { @@ -222,7 +283,7 @@ fail: return -EIO; } -static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb) +static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb) { resource_size_t sgsz = phb->ioda.m64_segsize; struct pci_dev *pdev; @@ -248,8 +309,8 @@ static void pnv_ioda2_reserve_m64_pe(struct pnv_phb *phb) } } -static int pnv_ioda2_pick_m64_pe(struct pnv_phb *phb, -struct pci_bus *bus, int all) +static int pnv_ioda_pick_m64_pe(struct pnv_phb *phb, + struct pci_bus *bus, int all) { resource_size_t segsz = phb->ioda.m64_segsize; struct pci_dev *pdev; @@ -346,6 +407,28 @@ done: pe->master = master_pe;
Re: [PATCH v4 03/21] powerpc/powernv: M64 support improvement
On 05/01/2015 04:02 PM, Gavin Shan wrote: We're having the hardware or enforced (on P7IOC) limitation: M64 I would think if it is enforced, then it is enforced by hardware but you say "hardware OR enforced" :) segment#x can only be assigned to PE#x. IO and M32 segment can be mapped to arbitrary PE# via IODT and M32DT. It means the PE number should be x if M64 segment#x has been assigned to the PE. Also, each PE own one M64 segment at most. Currently, we are reserving PE# according to root port's M64 window. It won't be reliable once we extend M64 windows of root port, or the upstream port of the PCIE switch behind root port to PHB's M64 window, in order to support PCI hotplug in future. The patch reserves PE# for M64 segments according to the M64 resources of the PCI devices (not bridges) contained in the PE. Besides, it's always worthy to trace the M64 segments consumed by the PE, which can be released at PCI unplugging time. Signed-off-by: Gavin Shan --- arch/powerpc/platforms/powernv/pci-ioda.c | 190 ++ arch/powerpc/platforms/powernv/pci.h | 10 +- 2 files changed, 122 insertions(+), 78 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 646962f..a994882 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -283,28 +283,78 @@ fail: return -EIO; } -static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb) +/* We extend the M64 window of root port, or the upstream bridge port + * of the PCIE switch behind root port. So we shouldn't reserve PEs + * for M64 resources because there are no (normal) PCI devices consuming "PCI devices"? Not "root ports or PCI bridges"? + * M64 resources on the PCI buses leading from root port, or the upstream + * bridge port.The function returns true if the indicated PCI bus needs + * reserved PEs because of M64 resources in advance. Otherwise, the + * function returns false. + */ +static bool pnv_ioda_need_m64_pe(struct pnv_phb *phb, +struct pci_bus *bus) { - resource_size_t sgsz = phb->ioda.m64_segsize; + /* Root bus */ The comment is too obvious as the call below is called "pci_is_root_bus" :) + if (!bus || pci_is_root_bus(bus)) + return false; + + /* Bus leading from root port. We need check what types of PCI +* devices on the bus. If it's connecting PCI bridge, we don't +* need reserve M64 PEs for it. Otherwise, we still need to do +* that. +*/ + if (pci_is_root_bus(bus->self->bus)) { + struct pci_dev *pdev; + + list_for_each_entry(pdev, &bus->devices, bus_list) { + if (pdev->hdr_type == PCI_HEADER_TYPE_NORMAL) + return true; + } + + return false; + } + + /* Bus leading from the upstream bridge port on top level */ + if (pci_is_root_bus(bus->self->bus->self->bus)) Is it for second level bridges? Like root->bridge->bridge? And for 3 levels you will need a PE? + return false; + + return true; +} + +static void pnv_ioda_reserve_m64_pe(struct pnv_phb *phb, + struct pci_bus *bus) +{ + resource_size_t segsz = phb->ioda.m64_segsize; struct pci_dev *pdev; struct resource *r; - int base, step, i; + unsigned long pe_no, limit; + int i; - /* -* Root bus always has full M64 range and root port has -* M64 range used in reality. So we're checking root port -* instead of root bus. + if (!pnv_ioda_need_m64_pe(phb, bus)) + return; + + /* The bridge's M64 window might have been extended to the +* PHB's M64 window in order to support PCI hotplug. So the +* bridge's M64 window isn't reliable to be used for picking +* PE# for its leading PCI bus. We have to check the M64 +* resources consumed by the PCI devices, which seat on the +* PCI bus. */ - list_for_each_entry(pdev, &phb->hose->bus->devices, bus_list) { - for (i = 0; i < PCI_BRIDGE_RESOURCE_NUM; i++) { - r = &pdev->resource[PCI_BRIDGE_RESOURCES + i]; - if (!r->parent || - !pnv_pci_is_mem_pref_64(r->flags)) + list_for_each_entry(pdev, &bus->devices, bus_list) { + for (i = 0; i < PCI_NUM_RESOURCES; i++) { +#ifdef CONFIG_PCI_IOV + if (i >= PCI_IOV_RESOURCES && i <= PCI_IOV_RESOURCE_END) + continue; +#endif + r = &pdev->resource[i]; + if (!r->flags || r->start >= r->end || + !r->parent || !pnv_pci_is_mem_pref_64(r->flags)) continue; - bas
Re: [PATCH v4 04/21] powerpc/powernv: Improve IO and M32 mapping
On 05/01/2015 04:02 PM, Gavin Shan wrote: The PHB's IO or M32 window is divided evenly to segments, each of them can be mapped to arbitrary PE# by IODT or M32DT. Current code figures out the consumed IO and M32 segments by one particular PE from the windows of the PE's upstream bridge. It won't be reliable once we extend M64 windows of root port, or the upstream port of the PCIE switch behind root port to PHB's IO or M32 window, in order to support PCI hotplug in future. The patch improves pnv_ioda_setup_pe_seg() to calculate PE's consumed IO or M32 segments from its contained devices, no bridge involved any more. Also, the logic to mapping IO and M32 segments are combined to simplify the code. Besides, it's always worthy to trace the IO and M32 segments consumed by one PE, which can be released at PCI unplugging time. Signed-off-by: Gavin Shan --- arch/powerpc/platforms/powernv/pci-ioda.c | 150 -- arch/powerpc/platforms/powernv/pci.h | 13 +-- 2 files changed, 85 insertions(+), 78 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index a994882..7e6e266 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2543,77 +2543,92 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev) } #endif /* CONFIG_PCI_IOV */ -/* - * This function is supposed to be called on basis of PE from top - * to bottom style. So the the I/O or MMIO segment assigned to - * parent PE could be overrided by its child PEs if necessary. - */ -static void pnv_ioda_setup_pe_seg(struct pci_controller *hose, - struct pnv_ioda_pe *pe) +static int pnv_ioda_map_pe_one_res(struct pci_controller *hose, + struct pnv_ioda_pe *pe, + struct resource *res) { struct pnv_phb *phb = hose->private_data; struct pci_bus_region region; - struct resource *res; - int i, index; - int rc; + unsigned int segsize, index; + unsigned long *segmap, *pe_segmap; + uint16_t win_type; + int64_t rc; - /* -* NOTE: We only care PCI bus based PE for now. For PCI -* device based PE, for example SRIOV sensitive VF should -* be figured out later. -*/ - BUG_ON(!(pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL))); + /* Check if we need map the resource */ + if (!res->parent || !res->flags || + res->start > res->end || + pnv_pci_is_mem_pref_64(res->flags)) + return 0; - pci_bus_for_each_resource(pe->pbus, res, i) { - if (!res || !res->flags || - res->start > res->end) - continue; + if (res->flags & IORESOURCE_IO) { + segmap = phb->ioda.io_segmap; + pe_segmap = pe->io_segmap; + region.start = res->start - phb->ioda.io_pci_base; + region.end = res->end - phb->ioda.io_pci_base; + segsize = phb->ioda.io_segsize; + win_type = OPAL_IO_WINDOW_TYPE; + } else { + segmap = phb->ioda.m32_segmap; + pe_segmap = pe->m32_segmap; + region.start = res->start - + hose->mem_offset[0] - + phb->ioda.m32_pci_base; + region.end = res->end - +hose->mem_offset[0] - +phb->ioda.m32_pci_base; + segsize = phb->ioda.m32_segsize; + win_type = OPAL_M32_WINDOW_TYPE; + } + + index = region.start / segsize; + while (index < phb->ioda.total_pe && + region.start <= region.end) { + rc = opal_pci_map_pe_mmio_window(phb->opal_id, + pe->pe_number, win_type, 0, index); + if (rc != OPAL_SUCCESS) { + pr_warn("%s: Error %lld mapping (%d) seg#%d to PE#%d\n", + __func__, rc, win_type, index, pe->pe_number); + return -EIO; + } - if (res->flags & IORESOURCE_IO) { - region.start = res->start - phb->ioda.io_pci_base; - region.end = res->end - phb->ioda.io_pci_base; - index = region.start / phb->ioda.io_segsize; + set_bit(index, segmap); + set_bit(index, pe_segmap); + region.start += segsize; + index++; + } - while (index < phb->ioda.total_pe && - region.start <= region.end) { - phb->ioda.io_segmap[index] = pe->pe_number; - rc = opal_pci_map_pe_mmio_window(phb->opal_id, - pe->pe_number, OPAL_IO_WIND
Re: [PATCH v4 06/21] powerpc/powernv: Create PEs dynamically
On 05/01/2015 04:02 PM, Gavin Shan wrote: Currently, the PEs and their associated resources are assigned in ppc_md.pcibios_fixup(). The function is called for once after PCI probing and resources assignment are finished. Obviously, it's not hotplug friendly. The patch creates PEs dynamically by ppc_md.pcibios_setup_bridge(), which is called on the event during system bootup and PCI hotplug: updating PCI bridge's windows after resource assignment/reassignment are finished. For partial hotplug case, where not all PCI devices belonging to the PE are unplugged and plugged again, we just need unbinding/binding the affected PCI devices with the corresponding PE without creating new one. Some PEs are already created dynamically (SRIOV). I'd suggest to make subject more specific. Besides, it might require addtional resources (e.g. M32) to the windows of the PCI bridge when unplugging current adapter, and insert a different adapter if there is one PCI slot, which is assumed behind root port, or the downstream bridge of the PCIE switch behind root port. The parent bridge of the newly plugged adapter would reject the request to add more resources, leading to hotplug failure. For the issue, the patch extends the windows of root port, or the upstream port of the PCIe switch behind root port to PHB's windows when ppc_md.pcibios_setup_bridge() is called. There is no upstream bridge for root bus, so we have to reserve PE#, which is next to the reserved PE# in advance and fixing the PE for root bus in ppc_md.pcibios_setup_bridge(). The patch also changes the rule assigning PE#: PE# reserved for prefetchable 64-bits memory resource and SRIOV VFs starts from zero while PE# for dynamic allocations starts from ioda.total_pe reversely. It's because PE# for prefetchable 64-bits memory resource, which is ually allocated begining with the PHB's aperatus and PE# s/aperatus/apertures/? May be it is just me but it looks like the patch moves existing bits and also adds this dynamic PE creation, cannot it be separated somehow into smaller patches as it is really hard to track all the changes you are making here? -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
On 05/01/2015 04:02 PM, Gavin Shan wrote: The original code doesn't support releasing PEs dynamically, meaning that PE and the associated resources (IO, M32, M64 and DMA) can't be released when unplugging a PCI adapter from one hotpluggable slot. The patch takes object oriented methodology, introducs reference count to PE, which is initialized to 1 and increased with 1 when a new PCI device joins the PE. Once the last PCI device leaves the PE, the PE is going to be release together with its associated (IO, M32, M64, DMA) resources. Too little commit log for non-trivial non-cut-n-paste 30KB patch... Signed-off-by: Gavin Shan --- arch/powerpc/include/asm/pci-bridge.h | 3 + arch/powerpc/kernel/pci-hotplug.c | 5 + arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++--- arch/powerpc/platforms/powernv/pci.h | 4 +- 4 files changed, 432 insertions(+), 238 deletions(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index 5367eb3..a6ad4b1 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -31,6 +31,9 @@ struct pci_controller_ops { resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type); void(*setup_bridge)(struct pci_bus *, unsigned long); void(*reset_secondary_bus)(struct pci_dev *dev); + + /* Called when PCI device is released */ + void(*release_device)(struct pci_dev *); }; /* diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c index 7ed85a6..0040343 100644 --- a/arch/powerpc/kernel/pci-hotplug.c +++ b/arch/powerpc/kernel/pci-hotplug.c @@ -29,6 +29,11 @@ */ void pcibios_release_device(struct pci_dev *dev) { + struct pci_controller *hose = pci_bus_to_host(dev->bus); + + if (hose->controller_ops.release_device) + hose->controller_ops.release_device(dev); + eeh_remove_device(dev); } diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 910fb67..ef8c216 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -12,6 +12,8 @@ #undef DEBUG #include +#include +#include #include #include #include @@ -47,6 +49,8 @@ /* 256M DMA window, 4K TCE pages, 8 bytes TCE */ #define TCE32_TABLE_SIZE ((0x1000 / 0x1000) * 8) +static void pnv_ioda_release_pe(struct kref *kref); + static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, const char *fmt, ...) { @@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags) (IORESOURCE_MEM_64 | IORESOURCE_PREFETCH)); } -static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no) +static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe) { - if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) { - pr_warn("%s: Invalid PE %d on PHB#%x\n", - __func__, pe_no, phb->hose->global_number); + if (!pe) + return; + + kref_get(&pe->kref); +} + +static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe) +{ + unsigned int count; + + if (!pe) return; + + /* +* The count is initialized to 1 and increased with 1 when +* a new PCI device is bound with the PE. Once the last PCI +* device is leaving from the PE, the PE is going to be +* released. +*/ + count = atomic_read(&pe->kref.refcount); + if (count == 2) + kref_sub(&pe->kref, 2, pnv_ioda_release_pe); + else + kref_put(&pe->kref, pnv_ioda_release_pe); What if pnv_ioda_pe_get() gets called between atomic_read() and kref_sub()? +} + +static void pnv_pci_release_device(struct pci_dev *pdev) +{ + struct pci_controller *hose = pci_bus_to_host(pdev->bus); + struct pnv_phb *phb = hose->private_data; + struct pci_dn *pdn = pci_get_pdn(pdev); + struct pnv_ioda_pe *pe; + + if (pdn && pdn->pe_number != IODA_INVALID_PE) { + pe = &phb->ioda.pe_array[pdn->pe_number]; + pnv_ioda_pe_put(pe); + pdn->pe_number = IODA_INVALID_PE; } +} - if (test_and_set_bit(pe_no, phb->ioda.pe_alloc)) { - pr_warn("%s: PE %d was assigned on PHB#%x\n", - __func__, pe_no, phb->hose->global_number); +static void pnv_ioda_release_pe_dma(struct pnv_ioda_pe *pe) +{ + struct pnv_phb *phb = pe->phb; + int index, count; + unsigned long tbl_addr, tbl_size; + + /* No DMA capability for slave PEs */ + if (pe->flags & PNV_IODA_PE_SLAVE) + return; + + /* Bypass DMA window */ + if (phb->type == PNV_PHB_IODA2 && + pe->tce_bypass_enabled && + pe->tce32_table && +
Re: [PATCH v4 08/21] powerpc/powernv: Drop pnv_ioda_setup_dev_PE()
On 05/01/2015 04:02 PM, Gavin Shan wrote: Nobody is using the this function. The patch drops it. Signed-off-by: Gavin Shan Yay! :) I would move this patchset along with other mechanical changes to the beginning of the patchset. Reviewed-by: Alexey Kardashevskiy --- arch/powerpc/platforms/powernv/pci-ioda.c | 71 --- 1 file changed, 71 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index ef8c216..5cd8298 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1302,77 +1302,6 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset) } #endif /* CONFIG_PCI_IOV */ -#if 0 -static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev) -{ - struct pci_controller *hose = pci_bus_to_host(dev->bus); - struct pnv_phb *phb = hose->private_data; - struct pci_dn *pdn = pci_get_pdn(dev); - struct pnv_ioda_pe *pe; - int pe_num; - - if (!pdn) { - pr_err("%s: Device tree node not associated properly\n", - pci_name(dev)); - return NULL; - } - if (pdn->pe_number != IODA_INVALID_PE) - return NULL; - - /* PE#0 has been pre-set */ - if (dev->bus->number == 0) - pe_num = 0; - else - pe_num = pnv_ioda_alloc_pe(phb); - if (pe_num == IODA_INVALID_PE) { - pr_warning("%s: Not enough PE# available, disabling device\n", - pci_name(dev)); - return NULL; - } - - /* NOTE: We get only one ref to the pci_dev for the pdn, not for the -* pointer in the PE data structure, both should be destroyed at the -* same time. However, this needs to be looked at more closely again -* once we actually start removing things (Hotplug, SR-IOV, ...) -* -* At some point we want to remove the PDN completely anyways -*/ - pe = &phb->ioda.pe_array[pe_num]; - pci_dev_get(dev); - pdn->pcidev = dev; - pdn->pe_number = pe_num; - pe->pdev = dev; - pe->pbus = NULL; - pe->tce32_seg = -1; - pe->mve_number = -1; - pe->rid = dev->bus->number << 8 | pdn->devfn; - - pe_info(pe, "Associated device to PE\n"); - - if (pnv_ioda_configure_pe(phb, pe)) { - /* XXX What do we do here ? */ - if (pe_num) - pnv_ioda_free_pe(phb, pe_num); - pdn->pe_number = IODA_INVALID_PE; - pe->pdev = NULL; - pci_dev_put(dev); - return NULL; - } - - /* Assign a DMA weight to the device */ - pe->dma_weight = pnv_ioda_dma_weight(dev); - if (pe->dma_weight != 0) { - phb->ioda.dma_weight += pe->dma_weight; - phb->ioda.dma_pe_count++; - } - - /* Link the PE */ - pnv_ioda_link_pe_by_weight(phb, pe); - - return pe; -} -#endif /* Useful for SRIOV case */ - static void pnv_ioda_setup_same_PE(struct pci_bus *bus, struct pnv_ioda_pe *pe) { struct pci_dev *dev; -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 09/21] powerpc/powernv: Use PCI slot reset infrastructure
On 05/01/2015 04:02 PM, Gavin Shan wrote: For PowerNV platform, running on top of skiboot, all PE level reset should be routed to firmware if the bridge of the PE primary bus has device-node property "ibm,reset-by-firmware". Otherwise, the kernel has to issue hot reset on PE's primary bus despite the requested reset types, which is the behaviour before the firmware supports PCI slot reset. So the changes don't depend on the PCI slot reset capability exposed from the firmware. Signed-off-by: Gavin Shan --- arch/powerpc/include/asm/eeh.h | 1 + arch/powerpc/include/asm/opal.h | 4 +- arch/powerpc/platforms/powernv/eeh-powernv.c | 206 +-- 3 files changed, 102 insertions(+), 109 deletions(-) diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h index c5eb86f..2793d24 100644 --- a/arch/powerpc/include/asm/eeh.h +++ b/arch/powerpc/include/asm/eeh.h @@ -190,6 +190,7 @@ enum { #define EEH_RESET_DEACTIVATE 0 /* Deactivate the PE reset */ #define EEH_RESET_HOT 1 /* Hot reset*/ #define EEH_RESET_FUNDAMENTAL 3 /* Fundamental reset*/ +#define EEH_RESET_COMPLETE 4 /* PHB complete reset */ #define EEH_LOG_TEMP 1 /* EEH temporary error log */ #define EEH_LOG_PERM 2 /* EEH permanent error log */ diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 042af1a..6d467df 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -129,7 +129,7 @@ int64_t opal_pci_map_pe_dma_window(uint64_t phb_id, uint16_t pe_number, uint16_t int64_t opal_pci_map_pe_dma_window_real(uint64_t phb_id, uint16_t pe_number, uint16_t dma_window_number, uint64_t pci_start_addr, uint64_t pci_mem_size); -int64_t opal_pci_reset(uint64_t phb_id, uint8_t reset_scope, uint8_t assert_state); +int64_t opal_pci_reset(uint64_t id, uint8_t reset_scope, uint8_t assert_state); int64_t opal_pci_get_hub_diag_data(uint64_t hub_id, void *diag_buffer, uint64_t diag_buffer_len); @@ -145,7 +145,7 @@ int64_t opal_get_epow_status(__be64 *status); int64_t opal_set_system_attention_led(uint8_t led_action); int64_t opal_pci_next_error(uint64_t phb_id, __be64 *first_frozen_pe, __be16 *pci_error_type, __be16 *severity); -int64_t opal_pci_poll(uint64_t phb_id); +int64_t opal_pci_poll(uint64_t id, uint8_t *val); int64_t opal_return_cpu(void); int64_t opal_check_token(uint64_t token); int64_t opal_reinit_cpus(uint64_t flags); diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c index ce738ab..3c01095 100644 --- a/arch/powerpc/platforms/powernv/eeh-powernv.c +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c @@ -742,12 +742,12 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay) return ret; } -static s64 pnv_eeh_phb_poll(struct pnv_phb *phb) +static s64 pnv_eeh_poll(uint64_t id) { s64 rc = OPAL_HARDWARE; while (1) { - rc = opal_pci_poll(phb->opal_id); + rc = opal_pci_poll(id, NULL); if (rc <= 0) break; @@ -763,84 +763,38 @@ static s64 pnv_eeh_phb_poll(struct pnv_phb *phb) int pnv_eeh_phb_reset(struct pci_controller *hose, int option) { struct pnv_phb *phb = hose->private_data; + uint8_t scope; s64 rc = OPAL_HARDWARE; pr_debug("%s: Reset PHB#%x, option=%d\n", __func__, hose->global_number, option); - - /* Issue PHB complete reset request */ - if (option == EEH_RESET_FUNDAMENTAL || - option == EEH_RESET_HOT) - rc = opal_pci_reset(phb->opal_id, - OPAL_RESET_PHB_COMPLETE, - OPAL_ASSERT_RESET); - else if (option == EEH_RESET_DEACTIVATE) - rc = opal_pci_reset(phb->opal_id, - OPAL_RESET_PHB_COMPLETE, - OPAL_DEASSERT_RESET); - if (rc < 0) - goto out; - - /* -* Poll state of the PHB until the request is done -* successfully. The PHB reset is usually PHB complete -* reset followed by hot reset on root bus. So we also -* need the PCI bus settlement delay. -*/ - rc = pnv_eeh_phb_poll(phb); - if (option == EEH_RESET_DEACTIVATE) { - if (system_state < SYSTEM_RUNNING) - udelay(1000 * EEH_PE_RST_SETTLE_TIME); - else - msleep(EEH_PE_RST_SETTLE_TIME); These udelay() and msleep() are gone. How come they are not needed anymore? Worth commenting in the commit log or remove those in a separate patch. I just remember you
Re: [PATCH v4 10/21] powerpc/powernv: Fundamental reset for PCI bus reset
On 05/01/2015 04:02 PM, Gavin Shan wrote: Function pnv_pci_reset_secondary_bus() is used to reset specified PCI bus, which is leaded by root complex or PCI bridge. That means the function shouldn't be called on PCI root bus and the patch removes the logic for that case. Also, some adapters beneath the indicated PCI bus may require fundamental reset in order to successfully reload their firmwares after the reset. The patch translates hot reset to fundamental reset for that case. Signed-off-by: Gavin Shan --- arch/powerpc/platforms/powernv/eeh-powernv.c | 35 +--- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c index 3c01095..58e4dcf 100644 --- a/arch/powerpc/platforms/powernv/eeh-powernv.c +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c @@ -888,18 +888,35 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option) return (rc == OPAL_SUCCESS) ? 0 : -EIO; } -void pnv_pci_reset_secondary_bus(struct pci_dev *dev) Why changing dev to pdev? Keeping "dev" could make the patch simpler. +static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data) { - struct pci_controller *hose; + int *freset = data; - if (pci_is_root_bus(dev->bus)) { - hose = pci_bus_to_host(dev->bus); - pnv_eeh_phb_reset(hose, EEH_RESET_HOT); - pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE); - } else { - pnv_eeh_bridge_reset(dev, EEH_RESET_HOT); - pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE); + /* +* Stop the iteration immediately if there is any +* one PCI device requesting fundamental reset +*/ + *freset |= pdev->needs_freset; + return *freset; +} + +void pnv_pci_reset_secondary_bus(struct pci_dev *pdev) +{ + int option = EEH_RESET_HOT; + int freset = 0; + + /* Check if there're any PCI devices asking for fundamental reset */ + if (pdev->subordinate) { + pci_walk_bus(pdev->subordinate, +pnv_pci_dev_reset_type, +&freset); + if (freset) + option = EEH_RESET_FUNDAMENTAL; } + + /* Issue the requested type of reset */ + pnv_eeh_bridge_reset(pdev, option); + pnv_eeh_bridge_reset(pdev, EEH_RESET_DEACTIVATE); } /** -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 13/21] powerpc/powernv: Introduce pnv_pci_poll()
On 05/01/2015 04:03 PM, Gavin Shan wrote: We might not get some PCI slot information (e.g. power status) immediately by OPAL API. Instead, opal_pci_poll() need to be called for the required information. The patch introduces pnv_pci_poll(), which bases on original pnv_eeh_poll(), to cover the above case Signed-off-by: Gavin Shan --- arch/powerpc/platforms/powernv/eeh-powernv.c | 28 ++-- arch/powerpc/platforms/powernv/pci.c | 16 arch/powerpc/platforms/powernv/pci.h | 1 + 3 files changed, 19 insertions(+), 26 deletions(-) diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c index 58e4dcf..9253b9e 100644 --- a/arch/powerpc/platforms/powernv/eeh-powernv.c +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c @@ -742,24 +742,6 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay) return ret; } -static s64 pnv_eeh_poll(uint64_t id) -{ - s64 rc = OPAL_HARDWARE; - - while (1) { - rc = opal_pci_poll(id, NULL); - if (rc <= 0) - break; - - if (system_state < SYSTEM_RUNNING) - udelay(1000 * rc); - else - msleep(rc); - } - - return rc; -} - int pnv_eeh_phb_reset(struct pci_controller *hose, int option) { struct pnv_phb *phb = hose->private_data; @@ -788,10 +770,7 @@ int pnv_eeh_phb_reset(struct pci_controller *hose, int option) /* Issue reset and poll until it's completed */ rc = opal_pci_reset(phb->opal_id, scope, OPAL_ASSERT_RESET); - if (rc > 0) - rc = pnv_eeh_poll(phb->opal_id); - - return (rc == OPAL_SUCCESS) ? 0 : -EIO; + return pnv_pci_poll(phb->opal_id, rc, NULL); You are carrying a negative value to the new helper too? Looks complicated. Also, before you only cared if opal_pci_reset() returned negative value, now you treat it as a timeout, is it new change to OPAL or it has always been there? } static int __pnv_eeh_bridge_reset(struct pci_dev *dev, int option) @@ -882,10 +861,7 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option) phb = hose->private_data; id |= (dev->bus->number << 24) | (dev->devfn << 16) | phb->opal_id; rc = opal_pci_reset(id, scope, OPAL_ASSERT_RESET); - if (rc > 0) - rc = pnv_eeh_poll(id); - - return (rc == OPAL_SUCCESS) ? 0 : -EIO; + return pnv_pci_poll(id, rc, NULL); } static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data) diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index bca2aeb..a2da9a3 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -44,6 +44,22 @@ #define cfg_dbg(fmt...) do { } while(0) //#define cfg_dbg(fmt...) printk(fmt) +int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval) +{ + while (rval > 0) { + if (system_state < SYSTEM_RUNNING) + udelay(1000 * rval); + else + msleep(rval); Are these delays the once removed by "PATCH v4 09/21] powerpc/powernv: Use PCI slot reset infrastructure"? If so, I would merge this patch into 09/24 or move this one before that one, for bisect'ability. + + rval = opal_pci_poll(id, pval); + if (rval == OPAL_SUCCESS && pval) + rval = opal_pci_poll(id, pval); Why calling it twice? + } + + return rval ? -EIO : 0; +} + #ifdef CONFIG_PCI_MSI static int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type) { diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h index 8b10f01..82c5539 100644 --- a/arch/powerpc/platforms/powernv/pci.h +++ b/arch/powerpc/platforms/powernv/pci.h @@ -202,6 +202,7 @@ struct pnv_phb { extern struct pci_ops pnv_pci_ops; +int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval); void pnv_pci_dump_phb_diag_data(struct pci_controller *hose, unsigned char *log_buff); int pnv_pci_cfg_read(struct pci_dn *pdn, -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 14/21] powerpc/powernv: Functions to get/reset PCI slot status
On 05/01/2015 04:03 PM, Gavin Shan wrote: The patch exports 3 functions, which base on corresponding OPAL APIs to get or set PCI slot status. Those functions are going to be used by PCI hotplug module in subsequent patches: pnv_pci_get_presence_status() opal_pci_get_presence_status() pnv_pci_get_power_status() opal_pci_get_power_status() pnv_pci_set_power_status() opal_pci_set_power_status() Besides, the patch also exports pnv_pci_hotplug_notifier() to allow registering PCI hotplug notifier, which will be used to receive PCI hotplug message from skiboot firmware. Signed-off-by: Gavin Shan --- arch/powerpc/include/asm/opal-api.h| 7 +++- arch/powerpc/include/asm/opal.h| 3 ++ arch/powerpc/include/asm/pnv-pci.h | 5 +++ arch/powerpc/platforms/powernv/opal-wrappers.S | 3 ++ arch/powerpc/platforms/powernv/pci.c | 45 ++ 5 files changed, 62 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index 0321a90..29b407d 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -153,7 +153,10 @@ #define OPAL_FLASH_READ 110 #define OPAL_FLASH_WRITE 111 #define OPAL_FLASH_ERASE 112 -#define OPAL_LAST 112 +#define OPAL_PCI_GET_PRESENCE_STATUS 116 +#define OPAL_PCI_GET_POWER_STATUS 117 +#define OPAL_PCI_SET_POWER_STATUS 118 +#define OPAL_LAST 118 /* Device tree flags */ @@ -352,6 +355,8 @@ enum opal_msg_type { OPAL_MSG_SHUTDOWN, /* params[0] = 1 reboot, 0 shutdown */ OPAL_MSG_HMI_EVT, OPAL_MSG_DPO, + OPAL_MSG_PRD, + OPAL_MSG_PCI_HOTPLUG, OPAL_MSG_TYPE_MAX, }; diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 6d467df..a0eb206 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -200,6 +200,9 @@ int64_t opal_flash_write(uint64_t id, uint64_t offset, uint64_t buf, uint64_t size, uint64_t token); int64_t opal_flash_erase(uint64_t id, uint64_t offset, uint64_t size, uint64_t token); +int64_t opal_pci_get_presence_status(uint64_t id, uint8_t *status); +int64_t opal_pci_get_power_status(uint64_t id, uint8_t *status); +int64_t opal_pci_set_power_status(uint64_t id, uint8_t status); /* Internal functions */ extern int early_init_dt_scan_opal(unsigned long node, const char *uname, diff --git a/arch/powerpc/include/asm/pnv-pci.h b/arch/powerpc/include/asm/pnv-pci.h index f9b4982..50d92a4 100644 --- a/arch/powerpc/include/asm/pnv-pci.h +++ b/arch/powerpc/include/asm/pnv-pci.h @@ -13,6 +13,11 @@ #include #include +extern int pnv_pci_get_presence_status(uint64_t id, uint8_t *status); +extern int pnv_pci_get_power_status(uint64_t id, uint8_t *status); +extern int pnv_pci_set_power_status(uint64_t id, uint8_t status); +extern int pnv_pci_hotplug_notifier(struct notifier_block *nb, bool enable); + int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode); int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq, unsigned int virq); diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S index a7ade94..aa95dcb 100644 --- a/arch/powerpc/platforms/powernv/opal-wrappers.S +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S @@ -295,3 +295,6 @@ OPAL_CALL(opal_i2c_request, OPAL_I2C_REQUEST); OPAL_CALL(opal_flash_read,OPAL_FLASH_READ); OPAL_CALL(opal_flash_write, OPAL_FLASH_WRITE); OPAL_CALL(opal_flash_erase, OPAL_FLASH_ERASE); +OPAL_CALL(opal_pci_get_presence_status, OPAL_PCI_GET_PRESENCE_STATUS); +OPAL_CALL(opal_pci_get_power_status, OPAL_PCI_GET_POWER_STATUS); +OPAL_CALL(opal_pci_set_power_status, OPAL_PCI_SET_POWER_STATUS); diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index a2da9a3..60e6d65 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -60,6 +60,51 @@ int pnv_pci_poll(uint64_t id, int64_t rval, uint8_t *pval) return rval ? -EIO : 0; } +int pnv_pci_get_presence_status(uint64_t id, uint8_t *status) +{ + long rc; + + if (!opal_check_token(OPAL_PCI_GET_PRESENCE_STATUS)) I got a question about the style (i.e. I do not mean the patch is wrong :) ) Everywhere else you use int64_t or s64 for the value returned by OPAL but not with opal_check_point(). And you would compare it to OPAL_SUCCESS rather than plain zero. What does opal_check_token() return when succeeded? 1, -1,...? OPAL_SUCCESS means here an error, right? + return -ENXIO; + + rc
Re: [PATCH v4 15/21] powerpc/pci: Delay creating pci_dn
On 05/01/2015 04:03 PM, Gavin Shan wrote: The pci_dn instances are allocated from memblock or bootmem when creating PCI controller (hoses) in setup_arch(). The PCI hotplug, which will be supported by proceeding patches, will release PCI device nodes and their corresponding pci_dn on unplugging event. The pci_dn instance memory chunks alloed from memblock or bootmem are hard to reused after being released. The patch delay creating pci_dn so that they can be allocated from slab. In turn, the memory chunks for them can be reused after being released without problem. The creation of eeh_dev instances, which depends on pci_dn, is delayed a bit as well. Signed-off-by: Gavin Shan --- arch/powerpc/include/asm/ppc-pci.h | 1 - arch/powerpc/kernel/eeh_dev.c | 2 +- arch/powerpc/kernel/pci_dn.c | 40 +++--- arch/powerpc/platforms/maple/pci.c | 35 + arch/powerpc/platforms/pasemi/pci.c| 3 --- arch/powerpc/platforms/powermac/pci.c | 39 - arch/powerpc/platforms/powernv/pci.c | 3 --- arch/powerpc/platforms/pseries/setup.c | 1 - 8 files changed, 68 insertions(+), 56 deletions(-) diff --git a/arch/powerpc/include/asm/ppc-pci.h b/arch/powerpc/include/asm/ppc-pci.h index 4122a86..7388316 100644 --- a/arch/powerpc/include/asm/ppc-pci.h +++ b/arch/powerpc/include/asm/ppc-pci.h @@ -40,7 +40,6 @@ void *traverse_pci_dn(struct pci_dn *root, void *(*fn)(struct pci_dn *, void *), void *data); -extern void pci_devs_phb_init(void); extern void pci_devs_phb_init_dynamic(struct pci_controller *phb); /* From rtas_pci.h */ diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c index aabba94..f33ce5b 100644 --- a/arch/powerpc/kernel/eeh_dev.c +++ b/arch/powerpc/kernel/eeh_dev.c @@ -110,4 +110,4 @@ static int __init eeh_dev_phb_init(void) return 0; } -core_initcall(eeh_dev_phb_init); +core_initcall_sync(eeh_dev_phb_init); diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c index b3b4df9..d3833af 100644 --- a/arch/powerpc/kernel/pci_dn.c +++ b/arch/powerpc/kernel/pci_dn.c @@ -277,7 +277,7 @@ void *update_dn_pci_info(struct device_node *dn, void *data) struct device_node *parent; struct pci_dn *pdn; - pdn = zalloc_maybe_bootmem(sizeof(*pdn), GFP_KERNEL); + pdn = kzalloc(sizeof(*pdn), GFP_KERNEL); if (pdn == NULL) return NULL; dn->data = pdn; @@ -442,33 +442,37 @@ void pci_devs_phb_init_dynamic(struct pci_controller *phb) traverse_pci_devices(dn, update_dn_pci_info, phb); } -/** +static void pci_dev_pdn_setup(struct pci_dev *pdev) +{ + struct pci_dn *pdn; + + if (pdev->dev.archdata.pci_data) + return; + + /* Setup the fast path */ + pdn = pci_get_pdn(pdev); + pdev->dev.archdata.pci_data = pdn; +} +DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup); How does moving of the chunk above help to "Delay creating pci_dn"? + +/* * pci_devs_phb_init - Initialize phbs and pci devs under them. - * - * This routine walks over all phb's (pci-host bridges) on the - * system, and sets up assorted pci-related structures + * + * This routine walks over all phb's (pci-host bridges) on + * the system, and sets up assorted pci-related structures * (including pci info in the device node structs) for each * pci device found underneath. This routine runs once, * early in the boot sequence. */ -void __init pci_devs_phb_init(void) +static int __init pci_devs_phb_init(void) { struct pci_controller *phb, *tmp; /* This must be done first so the device nodes have valid pci info! */ list_for_each_entry_safe(phb, tmp, &hose_list, list_node) pci_devs_phb_init_dynamic(phb); -} - -static void pci_dev_pdn_setup(struct pci_dev *pdev) -{ - struct pci_dn *pdn; - if (pdev->dev.archdata.pci_data) - return; - - /* Setup the fast path */ - pdn = pci_get_pdn(pdev); - pdev->dev.archdata.pci_data = pdn; + return 0; } -DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pci_dev_pdn_setup); + +core_initcall(pci_devs_phb_init); diff --git a/arch/powerpc/platforms/maple/pci.c b/arch/powerpc/platforms/maple/pci.c index a923230..04a69a8 100644 --- a/arch/powerpc/platforms/maple/pci.c +++ b/arch/powerpc/platforms/maple/pci.c @@ -568,6 +568,26 @@ void maple_pci_irq_fixup(struct pci_dev *dev) DBG(" <- maple_pci_irq_fixup\n"); } +static int maple_pci_root_bridge_prepare(struct pci_host_bridge *bridge) +{ + struct pci_controller *hose = pci_bus_to_host(bridge->bus); + struct device_node *np, *child; + + if (hose != u3_agp) + return 0; + + /* Fixup the PCI<->OF mapping for U3 AGP due to bus renumbering. We +* assume there is no P2P bridge on the AGP bus, w
Re: [PATCH v4 16/21] powerpc/pci: Create eeh_dev while creating pci_dn
On 05/01/2015 04:03 PM, Gavin Shan wrote: The eeh_dev is always created based on pci_dn, but with initcall supported by core_initcall_sync(). The patch creates eeh_dev when pci_dn is created, indicating they have same life cycle. Signed-off-by: Gavin Shan --- arch/powerpc/include/asm/eeh.h | 6 -- arch/powerpc/kernel/eeh_dev.c | 18 -- arch/powerpc/kernel/pci_dn.c | 12 arch/powerpc/platforms/pseries/setup.c | 6 +- 4 files changed, 21 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h index 2793d24..4ed88f6 100644 --- a/arch/powerpc/include/asm/eeh.h +++ b/arch/powerpc/include/asm/eeh.h @@ -269,7 +269,8 @@ void eeh_pe_restore_bars(struct eeh_pe *pe); const char *eeh_pe_loc_get(struct eeh_pe *pe); struct pci_bus *eeh_pe_bus_get(struct eeh_pe *pe); -void *eeh_dev_init(struct pci_dn *pdn, void *data); +struct eeh_dev *eeh_dev_init(struct pci_dn *pdn, +struct pci_controller *phb); Everywhere else (?) you name these pci_controller pointer variables "hose" but not in this patch. void eeh_dev_phb_init_dynamic(struct pci_controller *phb); int eeh_init(void); int __init eeh_ops_register(struct eeh_ops *ops); @@ -322,7 +323,8 @@ static inline int eeh_init(void) return 0; } -static inline void *eeh_dev_init(struct pci_dn *pdn, void *data) +static inline struct eeh_dev *eeh_dev_init(struct pci_dn *pdn, + struct pci_controller *phb) { return NULL; } diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c index f33ce5b..7486932 100644 --- a/arch/powerpc/kernel/eeh_dev.c +++ b/arch/powerpc/kernel/eeh_dev.c @@ -44,14 +44,14 @@ /** * eeh_dev_init - Create EEH device according to OF node * @pdn: PCI device node - * @data: PHB + * @phb: PCI controller * * It will create EEH device according to the given OF node. The function * might be called by PCI emunation, DR, PHB hotplug. */ -void *eeh_dev_init(struct pci_dn *pdn, void *data) +struct eeh_dev *eeh_dev_init(struct pci_dn *pdn, +struct pci_controller *phb) { - struct pci_controller *phb = data; struct eeh_dev *edev; /* Allocate EEH device */ @@ -68,7 +68,7 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data) edev->phb = phb; INIT_LIST_HEAD(&edev->list); - return NULL; + return edev; } /** @@ -80,16 +80,8 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data) */ void eeh_dev_phb_init_dynamic(struct pci_controller *phb) { - struct pci_dn *root = phb->pci_data; - /* EEH PE for PHB */ eeh_phb_pe_create(phb); - - /* EEH device for PHB */ - eeh_dev_init(root, phb); - - /* EEH devices for children OF nodes */ - traverse_pci_dn(root, eeh_dev_init, phb); } /** @@ -105,8 +97,6 @@ static int __init eeh_dev_phb_init(void) list_for_each_entry_safe(phb, tmp, &hose_list, list_node) eeh_dev_phb_init_dynamic(phb); - pr_info("EEH: devices created\n"); - return 0; } diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c index d3833af..abc81fa 100644 --- a/arch/powerpc/kernel/pci_dn.c +++ b/arch/powerpc/kernel/pci_dn.c @@ -276,6 +276,9 @@ void *update_dn_pci_info(struct device_node *dn, void *data) const __be32 *regs; struct device_node *parent; struct pci_dn *pdn; +#ifdef CONFIG_EEH + struct eeh_dev *edev; +#endif pdn = kzalloc(sizeof(*pdn), GFP_KERNEL); if (pdn == NULL) @@ -306,6 +309,15 @@ void *update_dn_pci_info(struct device_node *dn, void *data) /* Extended config space */ pdn->pci_ext_config_space = (type && of_read_number(type, 1) == 1); + /* Initialize EEH device */ +#ifdef CONFIG_EEH You do not need this #ifdef - you have a stub for eeh_dev_init() in arch/powerpc/include/asm/eeh.h + edev = eeh_dev_init(pdn, phb); + if (!edev) { s/!edev/eeh_dev_init(pdn, phb)/ and get rid of @edev local variable at all - you do not use it anyway? + kfree(pdn); + return NULL; + } +#endif + /* Attach to parent node */ INIT_LIST_HEAD(&pdn->child_list); INIT_LIST_HEAD(&pdn->list); diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 5f80758..92974aa 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -261,12 +261,8 @@ static int pci_dn_reconfig_notifier(struct notifier_block *nb, unsigned long act switch (action) { case OF_RECONFIG_ATTACH_NODE: pci = np->parent->data; - if (pci) { + if (pci) update_dn_pci_info(np, pci->phb); - - /* Create EEH device for the OF node */ -
Re: [PATCH v4 21/21] pci/hotplug: PowerPC PowerNV PCI hotplug driver
On 05/01/2015 04:03 PM, Gavin Shan wrote: The patch intends to add standalone driver to support PCI hotplug for PowerPC PowerNV platform, which runs on top of skiboot firmware. The firmware identified hotpluggable slots and marked their device tree node with proper "ibm,slot-pluggable" and "ibm,reset-by-firmware". The driver simply scans device-tree to create/register PCI hotplug slot accordingly. If the skiboot firmware doesn't support slot status retrieval, the PCI slot device node shouldn't have property "ibm,reset-by-firmware". In that case, none of valid PCI slots will be detected from device tree. The skiboot firmware doesn't export the capability to access attention LEDs yet and it's something for TBD. Signed-off-by: Gavin Shan --- drivers/pci/hotplug/Kconfig| 12 + drivers/pci/hotplug/Makefile | 4 + drivers/pci/hotplug/powernv_php.c | 146 drivers/pci/hotplug/powernv_php.h | 78 drivers/pci/hotplug/powernv_php_slot.c | 643 + 5 files changed, 883 insertions(+) create mode 100644 drivers/pci/hotplug/powernv_php.c create mode 100644 drivers/pci/hotplug/powernv_php.h create mode 100644 drivers/pci/hotplug/powernv_php_slot.c diff --git a/drivers/pci/hotplug/Kconfig b/drivers/pci/hotplug/Kconfig index df8caec..ef55dae 100644 --- a/drivers/pci/hotplug/Kconfig +++ b/drivers/pci/hotplug/Kconfig @@ -113,6 +113,18 @@ config HOTPLUG_PCI_SHPC When in doubt, say N. +config HOTPLUG_PCI_POWERNV + tristate "PowerPC PowerNV PCI Hotplug driver" + depends on PPC_POWERNV && EEH + help + Say Y here if you run PowerPC PowerNV platform that supports + PCI Hotplug + + To compile this driver as a module, choose M here: the + module will be called powernv-php. + + When in doubt, say N. + config HOTPLUG_PCI_RPA tristate "RPA PCI Hotplug driver" depends on PPC_PSERIES && EEH diff --git a/drivers/pci/hotplug/Makefile b/drivers/pci/hotplug/Makefile index 4a9aa08..a69665e 100644 --- a/drivers/pci/hotplug/Makefile +++ b/drivers/pci/hotplug/Makefile @@ -14,6 +14,7 @@ obj-$(CONFIG_HOTPLUG_PCI_PCIE)+= pciehp.o obj-$(CONFIG_HOTPLUG_PCI_CPCI_ZT5550) += cpcihp_zt5550.o obj-$(CONFIG_HOTPLUG_PCI_CPCI_GENERIC)+= cpcihp_generic.o obj-$(CONFIG_HOTPLUG_PCI_SHPC)+= shpchp.o +obj-$(CONFIG_HOTPLUG_PCI_POWERNV) += powernv-php.o obj-$(CONFIG_HOTPLUG_PCI_RPA) += rpaphp.o obj-$(CONFIG_HOTPLUG_PCI_RPA_DLPAR) += rpadlpar_io.o obj-$(CONFIG_HOTPLUG_PCI_SGI) += sgi_hotplug.o @@ -50,6 +51,9 @@ ibmphp-objs := ibmphp_core.o \ acpiphp-objs := acpiphp_core.o \ acpiphp_glue.o +powernv-php-objs := powernv_php.o \ + powernv_php_slot.o + rpaphp-objs := rpaphp_core.o \ rpaphp_pci.o\ rpaphp_slot.o diff --git a/drivers/pci/hotplug/powernv_php.c b/drivers/pci/hotplug/powernv_php.c new file mode 100644 index 000..5cf9e717 --- /dev/null +++ b/drivers/pci/hotplug/powernv_php.c @@ -0,0 +1,146 @@ +/* + * PCI Hotplug Driver for PowerPC PowerNV platform. + * + * Copyright Gavin Shan, IBM Corporation 2015. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "powernv_php.h" Compiles without linux/kernel.h, linux/sysfs.h, linux/string.h, linux/slab.h. Sure you need all of these? + +#define DRIVER_VERSION "0.1" +#define DRIVER_AUTHOR "Gavin Shan, IBM Corporation" +#define DRIVER_DESC"PowerPC PowerNV PCI Hotplug Driver" + +static struct notifier_block php_msg_nb = { + .notifier_call = powernv_php_msg_handler, + .next = NULL, + .priority = 0, +}; + +static int powernv_php_register_one(struct device_node *dn) +{ + struct powernv_php_slot *slot; + const __be32 *prop32; + int ret; + + /* Check if it's hotpluggable slot */ + prop32 = of_get_property(dn, "ibm,slot-pluggable", NULL); + if (!prop32 || !of_read_number(prop32, 1)) + return 0; Although nobody checks the return code, this should be -ENXIO or something but zero. And the check below too. + + prop32 = of_get_property(dn, "ibm,reset-by-firmware", NULL); + if (!prop32 || !of_read_number(prop32, 1)) + return 0; + + /* Allocate slot */ + slot = powernv_php_slot_alloc(dn); + if (!slot) + return -ENODEV; + + /* Register it */ + ret = powernv_php_slot_register(slot); + if (ret)
Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table
On 05/05/2015 10:02 PM, David Gibson wrote: On Fri, May 01, 2015 at 05:12:45PM +1000, Alexey Kardashevskiy wrote: On 05/01/2015 02:23 PM, David Gibson wrote: On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote: On 04/29/2015 04:31 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote: In order to support memory pre-registration, we need a way to track the use of every registered memory region and only allow unregistration if a region is not in use anymore. So we need a way to tell from what region the just cleared TCE was from. This adds a userspace view of the TCE table into iommu_table struct. It contains userspace address, one per TCE entry. The table is only allocated when the ownership over an IOMMU group is taken which means it is only used from outside of the powernv code (such as VFIO). Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * fixed code flow in error cases added in v8 v8: * added ENOMEM on failed vzalloc() --- arch/powerpc/include/asm/iommu.h | 6 ++ arch/powerpc/kernel/iommu.c | 18 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 22 -- 3 files changed, 44 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 7694546..1472de3 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -111,9 +111,15 @@ struct iommu_table { unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_table_group; + unsigned long *it_userspace; /* userspace view of the table */ A single unsigned long doesn't seem like enough. Why single? This is an array. As in single per page. Sorry, I am not following you here. It is per IOMMU page. MAP/UNMAP work with IOMMU pages which are fully backed with either system page or a huge page. How do you know which process's address space this address refers to? It is a current task. Multiple userspaces cannot use the same container/tables. Where is that enforced? It is accessed from VFIO DMA map/unmap which are ioctls() to a container's fd which is per a process. Usually, but what enforces that. If you open a container fd, then fork(), and attempt to map from both parent and child, what happens? vfio_group_fops::open() checks if the group is already opened, and I want to believe open() is called from fork() for new fd so no mapping can happen later. -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 27/32] powerpc/iommu/ioda2: Add get_table_size() to calculate the size of future table
On 05/05/2015 09:58 PM, David Gibson wrote: On Fri, May 01, 2015 at 04:53:08PM +1000, Alexey Kardashevskiy wrote: On 05/01/2015 03:12 PM, David Gibson wrote: On Fri, May 01, 2015 at 02:10:58PM +1000, Alexey Kardashevskiy wrote: On 04/29/2015 04:40 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:51PM +1000, Alexey Kardashevskiy wrote: This adds a way for the IOMMU user to know how much a new table will use so it can be accounted in the locked_vm limit before allocation happens. This stores the allocated table size in pnv_pci_create_table() so the locked_vm counter can be updated correctly when a table is being disposed. This defines an iommu_table_group_ops callback to let VFIO know how much memory will be locked if a table is created. Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * reimplemented the whole patch --- arch/powerpc/include/asm/iommu.h | 5 + arch/powerpc/platforms/powernv/pci-ioda.c | 14 arch/powerpc/platforms/powernv/pci.c | 36 +++ arch/powerpc/platforms/powernv/pci.h | 2 ++ 4 files changed, 57 insertions(+) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 1472de3..9844c106 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -99,6 +99,7 @@ struct iommu_table { unsigned long it_size; /* Size of iommu table in entries */ unsigned long it_indirect_levels; unsigned long it_level_size; + unsigned long it_allocated_size; unsigned long it_offset;/* Offset into global table */ unsigned long it_base; /* mapped address of tce table */ unsigned long it_index; /* which iommu table this is */ @@ -155,6 +156,10 @@ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, struct iommu_table_group; struct iommu_table_group_ops { + unsigned long (*get_table_size)( + __u32 page_shift, + __u64 window_size, + __u32 levels); long (*create_table)(struct iommu_table_group *table_group, int num, __u32 page_shift, diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index e0be556..7f548b4 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -2062,6 +2062,18 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb, } #ifdef CONFIG_IOMMU_API +static unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift, + __u64 window_size, __u32 levels) +{ + unsigned long ret = pnv_get_table_size(page_shift, window_size, levels); + + if (!ret) + return ret; + + /* Add size of it_userspace */ + return ret + (window_size >> page_shift) * sizeof(unsigned long); This doesn't make much sense. The userspace view can't possibly be a property of the specific low-level IOMMU model. This it_userspace thing is all about memory preregistration. I need some way to track how many actual mappings the mm_iommu_table_group_mem_t has in order to decide whether to allow unregistering or not. When I clear TCE, I can read the old value which is host physical address which I cannot use to find the preregistered region and adjust the mappings counter; I can only use userspace addresses for this (not even guest physical addresses as it is VFIO and probably no KVM). So I have to keep userspace addresses somewhere, one per IOMMU page, and the iommu_table seems a natural place for this. Well.. sort of. But as noted elsewhere this pulls VFIO specific constraints into a platform code structure. And whether you get this table depends on the platform IOMMU type rather than on what VFIO wants to do with it, which doesn't make sense. What might make more sense is an opaque pointer io iommu_table for use by the table "owner" (in the take_ownership sense). The pointer would be stored in iommu_table, but VFIO is responsible for populating and managing its contents. Or you could just put the userspace mappings in the container. Although you might want a different data structure in that case. Nope. I need this table in in-kernel acceleration to update the mappings counter per mm_iommu_table_group_mem_t. In KVM's real mode handlers, I only have IOMMU tables, not containers or groups. QEMU creates a guest view of the table (KVM_CREATE_SPAPR_TCE) specifying a LIOBN, and then attaches TCE tables to it via set of ioctls (one per IOMMU group) to VFIO KVM device. So if I call it it_opaque (instead of it_userspace), I will still need a common place (visible to VFIO and PowerKVM) for this to put: #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) I think it should be in a VFIO header. If I'm understanding right this part of the PowerKVM code is explicitly VFIO
Re: [PATCH kernel v9 31/32] vfio: powerpc/spapr: Support multiple groups in one container if possible
On 05/05/2015 09:50 PM, David Gibson wrote: On Fri, May 01, 2015 at 04:05:24PM +1000, Alexey Kardashevskiy wrote: On 05/01/2015 02:33 PM, David Gibson wrote: On Thu, Apr 30, 2015 at 07:33:09PM +1000, Alexey Kardashevskiy wrote: On 04/30/2015 05:22 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:55PM +1000, Alexey Kardashevskiy wrote: At the moment only one group per container is supported. POWER8 CPUs have more flexible design and allows naving 2 TCE tables per IOMMU group so we can relax this limitation and support multiple groups per container. It's not obvious why allowing multiple TCE tables per PE has any pearing on allowing multiple groups per container. This patchset is a global TCE tables rework (patches 1..30, roughly) with 2 outcomes: 1. reusing the same IOMMU table for multiple groups - patch 31; 2. allowing dynamic create/remove of IOMMU tables - patch 32. I can remove this one from the patchset and post it separately later but since 1..30 aim to support both 1) and 2), I'd think I better keep them all together (might explain some of changes I do in 1..30). The combined patchset is fine. My comment is because your commit message says that multiple groups are possible *because* 2 TCE tables per group are allowed, and it's not at all clear why one follows from the other. Ah. That's wrong indeed, I'll fix it. This adds TCE table descriptors to a container and uses iommu_table_group_ops to create/set DMA windows on IOMMU groups so the same TCE tables will be shared between several IOMMU groups. Signed-off-by: Alexey Kardashevskiy [aw: for the vfio related changes] Acked-by: Alex Williamson --- Changes: v7: * updated doc --- Documentation/vfio.txt | 8 +- drivers/vfio/vfio_iommu_spapr_tce.c | 268 ++-- 2 files changed, 199 insertions(+), 77 deletions(-) diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt index 94328c8..7dcf2b5 100644 --- a/Documentation/vfio.txt +++ b/Documentation/vfio.txt @@ -289,10 +289,12 @@ PPC64 sPAPR implementation note This implementation has some specifics: -1) Only one IOMMU group per container is supported as an IOMMU group -represents the minimal entity which isolation can be guaranteed for and -groups are allocated statically, one per a Partitionable Endpoint (PE) +1) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per +container is supported as an IOMMU table is allocated at the boot time, +one table per a IOMMU group which is a Partitionable Endpoint (PE) (PE is often a PCI domain but not always). I thought the more fundamental problem was that different PEs tended to use disjoint bus address ranges, so even by duplicating put_tce across PEs you couldn't have a common address space. Sorry, I am not following you here. By duplicating put_tce, I can have multiple IOMMU groups on the same virtual PHB in QEMU, "[PATCH qemu v7 04/14] spapr_pci_vfio: Enable multiple groups per container" does this, the address ranges will the same. Oh, ok. For some reason I thought that (at least on the older machines) the different PEs used different and not easily changeable DMA windows in bus addresses space. They do use different tables (which VFIO does not get to remove/create and uses these old helpers - iommu_take/release_ownership), correct. But all these windows are mapped at zero on a PE's PCI bus and nothing prevents me from updating all these tables with the same TCE values when handling H_PUT_TCE. Yes it is slow but it works (bit more details below). Um.. I'm pretty sure that contradicts what Ben was saying on the thread. True, it does contradict, I do not know why he said what he said :) -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH kernel v9 26/32] powerpc/iommu: Add userspace view of TCE table
On 05/11/2015 12:11 PM, Alexey Kardashevskiy wrote: On 05/05/2015 10:02 PM, David Gibson wrote: On Fri, May 01, 2015 at 05:12:45PM +1000, Alexey Kardashevskiy wrote: On 05/01/2015 02:23 PM, David Gibson wrote: On Fri, May 01, 2015 at 02:01:17PM +1000, Alexey Kardashevskiy wrote: On 04/29/2015 04:31 PM, David Gibson wrote: On Sat, Apr 25, 2015 at 10:14:50PM +1000, Alexey Kardashevskiy wrote: In order to support memory pre-registration, we need a way to track the use of every registered memory region and only allow unregistration if a region is not in use anymore. So we need a way to tell from what region the just cleared TCE was from. This adds a userspace view of the TCE table into iommu_table struct. It contains userspace address, one per TCE entry. The table is only allocated when the ownership over an IOMMU group is taken which means it is only used from outside of the powernv code (such as VFIO). Signed-off-by: Alexey Kardashevskiy --- Changes: v9: * fixed code flow in error cases added in v8 v8: * added ENOMEM on failed vzalloc() --- arch/powerpc/include/asm/iommu.h | 6 ++ arch/powerpc/kernel/iommu.c | 18 ++ arch/powerpc/platforms/powernv/pci-ioda.c | 22 -- 3 files changed, 44 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 7694546..1472de3 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -111,9 +111,15 @@ struct iommu_table { unsigned long *it_map; /* A simple allocation bitmap for now */ unsigned long it_page_shift;/* table iommu page size */ struct iommu_table_group *it_table_group; +unsigned long *it_userspace; /* userspace view of the table */ A single unsigned long doesn't seem like enough. Why single? This is an array. As in single per page. Sorry, I am not following you here. It is per IOMMU page. MAP/UNMAP work with IOMMU pages which are fully backed with either system page or a huge page. How do you know which process's address space this address refers to? It is a current task. Multiple userspaces cannot use the same container/tables. Where is that enforced? It is accessed from VFIO DMA map/unmap which are ioctls() to a container's fd which is per a process. Usually, but what enforces that. If you open a container fd, then fork(), and attempt to map from both parent and child, what happens? vfio_group_fops::open() checks if the group is already opened, and I want to believe open() is called from fork() for new fd so no mapping can happen later. I am wrong here. Nothing prevents multiple userspace from using the same container. It still does not seem really dangerous as in order to use VFIO, someone with the root privilege should set right permissions on /dev/vfio* first anyway and that person knows what QEMU does and what QEMU does not :) I could add pid into iommu_table, next to it_userspace, and fail when other pid is trying to change the it_userspace table. Not sure if I want to do this check in realmode though (performance). Or make sure somehow that fork() closes container and group fd's (but how?). In the worst case, wrong userspace page will be put and there will be random backtraces on the host kernel. What would you do? -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v4 07/21] powerpc/powernv: Release PEs dynamically
On 05/11/2015 04:25 PM, Gavin Shan wrote: On Sat, May 09, 2015 at 10:43:23PM +1000, Alexey Kardashevskiy wrote: On 05/01/2015 04:02 PM, Gavin Shan wrote: The original code doesn't support releasing PEs dynamically, meaning that PE and the associated resources (IO, M32, M64 and DMA) can't be released when unplugging a PCI adapter from one hotpluggable slot. The patch takes object oriented methodology, introducs reference count to PE, which is initialized to 1 and increased with 1 when a new PCI device joins the PE. Once the last PCI device leaves the PE, the PE is going to be release together with its associated (IO, M32, M64, DMA) resources. Too little commit log for non-trivial non-cut-n-paste 30KB patch... Ok. I'll add more details in next revision. Signed-off-by: Gavin Shan --- arch/powerpc/include/asm/pci-bridge.h | 3 + arch/powerpc/kernel/pci-hotplug.c | 5 + arch/powerpc/platforms/powernv/pci-ioda.c | 658 +++--- arch/powerpc/platforms/powernv/pci.h | 4 +- 4 files changed, 432 insertions(+), 238 deletions(-) diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h index 5367eb3..a6ad4b1 100644 --- a/arch/powerpc/include/asm/pci-bridge.h +++ b/arch/powerpc/include/asm/pci-bridge.h @@ -31,6 +31,9 @@ struct pci_controller_ops { resource_size_t (*window_alignment)(struct pci_bus *, unsigned long type); void(*setup_bridge)(struct pci_bus *, unsigned long); void(*reset_secondary_bus)(struct pci_dev *dev); + + /* Called when PCI device is released */ + void(*release_device)(struct pci_dev *); }; /* diff --git a/arch/powerpc/kernel/pci-hotplug.c b/arch/powerpc/kernel/pci-hotplug.c index 7ed85a6..0040343 100644 --- a/arch/powerpc/kernel/pci-hotplug.c +++ b/arch/powerpc/kernel/pci-hotplug.c @@ -29,6 +29,11 @@ */ void pcibios_release_device(struct pci_dev *dev) { + struct pci_controller *hose = pci_bus_to_host(dev->bus); + + if (hose->controller_ops.release_device) + hose->controller_ops.release_device(dev); + eeh_remove_device(dev); } diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 910fb67..ef8c216 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -12,6 +12,8 @@ #undef DEBUG #include +#include +#include #include #include #include @@ -47,6 +49,8 @@ /* 256M DMA window, 4K TCE pages, 8 bytes TCE */ #define TCE32_TABLE_SIZE ((0x1000 / 0x1000) * 8) +static void pnv_ioda_release_pe(struct kref *kref); + static void pe_level_printk(const struct pnv_ioda_pe *pe, const char *level, const char *fmt, ...) { @@ -123,25 +127,400 @@ static inline bool pnv_pci_is_mem_pref_64(unsigned long flags) (IORESOURCE_MEM_64 | IORESOURCE_PREFETCH)); } -static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int pe_no) +static inline void pnv_ioda_pe_get(struct pnv_ioda_pe *pe) { - if (!(pe_no >= 0 && pe_no < phb->ioda.total_pe)) { - pr_warn("%s: Invalid PE %d on PHB#%x\n", - __func__, pe_no, phb->hose->global_number); + if (!pe) + return; + + kref_get(&pe->kref); +} + +static inline void pnv_ioda_pe_put(struct pnv_ioda_pe *pe) +{ + unsigned int count; + + if (!pe) return; + + /* +* The count is initialized to 1 and increased with 1 when +* a new PCI device is bound with the PE. Once the last PCI +* device is leaving from the PE, the PE is going to be +* released. +*/ + count = atomic_read(&pe->kref.refcount); + if (count == 2) + kref_sub(&pe->kref, 2, pnv_ioda_release_pe); + else + kref_put(&pe->kref, pnv_ioda_release_pe); What if pnv_ioda_pe_get() gets called between atomic_read() and kref_sub()? Yeah, that would have problem. But it shouldn't happen because the PCI devices are joining the parent PE# in strictly serialized mode. Same thing happens when detaching PCI devices from its parent PE. oookay. Another thing then - why is this kref counter initialized to 1? It would make sense if you did something special when the counter becomes 1 after decrement but you do not. Also, this kref thing makes sense if you do kref_put() in multiple places and do not know which one will be the last one so you pass the callback to all of them. Here you do kref_put/sub in one place and you read the counter - so you can call pnv_ioda_release_pe() directly. And it feels like a simple atomic_t would do the job just fine. If you still feel that the counter should start from 1, there are atomic_dec_if_positive() and atomic_inc_not_zero() and othe
Re: [PATCH v4 09/21] powerpc/powernv: Use PCI slot reset infrastructure
On 05/11/2015 04:45 PM, Gavin Shan wrote: On Sat, May 09, 2015 at 11:41:05PM +1000, Alexey Kardashevskiy wrote: On 05/01/2015 04:02 PM, Gavin Shan wrote: For PowerNV platform, running on top of skiboot, all PE level reset should be routed to firmware if the bridge of the PE primary bus has device-node property "ibm,reset-by-firmware". Otherwise, the kernel has to issue hot reset on PE's primary bus despite the requested reset types, which is the behaviour before the firmware supports PCI slot reset. So the changes don't depend on the PCI slot reset capability exposed from the firmware. Signed-off-by: Gavin Shan --- arch/powerpc/include/asm/eeh.h | 1 + arch/powerpc/include/asm/opal.h | 4 +- arch/powerpc/platforms/powernv/eeh-powernv.c | 206 +-- 3 files changed, 102 insertions(+), 109 deletions(-) diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h index c5eb86f..2793d24 100644 --- a/arch/powerpc/include/asm/eeh.h +++ b/arch/powerpc/include/asm/eeh.h @@ -190,6 +190,7 @@ enum { #define EEH_RESET_DEACTIVATE 0 /* Deactivate the PE reset */ #define EEH_RESET_HOT 1 /* Hot reset*/ #define EEH_RESET_FUNDAMENTAL 3 /* Fundamental reset*/ +#define EEH_RESET_COMPLETE 4 /* PHB complete reset */ #define EEH_LOG_TEMP 1 /* EEH temporary error log */ #define EEH_LOG_PERM 2 /* EEH permanent error log */ diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 042af1a..6d467df 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -129,7 +129,7 @@ int64_t opal_pci_map_pe_dma_window(uint64_t phb_id, uint16_t pe_number, uint16_t int64_t opal_pci_map_pe_dma_window_real(uint64_t phb_id, uint16_t pe_number, uint16_t dma_window_number, uint64_t pci_start_addr, uint64_t pci_mem_size); -int64_t opal_pci_reset(uint64_t phb_id, uint8_t reset_scope, uint8_t assert_state); +int64_t opal_pci_reset(uint64_t id, uint8_t reset_scope, uint8_t assert_state); int64_t opal_pci_get_hub_diag_data(uint64_t hub_id, void *diag_buffer, uint64_t diag_buffer_len); @@ -145,7 +145,7 @@ int64_t opal_get_epow_status(__be64 *status); int64_t opal_set_system_attention_led(uint8_t led_action); int64_t opal_pci_next_error(uint64_t phb_id, __be64 *first_frozen_pe, __be16 *pci_error_type, __be16 *severity); -int64_t opal_pci_poll(uint64_t phb_id); +int64_t opal_pci_poll(uint64_t id, uint8_t *val); int64_t opal_return_cpu(void); int64_t opal_check_token(uint64_t token); int64_t opal_reinit_cpus(uint64_t flags); diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c index ce738ab..3c01095 100644 --- a/arch/powerpc/platforms/powernv/eeh-powernv.c +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c @@ -742,12 +742,12 @@ static int pnv_eeh_get_state(struct eeh_pe *pe, int *delay) return ret; } -static s64 pnv_eeh_phb_poll(struct pnv_phb *phb) +static s64 pnv_eeh_poll(uint64_t id) { s64 rc = OPAL_HARDWARE; while (1) { - rc = opal_pci_poll(phb->opal_id); + rc = opal_pci_poll(id, NULL); if (rc <= 0) break; @@ -763,84 +763,38 @@ static s64 pnv_eeh_phb_poll(struct pnv_phb *phb) int pnv_eeh_phb_reset(struct pci_controller *hose, int option) { struct pnv_phb *phb = hose->private_data; + uint8_t scope; s64 rc = OPAL_HARDWARE; pr_debug("%s: Reset PHB#%x, option=%d\n", __func__, hose->global_number, option); - - /* Issue PHB complete reset request */ - if (option == EEH_RESET_FUNDAMENTAL || - option == EEH_RESET_HOT) - rc = opal_pci_reset(phb->opal_id, - OPAL_RESET_PHB_COMPLETE, - OPAL_ASSERT_RESET); - else if (option == EEH_RESET_DEACTIVATE) - rc = opal_pci_reset(phb->opal_id, - OPAL_RESET_PHB_COMPLETE, - OPAL_DEASSERT_RESET); - if (rc < 0) - goto out; - - /* -* Poll state of the PHB until the request is done -* successfully. The PHB reset is usually PHB complete -* reset followed by hot reset on root bus. So we also -* need the PCI bus settlement delay. -*/ - rc = pnv_eeh_phb_poll(phb); - if (option == EEH_RESET_DEACTIVATE) { - if (system_state < SYSTEM_RUNNING) - udelay(1000 * EEH_PE_RST_SETTLE_TIME); - else - msleep(EEH_PE_RST_SETTLE_TIME); T
Re: [PATCH v4 10/21] powerpc/powernv: Fundamental reset for PCI bus reset
On 05/11/2015 04:47 PM, Gavin Shan wrote: On Sun, May 10, 2015 at 12:12:18AM +1000, Alexey Kardashevskiy wrote: On 05/01/2015 04:02 PM, Gavin Shan wrote: Function pnv_pci_reset_secondary_bus() is used to reset specified PCI bus, which is leaded by root complex or PCI bridge. That means the function shouldn't be called on PCI root bus and the patch removes the logic for that case. Also, some adapters beneath the indicated PCI bus may require fundamental reset in order to successfully reload their firmwares after the reset. The patch translates hot reset to fundamental reset for that case. Signed-off-by: Gavin Shan --- arch/powerpc/platforms/powernv/eeh-powernv.c | 35 +--- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c index 3c01095..58e4dcf 100644 --- a/arch/powerpc/platforms/powernv/eeh-powernv.c +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c @@ -888,18 +888,35 @@ static int pnv_eeh_bridge_reset(struct pci_dev *dev, int option) return (rc == OPAL_SUCCESS) ? 0 : -EIO; } -void pnv_pci_reset_secondary_bus(struct pci_dev *dev) Why changing dev to pdev? Keeping "dev" could make the patch simpler. In the early stage when I wrote the EEH code, I had "dev" to refer PCI device, which isn't precisely enough. Actually, "dev" means "struct device" while "pdev" stands for "struct pci_dev". That's why I changed it. The rest of the file and the kernel overall use "dev" for pci_dev just fine. I would not bother. +static int pnv_pci_dev_reset_type(struct pci_dev *pdev, void *data) { - struct pci_controller *hose; + int *freset = data; - if (pci_is_root_bus(dev->bus)) { - hose = pci_bus_to_host(dev->bus); - pnv_eeh_phb_reset(hose, EEH_RESET_HOT); - pnv_eeh_phb_reset(hose, EEH_RESET_DEACTIVATE); - } else { - pnv_eeh_bridge_reset(dev, EEH_RESET_HOT); - pnv_eeh_bridge_reset(dev, EEH_RESET_DEACTIVATE); + /* +* Stop the iteration immediately if there is any +* one PCI device requesting fundamental reset +*/ + *freset |= pdev->needs_freset; + return *freset; +} + +void pnv_pci_reset_secondary_bus(struct pci_dev *pdev) +{ + int option = EEH_RESET_HOT; + int freset = 0; + + /* Check if there're any PCI devices asking for fundamental reset */ + if (pdev->subordinate) { + pci_walk_bus(pdev->subordinate, +pnv_pci_dev_reset_type, +&freset); + if (freset) + option = EEH_RESET_FUNDAMENTAL; } + + /* Issue the requested type of reset */ + pnv_eeh_bridge_reset(pdev, option); + pnv_eeh_bridge_reset(pdev, EEH_RESET_DEACTIVATE); } /** Thanks, Gavin -- Alexey ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v10 00/34] powerpc/iommu/vfio: Enable Dynamic DMA windows
This enables sPAPR defined feature called Dynamic DMA windows (DDW). Each Partitionable Endpoint (IOMMU group) has an address range on a PCI bus where devices are allowed to do DMA. These ranges are called DMA windows. By default, there is a single DMA window, 1 or 2GB big, mapped at zero on a PCI bus. Hi-speed devices may suffer from the limited size of the window. The recent host kernels use a TCE bypass window on POWER8 CPU which implements direct PCI bus address range mapping (with offset of 1<<59) to the host memory. For guests, PAPR defines a DDW RTAS API which allows pseries guests querying the hypervisor about DDW support and capabilities (page size mask for now). A pseries guest may request an additional (to the default) DMA windows using this RTAS API. The existing pseries Linux guests request an additional window as big as the guest RAM and map the entire guest window which effectively creates direct mapping of the guest memory to a PCI bus. The multiple DMA windows feature is supported by POWER7/POWER8 CPUs; however this patchset only adds support for POWER8 as TCE tables are implemented in POWER7 in a quite different way ans POWER7 is not the highest priority. This patchset reworks PPC64 IOMMU code and adds necessary structures to support big windows. Once a Linux guest discovers the presence of DDW, it does: 1. query hypervisor about number of available windows and page size masks; 2. create a window with the biggest possible page size (today 4K/64K/16M); 3. map the entire guest RAM via H_PUT_TCE* hypercalls; 4. switche dma_ops to direct_dma_ops on the selected PE. Once this is done, H_PUT_TCE is not called anymore for 64bit devices and the guest does not waste time on DMA map/unmap operations. Note that 32bit devices won't use DDW and will keep using the default DMA window so KVM optimizations will be required (to be posted later). This is pushed to g...@github.com:aik/linux.git + 4d0247b...3a5fb80 vfio-for-github -> vfio-for-github (forced update) The pushed branch contains all patches from this patchset and KVM acceleration patches as well to give an idea about the current state of in-kernel acceleration support. Changes: v10: * fixed&tested on SRIOV system * fixed multiple comments from David; started thinking that I might have to remove all acks and get them again :) * added bunch of iommu device attachment reworks v9: * rebased on top of SRIOV (which is in upstream now) * fixed multiple comments from David * reworked ownership patches * removed vfio: powerpc/spapr: Do cleanup when releasing the group (used to be #2) as updated #1 should do this * moved "powerpc/powernv: Implement accessor to TCE entry" to a separate patch * added a patch which moves TCE Kill register address to PE from IOMMU table v8: * fixed a bug in error fallback in "powerpc/mmu: Add userspace-to-physical addresses translation cache" * fixed subject in "vfio: powerpc/spapr: Check that IOMMU page is fully contained by system page" * moved v2 documentation to the correct patch * added checks for failed vzalloc() in "powerpc/iommu: Add userspace view of TCE table" v7: * moved memory preregistration to the current process's MMU context * added code preventing unregistration if some pages are still mapped; for this, there is a userspace view of the table is stored in iommu_table * added locked_vm counting for DDW tables (including userspace view of those) v6: * fixed a bunch of errors in "vfio: powerpc/spapr: Support Dynamic DMA windows" * moved static IOMMU properties from iommu_table_group to iommu_table_group_ops v5: * added SPAPR_TCE_IOMMU_v2 to tell the userspace that there is a memory pre-registration feature * added backward compatibility * renamed few things (mostly powerpc_iommu -> iommu_table_group) v4: * moved patches around to have VFIO and PPC patches separated as much as possible * now works with the existing upstream QEMU v3: * redesigned the whole thing * multiple IOMMU groups per PHB -> one PHB is needed for VFIO in the guest -> no problems with locked_vm counting; also we save memory on actual tables * guest RAM preregistration is required for DDW * PEs (IOMMU groups) are passed to VFIO with no DMA windows at all so we do not bother with iommu_table::it_map anymore * added multilevel TCE tables support to support really huge guests v2: * added missing __pa() in "powerpc/powernv: Release replaced TCE" * reposted to make some noise Alexey Kardashevskiy (34): powerpc/eeh/ioda2: Use device::iommu_group to check IOMMU group powerpc/iommu/powernv: Get rid of set_iommu_table_base_and_group powerpc/powernv/ioda: Clean up IOMMU group registration powerpc/iommu: Put IOMMU group explicitly powerpc/iommu: Always release iommu_table in iommu_free_table() vfio: powerpc/spapr: Move page pinning from arch code to VFIO IOMMU driver vfio: powerpc/spapr: Check that IOMMU page is full
[PATCH kernel v10 04/34] powerpc/iommu: Put IOMMU group explicitly
So far an iommu_table lifetime was the same as PE. Dynamic DMA windows will change this and iommu_free_table() will not always require the group to be released. This moves iommu_group_put() out of iommu_free_table(). This adds a iommu_pseries_free_table() helper which does iommu_group_put() and iommu_free_table(). Later it will be changed to receive a table_group and we will have to change less lines then. This should cause no behavioural change. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kernel/iommu.c | 7 --- arch/powerpc/platforms/powernv/pci-ioda.c | 5 + arch/powerpc/platforms/pseries/iommu.c| 14 +- 3 files changed, 18 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index b054f33..3d47eb3 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -726,13 +726,6 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name) if (tbl->it_offset == 0) clear_bit(0, tbl->it_map); -#ifdef CONFIG_IOMMU_API - if (tbl->it_group) { - iommu_group_put(tbl->it_group); - BUG_ON(tbl->it_group); - } -#endif - /* verify that table contains no entries */ if (!bitmap_empty(tbl->it_map, tbl->it_size)) pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 8ca7abd..8c3c4bf 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -23,6 +23,7 @@ #include #include #include +#include #include #include @@ -1310,6 +1311,10 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe if (rc) pe_warn(pe, "OPAL error %ld release DMA window\n", rc); + if (tbl->it_group) { + iommu_group_put(tbl->it_group); + BUG_ON(tbl->it_group); + } iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); free_pages(addr, get_order(TCE32_TABLE_SIZE)); pe->tce32_table = NULL; diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 05ab06d..89f557b 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -36,6 +36,7 @@ #include #include #include +#include #include #include #include @@ -51,6 +52,16 @@ #include "pseries.h" +static void iommu_pseries_free_table(struct iommu_table *tbl, + const char *node_name) +{ + if (tbl->it_group) { + iommu_group_put(tbl->it_group); + BUG_ON(tbl->it_group); + } + iommu_free_table(tbl, node_name); +} + static void tce_invalidate_pSeries_sw(struct iommu_table *tbl, __be64 *startp, __be64 *endp) { @@ -1271,7 +1282,8 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti */ remove_ddw(np, false); if (pci && pci->iommu_table) - iommu_free_table(pci->iommu_table, np->full_name); + iommu_pseries_free_table(pci->iommu_table, + np->full_name); spin_lock(&direct_window_list_lock); list_for_each_entry(window, &direct_window_list, list) { -- 2.4.0.rc3.8.gfb3e7d5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH kernel v10 01/34] powerpc/eeh/ioda2: Use device::iommu_group to check IOMMU group
This relies on the fact that a PCI device always has an IOMMU table which may not be the case when we get dynamic DMA windows so let's use more reliable check for IOMMU group here. As we do not rely on the table presence here, remove the workaround from pnv_pci_ioda2_set_bypass(); also remove the @add_to_iommu_group parameter from pnv_ioda_setup_bus_dma(). Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kernel/eeh.c | 4 +--- arch/powerpc/platforms/powernv/pci-ioda.c | 27 +-- 2 files changed, 6 insertions(+), 25 deletions(-) diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index 9ee61d1..defd874 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c @@ -1412,13 +1412,11 @@ static int dev_has_iommu_table(struct device *dev, void *data) { struct pci_dev *pdev = to_pci_dev(dev); struct pci_dev **ppdev = data; - struct iommu_table *tbl; if (!dev) return 0; - tbl = get_iommu_table_base(dev); - if (tbl && tbl->it_group) { + if (dev->iommu_group) { *ppdev = pdev; return 1; } diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index f8bc950..2f092bb 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1654,21 +1654,15 @@ static u64 pnv_pci_ioda_dma_get_required_mask(struct pnv_phb *phb, } static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, - struct pci_bus *bus, - bool add_to_iommu_group) + struct pci_bus *bus) { struct pci_dev *dev; list_for_each_entry(dev, &bus->devices, bus_list) { - if (add_to_iommu_group) - set_iommu_table_base_and_group(&dev->dev, - pe->tce32_table); - else - set_iommu_table_base(&dev->dev, pe->tce32_table); + set_iommu_table_base_and_group(&dev->dev, pe->tce32_table); if (dev->subordinate) - pnv_ioda_setup_bus_dma(pe, dev->subordinate, - add_to_iommu_group); + pnv_ioda_setup_bus_dma(pe, dev->subordinate); } } @@ -1845,7 +1839,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, } else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) { iommu_register_group(tbl, phb->hose->global_number, pe->pe_number); - pnv_ioda_setup_bus_dma(pe, pe->pbus, true); + pnv_ioda_setup_bus_dma(pe, pe->pbus); } else if (pe->flags & PNV_IODA_PE_VF) { iommu_register_group(tbl, phb->hose->global_number, pe->pe_number); @@ -1882,17 +1876,6 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) window_id, pe->tce_bypass_base, 0); - - /* -* EEH needs the mapping between IOMMU table and group -* of those VFIO/KVM pass-through devices. We can postpone -* resetting DMA ops until the DMA mask is configured in -* host side. -*/ - if (pe->pdev) - set_iommu_table_base(&pe->pdev->dev, tbl); - else - pnv_ioda_setup_bus_dma(pe, pe->pbus, false); } if (rc) pe_err(pe, "OPAL error %lld configuring bypass window\n", rc); @@ -1984,7 +1967,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, } else if (pe->flags & (PNV_IODA_PE_BUS | PNV_IODA_PE_BUS_ALL)) { iommu_register_group(tbl, phb->hose->global_number, pe->pe_number); - pnv_ioda_setup_bus_dma(pe, pe->pbus, true); + pnv_ioda_setup_bus_dma(pe, pe->pbus); } else if (pe->flags & PNV_IODA_PE_VF) { iommu_register_group(tbl, phb->hose->global_number, pe->pe_number); -- 2.4.0.rc3.8.gfb3e7d5 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev