[PATCH V3 0/3] Numabalancing preserve write fix
This patch series address an issue w.r.t THP migration and autonuma preserve write feature. migrate_misplaced_transhuge_page() cannot deal with concurrent modification of the page. It does a page copy without following the migration pte sequence. IIUC, this was done to keep the migration simpler and at the time of implemenation we didn't had THP page cache which would have required a more elaborate migration scheme. That means thp autonuma migration expect the protnone with saved write to be done such that both kernel and user cannot update the page content. This patch series enables archs like ppc64 to do that. We are good with the hash translation mode with the current code, because we never create a hardware page table entry for a protnone pte. Changes form V2: * Fix kvm crashes due to ksm not clearing savedwrite bit. Changes from V1: * Update the patch so that it apply cleanly to upstream. * Add acked-by from Michael Neuling Aneesh Kumar K.V (3): mm/autonuma: Let architecture override how the write bit should be stashed in a protnone pte. mm/ksm: Handle protnone saved writes when making page write protect powerpc/mm/autonuma: Switch ppc64 to its own implementeation of saved write arch/powerpc/include/asm/book3s/64/pgtable.h | 52 include/asm-generic/pgtable.h| 24 + mm/huge_memory.c | 6 ++-- mm/ksm.c | 9 +++-- mm/memory.c | 2 +- mm/mprotect.c| 4 +-- 6 files changed, 82 insertions(+), 15 deletions(-) -- 2.7.4
[PATCH V3 1/3] mm/autonuma: Let architecture override how the write bit should be stashed in a protnone pte.
Autonuma preserves the write permission across numa fault to avoid taking a writefault after a numa fault (Commit: b191f9b106ea " mm: numa: preserve PTE write permissions across a NUMA hinting fault"). Architecture can implement protnone in different ways and some may choose to implement that by clearing Read/ Write/Exec bit of pte. Setting the write bit on such pte can result in wrong behaviour. Fix this up by allowing arch to override how to save the write bit on a protnone pte. Acked-By: Michael Neuling Signed-off-by: Aneesh Kumar K.V --- include/asm-generic/pgtable.h | 16 mm/huge_memory.c | 6 +++--- mm/memory.c | 2 +- mm/mprotect.c | 4 ++-- 4 files changed, 22 insertions(+), 6 deletions(-) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 18af2bcefe6a..b6f3a8a4b738 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -192,6 +192,22 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres } #endif +#ifndef pte_savedwrite +#define pte_savedwrite pte_write +#endif + +#ifndef pte_mk_savedwrite +#define pte_mk_savedwrite pte_mkwrite +#endif + +#ifndef pmd_savedwrite +#define pmd_savedwrite pmd_write +#endif + +#ifndef pmd_mk_savedwrite +#define pmd_mk_savedwrite pmd_mkwrite +#endif + #ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline void pmdp_set_wrprotect(struct mm_struct *mm, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8f1d93257fb9..e6de801fa477 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1253,7 +1253,7 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) } /* See similar comment in do_numa_page for explanation */ - if (!pmd_write(pmd)) + if (!pmd_savedwrite(pmd)) flags |= TNF_NO_GROUP; /* @@ -1316,7 +1316,7 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) goto out; clear_pmdnuma: BUG_ON(!PageLocked(page)); - was_writable = pmd_write(pmd); + was_writable = pmd_savedwrite(pmd); pmd = pmd_modify(pmd, vma->vm_page_prot); pmd = pmd_mkyoung(pmd); if (was_writable) @@ -1571,7 +1571,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd); entry = pmd_modify(entry, newprot); if (preserve_write) - entry = pmd_mkwrite(entry); + entry = pmd_mk_savedwrite(entry); ret = HPAGE_PMD_NR; set_pmd_at(mm, addr, pmd, entry); BUG_ON(vma_is_anonymous(vma) && !preserve_write && diff --git a/mm/memory.c b/mm/memory.c index 6bf2b471e30c..641b83dbff60 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3388,7 +3388,7 @@ static int do_numa_page(struct vm_fault *vmf) int target_nid; bool migrated = false; pte_t pte = vmf->orig_pte; - bool was_writable = pte_write(pte); + bool was_writable = pte_savedwrite(pte); int flags = 0; /* diff --git a/mm/mprotect.c b/mm/mprotect.c index f9c07f54dd62..15f5c174a7c1 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -113,13 +113,13 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, ptent = ptep_modify_prot_start(mm, addr, pte); ptent = pte_modify(ptent, newprot); if (preserve_write) - ptent = pte_mkwrite(ptent); + ptent = pte_mk_savedwrite(ptent); /* Avoid taking write faults for known dirty pages */ if (dirty_accountable && pte_dirty(ptent) && (pte_soft_dirty(ptent) || !(vma->vm_flags & VM_SOFTDIRTY))) { - ptent = pte_mkwrite(ptent); + ptent = pte_mk_savedwrite(ptent); } ptep_modify_prot_commit(mm, addr, pte, ptent); pages++; -- 2.7.4
[PATCH V3 2/3] mm/ksm: Handle protnone saved writes when making page write protect
Without this KSM will consider the page write protected, but a numa fault can later mark the page writable. This can result in memory corruption. Signed-off-by: Aneesh Kumar K.V --- include/asm-generic/pgtable.h | 8 mm/ksm.c | 9 +++-- 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index b6f3a8a4b738..8c8ba48bef0b 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h @@ -200,6 +200,10 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres #define pte_mk_savedwrite pte_mkwrite #endif +#ifndef pte_clear_savedwrite +#define pte_clear_savedwrite pte_wrprotect +#endif + #ifndef pmd_savedwrite #define pmd_savedwrite pmd_write #endif @@ -208,6 +212,10 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres #define pmd_mk_savedwrite pmd_mkwrite #endif +#ifndef pmd_clear_savedwrite +#define pmd_clear_savedwrite pmd_wrprotect +#endif + #ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT #ifdef CONFIG_TRANSPARENT_HUGEPAGE static inline void pmdp_set_wrprotect(struct mm_struct *mm, diff --git a/mm/ksm.c b/mm/ksm.c index 9ae6011a41f8..768202831578 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -872,7 +872,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, if (!ptep) goto out_mn; - if (pte_write(*ptep) || pte_dirty(*ptep)) { + if (pte_write(*ptep) || pte_dirty(*ptep) || + (pte_protnone(*ptep) && pte_savedwrite(*ptep))) { pte_t entry; swapped = PageSwapCache(page); @@ -897,7 +898,11 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, } if (pte_dirty(entry)) set_page_dirty(page); - entry = pte_mkclean(pte_wrprotect(entry)); + + if (pte_protnone(entry)) + entry = pte_mkclean(pte_clear_savedwrite(entry)); + else + entry = pte_mkclean(pte_wrprotect(entry)); set_pte_at_notify(mm, addr, ptep, entry); } *orig_pte = *ptep; -- 2.7.4
[PATCH V3 3/3] powerpc/mm/autonuma: Switch ppc64 to its own implementeation of saved write
With this our protnone becomes a present pte with READ/WRITE/EXEC bit cleared. By default we also set _PAGE_PRIVILEGED on such pte. This is now used to help us identify a protnone pte that as saved write bit. For such pte, we will clear the _PAGE_PRIVILEGED bit. The pte still remain non-accessible from both user and kernel. Acked-By: Michael Neuling Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/pgtable.h | 52 1 file changed, 45 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 6a55bbe91556..d87bee85fc44 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -1,6 +1,9 @@ #ifndef _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ #define _ASM_POWERPC_BOOK3S_64_PGTABLE_H_ +#ifndef __ASSEMBLY__ +#include +#endif /* * Common bits between hash and Radix page table */ @@ -428,15 +431,47 @@ static inline pte_t pte_clear_soft_dirty(pte_t pte) #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */ #ifdef CONFIG_NUMA_BALANCING -/* - * These work without NUMA balancing but the kernel does not care. See the - * comment in include/asm-generic/pgtable.h . On powerpc, this will only - * work for user pages and always return true for kernel pages. - */ static inline int pte_protnone(pte_t pte) { - return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED)) == - cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED); + return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE | _PAGE_RWX)) == + cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE); +} + +#define pte_mk_savedwrite pte_mk_savedwrite +static inline pte_t pte_mk_savedwrite(pte_t pte) +{ + /* +* Used by Autonuma subsystem to preserve the write bit +* while marking the pte PROT_NONE. Only allow this +* on PROT_NONE pte +*/ + VM_BUG_ON((pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_RWX | _PAGE_PRIVILEGED)) != + cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED)); + return __pte(pte_val(pte) & ~_PAGE_PRIVILEGED); +} + +#define pte_clear_savedwrite pte_clear_savedwrite +static inline pte_t pte_clear_savedwrite(pte_t pte) +{ + /* +* Used by KSM subsystem to make a protnone pte readonly. +*/ + VM_BUG_ON(!pte_protnone(pte)); + return __pte(pte_val(pte) | _PAGE_PRIVILEGED); +} + +#define pte_savedwrite pte_savedwrite +static inline bool pte_savedwrite(pte_t pte) +{ + /* +* Saved write ptes are prot none ptes that doesn't have +* privileged bit sit. We mark prot none as one which has +* present and pviliged bit set and RWX cleared. To mark +* protnone which used to have _PAGE_WRITE set we clear +* the privileged bit. +*/ + VM_BUG_ON(!pte_protnone(pte)); + return !(pte_raw(pte) & cpu_to_be64(_PAGE_RWX | _PAGE_PRIVILEGED)); } #endif /* CONFIG_NUMA_BALANCING */ @@ -867,6 +902,8 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd) #define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd))) #define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd))) #define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd))) +#define pmd_mk_savedwrite(pmd) pte_pmd(pte_mk_savedwrite(pmd_pte(pmd))) +#define pmd_clear_savedwrite(pmd) pte_pmd(pte_clear_savedwrite(pmd_pte(pmd))) #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY #define pmd_soft_dirty(pmd)pte_soft_dirty(pmd_pte(pmd)) @@ -883,6 +920,7 @@ static inline int pmd_protnone(pmd_t pmd) #define __HAVE_ARCH_PMD_WRITE #define pmd_write(pmd) pte_write(pmd_pte(pmd)) +#define pmd_savedwrite(pmd)pte_savedwrite(pmd_pte(pmd)) #ifdef CONFIG_TRANSPARENT_HUGEPAGE extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot); -- 2.7.4
[PATCH V3 00/10] powerpc/mm/ppc64: Add 128TB support
This patch series increase the effective virtual address range of applications from 64TB to 128TB. We do that by supporting a 68 bit virtual address. On platforms that can only do 65 bit virtual address we limit the max contexts to a 16bit value instead of 19. The patch series also switch the page table layout such that we can do 512TB effective address. But we still limit the TASK_SIZE to 128TB. This was done to make sure we don't break applications that make assumption regarding the max address returned by the OS. We can switch to 128TB without a linux personality value because other architectures do 128TB as max address. Changes from V2: * Handle hugepage size correctly. Aneesh Kumar K.V (10): powerpc/mm/slice: Convert slice_mask high slice to a bitmap powerpc/mm/slice: Update the function prototype powerpc/mm/hash: Move kernel context to the starting of context range powerpc/mm/hash: Support 68 bit VA powerpc/mm: Move copy_mm_to_paca to paca.c powerpc/mm: Remove redundant TASK_SIZE_USER64 checks powerpc/mm/slice: Use mm task_size as max value of slice index powerpc/mm/hash: Increase VA range to 128TB powerpc/mm/slice: Move slice_mask struct definition to slice.c powerpc/mm/slice: Update slice mask printing to use bitmap printing. arch/powerpc/include/asm/book3s/64/hash-4k.h | 2 +- arch/powerpc/include/asm/book3s/64/hash-64k.h | 2 +- arch/powerpc/include/asm/book3s/64/mmu-hash.h | 160 - arch/powerpc/include/asm/mmu.h| 19 ++- arch/powerpc/include/asm/mmu_context.h| 2 +- arch/powerpc/include/asm/paca.h | 18 +-- arch/powerpc/include/asm/page_64.h| 14 -- arch/powerpc/include/asm/processor.h | 22 ++- arch/powerpc/kernel/paca.c| 26 arch/powerpc/kvm/book3s_64_mmu_host.c | 10 +- arch/powerpc/mm/hash_utils_64.c | 9 +- arch/powerpc/mm/init_64.c | 4 - arch/powerpc/mm/mmu_context_book3s64.c| 96 + arch/powerpc/mm/pgtable_64.c | 5 - arch/powerpc/mm/slb.c | 2 +- arch/powerpc/mm/slb_low.S | 74 ++ arch/powerpc/mm/slice.c | 195 +++--- 17 files changed, 394 insertions(+), 266 deletions(-) -- 2.7.4
[PATCH V3 01/10] powerpc/mm/slice: Convert slice_mask high slice to a bitmap
In followup patch we want to increase the va range which will result in us requiring high_slices to have more than 64 bits. To enable this convert high_slices to bitmap. We keep the number bits same in this patch and later change that to higher value Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/page_64.h | 15 ++--- arch/powerpc/mm/slice.c| 110 + 2 files changed, 80 insertions(+), 45 deletions(-) diff --git a/arch/powerpc/include/asm/page_64.h b/arch/powerpc/include/asm/page_64.h index dd5f0712afa2..7f72659b7999 100644 --- a/arch/powerpc/include/asm/page_64.h +++ b/arch/powerpc/include/asm/page_64.h @@ -98,19 +98,16 @@ extern u64 ppc64_pft_size; #define GET_LOW_SLICE_INDEX(addr) ((addr) >> SLICE_LOW_SHIFT) #define GET_HIGH_SLICE_INDEX(addr) ((addr) >> SLICE_HIGH_SHIFT) +#ifndef __ASSEMBLY__ /* - * 1 bit per slice and we have one slice per 1TB - * Right now we support only 64TB. - * IF we change this we will have to change the type - * of high_slices + * One bit per slice. We have lower slices which cover 256MB segments + * upto 4G range. That gets us 16 low slices. For the rest we track slices + * in 1TB size. + * 64 below is actually SLICE_NUM_HIGH to fixup complie errros */ -#define SLICE_MASK_SIZE 8 - -#ifndef __ASSEMBLY__ - struct slice_mask { u16 low_slices; - u64 high_slices; + DECLARE_BITMAP(high_slices, 64); }; struct mm_struct; diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index 2b27458902ee..c4e718e38a03 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -36,11 +36,6 @@ #include #include -/* some sanity checks */ -#if (H_PGTABLE_RANGE >> 43) > SLICE_MASK_SIZE -#error H_PGTABLE_RANGE exceeds slice_mask high_slices size -#endif - static DEFINE_SPINLOCK(slice_convert_lock); @@ -49,7 +44,7 @@ int _slice_debug = 1; static void slice_print_mask(const char *label, struct slice_mask mask) { - char*p, buf[16 + 3 + 64 + 1]; + char*p, buf[SLICE_NUM_LOW + 3 + SLICE_NUM_HIGH + 1]; int i; if (!_slice_debug) @@ -60,8 +55,12 @@ static void slice_print_mask(const char *label, struct slice_mask mask) *(p++) = ' '; *(p++) = '-'; *(p++) = ' '; - for (i = 0; i < SLICE_NUM_HIGH; i++) - *(p++) = (mask.high_slices & (1ul << i)) ? '1' : '0'; + for (i = 0; i < SLICE_NUM_HIGH; i++) { + if (test_bit(i, mask.high_slices)) + *(p++) = '1'; + else + *(p++) = '0'; + } *(p++) = 0; printk(KERN_DEBUG "%s:%s\n", label, buf); @@ -80,7 +79,10 @@ static struct slice_mask slice_range_to_mask(unsigned long start, unsigned long len) { unsigned long end = start + len - 1; - struct slice_mask ret = { 0, 0 }; + struct slice_mask ret; + + ret.low_slices = 0; + bitmap_zero(ret.high_slices, SLICE_NUM_HIGH); if (start < SLICE_LOW_TOP) { unsigned long mend = min(end, SLICE_LOW_TOP); @@ -90,10 +92,13 @@ static struct slice_mask slice_range_to_mask(unsigned long start, - (1u << GET_LOW_SLICE_INDEX(mstart)); } - if ((start + len) > SLICE_LOW_TOP) - ret.high_slices = (1ul << (GET_HIGH_SLICE_INDEX(end) + 1)) - - (1ul << GET_HIGH_SLICE_INDEX(start)); + if ((start + len) > SLICE_LOW_TOP) { + unsigned long start_index = GET_HIGH_SLICE_INDEX(start); + unsigned long align_end = ALIGN(end, (1UL> (i * 4)) & 0xf) == psize) @@ -165,7 +176,7 @@ static struct slice_mask slice_mask_for_size(struct mm_struct *mm, int psize) mask_index = i & 0x1; index = i >> 1; if (((hpsizes[index] >> (mask_index * 4)) & 0xf) == psize) - ret.high_slices |= 1ul << i; + __set_bit(i, ret.high_slices); } return ret; @@ -173,8 +184,13 @@ static struct slice_mask slice_mask_for_size(struct mm_struct *mm, int psize) static int slice_check_fit(struct slice_mask mask, struct slice_mask available) { + DECLARE_BITMAP(result, SLICE_NUM_HIGH); + + bitmap_and(result, mask.high_slices, + available.high_slices, SLICE_NUM_HIGH); + return (mask.low_slices & available.low_slices) == mask.low_slices && - (mask.high_slices & available.high_slices) == mask.high_slices; + bitmap_equal(result, mask.high_slices, SLICE_NUM_HIGH); } static void slice_flush_segments(void *parm) @@ -221,7 +237,7 @@ static void slice_convert(struct mm_struct *mm, struct slice_mask mask, int psiz for (i = 0; i < SLICE_NUM_HIGH; i++) { mask_index = i & 0x1;
[PATCH V3 02/10] powerpc/mm/slice: Update the function prototype
This avoid copying the slice_mask struct as function return value Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/slice.c | 62 ++--- 1 file changed, 28 insertions(+), 34 deletions(-) diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index c4e718e38a03..1cb0e98e70c0 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -75,20 +75,19 @@ static void slice_print_mask(const char *label, struct slice_mask mask) {} #endif -static struct slice_mask slice_range_to_mask(unsigned long start, -unsigned long len) +static void slice_range_to_mask(unsigned long start, unsigned long len, + struct slice_mask *ret) { unsigned long end = start + len - 1; - struct slice_mask ret; - ret.low_slices = 0; - bitmap_zero(ret.high_slices, SLICE_NUM_HIGH); + ret->low_slices = 0; + bitmap_zero(ret->high_slices, SLICE_NUM_HIGH); if (start < SLICE_LOW_TOP) { unsigned long mend = min(end, SLICE_LOW_TOP); unsigned long mstart = min(start, SLICE_LOW_TOP); - ret.low_slices = (1u << (GET_LOW_SLICE_INDEX(mend) + 1)) + ret->low_slices = (1u << (GET_LOW_SLICE_INDEX(mend) + 1)) - (1u << GET_LOW_SLICE_INDEX(mstart)); } @@ -97,9 +96,8 @@ static struct slice_mask slice_range_to_mask(unsigned long start, unsigned long align_end = ALIGN(end, (1ULlow_slices = 0; + bitmap_zero(ret->high_slices, SLICE_NUM_HIGH); for (i = 0; i < SLICE_NUM_LOW; i++) if (!slice_low_has_vma(mm, i)) - ret.low_slices |= 1u << i; + ret->low_slices |= 1u << i; if (mm->task_size <= SLICE_LOW_TOP) - return ret; + return; for (i = 0; i < SLICE_NUM_HIGH; i++) if (!slice_high_has_vma(mm, i)) - __set_bit(i, ret.high_slices); - - return ret; + __set_bit(i, ret->high_slices); } -static struct slice_mask slice_mask_for_size(struct mm_struct *mm, int psize) +static void slice_mask_for_size(struct mm_struct *mm, int psize, struct slice_mask *ret) { unsigned char *hpsizes; int index, mask_index; - struct slice_mask ret; unsigned long i; u64 lpsizes; - ret.low_slices = 0; - bitmap_zero(ret.high_slices, SLICE_NUM_HIGH); + ret->low_slices = 0; + bitmap_zero(ret->high_slices, SLICE_NUM_HIGH); lpsizes = mm->context.low_slices_psize; for (i = 0; i < SLICE_NUM_LOW; i++) if (((lpsizes >> (i * 4)) & 0xf) == psize) - ret.low_slices |= 1u << i; + ret->low_slices |= 1u << i; hpsizes = mm->context.high_slices_psize; for (i = 0; i < SLICE_NUM_HIGH; i++) { mask_index = i & 0x1; index = i >> 1; if (((hpsizes[index] >> (mask_index * 4)) & 0xf) == psize) - __set_bit(i, ret.high_slices); + __set_bit(i, ret->high_slices); } - - return ret; } static int slice_check_fit(struct slice_mask mask, struct slice_mask available) @@ -461,7 +453,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, /* First make up a "good" mask of slices that have the right size * already */ - good_mask = slice_mask_for_size(mm, psize); + slice_mask_for_size(mm, psize, &good_mask); slice_print_mask(" good_mask", good_mask); /* @@ -486,7 +478,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, #ifdef CONFIG_PPC_64K_PAGES /* If we support combo pages, we can allow 64k pages in 4k slices */ if (psize == MMU_PAGE_64K) { - compat_mask = slice_mask_for_size(mm, MMU_PAGE_4K); + slice_mask_for_size(mm, MMU_PAGE_4K, &compat_mask); if (fixed) slice_or_mask(&good_mask, &compat_mask); } @@ -495,7 +487,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, /* First check hint if it's valid or if we have MAP_FIXED */ if (addr !=
[PATCH V3 03/10] powerpc/mm/hash: Move kernel context to the starting of context range
With current kernel, we use the top 4 context for the kernel. Kernel VSIDs are built using these top context values and effective segemnt ID. In the following patches, we want to increase the max effective address to 512TB. We achieve that by increasing the effective segments IDs there by increasing virtual address range. We will be switching to a 68bit virtual address in the following patch. But for platforms like p4 and p5, which only support a 65 bit va, we want to limit the virtual addrress to a 65 bit value. We do that by limiting the context bits to 16 instead of 19. That means we will have different max context values on different platforms. To make this simpler. we move the kernel context to the starting of the range. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/mmu-hash.h | 39 ++-- arch/powerpc/include/asm/mmu_context.h| 2 +- arch/powerpc/kvm/book3s_64_mmu_host.c | 2 +- arch/powerpc/mm/hash_utils_64.c | 5 -- arch/powerpc/mm/mmu_context_book3s64.c| 88 ++- arch/powerpc/mm/slb_low.S | 20 ++ 6 files changed, 84 insertions(+), 72 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h index 0735d5a8049f..014a9bb197cd 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h @@ -493,10 +493,10 @@ extern void slb_set_size(u16 size); * For user processes max context id is limited to ((1ul << 19) - 5) * for kernel space, we use the top 4 context ids to map address as below * NOTE: each context only support 64TB now. - * 0x7fffc - [ 0xc000 - 0xc0003fff ] - * 0x7fffd - [ 0xd000 - 0xd0003fff ] - * 0x7fffe - [ 0xe000 - 0xe0003fff ] - * 0x7 - [ 0xf000 - 0xf0003fff ] + * 0x0 - [ 0xc000 - 0xc0003fff ] + * 0x1 - [ 0xd000 - 0xd0003fff ] + * 0x2 - [ 0xe000 - 0xe0003fff ] + * 0x3 - [ 0xf000 - 0xf0003fff ] * * The proto-VSIDs are then scrambled into real VSIDs with the * multiplicative hash: @@ -510,15 +510,9 @@ extern void slb_set_size(u16 size); * robust scattering in the hash table (at least based on some initial * results). * - * We also consider VSID 0 special. We use VSID 0 for slb entries mapping - * bad address. This enables us to consolidate bad address handling in - * hash_page. - * * We also need to avoid the last segment of the last context, because that * would give a protovsid of 0x1f. That will result in a VSID 0 - * because of the modulo operation in vsid scramble. But the vmemmap - * (which is what uses region 0xf) will never be close to 64TB in size - * (it's 56 bytes per page of system memory). + * because of the modulo operation in vsid scramble. */ #define CONTEXT_BITS 19 @@ -530,12 +524,15 @@ extern void slb_set_size(u16 size); /* * 256MB segment * The proto-VSID space has 2^(CONTEX_BITS + ESID_BITS) - 1 segments - * available for user + kernel mapping. The top 4 contexts are used for + * available for user + kernel mapping. The bottom 4 contexts are used for * kernel mapping. Each segment contains 2^28 bytes. Each - * context maps 2^46 bytes (64TB) so we can support 2^19-1 contexts - * (19 == 37 + 28 - 46). + * context maps 2^46 bytes (64TB). + * + * We also need to avoid the last segment of the last context, because that + * would give a protovsid of 0x1f. That will result in a VSID 0 + * because of the modulo operation in vsid scramble. */ -#define MAX_USER_CONTEXT ((ASM_CONST(1) << CONTEXT_BITS) - 5) +#define MAX_USER_CONTEXT ((ASM_CONST(1) << CONTEXT_BITS) - 2) /* * This should be computed such that protovosid * vsid_mulitplier @@ -671,19 +668,19 @@ static inline unsigned long get_vsid(unsigned long context, unsigned long ea, * This is only valid for addresses >= PAGE_OFFSET * * For kernel space, we use the top 4 context ids to map address as below - * 0x7fffc - [ 0xc000 - 0xc0003fff ] - * 0x7fffd - [ 0xd000 - 0xd0003fff ] - * 0x7fffe - [ 0xe000 - 0xe0003fff ] - * 0x7 - [ 0xf000 - 0xf0003fff ] + * 0x0 - [ 0xc000 - 0xc0003fff ] + * 0x1 - [ 0xd000 - 0xd0003fff ] + * 0x2 - [ 0xe000 - 0xe0003fff ] + * 0x3 - [ 0xf000 - 0xf0003fff ] */ static inline unsigned long get_kernel_vsid(unsigned long ea, int ssize) { unsigned long context; /* -* kernel take the top 4 context from the available range +* kernel take the first 4 context from the available range */ - context = (MAX_USER_CONTEXT) + ((ea >> 60) - 0xc) + 1; + context = (ea >> 60) -
[PATCH V3 04/10] powerpc/mm/hash: Support 68 bit VA
Inorder to support large effective address range (512TB), we want to increase the virtual address bits to 68. But we do have platforms like p4 and p5 that can only do 65 bit VA. We support those platforms by limiting context bits on them to 16. The protovsid -> vsid conversion is verified to work with both 65 and 68 bit va values. I also documented the restrictions in a table format as part of code comments. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/mmu-hash.h | 123 -- arch/powerpc/include/asm/mmu.h| 19 ++-- arch/powerpc/kvm/book3s_64_mmu_host.c | 8 +- arch/powerpc/mm/mmu_context_book3s64.c| 8 +- arch/powerpc/mm/slb_low.S | 54 +-- 5 files changed, 150 insertions(+), 62 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h b/arch/powerpc/include/asm/book3s/64/mmu-hash.h index 014a9bb197cd..97ccd8ae6c75 100644 --- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h +++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h @@ -39,6 +39,7 @@ /* Bits in the SLB VSID word */ #define SLB_VSID_SHIFT 12 +#define SLB_VSID_SHIFT_256M12 #define SLB_VSID_SHIFT_1T 24 #define SLB_VSID_SSIZE_SHIFT 62 #define SLB_VSID_B ASM_CONST(0xc000) @@ -515,9 +516,19 @@ extern void slb_set_size(u16 size); * because of the modulo operation in vsid scramble. */ +/* + * Max Va bits we support as of now is 68 bits. We want 19 bit + * context ID. + * Restrictions: + * GPU has restrictions of not able to access beyond 128TB + * (47 bit effective address). We also cannot do more than 20bit PID. + * For p4 and p5 which can only do 65 bit VA, we restrict our CONTEXT_BITS + * to 16 bits (ie, we can only have 2^16 pids at the same time). + */ +#define VA_BITS68 #define CONTEXT_BITS 19 -#define ESID_BITS 18 -#define ESID_BITS_1T 6 +#define ESID_BITS (VA_BITS - (SID_SHIFT + CONTEXT_BITS)) +#define ESID_BITS_1T (VA_BITS - (SID_SHIFT_1T + CONTEXT_BITS)) #define ESID_BITS_MASK ((1 << ESID_BITS) - 1) #define ESID_BITS_1T_MASK ((1 << ESID_BITS_1T) - 1) @@ -526,62 +537,54 @@ extern void slb_set_size(u16 size); * The proto-VSID space has 2^(CONTEX_BITS + ESID_BITS) - 1 segments * available for user + kernel mapping. The bottom 4 contexts are used for * kernel mapping. Each segment contains 2^28 bytes. Each - * context maps 2^46 bytes (64TB). + * context maps 2^49 bytes (512TB). * * We also need to avoid the last segment of the last context, because that * would give a protovsid of 0x1f. That will result in a VSID 0 * because of the modulo operation in vsid scramble. */ #define MAX_USER_CONTEXT ((ASM_CONST(1) << CONTEXT_BITS) - 2) +/* + * For platforms that support on 65bit VA we limit the context bits + */ +#define MAX_USER_CONTEXT_65BIT_VA ((ASM_CONST(1) << (65 - (SID_SHIFT + ESID_BITS))) - 2) /* * This should be computed such that protovosid * vsid_mulitplier * doesn't overflow 64 bits. It should also be co-prime to vsid_modulus + * We also need to make sure that number of bits in divisor is less + * than twice the number of protovsid bits for our modulus optmization to work. + * The below table shows the current values used. + * + * |---++++--| + * | | Prime Bits | VSID_BITS_65VA | Total Bits | 2* VSID_BITS | + * |---++++--| + * | 1T| 24 | 25 | 49 | 50 | + * |---++++--| + * | 256MB | 24 | 37 | 61 | 74 | + * |---++++--| + * + * |---++++--| + * | | Prime Bits | VSID_BITS_68VA | Total Bits | 2* VSID_BITS | + * |---++++--| + * | 1T| 24 | 28 | 52 | 56 | + * |---++++--| + * | 256MB | 24 | 40 | 64 | 80 | + * |---++++--| + * */ #define VSID_MULTIPLIER_256M ASM_CONST(12538073) /* 24-bit prime */ -#define VSID_BITS_256M (CONTEXT_BITS + ESID_BITS) +#define VSID_BITS_256M (VA_BITS - SID_SHIFT) #define VSID_MODULUS_256M ((1UL<= \ -* 2^36-1, then r3+1 has the 2^36 bit set. So, if r3+1 has \ -* the bit clear, r3 already has the answer we want, if it \ -* doesn't, the answer is the low 36 bits of r3+1. So in all \ -* cases the answer is the low 36 bits of (r3 + ((r3+1) >> 36))*/\ - addirx,rt,1;\ -
[PATCH V3 05/10] powerpc/mm: Move copy_mm_to_paca to paca.c
We will be updating this later to use struct mm_struct. Move this so that function finds the definition of struct mm_struct; Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/paca.h | 18 +- arch/powerpc/kernel/paca.c | 19 +++ arch/powerpc/mm/hash_utils_64.c | 4 ++-- arch/powerpc/mm/slb.c | 2 +- arch/powerpc/mm/slice.c | 2 +- 5 files changed, 24 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h index 6a6792bb39fb..f25d3c93a30f 100644 --- a/arch/powerpc/include/asm/paca.h +++ b/arch/powerpc/include/asm/paca.h @@ -207,23 +207,7 @@ struct paca_struct { #endif }; -#ifdef CONFIG_PPC_BOOK3S -static inline void copy_mm_to_paca(mm_context_t *context) -{ - get_paca()->mm_ctx_id = context->id; -#ifdef CONFIG_PPC_MM_SLICES - get_paca()->mm_ctx_low_slices_psize = context->low_slices_psize; - memcpy(&get_paca()->mm_ctx_high_slices_psize, - &context->high_slices_psize, SLICE_ARRAY_SIZE); -#else - get_paca()->mm_ctx_user_psize = context->user_psize; - get_paca()->mm_ctx_sllp = context->sllp; -#endif -} -#else -static inline void copy_mm_to_paca(mm_context_t *context){} -#endif - +extern void copy_mm_to_paca(struct mm_struct *mm); extern struct paca_struct *paca; extern void initialise_paca(struct paca_struct *new_paca, int cpu); extern void setup_paca(struct paca_struct *new_paca); diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index fa20060ff7a5..b64daf124fee 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -244,3 +244,22 @@ void __init free_unused_pacas(void) free_lppacas(); } + +void copy_mm_to_paca(struct mm_struct *mm) +{ +#ifdef CONFIG_PPC_BOOK3S + mm_context_t *context = &mm->context; + + get_paca()->mm_ctx_id = context->id; +#ifdef CONFIG_PPC_MM_SLICES + get_paca()->mm_ctx_low_slices_psize = context->low_slices_psize; + memcpy(&get_paca()->mm_ctx_high_slices_psize, + &context->high_slices_psize, SLICE_ARRAY_SIZE); +#else /* CONFIG_PPC_MM_SLICES */ + get_paca()->mm_ctx_user_psize = context->user_psize; + get_paca()->mm_ctx_sllp = context->sllp; +#endif +#else /* CONFIG_PPC_BOOK3S */ + return; +#endif +} diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index 978314b6b8d7..67937a6eb541 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -1084,7 +1084,7 @@ void demote_segment_4k(struct mm_struct *mm, unsigned long addr) copro_flush_all_slbs(mm); if ((get_paca_psize(addr) != MMU_PAGE_4K) && (current->mm == mm)) { - copy_mm_to_paca(&mm->context); + copy_mm_to_paca(mm); slb_flush_and_rebolt(); } } @@ -1156,7 +1156,7 @@ static void check_paca_psize(unsigned long ea, struct mm_struct *mm, { if (user_region) { if (psize != get_paca_psize(ea)) { - copy_mm_to_paca(&mm->context); + copy_mm_to_paca(mm); slb_flush_and_rebolt(); } } else if (get_paca()->vmalloc_sllp != diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c index 48fc28bab544..15157b14b0b6 100644 --- a/arch/powerpc/mm/slb.c +++ b/arch/powerpc/mm/slb.c @@ -227,7 +227,7 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm) asm volatile("slbie %0" : : "r" (slbie_data)); get_paca()->slb_cache_ptr = 0; - copy_mm_to_paca(&mm->context); + copy_mm_to_paca(mm); /* * preload some userspace segments into the SLB. diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index 1cb0e98e70c0..da67b91f46d3 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -193,7 +193,7 @@ static void slice_flush_segments(void *parm) if (mm != current->active_mm) return; - copy_mm_to_paca(¤t->active_mm->context); + copy_mm_to_paca(current->active_mm); local_irq_save(flags); slb_flush_and_rebolt(); -- 2.7.4
[PATCH V3 06/10] powerpc/mm: Remove redundant TASK_SIZE_USER64 checks
The check against VSID range is implied when we check task size against hash and radix pgtable range[1], because we make sure page table range cannot exceed vsid range. [1] BUILD_BUG_ON(TASK_SIZE_USER64 > H_PGTABLE_RANGE); BUILD_BUG_ON(TASK_SIZE_USER64 > RADIX_PGTABLE_RANGE); The check for smaller task size is also removed here, because the follow up patch will support a tasksize smaller than pgtable range. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/init_64.c| 4 arch/powerpc/mm/pgtable_64.c | 5 - 2 files changed, 9 deletions(-) diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index 93abf8a9813d..f3e856e6ee23 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -69,10 +69,6 @@ #if H_PGTABLE_RANGE > USER_VSID_RANGE #warning Limited user VSID range means pagetable space is wasted #endif - -#if (TASK_SIZE_USER64 < H_PGTABLE_RANGE) && (TASK_SIZE_USER64 < USER_VSID_RANGE) -#warning TASK_SIZE is smaller than it needs to be. -#endif #endif /* CONFIG_PPC_STD_MMU_64 */ phys_addr_t memstart_addr = ~0; diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index 8bca7f58afc4..06e23e0b1b81 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -55,11 +55,6 @@ #include "mmu_decl.h" -#ifdef CONFIG_PPC_STD_MMU_64 -#if TASK_SIZE_USER64 > (1UL << (ESID_BITS + SID_SHIFT)) -#error TASK_SIZE_USER64 exceeds user VSID range -#endif -#endif #ifdef CONFIG_PPC_BOOK3S_64 /* -- 2.7.4
[PATCH V3 07/10] powerpc/mm/slice: Use mm task_size as max value of slice index
In the followup patch, we will increase the slice array sice to handle 512TB range, but will limit the task size to 128TB. Avoid doing uncessary computation and avoid doing slice mask related operation above task_size. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/slice.c | 22 -- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index da67b91f46d3..f286b7839a12 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -145,7 +145,7 @@ static void slice_mask_for_free(struct mm_struct *mm, struct slice_mask *ret) if (mm->task_size <= SLICE_LOW_TOP) return; - for (i = 0; i < SLICE_NUM_HIGH; i++) + for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->task_size); i++) if (!slice_high_has_vma(mm, i)) __set_bit(i, ret->high_slices); } @@ -166,7 +166,7 @@ static void slice_mask_for_size(struct mm_struct *mm, int psize, struct slice_ma ret->low_slices |= 1u << i; hpsizes = mm->context.high_slices_psize; - for (i = 0; i < SLICE_NUM_HIGH; i++) { + for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->task_size); i++) { mask_index = i & 0x1; index = i >> 1; if (((hpsizes[index] >> (mask_index * 4)) & 0xf) == psize) @@ -174,15 +174,17 @@ static void slice_mask_for_size(struct mm_struct *mm, int psize, struct slice_ma } } -static int slice_check_fit(struct slice_mask mask, struct slice_mask available) +static int slice_check_fit(struct mm_struct *mm, + struct slice_mask mask, struct slice_mask available) { DECLARE_BITMAP(result, SLICE_NUM_HIGH); + unsigned long slice_count = GET_HIGH_SLICE_INDEX(mm->task_size); bitmap_and(result, mask.high_slices, - available.high_slices, SLICE_NUM_HIGH); + available.high_slices, slice_count); return (mask.low_slices & available.low_slices) == mask.low_slices && - bitmap_equal(result, mask.high_slices, SLICE_NUM_HIGH); + bitmap_equal(result, mask.high_slices, slice_count); } static void slice_flush_segments(void *parm) @@ -226,7 +228,7 @@ static void slice_convert(struct mm_struct *mm, struct slice_mask mask, int psiz mm->context.low_slices_psize = lpsizes; hpsizes = mm->context.high_slices_psize; - for (i = 0; i < SLICE_NUM_HIGH; i++) { + for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->task_size); i++) { mask_index = i & 0x1; index = i >> 1; if (test_bit(i, mask.high_slices)) @@ -493,7 +495,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, /* Check if we fit in the good mask. If we do, we just return, * nothing else to do */ - if (slice_check_fit(mask, good_mask)) { + if (slice_check_fit(mm, mask, good_mask)) { slice_dbg(" fits good !\n"); return addr; } @@ -518,7 +520,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, slice_or_mask(&potential_mask, &good_mask); slice_print_mask(" potential", potential_mask); - if ((addr != 0 || fixed) && slice_check_fit(mask, potential_mask)) { + if ((addr != 0 || fixed) && slice_check_fit(mm, mask, potential_mask)) { slice_dbg(" fits potential !\n"); goto convert; } @@ -666,7 +668,7 @@ void slice_set_user_psize(struct mm_struct *mm, unsigned int psize) mm->context.low_slices_psize = lpsizes; hpsizes = mm->context.high_slices_psize; - for (i = 0; i < SLICE_NUM_HIGH; i++) { + for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->task_size); i++) { mask_index = i & 0x1; index = i >> 1; if (((hpsizes[index] >> (mask_index * 4)) & 0xf) == old_psize) @@ -743,6 +745,6 @@ int is_hugepage_only_range(struct mm_struct *mm, unsigned long addr, slice_print_mask(" mask", mask); slice_print_mask(" available", available); #endif - return !slice_check_fit(mask, available); + return !slice_check_fit(mm, mask, available); } #endif -- 2.7.4
[PATCH V3 08/10] powerpc/mm/hash: Increase VA range to 128TB
We update the hash linux page table layout such that we can support 512TB. But we limit the TASK_SIZE to 128TB. We can switch to 128TB by default without conditional because that is the max virtual address supported by other architectures. We will later add a mechanism to on-demand increase the application's effective address range to 512TB. Having the page table layout changed to accommodate 512TB makes testing large memory configuration easier with less code changes to kernel Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/hash-4k.h | 2 +- arch/powerpc/include/asm/book3s/64/hash-64k.h | 2 +- arch/powerpc/include/asm/page_64.h| 2 +- arch/powerpc/include/asm/processor.h | 22 ++ arch/powerpc/kernel/paca.c| 9 - arch/powerpc/mm/slice.c | 2 ++ 6 files changed, 31 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h index 0c4e470571ca..b4b5e6b671ca 100644 --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h @@ -8,7 +8,7 @@ #define H_PTE_INDEX_SIZE 9 #define H_PMD_INDEX_SIZE 7 #define H_PUD_INDEX_SIZE 9 -#define H_PGD_INDEX_SIZE 9 +#define H_PGD_INDEX_SIZE 12 #ifndef __ASSEMBLY__ #define H_PTE_TABLE_SIZE (sizeof(pte_t) << H_PTE_INDEX_SIZE) diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h index b39f0b86405e..682c4eb28fa4 100644 --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h @@ -4,7 +4,7 @@ #define H_PTE_INDEX_SIZE 8 #define H_PMD_INDEX_SIZE 5 #define H_PUD_INDEX_SIZE 5 -#define H_PGD_INDEX_SIZE 12 +#define H_PGD_INDEX_SIZE 15 /* * 64k aligned address free up few of the lower bits of RPN for us diff --git a/arch/powerpc/include/asm/page_64.h b/arch/powerpc/include/asm/page_64.h index 7f72659b7999..9b60e9455c6e 100644 --- a/arch/powerpc/include/asm/page_64.h +++ b/arch/powerpc/include/asm/page_64.h @@ -107,7 +107,7 @@ extern u64 ppc64_pft_size; */ struct slice_mask { u16 low_slices; - DECLARE_BITMAP(high_slices, 64); + DECLARE_BITMAP(high_slices, 512); }; struct mm_struct; diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index 1ba814436c73..1d4e34f9004d 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -102,11 +102,25 @@ void release_thread(struct task_struct *); #endif #ifdef CONFIG_PPC64 -/* 64-bit user address space is 46-bits (64TB user VM) */ -#define TASK_SIZE_USER64 (0x4000UL) +/* + * 64-bit user address space can have multiple limits + * For now supported values are: + */ +#define TASK_SIZE_64TB (0x4000UL) +#define TASK_SIZE_128TB (0x8000UL) +#define TASK_SIZE_512TB (0x0002UL) -/* - * 32-bit user address space is 4GB - 1 page +#ifdef CONFIG_PPC_BOOK3S_64 +/* + * MAx value currently used: + */ +#define TASK_SIZE_USER64 TASK_SIZE_128TB +#else +#define TASK_SIZE_USER64 TASK_SIZE_64TB +#endif + +/* + * 32-bit user address space is 4GB - 1 page * (this 1 page is needed so referencing of 0x generates EFAULT */ #define TASK_SIZE_USER32 (0x0001UL - (1*PAGE_SIZE)) diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c index b64daf124fee..c7ca70dc3ba5 100644 --- a/arch/powerpc/kernel/paca.c +++ b/arch/powerpc/kernel/paca.c @@ -253,8 +253,15 @@ void copy_mm_to_paca(struct mm_struct *mm) get_paca()->mm_ctx_id = context->id; #ifdef CONFIG_PPC_MM_SLICES get_paca()->mm_ctx_low_slices_psize = context->low_slices_psize; + /* +* We support upto 128TB for now. Hence copy only 128/2 bytes. +* Later when we support tasks with different max effective +* address, we can optimize this based on mm->task_size. +*/ + BUILD_BUG_ON(TASK_SIZE_USER64 != TASK_SIZE_128TB); memcpy(&get_paca()->mm_ctx_high_slices_psize, - &context->high_slices_psize, SLICE_ARRAY_SIZE); + &context->high_slices_psize, TASK_SIZE_128TB >> 41); + #else /* CONFIG_PPC_MM_SLICES */ get_paca()->mm_ctx_user_psize = context->user_psize; get_paca()->mm_ctx_sllp = context->sllp; diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index f286b7839a12..fd2c85e951bd 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -412,6 +412,8 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, struct mm_struct *mm = current->mm; unsigned long newaddr; + /* Make sure high_slices bitmap size is same as we expected */ + BUILD_BUG_ON(512 != SLICE_NUM_HIGH); /* * init different masks */ -- 2.7.4
[PATCH V3 09/10] powerpc/mm/slice: Move slice_mask struct definition to slice.c
This structure definition need not be in a header since this is used only by slice.c file. So move it to slice.c. This also allow us to use SLICE_NUM_HIGH instead of 512 and also helps in getting rid of one BUILD_BUG_ON(). I also switch the low_slices type to u64 from u16. This doesn't have an impact on size of struct due to padding added with u16 type. This helps in using bitmap printing function for printing slice mask. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/page_64.h | 11 --- arch/powerpc/mm/slice.c| 13 ++--- 2 files changed, 10 insertions(+), 14 deletions(-) diff --git a/arch/powerpc/include/asm/page_64.h b/arch/powerpc/include/asm/page_64.h index 9b60e9455c6e..3ecfc2734b50 100644 --- a/arch/powerpc/include/asm/page_64.h +++ b/arch/powerpc/include/asm/page_64.h @@ -99,17 +99,6 @@ extern u64 ppc64_pft_size; #define GET_HIGH_SLICE_INDEX(addr) ((addr) >> SLICE_HIGH_SHIFT) #ifndef __ASSEMBLY__ -/* - * One bit per slice. We have lower slices which cover 256MB segments - * upto 4G range. That gets us 16 low slices. For the rest we track slices - * in 1TB size. - * 64 below is actually SLICE_NUM_HIGH to fixup complie errros - */ -struct slice_mask { - u16 low_slices; - DECLARE_BITMAP(high_slices, 512); -}; - struct mm_struct; extern unsigned long slice_get_unmapped_area(unsigned long addr, diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index fd2c85e951bd..8eedb7382942 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -37,7 +37,16 @@ #include static DEFINE_SPINLOCK(slice_convert_lock); - +/* + * One bit per slice. We have lower slices which cover 256MB segments + * upto 4G range. That gets us 16 low slices. For the rest we track slices + * in 1TB size. + * 64 below is actually SLICE_NUM_HIGH to fixup complie errros + */ +struct slice_mask { + u64 low_slices; + DECLARE_BITMAP(high_slices, SLICE_NUM_HIGH); +}; #ifdef DEBUG int _slice_debug = 1; @@ -412,8 +421,6 @@ unsigned long slice_get_unmapped_area(unsigned long addr, unsigned long len, struct mm_struct *mm = current->mm; unsigned long newaddr; - /* Make sure high_slices bitmap size is same as we expected */ - BUILD_BUG_ON(512 != SLICE_NUM_HIGH); /* * init different masks */ -- 2.7.4
[PATCH V3 10/10] powerpc/mm/slice: Update slice mask printing to use bitmap printing.
We now get output like below which is much better. [0.935306] good_mask low_slice: 0-15 [0.935360] good_mask high_slice: 0-511 Compared to [0.953414] good_mask: - 1. I also fixed an error with slice_dbg printing. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/slice.c | 30 +++--- 1 file changed, 7 insertions(+), 23 deletions(-) diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c index 8eedb7382942..fce1734ab8a3 100644 --- a/arch/powerpc/mm/slice.c +++ b/arch/powerpc/mm/slice.c @@ -53,29 +53,13 @@ int _slice_debug = 1; static void slice_print_mask(const char *label, struct slice_mask mask) { - char*p, buf[SLICE_NUM_LOW + 3 + SLICE_NUM_HIGH + 1]; - int i; - if (!_slice_debug) return; - p = buf; - for (i = 0; i < SLICE_NUM_LOW; i++) - *(p++) = (mask.low_slices & (1 << i)) ? '1' : '0'; - *(p++) = ' '; - *(p++) = '-'; - *(p++) = ' '; - for (i = 0; i < SLICE_NUM_HIGH; i++) { - if (test_bit(i, mask.high_slices)) - *(p++) = '1'; - else - *(p++) = '0'; - } - *(p++) = 0; - - printk(KERN_DEBUG "%s:%s\n", label, buf); + pr_devel("%s low_slice: %*pbl\n", label, (int)SLICE_NUM_LOW, &mask.low_slices); + pr_devel("%s high_slice: %*pbl\n", label, (int)SLICE_NUM_HIGH, mask.high_slices); } -#define slice_dbg(fmt...) do { if (_slice_debug) pr_debug(fmt); } while(0) +#define slice_dbg(fmt...) do { if (_slice_debug) pr_devel(fmt); } while (0) #else @@ -247,8 +231,8 @@ static void slice_convert(struct mm_struct *mm, struct slice_mask mask, int psiz } slice_dbg(" lsps=%lx, hsps=%lx\n", - mm->context.low_slices_psize, - mm->context.high_slices_psize); + (unsigned long)mm->context.low_slices_psize, + (unsigned long)mm->context.high_slices_psize); spin_unlock_irqrestore(&slice_convert_lock, flags); @@ -690,8 +674,8 @@ void slice_set_user_psize(struct mm_struct *mm, unsigned int psize) slice_dbg(" lsps=%lx, hsps=%lx\n", - mm->context.low_slices_psize, - mm->context.high_slices_psize); + (unsigned long)mm->context.low_slices_psize, + (unsigned long)mm->context.high_slices_psize); bail: spin_unlock_irqrestore(&slice_convert_lock, flags); -- 2.7.4
[PATCH] powerpc/mm: Add translation mode information in /proc/cpuinfo
With this we have on powernv and pseries /proc/cpuinfo reporting timebase: 51200 platform: PowerNV model : 8247-22L machine : PowerNV 8247-22L firmware: OPAL translation : Hash Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/platforms/powernv/setup.c | 4 arch/powerpc/platforms/pseries/setup.c | 4 2 files changed, 8 insertions(+) diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c index d50c7d99baaf..d38571e289bb 100644 --- a/arch/powerpc/platforms/powernv/setup.c +++ b/arch/powerpc/platforms/powernv/setup.c @@ -95,6 +95,10 @@ static void pnv_show_cpuinfo(struct seq_file *m) else seq_printf(m, "firmware\t: BML\n"); of_node_put(root); + if (radix_enabled()) + seq_printf(m, "translation\t: Radix\n"); + else + seq_printf(m, "translation\t: Hash\n"); } static void pnv_prepare_going_down(void) diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 7736352f7279..6576fe306561 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -86,6 +86,10 @@ static void pSeries_show_cpuinfo(struct seq_file *m) model = of_get_property(root, "model", NULL); seq_printf(m, "machine\t\t: CHRP %s\n", model); of_node_put(root); + if (radix_enabled()) + seq_printf(m, "translation\t: Radix\n"); + else + seq_printf(m, "translation\t: Hash\n"); } /* Initialize firmware assisted non-maskable interrupts if -- 2.7.4
[PATCH] powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout
Without this if firmware reports 1MB page size support we will crash trying to use 1MB as hugetlb page size. echo 300 > /sys/kernel/mm/hugepages/hugepages-1024kB/nr_hugepages kernel BUG at ./arch/powerpc/include/asm/hugetlb.h:19! . [c000e2c27b30] c029dae8 .hugetlb_fault+0x638/0xda0 [c000e2c27c30] c026fb64 .handle_mm_fault+0x844/0x1d70 [c000e2c27d70] c004805c .do_page_fault+0x3dc/0x7c0 [c000e2c27e30] c000ac98 handle_page_fault+0x10/0x30 With fix, we don't enable 1MB as hugepage size. bash-4.2# cd /sys/kernel/mm/hugepages/ bash-4.2# ls hugepages-16384kB hugepages-16777216kB Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/mm/hugetlbpage.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 8c3389cbcd12..a4f33de4008e 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -753,6 +753,24 @@ static int __init add_huge_page_size(unsigned long long size) if ((mmu_psize = shift_to_mmu_psize(shift)) < 0) return -EINVAL; +#ifdef CONFIG_PPC_BOOK3S_64 + /* +* We need to make sure that for different page sizes reported by +* firmware we only add hugetlb support for page sizes that can be +* supported by linux page table layout. +* For now we have +* Radix: 2M +* Hash: 16M and 16G +*/ + if (radix_enabled()) { + if (mmu_psize != MMU_PAGE_2M) + return -EINVAL; + } else { + if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G) + return -EINVAL; + } +#endif + BUG_ON(mmu_psize_defs[mmu_psize].shift != shift); /* Return if huge page size has already been setup */ -- 2.7.4
Re: [PATCH V3 0/3] Numabalancing preserve write fix
I am not sure whether we want to merge this debug patch. This will help us in identifying wrong pte_wrprotect usage in the kernel. >From a0fb302fd204159a1327b67decb8f14ffa21 Mon Sep 17 00:00:00 2001 From: "Aneesh Kumar K.V" Date: Sat, 18 Feb 2017 10:39:47 +0530 Subject: [PATCH] powerpc/autonuma: Add debug check for wrong writable pte check With ppc64, protnone ptes don't use _PAGE_WRITE bit for savedwrite. Hence we need to make sure we don't do pte_write* functions on protnone ptes. Add debug check to catch wrong usage. This should be only used for debugging and can give wrong results w.r.t change bit on radix. Even on hash with kvm we will insert the page table entry in guest hash page table with write bit set, even if the pte is marked protnone. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/pgtable.h | 130 +-- 1 file changed, 85 insertions(+), 45 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index d87bee85fc44..1c99deac3966 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -341,10 +341,36 @@ static inline int __ptep_test_and_clear_young(struct mm_struct *mm, __r;\ }) +#undef SAVED_WRITE_DEBUG +#ifdef CONFIG_NUMA_BALANCING +static inline int pte_protnone(pte_t pte) +{ + /* +* We want to catch wrong usage of pte_write w.r.t protnone ptes. +* The way we do that is to make saved write as _PAGE_WRITE for hash +* translation mode. This only will work with hash translation mode. +*/ +#ifdef SAVED_WRITE_DEBUG + if (!radix_enabled()) + return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED)) == + cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED); +#endif + return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE | _PAGE_RWX)) == + cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE); +} +#endif + #define __HAVE_ARCH_PTEP_SET_WRPROTECT static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { +#ifdef SAVED_WRITE_DEBUG + /* +* Cannot use this with protnone pte, For protnone, writes +* will be marked via savedwrite bit. +*/ + VM_WARN_ON(pte_protnone(*ptep)); +#endif if ((pte_raw(*ptep) & cpu_to_be64(_PAGE_WRITE)) == 0) return; @@ -430,51 +456,6 @@ static inline pte_t pte_clear_soft_dirty(pte_t pte) } #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */ -#ifdef CONFIG_NUMA_BALANCING -static inline int pte_protnone(pte_t pte) -{ - return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE | _PAGE_RWX)) == - cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE); -} - -#define pte_mk_savedwrite pte_mk_savedwrite -static inline pte_t pte_mk_savedwrite(pte_t pte) -{ - /* -* Used by Autonuma subsystem to preserve the write bit -* while marking the pte PROT_NONE. Only allow this -* on PROT_NONE pte -*/ - VM_BUG_ON((pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_RWX | _PAGE_PRIVILEGED)) != - cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED)); - return __pte(pte_val(pte) & ~_PAGE_PRIVILEGED); -} - -#define pte_clear_savedwrite pte_clear_savedwrite -static inline pte_t pte_clear_savedwrite(pte_t pte) -{ - /* -* Used by KSM subsystem to make a protnone pte readonly. -*/ - VM_BUG_ON(!pte_protnone(pte)); - return __pte(pte_val(pte) | _PAGE_PRIVILEGED); -} - -#define pte_savedwrite pte_savedwrite -static inline bool pte_savedwrite(pte_t pte) -{ - /* -* Saved write ptes are prot none ptes that doesn't have -* privileged bit sit. We mark prot none as one which has -* present and pviliged bit set and RWX cleared. To mark -* protnone which used to have _PAGE_WRITE set we clear -* the privileged bit. -*/ - VM_BUG_ON(!pte_protnone(pte)); - return !(pte_raw(pte) & cpu_to_be64(_PAGE_RWX | _PAGE_PRIVILEGED)); -} -#endif /* CONFIG_NUMA_BALANCING */ - static inline int pte_present(pte_t pte) { return !!(pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT)); @@ -500,6 +481,14 @@ static inline unsigned long pte_pfn(pte_t pte) /* Generic modifiers for PTE bits */ static inline pte_t pte_wrprotect(pte_t pte) { + +#ifdef SAVED_WRITE_DEBUG + /* +* Cannot use this with protnone pte, For protnone, writes +* will be marked via savedwrite bit. +*/ + VM_WARN_ON(pte_protnone(pte)); +#endif return __pte(pte_val(pte) & ~_PAGE_WRITE); } @@ -552,6 +541,57 @@ static inline bool pte_user(pte_t pte) return !(pte_raw(pte) & cpu_to_be64(_PAGE_PRIVILEGED)); } +#ifdef CONFIG_NUMA_BALANCING +#define pte_mk_savedwrite pte_mk_savedwrite +static inline
Re: [PATCH] cxl: Enable PCI device ID for future IBM CXL adapter
On 17/02/17 14:45, Uma Krishnan wrote: From: "Matthew R. Ochs" Add support for a future IBM Coherent Accelerator (CXL) device with an ID of 0x0623. Signed-off-by: Matthew R. Ochs Signed-off-by: Uma Krishnan Is this a CAIA 1 or CAIA 2 device? -- Andrew Donnellan OzLabs, ADL Canberra andrew.donnel...@au1.ibm.com IBM Australia Limited
Re: powerpc/perf: use is_kernel_addr macro in perf_get_misc_flags()
On Sat, 2016-12-24 at 06:05:49 UTC, Madhavan Srinivasan wrote: > Cleanup to use is_kernel_addr macro. > > Signed-off-by: Madhavan Srinivasan Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/a2391b35f1d9d5b51d43a9150c7239 cheers
Re: powerpc: implement clear_bit_unlock_is_negative_byte()
On Tue, 2017-01-03 at 18:58:28 UTC, Nicholas Piggin wrote: > Commit b91e1302ad9b8 ("mm: optimize PageWaiters bit use for > unlock_page()") added a special bitop function to speed up > unlock_page(). Implement this for powerpc. ... > > Signed-off-by: Nicholas Piggin Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/d11914b21c4c21a294fe8937d66c1a cheers
Re: powerpc/powernv: Remove unused variable in pnv_pci_sriov_disable()
On Wed, 2017-01-11 at 01:09:05 UTC, Gavin Shan wrote: > The local variable @iov isn't used, to remove it. > > Signed-off-by: Gavin Shan > Reviewed-by: Andrew Donnellan Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/02983449c87b1dfd9b75af4c8a2a80 cheers
Re: [v2] powerpc/kernel: Remove error message in pcibios_setup_phb_resources()
On Wed, 2017-02-08 at 03:11:03 UTC, Gavin Shan wrote: > The CAPI driver creates virtual PHB (vPHB) from the CAPI adapter. > The vPHB's IO and memory windows aren't built from device-tree node > as we do for normal PHBs. A error message is thrown in below path > when trying to probe AFUs contained in the adapter. The error message > is confusing and unnecessary. > > cxl_probe() > pci_init_afu() > cxl_pci_vphb_add() > pcibios_scan_phb() > pcibios_setup_phb_resources() > > This removes the error message. We might have the case where the > first memory window on real PHB isn't populated properly because > of error in "ranges" property in the device-tree node. We can check > the device-tree instead for that. This also removes one unnecessary > blank line in the function. > > Signed-off-by: Gavin Shan > Reviewed-by: Andrew Donnellan Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/727597d12140b342a3deef10348b5e cheers
Re: [v2] powerpc/mm: Fix typo in set_pte_at()
On Wed, 2017-02-08 at 03:16:50 UTC, Gavin Shan wrote: > This fixes the typo about the _PAGE_PTE in set_pte_at() by changing > "tryint" to "trying to". > > Signed-off-by: Gavin Shan > Acked-by: Balbir Singh Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/c618f6b188a9170f67e4abd478d250 cheers
Re: [v2,1/6] powerpc/perf: Factor of event_alternative function
On Sun, 2017-02-12 at 17:03:10 UTC, Madhavan Srinivasan wrote: > Factor out the power8 event_alternative function to share > the code with power9. > > Signed-off-by: Madhavan Srinivasan Series applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/efe881afdd9996ccbcd2a09c93b724 cheers
Re: powerpc/perf: Avoid FAB_*_MATCH checks for power9
On Mon, 2017-02-13 at 11:32:54 UTC, Madhavan Srinivasan wrote: > Since power9 does not support FAB_*_MATCH bits in MMCR1, > avoid these checks for power9. For this, patch factor out > code in isa207_get_constraint() to retain these checks > only for power8. > > Patch also updates the comment in power9-pmu raw event > encode layout to remove FAB_*_MATCH. > > Finally for power9, patch adds additional check for > threshold events when adding the thresh mask and value in > isa207_get_constraint(). > > fixes: 7ffd948fae4c ('powerpc/perf: factor out power8 pmu functions') > fixes: 18201b204286 ('powerpc/perf: power9 raw event format encoding') > Signed-off-by: Ravi Bangoria > Signed-off-by: Madhavan Srinivasan Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/78a16d9fc1206e1a484b6ac9634875 cheers
Re: [v7, 3/4] powerpc/pseries: Implement indexed-count hotplug memory add
On Wed, 2017-02-15 at 18:45:56 UTC, Nathan Fontenot wrote: > From: Sahil Mehta > > Indexed-count add for memory hotplug guarantees that a contiguous block > of lmbs beginning at a specified will be assigned, > any LMBs in this range that are not already assigned will be DLPAR added. > Because of Qemu's per-DIMM memory management, the addition of a contiguous > block of memory currently requires a series of individual calls to add > each LMB in the block. Indexed-count add reduces this series of calls to > a single call for the entire block. > > Signed-off-by: Sahil Mehta > Signed-off-by: Nathan Fontenot Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/333f7b76865bec24c66710cf352f89 cheers
Re: [v7, 4/4] powerpc/pseries: Implement indexed-count hotplug memory remove
On Wed, 2017-02-15 at 18:46:18 UTC, Nathan Fontenot wrote: > From: Sahil Mehta > > Indexed-count remove for memory hotplug guarantees that a contiguous block > of lmbs beginning at a specified will be unassigned (NOT > that lmbs will be removed). Because of Qemu's per-DIMM memory > management, the removal of a contiguous block of memory currently > requires a series of individual calls. Indexed-count remove reduces > this series into a single call. > > Signed-off-by: Sahil Mehta > Signed-off-by: Nathan Fontenot Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/753843471cbbaeca25a5cab51981ee cheers
Re: [1/3] pci/hotplug/pnv-php: Remove WARN_ON() in pnv_php_put_slot()
On Wed, 2017-02-15 at 23:22:32 UTC, Gavin Shan wrote: > The WARN_ON() causes unnecessary backtrace when putting the parent > slot, which is likely to be NULL. > > WARNING: CPU: 2 PID: 1071 at drivers/pci/hotplug/pnv_php.c:85 \ > pnv_php_release+0xcc/0x150 [pnv_php] > : > Call Trace: > [c003bc007c10] [dad613c4] pnv_php_release+0x144/0x150 [pnv_php] > [c003bc007c40] [c06641d8] pci_hp_deregister+0x238/0x330 > [c003bc007cd0] [dad61440] pnv_php_unregister_one+0x70/0xa0 > [pnv_php] > [c003bc007d10] [dad614c0] pnv_php_unregister+0x50/0x80 [pnv_php] > [c003bc007d40] [dad61e84] pnv_php_exit+0x50/0xcb4 [pnv_php] > [c003bc007d70] [c019499c] SyS_delete_module+0x1fc/0x2a0 > [c003bc007e30] [c000b184] system_call+0x38/0xe0 > > Cc: # v4.8+ > Fixes: 66725152fb9f ("PCI/hotplug: PowerPC PowerNV PCI hotplug driver") > Signed-off-by: Gavin Shan > Reviewed-by: Andrew Donnellan > Tested-by: Vaibhav Jain Series applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/36c7c9da40c408a71e5e6bfe12e57d cheers
Re: [PATCHv3,4/4] MAINTAINERS: Remove powerpc's opal match
On Thu, 2017-02-16 at 00:37:15 UTC, Stewart Smith wrote: > Remove OPAL regex in powerpc to avoid false match > > Signed-off-by: Stewart Smith > Reviewed-by: Andrew Donnellan Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/a42715830d552d7c0e3be709383ece cheers
Re: [1/2] powerpc/mm: Convert slb_finish_load[_1T] to local symbols
On Thu, 2017-02-16 at 05:38:44 UTC, Michael Ellerman wrote: > slb_finish_load and slb_finish_load_1T are both only used within > slb_low.S, so make them local symbols. > > This makes the code a little clearer, as it's more obvious neither is > intended to be an entry point from arbitrary other code, only the uses > in this file. > > It also prevents them being used with kprobes and other tracing tools, > which is good because we're not able to safely take traps at these > locations, so making them local symbols avoids us needing to blacklist > them. > > Signed-off-by: Naveen N. Rao > Signed-off-by: Michael Ellerman Series applied to powerpc next. https://git.kernel.org/powerpc/c/e471c393dfafff54c65979cbda7d5a cheers
Re: [v2] powerpc: Add POWER9 architected mode to cputable
On Fri, 2017-02-17 at 02:01:35 UTC, Russell Currey wrote: > PVR value of 0x0F05 means we are arch v3.00 compliant (i.e. POWER9). > > Acked-by: Michael Neuling > Signed-off-by: Russell Currey Applied to powerpc next, thanks. https://git.kernel.org/powerpc/c/6ae3f8ad2017079292cb49c8959b52 cheers
next-20170217 boot on POWER8 LPAR : WARNING @kernel/jump_label.c:287
While booting next-20170217 on a POWER8 LPAR following warning is displayed. Reverting the following commit helps boot cleanly. commit 3821fd35b5 : jump_label: Reduce the size of struct static_key [ 11.393008] [ cut here ] [ 11.393031] WARNING: CPU: 5 PID: 2890 at kernel/jump_label.c:287 static_key_set_entries.isra.10+0x3c/0x50 [ 11.393035] Modules linked in: nfsd(+) ip_tables x_tables autofs4 [ 11.393043] CPU: 5 PID: 2890 Comm: modprobe Not tainted 4.10.0-rc8-next-20170217-autotest #1 [ 11.393047] task: c003a5692500 task.stack: c003a7774000 [ 11.393051] NIP: c17bcffc LR: c17bd46c CTR: [ 11.393054] REGS: c003a800 TRAP: 0700 Not tainted (4.10.0-rc8-next-20170217-autotest) [ 11.393058] MSR: 8282b033 [ 11.393065] CR: 48248282 XER: 0001 [ 11.393070] CFAR: c17bcfcc SOFTE: 1 GPR00: c17bd42c c003aa80 c262ce00 d3fdd580 GPR04: d3fe07df 00010017 c17bcd50 GPR08: 00053a09 0001 c254ce00 0001 GPR12: c1b56c40 cea81400 0020 d5081098 GPR16: c003ada0 c003adec 84a8 GPR20: d3fef000 d3fe2b28 c252dc90 0001 GPR24: c254d314 c25338f8 d3fe089f GPR28: d3fe1400 d3fdd578 d3fe07df [ 11.393115] NIP [c17bcffc] static_key_set_entries.isra.10+0x3c/0x50 [ 11.393119] LR [c17bd46c] jump_label_module_notify+0x20c/0x420 [ 11.393122] Call Trace: [ 11.393125] [c003aa80] [c17bd42c] jump_label_module_notify+0x1cc/0x420 (unreliable) [ 11.393132] [c003ab40] [c16b38e0] notifier_call_chain+0x90/0x100 [ 11.393137] [c003ab90] [c16b3db0] __blocking_notifier_call_chain+0x60/0x90 [ 11.393142] [c003abe0] [c17357bc] load_module+0x1c1c/0x2750 [ 11.393147] [c003ad70] [c1736550] SyS_finit_module+0xc0/0xf0 [ 11.393152] [c003ae30] [c15cb8e0] system_call+0x38/0xfc [ 11.393156] Instruction dump: [ 11.393158] 40c20018 e923 792907a0 7c844b78 f883 4e800020 3d42fff2 892a0514 [ 11.393166] 2f89 40feffe0 3921 992a0514 <0fe0> 4bd0 6000 6000 [ 11.393173] ---[ end trace a5f8fbc5d8226aec ]--- Have attached boot log. Thanks -Sachin dmesg_next_20170217.log Description: Binary data
[next-20170217] WARN @/arch/powerpc/include/asm/xics.h:124 .icp_hv_eoi+0x40/0x140
While booting next-20170217 on a POWER6 box, I ran into following warning. This is a full system lpar. Previous next tree was good. I will try a bisect tomorrow. ipr: IBM Power RAID SCSI Device Driver version: 2.6.3 (October 17, 2015) ipr 0200:00:01.0: Found IOA with IRQ: 305 [ cut here ] WARNING: CPU: 12 PID: 1 at ./arch/powerpc/include/asm/xics.h:124 .icp_hv_eoi+0x40/0x140 Modules linked in: CPU: 12 PID: 1 Comm: swapper/14 Not tainted 4.10.0-rc8-next-20170217-autotest #1 task: c002b2a4a580 task.stack: c002b2a5c000 NIP: c00731b0 LR: c01389f8 CTR: c0073170 REGS: c002b2a5f050 TRAP: 0700 Not tainted (4.10.0-rc8-next-20170217-autotest) MSR: 80029032 CR: 28004082 XER: 2004 CFAR: c01389e0 SOFTE: 0 GPR00: c01389f8 c002b2a5f2d0 c1025800 c002b203f498 GPR04: 0064 0131 GPR08: 0001 c000d3104cb8 0009b1f8 GPR12: 48004082 cedc2400 c000dad0 GPR16: 3c007efc c0a9e848 GPR20: d8008008 c002af4d47f0 c11efda8 c0a9ea10 GPR24: c0a9e848 c002af4d4fb8 GPR28: c002b203f498 c0ef8928 c002b203f400 NIP [c00731b0] .icp_hv_eoi+0x40/0x140 LR [c01389f8] .handle_fasteoi_irq+0x1e8/0x270 Call Trace: [c002b2a5f2d0] [c002b2a5f360] 0xc002b2a5f360 (unreliable) [c002b2a5f360] [c01389f8] .handle_fasteoi_irq+0x1e8/0x270 [c002b2a5f3e0] [c0136a08] .request_threaded_irq+0x298/0x370 [c002b2a5f490] [c05895c0] .ipr_probe_ioa+0x1110/0x1390 [c002b2a5f5c0] [c058d030] .ipr_probe+0x30/0x3e0 [c002b2a5f670] [c0466860] .local_pci_probe+0x60/0x130 [c002b2a5f710] [c0467658] .pci_device_probe+0x148/0x1e0 [c002b2a5f7c0] [c0527524] .driver_probe_device+0x2d4/0x5b0 [c002b2a5f860] [c052796c] .__driver_attach+0x16c/0x190 [c002b2a5f8f0] [c05242c4] .bus_for_each_dev+0x84/0xf0 [c002b2a5f990] [c0526af4] .driver_attach+0x24/0x40 [c002b2a5fa00] [c0526318] .bus_add_driver+0x2a8/0x370 [c002b2a5faa0] [c0528a5c] .driver_register+0x8c/0x170 [c002b2a5fb20] [c0465a54] .__pci_register_driver+0x44/0x60 [c002b2a5fb90] [c0b8efc8] .ipr_init+0x58/0x70 [c002b2a5fc10] [c000d20c] .do_one_initcall+0x5c/0x1c0 [c002b2a5fce0] [c0b44738] .kernel_init_freeable+0x280/0x360 [c002b2a5fdb0] [c000daec] .kernel_init+0x1c/0x130 [c002b2a5fe30] [c000baa0] .ret_from_kernel_thread+0x58/0xb8 Instruction dump: f8010010 f821ff71 80e3000c 7c0004ac e94d0030 3d02ffbc 3928f4b8 7d295214 81090004 3948 7d484378 79080fe2 <0b08> 2fa8 40de0050 91490004 ---[ end trace 5e18ae409f46392c ]--- ipr 0200:00:01.0: Initializing IOA. Thanks -Sachin
Re: [PATCH] powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout
On Sun, 2017-02-19 at 15:48 +0530, Aneesh Kumar K.V wrote: > +#ifdef CONFIG_PPC_BOOK3S_64 > + /* > + * We need to make sure that for different page sizes reported by > + * firmware we only add hugetlb support for page sizes that can be > + * supported by linux page table layout. > + * For now we have > + * Radix: 2M > + * Hash: 16M and 16G > + */ > + if (radix_enabled()) { > + if (mmu_psize != MMU_PAGE_2M) > + return -EINVAL; > + } else { > + if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G) > + return -EINVAL; > + } Hash could support others... Same with radix and PUD level pages. Why do we need that ? Won't FW provide separate properties for hash and radix page sizes anyway ? Ben.
Re: [PATCH] powerpc/powernv: Make PCI non-optional
On Fri, Feb 17, 2017 at 05:34:13PM +1100, Michael Ellerman wrote: >Bare metal systems without PCI don't exist, so there's no real point in >making PCI optional, it just breaks the build from time to time. In fact >the build is broken now if you turn off PCI_MSI but enable KVM. > >Using select for PCI is OK because we (powerpc) define config PCI, and it >has no dependencies. Selecting PCI_MSI is slightly fishy, because it's >in drivers/pci and it is user-visible, but its only dependency is PCI, >so selecting it can't actually lead to breakage. > >Signed-off-by: Michael Ellerman Acked-by: Gavin Shan
[PATCH 1/2] powerpc/mm: Refactor page table allocation
Introduce a helper pgtable_get_gfp_flags() which just returns the current gfp flags. In a future patch, we can enable __GFP_ACCOUNT based on the calling context. Signed-off-by: Balbir Singh --- arch/powerpc/include/asm/book3s/64/pgalloc.h | 22 -- arch/powerpc/mm/pgtable_64.c | 3 ++- 2 files changed, 18 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h index cd5e7aa..d0a9ca6 100644 --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h @@ -50,13 +50,19 @@ extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift); extern void __tlb_remove_table(void *_table); #endif +static inline gfp_t pgtable_get_gfp_flags(struct mm_struct *mm, gfp_t gfp) +{ + return gfp; +} + static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm) { #ifdef CONFIG_PPC_64K_PAGES - return (pgd_t *)__get_free_page(PGALLOC_GFP); + return (pgd_t *)__get_free_page(pgtable_get_gfp_flags(mm, PGALLOC_GFP)); #else struct page *page; - page = alloc_pages(PGALLOC_GFP | __GFP_REPEAT, 4); + page = alloc_pages(pgtable_get_gfp_flags(mm, + PGALLOC_GFP | __GFP_REPEAT), 4); if (!page) return NULL; return (pgd_t *) page_address(page); @@ -76,7 +82,8 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm) { if (radix_enabled()) return radix__pgd_alloc(mm); - return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL); + return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), + pgtable_get_gfp_flags(mm, GFP_KERNEL)); } static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd) @@ -93,7 +100,8 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud) static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr) { - return kmem_cache_alloc(PGT_CACHE(PUD_INDEX_SIZE), GFP_KERNEL); + return kmem_cache_alloc(PGT_CACHE(PUD_INDEX_SIZE), + pgtable_get_gfp_flags(mm, GFP_KERNEL)); } static inline void pud_free(struct mm_struct *mm, pud_t *pud) @@ -119,7 +127,8 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pud, static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) { - return kmem_cache_alloc(PGT_CACHE(PMD_CACHE_INDEX), GFP_KERNEL); + return kmem_cache_alloc(PGT_CACHE(PMD_CACHE_INDEX), + pgtable_get_gfp_flags(mm, GFP_KERNEL)); } static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd) @@ -159,7 +168,8 @@ static inline pgtable_t pmd_pgtable(pmd_t pmd) static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long address) { - return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO); + return (pte_t *)__get_free_page( + pgtable_get_gfp_flags(mm, PGALLOC_GFP)); } static inline pgtable_t pte_alloc_one(struct mm_struct *mm, diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c index 8bca7f5..9f416ee 100644 --- a/arch/powerpc/mm/pgtable_64.c +++ b/arch/powerpc/mm/pgtable_64.c @@ -350,7 +350,8 @@ static pte_t *get_from_cache(struct mm_struct *mm) static pte_t *__alloc_for_cache(struct mm_struct *mm, int kernel) { void *ret = NULL; - struct page *page = alloc_page(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO); + struct page *page = alloc_page(pgtable_get_gfp_flags(mm, + GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)); if (!page) return NULL; if (!kernel && !pgtable_page_ctor(page)) { -- 2.9.3
[PATCH 2/2] powerpc/mm: Enable page table accounting
Enabled __GFP_ACCOUNT in pgtable_get_gfp_flags(). This allows accounting of page table allocation via kmem to the correct cgroup. Basic testing was done to see if the accounting reflects in 1. perf record tracing 2. memory.kmem.slabinfo Signed-off-by: Balbir Singh --- arch/powerpc/include/asm/book3s/64/pgalloc.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h index d0a9ca6..9207213 100644 --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h @@ -52,7 +52,9 @@ extern void __tlb_remove_table(void *_table); static inline gfp_t pgtable_get_gfp_flags(struct mm_struct *mm, gfp_t gfp) { - return gfp; + if (mm == &init_mm) + return gfp; + return gfp | __GFP_ACCOUNT; } static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm) -- 2.9.3
Re: [RFC PATCH 4/9] powerpc/4xx: Create 4xx pseudo-platform in platforms/4xx
On Fri, 17 Feb 2017 17:32:14 +1100 Michael Ellerman wrote: > We have a lot of code in sysdev for supporting 4xx, ie. either 40x or > 44x. Instead it would be cleaner if it was all in platforms/4xx. > > This is slightly odd in that we don't actually define any machines in > the 4xx platform, as is usual for a platform directory. But still it > seems like a better result to have all this related code in a directory > by itself. What about the other things in sysdev that support multiple platforms? Why not just put the new 4xx subdirectory under sysdev? The other patches all seem okay to me. Do you have any grand plan for further breaking up traps.c? Thanks, Nick
Re: [next-20170217] WARN @/arch/powerpc/include/asm/xics.h:124 .icp_hv_eoi+0x40/0x140
Sachin Sant writes: > While booting next-20170217 on a POWER6 box, I ran into following > warning. This is a full system lpar. Previous next tree was good. > I will try a bisect tomorrow. Do you have CONFIG_DEBUG_SHIRQ=y ? cheers > ipr: IBM Power RAID SCSI Device Driver version: 2.6.3 (October 17, 2015) > ipr 0200:00:01.0: Found IOA with IRQ: 305 > [ cut here ] > WARNING: CPU: 12 PID: 1 at ./arch/powerpc/include/asm/xics.h:124 > .icp_hv_eoi+0x40/0x140 > Modules linked in: > CPU: 12 PID: 1 Comm: swapper/14 Not tainted 4.10.0-rc8-next-20170217-autotest > #1 > task: c002b2a4a580 task.stack: c002b2a5c000 > NIP: c00731b0 LR: c01389f8 CTR: c0073170 > REGS: c002b2a5f050 TRAP: 0700 Not tainted > (4.10.0-rc8-next-20170217-autotest) > MSR: 80029032 > CR: 28004082 XER: 2004 > CFAR: c01389e0 SOFTE: 0 > GPR00: c01389f8 c002b2a5f2d0 c1025800 c002b203f498 > GPR04: 0064 0131 > GPR08: 0001 c000d3104cb8 0009b1f8 > GPR12: 48004082 cedc2400 c000dad0 > GPR16: 3c007efc c0a9e848 > GPR20: d8008008 c002af4d47f0 c11efda8 c0a9ea10 > GPR24: c0a9e848 c002af4d4fb8 > GPR28: c002b203f498 c0ef8928 c002b203f400 > NIP [c00731b0] .icp_hv_eoi+0x40/0x140 > LR [c01389f8] .handle_fasteoi_irq+0x1e8/0x270 > Call Trace: > [c002b2a5f2d0] [c002b2a5f360] 0xc002b2a5f360 (unreliable) > [c002b2a5f360] [c01389f8] .handle_fasteoi_irq+0x1e8/0x270 > [c002b2a5f3e0] [c0136a08] .request_threaded_irq+0x298/0x370 > [c002b2a5f490] [c05895c0] .ipr_probe_ioa+0x1110/0x1390 > [c002b2a5f5c0] [c058d030] .ipr_probe+0x30/0x3e0 > [c002b2a5f670] [c0466860] .local_pci_probe+0x60/0x130 > [c002b2a5f710] [c0467658] .pci_device_probe+0x148/0x1e0 > [c002b2a5f7c0] [c0527524] .driver_probe_device+0x2d4/0x5b0 > [c002b2a5f860] [c052796c] .__driver_attach+0x16c/0x190 > [c002b2a5f8f0] [c05242c4] .bus_for_each_dev+0x84/0xf0 > [c002b2a5f990] [c0526af4] .driver_attach+0x24/0x40 > [c002b2a5fa00] [c0526318] .bus_add_driver+0x2a8/0x370 > [c002b2a5faa0] [c0528a5c] .driver_register+0x8c/0x170 > [c002b2a5fb20] [c0465a54] .__pci_register_driver+0x44/0x60 > [c002b2a5fb90] [c0b8efc8] .ipr_init+0x58/0x70 > [c002b2a5fc10] [c000d20c] .do_one_initcall+0x5c/0x1c0 > [c002b2a5fce0] [c0b44738] .kernel_init_freeable+0x280/0x360 > [c002b2a5fdb0] [c000daec] .kernel_init+0x1c/0x130 > [c002b2a5fe30] [c000baa0] .ret_from_kernel_thread+0x58/0xb8 > Instruction dump: > f8010010 f821ff71 80e3000c 7c0004ac e94d0030 3d02ffbc 3928f4b8 7d295214 > 81090004 3948 7d484378 79080fe2 <0b08> 2fa8 40de0050 91490004 > ---[ end trace 5e18ae409f46392c ]--- > ipr 0200:00:01.0: Initializing IOA. > > Thanks > -Sachin
[PATCH v4 00/10] IMC Instrumentation Support
Power 9 has In-Memory-Collection (IMC) infrastructure which contains various Performance Monitoring Units (PMUs) at Nest level (these are on-chip but off-core), Core level and Thread level. The Nest PMU counters are handled by a Nest IMC microcode which runs in the OCC (On-Chip Controller) complex. The microcode collects the counter data and moves the nest IMC counter data to memory. The Core and Thread IMC PMU counters are handled in the core. Core level PMU counters give us the IMC counters' data per core and thread level PMU counters give us the IMC counters' data per CPU thread. This patchset enables the nest IMC, core IMC and thread IMC PMUs and is based on the initial work done by Madhavan Srinivasan. "Nest Instrumentation Support" : https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-August/132078.html v1 for this patchset can be found here : https://lwn.net/Articles/705475/ Nest events: Per-chip nest instrumentation provides various per-chip metrics such as memory, powerbus, Xlink and Alink bandwidth. Core events: Per-core IMC instrumentation provides various per-core metrics such as non-idle cycles, non-idle instructions, various cache and memory related metrics etc. Thread events: All the events for thread level are same as core level with the difference being in the domain. These are per-cpu metrics. PMU Events' Information: OPAL obtains the IMC PMU and event information from the IMC Catalog and passes on to the kernel via the device tree. The events' information contains : - Event name - Event Offset - Event description and, maybe : - Event scale - Event unit Some PMUs may have a common scale and unit values for all their supported events. For those cases, the scale and unit properties for those events must be inherited from the PMU. The event offset in the memory is where the counter data gets accumulated. The OPAL-side patches are posted upstream : https://lists.ozlabs.org/pipermail/skiboot/2017-January/005979.html The kernel discovers the IMC counters information in the device tree at the "imc-counters" device node which has a compatible field "ibm,opal-in-memory-counters". Parsing of the Events' information: To parse the IMC PMUs and events information, the kernel has to discover the "imc-counters" node and walk through the pmu and event nodes. Here is an excerpt of the dt showing the imc-counters with mcs0 (nest), core and thread node: /dts-v1/; [...] /dts-v1/; / { name = ""; compatible = "ibm,opal-in-memory-counters"; #address-cells = <0x1>; #size-cells = <0x1>; imc-nest-offset = <0x32>; imc-nest-size = <0x3>; version-id = ""; NEST_MCS: nest-mcs-events { #address-cells = <0x1>; #size-cells = <0x1>; event@0 { event-name = "RRTO_QFULL_NO_DISP" ; reg = <0x0 0x8>; desc = "RRTO not dispatched in MCS0 due to capacity - pulses once for each time a valid RRTO op is not dispatched due to a command list full condition" ; }; event@8 { event-name = "WRTO_QFULL_NO_DISP" ; reg = <0x8 0x8>; desc = "WRTO not dispatched in MCS0 due to capacity - pulses once for each time a valid WRTO op is not dispatched due to a command list full condition" ; }; [...] mcs0 { compatible = "ibm,imc-counters-nest"; events-prefix = "PM_MCS0_"; unit = ""; scale = ""; reg = <0x118 0x8>; events = < &NEST_MCS >; }; mcs1 { compatible = "ibm,imc-counters-nest"; events-prefix = "PM_MCS1_"; unit = ""; scale = ""; reg = <0x198 0x8>; events = < &NEST_MCS >; }; [...] CORE_EVENTS: core-events { #address-cells = <0x1>; #size-cells = <0x1>; event@e0 { event-name = "0THRD_NON_IDLE_PCYC" ; reg = <0xe0 0x8>; desc = "The number of processor cycles when all threads are idle" ; }; event@120 { event-name = "1THRD_NON_IDLE_PCYC" ; reg = <0x120 0x8>; desc = "The number of processor cycles when exactly one SMT thread is executing non-idle code" ; }; [...] core { compatible = "ibm,imc-counters-core"; events-prefix = "CPM_"; unit = ""; scale = ""; reg = <0x0 0x8>; events = < &CORE_EVENTS >; }; thread { compatible = "ibm,imc-counters-core"; events-prefix
[PATCH v4 01/10] powerpc/powernv: Data structure and macros definitions
Create new header file "imc-pmu.h" to add the data structures and macros needed for IMC pmu support. Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Anton Blanchard Cc: Sukadev Bhattiprolu Cc: Michael Neuling Cc: Stewart Smith Cc: Daniel Axtens Cc: Stephane Eranian Cc: Balbir Singh Cc: Anju T Sudhakar Signed-off-by: Hemant Kumar --- arch/powerpc/include/asm/imc-pmu.h | 73 ++ 1 file changed, 73 insertions(+) create mode 100644 arch/powerpc/include/asm/imc-pmu.h diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h new file mode 100644 index 000..3232322 --- /dev/null +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -0,0 +1,73 @@ +#ifndef PPC_POWERNV_IMC_PMU_DEF_H +#define PPC_POWERNV_IMC_PMU_DEF_H + +/* + * IMC Nest Performance Monitor counter support. + * + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation. + * (C) 2016 Hemant K Shaw, IBM Corporation. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation. + */ + +#include +#include +#include +#include +#include + +#define IMC_MAX_CHIPS 32 +#define IMC_MAX_PMUS 32 +#define IMC_MAX_PMU_NAME_LEN 256 + +#define NEST_IMC_ENGINE_START 1 +#define NEST_IMC_ENGINE_STOP 0 +#define NEST_MAX_PAGES 16 + +#define NEST_IMC_PRODUCTION_MODE 1 + +#define IMC_DTB_COMPAT "ibm,opal-in-memory-counters" +#define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest" + +/* + * Structure to hold per chip specific memory address + * information for nest pmus. Nest Counter data are exported + * in per-chip reserved memory region by the PORE Engine. + */ +struct perchip_nest_info { + u32 chip_id; + u64 pbase; + u64 vbase[NEST_MAX_PAGES]; + u64 size; +}; + +/* + * Place holder for nest pmu events and values. + */ +struct imc_events { + char *ev_name; + char *ev_value; +}; + +/* + * Device tree parser code detects IMC pmu support and + * registers new IMC pmus. This structure will + * hold the pmu functions and attrs for each imc pmu and + * will be referenced at the time of pmu registration. + */ +struct imc_pmu { + struct pmu pmu; + int domain; + const struct attribute_group *attr_groups[4]; +}; + +/* + * Domains for IMC PMUs + */ +#define IMC_DOMAIN_NEST1 + +#define UNKNOWN_DOMAIN -1 + +#endif /* PPC_POWERNV_IMC_PMU_DEF_H */ -- 2.7.4
[PATCH v4 02/10] powerpc/powernv: Autoload IMC device driver module
This patch does three things : - Enables "opal.c" to create a platform device for the IMC interface according to the appropriate compatibility string. - Find the reserved-memory region details from the system device tree and get the base address of HOMER region address for each chip. - We also get the Nest PMU counter data offsets (in the HOMER region) and their sizes. The offsets for the counters' data are fixed and won't change from chip to chip. The device tree parsing logic is separated from the PMU creation functions (which is done in subsequent patches). Right now, only Nest units are taken care of. Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Anton Blanchard Cc: Sukadev Bhattiprolu Cc: Michael Neuling Cc: Stewart Smith Cc: Daniel Axtens Cc: Stephane Eranian Cc: Balbir Singh Cc: Anju T Sudhakar Signed-off-by: Hemant Kumar --- arch/powerpc/platforms/powernv/Makefile | 2 +- arch/powerpc/platforms/powernv/opal-imc.c | 117 ++ arch/powerpc/platforms/powernv/opal.c | 13 3 files changed, 131 insertions(+), 1 deletion(-) create mode 100644 arch/powerpc/platforms/powernv/opal-imc.c diff --git a/arch/powerpc/platforms/powernv/Makefile b/arch/powerpc/platforms/powernv/Makefile index b5d98cb..44909fe 100644 --- a/arch/powerpc/platforms/powernv/Makefile +++ b/arch/powerpc/platforms/powernv/Makefile @@ -2,7 +2,7 @@ obj-y += setup.o opal-wrappers.o opal.o opal-async.o idle.o obj-y += opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o obj-y += rng.o opal-elog.o opal-dump.o opal-sysparam.o opal-sensor.o obj-y += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o -obj-y += opal-kmsg.o +obj-y += opal-kmsg.o opal-imc.o obj-$(CONFIG_SMP) += smp.o subcore.o subcore-asm.o obj-$(CONFIG_PCI) += pci.o pci-ioda.o npu-dma.o diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c new file mode 100644 index 000..ee2ae45 --- /dev/null +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -0,0 +1,117 @@ +/* + * OPAL IMC interface detection driver + * Supported on POWERNV platform + * + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation. + *(C) 2016 Hemant K Shaw, IBM Corporation. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; + +static int opal_imc_counters_probe(struct platform_device *pdev) +{ + struct device_node *child, *imc_dev, *rm_node = NULL; + struct perchip_nest_info *pcni; + u32 reg[4], pages, nest_offset, nest_size, idx; + int i = 0; + const char *node_name; + + if (!pdev || !pdev->dev.of_node) + return -ENODEV; + + imc_dev = pdev->dev.of_node; + + /* +* nest_offset : where the nest-counters' data start. +* size : size of the entire nest-counters region +*/ + if (of_property_read_u32(imc_dev, "imc-nest-offset", &nest_offset)) + goto err; + if (of_property_read_u32(imc_dev, "imc-nest-size", &nest_size)) + goto err; + + /* Find the "homer region" for each chip */ + rm_node = of_find_node_by_path("/reserved-memory"); + if (!rm_node) + goto err; + + for_each_child_of_node(rm_node, child) { + if (of_property_read_string_index(child, "name", 0, + &node_name)) + continue; + if (strncmp("ibm,homer-image", node_name, + strlen("ibm,homer-image"))) + continue; + + /* Get the chip id to which the above homer region belongs to */ + if (of_property_read_u32(child, "ibm,chip-id", &idx)) + goto err; + + /* reg property will have four u32 cells. */ + if (of_property_read_u32_array(child, "reg", reg, 4)) + goto err; + + pcni = &nest_perchip_info[idx]; + + /* Fetch the homer region base address */ + pcni->pbase = reg[0]; + pcni->pbase = pcni->pbase << 32 | reg[1]; + /* Add the nest IMC Base offset */ + pcni->pbase
[PATCH v4 03/10] powerpc/powernv: Detect supported IMC units and its events
Parse device tree to detect IMC units. Traverse through each IMC unit node to find supported events and corresponding unit/scale files (if any). The device tree for IMC counters starts at the node : "imc-counters". This node contains all the IMC PMU nodes and event nodes for these IMC PMUs. The PMU nodes have an "events" property which has a phandle value for the actual events node. The events are separated from the PMU nodes to abstract out the common events. For example, PMU node "mcs0", "mcs1" etc. will contain a pointer to "nest-mcs-events" since, the events are common between these PMUs. These events have a different prefix based on their relation to different PMUs, and hence, the PMU nodes themselves contain an "events-prefix" property. The value for this property concatenated to the event name, forms the actual event name. Also, the PMU have a "reg" field as the base offset for the events which belong to this PMU. This "reg" field is added to an event in the "events" node, which gives us the location of the counter data. Kernel code uses this offset as event configuration value. Device tree parser code also looks for scale/unit property in the event node and passes on the value as an event attr for perf interface to use in the post processing by the perf tool. Some PMUs may have common scale and unit properties which implies that all events supported by this PMU inherit the scale and unit properties of the PMU itself. For those events, we need to set the common unit and scale values. For failure to initialize any unit or any event, disable that unit and continue setting up the rest of them. Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Anton Blanchard Cc: Sukadev Bhattiprolu Cc: Michael Neuling Cc: Stewart Smith Cc: Daniel Axtens Cc: Stephane Eranian Cc: Balbir Singh Signed-off-by: Hemant Kumar Signed-off-by: Anju T Sudhakar --- arch/powerpc/platforms/powernv/opal-imc.c | 385 ++ 1 file changed, 385 insertions(+) diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index ee2ae45..c58b893 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -32,6 +32,390 @@ #include struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; +struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; + +static int imc_event_info(char *name, struct imc_events *events) +{ + char *buf; + + /* memory for content */ + buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + events->ev_name = name; + events->ev_value = buf; + return 0; +} + +static int imc_event_info_str(struct property *pp, char *name, + struct imc_events *events) +{ + int ret; + + ret = imc_event_info(name, events); + if (ret) + return ret; + + if (!pp->value || (strnlen(pp->value, pp->length) == pp->length) || + (pp->length > IMC_MAX_PMU_NAME_LEN)) + return -EINVAL; + strncpy(events->ev_value, (const char *)pp->value, pp->length); + + return 0; +} + +static int imc_event_info_val(char *name, u32 val, + struct imc_events *events) +{ + int ret; + + ret = imc_event_info(name, events); + if (ret) + return ret; + sprintf(events->ev_value, "event=0x%x", val); + + return 0; +} + +static int set_event_property(struct property *pp, char *event_prop, + struct imc_events *events, char *ev_name) +{ + char *buf; + int ret; + + buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + sprintf(buf, "%s.%s", ev_name, event_prop); + ret = imc_event_info_str(pp, buf, events); + if (ret) { + kfree(events->ev_name); + kfree(events->ev_value); + } + + return ret; +} + +/* + * imc_events_node_parser: Parse the event node "dev" and assign the parsed + * information to event "events". + * + * Parses the "reg" property of this event. "reg" gives us the event offset. + * Also, parse the "scale" and "unit" properties, if any. + */ +static int imc_events_node_parser(struct device_node *dev, + struct imc_events *events, + struct property *event_scale, + struct property *event_unit, + struct property *name_prefix, + u32 reg) +{ + struct property *name, *pp; + char *ev_name; + u32 val; + int idx = 0, ret; + + if (!dev) + return -EINVAL; + + /* +* Loop through each property of an event node +*/ + name = of_find_property(dev, "event-name", NULL); +
[PATCH v4 04/10] powerpc/perf: Add event attribute and group to IMC pmus
Device tree IMC driver code parses the IMC units and their events. It passes the information to IMC pmu code which is placed in powerpc/perf as "imc-pmu.c". This patch creates only event attributes and attribute groups for the IMC pmus. Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Anton Blanchard Cc: Sukadev Bhattiprolu Cc: Michael Neuling Cc: Stewart Smith Cc: Daniel Axtens Cc: Stephane Eranian Cc: Balbir Singh Cc: Anju T Sudhakar Signed-off-by: Hemant Kumar --- arch/powerpc/perf/Makefile| 6 +- arch/powerpc/perf/imc-pmu.c | 96 +++ arch/powerpc/platforms/powernv/opal-imc.c | 12 +++- 3 files changed, 111 insertions(+), 3 deletions(-) create mode 100644 arch/powerpc/perf/imc-pmu.c diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile index 4d606b9..d0d1f04 100644 --- a/arch/powerpc/perf/Makefile +++ b/arch/powerpc/perf/Makefile @@ -2,10 +2,14 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror obj-$(CONFIG_PERF_EVENTS) += callchain.o perf_regs.o +imc-$(CONFIG_PPC_POWERNV) += imc-pmu.o + obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o obj64-$(CONFIG_PPC_PERF_CTRS) += power4-pmu.o ppc970-pmu.o power5-pmu.o \ power5+-pmu.o power6-pmu.o power7-pmu.o \ - isa207-common.o power8-pmu.o power9-pmu.o + isa207-common.o power8-pmu.o power9-pmu.o \ + $(imc-y) + obj32-$(CONFIG_PPC_PERF_CTRS) += mpc7450-pmu.o obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c new file mode 100644 index 000..7b6ce50 --- /dev/null +++ b/arch/powerpc/perf/imc-pmu.c @@ -0,0 +1,96 @@ +/* + * Nest Performance Monitor counter support. + * + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation. + * (C) 2016 Hemant K Shaw, IBM Corporation. + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation. + */ +#include +#include +#include +#include +#include +#include + +struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; +struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; + +/* dev_str_attr : Populate event "name" and string "str" in attribute */ +static struct attribute *dev_str_attr(const char *name, const char *str) +{ + struct perf_pmu_events_attr *attr; + + attr = kzalloc(sizeof(*attr), GFP_KERNEL); + + sysfs_attr_init(&attr->attr.attr); + + attr->event_str = str; + attr->attr.attr.name = name; + attr->attr.attr.mode = 0444; + attr->attr.show = perf_event_sysfs_show; + + return &attr->attr.attr; +} + +/* + * update_events_in_group: Update the "events" information in an attr_group + * and assign the attr_group to the pmu "pmu". + */ +static int update_events_in_group(struct imc_events *events, + int idx, struct imc_pmu *pmu) +{ + struct attribute_group *attr_group; + struct attribute **attrs; + int i; + + /* Allocate memory for attribute group */ + attr_group = kzalloc(sizeof(*attr_group), GFP_KERNEL); + if (!attr_group) + return -ENOMEM; + + /* Allocate memory for attributes */ + attrs = kzalloc((sizeof(struct attribute *) * (idx + 1)), GFP_KERNEL); + if (!attrs) { + kfree(attr_group); + return -ENOMEM; + } + + attr_group->name = "events"; + attr_group->attrs = attrs; + for (i = 0; i < idx; i++, events++) { + attrs[i] = dev_str_attr((char *)events->ev_name, + (char *)events->ev_value); + } + + pmu->attr_groups[0] = attr_group; + return 0; +} + +/* + * init_imc_pmu : Setup the IMC pmu device in "pmu_ptr" and its events + *"events". + * Setup the cpu mask information for these pmus and setup the state machine + * hotplug notifiers as well. + */ +int init_imc_pmu(struct imc_events *events, int idx, +struct imc_pmu *pmu_ptr) +{ + int ret = -ENODEV; + + ret = update_events_in_group(events, idx, pmu_ptr); + if (ret) + goto err_free; + + return 0; + +err_free: + /* Only free the attr_groups which are dynamically allocated */ + if (pmu_ptr->attr_groups[0]) { + kfree(pmu_ptr->attr_groups[0]->attrs); + kfree(pmu_ptr->attr_groups[0]); + } + + return ret; +} diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index c58b893..ed1e091 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -31,8 +31,11 @@ #include #include -s
[PATCH v4 05/10] powerpc/perf: Generic imc pmu event functions
Since, the IMC counters' data are periodically fed to a memory location, the functions to read/update, start/stop, add/del can be generic and can be used by all IMC PMU units. This patch adds a set of generic imc pmu related event functions to be used by each imc pmu unit. Add code to setup format attribute and to register imc pmus. Add a event_init function for nest_imc events. Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Anton Blanchard Cc: Sukadev Bhattiprolu Cc: Michael Neuling Cc: Stewart Smith Cc: Daniel Axtens Cc: Stephane Eranian Cc: Balbir Singh Cc: Anju T Sudhakar Signed-off-by: Hemant Kumar --- arch/powerpc/include/asm/imc-pmu.h| 1 + arch/powerpc/perf/imc-pmu.c | 121 ++ arch/powerpc/platforms/powernv/opal-imc.c | 30 +++- 3 files changed, 148 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index 3232322..7b58721 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -70,4 +70,5 @@ struct imc_pmu { #define UNKNOWN_DOMAIN -1 +int imc_get_domain(struct device_node *pmu_dev); #endif /* PPC_POWERNV_IMC_PMU_DEF_H */ diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index 7b6ce50..f6f1ef9 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -17,6 +17,116 @@ struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; +/* Needed for sanity check */ +extern u64 nest_max_offset; + +PMU_FORMAT_ATTR(event, "config:0-20"); +static struct attribute *imc_format_attrs[] = { + &format_attr_event.attr, + NULL, +}; + +static struct attribute_group imc_format_group = { + .name = "format", + .attrs = imc_format_attrs, +}; + +static int nest_imc_event_init(struct perf_event *event) +{ + int chip_id; + u32 config = event->attr.config; + struct perchip_nest_info *pcni; + + if (event->attr.type != event->pmu->type) + return -ENOENT; + + /* Sampling not supported */ + if (event->hw.sample_period) + return -EINVAL; + + /* unsupported modes and filters */ + if (event->attr.exclude_user || + event->attr.exclude_kernel || + event->attr.exclude_hv || + event->attr.exclude_idle || + event->attr.exclude_host || + event->attr.exclude_guest) + return -EINVAL; + + if (event->cpu < 0) + return -EINVAL; + + /* Sanity check for config (event offset) */ + if (config > nest_max_offset) + return -EINVAL; + + chip_id = topology_physical_package_id(event->cpu); + pcni = &nest_perchip_info[chip_id]; + event->hw.event_base = pcni->vbase[config/PAGE_SIZE] + + (config & ~PAGE_MASK); + + return 0; +} + +static void imc_read_counter(struct perf_event *event) +{ + u64 *addr, data; + + addr = (u64 *)event->hw.event_base; + data = __be64_to_cpu(*addr); + local64_set(&event->hw.prev_count, data); +} + +static void imc_perf_event_update(struct perf_event *event) +{ + u64 counter_prev, counter_new, final_count, *addr; + + addr = (u64 *)event->hw.event_base; + counter_prev = local64_read(&event->hw.prev_count); + counter_new = __be64_to_cpu(*addr); + final_count = counter_new - counter_prev; + + local64_set(&event->hw.prev_count, counter_new); + local64_add(final_count, &event->count); +} + +static void imc_event_start(struct perf_event *event, int flags) +{ + imc_read_counter(event); +} + +static void imc_event_stop(struct perf_event *event, int flags) +{ + imc_perf_event_update(event); +} + +static int imc_event_add(struct perf_event *event, int flags) +{ + if (flags & PERF_EF_START) + imc_event_start(event, flags); + + return 0; +} + +/* update_pmu_ops : Populate the appropriate operations for "pmu" */ +static int update_pmu_ops(struct imc_pmu *pmu) +{ + if (!pmu) + return -EINVAL; + + pmu->pmu.task_ctx_nr = perf_invalid_context; + pmu->pmu.event_init = nest_imc_event_init; + pmu->pmu.add = imc_event_add; + pmu->pmu.del = imc_event_stop; + pmu->pmu.start = imc_event_start; + pmu->pmu.stop = imc_event_stop; + pmu->pmu.read = imc_perf_event_update; + pmu->attr_groups[1] = &imc_format_group; + pmu->pmu.attr_groups = pmu->attr_groups; + + return 0; +} + /* dev_str_attr : Populate event "name" and string "str" in attribute */ static struct attribute *dev_str_attr(const char *name, const char *str) { @@ -83,6 +193,17 @@ int init_imc_pmu(struct imc_events *events, int idx, if (ret) goto err_free; + ret = update_p
[PATCH v4 06/10] powerpc/perf: IMC pmu cpumask and cpu hotplug support
Adds cpumask attribute to be used by each IMC pmu. Only one cpu (any online CPU) from each chip for nest PMUs is designated to read counters. On CPU hotplug, dying CPU is checked to see whether it is one of the designated cpus, if yes, next online cpu from the same chip (for nest units) is designated as new cpu to read counters. For this purpose, we introduce a new state : CPUHP_AP_PERF_POWERPC_NEST_ONLINE. Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Anton Blanchard Cc: Sukadev Bhattiprolu Cc: Michael Neuling Cc: Stewart Smith Cc: Daniel Axtens Cc: Stephane Eranian Cc: Balbir Singh Cc: Anju T Sudhakar Signed-off-by: Hemant Kumar --- arch/powerpc/include/asm/opal-api.h| 3 +- arch/powerpc/include/asm/opal.h| 3 + arch/powerpc/perf/imc-pmu.c| 163 - arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + include/linux/cpuhotplug.h | 1 + 5 files changed, 169 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index a0aa285..e15fb20 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -168,7 +168,8 @@ #define OPAL_INT_SET_MFRR 125 #define OPAL_PCI_TCE_KILL 126 #define OPAL_NMMU_SET_PTCR 127 -#define OPAL_LAST 127 +#define OPAL_NEST_IMC_COUNTERS_CONTROL 128 +#define OPAL_LAST 128 /* Device tree flags */ diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index 1ff03a6..d93d082 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -227,6 +227,9 @@ int64_t opal_pci_tce_kill(uint64_t phb_id, uint32_t kill_type, uint64_t dma_addr, uint32_t npages); int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr); +int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1, + uint64_t value2, uint64_t value3); + /* Internal functions */ extern int early_init_dt_scan_opal(unsigned long node, const char *uname, int depth, void *data); diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index f6f1ef9..e46ff6d 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -16,6 +16,7 @@ struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; +static cpumask_t nest_imc_cpumask; /* Needed for sanity check */ extern u64 nest_max_offset; @@ -31,6 +32,160 @@ static struct attribute_group imc_format_group = { .attrs = imc_format_attrs, }; +/* Get the cpumask printed to a buffer "buf" */ +static ssize_t imc_pmu_cpumask_get_attr(struct device *dev, + struct device_attribute *attr, char *buf) +{ + cpumask_t *active_mask; + + active_mask = &nest_imc_cpumask; + return cpumap_print_to_pagebuf(true, buf, active_mask); +} + +static DEVICE_ATTR(cpumask, S_IRUGO, imc_pmu_cpumask_get_attr, NULL); + +static struct attribute *imc_pmu_cpumask_attrs[] = { + &dev_attr_cpumask.attr, + NULL, +}; + +static struct attribute_group imc_pmu_cpumask_attr_group = { + .attrs = imc_pmu_cpumask_attrs, +}; + +/* + * nest_init : Initializes the nest imc engine for the current chip. + */ +static void nest_init(int *loc) +{ + int rc; + + rc = opal_nest_imc_counters_control(NEST_IMC_PRODUCTION_MODE, + NEST_IMC_ENGINE_START, 0, 0); + if (rc) + loc[smp_processor_id()] = 1; +} + +static void nest_change_cpu_context(int old_cpu, int new_cpu) +{ + int i; + + for (i = 0; +(per_nest_pmu_arr[i] != NULL) && (i < IMC_MAX_PMUS); i++) + perf_pmu_migrate_context(&per_nest_pmu_arr[i]->pmu, + old_cpu, new_cpu); +} + +static int ppc_nest_imc_cpu_online(unsigned int cpu) +{ + int nid, fcpu, ncpu; + struct cpumask *l_cpumask, tmp_mask; + + /* Fint the cpumask of this node */ + nid = cpu_to_node(cpu); + l_cpumask = cpumask_of_node(nid); + + /* +* If any of the cpu from this node is already present in the mask, +* just return, if not, then set this cpu in the mask. +*/ + if (!cpumask_and(&tmp_mask, l_cpumask, &nest_imc_cpumask)) { + cpumask_set_cpu(cpu, &nest_imc_cpumask); + return 0; + } + + fcpu = cpumask_first(l_cpumask); + ncpu = cpumask_next(cpu, l_cpumask); + if (cpu == fcpu) { + if (cpumask_test_and_clear_cpu(ncpu, &nest_imc_cpumask)) { + cpumask_set_cpu(cpu, &nest_imc_cpumask); + nest_change_cpu_context(ncp
[PATCH v4 07/10] powerpc/powernv: Core IMC events detection
This patch adds support for detection of core IMC events along with the Nest IMC events. It adds a new domain IMC_DOMAIN_CORE and its determined with the help of the compatibility string "ibm,imc-counters-core" based on the IMC device tree. Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Anton Blanchard Cc: Sukadev Bhattiprolu Cc: Michael Neuling Cc: Stewart Smith Cc: Daniel Axtens Cc: Stephane Eranian Cc: Balbir Singh Cc: Anju T Sudhakar Signed-off-by: Hemant Kumar --- arch/powerpc/include/asm/imc-pmu.h| 2 ++ arch/powerpc/perf/imc-pmu.c | 3 +++ arch/powerpc/platforms/powernv/opal-imc.c | 18 -- 3 files changed, 21 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index 7b58721..59de083 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -30,6 +30,7 @@ #define IMC_DTB_COMPAT "ibm,opal-in-memory-counters" #define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest" +#define IMC_DTB_CORE_COMPAT"ibm,imc-counters-core" /* * Structure to hold per chip specific memory address @@ -67,6 +68,7 @@ struct imc_pmu { * Domains for IMC PMUs */ #define IMC_DOMAIN_NEST1 +#define IMC_DOMAIN_CORE2 #define UNKNOWN_DOMAIN -1 diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index e46ff6d..9a0e3bc 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -18,8 +18,11 @@ struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; static cpumask_t nest_imc_cpumask; +struct imc_pmu *core_imc_pmu; + /* Needed for sanity check */ extern u64 nest_max_offset; +extern u64 core_max_offset; PMU_FORMAT_ATTR(event, "config:0-20"); static struct attribute *imc_format_attrs[] = { diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index a65aa2d..67ce873 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -33,10 +33,12 @@ extern struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; extern struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; +extern struct imc_pmu *core_imc_pmu; extern int init_imc_pmu(struct imc_events *events, int idx, struct imc_pmu *pmu_ptr); u64 nest_max_offset; +u64 core_max_offset; static int imc_event_info(char *name, struct imc_events *events) { @@ -80,6 +82,10 @@ static void update_max_value(u32 value, int pmu_domain) if (nest_max_offset < value) nest_max_offset = value; break; + case IMC_DOMAIN_CORE: + if (core_max_offset < value) + core_max_offset = value; + break; default: /* Unknown domain, return */ return; @@ -231,6 +237,8 @@ int imc_get_domain(struct device_node *pmu_dev) { if (of_device_is_compatible(pmu_dev, IMC_DTB_NEST_COMPAT)) return IMC_DOMAIN_NEST; + if (of_device_is_compatible(pmu_dev, IMC_DTB_CORE_COMPAT)) + return IMC_DOMAIN_CORE; else return UNKNOWN_DOMAIN; } @@ -298,7 +306,10 @@ static int imc_pmu_create(struct device_node *parent, int pmu_index) goto free_pmu; /* Needed for hotplug/migration */ - per_nest_pmu_arr[pmu_index] = pmu_ptr; + if (pmu_ptr->domain == IMC_DOMAIN_CORE) + core_imc_pmu = pmu_ptr; + else if (pmu_ptr->domain == IMC_DOMAIN_NEST) + per_nest_pmu_arr[pmu_index] = pmu_ptr; /* * "events" property inside a PMU node contains the phandle value @@ -354,7 +365,10 @@ static int imc_pmu_create(struct device_node *parent, int pmu_index) } /* Save the name to register it later */ - sprintf(buf, "nest_%s", (char *)pp->value); + if (pmu_ptr->domain == IMC_DOMAIN_NEST) + sprintf(buf, "nest_%s", (char *)pp->value); + else + sprintf(buf, "%s_imc", (char *)pp->value); pmu_ptr->pmu.name = (char *)buf; /* -- 2.7.4
[PATCH v4 08/10] powerpc/perf: PMU functions for Core IMC and hotplugging
This patch adds the PMU function to initialize a core IMC event. It also adds cpumask initialization function for core IMC PMU. For initialization, a page of memory is allocated per core where the data for core IMC counters will be accumulated. The base address for this page is sent to OPAL via an OPAL call which initializes various SCOMs related to Core IMC initialization. Upon any errors, the pages are free'ed and core IMC counters are disabled using the same OPAL call. For CPU hotplugging, a cpumask is initialized which contains an online CPU from each core. If a cpu goes offline, we check whether that cpu belongs to the core imc cpumask, if yes, then, we migrate the PMU context to any other online cpu (if available) in that core. If a cpu comes back online, then this cpu will be added to the core imc cpumask only if there was no other cpu from that core in the previous cpumask. To register the hotplug functions for core_imc, a new state CPUHP_AP_PERF_POWERPC_COREIMC_ONLINE is added to the list of existing states. Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Anton Blanchard Cc: Sukadev Bhattiprolu Cc: Michael Neuling Cc: Stewart Smith Cc: Daniel Axtens Cc: Stephane Eranian Cc: Balbir Singh Cc: Anju T Sudhakar Signed-off-by: Hemant Kumar --- arch/powerpc/include/asm/imc-pmu.h | 1 + arch/powerpc/include/asm/opal-api.h| 10 +- arch/powerpc/include/asm/opal.h| 2 + arch/powerpc/perf/imc-pmu.c| 248 - arch/powerpc/platforms/powernv/opal-imc.c | 4 +- arch/powerpc/platforms/powernv/opal-wrappers.S | 1 + include/linux/cpuhotplug.h | 1 + 7 files changed, 257 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index 59de083..5e76cd0 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -21,6 +21,7 @@ #define IMC_MAX_CHIPS 32 #define IMC_MAX_PMUS 32 #define IMC_MAX_PMU_NAME_LEN 256 +#define IMC_MAX_CORES 256 #define NEST_IMC_ENGINE_START 1 #define NEST_IMC_ENGINE_STOP 0 diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h index e15fb20..4ee52e8 100644 --- a/arch/powerpc/include/asm/opal-api.h +++ b/arch/powerpc/include/asm/opal-api.h @@ -169,7 +169,8 @@ #define OPAL_PCI_TCE_KILL 126 #define OPAL_NMMU_SET_PTCR 127 #define OPAL_NEST_IMC_COUNTERS_CONTROL 128 -#define OPAL_LAST 128 +#define OPAL_CORE_IMC_COUNTERS_CONTROL 129 +#define OPAL_LAST 129 /* Device tree flags */ @@ -929,6 +930,13 @@ enum { OPAL_PCI_TCE_KILL_ALL, }; +/* Operation argument to Core IMC */ +enum { + OPAL_CORE_IMC_DISABLE, + OPAL_CORE_IMC_ENABLE, + OPAL_CORE_IMC_INIT, +}; + #endif /* __ASSEMBLY__ */ #endif /* __OPAL_API_H */ diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index d93d082..c4baa6d 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -229,6 +229,8 @@ int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr); int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1, uint64_t value2, uint64_t value3); +int64_t opal_core_imc_counters_control(uint64_t operation, uint64_t addr, + uint64_t value2, uint64_t value3); /* Internal functions */ extern int early_init_dt_scan_opal(unsigned long node, const char *uname, diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index 9a0e3bc..61d99c7 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -1,5 +1,5 @@ /* - * Nest Performance Monitor counter support. + * IMC Performance Monitor counter support. * * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation. * (C) 2016 Hemant K Shaw, IBM Corporation. @@ -18,6 +18,9 @@ struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS]; struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS]; static cpumask_t nest_imc_cpumask; +/* Maintains base addresses for all the cores */ +static u64 per_core_pdbar_add[IMC_MAX_CHIPS][IMC_MAX_CORES]; +static cpumask_t core_imc_cpumask; struct imc_pmu *core_imc_pmu; /* Needed for sanity check */ @@ -37,11 +40,18 @@ static struct attribute_group imc_format_group = { /* Get the cpumask printed to a buffer "buf" */ static ssize_t imc_pmu_cpumask_get_attr(struct device *dev, - struct device_attribute *attr, char *buf) + struct device_attribute *attr, + char *buf) { + struct pmu *pmu = dev_get_drvdata(dev); cpumask_t *active_mask; -
[PATCH v4 09/10] powerpc/powernv: Thread IMC events detection
Patch adds support for detection of thread IMC events. It adds a new domain IMC_DOMAIN_THREAD and it is determined with the help of the compatibility string "ibm,imc-counters-thread" based on the IMC device tree. Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Anton Blanchard Cc: Sukadev Bhattiprolu Cc: Michael Neuling Cc: Stewart Smith Cc: Daniel Axtens Cc: Stephane Eranian Cc: Balbir Singh Cc: Anju T Sudhakar Signed-off-by: Hemant Kumar --- arch/powerpc/include/asm/imc-pmu.h| 2 ++ arch/powerpc/perf/imc-pmu.c | 1 + arch/powerpc/platforms/powernv/opal-imc.c | 11 +-- 3 files changed, 12 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index 5e76cd0..f2b4f12 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -32,6 +32,7 @@ #define IMC_DTB_COMPAT "ibm,opal-in-memory-counters" #define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest" #define IMC_DTB_CORE_COMPAT"ibm,imc-counters-core" +#define IMC_DTB_THREAD_COMPAT "ibm,imc-counters-thread" /* * Structure to hold per chip specific memory address @@ -70,6 +71,7 @@ struct imc_pmu { */ #define IMC_DOMAIN_NEST1 #define IMC_DOMAIN_CORE2 +#define IMC_DOMAIN_THREAD 3 #define UNKNOWN_DOMAIN -1 diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index 61d99c7..a48c5be 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -26,6 +26,7 @@ struct imc_pmu *core_imc_pmu; /* Needed for sanity check */ extern u64 nest_max_offset; extern u64 core_max_offset; +extern u64 thread_max_offset; PMU_FORMAT_ATTR(event, "config:0-20"); static struct attribute *imc_format_attrs[] = { diff --git a/arch/powerpc/platforms/powernv/opal-imc.c b/arch/powerpc/platforms/powernv/opal-imc.c index 6db3c5f..a5565e7 100644 --- a/arch/powerpc/platforms/powernv/opal-imc.c +++ b/arch/powerpc/platforms/powernv/opal-imc.c @@ -39,6 +39,7 @@ extern int init_imc_pmu(struct imc_events *events, int idx, struct imc_pmu *pmu_ptr); u64 nest_max_offset; u64 core_max_offset; +u64 thread_max_offset; static int imc_event_info(char *name, struct imc_events *events) { @@ -86,6 +87,10 @@ static void update_max_value(u32 value, int pmu_domain) if (core_max_offset < value) core_max_offset = value; break; + case IMC_DOMAIN_THREAD: + if (thread_max_offset < value) + thread_max_offset = value; + break; default: /* Unknown domain, return */ return; @@ -239,6 +244,8 @@ int imc_get_domain(struct device_node *pmu_dev) return IMC_DOMAIN_NEST; if (of_device_is_compatible(pmu_dev, IMC_DTB_CORE_COMPAT)) return IMC_DOMAIN_CORE; + if (of_device_is_compatible(pmu_dev, IMC_DTB_THREAD_COMPAT)) + return IMC_DOMAIN_THREAD; else return UNKNOWN_DOMAIN; } @@ -277,7 +284,7 @@ static void imc_free_events(struct imc_events *events, int nr_entries) /* * imc_pmu_create : Takes the parent device which is the pmu unit and a * pmu_index as the inputs. - * Allocates memory for the pmu, sets up its domain (NEST or CORE), and + * Allocates memory for the pmu, sets up its domain (NEST/CORE/THREAD), and * allocates memory for the events supported by this pmu. Assigns a name for * the pmu. Calls imc_events_node_parser() to setup the individual events. * If everything goes fine, it calls, init_imc_pmu() to setup the pmu device @@ -305,7 +312,7 @@ static int imc_pmu_create(struct device_node *parent, int pmu_index) if (pmu_ptr->domain == UNKNOWN_DOMAIN) goto free_pmu; - /* Needed for hotplug/migration */ + /* Needed for hotplug/migration for nest and core IMC PMUs */ if (pmu_ptr->domain == IMC_DOMAIN_CORE) core_imc_pmu = pmu_ptr; else if (pmu_ptr->domain == IMC_DOMAIN_NEST) -- 2.7.4
[PATCH v4 10/10] powerpc/perf: Thread IMC PMU functions
This patch adds the PMU functions required for event initialization, read, update, add, del etc. for thread IMC PMU. Thread IMC PMUs are used for per-task monitoring. These PMUs don't need any hotplugging support. For each CPU, a page of memory is allocated and is kept static i.e., these pages will exist till the machine shuts down. The base address of this page is assigned to the ldbar of that cpu. As soon as we do that, the thread IMC counters start running for that cpu and the data of these counters are assigned to the page allocated. But we use this for per-task monitoring. Whenever we start monitoring a task, the event is added is onto the task. At that point, we read the initial value of the event. Whenever, we stop monitoring the task, the final value is taken and the difference is the event data. Now, a task can move to a different cpu. Suppose a task X is moving from cpu A to cpu B. When the task is scheduled out of A, we get an event_del for A, and hence, the event data is updated. And, we stop updating the X's event data. As soon as X moves on to B, event_add is called for B, and we again update the event_data. And this is how it keeps on updating the event data even when the task is scheduled on to different cpus. Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Anton Blanchard Cc: Sukadev Bhattiprolu Cc: Michael Neuling Cc: Stewart Smith Cc: Daniel Axtens Cc: Stephane Eranian Cc: Balbir Singh Cc: Anju T Sudhakar Signed-off-by: Hemant Kumar --- arch/powerpc/include/asm/imc-pmu.h | 4 + arch/powerpc/perf/imc-pmu.c| 161 - 2 files changed, 164 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h index f2b4f12..8b7141b 100644 --- a/arch/powerpc/include/asm/imc-pmu.h +++ b/arch/powerpc/include/asm/imc-pmu.h @@ -22,6 +22,7 @@ #define IMC_MAX_PMUS 32 #define IMC_MAX_PMU_NAME_LEN 256 #define IMC_MAX_CORES 256 +#define IMC_MAX_CPUS2048 #define NEST_IMC_ENGINE_START 1 #define NEST_IMC_ENGINE_STOP 0 @@ -34,6 +35,9 @@ #define IMC_DTB_CORE_COMPAT"ibm,imc-counters-core" #define IMC_DTB_THREAD_COMPAT "ibm,imc-counters-thread" +#define THREAD_IMC_LDBAR_MASK 0x0003e000 +#define THREAD_IMC_ENABLE 0x8000 + /* * Structure to hold per chip specific memory address * information for nest pmus. Nest Counter data are exported diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c index a48c5be..4033b2d 100644 --- a/arch/powerpc/perf/imc-pmu.c +++ b/arch/powerpc/perf/imc-pmu.c @@ -23,6 +23,9 @@ static u64 per_core_pdbar_add[IMC_MAX_CHIPS][IMC_MAX_CORES]; static cpumask_t core_imc_cpumask; struct imc_pmu *core_imc_pmu; +/* Maintains base address for all the cpus */ +static u64 per_cpu_add[IMC_MAX_CPUS]; + /* Needed for sanity check */ extern u64 nest_max_offset; extern u64 core_max_offset; @@ -443,6 +446,56 @@ static int core_imc_event_init(struct perf_event *event) return 0; } +static int thread_imc_event_init(struct perf_event *event) +{ + struct task_struct *target; + + if (event->attr.type != event->pmu->type) + return -ENOENT; + + /* Sampling not supported */ + if (event->hw.sample_period) + return -EINVAL; + + event->hw.idx = -1; + + /* Sanity check for config (event offset) */ + if (event->attr.config > thread_max_offset) + return -EINVAL; + + target = event->hw.target; + + if (!target) + return -EINVAL; + + event->pmu->task_ctx_nr = perf_sw_context; + return 0; +} + +static void thread_imc_read_counter(struct perf_event *event) +{ + u64 *addr, data; + int cpu_id = smp_processor_id(); + + addr = (u64 *)(per_cpu_add[cpu_id] + event->attr.config); + data = __be64_to_cpu(*addr); + local64_set(&event->hw.prev_count, data); +} + +static void thread_imc_perf_event_update(struct perf_event *event) +{ + u64 counter_prev, counter_new, final_count, *addr; + int cpu_id = smp_processor_id(); + + addr = (u64 *)(per_cpu_add[cpu_id] + event->attr.config); + counter_prev = local64_read(&event->hw.prev_count); + counter_new = __be64_to_cpu(*addr); + final_count = counter_new - counter_prev; + + local64_set(&event->hw.prev_count, counter_new); + local64_add(final_count, &event->count); +} + static void imc_read_counter(struct perf_event *event) { u64 *addr, data; @@ -483,6 +536,53 @@ static int imc_event_add(struct perf_event *event, int flags) return 0; } +static void thread_imc_event_start(struct perf_event *event, int flags) +{ + thread_imc_read_counter(event); +} + +static void thread_imc_event_stop(struct perf_event *event, int flags) +{ + thread_
Re: [PATCH] powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout
On Monday 20 February 2017 02:35 AM, Benjamin Herrenschmidt wrote: On Sun, 2017-02-19 at 15:48 +0530, Aneesh Kumar K.V wrote: +#ifdef CONFIG_PPC_BOOK3S_64 + /* +* We need to make sure that for different page sizes reported by +* firmware we only add hugetlb support for page sizes that can be +* supported by linux page table layout. +* For now we have +* Radix: 2M +* Hash: 16M and 16G +*/ + if (radix_enabled()) { + if (mmu_psize != MMU_PAGE_2M) + return -EINVAL; + } else { + if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G) + return -EINVAL; + } Hash could support others... On book3s 64 ? I had the above within #ifdef. Same with radix and PUD level pages. Yes, but gigantic hugepage is not yet supported. Once we add that we will add MMU_PAGE_1G here. Why do we need that ? Won't FW provide separate properties for hash and radix page sizes anyway ? To avoid crashes like the one reported in the commit message due to buggy firmware ? Also It can serve as an easy way to understand what hugepage sizes are supported by different platforms. I am yet to figure out what the FSL_BOOK3E and PPC_8xx #ifdef above that hunk is all about. Having the supported hugepage size clearly verified against makes it easy ? -aneesh
Re: [PATCH] powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout
On Mon, 2017-02-20 at 09:02 +0530, Aneesh Kumar K.V wrote: > To avoid crashes like the one reported in the commit message due to > buggy firmware ? I don't want Linux to make those assumptions. We should fix the FW. Think of backward compat for example. > Also > It can serve as an easy way to understand what hugepage sizes are > supported by different platforms. > I am yet to figure out what the FSL_BOOK3E and PPC_8xx #ifdef above > that > hunk is all about. Having > the supported hugepage size clearly verified against makes it easy ? > > -aneesh
Re: [next-20170217] WARN @/arch/powerpc/include/asm/xics.h:124 .icp_hv_eoi+0x40/0x140
On Sun, 2017-02-19 at 20:39 +0530, Sachin Sant wrote: > While booting next-20170217 on a POWER6 box, I ran into following > warning. This is a full system lpar. Previous next tree was good. > I will try a bisect tomorrow. > > ipr: IBM Power RAID SCSI Device Driver version: 2.6.3 (October 17, 2015) > ipr 0200:00:01.0: Found IOA with IRQ: 305 > [ cut here ] > WARNING: CPU: 12 PID: 1 at ./arch/powerpc/include/asm/xics.h:124 > .icp_hv_eoi+0x40/0x140 This indicates that the CPPR stack underflow'd (we don't know the CPPR value at the time of the interrupt that we are going to do an EOI for). The problem could have occured elsewhere, but shows up at the first interrupt after the real cause. Could you past the full dmesg and config and follow Michael's suggestion for debugging SHIRQ's Balbir
Re: [PATCH 06/35] powerpc: Convert remaining uses of pr_warning to pr_warn
Joe Perches writes: > To enable eventual removal of pr_warning > > This makes pr_warn use consistent for arch/powerpc > > Prior to this patch, there were 36 uses of pr_warning and > 217 uses of pr_warn in arch/powerpc > > Signed-off-by: Joe Perches Can I take this via the powerpc tree, or do you want to merge them as a series? cheers
Re: [PATCH 06/35] powerpc: Convert remaining uses of pr_warning to pr_warn
On Mon, 2017-02-20 at 15:40 +1100, Michael Ellerman wrote: > Joe Perches writes: > > > To enable eventual removal of pr_warning > > > > This makes pr_warn use consistent for arch/powerpc > > > > Prior to this patch, there were 36 uses of pr_warning and > > 217 uses of pr_warn in arch/powerpc > > > > Signed-off-by: Joe Perches > > Can I take this via the powerpc tree, or do you want to merge them as a > series? Well, I expect it'd be better if you merge it.
Re: [PATCH] powerpc/xmon: Fix an unexpected xmon onoff state change
Pan Xinhui writes: > 在 2017/2/17 14:05, Michael Ellerman 写道: >> Pan Xinhui writes: >>> diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c >>> index 9c0e17c..f6e5c3d 100644 >>> --- a/arch/powerpc/xmon/xmon.c >>> +++ b/arch/powerpc/xmon/xmon.c >>> @@ -76,6 +76,7 @@ static int xmon_gate; >>> #endif /* CONFIG_SMP */ >>> >>> static unsigned long in_xmon __read_mostly = 0; >>> +static int xmon_off = !IS_ENABLED(CONFIG_XMON_DEFAULT); >> >> I think the logic would probably clearer if we invert this to become >> xmon_on. >> > yep, make sense. > >>> @@ -3266,16 +3269,16 @@ static int __init setup_xmon_sysrq(void) >>> __initcall(setup_xmon_sysrq); >>> #endif /* CONFIG_MAGIC_SYSRQ */ >>> >>> -static int __initdata xmon_early, xmon_off; >>> +static int __initdata xmon_early; >>> >>> static int __init early_parse_xmon(char *p) >>> { >>> if (!p || strncmp(p, "early", 5) == 0) { >>> /* just "xmon" is equivalent to "xmon=early" */ >>> - xmon_init(1); >>> xmon_early = 1; >>> + xmon_off = 0; >>> } else if (strncmp(p, "on", 2) == 0) >>> - xmon_init(1); >>> + xmon_off = 0; >> >> You've just changed the timing of when xmon gets enabled for the above >> two cases, from here which is called very early, to xmon_setup() which >> is called much later in boot. >> >> That effectively disables xmon for most of the boot, which we do not >> want to do. >> > Although it is not often that kernel got stucked during boot. I hope you're joking! :) cheers
Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails
Stewart Smith writes: > Vipin K Parashar writes: >> On Monday 13 February 2017 06:13 AM, Michael Ellerman wrote: >>> Vipin K Parashar writes: >>> OPAL returns OPAL_WRONG_STATE for XSCOM operations done to read any core FIR which is sleeping, offline. >>> OK. >>> >>> Do we know why Linux is causing that to happen? >> >> This issue is originally seen upon running STAF (Software Test >> Automation Framework) stress tests and off-lining some cores >> with stress tests running. >> >> It can also be re-created after off-lining few cores and following >> one of below methods. >> 1. Executing Linux "sensors" command >> 2. Reading contents of file /sys/class/hwmon/hwmon0/tempX_input, >> where X is offline CPU. >> >> Its "opal_get_sensor_data" Linux API that that triggers >> OPAL call "opal_sensor_read", performing XSCOM ops here. >> If core is found sleeping/offline Linux throws up >> "opal_error_code: Unexpected OPAL error" error onto console. >> >> Currently Linux isn't aware about OPAL_WRONG_STATE return code >> from OPAL. Thus it prints "Unexpected OPAL error" message, same >> as it would log for any unknown OPAL return codes. >> >> Seeing this error over console has been a concern for Test and >> would puzzle real user as well. This patch makes Linux aware about >> OPAL_WRONG_STATE return code from OPAL and stops printing >> "Unexpected OPAL error" message onto console for OPAL fails >> with OPAL_WRONG_STATE > > Ahh... so this is a DTS sensor, which indeed is just XSCOMs and we > return the xscom_read return code in event of error. > > I would argue that converting to EIO in that instance is probably > correct... or EAGAIN? EAGAIN may be more correct in the situation where > the core is just sleeping. > > What kind of offlining are you doing? > > Arguably, the correct behaviour would be to remove said sensors when the > core is offline. Right, that would be ideal. There appear to be at least two other hwmon drivers that are CPU hotplug aware (coretemp and via-cputemp). But perhaps it's not possible to work out which sensors are attached to which CPU etc., I haven't looked in detail. In that case changing just opal_get_sensor_data() to handle OPAL_WRONG_STATE would be OK, with a comment explaining that we might be asked to read a sensor on an offline CPU and we aren't able to detect that. cheers
Re: [next-20170217] WARN @/arch/powerpc/include/asm/xics.h:124 .icp_hv_eoi+0x40/0x140
>> While booting next-20170217 on a POWER6 box, I ran into following >> warning. This is a full system lpar. Previous next tree was good. >> I will try a bisect tomorrow. > > Do you have CONFIG_DEBUG_SHIRQ=y ? > Yes. CONFIG_DEBUG_SHIRQ is enabled. As suggested by you reverting following commit allows a clean boot. f91f694540f3 ("genirq: Reenable shared irq debugging in request_*_irq()”) >> ipr: IBM Power RAID SCSI Device Driver version: 2.6.3 (October 17, 2015) >> ipr 0200:00:01.0: Found IOA with IRQ: 305 >> [ cut here ] >> WARNING: CPU: 12 PID: 1 at ./arch/powerpc/include/asm/xics.h:124 >> .icp_hv_eoi+0x40/0x140 >> Modules linked in: >> CPU: 12 PID: 1 Comm: swapper/14 Not tainted >> 4.10.0-rc8-next-20170217-autotest #1 >> task: c002b2a4a580 task.stack: c002b2a5c000 >> NIP: c00731b0 LR: c01389f8 CTR: c0073170 >> REGS: c002b2a5f050 TRAP: 0700 Not tainted >> (4.10.0-rc8-next-20170217-autotest) >> MSR: 80029032 >> CR: 28004082 XER: 2004 >> CFAR: c01389e0 SOFTE: 0 >> GPR00: c01389f8 c002b2a5f2d0 c1025800 c002b203f498 >> GPR04: 0064 0131 >> GPR08: 0001 c000d3104cb8 0009b1f8 >> GPR12: 48004082 cedc2400 c000dad0 >> GPR16: 3c007efc c0a9e848 >> GPR20: d8008008 c002af4d47f0 c11efda8 c0a9ea10 >> GPR24: c0a9e848 c002af4d4fb8 >> GPR28: c002b203f498 c0ef8928 c002b203f400 >> NIP [c00731b0] .icp_hv_eoi+0x40/0x140 >> LR [c01389f8] .handle_fasteoi_irq+0x1e8/0x270 >> Call Trace: >> [c002b2a5f2d0] [c002b2a5f360] 0xc002b2a5f360 (unreliable) >> [c002b2a5f360] [c01389f8] .handle_fasteoi_irq+0x1e8/0x270 >> [c002b2a5f3e0] [c0136a08] .request_threaded_irq+0x298/0x370 >> [c002b2a5f490] [c05895c0] .ipr_probe_ioa+0x1110/0x1390 >> [c002b2a5f5c0] [c058d030] .ipr_probe+0x30/0x3e0 >> [c002b2a5f670] [c0466860] .local_pci_probe+0x60/0x130 >> [c002b2a5f710] [c0467658] .pci_device_probe+0x148/0x1e0 >> [c002b2a5f7c0] [c0527524] .driver_probe_device+0x2d4/0x5b0 >> [c002b2a5f860] [c052796c] .__driver_attach+0x16c/0x190 >> [c002b2a5f8f0] [c05242c4] .bus_for_each_dev+0x84/0xf0 >> [c002b2a5f990] [c0526af4] .driver_attach+0x24/0x40 >> [c002b2a5fa00] [c0526318] .bus_add_driver+0x2a8/0x370 >> [c002b2a5faa0] [c0528a5c] .driver_register+0x8c/0x170 >> [c002b2a5fb20] [c0465a54] .__pci_register_driver+0x44/0x60 >> [c002b2a5fb90] [c0b8efc8] .ipr_init+0x58/0x70 >> [c002b2a5fc10] [c000d20c] .do_one_initcall+0x5c/0x1c0 >> [c002b2a5fce0] [c0b44738] .kernel_init_freeable+0x280/0x360 >> [c002b2a5fdb0] [c000daec] .kernel_init+0x1c/0x130 >> [c002b2a5fe30] [c000baa0] .ret_from_kernel_thread+0x58/0xb8 >> Instruction dump: >> f8010010 f821ff71 80e3000c 7c0004ac e94d0030 3d02ffbc 3928f4b8 7d295214 >> 81090004 3948 7d484378 79080fe2 <0b08> 2fa8 40de0050 91490004 >> ---[ end trace 5e18ae409f46392c ]--- >> ipr 0200:00:01.0: Initializing IOA. >> >> Thanks >> -Sachin >
Re: [PATCH] powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout
Benjamin Herrenschmidt writes: > On Mon, 2017-02-20 at 09:02 +0530, Aneesh Kumar K.V wrote: >> To avoid crashes like the one reported in the commit message due to >> buggy firmware ? > > I don't want Linux to make those assumptions. We should fix the FW. > I was not suggesting to not fix FW. The idea was two fold. We cannot support different hugetlb page sizes. They need to be supported at linux page table level. So a generic check like is_power_of_2/4() may not be what we want. The second is to document clearly what are the different page sizes supported by a platform. -aneesh