[PATCH V3 0/3] Numabalancing preserve write fix

2017-02-19 Thread Aneesh Kumar K.V
This patch series address an issue w.r.t THP migration and autonuma
preserve write feature. migrate_misplaced_transhuge_page() cannot deal with
concurrent modification of the page. It does a page copy without
following the migration pte sequence. IIUC, this was done to keep the
migration simpler and at the time of implemenation we didn't had THP
page cache which would have required a more elaborate migration scheme.
That means thp autonuma migration expect the protnone with saved write
to be done such that both kernel and user cannot update
the page content. This patch series enables archs like ppc64 to do that.
We are good with the hash translation mode with the current code,
because we never create a hardware page table entry for a protnone pte. 

Changes form V2:
* Fix kvm crashes due to ksm not clearing savedwrite bit.

Changes from V1:
* Update the patch so that it apply cleanly to upstream.
* Add acked-by from Michael Neuling

Aneesh Kumar K.V (3):
  mm/autonuma: Let architecture override how the write bit should be
stashed in a protnone pte.
  mm/ksm: Handle protnone saved writes when making page write protect
  powerpc/mm/autonuma: Switch ppc64 to its own implementeation of saved
write

 arch/powerpc/include/asm/book3s/64/pgtable.h | 52 
 include/asm-generic/pgtable.h| 24 +
 mm/huge_memory.c |  6 ++--
 mm/ksm.c |  9 +++--
 mm/memory.c  |  2 +-
 mm/mprotect.c|  4 +--
 6 files changed, 82 insertions(+), 15 deletions(-)

-- 
2.7.4



[PATCH V3 1/3] mm/autonuma: Let architecture override how the write bit should be stashed in a protnone pte.

2017-02-19 Thread Aneesh Kumar K.V
Autonuma preserves the write permission across numa fault to avoid taking
a writefault after a numa fault (Commit: b191f9b106ea " mm: numa: preserve PTE
write permissions across a NUMA hinting fault"). Architecture can implement
protnone in different ways and some may choose to implement that by clearing 
Read/
Write/Exec bit of pte. Setting the write bit on such pte can result in wrong
behaviour. Fix this up by allowing arch to override how to save the write bit
on a protnone pte.

Acked-By: Michael Neuling 
Signed-off-by: Aneesh Kumar K.V 
---
 include/asm-generic/pgtable.h | 16 
 mm/huge_memory.c  |  6 +++---
 mm/memory.c   |  2 +-
 mm/mprotect.c |  4 ++--
 4 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 18af2bcefe6a..b6f3a8a4b738 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -192,6 +192,22 @@ static inline void ptep_set_wrprotect(struct mm_struct 
*mm, unsigned long addres
 }
 #endif
 
+#ifndef pte_savedwrite
+#define pte_savedwrite pte_write
+#endif
+
+#ifndef pte_mk_savedwrite
+#define pte_mk_savedwrite pte_mkwrite
+#endif
+
+#ifndef pmd_savedwrite
+#define pmd_savedwrite pmd_write
+#endif
+
+#ifndef pmd_mk_savedwrite
+#define pmd_mk_savedwrite pmd_mkwrite
+#endif
+
 #ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8f1d93257fb9..e6de801fa477 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1253,7 +1253,7 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
}
 
/* See similar comment in do_numa_page for explanation */
-   if (!pmd_write(pmd))
+   if (!pmd_savedwrite(pmd))
flags |= TNF_NO_GROUP;
 
/*
@@ -1316,7 +1316,7 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
goto out;
 clear_pmdnuma:
BUG_ON(!PageLocked(page));
-   was_writable = pmd_write(pmd);
+   was_writable = pmd_savedwrite(pmd);
pmd = pmd_modify(pmd, vma->vm_page_prot);
pmd = pmd_mkyoung(pmd);
if (was_writable)
@@ -1571,7 +1571,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
*pmd,
entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
if (preserve_write)
-   entry = pmd_mkwrite(entry);
+   entry = pmd_mk_savedwrite(entry);
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
diff --git a/mm/memory.c b/mm/memory.c
index 6bf2b471e30c..641b83dbff60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3388,7 +3388,7 @@ static int do_numa_page(struct vm_fault *vmf)
int target_nid;
bool migrated = false;
pte_t pte = vmf->orig_pte;
-   bool was_writable = pte_write(pte);
+   bool was_writable = pte_savedwrite(pte);
int flags = 0;
 
/*
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f9c07f54dd62..15f5c174a7c1 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -113,13 +113,13 @@ static unsigned long change_pte_range(struct 
vm_area_struct *vma, pmd_t *pmd,
ptent = ptep_modify_prot_start(mm, addr, pte);
ptent = pte_modify(ptent, newprot);
if (preserve_write)
-   ptent = pte_mkwrite(ptent);
+   ptent = pte_mk_savedwrite(ptent);
 
/* Avoid taking write faults for known dirty pages */
if (dirty_accountable && pte_dirty(ptent) &&
(pte_soft_dirty(ptent) ||
 !(vma->vm_flags & VM_SOFTDIRTY))) {
-   ptent = pte_mkwrite(ptent);
+   ptent = pte_mk_savedwrite(ptent);
}
ptep_modify_prot_commit(mm, addr, pte, ptent);
pages++;
-- 
2.7.4



[PATCH V3 2/3] mm/ksm: Handle protnone saved writes when making page write protect

2017-02-19 Thread Aneesh Kumar K.V
Without this KSM will consider the page write protected, but a numa fault can
later mark the page writable. This can result in memory corruption.

Signed-off-by: Aneesh Kumar K.V 
---
 include/asm-generic/pgtable.h | 8 
 mm/ksm.c  | 9 +++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index b6f3a8a4b738..8c8ba48bef0b 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -200,6 +200,10 @@ static inline void ptep_set_wrprotect(struct mm_struct 
*mm, unsigned long addres
 #define pte_mk_savedwrite pte_mkwrite
 #endif
 
+#ifndef pte_clear_savedwrite
+#define pte_clear_savedwrite pte_wrprotect
+#endif
+
 #ifndef pmd_savedwrite
 #define pmd_savedwrite pmd_write
 #endif
@@ -208,6 +212,10 @@ static inline void ptep_set_wrprotect(struct mm_struct 
*mm, unsigned long addres
 #define pmd_mk_savedwrite pmd_mkwrite
 #endif
 
+#ifndef pmd_clear_savedwrite
+#define pmd_clear_savedwrite pmd_wrprotect
+#endif
+
 #ifndef __HAVE_ARCH_PMDP_SET_WRPROTECT
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void pmdp_set_wrprotect(struct mm_struct *mm,
diff --git a/mm/ksm.c b/mm/ksm.c
index 9ae6011a41f8..768202831578 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -872,7 +872,8 @@ static int write_protect_page(struct vm_area_struct *vma, 
struct page *page,
if (!ptep)
goto out_mn;
 
-   if (pte_write(*ptep) || pte_dirty(*ptep)) {
+   if (pte_write(*ptep) || pte_dirty(*ptep) ||
+   (pte_protnone(*ptep) && pte_savedwrite(*ptep))) {
pte_t entry;
 
swapped = PageSwapCache(page);
@@ -897,7 +898,11 @@ static int write_protect_page(struct vm_area_struct *vma, 
struct page *page,
}
if (pte_dirty(entry))
set_page_dirty(page);
-   entry = pte_mkclean(pte_wrprotect(entry));
+
+   if (pte_protnone(entry))
+   entry = pte_mkclean(pte_clear_savedwrite(entry));
+   else
+   entry = pte_mkclean(pte_wrprotect(entry));
set_pte_at_notify(mm, addr, ptep, entry);
}
*orig_pte = *ptep;
-- 
2.7.4



[PATCH V3 3/3] powerpc/mm/autonuma: Switch ppc64 to its own implementeation of saved write

2017-02-19 Thread Aneesh Kumar K.V
With this our protnone becomes a present pte with READ/WRITE/EXEC bit cleared.
By default we also set _PAGE_PRIVILEGED on such pte. This is now used to help
us identify a protnone pte that as saved write bit. For such pte, we will clear
the _PAGE_PRIVILEGED bit. The pte still remain non-accessible from both user
and kernel.

Acked-By: Michael Neuling 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 52 
 1 file changed, 45 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 6a55bbe91556..d87bee85fc44 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1,6 +1,9 @@
 #ifndef _ASM_POWERPC_BOOK3S_64_PGTABLE_H_
 #define _ASM_POWERPC_BOOK3S_64_PGTABLE_H_
 
+#ifndef __ASSEMBLY__
+#include 
+#endif
 /*
  * Common bits between hash and Radix page table
  */
@@ -428,15 +431,47 @@ static inline pte_t pte_clear_soft_dirty(pte_t pte)
 #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
 
 #ifdef CONFIG_NUMA_BALANCING
-/*
- * These work without NUMA balancing but the kernel does not care. See the
- * comment in include/asm-generic/pgtable.h . On powerpc, this will only
- * work for user pages and always return true for kernel pages.
- */
 static inline int pte_protnone(pte_t pte)
 {
-   return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED)) ==
-   cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED);
+   return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE | 
_PAGE_RWX)) ==
+   cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE);
+}
+
+#define pte_mk_savedwrite pte_mk_savedwrite
+static inline pte_t pte_mk_savedwrite(pte_t pte)
+{
+   /*
+* Used by Autonuma subsystem to preserve the write bit
+* while marking the pte PROT_NONE. Only allow this
+* on PROT_NONE pte
+*/
+   VM_BUG_ON((pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_RWX | 
_PAGE_PRIVILEGED)) !=
+ cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED));
+   return __pte(pte_val(pte) & ~_PAGE_PRIVILEGED);
+}
+
+#define pte_clear_savedwrite pte_clear_savedwrite
+static inline pte_t pte_clear_savedwrite(pte_t pte)
+{
+   /*
+* Used by KSM subsystem to make a protnone pte readonly.
+*/
+   VM_BUG_ON(!pte_protnone(pte));
+   return __pte(pte_val(pte) | _PAGE_PRIVILEGED);
+}
+
+#define pte_savedwrite pte_savedwrite
+static inline bool pte_savedwrite(pte_t pte)
+{
+   /*
+* Saved write ptes are prot none ptes that doesn't have
+* privileged bit sit. We mark prot none as one which has
+* present and pviliged bit set and RWX cleared. To mark
+* protnone which used to have _PAGE_WRITE set we clear
+* the privileged bit.
+*/
+   VM_BUG_ON(!pte_protnone(pte));
+   return !(pte_raw(pte) & cpu_to_be64(_PAGE_RWX | _PAGE_PRIVILEGED));
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
@@ -867,6 +902,8 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
 #define pmd_mkclean(pmd)   pte_pmd(pte_mkclean(pmd_pte(pmd)))
 #define pmd_mkyoung(pmd)   pte_pmd(pte_mkyoung(pmd_pte(pmd)))
 #define pmd_mkwrite(pmd)   pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mk_savedwrite(pmd) pte_pmd(pte_mk_savedwrite(pmd_pte(pmd)))
+#define pmd_clear_savedwrite(pmd)  
pte_pmd(pte_clear_savedwrite(pmd_pte(pmd)))
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 #define pmd_soft_dirty(pmd)pte_soft_dirty(pmd_pte(pmd))
@@ -883,6 +920,7 @@ static inline int pmd_protnone(pmd_t pmd)
 
 #define __HAVE_ARCH_PMD_WRITE
 #define pmd_write(pmd) pte_write(pmd_pte(pmd))
+#define pmd_savedwrite(pmd)pte_savedwrite(pmd_pte(pmd))
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
-- 
2.7.4



[PATCH V3 00/10] powerpc/mm/ppc64: Add 128TB support

2017-02-19 Thread Aneesh Kumar K.V
This patch series increase the effective virtual address range of
applications from 64TB to 128TB. We do that by supporting a 
68 bit virtual address. On platforms that can only do 65 bit virtual
address we limit the max contexts to a 16bit value instead of 19.

The patch series also switch the page table layout such that we can
do 512TB effective address. But we still limit the TASK_SIZE to
128TB. This was done to make sure we don't break applications
that make assumption regarding the max address returned by the
OS. We can switch to 128TB without a linux personality value because
other architectures do 128TB as max address.

Changes from V2:
* Handle hugepage size correctly.

Aneesh Kumar K.V (10):
  powerpc/mm/slice: Convert slice_mask high slice to a bitmap
  powerpc/mm/slice: Update the function prototype
  powerpc/mm/hash: Move kernel context to the starting of context range
  powerpc/mm/hash: Support 68 bit VA
  powerpc/mm: Move copy_mm_to_paca to paca.c
  powerpc/mm: Remove redundant TASK_SIZE_USER64 checks
  powerpc/mm/slice: Use mm task_size as max value of slice index
  powerpc/mm/hash: Increase VA range to 128TB
  powerpc/mm/slice: Move slice_mask struct definition to slice.c
  powerpc/mm/slice: Update slice mask printing to use bitmap printing.

 arch/powerpc/include/asm/book3s/64/hash-4k.h  |   2 +-
 arch/powerpc/include/asm/book3s/64/hash-64k.h |   2 +-
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 160 -
 arch/powerpc/include/asm/mmu.h|  19 ++-
 arch/powerpc/include/asm/mmu_context.h|   2 +-
 arch/powerpc/include/asm/paca.h   |  18 +--
 arch/powerpc/include/asm/page_64.h|  14 --
 arch/powerpc/include/asm/processor.h  |  22 ++-
 arch/powerpc/kernel/paca.c|  26 
 arch/powerpc/kvm/book3s_64_mmu_host.c |  10 +-
 arch/powerpc/mm/hash_utils_64.c   |   9 +-
 arch/powerpc/mm/init_64.c |   4 -
 arch/powerpc/mm/mmu_context_book3s64.c|  96 +
 arch/powerpc/mm/pgtable_64.c  |   5 -
 arch/powerpc/mm/slb.c |   2 +-
 arch/powerpc/mm/slb_low.S |  74 ++
 arch/powerpc/mm/slice.c   | 195 +++---
 17 files changed, 394 insertions(+), 266 deletions(-)

-- 
2.7.4



[PATCH V3 01/10] powerpc/mm/slice: Convert slice_mask high slice to a bitmap

2017-02-19 Thread Aneesh Kumar K.V
In followup patch we want to increase the va range which will result
in us requiring high_slices to have more than 64 bits. To enable this
convert high_slices to bitmap. We keep the number bits same in this patch
and later change that to higher value

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/page_64.h |  15 ++---
 arch/powerpc/mm/slice.c| 110 +
 2 files changed, 80 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/include/asm/page_64.h 
b/arch/powerpc/include/asm/page_64.h
index dd5f0712afa2..7f72659b7999 100644
--- a/arch/powerpc/include/asm/page_64.h
+++ b/arch/powerpc/include/asm/page_64.h
@@ -98,19 +98,16 @@ extern u64 ppc64_pft_size;
 #define GET_LOW_SLICE_INDEX(addr)  ((addr) >> SLICE_LOW_SHIFT)
 #define GET_HIGH_SLICE_INDEX(addr) ((addr) >> SLICE_HIGH_SHIFT)
 
+#ifndef __ASSEMBLY__
 /*
- * 1 bit per slice and we have one slice per 1TB
- * Right now we support only 64TB.
- * IF we change this we will have to change the type
- * of high_slices
+ * One bit per slice. We have lower slices which cover 256MB segments
+ * upto 4G range. That gets us 16 low slices. For the rest we track slices
+ * in 1TB size.
+ * 64 below is actually SLICE_NUM_HIGH to fixup complie errros
  */
-#define SLICE_MASK_SIZE 8
-
-#ifndef __ASSEMBLY__
-
 struct slice_mask {
u16 low_slices;
-   u64 high_slices;
+   DECLARE_BITMAP(high_slices, 64);
 };
 
 struct mm_struct;
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 2b27458902ee..c4e718e38a03 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -36,11 +36,6 @@
 #include 
 #include 
 
-/* some sanity checks */
-#if (H_PGTABLE_RANGE >> 43) > SLICE_MASK_SIZE
-#error H_PGTABLE_RANGE exceeds slice_mask high_slices size
-#endif
-
 static DEFINE_SPINLOCK(slice_convert_lock);
 
 
@@ -49,7 +44,7 @@ int _slice_debug = 1;
 
 static void slice_print_mask(const char *label, struct slice_mask mask)
 {
-   char*p, buf[16 + 3 + 64 + 1];
+   char*p, buf[SLICE_NUM_LOW + 3 + SLICE_NUM_HIGH + 1];
int i;
 
if (!_slice_debug)
@@ -60,8 +55,12 @@ static void slice_print_mask(const char *label, struct 
slice_mask mask)
*(p++) = ' ';
*(p++) = '-';
*(p++) = ' ';
-   for (i = 0; i < SLICE_NUM_HIGH; i++)
-   *(p++) = (mask.high_slices & (1ul << i)) ? '1' : '0';
+   for (i = 0; i < SLICE_NUM_HIGH; i++) {
+   if (test_bit(i, mask.high_slices))
+   *(p++) = '1';
+   else
+   *(p++) = '0';
+   }
*(p++) = 0;
 
printk(KERN_DEBUG "%s:%s\n", label, buf);
@@ -80,7 +79,10 @@ static struct slice_mask slice_range_to_mask(unsigned long 
start,
 unsigned long len)
 {
unsigned long end = start + len - 1;
-   struct slice_mask ret = { 0, 0 };
+   struct slice_mask ret;
+
+   ret.low_slices = 0;
+   bitmap_zero(ret.high_slices, SLICE_NUM_HIGH);
 
if (start < SLICE_LOW_TOP) {
unsigned long mend = min(end, SLICE_LOW_TOP);
@@ -90,10 +92,13 @@ static struct slice_mask slice_range_to_mask(unsigned long 
start,
- (1u << GET_LOW_SLICE_INDEX(mstart));
}
 
-   if ((start + len) > SLICE_LOW_TOP)
-   ret.high_slices = (1ul << (GET_HIGH_SLICE_INDEX(end) + 1))
-   - (1ul << GET_HIGH_SLICE_INDEX(start));
+   if ((start + len) > SLICE_LOW_TOP) {
+   unsigned long start_index = GET_HIGH_SLICE_INDEX(start);
+   unsigned long align_end = ALIGN(end, (1UL > (i * 4)) & 0xf) == psize)
@@ -165,7 +176,7 @@ static struct slice_mask slice_mask_for_size(struct 
mm_struct *mm, int psize)
mask_index = i & 0x1;
index = i >> 1;
if (((hpsizes[index] >> (mask_index * 4)) & 0xf) == psize)
-   ret.high_slices |= 1ul << i;
+   __set_bit(i, ret.high_slices);
}
 
return ret;
@@ -173,8 +184,13 @@ static struct slice_mask slice_mask_for_size(struct 
mm_struct *mm, int psize)
 
 static int slice_check_fit(struct slice_mask mask, struct slice_mask available)
 {
+   DECLARE_BITMAP(result, SLICE_NUM_HIGH);
+
+   bitmap_and(result, mask.high_slices,
+  available.high_slices, SLICE_NUM_HIGH);
+
return (mask.low_slices & available.low_slices) == mask.low_slices &&
-   (mask.high_slices & available.high_slices) == mask.high_slices;
+   bitmap_equal(result, mask.high_slices, SLICE_NUM_HIGH);
 }
 
 static void slice_flush_segments(void *parm)
@@ -221,7 +237,7 @@ static void slice_convert(struct mm_struct *mm, struct 
slice_mask mask, int psiz
for (i = 0; i < SLICE_NUM_HIGH; i++) {
mask_index = i & 0x1;
   

[PATCH V3 02/10] powerpc/mm/slice: Update the function prototype

2017-02-19 Thread Aneesh Kumar K.V
This avoid copying the slice_mask struct as function return value

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/slice.c | 62 ++---
 1 file changed, 28 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index c4e718e38a03..1cb0e98e70c0 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -75,20 +75,19 @@ static void slice_print_mask(const char *label, struct 
slice_mask mask) {}
 
 #endif
 
-static struct slice_mask slice_range_to_mask(unsigned long start,
-unsigned long len)
+static void slice_range_to_mask(unsigned long start, unsigned long len,
+   struct slice_mask *ret)
 {
unsigned long end = start + len - 1;
-   struct slice_mask ret;
 
-   ret.low_slices = 0;
-   bitmap_zero(ret.high_slices, SLICE_NUM_HIGH);
+   ret->low_slices = 0;
+   bitmap_zero(ret->high_slices, SLICE_NUM_HIGH);
 
if (start < SLICE_LOW_TOP) {
unsigned long mend = min(end, SLICE_LOW_TOP);
unsigned long mstart = min(start, SLICE_LOW_TOP);
 
-   ret.low_slices = (1u << (GET_LOW_SLICE_INDEX(mend) + 1))
+   ret->low_slices = (1u << (GET_LOW_SLICE_INDEX(mend) + 1))
- (1u << GET_LOW_SLICE_INDEX(mstart));
}
 
@@ -97,9 +96,8 @@ static struct slice_mask slice_range_to_mask(unsigned long 
start,
unsigned long align_end = ALIGN(end, (1UL low_slices = 0;
+   bitmap_zero(ret->high_slices, SLICE_NUM_HIGH);
 
for (i = 0; i < SLICE_NUM_LOW; i++)
if (!slice_low_has_vma(mm, i))
-   ret.low_slices |= 1u << i;
+   ret->low_slices |= 1u << i;
 
if (mm->task_size <= SLICE_LOW_TOP)
-   return ret;
+   return;
 
for (i = 0; i < SLICE_NUM_HIGH; i++)
if (!slice_high_has_vma(mm, i))
-   __set_bit(i, ret.high_slices);
-
-   return ret;
+   __set_bit(i, ret->high_slices);
 }
 
-static struct slice_mask slice_mask_for_size(struct mm_struct *mm, int psize)
+static void slice_mask_for_size(struct mm_struct *mm, int psize, struct 
slice_mask *ret)
 {
unsigned char *hpsizes;
int index, mask_index;
-   struct slice_mask ret;
unsigned long i;
u64 lpsizes;
 
-   ret.low_slices = 0;
-   bitmap_zero(ret.high_slices, SLICE_NUM_HIGH);
+   ret->low_slices = 0;
+   bitmap_zero(ret->high_slices, SLICE_NUM_HIGH);
 
lpsizes = mm->context.low_slices_psize;
for (i = 0; i < SLICE_NUM_LOW; i++)
if (((lpsizes >> (i * 4)) & 0xf) == psize)
-   ret.low_slices |= 1u << i;
+   ret->low_slices |= 1u << i;
 
hpsizes = mm->context.high_slices_psize;
for (i = 0; i < SLICE_NUM_HIGH; i++) {
mask_index = i & 0x1;
index = i >> 1;
if (((hpsizes[index] >> (mask_index * 4)) & 0xf) == psize)
-   __set_bit(i, ret.high_slices);
+   __set_bit(i, ret->high_slices);
}
-
-   return ret;
 }
 
 static int slice_check_fit(struct slice_mask mask, struct slice_mask available)
@@ -461,7 +453,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
/* First make up a "good" mask of slices that have the right size
 * already
 */
-   good_mask = slice_mask_for_size(mm, psize);
+   slice_mask_for_size(mm, psize, &good_mask);
slice_print_mask(" good_mask", good_mask);
 
/*
@@ -486,7 +478,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
 #ifdef CONFIG_PPC_64K_PAGES
/* If we support combo pages, we can allow 64k pages in 4k slices */
if (psize == MMU_PAGE_64K) {
-   compat_mask = slice_mask_for_size(mm, MMU_PAGE_4K);
+   slice_mask_for_size(mm, MMU_PAGE_4K, &compat_mask);
if (fixed)
slice_or_mask(&good_mask, &compat_mask);
}
@@ -495,7 +487,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
/* First check hint if it's valid or if we have MAP_FIXED */
if (addr != 

[PATCH V3 03/10] powerpc/mm/hash: Move kernel context to the starting of context range

2017-02-19 Thread Aneesh Kumar K.V
With current kernel, we use the top 4 context for the kernel. Kernel VSIDs are 
built
using these top context values and effective segemnt ID. In the following 
patches,
we want to increase the max effective address to 512TB. We achieve that by
increasing the effective segments IDs there by increasing virtual address range.

We will be switching to a 68bit virtual address in the following patch. But for
platforms like  p4 and p5, which only support a 65 bit va, we want to limit the
virtual addrress to a 65 bit value. We do that by limiting the context bits to 
16
instead of 19. That means we will have different max context values on different
platforms.

To make this simpler. we move the kernel context to the starting of the range.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 39 ++--
 arch/powerpc/include/asm/mmu_context.h|  2 +-
 arch/powerpc/kvm/book3s_64_mmu_host.c |  2 +-
 arch/powerpc/mm/hash_utils_64.c   |  5 --
 arch/powerpc/mm/mmu_context_book3s64.c| 88 ++-
 arch/powerpc/mm/slb_low.S | 20 ++
 6 files changed, 84 insertions(+), 72 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 0735d5a8049f..014a9bb197cd 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -493,10 +493,10 @@ extern void slb_set_size(u16 size);
  * For user processes max context id is limited to ((1ul << 19) - 5)
  * for kernel space, we use the top 4 context ids to map address as below
  * NOTE: each context only support 64TB now.
- * 0x7fffc -  [ 0xc000 - 0xc0003fff ]
- * 0x7fffd -  [ 0xd000 - 0xd0003fff ]
- * 0x7fffe -  [ 0xe000 - 0xe0003fff ]
- * 0x7 -  [ 0xf000 - 0xf0003fff ]
+ * 0x0 -  [ 0xc000 - 0xc0003fff ]
+ * 0x1 -  [ 0xd000 - 0xd0003fff ]
+ * 0x2 -  [ 0xe000 - 0xe0003fff ]
+ * 0x3 -  [ 0xf000 - 0xf0003fff ]
  *
  * The proto-VSIDs are then scrambled into real VSIDs with the
  * multiplicative hash:
@@ -510,15 +510,9 @@ extern void slb_set_size(u16 size);
  * robust scattering in the hash table (at least based on some initial
  * results).
  *
- * We also consider VSID 0 special. We use VSID 0 for slb entries mapping
- * bad address. This enables us to consolidate bad address handling in
- * hash_page.
- *
  * We also need to avoid the last segment of the last context, because that
  * would give a protovsid of 0x1f. That will result in a VSID 0
- * because of the modulo operation in vsid scramble. But the vmemmap
- * (which is what uses region 0xf) will never be close to 64TB in size
- * (it's 56 bytes per page of system memory).
+ * because of the modulo operation in vsid scramble.
  */
 
 #define CONTEXT_BITS   19
@@ -530,12 +524,15 @@ extern void slb_set_size(u16 size);
 /*
  * 256MB segment
  * The proto-VSID space has 2^(CONTEX_BITS + ESID_BITS) - 1 segments
- * available for user + kernel mapping. The top 4 contexts are used for
+ * available for user + kernel mapping. The bottom 4 contexts are used for
  * kernel mapping. Each segment contains 2^28 bytes. Each
- * context maps 2^46 bytes (64TB) so we can support 2^19-1 contexts
- * (19 == 37 + 28 - 46).
+ * context maps 2^46 bytes (64TB).
+ *
+ * We also need to avoid the last segment of the last context, because that
+ * would give a protovsid of 0x1f. That will result in a VSID 0
+ * because of the modulo operation in vsid scramble.
  */
-#define MAX_USER_CONTEXT   ((ASM_CONST(1) << CONTEXT_BITS) - 5)
+#define MAX_USER_CONTEXT   ((ASM_CONST(1) << CONTEXT_BITS) - 2)
 
 /*
  * This should be computed such that protovosid * vsid_mulitplier
@@ -671,19 +668,19 @@ static inline unsigned long get_vsid(unsigned long 
context, unsigned long ea,
  * This is only valid for addresses >= PAGE_OFFSET
  *
  * For kernel space, we use the top 4 context ids to map address as below
- * 0x7fffc -  [ 0xc000 - 0xc0003fff ]
- * 0x7fffd -  [ 0xd000 - 0xd0003fff ]
- * 0x7fffe -  [ 0xe000 - 0xe0003fff ]
- * 0x7 -  [ 0xf000 - 0xf0003fff ]
+ * 0x0 -  [ 0xc000 - 0xc0003fff ]
+ * 0x1 -  [ 0xd000 - 0xd0003fff ]
+ * 0x2 -  [ 0xe000 - 0xe0003fff ]
+ * 0x3 -  [ 0xf000 - 0xf0003fff ]
  */
 static inline unsigned long get_kernel_vsid(unsigned long ea, int ssize)
 {
unsigned long context;
 
/*
-* kernel take the top 4 context from the available range
+* kernel take the first 4 context from the available range
 */
-   context = (MAX_USER_CONTEXT) + ((ea >> 60) - 0xc) + 1;
+   context = (ea >> 60) - 

[PATCH V3 04/10] powerpc/mm/hash: Support 68 bit VA

2017-02-19 Thread Aneesh Kumar K.V
Inorder to support large effective address range (512TB), we want to increase
the virtual address bits to 68. But we do have platforms like p4 and p5 that can
only do 65 bit VA. We support those platforms by limiting context bits on them
to 16.

The protovsid -> vsid conversion is verified to work with both 65 and 68 bit
va values. I also documented the restrictions in a table format as part of code
comments.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 123 --
 arch/powerpc/include/asm/mmu.h|  19 ++--
 arch/powerpc/kvm/book3s_64_mmu_host.c |   8 +-
 arch/powerpc/mm/mmu_context_book3s64.c|   8 +-
 arch/powerpc/mm/slb_low.S |  54 +--
 5 files changed, 150 insertions(+), 62 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 014a9bb197cd..97ccd8ae6c75 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -39,6 +39,7 @@
 
 /* Bits in the SLB VSID word */
 #define SLB_VSID_SHIFT 12
+#define SLB_VSID_SHIFT_256M12
 #define SLB_VSID_SHIFT_1T  24
 #define SLB_VSID_SSIZE_SHIFT   62
 #define SLB_VSID_B ASM_CONST(0xc000)
@@ -515,9 +516,19 @@ extern void slb_set_size(u16 size);
  * because of the modulo operation in vsid scramble.
  */
 
+/*
+ * Max Va bits we support as of now is 68 bits. We want 19 bit
+ * context ID.
+ * Restrictions:
+ * GPU has restrictions of not able to access beyond 128TB
+ * (47 bit effective address). We also cannot do more than 20bit PID.
+ * For p4 and p5 which can only do 65 bit VA, we restrict our CONTEXT_BITS
+ * to 16 bits (ie, we can only have 2^16 pids at the same time).
+ */
+#define VA_BITS68
 #define CONTEXT_BITS   19
-#define ESID_BITS  18
-#define ESID_BITS_1T   6
+#define ESID_BITS  (VA_BITS - (SID_SHIFT + CONTEXT_BITS))
+#define ESID_BITS_1T   (VA_BITS - (SID_SHIFT_1T + CONTEXT_BITS))
 
 #define ESID_BITS_MASK ((1 << ESID_BITS) - 1)
 #define ESID_BITS_1T_MASK  ((1 << ESID_BITS_1T) - 1)
@@ -526,62 +537,54 @@ extern void slb_set_size(u16 size);
  * The proto-VSID space has 2^(CONTEX_BITS + ESID_BITS) - 1 segments
  * available for user + kernel mapping. The bottom 4 contexts are used for
  * kernel mapping. Each segment contains 2^28 bytes. Each
- * context maps 2^46 bytes (64TB).
+ * context maps 2^49 bytes (512TB).
  *
  * We also need to avoid the last segment of the last context, because that
  * would give a protovsid of 0x1f. That will result in a VSID 0
  * because of the modulo operation in vsid scramble.
  */
 #define MAX_USER_CONTEXT   ((ASM_CONST(1) << CONTEXT_BITS) - 2)
+/*
+ * For platforms that support on 65bit VA we limit the context bits
+ */
+#define MAX_USER_CONTEXT_65BIT_VA ((ASM_CONST(1) << (65 - (SID_SHIFT + 
ESID_BITS))) - 2)
 
 /*
  * This should be computed such that protovosid * vsid_mulitplier
  * doesn't overflow 64 bits. It should also be co-prime to vsid_modulus
+ * We also need to make sure that number of bits in divisor is less
+ * than twice the number of protovsid bits for our modulus optmization to work.
+ * The below table shows the current values used.
+ *
+ * |---++++--|
+ * |   | Prime Bits | VSID_BITS_65VA | Total Bits | 2* VSID_BITS |
+ * |---++++--|
+ * | 1T| 24 | 25 | 49 |   50 |
+ * |---++++--|
+ * | 256MB | 24 | 37 | 61 |   74 |
+ * |---++++--|
+ *
+ * |---++++--|
+ * |   | Prime Bits | VSID_BITS_68VA | Total Bits | 2* VSID_BITS |
+ * |---++++--|
+ * | 1T| 24 | 28 | 52 |   56 |
+ * |---++++--|
+ * | 256MB | 24 | 40 | 64 |   80 |
+ * |---++++--|
+ *
  */
 #define VSID_MULTIPLIER_256M   ASM_CONST(12538073) /* 24-bit prime */
-#define VSID_BITS_256M (CONTEXT_BITS + ESID_BITS)
+#define VSID_BITS_256M (VA_BITS - SID_SHIFT)
 #define VSID_MODULUS_256M  ((1UL<=   \
-* 2^36-1, then r3+1 has the 2^36 bit set.  So, if r3+1 has \
-* the bit clear, r3 already has the answer we want, if it  \
-* doesn't, the answer is the low 36 bits of r3+1.  So in all   \
-* cases the answer is the low 36 bits of (r3 + ((r3+1) >> 36))*/\
-   addirx,rt,1;\
-

[PATCH V3 05/10] powerpc/mm: Move copy_mm_to_paca to paca.c

2017-02-19 Thread Aneesh Kumar K.V
We will be updating this later to use struct mm_struct. Move this so that 
function
finds the definition of struct mm_struct;

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/paca.h | 18 +-
 arch/powerpc/kernel/paca.c  | 19 +++
 arch/powerpc/mm/hash_utils_64.c |  4 ++--
 arch/powerpc/mm/slb.c   |  2 +-
 arch/powerpc/mm/slice.c |  2 +-
 5 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 6a6792bb39fb..f25d3c93a30f 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -207,23 +207,7 @@ struct paca_struct {
 #endif
 };
 
-#ifdef CONFIG_PPC_BOOK3S
-static inline void copy_mm_to_paca(mm_context_t *context)
-{
-   get_paca()->mm_ctx_id = context->id;
-#ifdef CONFIG_PPC_MM_SLICES
-   get_paca()->mm_ctx_low_slices_psize = context->low_slices_psize;
-   memcpy(&get_paca()->mm_ctx_high_slices_psize,
-  &context->high_slices_psize, SLICE_ARRAY_SIZE);
-#else
-   get_paca()->mm_ctx_user_psize = context->user_psize;
-   get_paca()->mm_ctx_sllp = context->sllp;
-#endif
-}
-#else
-static inline void copy_mm_to_paca(mm_context_t *context){}
-#endif
-
+extern void copy_mm_to_paca(struct mm_struct *mm);
 extern struct paca_struct *paca;
 extern void initialise_paca(struct paca_struct *new_paca, int cpu);
 extern void setup_paca(struct paca_struct *new_paca);
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index fa20060ff7a5..b64daf124fee 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -244,3 +244,22 @@ void __init free_unused_pacas(void)
 
free_lppacas();
 }
+
+void copy_mm_to_paca(struct mm_struct *mm)
+{
+#ifdef CONFIG_PPC_BOOK3S
+   mm_context_t *context = &mm->context;
+
+   get_paca()->mm_ctx_id = context->id;
+#ifdef CONFIG_PPC_MM_SLICES
+   get_paca()->mm_ctx_low_slices_psize = context->low_slices_psize;
+   memcpy(&get_paca()->mm_ctx_high_slices_psize,
+  &context->high_slices_psize, SLICE_ARRAY_SIZE);
+#else /* CONFIG_PPC_MM_SLICES */
+   get_paca()->mm_ctx_user_psize = context->user_psize;
+   get_paca()->mm_ctx_sllp = context->sllp;
+#endif
+#else /* CONFIG_PPC_BOOK3S */
+   return;
+#endif
+}
diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 978314b6b8d7..67937a6eb541 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1084,7 +1084,7 @@ void demote_segment_4k(struct mm_struct *mm, unsigned 
long addr)
copro_flush_all_slbs(mm);
if ((get_paca_psize(addr) != MMU_PAGE_4K) && (current->mm == mm)) {
 
-   copy_mm_to_paca(&mm->context);
+   copy_mm_to_paca(mm);
slb_flush_and_rebolt();
}
 }
@@ -1156,7 +1156,7 @@ static void check_paca_psize(unsigned long ea, struct 
mm_struct *mm,
 {
if (user_region) {
if (psize != get_paca_psize(ea)) {
-   copy_mm_to_paca(&mm->context);
+   copy_mm_to_paca(mm);
slb_flush_and_rebolt();
}
} else if (get_paca()->vmalloc_sllp !=
diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c
index 48fc28bab544..15157b14b0b6 100644
--- a/arch/powerpc/mm/slb.c
+++ b/arch/powerpc/mm/slb.c
@@ -227,7 +227,7 @@ void switch_slb(struct task_struct *tsk, struct mm_struct 
*mm)
asm volatile("slbie %0" : : "r" (slbie_data));
 
get_paca()->slb_cache_ptr = 0;
-   copy_mm_to_paca(&mm->context);
+   copy_mm_to_paca(mm);
 
/*
 * preload some userspace segments into the SLB.
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 1cb0e98e70c0..da67b91f46d3 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -193,7 +193,7 @@ static void slice_flush_segments(void *parm)
if (mm != current->active_mm)
return;
 
-   copy_mm_to_paca(¤t->active_mm->context);
+   copy_mm_to_paca(current->active_mm);
 
local_irq_save(flags);
slb_flush_and_rebolt();
-- 
2.7.4



[PATCH V3 06/10] powerpc/mm: Remove redundant TASK_SIZE_USER64 checks

2017-02-19 Thread Aneesh Kumar K.V
The check against VSID range is implied when we check task size against
hash and radix pgtable range[1], because we make sure page table range cannot
exceed vsid range.

[1] BUILD_BUG_ON(TASK_SIZE_USER64 > H_PGTABLE_RANGE);
BUILD_BUG_ON(TASK_SIZE_USER64 > RADIX_PGTABLE_RANGE);

The check for smaller task size is also removed here, because the follow up
patch will support a tasksize smaller than pgtable range.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/init_64.c| 4 
 arch/powerpc/mm/pgtable_64.c | 5 -
 2 files changed, 9 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 93abf8a9813d..f3e856e6ee23 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -69,10 +69,6 @@
 #if H_PGTABLE_RANGE > USER_VSID_RANGE
 #warning Limited user VSID range means pagetable space is wasted
 #endif
-
-#if (TASK_SIZE_USER64 < H_PGTABLE_RANGE) && (TASK_SIZE_USER64 < 
USER_VSID_RANGE)
-#warning TASK_SIZE is smaller than it needs to be.
-#endif
 #endif /* CONFIG_PPC_STD_MMU_64 */
 
 phys_addr_t memstart_addr = ~0;
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 8bca7f58afc4..06e23e0b1b81 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -55,11 +55,6 @@
 
 #include "mmu_decl.h"
 
-#ifdef CONFIG_PPC_STD_MMU_64
-#if TASK_SIZE_USER64 > (1UL << (ESID_BITS + SID_SHIFT))
-#error TASK_SIZE_USER64 exceeds user VSID range
-#endif
-#endif
 
 #ifdef CONFIG_PPC_BOOK3S_64
 /*
-- 
2.7.4



[PATCH V3 07/10] powerpc/mm/slice: Use mm task_size as max value of slice index

2017-02-19 Thread Aneesh Kumar K.V
In the followup patch, we will increase the slice array sice to handle 512TB
range, but will limit the task size to 128TB. Avoid doing uncessary computation
and avoid doing slice mask related operation above task_size.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/slice.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index da67b91f46d3..f286b7839a12 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -145,7 +145,7 @@ static void slice_mask_for_free(struct mm_struct *mm, 
struct slice_mask *ret)
if (mm->task_size <= SLICE_LOW_TOP)
return;
 
-   for (i = 0; i < SLICE_NUM_HIGH; i++)
+   for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->task_size); i++)
if (!slice_high_has_vma(mm, i))
__set_bit(i, ret->high_slices);
 }
@@ -166,7 +166,7 @@ static void slice_mask_for_size(struct mm_struct *mm, int 
psize, struct slice_ma
ret->low_slices |= 1u << i;
 
hpsizes = mm->context.high_slices_psize;
-   for (i = 0; i < SLICE_NUM_HIGH; i++) {
+   for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->task_size); i++) {
mask_index = i & 0x1;
index = i >> 1;
if (((hpsizes[index] >> (mask_index * 4)) & 0xf) == psize)
@@ -174,15 +174,17 @@ static void slice_mask_for_size(struct mm_struct *mm, int 
psize, struct slice_ma
}
 }
 
-static int slice_check_fit(struct slice_mask mask, struct slice_mask available)
+static int slice_check_fit(struct mm_struct *mm,
+  struct slice_mask mask, struct slice_mask available)
 {
DECLARE_BITMAP(result, SLICE_NUM_HIGH);
+   unsigned long slice_count = GET_HIGH_SLICE_INDEX(mm->task_size);
 
bitmap_and(result, mask.high_slices,
-  available.high_slices, SLICE_NUM_HIGH);
+  available.high_slices, slice_count);
 
return (mask.low_slices & available.low_slices) == mask.low_slices &&
-   bitmap_equal(result, mask.high_slices, SLICE_NUM_HIGH);
+   bitmap_equal(result, mask.high_slices, slice_count);
 }
 
 static void slice_flush_segments(void *parm)
@@ -226,7 +228,7 @@ static void slice_convert(struct mm_struct *mm, struct 
slice_mask mask, int psiz
mm->context.low_slices_psize = lpsizes;
 
hpsizes = mm->context.high_slices_psize;
-   for (i = 0; i < SLICE_NUM_HIGH; i++) {
+   for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->task_size); i++) {
mask_index = i & 0x1;
index = i >> 1;
if (test_bit(i, mask.high_slices))
@@ -493,7 +495,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
/* Check if we fit in the good mask. If we do, we just return,
 * nothing else to do
 */
-   if (slice_check_fit(mask, good_mask)) {
+   if (slice_check_fit(mm, mask, good_mask)) {
slice_dbg(" fits good !\n");
return addr;
}
@@ -518,7 +520,7 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
slice_or_mask(&potential_mask, &good_mask);
slice_print_mask(" potential", potential_mask);
 
-   if ((addr != 0 || fixed) && slice_check_fit(mask, potential_mask)) {
+   if ((addr != 0 || fixed) && slice_check_fit(mm, mask, potential_mask)) {
slice_dbg(" fits potential !\n");
goto convert;
}
@@ -666,7 +668,7 @@ void slice_set_user_psize(struct mm_struct *mm, unsigned 
int psize)
mm->context.low_slices_psize = lpsizes;
 
hpsizes = mm->context.high_slices_psize;
-   for (i = 0; i < SLICE_NUM_HIGH; i++) {
+   for (i = 0; i < GET_HIGH_SLICE_INDEX(mm->task_size); i++) {
mask_index = i & 0x1;
index = i >> 1;
if (((hpsizes[index] >> (mask_index * 4)) & 0xf) == old_psize)
@@ -743,6 +745,6 @@ int is_hugepage_only_range(struct mm_struct *mm, unsigned 
long addr,
slice_print_mask(" mask", mask);
slice_print_mask(" available", available);
 #endif
-   return !slice_check_fit(mask, available);
+   return !slice_check_fit(mm, mask, available);
 }
 #endif
-- 
2.7.4



[PATCH V3 08/10] powerpc/mm/hash: Increase VA range to 128TB

2017-02-19 Thread Aneesh Kumar K.V
We update the hash linux page table layout such that we can support 512TB. But
we limit the TASK_SIZE to 128TB. We can switch to 128TB by default without
conditional because that is the max virtual address supported by other
architectures. We will later add a mechanism to on-demand increase the
application's effective address range to 512TB.

Having the page table layout changed to accommodate 512TB  makes testing large
memory configuration easier with less code changes to kernel

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h  |  2 +-
 arch/powerpc/include/asm/book3s/64/hash-64k.h |  2 +-
 arch/powerpc/include/asm/page_64.h|  2 +-
 arch/powerpc/include/asm/processor.h  | 22 ++
 arch/powerpc/kernel/paca.c|  9 -
 arch/powerpc/mm/slice.c   |  2 ++
 6 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index 0c4e470571ca..b4b5e6b671ca 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -8,7 +8,7 @@
 #define H_PTE_INDEX_SIZE  9
 #define H_PMD_INDEX_SIZE  7
 #define H_PUD_INDEX_SIZE  9
-#define H_PGD_INDEX_SIZE  9
+#define H_PGD_INDEX_SIZE  12
 
 #ifndef __ASSEMBLY__
 #define H_PTE_TABLE_SIZE   (sizeof(pte_t) << H_PTE_INDEX_SIZE)
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index b39f0b86405e..682c4eb28fa4 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -4,7 +4,7 @@
 #define H_PTE_INDEX_SIZE  8
 #define H_PMD_INDEX_SIZE  5
 #define H_PUD_INDEX_SIZE  5
-#define H_PGD_INDEX_SIZE  12
+#define H_PGD_INDEX_SIZE  15
 
 /*
  * 64k aligned address free up few of the lower bits of RPN for us
diff --git a/arch/powerpc/include/asm/page_64.h 
b/arch/powerpc/include/asm/page_64.h
index 7f72659b7999..9b60e9455c6e 100644
--- a/arch/powerpc/include/asm/page_64.h
+++ b/arch/powerpc/include/asm/page_64.h
@@ -107,7 +107,7 @@ extern u64 ppc64_pft_size;
  */
 struct slice_mask {
u16 low_slices;
-   DECLARE_BITMAP(high_slices, 64);
+   DECLARE_BITMAP(high_slices, 512);
 };
 
 struct mm_struct;
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 1ba814436c73..1d4e34f9004d 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -102,11 +102,25 @@ void release_thread(struct task_struct *);
 #endif
 
 #ifdef CONFIG_PPC64
-/* 64-bit user address space is 46-bits (64TB user VM) */
-#define TASK_SIZE_USER64 (0x4000UL)
+/*
+ * 64-bit user address space can have multiple limits
+ * For now supported values are:
+ */
+#define TASK_SIZE_64TB  (0x4000UL)
+#define TASK_SIZE_128TB (0x8000UL)
+#define TASK_SIZE_512TB (0x0002UL)
 
-/* 
- * 32-bit user address space is 4GB - 1 page 
+#ifdef CONFIG_PPC_BOOK3S_64
+/*
+ * MAx value currently used:
+ */
+#define TASK_SIZE_USER64 TASK_SIZE_128TB
+#else
+#define TASK_SIZE_USER64 TASK_SIZE_64TB
+#endif
+
+/*
+ * 32-bit user address space is 4GB - 1 page
  * (this 1 page is needed so referencing of 0x generates EFAULT
  */
 #define TASK_SIZE_USER32 (0x0001UL - (1*PAGE_SIZE))
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index b64daf124fee..c7ca70dc3ba5 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -253,8 +253,15 @@ void copy_mm_to_paca(struct mm_struct *mm)
get_paca()->mm_ctx_id = context->id;
 #ifdef CONFIG_PPC_MM_SLICES
get_paca()->mm_ctx_low_slices_psize = context->low_slices_psize;
+   /*
+* We support upto 128TB for now. Hence copy only 128/2 bytes.
+* Later when we support tasks with different max effective
+* address, we can optimize this based on mm->task_size.
+*/
+   BUILD_BUG_ON(TASK_SIZE_USER64 != TASK_SIZE_128TB);
memcpy(&get_paca()->mm_ctx_high_slices_psize,
-  &context->high_slices_psize, SLICE_ARRAY_SIZE);
+  &context->high_slices_psize, TASK_SIZE_128TB >> 41);
+
 #else /* CONFIG_PPC_MM_SLICES */
get_paca()->mm_ctx_user_psize = context->user_psize;
get_paca()->mm_ctx_sllp = context->sllp;
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index f286b7839a12..fd2c85e951bd 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -412,6 +412,8 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
struct mm_struct *mm = current->mm;
unsigned long newaddr;
 
+   /* Make sure high_slices bitmap size is same as we expected */
+   BUILD_BUG_ON(512 != SLICE_NUM_HIGH);
/*
 * init different masks
 */
-- 
2.7.4



[PATCH V3 09/10] powerpc/mm/slice: Move slice_mask struct definition to slice.c

2017-02-19 Thread Aneesh Kumar K.V
This structure definition need not be in a header since this is used only by
slice.c file. So move it to slice.c. This also allow us to use SLICE_NUM_HIGH
instead of 512 and also helps in getting rid of one BUILD_BUG_ON().

I also switch the low_slices type to u64 from u16. This doesn't have an impact
on size of struct due to padding added with u16 type. This helps in using
bitmap printing function for printing slice mask.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/page_64.h | 11 ---
 arch/powerpc/mm/slice.c| 13 ++---
 2 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/page_64.h 
b/arch/powerpc/include/asm/page_64.h
index 9b60e9455c6e..3ecfc2734b50 100644
--- a/arch/powerpc/include/asm/page_64.h
+++ b/arch/powerpc/include/asm/page_64.h
@@ -99,17 +99,6 @@ extern u64 ppc64_pft_size;
 #define GET_HIGH_SLICE_INDEX(addr) ((addr) >> SLICE_HIGH_SHIFT)
 
 #ifndef __ASSEMBLY__
-/*
- * One bit per slice. We have lower slices which cover 256MB segments
- * upto 4G range. That gets us 16 low slices. For the rest we track slices
- * in 1TB size.
- * 64 below is actually SLICE_NUM_HIGH to fixup complie errros
- */
-struct slice_mask {
-   u16 low_slices;
-   DECLARE_BITMAP(high_slices, 512);
-};
-
 struct mm_struct;
 
 extern unsigned long slice_get_unmapped_area(unsigned long addr,
diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index fd2c85e951bd..8eedb7382942 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -37,7 +37,16 @@
 #include 
 
 static DEFINE_SPINLOCK(slice_convert_lock);
-
+/*
+ * One bit per slice. We have lower slices which cover 256MB segments
+ * upto 4G range. That gets us 16 low slices. For the rest we track slices
+ * in 1TB size.
+ * 64 below is actually SLICE_NUM_HIGH to fixup complie errros
+ */
+struct slice_mask {
+   u64 low_slices;
+   DECLARE_BITMAP(high_slices, SLICE_NUM_HIGH);
+};
 
 #ifdef DEBUG
 int _slice_debug = 1;
@@ -412,8 +421,6 @@ unsigned long slice_get_unmapped_area(unsigned long addr, 
unsigned long len,
struct mm_struct *mm = current->mm;
unsigned long newaddr;
 
-   /* Make sure high_slices bitmap size is same as we expected */
-   BUILD_BUG_ON(512 != SLICE_NUM_HIGH);
/*
 * init different masks
 */
-- 
2.7.4



[PATCH V3 10/10] powerpc/mm/slice: Update slice mask printing to use bitmap printing.

2017-02-19 Thread Aneesh Kumar K.V
We now get output like below which is much better.

[0.935306]  good_mask low_slice: 0-15
[0.935360]  good_mask high_slice: 0-511

Compared to

[0.953414]  good_mask: - 1.

I also fixed an error with slice_dbg printing.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/slice.c | 30 +++---
 1 file changed, 7 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 8eedb7382942..fce1734ab8a3 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -53,29 +53,13 @@ int _slice_debug = 1;
 
 static void slice_print_mask(const char *label, struct slice_mask mask)
 {
-   char*p, buf[SLICE_NUM_LOW + 3 + SLICE_NUM_HIGH + 1];
-   int i;
-
if (!_slice_debug)
return;
-   p = buf;
-   for (i = 0; i < SLICE_NUM_LOW; i++)
-   *(p++) = (mask.low_slices & (1 << i)) ? '1' : '0';
-   *(p++) = ' ';
-   *(p++) = '-';
-   *(p++) = ' ';
-   for (i = 0; i < SLICE_NUM_HIGH; i++) {
-   if (test_bit(i, mask.high_slices))
-   *(p++) = '1';
-   else
-   *(p++) = '0';
-   }
-   *(p++) = 0;
-
-   printk(KERN_DEBUG "%s:%s\n", label, buf);
+   pr_devel("%s low_slice: %*pbl\n", label, (int)SLICE_NUM_LOW, 
&mask.low_slices);
+   pr_devel("%s high_slice: %*pbl\n", label, (int)SLICE_NUM_HIGH, 
mask.high_slices);
 }
 
-#define slice_dbg(fmt...) do { if (_slice_debug) pr_debug(fmt); } while(0)
+#define slice_dbg(fmt...) do { if (_slice_debug) pr_devel(fmt); } while (0)
 
 #else
 
@@ -247,8 +231,8 @@ static void slice_convert(struct mm_struct *mm, struct 
slice_mask mask, int psiz
}
 
slice_dbg(" lsps=%lx, hsps=%lx\n",
- mm->context.low_slices_psize,
- mm->context.high_slices_psize);
+ (unsigned long)mm->context.low_slices_psize,
+ (unsigned long)mm->context.high_slices_psize);
 
spin_unlock_irqrestore(&slice_convert_lock, flags);
 
@@ -690,8 +674,8 @@ void slice_set_user_psize(struct mm_struct *mm, unsigned 
int psize)
 
 
slice_dbg(" lsps=%lx, hsps=%lx\n",
- mm->context.low_slices_psize,
- mm->context.high_slices_psize);
+ (unsigned long)mm->context.low_slices_psize,
+ (unsigned long)mm->context.high_slices_psize);
 
  bail:
spin_unlock_irqrestore(&slice_convert_lock, flags);
-- 
2.7.4



[PATCH] powerpc/mm: Add translation mode information in /proc/cpuinfo

2017-02-19 Thread Aneesh Kumar K.V
With this we have on powernv and pseries /proc/cpuinfo reporting

timebase: 51200
platform: PowerNV
model   : 8247-22L
machine : PowerNV 8247-22L
firmware: OPAL
translation : Hash

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/platforms/powernv/setup.c | 4 
 arch/powerpc/platforms/pseries/setup.c | 4 
 2 files changed, 8 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index d50c7d99baaf..d38571e289bb 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -95,6 +95,10 @@ static void pnv_show_cpuinfo(struct seq_file *m)
else
seq_printf(m, "firmware\t: BML\n");
of_node_put(root);
+   if (radix_enabled())
+   seq_printf(m, "translation\t: Radix\n");
+   else
+   seq_printf(m, "translation\t: Hash\n");
 }
 
 static void pnv_prepare_going_down(void)
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 7736352f7279..6576fe306561 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -86,6 +86,10 @@ static void pSeries_show_cpuinfo(struct seq_file *m)
model = of_get_property(root, "model", NULL);
seq_printf(m, "machine\t\t: CHRP %s\n", model);
of_node_put(root);
+   if (radix_enabled())
+   seq_printf(m, "translation\t: Radix\n");
+   else
+   seq_printf(m, "translation\t: Hash\n");
 }
 
 /* Initialize firmware assisted non-maskable interrupts if
-- 
2.7.4



[PATCH] powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout

2017-02-19 Thread Aneesh Kumar K.V
Without this if firmware reports 1MB page size support we will crash
trying to use 1MB as hugetlb page size.

echo 300 > /sys/kernel/mm/hugepages/hugepages-1024kB/nr_hugepages

kernel BUG at ./arch/powerpc/include/asm/hugetlb.h:19!
.

[c000e2c27b30] c029dae8 .hugetlb_fault+0x638/0xda0
[c000e2c27c30] c026fb64 .handle_mm_fault+0x844/0x1d70
[c000e2c27d70] c004805c .do_page_fault+0x3dc/0x7c0
[c000e2c27e30] c000ac98 handle_page_fault+0x10/0x30

With fix, we don't enable 1MB as hugepage size.

bash-4.2# cd /sys/kernel/mm/hugepages/
bash-4.2# ls
hugepages-16384kB  hugepages-16777216kB

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/hugetlbpage.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 8c3389cbcd12..a4f33de4008e 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -753,6 +753,24 @@ static int __init add_huge_page_size(unsigned long long 
size)
if ((mmu_psize = shift_to_mmu_psize(shift)) < 0)
return -EINVAL;
 
+#ifdef CONFIG_PPC_BOOK3S_64
+   /*
+* We need to make sure that for different page sizes reported by
+* firmware we only add hugetlb support for page sizes that can be
+* supported by linux page table layout.
+* For now we have
+* Radix: 2M
+* Hash: 16M and 16G
+*/
+   if (radix_enabled()) {
+   if (mmu_psize != MMU_PAGE_2M)
+   return -EINVAL;
+   } else {
+   if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G)
+   return -EINVAL;
+   }
+#endif
+
BUG_ON(mmu_psize_defs[mmu_psize].shift != shift);
 
/* Return if huge page size has already been setup */
-- 
2.7.4



Re: [PATCH V3 0/3] Numabalancing preserve write fix

2017-02-19 Thread Aneesh Kumar K.V

I am not sure whether we want to merge this debug patch. This will help
us in identifying wrong pte_wrprotect usage in the kernel.

>From a0fb302fd204159a1327b67decb8f14ffa21 Mon Sep 17 00:00:00 2001
From: "Aneesh Kumar K.V" 
Date: Sat, 18 Feb 2017 10:39:47 +0530
Subject: [PATCH] powerpc/autonuma: Add debug check for wrong writable pte
 check

With ppc64, protnone ptes don't use _PAGE_WRITE bit for savedwrite. Hence
we need to make sure we don't do pte_write* functions on protnone ptes.
Add debug check to catch wrong usage.

This should be only used for debugging and can give wrong results w.r.t change
bit on radix. Even on hash with kvm we will insert the page table entry in
guest hash page table with write bit set, even if the pte is marked protnone.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 130 +--
 1 file changed, 85 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index d87bee85fc44..1c99deac3966 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -341,10 +341,36 @@ static inline int __ptep_test_and_clear_young(struct 
mm_struct *mm,
__r;\
 })
 
+#undef SAVED_WRITE_DEBUG
+#ifdef CONFIG_NUMA_BALANCING
+static inline int pte_protnone(pte_t pte)
+{
+   /*
+* We want to catch wrong usage of pte_write w.r.t protnone ptes.
+* The way we do that is to make saved write as _PAGE_WRITE for hash
+* translation mode. This only will work with hash translation mode.
+*/
+#ifdef SAVED_WRITE_DEBUG
+   if (!radix_enabled())
+   return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | 
_PAGE_PRIVILEGED)) ==
+   cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED);
+#endif
+   return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE | 
_PAGE_RWX)) ==
+   cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE);
+}
+#endif
+
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr,
  pte_t *ptep)
 {
+#ifdef SAVED_WRITE_DEBUG
+   /*
+* Cannot use this with protnone pte, For protnone, writes
+* will be marked via savedwrite bit.
+*/
+   VM_WARN_ON(pte_protnone(*ptep));
+#endif
if ((pte_raw(*ptep) & cpu_to_be64(_PAGE_WRITE)) == 0)
return;
 
@@ -430,51 +456,6 @@ static inline pte_t pte_clear_soft_dirty(pte_t pte)
 }
 #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
 
-#ifdef CONFIG_NUMA_BALANCING
-static inline int pte_protnone(pte_t pte)
-{
-   return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE | 
_PAGE_RWX)) ==
-   cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE);
-}
-
-#define pte_mk_savedwrite pte_mk_savedwrite
-static inline pte_t pte_mk_savedwrite(pte_t pte)
-{
-   /*
-* Used by Autonuma subsystem to preserve the write bit
-* while marking the pte PROT_NONE. Only allow this
-* on PROT_NONE pte
-*/
-   VM_BUG_ON((pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_RWX | 
_PAGE_PRIVILEGED)) !=
- cpu_to_be64(_PAGE_PRESENT | _PAGE_PRIVILEGED));
-   return __pte(pte_val(pte) & ~_PAGE_PRIVILEGED);
-}
-
-#define pte_clear_savedwrite pte_clear_savedwrite
-static inline pte_t pte_clear_savedwrite(pte_t pte)
-{
-   /*
-* Used by KSM subsystem to make a protnone pte readonly.
-*/
-   VM_BUG_ON(!pte_protnone(pte));
-   return __pte(pte_val(pte) | _PAGE_PRIVILEGED);
-}
-
-#define pte_savedwrite pte_savedwrite
-static inline bool pte_savedwrite(pte_t pte)
-{
-   /*
-* Saved write ptes are prot none ptes that doesn't have
-* privileged bit sit. We mark prot none as one which has
-* present and pviliged bit set and RWX cleared. To mark
-* protnone which used to have _PAGE_WRITE set we clear
-* the privileged bit.
-*/
-   VM_BUG_ON(!pte_protnone(pte));
-   return !(pte_raw(pte) & cpu_to_be64(_PAGE_RWX | _PAGE_PRIVILEGED));
-}
-#endif /* CONFIG_NUMA_BALANCING */
-
 static inline int pte_present(pte_t pte)
 {
return !!(pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT));
@@ -500,6 +481,14 @@ static inline unsigned long pte_pfn(pte_t pte)
 /* Generic modifiers for PTE bits */
 static inline pte_t pte_wrprotect(pte_t pte)
 {
+
+#ifdef SAVED_WRITE_DEBUG
+   /*
+* Cannot use this with protnone pte, For protnone, writes
+* will be marked via savedwrite bit.
+*/
+   VM_WARN_ON(pte_protnone(pte));
+#endif
return __pte(pte_val(pte) & ~_PAGE_WRITE);
 }
 
@@ -552,6 +541,57 @@ static inline bool pte_user(pte_t pte)
return !(pte_raw(pte) & cpu_to_be64(_PAGE_PRIVILEGED));
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+#define pte_mk_savedwrite pte_mk_savedwrite
+static inline 

Re: [PATCH] cxl: Enable PCI device ID for future IBM CXL adapter

2017-02-19 Thread Andrew Donnellan

On 17/02/17 14:45, Uma Krishnan wrote:

From: "Matthew R. Ochs" 

Add support for a future IBM Coherent Accelerator (CXL) device
with an ID of 0x0623.

Signed-off-by: Matthew R. Ochs 
Signed-off-by: Uma Krishnan 


Is this a CAIA 1 or CAIA 2 device?

--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited



Re: powerpc/perf: use is_kernel_addr macro in perf_get_misc_flags()

2017-02-19 Thread Michael Ellerman
On Sat, 2016-12-24 at 06:05:49 UTC, Madhavan Srinivasan wrote:
> Cleanup to use is_kernel_addr macro.
> 
> Signed-off-by: Madhavan Srinivasan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a2391b35f1d9d5b51d43a9150c7239

cheers


Re: powerpc: implement clear_bit_unlock_is_negative_byte()

2017-02-19 Thread Michael Ellerman
On Tue, 2017-01-03 at 18:58:28 UTC, Nicholas Piggin wrote:
> Commit b91e1302ad9b8 ("mm: optimize PageWaiters bit use for
> unlock_page()") added a special bitop function to speed up
> unlock_page(). Implement this for powerpc.
...
> 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/d11914b21c4c21a294fe8937d66c1a

cheers


Re: powerpc/powernv: Remove unused variable in pnv_pci_sriov_disable()

2017-02-19 Thread Michael Ellerman
On Wed, 2017-01-11 at 01:09:05 UTC, Gavin Shan wrote:
> The local variable @iov isn't used, to remove it.
> 
> Signed-off-by: Gavin Shan 
> Reviewed-by: Andrew Donnellan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/02983449c87b1dfd9b75af4c8a2a80

cheers


Re: [v2] powerpc/kernel: Remove error message in pcibios_setup_phb_resources()

2017-02-19 Thread Michael Ellerman
On Wed, 2017-02-08 at 03:11:03 UTC, Gavin Shan wrote:
> The CAPI driver creates virtual PHB (vPHB) from the CAPI adapter.
> The vPHB's IO and memory windows aren't built from device-tree node
> as we do for normal PHBs. A error message is thrown in below path
> when trying to probe AFUs contained in the adapter. The error message
> is confusing and unnecessary.
> 
> cxl_probe()
> pci_init_afu()
> cxl_pci_vphb_add()
> pcibios_scan_phb()
> pcibios_setup_phb_resources()
> 
> This removes the error message. We might have the case where the
> first memory window on real PHB isn't populated properly because
> of error in "ranges" property in the device-tree node. We can check
> the device-tree instead for that. This also removes one unnecessary
> blank line in the function.
> 
> Signed-off-by: Gavin Shan 
> Reviewed-by: Andrew Donnellan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/727597d12140b342a3deef10348b5e

cheers


Re: [v2] powerpc/mm: Fix typo in set_pte_at()

2017-02-19 Thread Michael Ellerman
On Wed, 2017-02-08 at 03:16:50 UTC, Gavin Shan wrote:
> This fixes the typo about the _PAGE_PTE in set_pte_at() by changing
> "tryint" to "trying to".
> 
> Signed-off-by: Gavin Shan 
> Acked-by: Balbir Singh 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/c618f6b188a9170f67e4abd478d250

cheers


Re: [v2,1/6] powerpc/perf: Factor of event_alternative function

2017-02-19 Thread Michael Ellerman
On Sun, 2017-02-12 at 17:03:10 UTC, Madhavan Srinivasan wrote:
> Factor out the power8 event_alternative function to share
> the code with power9.
> 
> Signed-off-by: Madhavan Srinivasan 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/efe881afdd9996ccbcd2a09c93b724

cheers


Re: powerpc/perf: Avoid FAB_*_MATCH checks for power9

2017-02-19 Thread Michael Ellerman
On Mon, 2017-02-13 at 11:32:54 UTC, Madhavan Srinivasan wrote:
> Since power9 does not support FAB_*_MATCH bits in MMCR1,
> avoid these checks for power9. For this, patch factor out
> code in isa207_get_constraint() to retain these checks
> only for power8.
> 
> Patch also updates the comment in power9-pmu raw event
> encode layout to remove FAB_*_MATCH.
> 
> Finally for power9, patch adds additional check for
> threshold events when adding the thresh mask and value in
> isa207_get_constraint().
> 
> fixes: 7ffd948fae4c ('powerpc/perf: factor out power8 pmu functions')
> fixes: 18201b204286 ('powerpc/perf: power9 raw event format encoding')
> Signed-off-by: Ravi Bangoria 
> Signed-off-by: Madhavan Srinivasan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/78a16d9fc1206e1a484b6ac9634875

cheers


Re: [v7, 3/4] powerpc/pseries: Implement indexed-count hotplug memory add

2017-02-19 Thread Michael Ellerman
On Wed, 2017-02-15 at 18:45:56 UTC, Nathan Fontenot wrote:
> From: Sahil Mehta 
> 
> Indexed-count add for memory hotplug guarantees that a contiguous block
> of  lmbs beginning at a specified  will be assigned,
> any LMBs in this range that are not already assigned will be DLPAR added.
> Because of Qemu's per-DIMM memory management, the addition of a contiguous
> block of memory currently requires a series of individual calls to add
> each LMB in the block. Indexed-count add reduces this series of calls to
> a single call for the entire block.
> 
> Signed-off-by: Sahil Mehta 
> Signed-off-by: Nathan Fontenot 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/333f7b76865bec24c66710cf352f89

cheers


Re: [v7, 4/4] powerpc/pseries: Implement indexed-count hotplug memory remove

2017-02-19 Thread Michael Ellerman
On Wed, 2017-02-15 at 18:46:18 UTC, Nathan Fontenot wrote:
> From: Sahil Mehta 
> 
> Indexed-count remove for memory hotplug guarantees that a contiguous block
> of  lmbs beginning at a specified  will be unassigned (NOT
> that  lmbs will be removed). Because of Qemu's per-DIMM memory
> management, the removal of a contiguous block of memory currently
> requires a series of individual calls. Indexed-count remove reduces
> this series into a single call.
> 
> Signed-off-by: Sahil Mehta 
> Signed-off-by: Nathan Fontenot 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/753843471cbbaeca25a5cab51981ee

cheers


Re: [1/3] pci/hotplug/pnv-php: Remove WARN_ON() in pnv_php_put_slot()

2017-02-19 Thread Michael Ellerman
On Wed, 2017-02-15 at 23:22:32 UTC, Gavin Shan wrote:
> The WARN_ON() causes unnecessary backtrace when putting the parent
> slot, which is likely to be NULL.
> 
>  WARNING: CPU: 2 PID: 1071 at drivers/pci/hotplug/pnv_php.c:85 \
>   pnv_php_release+0xcc/0x150 [pnv_php]
> :
>  Call Trace:
>  [c003bc007c10] [dad613c4] pnv_php_release+0x144/0x150 [pnv_php]
>  [c003bc007c40] [c06641d8] pci_hp_deregister+0x238/0x330
>  [c003bc007cd0] [dad61440] pnv_php_unregister_one+0x70/0xa0 
> [pnv_php]
>  [c003bc007d10] [dad614c0] pnv_php_unregister+0x50/0x80 [pnv_php]
>  [c003bc007d40] [dad61e84] pnv_php_exit+0x50/0xcb4 [pnv_php]
>  [c003bc007d70] [c019499c] SyS_delete_module+0x1fc/0x2a0
>  [c003bc007e30] [c000b184] system_call+0x38/0xe0
> 
> Cc:  # v4.8+
> Fixes: 66725152fb9f ("PCI/hotplug: PowerPC PowerNV PCI hotplug driver")
> Signed-off-by: Gavin Shan 
> Reviewed-by: Andrew Donnellan 
> Tested-by: Vaibhav Jain 

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/36c7c9da40c408a71e5e6bfe12e57d

cheers


Re: [PATCHv3,4/4] MAINTAINERS: Remove powerpc's opal match

2017-02-19 Thread Michael Ellerman
On Thu, 2017-02-16 at 00:37:15 UTC, Stewart Smith wrote:
> Remove OPAL regex in powerpc to avoid false match
> 
> Signed-off-by: Stewart Smith 
> Reviewed-by: Andrew Donnellan 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/a42715830d552d7c0e3be709383ece

cheers


Re: [1/2] powerpc/mm: Convert slb_finish_load[_1T] to local symbols

2017-02-19 Thread Michael Ellerman
On Thu, 2017-02-16 at 05:38:44 UTC, Michael Ellerman wrote:
> slb_finish_load and slb_finish_load_1T are both only used within
> slb_low.S, so make them local symbols.
> 
> This makes the code a little clearer, as it's more obvious neither is
> intended to be an entry point from arbitrary other code, only the uses
> in this file.
> 
> It also prevents them being used with kprobes and other tracing tools,
> which is good because we're not able to safely take traps at these
> locations, so making them local symbols avoids us needing to blacklist
> them.
> 
> Signed-off-by: Naveen N. Rao 
> Signed-off-by: Michael Ellerman 

Series applied to powerpc next.

https://git.kernel.org/powerpc/c/e471c393dfafff54c65979cbda7d5a

cheers


Re: [v2] powerpc: Add POWER9 architected mode to cputable

2017-02-19 Thread Michael Ellerman
On Fri, 2017-02-17 at 02:01:35 UTC, Russell Currey wrote:
> PVR value of 0x0F05 means we are arch v3.00 compliant (i.e. POWER9).
> 
> Acked-by: Michael Neuling 
> Signed-off-by: Russell Currey 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/6ae3f8ad2017079292cb49c8959b52

cheers


next-20170217 boot on POWER8 LPAR : WARNING @kernel/jump_label.c:287

2017-02-19 Thread Sachin Sant
While booting next-20170217 on a POWER8 LPAR following
warning is displayed.

Reverting the following commit helps boot cleanly.
commit 3821fd35b5 :  jump_label: Reduce the size of struct static_key

[   11.393008] [ cut here ]
[   11.393031] WARNING: CPU: 5 PID: 2890 at kernel/jump_label.c:287 
static_key_set_entries.isra.10+0x3c/0x50
[   11.393035] Modules linked in: nfsd(+) ip_tables x_tables autofs4
[   11.393043] CPU: 5 PID: 2890 Comm: modprobe Not tainted 
4.10.0-rc8-next-20170217-autotest #1
[   11.393047] task: c003a5692500 task.stack: c003a7774000
[   11.393051] NIP: c17bcffc LR: c17bd46c CTR: 
[   11.393054] REGS: c003a800 TRAP: 0700   Not tainted  
(4.10.0-rc8-next-20170217-autotest)
[   11.393058] MSR: 8282b033 
[   11.393065]   CR: 48248282  XER: 0001
[   11.393070] CFAR: c17bcfcc SOFTE: 1
GPR00: c17bd42c c003aa80 c262ce00 d3fdd580
GPR04: d3fe07df 00010017 c17bcd50 
GPR08: 00053a09 0001 c254ce00 0001
GPR12: c1b56c40 cea81400 0020 d5081098
GPR16: c003ada0 c003adec  84a8
GPR20: d3fef000 d3fe2b28 c252dc90 0001
GPR24: c254d314  c25338f8 d3fe089f
GPR28:  d3fe1400 d3fdd578 d3fe07df
[   11.393115] NIP [c17bcffc] static_key_set_entries.isra.10+0x3c/0x50
[   11.393119] LR [c17bd46c] jump_label_module_notify+0x20c/0x420
[   11.393122] Call Trace:
[   11.393125] [c003aa80] [c17bd42c] 
jump_label_module_notify+0x1cc/0x420 (unreliable)
[   11.393132] [c003ab40] [c16b38e0] 
notifier_call_chain+0x90/0x100
[   11.393137] [c003ab90] [c16b3db0] 
__blocking_notifier_call_chain+0x60/0x90
[   11.393142] [c003abe0] [c17357bc] load_module+0x1c1c/0x2750
[   11.393147] [c003ad70] [c1736550] SyS_finit_module+0xc0/0xf0
[   11.393152] [c003ae30] [c15cb8e0] system_call+0x38/0xfc
[   11.393156] Instruction dump:
[   11.393158] 40c20018 e923 792907a0 7c844b78 f883 4e800020 3d42fff2 
892a0514
[   11.393166] 2f89 40feffe0 3921 992a0514 <0fe0> 4bd0 6000 
6000
[   11.393173] ---[ end trace a5f8fbc5d8226aec ]---

Have attached boot log.

Thanks
-Sachin

dmesg_next_20170217.log
Description: Binary data


[next-20170217] WARN @/arch/powerpc/include/asm/xics.h:124 .icp_hv_eoi+0x40/0x140

2017-02-19 Thread Sachin Sant
While booting next-20170217 on a POWER6 box, I ran into following
warning. This is a full system lpar. Previous next tree was good.
I will try a bisect tomorrow.

ipr: IBM Power RAID SCSI Device Driver version: 2.6.3 (October 17, 2015)
ipr 0200:00:01.0: Found IOA with IRQ: 305
[ cut here ]
WARNING: CPU: 12 PID: 1 at ./arch/powerpc/include/asm/xics.h:124 
.icp_hv_eoi+0x40/0x140
Modules linked in:
CPU: 12 PID: 1 Comm: swapper/14 Not tainted 4.10.0-rc8-next-20170217-autotest #1
task: c002b2a4a580 task.stack: c002b2a5c000
NIP: c00731b0 LR: c01389f8 CTR: c0073170
REGS: c002b2a5f050 TRAP: 0700   Not tainted  
(4.10.0-rc8-next-20170217-autotest)
MSR: 80029032 
  CR: 28004082  XER: 2004
CFAR: c01389e0 SOFTE: 0 
GPR00: c01389f8 c002b2a5f2d0 c1025800 c002b203f498 
GPR04:   0064 0131 
GPR08: 0001 c000d3104cb8  0009b1f8 
GPR12: 48004082 cedc2400 c000dad0  
GPR16:  3c007efc c0a9e848  
GPR20: d8008008 c002af4d47f0 c11efda8 c0a9ea10 
GPR24: c0a9e848  c002af4d4fb8  
GPR28:  c002b203f498 c0ef8928 c002b203f400 
NIP [c00731b0] .icp_hv_eoi+0x40/0x140
LR [c01389f8] .handle_fasteoi_irq+0x1e8/0x270
Call Trace:
[c002b2a5f2d0] [c002b2a5f360] 0xc002b2a5f360 (unreliable)
[c002b2a5f360] [c01389f8] .handle_fasteoi_irq+0x1e8/0x270
[c002b2a5f3e0] [c0136a08] .request_threaded_irq+0x298/0x370
[c002b2a5f490] [c05895c0] .ipr_probe_ioa+0x1110/0x1390
[c002b2a5f5c0] [c058d030] .ipr_probe+0x30/0x3e0
[c002b2a5f670] [c0466860] .local_pci_probe+0x60/0x130
[c002b2a5f710] [c0467658] .pci_device_probe+0x148/0x1e0
[c002b2a5f7c0] [c0527524] .driver_probe_device+0x2d4/0x5b0
[c002b2a5f860] [c052796c] .__driver_attach+0x16c/0x190
[c002b2a5f8f0] [c05242c4] .bus_for_each_dev+0x84/0xf0
[c002b2a5f990] [c0526af4] .driver_attach+0x24/0x40
[c002b2a5fa00] [c0526318] .bus_add_driver+0x2a8/0x370
[c002b2a5faa0] [c0528a5c] .driver_register+0x8c/0x170
[c002b2a5fb20] [c0465a54] .__pci_register_driver+0x44/0x60
[c002b2a5fb90] [c0b8efc8] .ipr_init+0x58/0x70
[c002b2a5fc10] [c000d20c] .do_one_initcall+0x5c/0x1c0
[c002b2a5fce0] [c0b44738] .kernel_init_freeable+0x280/0x360
[c002b2a5fdb0] [c000daec] .kernel_init+0x1c/0x130
[c002b2a5fe30] [c000baa0] .ret_from_kernel_thread+0x58/0xb8
Instruction dump:
f8010010 f821ff71 80e3000c 7c0004ac e94d0030 3d02ffbc 3928f4b8 7d295214 
81090004 3948 7d484378 79080fe2 <0b08> 2fa8 40de0050 91490004 
---[ end trace 5e18ae409f46392c ]---
ipr 0200:00:01.0: Initializing IOA.

Thanks
-Sachin


Re: [PATCH] powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout

2017-02-19 Thread Benjamin Herrenschmidt
On Sun, 2017-02-19 at 15:48 +0530, Aneesh Kumar K.V wrote:
> +#ifdef CONFIG_PPC_BOOK3S_64
> +   /*
> +    * We need to make sure that for different page sizes reported by
> +    * firmware we only add hugetlb support for page sizes that can be
> +    * supported by linux page table layout.
> +    * For now we have
> +    * Radix: 2M
> +    * Hash: 16M and 16G
> +    */
> +   if (radix_enabled()) {
> +   if (mmu_psize != MMU_PAGE_2M)
> +   return -EINVAL;
> +   } else {
> +   if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G)
> +   return -EINVAL;
> +   }

Hash could support others... Same with radix and PUD level pages.

Why do we need that ? Won't FW provide separate properties for hash and
radix page sizes anyway ?

Ben.



Re: [PATCH] powerpc/powernv: Make PCI non-optional

2017-02-19 Thread Gavin Shan
On Fri, Feb 17, 2017 at 05:34:13PM +1100, Michael Ellerman wrote:
>Bare metal systems without PCI don't exist, so there's no real point in
>making PCI optional, it just breaks the build from time to time. In fact
>the build is broken now if you turn off PCI_MSI but enable KVM.
>
>Using select for PCI is OK because we (powerpc) define config PCI, and it
>has no dependencies. Selecting PCI_MSI is slightly fishy, because it's
>in drivers/pci and it is user-visible, but its only dependency is PCI,
>so selecting it can't actually lead to breakage.
>
>Signed-off-by: Michael Ellerman 

Acked-by: Gavin Shan 



[PATCH 1/2] powerpc/mm: Refactor page table allocation

2017-02-19 Thread Balbir Singh
Introduce a helper pgtable_get_gfp_flags() which
just returns the current gfp flags. In a future
patch, we can enable __GFP_ACCOUNT based on the
calling context.

Signed-off-by: Balbir Singh 
---
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 22 --
 arch/powerpc/mm/pgtable_64.c |  3 ++-
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index cd5e7aa..d0a9ca6 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -50,13 +50,19 @@ extern void pgtable_free_tlb(struct mmu_gather *tlb, void 
*table, int shift);
 extern void __tlb_remove_table(void *_table);
 #endif
 
+static inline gfp_t pgtable_get_gfp_flags(struct mm_struct *mm, gfp_t gfp)
+{
+   return gfp;
+}
+
 static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
 {
 #ifdef CONFIG_PPC_64K_PAGES
-   return (pgd_t *)__get_free_page(PGALLOC_GFP);
+   return (pgd_t *)__get_free_page(pgtable_get_gfp_flags(mm, PGALLOC_GFP));
 #else
struct page *page;
-   page = alloc_pages(PGALLOC_GFP | __GFP_REPEAT, 4);
+   page = alloc_pages(pgtable_get_gfp_flags(mm,
+   PGALLOC_GFP | __GFP_REPEAT), 4);
if (!page)
return NULL;
return (pgd_t *) page_address(page);
@@ -76,7 +82,8 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
if (radix_enabled())
return radix__pgd_alloc(mm);
-   return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL);
+   return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE),
+   pgtable_get_gfp_flags(mm, GFP_KERNEL));
 }
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
@@ -93,7 +100,8 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t 
*pgd, pud_t *pud)
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return kmem_cache_alloc(PGT_CACHE(PUD_INDEX_SIZE), GFP_KERNEL);
+   return kmem_cache_alloc(PGT_CACHE(PUD_INDEX_SIZE),
+   pgtable_get_gfp_flags(mm, GFP_KERNEL));
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
@@ -119,7 +127,8 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, 
pud_t *pud,
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   return kmem_cache_alloc(PGT_CACHE(PMD_CACHE_INDEX), GFP_KERNEL);
+   return kmem_cache_alloc(PGT_CACHE(PMD_CACHE_INDEX),
+   pgtable_get_gfp_flags(mm, GFP_KERNEL));
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -159,7 +168,8 @@ static inline pgtable_t pmd_pgtable(pmd_t pmd)
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
  unsigned long address)
 {
-   return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+   return (pte_t *)__get_free_page(
+   pgtable_get_gfp_flags(mm, PGALLOC_GFP));
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 8bca7f5..9f416ee 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -350,7 +350,8 @@ static pte_t *get_from_cache(struct mm_struct *mm)
 static pte_t *__alloc_for_cache(struct mm_struct *mm, int kernel)
 {
void *ret = NULL;
-   struct page *page = alloc_page(GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
+   struct page *page = alloc_page(pgtable_get_gfp_flags(mm,
+   GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO));
if (!page)
return NULL;
if (!kernel && !pgtable_page_ctor(page)) {
-- 
2.9.3



[PATCH 2/2] powerpc/mm: Enable page table accounting

2017-02-19 Thread Balbir Singh
Enabled __GFP_ACCOUNT in pgtable_get_gfp_flags(). This
allows accounting of page table allocation via kmem to
the correct cgroup. Basic testing was done to see if
the accounting reflects in

1. perf record tracing
2. memory.kmem.slabinfo

Signed-off-by: Balbir Singh 
---
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index d0a9ca6..9207213 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -52,7 +52,9 @@ extern void __tlb_remove_table(void *_table);
 
 static inline gfp_t pgtable_get_gfp_flags(struct mm_struct *mm, gfp_t gfp)
 {
-   return gfp;
+   if (mm == &init_mm)
+   return gfp;
+   return gfp | __GFP_ACCOUNT;
 }
 
 static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
-- 
2.9.3



Re: [RFC PATCH 4/9] powerpc/4xx: Create 4xx pseudo-platform in platforms/4xx

2017-02-19 Thread Nicholas Piggin
On Fri, 17 Feb 2017 17:32:14 +1100
Michael Ellerman  wrote:

> We have a lot of code in sysdev for supporting 4xx, ie. either 40x or
> 44x. Instead it would be cleaner if it was all in platforms/4xx.
> 
> This is slightly odd in that we don't actually define any machines in
> the 4xx platform, as is usual for a platform directory. But still it
> seems like a better result to have all this related code in a directory
> by itself.

What about the other things in sysdev that support multiple platforms?
Why not just put the new 4xx subdirectory under sysdev?

The other patches all seem okay to me. Do you have any grand plan for
further breaking up traps.c?

Thanks,
Nick


Re: [next-20170217] WARN @/arch/powerpc/include/asm/xics.h:124 .icp_hv_eoi+0x40/0x140

2017-02-19 Thread Michael Ellerman
Sachin Sant  writes:

> While booting next-20170217 on a POWER6 box, I ran into following
> warning. This is a full system lpar. Previous next tree was good.
> I will try a bisect tomorrow.

Do you have CONFIG_DEBUG_SHIRQ=y ?

cheers

> ipr: IBM Power RAID SCSI Device Driver version: 2.6.3 (October 17, 2015)
> ipr 0200:00:01.0: Found IOA with IRQ: 305
> [ cut here ]
> WARNING: CPU: 12 PID: 1 at ./arch/powerpc/include/asm/xics.h:124 
> .icp_hv_eoi+0x40/0x140
> Modules linked in:
> CPU: 12 PID: 1 Comm: swapper/14 Not tainted 4.10.0-rc8-next-20170217-autotest 
> #1
> task: c002b2a4a580 task.stack: c002b2a5c000
> NIP: c00731b0 LR: c01389f8 CTR: c0073170
> REGS: c002b2a5f050 TRAP: 0700   Not tainted  
> (4.10.0-rc8-next-20170217-autotest)
> MSR: 80029032 
>   CR: 28004082  XER: 2004
> CFAR: c01389e0 SOFTE: 0 
> GPR00: c01389f8 c002b2a5f2d0 c1025800 c002b203f498 
> GPR04:   0064 0131 
> GPR08: 0001 c000d3104cb8  0009b1f8 
> GPR12: 48004082 cedc2400 c000dad0  
> GPR16:  3c007efc c0a9e848  
> GPR20: d8008008 c002af4d47f0 c11efda8 c0a9ea10 
> GPR24: c0a9e848  c002af4d4fb8  
> GPR28:  c002b203f498 c0ef8928 c002b203f400 
> NIP [c00731b0] .icp_hv_eoi+0x40/0x140
> LR [c01389f8] .handle_fasteoi_irq+0x1e8/0x270
> Call Trace:
> [c002b2a5f2d0] [c002b2a5f360] 0xc002b2a5f360 (unreliable)
> [c002b2a5f360] [c01389f8] .handle_fasteoi_irq+0x1e8/0x270
> [c002b2a5f3e0] [c0136a08] .request_threaded_irq+0x298/0x370
> [c002b2a5f490] [c05895c0] .ipr_probe_ioa+0x1110/0x1390
> [c002b2a5f5c0] [c058d030] .ipr_probe+0x30/0x3e0
> [c002b2a5f670] [c0466860] .local_pci_probe+0x60/0x130
> [c002b2a5f710] [c0467658] .pci_device_probe+0x148/0x1e0
> [c002b2a5f7c0] [c0527524] .driver_probe_device+0x2d4/0x5b0
> [c002b2a5f860] [c052796c] .__driver_attach+0x16c/0x190
> [c002b2a5f8f0] [c05242c4] .bus_for_each_dev+0x84/0xf0
> [c002b2a5f990] [c0526af4] .driver_attach+0x24/0x40
> [c002b2a5fa00] [c0526318] .bus_add_driver+0x2a8/0x370
> [c002b2a5faa0] [c0528a5c] .driver_register+0x8c/0x170
> [c002b2a5fb20] [c0465a54] .__pci_register_driver+0x44/0x60
> [c002b2a5fb90] [c0b8efc8] .ipr_init+0x58/0x70
> [c002b2a5fc10] [c000d20c] .do_one_initcall+0x5c/0x1c0
> [c002b2a5fce0] [c0b44738] .kernel_init_freeable+0x280/0x360
> [c002b2a5fdb0] [c000daec] .kernel_init+0x1c/0x130
> [c002b2a5fe30] [c000baa0] .ret_from_kernel_thread+0x58/0xb8
> Instruction dump:
> f8010010 f821ff71 80e3000c 7c0004ac e94d0030 3d02ffbc 3928f4b8 7d295214 
> 81090004 3948 7d484378 79080fe2 <0b08> 2fa8 40de0050 91490004 
> ---[ end trace 5e18ae409f46392c ]---
> ipr 0200:00:01.0: Initializing IOA.
>
> Thanks
> -Sachin


[PATCH v4 00/10] IMC Instrumentation Support

2017-02-19 Thread Hemant Kumar
Power 9 has In-Memory-Collection (IMC) infrastructure which contains
various Performance Monitoring Units (PMUs) at Nest level (these are
on-chip but off-core), Core level and Thread level.

The Nest PMU counters are handled by a Nest IMC microcode which runs
in the OCC (On-Chip Controller) complex. The microcode collects the
counter data and moves the nest IMC counter data to memory.

The Core and Thread IMC PMU counters are handled in the core. Core
level PMU counters give us the IMC counters' data per core and thread
level PMU counters give us the IMC counters' data per CPU thread.

This patchset enables the nest IMC, core IMC and thread IMC
PMUs and is based on the initial work done by Madhavan Srinivasan.
"Nest Instrumentation Support" :
https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-August/132078.html

v1 for this patchset can be found here :
https://lwn.net/Articles/705475/

Nest events:
Per-chip nest instrumentation provides various per-chip metrics
such as memory, powerbus, Xlink and Alink bandwidth.

Core events:
Per-core IMC instrumentation provides various per-core metrics
such as non-idle cycles, non-idle instructions, various cache and
memory related metrics etc.

Thread events:
All the events for thread level are same as core level with the
difference being in the domain. These are per-cpu metrics.

PMU Events' Information:
OPAL obtains the IMC PMU and event information from the IMC Catalog
and passes on to the kernel via the device tree. The events' information
contains :
 - Event name
 - Event Offset
 - Event description
and, maybe :
 - Event scale
 - Event unit

Some PMUs may have a common scale and unit values for all their
supported events. For those cases, the scale and unit properties for
those events must be inherited from the PMU.

The event offset in the memory is where the counter data gets
accumulated.

The OPAL-side patches are posted upstream :
https://lists.ozlabs.org/pipermail/skiboot/2017-January/005979.html

The kernel discovers the IMC counters information in the device tree
at the "imc-counters" device node which has a compatible field
"ibm,opal-in-memory-counters".

Parsing of the Events' information:
To parse the IMC PMUs and events information, the kernel has to
discover the "imc-counters" node and walk through the pmu and event
nodes.

Here is an excerpt of the dt showing the imc-counters with
mcs0 (nest), core and thread node:
/dts-v1/;

[...]
 
/dts-v1/;
 
/ {
name = "";
compatible = "ibm,opal-in-memory-counters";
#address-cells = <0x1>;
#size-cells = <0x1>;
imc-nest-offset = <0x32>;
imc-nest-size = <0x3>;
version-id = "";
 
NEST_MCS: nest-mcs-events {
#address-cells = <0x1>;
#size-cells = <0x1>;
 
event@0 {
event-name = "RRTO_QFULL_NO_DISP" ;
reg = <0x0 0x8>;
desc = "RRTO not dispatched in MCS0 due to capacity - 
pulses once for each time a valid RRTO op is not dispatched due to a command 
list full condition" ;
};
event@8 {
event-name = "WRTO_QFULL_NO_DISP" ;
reg = <0x8 0x8>;
desc = "WRTO not dispatched in MCS0 due to capacity - 
pulses once for each time a valid WRTO op is not dispatched due to a command 
list full condition" ;
};
[...]
mcs0 {
compatible = "ibm,imc-counters-nest";
events-prefix = "PM_MCS0_";
unit = "";
scale = "";
reg = <0x118 0x8>;
events = < &NEST_MCS >;
};
 
mcs1 {
compatible = "ibm,imc-counters-nest";
events-prefix = "PM_MCS1_";
unit = "";
scale = "";
reg = <0x198 0x8>;
events = < &NEST_MCS >;
};
[...]

CORE_EVENTS: core-events {
#address-cells = <0x1>;
#size-cells = <0x1>;
 
event@e0 {
event-name = "0THRD_NON_IDLE_PCYC" ;
reg = <0xe0 0x8>;
desc = "The number of processor cycles when all threads 
are idle" ;
};
event@120 {
event-name = "1THRD_NON_IDLE_PCYC" ;
reg = <0x120 0x8>;
desc = "The number of processor cycles when exactly one 
SMT thread is executing non-idle code" ;
};
[...]
   core {
compatible = "ibm,imc-counters-core";
events-prefix = "CPM_";
unit = "";
scale = "";
reg = <0x0 0x8>;
events = < &CORE_EVENTS >;
};
 
thread {
compatible = "ibm,imc-counters-core";
events-prefix

[PATCH v4 01/10] powerpc/powernv: Data structure and macros definitions

2017-02-19 Thread Hemant Kumar
Create new header file "imc-pmu.h" to add the data structures
and macros needed for IMC pmu support.

Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Anton Blanchard 
Cc: Sukadev Bhattiprolu 
Cc: Michael Neuling 
Cc: Stewart Smith 
Cc: Daniel Axtens 
Cc: Stephane Eranian 
Cc: Balbir Singh 
Cc: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
---
 arch/powerpc/include/asm/imc-pmu.h | 73 ++
 1 file changed, 73 insertions(+)
 create mode 100644 arch/powerpc/include/asm/imc-pmu.h

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
new file mode 100644
index 000..3232322
--- /dev/null
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -0,0 +1,73 @@
+#ifndef PPC_POWERNV_IMC_PMU_DEF_H
+#define PPC_POWERNV_IMC_PMU_DEF_H
+
+/*
+ * IMC Nest Performance Monitor counter support.
+ *
+ * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
+ *   (C) 2016 Hemant K Shaw, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define IMC_MAX_CHIPS  32
+#define IMC_MAX_PMUS   32
+#define IMC_MAX_PMU_NAME_LEN   256
+
+#define NEST_IMC_ENGINE_START  1
+#define NEST_IMC_ENGINE_STOP   0
+#define NEST_MAX_PAGES 16
+
+#define NEST_IMC_PRODUCTION_MODE   1
+
+#define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
+#define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest"
+
+/*
+ * Structure to hold per chip specific memory address
+ * information for nest pmus. Nest Counter data are exported
+ * in per-chip reserved memory region by the PORE Engine.
+ */
+struct perchip_nest_info {
+   u32 chip_id;
+   u64 pbase;
+   u64 vbase[NEST_MAX_PAGES];
+   u64 size;
+};
+
+/*
+ * Place holder for nest pmu events and values.
+ */
+struct imc_events {
+   char *ev_name;
+   char *ev_value;
+};
+
+/*
+ * Device tree parser code detects IMC pmu support and
+ * registers new IMC pmus. This structure will
+ * hold the pmu functions and attrs for each imc pmu and
+ * will be referenced at the time of pmu registration.
+ */
+struct imc_pmu {
+   struct pmu pmu;
+   int domain;
+   const struct attribute_group *attr_groups[4];
+};
+
+/*
+ * Domains for IMC PMUs
+ */
+#define IMC_DOMAIN_NEST1
+
+#define UNKNOWN_DOMAIN -1
+
+#endif /* PPC_POWERNV_IMC_PMU_DEF_H */
-- 
2.7.4



[PATCH v4 02/10] powerpc/powernv: Autoload IMC device driver module

2017-02-19 Thread Hemant Kumar
This patch does three things :
 - Enables "opal.c" to create a platform device for the IMC interface
   according to the appropriate compatibility string.
 - Find the reserved-memory region details from the system device tree
   and get the base address of HOMER region address for each chip.
 - We also get the Nest PMU counter data offsets (in the HOMER region)
   and their sizes. The offsets for the counters' data are fixed and
   won't change from chip to chip.

The device tree parsing logic is separated from the PMU creation
functions (which is done in subsequent patches). Right now, only Nest
units are taken care of.

Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Anton Blanchard 
Cc: Sukadev Bhattiprolu 
Cc: Michael Neuling 
Cc: Stewart Smith 
Cc: Daniel Axtens 
Cc: Stephane Eranian 
Cc: Balbir Singh 
Cc: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
---
 arch/powerpc/platforms/powernv/Makefile   |   2 +-
 arch/powerpc/platforms/powernv/opal-imc.c | 117 ++
 arch/powerpc/platforms/powernv/opal.c |  13 
 3 files changed, 131 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/platforms/powernv/opal-imc.c

diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index b5d98cb..44909fe 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -2,7 +2,7 @@ obj-y   += setup.o opal-wrappers.o opal.o 
opal-async.o idle.o
 obj-y  += opal-rtc.o opal-nvram.o opal-lpc.o opal-flash.o
 obj-y  += rng.o opal-elog.o opal-dump.o opal-sysparam.o 
opal-sensor.o
 obj-y  += opal-msglog.o opal-hmi.o opal-power.o opal-irqchip.o
-obj-y  += opal-kmsg.o
+obj-y  += opal-kmsg.o opal-imc.o
 
 obj-$(CONFIG_SMP)  += smp.o subcore.o subcore-asm.o
 obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
new file mode 100644
index 000..ee2ae45
--- /dev/null
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -0,0 +1,117 @@
+/*
+ * OPAL IMC interface detection driver
+ * Supported on POWERNV platform
+ *
+ * Copyright  (C) 2016 Madhavan Srinivasan, IBM Corporation.
+ *(C) 2016 Hemant K Shaw, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
+
+static int opal_imc_counters_probe(struct platform_device *pdev)
+{
+   struct device_node *child, *imc_dev, *rm_node = NULL;
+   struct perchip_nest_info *pcni;
+   u32 reg[4], pages, nest_offset, nest_size, idx;
+   int i = 0;
+   const char *node_name;
+
+   if (!pdev || !pdev->dev.of_node)
+   return -ENODEV;
+
+   imc_dev = pdev->dev.of_node;
+
+   /*
+* nest_offset : where the nest-counters' data start.
+* size : size of the entire nest-counters region
+*/
+   if (of_property_read_u32(imc_dev, "imc-nest-offset", &nest_offset))
+   goto err;
+   if (of_property_read_u32(imc_dev, "imc-nest-size", &nest_size))
+   goto err;
+
+   /* Find the "homer region" for each chip */
+   rm_node = of_find_node_by_path("/reserved-memory");
+   if (!rm_node)
+   goto err;
+
+   for_each_child_of_node(rm_node, child) {
+   if (of_property_read_string_index(child, "name", 0,
+ &node_name))
+   continue;
+   if (strncmp("ibm,homer-image", node_name,
+   strlen("ibm,homer-image")))
+   continue;
+
+   /* Get the chip id to which the above homer region belongs to */
+   if (of_property_read_u32(child, "ibm,chip-id", &idx))
+   goto err;
+
+   /* reg property will have four u32 cells. */
+   if (of_property_read_u32_array(child, "reg", reg, 4))
+   goto err;
+
+   pcni = &nest_perchip_info[idx];
+
+   /* Fetch the homer region base address */
+   pcni->pbase = reg[0];
+   pcni->pbase = pcni->pbase << 32 | reg[1];
+   /* Add the nest IMC Base offset */
+   pcni->pbase

[PATCH v4 03/10] powerpc/powernv: Detect supported IMC units and its events

2017-02-19 Thread Hemant Kumar
Parse device tree to detect IMC units. Traverse through each IMC unit
node to find supported events and corresponding unit/scale files (if any).

The device tree for IMC counters starts at the node :
"imc-counters". This node contains all the IMC PMU nodes and event nodes
for these IMC PMUs. The PMU nodes have an "events" property which has a
phandle value for the actual events node. The events are separated from
the PMU nodes to abstract out the common events. For example, PMU node
"mcs0", "mcs1" etc. will contain a pointer to "nest-mcs-events" since,
the events are common between these PMUs. These events have a different
prefix based on their relation to different PMUs, and hence, the PMU
nodes themselves contain an "events-prefix" property. The value for this
property concatenated to the event name, forms the actual event
name. Also, the PMU have a "reg" field as the base offset for the events
which belong to this PMU. This "reg" field is added to an event in the
"events" node, which gives us the location of the counter data. Kernel
code uses this offset as event configuration value.

Device tree parser code also looks for scale/unit property in the event
node and passes on the value as an event attr for perf interface to use
in the post processing by the perf tool. Some PMUs may have common scale
and unit properties which implies that all events supported by this PMU
inherit the scale and unit properties of the PMU itself. For those
events, we need to set the common unit and scale values.

For failure to initialize any unit or any event, disable that unit and
continue setting up the rest of them.

Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Anton Blanchard 
Cc: Sukadev Bhattiprolu 
Cc: Michael Neuling 
Cc: Stewart Smith 
Cc: Daniel Axtens 
Cc: Stephane Eranian 
Cc: Balbir Singh 
Signed-off-by: Hemant Kumar 
Signed-off-by: Anju T Sudhakar 
---
 arch/powerpc/platforms/powernv/opal-imc.c | 385 ++
 1 file changed, 385 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index ee2ae45..c58b893 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -32,6 +32,390 @@
 #include 
 
 struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
+struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+
+static int imc_event_info(char *name, struct imc_events *events)
+{
+   char *buf;
+
+   /* memory for content */
+   buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+
+   events->ev_name = name;
+   events->ev_value = buf;
+   return 0;
+}
+
+static int imc_event_info_str(struct property *pp, char *name,
+  struct imc_events *events)
+{
+   int ret;
+
+   ret = imc_event_info(name, events);
+   if (ret)
+   return ret;
+
+   if (!pp->value || (strnlen(pp->value, pp->length) == pp->length) ||
+  (pp->length > IMC_MAX_PMU_NAME_LEN))
+   return -EINVAL;
+   strncpy(events->ev_value, (const char *)pp->value, pp->length);
+
+   return 0;
+}
+
+static int imc_event_info_val(char *name, u32 val,
+ struct imc_events *events)
+{
+   int ret;
+
+   ret = imc_event_info(name, events);
+   if (ret)
+   return ret;
+   sprintf(events->ev_value, "event=0x%x", val);
+
+   return 0;
+}
+
+static int set_event_property(struct property *pp, char *event_prop,
+ struct imc_events *events, char *ev_name)
+{
+   char *buf;
+   int ret;
+
+   buf = kzalloc(IMC_MAX_PMU_NAME_LEN, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+
+   sprintf(buf, "%s.%s", ev_name, event_prop);
+   ret = imc_event_info_str(pp, buf, events);
+   if (ret) {
+   kfree(events->ev_name);
+   kfree(events->ev_value);
+   }
+
+   return ret;
+}
+
+/*
+ * imc_events_node_parser: Parse the event node "dev" and assign the parsed
+ * information to event "events".
+ *
+ * Parses the "reg" property of this event. "reg" gives us the event offset.
+ * Also, parse the "scale" and "unit" properties, if any.
+ */
+static int imc_events_node_parser(struct device_node *dev,
+ struct imc_events *events,
+ struct property *event_scale,
+ struct property *event_unit,
+ struct property *name_prefix,
+ u32 reg)
+{
+   struct property *name, *pp;
+   char *ev_name;
+   u32 val;
+   int idx = 0, ret;
+
+   if (!dev)
+   return -EINVAL;
+
+   /*
+* Loop through each property of an event node
+*/
+   name = of_find_property(dev, "event-name", NULL);
+

[PATCH v4 04/10] powerpc/perf: Add event attribute and group to IMC pmus

2017-02-19 Thread Hemant Kumar
Device tree IMC driver code parses the IMC units and their events. It
passes the information to IMC pmu code which is placed in powerpc/perf
as "imc-pmu.c".

This patch creates only event attributes and attribute groups for the
IMC pmus.

Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Anton Blanchard 
Cc: Sukadev Bhattiprolu 
Cc: Michael Neuling 
Cc: Stewart Smith 
Cc: Daniel Axtens 
Cc: Stephane Eranian 
Cc: Balbir Singh 
Cc: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
---
 arch/powerpc/perf/Makefile|  6 +-
 arch/powerpc/perf/imc-pmu.c   | 96 +++
 arch/powerpc/platforms/powernv/opal-imc.c | 12 +++-
 3 files changed, 111 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/perf/imc-pmu.c

diff --git a/arch/powerpc/perf/Makefile b/arch/powerpc/perf/Makefile
index 4d606b9..d0d1f04 100644
--- a/arch/powerpc/perf/Makefile
+++ b/arch/powerpc/perf/Makefile
@@ -2,10 +2,14 @@ subdir-ccflags-$(CONFIG_PPC_WERROR) := -Werror
 
 obj-$(CONFIG_PERF_EVENTS)  += callchain.o perf_regs.o
 
+imc-$(CONFIG_PPC_POWERNV)   += imc-pmu.o
+
 obj-$(CONFIG_PPC_PERF_CTRS)+= core-book3s.o bhrb.o
 obj64-$(CONFIG_PPC_PERF_CTRS)  += power4-pmu.o ppc970-pmu.o power5-pmu.o \
   power5+-pmu.o power6-pmu.o power7-pmu.o \
-  isa207-common.o power8-pmu.o power9-pmu.o
+  isa207-common.o power8-pmu.o power9-pmu.o \
+  $(imc-y)
+
 obj32-$(CONFIG_PPC_PERF_CTRS)  += mpc7450-pmu.o
 
 obj-$(CONFIG_FSL_EMB_PERF_EVENT) += core-fsl-emb.o
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
new file mode 100644
index 000..7b6ce50
--- /dev/null
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -0,0 +1,96 @@
+/*
+ * Nest Performance Monitor counter support.
+ *
+ * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
+ *  (C) 2016 Hemant K Shaw, IBM Corporation.
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
+struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+
+/* dev_str_attr : Populate event "name" and string "str" in attribute */
+static struct attribute *dev_str_attr(const char *name, const char *str)
+{
+   struct perf_pmu_events_attr *attr;
+
+   attr = kzalloc(sizeof(*attr), GFP_KERNEL);
+
+   sysfs_attr_init(&attr->attr.attr);
+
+   attr->event_str = str;
+   attr->attr.attr.name = name;
+   attr->attr.attr.mode = 0444;
+   attr->attr.show = perf_event_sysfs_show;
+
+   return &attr->attr.attr;
+}
+
+/*
+ * update_events_in_group: Update the "events" information in an attr_group
+ * and assign the attr_group to the pmu "pmu".
+ */
+static int update_events_in_group(struct imc_events *events,
+ int idx, struct imc_pmu *pmu)
+{
+   struct attribute_group *attr_group;
+   struct attribute **attrs;
+   int i;
+
+   /* Allocate memory for attribute group */
+   attr_group = kzalloc(sizeof(*attr_group), GFP_KERNEL);
+   if (!attr_group)
+   return -ENOMEM;
+
+   /* Allocate memory for attributes */
+   attrs = kzalloc((sizeof(struct attribute *) * (idx + 1)), GFP_KERNEL);
+   if (!attrs) {
+   kfree(attr_group);
+   return -ENOMEM;
+   }
+
+   attr_group->name = "events";
+   attr_group->attrs = attrs;
+   for (i = 0; i < idx; i++, events++) {
+   attrs[i] = dev_str_attr((char *)events->ev_name,
+   (char *)events->ev_value);
+   }
+
+   pmu->attr_groups[0] = attr_group;
+   return 0;
+}
+
+/*
+ * init_imc_pmu : Setup the IMC pmu device in "pmu_ptr" and its events
+ *"events".
+ * Setup the cpu mask information for these pmus and setup the state machine
+ * hotplug notifiers as well.
+ */
+int init_imc_pmu(struct imc_events *events, int idx,
+struct imc_pmu *pmu_ptr)
+{
+   int ret = -ENODEV;
+
+   ret = update_events_in_group(events, idx, pmu_ptr);
+   if (ret)
+   goto err_free;
+
+   return 0;
+
+err_free:
+   /* Only free the attr_groups which are dynamically allocated  */
+   if (pmu_ptr->attr_groups[0]) {
+   kfree(pmu_ptr->attr_groups[0]->attrs);
+   kfree(pmu_ptr->attr_groups[0]);
+   }
+
+   return ret;
+}
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index c58b893..ed1e091 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -31,8 +31,11 @@
 #include 
 #include 
 
-s

[PATCH v4 05/10] powerpc/perf: Generic imc pmu event functions

2017-02-19 Thread Hemant Kumar
Since, the IMC counters' data are periodically fed to a memory location,
the functions to read/update, start/stop, add/del can be generic and can
be used by all IMC PMU units.

This patch adds a set of generic imc pmu related event functions to be
used  by each imc pmu unit. Add code to setup format attribute and to
register imc pmus. Add a event_init function for nest_imc events.

Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Anton Blanchard 
Cc: Sukadev Bhattiprolu 
Cc: Michael Neuling 
Cc: Stewart Smith 
Cc: Daniel Axtens 
Cc: Stephane Eranian 
Cc: Balbir Singh 
Cc: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
---
 arch/powerpc/include/asm/imc-pmu.h|   1 +
 arch/powerpc/perf/imc-pmu.c   | 121 ++
 arch/powerpc/platforms/powernv/opal-imc.c |  30 +++-
 3 files changed, 148 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index 3232322..7b58721 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -70,4 +70,5 @@ struct imc_pmu {
 
 #define UNKNOWN_DOMAIN -1
 
+int imc_get_domain(struct device_node *pmu_dev);
 #endif /* PPC_POWERNV_IMC_PMU_DEF_H */
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 7b6ce50..f6f1ef9 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -17,6 +17,116 @@
 struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
 struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
 
+/* Needed for sanity check */
+extern u64 nest_max_offset;
+
+PMU_FORMAT_ATTR(event, "config:0-20");
+static struct attribute *imc_format_attrs[] = {
+   &format_attr_event.attr,
+   NULL,
+};
+
+static struct attribute_group imc_format_group = {
+   .name = "format",
+   .attrs = imc_format_attrs,
+};
+
+static int nest_imc_event_init(struct perf_event *event)
+{
+   int chip_id;
+   u32 config = event->attr.config;
+   struct perchip_nest_info *pcni;
+
+   if (event->attr.type != event->pmu->type)
+   return -ENOENT;
+
+   /* Sampling not supported */
+   if (event->hw.sample_period)
+   return -EINVAL;
+
+   /* unsupported modes and filters */
+   if (event->attr.exclude_user   ||
+   event->attr.exclude_kernel ||
+   event->attr.exclude_hv ||
+   event->attr.exclude_idle   ||
+   event->attr.exclude_host   ||
+   event->attr.exclude_guest)
+   return -EINVAL;
+
+   if (event->cpu < 0)
+   return -EINVAL;
+
+   /* Sanity check for config (event offset) */
+   if (config > nest_max_offset)
+   return -EINVAL;
+
+   chip_id = topology_physical_package_id(event->cpu);
+   pcni = &nest_perchip_info[chip_id];
+   event->hw.event_base = pcni->vbase[config/PAGE_SIZE] +
+   (config & ~PAGE_MASK);
+
+   return 0;
+}
+
+static void imc_read_counter(struct perf_event *event)
+{
+   u64 *addr, data;
+
+   addr = (u64 *)event->hw.event_base;
+   data = __be64_to_cpu(*addr);
+   local64_set(&event->hw.prev_count, data);
+}
+
+static void imc_perf_event_update(struct perf_event *event)
+{
+   u64 counter_prev, counter_new, final_count, *addr;
+
+   addr = (u64 *)event->hw.event_base;
+   counter_prev = local64_read(&event->hw.prev_count);
+   counter_new = __be64_to_cpu(*addr);
+   final_count = counter_new - counter_prev;
+
+   local64_set(&event->hw.prev_count, counter_new);
+   local64_add(final_count, &event->count);
+}
+
+static void imc_event_start(struct perf_event *event, int flags)
+{
+   imc_read_counter(event);
+}
+
+static void imc_event_stop(struct perf_event *event, int flags)
+{
+   imc_perf_event_update(event);
+}
+
+static int imc_event_add(struct perf_event *event, int flags)
+{
+   if (flags & PERF_EF_START)
+   imc_event_start(event, flags);
+
+   return 0;
+}
+
+/* update_pmu_ops : Populate the appropriate operations for "pmu" */
+static int update_pmu_ops(struct imc_pmu *pmu)
+{
+   if (!pmu)
+   return -EINVAL;
+
+   pmu->pmu.task_ctx_nr = perf_invalid_context;
+   pmu->pmu.event_init = nest_imc_event_init;
+   pmu->pmu.add = imc_event_add;
+   pmu->pmu.del = imc_event_stop;
+   pmu->pmu.start = imc_event_start;
+   pmu->pmu.stop = imc_event_stop;
+   pmu->pmu.read = imc_perf_event_update;
+   pmu->attr_groups[1] = &imc_format_group;
+   pmu->pmu.attr_groups = pmu->attr_groups;
+
+   return 0;
+}
+
 /* dev_str_attr : Populate event "name" and string "str" in attribute */
 static struct attribute *dev_str_attr(const char *name, const char *str)
 {
@@ -83,6 +193,17 @@ int init_imc_pmu(struct imc_events *events, int idx,
if (ret)
goto err_free;
 
+   ret = update_p

[PATCH v4 06/10] powerpc/perf: IMC pmu cpumask and cpu hotplug support

2017-02-19 Thread Hemant Kumar
Adds cpumask attribute to be used by each IMC pmu. Only one cpu (any
online CPU) from each chip for nest PMUs is designated to read counters.

On CPU hotplug, dying CPU is checked to see whether it is one of the
designated cpus, if yes, next online cpu from the same chip (for nest
units) is designated as new cpu to read counters. For this purpose, we
introduce a new state : CPUHP_AP_PERF_POWERPC_NEST_ONLINE.

Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Anton Blanchard 
Cc: Sukadev Bhattiprolu 
Cc: Michael Neuling 
Cc: Stewart Smith 
Cc: Daniel Axtens 
Cc: Stephane Eranian 
Cc: Balbir Singh 
Cc: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
---
 arch/powerpc/include/asm/opal-api.h|   3 +-
 arch/powerpc/include/asm/opal.h|   3 +
 arch/powerpc/perf/imc-pmu.c| 163 -
 arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
 include/linux/cpuhotplug.h |   1 +
 5 files changed, 169 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index a0aa285..e15fb20 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -168,7 +168,8 @@
 #define OPAL_INT_SET_MFRR  125
 #define OPAL_PCI_TCE_KILL  126
 #define OPAL_NMMU_SET_PTCR 127
-#define OPAL_LAST  127
+#define OPAL_NEST_IMC_COUNTERS_CONTROL 128
+#define OPAL_LAST  128
 
 /* Device tree flags */
 
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 1ff03a6..d93d082 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -227,6 +227,9 @@ int64_t opal_pci_tce_kill(uint64_t phb_id, uint32_t 
kill_type,
  uint64_t dma_addr, uint32_t npages);
 int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr);
 
+int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1,
+   uint64_t value2, uint64_t value3);
+
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
   int depth, void *data);
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index f6f1ef9..e46ff6d 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -16,6 +16,7 @@
 
 struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
 struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+static cpumask_t nest_imc_cpumask;
 
 /* Needed for sanity check */
 extern u64 nest_max_offset;
@@ -31,6 +32,160 @@ static struct attribute_group imc_format_group = {
.attrs = imc_format_attrs,
 };
 
+/* Get the cpumask printed to a buffer "buf" */
+static ssize_t imc_pmu_cpumask_get_attr(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   cpumask_t *active_mask;
+
+   active_mask = &nest_imc_cpumask;
+   return cpumap_print_to_pagebuf(true, buf, active_mask);
+}
+
+static DEVICE_ATTR(cpumask, S_IRUGO, imc_pmu_cpumask_get_attr, NULL);
+
+static struct attribute *imc_pmu_cpumask_attrs[] = {
+   &dev_attr_cpumask.attr,
+   NULL,
+};
+
+static struct attribute_group imc_pmu_cpumask_attr_group = {
+   .attrs = imc_pmu_cpumask_attrs,
+};
+
+/*
+ * nest_init : Initializes the nest imc engine for the current chip.
+ */
+static void nest_init(int *loc)
+{
+   int rc;
+
+   rc = opal_nest_imc_counters_control(NEST_IMC_PRODUCTION_MODE,
+   NEST_IMC_ENGINE_START, 0, 0);
+   if (rc)
+   loc[smp_processor_id()] = 1;
+}
+
+static void nest_change_cpu_context(int old_cpu, int new_cpu)
+{
+   int i;
+
+   for (i = 0;
+(per_nest_pmu_arr[i] != NULL) && (i < IMC_MAX_PMUS); i++)
+   perf_pmu_migrate_context(&per_nest_pmu_arr[i]->pmu,
+   old_cpu, new_cpu);
+}
+
+static int ppc_nest_imc_cpu_online(unsigned int cpu)
+{
+   int nid, fcpu, ncpu;
+   struct cpumask *l_cpumask, tmp_mask;
+
+   /* Fint the cpumask of this node */
+   nid = cpu_to_node(cpu);
+   l_cpumask = cpumask_of_node(nid);
+
+   /*
+* If any of the cpu from this node is already present in the mask,
+* just return, if not, then set this cpu in the mask.
+*/
+   if (!cpumask_and(&tmp_mask, l_cpumask, &nest_imc_cpumask)) {
+   cpumask_set_cpu(cpu, &nest_imc_cpumask);
+   return 0;
+   }
+
+   fcpu = cpumask_first(l_cpumask);
+   ncpu = cpumask_next(cpu, l_cpumask);
+   if (cpu == fcpu) {
+   if (cpumask_test_and_clear_cpu(ncpu, &nest_imc_cpumask)) {
+   cpumask_set_cpu(cpu, &nest_imc_cpumask);
+   nest_change_cpu_context(ncp

[PATCH v4 07/10] powerpc/powernv: Core IMC events detection

2017-02-19 Thread Hemant Kumar
This patch adds support for detection of core IMC events along with the
Nest IMC events. It adds a new domain IMC_DOMAIN_CORE and its determined
with the help of the compatibility string "ibm,imc-counters-core" based
on the IMC device tree.

Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Anton Blanchard 
Cc: Sukadev Bhattiprolu 
Cc: Michael Neuling 
Cc: Stewart Smith 
Cc: Daniel Axtens 
Cc: Stephane Eranian 
Cc: Balbir Singh 
Cc: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
---
 arch/powerpc/include/asm/imc-pmu.h|  2 ++
 arch/powerpc/perf/imc-pmu.c   |  3 +++
 arch/powerpc/platforms/powernv/opal-imc.c | 18 --
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index 7b58721..59de083 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -30,6 +30,7 @@
 
 #define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
 #define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest"
+#define IMC_DTB_CORE_COMPAT"ibm,imc-counters-core"
 
 /*
  * Structure to hold per chip specific memory address
@@ -67,6 +68,7 @@ struct imc_pmu {
  * Domains for IMC PMUs
  */
 #define IMC_DOMAIN_NEST1
+#define IMC_DOMAIN_CORE2
 
 #define UNKNOWN_DOMAIN -1
 
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index e46ff6d..9a0e3bc 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -18,8 +18,11 @@ struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
 struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
 static cpumask_t nest_imc_cpumask;
 
+struct imc_pmu *core_imc_pmu;
+
 /* Needed for sanity check */
 extern u64 nest_max_offset;
+extern u64 core_max_offset;
 
 PMU_FORMAT_ATTR(event, "config:0-20");
 static struct attribute *imc_format_attrs[] = {
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index a65aa2d..67ce873 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -33,10 +33,12 @@
 
 extern struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
 extern struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
+extern struct imc_pmu *core_imc_pmu;
 
 extern int init_imc_pmu(struct imc_events *events,
int idx, struct imc_pmu *pmu_ptr);
 u64 nest_max_offset;
+u64 core_max_offset;
 
 static int imc_event_info(char *name, struct imc_events *events)
 {
@@ -80,6 +82,10 @@ static void update_max_value(u32 value, int pmu_domain)
if (nest_max_offset < value)
nest_max_offset = value;
break;
+   case IMC_DOMAIN_CORE:
+   if (core_max_offset < value)
+   core_max_offset = value;
+   break;
default:
/* Unknown domain, return */
return;
@@ -231,6 +237,8 @@ int imc_get_domain(struct device_node *pmu_dev)
 {
if (of_device_is_compatible(pmu_dev, IMC_DTB_NEST_COMPAT))
return IMC_DOMAIN_NEST;
+   if (of_device_is_compatible(pmu_dev, IMC_DTB_CORE_COMPAT))
+   return IMC_DOMAIN_CORE;
else
return UNKNOWN_DOMAIN;
 }
@@ -298,7 +306,10 @@ static int imc_pmu_create(struct device_node *parent, int 
pmu_index)
goto free_pmu;
 
/* Needed for hotplug/migration */
-   per_nest_pmu_arr[pmu_index] = pmu_ptr;
+   if (pmu_ptr->domain == IMC_DOMAIN_CORE)
+   core_imc_pmu = pmu_ptr;
+   else if (pmu_ptr->domain == IMC_DOMAIN_NEST)
+   per_nest_pmu_arr[pmu_index] = pmu_ptr;
 
/*
 * "events" property inside a PMU node contains the phandle value
@@ -354,7 +365,10 @@ static int imc_pmu_create(struct device_node *parent, int 
pmu_index)
}
 
/* Save the name to register it later */
-   sprintf(buf, "nest_%s", (char *)pp->value);
+   if (pmu_ptr->domain == IMC_DOMAIN_NEST)
+   sprintf(buf, "nest_%s", (char *)pp->value);
+   else
+   sprintf(buf, "%s_imc", (char *)pp->value);
pmu_ptr->pmu.name = (char *)buf;
 
/*
-- 
2.7.4



[PATCH v4 08/10] powerpc/perf: PMU functions for Core IMC and hotplugging

2017-02-19 Thread Hemant Kumar
This patch adds the PMU function to initialize a core IMC event. It also
adds cpumask initialization function for core IMC PMU. For
initialization, a page of memory is allocated per core where the data
for core IMC counters will be accumulated. The base address for this
page is sent to OPAL via an OPAL call which initializes various SCOMs
related to Core IMC initialization. Upon any errors, the pages are
free'ed and core IMC counters are disabled using the same OPAL call.

For CPU hotplugging, a cpumask is initialized which contains an online
CPU from each core. If a cpu goes offline, we check whether that cpu
belongs to the core imc cpumask, if yes, then, we migrate the PMU
context to any other online cpu (if available) in that core. If a cpu
comes back online, then this cpu will be added to the core imc cpumask
only if there was no other cpu from that core in the previous cpumask.

To register the hotplug functions for core_imc, a new state
CPUHP_AP_PERF_POWERPC_COREIMC_ONLINE is added to the list of existing
states.

Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Anton Blanchard 
Cc: Sukadev Bhattiprolu 
Cc: Michael Neuling 
Cc: Stewart Smith 
Cc: Daniel Axtens 
Cc: Stephane Eranian 
Cc: Balbir Singh 
Cc: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
---
 arch/powerpc/include/asm/imc-pmu.h |   1 +
 arch/powerpc/include/asm/opal-api.h|  10 +-
 arch/powerpc/include/asm/opal.h|   2 +
 arch/powerpc/perf/imc-pmu.c| 248 -
 arch/powerpc/platforms/powernv/opal-imc.c  |   4 +-
 arch/powerpc/platforms/powernv/opal-wrappers.S |   1 +
 include/linux/cpuhotplug.h |   1 +
 7 files changed, 257 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index 59de083..5e76cd0 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -21,6 +21,7 @@
 #define IMC_MAX_CHIPS  32
 #define IMC_MAX_PMUS   32
 #define IMC_MAX_PMU_NAME_LEN   256
+#define IMC_MAX_CORES  256
 
 #define NEST_IMC_ENGINE_START  1
 #define NEST_IMC_ENGINE_STOP   0
diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index e15fb20..4ee52e8 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -169,7 +169,8 @@
 #define OPAL_PCI_TCE_KILL  126
 #define OPAL_NMMU_SET_PTCR 127
 #define OPAL_NEST_IMC_COUNTERS_CONTROL 128
-#define OPAL_LAST  128
+#define OPAL_CORE_IMC_COUNTERS_CONTROL 129
+#define OPAL_LAST  129
 
 /* Device tree flags */
 
@@ -929,6 +930,13 @@ enum {
OPAL_PCI_TCE_KILL_ALL,
 };
 
+/* Operation argument to Core IMC */
+enum {
+   OPAL_CORE_IMC_DISABLE,
+   OPAL_CORE_IMC_ENABLE,
+   OPAL_CORE_IMC_INIT,
+};
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* __OPAL_API_H */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index d93d082..c4baa6d 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -229,6 +229,8 @@ int64_t opal_nmmu_set_ptcr(uint64_t chip_id, uint64_t ptcr);
 
 int64_t opal_nest_imc_counters_control(uint64_t mode, uint64_t value1,
uint64_t value2, uint64_t value3);
+int64_t opal_core_imc_counters_control(uint64_t operation, uint64_t addr,
+   uint64_t value2, uint64_t value3);
 
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 9a0e3bc..61d99c7 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1,5 +1,5 @@
 /*
- * Nest Performance Monitor counter support.
+ * IMC Performance Monitor counter support.
  *
  * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
  *  (C) 2016 Hemant K Shaw, IBM Corporation.
@@ -18,6 +18,9 @@ struct perchip_nest_info nest_perchip_info[IMC_MAX_CHIPS];
 struct imc_pmu *per_nest_pmu_arr[IMC_MAX_PMUS];
 static cpumask_t nest_imc_cpumask;
 
+/* Maintains base addresses for all the cores */
+static u64 per_core_pdbar_add[IMC_MAX_CHIPS][IMC_MAX_CORES];
+static cpumask_t core_imc_cpumask;
 struct imc_pmu *core_imc_pmu;
 
 /* Needed for sanity check */
@@ -37,11 +40,18 @@ static struct attribute_group imc_format_group = {
 
 /* Get the cpumask printed to a buffer "buf" */
 static ssize_t imc_pmu_cpumask_get_attr(struct device *dev,
-   struct device_attribute *attr, char *buf)
+   struct device_attribute *attr,
+   char *buf)
 {
+   struct pmu *pmu = dev_get_drvdata(dev);
cpumask_t *active_mask;
 
-

[PATCH v4 09/10] powerpc/powernv: Thread IMC events detection

2017-02-19 Thread Hemant Kumar
Patch adds support for detection of thread IMC events. It adds a new
domain IMC_DOMAIN_THREAD and it is determined with the help of the
compatibility string "ibm,imc-counters-thread" based on the IMC device
tree.

Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Anton Blanchard 
Cc: Sukadev Bhattiprolu 
Cc: Michael Neuling 
Cc: Stewart Smith 
Cc: Daniel Axtens 
Cc: Stephane Eranian 
Cc: Balbir Singh 
Cc: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
---
 arch/powerpc/include/asm/imc-pmu.h|  2 ++
 arch/powerpc/perf/imc-pmu.c   |  1 +
 arch/powerpc/platforms/powernv/opal-imc.c | 11 +--
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index 5e76cd0..f2b4f12 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -32,6 +32,7 @@
 #define IMC_DTB_COMPAT "ibm,opal-in-memory-counters"
 #define IMC_DTB_NEST_COMPAT"ibm,imc-counters-nest"
 #define IMC_DTB_CORE_COMPAT"ibm,imc-counters-core"
+#define IMC_DTB_THREAD_COMPAT   "ibm,imc-counters-thread"
 
 /*
  * Structure to hold per chip specific memory address
@@ -70,6 +71,7 @@ struct imc_pmu {
  */
 #define IMC_DOMAIN_NEST1
 #define IMC_DOMAIN_CORE2
+#define IMC_DOMAIN_THREAD   3
 
 #define UNKNOWN_DOMAIN -1
 
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index 61d99c7..a48c5be 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -26,6 +26,7 @@ struct imc_pmu *core_imc_pmu;
 /* Needed for sanity check */
 extern u64 nest_max_offset;
 extern u64 core_max_offset;
+extern u64 thread_max_offset;
 
 PMU_FORMAT_ATTR(event, "config:0-20");
 static struct attribute *imc_format_attrs[] = {
diff --git a/arch/powerpc/platforms/powernv/opal-imc.c 
b/arch/powerpc/platforms/powernv/opal-imc.c
index 6db3c5f..a5565e7 100644
--- a/arch/powerpc/platforms/powernv/opal-imc.c
+++ b/arch/powerpc/platforms/powernv/opal-imc.c
@@ -39,6 +39,7 @@ extern int init_imc_pmu(struct imc_events *events,
int idx, struct imc_pmu *pmu_ptr);
 u64 nest_max_offset;
 u64 core_max_offset;
+u64 thread_max_offset;
 
 static int imc_event_info(char *name, struct imc_events *events)
 {
@@ -86,6 +87,10 @@ static void update_max_value(u32 value, int pmu_domain)
if (core_max_offset < value)
core_max_offset = value;
break;
+   case IMC_DOMAIN_THREAD:
+   if (thread_max_offset < value)
+   thread_max_offset = value;
+   break;
default:
/* Unknown domain, return */
return;
@@ -239,6 +244,8 @@ int imc_get_domain(struct device_node *pmu_dev)
return IMC_DOMAIN_NEST;
if (of_device_is_compatible(pmu_dev, IMC_DTB_CORE_COMPAT))
return IMC_DOMAIN_CORE;
+   if (of_device_is_compatible(pmu_dev, IMC_DTB_THREAD_COMPAT))
+   return IMC_DOMAIN_THREAD;
else
return UNKNOWN_DOMAIN;
 }
@@ -277,7 +284,7 @@ static void imc_free_events(struct imc_events *events, int 
nr_entries)
 /*
  * imc_pmu_create : Takes the parent device which is the pmu unit and a
  *  pmu_index as the inputs.
- * Allocates memory for the pmu, sets up its domain (NEST or CORE), and
+ * Allocates memory for the pmu, sets up its domain (NEST/CORE/THREAD), and
  * allocates memory for the events supported by this pmu. Assigns a name for
  * the pmu. Calls imc_events_node_parser() to setup the individual events.
  * If everything goes fine, it calls, init_imc_pmu() to setup the pmu device
@@ -305,7 +312,7 @@ static int imc_pmu_create(struct device_node *parent, int 
pmu_index)
if (pmu_ptr->domain == UNKNOWN_DOMAIN)
goto free_pmu;
 
-   /* Needed for hotplug/migration */
+   /* Needed for hotplug/migration for nest and core IMC PMUs */
if (pmu_ptr->domain == IMC_DOMAIN_CORE)
core_imc_pmu = pmu_ptr;
else if (pmu_ptr->domain == IMC_DOMAIN_NEST)
-- 
2.7.4



[PATCH v4 10/10] powerpc/perf: Thread IMC PMU functions

2017-02-19 Thread Hemant Kumar
This patch adds the PMU functions required for event initialization,
read, update, add, del etc. for thread IMC PMU. Thread IMC PMUs are used
for per-task monitoring. These PMUs don't need any hotplugging support.

For each CPU, a page of memory is allocated and is kept static i.e.,
these pages will exist till the machine shuts down. The base address of
this page is assigned to the ldbar of that cpu. As soon as we do that,
the thread IMC counters start running for that cpu and the data of these
counters are assigned to the page allocated. But we use this for
per-task monitoring. Whenever we start monitoring a task, the event is
added is onto the task. At that point, we read the initial value of the
event. Whenever, we stop monitoring the task, the final value is taken
and the difference is the event data.

Now, a task can move to a different cpu. Suppose a task X is moving from
cpu A to cpu B. When the task is scheduled out of A, we get an
event_del for A, and hence, the event data is updated. And, we stop
updating the X's event data. As soon as X moves on to B, event_add is
called for B, and we again update the event_data. And this is how it
keeps on updating the event data even when the task is scheduled on to
different cpus.

Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Anton Blanchard 
Cc: Sukadev Bhattiprolu 
Cc: Michael Neuling 
Cc: Stewart Smith 
Cc: Daniel Axtens 
Cc: Stephane Eranian 
Cc: Balbir Singh 
Cc: Anju T Sudhakar 
Signed-off-by: Hemant Kumar 
---
 arch/powerpc/include/asm/imc-pmu.h |   4 +
 arch/powerpc/perf/imc-pmu.c| 161 -
 2 files changed, 164 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index f2b4f12..8b7141b 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -22,6 +22,7 @@
 #define IMC_MAX_PMUS   32
 #define IMC_MAX_PMU_NAME_LEN   256
 #define IMC_MAX_CORES  256
+#define IMC_MAX_CPUS2048
 
 #define NEST_IMC_ENGINE_START  1
 #define NEST_IMC_ENGINE_STOP   0
@@ -34,6 +35,9 @@
 #define IMC_DTB_CORE_COMPAT"ibm,imc-counters-core"
 #define IMC_DTB_THREAD_COMPAT   "ibm,imc-counters-thread"
 
+#define THREAD_IMC_LDBAR_MASK   0x0003e000
+#define THREAD_IMC_ENABLE   0x8000
+
 /*
  * Structure to hold per chip specific memory address
  * information for nest pmus. Nest Counter data are exported
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index a48c5be..4033b2d 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -23,6 +23,9 @@ static u64 per_core_pdbar_add[IMC_MAX_CHIPS][IMC_MAX_CORES];
 static cpumask_t core_imc_cpumask;
 struct imc_pmu *core_imc_pmu;
 
+/* Maintains base address for all the cpus */
+static u64 per_cpu_add[IMC_MAX_CPUS];
+
 /* Needed for sanity check */
 extern u64 nest_max_offset;
 extern u64 core_max_offset;
@@ -443,6 +446,56 @@ static int core_imc_event_init(struct perf_event *event)
return 0;
 }
 
+static int thread_imc_event_init(struct perf_event *event)
+{
+   struct task_struct *target;
+
+   if (event->attr.type != event->pmu->type)
+   return -ENOENT;
+
+   /* Sampling not supported */
+   if (event->hw.sample_period)
+   return -EINVAL;
+
+   event->hw.idx = -1;
+
+   /* Sanity check for config (event offset) */
+   if (event->attr.config > thread_max_offset)
+   return -EINVAL;
+
+   target = event->hw.target;
+
+   if (!target)
+   return -EINVAL;
+
+   event->pmu->task_ctx_nr = perf_sw_context;
+   return 0;
+}
+
+static void thread_imc_read_counter(struct perf_event *event)
+{
+   u64 *addr, data;
+   int cpu_id = smp_processor_id();
+
+   addr = (u64 *)(per_cpu_add[cpu_id] + event->attr.config);
+   data = __be64_to_cpu(*addr);
+   local64_set(&event->hw.prev_count, data);
+}
+
+static void thread_imc_perf_event_update(struct perf_event *event)
+{
+   u64 counter_prev, counter_new, final_count, *addr;
+   int cpu_id = smp_processor_id();
+
+   addr = (u64 *)(per_cpu_add[cpu_id] + event->attr.config);
+   counter_prev = local64_read(&event->hw.prev_count);
+   counter_new = __be64_to_cpu(*addr);
+   final_count = counter_new - counter_prev;
+
+   local64_set(&event->hw.prev_count, counter_new);
+   local64_add(final_count, &event->count);
+}
+
 static void imc_read_counter(struct perf_event *event)
 {
u64 *addr, data;
@@ -483,6 +536,53 @@ static int imc_event_add(struct perf_event *event, int 
flags)
return 0;
 }
 
+static void thread_imc_event_start(struct perf_event *event, int flags)
+{
+   thread_imc_read_counter(event);
+}
+
+static void thread_imc_event_stop(struct perf_event *event, int flags)
+{
+   thread_

Re: [PATCH] powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout

2017-02-19 Thread Aneesh Kumar K.V



On Monday 20 February 2017 02:35 AM, Benjamin Herrenschmidt wrote:

On Sun, 2017-02-19 at 15:48 +0530, Aneesh Kumar K.V wrote:

+#ifdef CONFIG_PPC_BOOK3S_64
+   /*
+* We need to make sure that for different page sizes reported by
+* firmware we only add hugetlb support for page sizes that can be
+* supported by linux page table layout.
+* For now we have
+* Radix: 2M
+* Hash: 16M and 16G
+*/
+   if (radix_enabled()) {
+   if (mmu_psize != MMU_PAGE_2M)
+   return -EINVAL;
+   } else {
+   if (mmu_psize != MMU_PAGE_16M && mmu_psize != MMU_PAGE_16G)
+   return -EINVAL;
+   }

Hash could support others...


On book3s 64 ? I had the above within #ifdef.


Same with radix and PUD level pages.


Yes, but gigantic hugepage is not yet supported. Once we add that we will
add MMU_PAGE_1G here.




Why do we need that ? Won't FW provide separate properties for hash and
radix page sizes anyway ?



To avoid crashes like the one reported in the commit message due to 
buggy firmware ? Also
It can serve as an easy way to understand what hugepage sizes are 
supported by different platforms.
I am yet to figure out what the FSL_BOOK3E and PPC_8xx #ifdef above that 
hunk is all about. Having

the supported hugepage size clearly verified against makes it easy ?

-aneesh



Re: [PATCH] powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout

2017-02-19 Thread Benjamin Herrenschmidt
On Mon, 2017-02-20 at 09:02 +0530, Aneesh Kumar K.V wrote:
> To avoid crashes like the one reported in the commit message due to 
> buggy firmware ? 

I don't want Linux to make those assumptions. We should fix the FW.

Think of backward compat for example.

> Also
> It can serve as an easy way to understand what hugepage sizes are 
> supported by different platforms.
> I am yet to figure out what the FSL_BOOK3E and PPC_8xx #ifdef above
> that 
> hunk is all about. Having
> the supported hugepage size clearly verified against makes it easy ?
> 
> -aneesh


Re: [next-20170217] WARN @/arch/powerpc/include/asm/xics.h:124 .icp_hv_eoi+0x40/0x140

2017-02-19 Thread Balbir Singh
On Sun, 2017-02-19 at 20:39 +0530, Sachin Sant wrote:
> While booting next-20170217 on a POWER6 box, I ran into following
> warning. This is a full system lpar. Previous next tree was good.
> I will try a bisect tomorrow.
> 
> ipr: IBM Power RAID SCSI Device Driver version: 2.6.3 (October 17, 2015)
> ipr 0200:00:01.0: Found IOA with IRQ: 305
> [ cut here ]
> WARNING: CPU: 12 PID: 1 at ./arch/powerpc/include/asm/xics.h:124 
> .icp_hv_eoi+0x40/0x140


This indicates that the CPPR stack underflow'd (we don't know the CPPR value
at the time of the interrupt that we are going to do an EOI for).  The problem
could have occured elsewhere, but shows up at the first interrupt after
the real cause. Could you past the full dmesg and config and follow Michael's
suggestion for debugging SHIRQ's

Balbir




Re: [PATCH 06/35] powerpc: Convert remaining uses of pr_warning to pr_warn

2017-02-19 Thread Michael Ellerman
Joe Perches  writes:

> To enable eventual removal of pr_warning
>
> This makes pr_warn use consistent for arch/powerpc
>
> Prior to this patch, there were 36 uses of pr_warning and
> 217 uses of pr_warn in arch/powerpc
>
> Signed-off-by: Joe Perches 

Can I take this via the powerpc tree, or do you want to merge them as a
series?

cheers


Re: [PATCH 06/35] powerpc: Convert remaining uses of pr_warning to pr_warn

2017-02-19 Thread Joe Perches
On Mon, 2017-02-20 at 15:40 +1100, Michael Ellerman wrote:
> Joe Perches  writes:
> 
> > To enable eventual removal of pr_warning
> > 
> > This makes pr_warn use consistent for arch/powerpc
> > 
> > Prior to this patch, there were 36 uses of pr_warning and
> > 217 uses of pr_warn in arch/powerpc
> > 
> > Signed-off-by: Joe Perches 
> 
> Can I take this via the powerpc tree, or do you want to merge them as a
> series?

Well, I expect it'd be better if you merge it.



Re: [PATCH] powerpc/xmon: Fix an unexpected xmon onoff state change

2017-02-19 Thread Michael Ellerman
Pan Xinhui  writes:

> 在 2017/2/17 14:05, Michael Ellerman 写道:
>> Pan Xinhui  writes:
>>> diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
>>> index 9c0e17c..f6e5c3d 100644
>>> --- a/arch/powerpc/xmon/xmon.c
>>> +++ b/arch/powerpc/xmon/xmon.c
>>> @@ -76,6 +76,7 @@ static int xmon_gate;
>>>   #endif /* CONFIG_SMP */
>>>
>>>   static unsigned long in_xmon __read_mostly = 0;
>>> +static int xmon_off = !IS_ENABLED(CONFIG_XMON_DEFAULT);
>>
>> I think the logic would probably clearer if we invert this to become
>> xmon_on.
>>
> yep, make sense.
>
>>> @@ -3266,16 +3269,16 @@ static int __init setup_xmon_sysrq(void)
>>>   __initcall(setup_xmon_sysrq);
>>>   #endif /* CONFIG_MAGIC_SYSRQ */
>>>
>>> -static int __initdata xmon_early, xmon_off;
>>> +static int __initdata xmon_early;
>>>
>>>   static int __init early_parse_xmon(char *p)
>>>   {
>>> if (!p || strncmp(p, "early", 5) == 0) {
>>> /* just "xmon" is equivalent to "xmon=early" */
>>> -   xmon_init(1);
>>> xmon_early = 1;
>>> +   xmon_off = 0;
>>> } else if (strncmp(p, "on", 2) == 0)
>>> -   xmon_init(1);
>>> +   xmon_off = 0;
>>
>> You've just changed the timing of when xmon gets enabled for the above
>> two cases, from here which is called very early, to xmon_setup() which
>> is called much later in boot.
>>
>> That effectively disables xmon for most of the boot, which we do not
>> want to do.
>>
> Although it is not often that kernel got stucked during boot.

I hope you're joking! :)

cheers


Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails

2017-02-19 Thread Michael Ellerman
Stewart Smith  writes:

> Vipin K Parashar  writes:
>> On Monday 13 February 2017 06:13 AM, Michael Ellerman wrote:
>>> Vipin K Parashar  writes:
>>>
 OPAL returns OPAL_WRONG_STATE for XSCOM operations

 done to read any core FIR which is sleeping, offline.
>>> OK.
>>>
>>> Do we know why Linux is causing that to happen?
>>
>> This issue is originally seen upon running STAF (Software Test
>> Automation Framework) stress tests and off-lining some cores
>> with stress tests running.
>>
>> It can also be re-created after off-lining few cores and following
>> one of below methods.
>> 1. Executing Linux "sensors" command
>> 2. Reading contents of file /sys/class/hwmon/hwmon0/tempX_input,
>> where X is offline CPU.
>>
>> Its "opal_get_sensor_data" Linux API that that triggers
>> OPAL call "opal_sensor_read", performing XSCOM ops here.
>> If core is found sleeping/offline Linux throws up
>> "opal_error_code: Unexpected OPAL error" error onto console.
>>
>> Currently Linux isn't aware about OPAL_WRONG_STATE return code
>> from OPAL. Thus it prints "Unexpected OPAL error" message, same
>> as it would log for any unknown OPAL return codes.
>>
>> Seeing this error over console has been a concern for Test and
>> would puzzle real user as well. This patch makes Linux aware about
>> OPAL_WRONG_STATE return code from OPAL and stops printing
>> "Unexpected OPAL error" message onto console for OPAL fails
>> with OPAL_WRONG_STATE
>
> Ahh... so this is a DTS sensor, which indeed is just XSCOMs and we
> return the xscom_read return code in event of error.
>
> I would argue that converting to EIO in that instance is probably
> correct... or EAGAIN? EAGAIN may be more correct in the situation where
> the core is just sleeping.
>
> What kind of offlining are you doing?
>
> Arguably, the correct behaviour would be to remove said sensors when the
> core is offline.

Right, that would be ideal. There appear to be at least two other hwmon
drivers that are CPU hotplug aware (coretemp and via-cputemp).

But perhaps it's not possible to work out which sensors are attached to
which CPU etc., I haven't looked in detail.

In that case changing just opal_get_sensor_data() to handle
OPAL_WRONG_STATE would be OK, with a comment explaining that we might be
asked to read a sensor on an offline CPU and we aren't able to detect
that.

cheers


Re: [next-20170217] WARN @/arch/powerpc/include/asm/xics.h:124 .icp_hv_eoi+0x40/0x140

2017-02-19 Thread Sachin Sant

>> While booting next-20170217 on a POWER6 box, I ran into following
>> warning. This is a full system lpar. Previous next tree was good.
>> I will try a bisect tomorrow.
> 
> Do you have CONFIG_DEBUG_SHIRQ=y ?
> 

Yes. CONFIG_DEBUG_SHIRQ is enabled.

As suggested by you reverting following commit allows a clean boot.
f91f694540f3 ("genirq: Reenable shared irq debugging in request_*_irq()”)

>> ipr: IBM Power RAID SCSI Device Driver version: 2.6.3 (October 17, 2015)
>> ipr 0200:00:01.0: Found IOA with IRQ: 305
>> [ cut here ]
>> WARNING: CPU: 12 PID: 1 at ./arch/powerpc/include/asm/xics.h:124 
>> .icp_hv_eoi+0x40/0x140
>> Modules linked in:
>> CPU: 12 PID: 1 Comm: swapper/14 Not tainted 
>> 4.10.0-rc8-next-20170217-autotest #1
>> task: c002b2a4a580 task.stack: c002b2a5c000
>> NIP: c00731b0 LR: c01389f8 CTR: c0073170
>> REGS: c002b2a5f050 TRAP: 0700   Not tainted  
>> (4.10.0-rc8-next-20170217-autotest)
>> MSR: 80029032 
>>  CR: 28004082  XER: 2004
>> CFAR: c01389e0 SOFTE: 0 
>> GPR00: c01389f8 c002b2a5f2d0 c1025800 c002b203f498 
>> GPR04:   0064 0131 
>> GPR08: 0001 c000d3104cb8  0009b1f8 
>> GPR12: 48004082 cedc2400 c000dad0  
>> GPR16:  3c007efc c0a9e848  
>> GPR20: d8008008 c002af4d47f0 c11efda8 c0a9ea10 
>> GPR24: c0a9e848  c002af4d4fb8  
>> GPR28:  c002b203f498 c0ef8928 c002b203f400 
>> NIP [c00731b0] .icp_hv_eoi+0x40/0x140
>> LR [c01389f8] .handle_fasteoi_irq+0x1e8/0x270
>> Call Trace:
>> [c002b2a5f2d0] [c002b2a5f360] 0xc002b2a5f360 (unreliable)
>> [c002b2a5f360] [c01389f8] .handle_fasteoi_irq+0x1e8/0x270
>> [c002b2a5f3e0] [c0136a08] .request_threaded_irq+0x298/0x370
>> [c002b2a5f490] [c05895c0] .ipr_probe_ioa+0x1110/0x1390
>> [c002b2a5f5c0] [c058d030] .ipr_probe+0x30/0x3e0
>> [c002b2a5f670] [c0466860] .local_pci_probe+0x60/0x130
>> [c002b2a5f710] [c0467658] .pci_device_probe+0x148/0x1e0
>> [c002b2a5f7c0] [c0527524] .driver_probe_device+0x2d4/0x5b0
>> [c002b2a5f860] [c052796c] .__driver_attach+0x16c/0x190
>> [c002b2a5f8f0] [c05242c4] .bus_for_each_dev+0x84/0xf0
>> [c002b2a5f990] [c0526af4] .driver_attach+0x24/0x40
>> [c002b2a5fa00] [c0526318] .bus_add_driver+0x2a8/0x370
>> [c002b2a5faa0] [c0528a5c] .driver_register+0x8c/0x170
>> [c002b2a5fb20] [c0465a54] .__pci_register_driver+0x44/0x60
>> [c002b2a5fb90] [c0b8efc8] .ipr_init+0x58/0x70
>> [c002b2a5fc10] [c000d20c] .do_one_initcall+0x5c/0x1c0
>> [c002b2a5fce0] [c0b44738] .kernel_init_freeable+0x280/0x360
>> [c002b2a5fdb0] [c000daec] .kernel_init+0x1c/0x130
>> [c002b2a5fe30] [c000baa0] .ret_from_kernel_thread+0x58/0xb8
>> Instruction dump:
>> f8010010 f821ff71 80e3000c 7c0004ac e94d0030 3d02ffbc 3928f4b8 7d295214 
>> 81090004 3948 7d484378 79080fe2 <0b08> 2fa8 40de0050 91490004 
>> ---[ end trace 5e18ae409f46392c ]---
>> ipr 0200:00:01.0: Initializing IOA.
>> 
>> Thanks
>> -Sachin
> 



Re: [PATCH] powerpc/mm/hugetlb: Filter out hugepage size not supported by page table layout

2017-02-19 Thread Aneesh Kumar K.V
Benjamin Herrenschmidt  writes:

> On Mon, 2017-02-20 at 09:02 +0530, Aneesh Kumar K.V wrote:
>> To avoid crashes like the one reported in the commit message due to 
>> buggy firmware ? 
>
> I don't want Linux to make those assumptions. We should fix the FW.
>

I was not suggesting to not fix FW. The idea was two fold.

We cannot support different hugetlb page sizes. They need to
be supported at linux page table level. So a generic check like
is_power_of_2/4() may not be what we want. The second is to document
clearly what are the different page sizes supported by a platform.

-aneesh