date:20170712

Re: [RFC v5 00/38] powerpc: Memory Protection Keys

2017-07-12 Thread Michal Hocko

On Tue 11-07-17 12:32:57, Ram Pai wrote:
> On Tue, Jul 11, 2017 at 04:52:46PM +0200, Michal Hocko wrote:
> > On Wed 05-07-17 14:21:37, Ram Pai wrote:
> > > Memory protection keys enable applications to protect its
> > > address space from inadvertent access or corruption from
> > > itself.
> > > 
> > > The overall idea:
> > > 
> > >  A process allocates a   key  and associates it with
> > >  an  address  range  withinits   address   space.
> > >  The process  then  can  dynamically  set read/write 
> > >  permissions on  the   key   without  involving  the 
> > >  kernel. Any  code that  violates   the  permissions
> > >  of  the address space; as defined by its associated
> > >  key, will receive a segmentation fault.
> > > 
> > > This patch series enables the feature on PPC64 HPTE
> > > platform.
> > > 
> > > ISA3.0 section 5.7.13 describes the detailed specifications.
> > 
> > Could you describe the highlevel design of this feature in the cover
> > letter.
> 
> Yes it can be hard to understand without the big picture.  I will
> provide the high level design and the rationale behind the patch split
> towards the end.  Also I will have it in the cover letter for my next
> revision of the patchset.

Thanks!
 
> > I have tried to get some idea from the patchset but it was
> > really far from trivial. Patches are not very well split up (many
> > helpers are added without their users etc..). 
> 
> I see your point. Earlier, I had the patches split such a way that the
> users of the helpers were in the same patch as that of the helper.
> But then comments from others lead to the current split.

It is not my call here, obviously. I cannot review arch specific parts
due to lack of familiarity but it is a general good practice to include
helpers along with their users to make the usage clear. Also, as much as
I like small patches because they are easier to review, having very many
of them can lead to a harder review in the end because you easily lose
a higher level overview.

> > > Testing:
> > >   This patch series has passed all the protection key
> > >   tests available in  the selftests directory.
> > >   The tests are updated to work on both x86 and powerpc.
> > > 
> > > version v5:
> > >   (1) reverted back to the old design -- store the 
> > >   key in the pte, instead of bypassing it.
> > >   The v4 design slowed down the hash page path.
> > 
> > This surprised me a lot but I couldn't find the respective code. Why do
> > you need to store anything in the pte? My understanding of PKEYs is that
> > the setup and teardown should be very cheap and so no page tables have
> > to updated. Or do I just misunderstand what you wrote here?
> 
> Ideally the MMU looks at the PTE for keys, in order to enforce
> protection. This is the case with x86 and is the case with power9 Radix
> page table. Hence the keys have to be programmed into the PTE.

But x86 doesn't update ptes for PKEYs, that would be just too expensive.
You could use standard mprotect to do the same...
 
> However with HPT on power, these keys do not necessarily have to be
> programmed into the PTE. We could bypass the Linux Page Table Entry(PTE)
> and instead just program them into the Hash Page Table(HPTE), since
> the MMU does not refer the PTE but refers the HPTE. The last version
> of the page attempted to do that.   It worked as follows:
> 
> a) when a address range is requested to be associated with a key; by the
>application through key_mprotect() system call, the kernel
>stores that key in the vmas corresponding to that address
>range.
> 
> b) Whenever there is a hash page fault for that address, the fault
>handler reads the key from the VMA and programs the key into the
>HPTE. __hash_page() is the function that does that.

What causes the fault here?

> c) Once the hpte is programmed, the MMU can sense key violations and
>generate key-faults.
> 
> The problem is with step (b).  This step is really a very critical
> path which is performance sensitive. We dont want to add any delays.
> However if we want to access the key from the vma, we will have to
> hold the vma semaphore, and that is a big NO-NO. As a result, this
> design had to be dropped.
> 
> 
> 
> I reverted back to the old design i.e the design in v4 version. In this
> version we do the following:
> 
> a) when a address range is requested to be associated with a key; by the
>application through key_mprotect() system call, the kernel
>stores that key in the vmas corresponding to that address
>range. Also the kernel programs the key into Linux PTE coresponding to all 
> the
>pages associated with the address range.

OK, so how is this any different from the regular mprotect then?

> b) Whenever there is a hash page fault for that address, the fault
>handler reads the key from the Linux PTE and programs the key into 
>the HPTE.
> 
> c) Once the HPTE is programmed, the MMU can sense key violations and
>generate key-faults.
>

Re: [RFC v5 00/38] powerpc: Memory Protection Keys

2017-07-12 Thread Michal Hocko

On Wed 12-07-17 09:23:37, Michal Hocko wrote:
> On Tue 11-07-17 12:32:57, Ram Pai wrote:
[...]
> > Ideally the MMU looks at the PTE for keys, in order to enforce
> > protection. This is the case with x86 and is the case with power9 Radix
> > page table. Hence the keys have to be programmed into the PTE.
> 
> But x86 doesn't update ptes for PKEYs, that would be just too expensive.
> You could use standard mprotect to do the same...

OK, this seems to be a misunderstanding and confusion on my end.
do_mprotect_pkey does mprotect_fixup even for the pkey path which is
quite surprising to me. I guess my misunderstanding comes from
Documentation/x86/protection-keys.txt
"
Memory Protection Keys provides a mechanism for enforcing page-based
protections, but without requiring modification of the page tables
when an application changes protection domains.  It works by
dedicating 4 previously ignored bits in each page table entry to a
"protection key", giving 16 possible keys.
"

So please disregard my previous comments about page tables and sorry
about the confusion.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH] powerpc: Hande page faults for bad paste and bad AMO

2017-07-12 Thread Aneesh Kumar K.V

Benjamin Herrenschmidt  writes:

> On POWER9 and bad paste instruction (targeting the wrong memory
> type) or an invalid opcode in an AMO (atomic memory operation)
> will result in specific DSISR bits to be set.
>
> We currently don't understand those bits and thus just "hang"
> the process taking constant page faults.
>
> Additionally in the case of paste, it appears that we don't
> always get a valid DAR value when the error is a wrong memory
> type.
>
> So we need to test those errors early in do_page_fault(),
> I chose to generate a SIGBUS which is more correct than a SIGSEGV.

This is true even for hash right ? If so do we want to update
do_hash_page:
#ifdef CONFIG_PPC_STD_MMU_64
andis.  r0,r4,0xa450/* weird error? */
bne-handle_page_fault   /* if not, try to insert a HPTE */

that 0xa450 such that we add these errors and call hand_page_fault in
case of these DSISR values ?


>
> Signed-off-by: Benjamin Herrenschmidt 
> ---
>
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 0266c664014a..5dfce2022f74 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -337,6 +337,21 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
> address,
>   }
>  #endif /* CONFIG_PPC_ICSWX */
>  
> +#ifdef CONFIG_PPC_STD_MMU_64
> + /*
> +  * These faults indicate a copy/paste on an invalid memory type
> +  * or an incorrect AMO operation. They have been observed as not
> +  * properly updating the DAR, so handle them early
> +  */
> + if (error_code & (DSISR_BAD_COPYPASTE | DSISR_BAD_AMO)) {
> + if (user_mode(regs))
> + _exception(SIGBUS, regs, BUS_OBJERR, address);
> + else
> + rc = SIGBUS;
> + goto bail;
> + }
> +#endif /* CONFIG_PPC_STD_MMU_64 */
> +
>   if (notify_page_fault(regs))
>   goto bail;
>  
> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
> index 7e50e47375d6..71f524f2b109 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -282,6 +282,8 @@
>  #define   DSISR_UNSUPP_MMU   0x0008  /* Unsupported MMU config */
>  #define   DSISR_SET_RC   0x0004  /* Failed setting of 
> R/C bits */
>  #define   DSISR_PGDIRFAULT  0x0002  /* Fault on page directory */
> +#define   DSISR_BAD_COPYPASTE   0x0008  /* Copy/Paste on wrong mem 
> type */
> +#define   DSISR_BAD_AMO  0x0004  /* Incorrect AMO opcode 
> */
>  #define SPRN_TBRL0x10C   /* Time Base Read Lower Register (user, R/O) */
>  #define SPRN_TBRU0x10D   /* Time Base Read Upper Register (user, R/O) */
>  #define SPRN_CIR 0x11B   /* Chip Information Register (hyper, R/0) */

-aneesh

[PATCH 0/7] Prepare 8xx for CONFIG_STRICT_KERNEL_RWX

2017-07-12 Thread Christophe Leroy

This serie makes the PINning of ITLBs optional in the 8xx
in order to allow STRICT_KERNEL_RWX to work properly

Christophe Leroy (7):
  powerpc/8xx: Ensures RAM mapped with LTLB is seen as block mapped on
8xx.
  powerpc/8xx: Remove macro that checks kernel address
  powerpc/32: Avoid risk of unrecoverable TLBmiss inside entry_32.S
  powerpc/8xx: Make pinning of ITLBs optional
  powerpc/8xx: Do not allow Pinned TLBs with STRICT_KERNEL_RWX or
DEBUG_PAGEALLOC
  powerpc/8xx: mark init functions with __init
  powerpc/8xx: Reduce DTLB miss handler by one insn

 arch/powerpc/Kconfig   | 13 +-
 arch/powerpc/kernel/entry_32.S |  7 +++
 arch/powerpc/kernel/head_8xx.S | 96 +-
 arch/powerpc/mm/8xx_mmu.c  | 29 ++---
 4 files changed, 107 insertions(+), 38 deletions(-)

-- 
2.12.0

[PATCH 1/7] powerpc/8xx: Ensures RAM mapped with LTLB is seen as block mapped on 8xx.

2017-07-12 Thread Christophe Leroy

On the 8xx, the RAM mapped with LTLBs must be seen as block mapped,
just like areas mapped with BATs on standard PPC32.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/8xx_mmu.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/8xx_mmu.c b/arch/powerpc/mm/8xx_mmu.c
index f4c6472f2fc4..f3a00cef9c34 100644
--- a/arch/powerpc/mm/8xx_mmu.c
+++ b/arch/powerpc/mm/8xx_mmu.c
@@ -22,8 +22,11 @@
 
 extern int __map_without_ltlbs;
 
+static unsigned long block_mapped_ram;
+
 /*
- * Return PA for this VA if it is in IMMR area, or 0
+ * Return PA for this VA if it is in an area mapped with LTLBs.
+ * Otherwise, returns 0
  */
 phys_addr_t v_block_mapped(unsigned long va)
 {
@@ -33,11 +36,13 @@ phys_addr_t v_block_mapped(unsigned long va)
return 0;
if (va >= VIRT_IMMR_BASE && va < VIRT_IMMR_BASE + IMMR_SIZE)
return p + va - VIRT_IMMR_BASE;
+   if (va >= PAGE_OFFSET && va < PAGE_OFFSET + block_mapped_ram)
+   return __pa(va);
return 0;
 }
 
 /*
- * Return VA for a given PA or 0 if not mapped
+ * Return VA for a given PA mapped with LTLBs or 0 if not mapped
  */
 unsigned long p_block_mapped(phys_addr_t pa)
 {
@@ -47,6 +52,8 @@ unsigned long p_block_mapped(phys_addr_t pa)
return 0;
if (pa >= p && pa < p + IMMR_SIZE)
return VIRT_IMMR_BASE + pa - p;
+   if (pa < block_mapped_ram)
+   return (unsigned long)__va(pa);
return 0;
 }
 
@@ -133,6 +140,8 @@ unsigned long __init mmu_mapin_ram(unsigned long top)
if (mapped)
memblock_set_current_limit(mapped);
 
+   block_mapped_ram = mapped;
+
return mapped;
 }
 
-- 
2.12.0

[PATCH 2/7] powerpc/8xx: Remove macro that checks kernel address

2017-07-12 Thread Christophe Leroy

The macro to check if an address is a kernel address or not is
not used anymore in DTLBmiss handler. It is used in ITLB miss handler
and in DTLB error handler. DTLB error handler is not a hot path, it
doesn't need such optimisation.

In order to simplify a following patch which will rework ITLB miss
handler, we remove the macros and reintroduce them inside the handler.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_8xx.S | 29 -
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index c032fe8c2d26..02671e33905c 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -50,16 +50,9 @@
mtspr   spr, reg
 #endif
 
-/* Macro to test if an address is a kernel address */
 #if CONFIG_TASK_SIZE <= 0x8000 && CONFIG_PAGE_OFFSET >= 0x8000
-#define IS_KERNEL(tmp, addr)   \
-   andis.  tmp, addr, 0x8000   /* Address >= 0x8000 */
-#define BRANCH_UNLESS_KERNEL(label)beq label
-#else
-#define IS_KERNEL(tmp, addr)   \
-   rlwinm  tmp, addr, 16, 16, 31;  \
-   cmpli   cr0, tmp, PAGE_OFFSET >> 16
-#define BRANCH_UNLESS_KERNEL(label)blt label
+/* By simply checking Address >= 0x8000, we know if its a kernel address */
+#define SIMPLE_KERNEL_ADDRESS  1
 #endif
 
 
@@ -347,11 +340,20 @@ InstructionTLBMiss:
mfcrr3
 #endif
 #if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC)
-   IS_KERNEL(r11, r10)
+#ifdef SIMPLE_KERNEL_ADDRESS
+   andis.  r11, r10, 0x8000/* Address >= 0x8000 */
+#else
+   rlwinm  r11, r10, 16, 0xfff8
+   cmpli   cr0, r11, PAGE_OFFSET@h
+#endif
 #endif
mfspr   r11, SPRN_M_TW  /* Get level 1 table */
 #if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC)
-   BRANCH_UNLESS_KERNEL(3f)
+#ifdef SIMPLE_KERNEL_ADDRESS
+   beq+3f
+#else
+   blt+3f
+#endif
lis r11, (swapper_pg_dir-PAGE_OFFSET)@ha
 3:
 #endif
@@ -705,9 +707,10 @@ FixupDAR:/* Entry point for dcbx workaround. */
mtspr   SPRN_SPRG_SCRATCH2, r10
/* fetch instruction from memory. */
mfspr   r10, SPRN_SRR0
-   IS_KERNEL(r11, r10)
+   rlwinm  r11, r10, 16, 0xfff8
+   cmpli   cr0, r11, PAGE_OFFSET@h
mfspr   r11, SPRN_M_TW  /* Get level 1 table */
-   BRANCH_UNLESS_KERNEL(3f)
+   blt+3f
rlwinm  r11, r10, 16, 0xfff8
 _ENTRY(FixupDAR_cmp)
cmpli   cr7, r11, (PAGE_OFFSET + 0x180)@h
-- 
2.12.0

[PATCH 3/7] powerpc/32: Avoid risk of unrecoverable TLBmiss inside entry_32.S

2017-07-12 Thread Christophe Leroy

By default, the 8xx pins an ITLB on the first 8M of memory in order
to avoid any ITLB miss on kernel code.
However, with some debug functions like DEBUG_PAGEALLOC and
DEBUG_RODATA, pinning TLBs is contradictory.

In order to avoid any ITLB miss in a critical section without pinning
TLBs, we have to ensure that there is no page boundary crossed between
the setup of a new value in SRR0/SRR1 and the associated RFI.

The functions modifying srr0/srr1 are all located in setup_32.S.
They are spread over almost 4kbytes.

The patch forces a 12 bits (4kbytes) alignment for those
functions. This garanties that the functions remain in a
single 4k page.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/entry_32.S | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 8587059ad848..4e9a359ceff6 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -43,6 +43,13 @@
 #define LOAD_MSR_KERNEL(r, x)  li r,(x)
 #endif
 
+/*
+ * Align to 4k in order to ensure that all functions modyfing srr0/srr1
+ * fit into one page in order to not encounter a TLB miss between the
+ * modification of srr0/srr1 and the associated rfi.
+ */
+   .align  12
+
 #ifdef CONFIG_BOOKE
.globl  mcheck_transfer_to_handler
 mcheck_transfer_to_handler:
-- 
2.12.0

[PATCH 4/7] powerpc/8xx: Make pinning of ITLBs optional

2017-07-12 Thread Christophe Leroy

As stated in a comment in head_8xx.S, today we "Always pin the first
8 MB ITLB to prevent ITLB misses while mucking around with SRR0/SRR1
in asm".

This issue has just been cleared by the preceding patch, therefore
we can make this pinning optional (on by default) and independent
of DATA pinning.

This patch also makes pinning of IMMR independent of pinning of DATA.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig   | 10 
 arch/powerpc/kernel/head_8xx.S | 57 +-
 arch/powerpc/mm/8xx_mmu.c  |  8 +-
 3 files changed, 62 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 36f858c37ca7..d09b259d3621 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -1167,10 +1167,20 @@ config PIN_TLB
bool "Pinned Kernel TLBs (860 ONLY)"
depends on ADVANCED_OPTIONS && 8xx
 
+config PIN_TLB_DATA
+   bool "Pinned TLB for DATA"
+   depends on PIN_TLB
+   default y
+
 config PIN_TLB_IMMR
bool "Pinned TLB for IMMR"
depends on PIN_TLB
default y
+
+config PIN_TLB_TEXT
+   bool "Pinned TLB for TEXT"
+   depends on PIN_TLB
+   default y
 endmenu
 
 if PPC64
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 02671e33905c..b889b5812274 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -55,6 +55,15 @@
 #define SIMPLE_KERNEL_ADDRESS  1
 #endif
 
+/*
+ * We need an ITLB miss handler for kernel addresses if:
+ * - Either we have modules
+ * - Or we have not pinned the first 8M
+ */
+#if defined(CONFIG_MODULES) || !defined(CONFIG_PIN_TLB_TEXT) || \
+defined(CONFIG_DEBUG_PAGEALLOC)
+#define ITLB_MISS_KERNEL   1
+#endif
 
 /*
  * Value for the bits that have fixed value in RPN entries.
@@ -318,7 +327,7 @@ SystemCall:
 #endif
 
 InstructionTLBMiss:
-#if defined(CONFIG_8xx_CPU6) || defined(CONFIG_MODULES) || defined 
(CONFIG_DEBUG_PAGEALLOC) || defined (CONFIG_HUGETLB_PAGE)
+#if defined(CONFIG_8xx_CPU6) || defined(ITLB_MISS_KERNEL) || 
defined(CONFIG_HUGETLB_PAGE)
mtspr   SPRN_SPRG_SCRATCH2, r3
 #endif
EXCEPTION_PROLOG_0
@@ -336,24 +345,32 @@ InstructionTLBMiss:
INVALIDATE_ADJACENT_PAGES_CPU15(r11, r10)
/* Only modules will cause ITLB Misses as we always
 * pin the first 8MB of kernel memory */
-#if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC) || defined 
(CONFIG_HUGETLB_PAGE)
+#if defined(ITLB_MISS_KERNEL) || defined(CONFIG_HUGETLB_PAGE)
mfcrr3
 #endif
-#if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC)
-#ifdef SIMPLE_KERNEL_ADDRESS
+#ifdef ITLB_MISS_KERNEL
+#if defined(SIMPLE_KERNEL_ADDRESS) && defined(CONFIG_PIN_TLB_TEXT)
andis.  r11, r10, 0x8000/* Address >= 0x8000 */
 #else
rlwinm  r11, r10, 16, 0xfff8
cmpli   cr0, r11, PAGE_OFFSET@h
+#ifndef CONFIG_PIN_TLB_TEXT
+   /* It is assumed that kernel code fits into the first 8M page */
+_ENTRY(ITLBMiss_cmp)
+   cmpli   cr7, r11, (PAGE_OFFSET + 0x080)@h
+#endif
 #endif
 #endif
mfspr   r11, SPRN_M_TW  /* Get level 1 table */
-#if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC)
-#ifdef SIMPLE_KERNEL_ADDRESS
+#ifdef ITLB_MISS_KERNEL
+#if defined(SIMPLE_KERNEL_ADDRESS) && defined(CONFIG_PIN_TLB_TEXT)
beq+3f
 #else
blt+3f
 #endif
+#ifndef CONFIG_PIN_TLB_TEXT
+   blt cr7, ITLBMissLinear
+#endif
lis r11, (swapper_pg_dir-PAGE_OFFSET)@ha
 3:
 #endif
@@ -371,7 +388,7 @@ InstructionTLBMiss:
rlwimi  r10, r11, 0, 0, 32 - PAGE_SHIFT - 1 /* Add level 2 base */
lwz r10, 0(r10) /* Get the pte */
 4:
-#if defined(CONFIG_MODULES) || defined (CONFIG_DEBUG_PAGEALLOC) || defined 
(CONFIG_HUGETLB_PAGE)
+#if defined(ITLB_MISS_KERNEL) || defined(CONFIG_HUGETLB_PAGE)
mtcrr3
 #endif
/* Insert the APG into the TWC from the Linux PTE. */
@@ -402,7 +419,7 @@ InstructionTLBMiss:
MTSPR_CPU6(SPRN_MI_RPN, r10, r3)/* Update TLB entry */
 
/* Restore registers */
-#if defined(CONFIG_8xx_CPU6) || defined(CONFIG_MODULES) || defined 
(CONFIG_DEBUG_PAGEALLOC) || defined (CONFIG_HUGETLB_PAGE)
+#if defined(CONFIG_8xx_CPU6) || defined(ITLB_MISS_KERNEL) || 
defined(CONFIG_HUGETLB_PAGE)
mfspr   r3, SPRN_SPRG_SCRATCH2
 #endif
EXCEPTION_EPILOG_0
@@ -697,6 +714,22 @@ DTLBMissLinear:
EXCEPTION_EPILOG_0
rfi
 
+#ifndef CONFIG_PIN_TLB_TEXT
+ITLBMissLinear:
+   mtcrr3
+   /* Set 8M byte page and mark it valid */
+   li  r11, MI_PS8MEG | MI_SVALID | _PAGE_EXEC
+   MTSPR_CPU6(SPRN_MI_TWC, r11, r3)
+   rlwinm  r10, r10, 0, 0x0f80 /* 8xx supports max 256Mb RAM */
+   ori r10, r10, 0xf0 | MI_SPS16K | _PAGE_SHARED | _PAGE_DIRTY | \
+ _PAGE_PRESENT
+   MTSPR_CPU6(SPRN_MI_RPN, r10, r11)   /* Update TLB entry */
+
+   mfspr

[PATCH 5/7] powerpc/8xx: Do not allow Pinned TLBs with STRICT_KERNEL_RWX or DEBUG_PAGEALLOC

2017-07-12 Thread Christophe Leroy

Pinning TLBs bypasses STRICT_KERNEL_RWX or DEBUG_PAGEALLOC protections
so it should only be allowed when those are not selected

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index d09b259d3621..28608275d7c0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -1165,7 +1165,8 @@ config CONSISTENT_SIZE
 
 config PIN_TLB
bool "Pinned Kernel TLBs (860 ONLY)"
-   depends on ADVANCED_OPTIONS && 8xx
+   depends on ADVANCED_OPTIONS && PPC_8xx && \
+  !DEBUG_PAGEALLOC && !STRICT_KERNEL_RWX
 
 config PIN_TLB_DATA
bool "Pinned TLB for DATA"
-- 
2.12.0

[PATCH 6/7] powerpc/8xx: mark init functions with __init

2017-07-12 Thread Christophe Leroy

setup_initial_memory_limit() is only called during init.
mmu_patch_cmp_limit() is only called from 8xx_mmu.c

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/8xx_mmu.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/mm/8xx_mmu.c b/arch/powerpc/mm/8xx_mmu.c
index ab3b10746f36..f29212e40f40 100644
--- a/arch/powerpc/mm/8xx_mmu.c
+++ b/arch/powerpc/mm/8xx_mmu.c
@@ -87,7 +87,7 @@ void __init MMU_init_hw(void)
 #endif
 }
 
-static void mmu_mapin_immr(void)
+static void __init mmu_mapin_immr(void)
 {
unsigned long p = PHYS_IMMR_BASE;
unsigned long v = VIRT_IMMR_BASE;
@@ -107,7 +107,7 @@ extern unsigned int DTLBMiss_cmp, FixupDAR_cmp;
 extern unsigned int ITLBMiss_cmp;
 #endif
 
-void mmu_patch_cmp_limit(unsigned int *addr, unsigned long mapped)
+static void __init mmu_patch_cmp_limit(unsigned int *addr, unsigned long 
mapped)
 {
unsigned int instr = *addr;
 
@@ -151,8 +151,8 @@ unsigned long __init mmu_mapin_ram(unsigned long top)
return mapped;
 }
 
-void setup_initial_memory_limit(phys_addr_t first_memblock_base,
-   phys_addr_t first_memblock_size)
+void __init setup_initial_memory_limit(phys_addr_t first_memblock_base,
+  phys_addr_t first_memblock_size)
 {
/* We don't currently support the first MEMBLOCK not mapping 0
 * physical on those processors
-- 
2.12.0

[PATCH 7/7] powerpc/8xx: Reduce DTLB miss handler by one insn

2017-07-12 Thread Christophe Leroy

This reduces the DTLB miss handler hot path (user address path)
by one instruction by preserving r10.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/head_8xx.S | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index b889b5812274..7365148219fd 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -466,23 +466,23 @@ DataStoreTLBMiss:
 * kernel page tables.
 */
mfspr   r10, SPRN_MD_EPN
-   rlwinm  r10, r10, 16, 0xfff8
-   cmpli   cr0, r10, PAGE_OFFSET@h
+   rlwinm  r11, r10, 16, 0xfff8
+   cmpli   cr0, r11, PAGE_OFFSET@h
mfspr   r11, SPRN_M_TW  /* Get level 1 table */
blt+3f
+   rlwinm  r11, r10, 16, 0xfff8
 #ifndef CONFIG_PIN_TLB_IMMR
-   cmpli   cr0, r10, VIRT_IMMR_BASE@h
+   cmpli   cr0, r11, VIRT_IMMR_BASE@h
 #endif
 _ENTRY(DTLBMiss_cmp)
-   cmpli   cr7, r10, (PAGE_OFFSET + 0x180)@h
-   lis r11, (swapper_pg_dir-PAGE_OFFSET)@ha
+   cmpli   cr7, r11, (PAGE_OFFSET + 0x180)@h
 #ifndef CONFIG_PIN_TLB_IMMR
 _ENTRY(DTLBMiss_jmp)
beq-DTLBMissIMMR
 #endif
blt cr7, DTLBMissLinear
+   lis r11, (swapper_pg_dir-PAGE_OFFSET)@ha
 3:
-   mfspr   r10, SPRN_MD_EPN
 
/* Insert level 1 index */
rlwimi  r11, r10, 32 - ((PAGE_SHIFT - 2) << 1), (PAGE_SHIFT - 2) << 1, 
29
@@ -703,7 +703,7 @@ DTLBMissLinear:
/* Set 8M byte page and mark it valid */
li  r11, MD_PS8MEG | MD_SVALID
MTSPR_CPU6(SPRN_MD_TWC, r11, r3)
-   rlwinm  r10, r10, 16, 0x0f80/* 8xx supports max 256Mb RAM */
+   rlwinm  r10, r10, 0, 0x0f80 /* 8xx supports max 256Mb RAM */
ori r10, r10, 0xf0 | MD_SPS16K | _PAGE_SHARED | _PAGE_DIRTY | \
  _PAGE_PRESENT
MTSPR_CPU6(SPRN_MD_RPN, r10, r11)   /* Update TLB entry */
-- 
2.12.0

Re: [PATCH v5 2/2] powerpc/fadump: update documentation about 'fadump_append=' parameter

2017-07-12 Thread msuchanek

Hello,

On Wed, 12 Jul 2017 00:00:57 +0530
Hari Bathini  wrote:

> Hi Michal,
> 
> 
> Thanks for the review..
> 
> 
> On Monday 26 June 2017 05:45 PM, Michal Suchánek wrote:
> > Hello,
> >
> > On Tue, 20 Jun 2017 21:14:08 +0530
> > Hari Bathini  wrote:
> >  

> I would prefer documenting over a complex implementation. Actually, I
> am considering a simple approach of replacing every occurrence of
> "fadump_extra_args=" with "fadump_extra_args " in fadump capture
> kernel. The cmdline
> 
>"root=/dev/sda2 ro fadump_extra_args="a b c" crashkernel=512M 
> fadump_extra_args=d"
> 
> becomes
> 
>"root=/dev/sda2 ro fadump_extra_args "a b c" crashkernel=512M 
> fadump_extra_args d"

which is totally broken

> 
> in fadump capture kernel. This must take care of the pitfalls with
> the current approach and also,
> doesn't rely on parse_args() which was not designed for this scenario
> to start with..?

It was designed for parsing arguments. To handle replacing arguments
you have to extend it. You need to get more information from it for
this case.

Best regards

Michal

Re: [GIT PULL] Please pull JSON files for Power9 PMU events

2017-07-12 Thread Michael Ellerman

Sukadev Bhattiprolu  writes:

> Hi Arnaldo,
>
> Please pull the JSON files for POWER9 PMU events.
>
> The following changes since commit 07d306c838c5c30196619baae36107d0615e459b:
>
>   Merge git://www.linux-watchdog.org/linux-watchdog (2017-07-11 09:59:37 
> -0700)
>
> are available in the git repository at:
>
>   https://github.com/sukadev/linux/ p9-json
>
> for you to fetch changes up to f94a86dbb13e18bd7ad566cabe42df63d3bd1039:
>
>   perf vendor events: Add POWER9 PVRs to mapfile (2017-07-11 19:30:40 -0500)
>
> 
> Sukadev Bhattiprolu (2):
>   perf vendor events: Add POWER9 PMU events
>   perf vendor events: Add POWER9 PVRs to mapfile

I think the PVRs need work.

You have:

+004e0100,1,power9.json,core
+004e,1,power9.json,core

The first is P9 DD1, but the second doesn't exist. We have it in the
kernel, but with a mask of 0x. From memory the perf code doesn't
do any masking or anything fancy, it looks for an exact match.

So for starters you should probably drop that one and add 0x004e1200 and
0x004e0200.

cheers

Re: [PATCH v5 2/2] powerpc/fadump: update documentation about 'fadump_append=' parameter

2017-07-12 Thread Hari Bathini




On Wednesday 12 July 2017 05:01 PM, msuchanek wrote:

Hello,

On Wed, 12 Jul 2017 00:00:57 +0530
Hari Bathini  wrote:


Hi Michal,


Thanks for the review..


On Monday 26 June 2017 05:45 PM, Michal Suchánek wrote:

Hello,

On Tue, 20 Jun 2017 21:14:08 +0530
Hari Bathini  wrote:
  

I would prefer documenting over a complex implementation. Actually, I
am considering a simple approach of replacing every occurrence of
"fadump_extra_args=" with "fadump_extra_args " in fadump capture
kernel. The cmdline

"root=/dev/sda2 ro fadump_extra_args="a b c" crashkernel=512M
fadump_extra_args=d"

becomes

"root=/dev/sda2 ro fadump_extra_args "a b c" crashkernel=512M
fadump_extra_args d"

which is totally broken


My bad! I don't know what I was thinking when I expected that to work.. :-[


in fadump capture kernel. This must take care of the pitfalls with
the current approach and also,
doesn't rely on parse_args() which was not designed for this scenario
to start with..?

It was designed for parsing arguments. To handle replacing arguments
you have to extend it. You need to get more information from it for
this case.


I will try to work on this. Also, want to replace in-place without 
moving other parameters.
I guess, I could do that by replacing the leftover length after 
processing with spaces..


Thanks
Hari

Re: [GIT PULL] Please pull JSON files for Power9 PMU events

2017-07-12 Thread Arnaldo Carvalho de Melo

Em Wed, Jul 12, 2017 at 10:09:12PM +1000, Michael Ellerman escreveu:
> Sukadev Bhattiprolu  writes:
> > Please pull the JSON files for POWER9 PMU events.

> >   perf vendor events: Add POWER9 PMU events
> >   perf vendor events: Add POWER9 PVRs to mapfile
 
> I think the PVRs need work.
 
> You have:
 
> +004e0100,1,power9.json,core
> +004e,1,power9.json,core
 
> The first is P9 DD1, but the second doesn't exist. We have it in the
> kernel, but with a mask of 0x. From memory the perf code doesn't
> do any masking or anything fancy, it looks for an exact match.
 
> So for starters you should probably drop that one and add 0x004e1200 and
> 0x004e0200.

Could this check be somehow done automatically? Using whatever is in the
kernel sources or some IBM released document?

- Arnaldo

[PATCH] powerpc/time: use get_tb instead of get_vtb in running_clock

2017-07-12 Thread Jia He

Virtual time base(vtb) is a register which increases only in guest.
Any exit from guest to host will stop the vtb(saved and restored by kvm).
But if there is an IO causes guest exits to host, the guest's watchdog
(watchdog_timer_fn -> is_softlockup -> get_timestamp -> running_clock)
needs to also include the time elapsed in host. get_vtb is not correct in
this case.

Also, the TB_OFFSET is well saved and restored by qemu after commit [1].
So we can use get_tb here.

[1] http://git.qemu.org/?p=qemu.git;a=commit;h=42043e4f1

Signed-off-by: Jia He 
---
 arch/powerpc/kernel/time.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index fe6f3a2..c542dd3 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -695,16 +695,15 @@ notrace unsigned long long sched_clock(void)
 unsigned long long running_clock(void)
 {
/*
-* Don't read the VTB as a host since KVM does not switch in host
-* timebase into the VTB when it takes a guest off the CPU, reading the
-* VTB would result in reading 'last switched out' guest VTB.
+* Use get_tb instead of get_vtb for guest since the TB_OFFSET has been
+* well saved/restored when qemu does suspend/resume.
 *
 * Host kernels are often compiled with CONFIG_PPC_PSERIES checked, it
 * would be unsafe to rely only on the #ifdef above.
 */
if (firmware_has_feature(FW_FEATURE_LPAR) &&
cpu_has_feature(CPU_FTR_ARCH_207S))
-   return mulhdu(get_vtb() - boot_tb, tb_to_ns_scale) << 
tb_to_ns_shift;
+   return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << 
tb_to_ns_shift;
 
/*
 * This is a next best approximation without a VTB.
-- 
2.9.3

[PATCH] powerpc/hugetlb: fix page rights verification in gup_hugepte()

2017-07-12 Thread Christophe Leroy

gup_hugepte() checks if pages are present and readable, and
when  'write' is set, also checks if the pages are writable.

Initially this was done by checking if _PAGE_PRESENT and
_PAGE_READ were set. In addition, _PAGE_WRITE was verified for write
accesses.

The problem is that we have to handle the three following cases:
1/ The target defines __PAGE_READ and __PAGE_WRITE
2/ The target defines __PAGE_RW
3/ The target defines __PAGE_RO

In case 1/, this is obvious
In case 2/, __PAGE_READ is defined as 0 and __PAGE_WRITE as __PAGE_RW
so it works as well.
But in case 3, __PAGE_RW is defined as 0, which means __PAGE_WRITE is 0
and then the test returns true (page writable) in all cases.

A first correction was attempted in commit 6b8cb66a6a7cc ("powerpc: Fix
usage of _PAGE_RO in hugepage"), but that fix is wrong:
instead of checking that the page is writable when write is requested,
it checks that the page is NOT writable when write is NOT requested.

This patch adds a new pte_read() helper to check whether a page is
readable or not. This avoids handling all possible cases in
gup_hugepte().

Then gup_hugepte() is modified to use pte_present(), pte_read()
and pte_write() instead of the raw flags.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/32/pgtable.h |  1 +
 arch/powerpc/include/asm/book3s/64/pgtable.h |  5 +
 arch/powerpc/include/asm/nohash/pgtable.h|  1 +
 arch/powerpc/mm/hugetlbpage.c| 15 +++
 4 files changed, 10 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 7fb755880409..2a67b861e6e1 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -301,6 +301,7 @@ int map_kernel_page(unsigned long va, phys_addr_t pa, int 
flags);
 
 /* Generic accessors to PTE bits */
 static inline int pte_write(pte_t pte) { return !!(pte_val(pte) & 
_PAGE_RW);}
+static inline int pte_read(pte_t pte)  { return 1; }
 static inline int pte_dirty(pte_t pte) { return !!(pte_val(pte) & 
_PAGE_DIRTY); }
 static inline int pte_young(pte_t pte) { return !!(pte_val(pte) & 
_PAGE_ACCESSED); }
 static inline int pte_special(pte_t pte)   { return !!(pte_val(pte) & 
_PAGE_SPECIAL); }
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index c0737c86a362..83cb73c8c3bb 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -409,6 +409,11 @@ static inline int pte_write(pte_t pte)
return __pte_write(pte) || pte_savedwrite(pte);
 }
 
+static inline int pte_read(pte_t pte)
+{
+   return !!(pte_raw(pte) & cpu_to_be64(_PAGE_READ));
+}
+
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr,
  pte_t *ptep)
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
b/arch/powerpc/include/asm/nohash/pgtable.h
index e5805ad78e12..17989c3d9a24 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -14,6 +14,7 @@ static inline int pte_write(pte_t pte)
 {
return (pte_val(pte) & (_PAGE_RW | _PAGE_RO)) != _PAGE_RO;
 }
+static inline int pte_read(pte_t pte)  { return 1; }
 static inline int pte_dirty(pte_t pte) { return pte_val(pte) & 
_PAGE_DIRTY; }
 static inline int pte_young(pte_t pte) { return pte_val(pte) & 
_PAGE_ACCESSED; }
 static inline int pte_special(pte_t pte)   { return pte_val(pte) & 
_PAGE_SPECIAL; }
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index e1bf5ca397fe..1226932579a0 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -977,7 +977,6 @@ EXPORT_SYMBOL_GPL(__find_linux_pte_or_hugepte);
 int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
unsigned long end, int write, struct page **pages, int *nr)
 {
-   unsigned long mask;
unsigned long pte_end;
struct page *head, *page;
pte_t pte;
@@ -988,18 +987,10 @@ int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned 
long addr,
end = pte_end;
 
pte = READ_ONCE(*ptep);
-   mask = _PAGE_PRESENT | _PAGE_READ;
-
-   /*
-* On some CPUs like the 8xx, _PAGE_RW hence _PAGE_WRITE is defined
-* as 0 and _PAGE_RO has to be set when a page is not writable
-*/
-   if (write)
-   mask |= _PAGE_WRITE;
-   else
-   mask |= _PAGE_RO;
 
-   if ((pte_val(pte) & mask) != mask)
+   if (!pte_present(pte) || !pte_read(pte))
+   return 0;
+   if (write && !pte_write(pte))
return 0;
 
/* hugepages are never "special" */
-- 
2.12.0

Re: [PATCH] powerpc/mm: Implemented default_hugepagesz verification for powerpc

2017-07-12 Thread victora


Em 2017-07-05 13:03, victora escreveu:

Em 2017-07-05 01:26, Aneesh Kumar K.V escreveu:

On Tuesday 04 July 2017 01:35 AM, Victor Aoqui wrote:

Implemented default hugepage size verification (default_hugepagesz=)
in order to allow allocation of defined number of pages (hugepages=)
only for supported hugepage sizes.

Signed-off-by: Victor Aoqui 
---
  arch/powerpc/mm/hugetlbpage.c | 15 +++
  1 file changed, 15 insertions(+)

diff --git a/arch/powerpc/mm/hugetlbpage.c 
b/arch/powerpc/mm/hugetlbpage.c

index a4f33de..464e72e 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -797,6 +797,21 @@ static int __init hugepage_setup_sz(char *str)
  }
  __setup("hugepagesz=", hugepage_setup_sz);

+static int __init default_hugepage_setup_sz(char *str)
+{
+unsigned long long size;
+
+size = memparse(str, &str);
+
+if (add_huge_page_size(size) != 0) {
+hugetlb_bad_size();
+pr_err("Invalid default huge page size 
specified(%llu)\n", size);

+}
+
+return 1;
+}
+__setup("default_hugepagesz=", default_hugepage_setup_sz);


isn't that a behavior change in what we have now ? . Right now if size
specified is not supported, we fallback to HPAGE_SIZE.


Yes, it is. However, is this a correct behavior? If we specify an
unsupported value, for example default_hugepagesz=1M and
hugepages=1000, 1M will be ignored and 1000 pages of 16M (arch
default) will be allocated. This could lead to non-expected out of of
memory/performance issue.



mm/hugetlb.c

if (!size_to_hstate(default_hstate_size)) {
default_hstate_size = HPAGE_SIZE;
if (!size_to_hstate(default_hstate_size))
hugetlb_add_hstate(HUGETLB_PAGE_ORDER);
}



+
  struct kmem_cache *hugepte_cache;
  static int __init hugetlbpage_init(void)
  {



Even if we want to do this, this should be done in generic code and
should not be powerpc specific



The verification of supported powerpc hugepage size (hugepagesz=) is
being performed on add_huge_page_size(), which is currently defined in
arch/powerpc/mm/hugetlbpage.c. I think it makes more sense to
implement default_hugepagesz= verification on arch/powerpc, don't you
think?


-aneesh


Hi Aneesh,

Did you have time to review the comments above?

Thanks
Victor

Re: [GIT PULL] Please pull JSON files for Power9 PMU events

2017-07-12 Thread Sukadev Bhattiprolu

Arnaldo Carvalho de Melo [a...@kernel.org] wrote:
> Em Wed, Jul 12, 2017 at 10:09:12PM +1000, Michael Ellerman escreveu:
> > Sukadev Bhattiprolu  writes:
> > > Please pull the JSON files for POWER9 PMU events.
> 
> > >   perf vendor events: Add POWER9 PMU events
> > >   perf vendor events: Add POWER9 PVRs to mapfile
> 
> > I think the PVRs need work.
> 
> > You have:
> 
> > +004e0100,1,power9.json,core
> > +004e,1,power9.json,core
> 
> > The first is P9 DD1, but the second doesn't exist. We have it in the
> > kernel, but with a mask of 0x. From memory the perf code doesn't
> > do any masking or anything fancy, it looks for an exact match.
> 
> > So for starters you should probably drop that one and add 0x004e1200 and
> > 0x004e0200.

Yes, thanks.

> 
> Could this check be somehow done automatically? Using whatever is in the
> kernel sources or some IBM released document?

Yes, had a recent discussion offline with Michael Petlan and Jiri about
generalizing/simplifying this. For now, I have added the above PVRs and
will look into the broader fix.

The following changes since commit 07d306c838c5c30196619baae36107d0615e459b:

  Merge git://www.linux-watchdog.org/linux-watchdog (2017-07-11 09:59:37 -0700)

are available in the git repository at:

  https://github.com/sukadev/linux/ p9-json

for you to fetch changes up to 8870b54739866c6eea23c36a0125af8045fbd572:

  perf vendor events: Add POWER9 PVRs to mapfile (2017-07-12 11:11:00 -0500)


Sukadev Bhattiprolu (2):
  perf vendor events: Add POWER9 PMU events
  perf vendor events: Add POWER9 PVRs to mapfile

 tools/perf/pmu-events/arch/powerpc/mapfile.csv |   3 +
 .../perf/pmu-events/arch/powerpc/power9/cache.json | 176 +
 .../arch/powerpc/power9/floating-point.json|  44 ++
 .../pmu-events/arch/powerpc/power9/frontend.json   | 446 +++
 .../pmu-events/arch/powerpc/power9/marked.json | 782 +++
 .../pmu-events/arch/powerpc/power9/memory.json | 158 
 .../perf/pmu-events/arch/powerpc/power9/other.json | 836 +
 .../pmu-events/arch/powerpc/power9/pipeline.json   | 680 +
 tools/perf/pmu-events/arch/powerpc/power9/pmc.json | 146 
 .../arch/powerpc/power9/translation.json   | 272 +++
 10 files changed, 3543 insertions(+)
 create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/cache.json
 create mode 100644 
tools/perf/pmu-events/arch/powerpc/power9/floating-point.json
 create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/frontend.json
 create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/marked.json
 create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/memory.json
 create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/other.json
 create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/pipeline.json
 create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/pmc.json
 create mode 100644 tools/perf/pmu-events/arch/powerpc/power9/translation.json

Re: [GIT PULL] Please pull JSON files for Power9 PMU events

2017-07-12 Thread Arnaldo Carvalho de Melo

Em Wed, Jul 12, 2017 at 09:40:08AM -0700, Sukadev Bhattiprolu escreveu:
> Arnaldo Carvalho de Melo [a...@kernel.org] wrote:
> > Em Wed, Jul 12, 2017 at 10:09:12PM +1000, Michael Ellerman escreveu:
> > > So for starters you should probably drop that one and add 0x004e1200 and
> > > 0x004e0200.
 
> Yes, thanks.
 
> > Could this check be somehow done automatically? Using whatever is in the
> > kernel sources or some IBM released document?
 
> Yes, had a recent discussion offline with Michael Petlan and Jiri about
> generalizing/simplifying this. For now, I have added the above PVRs and
> will look into the broader fix.
 
> The following changes since commit 07d306c838c5c30196619baae36107d0615e459b:
 
>   Merge git://www.linux-watchdog.org/linux-watchdog (2017-07-11 09:59:37 
> -0700)
 
> are available in the git repository at:
 
>   https://github.com/sukadev/linux/ p9-json

Can you please base your branches out ouf tip/perf/core or
acme/perf/core when you submit it with the fixes discussed here?

thanks,

- Arnaldo

Re: [RFC v5 36/38] selftest: PowerPC specific test updates to memory protection keys

2017-07-12 Thread Ram Pai

On Tue, Jul 11, 2017 at 10:33:09AM -0700, Dave Hansen wrote:
> On 07/05/2017 02:22 PM, Ram Pai wrote:
> > Abstracted out the arch specific code into the header file, and
> > added powerpc specific changes.
> > 
> > a) added 4k-backed hpte, memory allocator, powerpc specific.
> > b) added three test case where the key is associated after the page is
> > accessed/allocated/mapped.
> > c) cleaned up the code to make checkpatch.pl happy
> 
> There's a *lot* of churn here.  If it breaks, I'm going to have a heck
> of a time figuring out which hunk broke.  Is there any way to break this
> up into a series of things that we have a chance at bisecting?

Just finished breaking down the changes into 20 gradual increments.
I have pushed it to my github tree at

https://github.com/rampai/memorykeys.git
branch is memkey.v6-rc3

See if it works for you. I am sure I would have broken something on
x86 since I dont have a x86 platform to test.

Let me know, Thanks,
RP

Re: [RFC v5 11/38] mm: introduce an additional vma bit for powerpc pkey

2017-07-12 Thread Ram Pai

On Tue, Jul 11, 2017 at 11:10:46AM -0700, Dave Hansen wrote:
> On 07/05/2017 02:21 PM, Ram Pai wrote:
> > Currently there are only 4bits in the vma flags to support 16 keys
> > on x86.  powerpc supports 32 keys, which needs 5bits. This patch
> > introduces an addition bit in the vma flags.
> > 
> > Signed-off-by: Ram Pai 
> > ---
> >  fs/proc/task_mmu.c |6 +-
> >  include/linux/mm.h |   18 +-
> >  2 files changed, 18 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index f0c8b33..2ddc298 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -666,12 +666,16 @@ static void show_smap_vma_flags(struct seq_file *m, 
> > struct vm_area_struct *vma)
> > [ilog2(VM_MERGEABLE)]   = "mg",
> > [ilog2(VM_UFFD_MISSING)]= "um",
> > [ilog2(VM_UFFD_WP)] = "uw",
> > -#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> > +#ifdef CONFIG_ARCH_HAS_PKEYS
> > /* These come out via ProtectionKey: */
> > [ilog2(VM_PKEY_BIT0)]   = "",
> > [ilog2(VM_PKEY_BIT1)]   = "",
> > [ilog2(VM_PKEY_BIT2)]   = "",
> > [ilog2(VM_PKEY_BIT3)]   = "",
> > +#endif /* CONFIG_ARCH_HAS_PKEYS */
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > +   /* Additional bit in ProtectionKey: */
> > +   [ilog2(VM_PKEY_BIT4)]   = "",
> >  #endif
> 
> I'd probably just leave the #ifdef out and eat the byte or whatever of
> storage that this costs us on x86.

fine with me.

> 
> > };
> > size_t i;
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 7cb17c6..3d35bcc 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -208,21 +208,29 @@ extern int overcommit_kbytes_handler(struct ctl_table 
> > *, int, void __user *,
> >  #define VM_HIGH_ARCH_BIT_1 33  /* bit only usable on 64-bit 
> > architectures */
> >  #define VM_HIGH_ARCH_BIT_2 34  /* bit only usable on 64-bit 
> > architectures */
> >  #define VM_HIGH_ARCH_BIT_3 35  /* bit only usable on 64-bit 
> > architectures */
> > +#define VM_HIGH_ARCH_BIT_4 36  /* bit only usable on 64-bit arch */
> 
> Please just copy the above lines.

Just copying over makes checkpatch.pl unhappy. It exceeds 80 columns.

> 
> >  #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
> >  #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
> >  #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
> >  #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
> > +#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
> >  #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
> >  
> > -#if defined(CONFIG_X86)
> > -# define VM_PATVM_ARCH_1   /* PAT reserves whole VMA at 
> > once (x86) */
> > -#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
> > +#ifdef CONFIG_ARCH_HAS_PKEYS
> >  # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
> > -# define VM_PKEY_BIT0  VM_HIGH_ARCH_0  /* A protection key is a 4-bit 
> > value */
> > +# define VM_PKEY_BIT0  VM_HIGH_ARCH_0
> >  # define VM_PKEY_BIT1  VM_HIGH_ARCH_1
> >  # define VM_PKEY_BIT2  VM_HIGH_ARCH_2
> >  # define VM_PKEY_BIT3  VM_HIGH_ARCH_3
> > -#endif
> > +#endif /* CONFIG_ARCH_HAS_PKEYS */
> 
> We have the space here, so can we just say that it's 4-bits on x86 and 5
> on ppc?

sure.

> 
> > +#if defined(CONFIG_PPC64_MEMORY_PROTECTION_KEYS)
> > +# define VM_PKEY_BIT4  VM_HIGH_ARCH_4 /* additional key bit used on 
> > ppc64 */
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> 
> Why bother #ifdef'ing a #define?

ok. 

RP

Re: [PATCH v2] powerpc/powernv: Enable PCI peer-to-peer

2017-07-12 Thread Benjamin Herrenschmidt

On Mon, 2017-06-26 at 20:08 +0200, Frederic Barrat wrote:
> +   if (desc & OPAL_PCI_P2P_ENABLE) {
> +   pe_init->p2p_initiator_count++;
> +   } else {
> +   if (pe_init->p2p_initiator_count > 0) {
> +   pe_init->p2p_initiator_count--;
> +   if (!pe_init->p2p_initiator_count)
> +   pnv_pci_ioda2_set_bypass(pe_init, true);
> +   }

So you have the initiator refcounting in Linux and the target
refcounting in OPAL ... any reason for that ?

Cheers,
Ben.

Re: [RFC v5 11/38] mm: introduce an additional vma bit for powerpc pkey

2017-07-12 Thread Benjamin Herrenschmidt

On Wed, 2017-07-12 at 15:23 -0700, Ram Pai wrote:
> Just copying over makes checkpatch.pl unhappy. It exceeds 80 columns.

Which is fine to ignore in a case like that where you remain consistent
with the existing code.

Ben.

Re: [PATCH] powerpc/time: use get_tb instead of get_vtb in running_clock

2017-07-12 Thread Benjamin Herrenschmidt

On Wed, 2017-07-12 at 23:01 +0800, Jia He wrote:
> Virtual time base(vtb) is a register which increases only in guest.
> Any exit from guest to host will stop the vtb(saved and restored by kvm).
> But if there is an IO causes guest exits to host, the guest's watchdog
> (watchdog_timer_fn -> is_softlockup -> get_timestamp -> running_clock)
> needs to also include the time elapsed in host. get_vtb is not correct in
> this case.
> 
> Also, the TB_OFFSET is well saved and restored by qemu after commit [1].
> So we can use get_tb here.

That completely defeats the purpose here... This was done specifically
to exploit the VTB which doesn't count in hypervisor mode.

> 
> [1] http://git.qemu.org/?p=qemu.git;a=commit;h=42043e4f1
> 
> Signed-off-by: Jia He 
> ---
>  arch/powerpc/kernel/time.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index fe6f3a2..c542dd3 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -695,16 +695,15 @@ notrace unsigned long long sched_clock(void)
>  unsigned long long running_clock(void)
>  {
>   /*
> -  * Don't read the VTB as a host since KVM does not switch in host
> -  * timebase into the VTB when it takes a guest off the CPU, reading the
> -  * VTB would result in reading 'last switched out' guest VTB.
> +  * Use get_tb instead of get_vtb for guest since the TB_OFFSET has been
> +  * well saved/restored when qemu does suspend/resume.
>*
>* Host kernels are often compiled with CONFIG_PPC_PSERIES checked, it
>* would be unsafe to rely only on the #ifdef above.
>*/
>   if (firmware_has_feature(FW_FEATURE_LPAR) &&
>   cpu_has_feature(CPU_FTR_ARCH_207S))
> - return mulhdu(get_vtb() - boot_tb, tb_to_ns_scale) << 
> tb_to_ns_shift;
> + return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << 
> tb_to_ns_shift;
>  
>   /*
>* This is a next best approximation without a VTB.

Re: [RFC v5 00/38] powerpc: Memory Protection Keys

2017-07-12 Thread Benjamin Herrenschmidt

On Wed, 2017-07-12 at 09:23 +0200, Michal Hocko wrote:
> 
> > 
> > Ideally the MMU looks at the PTE for keys, in order to enforce
> > protection. This is the case with x86 and is the case with power9 Radix
> > page table. Hence the keys have to be programmed into the PTE.
> 
> But x86 doesn't update ptes for PKEYs, that would be just too expensive.
> You could use standard mprotect to do the same...

What do you mean ? x86 ends up in mprotect_fixup -> change_protection()
which will update the PTEs just the same as we do.

Changing the key for a page is a form mprotect. Changing the access
permissions for keys is different, for us it's a special register
(AMR).

I don't understand why you think we are doing any differently than x86
here.

> > However with HPT on power, these keys do not necessarily have to be
> > programmed into the PTE. We could bypass the Linux Page Table Entry(PTE)
> > and instead just program them into the Hash Page Table(HPTE), since
> > the MMU does not refer the PTE but refers the HPTE. The last version
> > of the page attempted to do that.   It worked as follows:
> > 
> > a) when a address range is requested to be associated with a key; by the
> >application through key_mprotect() system call, the kernel
> >stores that key in the vmas corresponding to that address
> >range.
> > 
> > b) Whenever there is a hash page fault for that address, the fault
> >handler reads the key from the VMA and programs the key into the
> >HPTE. __hash_page() is the function that does that.
> 
> What causes the fault here?

The hardware. With the hash MMU, the HW walks a hash table which is
effectively a large in-memory TLB extension. When a page isn't found
there, a  "hash fault" is generated allowing Linux to populate that
hash table with the content of the corresponding PTE. 

> > c) Once the hpte is programmed, the MMU can sense key violations and
> >generate key-faults.
> > 
> > The problem is with step (b).  This step is really a very critical
> > path which is performance sensitive. We dont want to add any delays.
> > However if we want to access the key from the vma, we will have to
> > hold the vma semaphore, and that is a big NO-NO. As a result, this
> > design had to be dropped.
> > 
> > 
> > 
> > I reverted back to the old design i.e the design in v4 version. In this
> > version we do the following:
> > 
> > a) when a address range is requested to be associated with a key; by the
> >application through key_mprotect() system call, the kernel
> >stores that key in the vmas corresponding to that address
> >range. Also the kernel programs the key into Linux PTE coresponding to 
> > all the
> >pages associated with the address range.
> 
> OK, so how is this any different from the regular mprotect then?

It takes the key argument. This is nothing new. This was done for x86
already, we are just re-using the infrastructure. Look at
do_mprotect_pkey() in mm/mprotect.c today. It's all the same code,
pkey_mprotect() is just mprotect with an added key argument.

> > b) Whenever there is a hash page fault for that address, the fault
> >handler reads the key from the Linux PTE and programs the key into 
> >the HPTE.
> > 
> > c) Once the HPTE is programmed, the MMU can sense key violations and
> >generate key-faults.
> > 
> > 
> > Since step (b) in this case has easy access to the Linux PTE, and hence
> > to the key, it is fast to access it and program the HPTE. Thus we avoid
> > taking any performance hit on this critical path.
> > 
> > Hope this explains the rationale,
> > 
> > 
> > As promised here is the high level design:
> 
> I will read through that later
> [...]

Re: [PATCH v2] powerpc/powernv: Enable PCI peer-to-peer

2017-07-12 Thread Frederic Barrat




Le 12/07/2017 à 17:39, Benjamin Herrenschmidt a écrit :

On Mon, 2017-06-26 at 20:08 +0200, Frederic Barrat wrote:

+   if (desc & OPAL_PCI_P2P_ENABLE) {
+   pe_init->p2p_initiator_count++;
+   } else {
+   if (pe_init->p2p_initiator_count > 0) {
+   pe_init->p2p_initiator_count--;
+   if (!pe_init->p2p_initiator_count)
+   pnv_pci_ioda2_set_bypass(pe_init, true);
+   }


So you have the initiator refcounting in Linux and the target
refcounting in OPAL ... any reason for that ?


The initiator refcount is per PE and skiboot doesn't track PEs. Also 
when the initiator refcount falls back to 0, we should restore the 
default bypass setting for the TVE, and that's more easily done from 
linux, since it knows the start and end address of the memory. So for 
those reasons, I don't really have the choice, the initiator refcount 
seems better suited in linux.


The target refcount is per PHB and it just seemed easier to keep it in opal.

Does the lack of symmetry bother you?

  Fred

Re: [PATCH v2] powerpc/powernv: Enable PCI peer-to-peer

2017-07-12 Thread Benjamin Herrenschmidt

On Wed, 2017-07-12 at 18:01 -0500, Frederic Barrat wrote:
> > So you have the initiator refcounting in Linux and the target
> > refcounting in OPAL ... any reason for that ?
> 
> The initiator refcount is per PE and skiboot doesn't track PEs. Also 
> when the initiator refcount falls back to 0, we should restore the 
> default bypass setting for the TVE, and that's more easily done from 
> linux, since it knows the start and end address of the memory. So for 
> those reasons, I don't really have the choice, the initiator refcount 
> seems better suited in linux.
> 
> The target refcount is per PHB and it just seemed easier to keep it in opal.
> 
> Does the lack of symmetry bother you?

Only mildly. I would have put the whole refcount in Linux...

Cheers,
Ben.

Re: [PATCH v2] powerpc/powernv: Use darn instr for random_seed on p9

2017-07-12 Thread Matt Brown

On Tue, Jul 11, 2017 at 7:34 PM, Daniel Axtens  wrote:
> Hi Matt,
>
>> Currently ppc_md.get_random_seed uses the powernv_get_random_long function.
>> A guest calling this function would have to go through the hypervisor. The
>> 'darn' instruction, introduced in POWER9, allows us to bypass this by
>> directly obtaining a value from the mmio region.
>>
>> This patch adds a function for ppc_md.get_random_seed on p9,
>> utilising the darn instruction.
>
> This patch looks pretty good - I'm not set up to test it but I have one
> code-style nit:
>
>> diff --git a/arch/powerpc/platforms/powernv/rng.c 
>> b/arch/powerpc/platforms/powernv/rng.c
>> index 5dcbdea..ab6f411 100644
>> --- a/arch/powerpc/platforms/powernv/rng.c
>> +++ b/arch/powerpc/platforms/powernv/rng.c
>> @@ -8,6 +8,7 @@
>>   */
>>
>>  #define pr_fmt(fmt)  "powernv-rng: " fmt
>> +#define DARN_ERR 0xul
>>
>>  #include 
>>  #include 
>> @@ -16,6 +17,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -67,6 +69,21 @@ int powernv_get_random_real_mode(unsigned long *v)
>>   return 1;
>>  }
>>
>> +int powernv_get_random_darn(unsigned long *v)
>
> This is only referenced in this file so it should probably be labelled
> as 'static'.
>

That's true, my thinking was to then use the get_random_darn where the
get_random_long was being used.
Looks like there is only one other caller of it at the moment,
kvmppc_h_random (in arch/powerpc/kvm/book3s_hv_builtin.c).
We may want to change that to get_random_darn.

>> +{
>> + unsigned long val;
>> +
>> + /* Using DARN with L=1 - conditioned random number */
>> + asm (PPC_DARN(%0, 1)"\n" : "=r"(val) :);
>> +
>> + if (val == DARN_ERR)
>> + return 0;
>> +
>> + *v = val;
>> +
>> + return 1;
>
> I was a bit confused to see 1 representing success - I think I have been
> in userspace too long. But I checked against pseries_get_random_long and
> it is in fact correct, so good for you!
>
> An excellent followup patch would be changing the type of this function
> to be bool rather than int, but no pressure :)

That's probably a good idea!

Thanks,
Matt

>
> Regards,
> Daniel
>
>> +}
>> +
>>  int powernv_get_random_long(unsigned long *v)
>>  {
>>   struct powernv_rng *rng;
>> @@ -136,6 +153,7 @@ static __init int rng_create(struct device_node *dn)
>>  static __init int rng_init(void)
>>  {
>>   struct device_node *dn;
>> + unsigned long drn_test;
>>   int rc;
>>
>>   for_each_compatible_node(dn, NULL, "ibm,power-rng") {
>> @@ -150,6 +168,10 @@ static __init int rng_init(void)
>>   of_platform_device_create(dn, NULL, NULL);
>>   }
>>
>> + if (cpu_has_feature(CPU_FTR_ARCH_300) &&
>> + powernv_get_random_darn(&drn_test))
>> + ppc_md.get_random_seed = powernv_get_random_darn;
>> +
>>   return 0;
>>  }
>>  machine_subsys_initcall(powernv, rng_init);
>> --
>> 2.9.3

Re: [PATCH v2] powerpc/powernv: Use darn instr for random_seed on p9

2017-07-12 Thread Michael Ellerman

Matt Brown  writes:
> On Tue, Jul 11, 2017 at 7:34 PM, Daniel Axtens  wrote:
>>> @@ -67,6 +69,21 @@ int powernv_get_random_real_mode(unsigned long *v)
>>>   return 1;
>>>  }
>>>
>>> +int powernv_get_random_darn(unsigned long *v)
>>
>> This is only referenced in this file so it should probably be labelled
>> as 'static'.
>>
>
> That's true, my thinking was to then use the get_random_darn where the
> get_random_long was being used.
> Looks like there is only one other caller of it at the moment,
> kvmppc_h_random (in arch/powerpc/kvm/book3s_hv_builtin.c).
> We may want to change that to get_random_darn.

It needs to use the right one depending on whether it's running on P8 or
P9.

It should really just use ppc_md.get_random_seed(), except that it can
be called in real mode, which means it needs to be more careful. So as
usual it's complicated :)

cheers

[PATCH 1/5] powerpc/lib/sstep: Add cmpb instruction emulation

2017-07-12 Thread Matt Brown

This patch adds emulation of the cmpb instruction, enabling xmon to
emulate this instruction.

Signed-off-by: Matt Brown 
---
 arch/powerpc/lib/sstep.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 33117f8..f3e9ba8 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -596,6 +596,23 @@ static nokprobe_inline void do_cmp_unsigned(struct pt_regs 
*regs, unsigned long
regs->ccr = (regs->ccr & ~(0xf << shift)) | (crval << shift);
 }
 
+static nokprobe_inline void do_cmpb(struct pt_regs *regs, unsigned long v1,
+   unsigned long v2, int rd)
+{
+   unsigned long out_val, mask;
+   int i;
+
+   out_val = 0;
+   for (i = 0; i < 8; i++) {
+   mask = 0xff << (i * 8);
+
+   if ((v1 & mask) == (v2 & mask))
+   out_val |= mask;
+   }
+
+   regs->gpr[rd] = out_val;
+}
+
 static nokprobe_inline int trap_compare(long v1, long v2)
 {
int ret = 0;
@@ -1049,6 +1066,13 @@ int analyse_instr(struct instruction_op *op, struct 
pt_regs *regs,
do_cmp_unsigned(regs, val, val2, rd >> 2);
goto instr_done;
 
+   case 19173952: /* cmpb */
+   val = regs->gpr[rd];
+   val2 = regs->gpr[rb];
+
+   do_cmpb(regs, val, val2, ra);
+   goto instr_done;
+
 /*
  * Arithmetic instructions
  */
-- 
2.9.3

[PATCH 2/5] powerpc/lib/sstep: Add popcnt instruction emulation

2017-07-12 Thread Matt Brown

This adds emulations for the popcntb, popcntw, and popcntd instructions.

Signed-off-by: Matt Brown 
---
 arch/powerpc/lib/sstep.c | 39 +++
 1 file changed, 39 insertions(+)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index f3e9ba8..cf69987 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -613,6 +613,30 @@ static nokprobe_inline void do_cmpb(struct pt_regs *regs, 
unsigned long v1,
regs->gpr[rd] = out_val;
 }
 
+/*
+ * The size parameter is used to ajust the equivalent popcnt instruction.
+ * popcntb = 8, popcntw = 32, popcntd = 64
+ */
+static nokprobe_inline void do_popcnt(struct pt_regs *regs, unsigned long v1,
+   int size, int ra)
+{
+   unsigned int out_val, mask, n;
+   int i, j;
+
+   out_val = 0;
+
+   for (i = 0; i < (64 / size); i++) {
+   n = 0;
+   for (j = 0; j < size; j++) {
+   mask = 1 << (j + (i * size));
+   if (v1 & mask)
+   n++;
+   }
+   out_val |= n << (i * size);
+   }
+   regs->gpr[ra] = out_val;
+}
+
 static nokprobe_inline int trap_compare(long v1, long v2)
 {
int ret = 0;
@@ -1234,6 +1258,21 @@ int analyse_instr(struct instruction_op *op, struct 
pt_regs *regs,
regs->gpr[ra] = (signed int) regs->gpr[rd];
goto logical_done;
 #endif
+   case 299528:/* popcntb */
+   val = regs->gpr[rd];
+   do_popcnt(regs, val, 8, ra);
+   goto logical_done;
+
+   case 17076744:  /* popcntw */
+   val = regs->gpr[rd];
+   do_popcnt(regs, val, 32, ra);
+   goto logical_done;
+#ifdef __powerpc64__
+   case 19173896:  /* popcntd */
+   val = regs->gpr[rd];
+   do_popcnt(regs, val, 64, ra);
+   goto logical_done;
+#endif
 
 /*
  * Shift instructions
-- 
2.9.3

[PATCH 3/5] powerpc/lib/sstep: Add bpermd instruction emulation

2017-07-12 Thread Matt Brown

This adds emulation for the bpermd instruction.

Signed-off-by: Matt Brown 
---
 arch/powerpc/lib/sstep.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index cf69987..603654d 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -637,6 +637,21 @@ static nokprobe_inline void do_popcnt(struct pt_regs 
*regs, unsigned long v1,
regs->gpr[ra] = out_val;
 }
 
+static nokprobe_inline void do_bpermd(struct pt_regs *regs, unsigned long v1,
+   unsigned long v2, int ra)
+{
+   unsigned int idx, i;
+   unsigned char perm;
+
+   perm = 0x0;
+   for (i = 0; i < 8; i++) {
+   idx = (v1 >> (i * 8)) & 0xff;
+   if (idx < 64)
+   perm |= (v2 & (1 << idx)) >> (idx - i);
+   }
+   regs->gpr[ra] = 0 | perm;
+}
+
 static nokprobe_inline int trap_compare(long v1, long v2)
 {
int ret = 0;
@@ -1274,6 +1289,14 @@ int analyse_instr(struct instruction_op *op, struct 
pt_regs *regs,
goto logical_done;
 #endif
 
+#ifdef __powerpc64__
+   case 2396736:   /* bpermd */
+   val = regs->gpr[rd];
+   val2 = regs->gpr[rb];
+   do_bpermd(regs, val, val2, ra);
+   goto logical_done;
+#endif
+
 /*
  * Shift instructions
  */
-- 
2.9.3

[PATCH 4/5] powerpc/lib/sstep: Add prty instruction emulation

2017-07-12 Thread Matt Brown

This add emulation for the prtyw and prtyd instructions.

Signed-off-by: Matt Brown 
---
 arch/powerpc/lib/sstep.c | 58 +++-
 1 file changed, 52 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 603654d..3228783 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -652,6 +652,42 @@ static nokprobe_inline void do_bpermd(struct pt_regs 
*regs, unsigned long v1,
regs->gpr[ra] = 0 | perm;
 }
 
+static nokprobe_inline void do_prtyw(struct pt_regs *regs, unsigned long v,
+   int ra)
+{
+   unsigned long low, high, out;
+   unsigned int i;
+
+   high = 0;
+   low = 0;
+   out = 0;
+
+   for (i = 0; i < 8; i++) {
+   if (v & (1 << (i * 8)))
+   (i < 4) ? (low++) : (high++);
+   }
+
+   if (low % 2)
+   out |= low;
+   if (high % 2)
+   out |= (high << 32);
+
+   regs->gpr[ra] = out;
+}
+
+static nokprobe_inline void do_prtyd(struct pt_regs *regs, unsigned long v,
+   int ra)
+{
+   unsigned int count, i;
+
+   count = 0;
+   for (i = 0; i < 8; i++) {
+   if (v & (1 << (i * 8)))
+   count++;
+   }
+   regs->gpr[ra] = count % 2;
+}
+
 static nokprobe_inline int trap_compare(long v1, long v2)
 {
int ret = 0;
@@ -1278,16 +1314,15 @@ int analyse_instr(struct instruction_op *op, struct 
pt_regs *regs,
do_popcnt(regs, val, 8, ra);
goto logical_done;
 
-   case 17076744:  /* popcntw */
+   case 2101768:   /* prtyw */
val = regs->gpr[rd];
-   do_popcnt(regs, val, 32, ra);
+   do_prtyw(regs, val, ra);
goto logical_done;
-#ifdef __powerpc64__
-   case 19173896:  /* popcntd */
+
+   case 2134536:   /* prtyd */
val = regs->gpr[rd];
-   do_popcnt(regs, val, 64, ra);
+   do_prtyd(regs, val, ra);
goto logical_done;
-#endif
 
 #ifdef __powerpc64__
case 2396736:   /* bpermd */
@@ -1297,6 +1332,17 @@ int analyse_instr(struct instruction_op *op, struct 
pt_regs *regs,
goto logical_done;
 #endif
 
+   case 17076744:  /* popcntw */
+   val = regs->gpr[rd];
+   do_popcnt(regs, val, 32, ra);
+   goto logical_done;
+#ifdef __powerpc64__
+   case 19173896:  /* popcntd */
+   val = regs->gpr[rd];
+   do_popcnt(regs, val, 64, ra);
+   goto logical_done;
+#endif
+
 /*
  * Shift instructions
  */
-- 
2.9.3

[PATCH 5/5] powerpc/lib/sstep: Add isel instruction emulation

2017-07-12 Thread Matt Brown

This add emulation for the isel instruction.

Signed-off-by: Matt Brown 
---
 arch/powerpc/lib/sstep.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 3228783..bb0e301 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -1297,6 +1297,16 @@ int analyse_instr(struct instruction_op *op, struct 
pt_regs *regs,
regs->gpr[ra] = ~(regs->gpr[rd] & regs->gpr[rb]);
goto logical_done;
 
+   case 585:   /* isel */
+   mb = (instr >> 6) & 0x1f; /* bc */
+   val = (regs->ccr >> (mb + 32)) & 1;
+
+   if (val)
+   regs->gpr[rd] = regs->gpr[ra];
+   else
+   regs->gpr[rd] = regs->gpr[rb];
+   goto logical_done;
+
case 922:   /* extsh */
regs->gpr[ra] = (signed short) regs->gpr[rd];
goto logical_done;
-- 
2.9.3

Re: [GIT PULL] Please pull JSON files for Power9 PMU events

2017-07-12 Thread Michael Ellerman

Arnaldo Carvalho de Melo  writes:

> Em Wed, Jul 12, 2017 at 10:09:12PM +1000, Michael Ellerman escreveu:
>> Sukadev Bhattiprolu  writes:
>> > Please pull the JSON files for POWER9 PMU events.
>
>> >   perf vendor events: Add POWER9 PMU events
>> >   perf vendor events: Add POWER9 PVRs to mapfile
>  
>> I think the PVRs need work.
>  
>> You have:
>  
>> +004e0100,1,power9.json,core
>> +004e,1,power9.json,core
>  
>> The first is P9 DD1, but the second doesn't exist. We have it in the
>> kernel, but with a mask of 0x. From memory the perf code doesn't
>> do any masking or anything fancy, it looks for an exact match.
>  
>> So for starters you should probably drop that one and add 0x004e1200 and
>> 0x004e0200.
>
> Could this check be somehow done automatically? Using whatever is in the
> kernel sources or some IBM released document?

Not really no.

The kernel defines the major versions in arch/powerpc/include/asm/reg.h.

eg:

#define PVR_POWER8E 0x004B
#define PVR_POWER8NVL   0x004C
#define PVR_POWER8  0x004D
#define PVR_POWER9  0x004E

But that's only the top part of the value perf is looking for. To match
the full value you also need the minor revision.

I don't actually have an exhaustive list of all the Power8 revisions
that were/will ever be made. And for Power9 I definitely can't give you
a full list yet.

We do want to be able to match against the full PVR, in case we need a
different set of events for a particularly CPU revision.

But we could make it easier to maintain by making the search iterative.
ie. instead of just looking for an exact match, we could say that the
mapfile is sorted, and the first entry that matches will be used.

With a mapfile like:

004e0100,1,power9.json,core
004e,1,power9.json,core

That would match Power9 DD1 on the first line, but all other Power9 CPUs
would hit the second line. Obviously get_cpuid_str() would need to be
reworked to be match_cpuid_str() or something.

cheers

Re: [GIT PULL] Please pull JSON files for Power9 PMU events

2017-07-12 Thread Michael Ellerman

Sukadev Bhattiprolu  writes:
...
>
>  tools/perf/pmu-events/arch/powerpc/mapfile.csv |   3 +
>  .../perf/pmu-events/arch/powerpc/power9/cache.json | 176 +
>  .../arch/powerpc/power9/floating-point.json|  44 ++
>  .../pmu-events/arch/powerpc/power9/frontend.json   | 446 +++
>  .../pmu-events/arch/powerpc/power9/marked.json | 782 +++
>  .../pmu-events/arch/powerpc/power9/memory.json | 158 

In the map file we have "power9.json", but the files are power9/x.json.
How does that work?

cheers

Re: [PATCH 1/5] powerpc/lib/sstep: Add cmpb instruction emulation

2017-07-12 Thread Andrew Donnellan


On 13/07/17 13:25, Matt Brown wrote:

@@ -1049,6 +1066,13 @@ int analyse_instr(struct instruction_op *op, struct 
pt_regs *regs,
do_cmp_unsigned(regs, val, val2, rd >> 2);
goto instr_done;

+   case 19173952: /* cmpb */


This looks wrong and should never trigger, given that the switch 
statement is comparing against ((instr >> 1) & 0x3ff).


How did you get this value?

--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited

Re: [RFC v5 00/38] powerpc: Memory Protection Keys

2017-07-12 Thread Michal Hocko

On Thu 13-07-17 08:53:52, Benjamin Herrenschmidt wrote:
> On Wed, 2017-07-12 at 09:23 +0200, Michal Hocko wrote:
> > 
> > > 
> > > Ideally the MMU looks at the PTE for keys, in order to enforce
> > > protection. This is the case with x86 and is the case with power9 Radix
> > > page table. Hence the keys have to be programmed into the PTE.
> > 
> > But x86 doesn't update ptes for PKEYs, that would be just too expensive.
> > You could use standard mprotect to do the same...
> 
> What do you mean ? x86 ends up in mprotect_fixup -> change_protection()
> which will update the PTEs just the same as we do.
> 
> Changing the key for a page is a form mprotect. Changing the access
> permissions for keys is different, for us it's a special register
> (AMR).
> 
> I don't understand why you think we are doing any differently than x86
> here.

That was a misunderstanding on my side as explained in other reply.

> > > However with HPT on power, these keys do not necessarily have to be
> > > programmed into the PTE. We could bypass the Linux Page Table Entry(PTE)
> > > and instead just program them into the Hash Page Table(HPTE), since
> > > the MMU does not refer the PTE but refers the HPTE. The last version
> > > of the page attempted to do that.   It worked as follows:
> > > 
> > > a) when a address range is requested to be associated with a key; by the
> > >application through key_mprotect() system call, the kernel
> > >stores that key in the vmas corresponding to that address
> > >range.
> > > 
> > > b) Whenever there is a hash page fault for that address, the fault
> > >handler reads the key from the VMA and programs the key into the
> > >HPTE. __hash_page() is the function that does that.
> > 
> > What causes the fault here?
> 
> The hardware. With the hash MMU, the HW walks a hash table which is
> effectively a large in-memory TLB extension. When a page isn't found
> there, a  "hash fault" is generated allowing Linux to populate that
> hash table with the content of the corresponding PTE. 

Thanks for the clarification
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 1/5] powerpc/lib/sstep: Add cmpb instruction emulation

2017-07-12 Thread Segher Boessenkool

On Thu, Jul 13, 2017 at 01:51:30PM +1000, Andrew Donnellan wrote:
> On 13/07/17 13:25, Matt Brown wrote:
> >@@ -1049,6 +1066,13 @@ int analyse_instr(struct instruction_op *op, struct 
> >pt_regs *regs,
> > do_cmp_unsigned(regs, val, val2, rd >> 2);
> > goto instr_done;
> >
> >+case 19173952: /* cmpb */
> 
> This looks wrong and should never trigger, given that the switch 
> statement is comparing against ((instr >> 1) & 0x3ff).
> 
> How did you get this value?

The correct number is 508, and 19173952 = 37744*508.  How to get
37744 is a mystery though :-)


Segher

Re: [PATCH 1/5] powerpc/lib/sstep: Add cmpb instruction emulation

2017-07-12 Thread Segher Boessenkool

On Thu, Jul 13, 2017 at 01:25:44PM +1000, Matt Brown wrote:
> +static nokprobe_inline void do_cmpb(struct pt_regs *regs, unsigned long v1,
> + unsigned long v2, int rd)
> +{
> + unsigned long out_val, mask;
> + int i;
> +
> + out_val = 0;
> + for (i = 0; i < 8; i++) {
> + mask = 0xff << (i * 8);

0xffUL ?

> +
> + if ((v1 & mask) == (v2 & mask))
> + out_val |= mask;
> + }
> +
> + regs->gpr[rd] = out_val;
> +}


Segher

Re: [PATCH] powerpc/time: use get_tb instead of get_vtb in running_clock

2017-07-12 Thread hejianet


Hi Ben
I add some printk logs in watchdog_timer_fn in the guest
[   16.025222] get_vtb=8236291881, get_tb=13756711357, get_timestamp=4
[   20.025624] get_vtb=9745285807, get_tb=15804711283, get_timestamp=7
[   24.025042] get_vtb=11518119641, get_tb=17852711085, get_timestamp=10
[   28.024074] get_vtb=13192704319, get_tb=19900711071, get_timestamp=13
[   32.024086] get_vtb=14856516982, get_tb=21948711066, get_timestamp=16
[   36.024075] get_vtb=16569127618, get_tb=23996711078, get_timestamp=20
[   40.024138] get_vtb=17008865823, get_tb=26044718418, get_timestamp=20
[   44.023993] get_vtb=17020637241, get_tb=28092716383, get_timestamp=20
[   48.023996] get_vtb=17022857170, get_tb=30140718472, get_timestamp=20
[   52.023996] get_vtb=17024268541, get_tb=32188718432, get_timestamp=20
[   56.023996] get_vtb=17036577783, get_tb=34236718077, get_timestamp=20
[   60.023996] get_vtb=17037829743, get_tb=36284718437, get_timestamp=20
[   64.023992] get_vtb=17039846747, get_tb=38332716609, get_timestamp=20
[   68.023991] get_vtb=17041448345, get_tb=40380715903, get_timestamp=20

The get_timestamp(use get_vtb(),unit is second) is slower down compared
with printk time. You also can obviously watch the get_vtb increment is
slowly less than get_tb increment.

Without this patch, I thought there might be some softlockup warnings
missed in guest.

-Jia

On 13/07/2017 6:45 AM, Benjamin Herrenschmidt wrote:

On Wed, 2017-07-12 at 23:01 +0800, Jia He wrote:

Virtual time base(vtb) is a register which increases only in guest.
Any exit from guest to host will stop the vtb(saved and restored by kvm).
But if there is an IO causes guest exits to host, the guest's watchdog
(watchdog_timer_fn -> is_softlockup -> get_timestamp -> running_clock)
needs to also include the time elapsed in host. get_vtb is not correct in
this case.

Also, the TB_OFFSET is well saved and restored by qemu after commit [1].
So we can use get_tb here.


That completely defeats the purpose here... This was done specifically
to exploit the VTB which doesn't count in hypervisor mode.



[1] http://git.qemu.org/?p=qemu.git;a=commit;h=42043e4f1

Signed-off-by: Jia He 
---
  arch/powerpc/kernel/time.c | 7 +++
  1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index fe6f3a2..c542dd3 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -695,16 +695,15 @@ notrace unsigned long long sched_clock(void)
  unsigned long long running_clock(void)
  {
/*
-* Don't read the VTB as a host since KVM does not switch in host
-* timebase into the VTB when it takes a guest off the CPU, reading the
-* VTB would result in reading 'last switched out' guest VTB.
+* Use get_tb instead of get_vtb for guest since the TB_OFFSET has been
+* well saved/restored when qemu does suspend/resume.
 *
 * Host kernels are often compiled with CONFIG_PPC_PSERIES checked, it
 * would be unsafe to rely only on the #ifdef above.
 */
if (firmware_has_feature(FW_FEATURE_LPAR) &&
cpu_has_feature(CPU_FTR_ARCH_207S))
-   return mulhdu(get_vtb() - boot_tb, tb_to_ns_scale) << 
tb_to_ns_shift;
+   return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << 
tb_to_ns_shift;
  
  	/*

 * This is a next best approximation without a VTB.

42 matches

Mail list logo