date:20170314

Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src

2017-03-14 Thread Madhavan Srinivasan




On Monday 13 March 2017 06:20 PM, Peter Zijlstra wrote:

On Mon, Mar 13, 2017 at 04:45:51PM +0530, Madhavan Srinivasan wrote:

  - should you not have fixed this in the tool only? This patch
effectively breaks ABI on big-endian architectures.

IIUC, we are the first BE user for this feature
(Kindly correct me if I am wrong), so technically we
are not breaking ABI here :) .  But let me also look  at
the dynamic conversion part.

Huh? PPC hasn't yet implemented this? Then why are you fixing it?


yes, PPC hasn't implemented this (until now).
And did not understand "Then why are you fixing it?"

Maddy

[PATCH 0/8] idle fixes and changes for POWER8 and POWER9

2017-03-14 Thread Nicholas Piggin

Hi,

This is a resend of the previous idle series, with the addition that
I accounted for Gautham's feedback, and also re-introduced the feature
to avoid full state restore by counting winkles rather than special
HSPRG0 bit.

The two big things we get from this, is no longer messing with HSPRG0
in the idle wakeup path on POWER8, and not crashing on POWER9 machine
check from power saving mode (tested in mambo according to ISA specs,
but have not triggered it on real hardware).

I only added the reviewed-by for patches which were not significantly
chaged since last time.

Thanks,
Nick

Nicholas Piggin (8):
  powerpc/64s: move remaining system reset idle code into idle_book3s.S
  powerpc/64s: stop using bit in HSPRG0 to test winkle
  powerpc/64s: use alternative feature patching
  powerpc/64s: fix POWER9 machine check handler from stop state
  powerpc/64s: use PACA_THREAD_IDLE_STATE only in POWER8
  powerpc/64s: idle expand usable core idle state bits
  powerpc/64s: idle do not hold reservation longer than required
  powerpc/64s: idle POWER8 avoid full state loss recovery when possible

 arch/powerpc/include/asm/cpuidle.h   |  32 +-
 arch/powerpc/include/asm/exception-64s.h |  13 +--
 arch/powerpc/include/asm/paca.h  |  12 +-
 arch/powerpc/include/asm/reg.h   |   1 +
 arch/powerpc/kernel/exceptions-64s.S | 112 ++
 arch/powerpc/kernel/idle_book3s.S| 189 ---
 arch/powerpc/platforms/powernv/idle.c|  13 ---
 7 files changed, 202 insertions(+), 170 deletions(-)

-- 
2.11.0

[PATCH 1/8] powerpc/64s: move remaining system reset idle code into idle_book3s.S

2017-03-14 Thread Nicholas Piggin

Should be no functional change.

Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/exceptions-64s.S | 26 +---
 arch/powerpc/kernel/idle_book3s.S| 76 
 2 files changed, 51 insertions(+), 51 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 857bf7c5b946..2f837a4a78a2 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -126,31 +126,7 @@ EXC_VIRT_NONE(0x4100, 0x100)
 
 #ifdef CONFIG_PPC_P7_NAP
 EXC_COMMON_BEGIN(system_reset_idle_common)
-BEGIN_FTR_SECTION
-   GET_PACA(r13) /* Restore HSPRG0 to get the winkle bit in r13 */
-END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
-   bl  pnv_restore_hyp_resource
-
-   li  r0,PNV_THREAD_RUNNING
-   stb r0,PACA_THREAD_IDLE_STATE(r13)  /* Clear thread state */
-
-#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
-   li  r0,KVM_HWTHREAD_IN_KERNEL
-   stb r0,HSTATE_HWTHREAD_STATE(r13)
-   /* Order setting hwthread_state vs. testing hwthread_req */
-   sync
-   lbz r0,HSTATE_HWTHREAD_REQ(r13)
-   cmpwi   r0,0
-   beq 1f
-   BRANCH_TO_KVM(r10, kvm_start_guest)
-1:
-#endif
-
-   /* Return SRR1 from power7_nap() */
-   mfspr   r3,SPRN_SRR1
-   blt cr3,2f
-   b   pnv_wakeup_loss
-2: b   pnv_wakeup_noloss
+   b   pnv_powersave_wakeup
 #endif
 
 EXC_COMMON(system_reset_common, 0x100, system_reset_exception)
diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 995728736677..4313c107da5d 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -113,7 +114,7 @@ core_idle_lock_held:
  *
  * Address to 'rfid' to in r5
  */
-_GLOBAL(pnv_powersave_common)
+pnv_powersave_common:
/* Use r3 to pass state nap/sleep/winkle */
/* NAP is a state loss, we create a regs frame on the
 * stack, fill it up with the state we care about and
@@ -188,8 +189,8 @@ pnv_enter_arch207_idle_mode:
/* The following store to HSTATE_HWTHREAD_STATE(r13)  */
/* MUST occur in real mode, i.e. with the MMU off,*/
/* and the MMU must stay off until we clear this flag */
-   /* and test HSTATE_HWTHREAD_REQ(r13) in the system*/
-   /* reset interrupt vector in exceptions-64s.S.*/
+   /* and test HSTATE_HWTHREAD_REQ(r13) in   */
+   /* pnv_powersave_wakeup in this file. */
/* The reason is that another thread can switch the   */
/* MMU to a guest context whenever this flag is set   */
/* to KVM_HWTHREAD_IN_IDLE, and if the MMU was on,*/
@@ -375,15 +376,42 @@ _GLOBAL(power9_idle_stop)
li  r4,1
b   pnv_powersave_common
/* No return */
+
+.global pnv_powersave_wakeup
+pnv_powersave_wakeup:
+BEGIN_FTR_SECTION
+   GET_PACA(r13) /* Restore HSPRG0 to get the winkle bit in r13 */
+END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
+   bl  pnv_restore_hyp_resource
+
+   li  r0,PNV_THREAD_RUNNING
+   stb r0,PACA_THREAD_IDLE_STATE(r13)  /* Clear thread state */
+
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+   li  r0,KVM_HWTHREAD_IN_KERNEL
+   stb r0,HSTATE_HWTHREAD_STATE(r13)
+   /* Order setting hwthread_state vs. testing hwthread_req */
+   sync
+   lbz r0,HSTATE_HWTHREAD_REQ(r13)
+   cmpwi   r0,0
+   beq 1f
+   BRANCH_TO_KVM(r10, kvm_start_guest)
+1:
+#endif
+
+   /* Return SRR1 from power7_nap() */
+   mfspr   r3,SPRN_SRR1
+   blt cr3,pnv_wakeup_noloss
+   b   pnv_wakeup_loss
+
 /*
- * Called from reset vector. Check whether we have woken up with
- * hypervisor state loss. If yes, restore hypervisor state and return
- * back to reset vector.
+ * Check whether we have woken up with hypervisor state loss.
+ * If yes, restore hypervisor state and return back to link.
  *
  * r13 - Contents of HSPRG0
  * cr3 - set to gt if waking up with partial/complete hypervisor state loss
  */
-_GLOBAL(pnv_restore_hyp_resource)
+pnv_restore_hyp_resource:
 BEGIN_FTR_SECTION
ld  r2,PACATOC(r13);
/*
@@ -400,12 +428,9 @@ BEGIN_FTR_SECTION
 */
rldicl  r5,r5,4,60
cmpdcr4,r5,r4
-   bge cr4,pnv_wakeup_tb_loss
-   /*
-* Waking up without hypervisor state loss. Return to
-* reset vector
-*/
-   blr
+   bge cr4,pnv_wakeup_tb_loss /* returns to caller */
+
+   blr /* Waking up without hypervisor state loss. */
 
 END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
 
@@ -433,8 +458,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
 */
bgt cr3,.
 
-   blr /* Return back to System Reset vector from where
-  pnv_restore_hyp_resource was invoked */
+

[PATCH 2/8] powerpc/64s: stop using bit in HSPRG0 to test winkle

2017-03-14 Thread Nicholas Piggin

The POWER8 idle code has a neat trick of programming the power on engine
to restore a low bit into HSPRG0, so idle wakeup code can test and see
if it has been programmed this way and therefore lost all state, and
avoiding the expensive full restore if not.

However this messes with our r13 PACA pointer, and requires HSPRG0 to
be written to throughout the exception handlers and idle wakeup, rather
than just once on kernel entry.

Remove this complexity and assume winkle sleeps always require a state
restore. This speedup is later re-introduced by counting per-core winkles
and setting a bitmap of threads with state loss when all are in winkle.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/exception-64s.h | 13 ++---
 arch/powerpc/kernel/exceptions-64s.S | 21 +++--
 arch/powerpc/kernel/idle_book3s.S| 23 ---
 arch/powerpc/platforms/powernv/idle.c| 13 -
 4 files changed, 13 insertions(+), 57 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index 14752eee3d0c..f5fe7c901b37 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -167,17 +167,14 @@ BEGIN_FTR_SECTION_NESTED(943) 
\
std ra,offset(r13); \
 END_FTR_SECTION_NESTED(ftr,ftr,943)
 
-#define EXCEPTION_PROLOG_0_PACA(area)  \
+#define EXCEPTION_PROLOG_0(area)   \
+   GET_PACA(r13);  \
std r9,area+EX_R9(r13); /* save r9 */   \
OPT_GET_SPR(r9, SPRN_PPR, CPU_FTR_HAS_PPR); \
HMT_MEDIUM; \
std r10,area+EX_R10(r13);   /* save r10 - r12 */\
OPT_GET_SPR(r10, SPRN_CFAR, CPU_FTR_CFAR)
 
-#define EXCEPTION_PROLOG_0(area)   \
-   GET_PACA(r13);  \
-   EXCEPTION_PROLOG_0_PACA(area)
-
 #define __EXCEPTION_PROLOG_1(area, extra, vec) \
OPT_SAVE_REG_TO_PACA(area+EX_PPR, r9, CPU_FTR_HAS_PPR); \
OPT_SAVE_REG_TO_PACA(area+EX_CFAR, r10, CPU_FTR_CFAR);  \
@@ -208,12 +205,6 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
EXCEPTION_PROLOG_1(area, extra, vec);   \
EXCEPTION_PROLOG_PSERIES_1(label, h);
 
-/* Have the PACA in r13 already */
-#define EXCEPTION_PROLOG_PSERIES_PACA(area, label, h, extra, vec)  \
-   EXCEPTION_PROLOG_0_PACA(area);  \
-   EXCEPTION_PROLOG_1(area, extra, vec);   \
-   EXCEPTION_PROLOG_PSERIES_1(label, h);
-
 #define __KVMTEST(h, n)
\
lbz r10,HSTATE_IN_GUEST(r13);   \
cmpwi   r10,0;  \
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 2f837a4a78a2..e390fcd04bcb 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -116,9 +116,7 @@ EXC_VIRT_NONE(0x4000, 0x100)
 
 EXC_REAL_BEGIN(system_reset, 0x100, 0x100)
SET_SCRATCH0(r13)
-   GET_PACA(r13)
-   clrrdi  r13,r13,1 /* Last bit of HSPRG0 is set if waking from winkle */
-   EXCEPTION_PROLOG_PSERIES_PACA(PACA_EXGEN, system_reset_common, EXC_STD,
+   EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, system_reset_common, EXC_STD,
 IDLETEST, 0x100)
 
 EXC_REAL_END(system_reset, 0x100, 0x100)
@@ -148,14 +146,6 @@ EXC_REAL_BEGIN(machine_check, 0x200, 0x100)
 * vector
 */
SET_SCRATCH0(r13)   /* save r13 */
-   /*
-* Running native on arch 2.06 or later, we may wakeup from winkle
-* inside machine check. If yes, then last bit of HSPRG0 would be set
-* to 1. Hence clear it unconditionally.
-*/
-   GET_PACA(r13)
-   clrrdi  r13,r13,1
-   SET_PACA(r13)
EXCEPTION_PROLOG_0(PACA_EXMC)
 BEGIN_FTR_SECTION
b   machine_check_powernv_early
@@ -339,7 +329,7 @@ EXC_COMMON_BEGIN(machine_check_handle_early)
 * Go back to nap/sleep/winkle mode again if (b) is true.
 */
rlwinm. r11,r12,47-31,30,31 /* Was it in power saving mode? */
-   beq 4f  /* No, it wasn;t */
+   beq 4f  /* No, it wasn't */
/* Thread was in power saving mode. Go back to nap again. */
cmpwi   r11,2
blt 3f
@@ -369,13 +359,8 @@ EXC_COMMON_BEGIN(machine_check_handle_early)
/*
 * Go back to winkle. Please note that this thread was woken up in
 * machine chec

[PATCH 3/8] powerpc/64s: use alternative feature patching

2017-03-14 Thread Nicholas Piggin

This reduces the number of nops for POWER8.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/idle_book3s.S | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 405631b2c229..9284ea0762b1 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -383,7 +383,11 @@ _GLOBAL(power9_idle_stop)
  */
 .global pnv_powersave_wakeup
 pnv_powersave_wakeup:
-   bl  pnv_restore_hyp_resource
+BEGIN_FTR_SECTION
+   bl  pnv_restore_hyp_resource_arch300
+FTR_SECTION_ELSE
+   bl  pnv_restore_hyp_resource_arch207
+ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
 
li  r0,PNV_THREAD_RUNNING
stb r0,PACA_THREAD_IDLE_STATE(r13)  /* Clear thread state */
@@ -411,14 +415,13 @@ pnv_powersave_wakeup:
  *
  * cr3 - set to gt if waking up with partial/complete hypervisor state loss
  */
-pnv_restore_hyp_resource:
-   ld  r2,PACATOC(r13);
-
-BEGIN_FTR_SECTION
+pnv_restore_hyp_resource_arch300:
/*
 * POWER ISA 3. Use PSSCR to determine if we
 * are waking up from deep idle state
 */
+   ld  r2,PACATOC(r13);
+
LOAD_REG_ADDRBASE(r5,pnv_first_deep_stop_state)
ld  r4,ADDROFF(pnv_first_deep_stop_state)(r5)
 
@@ -433,12 +436,14 @@ BEGIN_FTR_SECTION
 
blr /* Waking up without hypervisor state loss. */
 
-END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
-
+/* Same calling convention as arch300 */
+pnv_restore_hyp_resource_arch207:
/*
 * POWER ISA 2.07 or less.
 * Check if we slept with winkle.
 */
+   ld  r2,PACATOC(r13);
+
lbz r0,PACA_THREAD_IDLE_STATE(r13)
cmpwi   cr2,r0,PNV_THREAD_NAP
cmpwi   cr4,r0,PNV_THREAD_WINKLE
-- 
2.11.0

[PATCH 4/8] powerpc/64s: fix POWER9 machine check handler from stop state

2017-03-14 Thread Nicholas Piggin

The ISA specifies power save wakeup can cause a machine check interrupt.
The machine check handler currently has code to handle that for POWER8,
but POWER9 crashes when trying to execute the P8 style sleep
instructions.

So queue up the machine check, then call into the idle code to wake up
as the system reset interrupt does, rather than attempting to sleep
again without going through the main idle path.

Reviewed-by: Gautham R. Shenoy 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/reg.h   |  1 +
 arch/powerpc/kernel/exceptions-64s.S | 69 ++--
 2 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index fc879fd6bdae..8bbdfacce970 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -656,6 +656,7 @@
 #define   SRR1_ISI_PROT0x0800 /* ISI: Other protection 
fault */
 #define   SRR1_WAKEMASK0x0038 /* reason for wakeup */
 #define   SRR1_WAKEMASK_P8 0x003c /* reason for wakeup on POWER8 and 9 
*/
+#define   SRR1_WAKEMCE_RESVD   0x003c /* Unused/reserved value used by MCE 
wakeup to indicate cause to idle wakeup handler */
 #define   SRR1_WAKESYSERR  0x0030 /* System error */
 #define   SRR1_WAKEEE  0x0020 /* External interrupt */
 #define   SRR1_WAKEHVI 0x0024 /* Hypervisor Virtualization 
Interrupt (P9) */
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index e390fcd04bcb..5779d2d6a192 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -306,6 +306,33 @@ EXC_COMMON_BEGIN(machine_check_common)
/* restore original r1. */  \
ld  r1,GPR1(r1)
 
+#ifdef CONFIG_PPC_P7_NAP
+EXC_COMMON_BEGIN(machine_check_idle_common)
+   bl  machine_check_queue_event
+   /*
+* Queue the machine check, then reload SRR1 and use it to set
+* CR3 according to pnv_powersave_wakeup convention.
+*/
+   ld  r12,_MSR(r1)
+   rlwinm  r11,r12,47-31,30,31
+   cmpwi   cr3,r11,2
+
+   /*
+* Now put SRR1_WAKEMCE_RESVD into SRR1, allows it to follow the
+* system reset wakeup code.
+*/
+   orisr12,r12,SRR1_WAKEMCE_RESVD@h
+   mtspr   SPRN_SRR1,r12
+   std r12,_MSR(r1)
+
+   /*
+* Decrement MCE nesting after finishing with the stack.
+*/
+   lhz r11,PACA_IN_MCE(r13)
+   subir11,r11,1
+   sth r11,PACA_IN_MCE(r13)
+   b   pnv_powersave_wakeup
+#endif
/*
 * Handle machine check early in real mode. We come here with
 * ME=1, MMU (IR=0 and DR=0) off and using MC emergency stack.
@@ -318,6 +345,7 @@ EXC_COMMON_BEGIN(machine_check_handle_early)
bl  machine_check_early
std r3,RESULT(r1)   /* Save result */
ld  r12,_MSR(r1)
+
 #ifdef CONFIG_PPC_P7_NAP
/*
 * Check if thread was in power saving mode. We come here when any
@@ -328,43 +356,14 @@ EXC_COMMON_BEGIN(machine_check_handle_early)
 *
 * Go back to nap/sleep/winkle mode again if (b) is true.
 */
-   rlwinm. r11,r12,47-31,30,31 /* Was it in power saving mode? */
-   beq 4f  /* No, it wasn't */
-   /* Thread was in power saving mode. Go back to nap again. */
-   cmpwi   r11,2
-   blt 3f
-   /* Supervisor/Hypervisor state loss */
-   li  r0,1
-   stb r0,PACA_NAPSTATELOST(r13)
-3: bl  machine_check_queue_event
-   MACHINE_CHECK_HANDLER_WINDUP
-   GET_PACA(r13)
-   ld  r1,PACAR1(r13)
-   /*
-* Check what idle state this CPU was in and go back to same mode
-* again.
-*/
-   lbz r3,PACA_THREAD_IDLE_STATE(r13)
-   cmpwi   r3,PNV_THREAD_NAP
-   bgt 10f
-   IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP)
-   /* No return */
-10:
-   cmpwi   r3,PNV_THREAD_SLEEP
-   bgt 2f
-   IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP)
-   /* No return */
-
-2:
-   /*
-* Go back to winkle. Please note that this thread was woken up in
-* machine check from winkle and have not restored the per-subcore
-* state.
-*/
-   IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE)
-   /* No return */
+   BEGIN_FTR_SECTION
+   rlwinm. r11,r12,47-31,30,31
+   beq-4f
+   BRANCH_TO_COMMON(r10, machine_check_idle_common)
 4:
+   END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
 #endif
+
/*
 * Check if we are coming from hypervisor userspace. If yes then we
 * continue in host kernel in V mode to deliver the MC event.
-- 
2.11.0

[PATCH 5/8] powerpc/64s: use PACA_THREAD_IDLE_STATE only in POWER8

2017-03-14 Thread Nicholas Piggin

POWER9 does not use this field, so it should be moved into the POWER8
code. Update the documentation in the paca struct too.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/paca.h   | 12 ++--
 arch/powerpc/kernel/idle_book3s.S | 13 +++--
 2 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 708c3e592eeb..bbb59e226a9f 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -165,11 +165,19 @@ struct paca_struct {
 #endif
 
 #ifdef CONFIG_PPC_POWERNV
-   /* Per-core mask tracking idle threads and a lock bit-[L][] */
+   /* CPU idle fields */
+
+   /*
+* Per-core word used to synchronize between threads. See
+* asm/cpuidle.h, PNV_CORE_IDLE_*
+*/
u32 *core_idle_state_ptr;
-   u8 thread_idle_state;   /* PNV_THREAD_RUNNING/NAP/SLEEP */
/* Mask to indicate thread id in core */
u8 thread_mask;
+
+   /* POWER8 specific fields */
+   /* PNV_THREAD_RUNNING/NAP/SLEEP */
+   u8 thread_idle_state;
/* Mask to denote subcore sibling threads */
u8 subcore_sibling_mask;
 #endif
diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 9284ea0762b1..9bdfba75a5e7 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -389,9 +389,6 @@ FTR_SECTION_ELSE
bl  pnv_restore_hyp_resource_arch207
 ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
 
-   li  r0,PNV_THREAD_RUNNING
-   stb r0,PACA_THREAD_IDLE_STATE(r13)  /* Clear thread state */
-
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
li  r0,KVM_HWTHREAD_IN_KERNEL
stb r0,HSTATE_HWTHREAD_STATE(r13)
@@ -445,9 +442,13 @@ pnv_restore_hyp_resource_arch207:
ld  r2,PACATOC(r13);
 
lbz r0,PACA_THREAD_IDLE_STATE(r13)
-   cmpwi   cr2,r0,PNV_THREAD_NAP
-   cmpwi   cr4,r0,PNV_THREAD_WINKLE
-   bgt cr2,pnv_wakeup_tb_loss  /* Either sleep or Winkle */
+   cmpwi   cr2,r0,PNV_THREAD_NAP
+   cmpwi   cr4,r0,PNV_THREAD_WINKLE
+   li  r0,PNV_THREAD_RUNNING
+   stb r0,PACA_THREAD_IDLE_STATE(r13)  /* Clear thread state */
+
+   bgt cr2,pnv_wakeup_tb_loss  /* Either sleep or Winkle */
+
 
/*
 * We fall through here if PACA_THREAD_IDLE_STATE shows we are waking
-- 
2.11.0

[PATCH 6/8] powerpc/64s: idle expand usable core idle state bits

2017-03-14 Thread Nicholas Piggin

In preparation for adding more bits to the core idle state word,
move the lock bit up, and unlock by flipping the lock bit rather
than masking off all but the thread bits.

Add branch hints for atomic operations while we're here.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/cpuidle.h |  4 ++--
 arch/powerpc/kernel/idle_book3s.S  | 33 +
 2 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/cpuidle.h 
b/arch/powerpc/include/asm/cpuidle.h
index 155731557c9b..b9d9f960dffd 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -7,8 +7,8 @@
 #define PNV_THREAD_NAP  1
 #define PNV_THREAD_SLEEP2
 #define PNV_THREAD_WINKLE   3
-#define PNV_CORE_IDLE_LOCK_BIT  0x100
-#define PNV_CORE_IDLE_THREAD_BITS   0x0FF
+#define PNV_CORE_IDLE_LOCK_BIT  0x1000
+#define PNV_CORE_IDLE_THREAD_BITS   0x00FF
 
 /*
  *  NOTE =
diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 9bdfba75a5e7..1c91dc35c559 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -95,12 +95,12 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300)
 core_idle_lock_held:
HMT_LOW
 3: lwz r15,0(r14)
-   andi.   r15,r15,PNV_CORE_IDLE_LOCK_BIT
+   andis.  r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
bne 3b
HMT_MEDIUM
lwarx   r15,0,r14
-   andi.   r9,r15,PNV_CORE_IDLE_LOCK_BIT
-   bne core_idle_lock_held
+   andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
+   bne-core_idle_lock_held
blr
 
 /*
@@ -213,8 +213,8 @@ pnv_enter_arch207_idle_mode:
 lwarx_loop1:
lwarx   r15,0,r14
 
-   andi.   r9,r15,PNV_CORE_IDLE_LOCK_BIT
-   bnelcore_idle_lock_held
+   andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
+   bnel-   core_idle_lock_held
 
andcr15,r15,r7  /* Clear thread bit */
 
@@ -241,7 +241,7 @@ common_enter: /* common code for all the threads entering 
sleep or winkle */
IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP)
 
 fastsleep_workaround_at_entry:
-   ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
+   orisr15,r15,PNV_CORE_IDLE_LOCK_BIT@h
stwcx.  r15,0,r14
bne-lwarx_loop1
isync
@@ -251,10 +251,10 @@ fastsleep_workaround_at_entry:
li  r4,1
bl  opal_config_cpu_idle_state
 
-   /* Clear Lock bit */
-   li  r0,0
+   /* Unlock */
+   xoris   r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
lwsync
-   stw r0,0(r14)
+   stw r15,0(r14)
b   common_enter
 
 enter_winkle:
@@ -302,8 +302,8 @@ power_enter_stop:
 
 lwarx_loop_stop:
lwarx   r15,0,r14
-   andi.   r9,r15,PNV_CORE_IDLE_LOCK_BIT
-   bnelcore_idle_lock_held
+   andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
+   bnel-   core_idle_lock_held
andcr15,r15,r7  /* Clear thread bit */
 
stwcx.  r15,0,r14
@@ -492,7 +492,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
ld  r14,PACA_CORE_IDLE_STATE_PTR(r13)
 lwarx_loop2:
lwarx   r15,0,r14
-   andi.   r9,r15,PNV_CORE_IDLE_LOCK_BIT
+   andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
/*
 * Lock bit is set in one of the 2 cases-
 * a. In the sleep/winkle enter path, the last thread is executing
@@ -501,9 +501,10 @@ lwarx_loop2:
 * workaround undo code or resyncing timebase or restoring context
 * In either case loop until the lock bit is cleared.
 */
-   bnelcore_idle_lock_held
+   bnel-   core_idle_lock_held
 
-   cmpwi   cr2,r15,0
+   andi.   r9,r15,PNV_CORE_IDLE_THREAD_BITS
+   cmpwi   cr2,r9,0
 
/*
 * At this stage
@@ -512,7 +513,7 @@ lwarx_loop2:
 * cr4 - gt or eq if waking up from complete hypervisor state loss.
 */
 
-   ori r15,r15,PNV_CORE_IDLE_LOCK_BIT
+   orisr15,r15,PNV_CORE_IDLE_LOCK_BIT@h
stwcx.  r15,0,r14
bne-lwarx_loop2
isync
@@ -602,7 +603,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
mtspr   SPRN_WORC,r4
 
 clear_lock:
-   andi.   r15,r15,PNV_CORE_IDLE_THREAD_BITS
+   xoris   r15,r15,PNV_CORE_IDLE_LOCK_BIT@h
lwsync
stw r15,0(r14)
 
-- 
2.11.0

[PATCH 7/8] powerpc/64s: idle do not hold reservation longer than required

2017-03-14 Thread Nicholas Piggin

When taking the core idle state lock, grab it immediately like a
regular lock, rather than adding more tests in there. Holding the lock
keeps it stable, so there is no need to do it whole holding the
reservation.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/idle_book3s.S | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 1c91dc35c559..3cb75907c5c5 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -488,12 +488,12 @@ BEGIN_FTR_SECTION
CHECK_HMI_INTERRUPT
 END_FTR_SECTION_IFSET(CPU_FTR_HVMODE)
 
-   lbz r7,PACA_THREAD_MASK(r13)
ld  r14,PACA_CORE_IDLE_STATE_PTR(r13)
-lwarx_loop2:
-   lwarx   r15,0,r14
-   andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
+   lbz r7,PACA_THREAD_MASK(r13)
+
/*
+* Take the core lock to synchronize against other threads.
+*
 * Lock bit is set in one of the 2 cases-
 * a. In the sleep/winkle enter path, the last thread is executing
 * fastsleep workaround code.
@@ -501,7 +501,14 @@ lwarx_loop2:
 * workaround undo code or resyncing timebase or restoring context
 * In either case loop until the lock bit is cleared.
 */
+1:
+   lwarx   r15,0,r14
+   andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
bnel-   core_idle_lock_held
+   orisr15,r15,PNV_CORE_IDLE_LOCK_BIT@h
+   stwcx.  r15,0,r14
+   bne-1b
+   isync
 
andi.   r9,r15,PNV_CORE_IDLE_THREAD_BITS
cmpwi   cr2,r9,0
@@ -513,11 +520,6 @@ lwarx_loop2:
 * cr4 - gt or eq if waking up from complete hypervisor state loss.
 */
 
-   orisr15,r15,PNV_CORE_IDLE_LOCK_BIT@h
-   stwcx.  r15,0,r14
-   bne-lwarx_loop2
-   isync
-
 BEGIN_FTR_SECTION
lbz r4,PACA_SUBCORE_SIBLING_MASK(r13)
and r4,r4,r15
-- 
2.11.0

[PATCH 8/8] powerpc/64s: idle POWER8 avoid full state loss recovery when possible

2017-03-14 Thread Nicholas Piggin

If not all threads were in winkle, full state loss recovery is not
necessary and can be avoided. A previous patch removed this optimisation
due to some complexity with the implementation. Re-implement it by
counting the number of threads in winkle with the per-core idle state.
Only restore full state loss if all threads were in winkle.

This has a small window of false positives right before threads execute
winkle and just after they wake up, when the winkle count does not
reflect the true number of threads in winkle. This is not a significant
problem in comparison with even the minimum winkle duration. For
correctness, a false positive is not a problem (only false negatives
would be).

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/cpuidle.h | 32 ---
 arch/powerpc/kernel/idle_book3s.S  | 45 +-
 2 files changed, 68 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/cpuidle.h 
b/arch/powerpc/include/asm/cpuidle.h
index b9d9f960dffd..b68a5cd75ae8 100644
--- a/arch/powerpc/include/asm/cpuidle.h
+++ b/arch/powerpc/include/asm/cpuidle.h
@@ -2,13 +2,39 @@
 #define _ASM_POWERPC_CPUIDLE_H
 
 #ifdef CONFIG_PPC_POWERNV
-/* Used in powernv idle state management */
+/* Thread state used in powernv idle state management */
 #define PNV_THREAD_RUNNING  0
 #define PNV_THREAD_NAP  1
 #define PNV_THREAD_SLEEP2
 #define PNV_THREAD_WINKLE   3
-#define PNV_CORE_IDLE_LOCK_BIT  0x1000
-#define PNV_CORE_IDLE_THREAD_BITS   0x00FF
+
+/*
+ * Core state used in powernv idle for POWER8.
+ *
+ * The lock bit synchronizes updates to the state, as well as parts of the
+ * sleep/wake code (see kernel/idle_book3s.S).
+ *
+ * Bottom 8 bits track the idle state of each thread. Bit is cleared before
+ * the thread executes an idle instruction (nap/sleep/winkle).
+ *
+ * Then there is winkle tracking. A core does not lose complete state
+ * until every thread is in winkle. So the winkle count field counts the
+ * number of threads in winkle (small window of false positives is okay
+ * around the sleep/wake, so long as there are no false negatives).
+ *
+ * When the winkle count reaches 8 (the COUNT_ALL_BIT becomes set), then
+ * the THREAD_WINKLE_BITS are set, which indicate which threads have not
+ * yet woken from the winkle state.
+ */
+#define PNV_CORE_IDLE_LOCK_BIT 0x1000
+
+#define PNV_CORE_IDLE_WINKLE_COUNT 0x0001
+#define PNV_CORE_IDLE_WINKLE_COUNT_ALL_BIT 0x0008
+#define PNV_CORE_IDLE_WINKLE_COUNT_BITS0x000F
+#define PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT 8
+#define PNV_CORE_IDLE_THREAD_WINKLE_BITS   0xFF00
+
+#define PNV_CORE_IDLE_THREAD_BITS  0x00FF
 
 /*
  *  NOTE =
diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 3cb75907c5c5..87518a1dca50 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -210,15 +210,20 @@ pnv_enter_arch207_idle_mode:
/* Sleep or winkle */
lbz r7,PACA_THREAD_MASK(r13)
ld  r14,PACA_CORE_IDLE_STATE_PTR(r13)
+   li  r5,0
+   beq cr3,3f
+   lis r5,PNV_CORE_IDLE_WINKLE_COUNT@h
+3:
 lwarx_loop1:
lwarx   r15,0,r14
 
andis.  r9,r15,PNV_CORE_IDLE_LOCK_BIT@h
bnel-   core_idle_lock_held
 
+   add r15,r15,r5  /* Add if winkle */
andcr15,r15,r7  /* Clear thread bit */
 
-   andi.   r15,r15,PNV_CORE_IDLE_THREAD_BITS
+   andi.   r9,r15,PNV_CORE_IDLE_THREAD_BITS
 
 /*
  * If cr0 = 0, then current thread is the last thread of the core entering
@@ -437,16 +442,14 @@ pnv_restore_hyp_resource_arch300:
 pnv_restore_hyp_resource_arch207:
/*
 * POWER ISA 2.07 or less.
-* Check if we slept with winkle.
+* Check if we slept with sleep or winkle.
 */
ld  r2,PACATOC(r13);
 
-   lbz r0,PACA_THREAD_IDLE_STATE(r13)
-   cmpwi   cr2,r0,PNV_THREAD_NAP
-   cmpwi   cr4,r0,PNV_THREAD_WINKLE
+   lbz r4,PACA_THREAD_IDLE_STATE(r13)
li  r0,PNV_THREAD_RUNNING
stb r0,PACA_THREAD_IDLE_STATE(r13)  /* Clear thread state */
-
+   cmpwi   cr2,r4,PNV_THREAD_NAP
bgt cr2,pnv_wakeup_tb_loss  /* Either sleep or Winkle */
 
 
@@ -467,7 +470,12 @@ pnv_restore_hyp_resource_arch207:
  *
  * r13 - PACA
  * cr3 - gt if waking up with partial/complete hypervisor state loss
+ *
+ * If ISA300:
  * cr4 - gt or eq if waking up from complete hypervisor state loss.
+ *
+ * If ISA207:
+ * r4 - PACA_THREAD_IDLE_STATE
  */
 pnv_wakeup_tb_loss:
ld  r1,PACAR1(r13)
@@ -482,6 +490,7 @@ pnv_wakeup_tb_loss:
 * required to return back to caller after hypervisor state restore is
 * complete.
 */
+

Linux 4.11: Reported regressions as of Tuesday, 20176-03-14

2017-03-14 Thread Thorsten Leemhuis

Hi! Find below my first regression report for Linux 4.11. It lists 9
regressions I'm currently aware of.

As always: Are you aware of any other regressions? Then please let me
know (simply CC regressi...@leemhuis.info). And please tell me if there
is anything in the report that shouldn't be there.

Ciao, Thorsten

P.S.: Sorry, I didn't compile any regression reports for 4.10: I 
didn't find time due to various reasons (vacation, a cold, regular 
work, and attending two conferences). Reminder: compiling these reports
has nothing to do with my paid job and I'm doing it in my spare time
just because I think someone should do it.

P.P.S: Dear Gmane mainling list archive webinterface, please come
back soon. I really really miss you. It hurts ever single day.
Don't you miss me, too? ;-)

== Current regressions ==

Desc: PowerPC crashes on boot, bisected to commit 5657933dbb6e
Repo: 2017-03-02 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1343553.html
Stat: 2017-03-09 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1349568.html
Note: patch hopefully heading mainline

Desc: thinkpad x220: GPU hang
Repo: 2017-03-05 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1345689.html
Stat: n/a 
Note: poked discussion for a status update

Desc: e1000e: __pci_enable_msi_range fails before/after resume
Repo: 2017-03-06 https://bugzilla.kernel.org/show_bug.cgi?id=194801
Stat: 2017-03-06 https://bugzilla.kernel.org/show_bug.cgi?id=194801#c1
Note: poked bug for status; might need to get forwared to network people

Desc: Two batteries is detected on DEXP Ursus 7W tablet instead of one
Repo: 2017-03-07 https://bugzilla.kernel.org/show_bug.cgi?id=194811
Stat: 2017-03-12 https://bugzilla.kernel.org/show_bug.cgi?id=194811#c8
Note: patch likely heading mainline

Desc: [lkp-robot] [f2fs] 4ac912427c: -33.7% aim7.jobs-per-min regression
Repo: 2017-03-08 https://www.spinics.net/lists/kernel/msg2459239.html
Stat: 2017-03-13 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1353085.html
Note: patch discussed and heading mainline

Desc: VM with virtio-scsi drive often crashes during boot with kernel 4.11rc1
Repo: 2017-03-09 https://bugzilla.kernel.org/show_bug.cgi?id=194837
Stat: n/a 
Note: will forward this to scsi & virtio & kvm people

Desc: general protection fault: inet6_fill_ifaddr+0x6c/0x230
Repo: 2017-03-11 https://bugzilla.kernel.org/show_bug.cgi?id=194849
Stat: n/a 
Note: poked bug for status; might need to get forwared to network people

Desc: Synaptics RMI4 touchpad regression in 4.11-rc1
Repo: 2017-03-11 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1351561.html
Stat: 2017-03-13 
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1352399.html
Note: solution discussed, no patch yet

Desc: DRM BUG while initializing cape verde (2nd card)
Repo: 2017-03-13 https://bugzilla.kernel.org/show_bug.cgi?id=194867
Stat: n/a 
Note: patch proposed by reporter

Re: [PATCH] drivers/pcmcia: NO_IRQ removal for electra_cf.c

2017-03-14 Thread Michael Ellerman

Michael Ellerman  writes:

> We'd like to eventually remove NO_IRQ on powerpc, so remove usages of it
> from electra_cf.c which is a powerpc-only driver.
>
> Signed-off-by: Michael Ellerman 
> ---
>  drivers/pcmcia/electra_cf.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Ping anyone?

Or should I merge this via the powerpc tree?

cheers

> diff --git a/drivers/pcmcia/electra_cf.c b/drivers/pcmcia/electra_cf.c
> index 4d7bc3f4124a..c6fe2a4a7a6a 100644
> --- a/drivers/pcmcia/electra_cf.c
> +++ b/drivers/pcmcia/electra_cf.c
> @@ -207,7 +207,7 @@ static int electra_cf_probe(struct platform_device *ofdev)
>   return -ENOMEM;
>  
>   setup_timer(&cf->timer, electra_cf_timer, (unsigned long)cf);
> - cf->irq = NO_IRQ;
> + cf->irq = 0;
>  
>   cf->ofdev = ofdev;
>   cf->mem_phys = mem.start;
> @@ -313,7 +313,7 @@ fail3:
>  fail2:
>   release_mem_region(cf->mem_phys, cf->mem_size);
>  fail1:
> - if (cf->irq != NO_IRQ)
> + if (cf->irq)
>   free_irq(cf->irq, cf);
>  
>   if (cf->io_virt)
> -- 
> 2.7.4

[PATCH V2 1/6] cxl: Remove unused values in bare-metal environment.

2017-03-14 Thread Christophe Lombard

The two fields pid and tid of the structure cxl_irq_info are only used
in the guest environment. To avoid confusion, it's not necessary
to fill the fields in the bare-metal environment.
The PSL Process and Thread Identification Register is only used when
attaching a dedicated process for PSL8 only.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/native.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 7ae7105..7257e8b 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -859,8 +859,6 @@ static int native_detach_process(struct cxl_context *ctx)
 
 static int native_get_irq_info(struct cxl_afu *afu, struct cxl_irq_info *info)
 {
-   u64 pidtid;
-
/* If the adapter has gone away, we can't get any meaningful
 * information.
 */
@@ -870,9 +868,6 @@ static int native_get_irq_info(struct cxl_afu *afu, struct 
cxl_irq_info *info)
info->dsisr = cxl_p2n_read(afu, CXL_PSL_DSISR_An);
info->dar = cxl_p2n_read(afu, CXL_PSL_DAR_An);
info->dsr = cxl_p2n_read(afu, CXL_PSL_DSR_An);
-   pidtid = cxl_p2n_read(afu, CXL_PSL_PID_TID_An);
-   info->pid = pidtid >> 32;
-   info->tid = pidtid & 0x;
info->afu_err = cxl_p2n_read(afu, CXL_AFU_ERR_An);
info->errstat = cxl_p2n_read(afu, CXL_PSL_ErrStat_An);
info->proc_handle = 0;
-- 
2.7.4

[PATCH V2 3/6] cxl: Update implementation service layer

2017-03-14 Thread Christophe Lombard

The service layer API (in cxl.h) lists some low-level functions whose
implementation is different on PSL8, PSL9 and XSL:
- Init implementation for the adapter and the afu.
- Invalidate TLB/SLB.
- Attach process for dedicated/directed models.
- Handle psl interrupts.
- Debug registers for the adapter and the afu.
- Traces.
Each environment implements its own functions, and the common code uses
them through function pointers, defined in cxl_service_layer_ops.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/cxl.h | 34 ++-
 drivers/misc/cxl/debugfs.c | 16 +++
 drivers/misc/cxl/guest.c   |  2 +-
 drivers/misc/cxl/irq.c |  2 +-
 drivers/misc/cxl/native.c  | 50 +++---
 drivers/misc/cxl/pci.c | 47 +--
 6 files changed, 97 insertions(+), 54 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 4d1b704..3e03a66 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -553,13 +553,23 @@ struct cxl_context {
struct mm_struct *mm;
 };
 
+struct cxl_irq_info;
+
 struct cxl_service_layer_ops {
int (*adapter_regs_init)(struct cxl *adapter, struct pci_dev *dev);
+   int (*invalidate_all)(struct cxl *adapter);
int (*afu_regs_init)(struct cxl_afu *afu);
+   int (*sanitise_afu_regs)(struct cxl_afu *afu);
int (*register_serr_irq)(struct cxl_afu *afu);
void (*release_serr_irq)(struct cxl_afu *afu);
-   void (*debugfs_add_adapter_sl_regs)(struct cxl *adapter, struct dentry 
*dir);
-   void (*debugfs_add_afu_sl_regs)(struct cxl_afu *afu, struct dentry 
*dir);
+   irqreturn_t (*handle_interrupt)(int irq, struct cxl_context *ctx, 
struct cxl_irq_info *irq_info);
+   irqreturn_t (*fail_irq)(struct cxl_afu *afu, struct cxl_irq_info 
*irq_info);
+   int (*activate_dedicated_process)(struct cxl_afu *afu);
+   int (*attach_afu_directed)(struct cxl_context *ctx, u64 wed, u64 amr);
+   int (*attach_dedicated_process)(struct cxl_context *ctx, u64 wed, u64 
amr);
+   void (*update_dedicated_ivtes)(struct cxl_context *ctx);
+   void (*debugfs_add_adapter_regs)(struct cxl *adapter, struct dentry 
*dir);
+   void (*debugfs_add_afu_regs)(struct cxl_afu *afu, struct dentry *dir);
void (*psl_irq_dump_registers)(struct cxl_context *ctx);
void (*err_irq_dump_registers)(struct cxl *adapter);
void (*debugfs_stop_trace)(struct cxl *adapter);
@@ -805,16 +815,21 @@ void afu_irq_name_free(struct cxl_context *ctx);
 
 #ifdef CONFIG_DEBUG_FS
 
+int cxl_attach_afu_directed_psl(struct cxl_context *ctx, u64 wed, u64 amr);
+int cxl_activate_dedicated_process_psl(struct cxl_afu *afu);
+int cxl_attach_dedicated_process_psl(struct cxl_context *ctx, u64 wed, u64 
amr);
+void cxl_update_dedicated_ivtes_psl(struct cxl_context *ctx);
+
 int cxl_debugfs_init(void);
 void cxl_debugfs_exit(void);
 int cxl_debugfs_adapter_add(struct cxl *adapter);
 void cxl_debugfs_adapter_remove(struct cxl *adapter);
 int cxl_debugfs_afu_add(struct cxl_afu *afu);
 void cxl_debugfs_afu_remove(struct cxl_afu *afu);
-void cxl_stop_trace(struct cxl *cxl);
-void cxl_debugfs_add_adapter_psl_regs(struct cxl *adapter, struct dentry *dir);
-void cxl_debugfs_add_adapter_xsl_regs(struct cxl *adapter, struct dentry *dir);
-void cxl_debugfs_add_afu_psl_regs(struct cxl_afu *afu, struct dentry *dir);
+void cxl_stop_trace_psl(struct cxl *cxl);
+void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir);
+void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct dentry *dir);
+void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry *dir);
 
 #else /* CONFIG_DEBUG_FS */
 
@@ -916,19 +931,20 @@ struct cxl_irq_info {
 };
 
 void cxl_assign_psn_space(struct cxl_context *ctx);
-irqreturn_t cxl_irq(int irq, struct cxl_context *ctx, struct cxl_irq_info 
*irq_info);
+int cxl_invalidate_all_psl(struct cxl *adapter);
+irqreturn_t cxl_irq_psl(int irq, struct cxl_context *ctx, struct cxl_irq_info 
*irq_info);
+irqreturn_t cxl_fail_irq_psl(struct cxl_afu *afu, struct cxl_irq_info 
*irq_info);
 int cxl_register_one_irq(struct cxl *adapter, irq_handler_t handler,
void *cookie, irq_hw_number_t *dest_hwirq,
unsigned int *dest_virq, const char *name);
 
 int cxl_check_error(struct cxl_afu *afu);
 int cxl_afu_slbia(struct cxl_afu *afu);
-int cxl_tlb_slb_invalidate(struct cxl *adapter);
 int cxl_data_cache_flush(struct cxl *adapter);
 int cxl_afu_disable(struct cxl_afu *afu);
 int cxl_psl_purge(struct cxl_afu *afu);
 
-void cxl_native_psl_irq_dump_regs(struct cxl_context *ctx);
+void cxl_native_irq_dump_regs_psl(struct cxl_context *ctx);
 void cxl_native_err_irq_dump_regs(struct cxl *adapter);
 int cxl_pci_vphb_add(struct cxl_afu *afu);
 void cxl_pci_vphb_remove(struct cxl_afu *afu);
diff --git a/drivers/misc/cxl/debugfs.c b/drivers/misc/cxl/debugfs

[PATCH V2 5/6] cxl: Isolate few psl8 specific calls

2017-03-14 Thread Christophe Lombard

Point out the specific Coherent Accelerator Interface Architecture,
level 1, registers.
Code and functions specific to PSL8 (CAIA1) must be framed.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/context.c | 28 +++-
 drivers/misc/cxl/cxl.h | 35 +++--
 drivers/misc/cxl/debugfs.c |  6 +++--
 drivers/misc/cxl/fault.c   | 14 +-
 drivers/misc/cxl/native.c  | 58 ++---
 drivers/misc/cxl/pci.c | 64 +++---
 6 files changed, 136 insertions(+), 69 deletions(-)

diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index ed0a447..0fd3cf8 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -39,23 +39,26 @@ int cxl_context_init(struct cxl_context *ctx, struct 
cxl_afu *afu, bool master)
 {
int i;
 
-   spin_lock_init(&ctx->sste_lock);
+   if (cxl_is_psl8(afu))
+   spin_lock_init(&ctx->sste_lock);
ctx->afu = afu;
ctx->master = master;
ctx->pid = NULL; /* Set in start work ioctl */
mutex_init(&ctx->mapping_lock);
ctx->mapping = NULL;
 
-   /*
-* Allocate the segment table before we put it in the IDR so that we
-* can always access it when dereferenced from IDR. For the same
-* reason, the segment table is only destroyed after the context is
-* removed from the IDR.  Access to this in the IOCTL is protected by
-* Linux filesytem symantics (can't IOCTL until open is complete).
-*/
-   i = cxl_alloc_sst(ctx);
-   if (i)
-   return i;
+   if (cxl_is_psl8(afu)) {
+   /*
+* Allocate the segment table before we put it in the IDR so 
that we
+* can always access it when dereferenced from IDR. For the same
+* reason, the segment table is only destroyed after the 
context is
+* removed from the IDR.  Access to this in the IOCTL is 
protected by
+* Linux filesytem symantics (can't IOCTL until open is 
complete).
+*/
+   i = cxl_alloc_sst(ctx);
+   if (i)
+   return i;
+   }
 
INIT_WORK(&ctx->fault_work, cxl_handle_fault);
 
@@ -307,7 +310,8 @@ static void reclaim_ctx(struct rcu_head *rcu)
 {
struct cxl_context *ctx = container_of(rcu, struct cxl_context, rcu);
 
-   free_page((u64)ctx->sstp);
+   if (cxl_is_psl8(ctx->afu))
+   free_page((u64)ctx->sstp);
if (ctx->ff_page)
__free_page(ctx->ff_page);
ctx->sstp = NULL;
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 1f04238..f6a3a34 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -73,7 +73,7 @@ static const cxl_p1_reg_t CXL_PSL_Control = {0x0020};
 static const cxl_p1_reg_t CXL_PSL_DLCNTL  = {0x0060};
 static const cxl_p1_reg_t CXL_PSL_DLADDR  = {0x0068};
 
-/* PSL Lookaside Buffer Management Area */
+/* PSL Lookaside Buffer Management Area - CAIA 1 */
 static const cxl_p1_reg_t CXL_PSL_LBISEL  = {0x0080};
 static const cxl_p1_reg_t CXL_PSL_SLBIE   = {0x0088};
 static const cxl_p1_reg_t CXL_PSL_SLBIA   = {0x0090};
@@ -82,7 +82,7 @@ static const cxl_p1_reg_t CXL_PSL_TLBIA   = {0x00A8};
 static const cxl_p1_reg_t CXL_PSL_AFUSEL  = {0x00B0};
 
 /* 0x00C0:7EFF Implementation dependent area */
-/* PSL registers */
+/* PSL registers - CAIA 1 */
 static const cxl_p1_reg_t CXL_PSL_FIR1  = {0x0100};
 static const cxl_p1_reg_t CXL_PSL_FIR2  = {0x0108};
 static const cxl_p1_reg_t CXL_PSL_Timebase  = {0x0110};
@@ -109,7 +109,7 @@ static const cxl_p1n_reg_t CXL_PSL_AMBAR_An   = {0x10};
 static const cxl_p1n_reg_t CXL_PSL_SPOffset_An= {0x18};
 static const cxl_p1n_reg_t CXL_PSL_ID_An  = {0x20};
 static const cxl_p1n_reg_t CXL_PSL_SERR_An= {0x28};
-/* Memory Management and Lookaside Buffer Management */
+/* Memory Management and Lookaside Buffer Management - CAIA 1*/
 static const cxl_p1n_reg_t CXL_PSL_SDR_An = {0x30};
 static const cxl_p1n_reg_t CXL_PSL_AMOR_An= {0x38};
 /* Pointer Area */
@@ -124,6 +124,7 @@ static const cxl_p1n_reg_t CXL_PSL_IVTE_Limit_An  = {0xB8};
 /* 0xC0:FF Implementation Dependent Area */
 static const cxl_p1n_reg_t CXL_PSL_FIR_SLICE_An   = {0xC0};
 static const cxl_p1n_reg_t CXL_AFU_DEBUG_An   = {0xC8};
+/* 0xC0:FF Implementation Dependent Area - CAIA 1 */
 static const cxl_p1n_reg_t CXL_PSL_APCALLOC_A = {0xD0};
 static const cxl_p1n_reg_t CXL_PSL_COALLOC_A  = {0xD8};
 static const cxl_p1n_reg_t CXL_PSL_RXCTL_A= {0xE0};
@@ -133,12 +134,14 @@ static const cxl_p1n_reg_t CXL_PSL_SLICE_TRACE= 
{0xE8};
 /* Configuration and Control Area */
 static const cxl_p2n_reg_t CXL_PSL_PID_TID_An = {0x000};
 static const cxl_p2n_reg_t CXL_CSRP_An= {0x008};
+/* Configuration and Control Area - CAIA 1 */
 static const cxl_p2n_reg_t CXL_AURP0_

[PATCH V2 0/6] cxl: Add support for Coherent Accelerator Interface Architecture 2.0

2017-03-14 Thread Christophe Lombard

This series adds support for a cxl card which supports the Coherent
Accelerator Interface Architecture 2.0.

It requires IBM Power9 system and the Power Service Layer, version 9.
The PSL provides the address translation and system memory cache for
CAIA compliant Accelerators.
the PSL attaches to the IBM Processor chip through the PCIe link using
the PSL-specific “CAPI Protocol” Transaction Layer Packets.
The PSL and CAPP communicate using PowerBus packets. 
When using a PCIe link the PCIe Host Bridge (PHB) decodes the CAPI
Protocol Packets from the PSL and forwards them as PowerBus data
packets. The PSL also has an optional DMA feature which allows the AFU
to send native PCIe reads and writes to the Processor.

CAIA 2 introduces new features: 
* There are several similarities among the two programming models:
Dedicated-Process and shared models.
* DMA support
* Nest MMU to handle translation addresses.
* ...

It builds on top of the existing cxl driver for the first version of
CAIA. Today only the bare-metal environment supports these new features.

Compatibility with the CAIA, version 1, allows applications and system
software to migrate from one implementation to another with minor
changes.
Most of the differences are:
* Power Service Layer registers: p1 and p2 registers. These new
registers require reworking The service layer API (in cxl.h).
* Support of Radix mode. Power9 consist of multiple memory management
model. So we need to select the right Translation mechanism mode.
* Dedicated-Shared Process Programming Model
* Process element entry. Structure cxl_process_element_common is
redefined.
* Translation Fault Handling. Only a page fault is now handle by the
driver cxl when a translation fault is occured. 

Roughly 3/4 of the code is common between the two CAIA version. When
the code needs to call a specific implementation, it does so
through an API. The PSL8 and PSL9 implementations each describe
their own definition. See struct cxl_service_layer_ops.

The first 3 patches are mostly cleanup and fixes, separating the
psl8-specific code from the code which will also be used for psl9.
Patches 4 restructure existing code, to easily add the psl
implementation.
Patch 5 and 6 rename and isolate implementation-specific code.
Patch 7 introduces the core of the PSL9-specific code.

Tested on Simulation environment.

Changelog:
v1->v2:
 - integrate comments from Andrew Donnellan and Frederic Barrat 


Christophe Lombard (6):
  cxl: Remove unused values in bare-metal environment.
  cxl: Keep track of mm struct associated with a context
  cxl: Update implementation service layer
  cxl: Rename some psl8 specific functions
  cxl: Isolate few psl8 specific calls
  cxl: Add psl9 specific code

 drivers/misc/cxl/api.c |  17 ++-
 drivers/misc/cxl/context.c |  67 +++--
 drivers/misc/cxl/cxl.h | 206 ++-
 drivers/misc/cxl/debugfs.c |  41 --
 drivers/misc/cxl/fault.c   | 139 ++
 drivers/misc/cxl/file.c|  15 +-
 drivers/misc/cxl/guest.c   |  10 +-
 drivers/misc/cxl/irq.c |  55 +++-
 drivers/misc/cxl/native.c  | 331 ---
 drivers/misc/cxl/pci.c | 345 +++--
 drivers/misc/cxl/trace.h   |  43 ++
 11 files changed, 1004 insertions(+), 265 deletions(-)

-- 
2.7.4

[PATCH V2 4/6] cxl: Rename some psl8 specific functions

2017-03-14 Thread Christophe Lombard

Rename a few functions, changing the '_psl' suffix to '_psl8', to make
clear that the implementation is psl8 specific.
Those functions will have an equivalent implementation for the psl9 in
a later patch.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/cxl.h | 22 ++---
 drivers/misc/cxl/debugfs.c |  6 +++---
 drivers/misc/cxl/guest.c   |  2 +-
 drivers/misc/cxl/irq.c |  2 +-
 drivers/misc/cxl/native.c  | 14 +++---
 drivers/misc/cxl/pci.c | 48 +++---
 6 files changed, 47 insertions(+), 47 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 3e03a66..1f04238 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -815,10 +815,10 @@ void afu_irq_name_free(struct cxl_context *ctx);
 
 #ifdef CONFIG_DEBUG_FS
 
-int cxl_attach_afu_directed_psl(struct cxl_context *ctx, u64 wed, u64 amr);
-int cxl_activate_dedicated_process_psl(struct cxl_afu *afu);
-int cxl_attach_dedicated_process_psl(struct cxl_context *ctx, u64 wed, u64 
amr);
-void cxl_update_dedicated_ivtes_psl(struct cxl_context *ctx);
+int cxl_attach_afu_directed_psl8(struct cxl_context *ctx, u64 wed, u64 amr);
+int cxl_activate_dedicated_process_psl8(struct cxl_afu *afu);
+int cxl_attach_dedicated_process_psl8(struct cxl_context *ctx, u64 wed, u64 
amr);
+void cxl_update_dedicated_ivtes_psl8(struct cxl_context *ctx);
 
 int cxl_debugfs_init(void);
 void cxl_debugfs_exit(void);
@@ -826,10 +826,10 @@ int cxl_debugfs_adapter_add(struct cxl *adapter);
 void cxl_debugfs_adapter_remove(struct cxl *adapter);
 int cxl_debugfs_afu_add(struct cxl_afu *afu);
 void cxl_debugfs_afu_remove(struct cxl_afu *afu);
-void cxl_stop_trace_psl(struct cxl *cxl);
-void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir);
+void cxl_stop_trace_psl8(struct cxl *cxl);
+void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, struct dentry 
*dir);
 void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct dentry *dir);
-void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry *dir);
+void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct dentry *dir);
 
 #else /* CONFIG_DEBUG_FS */
 
@@ -931,9 +931,9 @@ struct cxl_irq_info {
 };
 
 void cxl_assign_psn_space(struct cxl_context *ctx);
-int cxl_invalidate_all_psl(struct cxl *adapter);
-irqreturn_t cxl_irq_psl(int irq, struct cxl_context *ctx, struct cxl_irq_info 
*irq_info);
-irqreturn_t cxl_fail_irq_psl(struct cxl_afu *afu, struct cxl_irq_info 
*irq_info);
+int cxl_invalidate_all_psl8(struct cxl *adapter);
+irqreturn_t cxl_irq_psl8(int irq, struct cxl_context *ctx, struct cxl_irq_info 
*irq_info);
+irqreturn_t cxl_fail_irq_psl8(struct cxl_afu *afu, struct cxl_irq_info 
*irq_info);
 int cxl_register_one_irq(struct cxl *adapter, irq_handler_t handler,
void *cookie, irq_hw_number_t *dest_hwirq,
unsigned int *dest_virq, const char *name);
@@ -944,7 +944,7 @@ int cxl_data_cache_flush(struct cxl *adapter);
 int cxl_afu_disable(struct cxl_afu *afu);
 int cxl_psl_purge(struct cxl_afu *afu);
 
-void cxl_native_irq_dump_regs_psl(struct cxl_context *ctx);
+void cxl_native_irq_dump_regs_psl8(struct cxl_context *ctx);
 void cxl_native_err_irq_dump_regs(struct cxl *adapter);
 int cxl_pci_vphb_add(struct cxl_afu *afu);
 void cxl_pci_vphb_remove(struct cxl_afu *afu);
diff --git a/drivers/misc/cxl/debugfs.c b/drivers/misc/cxl/debugfs.c
index 4848ebf..2ff10a9 100644
--- a/drivers/misc/cxl/debugfs.c
+++ b/drivers/misc/cxl/debugfs.c
@@ -15,7 +15,7 @@
 
 static struct dentry *cxl_debugfs;
 
-void cxl_stop_trace_psl(struct cxl *adapter)
+void cxl_stop_trace_psl8(struct cxl *adapter)
 {
int slice;
 
@@ -53,7 +53,7 @@ static struct dentry *debugfs_create_io_x64(const char *name, 
umode_t mode,
  (void __force *)value, &fops_io_x64);
 }
 
-void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir)
+void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, struct dentry *dir)
 {
debugfs_create_io_x64("fir1", S_IRUSR, dir, _cxl_p1_addr(adapter, 
CXL_PSL_FIR1));
debugfs_create_io_x64("fir2", S_IRUSR, dir, _cxl_p1_addr(adapter, 
CXL_PSL_FIR2));
@@ -92,7 +92,7 @@ void cxl_debugfs_adapter_remove(struct cxl *adapter)
debugfs_remove_recursive(adapter->debugfs);
 }
 
-void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry *dir)
+void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct dentry *dir)
 {
debugfs_create_io_x64("fir", S_IRUSR, dir, _cxl_p1n_addr(afu, 
CXL_PSL_FIR_SLICE_An));
debugfs_create_io_x64("serr", S_IRUSR, dir, _cxl_p1n_addr(afu, 
CXL_PSL_SERR_An));
diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c
index f6ba698..3ad7381 100644
--- a/drivers/misc/cxl/guest.c
+++ b/drivers/misc/cxl/guest.c
@@ -169,7 +169,7 @@ static irqreturn_t guest_psl_irq(int irq, void *data)
return IR

[PATCH V2 6/6] cxl: Add psl9 specific code

2017-03-14 Thread Christophe Lombard

The new Coherent Accelerator Interface Architecture, level 2, for the
IBM POWER9 brings new content and features:
- POWER9 Service Layer
- Registers
- Radix mode
- Process element entry
- Dedicated-Shared Process Programming Model
- Translation Fault Handling
- CAPP
- Memory Context ID
If a valid mm_struct is found the memory context id is used for each
transaction associated with the process handle. The PSL uses the
context ID to find the corresponding process element.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/context.c |  13 +++
 drivers/misc/cxl/cxl.h | 124 
 drivers/misc/cxl/debugfs.c |  19 
 drivers/misc/cxl/fault.c   |  48 ++
 drivers/misc/cxl/guest.c   |   8 +-
 drivers/misc/cxl/irq.c |  53 +++
 drivers/misc/cxl/native.c  | 218 +--
 drivers/misc/cxl/pci.c | 228 +++--
 drivers/misc/cxl/trace.h   |  43 +
 9 files changed, 700 insertions(+), 54 deletions(-)

diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index 0fd3cf8..8d41fc5 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -205,6 +205,19 @@ int cxl_context_iomap(struct cxl_context *ctx, struct 
vm_area_struct *vma)
return -EBUSY;
}
 
+   if ((ctx->afu->current_mode == CXL_MODE_DEDICATED) &&
+ (cxl_is_psl9(ctx->afu))) {
+   /* make sure there is a valid problem state area space for this 
AFU */
+   if (ctx->master && !ctx->afu->psa) {
+   pr_devel("AFU doesn't support mmio space\n");
+   return -EINVAL;
+   }
+
+   /* Can't mmap until the AFU is enabled */
+   if (!ctx->afu->enabled)
+   return -EBUSY;
+   }
+
pr_devel("%s: mmio physical: %llx pe: %i master:%i\n", __func__,
 ctx->psn_phys, ctx->pe , ctx->master);
 
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index f6a3a34..fbdc511 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -63,7 +63,7 @@ typedef struct {
 /* Memory maps. Ref CXL Appendix A */
 
 /* PSL Privilege 1 Memory Map */
-/* Configuration and Control area */
+/* Configuration and Control area - CAIA 1&2 */
 static const cxl_p1_reg_t CXL_PSL_CtxTime = {0x};
 static const cxl_p1_reg_t CXL_PSL_ErrIVTE = {0x0008};
 static const cxl_p1_reg_t CXL_PSL_KEY1= {0x0010};
@@ -98,11 +98,29 @@ static const cxl_p1_reg_t CXL_XSL_Timebase  = {0x0100};
 static const cxl_p1_reg_t CXL_XSL_TB_CTLSTAT = {0x0108};
 static const cxl_p1_reg_t CXL_XSL_FEC   = {0x0158};
 static const cxl_p1_reg_t CXL_XSL_DSNCTL= {0x0168};
+/* PSL registers - CAIA 2 */
+static const cxl_p1_reg_t CXL_PSL9_CONTROL  = {0x0020};
+static const cxl_p1_reg_t CXL_XSL9_DSNCTL   = {0x0168};
+static const cxl_p1_reg_t CXL_PSL9_FIR1 = {0x0300};
+static const cxl_p1_reg_t CXL_PSL9_FIR2 = {0x0308}; /* TBD NML CAIA 2 */
+static const cxl_p1_reg_t CXL_PSL9_Timebase = {0x0310};
+static const cxl_p1_reg_t CXL_PSL9_DEBUG= {0x0320};
+static const cxl_p1_reg_t CXL_PSL9_FIR_CNTL = {0x0348};
+static const cxl_p1_reg_t CXL_PSL9_DSNDCTL  = {0x0350};
+static const cxl_p1_reg_t CXL_PSL9_TB_CTLSTAT = {0x0340};
+static const cxl_p1_reg_t CXL_PSL9_TRACECFG = {0x0368};
+static const cxl_p1_reg_t CXL_PSL9_APCDEDALLOC = {0x0378};
+static const cxl_p1_reg_t CXL_PSL9_APCDEDTYPE = {0x0380};
+static const cxl_p1_reg_t CXL_PSL9_TNR_ADDR = {0x0388};
+static const cxl_p1_reg_t CXL_PSL9_GP_CT = {0x0398};
+static const cxl_p1_reg_t CXL_XSL9_IERAT = {0x0588};
+static const cxl_p1_reg_t CXL_XSL9_ILPP  = {0x0590};
+
 /* 0x7F00:7FFF Reserved PCIe MSI-X Pending Bit Array area */
 /* 0x8000: Reserved PCIe MSI-X Table Area */
 
 /* PSL Slice Privilege 1 Memory Map */
-/* Configuration Area */
+/* Configuration Area - CAIA 1&2 */
 static const cxl_p1n_reg_t CXL_PSL_SR_An  = {0x00};
 static const cxl_p1n_reg_t CXL_PSL_LPID_An= {0x08};
 static const cxl_p1n_reg_t CXL_PSL_AMBAR_An   = {0x10};
@@ -111,17 +129,18 @@ static const cxl_p1n_reg_t CXL_PSL_ID_An  = 
{0x20};
 static const cxl_p1n_reg_t CXL_PSL_SERR_An= {0x28};
 /* Memory Management and Lookaside Buffer Management - CAIA 1*/
 static const cxl_p1n_reg_t CXL_PSL_SDR_An = {0x30};
+/* Memory Management and Lookaside Buffer Management - CAIA 1&2 */
 static const cxl_p1n_reg_t CXL_PSL_AMOR_An= {0x38};
-/* Pointer Area */
+/* Pointer Area - CAIA 1&2 */
 static const cxl_p1n_reg_t CXL_HAURP_An   = {0x80};
 static const cxl_p1n_reg_t CXL_PSL_SPAP_An= {0x88};
 static const cxl_p1n_reg_t CXL_PSL_LLCMD_An   = {0x90};
-/* Control Area */
+/* Control Area - CAIA 1&2 */
 static const cxl_p1n_reg_t CXL_PSL_SCNTL_An   = {0xA0};
 static const cxl_p1n_reg_t CXL_PSL_CtxTime_An = {0xA8};
 static const cxl_p1n_reg_t CXL_PSL_IVTE_Offset_An = {0xB

[PATCH V2 2/6] cxl: Keep track of mm struct associated with a context

2017-03-14 Thread Christophe Lombard

The mm_struct corresponding to the current task is acquired each time
an interrupt is raised. So to simplify the code, we only get the
mm_struct when attaching an AFU context to the process.
The mm_count reference is increased to ensure that the mm_struct can't
be freed. The mm_struct will be released when the context is detached.
The reference (use count) on the struct mm is not kept to avoid a
circular dependency if the process mmaps its cxl mmio and forget to
unmap before exiting.

Signed-off-by: Christophe Lombard 
---
 drivers/misc/cxl/api.c | 17 --
 drivers/misc/cxl/context.c | 26 ++--
 drivers/misc/cxl/cxl.h | 13 ++--
 drivers/misc/cxl/fault.c   | 77 +-
 drivers/misc/cxl/file.c| 15 +++--
 5 files changed, 68 insertions(+), 80 deletions(-)

diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
index bcc030e..1a138c8 100644
--- a/drivers/misc/cxl/api.c
+++ b/drivers/misc/cxl/api.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "cxl.h"
 
@@ -321,19 +322,29 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
 
if (task) {
ctx->pid = get_task_pid(task, PIDTYPE_PID);
-   ctx->glpid = get_task_pid(task->group_leader, PIDTYPE_PID);
kernel = false;
ctx->real_mode = false;
+
+   /* acquire a reference to the task's mm */
+   ctx->mm = get_task_mm(current);
+
+   /* ensure this mm_struct can't be freed */
+   cxl_context_mm_count_get(ctx);
+
+   /* decrement the use count */
+   if (ctx->mm)
+   mmput(ctx->mm);
}
 
cxl_ctx_get();
 
if ((rc = cxl_ops->attach_process(ctx, kernel, wed, 0))) {
-   put_pid(ctx->glpid);
put_pid(ctx->pid);
-   ctx->glpid = ctx->pid = NULL;
+   ctx->pid = NULL;
cxl_adapter_context_put(ctx->afu->adapter);
cxl_ctx_put();
+   if (task)
+   cxl_context_mm_count_put(ctx);
goto out;
}
 
diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c
index 062bf6c..ed0a447 100644
--- a/drivers/misc/cxl/context.c
+++ b/drivers/misc/cxl/context.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -41,7 +42,7 @@ int cxl_context_init(struct cxl_context *ctx, struct cxl_afu 
*afu, bool master)
spin_lock_init(&ctx->sste_lock);
ctx->afu = afu;
ctx->master = master;
-   ctx->pid = ctx->glpid = NULL; /* Set in start work ioctl */
+   ctx->pid = NULL; /* Set in start work ioctl */
mutex_init(&ctx->mapping_lock);
ctx->mapping = NULL;
 
@@ -242,12 +243,15 @@ int __detach_context(struct cxl_context *ctx)
 
/* release the reference to the group leader and mm handling pid */
put_pid(ctx->pid);
-   put_pid(ctx->glpid);
 
cxl_ctx_put();
 
/* Decrease the attached context count on the adapter */
cxl_adapter_context_put(ctx->afu->adapter);
+
+   /* Decrease the mm count on the context */
+   cxl_context_mm_count_put(ctx);
+
return 0;
 }
 
@@ -325,3 +329,21 @@ void cxl_context_free(struct cxl_context *ctx)
mutex_unlock(&ctx->afu->contexts_lock);
call_rcu(&ctx->rcu, reclaim_ctx);
 }
+
+void cxl_context_mm_count_get(struct cxl_context *ctx)
+{
+   if (ctx->mm)
+   atomic_inc(&ctx->mm->mm_count);
+}
+
+void cxl_context_mm_count_put(struct cxl_context *ctx)
+{
+   if (ctx->mm)
+   mmdrop(ctx->mm);
+}
+
+void cxl_context_mm_users_get(struct cxl_context *ctx)
+{
+   if (ctx->mm)
+   atomic_inc(&ctx->mm->mm_users);
+}
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 79e60ec..4d1b704 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -482,8 +482,6 @@ struct cxl_context {
unsigned int sst_size, sst_lru;
 
wait_queue_head_t wq;
-   /* pid of the group leader associated with the pid */
-   struct pid *glpid;
/* use mm context associated with this pid for ds faults */
struct pid *pid;
spinlock_t lock; /* Protects pending_irq_mask, pending_fault and 
fault_addr */
@@ -551,6 +549,8 @@ struct cxl_context {
 * CX4 only:
 */
struct list_head extra_irq_contexts;
+
+   struct mm_struct *mm;
 };
 
 struct cxl_service_layer_ops {
@@ -1024,4 +1024,13 @@ int cxl_adapter_context_lock(struct cxl *adapter);
 /* Unlock the contexts-lock if taken. Warn and force unlock otherwise */
 void cxl_adapter_context_unlock(struct cxl *adapter);
 
+/* Increases the reference count to "struct mm_struct" */
+void cxl_context_mm_count_get(struct cxl_context *ctx);
+
+/* Decrements the reference count to "struct mm_struct" */
+void cxl_context_mm_count_put(struct cxl_context *ctx);
+

RE: [RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init'

2017-03-14 Thread David Laight

From: Linuxppc-dev Till Smejkal
> Sent: 13 March 2017 22:14
> The only way until now to create a new memory map was via the exported
> function 'mm_alloc'. Unfortunately, this function not only allocates a new
> memory map, but also completely initializes it. However, with the
> introduction of first class virtual address spaces, some initialization
> steps done in 'mm_alloc' are not applicable to the memory maps needed for
> this feature and hence would lead to errors in the kernel code.
> 
> Instead of introducing a new function that can allocate and initialize
> memory maps for first class virtual address spaces and potentially
> duplicate some code, I decided to split the mm_alloc function as well as
> the 'mm_init' function that it uses.
> 
> Now there are four functions exported instead of only one. The new
> 'mm_alloc' function only allocates a new mm_struct and zeros it out. If one
> want to have the old behavior of mm_alloc one can use the newly introduced
> function 'mm_alloc_and_setup' which not only allocates a new mm_struct but
> also fully initializes it.
...

That looks like bugs waiting to happen.
You need unchanged code to fail to compile.

David

ioctl structs differ from x86_64?

2017-03-14 Thread Harshal Patil

Hello,
 
I am looking into a bug, https://bugzilla.linux.ibm.com/show_bug.cgi?id=152493 ( external mirror is at, https://github.com/opencontainers/runc/issues/1364) 
 
Recently in runc code, they added this code https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR155. As you can see they set -onlcr to get rid of \r (line no. 164). Golang, in which runc is written, doesn't have any bindings for ioctls.  This means you have to invoke C code directly (that's what they are doing there).
 
Our guess is the ioctls in ppc64le differ than x86_64, and thats why the code which is clearing onclr bit (https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR164) is failing on ppc64le but works fine on x86_64. 
 
Any pointers on the possible solution would be really helpful. 
 
Thanks,
Harshal Patil
E-mail: harshal.pa...@in.ibm.com

Re: ioctl structs differ from x86_64?

2017-03-14 Thread Harshal Patil

Re: ioctl structs differ from x86_64?

2017-03-14 Thread Harshal Patil

Adding Pradipta and Christy.
Harshal Patil
E-mail: harshal.pa...@in.ibm.com
 
 
- Original message -From: Harshal Patil/India/IBMTo: linuxppc-dev@lists.ozlabs.orgCc: Sudipto Ghosh/India/IBM@IBMINSubject: ioctl structs differ from x86_64?Date: Tue, Mar 14, 2017 4:07 PM 
Hello,
 
I am looking into a bug, https://bugzilla.linux.ibm.com/show_bug.cgi?id=152493 ( external mirror is at, https://github.com/opencontainers/runc/issues/1364) 
 
Recently in runc code, they added this code https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR155. As you can see they set -onlcr to get rid of \r (line no. 164). Golang, in which runc is written, doesn't have any bindings for ioctls.  This means you have to invoke C code directly (that's what they are doing there).
 
Our guess is the ioctls in ppc64le differ than x86_64, and thats why the code which is clearing onclr bit (https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR164) is failing on ppc64le but works fine on x86_64. 
 
Any pointers on the possible solution would be really helpful. 
 
Thanks,
Harshal Patil
E-mail: harshal.pa...@in.ibm.com

Re: powerpc/perf: Fix perf_get_data_addr() for power9 DD1

2017-03-14 Thread Michael Ellerman

On Mon, 2017-02-20 at 13:56:30 UTC, Madhavan Srinivasan wrote:
> Power9 DD1 do not support PMU_HAS_SIER flag and sdsync
> in perf_get_data_addr() defaults to MMCRA_SDSYNC which
> is wrong. Since power9 MMCRA does not support SDSYNC bit,
> patch includes PPMU_NO_SIAR flag to the check and set the
> sdsync with MMCRA_SAMPLE_ENABLE;
> 
> Signed-off-by: Madhavan Srinivasan 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/f04d108029063a8a67528a88449c41

cheers

Re: powerpc/perf: Handle sdar_mode for marked event in power9

2017-03-14 Thread Michael Ellerman

On Mon, 2017-02-20 at 13:59:03 UTC, Madhavan Srinivasan wrote:
> MMCRA[SDAR_MODE] specifices how the SDAR should be updated in
> continous sampling mode. On P9 it must be set to 0b00 when
> MMCRA[63] is set.
> 
> Fixes: c7c3f568beff2 ('powerpc/perf: macros for power9 format encoding')
> Signed-off-by: Madhavan Srinivasan 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/78b4416aa249365dd3c1b64da4d3a2

cheers

Re: [kernel] powerpc/powernv/ioda2: Update iommu table base on ownership change

2017-03-14 Thread Michael Ellerman

On Tue, 2017-02-21 at 02:41:31 UTC, Alexey Kardashevskiy wrote:
> On POWERNV platform, in order to do DMA via IOMMU (i.e. 32bit DMA in
> our case), a device needs an iommu_table pointer set via
> set_iommu_table_base().
> 
> The codeflow is:
> - pnv_pci_ioda2_setup_dma_pe()
>   - pnv_pci_ioda2_setup_default_config()
>   - pnv_ioda_setup_bus_dma() [1]
> 
> pnv_pci_ioda2_setup_dma_pe() creates IOMMU groups,
> pnv_pci_ioda2_setup_default_config() does default DMA setup,
> pnv_ioda_setup_bus_dma() takes a bus PE (on IODA2, all physical function
> PEs as bus PEs except NPU), walks through all underlying buses and
> devices, adds all devices to an IOMMU group and sets iommu_table.
> 
> On IODA2, when VFIO is used, it takes ownership over a PE which means it
> removes all tables and creates new ones (with a possibility of sharing
> them among PEs). So when the ownership is returned from VFIO to
> the kernel, the iommu_table pointer written to a device at [1] is
> stale and needs an update.
> 
> This adds an "add_to_group" parameter to pnv_ioda_setup_bus_dma()
> (in fact re-adds as it used to be there a while ago for different
> reasons) to tell the helper if a device needs to be added to
> an IOMMU group with an iommu_table update or just the latter.
> 
> This calls pnv_ioda_setup_bus_dma(..., false) from
> pnv_ioda2_release_ownership() so when the ownership is restored,
> 32bit DMA can work again for a device. This does the same thing
> on obtaining ownership as the iommu_table point is stale at this point
> anyway and it is safer to have NULL there.
> 
> We did not hit this earlier as all tested devices in recent years were
> only using 64bit DMA; the rare exception for this is MPT3 SAS adapter
> which uses both 32bit and 64bit DMA access and it has not been tested
> with VFIO much.
> 
> Cc: Gavin Shan 
> Signed-off-by: Alexey Kardashevskiy 
> Acked-by: Gavin Shan 
> Reviewed-by: David Gibson 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/db08e1d53034a54fe177ced70476fd

cheers

Re: [kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested

2017-03-14 Thread Michael Ellerman

On Wed, 2017-02-22 at 04:43:59 UTC, Alexey Kardashevskiy wrote:
> The IODA2 specification says that a 64 DMA address cannot use top 4 bits
> (3 are reserved and one is a "TVE select"); bottom page_shift bits
> cannot be used for multilevel table addressing either.
> 
> The existing IODA2 table allocation code aligns the minimum TCE table
> size to PAGE_SIZE so in the case of 64K system pages and 4K IOMMU pages,
> we have 64-4-12=48 bits. Since 64K page stores 8192 TCEs, i.e. needs
> 13 bits, the maximum number of levels is 48/13 = 3 so we physically
> cannot address more and EEH happens on DMA accesses.
> 
> This adds a check that too many levels were requested.
> 
> It is still possible to have 5 levels in the case of 4K system page size.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Acked-by: Gavin Shan 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/7aafac11e308d37ed3c509829bb43d

cheers

Re: [1/3] powerpc/64s: fix handling of non-synchronous machine checks

2017-03-14 Thread Michael Ellerman

On Tue, 2017-02-28 at 02:00:46 UTC, Nicholas Piggin wrote:
> A synchronous machine check is an exception raised by the attempt to
> execute the current instruction. If the error can't be corrected, it
> can make sense to SIGBUS the currently running process.
> 
> In other cases, the error condition is not related to the current
> instruction, so killing the current process is not the right thing to
> do.
> 
> Today, all machine checks are MCE_SEV_ERROR_SYNC, so this has no
> practical change. It will be used to handle POWER9 asynchronous
> machine checks.
> 
> Signed-off-by: Nicholas Piggin 

Series applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/1363875bdb6317a2d0798284d7aaf3

cheers

Re: selftests/powerpc: Replace stxvx and lxvx with their equivalent instruction

2017-03-14 Thread Michael Ellerman

On Tue, 2017-03-07 at 00:39:31 UTC, Cyril Bur wrote:
> On POWER8 (ISA 2.07) lxvx and stxvx are defined to be extended mnemonics
> of lxvd2x and stxvd2x. For POWER9 (ISA 3.0) the HW architects in their
> infinite wisdom made lxvx and stxvx instructions in their own right.
> 
> POWER9 aware GCC will use the POWER9 instruction for lxvx and stxvx
> causing these selftests to fail on POWER8. Further compounding the
> issue, because of the way -mvsx works it will cause the power9
> instructions to be used regardless of -mcpu=power8 to GCC or -mpower8 to
> AS.
> 
> The safest way to address the problem for now is to not use the extended
> mnemonic. These tests only perform register comparisons the big endian
> only byte ordering for stxvd2x and lxvd2x does not impact the test.
> 
> Signed-off-by: Cyril Bur 
> Signed-off-by: Cyril Bur 
> Acked-by: Balbir Singh

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/605df8d674ac65e044a0bf4998b28c

cheers

Re: powerpc/boot: Fix zImage TOC alignment

2017-03-14 Thread Michael Ellerman

On Tue, 2017-03-07 at 23:40:21 UTC, Michael Ellerman wrote:
> Recent toolchains force the TOC to be 256 byte aligned. We need to
> enforce this alignment in the zImage linker script, otherwise pointers
> to our TOC variables (__toc_start) could be incorrect. If the actual
> start of the TOC and __toc_start don't have the same value we crash
> early in the zImage wrapper.
> 
> Cc: sta...@vger.kernel.org
> Suggested-by: Alan Modra 
> Signed-off-by: Michael Ellerman 

Applied to powerpc fixes.

https://git.kernel.org/powerpc/c/97ee351b50a49717543533cfb85b4b

cheers

Re: powerpc: Fix crash introduced with commit 5657933dbb6e

2017-03-14 Thread Michael Ellerman

On Fri, 2017-03-10 at 02:33:51 UTC, Larry Finger wrote:
> Code inserted during the code merged between kernels 4.10 and 4.11-rc1
> caused an early panic quickly followed by a complete shutdown for
> PowerPC. The traceback was not displayed long enough to read or
> photograph, thus it is not reproduced here.
> 
> The problem was bisected to commit 5657933dbb6e ("treewide: Move dma_ops
> from struct dev_archdata into struct device"). The problem was a missing
> copy of critical information from struct dev_archdata into struct device,
> leading to the dereference of a null pointer.
> 
> This fix was suggested by Benjamin Herrenschmidt.
> 
> Fixes: commit 5657933dbb6e ("treewide: Move dma_ops from struct dev_archdata 
> into struct device")
> Signed-off-by: Larry Finger 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Benjamin Herrenschmidt 
> Cc: regressi...@leemhuis.info 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/46f401c4297a2232a037ad8801b6c8

cheers

Re: syscall statx not implemented on powerpc

2017-03-14 Thread Michael Ellerman

Chris Packham  writes:

> On 13/03/17 21:52, Chandan Rajendra wrote:
>> On Monday, March 13, 2017 03:33:07 AM Chris Packham wrote:
>>> Hi,
>>>
>>> I've just attempted to build a powerpc kernel from 4.11-rc2 using a
>>> custom defconfig (available on request) and I'm hitting the following
>>> error in the early stages of compilation.
>>>
>>> :1325:2: error: #warning syscall statx not implemented [-Werror=cpp]
>>>
>>> Same thing seems to happen with mpc85xx_basic_defconfig.
>>>
>>> I don't actually need this syscall so I'd be happy to turn something off
>>> to get things building. I did a quick search and couldn't see anything
>>> on linuxppc-dev but google keeps correcting "statx" to "stats" so I
>>> could have missed it.
>>>
>>
>> The upstream commit
>> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a528d35e8bfcc521d7cb70aaf03e1bd296c8493f)
>> that introduces the statx syscall provides a test program. I will wire-up the
>> syscall on ppc64, run that test program and post the patch if the test 
>> program
>> works well.
>>
>
> Thanks, I'd be happy to test a patch here.
>
> In the meantime I worked around the build issue by adding __INGORE_statx 
> to checksyscalls.sh.

Is it actually breaking the build? It's meant to be a warning, and
that's all I get.

cheers

Re: [PATCH 08/29] drivers, md: convert mddev.active from atomic_t to refcount_t

2017-03-14 Thread Michael Ellerman

Elena Reshetova  writes:

> refcount_t type and corresponding API should be
> used instead of atomic_t when the variable is used as
> a reference counter. This allows to avoid accidental
> refcounter overflows that might lead to use-after-free
> situations.
>
> Signed-off-by: Elena Reshetova 
> Signed-off-by: Hans Liljestrand 
> Signed-off-by: Kees Cook 
> Signed-off-by: David Windsor 
> ---
>  drivers/md/md.c | 6 +++---
>  drivers/md/md.h | 3 ++-
>  2 files changed, 5 insertions(+), 4 deletions(-)

When booting linux-next (specifically 5be4921c9958ec) I'm seeing the
backtrace below. I suspect this patch is just exposing an existing
issue?

cheers


[0.230738] md: Waiting for all devices to be available before autodetect
[0.230742] md: If you don't use raid, use raid=noautodetect
[0.230962] refcount_t: increment on 0; use-after-free.
[0.230988] [ cut here ]
[0.230996] WARNING: CPU: 0 PID: 1 at lib/refcount.c:114 
.refcount_inc+0x5c/0x70
[0.231001] Modules linked in:
[0.231006] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
4.11.0-rc1-gccN-next-20170310-g5be4921 #1
[0.231012] task: c0004940 task.stack: c0004944
[0.231016] NIP: c05ac6bc LR: c05ac6b8 CTR: c0743390
[0.231021] REGS: c00049443160 TRAP: 0700   Not tainted  
(4.11.0-rc1-gccN-next-20170310-g5be4921)
[0.231026] MSR: 80029032 
[0.231033]   CR: 24024422  XER: 000c
[0.231038] CFAR: c0a5356c SOFTE: 1 
[0.231038] GPR00: c05ac6b8 c000494433e0 c1079d00 
002b 
[0.231038] GPR04:  00ef  
c10418a0 
[0.231038] GPR08: 4af8 c0ecc9a8 c0ecc9a8 
 
[0.231038] GPR12: 28024824 c6bb  
c00049443a00 
[0.231038] GPR16:  c00049443a10  
 
[0.231038] GPR20:   c0f7dd20 
 
[0.231038] GPR24: 014080c0 c12060b8 c1206080 
0009 
[0.231038] GPR28: c0f7dde0 0090  
c000461ae800 
[0.231100] NIP [c05ac6bc] .refcount_inc+0x5c/0x70
[0.231104] LR [c05ac6b8] .refcount_inc+0x58/0x70
[0.231108] Call Trace:
[0.231112] [c000494433e0] [c05ac6b8] .refcount_inc+0x58/0x70 
(unreliable)
[0.231120] [c00049443450] [c086c008] .mddev_find+0x1e8/0x430
[0.231125] [c00049443530] [c0872b6c] .md_open+0x2c/0x140
[0.231132] [c000494435c0] [c03962a4] .__blkdev_get+0xd4/0x520
[0.231138] [c00049443690] [c0396cc0] .blkdev_get+0x1c0/0x4f0
[0.231145] [c00049443790] [c0336d64] 
.do_dentry_open.isra.1+0x2a4/0x410
[0.231152] [c00049443830] [c03523f4] .path_openat+0x624/0x1580
[0.231157] [c00049443990] [c0354ce4] .do_filp_open+0x84/0x120
[0.231163] [c00049443b10] [c0338d74] .do_sys_open+0x214/0x300
[0.231170] [c00049443be0] [c0da69ac] .md_run_setup+0xa0/0xec
[0.231176] [c00049443c60] [c0da4fbc] 
.prepare_namespace+0x60/0x240
[0.231182] [c00049443ce0] [c0da47a8] 
.kernel_init_freeable+0x330/0x36c
[0.231190] [c00049443db0] [c000dc44] .kernel_init+0x24/0x160
[0.231197] [c00049443e30] [c000badc] 
.ret_from_kernel_thread+0x58/0x7c
[0.231202] Instruction dump:
[0.231206] 6000 3d22ffee 89296bfb 2f89 409effdc 3c62ffc6 3921 
3d42ffee 
[0.231216] 38630928 992a6bfb 484a6e79 6000 <0fe0> 4bb8 6000 
6000 
[0.231226] ---[ end trace 8c51f269ad91ffc2 ]---
[0.231233] md: Autodetecting RAID arrays.
[0.231236] md: autorun ...
[0.231239] md: ... autorun DONE.
[0.234188] EXT4-fs (sda4): mounting ext3 file system using the ext4 
subsystem
[0.250506] refcount_t: underflow; use-after-free.
[0.250531] [ cut here ]
[0.250537] WARNING: CPU: 0 PID: 3 at lib/refcount.c:207 
.refcount_dec_not_one+0x104/0x120
[0.250542] Modules linked in:
[0.250546] CPU: 0 PID: 3 Comm: kworker/0:0 Tainted: GW   
4.11.0-rc1-gccN-next-20170310-g5be4921 #1
[0.250553] Workqueue: events .delayed_fput
[0.250557] task: c00049404900 task.stack: c00049448000
[0.250562] NIP: c05ac964 LR: c05ac960 CTR: c0743390
[0.250567] REGS: c0004944b530 TRAP: 0700   Tainted: GW
(4.11.0-rc1-gccN-next-20170310-g5be4921)
[0.250572] MSR: 80029032 
[0.250578]   CR: 24002422  XER: 0007
[0.250584] CFAR: c0a5356c SOFTE: 1 
[0.250584] GPR00: c05ac960 c0004944b7b0 c1079d00 
0026 
[0.250584] GPR04:  0113  
c10418a0 
[0.250584] GPR08: 00

[PATCH 0/6] table-based machine check handlers

2017-03-14 Thread Nicholas Piggin

Hi,

This is a rebase of the table based MCE patches on top of the POWER9
handler that was merged upstream. Plus a couple of misc things (patches
1 and 6).

The plan is to have OPAL firmware do most of this recovery and parsing
in future, but I think it's worth tidying this up. IMO it's easier to
see what's going on and less error prone to change or add things.

Thanks,
Nick

Nicholas Piggin (6):
  powerpc/64s: machine check print NIP
  powerpc/64s: clean up machine check recovery flushing
  powerpc/64s: move POWER machine check defines into mce_power.c
  powerpc/64s: data driven machine check evaluation
  powerpc/64s: data driven machine check handling
  powerpc/64s: POWER8 add missing machine check definitions

 arch/powerpc/include/asm/mce.h  |  91 -
 arch/powerpc/kernel/mce.c   |   3 +-
 arch/powerpc/kernel/mce_power.c | 776 +++-
 3 files changed, 369 insertions(+), 501 deletions(-)

-- 
2.11.0

[PATCH 1/6] powerpc/64s: machine check print NIP

2017-03-14 Thread Nicholas Piggin

Print the faulting address of the machine check that may help with
debugging. The effective address reported can be a target memory address
rather than the faulting instruction address.

Fix up a dangling bracket while here.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/mce.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index a1475e6aef3a..399aeafb7dd4 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -310,7 +310,8 @@ void machine_check_print_event_info(struct 
machine_check_event *evt)
 
printk("%s%s Machine check interrupt [%s]\n", level, sevstr,
   evt->disposition == MCE_DISPOSITION_RECOVERED ?
-  "Recovered" : "[Not recovered");
+  "Recovered" : "Not recovered");
+   printk("%s  NIP: %016llx\n", level, evt->srr0);
printk("%s  Initiator: %s\n", level,
   evt->initiator == MCE_INITIATOR_CPU ? "CPU" : "Unknown");
switch (evt->error_type) {
-- 
2.11.0

[PATCH 2/6] powerpc/64s: clean up machine check recovery flushing

2017-03-14 Thread Nicholas Piggin

Use the flush function introduced with the POWER9 machine check handler
for POWER7 and 8, rather than open coding it multiple times in callers.

There is a specific ERAT flush type introduced for POWER9, but the
POWER7-8 ERAT errors continue to do SLB flushing (which also flushes
ERAT), so as not to introduce functional changes with this cleanup
patch.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/mce_power.c | 97 ++---
 1 file changed, 23 insertions(+), 74 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 763d6f58caa8..0f35a88e3655 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -161,81 +161,27 @@ static int mce_handle_flush_derrors(uint64_t dsisr, 
uint64_t slb, uint64_t tlb,
return 1;
 }
 
-static long mce_handle_derror(uint64_t dsisr, uint64_t slb_error_bits)
-{
-   long handled = 1;
-
-   /*
-* flush and reload SLBs for SLB errors and flush TLBs for TLB errors.
-* reset the error bits whenever we handle them so that at the end
-* we can check whether we handled all of them or not.
-* */
-#ifdef CONFIG_PPC_STD_MMU_64
-   if (dsisr & slb_error_bits) {
-   flush_and_reload_slb();
-   /* reset error bits */
-   dsisr &= ~(slb_error_bits);
-   }
-   if (dsisr & P7_DSISR_MC_TLB_MULTIHIT_MFTLB) {
-   if (cur_cpu_spec && cur_cpu_spec->flush_tlb)
-   cur_cpu_spec->flush_tlb(TLB_INVAL_SCOPE_GLOBAL);
-   /* reset error bits */
-   dsisr &= ~P7_DSISR_MC_TLB_MULTIHIT_MFTLB;
-   }
-#endif
-   /* Any other errors we don't understand? */
-   if (dsisr & 0xUL)
-   handled = 0;
-
-   return handled;
-}
-
 static long mce_handle_derror_p7(uint64_t dsisr)
 {
-   return mce_handle_derror(dsisr, P7_DSISR_MC_SLB_ERRORS);
+   return mce_handle_flush_derrors(dsisr,
+   P7_DSISR_MC_SLB_ERRORS,
+   P7_DSISR_MC_TLB_MULTIHIT_MFTLB,
+   0);
 }
 
-static long mce_handle_common_ierror(uint64_t srr1)
+static long mce_handle_ierror_p7(uint64_t srr1)
 {
-   long handled = 0;
-
switch (P7_SRR1_MC_IFETCH(srr1)) {
-   case 0:
-   break;
-#ifdef CONFIG_PPC_STD_MMU_64
case P7_SRR1_MC_IFETCH_SLB_PARITY:
case P7_SRR1_MC_IFETCH_SLB_MULTIHIT:
-   /* flush and reload SLBs for SLB errors. */
-   flush_and_reload_slb();
-   handled = 1;
-   break;
+   case P7_SRR1_MC_IFETCH_SLB_BOTH:
+   return mce_flush(MCE_FLUSH_SLB);
+
case P7_SRR1_MC_IFETCH_TLB_MULTIHIT:
-   if (cur_cpu_spec && cur_cpu_spec->flush_tlb) {
-   cur_cpu_spec->flush_tlb(TLB_INVAL_SCOPE_GLOBAL);
-   handled = 1;
-   }
-   break;
-#endif
+   return mce_flush(MCE_FLUSH_TLB);
default:
-   break;
-   }
-
-   return handled;
-}
-
-static long mce_handle_ierror_p7(uint64_t srr1)
-{
-   long handled = 0;
-
-   handled = mce_handle_common_ierror(srr1);
-
-#ifdef CONFIG_PPC_STD_MMU_64
-   if (P7_SRR1_MC_IFETCH(srr1) == P7_SRR1_MC_IFETCH_SLB_BOTH) {
-   flush_and_reload_slb();
-   handled = 1;
+   return 0;
}
-#endif
-   return handled;
 }
 
 static void mce_get_common_ierror(struct mce_error_info *mce_err, uint64_t 
srr1)
@@ -376,22 +322,25 @@ static void mce_get_derror_p8(struct mce_error_info 
*mce_err, uint64_t dsisr)
 
 static long mce_handle_ierror_p8(uint64_t srr1)
 {
-   long handled = 0;
-
-   handled = mce_handle_common_ierror(srr1);
+   switch (P7_SRR1_MC_IFETCH(srr1)) {
+   case P7_SRR1_MC_IFETCH_SLB_PARITY:
+   case P7_SRR1_MC_IFETCH_SLB_MULTIHIT:
+   case P8_SRR1_MC_IFETCH_ERAT_MULTIHIT:
+   return mce_flush(MCE_FLUSH_SLB);
 
-#ifdef CONFIG_PPC_STD_MMU_64
-   if (P7_SRR1_MC_IFETCH(srr1) == P8_SRR1_MC_IFETCH_ERAT_MULTIHIT) {
-   flush_and_reload_slb();
-   handled = 1;
+   case P7_SRR1_MC_IFETCH_TLB_MULTIHIT:
+   return mce_flush(MCE_FLUSH_TLB);
+   default:
+   return 0;
}
-#endif
-   return handled;
 }
 
 static long mce_handle_derror_p8(uint64_t dsisr)
 {
-   return mce_handle_derror(dsisr, P8_DSISR_MC_SLB_ERRORS);
+   return mce_handle_flush_derrors(dsisr,
+   P8_DSISR_MC_SLB_ERRORS,
+   P7_DSISR_MC_TLB_MULTIHIT_MFTLB,
+   0);
 }
 
 long __machine_check_early_realmode_p8(struct pt_regs *regs)
-- 
2.11.0

[PATCH 3/6] powerpc/64s: move POWER machine check defines into mce_power.c

2017-03-14 Thread Nicholas Piggin

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/mce.h  | 91 
 arch/powerpc/kernel/mce_power.c | 92 +
 2 files changed, 92 insertions(+), 91 deletions(-)

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index ed62efe01e49..e3498b446788 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -24,97 +24,6 @@
 
 #include 
 
-/*
- * Machine Check bits on power7 and power8
- */
-#define P7_SRR1_MC_LOADSTORE(srr1) ((srr1) & PPC_BIT(42)) /* P8 too */
-
-/* SRR1 bits for machine check (On Power7 and Power8) */
-#define P7_SRR1_MC_IFETCH(srr1)((srr1) & PPC_BITMASK(43, 45)) /* P8 
too */
-
-#define P7_SRR1_MC_IFETCH_UE   (0x1 << PPC_BITLSHIFT(45)) /* P8 too */
-#define P7_SRR1_MC_IFETCH_SLB_PARITY   (0x2 << PPC_BITLSHIFT(45)) /* P8 too */
-#define P7_SRR1_MC_IFETCH_SLB_MULTIHIT (0x3 << PPC_BITLSHIFT(45)) /* P8 too */
-#define P7_SRR1_MC_IFETCH_SLB_BOTH (0x4 << PPC_BITLSHIFT(45))
-#define P7_SRR1_MC_IFETCH_TLB_MULTIHIT (0x5 << PPC_BITLSHIFT(45)) /* P8 too */
-#define P7_SRR1_MC_IFETCH_UE_TLB_RELOAD(0x6 << PPC_BITLSHIFT(45)) /* 
P8 too */
-#define P7_SRR1_MC_IFETCH_UE_IFU_INTERNAL  (0x7 << PPC_BITLSHIFT(45))
-
-/* SRR1 bits for machine check (On Power8) */
-#define P8_SRR1_MC_IFETCH_ERAT_MULTIHIT(0x4 << PPC_BITLSHIFT(45))
-
-/* DSISR bits for machine check (On Power7 and Power8) */
-#define P7_DSISR_MC_UE (PPC_BIT(48))   /* P8 too */
-#define P7_DSISR_MC_UE_TABLEWALK   (PPC_BIT(49))   /* P8 too */
-#define P7_DSISR_MC_ERAT_MULTIHIT  (PPC_BIT(52))   /* P8 too */
-#define P7_DSISR_MC_TLB_MULTIHIT_MFTLB (PPC_BIT(53))   /* P8 too */
-#define P7_DSISR_MC_SLB_PARITY_MFSLB   (PPC_BIT(55))   /* P8 too */
-#define P7_DSISR_MC_SLB_MULTIHIT   (PPC_BIT(56))   /* P8 too */
-#define P7_DSISR_MC_SLB_MULTIHIT_PARITY(PPC_BIT(57))   /* P8 too */
-
-/*
- * DSISR bits for machine check (Power8) in addition to above.
- * Secondary DERAT Multihit
- */
-#define P8_DSISR_MC_ERAT_MULTIHIT_SEC  (PPC_BIT(54))
-
-/* SLB error bits */
-#define P7_DSISR_MC_SLB_ERRORS (P7_DSISR_MC_ERAT_MULTIHIT | \
-P7_DSISR_MC_SLB_PARITY_MFSLB | \
-P7_DSISR_MC_SLB_MULTIHIT | \
-P7_DSISR_MC_SLB_MULTIHIT_PARITY)
-
-#define P8_DSISR_MC_SLB_ERRORS (P7_DSISR_MC_SLB_ERRORS | \
-P8_DSISR_MC_ERAT_MULTIHIT_SEC)
-
-/*
- * Machine Check bits on power9
- */
-#define P9_SRR1_MC_LOADSTORE(srr1) (((srr1) >> PPC_BITLSHIFT(42)) & 1)
-
-#define P9_SRR1_MC_IFETCH(srr1)(   \
-   PPC_BITEXTRACT(srr1, 45, 0) |   \
-   PPC_BITEXTRACT(srr1, 44, 1) |   \
-   PPC_BITEXTRACT(srr1, 43, 2) |   \
-   PPC_BITEXTRACT(srr1, 36, 3) )
-
-/* 0 is reserved */
-#define P9_SRR1_MC_IFETCH_UE   1
-#define P9_SRR1_MC_IFETCH_SLB_PARITY   2
-#define P9_SRR1_MC_IFETCH_SLB_MULTIHIT 3
-#define P9_SRR1_MC_IFETCH_ERAT_MULTIHIT4
-#define P9_SRR1_MC_IFETCH_TLB_MULTIHIT 5
-#define P9_SRR1_MC_IFETCH_UE_TLB_RELOAD6
-/* 7 is reserved */
-#define P9_SRR1_MC_IFETCH_LINK_TIMEOUT 8
-#define P9_SRR1_MC_IFETCH_LINK_TABLEWALK_TIMEOUT   9
-/* 10 ? */
-#define P9_SRR1_MC_IFETCH_RA   11
-#define P9_SRR1_MC_IFETCH_RA_TABLEWALK 12
-#define P9_SRR1_MC_IFETCH_RA_ASYNC_STORE   13
-#define P9_SRR1_MC_IFETCH_LINK_ASYNC_STORE_TIMEOUT 14
-#define P9_SRR1_MC_IFETCH_RA_TABLEWALK_FOREIGN 15
-
-/* DSISR bits for machine check (On Power9) */
-#define P9_DSISR_MC_UE (PPC_BIT(48))
-#define P9_DSISR_MC_UE_TABLEWALK   (PPC_BIT(49))
-#define P9_DSISR_MC_LINK_LOAD_TIMEOUT  (PPC_BIT(50))
-#define P9_DSISR_MC_LINK_TABLEWALK_TIMEOUT (PPC_BIT(51))
-#define P9_DSISR_MC_ERAT_MULTIHIT  (PPC_BIT(52))
-#define P9_DSISR_MC_TLB_MULTIHIT_MFTLB (PPC_BIT(53))
-#define P9_DSISR_MC_USER_TLBIE (PPC_BIT(54))
-#define P9_DSISR_MC_SLB_PARITY_MFSLB   (PPC_BIT(55))
-#define P9_DSISR_MC_SLB_MULTIHIT_MFSLB (PPC_BIT(56))
-#define P9_DSISR_MC_RA_LOAD(PPC_BIT(57))
-#define P9_DSISR_MC_RA_TABLEWALK   (PPC_BIT(58))
-#define P9_DSISR_MC_RA_TABLEWALK_FOREIGN   (PPC_BIT(59))
-#define P9_DSISR_MC_RA_FOREIGN (PPC_BIT(60))
-
-/* SLB error bits */
-#define P9_DSISR_MC_SLB_ERRORS (P9_DSISR_MC_ERAT_MULTIHIT | \
-P9_DSISR_MC_SLB_PARITY_MFSLB | \
-P9_DSISR_MC_SLB_MULTIHIT_MFSLB)
-
 enum MCE_Version {
MCE_V1 = 1,
 };
diff --git a/arch/p

[PATCH 4/6] powerpc/64s: data driven machine check evaluation

2017-03-14 Thread Nicholas Piggin

Have machine types define i-side and d-side tables to describe their
machine check encodings, and match entries to evaluate (for reporting)
machine checks.

Functionality is mostly unchanged (tested with a userspace harness), but
it does make a change in that it no longer records DAR as the effective
address for those errors where it is specified to be invalid (which is a
reporting change only).

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/mce_power.c | 571 +++-
 1 file changed, 328 insertions(+), 243 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 9b3bcd1213bf..1b38fabdd2a0 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -253,91 +253,304 @@ static int mce_handle_flush_derrors(uint64_t dsisr, 
uint64_t slb, uint64_t tlb,
 P9_DSISR_MC_SLB_PARITY_MFSLB | \
 P9_DSISR_MC_SLB_MULTIHIT_MFSLB)
 
-static long mce_handle_derror_p7(uint64_t dsisr)
-{
-   return mce_handle_flush_derrors(dsisr,
-   P7_DSISR_MC_SLB_ERRORS,
-   P7_DSISR_MC_TLB_MULTIHIT_MFTLB,
-   0);
-}
-
-static long mce_handle_ierror_p7(uint64_t srr1)
+struct mce_ierror_table {
+   unsigned long srr1_mask;
+   unsigned long srr1_value;
+   bool nip_valid; /* nip is a valid indicator of faulting address */
+   unsigned int error_type;
+   unsigned int error_subtype;
+   unsigned int initiator;
+   unsigned int severity;
+};
+
+static const struct mce_ierror_table mce_p7_ierror_table[] = {
+{ 0x001c, 0x0004, true,
+  MCE_ERROR_TYPE_UE,  MCE_UE_ERROR_IFETCH,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x001c, 0x0008, true,
+  MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_PARITY,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x001c, 0x000c, true,
+  MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_MULTIHIT,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x001c, 0x0010, true,
+  MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_INDETERMINATE, /* BOTH */
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x001c, 0x0014, true,
+  MCE_ERROR_TYPE_TLB, MCE_TLB_ERROR_MULTIHIT,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x001c, 0x0018, true,
+  MCE_ERROR_TYPE_UE,  MCE_UE_ERROR_PAGE_TABLE_WALK_IFETCH,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x001c, 0x001c, true,
+  MCE_ERROR_TYPE_UE,  MCE_UE_ERROR_IFETCH,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0, 0, 0, 0, 0, 0 } };
+
+static const struct mce_ierror_table mce_p8_ierror_table[] = {
+{ 0x001c, 0x0004, true,
+  MCE_ERROR_TYPE_UE,  MCE_UE_ERROR_IFETCH,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x001c, 0x0008, true,
+  MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_PARITY,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x001c, 0x000c, true,
+  MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_MULTIHIT,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x001c, 0x0010, true,
+  MCE_ERROR_TYPE_ERAT,MCE_ERAT_ERROR_MULTIHIT,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x001c, 0x0014, true,
+  MCE_ERROR_TYPE_TLB, MCE_TLB_ERROR_MULTIHIT,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x001c, 0x0018, true,
+  MCE_ERROR_TYPE_UE,  MCE_UE_ERROR_PAGE_TABLE_WALK_IFETCH,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x001c, 0x001c, true,
+  MCE_ERROR_TYPE_UE,  MCE_UE_ERROR_IFETCH,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0, 0, 0, 0, 0, 0 } };
+
+static const struct mce_ierror_table mce_p9_ierror_table[] = {
+{ 0x081c, 0x0004, true,
+  MCE_ERROR_TYPE_UE,  MCE_UE_ERROR_IFETCH,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x081c, 0x0008, true,
+  MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_PARITY,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x081c, 0x000c, true,
+  MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_MULTIHIT,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x081c, 0x0010, true,
+  MCE_ERROR_TYPE_ERAT,MCE_ERAT_ERROR_MULTIHIT,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x081c, 0x0014, true,
+  MCE_ERROR_TYPE_TLB, MCE_TLB_ERROR_MULTIHIT,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x081c, 0x0018, true,
+  MCE_ERROR_TYPE_UE,  MCE_UE_ERROR_PAGE_TABLE_WALK_IFETCH,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x081c, 0x0800, true,
+  MCE_ERROR_TYPE_LINK,MCE_LINK_ERROR_IFETCH_TIMEOUT,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x081c, 0x0804, true,
+  MCE_ERROR_TYPE_LINK,MCE_LINK_ERROR_PAGE_TABLE_WALK_IFETCH_TIMEOUT

[PATCH 5/6] powerpc/64s: data driven machine check handling

2017-03-14 Thread Nicholas Piggin

Move the handling (corrective action) of machine checks to the table
based evaluation.

This changes P7 and P8 ERAT flushing from using SLB flush to using ERAT
flush.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/mce_power.c | 328 +---
 1 file changed, 74 insertions(+), 254 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 1b38fabdd2a0..95ed9d85ebcf 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -147,111 +147,7 @@ static int mce_flush(int what)
return 0;
 }
 
-static int mce_handle_flush_derrors(uint64_t dsisr, uint64_t slb, uint64_t 
tlb, uint64_t erat)
-{
-   if ((dsisr & slb) && mce_flush(MCE_FLUSH_SLB))
-   dsisr &= ~slb;
-   if ((dsisr & erat) && mce_flush(MCE_FLUSH_ERAT))
-   dsisr &= ~erat;
-   if ((dsisr & tlb) && mce_flush(MCE_FLUSH_TLB))
-   dsisr &= ~tlb;
-   /* Any other errors we don't understand? */
-   if (dsisr)
-   return 0;
-   return 1;
-}
-
-
-/*
- * Machine Check bits on power7 and power8
- */
-#define P7_SRR1_MC_LOADSTORE(srr1) ((srr1) & PPC_BIT(42)) /* P8 too */
-
-/* SRR1 bits for machine check (On Power7 and Power8) */
-#define P7_SRR1_MC_IFETCH(srr1)((srr1) & PPC_BITMASK(43, 45)) /* P8 
too */
-
-#define P7_SRR1_MC_IFETCH_UE   (0x1 << PPC_BITLSHIFT(45)) /* P8 too */
-#define P7_SRR1_MC_IFETCH_SLB_PARITY   (0x2 << PPC_BITLSHIFT(45)) /* P8 too */
-#define P7_SRR1_MC_IFETCH_SLB_MULTIHIT (0x3 << PPC_BITLSHIFT(45)) /* P8 too */
-#define P7_SRR1_MC_IFETCH_SLB_BOTH (0x4 << PPC_BITLSHIFT(45))
-#define P7_SRR1_MC_IFETCH_TLB_MULTIHIT (0x5 << PPC_BITLSHIFT(45)) /* P8 too */
-#define P7_SRR1_MC_IFETCH_UE_TLB_RELOAD(0x6 << PPC_BITLSHIFT(45)) /* 
P8 too */
-#define P7_SRR1_MC_IFETCH_UE_IFU_INTERNAL  (0x7 << PPC_BITLSHIFT(45))
-
-/* SRR1 bits for machine check (On Power8) */
-#define P8_SRR1_MC_IFETCH_ERAT_MULTIHIT(0x4 << PPC_BITLSHIFT(45))
-
-/* DSISR bits for machine check (On Power7 and Power8) */
-#define P7_DSISR_MC_UE (PPC_BIT(48))   /* P8 too */
-#define P7_DSISR_MC_UE_TABLEWALK   (PPC_BIT(49))   /* P8 too */
-#define P7_DSISR_MC_ERAT_MULTIHIT  (PPC_BIT(52))   /* P8 too */
-#define P7_DSISR_MC_TLB_MULTIHIT_MFTLB (PPC_BIT(53))   /* P8 too */
-#define P7_DSISR_MC_SLB_PARITY_MFSLB   (PPC_BIT(55))   /* P8 too */
-#define P7_DSISR_MC_SLB_MULTIHIT   (PPC_BIT(56))   /* P8 too */
-#define P7_DSISR_MC_SLB_MULTIHIT_PARITY(PPC_BIT(57))   /* P8 too */
-
-/*
- * DSISR bits for machine check (Power8) in addition to above.
- * Secondary DERAT Multihit
- */
-#define P8_DSISR_MC_ERAT_MULTIHIT_SEC  (PPC_BIT(54))
-
-/* SLB error bits */
-#define P7_DSISR_MC_SLB_ERRORS (P7_DSISR_MC_ERAT_MULTIHIT | \
-P7_DSISR_MC_SLB_PARITY_MFSLB | \
-P7_DSISR_MC_SLB_MULTIHIT | \
-P7_DSISR_MC_SLB_MULTIHIT_PARITY)
-
-#define P8_DSISR_MC_SLB_ERRORS (P7_DSISR_MC_SLB_ERRORS | \
-P8_DSISR_MC_ERAT_MULTIHIT_SEC)
-
-/*
- * Machine Check bits on power9
- */
-#define P9_SRR1_MC_LOADSTORE(srr1) (((srr1) >> PPC_BITLSHIFT(42)) & 1)
-
-#define P9_SRR1_MC_IFETCH(srr1)(   \
-   PPC_BITEXTRACT(srr1, 45, 0) |   \
-   PPC_BITEXTRACT(srr1, 44, 1) |   \
-   PPC_BITEXTRACT(srr1, 43, 2) |   \
-   PPC_BITEXTRACT(srr1, 36, 3) )
-
-/* 0 is reserved */
-#define P9_SRR1_MC_IFETCH_UE   1
-#define P9_SRR1_MC_IFETCH_SLB_PARITY   2
-#define P9_SRR1_MC_IFETCH_SLB_MULTIHIT 3
-#define P9_SRR1_MC_IFETCH_ERAT_MULTIHIT4
-#define P9_SRR1_MC_IFETCH_TLB_MULTIHIT 5
-#define P9_SRR1_MC_IFETCH_UE_TLB_RELOAD6
-/* 7 is reserved */
-#define P9_SRR1_MC_IFETCH_LINK_TIMEOUT 8
-#define P9_SRR1_MC_IFETCH_LINK_TABLEWALK_TIMEOUT   9
-/* 10 ? */
-#define P9_SRR1_MC_IFETCH_RA   11
-#define P9_SRR1_MC_IFETCH_RA_TABLEWALK 12
-#define P9_SRR1_MC_IFETCH_RA_ASYNC_STORE   13
-#define P9_SRR1_MC_IFETCH_LINK_ASYNC_STORE_TIMEOUT 14
-#define P9_SRR1_MC_IFETCH_RA_TABLEWALK_FOREIGN 15
-
-/* DSISR bits for machine check (On Power9) */
-#define P9_DSISR_MC_UE (PPC_BIT(48))
-#define P9_DSISR_MC_UE_TABLEWALK   (PPC_BIT(49))
-#define P9_DSISR_MC_LINK_LOAD_TIMEOUT  (PPC_BIT(50))
-#define P9_DSISR_MC_LINK_TABLEWALK_TIMEOUT (PPC_BIT(51))
-#define P9_DSISR_MC_ERAT_MULTIHIT  (PPC_BIT(52))
-#define P9_DSISR_MC_TLB_MULTIHIT_MFTLB (PPC_BIT(53))
-#define P9_DSISR_MC_USER_TLBIE (PPC_BIT(54))
-#define P9_DSISR_MC_SLB_PARITY_MFSLB   (PPC_BIT(55))
-#define P9_DSISR_MC_SLB_MULTIHIT_MFSLB

[PATCH 6/6] powerpc/64s: POWER8 add missing machine check definitions

2017-03-14 Thread Nicholas Piggin

POWER8 uses bit 36 in SRR1 like POWER9 for i-side machine checks, and
contains several conditions for link timeouts that are not currently
handled.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/mce_power.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index 95ed9d85ebcf..736530219bc3 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -184,27 +184,33 @@ static const struct mce_ierror_table 
mce_p7_ierror_table[] = {
 { 0, 0, 0, 0, 0, 0 } };
 
 static const struct mce_ierror_table mce_p8_ierror_table[] = {
-{ 0x001c, 0x0004, true,
+{ 0x081c, 0x0004, true,
   MCE_ERROR_TYPE_UE,  MCE_UE_ERROR_IFETCH,
   MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
-{ 0x001c, 0x0008, true,
+{ 0x081c, 0x0008, true,
   MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_PARITY,
   MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
-{ 0x001c, 0x000c, true,
+{ 0x081c, 0x000c, true,
   MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_MULTIHIT,
   MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
-{ 0x001c, 0x0010, true,
+{ 0x081c, 0x0010, true,
   MCE_ERROR_TYPE_ERAT,MCE_ERAT_ERROR_MULTIHIT,
   MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
-{ 0x001c, 0x0014, true,
+{ 0x081c, 0x0014, true,
   MCE_ERROR_TYPE_TLB, MCE_TLB_ERROR_MULTIHIT,
   MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
-{ 0x001c, 0x0018, true,
+{ 0x081c, 0x0018, true,
   MCE_ERROR_TYPE_UE,  MCE_UE_ERROR_PAGE_TABLE_WALK_IFETCH,
   MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
-{ 0x001c, 0x001c, true,
+{ 0x081c, 0x001c, true,
   MCE_ERROR_TYPE_UE,  MCE_UE_ERROR_IFETCH,
   MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x081c, 0x0800, true,
+  MCE_ERROR_TYPE_LINK,MCE_LINK_ERROR_IFETCH_TIMEOUT,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
+{ 0x081c, 0x0804, true,
+  MCE_ERROR_TYPE_LINK,MCE_LINK_ERROR_PAGE_TABLE_WALK_IFETCH_TIMEOUT,
+  MCE_INITIATOR_CPU,  MCE_SEV_ERROR_SYNC, },
 { 0, 0, 0, 0, 0, 0 } };
 
 static const struct mce_ierror_table mce_p9_ierror_table[] = {
@@ -289,6 +295,12 @@ static const struct mce_derror_table mce_p8_derror_table[] 
= {
 { 0x4000, true,
   MCE_ERROR_TYPE_UE,   MCE_UE_ERROR_PAGE_TABLE_WALK_LOAD_STORE,
   MCE_INITIATOR_CPU,   MCE_SEV_ERROR_SYNC, },
+{ 0x2000, true,
+  MCE_ERROR_TYPE_LINK, MCE_LINK_ERROR_LOAD_TIMEOUT,
+  MCE_INITIATOR_CPU,   MCE_SEV_ERROR_SYNC, },
+{ 0x1000, true,
+  MCE_ERROR_TYPE_LINK, MCE_LINK_ERROR_PAGE_TABLE_WALK_LOAD_STORE_TIMEOUT,
+  MCE_INITIATOR_CPU,   MCE_SEV_ERROR_SYNC, },
 { 0x0800, true,
   MCE_ERROR_TYPE_ERAT, MCE_ERAT_ERROR_MULTIHIT,
   MCE_INITIATOR_CPU,   MCE_SEV_ERROR_SYNC, },
-- 
2.11.0

Re: [PATCH kernel] powerpc/powernv/ioda2: Update iommu table base on ownership change

2017-03-14 Thread Mauricio Faria de Oliveira


On 02/20/2017 11:41 PM, Alexey Kardashevskiy wrote:

Cc: Gavin Shan 
Signed-off-by: Alexey Kardashevskiy 


Tested-by: Mauricio Faria de Oliveira 

P.S.: sorry, late, but for the record.


--
Mauricio Faria de Oliveira
IBM Linux Technology Center

Re: [PATCH] drivers/pcmcia: NO_IRQ removal for electra_cf.c

2017-03-14 Thread Arnd Bergmann

On Tue, Mar 14, 2017 at 11:51 AM, Michael Ellerman  wrote:
> Michael Ellerman  writes:
>
>> We'd like to eventually remove NO_IRQ on powerpc, so remove usages of it
>> from electra_cf.c which is a powerpc-only driver.
>>
>> Signed-off-by: Michael Ellerman 
>> ---
>>  drivers/pcmcia/electra_cf.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> Ping anyone?
>
> Or should I merge this via the powerpc tree?

That's what I would recommend for a powerpc specific pcmcia driver, yes.

Looking at the bigger picture of powerpc drivers using NO_IRQ, I also
see these others:

drivers/ata/pata_mpc52xx.c: if (ata_irq == NO_IRQ) {
drivers/ata/sata_dwc_460ex.c:#ifndef NO_IRQ
drivers/ata/sata_dwc_460ex.c:#define NO_IRQ 0
drivers/ata/sata_dwc_460ex.c:   if (hsdev->dma->irq == NO_IRQ) {
drivers/ata/sata_dwc_460ex.c:   if (irq == NO_IRQ) {
drivers/iommu/fsl_pamu.c:   if (irq == NO_IRQ) {
drivers/iommu/fsl_pamu.c:   if (irq != NO_IRQ)
drivers/media/platform/fsl-viu.c:   if (viu_irq == NO_IRQ) {
drivers/mtd/nand/mpc5121_nfc.c: if (prv->irq == NO_IRQ) {
drivers/pcmcia/electra_cf.c:cf->irq = NO_IRQ;
drivers/pcmcia/electra_cf.c:if (cf->irq != NO_IRQ)
drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if
no interrupt is pending. */
drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ;
drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if
no interrupt is pending. */
drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ;
drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_low == NO_IRQ) {
drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_high != NO_IRQ &&
drivers/spi/spi-mpc52xx.c:  if (status && (irq != NO_IRQ))
drivers/tty/ehv_bytechan.c: if (stdout_irq == NO_IRQ) {
drivers/tty/ehv_bytechan.c: if ((bc->rx_irq == NO_IRQ) ||
(bc->tx_irq == NO_IRQ)) {
drivers/tty/serial/cpm_uart/cpm_uart_core.c:if (pinfo->port.irq == NO_IRQ) {
drivers/uio/uio_fsl_elbc_gpcm.c:if (irq != NO_IRQ) {
drivers/uio/uio_fsl_elbc_gpcm.c:irq = NO_IRQ;
drivers/uio/uio_fsl_elbc_gpcm.c: irq != NO_IRQ ? irq : -1);
drivers/usb/host/ehci-grlib.c:  if (irq == NO_IRQ) {
drivers/usb/host/ehci-ppc-of.c: if (irq == NO_IRQ) {
drivers/usb/host/fhci-hcd.c:if (usb_irq == NO_IRQ) {
drivers/usb/host/ohci-ppc-of.c: if (irq == NO_IRQ) {
drivers/usb/host/uhci-grlib.c:  if (irq == NO_IRQ) {
drivers/video/fbdev/mb862xx/mb862xxfbdrv.c: if (par->irq == NO_IRQ) {
drivers/virt/fsl_hypervisor.c:  if (!handle || (irq == NO_IRQ)) {
include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
include/soc/fsl/qe/qe_ic.h: if (cascade_irq == NO_IRQ)
include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)

Did you have other pending patches for those?

  Arnd

Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src

2017-03-14 Thread Peter Zijlstra

On Tue, Mar 14, 2017 at 02:31:51PM +0530, Madhavan Srinivasan wrote:

> >Huh? PPC hasn't yet implemented this? Then why are you fixing it?
> 
> yes, PPC hasn't implemented this (until now).

until now where?

> And did not understand "Then why are you fixing it?"

I see no implementation; so why are you poking at it.

Re: [PATCH v5 2/5] powerpc: kretprobes: override default function entry offset

2017-03-14 Thread Arnaldo Carvalho de Melo

Em Thu, Mar 09, 2017 at 05:37:38PM +1100, Michael Ellerman escreveu:
> "Naveen N. Rao"  writes:
> > On 2017/03/08 11:29AM, Arnaldo Carvalho de Melo wrote:
> >> > I wasn't sure if you were planning on picking up KPROBES_ON_FTRACE for 
> >> > v4.11. If so, it would be good to take this patch through the powerpc 
> >> > tree. Otherwise, this can go via Ingo's tree.
> >> 
> >> If you guys convince Ingo that this should go _now_, then just cherry
> >> pick what was merged into tip/perf/core that is needed for the arch
> >> specific stuff and go from there.
> >
> > Ok, in hindsight, I think Michael's concern was actually for v4.12 
> 
> Yes I was talking about 4.12, sorry I thought that was implied :)
> 
> > itself, in which case this particular patch can go via powerpc tree, 
> > while the rest of the patches in this series can go via your tree.
> >
> > Michael?
> 
> Yeah I think that's the easiest option. The function will be temporarily
> unused until the two trees are merged, but I think that's fine.

Ok, done that, now compile testing building it in my
multi-distro/x-build containers.

- Arnaldo

Re: [PATCH 1/3] powernv:smp: Add busy-wait loop as fall back for CPU-Hotplug

2017-03-14 Thread Nicholas Piggin

On Mon, 13 Mar 2017 11:31:26 +0530
"Gautham R. Shenoy"  wrote:

> [Changelog written with inputs from sva...@linux.vnet.ibm.com]
> Signed-off-by: Gautham R. Shenoy 

Reviewed-by: Nicholas Piggin 

> ---
>  arch/powerpc/platforms/powernv/smp.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/smp.c 
> b/arch/powerpc/platforms/powernv/smp.c
> index e39e6c4..8d5b99e 100644
> --- a/arch/powerpc/platforms/powernv/smp.c
> +++ b/arch/powerpc/platforms/powernv/smp.c
> @@ -192,8 +192,16 @@ static void pnv_smp_cpu_kill_self(void)
>   } else if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
>  (idle_states & OPAL_PM_SLEEP_ENABLED_ER1)) {
>   srr1 = power7_sleep();
> - } else {
> + } else if (idle_states & OPAL_PM_NAP_ENABLED) {
>   srr1 = power7_nap(1);
> + } else {
> + /* This is the fallback method. We emulate snooze */
> + while (!generic_check_cpu_restart(cpu)) {
> + HMT_low();
> + HMT_very_low();
> + }
> + srr1 = 0;
> + HMT_medium();
>   }
>  
>   ppc64_runlatch_on();

RE: [PATCH 08/29] drivers, md: convert mddev.active from atomic_t to refcount_t

2017-03-14 Thread Reshetova, Elena

> Elena Reshetova  writes:
> 
> > refcount_t type and corresponding API should be
> > used instead of atomic_t when the variable is used as
> > a reference counter. This allows to avoid accidental
> > refcounter overflows that might lead to use-after-free
> > situations.
> >
> > Signed-off-by: Elena Reshetova 
> > Signed-off-by: Hans Liljestrand 
> > Signed-off-by: Kees Cook 
> > Signed-off-by: David Windsor 
> > ---
> >  drivers/md/md.c | 6 +++---
> >  drivers/md/md.h | 3 ++-
> >  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> When booting linux-next (specifically 5be4921c9958ec) I'm seeing the
> backtrace below. I suspect this patch is just exposing an existing
> issue?

Yes, we have actually been following this issue in the another thread. 
It looks like the object is re-used somehow, but I can't quite understand how 
just by reading the code. 
This was what I put into the previous thread:

"The log below indicates that you are using your refcounter in a bit weird way 
in mddev_find(). 
However, I can't find the place (just by reading the code) where you would 
increment refcounter from zero (vs. setting it to one).
It looks like you either iterate over existing nodes (and increment their 
counters, which should be >= 1 at the time of increment) or create a new node, 
but then mddev_init() sets the counter to 1. "

If you can help to understand what is going on with the object 
creation/destruction, would be appreciated!

Also Shaohua Li stopped this patch coming from his tree since the issue was 
caught at that time, so we are not going to merge this until we figure it out. 

Best Regards,
Elena.

> 
> cheers
> 
> 
> [0.230738] md: Waiting for all devices to be available before autodetect
> [0.230742] md: If you don't use raid, use raid=noautodetect
> [0.230962] refcount_t: increment on 0; use-after-free.
> [0.230988] [ cut here ]
> [0.230996] WARNING: CPU: 0 PID: 1 at lib/refcount.c:114
> .refcount_inc+0x5c/0x70
> [0.231001] Modules linked in:
> [0.231006] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-gccN-next-
> 20170310-g5be4921 #1
> [0.231012] task: c0004940 task.stack: c0004944
> [0.231016] NIP: c05ac6bc LR: c05ac6b8 CTR:
> c0743390
> [0.231021] REGS: c00049443160 TRAP: 0700   Not tainted  (4.11.0-rc1-
> gccN-next-20170310-g5be4921)
> [0.231026] MSR: 80029032 
> [0.231033]   CR: 24024422  XER: 000c
> [0.231038] CFAR: c0a5356c SOFTE: 1
> [0.231038] GPR00: c05ac6b8 c000494433e0 c1079d00
> 002b
> [0.231038] GPR04:  00ef 
> c10418a0
> [0.231038] GPR08: 4af8 c0ecc9a8 c0ecc9a8
> 
> [0.231038] GPR12: 28024824 c6bb 
> c00049443a00
> [0.231038] GPR16:  c00049443a10 
> 
> [0.231038] GPR20:   c0f7dd20
> 
> [0.231038] GPR24: 014080c0 c12060b8 c1206080
> 0009
> [0.231038] GPR28: c0f7dde0 0090 
> c000461ae800
> [0.231100] NIP [c05ac6bc] .refcount_inc+0x5c/0x70
> [0.231104] LR [c05ac6b8] .refcount_inc+0x58/0x70
> [0.231108] Call Trace:
> [0.231112] [c000494433e0] [c05ac6b8] .refcount_inc+0x58/0x70
> (unreliable)
> [0.231120] [c00049443450] [c086c008]
> .mddev_find+0x1e8/0x430
> [0.231125] [c00049443530] [c0872b6c] .md_open+0x2c/0x140
> [0.231132] [c000494435c0] [c03962a4]
> .__blkdev_get+0xd4/0x520
> [0.231138] [c00049443690] [c0396cc0] .blkdev_get+0x1c0/0x4f0
> [0.231145] [c00049443790] [c0336d64]
> .do_dentry_open.isra.1+0x2a4/0x410
> [0.231152] [c00049443830] [c03523f4]
> .path_openat+0x624/0x1580
> [0.231157] [c00049443990] [c0354ce4]
> .do_filp_open+0x84/0x120
> [0.231163] [c00049443b10] [c0338d74]
> .do_sys_open+0x214/0x300
> [0.231170] [c00049443be0] [c0da69ac]
> .md_run_setup+0xa0/0xec
> [0.231176] [c00049443c60] [c0da4fbc]
> .prepare_namespace+0x60/0x240
> [0.231182] [c00049443ce0] [c0da47a8]
> .kernel_init_freeable+0x330/0x36c
> [0.231190] [c00049443db0] [c000dc44] .kernel_init+0x24/0x160
> [0.231197] [c00049443e30] [c000badc]
> .ret_from_kernel_thread+0x58/0x7c
> [0.231202] Instruction dump:
> [0.231206] 6000 3d22ffee 89296bfb 2f89 409effdc 3c62ffc6 3921
> 3d42ffee
> [0.231216] 38630928 992a6bfb 484a6e79 6000 <0fe0> 4bb8
> 6000 6000
> [0.231226] ---[ end trace 8c51f269ad91ffc2 ]---
> [0.231233] md: Autodetecting RAID arrays.
> [0.231236] md: autorun ...
> [0.231239]

Re: [PATCH 2/3] powernv:idle: Don't override default/deepest directly in kernel

2017-03-14 Thread Nicholas Piggin

On Mon, 13 Mar 2017 11:31:27 +0530
"Gautham R. Shenoy"  wrote:

> From: "Gautham R. Shenoy" 
> 
> Currently during idle-init on power9, if we don't find suitable stop
> states in the device tree that can be used as the
> default_stop/deepest_stop, we set stop0 (ESL=1,EC=1) as the default
> stop state psscr to be used by power9_idle and deepest stop state
> which is used by CPU-Hotplug.
> 
> However, if the platform firmware has not configured or enabled a stop
> state, the kernel should not make any assumptions and fallback to a
> default choice.
> 
> If the kernel uses a stop state that is not configured by the platform
> firmware, it may lead to further failures which should be avoided.
> 
> In this patch, we modify the init code to ensure that the kernel uses
> only the stop states exposed by the firmware through the device
> tree. When a suitable default stop state isn't found, we disable
> ppc_md.power_save for power9. Similarly, when a suitable
> deepest_stop_state is not found in the device tree exported by the
> firmware, fall back to the default busy-wait loop in the CPU-Hotplug
> code.

Seems reasonable. I have a few comments that you may consider. Nothing
too major.

Btw., it would be nice to move this hotplug idling selection code to
idle.c. Have the hotplug just ask to enter the best available idle mode
and that's it. I'm not asking you to do that for this series, but perhaps
consider it for the future.


> diff --git a/arch/powerpc/platforms/powernv/idle.c 
> b/arch/powerpc/platforms/powernv/idle.c
> index 4ee837e..9fde6e4 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -147,7 +147,6 @@ u32 pnv_get_supported_cpuidle_states(void)
>  }
>  EXPORT_SYMBOL_GPL(pnv_get_supported_cpuidle_states);
>  
> -
>  static void pnv_fastsleep_workaround_apply(void *info)
>  
>  {
> @@ -243,6 +242,7 @@ static DEVICE_ATTR(fastsleep_workaround_applyonce, 0600,
>   */
>  u64 pnv_default_stop_val;
>  u64 pnv_default_stop_mask;
> +bool default_stop_found;
>  
>  /*
>   * Used for ppc_md.power_save which needs a function with no parameters
> @@ -264,6 +264,7 @@ static void power9_idle(void)
>   */
>  u64 pnv_deepest_stop_psscr_val;
>  u64 pnv_deepest_stop_psscr_mask;
> +bool deepest_stop_found;
>  
>  /*
>   * Power ISA 3.0 idle initialization.

If the hotplug idle code was in idle.c, then all this deepest/default stop
logic and register settings would be static to idle.c, which would be nice.

If you have a function to check if deepest stop is found, then you don't need
a non-static variable here (or for default_stop_found AFAIKS).


> @@ -352,7 +353,6 @@ static int __init pnv_power9_idle_init(struct device_node 
> *np, u32 *flags,
>   u32 *residency_ns = NULL;
>   u64 max_residency_ns = 0;
>   int rc = 0, i;
> - bool default_stop_found = false, deepest_stop_found = false;
>  
>   psscr_val = kcalloc(dt_idle_states, sizeof(*psscr_val), GFP_KERNEL);
>   psscr_mask = kcalloc(dt_idle_states, sizeof(*psscr_mask), GFP_KERNEL);
> @@ -433,20 +433,22 @@ static int __init pnv_power9_idle_init(struct 
> device_node *np, u32 *flags,
>   }
>  
>   if (!default_stop_found) {
> - pnv_default_stop_val = PSSCR_HV_DEFAULT_VAL;
> - pnv_default_stop_mask = PSSCR_HV_DEFAULT_MASK;
> - pr_warn("Setting default stop psscr 
> val=0x%016llx,mask=0x%016llx\n",
> + pr_warn("powernv:idle: Default stop not found. Disabling 
> ppcmd.powersave\n");
> + } else {
> + pr_info("powernv:idle: Default stop: psscr = 
> 0x%016llx,mask=0x%016llx\n",
>   pnv_default_stop_val, pnv_default_stop_mask);
>   }
>  
>   if (!deepest_stop_found) {
> - pnv_deepest_stop_psscr_val = PSSCR_HV_DEFAULT_VAL;
> - pnv_deepest_stop_psscr_mask = PSSCR_HV_DEFAULT_MASK;
> - pr_warn("Setting default stop psscr 
> val=0x%016llx,mask=0x%016llx\n",
> + pr_warn("powernv:idle: Deepest stop not found. CPU-Hotplug is 
> affected\n");

I guess these warnings are meant for developers, but it might be nice
to make them slightly more meaningful? Prefix of choice seems to be
"cpuidle-powernv:"

Then you could have "no suitable stop state provided by firmware. Disabling
idle power saving" and "no suitable stop state provided by firmware. Offlined
CPUs will busy-wait", perhaps?

Just a suggestion.

> + } else {
> + pr_info("powernv:idle: Deepest stop: psscr = 
> 0x%016llx,mask=0x%016llx\n",
>   pnv_deepest_stop_psscr_val,
>   pnv_deepest_stop_psscr_mask);
>   }
>  
> + pr_info("powernv:idle: RL value of first deep stop = 0x%llx\n",
> + pnv_first_deep_stop_state);

cpuidle-powernv: prefix for these too?

>  out:
>   kfree(psscr_val);
>   kfree(psscr_mask);
> @@ -454,6 +456,12 @@ static int __init pnv_power9_idle_init(struct 
> device_node *np, u32 *flags,
>   return rc;
>  }
>  
> +

Re: [PATCH 3/3] powernv:Recover correct PACA on wakeup from a stop on P9 DD1

2017-03-14 Thread Nicholas Piggin

On Mon, 13 Mar 2017 11:31:28 +0530
"Gautham R. Shenoy"  wrote:

> From: "Gautham R. Shenoy" 
> 
> POWER9 platform can be configured to rebalance per-thread resources
> within a core in order to improve SMT performance.  Certain STOP
> states can be configure to relinquish resources include some
> hypervisor SPRs in order to enable SMT thread folding.
> 
> Due to relinquishing of per-thread resources under certain platform
> configuration, certain SPR context could be lost within a core that
> needs to be recovered.  This state lose is due to reconfiguration of
> SMT threads and not due to actual electrical power lose.

Interesting. This will clash slightly with my patches, but that's
no big deal.


> This patch implements a context recovery framework within threads of a
> core, by provisioning space in paca_struct for saving every sibling
> threads's paca pointers. Basically, we should be able to arrive at the
> right paca pointer from any of the thread's existing paca pointer.
> 
> At bootup, during powernv idle-init, we save the paca address of every
> CPU in each one its siblings paca_struct in the slot corresponding to
> this CPU's index in the core.
> 
> On wakeup from a stop, the thread will determine its index in the core
> from the lower 2 bits of the PIR register and recover its PACA pointer
> by indexing into the correct slot in the provisioned space in the
> current PACA.
> 
> [Changelog written with inputs from sva...@linux.vnet.ibm.com]
> 
> Signed-off-by: Gautham R. Shenoy 
> ---
>  arch/powerpc/include/asm/paca.h   |  5 
>  arch/powerpc/kernel/asm-offsets.c |  1 +
>  arch/powerpc/kernel/idle_book3s.S | 43 
> ++-
>  arch/powerpc/platforms/powernv/idle.c | 22 ++
>  4 files changed, 70 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> index 708c3e5..4405630 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -172,6 +172,11 @@ struct paca_struct {
>   u8 thread_mask;
>   /* Mask to denote subcore sibling threads */
>   u8 subcore_sibling_mask;
> + /*
> +  * Pointer to an array which contains pointer
> +  * to the sibling threads' paca.
> +  */
> + struct paca_struct *thread_sibling_pacas[8];
>  #endif
>  
>  #ifdef CONFIG_PPC_BOOK3S_64
> diff --git a/arch/powerpc/kernel/asm-offsets.c 
> b/arch/powerpc/kernel/asm-offsets.c
> index 4367e7d..6ec5016 100644
> --- a/arch/powerpc/kernel/asm-offsets.c
> +++ b/arch/powerpc/kernel/asm-offsets.c
> @@ -727,6 +727,7 @@ int main(void)
>   OFFSET(PACA_THREAD_IDLE_STATE, paca_struct, thread_idle_state);
>   OFFSET(PACA_THREAD_MASK, paca_struct, thread_mask);
>   OFFSET(PACA_SUBCORE_SIBLING_MASK, paca_struct, subcore_sibling_mask);
> + OFFSET(PACA_SIBLING_PACA_PTRS, paca_struct, thread_sibling_pacas);
>  #endif
>  
>   DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
> diff --git a/arch/powerpc/kernel/idle_book3s.S 
> b/arch/powerpc/kernel/idle_book3s.S
> index 9957287..5a90f2c 100644
> --- a/arch/powerpc/kernel/idle_book3s.S
> +++ b/arch/powerpc/kernel/idle_book3s.S
> @@ -385,7 +385,48 @@ _GLOBAL(power9_idle_stop)
>   */
>  _GLOBAL(pnv_restore_hyp_resource)
>  BEGIN_FTR_SECTION
> - ld  r2,PACATOC(r13);
> +BEGIN_FTR_SECTION_NESTED(70)
> +/* Save our LR in r17 */
> + mflrr17
> +/*
> + * On entering certain stop states, the thread might relinquish its
> + * per-thread resources due to some reconfiguration for improved SMT
> + * performance. This would result in certain SPR context such as
> + * HSPRG0 (which contains the paca pointer) to be lost within the core.
> + *
> + * Fortunately, the PIR is invariant to thread reconfiguration. Since
> + * this thread's paca pointer is recorded in all its sibling's
> + * paca, we can correctly recover this thread's paca pointer if we
> + * know the index of this thread in the core.
> + * This index can be obtained from the lower two bits of the PIR.
> + *
> + * i.e, thread's position in the core = PIR[62:63].
> + * If this value is i, then this thread's paca is
> + * paca->thread_sibling_pacas[i].
> + */
> + mfspr   r4, SPRN_PIR
> + andi.   r4, r4, 0x3
> +/*
> + * Since each entry in thread_sibling_pacas is 8 bytes
> + * we need to left-shift by 3 bits. Thus r4 = i * 8
> + */
> + sldir4, r4, 3
> +/* Get &paca->thread_sibling_pacas[0] in r5 */
> + addir5, r13, PACA_SIBLING_PACA_PTRS

I don't actually understand how this works. We have in r13 a paca
pointer already, and it must be a pointer to one of the threads in
this core, but it may not be a pointer to this thread's paca?

Can you explain how that works a bit more? Where does that r13 come
from?

The exception code before we reach here already saves registers into
the paca save area (PACA_EXGEN), so this is going to go bad if we
are saving into another thread's paca. I have a patch that makes the
0x100 vector use a a new

Re: [PATCH 08/29] drivers, md: convert mddev.active from atomic_t to refcount_t

2017-03-14 Thread James Bottomley

On Tue, 2017-03-14 at 12:29 +, Reshetova, Elena wrote:
> > Elena Reshetova  writes:
> > 
> > > refcount_t type and corresponding API should be
> > > used instead of atomic_t when the variable is used as
> > > a reference counter. This allows to avoid accidental
> > > refcounter overflows that might lead to use-after-free
> > > situations.
> > > 
> > > Signed-off-by: Elena Reshetova 
> > > Signed-off-by: Hans Liljestrand 
> > > Signed-off-by: Kees Cook 
> > > Signed-off-by: David Windsor 
> > > ---
> > >  drivers/md/md.c | 6 +++---
> > >  drivers/md/md.h | 3 ++-
> > >  2 files changed, 5 insertions(+), 4 deletions(-)
> > 
> > When booting linux-next (specifically 5be4921c9958ec) I'm seeing
> > the
> > backtrace below. I suspect this patch is just exposing an existing
> > issue?
> 
> Yes, we have actually been following this issue in the another
> thread. 
> It looks like the object is re-used somehow, but I can't quite
> understand how just by reading the code. 
> This was what I put into the previous thread:
> 
> "The log below indicates that you are using your refcounter in a bit
> weird way in mddev_find(). 
> However, I can't find the place (just by reading the code) where you
> would increment refcounter from zero (vs. setting it to one).
> It looks like you either iterate over existing nodes (and increment
> their counters, which should be >= 1 at the time of increment) or
> create a new node, but then mddev_init() sets the counter to 1. "
> 
> If you can help to understand what is going on with the object
> creation/destruction, would be appreciated!
> 
> Also Shaohua Li stopped this patch coming from his tree since the 
> issue was caught at that time, so we are not going to merge this 
> until we figure it out. 

Asking on the correct list (dm-devel) would have got you the easy
answer:  The refcount behind mddev->active is a genuine atomic.  It has
refcount properties but only if the array fails to initialise (in that
case, final put kills it).  Once it's added to the system as a gendisk,
it cannot be freed until md_free().  Thus its ->active count can go to
zero (when it becomes inactive; usually because of an unmount). On a
simple allocation regardless of outcome, the last executed statement in
md_alloc is mddev_put(): that destroys the device if we didn't manage
to create it or returns 0 and adds an inactive device to the system
which the user can get with mddev_find().

James

Re: [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration

2017-03-14 Thread Alex Williamson

On Tue, 14 Mar 2017 11:55:33 +1100
David Gibson  wrote:

> On Tue, Mar 14, 2017 at 11:54:03AM +1100, Alexey Kardashevskiy wrote:
> > On 10/03/17 15:48, David Gibson wrote:  
> > > On Fri, Mar 10, 2017 at 02:53:27PM +1100, Alexey Kardashevskiy wrote:  
> > >> This is my current queue of patches to add acceleration of TCE
> > >> updates in KVM.
> > >>
> > >> This is based on Linus'es tree sha1 c1aa905a304e.  

Hmm, sure about that?  03/10 doesn't apply.

> > > 
> > > I think we're finally there - I've now sent an R-b for all patches.  
> > 
> > Thanks for the patience.
> > 
> > 
> > I supposed in order to proceed now I need an ack from Alex, correct?  
> 
> That, or simply for him to merge it.

Given the diffstat, I'd guess you're looking for acks from me and maybe
Paolo, but it looks like it should be merged through ppc trees.  Thanks,

Alex

> > >>
> > >> Please comment. Thanks.
> > >>
> > >> Changes:
> > >> v8:
> > >> * kept fixing oddities with error handling in 10/10
> > >>
> > >> v7:
> > >> * added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c
> > >>
> > >> v6:
> > >> * reworked the last patch in terms of error handling and parameters 
> > >> checking
> > >>
> > >> v5:
> > >> * replaced "KVM: PPC: Separate TCE validation from update" with
> > >> "KVM: PPC: iommu: Unify TCE checking"
> > >> * changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup 
> > >> iommu_table disposal"
> > >> * reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
> > >> * more details in individual commit logs
> > >>
> > >> v4:
> > >> * addressed comments from v3
> > >> * updated subject lines with correct component names
> > >> * regrouped the patchset in order:
> > >>  - powerpc fixes;
> > >>  - vfio_spapr_tce driver fixes;
> > >>  - KVM/PPC fixes;
> > >>  - KVM+PPC+VFIO;
> > >> * everything except last 2 patches have "Reviewed-By: David"
> > >>
> > >> v3:
> > >> * there was no full repost, only last patch was posted
> > >>
> > >> v2:
> > >> * 11/11 reworked to use new notifiers, it is rather RFC as it still has
> > >> a issue;
> > >> * got 09/11, 10/11 to use notifiers in 11/11;
> > >> * added rb: David to most of patches and added a comment in 05/11.
> > >>
> > >> Alexey Kardashevskiy (10):
> > >>   powerpc/mmu: Add real mode support for IOMMU preregistered memory
> > >>   powerpc/powernv/iommu: Add real mode version of
> > >> iommu_table_ops::exchange()
> > >>   powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
> > >>   powerpc/vfio_spapr_tce: Add reference counting to iommu_table
> > >>   KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
> > >>   KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
> > >>   KVM: PPC: Pass kvm* to kvmppc_find_table()
> > >>   KVM: PPC: Use preregistered memory API to access TCE list
> > >>   KVM: PPC: iommu: Unify TCE checking
> > >>   KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
> > >>
> > >>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> > >>  arch/powerpc/include/asm/iommu.h   |  32 ++-
> > >>  arch/powerpc/include/asm/kvm_host.h|   8 +
> > >>  arch/powerpc/include/asm/kvm_ppc.h |  12 +-
> > >>  arch/powerpc/include/asm/mmu_context.h |   4 +
> > >>  include/uapi/linux/kvm.h   |   9 +
> > >>  arch/powerpc/kernel/iommu.c|  86 +---
> > >>  arch/powerpc/kvm/book3s_64_vio.c   | 330 
> > >> -
> > >>  arch/powerpc/kvm/book3s_64_vio_hv.c| 303 
> > >> ++
> > >>  arch/powerpc/kvm/powerpc.c |   2 +
> > >>  arch/powerpc/mm/mmu_context_iommu.c|  39 
> > >>  arch/powerpc/platforms/powernv/pci-ioda.c  |  46 ++--
> > >>  arch/powerpc/platforms/powernv/pci.c   |   1 +
> > >>  arch/powerpc/platforms/pseries/iommu.c |   3 +-
> > >>  arch/powerpc/platforms/pseries/vio.c   |   2 +-
> > >>  drivers/vfio/vfio_iommu_spapr_tce.c|   2 +-
> > >>  virt/kvm/vfio.c|  60 ++
> > >>  arch/powerpc/kvm/Kconfig   |   1 +
> > >>  18 files changed, 855 insertions(+), 107 deletions(-)
> > >>  
> > >   
> > 
> >   
> 
> 
> 
>

Re: [PATCH kernel v8 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal

2017-03-14 Thread Alex Williamson

On Fri, 10 Mar 2017 14:53:30 +1100
Alexey Kardashevskiy  wrote:

> At the moment iommu_table can be disposed by either calling
> iommu_table_free() directly or it_ops::free(); the only implementation
> of free() is in IODA2 - pnv_ioda2_table_free() - and it calls
> iommu_table_free() anyway.
> 
> As we are going to have reference counting on tables, we need an unified
> way of disposing tables.
> 
> This moves it_ops::free() call into iommu_free_table() and makes use
> of the latter. The free() callback now handles only platform-specific
> data.
> 
> As from now on the iommu_free_table() calls it_ops->free(), we need
> to have it_ops initialized before calling iommu_free_table() so this
> moves this initialization in pnv_pci_ioda2_create_table().
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: David Gibson 
> ---
> Changes:
> v5:
> * moved "tbl->it_ops = &pnv_ioda2_iommu_ops" earlier and updated
> the commit log
> ---
>  arch/powerpc/kernel/iommu.c   |  4 
>  arch/powerpc/platforms/powernv/pci-ioda.c | 10 --
>  drivers/vfio/vfio_iommu_spapr_tce.c   |  2 +-
>  3 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 9bace5df05d5..bc142d87130f 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char 
> *node_name)
>   if (!tbl)
>   return;
>  
> + if (tbl->it_ops->free)
> + tbl->it_ops->free(tbl);
> +
>   if (!tbl->it_map) {
>   kfree(tbl);
>   return;
> @@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char 
> *node_name)
>   /* free table */
>   kfree(tbl);
>  }
> +EXPORT_SYMBOL_GPL(iommu_free_table);

A slightly cringe worthy generically named export in arch code.

>  
>  /* Creates TCEs for a user provided buffer.  The user buffer must be
>   * contiguous real kernel storage (not vmalloc).  The address passed here
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 69c40b43daa3..7916d0cb05fe 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1425,7 +1425,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
> *dev, struct pnv_ioda_pe
>   iommu_group_put(pe->table_group.group);
>   BUG_ON(pe->table_group.group);
>   }
> - pnv_pci_ioda2_table_free_pages(tbl);
>   iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
>  }
>  
> @@ -2041,7 +2040,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, 
> long index,
>  static void pnv_ioda2_table_free(struct iommu_table *tbl)
>  {
>   pnv_pci_ioda2_table_free_pages(tbl);
> - iommu_free_table(tbl, "pnv");
>  }
>  
>  static struct iommu_table_ops pnv_ioda2_iommu_ops = {
> @@ -2318,6 +2316,8 @@ static long pnv_pci_ioda2_create_table(struct 
> iommu_table_group *table_group,
>   if (!tbl)
>   return -ENOMEM;
>  
> + tbl->it_ops = &pnv_ioda2_iommu_ops;
> +
>   ret = pnv_pci_ioda2_table_alloc_pages(nid,
>   bus_offset, page_shift, window_size,
>   levels, tbl);
> @@ -2326,8 +2326,6 @@ static long pnv_pci_ioda2_create_table(struct 
> iommu_table_group *table_group,
>   return ret;
>   }
>  
> - tbl->it_ops = &pnv_ioda2_iommu_ops;
> -
>   *ptbl = tbl;
>  
>   return 0;
> @@ -2368,7 +2366,7 @@ static long pnv_pci_ioda2_setup_default_config(struct 
> pnv_ioda_pe *pe)
>   if (rc) {
>   pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n",
>   rc);
> - pnv_ioda2_table_free(tbl);
> + iommu_free_table(tbl, "");
>   return rc;
>   }
>  
> @@ -2456,7 +2454,7 @@ static void pnv_ioda2_take_ownership(struct 
> iommu_table_group *table_group)
>   pnv_pci_ioda2_unset_window(&pe->table_group, 0);
>   if (pe->pbus)
>   pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
> - pnv_ioda2_table_free(tbl);
> + iommu_free_table(tbl, "pnv");
>  }
>  
>  static void pnv_ioda2_release_ownership(struct iommu_table_group 
> *table_group)
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index cf3de91fbfe7..fbec7348a7e5 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container 
> *container,
>   unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT;
>  
>   tce_iommu_userspace_view_free(tbl, container->mm);
> - tbl->it_ops->free(tbl);
> + iommu_free_table(tbl, "");
>   decrement_locked_vm(container->mm, pages);
>  }
>  

Acked-by: Alex Williamson

[PATCH 17/19] perf powerpc: Choose local entry point with kretprobes

2017-03-14 Thread Arnaldo Carvalho de Melo

From: "Naveen N. Rao" 

perf now uses an offset from _text/_stext for kretprobes if the kernel
supports it, rather than the actual function name. As such, let's choose
the LEP for powerpc ABIv2 so as to ensure the probe gets hit. Do it only
if the kernel supports specifying offsets with kretprobes.

Signed-off-by: Naveen N. Rao 
Acked-by: Masami Hiramatsu 
Cc: Ananth N Mavinakayanahalli 
Cc: Michael Ellerman 
Cc: Steven Rostedt 
Cc: linuxppc-dev@lists.ozlabs.org
Link: 
http://lkml.kernel.org/r/7445b5334673ef5404ac1d12609bad4d73d2b567.1488961018.git.naveen.n@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/arch/powerpc/util/sym-handling.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/tools/perf/arch/powerpc/util/sym-handling.c 
b/tools/perf/arch/powerpc/util/sym-handling.c
index 1030a6e504bb..39dbe512b9fc 100644
--- a/tools/perf/arch/powerpc/util/sym-handling.c
+++ b/tools/perf/arch/powerpc/util/sym-handling.c
@@ -10,6 +10,7 @@
 #include "symbol.h"
 #include "map.h"
 #include "probe-event.h"
+#include "probe-file.h"
 
 #ifdef HAVE_LIBELF_SUPPORT
 bool elf__needs_adjust_symbols(GElf_Ehdr ehdr)
@@ -79,13 +80,18 @@ void arch__fix_tev_from_maps(struct perf_probe_event *pev,
 * However, if the user specifies an offset, we fall back to using the
 * GEP since all userspace applications (objdump/readelf) show function
 * disassembly with offsets from the GEP.
-*
-* In addition, we shouldn't specify an offset for kretprobes.
 */
-   if (pev->point.offset || (!pev->uprobes && pev->point.retprobe) ||
-   !map || !sym)
+   if (pev->point.offset || !map || !sym)
return;
 
+   /* For kretprobes, add an offset only if the kernel supports it */
+   if (!pev->uprobes && pev->point.retprobe) {
+#ifdef HAVE_LIBELF_SUPPORT
+   if (!kretprobe_offset_is_supported())
+#endif
+   return;
+   }
+
lep_offset = PPC64_LOCAL_ENTRY_OFFSET(sym->arch_sym);
 
if (map->dso->symtab_type == DSO_BINARY_TYPE__KALLSYMS)
-- 
2.9.3

[PATCH 15/19] perf probe: Factor out the ftrace README scanning

2017-03-14 Thread Arnaldo Carvalho de Melo

From: "Naveen N. Rao" 

Simplify and separate out the ftrace README scanning logic into a
separate helper. This is used subsequently to scan for all patterns of
interest and to cache the result.

Since we are only interested in availability of probe argument type x,
we will only scan for that.

Signed-off-by: Naveen N. Rao 
Acked-by: Masami Hiramatsu 
Cc: Ananth N Mavinakayanahalli 
Cc: Michael Ellerman 
Cc: Steven Rostedt 
Cc: linuxppc-dev@lists.ozlabs.org
Link: 
http://lkml.kernel.org/r/6dc30edc747ba82a236593be6cf3a046fa9453b5.1488961018.git.naveen.n@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/probe-file.c | 70 +++-
 1 file changed, 37 insertions(+), 33 deletions(-)

diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c
index 1a62daceb028..8a219cd831b7 100644
--- a/tools/perf/util/probe-file.c
+++ b/tools/perf/util/probe-file.c
@@ -877,35 +877,31 @@ int probe_cache__show_all_caches(struct strfilter *filter)
return 0;
 }
 
+enum ftrace_readme {
+   FTRACE_README_PROBE_TYPE_X = 0,
+   FTRACE_README_END,
+};
+
 static struct {
const char *pattern;
-   boolavail;
-   boolchecked;
-} probe_type_table[] = {
-#define DEFINE_TYPE(idx, pat, def_avail)   \
-   [idx] = {.pattern = pat, .avail = (def_avail)}
-   DEFINE_TYPE(PROBE_TYPE_U, "* u8/16/32/64,*", true),
-   DEFINE_TYPE(PROBE_TYPE_S, "* s8/16/32/64,*", true),
-   DEFINE_TYPE(PROBE_TYPE_X, "* x8/16/32/64,*", false),
-   DEFINE_TYPE(PROBE_TYPE_STRING, "* string,*", true),
-   DEFINE_TYPE(PROBE_TYPE_BITFIELD,
-   "* b@/", true),
+   bool avail;
+} ftrace_readme_table[] = {
+#define DEFINE_TYPE(idx, pat)  \
+   [idx] = {.pattern = pat, .avail = false}
+   DEFINE_TYPE(FTRACE_README_PROBE_TYPE_X, "*type: * x8/16/32/64,*"),
 };
 
-bool probe_type_is_available(enum probe_type type)
+static bool scan_ftrace_readme(enum ftrace_readme type)
 {
+   int fd;
FILE *fp;
char *buf = NULL;
size_t len = 0;
-   bool target_line = false;
-   bool ret = probe_type_table[type].avail;
-   int fd;
+   bool ret = false;
+   static bool scanned = false;
 
-   if (type >= PROBE_TYPE_END)
-   return false;
-   /* We don't have to check the type which supported by default */
-   if (ret || probe_type_table[type].checked)
-   return ret;
+   if (scanned)
+   goto result;
 
fd = open_trace_file("README", false);
if (fd < 0)
@@ -917,21 +913,29 @@ bool probe_type_is_available(enum probe_type type)
return ret;
}
 
-   while (getline(&buf, &len, fp) > 0 && !ret) {
-   if (!target_line) {
-   target_line = !!strstr(buf, " type: ");
-   if (!target_line)
-   continue;
-   } else if (strstr(buf, "\t  ") != buf)
-   break;
-   ret = strglobmatch(buf, probe_type_table[type].pattern);
-   }
-   /* Cache the result */
-   probe_type_table[type].checked = true;
-   probe_type_table[type].avail = ret;
+   while (getline(&buf, &len, fp) > 0)
+   for (enum ftrace_readme i = 0; i < FTRACE_README_END; i++)
+   if (!ftrace_readme_table[i].avail)
+   ftrace_readme_table[i].avail =
+   strglobmatch(buf, 
ftrace_readme_table[i].pattern);
+   scanned = true;
 
fclose(fp);
free(buf);
 
-   return ret;
+result:
+   if (type >= FTRACE_README_END)
+   return false;
+
+   return ftrace_readme_table[type].avail;
+}
+
+bool probe_type_is_available(enum probe_type type)
+{
+   if (type >= PROBE_TYPE_END)
+   return false;
+   else if (type == PROBE_TYPE_X)
+   return scan_ftrace_readme(FTRACE_README_PROBE_TYPE_X);
+
+   return true;
 }
-- 
2.9.3

[PATCH 16/19] perf kretprobes: Offset from reloc_sym if kernel supports it

2017-03-14 Thread Arnaldo Carvalho de Melo

From: "Naveen N. Rao" 

We indicate support for accepting sym+offset with kretprobes through a
line in ftrace README. Parse the same to identify support and choose the
appropriate format for kprobe_events.

As an example, without this perf patch, but with the ftrace changes:

  naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/tracing/README | 
grep kretprobe
  place (kretprobe): [:][+]|
  naveen@ubuntu:~/linux/tools/perf$
  naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return
  probe-definition(0): do_open%return
  symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null)
  0 arguments
  Looking at the vmlinux_path (8 entries long)
  Using /boot/vmlinux for symbols
  Open Debuginfo file: /boot/vmlinux
  Try to find probe point from debuginfo.
  Matched function: do_open [2d0c7d8]
  Probe point found: do_open+0
  Matched function: do_open [35d76b5]
  found inline addr: 0xc04ba984
  Failed to find "do_open%return",
   because do_open is an inlined function and has no return point.
  An error occurred in debuginfo analysis (-22).
  Trying to use symbols.
  Opening /sys/kernel/debug/tracing//kprobe_events write=1
  Writing event: r:probe/do_open do_open+0
  Writing event: r:probe/do_open_1 do_open+0
  Added new events:
probe:do_open(on do_open%return)
probe:do_open_1  (on do_open%return)

  You can now use it in all perf tools, such as:

  perf record -e probe:do_open_1 -aR sleep 1

  naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/kprobes/list
  c0041370  k  kretprobe_trampoline+0x0[OPTIMIZED]
  c04433d0  r  do_open+0x0[DISABLED]
  c04433d0  r  do_open+0x0[DISABLED]

And after this patch (and the subsequent powerpc patch):

  naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return
  probe-definition(0): do_open%return
  symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null)
  0 arguments
  Looking at the vmlinux_path (8 entries long)
  Using /boot/vmlinux for symbols
  Open Debuginfo file: /boot/vmlinux
  Try to find probe point from debuginfo.
  Matched function: do_open [2d0c7d8]
  Probe point found: do_open+0
  Matched function: do_open [35d76b5]
  found inline addr: 0xc04ba984
  Failed to find "do_open%return",
   because do_open is an inlined function and has no return point.
  An error occurred in debuginfo analysis (-22).
  Trying to use symbols.
  Opening /sys/kernel/debug/tracing//README write=0
  Opening /sys/kernel/debug/tracing//kprobe_events write=1
  Writing event: r:probe/do_open _text+4469712
  Writing event: r:probe/do_open_1 _text+4956248
  Added new events:
probe:do_open(on do_open%return)
probe:do_open_1  (on do_open%return)

  You can now use it in all perf tools, such as:

  perf record -e probe:do_open_1 -aR sleep 1

  naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/kprobes/list
  c0041370  k  kretprobe_trampoline+0x0[OPTIMIZED]
  c04433d0  r  do_open+0x0[DISABLED]
  c04ba058  r  do_open+0x8[DISABLED]

Signed-off-by: Naveen N. Rao 
Acked-by: Masami Hiramatsu 
Cc: Ananth N Mavinakayanahalli 
Cc: Michael Ellerman 
Cc: Steven Rostedt 
Cc: linuxppc-dev@lists.ozlabs.org
Link: 
http://lkml.kernel.org/r/496ef9f33c1ab16286ece9dd62aa672807aef91c.1488961018.git.naveen.n@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/probe-event.c | 12 +---
 tools/perf/util/probe-file.c  |  7 +++
 tools/perf/util/probe-file.h  |  1 +
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index 28fb62c32678..c9bdc9ded0c3 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -757,7 +757,9 @@ post_process_kernel_probe_trace_events(struct 
probe_trace_event *tevs,
}
 
for (i = 0; i < ntevs; i++) {
-   if (!tevs[i].point.address || tevs[i].point.retprobe)
+   if (!tevs[i].point.address)
+   continue;
+   if (tevs[i].point.retprobe && !kretprobe_offset_is_supported())
continue;
/* If we found a wrong one, mark it by NULL symbol */
if (kprobe_warn_out_range(tevs[i].point.symbol,
@@ -1528,11 +1530,6 @@ static int parse_perf_probe_point(char *arg, struct 
perf_probe_event *pev)
return -EINVAL;
}
 
-   if (pp->retprobe && !pp->function) {
-   semantic_error("Return probe requires an entry function.\n");
-   return -EINVAL;
-   }
-
if ((pp->offset || pp->line || pp->lazy_line) && pp->retprobe) {
semantic_error("Offset/Line/Lazy pattern can't be used with "
   "return probe.\n");
@@ -2841,7 +2838,8 @@ static int find_probe_trace_events_from_map(struct 
perf_probe_event *pev,
}
 
/* Note that the symbols in t

[GIT PULL 00/19] perf/core improvements and fixes

2017-03-14 Thread Arnaldo Carvalho de Melo

Hi Ingo,

Please consider pulling,

- Arnaldo

Test results at the end of this message, as usual.

The following changes since commit 84e5b549214f2160c12318aac549de85f600c79a:

  Merge tag 'perf-core-for-mingo-4.11-20170306' of 
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core 
(2017-03-07 08:14:14 +0100)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
tags/perf-core-for-mingo-4.12-20170314

for you to fetch changes up to 5f6bee34707973ea7879a7857fd63ddccc92fff3:

  kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOL (2017-03-14 
15:17:40 -0300)


perf/core improvements and fixes:

New features:

- Add PERF_RECORD_NAMESPACES so that the kernel can record information
  required to associate samples to namespaces, helping in container
  problem characterization.

  Now the 'perf record has a --namespace' option to ask for such info,
  and when present, it can be used, initially, via a new sort order,
  'cgroup_id', allowing histogram entry bucketization by a (device, inode)
  based cgroup identifier (Hari Bathini)

- Add --next option to 'perf sched timehist', showing what is the next
  thread to run (Brendan Gregg)

Fixes:

- Fix segfault with basic block 'cycles' sort dimension (Changbin Du)

- Add c2c to command-list.txt, making it appear in the 'perf help'
  output (Changbin Du)

- Fix zeroing of 'abs_path' variable in the perf hists browser switch
  file code (Changbin Du)

- Hide tips messages when -q/--quiet is given to 'perf report' (Namhyung Kim)

Infrastructure:

- Use ref_reloc_sym + offset to setup kretprobes (Naveen Rao)

- Ignore generated files pmu-events/{jevents,pmu-events.c} for git (Changbin Du)

Documentation:

- Document +field style argument support for --field option (Changbin Du)

- Clarify 'perf c2c --stats' help message (Namhyung Kim)

Signed-off-by: Arnaldo Carvalho de Melo 


Brendan Gregg (1):
  perf sched timehist: Add --next option

Changbin Du (5):
  perf tools: Missing c2c command in command-list
  perf tools: Ignore generated files pmu-events/{jevents,pmu-events.c} for 
git
  perf sort: Fix segfault with basic block 'cycles' sort dimension
  perf report: Document +field style argument support for --field option
  perf hists browser: Fix typo in function switch_data_file

Hari Bathini (5):
  perf: Add PERF_RECORD_NAMESPACES to include namespaces related info
  perf tools: Add PERF_RECORD_NAMESPACES to include namespaces related info
  perf record: Synthesize namespace events for current processes
  perf script: Add script print support for namespace events
  perf tools: Add 'cgroup_id' sort order keyword

Namhyung Kim (3):
  perf report: Hide tip message when -q option is given
  perf c2c: Clarify help message of --stats option
  perf c2c: Fix display bug when using pipe

Naveen N. Rao (5):
  perf probe: Factor out the ftrace README scanning
  perf kretprobes: Offset from reloc_sym if kernel supports it
  perf powerpc: Choose local entry point with kretprobes
  doc: trace/kprobes: add information about NOKPROBE_SYMBOL
  kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOL

 Documentation/trace/kprobetrace.txt |   5 +-
 include/linux/perf_event.h  |   2 +
 include/uapi/linux/perf_event.h |  32 +-
 kernel/events/core.c| 139 ++
 kernel/fork.c   |   2 +
 kernel/kprobes.c|   5 +-
 kernel/nsproxy.c|   3 +
 tools/include/uapi/linux/perf_event.h   |  32 +-
 tools/perf/.gitignore   |   2 +
 tools/perf/Documentation/perf-record.txt|   3 +
 tools/perf/Documentation/perf-report.txt|   7 +-
 tools/perf/Documentation/perf-sched.txt |   4 +
 tools/perf/Documentation/perf-script.txt|   3 +
 tools/perf/arch/powerpc/util/sym-handling.c |  14 ++-
 tools/perf/builtin-annotate.c   |   1 +
 tools/perf/builtin-c2c.c|   4 +-
 tools/perf/builtin-diff.c   |   1 +
 tools/perf/builtin-inject.c |  13 +++
 tools/perf/builtin-kmem.c   |   1 +
 tools/perf/builtin-kvm.c|   2 +
 tools/perf/builtin-lock.c   |   1 +
 tools/perf/builtin-mem.c|   1 +
 tools/perf/builtin-record.c |  35 ++-
 tools/perf/builtin-report.c |   4 +-
 tools/perf/builtin-sched.c  |  26 -
 tools/perf/builtin-script.c |  41 
 tools/perf/builtin-trace.c  |   3 +-
 tools/per

Re: Linux 4.11: Reported regressions as of Tuesday, 20176-03-14

2017-03-14 Thread Linus Torvalds

On Tue, Mar 14, 2017 at 3:02 AM, Thorsten Leemhuis
 wrote:
>
> Desc: PowerPC crashes on boot, bisected to commit 5657933dbb6e
> Repo: 2017-03-02 
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1343553.html
> Stat: 2017-03-09 
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1349568.html
> Note: patch hopefully heading mainline

Should already be mainline as commit 46f401c4297a ("powerpc/pmac: Fix
crash in dma-mapping.h with NULL dma_ops") but didn't make rc2.

   Linus

Re: syscall statx not implemented on powerpc

2017-03-14 Thread Chris Packham

On 15/03/17 00:57, Michael Ellerman wrote:
> Chris Packham  writes:
>
>> On 13/03/17 21:52, Chandan Rajendra wrote:
>>> On Monday, March 13, 2017 03:33:07 AM Chris Packham wrote:
 Hi,

 I've just attempted to build a powerpc kernel from 4.11-rc2 using a
 custom defconfig (available on request) and I'm hitting the following
 error in the early stages of compilation.

 :1325:2: error: #warning syscall statx not implemented [-Werror=cpp]

 Same thing seems to happen with mpc85xx_basic_defconfig.

 I don't actually need this syscall so I'd be happy to turn something off
 to get things building. I did a quick search and couldn't see anything
 on linuxppc-dev but google keeps correcting "statx" to "stats" so I
 could have missed it.

>>>
>>> The upstream commit
>>> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a528d35e8bfcc521d7cb70aaf03e1bd296c8493f)
>>> that introduces the statx syscall provides a test program. I will wire-up 
>>> the
>>> syscall on ppc64, run that test program and post the patch if the test 
>>> program
>>> works well.
>>>
>>
>> Thanks, I'd be happy to test a patch here.
>>
>> In the meantime I worked around the build issue by adding __INGORE_statx
>> to checksyscalls.sh.
>
> Is it actually breaking the build? It's meant to be a warning, and
> that's all I get.
>
> cheers
>

It breaks my builds for ppc and arm. We do have KCFLAGS=-Werror in our 
environment but KCPPFLAGS= is not set so I'm not sure why it's stopping 
compilation.

Re: [PATCH kernel v8 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table

2017-03-14 Thread Alex Williamson

On Fri, 10 Mar 2017 14:53:31 +1100
Alexey Kardashevskiy  wrote:

> So far iommu_table obejcts were only used in virtual mode and had
> a single owner. We are going to change this by implementing in-kernel
> acceleration of DMA mapping requests. The proposed acceleration
> will handle requests in real mode and KVM will keep references to tables.
> 
> This adds a kref to iommu_table and defines new helpers to update it.
> This replaces iommu_free_table() with iommu_table_put() and makes
> iommu_free_table() static. iommu_table_get() is not used in this patch
> but it will be in the following patch.
> 
> Since this touches prototypes, this also removes @node_name parameter as
> it has never been really useful on powernv and carrying it for
> the pseries platform code to iommu_free_table() seems to be quite
> useless as well.
> 
> This should cause no behavioral change.
> 
> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: David Gibson 
> ---
>  arch/powerpc/include/asm/iommu.h  |  5 +++--
>  arch/powerpc/kernel/iommu.c   | 24 +++-
>  arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++---
>  arch/powerpc/platforms/powernv/pci.c  |  1 +
>  arch/powerpc/platforms/pseries/iommu.c|  3 ++-
>  arch/powerpc/platforms/pseries/vio.c  |  2 +-
>  drivers/vfio/vfio_iommu_spapr_tce.c   |  2 +-
>  7 files changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index 4554699aec02..82e77ebf85f4 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -119,6 +119,7 @@ struct iommu_table {
>   struct list_head it_group_list;/* List of iommu_table_group_link */
>   unsigned long *it_userspace; /* userspace view of the table */
>   struct iommu_table_ops *it_ops;
> + struct krefit_kref;
>  };
>  
>  #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \
> @@ -151,8 +152,8 @@ static inline void *get_iommu_table_base(struct device 
> *dev)
>  
>  extern int dma_iommu_dma_supported(struct device *dev, u64 mask);
>  
> -/* Frees table for an individual device node */
> -extern void iommu_free_table(struct iommu_table *tbl, const char *node_name);
> +extern void iommu_table_get(struct iommu_table *tbl);
> +extern void iommu_table_put(struct iommu_table *tbl);
>  
>  /* Initializes an iommu_table based in values set in the passed-in
>   * structure
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index bc142d87130f..d02b8d22fb50 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table 
> *tbl, int nid)
>   return tbl;
>  }
>  
> -void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> +static void iommu_table_free(struct kref *kref)
>  {
>   unsigned long bitmap_sz;
>   unsigned int order;
> + struct iommu_table *tbl;
>  
> - if (!tbl)
> - return;
> + tbl = container_of(kref, struct iommu_table, it_kref);
>  
>   if (tbl->it_ops->free)
>   tbl->it_ops->free(tbl);
> @@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char 
> *node_name)
>  
>   /* verify that table contains no entries */
>   if (!bitmap_empty(tbl->it_map, tbl->it_size))
> - pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name);
> + pr_warn("%s: Unexpected TCEs\n", __func__);
>  
>   /* calculate bitmap size in bytes */
>   bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long);
> @@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const 
> char *node_name)
>   /* free table */
>   kfree(tbl);
>  }
> -EXPORT_SYMBOL_GPL(iommu_free_table);
> +
> +void iommu_table_get(struct iommu_table *tbl)
> +{
> + kref_get(&tbl->it_kref);
> +}
> +EXPORT_SYMBOL_GPL(iommu_table_get);
> +
> +void iommu_table_put(struct iommu_table *tbl)
> +{
> + if (!tbl)
> + return;
> +
> + kref_put(&tbl->it_kref, iommu_table_free);
> +}
> +EXPORT_SYMBOL_GPL(iommu_table_put);
>  


Maybe an opportunity for less cringe worthy generic names exported from
arch code.  iommu_tce_table_get/put perhaps?


>  /* Creates TCEs for a user provided buffer.  The user buffer must be
>   * contiguous real kernel storage (not vmalloc).  The address passed here
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 7916d0cb05fe..ec3e565de511 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1425,7 +1425,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev 
> *dev, struct pnv_ioda_pe
>   iommu_group_put(pe->table_group.group);
>   BUG_ON(pe->table_group.group);
>   }
> - iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
> + iommu_table_put(tbl);

Re: [FIX PATCH v1] powerpc/pseries: Fix reference count leak during CPU unplug

2017-03-14 Thread Tyrel Datwyler

On 03/13/2017 03:29 AM, Bharata B Rao wrote:
> On Thu, Mar 09, 2017 at 01:34:00PM -0800, Tyrel Datwyler wrote:
>> On 03/08/2017 08:37 PM, Bharata B Rao wrote:
>>> The following warning is seen when a CPU is hot unplugged on a PowerKVM
>>> guest:
>>
>> Is this the case with cpus present at boot? What about cpus hotplugged
>> after boot?
> 
> I have observed this for CPUs that are hotplugged.

If removing a cpu present at boot works, but removing one that has been
hotplugged after boot reproduces the problem it is more likely the case
that we failed to take a reference during hotplug or released a
reference we shouldn't have. I'd have to go look at the hot add path.

> 
>>
>> My suspicion is that the refcount was wrong to begin with. See my
>> comments below. The use of the of_node_put() calls is correct as in each
>> case we incremented the ref count earlier in the same function.
>>
>>>
>>> refcount_t: underflow; use-after-free.
>>> [ cut here ]
>>> WARNING: CPU: 0 PID: 53 at lib/refcount.c:128 
>>> refcount_sub_and_test+0xd8/0xf0
>>> Modules linked in:
>>> CPU: 0 PID: 53 Comm: kworker/u510:1 Not tainted 4.11.0-rc1 #3
>>> Workqueue: pseries hotplug workque pseries_hp_work_fn
>>> task: c000fb475000 task.stack: c000fb81c000
>>> NIP: c06f0808 LR: c06f0804 CTR: c07b98c0
>>> REGS: c000fb81f710 TRAP: 0700   Not tainted  (4.11.0-rc1)
>>> MSR: 8282b033 
>>>   CR: 4800  XER: 2000
>>> CFAR: c0c438e0 SOFTE: 1
>>> GPR00: c06f0804 c000fb81f990 c1573b00 0026
>>> GPR04:  016c 667265652e0d0a73 652d61667465722d
>>> GPR08: 0007 0007 0001 0006
>>> GPR12: 2200 cff4 c010c578 c001f11b9f40
>>> GPR16: c001fe0312a8 c001fe031078 c001fe031020 0001
>>> GPR20:   c1454808 fef7
>>> GPR24:  c001f1677648  
>>> GPR28: 1008 c0e4d3d8  c001eaae07d8
>>> NIP [c06f0808] refcount_sub_and_test+0xd8/0xf0
>>> LR [c06f0804] refcount_sub_and_test+0xd4/0xf0
>>> Call Trace:
>>> [c000fb81f990] [c06f0804] refcount_sub_and_test+0xd4/0xf0 
>>> (unreliable)
>>> [c000fb81f9f0] [c06d04b4] kobject_put+0x44/0x2a0
>>> [c000fb81fa70] [c09d5284] of_node_put+0x34/0x50
>>> [c000fb81faa0] [c00aceb8] dlpar_cpu_remove_by_index+0x108/0x130
>>> [c000fb81fb30] [c00ae128] dlpar_cpu+0x78/0x550
>>> [c000fb81fbe0] [c00a7b40] handle_dlpar_errorlog+0xc0/0x160
>>> [c000fb81fc50] [c00a7c74] pseries_hp_work_fn+0x94/0xa0
>>> [c000fb81fc80] [c0102cec] process_one_work+0x23c/0x540
>>> [c000fb81fd20] [c010309c] worker_thread+0xac/0x620
>>> [c000fb81fdc0] [c010c6c4] kthread+0x154/0x1a0
>>> [c000fb81fe30] [c000bbe0] ret_from_kernel_thread+0x5c/0x7c
>>>
>>> Fix this by ensuring that of_node_put() is called only from the
>>> error path in dlpar_cpu_remove_by_index(). In the normal path,
>>> of_node_put() happens as part of dlpar_detach_node().
>>>
>>> Signed-off-by: Bharata B Rao 
>>> Cc: Nathan Fontenot 
>>> ---
>>> Changes in v1:
>>> - Fixed the refcount problem in the userspace driven unplug path
>>>   in addition to in-kernel unplug path. (Sachin Sant)
>>>
>>> v0: https://patchwork.ozlabs.org/patch/736547/
>>>
>>>  arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 
>>>  1 file changed, 8 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
>>> b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>>> index 7bc0e91..c5ed510 100644
>>> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
>>> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>>> @@ -619,7 +619,8 @@ static int dlpar_cpu_remove_by_index(u32 drc_index)
>>> }
>>>
>>> rc = dlpar_cpu_remove(dn, drc_index);
>>> -   of_node_put(dn);
>>> +   if (rc)
>>> +   of_node_put(dn);
>>
>> I think there is another issue at play here because this is wrong.
>> Regardless of whether the dlpar_cpu_remove() succeeds or fails we still
>> need of_node_put() for both cases because we incremented the ref count
>> earlier in this function with a call to cpu_drc_index_to_dn() call. That
>> call doesn't, but shoul, document that it returns a device_node with
>> incremented refcount.
>>
>>> return rc;
>>>  }
>>>
>>> @@ -856,9 +857,12 @@ static ssize_t dlpar_cpu_release(const char *buf, 
>>> size_t count)
>>> }
>>>
>>> rc = dlpar_cpu_remove(dn, drc_index);
>>> -   of_node_put(dn);
>>> -
>>> -   return rc ? rc : count;
>>> +   if (rc) {
>>> +   of_node_put(dn);
>>> +   return rc;
>>> +   } else {
>>> +   return count;
>>> +   }
>>
>> Same comment as above. The call earlier in the function to
>> of_find_node_by_path() returne

Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-14 Thread Alex Williamson

On Fri, 10 Mar 2017 14:53:37 +1100
Alexey Kardashevskiy  wrote:

> This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> without passing them to user space which saves time on switching
> to user space and back.
> 
> This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> KVM tries to handle a TCE request in the real mode, if failed
> it passes the request to the virtual mode to complete the operation.
> If it a virtual mode handler fails, the request is passed to
> the user space; this is not expected to happen though.
> 
> To avoid dealing with page use counters (which is tricky in real mode),
> this only accelerates SPAPR TCE IOMMU v2 clients which are required
> to pre-register the userspace memory. The very first TCE request will
> be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> of the TCE table (iommu_table::it_userspace) is not allocated till
> the very first mapping happens and we cannot call vmalloc in real mode.
> 
> If we fail to update a hardware IOMMU table unexpected reason, we just
> clear it and move on as there is nothing really we can do about it -
> for example, if we hot plug a VFIO device to a guest, existing TCE tables
> will be mirrored automatically to the hardware and there is no interface
> to report to the guest about possible failures.
> 
> This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> and associates a physical IOMMU table with the SPAPR TCE table (which
> is a guest view of the hardware IOMMU table). The iommu_table object
> is cached and referenced so we do not have to look up for it in real mode.
> 
> This does not implement the UNSET counterpart as there is no use for it -
> once the acceleration is enabled, the existing userspace won't
> disable it unless a VFIO container is destroyed; this adds necessary
> cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> 
> As this creates a descriptor per IOMMU table-LIOBN couple (called
> kvmppc_spapr_tce_iommu_table), it is possible to have several
> descriptors with the same iommu_table (hardware IOMMU table) attached
> to the same LIOBN; we do not remove duplicates though as
> iommu_table_ops::exchange not just update a TCE entry (which is
> shared among IOMMU groups) but also invalidates the TCE cache
> (one per IOMMU group).
> 
> This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> space.
> 
> This adds real mode version of WARN_ON_ONCE() as the generic version
> causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> returns in the code, this also adds a check for already existing
> vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> 
> This finally makes use of vfio_external_user_iommu_id() which was
> introduced quite some time ago and was considered for removal.
> 
> Tests show that this patch increases transmission speed from 220MB/s
> to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
> Changes:
> v8:
> * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> to handle them
> * changed vmalloc_to_phys() callers to return H_HARDWARE
> * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> and added a comment about this in the code
> * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> and do WARN_ON
> * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> have all vmalloc_to_phys() callsites covered
> 
> v7:
> * added realmode-friendly WARN_ON_ONCE_RM
> 
> v6:
> * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> * moved kvmppc_gpa_to_ua() to TCE validation
> 
> v5:
> * changed error codes in multiple places
> * added bunch of WARN_ON() in places which should not really happen
> * adde a check that an iommu table is not attached already to LIOBN
> * dropped explicit calls to iommu_tce_clear_param_check/
> iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> call them anyway (since the previous patch)
> * if we fail to update a hardware IOMMU table for unexpected reason,
> this just clears the entry
> 
> v4:
> * added note to the commit log about allowing multiple updates of
> the same IOMMU table;
> * instead of checking for if any memory was preregistered, this
> returns H_TOO_HARD if a specific page was not;
> * fixed comments from v3 about error handling in many places;
> * simplified TCE handlers and merged IOMMU parts inline - for example,
> there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> kvmppc_h_put_tce(); this allows to check IOBA boundaries against
> the first attached table only (makes the code simpler);
> 
> v3:
> * simplified not to use VFIO group notifiers
> * reworked cleanup, should be cleaner/simpler now
> 
> v2:
> * reworked to use new VFIO notifiers
> * now same iommu

Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-14 Thread Till Smejkal

On Mon, 13 Mar 2017, Andy Lutomirski wrote:
> On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal
>  wrote:
> > On Mon, 13 Mar 2017, Andy Lutomirski wrote:
> >> This sounds rather complicated.  Getting TLB flushing right seems
> >> tricky.  Why not just map the same thing into multiple mms?
> >
> > This is exactly what happens at the end. The memory region that is 
> > described by the
> > VAS segment will be mapped in the ASes that use the segment.
> 
> So why is this kernel feature better than just doing MAP_SHARED
> manually in userspace?

One advantage of VAS segments is that they can be globally queried by user 
programs
which means that VAS segments can be shared by applications that not 
necessarily have
to be related. If I am not mistaken, MAP_SHARED of pure in memory data will 
only work
if the tasks that share the memory region are related (aka. have a common 
parent that
initialized the shared mapping). Otherwise, the shared mapping have to be 
backed by a
file. VAS segments on the other side allow sharing of pure in memory data by
arbitrary related tasks without the need of a file. This becomes especially
interesting if one combines VAS segments with non-volatile memory since one can 
keep
data structures in the NVM and still be able to share them between multiple 
tasks.

> >> Ick.  Please don't do this.  Can we please keep an mm as just an mm
> >> and not make it look magically different depending on which process
> >> maps it?  If you need a trampoline (which you do, of course), just
> >> write a trampoline in regular user code and map it manually.
> >
> > Did I understand you correctly that you are proposing that the switching 
> > thread
> > should make sure by itself that its code, stack, … memory regions are 
> > properly setup
> > in the new AS before/after switching into it? I think, this would make 
> > using first
> > class virtual address spaces much more difficult for user applications to 
> > the extend
> > that I am not even sure if they can be used at all. At the moment, 
> > switching into a
> > VAS is a very simple operation for an application because the kernel will 
> > just simply
> > do the right thing.
> 
> Yes.  I think that having the same mm_struct look different from
> different tasks is problematic.  Getting it right in the arch code is
> going to be nasty.  The heuristics of what to share are also tough --
> why would text + data + stack or whatever you're doing be adequate?
> What if you're in a thread?  What if two tasks have their stacks in
> the same place?

The different ASes that a task now can have when it uses first class virtual 
address
spaces are not realized in the kernel by using only one mm_struct per task that 
just
looks differently but by using multiple mm_structs - one for each AS that the 
task
can execute in. When a task attaches a first class virtual address space to 
itself to
be able to use another AS, the kernel adds a temporary mm_struct to this task 
that
contains the mappings of the first class virtual address space and the one 
shared
with the task's original AS. If a thread now wants to switch into this attached 
first
class virtual address space the kernel only changes the 'mm' and 'active_mm' 
pointers
in the task_struct of the thread to the temporary mm_struct and performs the
corresponding mm_switch operation. The original mm_struct of the thread will 
not be
changed.

Accordingly, I do not magically make mm_structs look differently depending on 
the
task that uses it, but create temporary mm_structs that only contain mappings 
to the
same memory regions.

I agree that finding a good heuristics of what to share is difficult. At the 
moment,
all memory regions that are available in the task's original AS will also be
available when a thread switches into an attached first class virtual address 
space
(aka. are shared). That means that VAS can mainly be used to extend the AS of a 
task
in the current state of the implementation. The reason why I implemented the 
sharing
in this way is that I didn't want to break shared libraries. If I only share
code+heap+stack, shared libraries would not work anymore after switching into a 
VAS.

> I could imagine something like a sigaltstack() mode that lets you set
> a signal up to also switch mm could be useful.

This is a very interesting idea. I will keep it in mind for future use cases of
multiple virtual address spaces per task.

Thanks
Till

Re: [RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init'

2017-03-14 Thread Till Smejkal

On Tue, 14 Mar 2017, David Laight wrote:
> From: Linuxppc-dev Till Smejkal
> > Sent: 13 March 2017 22:14
> > The only way until now to create a new memory map was via the exported
> > function 'mm_alloc'. Unfortunately, this function not only allocates a new
> > memory map, but also completely initializes it. However, with the
> > introduction of first class virtual address spaces, some initialization
> > steps done in 'mm_alloc' are not applicable to the memory maps needed for
> > this feature and hence would lead to errors in the kernel code.
> > 
> > Instead of introducing a new function that can allocate and initialize
> > memory maps for first class virtual address spaces and potentially
> > duplicate some code, I decided to split the mm_alloc function as well as
> > the 'mm_init' function that it uses.
> > 
> > Now there are four functions exported instead of only one. The new
> > 'mm_alloc' function only allocates a new mm_struct and zeros it out. If one
> > want to have the old behavior of mm_alloc one can use the newly introduced
> > function 'mm_alloc_and_setup' which not only allocates a new mm_struct but
> > also fully initializes it.
> ...
> 
> That looks like bugs waiting to happen.
> You need unchanged code to fail to compile.

Thank you for this hint. I can give the new mm_alloc function a different name 
so
that code that uses the *old* mm_alloc function will fail to compile. I just 
reused
the old name when I wrote the code, because mm_alloc was only used in very few
locations in the kernel (2 times in the whole kernel source) which made 
identifying
and changing them very easy. I also don't think that there will be many users 
in the
kernel for mm_alloc in the future because it is a relatively low level data
structure. But if it is better to use a different name for the new function, I 
am
very happy to change this.

Till

Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-14 Thread Chris Metcalf


On 3/14/2017 12:12 PM, Till Smejkal wrote:

On Mon, 13 Mar 2017, Andy Lutomirski wrote:

On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal
 wrote:

On Mon, 13 Mar 2017, Andy Lutomirski wrote:

This sounds rather complicated.  Getting TLB flushing right seems
tricky.  Why not just map the same thing into multiple mms?

This is exactly what happens at the end. The memory region that is described by 
the
VAS segment will be mapped in the ASes that use the segment.

So why is this kernel feature better than just doing MAP_SHARED
manually in userspace?

One advantage of VAS segments is that they can be globally queried by user 
programs
which means that VAS segments can be shared by applications that not 
necessarily have
to be related. If I am not mistaken, MAP_SHARED of pure in memory data will 
only work
if the tasks that share the memory region are related (aka. have a common 
parent that
initialized the shared mapping). Otherwise, the shared mapping have to be 
backed by a
file.


True, but why is this bad?  The shared mapping will be memory resident
regardless, even if backed by a file (unless swapped out under heavy
memory pressure, but arguably that's a feature anyway).  More importantly,
having a file name is a simple and consistent way of identifying such
shared memory segments.

With a little work, you can also arrange to map such files into memory
at a fixed address in all participating processes, thus making internal
pointers work correctly.


VAS segments on the other side allow sharing of pure in memory data by
arbitrary related tasks without the need of a file. This becomes especially
interesting if one combines VAS segments with non-volatile memory since one can 
keep
data structures in the NVM and still be able to share them between multiple 
tasks.


I am not fully up to speed on NV/pmem stuff, but isn't that exactly what
the DAX mode is supposed to allow you to do?  If so, isn't sharing a
mapped file on a DAX filesystem on top of pmem equivalent to what
you're proposing?

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-14 Thread Till Smejkal

On Tue, 14 Mar 2017, Chris Metcalf wrote:
> On 3/14/2017 12:12 PM, Till Smejkal wrote:
> > On Mon, 13 Mar 2017, Andy Lutomirski wrote:
> > > On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal
> > >  wrote:
> > > > On Mon, 13 Mar 2017, Andy Lutomirski wrote:
> > > > > This sounds rather complicated.  Getting TLB flushing right seems
> > > > > tricky.  Why not just map the same thing into multiple mms?
> > > > This is exactly what happens at the end. The memory region that is 
> > > > described by the
> > > > VAS segment will be mapped in the ASes that use the segment.
> > > So why is this kernel feature better than just doing MAP_SHARED
> > > manually in userspace?
> > One advantage of VAS segments is that they can be globally queried by user 
> > programs
> > which means that VAS segments can be shared by applications that not 
> > necessarily have
> > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will 
> > only work
> > if the tasks that share the memory region are related (aka. have a common 
> > parent that
> > initialized the shared mapping). Otherwise, the shared mapping have to be 
> > backed by a
> > file.
> 
> True, but why is this bad?  The shared mapping will be memory resident
> regardless, even if backed by a file (unless swapped out under heavy
> memory pressure, but arguably that's a feature anyway).  More importantly,
> having a file name is a simple and consistent way of identifying such
> shared memory segments.
> 
> With a little work, you can also arrange to map such files into memory
> at a fixed address in all participating processes, thus making internal
> pointers work correctly.

I don't want to say that the interface provided by MAP_SHARED is bad. I am only
arguing that VAS segments and the interface that they provide have an advantage 
over
the existing ones in my opinion. However, Matthew Wilcox also suggested in some
earlier mail that VAS segments could be exported to user space via a special 
purpose
filesystem. This would enable users of VAS segments to also just use some 
special
files to setup the shared memory regions. But since the VAS segment itself 
already
knows where at has to be mapped in the virtual address space of the process, the
establishing of the shared memory region would be very easy for the user.

> > VAS segments on the other side allow sharing of pure in memory data by
> > arbitrary related tasks without the need of a file. This becomes especially
> > interesting if one combines VAS segments with non-volatile memory since one 
> > can keep
> > data structures in the NVM and still be able to share them between multiple 
> > tasks.
> 
> I am not fully up to speed on NV/pmem stuff, but isn't that exactly what
> the DAX mode is supposed to allow you to do?  If so, isn't sharing a
> mapped file on a DAX filesystem on top of pmem equivalent to what
> you're proposing?

If I read the documentation to DAX filesystems correctly, it is indeed possible 
to us
them to create files that life purely in NVM. I wasn't fully aware of this 
feature.
Thanks for the pointer.

However, the main contribution of this patchset is actually the idea of first 
class
virtual address spaces and that they can be used to allow processes to have 
multiple
different views on the system's main memory. For us, VAS segments were another 
logic
step in the same direction (from first class virtual address spaces to first 
class
address space segments). However, if there is already functionality in the Linux
kernel to achieve the exact same behavior, there is no real need to add VAS 
segments.
I will continue thinking about them and either find a different situation where 
the
currently available interface is not sufficient/too complicated or drop VAS 
segments
from future version of the patch set.

Till

Re: syscall statx not implemented on powerpc

2017-03-14 Thread Michael Ellerman

Chris Packham  writes:

> On 15/03/17 00:57, Michael Ellerman wrote:
>> Chris Packham  writes:
>>> On 13/03/17 21:52, Chandan Rajendra wrote:
 On Monday, March 13, 2017 03:33:07 AM Chris Packham wrote:
> I've just attempted to build a powerpc kernel from 4.11-rc2 using a
> custom defconfig (available on request) and I'm hitting the following
> error in the early stages of compilation.
>
> :1325:2: error: #warning syscall statx not implemented 
> [-Werror=cpp]
>
> Same thing seems to happen with mpc85xx_basic_defconfig.
>
> I don't actually need this syscall so I'd be happy to turn something off
> to get things building. I did a quick search and couldn't see anything
> on linuxppc-dev but google keeps correcting "statx" to "stats" so I
> could have missed it.

 The upstream commit
 (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a528d35e8bfcc521d7cb70aaf03e1bd296c8493f)
 that introduces the statx syscall provides a test program. I will wire-up 
 the
 syscall on ppc64, run that test program and post the patch if the test 
 program
 works well.
>>>
>>> Thanks, I'd be happy to test a patch here.
>>>
>>> In the meantime I worked around the build issue by adding __INGORE_statx
>>> to checksyscalls.sh.
>>
>> Is it actually breaking the build? It's meant to be a warning, and
>> that's all I get.
>
> It breaks my builds for ppc and arm. We do have KCFLAGS=-Werror in our 
> environment but KCPPFLAGS= is not set so I'm not sure why it's stopping 
> compilation.

OK, I guess that must be it.

We turn on -Werror for all of arch/powerpc by default (unless
CONFIG_PPC_DISABLE_WERROR=y), but the syscall checking script is not
under arch/powerpc so shouldn't be affected by that.

cheers

Re: Linux 4.11: Reported regressions as of Tuesday, 20176-03-14

2017-03-14 Thread Michael Ellerman

Thorsten Leemhuis  writes:

> Hi! Find below my first regression report for Linux 4.11. It lists 9
> regressions I'm currently aware of.
>
> As always: Are you aware of any other regressions? Then please let me
> know (simply CC regressi...@leemhuis.info). And please tell me if there
> is anything in the report that shouldn't be there.
>
> Ciao, Thorsten
>
> P.S.: Sorry, I didn't compile any regression reports for 4.10: I 
> didn't find time due to various reasons (vacation, a cold, regular 
> work, and attending two conferences). Reminder: compiling these reports
> has nothing to do with my paid job and I'm doing it in my spare time
> just because I think someone should do it.
>
> P.P.S: Dear Gmane mainling list archive webinterface, please come
> back soon. I really really miss you. It hurts ever single day.
> Don't you miss me, too? ;-)
>
> == Current regressions ==
>
> Desc: PowerPC crashes on boot, bisected to commit 5657933dbb6e
> Repo: 2017-03-02 
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1343553.html
> Stat: 2017-03-09 
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1349568.html
> Note: patch hopefully heading mainline

Now fixed in Linus' tree by commit 46f401c4297a.

cheers

Re: [PATCH v2 16/23] MAINTAINERS: Add file patterns for powerpc device tree bindings

2017-03-14 Thread Michael Ellerman

Geert Uytterhoeven  writes:

> Submitters of device tree binding documentation may forget to CC
> the subsystem maintainer if this is missing.
>
> Signed-off-by: Geert Uytterhoeven 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: linuxppc-dev@lists.ozlabs.org
> ---
> Please apply this patch directly if you want to be involved in device
> tree binding documentation for your subsystem.
>
> v2:
>   - Rebased on top of commit a42715830d552d7c ("MAINTAINERS: Remove
> powerpc's "opal" pattern match), which just added "powerpc/opal",
> while obviously the whole "powerpc" hierarchy is of interest.
>
> Impact on next-20170310:

Actual diff ignoring ordering etc:

 +Benjamin Herrenschmidt  (supporter:LINUX FOR 
POWERPC (32-BIT AND 64-BIT))
 +Paul Mackerras  (supporter:LINUX FOR POWERPC (32-BIT AND 
64-BIT))
 +linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND 
64-BIT))
 -Scott Wood  (commit_signer:5/11=45%)
 -Zhao Qiang  (commit_signer:4/11=36%,authored:4/11=36%)
 -Christian Lamparter  (commit_signer:1/11=9%)
 -yangbo lu  (authored:1/11=9%)
 -"Otto Kekäläinen"  (authored:1/11=9%)
 -Chris Packham  (authored:1/11=9%)
 -York Sun  (authored:1/11=9%)

Which looks bad as all the NXP folks get dropped, but they should be
reading linuxppc-dev. So I think I'll merge this, unless anyone
disagrees.

cheers

Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO

2017-03-14 Thread David Gibson

On Tue, Mar 14, 2017 at 03:05:27PM -0600, Alex Williamson wrote:
> On Fri, 10 Mar 2017 14:53:37 +1100
> Alexey Kardashevskiy  wrote:
> 
> > This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
> > and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
> > without passing them to user space which saves time on switching
> > to user space and back.
> > 
> > This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
> > KVM tries to handle a TCE request in the real mode, if failed
> > it passes the request to the virtual mode to complete the operation.
> > If it a virtual mode handler fails, the request is passed to
> > the user space; this is not expected to happen though.
> > 
> > To avoid dealing with page use counters (which is tricky in real mode),
> > this only accelerates SPAPR TCE IOMMU v2 clients which are required
> > to pre-register the userspace memory. The very first TCE request will
> > be handled in the VFIO SPAPR TCE driver anyway as the userspace view
> > of the TCE table (iommu_table::it_userspace) is not allocated till
> > the very first mapping happens and we cannot call vmalloc in real mode.
> > 
> > If we fail to update a hardware IOMMU table unexpected reason, we just
> > clear it and move on as there is nothing really we can do about it -
> > for example, if we hot plug a VFIO device to a guest, existing TCE tables
> > will be mirrored automatically to the hardware and there is no interface
> > to report to the guest about possible failures.
> > 
> > This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
> > the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
> > and associates a physical IOMMU table with the SPAPR TCE table (which
> > is a guest view of the hardware IOMMU table). The iommu_table object
> > is cached and referenced so we do not have to look up for it in real mode.
> > 
> > This does not implement the UNSET counterpart as there is no use for it -
> > once the acceleration is enabled, the existing userspace won't
> > disable it unless a VFIO container is destroyed; this adds necessary
> > cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
> > 
> > As this creates a descriptor per IOMMU table-LIOBN couple (called
> > kvmppc_spapr_tce_iommu_table), it is possible to have several
> > descriptors with the same iommu_table (hardware IOMMU table) attached
> > to the same LIOBN; we do not remove duplicates though as
> > iommu_table_ops::exchange not just update a TCE entry (which is
> > shared among IOMMU groups) but also invalidates the TCE cache
> > (one per IOMMU group).
> > 
> > This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
> > space.
> > 
> > This adds real mode version of WARN_ON_ONCE() as the generic version
> > causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
> > returns in the code, this also adds a check for already existing
> > vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
> > 
> > This finally makes use of vfio_external_user_iommu_id() which was
> > introduced quite some time ago and was considered for removal.
> > 
> > Tests show that this patch increases transmission speed from 220MB/s
> > to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
> > 
> > Signed-off-by: Alexey Kardashevskiy 
> > ---
> > Changes:
> > v8:
> > * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed
> > to handle them
> > * changed vmalloc_to_phys() callers to return H_HARDWARE
> > * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD
> > and added a comment about this in the code
> > * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE
> > and do WARN_ON
> > * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to
> > have all vmalloc_to_phys() callsites covered
> > 
> > v7:
> > * added realmode-friendly WARN_ON_ONCE_RM
> > 
> > v6:
> > * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map()
> > * moved kvmppc_gpa_to_ua() to TCE validation
> > 
> > v5:
> > * changed error codes in multiple places
> > * added bunch of WARN_ON() in places which should not really happen
> > * adde a check that an iommu table is not attached already to LIOBN
> > * dropped explicit calls to iommu_tce_clear_param_check/
> > iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate
> > call them anyway (since the previous patch)
> > * if we fail to update a hardware IOMMU table for unexpected reason,
> > this just clears the entry
> > 
> > v4:
> > * added note to the commit log about allowing multiple updates of
> > the same IOMMU table;
> > * instead of checking for if any memory was preregistered, this
> > returns H_TOO_HARD if a specific page was not;
> > * fixed comments from v3 about error handling in many places;
> > * simplified TCE handlers and merged IOMMU parts inline - for example,
> > there used to be kvmppc_h_put_tce_iommu(), now it is merged into
> > kvmppc_h_put_tce(); this allows

Re: [PATCH] drivers/pcmcia: NO_IRQ removal for electra_cf.c

2017-03-14 Thread Michael Ellerman

Arnd Bergmann  writes:

> On Tue, Mar 14, 2017 at 11:51 AM, Michael Ellerman  
> wrote:
>> Michael Ellerman  writes:
>>
>>> We'd like to eventually remove NO_IRQ on powerpc, so remove usages of it
>>> from electra_cf.c which is a powerpc-only driver.
>>>
>>> Signed-off-by: Michael Ellerman 
>>> ---
>>>  drivers/pcmcia/electra_cf.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> Ping anyone?
>>
>> Or should I merge this via the powerpc tree?
>
> That's what I would recommend for a powerpc specific pcmcia driver, yes.

Suits me.

> Looking at the bigger picture of powerpc drivers using NO_IRQ, I also
> see these others:
>
> drivers/ata/pata_mpc52xx.c: if (ata_irq == NO_IRQ) {
> drivers/ata/sata_dwc_460ex.c:#ifndef NO_IRQ
> drivers/ata/sata_dwc_460ex.c:#define NO_IRQ 0
> drivers/ata/sata_dwc_460ex.c:   if (hsdev->dma->irq == NO_IRQ) {
> drivers/ata/sata_dwc_460ex.c:   if (irq == NO_IRQ) {
> drivers/iommu/fsl_pamu.c:   if (irq == NO_IRQ) {
> drivers/iommu/fsl_pamu.c:   if (irq != NO_IRQ)
> drivers/media/platform/fsl-viu.c:   if (viu_irq == NO_IRQ) {
> drivers/mtd/nand/mpc5121_nfc.c: if (prv->irq == NO_IRQ) {
> drivers/pcmcia/electra_cf.c:cf->irq = NO_IRQ;
> drivers/pcmcia/electra_cf.c:if (cf->irq != NO_IRQ)
> drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if
> no interrupt is pending. */
> drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ;
> drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if
> no interrupt is pending. */
> drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ;
> drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_low == NO_IRQ) {
> drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_high != NO_IRQ &&
> drivers/spi/spi-mpc52xx.c:  if (status && (irq != NO_IRQ))
> drivers/tty/ehv_bytechan.c: if (stdout_irq == NO_IRQ) {
> drivers/tty/ehv_bytechan.c: if ((bc->rx_irq == NO_IRQ) ||
> (bc->tx_irq == NO_IRQ)) {
> drivers/tty/serial/cpm_uart/cpm_uart_core.c:if (pinfo->port.irq == 
> NO_IRQ) {
> drivers/uio/uio_fsl_elbc_gpcm.c:if (irq != NO_IRQ) {
> drivers/uio/uio_fsl_elbc_gpcm.c:irq = NO_IRQ;
> drivers/uio/uio_fsl_elbc_gpcm.c: irq != NO_IRQ ? irq : -1);
> drivers/usb/host/ehci-grlib.c:  if (irq == NO_IRQ) {
> drivers/usb/host/ehci-ppc-of.c: if (irq == NO_IRQ) {
> drivers/usb/host/fhci-hcd.c:if (usb_irq == NO_IRQ) {
> drivers/usb/host/ohci-ppc-of.c: if (irq == NO_IRQ) {
> drivers/usb/host/uhci-grlib.c:  if (irq == NO_IRQ) {
> drivers/video/fbdev/mb862xx/mb862xxfbdrv.c: if (par->irq == NO_IRQ) {
> drivers/virt/fsl_hypervisor.c:  if (!handle || (irq == NO_IRQ)) {
> include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
> include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
> include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
> include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
> include/soc/fsl/qe/qe_ic.h: if (cascade_irq == NO_IRQ)
> include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ)
>
> Did you have other pending patches for those?

No. I stayed away from anything FSL related as I was under the
impression some of them were being ported to arch/arm, which uses -1 for
NO_IRQ IIUIC.

eg. all of include/soc/fsl and drivers/soc/fsl was moved from
arch/powerpc in commit 7aa1aa6ecec2, which said:

QE: Move QE from arch/powerpc to drivers/soc

ls1 has qe and ls1 has arm cpu.
move qe from arch/powerpc to drivers/soc/fsl
to adapt to powerpc and arm

But looking at the Kconfigs it looks like they're still only selectable
on PPC. So that's a bit annoying.

I'll do patches for everything above that's not drivers/soc or
include/soc and hopefully we can hear from someone at NXP on the plans
for getting the soc parts enabled on arm.

cheers

Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src

2017-03-14 Thread Michael Ellerman

Hi Peter,

Peter Zijlstra  writes:
> On Tue, Mar 14, 2017 at 02:31:51PM +0530, Madhavan Srinivasan wrote:
>
>> >Huh? PPC hasn't yet implemented this? Then why are you fixing it?
>> 
>> yes, PPC hasn't implemented this (until now).
>
> until now where?

On powerpc there is currently no kernel support for filling the data_src
value with anything meaningful.

A user can still request PERF_SAMPLE_DATA_SRC (perf report -d), but they
just get the default value from perf_sample_data_init(), which is
PERF_MEM_NA.

Though even that is currently broken with a big endian perf tool.

>> And did not understand "Then why are you fixing it?"
>
> I see no implementation; so why are you poking at it.

Maddy has posted an implementation of the kernel part for powerpc in
patch 2 of this series, but maybe you're not on Cc?

Regardless of us wanting to do the kernel side on powerpc, the current
API is broken on big endian.

That's because in the kernel the PERF_MEM_NA value is constructed using
shifts:

  /* TLB access */
  #define PERF_MEM_TLB_NA   0x01 /* not available */
  ...
  #define PERF_MEM_TLB_SHIFT26

  #define PERF_MEM_S(a, s) \
(((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT)

  #define PERF_MEM_NA (PERF_MEM_S(OP, NA)   |\
PERF_MEM_S(LVL, NA)   |\
PERF_MEM_S(SNOOP, NA) |\
PERF_MEM_S(LOCK, NA)  |\
PERF_MEM_S(TLB, NA))

Which works out as:

  ((0x01 << 0) | (0x01 << 5) | (0x01 << 19) | (0x01 << 24) | (0x01 << 26))

Which means the PERF_MEM_NA value comes out of the kernel as 0x5080021
in CPU endian.

But then in the perf tool, the code uses the bitfields to inspect the
value, and currently the bitfields are defined using little endian
ordering.

So eg. in perf_mem__tlb_scnprintf() we see:
  data_src->val = 0x5080021
 op = 0x0
lvl = 0x0
  snoop = 0x0
   lock = 0x0
   dtlb = 0x0
   rsvd = 0x5080021

So this patch does what I think is the minimal fix, of changing the
definition of the bitfields to match the values that are already
exported by the kernel on big endian. And it makes no change on little
endian.

cheers

70 matches

Mail list logo