Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src
On Monday 13 March 2017 06:20 PM, Peter Zijlstra wrote: On Mon, Mar 13, 2017 at 04:45:51PM +0530, Madhavan Srinivasan wrote: - should you not have fixed this in the tool only? This patch effectively breaks ABI on big-endian architectures. IIUC, we are the first BE user for this feature (Kindly correct me if I am wrong), so technically we are not breaking ABI here :) . But let me also look at the dynamic conversion part. Huh? PPC hasn't yet implemented this? Then why are you fixing it? yes, PPC hasn't implemented this (until now). And did not understand "Then why are you fixing it?" Maddy
[PATCH 0/8] idle fixes and changes for POWER8 and POWER9
Hi, This is a resend of the previous idle series, with the addition that I accounted for Gautham's feedback, and also re-introduced the feature to avoid full state restore by counting winkles rather than special HSPRG0 bit. The two big things we get from this, is no longer messing with HSPRG0 in the idle wakeup path on POWER8, and not crashing on POWER9 machine check from power saving mode (tested in mambo according to ISA specs, but have not triggered it on real hardware). I only added the reviewed-by for patches which were not significantly chaged since last time. Thanks, Nick Nicholas Piggin (8): powerpc/64s: move remaining system reset idle code into idle_book3s.S powerpc/64s: stop using bit in HSPRG0 to test winkle powerpc/64s: use alternative feature patching powerpc/64s: fix POWER9 machine check handler from stop state powerpc/64s: use PACA_THREAD_IDLE_STATE only in POWER8 powerpc/64s: idle expand usable core idle state bits powerpc/64s: idle do not hold reservation longer than required powerpc/64s: idle POWER8 avoid full state loss recovery when possible arch/powerpc/include/asm/cpuidle.h | 32 +- arch/powerpc/include/asm/exception-64s.h | 13 +-- arch/powerpc/include/asm/paca.h | 12 +- arch/powerpc/include/asm/reg.h | 1 + arch/powerpc/kernel/exceptions-64s.S | 112 ++ arch/powerpc/kernel/idle_book3s.S| 189 --- arch/powerpc/platforms/powernv/idle.c| 13 --- 7 files changed, 202 insertions(+), 170 deletions(-) -- 2.11.0
[PATCH 1/8] powerpc/64s: move remaining system reset idle code into idle_book3s.S
Should be no functional change. Reviewed-by: Gautham R. Shenoy Signed-off-by: Nicholas Piggin --- arch/powerpc/kernel/exceptions-64s.S | 26 +--- arch/powerpc/kernel/idle_book3s.S| 76 2 files changed, 51 insertions(+), 51 deletions(-) diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 857bf7c5b946..2f837a4a78a2 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -126,31 +126,7 @@ EXC_VIRT_NONE(0x4100, 0x100) #ifdef CONFIG_PPC_P7_NAP EXC_COMMON_BEGIN(system_reset_idle_common) -BEGIN_FTR_SECTION - GET_PACA(r13) /* Restore HSPRG0 to get the winkle bit in r13 */ -END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300) - bl pnv_restore_hyp_resource - - li r0,PNV_THREAD_RUNNING - stb r0,PACA_THREAD_IDLE_STATE(r13) /* Clear thread state */ - -#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE - li r0,KVM_HWTHREAD_IN_KERNEL - stb r0,HSTATE_HWTHREAD_STATE(r13) - /* Order setting hwthread_state vs. testing hwthread_req */ - sync - lbz r0,HSTATE_HWTHREAD_REQ(r13) - cmpwi r0,0 - beq 1f - BRANCH_TO_KVM(r10, kvm_start_guest) -1: -#endif - - /* Return SRR1 from power7_nap() */ - mfspr r3,SPRN_SRR1 - blt cr3,2f - b pnv_wakeup_loss -2: b pnv_wakeup_noloss + b pnv_powersave_wakeup #endif EXC_COMMON(system_reset_common, 0x100, system_reset_exception) diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S index 995728736677..4313c107da5d 100644 --- a/arch/powerpc/kernel/idle_book3s.S +++ b/arch/powerpc/kernel/idle_book3s.S @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -113,7 +114,7 @@ core_idle_lock_held: * * Address to 'rfid' to in r5 */ -_GLOBAL(pnv_powersave_common) +pnv_powersave_common: /* Use r3 to pass state nap/sleep/winkle */ /* NAP is a state loss, we create a regs frame on the * stack, fill it up with the state we care about and @@ -188,8 +189,8 @@ pnv_enter_arch207_idle_mode: /* The following store to HSTATE_HWTHREAD_STATE(r13) */ /* MUST occur in real mode, i.e. with the MMU off,*/ /* and the MMU must stay off until we clear this flag */ - /* and test HSTATE_HWTHREAD_REQ(r13) in the system*/ - /* reset interrupt vector in exceptions-64s.S.*/ + /* and test HSTATE_HWTHREAD_REQ(r13) in */ + /* pnv_powersave_wakeup in this file. */ /* The reason is that another thread can switch the */ /* MMU to a guest context whenever this flag is set */ /* to KVM_HWTHREAD_IN_IDLE, and if the MMU was on,*/ @@ -375,15 +376,42 @@ _GLOBAL(power9_idle_stop) li r4,1 b pnv_powersave_common /* No return */ + +.global pnv_powersave_wakeup +pnv_powersave_wakeup: +BEGIN_FTR_SECTION + GET_PACA(r13) /* Restore HSPRG0 to get the winkle bit in r13 */ +END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300) + bl pnv_restore_hyp_resource + + li r0,PNV_THREAD_RUNNING + stb r0,PACA_THREAD_IDLE_STATE(r13) /* Clear thread state */ + +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE + li r0,KVM_HWTHREAD_IN_KERNEL + stb r0,HSTATE_HWTHREAD_STATE(r13) + /* Order setting hwthread_state vs. testing hwthread_req */ + sync + lbz r0,HSTATE_HWTHREAD_REQ(r13) + cmpwi r0,0 + beq 1f + BRANCH_TO_KVM(r10, kvm_start_guest) +1: +#endif + + /* Return SRR1 from power7_nap() */ + mfspr r3,SPRN_SRR1 + blt cr3,pnv_wakeup_noloss + b pnv_wakeup_loss + /* - * Called from reset vector. Check whether we have woken up with - * hypervisor state loss. If yes, restore hypervisor state and return - * back to reset vector. + * Check whether we have woken up with hypervisor state loss. + * If yes, restore hypervisor state and return back to link. * * r13 - Contents of HSPRG0 * cr3 - set to gt if waking up with partial/complete hypervisor state loss */ -_GLOBAL(pnv_restore_hyp_resource) +pnv_restore_hyp_resource: BEGIN_FTR_SECTION ld r2,PACATOC(r13); /* @@ -400,12 +428,9 @@ BEGIN_FTR_SECTION */ rldicl r5,r5,4,60 cmpdcr4,r5,r4 - bge cr4,pnv_wakeup_tb_loss - /* -* Waking up without hypervisor state loss. Return to -* reset vector -*/ - blr + bge cr4,pnv_wakeup_tb_loss /* returns to caller */ + + blr /* Waking up without hypervisor state loss. */ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) @@ -433,8 +458,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) */ bgt cr3,. - blr /* Return back to System Reset vector from where - pnv_restore_hyp_resource was invoked */ +
[PATCH 2/8] powerpc/64s: stop using bit in HSPRG0 to test winkle
The POWER8 idle code has a neat trick of programming the power on engine to restore a low bit into HSPRG0, so idle wakeup code can test and see if it has been programmed this way and therefore lost all state, and avoiding the expensive full restore if not. However this messes with our r13 PACA pointer, and requires HSPRG0 to be written to throughout the exception handlers and idle wakeup, rather than just once on kernel entry. Remove this complexity and assume winkle sleeps always require a state restore. This speedup is later re-introduced by counting per-core winkles and setting a bitmap of threads with state loss when all are in winkle. Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/exception-64s.h | 13 ++--- arch/powerpc/kernel/exceptions-64s.S | 21 +++-- arch/powerpc/kernel/idle_book3s.S| 23 --- arch/powerpc/platforms/powernv/idle.c| 13 - 4 files changed, 13 insertions(+), 57 deletions(-) diff --git a/arch/powerpc/include/asm/exception-64s.h b/arch/powerpc/include/asm/exception-64s.h index 14752eee3d0c..f5fe7c901b37 100644 --- a/arch/powerpc/include/asm/exception-64s.h +++ b/arch/powerpc/include/asm/exception-64s.h @@ -167,17 +167,14 @@ BEGIN_FTR_SECTION_NESTED(943) \ std ra,offset(r13); \ END_FTR_SECTION_NESTED(ftr,ftr,943) -#define EXCEPTION_PROLOG_0_PACA(area) \ +#define EXCEPTION_PROLOG_0(area) \ + GET_PACA(r13); \ std r9,area+EX_R9(r13); /* save r9 */ \ OPT_GET_SPR(r9, SPRN_PPR, CPU_FTR_HAS_PPR); \ HMT_MEDIUM; \ std r10,area+EX_R10(r13); /* save r10 - r12 */\ OPT_GET_SPR(r10, SPRN_CFAR, CPU_FTR_CFAR) -#define EXCEPTION_PROLOG_0(area) \ - GET_PACA(r13); \ - EXCEPTION_PROLOG_0_PACA(area) - #define __EXCEPTION_PROLOG_1(area, extra, vec) \ OPT_SAVE_REG_TO_PACA(area+EX_PPR, r9, CPU_FTR_HAS_PPR); \ OPT_SAVE_REG_TO_PACA(area+EX_CFAR, r10, CPU_FTR_CFAR); \ @@ -208,12 +205,6 @@ END_FTR_SECTION_NESTED(ftr,ftr,943) EXCEPTION_PROLOG_1(area, extra, vec); \ EXCEPTION_PROLOG_PSERIES_1(label, h); -/* Have the PACA in r13 already */ -#define EXCEPTION_PROLOG_PSERIES_PACA(area, label, h, extra, vec) \ - EXCEPTION_PROLOG_0_PACA(area); \ - EXCEPTION_PROLOG_1(area, extra, vec); \ - EXCEPTION_PROLOG_PSERIES_1(label, h); - #define __KVMTEST(h, n) \ lbz r10,HSTATE_IN_GUEST(r13); \ cmpwi r10,0; \ diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index 2f837a4a78a2..e390fcd04bcb 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -116,9 +116,7 @@ EXC_VIRT_NONE(0x4000, 0x100) EXC_REAL_BEGIN(system_reset, 0x100, 0x100) SET_SCRATCH0(r13) - GET_PACA(r13) - clrrdi r13,r13,1 /* Last bit of HSPRG0 is set if waking from winkle */ - EXCEPTION_PROLOG_PSERIES_PACA(PACA_EXGEN, system_reset_common, EXC_STD, + EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, system_reset_common, EXC_STD, IDLETEST, 0x100) EXC_REAL_END(system_reset, 0x100, 0x100) @@ -148,14 +146,6 @@ EXC_REAL_BEGIN(machine_check, 0x200, 0x100) * vector */ SET_SCRATCH0(r13) /* save r13 */ - /* -* Running native on arch 2.06 or later, we may wakeup from winkle -* inside machine check. If yes, then last bit of HSPRG0 would be set -* to 1. Hence clear it unconditionally. -*/ - GET_PACA(r13) - clrrdi r13,r13,1 - SET_PACA(r13) EXCEPTION_PROLOG_0(PACA_EXMC) BEGIN_FTR_SECTION b machine_check_powernv_early @@ -339,7 +329,7 @@ EXC_COMMON_BEGIN(machine_check_handle_early) * Go back to nap/sleep/winkle mode again if (b) is true. */ rlwinm. r11,r12,47-31,30,31 /* Was it in power saving mode? */ - beq 4f /* No, it wasn;t */ + beq 4f /* No, it wasn't */ /* Thread was in power saving mode. Go back to nap again. */ cmpwi r11,2 blt 3f @@ -369,13 +359,8 @@ EXC_COMMON_BEGIN(machine_check_handle_early) /* * Go back to winkle. Please note that this thread was woken up in * machine chec
[PATCH 3/8] powerpc/64s: use alternative feature patching
This reduces the number of nops for POWER8. Signed-off-by: Nicholas Piggin --- arch/powerpc/kernel/idle_book3s.S | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S index 405631b2c229..9284ea0762b1 100644 --- a/arch/powerpc/kernel/idle_book3s.S +++ b/arch/powerpc/kernel/idle_book3s.S @@ -383,7 +383,11 @@ _GLOBAL(power9_idle_stop) */ .global pnv_powersave_wakeup pnv_powersave_wakeup: - bl pnv_restore_hyp_resource +BEGIN_FTR_SECTION + bl pnv_restore_hyp_resource_arch300 +FTR_SECTION_ELSE + bl pnv_restore_hyp_resource_arch207 +ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300) li r0,PNV_THREAD_RUNNING stb r0,PACA_THREAD_IDLE_STATE(r13) /* Clear thread state */ @@ -411,14 +415,13 @@ pnv_powersave_wakeup: * * cr3 - set to gt if waking up with partial/complete hypervisor state loss */ -pnv_restore_hyp_resource: - ld r2,PACATOC(r13); - -BEGIN_FTR_SECTION +pnv_restore_hyp_resource_arch300: /* * POWER ISA 3. Use PSSCR to determine if we * are waking up from deep idle state */ + ld r2,PACATOC(r13); + LOAD_REG_ADDRBASE(r5,pnv_first_deep_stop_state) ld r4,ADDROFF(pnv_first_deep_stop_state)(r5) @@ -433,12 +436,14 @@ BEGIN_FTR_SECTION blr /* Waking up without hypervisor state loss. */ -END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) - +/* Same calling convention as arch300 */ +pnv_restore_hyp_resource_arch207: /* * POWER ISA 2.07 or less. * Check if we slept with winkle. */ + ld r2,PACATOC(r13); + lbz r0,PACA_THREAD_IDLE_STATE(r13) cmpwi cr2,r0,PNV_THREAD_NAP cmpwi cr4,r0,PNV_THREAD_WINKLE -- 2.11.0
[PATCH 4/8] powerpc/64s: fix POWER9 machine check handler from stop state
The ISA specifies power save wakeup can cause a machine check interrupt. The machine check handler currently has code to handle that for POWER8, but POWER9 crashes when trying to execute the P8 style sleep instructions. So queue up the machine check, then call into the idle code to wake up as the system reset interrupt does, rather than attempting to sleep again without going through the main idle path. Reviewed-by: Gautham R. Shenoy Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/reg.h | 1 + arch/powerpc/kernel/exceptions-64s.S | 69 ++-- 2 files changed, 35 insertions(+), 35 deletions(-) diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index fc879fd6bdae..8bbdfacce970 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -656,6 +656,7 @@ #define SRR1_ISI_PROT0x0800 /* ISI: Other protection fault */ #define SRR1_WAKEMASK0x0038 /* reason for wakeup */ #define SRR1_WAKEMASK_P8 0x003c /* reason for wakeup on POWER8 and 9 */ +#define SRR1_WAKEMCE_RESVD 0x003c /* Unused/reserved value used by MCE wakeup to indicate cause to idle wakeup handler */ #define SRR1_WAKESYSERR 0x0030 /* System error */ #define SRR1_WAKEEE 0x0020 /* External interrupt */ #define SRR1_WAKEHVI 0x0024 /* Hypervisor Virtualization Interrupt (P9) */ diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S index e390fcd04bcb..5779d2d6a192 100644 --- a/arch/powerpc/kernel/exceptions-64s.S +++ b/arch/powerpc/kernel/exceptions-64s.S @@ -306,6 +306,33 @@ EXC_COMMON_BEGIN(machine_check_common) /* restore original r1. */ \ ld r1,GPR1(r1) +#ifdef CONFIG_PPC_P7_NAP +EXC_COMMON_BEGIN(machine_check_idle_common) + bl machine_check_queue_event + /* +* Queue the machine check, then reload SRR1 and use it to set +* CR3 according to pnv_powersave_wakeup convention. +*/ + ld r12,_MSR(r1) + rlwinm r11,r12,47-31,30,31 + cmpwi cr3,r11,2 + + /* +* Now put SRR1_WAKEMCE_RESVD into SRR1, allows it to follow the +* system reset wakeup code. +*/ + orisr12,r12,SRR1_WAKEMCE_RESVD@h + mtspr SPRN_SRR1,r12 + std r12,_MSR(r1) + + /* +* Decrement MCE nesting after finishing with the stack. +*/ + lhz r11,PACA_IN_MCE(r13) + subir11,r11,1 + sth r11,PACA_IN_MCE(r13) + b pnv_powersave_wakeup +#endif /* * Handle machine check early in real mode. We come here with * ME=1, MMU (IR=0 and DR=0) off and using MC emergency stack. @@ -318,6 +345,7 @@ EXC_COMMON_BEGIN(machine_check_handle_early) bl machine_check_early std r3,RESULT(r1) /* Save result */ ld r12,_MSR(r1) + #ifdef CONFIG_PPC_P7_NAP /* * Check if thread was in power saving mode. We come here when any @@ -328,43 +356,14 @@ EXC_COMMON_BEGIN(machine_check_handle_early) * * Go back to nap/sleep/winkle mode again if (b) is true. */ - rlwinm. r11,r12,47-31,30,31 /* Was it in power saving mode? */ - beq 4f /* No, it wasn't */ - /* Thread was in power saving mode. Go back to nap again. */ - cmpwi r11,2 - blt 3f - /* Supervisor/Hypervisor state loss */ - li r0,1 - stb r0,PACA_NAPSTATELOST(r13) -3: bl machine_check_queue_event - MACHINE_CHECK_HANDLER_WINDUP - GET_PACA(r13) - ld r1,PACAR1(r13) - /* -* Check what idle state this CPU was in and go back to same mode -* again. -*/ - lbz r3,PACA_THREAD_IDLE_STATE(r13) - cmpwi r3,PNV_THREAD_NAP - bgt 10f - IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP) - /* No return */ -10: - cmpwi r3,PNV_THREAD_SLEEP - bgt 2f - IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP) - /* No return */ - -2: - /* -* Go back to winkle. Please note that this thread was woken up in -* machine check from winkle and have not restored the per-subcore -* state. -*/ - IDLE_STATE_ENTER_SEQ_NORET(PPC_WINKLE) - /* No return */ + BEGIN_FTR_SECTION + rlwinm. r11,r12,47-31,30,31 + beq-4f + BRANCH_TO_COMMON(r10, machine_check_idle_common) 4: + END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206) #endif + /* * Check if we are coming from hypervisor userspace. If yes then we * continue in host kernel in V mode to deliver the MC event. -- 2.11.0
[PATCH 5/8] powerpc/64s: use PACA_THREAD_IDLE_STATE only in POWER8
POWER9 does not use this field, so it should be moved into the POWER8 code. Update the documentation in the paca struct too. Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/paca.h | 12 ++-- arch/powerpc/kernel/idle_book3s.S | 13 +++-- 2 files changed, 17 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h index 708c3e592eeb..bbb59e226a9f 100644 --- a/arch/powerpc/include/asm/paca.h +++ b/arch/powerpc/include/asm/paca.h @@ -165,11 +165,19 @@ struct paca_struct { #endif #ifdef CONFIG_PPC_POWERNV - /* Per-core mask tracking idle threads and a lock bit-[L][] */ + /* CPU idle fields */ + + /* +* Per-core word used to synchronize between threads. See +* asm/cpuidle.h, PNV_CORE_IDLE_* +*/ u32 *core_idle_state_ptr; - u8 thread_idle_state; /* PNV_THREAD_RUNNING/NAP/SLEEP */ /* Mask to indicate thread id in core */ u8 thread_mask; + + /* POWER8 specific fields */ + /* PNV_THREAD_RUNNING/NAP/SLEEP */ + u8 thread_idle_state; /* Mask to denote subcore sibling threads */ u8 subcore_sibling_mask; #endif diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S index 9284ea0762b1..9bdfba75a5e7 100644 --- a/arch/powerpc/kernel/idle_book3s.S +++ b/arch/powerpc/kernel/idle_book3s.S @@ -389,9 +389,6 @@ FTR_SECTION_ELSE bl pnv_restore_hyp_resource_arch207 ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300) - li r0,PNV_THREAD_RUNNING - stb r0,PACA_THREAD_IDLE_STATE(r13) /* Clear thread state */ - #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE li r0,KVM_HWTHREAD_IN_KERNEL stb r0,HSTATE_HWTHREAD_STATE(r13) @@ -445,9 +442,13 @@ pnv_restore_hyp_resource_arch207: ld r2,PACATOC(r13); lbz r0,PACA_THREAD_IDLE_STATE(r13) - cmpwi cr2,r0,PNV_THREAD_NAP - cmpwi cr4,r0,PNV_THREAD_WINKLE - bgt cr2,pnv_wakeup_tb_loss /* Either sleep or Winkle */ + cmpwi cr2,r0,PNV_THREAD_NAP + cmpwi cr4,r0,PNV_THREAD_WINKLE + li r0,PNV_THREAD_RUNNING + stb r0,PACA_THREAD_IDLE_STATE(r13) /* Clear thread state */ + + bgt cr2,pnv_wakeup_tb_loss /* Either sleep or Winkle */ + /* * We fall through here if PACA_THREAD_IDLE_STATE shows we are waking -- 2.11.0
[PATCH 6/8] powerpc/64s: idle expand usable core idle state bits
In preparation for adding more bits to the core idle state word, move the lock bit up, and unlock by flipping the lock bit rather than masking off all but the thread bits. Add branch hints for atomic operations while we're here. Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/cpuidle.h | 4 ++-- arch/powerpc/kernel/idle_book3s.S | 33 + 2 files changed, 19 insertions(+), 18 deletions(-) diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h index 155731557c9b..b9d9f960dffd 100644 --- a/arch/powerpc/include/asm/cpuidle.h +++ b/arch/powerpc/include/asm/cpuidle.h @@ -7,8 +7,8 @@ #define PNV_THREAD_NAP 1 #define PNV_THREAD_SLEEP2 #define PNV_THREAD_WINKLE 3 -#define PNV_CORE_IDLE_LOCK_BIT 0x100 -#define PNV_CORE_IDLE_THREAD_BITS 0x0FF +#define PNV_CORE_IDLE_LOCK_BIT 0x1000 +#define PNV_CORE_IDLE_THREAD_BITS 0x00FF /* * NOTE = diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S index 9bdfba75a5e7..1c91dc35c559 100644 --- a/arch/powerpc/kernel/idle_book3s.S +++ b/arch/powerpc/kernel/idle_book3s.S @@ -95,12 +95,12 @@ ALT_FTR_SECTION_END_IFSET(CPU_FTR_ARCH_300) core_idle_lock_held: HMT_LOW 3: lwz r15,0(r14) - andi. r15,r15,PNV_CORE_IDLE_LOCK_BIT + andis. r15,r15,PNV_CORE_IDLE_LOCK_BIT@h bne 3b HMT_MEDIUM lwarx r15,0,r14 - andi. r9,r15,PNV_CORE_IDLE_LOCK_BIT - bne core_idle_lock_held + andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h + bne-core_idle_lock_held blr /* @@ -213,8 +213,8 @@ pnv_enter_arch207_idle_mode: lwarx_loop1: lwarx r15,0,r14 - andi. r9,r15,PNV_CORE_IDLE_LOCK_BIT - bnelcore_idle_lock_held + andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h + bnel- core_idle_lock_held andcr15,r15,r7 /* Clear thread bit */ @@ -241,7 +241,7 @@ common_enter: /* common code for all the threads entering sleep or winkle */ IDLE_STATE_ENTER_SEQ_NORET(PPC_SLEEP) fastsleep_workaround_at_entry: - ori r15,r15,PNV_CORE_IDLE_LOCK_BIT + orisr15,r15,PNV_CORE_IDLE_LOCK_BIT@h stwcx. r15,0,r14 bne-lwarx_loop1 isync @@ -251,10 +251,10 @@ fastsleep_workaround_at_entry: li r4,1 bl opal_config_cpu_idle_state - /* Clear Lock bit */ - li r0,0 + /* Unlock */ + xoris r15,r15,PNV_CORE_IDLE_LOCK_BIT@h lwsync - stw r0,0(r14) + stw r15,0(r14) b common_enter enter_winkle: @@ -302,8 +302,8 @@ power_enter_stop: lwarx_loop_stop: lwarx r15,0,r14 - andi. r9,r15,PNV_CORE_IDLE_LOCK_BIT - bnelcore_idle_lock_held + andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h + bnel- core_idle_lock_held andcr15,r15,r7 /* Clear thread bit */ stwcx. r15,0,r14 @@ -492,7 +492,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HVMODE) ld r14,PACA_CORE_IDLE_STATE_PTR(r13) lwarx_loop2: lwarx r15,0,r14 - andi. r9,r15,PNV_CORE_IDLE_LOCK_BIT + andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h /* * Lock bit is set in one of the 2 cases- * a. In the sleep/winkle enter path, the last thread is executing @@ -501,9 +501,10 @@ lwarx_loop2: * workaround undo code or resyncing timebase or restoring context * In either case loop until the lock bit is cleared. */ - bnelcore_idle_lock_held + bnel- core_idle_lock_held - cmpwi cr2,r15,0 + andi. r9,r15,PNV_CORE_IDLE_THREAD_BITS + cmpwi cr2,r9,0 /* * At this stage @@ -512,7 +513,7 @@ lwarx_loop2: * cr4 - gt or eq if waking up from complete hypervisor state loss. */ - ori r15,r15,PNV_CORE_IDLE_LOCK_BIT + orisr15,r15,PNV_CORE_IDLE_LOCK_BIT@h stwcx. r15,0,r14 bne-lwarx_loop2 isync @@ -602,7 +603,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300) mtspr SPRN_WORC,r4 clear_lock: - andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS + xoris r15,r15,PNV_CORE_IDLE_LOCK_BIT@h lwsync stw r15,0(r14) -- 2.11.0
[PATCH 7/8] powerpc/64s: idle do not hold reservation longer than required
When taking the core idle state lock, grab it immediately like a regular lock, rather than adding more tests in there. Holding the lock keeps it stable, so there is no need to do it whole holding the reservation. Signed-off-by: Nicholas Piggin --- arch/powerpc/kernel/idle_book3s.S | 20 +++- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S index 1c91dc35c559..3cb75907c5c5 100644 --- a/arch/powerpc/kernel/idle_book3s.S +++ b/arch/powerpc/kernel/idle_book3s.S @@ -488,12 +488,12 @@ BEGIN_FTR_SECTION CHECK_HMI_INTERRUPT END_FTR_SECTION_IFSET(CPU_FTR_HVMODE) - lbz r7,PACA_THREAD_MASK(r13) ld r14,PACA_CORE_IDLE_STATE_PTR(r13) -lwarx_loop2: - lwarx r15,0,r14 - andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h + lbz r7,PACA_THREAD_MASK(r13) + /* +* Take the core lock to synchronize against other threads. +* * Lock bit is set in one of the 2 cases- * a. In the sleep/winkle enter path, the last thread is executing * fastsleep workaround code. @@ -501,7 +501,14 @@ lwarx_loop2: * workaround undo code or resyncing timebase or restoring context * In either case loop until the lock bit is cleared. */ +1: + lwarx r15,0,r14 + andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h bnel- core_idle_lock_held + orisr15,r15,PNV_CORE_IDLE_LOCK_BIT@h + stwcx. r15,0,r14 + bne-1b + isync andi. r9,r15,PNV_CORE_IDLE_THREAD_BITS cmpwi cr2,r9,0 @@ -513,11 +520,6 @@ lwarx_loop2: * cr4 - gt or eq if waking up from complete hypervisor state loss. */ - orisr15,r15,PNV_CORE_IDLE_LOCK_BIT@h - stwcx. r15,0,r14 - bne-lwarx_loop2 - isync - BEGIN_FTR_SECTION lbz r4,PACA_SUBCORE_SIBLING_MASK(r13) and r4,r4,r15 -- 2.11.0
[PATCH 8/8] powerpc/64s: idle POWER8 avoid full state loss recovery when possible
If not all threads were in winkle, full state loss recovery is not necessary and can be avoided. A previous patch removed this optimisation due to some complexity with the implementation. Re-implement it by counting the number of threads in winkle with the per-core idle state. Only restore full state loss if all threads were in winkle. This has a small window of false positives right before threads execute winkle and just after they wake up, when the winkle count does not reflect the true number of threads in winkle. This is not a significant problem in comparison with even the minimum winkle duration. For correctness, a false positive is not a problem (only false negatives would be). Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/cpuidle.h | 32 --- arch/powerpc/kernel/idle_book3s.S | 45 +- 2 files changed, 68 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/cpuidle.h b/arch/powerpc/include/asm/cpuidle.h index b9d9f960dffd..b68a5cd75ae8 100644 --- a/arch/powerpc/include/asm/cpuidle.h +++ b/arch/powerpc/include/asm/cpuidle.h @@ -2,13 +2,39 @@ #define _ASM_POWERPC_CPUIDLE_H #ifdef CONFIG_PPC_POWERNV -/* Used in powernv idle state management */ +/* Thread state used in powernv idle state management */ #define PNV_THREAD_RUNNING 0 #define PNV_THREAD_NAP 1 #define PNV_THREAD_SLEEP2 #define PNV_THREAD_WINKLE 3 -#define PNV_CORE_IDLE_LOCK_BIT 0x1000 -#define PNV_CORE_IDLE_THREAD_BITS 0x00FF + +/* + * Core state used in powernv idle for POWER8. + * + * The lock bit synchronizes updates to the state, as well as parts of the + * sleep/wake code (see kernel/idle_book3s.S). + * + * Bottom 8 bits track the idle state of each thread. Bit is cleared before + * the thread executes an idle instruction (nap/sleep/winkle). + * + * Then there is winkle tracking. A core does not lose complete state + * until every thread is in winkle. So the winkle count field counts the + * number of threads in winkle (small window of false positives is okay + * around the sleep/wake, so long as there are no false negatives). + * + * When the winkle count reaches 8 (the COUNT_ALL_BIT becomes set), then + * the THREAD_WINKLE_BITS are set, which indicate which threads have not + * yet woken from the winkle state. + */ +#define PNV_CORE_IDLE_LOCK_BIT 0x1000 + +#define PNV_CORE_IDLE_WINKLE_COUNT 0x0001 +#define PNV_CORE_IDLE_WINKLE_COUNT_ALL_BIT 0x0008 +#define PNV_CORE_IDLE_WINKLE_COUNT_BITS0x000F +#define PNV_CORE_IDLE_THREAD_WINKLE_BITS_SHIFT 8 +#define PNV_CORE_IDLE_THREAD_WINKLE_BITS 0xFF00 + +#define PNV_CORE_IDLE_THREAD_BITS 0x00FF /* * NOTE = diff --git a/arch/powerpc/kernel/idle_book3s.S b/arch/powerpc/kernel/idle_book3s.S index 3cb75907c5c5..87518a1dca50 100644 --- a/arch/powerpc/kernel/idle_book3s.S +++ b/arch/powerpc/kernel/idle_book3s.S @@ -210,15 +210,20 @@ pnv_enter_arch207_idle_mode: /* Sleep or winkle */ lbz r7,PACA_THREAD_MASK(r13) ld r14,PACA_CORE_IDLE_STATE_PTR(r13) + li r5,0 + beq cr3,3f + lis r5,PNV_CORE_IDLE_WINKLE_COUNT@h +3: lwarx_loop1: lwarx r15,0,r14 andis. r9,r15,PNV_CORE_IDLE_LOCK_BIT@h bnel- core_idle_lock_held + add r15,r15,r5 /* Add if winkle */ andcr15,r15,r7 /* Clear thread bit */ - andi. r15,r15,PNV_CORE_IDLE_THREAD_BITS + andi. r9,r15,PNV_CORE_IDLE_THREAD_BITS /* * If cr0 = 0, then current thread is the last thread of the core entering @@ -437,16 +442,14 @@ pnv_restore_hyp_resource_arch300: pnv_restore_hyp_resource_arch207: /* * POWER ISA 2.07 or less. -* Check if we slept with winkle. +* Check if we slept with sleep or winkle. */ ld r2,PACATOC(r13); - lbz r0,PACA_THREAD_IDLE_STATE(r13) - cmpwi cr2,r0,PNV_THREAD_NAP - cmpwi cr4,r0,PNV_THREAD_WINKLE + lbz r4,PACA_THREAD_IDLE_STATE(r13) li r0,PNV_THREAD_RUNNING stb r0,PACA_THREAD_IDLE_STATE(r13) /* Clear thread state */ - + cmpwi cr2,r4,PNV_THREAD_NAP bgt cr2,pnv_wakeup_tb_loss /* Either sleep or Winkle */ @@ -467,7 +470,12 @@ pnv_restore_hyp_resource_arch207: * * r13 - PACA * cr3 - gt if waking up with partial/complete hypervisor state loss + * + * If ISA300: * cr4 - gt or eq if waking up from complete hypervisor state loss. + * + * If ISA207: + * r4 - PACA_THREAD_IDLE_STATE */ pnv_wakeup_tb_loss: ld r1,PACAR1(r13) @@ -482,6 +490,7 @@ pnv_wakeup_tb_loss: * required to return back to caller after hypervisor state restore is * complete. */ +
Linux 4.11: Reported regressions as of Tuesday, 20176-03-14
Hi! Find below my first regression report for Linux 4.11. It lists 9 regressions I'm currently aware of. As always: Are you aware of any other regressions? Then please let me know (simply CC regressi...@leemhuis.info). And please tell me if there is anything in the report that shouldn't be there. Ciao, Thorsten P.S.: Sorry, I didn't compile any regression reports for 4.10: I didn't find time due to various reasons (vacation, a cold, regular work, and attending two conferences). Reminder: compiling these reports has nothing to do with my paid job and I'm doing it in my spare time just because I think someone should do it. P.P.S: Dear Gmane mainling list archive webinterface, please come back soon. I really really miss you. It hurts ever single day. Don't you miss me, too? ;-) == Current regressions == Desc: PowerPC crashes on boot, bisected to commit 5657933dbb6e Repo: 2017-03-02 https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1343553.html Stat: 2017-03-09 https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1349568.html Note: patch hopefully heading mainline Desc: thinkpad x220: GPU hang Repo: 2017-03-05 https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1345689.html Stat: n/a Note: poked discussion for a status update Desc: e1000e: __pci_enable_msi_range fails before/after resume Repo: 2017-03-06 https://bugzilla.kernel.org/show_bug.cgi?id=194801 Stat: 2017-03-06 https://bugzilla.kernel.org/show_bug.cgi?id=194801#c1 Note: poked bug for status; might need to get forwared to network people Desc: Two batteries is detected on DEXP Ursus 7W tablet instead of one Repo: 2017-03-07 https://bugzilla.kernel.org/show_bug.cgi?id=194811 Stat: 2017-03-12 https://bugzilla.kernel.org/show_bug.cgi?id=194811#c8 Note: patch likely heading mainline Desc: [lkp-robot] [f2fs] 4ac912427c: -33.7% aim7.jobs-per-min regression Repo: 2017-03-08 https://www.spinics.net/lists/kernel/msg2459239.html Stat: 2017-03-13 https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1353085.html Note: patch discussed and heading mainline Desc: VM with virtio-scsi drive often crashes during boot with kernel 4.11rc1 Repo: 2017-03-09 https://bugzilla.kernel.org/show_bug.cgi?id=194837 Stat: n/a Note: will forward this to scsi & virtio & kvm people Desc: general protection fault: inet6_fill_ifaddr+0x6c/0x230 Repo: 2017-03-11 https://bugzilla.kernel.org/show_bug.cgi?id=194849 Stat: n/a Note: poked bug for status; might need to get forwared to network people Desc: Synaptics RMI4 touchpad regression in 4.11-rc1 Repo: 2017-03-11 https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1351561.html Stat: 2017-03-13 https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1352399.html Note: solution discussed, no patch yet Desc: DRM BUG while initializing cape verde (2nd card) Repo: 2017-03-13 https://bugzilla.kernel.org/show_bug.cgi?id=194867 Stat: n/a Note: patch proposed by reporter
Re: [PATCH] drivers/pcmcia: NO_IRQ removal for electra_cf.c
Michael Ellerman writes: > We'd like to eventually remove NO_IRQ on powerpc, so remove usages of it > from electra_cf.c which is a powerpc-only driver. > > Signed-off-by: Michael Ellerman > --- > drivers/pcmcia/electra_cf.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) Ping anyone? Or should I merge this via the powerpc tree? cheers > diff --git a/drivers/pcmcia/electra_cf.c b/drivers/pcmcia/electra_cf.c > index 4d7bc3f4124a..c6fe2a4a7a6a 100644 > --- a/drivers/pcmcia/electra_cf.c > +++ b/drivers/pcmcia/electra_cf.c > @@ -207,7 +207,7 @@ static int electra_cf_probe(struct platform_device *ofdev) > return -ENOMEM; > > setup_timer(&cf->timer, electra_cf_timer, (unsigned long)cf); > - cf->irq = NO_IRQ; > + cf->irq = 0; > > cf->ofdev = ofdev; > cf->mem_phys = mem.start; > @@ -313,7 +313,7 @@ fail3: > fail2: > release_mem_region(cf->mem_phys, cf->mem_size); > fail1: > - if (cf->irq != NO_IRQ) > + if (cf->irq) > free_irq(cf->irq, cf); > > if (cf->io_virt) > -- > 2.7.4
[PATCH V2 1/6] cxl: Remove unused values in bare-metal environment.
The two fields pid and tid of the structure cxl_irq_info are only used in the guest environment. To avoid confusion, it's not necessary to fill the fields in the bare-metal environment. The PSL Process and Thread Identification Register is only used when attaching a dedicated process for PSL8 only. Signed-off-by: Christophe Lombard --- drivers/misc/cxl/native.c | 5 - 1 file changed, 5 deletions(-) diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c index 7ae7105..7257e8b 100644 --- a/drivers/misc/cxl/native.c +++ b/drivers/misc/cxl/native.c @@ -859,8 +859,6 @@ static int native_detach_process(struct cxl_context *ctx) static int native_get_irq_info(struct cxl_afu *afu, struct cxl_irq_info *info) { - u64 pidtid; - /* If the adapter has gone away, we can't get any meaningful * information. */ @@ -870,9 +868,6 @@ static int native_get_irq_info(struct cxl_afu *afu, struct cxl_irq_info *info) info->dsisr = cxl_p2n_read(afu, CXL_PSL_DSISR_An); info->dar = cxl_p2n_read(afu, CXL_PSL_DAR_An); info->dsr = cxl_p2n_read(afu, CXL_PSL_DSR_An); - pidtid = cxl_p2n_read(afu, CXL_PSL_PID_TID_An); - info->pid = pidtid >> 32; - info->tid = pidtid & 0x; info->afu_err = cxl_p2n_read(afu, CXL_AFU_ERR_An); info->errstat = cxl_p2n_read(afu, CXL_PSL_ErrStat_An); info->proc_handle = 0; -- 2.7.4
[PATCH V2 3/6] cxl: Update implementation service layer
The service layer API (in cxl.h) lists some low-level functions whose implementation is different on PSL8, PSL9 and XSL: - Init implementation for the adapter and the afu. - Invalidate TLB/SLB. - Attach process for dedicated/directed models. - Handle psl interrupts. - Debug registers for the adapter and the afu. - Traces. Each environment implements its own functions, and the common code uses them through function pointers, defined in cxl_service_layer_ops. Signed-off-by: Christophe Lombard --- drivers/misc/cxl/cxl.h | 34 ++- drivers/misc/cxl/debugfs.c | 16 +++ drivers/misc/cxl/guest.c | 2 +- drivers/misc/cxl/irq.c | 2 +- drivers/misc/cxl/native.c | 50 +++--- drivers/misc/cxl/pci.c | 47 +-- 6 files changed, 97 insertions(+), 54 deletions(-) diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index 4d1b704..3e03a66 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -553,13 +553,23 @@ struct cxl_context { struct mm_struct *mm; }; +struct cxl_irq_info; + struct cxl_service_layer_ops { int (*adapter_regs_init)(struct cxl *adapter, struct pci_dev *dev); + int (*invalidate_all)(struct cxl *adapter); int (*afu_regs_init)(struct cxl_afu *afu); + int (*sanitise_afu_regs)(struct cxl_afu *afu); int (*register_serr_irq)(struct cxl_afu *afu); void (*release_serr_irq)(struct cxl_afu *afu); - void (*debugfs_add_adapter_sl_regs)(struct cxl *adapter, struct dentry *dir); - void (*debugfs_add_afu_sl_regs)(struct cxl_afu *afu, struct dentry *dir); + irqreturn_t (*handle_interrupt)(int irq, struct cxl_context *ctx, struct cxl_irq_info *irq_info); + irqreturn_t (*fail_irq)(struct cxl_afu *afu, struct cxl_irq_info *irq_info); + int (*activate_dedicated_process)(struct cxl_afu *afu); + int (*attach_afu_directed)(struct cxl_context *ctx, u64 wed, u64 amr); + int (*attach_dedicated_process)(struct cxl_context *ctx, u64 wed, u64 amr); + void (*update_dedicated_ivtes)(struct cxl_context *ctx); + void (*debugfs_add_adapter_regs)(struct cxl *adapter, struct dentry *dir); + void (*debugfs_add_afu_regs)(struct cxl_afu *afu, struct dentry *dir); void (*psl_irq_dump_registers)(struct cxl_context *ctx); void (*err_irq_dump_registers)(struct cxl *adapter); void (*debugfs_stop_trace)(struct cxl *adapter); @@ -805,16 +815,21 @@ void afu_irq_name_free(struct cxl_context *ctx); #ifdef CONFIG_DEBUG_FS +int cxl_attach_afu_directed_psl(struct cxl_context *ctx, u64 wed, u64 amr); +int cxl_activate_dedicated_process_psl(struct cxl_afu *afu); +int cxl_attach_dedicated_process_psl(struct cxl_context *ctx, u64 wed, u64 amr); +void cxl_update_dedicated_ivtes_psl(struct cxl_context *ctx); + int cxl_debugfs_init(void); void cxl_debugfs_exit(void); int cxl_debugfs_adapter_add(struct cxl *adapter); void cxl_debugfs_adapter_remove(struct cxl *adapter); int cxl_debugfs_afu_add(struct cxl_afu *afu); void cxl_debugfs_afu_remove(struct cxl_afu *afu); -void cxl_stop_trace(struct cxl *cxl); -void cxl_debugfs_add_adapter_psl_regs(struct cxl *adapter, struct dentry *dir); -void cxl_debugfs_add_adapter_xsl_regs(struct cxl *adapter, struct dentry *dir); -void cxl_debugfs_add_afu_psl_regs(struct cxl_afu *afu, struct dentry *dir); +void cxl_stop_trace_psl(struct cxl *cxl); +void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir); +void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct dentry *dir); +void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry *dir); #else /* CONFIG_DEBUG_FS */ @@ -916,19 +931,20 @@ struct cxl_irq_info { }; void cxl_assign_psn_space(struct cxl_context *ctx); -irqreturn_t cxl_irq(int irq, struct cxl_context *ctx, struct cxl_irq_info *irq_info); +int cxl_invalidate_all_psl(struct cxl *adapter); +irqreturn_t cxl_irq_psl(int irq, struct cxl_context *ctx, struct cxl_irq_info *irq_info); +irqreturn_t cxl_fail_irq_psl(struct cxl_afu *afu, struct cxl_irq_info *irq_info); int cxl_register_one_irq(struct cxl *adapter, irq_handler_t handler, void *cookie, irq_hw_number_t *dest_hwirq, unsigned int *dest_virq, const char *name); int cxl_check_error(struct cxl_afu *afu); int cxl_afu_slbia(struct cxl_afu *afu); -int cxl_tlb_slb_invalidate(struct cxl *adapter); int cxl_data_cache_flush(struct cxl *adapter); int cxl_afu_disable(struct cxl_afu *afu); int cxl_psl_purge(struct cxl_afu *afu); -void cxl_native_psl_irq_dump_regs(struct cxl_context *ctx); +void cxl_native_irq_dump_regs_psl(struct cxl_context *ctx); void cxl_native_err_irq_dump_regs(struct cxl *adapter); int cxl_pci_vphb_add(struct cxl_afu *afu); void cxl_pci_vphb_remove(struct cxl_afu *afu); diff --git a/drivers/misc/cxl/debugfs.c b/drivers/misc/cxl/debugfs
[PATCH V2 5/6] cxl: Isolate few psl8 specific calls
Point out the specific Coherent Accelerator Interface Architecture, level 1, registers. Code and functions specific to PSL8 (CAIA1) must be framed. Signed-off-by: Christophe Lombard --- drivers/misc/cxl/context.c | 28 +++- drivers/misc/cxl/cxl.h | 35 +++-- drivers/misc/cxl/debugfs.c | 6 +++-- drivers/misc/cxl/fault.c | 14 +- drivers/misc/cxl/native.c | 58 ++--- drivers/misc/cxl/pci.c | 64 +++--- 6 files changed, 136 insertions(+), 69 deletions(-) diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c index ed0a447..0fd3cf8 100644 --- a/drivers/misc/cxl/context.c +++ b/drivers/misc/cxl/context.c @@ -39,23 +39,26 @@ int cxl_context_init(struct cxl_context *ctx, struct cxl_afu *afu, bool master) { int i; - spin_lock_init(&ctx->sste_lock); + if (cxl_is_psl8(afu)) + spin_lock_init(&ctx->sste_lock); ctx->afu = afu; ctx->master = master; ctx->pid = NULL; /* Set in start work ioctl */ mutex_init(&ctx->mapping_lock); ctx->mapping = NULL; - /* -* Allocate the segment table before we put it in the IDR so that we -* can always access it when dereferenced from IDR. For the same -* reason, the segment table is only destroyed after the context is -* removed from the IDR. Access to this in the IOCTL is protected by -* Linux filesytem symantics (can't IOCTL until open is complete). -*/ - i = cxl_alloc_sst(ctx); - if (i) - return i; + if (cxl_is_psl8(afu)) { + /* +* Allocate the segment table before we put it in the IDR so that we +* can always access it when dereferenced from IDR. For the same +* reason, the segment table is only destroyed after the context is +* removed from the IDR. Access to this in the IOCTL is protected by +* Linux filesytem symantics (can't IOCTL until open is complete). +*/ + i = cxl_alloc_sst(ctx); + if (i) + return i; + } INIT_WORK(&ctx->fault_work, cxl_handle_fault); @@ -307,7 +310,8 @@ static void reclaim_ctx(struct rcu_head *rcu) { struct cxl_context *ctx = container_of(rcu, struct cxl_context, rcu); - free_page((u64)ctx->sstp); + if (cxl_is_psl8(ctx->afu)) + free_page((u64)ctx->sstp); if (ctx->ff_page) __free_page(ctx->ff_page); ctx->sstp = NULL; diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index 1f04238..f6a3a34 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -73,7 +73,7 @@ static const cxl_p1_reg_t CXL_PSL_Control = {0x0020}; static const cxl_p1_reg_t CXL_PSL_DLCNTL = {0x0060}; static const cxl_p1_reg_t CXL_PSL_DLADDR = {0x0068}; -/* PSL Lookaside Buffer Management Area */ +/* PSL Lookaside Buffer Management Area - CAIA 1 */ static const cxl_p1_reg_t CXL_PSL_LBISEL = {0x0080}; static const cxl_p1_reg_t CXL_PSL_SLBIE = {0x0088}; static const cxl_p1_reg_t CXL_PSL_SLBIA = {0x0090}; @@ -82,7 +82,7 @@ static const cxl_p1_reg_t CXL_PSL_TLBIA = {0x00A8}; static const cxl_p1_reg_t CXL_PSL_AFUSEL = {0x00B0}; /* 0x00C0:7EFF Implementation dependent area */ -/* PSL registers */ +/* PSL registers - CAIA 1 */ static const cxl_p1_reg_t CXL_PSL_FIR1 = {0x0100}; static const cxl_p1_reg_t CXL_PSL_FIR2 = {0x0108}; static const cxl_p1_reg_t CXL_PSL_Timebase = {0x0110}; @@ -109,7 +109,7 @@ static const cxl_p1n_reg_t CXL_PSL_AMBAR_An = {0x10}; static const cxl_p1n_reg_t CXL_PSL_SPOffset_An= {0x18}; static const cxl_p1n_reg_t CXL_PSL_ID_An = {0x20}; static const cxl_p1n_reg_t CXL_PSL_SERR_An= {0x28}; -/* Memory Management and Lookaside Buffer Management */ +/* Memory Management and Lookaside Buffer Management - CAIA 1*/ static const cxl_p1n_reg_t CXL_PSL_SDR_An = {0x30}; static const cxl_p1n_reg_t CXL_PSL_AMOR_An= {0x38}; /* Pointer Area */ @@ -124,6 +124,7 @@ static const cxl_p1n_reg_t CXL_PSL_IVTE_Limit_An = {0xB8}; /* 0xC0:FF Implementation Dependent Area */ static const cxl_p1n_reg_t CXL_PSL_FIR_SLICE_An = {0xC0}; static const cxl_p1n_reg_t CXL_AFU_DEBUG_An = {0xC8}; +/* 0xC0:FF Implementation Dependent Area - CAIA 1 */ static const cxl_p1n_reg_t CXL_PSL_APCALLOC_A = {0xD0}; static const cxl_p1n_reg_t CXL_PSL_COALLOC_A = {0xD8}; static const cxl_p1n_reg_t CXL_PSL_RXCTL_A= {0xE0}; @@ -133,12 +134,14 @@ static const cxl_p1n_reg_t CXL_PSL_SLICE_TRACE= {0xE8}; /* Configuration and Control Area */ static const cxl_p2n_reg_t CXL_PSL_PID_TID_An = {0x000}; static const cxl_p2n_reg_t CXL_CSRP_An= {0x008}; +/* Configuration and Control Area - CAIA 1 */ static const cxl_p2n_reg_t CXL_AURP0_
[PATCH V2 0/6] cxl: Add support for Coherent Accelerator Interface Architecture 2.0
This series adds support for a cxl card which supports the Coherent Accelerator Interface Architecture 2.0. It requires IBM Power9 system and the Power Service Layer, version 9. The PSL provides the address translation and system memory cache for CAIA compliant Accelerators. the PSL attaches to the IBM Processor chip through the PCIe link using the PSL-specific “CAPI Protocol” Transaction Layer Packets. The PSL and CAPP communicate using PowerBus packets. When using a PCIe link the PCIe Host Bridge (PHB) decodes the CAPI Protocol Packets from the PSL and forwards them as PowerBus data packets. The PSL also has an optional DMA feature which allows the AFU to send native PCIe reads and writes to the Processor. CAIA 2 introduces new features: * There are several similarities among the two programming models: Dedicated-Process and shared models. * DMA support * Nest MMU to handle translation addresses. * ... It builds on top of the existing cxl driver for the first version of CAIA. Today only the bare-metal environment supports these new features. Compatibility with the CAIA, version 1, allows applications and system software to migrate from one implementation to another with minor changes. Most of the differences are: * Power Service Layer registers: p1 and p2 registers. These new registers require reworking The service layer API (in cxl.h). * Support of Radix mode. Power9 consist of multiple memory management model. So we need to select the right Translation mechanism mode. * Dedicated-Shared Process Programming Model * Process element entry. Structure cxl_process_element_common is redefined. * Translation Fault Handling. Only a page fault is now handle by the driver cxl when a translation fault is occured. Roughly 3/4 of the code is common between the two CAIA version. When the code needs to call a specific implementation, it does so through an API. The PSL8 and PSL9 implementations each describe their own definition. See struct cxl_service_layer_ops. The first 3 patches are mostly cleanup and fixes, separating the psl8-specific code from the code which will also be used for psl9. Patches 4 restructure existing code, to easily add the psl implementation. Patch 5 and 6 rename and isolate implementation-specific code. Patch 7 introduces the core of the PSL9-specific code. Tested on Simulation environment. Changelog: v1->v2: - integrate comments from Andrew Donnellan and Frederic Barrat Christophe Lombard (6): cxl: Remove unused values in bare-metal environment. cxl: Keep track of mm struct associated with a context cxl: Update implementation service layer cxl: Rename some psl8 specific functions cxl: Isolate few psl8 specific calls cxl: Add psl9 specific code drivers/misc/cxl/api.c | 17 ++- drivers/misc/cxl/context.c | 67 +++-- drivers/misc/cxl/cxl.h | 206 ++- drivers/misc/cxl/debugfs.c | 41 -- drivers/misc/cxl/fault.c | 139 ++ drivers/misc/cxl/file.c| 15 +- drivers/misc/cxl/guest.c | 10 +- drivers/misc/cxl/irq.c | 55 +++- drivers/misc/cxl/native.c | 331 --- drivers/misc/cxl/pci.c | 345 +++-- drivers/misc/cxl/trace.h | 43 ++ 11 files changed, 1004 insertions(+), 265 deletions(-) -- 2.7.4
[PATCH V2 4/6] cxl: Rename some psl8 specific functions
Rename a few functions, changing the '_psl' suffix to '_psl8', to make clear that the implementation is psl8 specific. Those functions will have an equivalent implementation for the psl9 in a later patch. Signed-off-by: Christophe Lombard --- drivers/misc/cxl/cxl.h | 22 ++--- drivers/misc/cxl/debugfs.c | 6 +++--- drivers/misc/cxl/guest.c | 2 +- drivers/misc/cxl/irq.c | 2 +- drivers/misc/cxl/native.c | 14 +++--- drivers/misc/cxl/pci.c | 48 +++--- 6 files changed, 47 insertions(+), 47 deletions(-) diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index 3e03a66..1f04238 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -815,10 +815,10 @@ void afu_irq_name_free(struct cxl_context *ctx); #ifdef CONFIG_DEBUG_FS -int cxl_attach_afu_directed_psl(struct cxl_context *ctx, u64 wed, u64 amr); -int cxl_activate_dedicated_process_psl(struct cxl_afu *afu); -int cxl_attach_dedicated_process_psl(struct cxl_context *ctx, u64 wed, u64 amr); -void cxl_update_dedicated_ivtes_psl(struct cxl_context *ctx); +int cxl_attach_afu_directed_psl8(struct cxl_context *ctx, u64 wed, u64 amr); +int cxl_activate_dedicated_process_psl8(struct cxl_afu *afu); +int cxl_attach_dedicated_process_psl8(struct cxl_context *ctx, u64 wed, u64 amr); +void cxl_update_dedicated_ivtes_psl8(struct cxl_context *ctx); int cxl_debugfs_init(void); void cxl_debugfs_exit(void); @@ -826,10 +826,10 @@ int cxl_debugfs_adapter_add(struct cxl *adapter); void cxl_debugfs_adapter_remove(struct cxl *adapter); int cxl_debugfs_afu_add(struct cxl_afu *afu); void cxl_debugfs_afu_remove(struct cxl_afu *afu); -void cxl_stop_trace_psl(struct cxl *cxl); -void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir); +void cxl_stop_trace_psl8(struct cxl *cxl); +void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, struct dentry *dir); void cxl_debugfs_add_adapter_regs_xsl(struct cxl *adapter, struct dentry *dir); -void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry *dir); +void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct dentry *dir); #else /* CONFIG_DEBUG_FS */ @@ -931,9 +931,9 @@ struct cxl_irq_info { }; void cxl_assign_psn_space(struct cxl_context *ctx); -int cxl_invalidate_all_psl(struct cxl *adapter); -irqreturn_t cxl_irq_psl(int irq, struct cxl_context *ctx, struct cxl_irq_info *irq_info); -irqreturn_t cxl_fail_irq_psl(struct cxl_afu *afu, struct cxl_irq_info *irq_info); +int cxl_invalidate_all_psl8(struct cxl *adapter); +irqreturn_t cxl_irq_psl8(int irq, struct cxl_context *ctx, struct cxl_irq_info *irq_info); +irqreturn_t cxl_fail_irq_psl8(struct cxl_afu *afu, struct cxl_irq_info *irq_info); int cxl_register_one_irq(struct cxl *adapter, irq_handler_t handler, void *cookie, irq_hw_number_t *dest_hwirq, unsigned int *dest_virq, const char *name); @@ -944,7 +944,7 @@ int cxl_data_cache_flush(struct cxl *adapter); int cxl_afu_disable(struct cxl_afu *afu); int cxl_psl_purge(struct cxl_afu *afu); -void cxl_native_irq_dump_regs_psl(struct cxl_context *ctx); +void cxl_native_irq_dump_regs_psl8(struct cxl_context *ctx); void cxl_native_err_irq_dump_regs(struct cxl *adapter); int cxl_pci_vphb_add(struct cxl_afu *afu); void cxl_pci_vphb_remove(struct cxl_afu *afu); diff --git a/drivers/misc/cxl/debugfs.c b/drivers/misc/cxl/debugfs.c index 4848ebf..2ff10a9 100644 --- a/drivers/misc/cxl/debugfs.c +++ b/drivers/misc/cxl/debugfs.c @@ -15,7 +15,7 @@ static struct dentry *cxl_debugfs; -void cxl_stop_trace_psl(struct cxl *adapter) +void cxl_stop_trace_psl8(struct cxl *adapter) { int slice; @@ -53,7 +53,7 @@ static struct dentry *debugfs_create_io_x64(const char *name, umode_t mode, (void __force *)value, &fops_io_x64); } -void cxl_debugfs_add_adapter_regs_psl(struct cxl *adapter, struct dentry *dir) +void cxl_debugfs_add_adapter_regs_psl8(struct cxl *adapter, struct dentry *dir) { debugfs_create_io_x64("fir1", S_IRUSR, dir, _cxl_p1_addr(adapter, CXL_PSL_FIR1)); debugfs_create_io_x64("fir2", S_IRUSR, dir, _cxl_p1_addr(adapter, CXL_PSL_FIR2)); @@ -92,7 +92,7 @@ void cxl_debugfs_adapter_remove(struct cxl *adapter) debugfs_remove_recursive(adapter->debugfs); } -void cxl_debugfs_add_afu_regs_psl(struct cxl_afu *afu, struct dentry *dir) +void cxl_debugfs_add_afu_regs_psl8(struct cxl_afu *afu, struct dentry *dir) { debugfs_create_io_x64("fir", S_IRUSR, dir, _cxl_p1n_addr(afu, CXL_PSL_FIR_SLICE_An)); debugfs_create_io_x64("serr", S_IRUSR, dir, _cxl_p1n_addr(afu, CXL_PSL_SERR_An)); diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c index f6ba698..3ad7381 100644 --- a/drivers/misc/cxl/guest.c +++ b/drivers/misc/cxl/guest.c @@ -169,7 +169,7 @@ static irqreturn_t guest_psl_irq(int irq, void *data) return IR
[PATCH V2 6/6] cxl: Add psl9 specific code
The new Coherent Accelerator Interface Architecture, level 2, for the IBM POWER9 brings new content and features: - POWER9 Service Layer - Registers - Radix mode - Process element entry - Dedicated-Shared Process Programming Model - Translation Fault Handling - CAPP - Memory Context ID If a valid mm_struct is found the memory context id is used for each transaction associated with the process handle. The PSL uses the context ID to find the corresponding process element. Signed-off-by: Christophe Lombard --- drivers/misc/cxl/context.c | 13 +++ drivers/misc/cxl/cxl.h | 124 drivers/misc/cxl/debugfs.c | 19 drivers/misc/cxl/fault.c | 48 ++ drivers/misc/cxl/guest.c | 8 +- drivers/misc/cxl/irq.c | 53 +++ drivers/misc/cxl/native.c | 218 +-- drivers/misc/cxl/pci.c | 228 +++-- drivers/misc/cxl/trace.h | 43 + 9 files changed, 700 insertions(+), 54 deletions(-) diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c index 0fd3cf8..8d41fc5 100644 --- a/drivers/misc/cxl/context.c +++ b/drivers/misc/cxl/context.c @@ -205,6 +205,19 @@ int cxl_context_iomap(struct cxl_context *ctx, struct vm_area_struct *vma) return -EBUSY; } + if ((ctx->afu->current_mode == CXL_MODE_DEDICATED) && + (cxl_is_psl9(ctx->afu))) { + /* make sure there is a valid problem state area space for this AFU */ + if (ctx->master && !ctx->afu->psa) { + pr_devel("AFU doesn't support mmio space\n"); + return -EINVAL; + } + + /* Can't mmap until the AFU is enabled */ + if (!ctx->afu->enabled) + return -EBUSY; + } + pr_devel("%s: mmio physical: %llx pe: %i master:%i\n", __func__, ctx->psn_phys, ctx->pe , ctx->master); diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index f6a3a34..fbdc511 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -63,7 +63,7 @@ typedef struct { /* Memory maps. Ref CXL Appendix A */ /* PSL Privilege 1 Memory Map */ -/* Configuration and Control area */ +/* Configuration and Control area - CAIA 1&2 */ static const cxl_p1_reg_t CXL_PSL_CtxTime = {0x}; static const cxl_p1_reg_t CXL_PSL_ErrIVTE = {0x0008}; static const cxl_p1_reg_t CXL_PSL_KEY1= {0x0010}; @@ -98,11 +98,29 @@ static const cxl_p1_reg_t CXL_XSL_Timebase = {0x0100}; static const cxl_p1_reg_t CXL_XSL_TB_CTLSTAT = {0x0108}; static const cxl_p1_reg_t CXL_XSL_FEC = {0x0158}; static const cxl_p1_reg_t CXL_XSL_DSNCTL= {0x0168}; +/* PSL registers - CAIA 2 */ +static const cxl_p1_reg_t CXL_PSL9_CONTROL = {0x0020}; +static const cxl_p1_reg_t CXL_XSL9_DSNCTL = {0x0168}; +static const cxl_p1_reg_t CXL_PSL9_FIR1 = {0x0300}; +static const cxl_p1_reg_t CXL_PSL9_FIR2 = {0x0308}; /* TBD NML CAIA 2 */ +static const cxl_p1_reg_t CXL_PSL9_Timebase = {0x0310}; +static const cxl_p1_reg_t CXL_PSL9_DEBUG= {0x0320}; +static const cxl_p1_reg_t CXL_PSL9_FIR_CNTL = {0x0348}; +static const cxl_p1_reg_t CXL_PSL9_DSNDCTL = {0x0350}; +static const cxl_p1_reg_t CXL_PSL9_TB_CTLSTAT = {0x0340}; +static const cxl_p1_reg_t CXL_PSL9_TRACECFG = {0x0368}; +static const cxl_p1_reg_t CXL_PSL9_APCDEDALLOC = {0x0378}; +static const cxl_p1_reg_t CXL_PSL9_APCDEDTYPE = {0x0380}; +static const cxl_p1_reg_t CXL_PSL9_TNR_ADDR = {0x0388}; +static const cxl_p1_reg_t CXL_PSL9_GP_CT = {0x0398}; +static const cxl_p1_reg_t CXL_XSL9_IERAT = {0x0588}; +static const cxl_p1_reg_t CXL_XSL9_ILPP = {0x0590}; + /* 0x7F00:7FFF Reserved PCIe MSI-X Pending Bit Array area */ /* 0x8000: Reserved PCIe MSI-X Table Area */ /* PSL Slice Privilege 1 Memory Map */ -/* Configuration Area */ +/* Configuration Area - CAIA 1&2 */ static const cxl_p1n_reg_t CXL_PSL_SR_An = {0x00}; static const cxl_p1n_reg_t CXL_PSL_LPID_An= {0x08}; static const cxl_p1n_reg_t CXL_PSL_AMBAR_An = {0x10}; @@ -111,17 +129,18 @@ static const cxl_p1n_reg_t CXL_PSL_ID_An = {0x20}; static const cxl_p1n_reg_t CXL_PSL_SERR_An= {0x28}; /* Memory Management and Lookaside Buffer Management - CAIA 1*/ static const cxl_p1n_reg_t CXL_PSL_SDR_An = {0x30}; +/* Memory Management and Lookaside Buffer Management - CAIA 1&2 */ static const cxl_p1n_reg_t CXL_PSL_AMOR_An= {0x38}; -/* Pointer Area */ +/* Pointer Area - CAIA 1&2 */ static const cxl_p1n_reg_t CXL_HAURP_An = {0x80}; static const cxl_p1n_reg_t CXL_PSL_SPAP_An= {0x88}; static const cxl_p1n_reg_t CXL_PSL_LLCMD_An = {0x90}; -/* Control Area */ +/* Control Area - CAIA 1&2 */ static const cxl_p1n_reg_t CXL_PSL_SCNTL_An = {0xA0}; static const cxl_p1n_reg_t CXL_PSL_CtxTime_An = {0xA8}; static const cxl_p1n_reg_t CXL_PSL_IVTE_Offset_An = {0xB
[PATCH V2 2/6] cxl: Keep track of mm struct associated with a context
The mm_struct corresponding to the current task is acquired each time an interrupt is raised. So to simplify the code, we only get the mm_struct when attaching an AFU context to the process. The mm_count reference is increased to ensure that the mm_struct can't be freed. The mm_struct will be released when the context is detached. The reference (use count) on the struct mm is not kept to avoid a circular dependency if the process mmaps its cxl mmio and forget to unmap before exiting. Signed-off-by: Christophe Lombard --- drivers/misc/cxl/api.c | 17 -- drivers/misc/cxl/context.c | 26 ++-- drivers/misc/cxl/cxl.h | 13 ++-- drivers/misc/cxl/fault.c | 77 +- drivers/misc/cxl/file.c| 15 +++-- 5 files changed, 68 insertions(+), 80 deletions(-) diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c index bcc030e..1a138c8 100644 --- a/drivers/misc/cxl/api.c +++ b/drivers/misc/cxl/api.c @@ -14,6 +14,7 @@ #include #include #include +#include #include "cxl.h" @@ -321,19 +322,29 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed, if (task) { ctx->pid = get_task_pid(task, PIDTYPE_PID); - ctx->glpid = get_task_pid(task->group_leader, PIDTYPE_PID); kernel = false; ctx->real_mode = false; + + /* acquire a reference to the task's mm */ + ctx->mm = get_task_mm(current); + + /* ensure this mm_struct can't be freed */ + cxl_context_mm_count_get(ctx); + + /* decrement the use count */ + if (ctx->mm) + mmput(ctx->mm); } cxl_ctx_get(); if ((rc = cxl_ops->attach_process(ctx, kernel, wed, 0))) { - put_pid(ctx->glpid); put_pid(ctx->pid); - ctx->glpid = ctx->pid = NULL; + ctx->pid = NULL; cxl_adapter_context_put(ctx->afu->adapter); cxl_ctx_put(); + if (task) + cxl_context_mm_count_put(ctx); goto out; } diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c index 062bf6c..ed0a447 100644 --- a/drivers/misc/cxl/context.c +++ b/drivers/misc/cxl/context.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -41,7 +42,7 @@ int cxl_context_init(struct cxl_context *ctx, struct cxl_afu *afu, bool master) spin_lock_init(&ctx->sste_lock); ctx->afu = afu; ctx->master = master; - ctx->pid = ctx->glpid = NULL; /* Set in start work ioctl */ + ctx->pid = NULL; /* Set in start work ioctl */ mutex_init(&ctx->mapping_lock); ctx->mapping = NULL; @@ -242,12 +243,15 @@ int __detach_context(struct cxl_context *ctx) /* release the reference to the group leader and mm handling pid */ put_pid(ctx->pid); - put_pid(ctx->glpid); cxl_ctx_put(); /* Decrease the attached context count on the adapter */ cxl_adapter_context_put(ctx->afu->adapter); + + /* Decrease the mm count on the context */ + cxl_context_mm_count_put(ctx); + return 0; } @@ -325,3 +329,21 @@ void cxl_context_free(struct cxl_context *ctx) mutex_unlock(&ctx->afu->contexts_lock); call_rcu(&ctx->rcu, reclaim_ctx); } + +void cxl_context_mm_count_get(struct cxl_context *ctx) +{ + if (ctx->mm) + atomic_inc(&ctx->mm->mm_count); +} + +void cxl_context_mm_count_put(struct cxl_context *ctx) +{ + if (ctx->mm) + mmdrop(ctx->mm); +} + +void cxl_context_mm_users_get(struct cxl_context *ctx) +{ + if (ctx->mm) + atomic_inc(&ctx->mm->mm_users); +} diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index 79e60ec..4d1b704 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -482,8 +482,6 @@ struct cxl_context { unsigned int sst_size, sst_lru; wait_queue_head_t wq; - /* pid of the group leader associated with the pid */ - struct pid *glpid; /* use mm context associated with this pid for ds faults */ struct pid *pid; spinlock_t lock; /* Protects pending_irq_mask, pending_fault and fault_addr */ @@ -551,6 +549,8 @@ struct cxl_context { * CX4 only: */ struct list_head extra_irq_contexts; + + struct mm_struct *mm; }; struct cxl_service_layer_ops { @@ -1024,4 +1024,13 @@ int cxl_adapter_context_lock(struct cxl *adapter); /* Unlock the contexts-lock if taken. Warn and force unlock otherwise */ void cxl_adapter_context_unlock(struct cxl *adapter); +/* Increases the reference count to "struct mm_struct" */ +void cxl_context_mm_count_get(struct cxl_context *ctx); + +/* Decrements the reference count to "struct mm_struct" */ +void cxl_context_mm_count_put(struct cxl_context *ctx); +
RE: [RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init'
From: Linuxppc-dev Till Smejkal > Sent: 13 March 2017 22:14 > The only way until now to create a new memory map was via the exported > function 'mm_alloc'. Unfortunately, this function not only allocates a new > memory map, but also completely initializes it. However, with the > introduction of first class virtual address spaces, some initialization > steps done in 'mm_alloc' are not applicable to the memory maps needed for > this feature and hence would lead to errors in the kernel code. > > Instead of introducing a new function that can allocate and initialize > memory maps for first class virtual address spaces and potentially > duplicate some code, I decided to split the mm_alloc function as well as > the 'mm_init' function that it uses. > > Now there are four functions exported instead of only one. The new > 'mm_alloc' function only allocates a new mm_struct and zeros it out. If one > want to have the old behavior of mm_alloc one can use the newly introduced > function 'mm_alloc_and_setup' which not only allocates a new mm_struct but > also fully initializes it. ... That looks like bugs waiting to happen. You need unchanged code to fail to compile. David
ioctl structs differ from x86_64?
Hello, I am looking into a bug, https://bugzilla.linux.ibm.com/show_bug.cgi?id=152493 ( external mirror is at, https://github.com/opencontainers/runc/issues/1364) Recently in runc code, they added this code https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR155. As you can see they set -onlcr to get rid of \r (line no. 164). Golang, in which runc is written, doesn't have any bindings for ioctls. This means you have to invoke C code directly (that's what they are doing there). Our guess is the ioctls in ppc64le differ than x86_64, and thats why the code which is clearing onclr bit (https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR164) is failing on ppc64le but works fine on x86_64. Any pointers on the possible solution would be really helpful. Thanks, Harshal Patil E-mail: harshal.pa...@in.ibm.com
Re: ioctl structs differ from x86_64?
Re: ioctl structs differ from x86_64?
Adding Pradipta and Christy. Harshal Patil E-mail: harshal.pa...@in.ibm.com - Original message -From: Harshal Patil/India/IBMTo: linuxppc-dev@lists.ozlabs.orgCc: Sudipto Ghosh/India/IBM@IBMINSubject: ioctl structs differ from x86_64?Date: Tue, Mar 14, 2017 4:07 PM Hello, I am looking into a bug, https://bugzilla.linux.ibm.com/show_bug.cgi?id=152493 ( external mirror is at, https://github.com/opencontainers/runc/issues/1364) Recently in runc code, they added this code https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR155. As you can see they set -onlcr to get rid of \r (line no. 164). Golang, in which runc is written, doesn't have any bindings for ioctls. This means you have to invoke C code directly (that's what they are doing there). Our guess is the ioctls in ppc64le differ than x86_64, and thats why the code which is clearing onclr bit (https://github.com/opencontainers/runc/commit/eea28f480db435dbef4a275de9776b9934818b8c#diff-5f5c07d0cab3ce2086437d3d43c0d25fR164) is failing on ppc64le but works fine on x86_64. Any pointers on the possible solution would be really helpful. Thanks, Harshal Patil E-mail: harshal.pa...@in.ibm.com
Re: powerpc/perf: Fix perf_get_data_addr() for power9 DD1
On Mon, 2017-02-20 at 13:56:30 UTC, Madhavan Srinivasan wrote: > Power9 DD1 do not support PMU_HAS_SIER flag and sdsync > in perf_get_data_addr() defaults to MMCRA_SDSYNC which > is wrong. Since power9 MMCRA does not support SDSYNC bit, > patch includes PPMU_NO_SIAR flag to the check and set the > sdsync with MMCRA_SAMPLE_ENABLE; > > Signed-off-by: Madhavan Srinivasan Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/f04d108029063a8a67528a88449c41 cheers
Re: powerpc/perf: Handle sdar_mode for marked event in power9
On Mon, 2017-02-20 at 13:59:03 UTC, Madhavan Srinivasan wrote: > MMCRA[SDAR_MODE] specifices how the SDAR should be updated in > continous sampling mode. On P9 it must be set to 0b00 when > MMCRA[63] is set. > > Fixes: c7c3f568beff2 ('powerpc/perf: macros for power9 format encoding') > Signed-off-by: Madhavan Srinivasan Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/78b4416aa249365dd3c1b64da4d3a2 cheers
Re: [kernel] powerpc/powernv/ioda2: Update iommu table base on ownership change
On Tue, 2017-02-21 at 02:41:31 UTC, Alexey Kardashevskiy wrote: > On POWERNV platform, in order to do DMA via IOMMU (i.e. 32bit DMA in > our case), a device needs an iommu_table pointer set via > set_iommu_table_base(). > > The codeflow is: > - pnv_pci_ioda2_setup_dma_pe() > - pnv_pci_ioda2_setup_default_config() > - pnv_ioda_setup_bus_dma() [1] > > pnv_pci_ioda2_setup_dma_pe() creates IOMMU groups, > pnv_pci_ioda2_setup_default_config() does default DMA setup, > pnv_ioda_setup_bus_dma() takes a bus PE (on IODA2, all physical function > PEs as bus PEs except NPU), walks through all underlying buses and > devices, adds all devices to an IOMMU group and sets iommu_table. > > On IODA2, when VFIO is used, it takes ownership over a PE which means it > removes all tables and creates new ones (with a possibility of sharing > them among PEs). So when the ownership is returned from VFIO to > the kernel, the iommu_table pointer written to a device at [1] is > stale and needs an update. > > This adds an "add_to_group" parameter to pnv_ioda_setup_bus_dma() > (in fact re-adds as it used to be there a while ago for different > reasons) to tell the helper if a device needs to be added to > an IOMMU group with an iommu_table update or just the latter. > > This calls pnv_ioda_setup_bus_dma(..., false) from > pnv_ioda2_release_ownership() so when the ownership is restored, > 32bit DMA can work again for a device. This does the same thing > on obtaining ownership as the iommu_table point is stale at this point > anyway and it is safer to have NULL there. > > We did not hit this earlier as all tested devices in recent years were > only using 64bit DMA; the rare exception for this is MPT3 SAS adapter > which uses both 32bit and 64bit DMA access and it has not been tested > with VFIO much. > > Cc: Gavin Shan > Signed-off-by: Alexey Kardashevskiy > Acked-by: Gavin Shan > Reviewed-by: David Gibson Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/db08e1d53034a54fe177ced70476fd cheers
Re: [kernel] powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
On Wed, 2017-02-22 at 04:43:59 UTC, Alexey Kardashevskiy wrote: > The IODA2 specification says that a 64 DMA address cannot use top 4 bits > (3 are reserved and one is a "TVE select"); bottom page_shift bits > cannot be used for multilevel table addressing either. > > The existing IODA2 table allocation code aligns the minimum TCE table > size to PAGE_SIZE so in the case of 64K system pages and 4K IOMMU pages, > we have 64-4-12=48 bits. Since 64K page stores 8192 TCEs, i.e. needs > 13 bits, the maximum number of levels is 48/13 = 3 so we physically > cannot address more and EEH happens on DMA accesses. > > This adds a check that too many levels were requested. > > It is still possible to have 5 levels in the case of 4K system page size. > > Signed-off-by: Alexey Kardashevskiy > Acked-by: Gavin Shan Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/7aafac11e308d37ed3c509829bb43d cheers
Re: [1/3] powerpc/64s: fix handling of non-synchronous machine checks
On Tue, 2017-02-28 at 02:00:46 UTC, Nicholas Piggin wrote: > A synchronous machine check is an exception raised by the attempt to > execute the current instruction. If the error can't be corrected, it > can make sense to SIGBUS the currently running process. > > In other cases, the error condition is not related to the current > instruction, so killing the current process is not the right thing to > do. > > Today, all machine checks are MCE_SEV_ERROR_SYNC, so this has no > practical change. It will be used to handle POWER9 asynchronous > machine checks. > > Signed-off-by: Nicholas Piggin Series applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/1363875bdb6317a2d0798284d7aaf3 cheers
Re: selftests/powerpc: Replace stxvx and lxvx with their equivalent instruction
On Tue, 2017-03-07 at 00:39:31 UTC, Cyril Bur wrote: > On POWER8 (ISA 2.07) lxvx and stxvx are defined to be extended mnemonics > of lxvd2x and stxvd2x. For POWER9 (ISA 3.0) the HW architects in their > infinite wisdom made lxvx and stxvx instructions in their own right. > > POWER9 aware GCC will use the POWER9 instruction for lxvx and stxvx > causing these selftests to fail on POWER8. Further compounding the > issue, because of the way -mvsx works it will cause the power9 > instructions to be used regardless of -mcpu=power8 to GCC or -mpower8 to > AS. > > The safest way to address the problem for now is to not use the extended > mnemonic. These tests only perform register comparisons the big endian > only byte ordering for stxvd2x and lxvd2x does not impact the test. > > Signed-off-by: Cyril Bur > Signed-off-by: Cyril Bur > Acked-by: Balbir Singh Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/605df8d674ac65e044a0bf4998b28c cheers
Re: powerpc/boot: Fix zImage TOC alignment
On Tue, 2017-03-07 at 23:40:21 UTC, Michael Ellerman wrote: > Recent toolchains force the TOC to be 256 byte aligned. We need to > enforce this alignment in the zImage linker script, otherwise pointers > to our TOC variables (__toc_start) could be incorrect. If the actual > start of the TOC and __toc_start don't have the same value we crash > early in the zImage wrapper. > > Cc: sta...@vger.kernel.org > Suggested-by: Alan Modra > Signed-off-by: Michael Ellerman Applied to powerpc fixes. https://git.kernel.org/powerpc/c/97ee351b50a49717543533cfb85b4b cheers
Re: powerpc: Fix crash introduced with commit 5657933dbb6e
On Fri, 2017-03-10 at 02:33:51 UTC, Larry Finger wrote: > Code inserted during the code merged between kernels 4.10 and 4.11-rc1 > caused an early panic quickly followed by a complete shutdown for > PowerPC. The traceback was not displayed long enough to read or > photograph, thus it is not reproduced here. > > The problem was bisected to commit 5657933dbb6e ("treewide: Move dma_ops > from struct dev_archdata into struct device"). The problem was a missing > copy of critical information from struct dev_archdata into struct device, > leading to the dereference of a null pointer. > > This fix was suggested by Benjamin Herrenschmidt. > > Fixes: commit 5657933dbb6e ("treewide: Move dma_ops from struct dev_archdata > into struct device") > Signed-off-by: Larry Finger > Cc: linuxppc-dev@lists.ozlabs.org > Cc: Benjamin Herrenschmidt > Cc: regressi...@leemhuis.info Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/46f401c4297a2232a037ad8801b6c8 cheers
Re: syscall statx not implemented on powerpc
Chris Packham writes: > On 13/03/17 21:52, Chandan Rajendra wrote: >> On Monday, March 13, 2017 03:33:07 AM Chris Packham wrote: >>> Hi, >>> >>> I've just attempted to build a powerpc kernel from 4.11-rc2 using a >>> custom defconfig (available on request) and I'm hitting the following >>> error in the early stages of compilation. >>> >>> :1325:2: error: #warning syscall statx not implemented [-Werror=cpp] >>> >>> Same thing seems to happen with mpc85xx_basic_defconfig. >>> >>> I don't actually need this syscall so I'd be happy to turn something off >>> to get things building. I did a quick search and couldn't see anything >>> on linuxppc-dev but google keeps correcting "statx" to "stats" so I >>> could have missed it. >>> >> >> The upstream commit >> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a528d35e8bfcc521d7cb70aaf03e1bd296c8493f) >> that introduces the statx syscall provides a test program. I will wire-up the >> syscall on ppc64, run that test program and post the patch if the test >> program >> works well. >> > > Thanks, I'd be happy to test a patch here. > > In the meantime I worked around the build issue by adding __INGORE_statx > to checksyscalls.sh. Is it actually breaking the build? It's meant to be a warning, and that's all I get. cheers
Re: [PATCH 08/29] drivers, md: convert mddev.active from atomic_t to refcount_t
Elena Reshetova writes: > refcount_t type and corresponding API should be > used instead of atomic_t when the variable is used as > a reference counter. This allows to avoid accidental > refcounter overflows that might lead to use-after-free > situations. > > Signed-off-by: Elena Reshetova > Signed-off-by: Hans Liljestrand > Signed-off-by: Kees Cook > Signed-off-by: David Windsor > --- > drivers/md/md.c | 6 +++--- > drivers/md/md.h | 3 ++- > 2 files changed, 5 insertions(+), 4 deletions(-) When booting linux-next (specifically 5be4921c9958ec) I'm seeing the backtrace below. I suspect this patch is just exposing an existing issue? cheers [0.230738] md: Waiting for all devices to be available before autodetect [0.230742] md: If you don't use raid, use raid=noautodetect [0.230962] refcount_t: increment on 0; use-after-free. [0.230988] [ cut here ] [0.230996] WARNING: CPU: 0 PID: 1 at lib/refcount.c:114 .refcount_inc+0x5c/0x70 [0.231001] Modules linked in: [0.231006] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-gccN-next-20170310-g5be4921 #1 [0.231012] task: c0004940 task.stack: c0004944 [0.231016] NIP: c05ac6bc LR: c05ac6b8 CTR: c0743390 [0.231021] REGS: c00049443160 TRAP: 0700 Not tainted (4.11.0-rc1-gccN-next-20170310-g5be4921) [0.231026] MSR: 80029032 [0.231033] CR: 24024422 XER: 000c [0.231038] CFAR: c0a5356c SOFTE: 1 [0.231038] GPR00: c05ac6b8 c000494433e0 c1079d00 002b [0.231038] GPR04: 00ef c10418a0 [0.231038] GPR08: 4af8 c0ecc9a8 c0ecc9a8 [0.231038] GPR12: 28024824 c6bb c00049443a00 [0.231038] GPR16: c00049443a10 [0.231038] GPR20: c0f7dd20 [0.231038] GPR24: 014080c0 c12060b8 c1206080 0009 [0.231038] GPR28: c0f7dde0 0090 c000461ae800 [0.231100] NIP [c05ac6bc] .refcount_inc+0x5c/0x70 [0.231104] LR [c05ac6b8] .refcount_inc+0x58/0x70 [0.231108] Call Trace: [0.231112] [c000494433e0] [c05ac6b8] .refcount_inc+0x58/0x70 (unreliable) [0.231120] [c00049443450] [c086c008] .mddev_find+0x1e8/0x430 [0.231125] [c00049443530] [c0872b6c] .md_open+0x2c/0x140 [0.231132] [c000494435c0] [c03962a4] .__blkdev_get+0xd4/0x520 [0.231138] [c00049443690] [c0396cc0] .blkdev_get+0x1c0/0x4f0 [0.231145] [c00049443790] [c0336d64] .do_dentry_open.isra.1+0x2a4/0x410 [0.231152] [c00049443830] [c03523f4] .path_openat+0x624/0x1580 [0.231157] [c00049443990] [c0354ce4] .do_filp_open+0x84/0x120 [0.231163] [c00049443b10] [c0338d74] .do_sys_open+0x214/0x300 [0.231170] [c00049443be0] [c0da69ac] .md_run_setup+0xa0/0xec [0.231176] [c00049443c60] [c0da4fbc] .prepare_namespace+0x60/0x240 [0.231182] [c00049443ce0] [c0da47a8] .kernel_init_freeable+0x330/0x36c [0.231190] [c00049443db0] [c000dc44] .kernel_init+0x24/0x160 [0.231197] [c00049443e30] [c000badc] .ret_from_kernel_thread+0x58/0x7c [0.231202] Instruction dump: [0.231206] 6000 3d22ffee 89296bfb 2f89 409effdc 3c62ffc6 3921 3d42ffee [0.231216] 38630928 992a6bfb 484a6e79 6000 <0fe0> 4bb8 6000 6000 [0.231226] ---[ end trace 8c51f269ad91ffc2 ]--- [0.231233] md: Autodetecting RAID arrays. [0.231236] md: autorun ... [0.231239] md: ... autorun DONE. [0.234188] EXT4-fs (sda4): mounting ext3 file system using the ext4 subsystem [0.250506] refcount_t: underflow; use-after-free. [0.250531] [ cut here ] [0.250537] WARNING: CPU: 0 PID: 3 at lib/refcount.c:207 .refcount_dec_not_one+0x104/0x120 [0.250542] Modules linked in: [0.250546] CPU: 0 PID: 3 Comm: kworker/0:0 Tainted: GW 4.11.0-rc1-gccN-next-20170310-g5be4921 #1 [0.250553] Workqueue: events .delayed_fput [0.250557] task: c00049404900 task.stack: c00049448000 [0.250562] NIP: c05ac964 LR: c05ac960 CTR: c0743390 [0.250567] REGS: c0004944b530 TRAP: 0700 Tainted: GW (4.11.0-rc1-gccN-next-20170310-g5be4921) [0.250572] MSR: 80029032 [0.250578] CR: 24002422 XER: 0007 [0.250584] CFAR: c0a5356c SOFTE: 1 [0.250584] GPR00: c05ac960 c0004944b7b0 c1079d00 0026 [0.250584] GPR04: 0113 c10418a0 [0.250584] GPR08: 00
[PATCH 0/6] table-based machine check handlers
Hi, This is a rebase of the table based MCE patches on top of the POWER9 handler that was merged upstream. Plus a couple of misc things (patches 1 and 6). The plan is to have OPAL firmware do most of this recovery and parsing in future, but I think it's worth tidying this up. IMO it's easier to see what's going on and less error prone to change or add things. Thanks, Nick Nicholas Piggin (6): powerpc/64s: machine check print NIP powerpc/64s: clean up machine check recovery flushing powerpc/64s: move POWER machine check defines into mce_power.c powerpc/64s: data driven machine check evaluation powerpc/64s: data driven machine check handling powerpc/64s: POWER8 add missing machine check definitions arch/powerpc/include/asm/mce.h | 91 - arch/powerpc/kernel/mce.c | 3 +- arch/powerpc/kernel/mce_power.c | 776 +++- 3 files changed, 369 insertions(+), 501 deletions(-) -- 2.11.0
[PATCH 1/6] powerpc/64s: machine check print NIP
Print the faulting address of the machine check that may help with debugging. The effective address reported can be a target memory address rather than the faulting instruction address. Fix up a dangling bracket while here. Signed-off-by: Nicholas Piggin --- arch/powerpc/kernel/mce.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c index a1475e6aef3a..399aeafb7dd4 100644 --- a/arch/powerpc/kernel/mce.c +++ b/arch/powerpc/kernel/mce.c @@ -310,7 +310,8 @@ void machine_check_print_event_info(struct machine_check_event *evt) printk("%s%s Machine check interrupt [%s]\n", level, sevstr, evt->disposition == MCE_DISPOSITION_RECOVERED ? - "Recovered" : "[Not recovered"); + "Recovered" : "Not recovered"); + printk("%s NIP: %016llx\n", level, evt->srr0); printk("%s Initiator: %s\n", level, evt->initiator == MCE_INITIATOR_CPU ? "CPU" : "Unknown"); switch (evt->error_type) { -- 2.11.0
[PATCH 2/6] powerpc/64s: clean up machine check recovery flushing
Use the flush function introduced with the POWER9 machine check handler for POWER7 and 8, rather than open coding it multiple times in callers. There is a specific ERAT flush type introduced for POWER9, but the POWER7-8 ERAT errors continue to do SLB flushing (which also flushes ERAT), so as not to introduce functional changes with this cleanup patch. Signed-off-by: Nicholas Piggin --- arch/powerpc/kernel/mce_power.c | 97 ++--- 1 file changed, 23 insertions(+), 74 deletions(-) diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c index 763d6f58caa8..0f35a88e3655 100644 --- a/arch/powerpc/kernel/mce_power.c +++ b/arch/powerpc/kernel/mce_power.c @@ -161,81 +161,27 @@ static int mce_handle_flush_derrors(uint64_t dsisr, uint64_t slb, uint64_t tlb, return 1; } -static long mce_handle_derror(uint64_t dsisr, uint64_t slb_error_bits) -{ - long handled = 1; - - /* -* flush and reload SLBs for SLB errors and flush TLBs for TLB errors. -* reset the error bits whenever we handle them so that at the end -* we can check whether we handled all of them or not. -* */ -#ifdef CONFIG_PPC_STD_MMU_64 - if (dsisr & slb_error_bits) { - flush_and_reload_slb(); - /* reset error bits */ - dsisr &= ~(slb_error_bits); - } - if (dsisr & P7_DSISR_MC_TLB_MULTIHIT_MFTLB) { - if (cur_cpu_spec && cur_cpu_spec->flush_tlb) - cur_cpu_spec->flush_tlb(TLB_INVAL_SCOPE_GLOBAL); - /* reset error bits */ - dsisr &= ~P7_DSISR_MC_TLB_MULTIHIT_MFTLB; - } -#endif - /* Any other errors we don't understand? */ - if (dsisr & 0xUL) - handled = 0; - - return handled; -} - static long mce_handle_derror_p7(uint64_t dsisr) { - return mce_handle_derror(dsisr, P7_DSISR_MC_SLB_ERRORS); + return mce_handle_flush_derrors(dsisr, + P7_DSISR_MC_SLB_ERRORS, + P7_DSISR_MC_TLB_MULTIHIT_MFTLB, + 0); } -static long mce_handle_common_ierror(uint64_t srr1) +static long mce_handle_ierror_p7(uint64_t srr1) { - long handled = 0; - switch (P7_SRR1_MC_IFETCH(srr1)) { - case 0: - break; -#ifdef CONFIG_PPC_STD_MMU_64 case P7_SRR1_MC_IFETCH_SLB_PARITY: case P7_SRR1_MC_IFETCH_SLB_MULTIHIT: - /* flush and reload SLBs for SLB errors. */ - flush_and_reload_slb(); - handled = 1; - break; + case P7_SRR1_MC_IFETCH_SLB_BOTH: + return mce_flush(MCE_FLUSH_SLB); + case P7_SRR1_MC_IFETCH_TLB_MULTIHIT: - if (cur_cpu_spec && cur_cpu_spec->flush_tlb) { - cur_cpu_spec->flush_tlb(TLB_INVAL_SCOPE_GLOBAL); - handled = 1; - } - break; -#endif + return mce_flush(MCE_FLUSH_TLB); default: - break; - } - - return handled; -} - -static long mce_handle_ierror_p7(uint64_t srr1) -{ - long handled = 0; - - handled = mce_handle_common_ierror(srr1); - -#ifdef CONFIG_PPC_STD_MMU_64 - if (P7_SRR1_MC_IFETCH(srr1) == P7_SRR1_MC_IFETCH_SLB_BOTH) { - flush_and_reload_slb(); - handled = 1; + return 0; } -#endif - return handled; } static void mce_get_common_ierror(struct mce_error_info *mce_err, uint64_t srr1) @@ -376,22 +322,25 @@ static void mce_get_derror_p8(struct mce_error_info *mce_err, uint64_t dsisr) static long mce_handle_ierror_p8(uint64_t srr1) { - long handled = 0; - - handled = mce_handle_common_ierror(srr1); + switch (P7_SRR1_MC_IFETCH(srr1)) { + case P7_SRR1_MC_IFETCH_SLB_PARITY: + case P7_SRR1_MC_IFETCH_SLB_MULTIHIT: + case P8_SRR1_MC_IFETCH_ERAT_MULTIHIT: + return mce_flush(MCE_FLUSH_SLB); -#ifdef CONFIG_PPC_STD_MMU_64 - if (P7_SRR1_MC_IFETCH(srr1) == P8_SRR1_MC_IFETCH_ERAT_MULTIHIT) { - flush_and_reload_slb(); - handled = 1; + case P7_SRR1_MC_IFETCH_TLB_MULTIHIT: + return mce_flush(MCE_FLUSH_TLB); + default: + return 0; } -#endif - return handled; } static long mce_handle_derror_p8(uint64_t dsisr) { - return mce_handle_derror(dsisr, P8_DSISR_MC_SLB_ERRORS); + return mce_handle_flush_derrors(dsisr, + P8_DSISR_MC_SLB_ERRORS, + P7_DSISR_MC_TLB_MULTIHIT_MFTLB, + 0); } long __machine_check_early_realmode_p8(struct pt_regs *regs) -- 2.11.0
[PATCH 3/6] powerpc/64s: move POWER machine check defines into mce_power.c
Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/mce.h | 91 arch/powerpc/kernel/mce_power.c | 92 + 2 files changed, 92 insertions(+), 91 deletions(-) diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h index ed62efe01e49..e3498b446788 100644 --- a/arch/powerpc/include/asm/mce.h +++ b/arch/powerpc/include/asm/mce.h @@ -24,97 +24,6 @@ #include -/* - * Machine Check bits on power7 and power8 - */ -#define P7_SRR1_MC_LOADSTORE(srr1) ((srr1) & PPC_BIT(42)) /* P8 too */ - -/* SRR1 bits for machine check (On Power7 and Power8) */ -#define P7_SRR1_MC_IFETCH(srr1)((srr1) & PPC_BITMASK(43, 45)) /* P8 too */ - -#define P7_SRR1_MC_IFETCH_UE (0x1 << PPC_BITLSHIFT(45)) /* P8 too */ -#define P7_SRR1_MC_IFETCH_SLB_PARITY (0x2 << PPC_BITLSHIFT(45)) /* P8 too */ -#define P7_SRR1_MC_IFETCH_SLB_MULTIHIT (0x3 << PPC_BITLSHIFT(45)) /* P8 too */ -#define P7_SRR1_MC_IFETCH_SLB_BOTH (0x4 << PPC_BITLSHIFT(45)) -#define P7_SRR1_MC_IFETCH_TLB_MULTIHIT (0x5 << PPC_BITLSHIFT(45)) /* P8 too */ -#define P7_SRR1_MC_IFETCH_UE_TLB_RELOAD(0x6 << PPC_BITLSHIFT(45)) /* P8 too */ -#define P7_SRR1_MC_IFETCH_UE_IFU_INTERNAL (0x7 << PPC_BITLSHIFT(45)) - -/* SRR1 bits for machine check (On Power8) */ -#define P8_SRR1_MC_IFETCH_ERAT_MULTIHIT(0x4 << PPC_BITLSHIFT(45)) - -/* DSISR bits for machine check (On Power7 and Power8) */ -#define P7_DSISR_MC_UE (PPC_BIT(48)) /* P8 too */ -#define P7_DSISR_MC_UE_TABLEWALK (PPC_BIT(49)) /* P8 too */ -#define P7_DSISR_MC_ERAT_MULTIHIT (PPC_BIT(52)) /* P8 too */ -#define P7_DSISR_MC_TLB_MULTIHIT_MFTLB (PPC_BIT(53)) /* P8 too */ -#define P7_DSISR_MC_SLB_PARITY_MFSLB (PPC_BIT(55)) /* P8 too */ -#define P7_DSISR_MC_SLB_MULTIHIT (PPC_BIT(56)) /* P8 too */ -#define P7_DSISR_MC_SLB_MULTIHIT_PARITY(PPC_BIT(57)) /* P8 too */ - -/* - * DSISR bits for machine check (Power8) in addition to above. - * Secondary DERAT Multihit - */ -#define P8_DSISR_MC_ERAT_MULTIHIT_SEC (PPC_BIT(54)) - -/* SLB error bits */ -#define P7_DSISR_MC_SLB_ERRORS (P7_DSISR_MC_ERAT_MULTIHIT | \ -P7_DSISR_MC_SLB_PARITY_MFSLB | \ -P7_DSISR_MC_SLB_MULTIHIT | \ -P7_DSISR_MC_SLB_MULTIHIT_PARITY) - -#define P8_DSISR_MC_SLB_ERRORS (P7_DSISR_MC_SLB_ERRORS | \ -P8_DSISR_MC_ERAT_MULTIHIT_SEC) - -/* - * Machine Check bits on power9 - */ -#define P9_SRR1_MC_LOADSTORE(srr1) (((srr1) >> PPC_BITLSHIFT(42)) & 1) - -#define P9_SRR1_MC_IFETCH(srr1)( \ - PPC_BITEXTRACT(srr1, 45, 0) | \ - PPC_BITEXTRACT(srr1, 44, 1) | \ - PPC_BITEXTRACT(srr1, 43, 2) | \ - PPC_BITEXTRACT(srr1, 36, 3) ) - -/* 0 is reserved */ -#define P9_SRR1_MC_IFETCH_UE 1 -#define P9_SRR1_MC_IFETCH_SLB_PARITY 2 -#define P9_SRR1_MC_IFETCH_SLB_MULTIHIT 3 -#define P9_SRR1_MC_IFETCH_ERAT_MULTIHIT4 -#define P9_SRR1_MC_IFETCH_TLB_MULTIHIT 5 -#define P9_SRR1_MC_IFETCH_UE_TLB_RELOAD6 -/* 7 is reserved */ -#define P9_SRR1_MC_IFETCH_LINK_TIMEOUT 8 -#define P9_SRR1_MC_IFETCH_LINK_TABLEWALK_TIMEOUT 9 -/* 10 ? */ -#define P9_SRR1_MC_IFETCH_RA 11 -#define P9_SRR1_MC_IFETCH_RA_TABLEWALK 12 -#define P9_SRR1_MC_IFETCH_RA_ASYNC_STORE 13 -#define P9_SRR1_MC_IFETCH_LINK_ASYNC_STORE_TIMEOUT 14 -#define P9_SRR1_MC_IFETCH_RA_TABLEWALK_FOREIGN 15 - -/* DSISR bits for machine check (On Power9) */ -#define P9_DSISR_MC_UE (PPC_BIT(48)) -#define P9_DSISR_MC_UE_TABLEWALK (PPC_BIT(49)) -#define P9_DSISR_MC_LINK_LOAD_TIMEOUT (PPC_BIT(50)) -#define P9_DSISR_MC_LINK_TABLEWALK_TIMEOUT (PPC_BIT(51)) -#define P9_DSISR_MC_ERAT_MULTIHIT (PPC_BIT(52)) -#define P9_DSISR_MC_TLB_MULTIHIT_MFTLB (PPC_BIT(53)) -#define P9_DSISR_MC_USER_TLBIE (PPC_BIT(54)) -#define P9_DSISR_MC_SLB_PARITY_MFSLB (PPC_BIT(55)) -#define P9_DSISR_MC_SLB_MULTIHIT_MFSLB (PPC_BIT(56)) -#define P9_DSISR_MC_RA_LOAD(PPC_BIT(57)) -#define P9_DSISR_MC_RA_TABLEWALK (PPC_BIT(58)) -#define P9_DSISR_MC_RA_TABLEWALK_FOREIGN (PPC_BIT(59)) -#define P9_DSISR_MC_RA_FOREIGN (PPC_BIT(60)) - -/* SLB error bits */ -#define P9_DSISR_MC_SLB_ERRORS (P9_DSISR_MC_ERAT_MULTIHIT | \ -P9_DSISR_MC_SLB_PARITY_MFSLB | \ -P9_DSISR_MC_SLB_MULTIHIT_MFSLB) - enum MCE_Version { MCE_V1 = 1, }; diff --git a/arch/p
[PATCH 4/6] powerpc/64s: data driven machine check evaluation
Have machine types define i-side and d-side tables to describe their machine check encodings, and match entries to evaluate (for reporting) machine checks. Functionality is mostly unchanged (tested with a userspace harness), but it does make a change in that it no longer records DAR as the effective address for those errors where it is specified to be invalid (which is a reporting change only). Signed-off-by: Nicholas Piggin --- arch/powerpc/kernel/mce_power.c | 571 +++- 1 file changed, 328 insertions(+), 243 deletions(-) diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c index 9b3bcd1213bf..1b38fabdd2a0 100644 --- a/arch/powerpc/kernel/mce_power.c +++ b/arch/powerpc/kernel/mce_power.c @@ -253,91 +253,304 @@ static int mce_handle_flush_derrors(uint64_t dsisr, uint64_t slb, uint64_t tlb, P9_DSISR_MC_SLB_PARITY_MFSLB | \ P9_DSISR_MC_SLB_MULTIHIT_MFSLB) -static long mce_handle_derror_p7(uint64_t dsisr) -{ - return mce_handle_flush_derrors(dsisr, - P7_DSISR_MC_SLB_ERRORS, - P7_DSISR_MC_TLB_MULTIHIT_MFTLB, - 0); -} - -static long mce_handle_ierror_p7(uint64_t srr1) +struct mce_ierror_table { + unsigned long srr1_mask; + unsigned long srr1_value; + bool nip_valid; /* nip is a valid indicator of faulting address */ + unsigned int error_type; + unsigned int error_subtype; + unsigned int initiator; + unsigned int severity; +}; + +static const struct mce_ierror_table mce_p7_ierror_table[] = { +{ 0x001c, 0x0004, true, + MCE_ERROR_TYPE_UE, MCE_UE_ERROR_IFETCH, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x001c, 0x0008, true, + MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_PARITY, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x001c, 0x000c, true, + MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_MULTIHIT, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x001c, 0x0010, true, + MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_INDETERMINATE, /* BOTH */ + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x001c, 0x0014, true, + MCE_ERROR_TYPE_TLB, MCE_TLB_ERROR_MULTIHIT, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x001c, 0x0018, true, + MCE_ERROR_TYPE_UE, MCE_UE_ERROR_PAGE_TABLE_WALK_IFETCH, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x001c, 0x001c, true, + MCE_ERROR_TYPE_UE, MCE_UE_ERROR_IFETCH, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0, 0, 0, 0, 0, 0 } }; + +static const struct mce_ierror_table mce_p8_ierror_table[] = { +{ 0x001c, 0x0004, true, + MCE_ERROR_TYPE_UE, MCE_UE_ERROR_IFETCH, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x001c, 0x0008, true, + MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_PARITY, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x001c, 0x000c, true, + MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_MULTIHIT, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x001c, 0x0010, true, + MCE_ERROR_TYPE_ERAT,MCE_ERAT_ERROR_MULTIHIT, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x001c, 0x0014, true, + MCE_ERROR_TYPE_TLB, MCE_TLB_ERROR_MULTIHIT, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x001c, 0x0018, true, + MCE_ERROR_TYPE_UE, MCE_UE_ERROR_PAGE_TABLE_WALK_IFETCH, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x001c, 0x001c, true, + MCE_ERROR_TYPE_UE, MCE_UE_ERROR_IFETCH, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0, 0, 0, 0, 0, 0 } }; + +static const struct mce_ierror_table mce_p9_ierror_table[] = { +{ 0x081c, 0x0004, true, + MCE_ERROR_TYPE_UE, MCE_UE_ERROR_IFETCH, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x081c, 0x0008, true, + MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_PARITY, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x081c, 0x000c, true, + MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_MULTIHIT, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x081c, 0x0010, true, + MCE_ERROR_TYPE_ERAT,MCE_ERAT_ERROR_MULTIHIT, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x081c, 0x0014, true, + MCE_ERROR_TYPE_TLB, MCE_TLB_ERROR_MULTIHIT, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x081c, 0x0018, true, + MCE_ERROR_TYPE_UE, MCE_UE_ERROR_PAGE_TABLE_WALK_IFETCH, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x081c, 0x0800, true, + MCE_ERROR_TYPE_LINK,MCE_LINK_ERROR_IFETCH_TIMEOUT, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x081c, 0x0804, true, + MCE_ERROR_TYPE_LINK,MCE_LINK_ERROR_PAGE_TABLE_WALK_IFETCH_TIMEOUT
[PATCH 5/6] powerpc/64s: data driven machine check handling
Move the handling (corrective action) of machine checks to the table based evaluation. This changes P7 and P8 ERAT flushing from using SLB flush to using ERAT flush. Signed-off-by: Nicholas Piggin --- arch/powerpc/kernel/mce_power.c | 328 +--- 1 file changed, 74 insertions(+), 254 deletions(-) diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c index 1b38fabdd2a0..95ed9d85ebcf 100644 --- a/arch/powerpc/kernel/mce_power.c +++ b/arch/powerpc/kernel/mce_power.c @@ -147,111 +147,7 @@ static int mce_flush(int what) return 0; } -static int mce_handle_flush_derrors(uint64_t dsisr, uint64_t slb, uint64_t tlb, uint64_t erat) -{ - if ((dsisr & slb) && mce_flush(MCE_FLUSH_SLB)) - dsisr &= ~slb; - if ((dsisr & erat) && mce_flush(MCE_FLUSH_ERAT)) - dsisr &= ~erat; - if ((dsisr & tlb) && mce_flush(MCE_FLUSH_TLB)) - dsisr &= ~tlb; - /* Any other errors we don't understand? */ - if (dsisr) - return 0; - return 1; -} - - -/* - * Machine Check bits on power7 and power8 - */ -#define P7_SRR1_MC_LOADSTORE(srr1) ((srr1) & PPC_BIT(42)) /* P8 too */ - -/* SRR1 bits for machine check (On Power7 and Power8) */ -#define P7_SRR1_MC_IFETCH(srr1)((srr1) & PPC_BITMASK(43, 45)) /* P8 too */ - -#define P7_SRR1_MC_IFETCH_UE (0x1 << PPC_BITLSHIFT(45)) /* P8 too */ -#define P7_SRR1_MC_IFETCH_SLB_PARITY (0x2 << PPC_BITLSHIFT(45)) /* P8 too */ -#define P7_SRR1_MC_IFETCH_SLB_MULTIHIT (0x3 << PPC_BITLSHIFT(45)) /* P8 too */ -#define P7_SRR1_MC_IFETCH_SLB_BOTH (0x4 << PPC_BITLSHIFT(45)) -#define P7_SRR1_MC_IFETCH_TLB_MULTIHIT (0x5 << PPC_BITLSHIFT(45)) /* P8 too */ -#define P7_SRR1_MC_IFETCH_UE_TLB_RELOAD(0x6 << PPC_BITLSHIFT(45)) /* P8 too */ -#define P7_SRR1_MC_IFETCH_UE_IFU_INTERNAL (0x7 << PPC_BITLSHIFT(45)) - -/* SRR1 bits for machine check (On Power8) */ -#define P8_SRR1_MC_IFETCH_ERAT_MULTIHIT(0x4 << PPC_BITLSHIFT(45)) - -/* DSISR bits for machine check (On Power7 and Power8) */ -#define P7_DSISR_MC_UE (PPC_BIT(48)) /* P8 too */ -#define P7_DSISR_MC_UE_TABLEWALK (PPC_BIT(49)) /* P8 too */ -#define P7_DSISR_MC_ERAT_MULTIHIT (PPC_BIT(52)) /* P8 too */ -#define P7_DSISR_MC_TLB_MULTIHIT_MFTLB (PPC_BIT(53)) /* P8 too */ -#define P7_DSISR_MC_SLB_PARITY_MFSLB (PPC_BIT(55)) /* P8 too */ -#define P7_DSISR_MC_SLB_MULTIHIT (PPC_BIT(56)) /* P8 too */ -#define P7_DSISR_MC_SLB_MULTIHIT_PARITY(PPC_BIT(57)) /* P8 too */ - -/* - * DSISR bits for machine check (Power8) in addition to above. - * Secondary DERAT Multihit - */ -#define P8_DSISR_MC_ERAT_MULTIHIT_SEC (PPC_BIT(54)) - -/* SLB error bits */ -#define P7_DSISR_MC_SLB_ERRORS (P7_DSISR_MC_ERAT_MULTIHIT | \ -P7_DSISR_MC_SLB_PARITY_MFSLB | \ -P7_DSISR_MC_SLB_MULTIHIT | \ -P7_DSISR_MC_SLB_MULTIHIT_PARITY) - -#define P8_DSISR_MC_SLB_ERRORS (P7_DSISR_MC_SLB_ERRORS | \ -P8_DSISR_MC_ERAT_MULTIHIT_SEC) - -/* - * Machine Check bits on power9 - */ -#define P9_SRR1_MC_LOADSTORE(srr1) (((srr1) >> PPC_BITLSHIFT(42)) & 1) - -#define P9_SRR1_MC_IFETCH(srr1)( \ - PPC_BITEXTRACT(srr1, 45, 0) | \ - PPC_BITEXTRACT(srr1, 44, 1) | \ - PPC_BITEXTRACT(srr1, 43, 2) | \ - PPC_BITEXTRACT(srr1, 36, 3) ) - -/* 0 is reserved */ -#define P9_SRR1_MC_IFETCH_UE 1 -#define P9_SRR1_MC_IFETCH_SLB_PARITY 2 -#define P9_SRR1_MC_IFETCH_SLB_MULTIHIT 3 -#define P9_SRR1_MC_IFETCH_ERAT_MULTIHIT4 -#define P9_SRR1_MC_IFETCH_TLB_MULTIHIT 5 -#define P9_SRR1_MC_IFETCH_UE_TLB_RELOAD6 -/* 7 is reserved */ -#define P9_SRR1_MC_IFETCH_LINK_TIMEOUT 8 -#define P9_SRR1_MC_IFETCH_LINK_TABLEWALK_TIMEOUT 9 -/* 10 ? */ -#define P9_SRR1_MC_IFETCH_RA 11 -#define P9_SRR1_MC_IFETCH_RA_TABLEWALK 12 -#define P9_SRR1_MC_IFETCH_RA_ASYNC_STORE 13 -#define P9_SRR1_MC_IFETCH_LINK_ASYNC_STORE_TIMEOUT 14 -#define P9_SRR1_MC_IFETCH_RA_TABLEWALK_FOREIGN 15 - -/* DSISR bits for machine check (On Power9) */ -#define P9_DSISR_MC_UE (PPC_BIT(48)) -#define P9_DSISR_MC_UE_TABLEWALK (PPC_BIT(49)) -#define P9_DSISR_MC_LINK_LOAD_TIMEOUT (PPC_BIT(50)) -#define P9_DSISR_MC_LINK_TABLEWALK_TIMEOUT (PPC_BIT(51)) -#define P9_DSISR_MC_ERAT_MULTIHIT (PPC_BIT(52)) -#define P9_DSISR_MC_TLB_MULTIHIT_MFTLB (PPC_BIT(53)) -#define P9_DSISR_MC_USER_TLBIE (PPC_BIT(54)) -#define P9_DSISR_MC_SLB_PARITY_MFSLB (PPC_BIT(55)) -#define P9_DSISR_MC_SLB_MULTIHIT_MFSLB
[PATCH 6/6] powerpc/64s: POWER8 add missing machine check definitions
POWER8 uses bit 36 in SRR1 like POWER9 for i-side machine checks, and contains several conditions for link timeouts that are not currently handled. Signed-off-by: Nicholas Piggin --- arch/powerpc/kernel/mce_power.c | 26 +++--- 1 file changed, 19 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c index 95ed9d85ebcf..736530219bc3 100644 --- a/arch/powerpc/kernel/mce_power.c +++ b/arch/powerpc/kernel/mce_power.c @@ -184,27 +184,33 @@ static const struct mce_ierror_table mce_p7_ierror_table[] = { { 0, 0, 0, 0, 0, 0 } }; static const struct mce_ierror_table mce_p8_ierror_table[] = { -{ 0x001c, 0x0004, true, +{ 0x081c, 0x0004, true, MCE_ERROR_TYPE_UE, MCE_UE_ERROR_IFETCH, MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, -{ 0x001c, 0x0008, true, +{ 0x081c, 0x0008, true, MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_PARITY, MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, -{ 0x001c, 0x000c, true, +{ 0x081c, 0x000c, true, MCE_ERROR_TYPE_SLB, MCE_SLB_ERROR_MULTIHIT, MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, -{ 0x001c, 0x0010, true, +{ 0x081c, 0x0010, true, MCE_ERROR_TYPE_ERAT,MCE_ERAT_ERROR_MULTIHIT, MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, -{ 0x001c, 0x0014, true, +{ 0x081c, 0x0014, true, MCE_ERROR_TYPE_TLB, MCE_TLB_ERROR_MULTIHIT, MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, -{ 0x001c, 0x0018, true, +{ 0x081c, 0x0018, true, MCE_ERROR_TYPE_UE, MCE_UE_ERROR_PAGE_TABLE_WALK_IFETCH, MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, -{ 0x001c, 0x001c, true, +{ 0x081c, 0x001c, true, MCE_ERROR_TYPE_UE, MCE_UE_ERROR_IFETCH, MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x081c, 0x0800, true, + MCE_ERROR_TYPE_LINK,MCE_LINK_ERROR_IFETCH_TIMEOUT, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x081c, 0x0804, true, + MCE_ERROR_TYPE_LINK,MCE_LINK_ERROR_PAGE_TABLE_WALK_IFETCH_TIMEOUT, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, { 0, 0, 0, 0, 0, 0 } }; static const struct mce_ierror_table mce_p9_ierror_table[] = { @@ -289,6 +295,12 @@ static const struct mce_derror_table mce_p8_derror_table[] = { { 0x4000, true, MCE_ERROR_TYPE_UE, MCE_UE_ERROR_PAGE_TABLE_WALK_LOAD_STORE, MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x2000, true, + MCE_ERROR_TYPE_LINK, MCE_LINK_ERROR_LOAD_TIMEOUT, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, +{ 0x1000, true, + MCE_ERROR_TYPE_LINK, MCE_LINK_ERROR_PAGE_TABLE_WALK_LOAD_STORE_TIMEOUT, + MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, { 0x0800, true, MCE_ERROR_TYPE_ERAT, MCE_ERAT_ERROR_MULTIHIT, MCE_INITIATOR_CPU, MCE_SEV_ERROR_SYNC, }, -- 2.11.0
Re: [PATCH kernel] powerpc/powernv/ioda2: Update iommu table base on ownership change
On 02/20/2017 11:41 PM, Alexey Kardashevskiy wrote: Cc: Gavin Shan Signed-off-by: Alexey Kardashevskiy Tested-by: Mauricio Faria de Oliveira P.S.: sorry, late, but for the record. -- Mauricio Faria de Oliveira IBM Linux Technology Center
Re: [PATCH] drivers/pcmcia: NO_IRQ removal for electra_cf.c
On Tue, Mar 14, 2017 at 11:51 AM, Michael Ellerman wrote: > Michael Ellerman writes: > >> We'd like to eventually remove NO_IRQ on powerpc, so remove usages of it >> from electra_cf.c which is a powerpc-only driver. >> >> Signed-off-by: Michael Ellerman >> --- >> drivers/pcmcia/electra_cf.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) > > Ping anyone? > > Or should I merge this via the powerpc tree? That's what I would recommend for a powerpc specific pcmcia driver, yes. Looking at the bigger picture of powerpc drivers using NO_IRQ, I also see these others: drivers/ata/pata_mpc52xx.c: if (ata_irq == NO_IRQ) { drivers/ata/sata_dwc_460ex.c:#ifndef NO_IRQ drivers/ata/sata_dwc_460ex.c:#define NO_IRQ 0 drivers/ata/sata_dwc_460ex.c: if (hsdev->dma->irq == NO_IRQ) { drivers/ata/sata_dwc_460ex.c: if (irq == NO_IRQ) { drivers/iommu/fsl_pamu.c: if (irq == NO_IRQ) { drivers/iommu/fsl_pamu.c: if (irq != NO_IRQ) drivers/media/platform/fsl-viu.c: if (viu_irq == NO_IRQ) { drivers/mtd/nand/mpc5121_nfc.c: if (prv->irq == NO_IRQ) { drivers/pcmcia/electra_cf.c:cf->irq = NO_IRQ; drivers/pcmcia/electra_cf.c:if (cf->irq != NO_IRQ) drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if no interrupt is pending. */ drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ; drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if no interrupt is pending. */ drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ; drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_low == NO_IRQ) { drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_high != NO_IRQ && drivers/spi/spi-mpc52xx.c: if (status && (irq != NO_IRQ)) drivers/tty/ehv_bytechan.c: if (stdout_irq == NO_IRQ) { drivers/tty/ehv_bytechan.c: if ((bc->rx_irq == NO_IRQ) || (bc->tx_irq == NO_IRQ)) { drivers/tty/serial/cpm_uart/cpm_uart_core.c:if (pinfo->port.irq == NO_IRQ) { drivers/uio/uio_fsl_elbc_gpcm.c:if (irq != NO_IRQ) { drivers/uio/uio_fsl_elbc_gpcm.c:irq = NO_IRQ; drivers/uio/uio_fsl_elbc_gpcm.c: irq != NO_IRQ ? irq : -1); drivers/usb/host/ehci-grlib.c: if (irq == NO_IRQ) { drivers/usb/host/ehci-ppc-of.c: if (irq == NO_IRQ) { drivers/usb/host/fhci-hcd.c:if (usb_irq == NO_IRQ) { drivers/usb/host/ohci-ppc-of.c: if (irq == NO_IRQ) { drivers/usb/host/uhci-grlib.c: if (irq == NO_IRQ) { drivers/video/fbdev/mb862xx/mb862xxfbdrv.c: if (par->irq == NO_IRQ) { drivers/virt/fsl_hypervisor.c: if (!handle || (irq == NO_IRQ)) { include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) include/soc/fsl/qe/qe_ic.h: if (cascade_irq == NO_IRQ) include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) Did you have other pending patches for those? Arnd
Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src
On Tue, Mar 14, 2017 at 02:31:51PM +0530, Madhavan Srinivasan wrote: > >Huh? PPC hasn't yet implemented this? Then why are you fixing it? > > yes, PPC hasn't implemented this (until now). until now where? > And did not understand "Then why are you fixing it?" I see no implementation; so why are you poking at it.
Re: [PATCH v5 2/5] powerpc: kretprobes: override default function entry offset
Em Thu, Mar 09, 2017 at 05:37:38PM +1100, Michael Ellerman escreveu: > "Naveen N. Rao" writes: > > On 2017/03/08 11:29AM, Arnaldo Carvalho de Melo wrote: > >> > I wasn't sure if you were planning on picking up KPROBES_ON_FTRACE for > >> > v4.11. If so, it would be good to take this patch through the powerpc > >> > tree. Otherwise, this can go via Ingo's tree. > >> > >> If you guys convince Ingo that this should go _now_, then just cherry > >> pick what was merged into tip/perf/core that is needed for the arch > >> specific stuff and go from there. > > > > Ok, in hindsight, I think Michael's concern was actually for v4.12 > > Yes I was talking about 4.12, sorry I thought that was implied :) > > > itself, in which case this particular patch can go via powerpc tree, > > while the rest of the patches in this series can go via your tree. > > > > Michael? > > Yeah I think that's the easiest option. The function will be temporarily > unused until the two trees are merged, but I think that's fine. Ok, done that, now compile testing building it in my multi-distro/x-build containers. - Arnaldo
Re: [PATCH 1/3] powernv:smp: Add busy-wait loop as fall back for CPU-Hotplug
On Mon, 13 Mar 2017 11:31:26 +0530 "Gautham R. Shenoy" wrote: > [Changelog written with inputs from sva...@linux.vnet.ibm.com] > Signed-off-by: Gautham R. Shenoy Reviewed-by: Nicholas Piggin > --- > arch/powerpc/platforms/powernv/smp.c | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/platforms/powernv/smp.c > b/arch/powerpc/platforms/powernv/smp.c > index e39e6c4..8d5b99e 100644 > --- a/arch/powerpc/platforms/powernv/smp.c > +++ b/arch/powerpc/platforms/powernv/smp.c > @@ -192,8 +192,16 @@ static void pnv_smp_cpu_kill_self(void) > } else if ((idle_states & OPAL_PM_SLEEP_ENABLED) || > (idle_states & OPAL_PM_SLEEP_ENABLED_ER1)) { > srr1 = power7_sleep(); > - } else { > + } else if (idle_states & OPAL_PM_NAP_ENABLED) { > srr1 = power7_nap(1); > + } else { > + /* This is the fallback method. We emulate snooze */ > + while (!generic_check_cpu_restart(cpu)) { > + HMT_low(); > + HMT_very_low(); > + } > + srr1 = 0; > + HMT_medium(); > } > > ppc64_runlatch_on();
RE: [PATCH 08/29] drivers, md: convert mddev.active from atomic_t to refcount_t
> Elena Reshetova writes: > > > refcount_t type and corresponding API should be > > used instead of atomic_t when the variable is used as > > a reference counter. This allows to avoid accidental > > refcounter overflows that might lead to use-after-free > > situations. > > > > Signed-off-by: Elena Reshetova > > Signed-off-by: Hans Liljestrand > > Signed-off-by: Kees Cook > > Signed-off-by: David Windsor > > --- > > drivers/md/md.c | 6 +++--- > > drivers/md/md.h | 3 ++- > > 2 files changed, 5 insertions(+), 4 deletions(-) > > When booting linux-next (specifically 5be4921c9958ec) I'm seeing the > backtrace below. I suspect this patch is just exposing an existing > issue? Yes, we have actually been following this issue in the another thread. It looks like the object is re-used somehow, but I can't quite understand how just by reading the code. This was what I put into the previous thread: "The log below indicates that you are using your refcounter in a bit weird way in mddev_find(). However, I can't find the place (just by reading the code) where you would increment refcounter from zero (vs. setting it to one). It looks like you either iterate over existing nodes (and increment their counters, which should be >= 1 at the time of increment) or create a new node, but then mddev_init() sets the counter to 1. " If you can help to understand what is going on with the object creation/destruction, would be appreciated! Also Shaohua Li stopped this patch coming from his tree since the issue was caught at that time, so we are not going to merge this until we figure it out. Best Regards, Elena. > > cheers > > > [0.230738] md: Waiting for all devices to be available before autodetect > [0.230742] md: If you don't use raid, use raid=noautodetect > [0.230962] refcount_t: increment on 0; use-after-free. > [0.230988] [ cut here ] > [0.230996] WARNING: CPU: 0 PID: 1 at lib/refcount.c:114 > .refcount_inc+0x5c/0x70 > [0.231001] Modules linked in: > [0.231006] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-gccN-next- > 20170310-g5be4921 #1 > [0.231012] task: c0004940 task.stack: c0004944 > [0.231016] NIP: c05ac6bc LR: c05ac6b8 CTR: > c0743390 > [0.231021] REGS: c00049443160 TRAP: 0700 Not tainted (4.11.0-rc1- > gccN-next-20170310-g5be4921) > [0.231026] MSR: 80029032 > [0.231033] CR: 24024422 XER: 000c > [0.231038] CFAR: c0a5356c SOFTE: 1 > [0.231038] GPR00: c05ac6b8 c000494433e0 c1079d00 > 002b > [0.231038] GPR04: 00ef > c10418a0 > [0.231038] GPR08: 4af8 c0ecc9a8 c0ecc9a8 > > [0.231038] GPR12: 28024824 c6bb > c00049443a00 > [0.231038] GPR16: c00049443a10 > > [0.231038] GPR20: c0f7dd20 > > [0.231038] GPR24: 014080c0 c12060b8 c1206080 > 0009 > [0.231038] GPR28: c0f7dde0 0090 > c000461ae800 > [0.231100] NIP [c05ac6bc] .refcount_inc+0x5c/0x70 > [0.231104] LR [c05ac6b8] .refcount_inc+0x58/0x70 > [0.231108] Call Trace: > [0.231112] [c000494433e0] [c05ac6b8] .refcount_inc+0x58/0x70 > (unreliable) > [0.231120] [c00049443450] [c086c008] > .mddev_find+0x1e8/0x430 > [0.231125] [c00049443530] [c0872b6c] .md_open+0x2c/0x140 > [0.231132] [c000494435c0] [c03962a4] > .__blkdev_get+0xd4/0x520 > [0.231138] [c00049443690] [c0396cc0] .blkdev_get+0x1c0/0x4f0 > [0.231145] [c00049443790] [c0336d64] > .do_dentry_open.isra.1+0x2a4/0x410 > [0.231152] [c00049443830] [c03523f4] > .path_openat+0x624/0x1580 > [0.231157] [c00049443990] [c0354ce4] > .do_filp_open+0x84/0x120 > [0.231163] [c00049443b10] [c0338d74] > .do_sys_open+0x214/0x300 > [0.231170] [c00049443be0] [c0da69ac] > .md_run_setup+0xa0/0xec > [0.231176] [c00049443c60] [c0da4fbc] > .prepare_namespace+0x60/0x240 > [0.231182] [c00049443ce0] [c0da47a8] > .kernel_init_freeable+0x330/0x36c > [0.231190] [c00049443db0] [c000dc44] .kernel_init+0x24/0x160 > [0.231197] [c00049443e30] [c000badc] > .ret_from_kernel_thread+0x58/0x7c > [0.231202] Instruction dump: > [0.231206] 6000 3d22ffee 89296bfb 2f89 409effdc 3c62ffc6 3921 > 3d42ffee > [0.231216] 38630928 992a6bfb 484a6e79 6000 <0fe0> 4bb8 > 6000 6000 > [0.231226] ---[ end trace 8c51f269ad91ffc2 ]--- > [0.231233] md: Autodetecting RAID arrays. > [0.231236] md: autorun ... > [0.231239]
Re: [PATCH 2/3] powernv:idle: Don't override default/deepest directly in kernel
On Mon, 13 Mar 2017 11:31:27 +0530 "Gautham R. Shenoy" wrote: > From: "Gautham R. Shenoy" > > Currently during idle-init on power9, if we don't find suitable stop > states in the device tree that can be used as the > default_stop/deepest_stop, we set stop0 (ESL=1,EC=1) as the default > stop state psscr to be used by power9_idle and deepest stop state > which is used by CPU-Hotplug. > > However, if the platform firmware has not configured or enabled a stop > state, the kernel should not make any assumptions and fallback to a > default choice. > > If the kernel uses a stop state that is not configured by the platform > firmware, it may lead to further failures which should be avoided. > > In this patch, we modify the init code to ensure that the kernel uses > only the stop states exposed by the firmware through the device > tree. When a suitable default stop state isn't found, we disable > ppc_md.power_save for power9. Similarly, when a suitable > deepest_stop_state is not found in the device tree exported by the > firmware, fall back to the default busy-wait loop in the CPU-Hotplug > code. Seems reasonable. I have a few comments that you may consider. Nothing too major. Btw., it would be nice to move this hotplug idling selection code to idle.c. Have the hotplug just ask to enter the best available idle mode and that's it. I'm not asking you to do that for this series, but perhaps consider it for the future. > diff --git a/arch/powerpc/platforms/powernv/idle.c > b/arch/powerpc/platforms/powernv/idle.c > index 4ee837e..9fde6e4 100644 > --- a/arch/powerpc/platforms/powernv/idle.c > +++ b/arch/powerpc/platforms/powernv/idle.c > @@ -147,7 +147,6 @@ u32 pnv_get_supported_cpuidle_states(void) > } > EXPORT_SYMBOL_GPL(pnv_get_supported_cpuidle_states); > > - > static void pnv_fastsleep_workaround_apply(void *info) > > { > @@ -243,6 +242,7 @@ static DEVICE_ATTR(fastsleep_workaround_applyonce, 0600, > */ > u64 pnv_default_stop_val; > u64 pnv_default_stop_mask; > +bool default_stop_found; > > /* > * Used for ppc_md.power_save which needs a function with no parameters > @@ -264,6 +264,7 @@ static void power9_idle(void) > */ > u64 pnv_deepest_stop_psscr_val; > u64 pnv_deepest_stop_psscr_mask; > +bool deepest_stop_found; > > /* > * Power ISA 3.0 idle initialization. If the hotplug idle code was in idle.c, then all this deepest/default stop logic and register settings would be static to idle.c, which would be nice. If you have a function to check if deepest stop is found, then you don't need a non-static variable here (or for default_stop_found AFAIKS). > @@ -352,7 +353,6 @@ static int __init pnv_power9_idle_init(struct device_node > *np, u32 *flags, > u32 *residency_ns = NULL; > u64 max_residency_ns = 0; > int rc = 0, i; > - bool default_stop_found = false, deepest_stop_found = false; > > psscr_val = kcalloc(dt_idle_states, sizeof(*psscr_val), GFP_KERNEL); > psscr_mask = kcalloc(dt_idle_states, sizeof(*psscr_mask), GFP_KERNEL); > @@ -433,20 +433,22 @@ static int __init pnv_power9_idle_init(struct > device_node *np, u32 *flags, > } > > if (!default_stop_found) { > - pnv_default_stop_val = PSSCR_HV_DEFAULT_VAL; > - pnv_default_stop_mask = PSSCR_HV_DEFAULT_MASK; > - pr_warn("Setting default stop psscr > val=0x%016llx,mask=0x%016llx\n", > + pr_warn("powernv:idle: Default stop not found. Disabling > ppcmd.powersave\n"); > + } else { > + pr_info("powernv:idle: Default stop: psscr = > 0x%016llx,mask=0x%016llx\n", > pnv_default_stop_val, pnv_default_stop_mask); > } > > if (!deepest_stop_found) { > - pnv_deepest_stop_psscr_val = PSSCR_HV_DEFAULT_VAL; > - pnv_deepest_stop_psscr_mask = PSSCR_HV_DEFAULT_MASK; > - pr_warn("Setting default stop psscr > val=0x%016llx,mask=0x%016llx\n", > + pr_warn("powernv:idle: Deepest stop not found. CPU-Hotplug is > affected\n"); I guess these warnings are meant for developers, but it might be nice to make them slightly more meaningful? Prefix of choice seems to be "cpuidle-powernv:" Then you could have "no suitable stop state provided by firmware. Disabling idle power saving" and "no suitable stop state provided by firmware. Offlined CPUs will busy-wait", perhaps? Just a suggestion. > + } else { > + pr_info("powernv:idle: Deepest stop: psscr = > 0x%016llx,mask=0x%016llx\n", > pnv_deepest_stop_psscr_val, > pnv_deepest_stop_psscr_mask); > } > > + pr_info("powernv:idle: RL value of first deep stop = 0x%llx\n", > + pnv_first_deep_stop_state); cpuidle-powernv: prefix for these too? > out: > kfree(psscr_val); > kfree(psscr_mask); > @@ -454,6 +456,12 @@ static int __init pnv_power9_idle_init(struct > device_node *np, u32 *flags, > return rc; > } > > +
Re: [PATCH 3/3] powernv:Recover correct PACA on wakeup from a stop on P9 DD1
On Mon, 13 Mar 2017 11:31:28 +0530 "Gautham R. Shenoy" wrote: > From: "Gautham R. Shenoy" > > POWER9 platform can be configured to rebalance per-thread resources > within a core in order to improve SMT performance. Certain STOP > states can be configure to relinquish resources include some > hypervisor SPRs in order to enable SMT thread folding. > > Due to relinquishing of per-thread resources under certain platform > configuration, certain SPR context could be lost within a core that > needs to be recovered. This state lose is due to reconfiguration of > SMT threads and not due to actual electrical power lose. Interesting. This will clash slightly with my patches, but that's no big deal. > This patch implements a context recovery framework within threads of a > core, by provisioning space in paca_struct for saving every sibling > threads's paca pointers. Basically, we should be able to arrive at the > right paca pointer from any of the thread's existing paca pointer. > > At bootup, during powernv idle-init, we save the paca address of every > CPU in each one its siblings paca_struct in the slot corresponding to > this CPU's index in the core. > > On wakeup from a stop, the thread will determine its index in the core > from the lower 2 bits of the PIR register and recover its PACA pointer > by indexing into the correct slot in the provisioned space in the > current PACA. > > [Changelog written with inputs from sva...@linux.vnet.ibm.com] > > Signed-off-by: Gautham R. Shenoy > --- > arch/powerpc/include/asm/paca.h | 5 > arch/powerpc/kernel/asm-offsets.c | 1 + > arch/powerpc/kernel/idle_book3s.S | 43 > ++- > arch/powerpc/platforms/powernv/idle.c | 22 ++ > 4 files changed, 70 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h > index 708c3e5..4405630 100644 > --- a/arch/powerpc/include/asm/paca.h > +++ b/arch/powerpc/include/asm/paca.h > @@ -172,6 +172,11 @@ struct paca_struct { > u8 thread_mask; > /* Mask to denote subcore sibling threads */ > u8 subcore_sibling_mask; > + /* > + * Pointer to an array which contains pointer > + * to the sibling threads' paca. > + */ > + struct paca_struct *thread_sibling_pacas[8]; > #endif > > #ifdef CONFIG_PPC_BOOK3S_64 > diff --git a/arch/powerpc/kernel/asm-offsets.c > b/arch/powerpc/kernel/asm-offsets.c > index 4367e7d..6ec5016 100644 > --- a/arch/powerpc/kernel/asm-offsets.c > +++ b/arch/powerpc/kernel/asm-offsets.c > @@ -727,6 +727,7 @@ int main(void) > OFFSET(PACA_THREAD_IDLE_STATE, paca_struct, thread_idle_state); > OFFSET(PACA_THREAD_MASK, paca_struct, thread_mask); > OFFSET(PACA_SUBCORE_SIBLING_MASK, paca_struct, subcore_sibling_mask); > + OFFSET(PACA_SIBLING_PACA_PTRS, paca_struct, thread_sibling_pacas); > #endif > > DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER); > diff --git a/arch/powerpc/kernel/idle_book3s.S > b/arch/powerpc/kernel/idle_book3s.S > index 9957287..5a90f2c 100644 > --- a/arch/powerpc/kernel/idle_book3s.S > +++ b/arch/powerpc/kernel/idle_book3s.S > @@ -385,7 +385,48 @@ _GLOBAL(power9_idle_stop) > */ > _GLOBAL(pnv_restore_hyp_resource) > BEGIN_FTR_SECTION > - ld r2,PACATOC(r13); > +BEGIN_FTR_SECTION_NESTED(70) > +/* Save our LR in r17 */ > + mflrr17 > +/* > + * On entering certain stop states, the thread might relinquish its > + * per-thread resources due to some reconfiguration for improved SMT > + * performance. This would result in certain SPR context such as > + * HSPRG0 (which contains the paca pointer) to be lost within the core. > + * > + * Fortunately, the PIR is invariant to thread reconfiguration. Since > + * this thread's paca pointer is recorded in all its sibling's > + * paca, we can correctly recover this thread's paca pointer if we > + * know the index of this thread in the core. > + * This index can be obtained from the lower two bits of the PIR. > + * > + * i.e, thread's position in the core = PIR[62:63]. > + * If this value is i, then this thread's paca is > + * paca->thread_sibling_pacas[i]. > + */ > + mfspr r4, SPRN_PIR > + andi. r4, r4, 0x3 > +/* > + * Since each entry in thread_sibling_pacas is 8 bytes > + * we need to left-shift by 3 bits. Thus r4 = i * 8 > + */ > + sldir4, r4, 3 > +/* Get &paca->thread_sibling_pacas[0] in r5 */ > + addir5, r13, PACA_SIBLING_PACA_PTRS I don't actually understand how this works. We have in r13 a paca pointer already, and it must be a pointer to one of the threads in this core, but it may not be a pointer to this thread's paca? Can you explain how that works a bit more? Where does that r13 come from? The exception code before we reach here already saves registers into the paca save area (PACA_EXGEN), so this is going to go bad if we are saving into another thread's paca. I have a patch that makes the 0x100 vector use a a new
Re: [PATCH 08/29] drivers, md: convert mddev.active from atomic_t to refcount_t
On Tue, 2017-03-14 at 12:29 +, Reshetova, Elena wrote: > > Elena Reshetova writes: > > > > > refcount_t type and corresponding API should be > > > used instead of atomic_t when the variable is used as > > > a reference counter. This allows to avoid accidental > > > refcounter overflows that might lead to use-after-free > > > situations. > > > > > > Signed-off-by: Elena Reshetova > > > Signed-off-by: Hans Liljestrand > > > Signed-off-by: Kees Cook > > > Signed-off-by: David Windsor > > > --- > > > drivers/md/md.c | 6 +++--- > > > drivers/md/md.h | 3 ++- > > > 2 files changed, 5 insertions(+), 4 deletions(-) > > > > When booting linux-next (specifically 5be4921c9958ec) I'm seeing > > the > > backtrace below. I suspect this patch is just exposing an existing > > issue? > > Yes, we have actually been following this issue in the another > thread. > It looks like the object is re-used somehow, but I can't quite > understand how just by reading the code. > This was what I put into the previous thread: > > "The log below indicates that you are using your refcounter in a bit > weird way in mddev_find(). > However, I can't find the place (just by reading the code) where you > would increment refcounter from zero (vs. setting it to one). > It looks like you either iterate over existing nodes (and increment > their counters, which should be >= 1 at the time of increment) or > create a new node, but then mddev_init() sets the counter to 1. " > > If you can help to understand what is going on with the object > creation/destruction, would be appreciated! > > Also Shaohua Li stopped this patch coming from his tree since the > issue was caught at that time, so we are not going to merge this > until we figure it out. Asking on the correct list (dm-devel) would have got you the easy answer: The refcount behind mddev->active is a genuine atomic. It has refcount properties but only if the array fails to initialise (in that case, final put kills it). Once it's added to the system as a gendisk, it cannot be freed until md_free(). Thus its ->active count can go to zero (when it becomes inactive; usually because of an unmount). On a simple allocation regardless of outcome, the last executed statement in md_alloc is mddev_put(): that destroys the device if we didn't manage to create it or returns 0 and adds an inactive device to the system which the user can get with mddev_find(). James
Re: [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration
On Tue, 14 Mar 2017 11:55:33 +1100 David Gibson wrote: > On Tue, Mar 14, 2017 at 11:54:03AM +1100, Alexey Kardashevskiy wrote: > > On 10/03/17 15:48, David Gibson wrote: > > > On Fri, Mar 10, 2017 at 02:53:27PM +1100, Alexey Kardashevskiy wrote: > > >> This is my current queue of patches to add acceleration of TCE > > >> updates in KVM. > > >> > > >> This is based on Linus'es tree sha1 c1aa905a304e. Hmm, sure about that? 03/10 doesn't apply. > > > > > > I think we're finally there - I've now sent an R-b for all patches. > > > > Thanks for the patience. > > > > > > I supposed in order to proceed now I need an ack from Alex, correct? > > That, or simply for him to merge it. Given the diffstat, I'd guess you're looking for acks from me and maybe Paolo, but it looks like it should be merged through ppc trees. Thanks, Alex > > >> > > >> Please comment. Thanks. > > >> > > >> Changes: > > >> v8: > > >> * kept fixing oddities with error handling in 10/10 > > >> > > >> v7: > > >> * added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c > > >> > > >> v6: > > >> * reworked the last patch in terms of error handling and parameters > > >> checking > > >> > > >> v5: > > >> * replaced "KVM: PPC: Separate TCE validation from update" with > > >> "KVM: PPC: iommu: Unify TCE checking" > > >> * changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup > > >> iommu_table disposal" > > >> * reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO" > > >> * more details in individual commit logs > > >> > > >> v4: > > >> * addressed comments from v3 > > >> * updated subject lines with correct component names > > >> * regrouped the patchset in order: > > >> - powerpc fixes; > > >> - vfio_spapr_tce driver fixes; > > >> - KVM/PPC fixes; > > >> - KVM+PPC+VFIO; > > >> * everything except last 2 patches have "Reviewed-By: David" > > >> > > >> v3: > > >> * there was no full repost, only last patch was posted > > >> > > >> v2: > > >> * 11/11 reworked to use new notifiers, it is rather RFC as it still has > > >> a issue; > > >> * got 09/11, 10/11 to use notifiers in 11/11; > > >> * added rb: David to most of patches and added a comment in 05/11. > > >> > > >> Alexey Kardashevskiy (10): > > >> powerpc/mmu: Add real mode support for IOMMU preregistered memory > > >> powerpc/powernv/iommu: Add real mode version of > > >> iommu_table_ops::exchange() > > >> powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal > > >> powerpc/vfio_spapr_tce: Add reference counting to iommu_table > > >> KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number > > >> KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently > > >> KVM: PPC: Pass kvm* to kvmppc_find_table() > > >> KVM: PPC: Use preregistered memory API to access TCE list > > >> KVM: PPC: iommu: Unify TCE checking > > >> KVM: PPC: VFIO: Add in-kernel acceleration for VFIO > > >> > > >> Documentation/virtual/kvm/devices/vfio.txt | 22 +- > > >> arch/powerpc/include/asm/iommu.h | 32 ++- > > >> arch/powerpc/include/asm/kvm_host.h| 8 + > > >> arch/powerpc/include/asm/kvm_ppc.h | 12 +- > > >> arch/powerpc/include/asm/mmu_context.h | 4 + > > >> include/uapi/linux/kvm.h | 9 + > > >> arch/powerpc/kernel/iommu.c| 86 +--- > > >> arch/powerpc/kvm/book3s_64_vio.c | 330 > > >> - > > >> arch/powerpc/kvm/book3s_64_vio_hv.c| 303 > > >> ++ > > >> arch/powerpc/kvm/powerpc.c | 2 + > > >> arch/powerpc/mm/mmu_context_iommu.c| 39 > > >> arch/powerpc/platforms/powernv/pci-ioda.c | 46 ++-- > > >> arch/powerpc/platforms/powernv/pci.c | 1 + > > >> arch/powerpc/platforms/pseries/iommu.c | 3 +- > > >> arch/powerpc/platforms/pseries/vio.c | 2 +- > > >> drivers/vfio/vfio_iommu_spapr_tce.c| 2 +- > > >> virt/kvm/vfio.c| 60 ++ > > >> arch/powerpc/kvm/Kconfig | 1 + > > >> 18 files changed, 855 insertions(+), 107 deletions(-) > > >> > > > > > > > > > > >
Re: [PATCH kernel v8 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
On Fri, 10 Mar 2017 14:53:30 +1100 Alexey Kardashevskiy wrote: > At the moment iommu_table can be disposed by either calling > iommu_table_free() directly or it_ops::free(); the only implementation > of free() is in IODA2 - pnv_ioda2_table_free() - and it calls > iommu_table_free() anyway. > > As we are going to have reference counting on tables, we need an unified > way of disposing tables. > > This moves it_ops::free() call into iommu_free_table() and makes use > of the latter. The free() callback now handles only platform-specific > data. > > As from now on the iommu_free_table() calls it_ops->free(), we need > to have it_ops initialized before calling iommu_free_table() so this > moves this initialization in pnv_pci_ioda2_create_table(). > > This should cause no behavioral change. > > Signed-off-by: Alexey Kardashevskiy > Reviewed-by: David Gibson > --- > Changes: > v5: > * moved "tbl->it_ops = &pnv_ioda2_iommu_ops" earlier and updated > the commit log > --- > arch/powerpc/kernel/iommu.c | 4 > arch/powerpc/platforms/powernv/pci-ioda.c | 10 -- > drivers/vfio/vfio_iommu_spapr_tce.c | 2 +- > 3 files changed, 9 insertions(+), 7 deletions(-) > > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c > index 9bace5df05d5..bc142d87130f 100644 > --- a/arch/powerpc/kernel/iommu.c > +++ b/arch/powerpc/kernel/iommu.c > @@ -719,6 +719,9 @@ void iommu_free_table(struct iommu_table *tbl, const char > *node_name) > if (!tbl) > return; > > + if (tbl->it_ops->free) > + tbl->it_ops->free(tbl); > + > if (!tbl->it_map) { > kfree(tbl); > return; > @@ -745,6 +748,7 @@ void iommu_free_table(struct iommu_table *tbl, const char > *node_name) > /* free table */ > kfree(tbl); > } > +EXPORT_SYMBOL_GPL(iommu_free_table); A slightly cringe worthy generically named export in arch code. > > /* Creates TCEs for a user provided buffer. The user buffer must be > * contiguous real kernel storage (not vmalloc). The address passed here > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c > b/arch/powerpc/platforms/powernv/pci-ioda.c > index 69c40b43daa3..7916d0cb05fe 100644 > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > @@ -1425,7 +1425,6 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev > *dev, struct pnv_ioda_pe > iommu_group_put(pe->table_group.group); > BUG_ON(pe->table_group.group); > } > - pnv_pci_ioda2_table_free_pages(tbl); > iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); > } > > @@ -2041,7 +2040,6 @@ static void pnv_ioda2_tce_free(struct iommu_table *tbl, > long index, > static void pnv_ioda2_table_free(struct iommu_table *tbl) > { > pnv_pci_ioda2_table_free_pages(tbl); > - iommu_free_table(tbl, "pnv"); > } > > static struct iommu_table_ops pnv_ioda2_iommu_ops = { > @@ -2318,6 +2316,8 @@ static long pnv_pci_ioda2_create_table(struct > iommu_table_group *table_group, > if (!tbl) > return -ENOMEM; > > + tbl->it_ops = &pnv_ioda2_iommu_ops; > + > ret = pnv_pci_ioda2_table_alloc_pages(nid, > bus_offset, page_shift, window_size, > levels, tbl); > @@ -2326,8 +2326,6 @@ static long pnv_pci_ioda2_create_table(struct > iommu_table_group *table_group, > return ret; > } > > - tbl->it_ops = &pnv_ioda2_iommu_ops; > - > *ptbl = tbl; > > return 0; > @@ -2368,7 +2366,7 @@ static long pnv_pci_ioda2_setup_default_config(struct > pnv_ioda_pe *pe) > if (rc) { > pe_err(pe, "Failed to configure 32-bit TCE table, err %ld\n", > rc); > - pnv_ioda2_table_free(tbl); > + iommu_free_table(tbl, ""); > return rc; > } > > @@ -2456,7 +2454,7 @@ static void pnv_ioda2_take_ownership(struct > iommu_table_group *table_group) > pnv_pci_ioda2_unset_window(&pe->table_group, 0); > if (pe->pbus) > pnv_ioda_setup_bus_dma(pe, pe->pbus, false); > - pnv_ioda2_table_free(tbl); > + iommu_free_table(tbl, "pnv"); > } > > static void pnv_ioda2_release_ownership(struct iommu_table_group > *table_group) > diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c > b/drivers/vfio/vfio_iommu_spapr_tce.c > index cf3de91fbfe7..fbec7348a7e5 100644 > --- a/drivers/vfio/vfio_iommu_spapr_tce.c > +++ b/drivers/vfio/vfio_iommu_spapr_tce.c > @@ -680,7 +680,7 @@ static void tce_iommu_free_table(struct tce_container > *container, > unsigned long pages = tbl->it_allocated_size >> PAGE_SHIFT; > > tce_iommu_userspace_view_free(tbl, container->mm); > - tbl->it_ops->free(tbl); > + iommu_free_table(tbl, ""); > decrement_locked_vm(container->mm, pages); > } > Acked-by: Alex Williamson
[PATCH 17/19] perf powerpc: Choose local entry point with kretprobes
From: "Naveen N. Rao" perf now uses an offset from _text/_stext for kretprobes if the kernel supports it, rather than the actual function name. As such, let's choose the LEP for powerpc ABIv2 so as to ensure the probe gets hit. Do it only if the kernel supports specifying offsets with kretprobes. Signed-off-by: Naveen N. Rao Acked-by: Masami Hiramatsu Cc: Ananth N Mavinakayanahalli Cc: Michael Ellerman Cc: Steven Rostedt Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/7445b5334673ef5404ac1d12609bad4d73d2b567.1488961018.git.naveen.n@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/arch/powerpc/util/sym-handling.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/tools/perf/arch/powerpc/util/sym-handling.c b/tools/perf/arch/powerpc/util/sym-handling.c index 1030a6e504bb..39dbe512b9fc 100644 --- a/tools/perf/arch/powerpc/util/sym-handling.c +++ b/tools/perf/arch/powerpc/util/sym-handling.c @@ -10,6 +10,7 @@ #include "symbol.h" #include "map.h" #include "probe-event.h" +#include "probe-file.h" #ifdef HAVE_LIBELF_SUPPORT bool elf__needs_adjust_symbols(GElf_Ehdr ehdr) @@ -79,13 +80,18 @@ void arch__fix_tev_from_maps(struct perf_probe_event *pev, * However, if the user specifies an offset, we fall back to using the * GEP since all userspace applications (objdump/readelf) show function * disassembly with offsets from the GEP. -* -* In addition, we shouldn't specify an offset for kretprobes. */ - if (pev->point.offset || (!pev->uprobes && pev->point.retprobe) || - !map || !sym) + if (pev->point.offset || !map || !sym) return; + /* For kretprobes, add an offset only if the kernel supports it */ + if (!pev->uprobes && pev->point.retprobe) { +#ifdef HAVE_LIBELF_SUPPORT + if (!kretprobe_offset_is_supported()) +#endif + return; + } + lep_offset = PPC64_LOCAL_ENTRY_OFFSET(sym->arch_sym); if (map->dso->symtab_type == DSO_BINARY_TYPE__KALLSYMS) -- 2.9.3
[PATCH 15/19] perf probe: Factor out the ftrace README scanning
From: "Naveen N. Rao" Simplify and separate out the ftrace README scanning logic into a separate helper. This is used subsequently to scan for all patterns of interest and to cache the result. Since we are only interested in availability of probe argument type x, we will only scan for that. Signed-off-by: Naveen N. Rao Acked-by: Masami Hiramatsu Cc: Ananth N Mavinakayanahalli Cc: Michael Ellerman Cc: Steven Rostedt Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/6dc30edc747ba82a236593be6cf3a046fa9453b5.1488961018.git.naveen.n@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/probe-file.c | 70 +++- 1 file changed, 37 insertions(+), 33 deletions(-) diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c index 1a62daceb028..8a219cd831b7 100644 --- a/tools/perf/util/probe-file.c +++ b/tools/perf/util/probe-file.c @@ -877,35 +877,31 @@ int probe_cache__show_all_caches(struct strfilter *filter) return 0; } +enum ftrace_readme { + FTRACE_README_PROBE_TYPE_X = 0, + FTRACE_README_END, +}; + static struct { const char *pattern; - boolavail; - boolchecked; -} probe_type_table[] = { -#define DEFINE_TYPE(idx, pat, def_avail) \ - [idx] = {.pattern = pat, .avail = (def_avail)} - DEFINE_TYPE(PROBE_TYPE_U, "* u8/16/32/64,*", true), - DEFINE_TYPE(PROBE_TYPE_S, "* s8/16/32/64,*", true), - DEFINE_TYPE(PROBE_TYPE_X, "* x8/16/32/64,*", false), - DEFINE_TYPE(PROBE_TYPE_STRING, "* string,*", true), - DEFINE_TYPE(PROBE_TYPE_BITFIELD, - "* b@/", true), + bool avail; +} ftrace_readme_table[] = { +#define DEFINE_TYPE(idx, pat) \ + [idx] = {.pattern = pat, .avail = false} + DEFINE_TYPE(FTRACE_README_PROBE_TYPE_X, "*type: * x8/16/32/64,*"), }; -bool probe_type_is_available(enum probe_type type) +static bool scan_ftrace_readme(enum ftrace_readme type) { + int fd; FILE *fp; char *buf = NULL; size_t len = 0; - bool target_line = false; - bool ret = probe_type_table[type].avail; - int fd; + bool ret = false; + static bool scanned = false; - if (type >= PROBE_TYPE_END) - return false; - /* We don't have to check the type which supported by default */ - if (ret || probe_type_table[type].checked) - return ret; + if (scanned) + goto result; fd = open_trace_file("README", false); if (fd < 0) @@ -917,21 +913,29 @@ bool probe_type_is_available(enum probe_type type) return ret; } - while (getline(&buf, &len, fp) > 0 && !ret) { - if (!target_line) { - target_line = !!strstr(buf, " type: "); - if (!target_line) - continue; - } else if (strstr(buf, "\t ") != buf) - break; - ret = strglobmatch(buf, probe_type_table[type].pattern); - } - /* Cache the result */ - probe_type_table[type].checked = true; - probe_type_table[type].avail = ret; + while (getline(&buf, &len, fp) > 0) + for (enum ftrace_readme i = 0; i < FTRACE_README_END; i++) + if (!ftrace_readme_table[i].avail) + ftrace_readme_table[i].avail = + strglobmatch(buf, ftrace_readme_table[i].pattern); + scanned = true; fclose(fp); free(buf); - return ret; +result: + if (type >= FTRACE_README_END) + return false; + + return ftrace_readme_table[type].avail; +} + +bool probe_type_is_available(enum probe_type type) +{ + if (type >= PROBE_TYPE_END) + return false; + else if (type == PROBE_TYPE_X) + return scan_ftrace_readme(FTRACE_README_PROBE_TYPE_X); + + return true; } -- 2.9.3
[PATCH 16/19] perf kretprobes: Offset from reloc_sym if kernel supports it
From: "Naveen N. Rao" We indicate support for accepting sym+offset with kretprobes through a line in ftrace README. Parse the same to identify support and choose the appropriate format for kprobe_events. As an example, without this perf patch, but with the ftrace changes: naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/tracing/README | grep kretprobe place (kretprobe): [:][+]| naveen@ubuntu:~/linux/tools/perf$ naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return probe-definition(0): do_open%return symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null) 0 arguments Looking at the vmlinux_path (8 entries long) Using /boot/vmlinux for symbols Open Debuginfo file: /boot/vmlinux Try to find probe point from debuginfo. Matched function: do_open [2d0c7d8] Probe point found: do_open+0 Matched function: do_open [35d76b5] found inline addr: 0xc04ba984 Failed to find "do_open%return", because do_open is an inlined function and has no return point. An error occurred in debuginfo analysis (-22). Trying to use symbols. Opening /sys/kernel/debug/tracing//kprobe_events write=1 Writing event: r:probe/do_open do_open+0 Writing event: r:probe/do_open_1 do_open+0 Added new events: probe:do_open(on do_open%return) probe:do_open_1 (on do_open%return) You can now use it in all perf tools, such as: perf record -e probe:do_open_1 -aR sleep 1 naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/kprobes/list c0041370 k kretprobe_trampoline+0x0[OPTIMIZED] c04433d0 r do_open+0x0[DISABLED] c04433d0 r do_open+0x0[DISABLED] And after this patch (and the subsequent powerpc patch): naveen@ubuntu:~/linux/tools/perf$ sudo ./perf probe -v do_open%return probe-definition(0): do_open%return symbol:do_open file:(null) line:0 offset:0 return:1 lazy:(null) 0 arguments Looking at the vmlinux_path (8 entries long) Using /boot/vmlinux for symbols Open Debuginfo file: /boot/vmlinux Try to find probe point from debuginfo. Matched function: do_open [2d0c7d8] Probe point found: do_open+0 Matched function: do_open [35d76b5] found inline addr: 0xc04ba984 Failed to find "do_open%return", because do_open is an inlined function and has no return point. An error occurred in debuginfo analysis (-22). Trying to use symbols. Opening /sys/kernel/debug/tracing//README write=0 Opening /sys/kernel/debug/tracing//kprobe_events write=1 Writing event: r:probe/do_open _text+4469712 Writing event: r:probe/do_open_1 _text+4956248 Added new events: probe:do_open(on do_open%return) probe:do_open_1 (on do_open%return) You can now use it in all perf tools, such as: perf record -e probe:do_open_1 -aR sleep 1 naveen@ubuntu:~/linux/tools/perf$ sudo cat /sys/kernel/debug/kprobes/list c0041370 k kretprobe_trampoline+0x0[OPTIMIZED] c04433d0 r do_open+0x0[DISABLED] c04ba058 r do_open+0x8[DISABLED] Signed-off-by: Naveen N. Rao Acked-by: Masami Hiramatsu Cc: Ananth N Mavinakayanahalli Cc: Michael Ellerman Cc: Steven Rostedt Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/496ef9f33c1ab16286ece9dd62aa672807aef91c.1488961018.git.naveen.n@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/perf/util/probe-event.c | 12 +--- tools/perf/util/probe-file.c | 7 +++ tools/perf/util/probe-file.h | 1 + 3 files changed, 13 insertions(+), 7 deletions(-) diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c index 28fb62c32678..c9bdc9ded0c3 100644 --- a/tools/perf/util/probe-event.c +++ b/tools/perf/util/probe-event.c @@ -757,7 +757,9 @@ post_process_kernel_probe_trace_events(struct probe_trace_event *tevs, } for (i = 0; i < ntevs; i++) { - if (!tevs[i].point.address || tevs[i].point.retprobe) + if (!tevs[i].point.address) + continue; + if (tevs[i].point.retprobe && !kretprobe_offset_is_supported()) continue; /* If we found a wrong one, mark it by NULL symbol */ if (kprobe_warn_out_range(tevs[i].point.symbol, @@ -1528,11 +1530,6 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev) return -EINVAL; } - if (pp->retprobe && !pp->function) { - semantic_error("Return probe requires an entry function.\n"); - return -EINVAL; - } - if ((pp->offset || pp->line || pp->lazy_line) && pp->retprobe) { semantic_error("Offset/Line/Lazy pattern can't be used with " "return probe.\n"); @@ -2841,7 +2838,8 @@ static int find_probe_trace_events_from_map(struct perf_probe_event *pev, } /* Note that the symbols in t
[GIT PULL 00/19] perf/core improvements and fixes
Hi Ingo, Please consider pulling, - Arnaldo Test results at the end of this message, as usual. The following changes since commit 84e5b549214f2160c12318aac549de85f600c79a: Merge tag 'perf-core-for-mingo-4.11-20170306' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core (2017-03-07 08:14:14 +0100) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git tags/perf-core-for-mingo-4.12-20170314 for you to fetch changes up to 5f6bee34707973ea7879a7857fd63ddccc92fff3: kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOL (2017-03-14 15:17:40 -0300) perf/core improvements and fixes: New features: - Add PERF_RECORD_NAMESPACES so that the kernel can record information required to associate samples to namespaces, helping in container problem characterization. Now the 'perf record has a --namespace' option to ask for such info, and when present, it can be used, initially, via a new sort order, 'cgroup_id', allowing histogram entry bucketization by a (device, inode) based cgroup identifier (Hari Bathini) - Add --next option to 'perf sched timehist', showing what is the next thread to run (Brendan Gregg) Fixes: - Fix segfault with basic block 'cycles' sort dimension (Changbin Du) - Add c2c to command-list.txt, making it appear in the 'perf help' output (Changbin Du) - Fix zeroing of 'abs_path' variable in the perf hists browser switch file code (Changbin Du) - Hide tips messages when -q/--quiet is given to 'perf report' (Namhyung Kim) Infrastructure: - Use ref_reloc_sym + offset to setup kretprobes (Naveen Rao) - Ignore generated files pmu-events/{jevents,pmu-events.c} for git (Changbin Du) Documentation: - Document +field style argument support for --field option (Changbin Du) - Clarify 'perf c2c --stats' help message (Namhyung Kim) Signed-off-by: Arnaldo Carvalho de Melo Brendan Gregg (1): perf sched timehist: Add --next option Changbin Du (5): perf tools: Missing c2c command in command-list perf tools: Ignore generated files pmu-events/{jevents,pmu-events.c} for git perf sort: Fix segfault with basic block 'cycles' sort dimension perf report: Document +field style argument support for --field option perf hists browser: Fix typo in function switch_data_file Hari Bathini (5): perf: Add PERF_RECORD_NAMESPACES to include namespaces related info perf tools: Add PERF_RECORD_NAMESPACES to include namespaces related info perf record: Synthesize namespace events for current processes perf script: Add script print support for namespace events perf tools: Add 'cgroup_id' sort order keyword Namhyung Kim (3): perf report: Hide tip message when -q option is given perf c2c: Clarify help message of --stats option perf c2c: Fix display bug when using pipe Naveen N. Rao (5): perf probe: Factor out the ftrace README scanning perf kretprobes: Offset from reloc_sym if kernel supports it perf powerpc: Choose local entry point with kretprobes doc: trace/kprobes: add information about NOKPROBE_SYMBOL kprobes: Convert kprobe_exceptions_notify to use NOKPROBE_SYMBOL Documentation/trace/kprobetrace.txt | 5 +- include/linux/perf_event.h | 2 + include/uapi/linux/perf_event.h | 32 +- kernel/events/core.c| 139 ++ kernel/fork.c | 2 + kernel/kprobes.c| 5 +- kernel/nsproxy.c| 3 + tools/include/uapi/linux/perf_event.h | 32 +- tools/perf/.gitignore | 2 + tools/perf/Documentation/perf-record.txt| 3 + tools/perf/Documentation/perf-report.txt| 7 +- tools/perf/Documentation/perf-sched.txt | 4 + tools/perf/Documentation/perf-script.txt| 3 + tools/perf/arch/powerpc/util/sym-handling.c | 14 ++- tools/perf/builtin-annotate.c | 1 + tools/perf/builtin-c2c.c| 4 +- tools/perf/builtin-diff.c | 1 + tools/perf/builtin-inject.c | 13 +++ tools/perf/builtin-kmem.c | 1 + tools/perf/builtin-kvm.c| 2 + tools/perf/builtin-lock.c | 1 + tools/perf/builtin-mem.c| 1 + tools/perf/builtin-record.c | 35 ++- tools/perf/builtin-report.c | 4 +- tools/perf/builtin-sched.c | 26 - tools/perf/builtin-script.c | 41 tools/perf/builtin-trace.c | 3 +- tools/per
Re: Linux 4.11: Reported regressions as of Tuesday, 20176-03-14
On Tue, Mar 14, 2017 at 3:02 AM, Thorsten Leemhuis wrote: > > Desc: PowerPC crashes on boot, bisected to commit 5657933dbb6e > Repo: 2017-03-02 > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1343553.html > Stat: 2017-03-09 > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1349568.html > Note: patch hopefully heading mainline Should already be mainline as commit 46f401c4297a ("powerpc/pmac: Fix crash in dma-mapping.h with NULL dma_ops") but didn't make rc2. Linus
Re: syscall statx not implemented on powerpc
On 15/03/17 00:57, Michael Ellerman wrote: > Chris Packham writes: > >> On 13/03/17 21:52, Chandan Rajendra wrote: >>> On Monday, March 13, 2017 03:33:07 AM Chris Packham wrote: Hi, I've just attempted to build a powerpc kernel from 4.11-rc2 using a custom defconfig (available on request) and I'm hitting the following error in the early stages of compilation. :1325:2: error: #warning syscall statx not implemented [-Werror=cpp] Same thing seems to happen with mpc85xx_basic_defconfig. I don't actually need this syscall so I'd be happy to turn something off to get things building. I did a quick search and couldn't see anything on linuxppc-dev but google keeps correcting "statx" to "stats" so I could have missed it. >>> >>> The upstream commit >>> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a528d35e8bfcc521d7cb70aaf03e1bd296c8493f) >>> that introduces the statx syscall provides a test program. I will wire-up >>> the >>> syscall on ppc64, run that test program and post the patch if the test >>> program >>> works well. >>> >> >> Thanks, I'd be happy to test a patch here. >> >> In the meantime I worked around the build issue by adding __INGORE_statx >> to checksyscalls.sh. > > Is it actually breaking the build? It's meant to be a warning, and > that's all I get. > > cheers > It breaks my builds for ppc and arm. We do have KCFLAGS=-Werror in our environment but KCPPFLAGS= is not set so I'm not sure why it's stopping compilation.
Re: [PATCH kernel v8 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table
On Fri, 10 Mar 2017 14:53:31 +1100 Alexey Kardashevskiy wrote: > So far iommu_table obejcts were only used in virtual mode and had > a single owner. We are going to change this by implementing in-kernel > acceleration of DMA mapping requests. The proposed acceleration > will handle requests in real mode and KVM will keep references to tables. > > This adds a kref to iommu_table and defines new helpers to update it. > This replaces iommu_free_table() with iommu_table_put() and makes > iommu_free_table() static. iommu_table_get() is not used in this patch > but it will be in the following patch. > > Since this touches prototypes, this also removes @node_name parameter as > it has never been really useful on powernv and carrying it for > the pseries platform code to iommu_free_table() seems to be quite > useless as well. > > This should cause no behavioral change. > > Signed-off-by: Alexey Kardashevskiy > Reviewed-by: David Gibson > --- > arch/powerpc/include/asm/iommu.h | 5 +++-- > arch/powerpc/kernel/iommu.c | 24 +++- > arch/powerpc/platforms/powernv/pci-ioda.c | 14 +++--- > arch/powerpc/platforms/powernv/pci.c | 1 + > arch/powerpc/platforms/pseries/iommu.c| 3 ++- > arch/powerpc/platforms/pseries/vio.c | 2 +- > drivers/vfio/vfio_iommu_spapr_tce.c | 2 +- > 7 files changed, 34 insertions(+), 17 deletions(-) > > diff --git a/arch/powerpc/include/asm/iommu.h > b/arch/powerpc/include/asm/iommu.h > index 4554699aec02..82e77ebf85f4 100644 > --- a/arch/powerpc/include/asm/iommu.h > +++ b/arch/powerpc/include/asm/iommu.h > @@ -119,6 +119,7 @@ struct iommu_table { > struct list_head it_group_list;/* List of iommu_table_group_link */ > unsigned long *it_userspace; /* userspace view of the table */ > struct iommu_table_ops *it_ops; > + struct krefit_kref; > }; > > #define IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry) \ > @@ -151,8 +152,8 @@ static inline void *get_iommu_table_base(struct device > *dev) > > extern int dma_iommu_dma_supported(struct device *dev, u64 mask); > > -/* Frees table for an individual device node */ > -extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); > +extern void iommu_table_get(struct iommu_table *tbl); > +extern void iommu_table_put(struct iommu_table *tbl); > > /* Initializes an iommu_table based in values set in the passed-in > * structure > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c > index bc142d87130f..d02b8d22fb50 100644 > --- a/arch/powerpc/kernel/iommu.c > +++ b/arch/powerpc/kernel/iommu.c > @@ -711,13 +711,13 @@ struct iommu_table *iommu_init_table(struct iommu_table > *tbl, int nid) > return tbl; > } > > -void iommu_free_table(struct iommu_table *tbl, const char *node_name) > +static void iommu_table_free(struct kref *kref) > { > unsigned long bitmap_sz; > unsigned int order; > + struct iommu_table *tbl; > > - if (!tbl) > - return; > + tbl = container_of(kref, struct iommu_table, it_kref); > > if (tbl->it_ops->free) > tbl->it_ops->free(tbl); > @@ -736,7 +736,7 @@ void iommu_free_table(struct iommu_table *tbl, const char > *node_name) > > /* verify that table contains no entries */ > if (!bitmap_empty(tbl->it_map, tbl->it_size)) > - pr_warn("%s: Unexpected TCEs for %s\n", __func__, node_name); > + pr_warn("%s: Unexpected TCEs\n", __func__); > > /* calculate bitmap size in bytes */ > bitmap_sz = BITS_TO_LONGS(tbl->it_size) * sizeof(unsigned long); > @@ -748,7 +748,21 @@ void iommu_free_table(struct iommu_table *tbl, const > char *node_name) > /* free table */ > kfree(tbl); > } > -EXPORT_SYMBOL_GPL(iommu_free_table); > + > +void iommu_table_get(struct iommu_table *tbl) > +{ > + kref_get(&tbl->it_kref); > +} > +EXPORT_SYMBOL_GPL(iommu_table_get); > + > +void iommu_table_put(struct iommu_table *tbl) > +{ > + if (!tbl) > + return; > + > + kref_put(&tbl->it_kref, iommu_table_free); > +} > +EXPORT_SYMBOL_GPL(iommu_table_put); > Maybe an opportunity for less cringe worthy generic names exported from arch code. iommu_tce_table_get/put perhaps? > /* Creates TCEs for a user provided buffer. The user buffer must be > * contiguous real kernel storage (not vmalloc). The address passed here > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c > b/arch/powerpc/platforms/powernv/pci-ioda.c > index 7916d0cb05fe..ec3e565de511 100644 > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > @@ -1425,7 +1425,7 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev > *dev, struct pnv_ioda_pe > iommu_group_put(pe->table_group.group); > BUG_ON(pe->table_group.group); > } > - iommu_free_table(tbl, of_node_full_name(dev->dev.of_node)); > + iommu_table_put(tbl);
Re: [FIX PATCH v1] powerpc/pseries: Fix reference count leak during CPU unplug
On 03/13/2017 03:29 AM, Bharata B Rao wrote: > On Thu, Mar 09, 2017 at 01:34:00PM -0800, Tyrel Datwyler wrote: >> On 03/08/2017 08:37 PM, Bharata B Rao wrote: >>> The following warning is seen when a CPU is hot unplugged on a PowerKVM >>> guest: >> >> Is this the case with cpus present at boot? What about cpus hotplugged >> after boot? > > I have observed this for CPUs that are hotplugged. If removing a cpu present at boot works, but removing one that has been hotplugged after boot reproduces the problem it is more likely the case that we failed to take a reference during hotplug or released a reference we shouldn't have. I'd have to go look at the hot add path. > >> >> My suspicion is that the refcount was wrong to begin with. See my >> comments below. The use of the of_node_put() calls is correct as in each >> case we incremented the ref count earlier in the same function. >> >>> >>> refcount_t: underflow; use-after-free. >>> [ cut here ] >>> WARNING: CPU: 0 PID: 53 at lib/refcount.c:128 >>> refcount_sub_and_test+0xd8/0xf0 >>> Modules linked in: >>> CPU: 0 PID: 53 Comm: kworker/u510:1 Not tainted 4.11.0-rc1 #3 >>> Workqueue: pseries hotplug workque pseries_hp_work_fn >>> task: c000fb475000 task.stack: c000fb81c000 >>> NIP: c06f0808 LR: c06f0804 CTR: c07b98c0 >>> REGS: c000fb81f710 TRAP: 0700 Not tainted (4.11.0-rc1) >>> MSR: 8282b033 >>> CR: 4800 XER: 2000 >>> CFAR: c0c438e0 SOFTE: 1 >>> GPR00: c06f0804 c000fb81f990 c1573b00 0026 >>> GPR04: 016c 667265652e0d0a73 652d61667465722d >>> GPR08: 0007 0007 0001 0006 >>> GPR12: 2200 cff4 c010c578 c001f11b9f40 >>> GPR16: c001fe0312a8 c001fe031078 c001fe031020 0001 >>> GPR20: c1454808 fef7 >>> GPR24: c001f1677648 >>> GPR28: 1008 c0e4d3d8 c001eaae07d8 >>> NIP [c06f0808] refcount_sub_and_test+0xd8/0xf0 >>> LR [c06f0804] refcount_sub_and_test+0xd4/0xf0 >>> Call Trace: >>> [c000fb81f990] [c06f0804] refcount_sub_and_test+0xd4/0xf0 >>> (unreliable) >>> [c000fb81f9f0] [c06d04b4] kobject_put+0x44/0x2a0 >>> [c000fb81fa70] [c09d5284] of_node_put+0x34/0x50 >>> [c000fb81faa0] [c00aceb8] dlpar_cpu_remove_by_index+0x108/0x130 >>> [c000fb81fb30] [c00ae128] dlpar_cpu+0x78/0x550 >>> [c000fb81fbe0] [c00a7b40] handle_dlpar_errorlog+0xc0/0x160 >>> [c000fb81fc50] [c00a7c74] pseries_hp_work_fn+0x94/0xa0 >>> [c000fb81fc80] [c0102cec] process_one_work+0x23c/0x540 >>> [c000fb81fd20] [c010309c] worker_thread+0xac/0x620 >>> [c000fb81fdc0] [c010c6c4] kthread+0x154/0x1a0 >>> [c000fb81fe30] [c000bbe0] ret_from_kernel_thread+0x5c/0x7c >>> >>> Fix this by ensuring that of_node_put() is called only from the >>> error path in dlpar_cpu_remove_by_index(). In the normal path, >>> of_node_put() happens as part of dlpar_detach_node(). >>> >>> Signed-off-by: Bharata B Rao >>> Cc: Nathan Fontenot >>> --- >>> Changes in v1: >>> - Fixed the refcount problem in the userspace driven unplug path >>> in addition to in-kernel unplug path. (Sachin Sant) >>> >>> v0: https://patchwork.ozlabs.org/patch/736547/ >>> >>> arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 >>> 1 file changed, 8 insertions(+), 4 deletions(-) >>> >>> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c >>> b/arch/powerpc/platforms/pseries/hotplug-cpu.c >>> index 7bc0e91..c5ed510 100644 >>> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c >>> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c >>> @@ -619,7 +619,8 @@ static int dlpar_cpu_remove_by_index(u32 drc_index) >>> } >>> >>> rc = dlpar_cpu_remove(dn, drc_index); >>> - of_node_put(dn); >>> + if (rc) >>> + of_node_put(dn); >> >> I think there is another issue at play here because this is wrong. >> Regardless of whether the dlpar_cpu_remove() succeeds or fails we still >> need of_node_put() for both cases because we incremented the ref count >> earlier in this function with a call to cpu_drc_index_to_dn() call. That >> call doesn't, but shoul, document that it returns a device_node with >> incremented refcount. >> >>> return rc; >>> } >>> >>> @@ -856,9 +857,12 @@ static ssize_t dlpar_cpu_release(const char *buf, >>> size_t count) >>> } >>> >>> rc = dlpar_cpu_remove(dn, drc_index); >>> - of_node_put(dn); >>> - >>> - return rc ? rc : count; >>> + if (rc) { >>> + of_node_put(dn); >>> + return rc; >>> + } else { >>> + return count; >>> + } >> >> Same comment as above. The call earlier in the function to >> of_find_node_by_path() returne
Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
On Fri, 10 Mar 2017 14:53:37 +1100 Alexey Kardashevskiy wrote: > This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT > and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO > without passing them to user space which saves time on switching > to user space and back. > > This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM. > KVM tries to handle a TCE request in the real mode, if failed > it passes the request to the virtual mode to complete the operation. > If it a virtual mode handler fails, the request is passed to > the user space; this is not expected to happen though. > > To avoid dealing with page use counters (which is tricky in real mode), > this only accelerates SPAPR TCE IOMMU v2 clients which are required > to pre-register the userspace memory. The very first TCE request will > be handled in the VFIO SPAPR TCE driver anyway as the userspace view > of the TCE table (iommu_table::it_userspace) is not allocated till > the very first mapping happens and we cannot call vmalloc in real mode. > > If we fail to update a hardware IOMMU table unexpected reason, we just > clear it and move on as there is nothing really we can do about it - > for example, if we hot plug a VFIO device to a guest, existing TCE tables > will be mirrored automatically to the hardware and there is no interface > to report to the guest about possible failures. > > This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to > the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd > and associates a physical IOMMU table with the SPAPR TCE table (which > is a guest view of the hardware IOMMU table). The iommu_table object > is cached and referenced so we do not have to look up for it in real mode. > > This does not implement the UNSET counterpart as there is no use for it - > once the acceleration is enabled, the existing userspace won't > disable it unless a VFIO container is destroyed; this adds necessary > cleanup to the KVM_DEV_VFIO_GROUP_DEL handler. > > As this creates a descriptor per IOMMU table-LIOBN couple (called > kvmppc_spapr_tce_iommu_table), it is possible to have several > descriptors with the same iommu_table (hardware IOMMU table) attached > to the same LIOBN; we do not remove duplicates though as > iommu_table_ops::exchange not just update a TCE entry (which is > shared among IOMMU groups) but also invalidates the TCE cache > (one per IOMMU group). > > This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user > space. > > This adds real mode version of WARN_ON_ONCE() as the generic version > causes problems with rcu_sched. Since we testing what vmalloc_to_phys() > returns in the code, this also adds a check for already existing > vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect(). > > This finally makes use of vfio_external_user_iommu_id() which was > introduced quite some time ago and was considered for removal. > > Tests show that this patch increases transmission speed from 220MB/s > to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). > > Signed-off-by: Alexey Kardashevskiy > --- > Changes: > v8: > * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed > to handle them > * changed vmalloc_to_phys() callers to return H_HARDWARE > * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD > and added a comment about this in the code > * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE > and do WARN_ON > * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to > have all vmalloc_to_phys() callsites covered > > v7: > * added realmode-friendly WARN_ON_ONCE_RM > > v6: > * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map() > * moved kvmppc_gpa_to_ua() to TCE validation > > v5: > * changed error codes in multiple places > * added bunch of WARN_ON() in places which should not really happen > * adde a check that an iommu table is not attached already to LIOBN > * dropped explicit calls to iommu_tce_clear_param_check/ > iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate > call them anyway (since the previous patch) > * if we fail to update a hardware IOMMU table for unexpected reason, > this just clears the entry > > v4: > * added note to the commit log about allowing multiple updates of > the same IOMMU table; > * instead of checking for if any memory was preregistered, this > returns H_TOO_HARD if a specific page was not; > * fixed comments from v3 about error handling in many places; > * simplified TCE handlers and merged IOMMU parts inline - for example, > there used to be kvmppc_h_put_tce_iommu(), now it is merged into > kvmppc_h_put_tce(); this allows to check IOBA boundaries against > the first attached table only (makes the code simpler); > > v3: > * simplified not to use VFIO group notifiers > * reworked cleanup, should be cleaner/simpler now > > v2: > * reworked to use new VFIO notifiers > * now same iommu
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Mon, 13 Mar 2017, Andy Lutomirski wrote: > On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal > wrote: > > On Mon, 13 Mar 2017, Andy Lutomirski wrote: > >> This sounds rather complicated. Getting TLB flushing right seems > >> tricky. Why not just map the same thing into multiple mms? > > > > This is exactly what happens at the end. The memory region that is > > described by the > > VAS segment will be mapped in the ASes that use the segment. > > So why is this kernel feature better than just doing MAP_SHARED > manually in userspace? One advantage of VAS segments is that they can be globally queried by user programs which means that VAS segments can be shared by applications that not necessarily have to be related. If I am not mistaken, MAP_SHARED of pure in memory data will only work if the tasks that share the memory region are related (aka. have a common parent that initialized the shared mapping). Otherwise, the shared mapping have to be backed by a file. VAS segments on the other side allow sharing of pure in memory data by arbitrary related tasks without the need of a file. This becomes especially interesting if one combines VAS segments with non-volatile memory since one can keep data structures in the NVM and still be able to share them between multiple tasks. > >> Ick. Please don't do this. Can we please keep an mm as just an mm > >> and not make it look magically different depending on which process > >> maps it? If you need a trampoline (which you do, of course), just > >> write a trampoline in regular user code and map it manually. > > > > Did I understand you correctly that you are proposing that the switching > > thread > > should make sure by itself that its code, stack, … memory regions are > > properly setup > > in the new AS before/after switching into it? I think, this would make > > using first > > class virtual address spaces much more difficult for user applications to > > the extend > > that I am not even sure if they can be used at all. At the moment, > > switching into a > > VAS is a very simple operation for an application because the kernel will > > just simply > > do the right thing. > > Yes. I think that having the same mm_struct look different from > different tasks is problematic. Getting it right in the arch code is > going to be nasty. The heuristics of what to share are also tough -- > why would text + data + stack or whatever you're doing be adequate? > What if you're in a thread? What if two tasks have their stacks in > the same place? The different ASes that a task now can have when it uses first class virtual address spaces are not realized in the kernel by using only one mm_struct per task that just looks differently but by using multiple mm_structs - one for each AS that the task can execute in. When a task attaches a first class virtual address space to itself to be able to use another AS, the kernel adds a temporary mm_struct to this task that contains the mappings of the first class virtual address space and the one shared with the task's original AS. If a thread now wants to switch into this attached first class virtual address space the kernel only changes the 'mm' and 'active_mm' pointers in the task_struct of the thread to the temporary mm_struct and performs the corresponding mm_switch operation. The original mm_struct of the thread will not be changed. Accordingly, I do not magically make mm_structs look differently depending on the task that uses it, but create temporary mm_structs that only contain mappings to the same memory regions. I agree that finding a good heuristics of what to share is difficult. At the moment, all memory regions that are available in the task's original AS will also be available when a thread switches into an attached first class virtual address space (aka. are shared). That means that VAS can mainly be used to extend the AS of a task in the current state of the implementation. The reason why I implemented the sharing in this way is that I didn't want to break shared libraries. If I only share code+heap+stack, shared libraries would not work anymore after switching into a VAS. > I could imagine something like a sigaltstack() mode that lets you set > a signal up to also switch mm could be useful. This is a very interesting idea. I will keep it in mind for future use cases of multiple virtual address spaces per task. Thanks Till
Re: [RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init'
On Tue, 14 Mar 2017, David Laight wrote: > From: Linuxppc-dev Till Smejkal > > Sent: 13 March 2017 22:14 > > The only way until now to create a new memory map was via the exported > > function 'mm_alloc'. Unfortunately, this function not only allocates a new > > memory map, but also completely initializes it. However, with the > > introduction of first class virtual address spaces, some initialization > > steps done in 'mm_alloc' are not applicable to the memory maps needed for > > this feature and hence would lead to errors in the kernel code. > > > > Instead of introducing a new function that can allocate and initialize > > memory maps for first class virtual address spaces and potentially > > duplicate some code, I decided to split the mm_alloc function as well as > > the 'mm_init' function that it uses. > > > > Now there are four functions exported instead of only one. The new > > 'mm_alloc' function only allocates a new mm_struct and zeros it out. If one > > want to have the old behavior of mm_alloc one can use the newly introduced > > function 'mm_alloc_and_setup' which not only allocates a new mm_struct but > > also fully initializes it. > ... > > That looks like bugs waiting to happen. > You need unchanged code to fail to compile. Thank you for this hint. I can give the new mm_alloc function a different name so that code that uses the *old* mm_alloc function will fail to compile. I just reused the old name when I wrote the code, because mm_alloc was only used in very few locations in the kernel (2 times in the whole kernel source) which made identifying and changing them very easy. I also don't think that there will be many users in the kernel for mm_alloc in the future because it is a relatively low level data structure. But if it is better to use a different name for the new function, I am very happy to change this. Till
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On 3/14/2017 12:12 PM, Till Smejkal wrote: On Mon, 13 Mar 2017, Andy Lutomirski wrote: On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal wrote: On Mon, 13 Mar 2017, Andy Lutomirski wrote: This sounds rather complicated. Getting TLB flushing right seems tricky. Why not just map the same thing into multiple mms? This is exactly what happens at the end. The memory region that is described by the VAS segment will be mapped in the ASes that use the segment. So why is this kernel feature better than just doing MAP_SHARED manually in userspace? One advantage of VAS segments is that they can be globally queried by user programs which means that VAS segments can be shared by applications that not necessarily have to be related. If I am not mistaken, MAP_SHARED of pure in memory data will only work if the tasks that share the memory region are related (aka. have a common parent that initialized the shared mapping). Otherwise, the shared mapping have to be backed by a file. True, but why is this bad? The shared mapping will be memory resident regardless, even if backed by a file (unless swapped out under heavy memory pressure, but arguably that's a feature anyway). More importantly, having a file name is a simple and consistent way of identifying such shared memory segments. With a little work, you can also arrange to map such files into memory at a fixed address in all participating processes, thus making internal pointers work correctly. VAS segments on the other side allow sharing of pure in memory data by arbitrary related tasks without the need of a file. This becomes especially interesting if one combines VAS segments with non-volatile memory since one can keep data structures in the NVM and still be able to share them between multiple tasks. I am not fully up to speed on NV/pmem stuff, but isn't that exactly what the DAX mode is supposed to allow you to do? If so, isn't sharing a mapped file on a DAX filesystem on top of pmem equivalent to what you're proposing? -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com
Re: [RFC PATCH 00/13] Introduce first class virtual address spaces
On Tue, 14 Mar 2017, Chris Metcalf wrote: > On 3/14/2017 12:12 PM, Till Smejkal wrote: > > On Mon, 13 Mar 2017, Andy Lutomirski wrote: > > > On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal > > > wrote: > > > > On Mon, 13 Mar 2017, Andy Lutomirski wrote: > > > > > This sounds rather complicated. Getting TLB flushing right seems > > > > > tricky. Why not just map the same thing into multiple mms? > > > > This is exactly what happens at the end. The memory region that is > > > > described by the > > > > VAS segment will be mapped in the ASes that use the segment. > > > So why is this kernel feature better than just doing MAP_SHARED > > > manually in userspace? > > One advantage of VAS segments is that they can be globally queried by user > > programs > > which means that VAS segments can be shared by applications that not > > necessarily have > > to be related. If I am not mistaken, MAP_SHARED of pure in memory data will > > only work > > if the tasks that share the memory region are related (aka. have a common > > parent that > > initialized the shared mapping). Otherwise, the shared mapping have to be > > backed by a > > file. > > True, but why is this bad? The shared mapping will be memory resident > regardless, even if backed by a file (unless swapped out under heavy > memory pressure, but arguably that's a feature anyway). More importantly, > having a file name is a simple and consistent way of identifying such > shared memory segments. > > With a little work, you can also arrange to map such files into memory > at a fixed address in all participating processes, thus making internal > pointers work correctly. I don't want to say that the interface provided by MAP_SHARED is bad. I am only arguing that VAS segments and the interface that they provide have an advantage over the existing ones in my opinion. However, Matthew Wilcox also suggested in some earlier mail that VAS segments could be exported to user space via a special purpose filesystem. This would enable users of VAS segments to also just use some special files to setup the shared memory regions. But since the VAS segment itself already knows where at has to be mapped in the virtual address space of the process, the establishing of the shared memory region would be very easy for the user. > > VAS segments on the other side allow sharing of pure in memory data by > > arbitrary related tasks without the need of a file. This becomes especially > > interesting if one combines VAS segments with non-volatile memory since one > > can keep > > data structures in the NVM and still be able to share them between multiple > > tasks. > > I am not fully up to speed on NV/pmem stuff, but isn't that exactly what > the DAX mode is supposed to allow you to do? If so, isn't sharing a > mapped file on a DAX filesystem on top of pmem equivalent to what > you're proposing? If I read the documentation to DAX filesystems correctly, it is indeed possible to us them to create files that life purely in NVM. I wasn't fully aware of this feature. Thanks for the pointer. However, the main contribution of this patchset is actually the idea of first class virtual address spaces and that they can be used to allow processes to have multiple different views on the system's main memory. For us, VAS segments were another logic step in the same direction (from first class virtual address spaces to first class address space segments). However, if there is already functionality in the Linux kernel to achieve the exact same behavior, there is no real need to add VAS segments. I will continue thinking about them and either find a different situation where the currently available interface is not sufficient/too complicated or drop VAS segments from future version of the patch set. Till
Re: syscall statx not implemented on powerpc
Chris Packham writes: > On 15/03/17 00:57, Michael Ellerman wrote: >> Chris Packham writes: >>> On 13/03/17 21:52, Chandan Rajendra wrote: On Monday, March 13, 2017 03:33:07 AM Chris Packham wrote: > I've just attempted to build a powerpc kernel from 4.11-rc2 using a > custom defconfig (available on request) and I'm hitting the following > error in the early stages of compilation. > > :1325:2: error: #warning syscall statx not implemented > [-Werror=cpp] > > Same thing seems to happen with mpc85xx_basic_defconfig. > > I don't actually need this syscall so I'd be happy to turn something off > to get things building. I did a quick search and couldn't see anything > on linuxppc-dev but google keeps correcting "statx" to "stats" so I > could have missed it. The upstream commit (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a528d35e8bfcc521d7cb70aaf03e1bd296c8493f) that introduces the statx syscall provides a test program. I will wire-up the syscall on ppc64, run that test program and post the patch if the test program works well. >>> >>> Thanks, I'd be happy to test a patch here. >>> >>> In the meantime I worked around the build issue by adding __INGORE_statx >>> to checksyscalls.sh. >> >> Is it actually breaking the build? It's meant to be a warning, and >> that's all I get. > > It breaks my builds for ppc and arm. We do have KCFLAGS=-Werror in our > environment but KCPPFLAGS= is not set so I'm not sure why it's stopping > compilation. OK, I guess that must be it. We turn on -Werror for all of arch/powerpc by default (unless CONFIG_PPC_DISABLE_WERROR=y), but the syscall checking script is not under arch/powerpc so shouldn't be affected by that. cheers
Re: Linux 4.11: Reported regressions as of Tuesday, 20176-03-14
Thorsten Leemhuis writes: > Hi! Find below my first regression report for Linux 4.11. It lists 9 > regressions I'm currently aware of. > > As always: Are you aware of any other regressions? Then please let me > know (simply CC regressi...@leemhuis.info). And please tell me if there > is anything in the report that shouldn't be there. > > Ciao, Thorsten > > P.S.: Sorry, I didn't compile any regression reports for 4.10: I > didn't find time due to various reasons (vacation, a cold, regular > work, and attending two conferences). Reminder: compiling these reports > has nothing to do with my paid job and I'm doing it in my spare time > just because I think someone should do it. > > P.P.S: Dear Gmane mainling list archive webinterface, please come > back soon. I really really miss you. It hurts ever single day. > Don't you miss me, too? ;-) > > == Current regressions == > > Desc: PowerPC crashes on boot, bisected to commit 5657933dbb6e > Repo: 2017-03-02 > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1343553.html > Stat: 2017-03-09 > https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1349568.html > Note: patch hopefully heading mainline Now fixed in Linus' tree by commit 46f401c4297a. cheers
Re: [PATCH v2 16/23] MAINTAINERS: Add file patterns for powerpc device tree bindings
Geert Uytterhoeven writes: > Submitters of device tree binding documentation may forget to CC > the subsystem maintainer if this is missing. > > Signed-off-by: Geert Uytterhoeven > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Michael Ellerman > Cc: linuxppc-dev@lists.ozlabs.org > --- > Please apply this patch directly if you want to be involved in device > tree binding documentation for your subsystem. > > v2: > - Rebased on top of commit a42715830d552d7c ("MAINTAINERS: Remove > powerpc's "opal" pattern match), which just added "powerpc/opal", > while obviously the whole "powerpc" hierarchy is of interest. > > Impact on next-20170310: Actual diff ignoring ordering etc: +Benjamin Herrenschmidt (supporter:LINUX FOR POWERPC (32-BIT AND 64-BIT)) +Paul Mackerras (supporter:LINUX FOR POWERPC (32-BIT AND 64-BIT)) +linuxppc-dev@lists.ozlabs.org (open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)) -Scott Wood (commit_signer:5/11=45%) -Zhao Qiang (commit_signer:4/11=36%,authored:4/11=36%) -Christian Lamparter (commit_signer:1/11=9%) -yangbo lu (authored:1/11=9%) -"Otto Kekäläinen" (authored:1/11=9%) -Chris Packham (authored:1/11=9%) -York Sun (authored:1/11=9%) Which looks bad as all the NXP folks get dropped, but they should be reading linuxppc-dev. So I think I'll merge this, unless anyone disagrees. cheers
Re: [PATCH kernel v8 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
On Tue, Mar 14, 2017 at 03:05:27PM -0600, Alex Williamson wrote: > On Fri, 10 Mar 2017 14:53:37 +1100 > Alexey Kardashevskiy wrote: > > > This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT > > and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO > > without passing them to user space which saves time on switching > > to user space and back. > > > > This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM. > > KVM tries to handle a TCE request in the real mode, if failed > > it passes the request to the virtual mode to complete the operation. > > If it a virtual mode handler fails, the request is passed to > > the user space; this is not expected to happen though. > > > > To avoid dealing with page use counters (which is tricky in real mode), > > this only accelerates SPAPR TCE IOMMU v2 clients which are required > > to pre-register the userspace memory. The very first TCE request will > > be handled in the VFIO SPAPR TCE driver anyway as the userspace view > > of the TCE table (iommu_table::it_userspace) is not allocated till > > the very first mapping happens and we cannot call vmalloc in real mode. > > > > If we fail to update a hardware IOMMU table unexpected reason, we just > > clear it and move on as there is nothing really we can do about it - > > for example, if we hot plug a VFIO device to a guest, existing TCE tables > > will be mirrored automatically to the hardware and there is no interface > > to report to the guest about possible failures. > > > > This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to > > the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd > > and associates a physical IOMMU table with the SPAPR TCE table (which > > is a guest view of the hardware IOMMU table). The iommu_table object > > is cached and referenced so we do not have to look up for it in real mode. > > > > This does not implement the UNSET counterpart as there is no use for it - > > once the acceleration is enabled, the existing userspace won't > > disable it unless a VFIO container is destroyed; this adds necessary > > cleanup to the KVM_DEV_VFIO_GROUP_DEL handler. > > > > As this creates a descriptor per IOMMU table-LIOBN couple (called > > kvmppc_spapr_tce_iommu_table), it is possible to have several > > descriptors with the same iommu_table (hardware IOMMU table) attached > > to the same LIOBN; we do not remove duplicates though as > > iommu_table_ops::exchange not just update a TCE entry (which is > > shared among IOMMU groups) but also invalidates the TCE cache > > (one per IOMMU group). > > > > This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user > > space. > > > > This adds real mode version of WARN_ON_ONCE() as the generic version > > causes problems with rcu_sched. Since we testing what vmalloc_to_phys() > > returns in the code, this also adds a check for already existing > > vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect(). > > > > This finally makes use of vfio_external_user_iommu_id() which was > > introduced quite some time ago and was considered for removal. > > > > Tests show that this patch increases transmission speed from 220MB/s > > to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card). > > > > Signed-off-by: Alexey Kardashevskiy > > --- > > Changes: > > v8: > > * changed all (!pua) checks to return H_TOO_HARD as ioctl() is supposed > > to handle them > > * changed vmalloc_to_phys() callers to return H_HARDWARE > > * changed real mode iommu_tce_xchg_rm() callers to return H_TOO_HARD > > and added a comment about this in the code > > * changed virtual mode iommu_tce_xchg() callers to return H_HARDWARE > > and do WARN_ON > > * added WARN_ON_ONCE_RM(!rmap) in kvmppc_rm_h_put_tce_indirect() to > > have all vmalloc_to_phys() callsites covered > > > > v7: > > * added realmode-friendly WARN_ON_ONCE_RM > > > > v6: > > * changed handling of errors returned by kvmppc_(rm_)tce_iommu_(un)map() > > * moved kvmppc_gpa_to_ua() to TCE validation > > > > v5: > > * changed error codes in multiple places > > * added bunch of WARN_ON() in places which should not really happen > > * adde a check that an iommu table is not attached already to LIOBN > > * dropped explicit calls to iommu_tce_clear_param_check/ > > iommu_tce_put_param_check as kvmppc_tce_validate/kvmppc_ioba_validate > > call them anyway (since the previous patch) > > * if we fail to update a hardware IOMMU table for unexpected reason, > > this just clears the entry > > > > v4: > > * added note to the commit log about allowing multiple updates of > > the same IOMMU table; > > * instead of checking for if any memory was preregistered, this > > returns H_TOO_HARD if a specific page was not; > > * fixed comments from v3 about error handling in many places; > > * simplified TCE handlers and merged IOMMU parts inline - for example, > > there used to be kvmppc_h_put_tce_iommu(), now it is merged into > > kvmppc_h_put_tce(); this allows
Re: [PATCH] drivers/pcmcia: NO_IRQ removal for electra_cf.c
Arnd Bergmann writes: > On Tue, Mar 14, 2017 at 11:51 AM, Michael Ellerman > wrote: >> Michael Ellerman writes: >> >>> We'd like to eventually remove NO_IRQ on powerpc, so remove usages of it >>> from electra_cf.c which is a powerpc-only driver. >>> >>> Signed-off-by: Michael Ellerman >>> --- >>> drivers/pcmcia/electra_cf.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> Ping anyone? >> >> Or should I merge this via the powerpc tree? > > That's what I would recommend for a powerpc specific pcmcia driver, yes. Suits me. > Looking at the bigger picture of powerpc drivers using NO_IRQ, I also > see these others: > > drivers/ata/pata_mpc52xx.c: if (ata_irq == NO_IRQ) { > drivers/ata/sata_dwc_460ex.c:#ifndef NO_IRQ > drivers/ata/sata_dwc_460ex.c:#define NO_IRQ 0 > drivers/ata/sata_dwc_460ex.c: if (hsdev->dma->irq == NO_IRQ) { > drivers/ata/sata_dwc_460ex.c: if (irq == NO_IRQ) { > drivers/iommu/fsl_pamu.c: if (irq == NO_IRQ) { > drivers/iommu/fsl_pamu.c: if (irq != NO_IRQ) > drivers/media/platform/fsl-viu.c: if (viu_irq == NO_IRQ) { > drivers/mtd/nand/mpc5121_nfc.c: if (prv->irq == NO_IRQ) { > drivers/pcmcia/electra_cf.c:cf->irq = NO_IRQ; > drivers/pcmcia/electra_cf.c:if (cf->irq != NO_IRQ) > drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if > no interrupt is pending. */ > drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ; > drivers/soc/fsl/qe/qe_ic.c:/* Return an interrupt vector or NO_IRQ if > no interrupt is pending. */ > drivers/soc/fsl/qe/qe_ic.c: return NO_IRQ; > drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_low == NO_IRQ) { > drivers/soc/fsl/qe/qe_ic.c: if (qe_ic->virq_high != NO_IRQ && > drivers/spi/spi-mpc52xx.c: if (status && (irq != NO_IRQ)) > drivers/tty/ehv_bytechan.c: if (stdout_irq == NO_IRQ) { > drivers/tty/ehv_bytechan.c: if ((bc->rx_irq == NO_IRQ) || > (bc->tx_irq == NO_IRQ)) { > drivers/tty/serial/cpm_uart/cpm_uart_core.c:if (pinfo->port.irq == > NO_IRQ) { > drivers/uio/uio_fsl_elbc_gpcm.c:if (irq != NO_IRQ) { > drivers/uio/uio_fsl_elbc_gpcm.c:irq = NO_IRQ; > drivers/uio/uio_fsl_elbc_gpcm.c: irq != NO_IRQ ? irq : -1); > drivers/usb/host/ehci-grlib.c: if (irq == NO_IRQ) { > drivers/usb/host/ehci-ppc-of.c: if (irq == NO_IRQ) { > drivers/usb/host/fhci-hcd.c:if (usb_irq == NO_IRQ) { > drivers/usb/host/ohci-ppc-of.c: if (irq == NO_IRQ) { > drivers/usb/host/uhci-grlib.c: if (irq == NO_IRQ) { > drivers/video/fbdev/mb862xx/mb862xxfbdrv.c: if (par->irq == NO_IRQ) { > drivers/virt/fsl_hypervisor.c: if (!handle || (irq == NO_IRQ)) { > include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) > include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) > include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) > include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) > include/soc/fsl/qe/qe_ic.h: if (cascade_irq == NO_IRQ) > include/soc/fsl/qe/qe_ic.h: if (cascade_irq != NO_IRQ) > > Did you have other pending patches for those? No. I stayed away from anything FSL related as I was under the impression some of them were being ported to arch/arm, which uses -1 for NO_IRQ IIUIC. eg. all of include/soc/fsl and drivers/soc/fsl was moved from arch/powerpc in commit 7aa1aa6ecec2, which said: QE: Move QE from arch/powerpc to drivers/soc ls1 has qe and ls1 has arm cpu. move qe from arch/powerpc to drivers/soc/fsl to adapt to powerpc and arm But looking at the Kconfigs it looks like they're still only selectable on PPC. So that's a bit annoying. I'll do patches for everything above that's not drivers/soc or include/soc and hopefully we can hear from someone at NXP on the plans for getting the soc parts enabled on arm. cheers
Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src
Hi Peter, Peter Zijlstra writes: > On Tue, Mar 14, 2017 at 02:31:51PM +0530, Madhavan Srinivasan wrote: > >> >Huh? PPC hasn't yet implemented this? Then why are you fixing it? >> >> yes, PPC hasn't implemented this (until now). > > until now where? On powerpc there is currently no kernel support for filling the data_src value with anything meaningful. A user can still request PERF_SAMPLE_DATA_SRC (perf report -d), but they just get the default value from perf_sample_data_init(), which is PERF_MEM_NA. Though even that is currently broken with a big endian perf tool. >> And did not understand "Then why are you fixing it?" > > I see no implementation; so why are you poking at it. Maddy has posted an implementation of the kernel part for powerpc in patch 2 of this series, but maybe you're not on Cc? Regardless of us wanting to do the kernel side on powerpc, the current API is broken on big endian. That's because in the kernel the PERF_MEM_NA value is constructed using shifts: /* TLB access */ #define PERF_MEM_TLB_NA 0x01 /* not available */ ... #define PERF_MEM_TLB_SHIFT26 #define PERF_MEM_S(a, s) \ (((__u64)PERF_MEM_##a##_##s) << PERF_MEM_##a##_SHIFT) #define PERF_MEM_NA (PERF_MEM_S(OP, NA) |\ PERF_MEM_S(LVL, NA) |\ PERF_MEM_S(SNOOP, NA) |\ PERF_MEM_S(LOCK, NA) |\ PERF_MEM_S(TLB, NA)) Which works out as: ((0x01 << 0) | (0x01 << 5) | (0x01 << 19) | (0x01 << 24) | (0x01 << 26)) Which means the PERF_MEM_NA value comes out of the kernel as 0x5080021 in CPU endian. But then in the perf tool, the code uses the bitfields to inspect the value, and currently the bitfields are defined using little endian ordering. So eg. in perf_mem__tlb_scnprintf() we see: data_src->val = 0x5080021 op = 0x0 lvl = 0x0 snoop = 0x0 lock = 0x0 dtlb = 0x0 rsvd = 0x5080021 So this patch does what I think is the minimal fix, of changing the definition of the bitfields to match the values that are already exported by the kernel on big endian. And it makes no change on little endian. cheers