Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
* Peter Zijlstra (pet...@infradead.org) wrote: [...] Hi Peter, Looking at this simplified version of perf's ring buffer synchronization, I get concerned about the following issue: > /* > * One important detail is that the kbuf part and the kbuf_writer() are > * strictly per cpu and we can thus rely on program order for those. > * > * Only the userspace consumer can possibly run on another cpu, and thus we > * need to ensure data consistency for those. > */ > > struct buffer { > u64 size; > u64 tail; > u64 head; > void *data; > }; > > struct buffer *kbuf, *ubuf; > > /* > * If there's space in the buffer; store the data @buf; otherwise > * discard it. > */ > void kbuf_write(int sz, void *buf) > { > u64 tail, head, offset; > > do { > tail = ACCESS_ONCE(ubuf->tail); > offset = head = kbuf->head; > if (CIRC_SPACE(head, tail, kbuf->size) < sz) { > /* discard @buf */ > return; > } > head += sz; > } while (local_cmpxchg(&kbuf->head, offset, head) != offset) > Let's suppose we have a thread executing kbuf_write(), interrupted by an IRQ or NMI right after a successful local_cmpxchg() (space reservation in the buffer). If the nested execution context also calls kbuf_write(), it will therefore update ubuf->head (below) with the second reserved space, and only after that will it return to the original thread context and continue executing kbuf_write(), thus overwriting ubuf->head with the prior-to-last reserved offset. All this probably works OK most of the times, when we have an event flow guaranteeing that a following event will fix things up, but there appears to be a risk of losing events near the end of the trace when those are in nested execution contexts. Thoughts ? Thanks, Mathieu > /* > * Ensure that if we see the userspace tail (ubuf->tail) such > * that there is space to write @buf without overwriting data > * userspace hasn't seen yet, we won't in fact store data before > * that read completes. > */ > > smp_mb(); /* A, matches with D */ > > memcpy(kbuf->data + offset, buf, sz); > > /* > * Ensure that we write all the @buf data before we update the > * userspace visible ubuf->head pointer. > */ > smp_wmb(); /* B, matches with C */ > > ubuf->head = kbuf->head; > } > > /* > * Consume the buffer data and update the tail pointer to indicate to > * kernel space there's 'free' space. > */ > void ubuf_read(void) > { > u64 head, tail; > > tail = ACCESS_ONCE(ubuf->tail); > head = ACCESS_ONCE(ubuf->head); > > /* > * Ensure we read the buffer boundaries before the actual buffer > * data... > */ > smp_rmb(); /* C, matches with B */ > > while (tail != head) { > obj = ubuf->data + tail; > /* process obj */ > tail += obj->size; > tail %= ubuf->size; > } > > /* > * Ensure all data reads are complete before we issue the > * ubuf->tail update; once that update hits, kbuf_write() can > * observe and overwrite data. > */ > smp_mb(); /* D, matches with A */ > > ubuf->tail = tail; > } -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH 1/9] powerpc: allocate sys_membarrier system call number
Allow it to be used from SPU, since it should not have unwanted side-effects. [ Untested on this architecture. To try it out: fetch linux-next/akpm, apply this patch, build/run a membarrier-enabled kernel, and do make kselftest. ] Signed-off-by: Mathieu Desnoyers CC: Andrew Morton CC: linux-...@vger.kernel.org CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Michael Ellerman CC: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/include/asm/systbl.h | 1 + arch/powerpc/include/asm/unistd.h | 2 +- arch/powerpc/include/uapi/asm/unistd.h | 1 + 3 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index 4d65499..126d0c4 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -369,3 +369,4 @@ SYSCALL_SPU(bpf) COMPAT_SYS(execveat) PPC64ONLY(switch_endian) SYSCALL_SPU(userfaultfd) +SYSCALL_SPU(membarrier) diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h index 4a055b6..13411be 100644 --- a/arch/powerpc/include/asm/unistd.h +++ b/arch/powerpc/include/asm/unistd.h @@ -12,7 +12,7 @@ #include -#define __NR_syscalls 365 +#define __NR_syscalls 366 #define __NR__exit __NR_exit #define NR_syscalls__NR_syscalls diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h index 6ad58d4..6337738 100644 --- a/arch/powerpc/include/uapi/asm/unistd.h +++ b/arch/powerpc/include/uapi/asm/unistd.h @@ -387,5 +387,6 @@ #define __NR_execveat 362 #define __NR_switch_endian 363 #define __NR_userfaultfd 364 +#define __NR_membarrier365 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */ -- 1.9.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RFC PATCH 1/9] powerpc: allocate sys_membarrier system call number
- On Aug 31, 2015, at 2:54 AM, Michael Ellerman m...@ellerman.id.au wrote: > On Thu, 2015-08-27 at 13:56 -0400, Mathieu Desnoyers wrote: >> Allow it to be used from SPU, since it should not have unwanted >> side-effects. >> >> [ Untested on this architecture. To try it out: fetch linux-next/akpm, >> apply this patch, build/run a membarrier-enabled kernel, and do make >> kselftest. ] >> >> Signed-off-by: Mathieu Desnoyers >> CC: Andrew Morton >> CC: linux-...@vger.kernel.org >> CC: Benjamin Herrenschmidt >> CC: Paul Mackerras >> CC: Michael Ellerman >> CC: linuxppc-dev@lists.ozlabs.org > > Thanks. > > I get: > > $ ./membarrier_test > membarrier MEMBARRIER_CMD_QUERY syscall available. > membarrier: MEMBARRIER_CMD_SHARED success. > membarrier: tests done! > > Which looks good. > > Assuming the membarrier support hits 4.3, I'll take this via my tree for 4.3 > also. Great! I'll take care of your comments about self-tests, Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v2 1/9] [picked] powerpc: allocate sys_membarrier system call number
Allow it to be used from SPU, since it should not have unwanted side-effects. [ Picked-by: Michael Ellerman ] Signed-off-by: Mathieu Desnoyers CC: Andrew Morton CC: linux-...@vger.kernel.org CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Michael Ellerman CC: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/include/asm/systbl.h | 1 + arch/powerpc/include/asm/unistd.h | 2 +- arch/powerpc/include/uapi/asm/unistd.h | 1 + 3 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index 4d65499..126d0c4 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -369,3 +369,4 @@ SYSCALL_SPU(bpf) COMPAT_SYS(execveat) PPC64ONLY(switch_endian) SYSCALL_SPU(userfaultfd) +SYSCALL_SPU(membarrier) diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h index 4a055b6..13411be 100644 --- a/arch/powerpc/include/asm/unistd.h +++ b/arch/powerpc/include/asm/unistd.h @@ -12,7 +12,7 @@ #include -#define __NR_syscalls 365 +#define __NR_syscalls 366 #define __NR__exit __NR_exit #define NR_syscalls__NR_syscalls diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h index 6ad58d4..6337738 100644 --- a/arch/powerpc/include/uapi/asm/unistd.h +++ b/arch/powerpc/include/uapi/asm/unistd.h @@ -387,5 +387,6 @@ #define __NR_execveat 362 #define __NR_switch_endian 363 #define __NR_userfaultfd 364 +#define __NR_membarrier365 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */ -- 1.9.1 ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[RFC PATCH for 4.21 09/16] powerpc: Wire up cpu_opv system call
Signed-off-by: Mathieu Desnoyers CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Michael Ellerman CC: Boqun Feng CC: Peter Zijlstra CC: "Paul E. McKenney" CC: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/include/asm/systbl.h | 1 + arch/powerpc/include/uapi/asm/unistd.h | 1 + 2 files changed, 2 insertions(+) diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index 01b5171ea189..8f58710f5e8b 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -394,3 +394,4 @@ SYSCALL(pkey_free) SYSCALL(pkey_mprotect) SYSCALL(rseq) COMPAT_SYS(io_pgetevents) +SYSCALL(cpu_opv) diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h index 985534d0b448..112e2c54750a 100644 --- a/arch/powerpc/include/uapi/asm/unistd.h +++ b/arch/powerpc/include/uapi/asm/unistd.h @@ -400,5 +400,6 @@ #define __NR_pkey_mprotect 386 #define __NR_rseq 387 #define __NR_io_pgetevents 388 +#define __NR_cpu_opv 389 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */ -- 2.11.0
[RFC PATCH for 4.21 09/16] powerpc: Wire up cpu_opv system call
Signed-off-by: Mathieu Desnoyers CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Michael Ellerman CC: Boqun Feng CC: Peter Zijlstra CC: "Paul E. McKenney" CC: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/include/asm/systbl.h | 1 + arch/powerpc/include/uapi/asm/unistd.h | 1 + 2 files changed, 2 insertions(+) diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index 01b5171ea189..8f58710f5e8b 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -394,3 +394,4 @@ SYSCALL(pkey_free) SYSCALL(pkey_mprotect) SYSCALL(rseq) COMPAT_SYS(io_pgetevents) +SYSCALL(cpu_opv) diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h index 985534d0b448..112e2c54750a 100644 --- a/arch/powerpc/include/uapi/asm/unistd.h +++ b/arch/powerpc/include/uapi/asm/unistd.h @@ -400,5 +400,6 @@ #define __NR_pkey_mprotect 386 #define __NR_rseq 387 #define __NR_io_pgetevents 388 +#define __NR_cpu_opv 389 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */ -- 2.11.0
Re: [PATCH 07/14] powerpc: Add support for restartable sequences
- On May 24, 2018, at 3:03 AM, Michael Ellerman m...@ellerman.id.au wrote: > Mathieu Desnoyers writes: >> - On May 23, 2018, at 4:14 PM, Mathieu Desnoyers >> mathieu.desnoy...@efficios.com wrote: > ... >>> >>> Hi Boqun, >>> >>> I tried your patch in a ppc64 le environment, and it does not survive boot >>> with CONFIG_DEBUG_RSEQ=y. init gets killed right away. > > > Sorry this code is super gross and hard to deal with. > >> The following fixup gets ppc64 to work: >> >> --- a/arch/powerpc/kernel/entry_64.S >> +++ b/arch/powerpc/kernel/entry_64.S >> @@ -208,6 +208,7 @@ system_call_exit: >> /* Check whether the syscall is issued inside a restartable sequence >> */ >> addir3,r1,STACK_FRAME_OVERHEAD >> bl rseq_syscall >> + ld r3,RESULT(r1) >> #endif >> /* >> * Disable interrupts so current_thread_info()->flags can't change, > > I don't think that's safe. > > If you look above that, we have r3, r8 and r12 all live: > > .Lsyscall_exit: > std r3,RESULT(r1) > CURRENT_THREAD_INFO(r12, r1) > > ld r8,_MSR(r1) > #ifdef CONFIG_PPC_BOOK3S > /* No MSR:RI on BookE */ > andi. r10,r8,MSR_RI > beq-.Lunrecov_restore > #endif > > > They're all volatile across function calls: > > > http://openpowerfoundation.org/wp-content/uploads/resources/leabi/content/dbdoclet.50655240_68174.html > > > The system_call_exit symbol is actually there for kprobes and cosmetic > purposes. The actual syscall return flow starts at .Lsyscall_exit. > > So I think this would work: > > diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S > index db4df061c33a..e19f377a25e0 100644 > --- a/arch/powerpc/kernel/entry_64.S > +++ b/arch/powerpc/kernel/entry_64.S > @@ -184,6 +184,14 @@ system_call: /* label this so stack > traces look sane */ > > .Lsyscall_exit: > std r3,RESULT(r1) > + > +#ifdef CONFIG_DEBUG_RSEQ > + /* Check whether the syscall is issued inside a restartable sequence */ > + addir3,r1,STACK_FRAME_OVERHEAD > + bl rseq_syscall > + ld r3,RESULT(r1) > +#endif > + > CURRENT_THREAD_INFO(r12, r1) > > ld r8,_MSR(r1) > > > I'll try and get this series into my test setup at some point, been a > bit busy lately :) Yes, this was needed. I had this in my tree already, but there is still a kernel OOPS when running the rseq selftests on ppc64 with CONFIG_DEBUG_RSEQ=y. My current dev tree is at: https://github.com/compudj/linux-percpu-dev/tree/rseq/dev-local So considering we are at rc7 now, should I plan to removing the powerpc bits for merge window submission, or is there someone planning to spend time on fixing and testing ppc integration before the merge window opens ? Thanks, Mathieu > > cheers -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
[RFC PATCH for 4.18 08/16] powerpc: Add support for restartable sequences
From: Boqun Feng Call the rseq_handle_notify_resume() function on return to userspace if TIF_NOTIFY_RESUME thread flag is set. Perform fixup on the pre-signal when a signal is delivered on top of a restartable sequence critical section. Signed-off-by: Boqun Feng Signed-off-by: Mathieu Desnoyers CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Michael Ellerman CC: Peter Zijlstra CC: "Paul E. McKenney" CC: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/Kconfig | 1 + arch/powerpc/kernel/signal.c | 3 +++ 2 files changed, 4 insertions(+) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index c32a181a7cbb..ed21a777e8c6 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -223,6 +223,7 @@ config PPC select HAVE_SYSCALL_TRACEPOINTS select HAVE_VIRT_CPU_ACCOUNTING select HAVE_IRQ_TIME_ACCOUNTING + select HAVE_RSEQ select IRQ_DOMAIN select IRQ_FORCED_THREADING select MODULES_USE_ELF_RELA diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c index 61db86ecd318..d3bb3aaaf5ac 100644 --- a/arch/powerpc/kernel/signal.c +++ b/arch/powerpc/kernel/signal.c @@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk) /* Re-enable the breakpoints for the signal stack */ thread_change_pc(tsk, tsk->thread.regs); + rseq_signal_deliver(tsk->thread.regs); + if (is32) { if (ksig.ka.sa.sa_flags & SA_SIGINFO) ret = handle_rt_signal32(&ksig, oldset, tsk); @@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags) if (thread_info_flags & _TIF_NOTIFY_RESUME) { clear_thread_flag(TIF_NOTIFY_RESUME); tracehook_notify_resume(regs); + rseq_handle_notify_resume(regs); } user_enter(); -- 2.11.0
[RFC PATCH for 4.18 09/16] powerpc: Add syscall detection for restartable sequences
From: Boqun Feng Syscalls are not allowed inside restartable sequences, so add a call to rseq_syscall() at the very beginning of system call exiting path for CONFIG_DEBUG_RSEQ=y kernel. This could help us to detect whether there is a syscall issued inside restartable sequences. [ Tested on 64-bit powerpc kernel by Mathieu Desnoyers. Still needs to be tested on 32-bit powerpc kernel. ] Signed-off-by: Boqun Feng Signed-off-by: Mathieu Desnoyers CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Michael Ellerman CC: Peter Zijlstra CC: "Paul E. McKenney" CC: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/kernel/entry_32.S | 7 +++ arch/powerpc/kernel/entry_64.S | 8 2 files changed, 15 insertions(+) diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S index eb8d01bae8c6..973577f2141c 100644 --- a/arch/powerpc/kernel/entry_32.S +++ b/arch/powerpc/kernel/entry_32.S @@ -365,6 +365,13 @@ syscall_dotrace_cont: blrl/* Call handler */ .globl ret_from_syscall ret_from_syscall: +#ifdef CONFIG_DEBUG_RSEQ + /* Check whether the syscall is issued inside a restartable sequence */ + stw r3,GPR3(r1) + addir3,r1,STACK_FRAME_OVERHEAD + bl rseq_syscall + lwz r3,GPR3(r1) +#endif mr r6,r3 CURRENT_THREAD_INFO(r12, r1) /* disable interrupts so current_thread_info()->flags can't change */ diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 51695608c68b..1c374387656a 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -184,6 +184,14 @@ system_call: /* label this so stack traces look sane */ .Lsyscall_exit: std r3,RESULT(r1) + +#ifdef CONFIG_DEBUG_RSEQ + /* Check whether the syscall is issued inside a restartable sequence */ + addir3,r1,STACK_FRAME_OVERHEAD + bl rseq_syscall + ld r3,RESULT(r1) +#endif + CURRENT_THREAD_INFO(r12, r1) ld r8,_MSR(r1) -- 2.11.0
[RFC PATCH for 4.18 10/16] powerpc: Wire up restartable sequences system call
From: Boqun Feng Wire up the rseq system call on powerpc. This provides an ABI improving the speed of a user-space getcpu operation on powerpc by skipping the getcpu system call on the fast path, as well as improving the speed of user-space operations on per-cpu data compared to using load-reservation/store-conditional atomics. Signed-off-by: Boqun Feng Signed-off-by: Mathieu Desnoyers CC: Benjamin Herrenschmidt CC: Paul Mackerras CC: Michael Ellerman CC: Peter Zijlstra CC: "Paul E. McKenney" CC: linuxppc-dev@lists.ozlabs.org --- arch/powerpc/include/asm/systbl.h | 1 + arch/powerpc/include/asm/unistd.h | 2 +- arch/powerpc/include/uapi/asm/unistd.h | 1 + 3 files changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h index d61f9c96d916..45d4d37495fd 100644 --- a/arch/powerpc/include/asm/systbl.h +++ b/arch/powerpc/include/asm/systbl.h @@ -392,3 +392,4 @@ SYSCALL(statx) SYSCALL(pkey_alloc) SYSCALL(pkey_free) SYSCALL(pkey_mprotect) +SYSCALL(rseq) diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h index daf1ba97a00c..1e9708632dce 100644 --- a/arch/powerpc/include/asm/unistd.h +++ b/arch/powerpc/include/asm/unistd.h @@ -12,7 +12,7 @@ #include -#define NR_syscalls387 +#define NR_syscalls388 #define __NR__exit __NR_exit diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h index 389c36fd8299..ac5ba55066dd 100644 --- a/arch/powerpc/include/uapi/asm/unistd.h +++ b/arch/powerpc/include/uapi/asm/unistd.h @@ -398,5 +398,6 @@ #define __NR_pkey_alloc384 #define __NR_pkey_free 385 #define __NR_pkey_mprotect 386 +#define __NR_rseq 387 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */ -- 2.11.0
Re: [RFC PATCH for 4.18 09/16] powerpc: Add syscall detection for restartable sequences
- On Jun 5, 2018, at 1:21 AM, Michael Ellerman m...@ellerman.id.au wrote: > Mathieu Desnoyers writes: >> From: Boqun Feng >> >> Syscalls are not allowed inside restartable sequences, so add a call to >> rseq_syscall() at the very beginning of system call exiting path for >> CONFIG_DEBUG_RSEQ=y kernel. This could help us to detect whether there >> is a syscall issued inside restartable sequences. >> >> [ Tested on 64-bit powerpc kernel by Mathieu Desnoyers. Still needs to >> be tested on 32-bit powerpc kernel. ] >> >> Signed-off-by: Boqun Feng >> Signed-off-by: Mathieu Desnoyers >> CC: Benjamin Herrenschmidt >> CC: Paul Mackerras >> CC: Michael Ellerman >> CC: Peter Zijlstra >> CC: "Paul E. McKenney" >> CC: linuxppc-dev@lists.ozlabs.org >> --- >> arch/powerpc/kernel/entry_32.S | 7 +++ >> arch/powerpc/kernel/entry_64.S | 8 >> 2 files changed, 15 insertions(+) > > I don't _love_ the #ifdefs in here, but they look correct and there's > not really a better option until we rewrite the syscall handler in C. > > The rseq selftests passed for me with this applied and enabled. So if > you like here's some tags: > > Tested-by: Michael Ellerman > Acked-by: Michael Ellerman > Adding you ack to the series. Thanks! Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH for 4.18 10/16] powerpc: Wire up restartable sequences system call
- On Jun 5, 2018, at 1:18 AM, Michael Ellerman m...@ellerman.id.au wrote: > Mathieu Desnoyers writes: > >> From: Boqun Feng >> >> Wire up the rseq system call on powerpc. >> >> This provides an ABI improving the speed of a user-space getcpu >> operation on powerpc by skipping the getcpu system call on the fast >> path, as well as improving the speed of user-space operations on per-cpu >> data compared to using load-reservation/store-conditional atomics. >> >> Signed-off-by: Boqun Feng >> Signed-off-by: Mathieu Desnoyers >> CC: Benjamin Herrenschmidt >> CC: Paul Mackerras >> CC: Michael Ellerman >> CC: Peter Zijlstra >> CC: "Paul E. McKenney" >> CC: linuxppc-dev@lists.ozlabs.org >> --- >> arch/powerpc/include/asm/systbl.h | 1 + >> arch/powerpc/include/asm/unistd.h | 2 +- >> arch/powerpc/include/uapi/asm/unistd.h | 1 + >> 3 files changed, 3 insertions(+), 1 deletion(-) > > Looks fine to me. > > I don't have any other new syscalls in my next, so this should not > conflict with anything for 4.18. > > Acked-by: Michael Ellerman (powerpc) Added your ack to the series, thanks! Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Appropriate liburcu cache line size for Power
Hi, In the powerpc architecture support within the liburcu project [1] we have a cache line size defined as 256 bytes with the following comment: /* Include size of POWER5+ L3 cache lines: 256 bytes */ #define CAA_CACHE_LINE_SIZE 256 I recently received a pull request on github [2] asking to change this to 128 bytes. All the material provided supports that the cache line sizes on powerpc are 128 bytes or less (even L3 on POWER7, POWER8, and POWER9) [3]. I wonder where the 256 bytes L3 cache line size for POWER5+ we have in liburcu comes from, and I wonder if it's the right choice for a cache line size on all powerpc, considering that the Linux kernel cache line size appear to use 128 bytes on recent Power architectures. I recall some benchmark experiments Paul and I did on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes cache line size, and I suppose this is why we came up with this value, but I don't have the detailed specs of that machine. Any feedback on this matter would be appreciated. Thanks! Mathieu [1] https://liburcu.org [2] https://github.com/urcu/userspace-rcu/pull/22 [3] https://www.7-cpu.com/ -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: Appropriate liburcu cache line size for Power
On 2024-03-26 03:19, Michael Ellerman wrote: Mathieu Desnoyers writes: Hi, Hi Mathieu, In the powerpc architecture support within the liburcu project [1] we have a cache line size defined as 256 bytes with the following comment: /* Include size of POWER5+ L3 cache lines: 256 bytes */ #define CAA_CACHE_LINE_SIZE 256 I recently received a pull request on github [2] asking to change this to 128 bytes. All the material provided supports that the cache line sizes on powerpc are 128 bytes or less (even L3 on POWER7, POWER8, and POWER9) [3]. I wonder where the 256 bytes L3 cache line size for POWER5+ we have in liburcu comes from, and I wonder if it's the right choice for a cache line size on all powerpc, considering that the Linux kernel cache line size appear to use 128 bytes on recent Power architectures. I recall some benchmark experiments Paul and I did on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes cache line size, and I suppose this is why we came up with this value, but I don't have the detailed specs of that machine. Any feedback on this matter would be appreciated. The ISA doesn't specify the cache line size, other than it is smaller than a page. In practice all the 64-bit IBM server CPUs I'm aware of have used 128 bytes. There are some 64-bit CPUs that use 64 bytes, eg. pasemi PA6T and Freescale e6500. It is possible to discover at runtime via AUXV headers. But that's no use if you want a compile-time constant. Indeed, and this CAA_CACHE_LINE_SIZE is part of the liburcu powerpc ABI, so changing this would require a soname bump, which I don't want to do without really good reasons. I'm happy to run some benchmarks if you can point me at what to run. I had a poke around the repository and found short_bench, but it seemed to run for a very long time. I've created a dedicated test program for this, see: https://github.com/compudj/userspace-rcu-dev/tree/false-sharing example use: On a AMD Ryzen 7 PRO 6850U with Radeon Graphics: for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done ok 1 - Stride 8 bytes, increments per ms per thread: 21320 1..1 ok 1 - Stride 16 bytes, increments per ms per thread: 22657 1..1 ok 1 - Stride 32 bytes, increments per ms per thread: 47599 1..1 ok 1 - Stride 64 bytes, increments per ms per thread: 531364 1..1 ok 1 - Stride 128 bytes, increments per ms per thread: 523634 1..1 ok 1 - Stride 256 bytes, increments per ms per thread: 519402 1..1 ok 1 - Stride 512 bytes, increments per ms per thread: 520651 1..1 Would point to false-sharing starting with cache lines smaller than 64 bytes. I get similar results (false-sharing under 64 bytes) with a AMD EPYC 9654 96-Core Processor. The test programs runs 4 threads by default, which can be overridden with "-t N". This may be needed if you want this to use all cores from a larger machine. See "-h" for options. On a POWER9 (architected), altivec supported: for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done ok 1 - Stride 8 bytes, increments per ms per thread: 12264 1..1 ok 1 - Stride 16 bytes, increments per ms per thread: 12276 1..1 ok 1 - Stride 32 bytes, increments per ms per thread: 25638 1..1 ok 1 - Stride 64 bytes, increments per ms per thread: 39934 1..1 ok 1 - Stride 128 bytes, increments per ms per thread: 53971 1..1 ok 1 - Stride 256 bytes, increments per ms per thread: 53599 1..1 ok 1 - Stride 512 bytes, increments per ms per thread: 53962 1..1 This points at false-sharing below 128 bytes stride. On a e6500, altivec supported, Model 2.0 (pvr 8040 0120) for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done ok 1 - Stride 8 bytes, increments per ms per thread: 9049 1..1 ok 1 - Stride 16 bytes, increments per ms per thread: 9054 1..1 ok 1 - Stride 32 bytes, increments per ms per thread: 18643 1..1 ok 1 - Stride 64 bytes, increments per ms per thread: 37417 1..1 ok 1 - Stride 128 bytes, increments per ms per thread: 37906 1..1 ok 1 - Stride 256 bytes, increments per ms per thread: 37870 1..1 ok 1 - Stride 512 bytes, increments per ms per thread: 37899 1..1 Which points at false-sharing below 64 bytes. I prefer to be cautious about this cache line size value and aim for a value which takes into account the largest known cache line size for an architecture rather than use a too small due to the large overhead caused by false-sharing. Feedback is welcome. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: Appropriate liburcu cache line size for Power
On 2024-03-25 16:34, Nathan Lynch wrote: Mathieu Desnoyers writes: In the powerpc architecture support within the liburcu project [1] we have a cache line size defined as 256 bytes with the following comment: /* Include size of POWER5+ L3 cache lines: 256 bytes */ #define CAA_CACHE_LINE_SIZE 256 I recently received a pull request on github [2] asking to change this to 128 bytes. All the material provided supports that the cache line sizes on powerpc are 128 bytes or less (even L3 on POWER7, POWER8, and POWER9) [3]. I wonder where the 256 bytes L3 cache line size for POWER5+ we have in liburcu comes from, and I wonder if it's the right choice for a cache line size on all powerpc, considering that the Linux kernel cache line size appear to use 128 bytes on recent Power architectures. I recall some benchmark experiments Paul and I did on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes cache line size, and I suppose this is why we came up with this value, but I don't have the detailed specs of that machine. Any feedback on this matter would be appreciated. For what it's worth, I found a copy of an IBM Journal of Research & Development article confirming that POWER5's L3 had a 256-byte line size: Each slice [of the L3] is 12-way set-associative, with 4,096 congruence classes of 256-byte lines managed as two 128-byte sectors to match the L2 line size. https://www.eecg.utoronto.ca/~moshovos/ACA08/readings/power5.pdf I don't know of any reason to prefer 256 over 128 for current Power processors though. Thanks for the pointer. I will add a reference to it in the liburcu source code to explain the cache line size choice. Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH printk v3 00/40] reduce console_lock scope
On 2022-11-07 09:15, John Ogness wrote: [...] The base commit for this series is from Paul McKenney's RCU tree and provides an NMI-safe SRCU implementation [1]. Without the NMI-safe SRCU implementation, this series is not less safe than mainline. But we will need the NMI-safe SRCU implementation for atomic consoles anyway, so we might as well get it in now. Especially since it _does_ increase the reliability for mainline in the panic path. So, your email got me to review the SRCU nmi-safe series: [1] https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/log/?h=srcunmisafe.2022.10.21a Especially this commit: https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/commit/?h=srcunmisafe.2022.10.21a&id=5d0f5953b60f5f7a278085b55ddc73e2932f4c33 I disagree with the overall approach taken there, which is to create yet another SRCU flavor, this time with explicit "nmi-safe" read-locks. This adds complexity to the kernel APIs and I think we can be clever about this and make SRCU nmi-safe without requiring a whole new incompatible API. You can find the basic idea needed to achieve this in the libside RCU user-space implementation. I needed to introduce a split-counter concept to support rseq vs atomics to keep track of per-cpu grace period counters. The "rseq" counter is the fast-path, but if rseq fails, the abort handler uses the atomic counter instead. https://github.com/compudj/side/blob/main/src/rcu.h#L23 struct side_rcu_percpu_count { uintptr_t begin; uintptr_t rseq_begin; uintptr_t end; uintptr_t rseq_end; } __attribute__((__aligned__(SIDE_CACHE_LINE_SIZE))); The idea is to "split" each percpu counter into two counters, one for rseq, and the other for atomics. When a grace period wants to observe the value of a percpu counter, it simply sums the two counters: https://github.com/compudj/side/blob/main/src/rcu.c#L112 The same idea can be applied to SRCU in the kernel: one counter for percpu ops, and the other counter for nmi context, so basically: srcu_read_lock() if (likely(!in_nmi())) increment the percpu-ops lock counter else increment the atomic lock counter srcu_read_unlock() if (likely(!in_nmi())) increment the percpu-ops unlock counter else increment the atomic unlock counter Then in the grace period sum the percpu-ops and the atomic values whenever each counter value is read. This would allow SRCU to be NMI-safe without requiring the callers to explicitly state whether they need to be nmi-safe or not, and would only take the overhead of the atomics in the NMI handlers rather than for all users which happen to use SRCU read locks shared with nmi handlers. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] powerpc/ftrace: fix syscall tracing on PPC64_ELF_ABI_V1
On 2022-12-05 17:50, Michael Ellerman wrote: Michael Jeanson writes: On 2022-12-05 15:11, Michael Jeanson wrote: Michael Jeanson writes: In v5.7 the powerpc syscall entry/exit logic was rewritten in C, on PPC64_ELF_ABI_V1 this resulted in the symbols in the syscall table changing from their dot prefixed variant to the non-prefixed ones. Since ftrace prefixes a dot to the syscall names when matching them to build its syscall event list, this resulted in no syscall events being available. Remove the PPC64_ELF_ABI_V1 specific version of arch_syscall_match_sym_name to have the same behavior across all powerpc variants. This doesn't seem to work for me. Event with it applied I still don't see anything in /sys/kernel/debug/tracing/events/syscalls Did we break it in some other way recently? cheers I did some further testing, my config also enabled KALLSYMS_ALL, when I remove it there is indeed no syscall events. Aha, OK that explains it I guess. I was using ppc64_guest_defconfig which has ABI_V1 and FTRACE_SYSCALLS, but does not have KALLSYMS_ALL. So I guess there's some other bug lurking in there. I don't have the setup handy to validate it, but I suspect it is caused by the way scripts/kallsyms.c:symbol_valid() checks whether a symbol entry needs to be integrated into the assembler output when --all-symbols is not specified. It only keeps symbols which addresses are in the text range. On PPC64_ELF_ABI_V1, this means only the dot-prefixed symbols will be kept (those point to the function begin), leaving out the non-dot-prefixed symbols (those point to the function descriptors). So I see two possible solutions there: either we ensure that FTRACE_SYSCALLS selects KALLSYMS_ALL on PPC64_ELF_ABI_V1, or we modify scripts/kallsyms.c:symbol_valid() to also include function descriptor symbols. This would mean accepting symbols pointing into the .opd ELF section. IMHO the second option would be better because it does not increase the kernel image size as much as KALLSYMS_ALL. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [PATCH] powerpc/ftrace: fix syscall tracing on PPC64_ELF_ABI_V1
On 2022-12-06 21:09, Michael Ellerman wrote: Mathieu Desnoyers writes: On 2022-12-05 17:50, Michael Ellerman wrote: Michael Jeanson writes: On 2022-12-05 15:11, Michael Jeanson wrote: Michael Jeanson writes: In v5.7 the powerpc syscall entry/exit logic was rewritten in C, on PPC64_ELF_ABI_V1 this resulted in the symbols in the syscall table changing from their dot prefixed variant to the non-prefixed ones. Since ftrace prefixes a dot to the syscall names when matching them to build its syscall event list, this resulted in no syscall events being available. Remove the PPC64_ELF_ABI_V1 specific version of arch_syscall_match_sym_name to have the same behavior across all powerpc variants. This doesn't seem to work for me. Event with it applied I still don't see anything in /sys/kernel/debug/tracing/events/syscalls Did we break it in some other way recently? cheers I did some further testing, my config also enabled KALLSYMS_ALL, when I remove it there is indeed no syscall events. Aha, OK that explains it I guess. I was using ppc64_guest_defconfig which has ABI_V1 and FTRACE_SYSCALLS, but does not have KALLSYMS_ALL. So I guess there's some other bug lurking in there. I don't have the setup handy to validate it, but I suspect it is caused by the way scripts/kallsyms.c:symbol_valid() checks whether a symbol entry needs to be integrated into the assembler output when --all-symbols is not specified. It only keeps symbols which addresses are in the text range. On PPC64_ELF_ABI_V1, this means only the dot-prefixed symbols will be kept (those point to the function begin), leaving out the non-dot-prefixed symbols (those point to the function descriptors). OK. So I guess it never worked without KALLSYMS_ALL. I suspect it worked prior to kernel v5.7, because back then the PPC64 ABIv1 system call table contained pointers to the text section (beginning of functions) rather than function descriptors. So changing this system call table to point to C functions introduced the dot-prefix match issue for syscall tracing as well as a dependency on KALLSYMS_ALL. It seems like most distros enable KALLSYMS_ALL, so I guess that's why we've never noticed. Or very few people run recent PPC64 ABIv1 kernels :) So I see two possible solutions there: either we ensure that FTRACE_SYSCALLS selects KALLSYMS_ALL on PPC64_ELF_ABI_V1, or we modify scripts/kallsyms.c:symbol_valid() to also include function descriptor symbols. This would mean accepting symbols pointing into the .opd ELF section. My only worry is that will cause some other breakage, because .opd symbols are not really "text" in the normal sense, ie. you can't execute them directly. AFAIU adding the .opd section to scripts/kallsyms.c text_ranges will only affect the result of symbol_valid(), which decides which symbols get pulled into a KALLSYMS=n/KALLSYMS_ALL=n kernel build. Considering that a KALLSYMS_ALL=y build pulls those function descriptor symbols into the image without breaking anything, I don't see why adding the .opd section to this script text_ranges would have any ill side-effect. On the other hand the help for KALLSYMS_ALL says: "Normally kallsyms only contains the symbols of functions" But without .opd included that's not really true. In practice it probably doesn't really matter, because eg. backtraces will point to dot symbols which can be resolved. Indeed, I don't see this affecting backtraces, but as soon as a lookup depends on comparing the C function pointer to a function descriptor, the .opd symbols are needed. Not having those function descriptor symbols in KALLSYMS_ALL=n builds seems rather error-prone. IMHO the second option would be better because it does not increase the kernel image size as much as KALLSYMS_ALL. Yes I agree. Even if that did break something, any breakage would be limited to arches which uses function descriptors, which are now all rare. Yes, it would only impact those arches using function descriptors, which are broken today with respect to system call tracing. Are you aware of other architectures other than PPC64 ELF ABI v1 supported by the Linux kernel that use function descriptors ? Relatedly we have a patch in next to optionally use ABIv2 for 64-bit big endian builds. Interesting. Does it require a matching user-space ? (built with PPC64 ABIv2 ?) Does it handle legacy PPC32 executables ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [powerpc] Boot failure kernel BUG at mm/usercopy.c:102
On 2023-01-10 05:42, Sachin Sant wrote: 6.2.0-rc3-next-20230109 fails to boot on powerpc with following: [ 0.444834] [ cut here ] [ 0.444838] kernel BUG at mm/usercopy.c:102! [ 0.444842] Oops: Exception in kernel mode, sig: 5 [#1] [ 0.444845] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries [ 0.444849] Modules linked in: [ 0.444853] CPU: 23 PID: 201 Comm: modprobe Not tainted 6.2.0-rc3-next-20230109 #1 [ 0.444858] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 of:IBM,FW1030.00 (NH1030_026) hv:phyp pSeries [ 0.444862] NIP: c055a934 LR: c055a930 CTR: 00725d90 [ 0.444866] REGS: c7f937c0 TRAP: 0700 Not tainted (6.2.0-rc3-next-20230109) [ 0.444871] MSR: 80029033 CR: 28088822 XER: 0007 [ 0.444879] CFAR: c02012a8 IRQMASK: 0 [ 0.444879] GPR00: c055a930 c7f93a60 c13b0800 0066 [ 0.444879] GPR04: 7fff c7f93880 c7f93878 [ 0.444879] GPR08: 7fff c25e7150 c29672b8 [ 0.444879] GPR12: 48088824 c00e87bf6300 c017f458 c34b8100 [ 0.444879] GPR16: 00012f18eac0 7fffc5c095d0 7fffc5c095d8 00012f140040 [ 0.444879] GPR20: fcff 001f 5455 0008 [ 0.444879] GPR24: c723a0c0 7fffc5c09368 7fffc5c09370 [ 0.444879] GPR28: 0250 0001 c3017000 c723a0c0 [ 0.444922] NIP [c055a934] usercopy_abort+0xa4/0xb0 [ 0.444928] LR [c055a930] usercopy_abort+0xa0/0xb0 [ 0.444932] Call Trace: [ 0.444933] [c7f93a60] [c055a930] usercopy_abort+0xa0/0xb0 (unreliable) [ 0.444939] [c7f93ad0] [c050eeb8] __check_heap_object+0x198/0x1d0 [ 0.444945] [c7f93b10] [c055a7e0] __check_object_size+0x290/0x340 [ 0.444949] [c7f93b50] [c060eba4] create_elf_tables.isra.20+0xc04/0xc90 [ 0.444956] [c7f93c10] [c0610b2c] load_elf_binary+0xdac/0x1320 [ 0.444962] [c7f93d00] [c0571cf0] bprm_execve+0x3d0/0x7c0 [ 0.444966] [c7f93dc0] [c0572b9c] kernel_execve+0x1ac/0x270 [ 0.444971] [c7f93e10] [c017f5cc] call_usermodehelper_exec_async+0x17c/0x250 [ 0.444978] [c7f93e50] [c000e054] ret_from_kernel_thread+0x5c/0x64 [ 0.444983] --- interrupt: 0 at 0x0 [ 0.444986] NIP: LR: CTR: [ 0.444990] REGS: c7f93e80 TRAP: Not tainted (6.2.0-rc3-next-20230109) [ 0.444994] MSR: <> CR: XER: [ 0.444998] CFAR: IRQMASK: 0 [ 0.444998] GPR00: c7f94000 [ 0.444998] GPR04: [ 0.444998] GPR08: [ 0.444998] GPR12: c017f458 c34b8100 [ 0.444998] GPR16: [ 0.444998] GPR20: [ 0.444998] GPR24: [ 0.444998] GPR28: [ 0.445039] NIP [] 0x0 [ 0.445042] LR [] 0x0 [ 0.445044] --- interrupt: 0 [ 0.445046] Code: 392990f8 4814 3d02ffe9 390827f0 7d074378 7d094378 7c661b78 3c62ffe7 f9610060 386319f0 4bca6935 6000 <0fe0> 7c0802a6 [ 0.445061] ---[ end trace ]— Git bisect points to following patch: commit 317c8194e6aeb8b3b573ad139fc2a0635856498e rseq: Introduce feature size and alignment ELF auxiliary vector entries Reverting the patch helps boot the kernel. Can you try with the following fix ? https://lore.kernel.org/r/20230104192054.34046-1-mathieu.desnoy...@efficios.com "rseq: Fix: Increase AT_VECTOR_SIZE_BASE to match rseq auxvec entries" Peter, I suspect it would be good to merge that fix into tip/master though sched/core. Thanks, Mathieu Thanks -Sachin -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [2.6.24 patch] restore Cell OProfile support
* Adrian Bunk ([EMAIL PROTECTED]) wrote: > This patch restores the Cell OProfile support that was killed by > commit 09cadedbdc01f1a4bea1f427d4fb4642eaa19da9. > > Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]> > Same here : why isn't this on by default when SPU_FS and OPROFILE are enabled ? > --- > > kernel/Kconfig.instrumentation |8 > 1 file changed, 8 insertions(+) > > 6d1446d35d367fdbdd2a0a29e0d156646ff40630 > diff --git a/kernel/Kconfig.instrumentation b/kernel/Kconfig.instrumentation > index a76635d..9d143e6 100644 > --- a/kernel/Kconfig.instrumentation > +++ b/kernel/Kconfig.instrumentation > @@ -53,2 +53,10 @@ config HARDWARE_PM > > +config OPROFILE_CELL > + bool "OProfile for Cell Broadband Engine" > + depends on PPC && (SPU_FS = y && OPROFILE = m) || (SPU_FS = y && > OPROFILE = y) > + default y > + help > + Profiling of Cell BE SPUs requires special support enabled > + by this option. > + > config KPROBES > -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[2.6.24 patch] Fix Cell OProfile support
This patch restores the Cell OProfile support that was killed by commit 09cadedbdc01f1a4bea1f427d4fb4642eaa19da9. It puts it in arch/powerpc/Kconfig. Since I don't see any good reason to leave this as a supplementary user-selectable option, it is now automatically enabled whenever SPU_FS and OPROFILE are enabled. Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> CC: Adrian Bunk <[EMAIL PROTECTED]> CC: Randy Dunlap <[EMAIL PROTECTED]> CC: [EMAIL PROTECTED] CC: [EMAIL PROTECTED] CC: linuxppc-dev@ozlabs.org CC: [EMAIL PROTECTED] --- arch/powerpc/Kconfig |4 1 file changed, 4 insertions(+) Index: linux-2.6-lttng/arch/powerpc/Kconfig === --- linux-2.6-lttng.orig/arch/powerpc/Kconfig 2007-12-28 17:00:24.0 -0500 +++ linux-2.6-lttng/arch/powerpc/Kconfig2007-12-28 17:00:39.0 -0500 @@ -163,6 +163,10 @@ config PPC_OF_PLATFORM_PCI depends on PPC64 # not supported on 32 bits yet default n +config OPROFILE_CELL + def_bool y + depends on (SPU_FS = y && OPROFILE = m) || (SPU_FS = y && OPROFILE = y) + source "init/Kconfig" source "arch/powerpc/platforms/Kconfig" -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [2.6.24 patch] Fix Cell OProfile support
* Arnd Bergmann ([EMAIL PROTECTED]) wrote: > On Saturday 29 December 2007, Mathieu Desnoyers wrote: > > This patch restores the Cell OProfile support that was killed by > > commit 09cadedbdc01f1a4bea1f427d4fb4642eaa19da9. > > > > It puts it in arch/powerpc/Kconfig. Since I don't see any good reason to > > leave > > this as a supplementary user-selectable option, it is now automatically > > enabled > > whenever SPU_FS and OPROFILE are enabled. > > This one has already been superceded by the fix in > aed3a8c9bb1a8623a618232087c5ff62718e3b9a, which made CONFIG_OPROFILE_CELL an > automatically selected option, from arch/powerpc/platforms/cell/Kconfig. > Great, so my patch should be dropped. Thanks! Mathieu > Arnd <>< > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[patch 05/28] Add cmpxchg64 and cmpxchg64_local to powerpc
Make sure that at least cmpxchg64_local is available on all architectures to use for unsigned long long values. Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> Acked-by: Paul Mackerras <[EMAIL PROTECTED]> CC: linuxppc-dev@ozlabs.org --- include/asm-powerpc/system.h |6 ++ 1 file changed, 6 insertions(+) Index: linux-2.6-lttng/include/asm-powerpc/system.h === --- linux-2.6-lttng.orig/include/asm-powerpc/system.h 2007-09-24 10:50:11.0 -0400 +++ linux-2.6-lttng/include/asm-powerpc/system.h2007-09-24 11:01:04.0 -0400 @@ -488,6 +488,12 @@ __cmpxchg_local(volatile void *ptr, unsi */ #define NET_IP_ALIGN 0 #define NET_SKB_PADL1_CACHE_BYTES + +#define cmpxchg64 cmpxchg +#define cmpxchg64_localcmpxchg_local +#else +#include +#define cmpxchg64_local(ptr,o,n) __cmpxchg64_local_generic((ptr), (o), (n)) #endif #define arch_align_stack(x) (x) -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[patch 05/28] Add cmpxchg64 and cmpxchg64_local to powerpc
Make sure that at least cmpxchg64_local is available on all architectures to use for unsigned long long values. Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> Acked-by: Paul Mackerras <[EMAIL PROTECTED]> CC: linuxppc-dev@ozlabs.org --- include/asm-powerpc/system.h | 18 -- 1 file changed, 16 insertions(+), 2 deletions(-) Index: linux-2.6-lttng/include/asm-powerpc/system.h === --- linux-2.6-lttng.orig/include/asm-powerpc/system.h 2007-10-31 17:32:44.0 -0400 +++ linux-2.6-lttng/include/asm-powerpc/system.h2007-10-31 17:56:25.0 -0400 @@ -461,7 +461,7 @@ __cmpxchg_local(volatile void *ptr, unsi return old; } -#define cmpxchg(ptr,o,n)\ +#define cmpxchg(ptr, o, n) \ ({\ __typeof__(*(ptr)) _o_ = (o); \ __typeof__(*(ptr)) _n_ = (n); \ @@ -470,7 +470,7 @@ __cmpxchg_local(volatile void *ptr, unsi }) -#define cmpxchg_local(ptr,o,n) \ +#define cmpxchg_local(ptr, o, n)\ ({\ __typeof__(*(ptr)) _o_ = (o); \ __typeof__(*(ptr)) _n_ = (n); \ @@ -490,6 +490,20 @@ __cmpxchg_local(volatile void *ptr, unsi */ #define NET_IP_ALIGN 0 #define NET_SKB_PADL1_CACHE_BYTES + +#define cmpxchg64(ptr, o, n) \ + ({ \ + BUILD_BUG_ON(sizeof(*(ptr)) != 8); \ + cmpxchg((ptr), (o), (n)); \ + }) +#define cmpxchg64_local(ptr, o, n) \ + ({ \ + BUILD_BUG_ON(sizeof(*(ptr)) != 8); \ + cmpxchg_local((ptr), (o), (n)); \ + }) +#else +#include +#define cmpxchg64_local(ptr, o, n) __cmpxchg64_local_generic((ptr), (o), (n)) #endif #define arch_align_stack(x) (x) -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
2.6.23-rc2-mm2 build error on MIPS and ARM
Hi Andrew, I got the following errors when building 2.6.23-rc2-mm2 on both mips and arm. Both errors are very much alike. MIPS: /opt/crosstool/gcc-3.4.5-glibc-2.3.6/mips-unknown-linux-gnu/lib/gcc/mips-unknown-linux-gnu/3.4.5/include -D__KERNEL__ -Iinclude -Iinclude2 -I/home/compudj/git/linux-2.6-lttng/include -include include/linux/autoconf.h -I/home/compudj/git/linux-2.6-lttng/. -I. -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Os -mabi=32 -G 0 -mno-abicalls -fno-pic -pipe -msoft-float -ffreestanding -march=r5000 -Wa,--trap -I/home/compudj/git/linux-2.6-lttng/include/asm-mips/mach-ip22 -Iinclude/asm-mips/mach-ip22 -I/home/compudj/git/linux-2.6-lttng/include/asm-mips/mach-generic -Iinclude/asm-mips/mach-generic -D"VMLINUX_LOAD_ADDRESS=0x88002000" -fomit-frame-pointer -Wdeclaration-after-statement -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(asm_offsets)" -D"KBUILD_MODNAME=KBUILD_STR(asm_offsets)" -fverbose-asm -S -o arch/mips/kernel/asm-offsets.s /home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c In file included from /home/compudj/git/linux-2.6-lttng/include/linux/sched.h:58, from /home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c:13: /home/compudj/git/linux-2.6-lttng/include/linux/mm_types.h:115: error: syntax error before "pgprot_t" /home/compudj/git/linux-2.6-lttng/include/linux/mm_types.h:115: warning: no semicolon at end of struct or union /home/compudj/git/linux-2.6-lttng/include/linux/mm_types.h:161: error: syntax error before '}' token /home/compudj/git/linux-2.6-lttng/include/linux/mm_types.h:175: error: syntax error before "pgd_t" /home/compudj/git/linux-2.6-lttng/include/linux/mm_types.h:175: warning: no semicolon at end of struct or union /home/compudj/git/linux-2.6-lttng/include/linux/mm_types.h:229: error: syntax error before '}' token In file included from /home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c:13: /home/compudj/git/linux-2.6-lttng/include/linux/sched.h: In function `mmdrop': /home/compudj/git/linux-2.6-lttng/include/linux/sched.h:1509: error: dereferencing pointer to incomplete type /home/compudj/git/linux-2.6-lttng/include/linux/sched.h: In function `arch_pick_mmap_layout': /home/compudj/git/linux-2.6-lttng/include/linux/sched.h:1762: error: dereferencing pointer to incomplete type /home/compudj/git/linux-2.6-lttng/include/linux/sched.h:1763: error: dereferencing pointer to incomplete type /home/compudj/git/linux-2.6-lttng/include/linux/sched.h:1764: error: dereferencing pointer to incomplete type In file included from /home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c:14: /home/compudj/git/linux-2.6-lttng/include/linux/mm.h: In function `vma_nonlinear_insert': /home/compudj/git/linux-2.6-lttng/include/linux/mm.h:968: error: dereferencing pointer to incomplete type /home/compudj/git/linux-2.6-lttng/include/linux/mm.h:969: error: dereferencing pointer to incomplete type /home/compudj/git/linux-2.6-lttng/include/linux/mm.h: In function `find_vma_intersection': /home/compudj/git/linux-2.6-lttng/include/linux/mm.h:1078: error: dereferencing pointer to incomplete type /home/compudj/git/linux-2.6-lttng/include/linux/mm.h: In function `vma_pages': /home/compudj/git/linux-2.6-lttng/include/linux/mm.h:1085: error: dereferencing pointer to incomplete type /home/compudj/git/linux-2.6-lttng/include/linux/mm.h:1085: error: dereferencing pointer to incomplete type /home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c: In function `output_mm_defines': /home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c:220: error: dereferencing pointer to incomplete type /home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c:221: error: dereferencing pointer to incomplete type /home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c:222: error: dereferencing pointer to incomplete type make[2]: *** [arch/mips/kernel/asm-offsets.s] Error 1 make[1]: *** [prepare0] Error 2 make: *** [_all] Error 2 ARM: /opt/crosstool/gcc-4.0.2-glibc-2.3.6/arm-unknown-linux-gnu/bin/arm-unknown-linux-gnu-gcc -Wp,-MD,arch/arm/kernel/.asm-offsets.s.d -nostdinc -isystem /opt/crosstool/gcc-4.0.2-glibc-2.3.6/arm-unknown-linux-gnu/lib/gcc/arm-unknown-linux-gnu/4.0.2/include -D__KERNEL__ -Iinclude -Iinclude2 -I/home/compudj/git/linux-2.6-lttng/include -include include/linux/autoconf.h -mlittle-endian -I/home/compudj/git/linux-2.6-lttng/. -I. -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Os -marm -fno-omit-frame-pointer -mapcs -mno-sched-prolog -mabi=apcs-gnu -mno-thumb-interwork -D__LINUX_ARM_ARCH__=4 -march=armv4 -mtune=strongarm110 -msoft-float -Uarm -fno-omit-frame-pointer -fno-optimize-sibling-calls -Wdeclaration-after-statement -Wno-pointer-sign -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_S
Re: 2.6.23-rc2-mm2 build error on MIPS and ARM
* Martin Schwidefsky ([EMAIL PROTECTED]) wrote: > On Fri, 2007-08-10 at 22:58 -0400, Mathieu Desnoyers wrote: > > I got the following errors when building 2.6.23-rc2-mm2 on both mips and > > arm. Both errors are very much alike. > > That attached patch should fix it for arm and mips. I'll try a few more > architectures until I'm fed up with compiling cross-compilers.. > It fixes the problem for me, thanks! Tested-by: Mathieu Desnoyers <[EMAIL PROTECTED]> Mathieu > -- > blue skies, > Martin. > > "Reality continues to ruin my life." - Calvin. > --- > Subject: [PATCH] move mm_struct and vm_area_struct, compile fix. > > From: Martin Schwidefsky <[EMAIL PROTECTED]> > > Include in to make mips and arm > compile again after mm_struct and vm_area_struct have been moved. > > Signed-off-by: Martin Schwidefsky <[EMAIL PROTECTED]> > --- > > include/linux/mm_types.h |1 + > 1 file changed, 1 insertion(+) > > diff -urpN linux-2.6/include/linux/mm_types.h > linux-2.6-patched/include/linux/mm_types.h > --- linux-2.6/include/linux/mm_types.h2007-08-11 11:30:10.0 > +0200 > +++ linux-2.6-patched/include/linux/mm_types.h2007-08-11 > 11:32:42.0 +0200 > @@ -11,6 +11,7 @@ > #include > #include > #include > +#include > > struct address_space; > > > -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[patch 04/28] Add cmpxchg64 and cmpxchg64_local to powerpc
Make sure that at least cmpxchg64_local is available on all architectures to use for unsigned long long values. Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> CC: [EMAIL PROTECTED] CC: linuxppc-dev@ozlabs.org --- include/asm-powerpc/system.h |6 ++ 1 file changed, 6 insertions(+) Index: linux-2.6-lttng/include/asm-powerpc/system.h === --- linux-2.6-lttng.orig/include/asm-powerpc/system.h 2007-08-27 11:42:08.0 -0400 +++ linux-2.6-lttng/include/asm-powerpc/system.h2007-08-27 11:42:43.0 -0400 @@ -485,6 +485,12 @@ __cmpxchg_local(volatile void *ptr, unsi */ #define NET_IP_ALIGN 0 #define NET_SKB_PADL1_CACHE_BYTES + +#define cmpxchg64 cmpxchg +#define cmpxchg64_localcmpxchg_local +#else +#include +#define cmpxchg64_local(ptr,o,n) __cmpxchg64_local_generic((ptr), (o), (n)) #endif #define arch_align_stack(x) (x) -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [patch 04/28] Add cmpxchg64 and cmpxchg64_local to powerpc
* Paul Mackerras ([EMAIL PROTECTED]) wrote: > Mathieu Desnoyers writes: > > > Make sure that at least cmpxchg64_local is available on all architectures > > to use > > for unsigned long long values. > > > > Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> > > Acked-by: Paul Mackerras <[EMAIL PROTECTED]> Thanks Paul, Just make sure add-cmpxchg-local-to-generic-for-up.patch gets merged before this one, since it needs asm-generic/cmpxchg-local.h. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[RFC patch 6/9] POWERPC remove -traditional
Subject: [RFC patch 6/9] Re: [PATCH] Stringify support commas > This is a no-no for those archs that still use -traditional. > > I dunno if this is a problem for you at the moment and the > > right fix is anyway to nuke -traditional. > > > > Sam Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> CC: Sam Ravnborg <[EMAIL PROTECTED]> CC: [EMAIL PROTECTED] CC: linuxppc-dev@ozlabs.org --- arch/powerpc/boot/Makefile |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6-lttng/arch/powerpc/boot/Makefile === --- linux-2.6-lttng.orig/arch/powerpc/boot/Makefile 2008-04-21 10:03:05.0 -0400 +++ linux-2.6-lttng/arch/powerpc/boot/Makefile 2008-04-21 10:03:17.0 -0400 @@ -23,7 +23,7 @@ BOOTCFLAGS:= -Wall -Wundef -Wstrict- -fno-strict-aliasing -Os -msoft-float -pipe \ -fomit-frame-pointer -fno-builtin -fPIC -nostdinc \ -isystem $(shell $(CROSS32CC) -print-file-name=include) -BOOTAFLAGS := -D__ASSEMBLY__ $(BOOTCFLAGS) -traditional -nostdinc +BOOTAFLAGS := -D__ASSEMBLY__ $(BOOTCFLAGS) -nostdinc ifdef CONFIG_DEBUG_INFO BOOTCFLAGS += -g -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [RFC patch 6/9] POWERPC remove -traditional
* Paul Mackerras ([EMAIL PROTECTED]) wrote: > Mathieu Desnoyers writes: > > Subject: [RFC patch 6/9] Re: [PATCH] Stringify support commas > > > This is a no-no for those archs that still use -traditional. > > > > I dunno if this is a problem for you at the moment and the > > > > right fix is anyway to nuke -traditional. > > > > > > > > Sam > > > > Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> > > CC: Sam Ravnborg <[EMAIL PROTECTED]> > > CC: [EMAIL PROTECTED] > > CC: linuxppc-dev@ozlabs.org > > --- > > arch/powerpc/boot/Makefile |2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > Looks OK. I'll put it in my tree. > > Paul. OK, although it won't be needed with v2 of the stringify with commas. There seems to be some important details which differ between -traditional and stdc gcc operation which makes it non-trivial to remove the -traditional, at least on sparc. Maybe Kyle could tell us a bit more about the nature of these limitations ? I would wait a bit before merging until we at least identify what those limitations are. Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)
;> ---[ end trace ffb4a6e37d78370a ]--- >> >> --- >> Oops: Exception in kernel mode, sig: 11 [#1] >> Exsw1600 >> Modules linked in: bridge stp llc tun beacon_freescale(P) 80211(P) >> shm_freescale(P) ath_remote_regs_freescale(P) exsw >> load_local_fpga_freescale 8021q poe_fi >> NIP: c00bd6fc LR: c00bd6fc CTR: c00bd6c8 >> REGS: decd5cf0 TRAP: 0700 Tainted: P (2.6.27-rc2) >> MSR: 00029000 CR: 8882 XER: 2000 >> TASK = d9f5a9a0[6897] 'sh' THREAD: decd4000 >> GPR00: decd5da0 d9f5a9a0 dd801180 decd5de8 0004 >> fff9 >> GPR08: fffa 00d5ca28 decd5d90 2ccc1686 100ad874 1fffb200 >> >> GPR16: 007fff00 decd5ec4 0008 c02fe298 00200200 c031aa34 >> decd5de8 >> GPR24: decd5df4 1af2 dd507d00 c038 dd801180 decd5de8 >> decd5da0 >> NIP [c00bd6fc] d_lookup+0x40/0x90 >> LR [c00bd6fc] d_lookup+0x40/0x90 >> Call Trace: >> [decd5dc0] [c00bd7d8] d_hash_and_lookup+0x8c/0xc4 >> [decd5de0] [c00ed728] proc_flush_task+0xb4/0x268 >> [decd5e40] [c0032894] release_task+0x6c/0x39c >> [decd5e70] [c0033238] wait_consider_task+0x674/0x920 >> [decd5eb0] [c0033610] do_wait+0x12c/0x3d4 >> [decd5f20] [c0033954] sys_wait4+0x9c/0xe8 >> [decd5f40] [c0010554] ret_from_syscall+0x0/0x3c >> Instruction dump: >> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0 >> 73c1 7f83e378 7fa4eb78 4082002f <> 2f83 409e0030 801b52a0 >> ---[ end trace 00106ff4ec3b9c22 ]--- >> >> >> >> > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)
* Steven Rostedt ([EMAIL PROTECTED]) wrote: > Mathieu Desnoyers wrote: >> >> Steve ? I just noticed this : >> >> >> ftrace_modify_code(unsigned long ip, unsigned char *old_code, >>unsigned char *new_code) >> { >> unsigned replaced; >> unsigned old = *(unsigned *)old_code; >> unsigned new = *(unsigned *)new_code; >> int faulted = 0; >> >> /* >> * Note: Due to modules and __init, code can >> * disappear and change, we need to protect against faulting >> * as well as code changing. >> * >> * No real locking needed, this code is run through >> * kstop_machine. >> */ >> asm volatile ( >> "1: lwz %1, 0(%2)\n" >> " cmpw%1, %5\n" >> " bne 2f\n" >> " stwu%3, 0(%2)\n" >> "2:\n" >> ".section .fixup, \"ax\"\n" >> "3: li %0, 1\n" >> " b 2b\n" >> ".previous\n" >> ".section __ex_table,\"a\"\n" >> _ASM_ALIGN "\n" >> _ASM_PTR "1b, 3b\n" >> ".previous" >> : "=r"(faulted), "=r"(replaced) >> : "r"(ip), "r"(new), >> "0"(faulted), "r"(old) >> : "memory"); >> >> if (replaced != old && replaced != new) >> faulted = 2; >> >> if (!faulted) >> flush_icache_range(ip, ip + 8); >> >> return faulted; >> } >> >> What happens if you are really unlucky and get the following execution >> sequence ? >> >> >> Load module A at address 0xddfc3e00 >> Populate ftrace function table while the system runs >> Unload module A >> Load module B at address 0xddfc3e00 >> Activate ftrace >> -> At that point, since there is another module loaded in the same >> address space, it won't fault. If there happens to be an instruction >> which looks exactly like the instruction we are replacing at the same >> location, we are introducing a bug. True both on x86 ans powerpc... >> >> > > Hi Mathieu, > > Yep I know of this issue, and it is very unlikely it would happen. But > that's not good enough, so this is why I did this: > > http://marc.info/?l=linux-kernel&m=121876928124384&w=2 > http://marc.info/?l=linux-kernel&m=121876938524523&w=2 > > ;-) > > On powerpc it would be less likely an issue since the code segments are all > 4 bytes aligned. And a call being replaced would be a call to mcount > (relative jump). A call to mcount would be the same for both. Then again, > we could be replacing a nop, but most likely that wouldn't hurt either, > since nops are seldom used, and if we did call mcount, it would probably > not hurt. But it would be an issue if Module B happen to put a data section > that matched the code to jmp to mcount or a nop. > Ah, I did not see this one passing by :) That's the right solution indeed. I guess it's not in 2.6.27-rc2/rc3 though ? I wonder if the bug can be repeated with a module-free kernel ? Mathieu > -- Steve > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)
* Eran Liberty ([EMAIL PROTECTED]) wrote: > Steven Rostedt wrote: >> Eran Liberty wrote: >>> After compiling a kernel with ftrace I started to experience all sorts of >>> crashes. >> >> Just to make sure... >> >> ftrace enables markers too, and RCU has tracing with the markers. This may >> not be the problem, but I just want to eliminate as many variables as >> possible. >> Could you disable ftrace, but keep the markers on too. Also, could you >> enable ftrace again and turn on the FTRACE_STARTUP_TEST. > > for the fun of it I took out all my propriety modules; so now its a non > tainted kernel. > > Here is the matrix: > > !FTRACE x !MARKERS => stable > !FTRACE x MARKERS => stable > FTRACE x !MARKERS => n/a (FTRACE forces MARKERS) > FTRACE x MARKERS => unstable > FTRACE x FTRACE_STARTUP_TEST x MARKERS => unstable + tests passed > > Testing tracer sched_switch: PASSED > Testing tracer ftrace: PASSED > Testing dynamic ftrace: PASSED > > Oops: Exception in kernel mode, sig: 11 [#1] > Exsw1600 > Modules linked in: > NIP: c00bbb20 LR: c00bbb20 CTR: > REGS: dd5b1c50 TRAP: 0700 Not tainted (2.6.27-rc2) > MSR: 00029000 CR: 24082282 XER: 2000 > TASK = ddcce060[1707] 'find' THREAD: dd5b > GPR00: dd5b1d00 ddcce060 dd801180 dd5b1d68 dd5b1d58 dd80125b > 100234ec > GPR08: c080 00019330 dd5b1d20 24000288 100ad874 100936f8 > 1008a1d0 > GPR16: 10083f80 dd5b1e2c dd5b1d68 fff4 c038 dd5b1d60 dd5b1d58 > dd802084 > GPR24: dc3d7700 dd802018 dd5b1d68 c038 dd801180 dd5b1d68 > dd5b1d00 > NIP [c00bbb20] d_lookup+0x40/0x90 > LR [c00bbb20] d_lookup+0x40/0x90 > Call Trace: > [dd5b1d00] [dd5b1d58] 0xdd5b1d58 (unreliable) Can you check if, at some point during the system execution (starting from boot), 0xdd5b1d58 is an address where a module is loaded ? (the module can be later unloaded, what I wonder is if this address would appear to have had a loaded+unloaded module). Actually, could you try to compile your kernel without "MODULE_UNLOAD" ? Mathieu > [dd5b1d20] [c00aebc4] do_lookup+0xe8/0x220 > [dd5b1d50] [c00b0a80] __link_path_walk+0x5a4/0xd54 > [dd5b1dc0] [c00b1288] path_walk+0x58/0xe0 > [dd5b1df0] [c00b13f8] do_path_lookup+0x78/0x13c > [dd5b1e20] [c00b20f4] user_path_at+0x64/0xac > [dd5b1e90] [c00a9028] vfs_lstat_fd+0x34/0x74 > [dd5b1ec0] [c00a90fc] vfs_lstat+0x30/0x48 > [dd5b1ed0] [c00a9144] sys_lstat64+0x30/0x5c > [dd5b1f40] [c0010554] ret_from_syscall+0x0/0x3c > Instruction dump: > 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0 > 73c1 7f83e378 7fa4eb78 4082002f <> 2f83 409e0030 801b32a0 > ---[ end trace 1eb8fd7adac2bb65 ]--- > > Liberty > > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)
* Steven Rostedt ([EMAIL PROTECTED]) wrote: > > On Tue, 19 Aug 2008, Benjamin Herrenschmidt wrote: > > > > > > Hmm, this was originally copied from x86, where we did a cmpxchg, but > > > that > > > is probably not needed since all of this is done in kstop_machine. Also, > > > only the "get" is needed. If we don't fault there, we wont fault on the > > > put (unless we have permissions wrong, and that would be a bug). > > > > Would it ? How do we make sure the kernel text is mapped writeable ? > > We map it writeable if FTRACE is enabled. > Argh. See text_poke(). It's there exactly for this purpose on x86. Mathieu > > > > > So are you recommending something like > > > > > > int cmd; > > > > > > if (__get_user(cmd, ip)) > > > goto fault; > > > > > > if (cmd != old) > > > goto not_same; > > > > > > WARN_ON_ONCE(__put_user(cmd, ip)); > > > > > > If we did this, we could probably put this into the generic code: > > > > That would work I suppose, I'll give it a try. > > > > > if (copy_from_user(cmd, ip, ARCH_CALL_SIZE)) > > > goto fault; > > > > > > if (memcmp(cmd, old, ARCH_CALL_SIZE) != 0) > > > goto not_same; > > > > > > WARN_ON_ONCE(copy_to_user(cmd, ip, ARCH_CALL_SIZE)); > > > > You need the __ variants or the access_ok() checks will bite > > you bad. > > Ah, good point ;-) > > -- Steve > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)
* Steven Rostedt ([EMAIL PROTECTED]) wrote: > > On Mon, 18 Aug 2008, Mathieu Desnoyers wrote: > > > * Steven Rostedt ([EMAIL PROTECTED]) wrote: > > > > > > On Tue, 19 Aug 2008, Benjamin Herrenschmidt wrote: > > > > > > > > > > > > Hmm, this was originally copied from x86, where we did a cmpxchg, but > > > > > that > > > > > is probably not needed since all of this is done in kstop_machine. > > > > > Also, > > > > > only the "get" is needed. If we don't fault there, we wont fault on > > > > > the > > > > > put (unless we have permissions wrong, and that would be a bug). > > > > > > > > Would it ? How do we make sure the kernel text is mapped writeable ? > > > > > > We map it writeable if FTRACE is enabled. > > > > > > > Argh. See text_poke(). It's there exactly for this purpose on x86. > > Ouch, I just did. text_poke is quite heavy. It would be interesting to see > that performed on 20,000 locations at one time. I could play with it, but > I'm a bit nervous. > It's alread used to modify the LOCK prefixes in alternative.c and did not seem to be too slow for that.. it should therefore be ok. Mathieu > -- Steve -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)
* Eran Liberty ([EMAIL PROTECTED]) wrote: > Mathieu Desnoyers wrote: >> Can you check if, at some point during the system execution (starting >> from boot), 0xdd5b1d58 is an address where a module is loaded ? (the >> module can be later unloaded, what I wonder is if this address would >> appear to have had a loaded+unloaded module). >> >> Actually, could you try to compile your kernel without "MODULE_UNLOAD" ? >> >> Mathieu >> > No modules... > > ~ # uname -a > Linux 2.6.27-rc2 #6 Tue Aug 19 12:33:34 IDT 2008 ppc unknown > ~ # gunzip /proc/config.gz -c | grep UNLOAD > # CONFIG_MODULE_UNLOAD is not set > ~ # gunzip /proc/config.gz -c | grep FTRACE > CONFIG_HAVE_FTRACE=y > CONFIG_HAVE_DYNAMIC_FTRACE=y > CONFIG_FTRACE=y > CONFIG_DYNAMIC_FTRACE=y > CONFIG_FTRACE_SELFTEST=y > CONFIG_FTRACE_STARTUP_TEST=y > ~ # gunzip /proc/config.gz -c | grep MARKERS > CONFIG_MARKERS=y > ~ # lsmod > Module Size Used by > ~ # while [ 1 ] ; do find / > /dev/null ; echo . ; done > . > Oops: Exception in kernel mode, sig: 11 [#1] > Exsw1600 > Modules linked in: > NIP: c00bb724 LR: c00bb724 CTR: > REGS: d8b59c50 TRAP: 0700 Not tainted (2.6.27-rc2) > MSR: 00029000 CR: 24082282 XER: 2000 > TASK = dbc68de0[1712] 'find' THREAD: d8b58000 > GPR00: d8b59d00 dbc68de0 dd801180 d8b59d68 d8b59d58 dd8019db > 100234ec > GPR08: c080 00019330 d8b59d20 24000288 100ad874 100936f8 > 1008a1d0 > GPR16: 10083f80 d8b59e2c d8b59d68 fff4 c038 d8b59d60 d8b59d58 > dd802084 Register GPR30 is always zero... (also true for the other oops) > GPR24: ddc69500 dd802018 d8b59d68 c038 dd801180 d8b59d68 > d8b59d00 > NIP [c00bb724] d_lookup+0x40/0x90 > LR [c00bb724] d_lookup+0x40/0x90 > Call Trace: > [d8b59d00] [d8b59d58] 0xd8b59d58 (unreliable) > [d8b59d20] [c00ae7c8] do_lookup+0xe8/0x220 > [d8b59d50] [c00b0684] __link_path_walk+0x5a4/0xd54 > [d8b59dc0] [c00b0e8c] path_walk+0x58/0xe0 > [d8b59df0] [c00b0ffc] do_path_lookup+0x78/0x13c > [d8b59e20] [c00b1cf8] user_path_at+0x64/0xac > [d8b59e90] [c00a8c98] vfs_lstat_fd+0x34/0x74 > [d8b59ec0] [c00a8d6c] vfs_lstat+0x30/0x48 > [d8b59ed0] [c00a8db4] sys_lstat64+0x30/0x5c > [d8b59f40] [c0010554] ret_from_syscall+0x0/0x3c > Instruction dump: > 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0 Just like this instruction... maybe related ? Mathieu > 73c1 7f83e378 7fa4eb78 4082002f <> 2f83 409e0030 801b32a0 > ---[ end trace 7766edd310cd3442 ]--- > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)
t;, nd=0xd99b3e28) at > fs/namei.c:1041 > #6 0xc00b0ffc in do_path_lookup (dfd=, > name=0xddc23000 "/proc/mtd", flags=, nd=0xd99b3e28) at > fs/namei.c:1091 > #7 0xc00b1cf8 in user_path_at (dfd=-100, name=, > flags=0, path=0xd99b3e98) at fs/namei.c:1334 > #8 0xc00a8c98 in vfs_lstat_fd (dfd=-578809472, name=0xd99b3d68 "", > stat=0xd99b3ed8) at fs/stat.c:83 > #9 0xc00a8d6c in vfs_lstat (name=0xd99b3d68 "", stat=0xd99b3d58) at > fs/stat.c:93 > #10 0xc00a8db4 in sys_lstat64 (filename=0xdd801180 "", statbuf=0xbfe285d8) > at fs/stat.c:381 > #11 0xc0010554 in syscall_dotrace_cont () > > both cp & lr points to 0xC00BB724 > (gdb) info registers > r0 0x00 > r1 0xd99b3d003650829568 > r2 0xddccf2e03721196256 > r3 0xdd8011803716157824 > r4 0xd99b3d683650829672 > r5 0xd99b3d583650829656 > r6 0xdd822a5b3716295259 > r7 0x100234ec268580076 > r8 0xc0803229614080 > r9 0x19330103216 > r100x65535 > r110xd99b3d203650829600 > r120x24000288603980424 > r130x100ad874269146228 > r140x100936f8269039352 > r150x1008a1d0269001168 > r160x10083f80268976000 > r170xd99b3e2c3650829868 > r180xd99b3d683650829672 > r190xfff44294967284 > r200xc0383224895488 > r210xd99b3d603650829664 > r220xd99b3d583650829656 > r230xdd8020843716161668 > r240xdc3fb2803695161984 > r250xdd8020183716161560 > r260xd99b3d683650829672 > r270xc0383224895488 > r280xdd8011803716157824 > r290xd99b3d683650829672 > r300x00 > r310xd99b3d003650829568 > pc 0xc00bb7243221993252 > cr 0x24082282604512898 > lr 0xc00bb7243221993252 > ctr0x00 > xer0x2000536870912 > > (gdb) li *0xC00BB724 > Line 1361 of "fs/dcache.c" starts at address 0xc00bb724 and > ends at 0xc00bb728 . > > (gdb) disassemble 0xC00BB724 > Dump of assembler code for function d_lookup: > 0xc00bb6e4 :mflrr0 > 0xc00bb6e8 :stw r0,4(r1) > 0xc00bb6ec :nop > 0xc00bb6f0 :stwur1,-32(r1) > 0xc00bb6f4 :mflrr0 > 0xc00bb6f8 :stmwr27,12(r1) > 0xc00bb6fc :lis r27,-16328 > 0xc00bb700 :mr r31,r1 > 0xc00bb704 :stw r0,36(r1) > 0xc00bb708 :mr r28,r3 > 0xc00bb70c :mr r29,r4 > 0xc00bb710 : lwz r30,12960(r27) > 0xc00bb714 :andi. r0,r30,1 > 0xc00bb718 :mr r3,r28 > 0xc00bb71c :mr r4,r29 > 0xc00bb720 :bnela- 0x2c > 0xc00bb724 :.long 0x0 > 0xc00bb728 :cmpwi cr7,r3,0 > 0xc00bb72c :bne-cr7,0xc00bb75c > 0xc00bb730 :lwz r0,12960(r27) > 0xc00bb734 :cmpwcr7,r0,r30 > 0xc00bb738 :beq-cr7,0xc00bb75c > 0xc00bb73c :mr r30,r0 > 0xc00bb740 :andi. r0,r30,1 > 0xc00bb744 :mr r3,r28 > 0xc00bb748 :mr r4,r29 > 0xc00bb74c :beq+0xc00bb724 > 0xc00bb750 :lwz r0,12960(r27) > 0xc00bb754 :mr r30,r0 > 0xc00bb758 :b 0xc00bb740 > 0xc00bb75c :lwz r11,0(r1) > 0xc00bb760 :lwz r0,4(r11) > 0xc00bb764 :lmw r27,-20(r11) > 0xc00bb768 :mtlrr0 > 0xc00bb76c :mr r1,r11 > 0xc00bb770 :blr > End of assembler dump. > (gdb) > > Hope this is helpfull > Can you also give us objdump -S --start-address=0xC00BB724 vmlinux | head 20 ? Then we could compare the result with the OOPS instruction dump : 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0 73c1 7f83e378 7fa4eb78 4082002f <> 2f83 409e0030 801b32a0 Mathieu > Liberty > > > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)
* Eran Liberty ([EMAIL PROTECTED]) wrote: > Mathieu Desnoyers wrote: >> Can you also give us >> >> objdump -S --start-address=0xC00BB724 vmlinux | head 20 >> >> ? >> >> Then we could compare the result with the OOPS instruction dump : >> >> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0 >> 73c1 7f83e378 7fa4eb78 4082002f <> 2f83 409e0030 801b32a0 >> >> Mathieu >> >> > > to give you more context I have run: > > powerpc-linux-gnu-objdump -S --start-address=0xC00BB700 vmlinux | head -n > 60 > > the discrepancy starts at address: > c00bb720: 40 82 00 30 <=> 40 82 00 2f > c00bb724: 4b ff fe 61 <=> 00 00 00 00 > Ah ! I think I see what could be wrong : First we have : static unsigned int ftrace_nop = 0x6000; We probably replace the original function call by this nop. Then we do : notrace unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr) { static unsigned int op; /* * It would be nice to just use create_function_call, but that will * update the code itself. Here we need to just return the * instruction that is going to be modified, without modifying the * code. */ addr = GET_ADDR(addr); /* Set to "bl addr" */ op = 0x4801 | (ftrace_calc_offset(ip, addr) & 0x03fc); /* * No locking needed, this must be called via kstop_machine * which in essence is like running on a uniprocessor machine. */ return (unsigned char *)&op; } And I guess we must be doing a 0x4801 | 0x0; or something ? Also, we have to consider that POWERPC 64 functions are : /* PowerPC64's functions are data that points to the functions */ And this does not seem to hold for ppc32. Therefore, it is strange to me that the same code is used for the update.. are we updating the correct site ? Mathieu > vmlinux: file format elf32-powerpc > > Disassembly of section .text: > > c00bb700 : > * d_lookup() is protected against the concurrent renames in some unrelated > * directory using the seqlockt_t rename_lock. > */ > > struct dentry * d_lookup(struct dentry * parent, struct qstr * name) > { > c00bb700: 7c 3f 0b 78 mr r31,r1 > c00bb704: 90 01 00 24 stw r0,36(r1) > c00bb708: 7c 7c 1b 78 mr r28,r3 > c00bb70c: 7c 9d 23 78 mr r29,r4 > c00bb710: 83 db 32 a0 lwz r30,12960(r27) > { >unsigned ret; > > repeat: >ret = sl->sequence; >smp_rmb(); > c00bb714: 73 c0 00 01 andi. r0,r30,1 >struct dentry * dentry = NULL; >unsigned long seq; > >do { >seq = read_seqbegin(&rename_lock); >dentry = __d_lookup(parent, name); > c00bb718: 7f 83 e3 78 mr r3,r28 > c00bb71c: 7f a4 eb 78 mr r4,r29 >if (unlikely(ret & 1)) { > c00bb720: 40 82 00 30 bne-c00bb750 > c00bb724: 4b ff fe 61 bl c00bb584 <__d_lookup> >if (dentry) > c00bb728: 2f 83 00 00 cmpwi cr7,r3,0 > c00bb72c: 40 9e 00 30 bne-cr7,c00bb75c > * > * If sequence value changed then writer changed data while in section. > */ > static __always_inline int read_seqretry(const seqlock_t *sl, unsigned > start) > { >smp_rmb(); > c00bb730: 80 1b 32 a0 lwz r0,12960(r27) >break; >} while (read_seqretry(&rename_lock, seq)); > c00bb734: 7f 80 f0 00 cmpwcr7,r0,r30 > c00bb738: 41 9e 00 24 beq-cr7,c00bb75c > /* Start of read calculation -- fetch last complete writer token */ > static __always_inline unsigned read_seqbegin(const seqlock_t *sl) > { >unsigned ret; > > repeat: > c00bb73c: 7c 1e 03 78 mr r30,r0 >ret = sl->sequence; >smp_rmb(); > c00bb740: 73 c0 00 01 andi. r0,r30,1 >struct dentry * dentry = NULL; > -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)
* Steven Rostedt ([EMAIL PROTECTED]) wrote: > > > On Mon, 18 Aug 2008, Mathieu Desnoyers wrote: > > > * Steven Rostedt ([EMAIL PROTECTED]) wrote: > > > > > > On Tue, 19 Aug 2008, Benjamin Herrenschmidt wrote: > > > > > > > > > > > > Hmm, this was originally copied from x86, where we did a cmpxchg, but > > > > > that > > > > > is probably not needed since all of this is done in kstop_machine. > > > > > Also, > > > > > only the "get" is needed. If we don't fault there, we wont fault on > > > > > the > > > > > put (unless we have permissions wrong, and that would be a bug). > > > > > > > > Would it ? How do we make sure the kernel text is mapped writeable ? > > > > > > We map it writeable if FTRACE is enabled. > > > > > > > Argh. See text_poke(). It's there exactly for this purpose on x86. > > > > OK, I just tried text_poke and it unfortunately fails. The problem is that > it requires that the text you are changing is aligned and fits on one > page. We have no control over that. > > -- Steve > Ok, there are two cases where it's ok : 1 - in stop_machine, considering we are not touching code executed in NMI handlers. 2 - when using my replace_instruction_safe() which uses a temporary breakpoint when doing the instruction replacement. In those cases you could use text_poke_early(). See http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x86/kernel/immediate.c;h=7789e2c75bf03e645f15759d5dff0c1698493f92;hb=HEAD For a use example. Basically it looks like : 360 pages[0] = virt_to_page((void *)bypass_eip); 361 vaddr = vmap(pages, 1, VM_MAP, PAGE_KERNEL); 362 BUG_ON(!vaddr); 363 text_poke_early(&vaddr[bypass_eip & ~PAGE_MASK], 364 (void *)addr, size); 365 /* 366 * Fill the rest with nops. 367 */ 368 len = NR_NOPS - size; 369 add_nops((void *) 370 &vaddr[(bypass_eip & ~PAGE_MASK) + size], 371 len); 372 print_dbg_bytes("inserted nops", 373 &vaddr[(bypass_eip & ~PAGE_MASK) + size], len); 374 vunmap(vaddr); Mathieu -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] sputrace : use marker_synchronize_unregister()
We need a marker_synchronize_unregister() before the end of exit() to make sure every probe callers have exited the non preemptible section and thus are not executing the probe code anymore. Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> CC: Ingo Molnar <[EMAIL PROTECTED]> CC: Jeremy Kerr <[EMAIL PROTECTED]> CC: linuxppc-dev@ozlabs.org CC: Christoph Hellwig <[EMAIL PROTECTED]> --- arch/powerpc/platforms/cell/spufs/sputrace.c |1 + 1 file changed, 1 insertion(+) Index: linux-2.6-lttng/arch/powerpc/platforms/cell/spufs/sputrace.c === --- linux-2.6-lttng.orig/arch/powerpc/platforms/cell/spufs/sputrace.c 2008-07-31 09:34:58.0 -0400 +++ linux-2.6-lttng/arch/powerpc/platforms/cell/spufs/sputrace.c 2008-07-31 09:35:15.0 -0400 @@ -233,6 +233,7 @@ static void __exit sputrace_exit(void) remove_proc_entry("sputrace", NULL); kfree(sputrace_log); + marker_synchronize_unregister(); } module_init(sputrace_init); -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[patch 5/5] sputrace : use marker_synchronize_unregister()
We need a marker_synchronize_unregister() before the end of exit() to make sure every probe callers have exited the non preemptible section and thus are not executing the probe code anymore. Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> CC: Ingo Molnar <[EMAIL PROTECTED]> CC: Jeremy Kerr <[EMAIL PROTECTED]> CC: linuxppc-dev@ozlabs.org CC: Christoph Hellwig <[EMAIL PROTECTED]> --- arch/powerpc/platforms/cell/spufs/sputrace.c |1 + 1 file changed, 1 insertion(+) Index: linux-2.6-lttng/arch/powerpc/platforms/cell/spufs/sputrace.c === --- linux-2.6-lttng.orig/arch/powerpc/platforms/cell/spufs/sputrace.c 2008-07-31 09:34:58.0 -0400 +++ linux-2.6-lttng/arch/powerpc/platforms/cell/spufs/sputrace.c 2008-07-31 09:35:15.0 -0400 @@ -233,6 +233,7 @@ static void __exit sputrace_exit(void) remove_proc_entry("sputrace", NULL); kfree(sputrace_log); + marker_synchronize_unregister(); } module_init(sputrace_init); -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[patch 2/3] Add missing DATA_DATA in powerpc
Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> CC: [EMAIL PROTECTED] CC: linuxppc-dev@ozlabs.org -- arch/powerpc/kernel/vmlinux.lds.S |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6-lttng/arch/powerpc/kernel/vmlinux.lds.S === --- linux-2.6-lttng.orig/arch/powerpc/kernel/vmlinux.lds.S 2007-06-16 22:17:53.0 -0400 +++ linux-2.6-lttng/arch/powerpc/kernel/vmlinux.lds.S 2007-06-16 22:18:32.0 -0400 @@ -173,7 +173,9 @@ } #else .data : { - *(.data .data.rel* .toc1) + DATA_DATA + *(.data.rel*) + *(.toc1) *(.branch_lt) } -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: + add-missing-data_data-in-powerpc.patch added to -mm tree
Please add to patch header: Newer asm-generic/vmlinux.lds.h provides a DATA_DATA macro which defines the content of what is normally referred to by *(.data). It permits adding more "subsections" (declared by gcc with the section attribute or in assembly), whichwill be linked in the .data section. * [EMAIL PROTECTED] ([EMAIL PROTECTED]) wrote: > > The patch titled > powerpc: add missing DATA_DATA > has been added to the -mm tree. Its filename is > add-missing-data_data-in-powerpc.patch > > *** Remember to use Documentation/SubmitChecklist when testing your code *** > > See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find > out what to do about this > > -- > Subject: powerpc: add missing DATA_DATA > From: Mathieu Desnoyers <[EMAIL PROTECTED]> > > Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Cc: > Signed-off-by: Andrew Morton <[EMAIL PROTECTED]> > --- > > arch/powerpc/kernel/vmlinux.lds.S |4 +++- > 1 files changed, 3 insertions(+), 1 deletion(-) > > diff -puN arch/powerpc/kernel/vmlinux.lds.S~add-missing-data_data-in-powerpc > arch/powerpc/kernel/vmlinux.lds.S > --- a/arch/powerpc/kernel/vmlinux.lds.S~add-missing-data_data-in-powerpc > +++ a/arch/powerpc/kernel/vmlinux.lds.S > @@ -174,7 +174,9 @@ SECTIONS > } > #else > .data : { > - *(.data .data.rel* .toc1) > + DATA_DATA > + *(.data.rel*) > + *(.toc1) > *(.branch_lt) > } > > _ > > Patches currently in -mm which might be from [EMAIL PROTECTED] are > > powerpc-promc-remove-undef-printk.patch > i386-text-edit-lock.patch > i386-text-edit-lock-alternative-instructions.patch > i386-text-edit-lock-kprobes.patch > immediate-values-global-modules-list-and-module-mutex.patch > immediate-value-architecture-independent-code.patch > immediate-values-non-optimized-architectures.patch > immediate-value-add-kconfig-menus.patch > immediate-values-kprobe-header-fix.patch > immediate-value-i386-optimization.patch > immediate-value-powerpc-optimization.patch > immediate-value-documentation.patch > f00f-bug-fixup-for-i386-use-immediate-values.patch > scheduler-profiling-use-immediate-values.patch > use-data_data-in-cris.patch > add-missing-data_data-in-powerpc.patch > use-data_data-in-xtensa.patch > -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Powerpc - Include pagemap.h in asm/powerpc/tlb.h
Powerpc - Include pagemap.h in asm/powerpc/tlb.h Fixes this powerpc build error in 2.6.22-rc6-mm1 for powerpc 64 : In file included from include2/asm/tlb.h:60, from /home/compudj/git/linux-2.6-lttng/arch/powerpc/mm/init_64. c:56: /home/compudj/git/linux-2.6-lttng/include/asm-generic/tlb.h: In function 'tlb_fl ush_mmu': /home/compudj/git/linux-2.6-lttng/include/asm-generic/tlb.h:76: error: implicit declaration of function 'release_pages' /home/compudj/git/linux-2.6-lttng/include/asm-generic/tlb.h: In function 'tlb_re move_page': /home/compudj/git/linux-2.6-lttng/include/asm-generic/tlb.h:105: error: implicit declaration of function 'page_cache_release' make[2]: *** [arch/powerpc/mm/init_64.o] Error 1 release_pages is declared in linux/pagemap.h, but cannot be included in linux/swap.h because of a sparc related comment: /* only sparc can not include linux/pagemap.h in this file * so leave page_cache_release and release_pages undeclared... */ #define free_page_and_swap_cache(page) \ page_cache_release(page) #define free_pages_and_swap_cache(pages, nr) \ release_pages((pages), (nr), 0); Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> CC: linuxppc-dev@ozlabs.org CC: Paul Mackerras <[EMAIL PROTECTED]> --- include/asm-powerpc/tlb.h |2 ++ 1 file changed, 2 insertions(+) Index: linux-2.6-lttng/include/asm-powerpc/tlb.h === --- linux-2.6-lttng.orig/include/asm-powerpc/tlb.h 2007-07-13 11:30:54.0 -0400 +++ linux-2.6-lttng/include/asm-powerpc/tlb.h 2007-07-13 11:31:22.0 -0400 @@ -23,6 +23,8 @@ #include #endif +#include + struct mmu_gather; #define tlb_start_vma(tlb, vma)do { } while (0) -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[patch 2/3] Add missing DATA_DATA in powerpc
Newer asm-generic/vmlinux.lds.h provides a DATA_DATA macro which defines the content of what is normally referred to by *(.data). It permits adding more "subsections" (declared by gcc with the section attribute or in assembly), which will be linked in the .data section. Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]> CC: [EMAIL PROTECTED] CC: linuxppc-dev@ozlabs.org -- arch/powerpc/kernel/vmlinux.lds.S |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) Index: linux-2.6-lttng/arch/powerpc/kernel/vmlinux.lds.S === --- linux-2.6-lttng.orig/arch/powerpc/kernel/vmlinux.lds.S 2007-07-11 11:55:48.0 -0400 +++ linux-2.6-lttng/arch/powerpc/kernel/vmlinux.lds.S 2007-07-11 15:48:42.0 -0400 @@ -174,7 +174,9 @@ } #else .data : { - *(.data .data.rel* .toc1) + DATA_DATA + *(.data.rel*) + *(.toc1) *(.branch_lt) } -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: Powerpc - Include pagemap.h in asm/powerpc/tlb.h
7;: /home/compudj/git/linux-2.6-lttng/include/linux/pagemap.h:207: error: implicit declaration of function 'PageWriteback' /home/compudj/git/linux-2.6-lttng/include/linux/pagemap.h:208: error: 'PG_writeback' undeclared (first use in this function) In file included from /home/compudj/git/linux-2.6-lttng/include/linux/rmap.h:9, from /home/compudj/git/linux-2.6-lttng/init/main.c:49: /home/compudj/git/linux-2.6-lttng/include/linux/mm.h: At top level: /home/compudj/git/linux-2.6-lttng/include/linux/mm.h:605: error: conflicting types for 'page_address' include2/asm/highmem.h:61: error: previous implicit declaration of 'page_address' was here PageHighMem is declared in linux/page-flags.h, which is included in linux/mm.h with the following comment: /* * FIXME: take this include out, include page-flags.h in * files which need it (119 of them) */ #include (linux/mm.h is included from linux/highmem.h) Actually, we get a circular inclusion there: In file included from /home/compudj/git/linux-2.6-lttng/include/linux/highmem.h:24, from /home/compudj/git/linux-2.6-lttng/include/linux/pagemap.h:10, from /home/compudj/git/linux-2.6-lttng/include/linux/swap.h:9, from include2/asm/pgtable.h:15, from /home/compudj/git/linux-2.6-lttng/include/linux/mm.h:38, from /home/compudj/git/linux-2.6-lttng/include/linux/rmap.h:9, from /home/compudj/git/linux-2.6-lttng/init/main.c:49: mm.h includes asm-sparc/pgtable.h includes linux/swap.h includes linux/pagemap.h (which I have added myself) includes linux/highmem.h includes mm.h. Is it me or it all looks like a fubarish mess ? :( Why is asm-sparc/pgtable.h including linux/swap.h ? Perharps the sparc devs will be able to enlighten us... Mathieu -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] powerpc: select ARCH_HAS_MEMBARRIER_SYNC_CORE
- On Jul 7, 2020, at 1:50 AM, Nicholas Piggin npig...@gmail.com wrote: > Excerpts from Christophe Leroy's message of July 6, 2020 7:53 pm: >> >> >> Le 06/07/2020 à 04:18, Nicholas Piggin a écrit : >>> diff --git a/arch/powerpc/include/asm/exception-64s.h >>> b/arch/powerpc/include/asm/exception-64s.h >>> index 47bd4ea0837d..b88cb3a989b6 100644 >>> --- a/arch/powerpc/include/asm/exception-64s.h >>> +++ b/arch/powerpc/include/asm/exception-64s.h >>> @@ -68,6 +68,10 @@ >>>* >>>* The nop instructions allow us to insert one or more instructions to >>> flush the >>>* L1-D cache when returning to userspace or a guest. >>> + * >>> + * powerpc relies on return from interrupt/syscall being context >>> synchronising >>> + * (which hrfid, rfid, and rfscv are) to support >>> ARCH_HAS_MEMBARRIER_SYNC_CORE >>> + * without additional additional synchronisation instructions. >> >> This file is dedicated to BOOK3S/64. What about other ones ? >> >> On 32 bits, this is also valid as 'rfi' is also context synchronising, >> but then why just add some comment in exception-64s.h and only there ? > > Yeah you're right, I basically wanted to keep a note there just in case, > because it's possible we would get a less synchronising return (maybe > unlikely with meltdown) or even return from a kernel interrupt using a > something faster (e.g., bctar if we don't use tar register in the kernel > anywhere). > > So I wonder where to add the note, entry_32.S and 64e.h as well? > For 64-bit powerpc, I would be tempted to either place the comment in the header implementing the RFI_TO_USER and RFI_TO_USER_OR_KERNEL macros or the .S files using them, e.g. either: arch/powerpc/include/asm/exception-64e.h arch/powerpc/include/asm/exception-64s.h or arch/powerpc/kernel/exceptions-64s.S arch/powerpc/kernel/entry_64.S And for 32-bit powerpc, AFAIU arch/powerpc/kernel/entry_32.S uses SYNC + RFI to return to user-space. RFI is defined in arch/powerpc/include/asm/ppc_asm.h So a comment either near the RFI define and its uses should work. > I should actually change the comment for 64-bit because soft masked > interrupt replay is an interesting case. I thought it was okay (because > the IPI would cause a hard interrupt which does do the rfi) but that > should at least be written. Yes. > The context synchronisation happens before > the Linux IPI function is called, but for the purpose of membarrier I > think that is okay (the membarrier just needs to have caused a memory > barrier + context synchronistaion by the time it has done). Can you point me to the code implementing this logic ? Thanks, Mathieu > > Thanks, > Nick -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Failure to build librseq on ppc
Hi Boqun, I'm trying to build librseq at: https://git.kernel.org/pub/scm/libs/librseq/librseq.git on powerpc, and I get these errors when building the rseq basic test mirrored from the kernel selftests code: /tmp/ccieEWxU.s: Assembler messages: /tmp/ccieEWxU.s:118: Error: syntax error; found `,', expected `(' /tmp/ccieEWxU.s:118: Error: junk at end of line: `,8' /tmp/ccieEWxU.s:121: Error: syntax error; found `,', expected `(' /tmp/ccieEWxU.s:121: Error: junk at end of line: `,8' /tmp/ccieEWxU.s:626: Error: syntax error; found `,', expected `(' /tmp/ccieEWxU.s:626: Error: junk at end of line: `,8' /tmp/ccieEWxU.s:629: Error: syntax error; found `,', expected `(' /tmp/ccieEWxU.s:629: Error: junk at end of line: `,8' /tmp/ccieEWxU.s:735: Error: syntax error; found `,', expected `(' /tmp/ccieEWxU.s:735: Error: junk at end of line: `,8' /tmp/ccieEWxU.s:738: Error: syntax error; found `,', expected `(' /tmp/ccieEWxU.s:738: Error: junk at end of line: `,8' /tmp/ccieEWxU.s:741: Error: syntax error; found `,', expected `(' /tmp/ccieEWxU.s:741: Error: junk at end of line: `,8' Makefile:581: recipe for target 'basic_percpu_ops_test.o' failed I am using this compiler: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12) Target: powerpc-linux-gnu So far, I got things to build by changing "m" operands to "Q" operands. Based on https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints it seems that "Q" means "A memory operand addressed by just a base register." I suspect that lwz and stw don't expect some kind of immediate offset which can be kept with "m", and "Q" fixes this. Is that the right fix ? And should we change all operands passed to lwz and stw to a "Q" operand ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: Failure to build librseq on ppc
- On Jul 7, 2020, at 8:59 PM, Segher Boessenkool seg...@kernel.crashing.org wrote: > Hi! > > On Tue, Jul 07, 2020 at 03:17:10PM -0400, Mathieu Desnoyers wrote: >> I'm trying to build librseq at: >> >> https://git.kernel.org/pub/scm/libs/librseq/librseq.git >> >> on powerpc, and I get these errors when building the rseq basic >> test mirrored from the kernel selftests code: >> >> /tmp/ccieEWxU.s: Assembler messages: >> /tmp/ccieEWxU.s:118: Error: syntax error; found `,', expected `(' >> /tmp/ccieEWxU.s:118: Error: junk at end of line: `,8' >> /tmp/ccieEWxU.s:121: Error: syntax error; found `,', expected `(' >> /tmp/ccieEWxU.s:121: Error: junk at end of line: `,8' >> /tmp/ccieEWxU.s:626: Error: syntax error; found `,', expected `(' >> /tmp/ccieEWxU.s:626: Error: junk at end of line: `,8' >> /tmp/ccieEWxU.s:629: Error: syntax error; found `,', expected `(' >> /tmp/ccieEWxU.s:629: Error: junk at end of line: `,8' >> /tmp/ccieEWxU.s:735: Error: syntax error; found `,', expected `(' >> /tmp/ccieEWxU.s:735: Error: junk at end of line: `,8' >> /tmp/ccieEWxU.s:738: Error: syntax error; found `,', expected `(' >> /tmp/ccieEWxU.s:738: Error: junk at end of line: `,8' >> /tmp/ccieEWxU.s:741: Error: syntax error; found `,', expected `(' >> /tmp/ccieEWxU.s:741: Error: junk at end of line: `,8' >> Makefile:581: recipe for target 'basic_percpu_ops_test.o' failed > > You'll have to show the actual failing machine code, and with enough > context that we can relate this to the source code. > > -save-temps helps, or use -S instead of -c, etc. Sure, see attached .S file. > >> I am using this compiler: >> >> gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12) >> Target: powerpc-linux-gnu >> >> So far, I got things to build by changing "m" operands to "Q" operands. >> Based on >> https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints >> it seems that "Q" means "A memory operand addressed by just a base register." > > Yup. > >> I suspect that lwz and stw don't expect some kind of immediate offset which >> can be kept with "m", and "Q" fixes this. Is that the right fix ? >> >> And should we change all operands passed to lwz and stw to a "Q" operand ? > > No, lwz and stw exactly *do* take an immediate offset. > > It sounds like the compiler passed memory addressed by indexed > addressing, instead. Which is fine for "m", and also fine for those > insns... well, you need lwzx and stwx. > > So perhaps you have code like > > int *p; > int x; > ... > asm ("lwz %0,%1" : "=r"(x) : "m"(*p)); We indeed have explicit "lwz" and "stw" instructions in there. > > where that last line should actually read > > asm ("lwz%X1 %0,%1" : "=r"(x) : "m"(*p)); Indeed, turning those into "lwzx" and "stwx" seems to fix the issue. There has been some level of extra CPP macro coating around those instructions to support both ppc32 and ppc64 with the same assembly. So adding %X[arg] is not trivial. Let me see what can be done here. Thanks, Mathieu > > ? > > > Segher -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com basic_percpu_ops_test.S Description: Binary data
Re: Failure to build librseq on ppc
- On Jul 8, 2020, at 8:33 AM, Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote: > - On Jul 7, 2020, at 8:59 PM, Segher Boessenkool > seg...@kernel.crashing.org > wrote: [...] >> >> So perhaps you have code like >> >> int *p; >> int x; >> ... >> asm ("lwz %0,%1" : "=r"(x) : "m"(*p)); > > We indeed have explicit "lwz" and "stw" instructions in there. > >> >> where that last line should actually read >> >> asm ("lwz%X1 %0,%1" : "=r"(x) : "m"(*p)); > > Indeed, turning those into "lwzx" and "stwx" seems to fix the issue. > > There has been some level of extra CPP macro coating around those instructions > to > support both ppc32 and ppc64 with the same assembly. So adding %X[arg] is not > trivial. > Let me see what can be done here. I did the following changes which appear to generate valid asm. See attached corresponding .S output. I grepped for uses of "m" asm operand in Linux powerpc code and noticed it's pretty much always used with e.g. "lwz%U1%X1". I could find one blog post discussing that %U is about update flag, and nothing about %X. Are those documented ? Although it appears to generate valid asm, I have the feeling I'm relying on undocumented features here. :-/ Here is the diff on https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/include/rseq/rseq-ppc.h It's only compile-tested on powerpc32 so far: diff --git a/include/rseq/rseq-ppc.h b/include/rseq/rseq-ppc.h index eb53953..f689fe9 100644 --- a/include/rseq/rseq-ppc.h +++ b/include/rseq/rseq-ppc.h @@ -47,9 +47,9 @@ do { \ #ifdef __PPC64__ -#define STORE_WORD "std " -#define LOAD_WORD "ld " -#define LOADX_WORD "ldx " +#define STORE_WORD(arg)"std%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] "/* To memory ("m" constraint) */ +#define LOAD_WORD(arg) "lwd%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] " /* From memory ("m" constraint) */ +#define LOADX_WORD "ldx " /* From base register ("b" constraint) */ #define CMP_WORD "cmpd " #define __RSEQ_ASM_DEFINE_TABLE(label, version, flags, \ @@ -89,9 +89,9 @@ do { \ #else /* #ifdef __PPC64__ */ -#define STORE_WORD "stw " -#define LOAD_WORD "lwz " -#define LOADX_WORD "lwzx " +#define STORE_WORD(arg)"stw%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] "/* To memory ("m" constraint) */ +#define LOAD_WORD(arg) "lwz%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] " /* From memory ("m" constraint) */ +#define LOADX_WORD "lwzx " /* From base register ("b" constraint) */ #define CMP_WORD "cmpw " #define __RSEQ_ASM_DEFINE_TABLE(label, version, flags, \ @@ -125,7 +125,7 @@ do { \ RSEQ_INJECT_ASM(1) \ "lis %%r17, (" __rseq_str(cs_label) ")@ha\n\t" \ "addi %%r17, %%r17, (" __rseq_str(cs_label) ")@l\n\t" \ - "stw %%r17, %[" __rseq_str(rseq_cs) "]\n\t" \ + "stw%U[" __rseq_str(rseq_cs) "]%X[" __rseq_str(rseq_cs) "] %%r17, %[" __rseq_str(rseq_cs) "]\n\t" \ __rseq_str(label) ":\n\t" #endif /* #ifdef __PPC64__ */ @@ -136,7 +136,7 @@ do { \ #define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label) \ RSEQ_INJECT_ASM(2) \ - "lwz %%r17, %[" __rseq_str(current_cpu_id) "]\n\t" \ + "lwz%U[" __rseq_str(current_cpu_id) "]%X[" __rseq_str(current_cpu_id) "] %%r17, %[" __rseq_str(current_cpu_id) "]\n\t" \ "cmpw cr7, %[" __rseq_str(cpu_id) "], %%r17\n\t" \ "bne- cr7, " __rseq_str(label) "\n\t" @@ -153,25 +153,25 @@ do {
Re: [PATCH] powerpc: select ARCH_HAS_MEMBARRIER_SYNC_CORE
- On Jul 8, 2020, at 1:17 AM, Nicholas Piggin npig...@gmail.com wrote: > Excerpts from Mathieu Desnoyers's message of July 7, 2020 9:25 pm: >> - On Jul 7, 2020, at 1:50 AM, Nicholas Piggin npig...@gmail.com wrote: >> [...] >>> I should actually change the comment for 64-bit because soft masked >>> interrupt replay is an interesting case. I thought it was okay (because >>> the IPI would cause a hard interrupt which does do the rfi) but that >>> should at least be written. >> >> Yes. >> >>> The context synchronisation happens before >>> the Linux IPI function is called, but for the purpose of membarrier I >>> think that is okay (the membarrier just needs to have caused a memory >>> barrier + context synchronistaion by the time it has done). >> >> Can you point me to the code implementing this logic ? > > It's mostly in arch/powerpc/kernel/exception-64s.S and > powerpc/kernel/irq.c, but a lot of asm so easier to explain. > > When any Linux code does local_irq_disable(), we set interrupts as > software-masked in a per-cpu flag. When interrupts (including IPIs) come > in, the first thing we do is check that flag and if we are masked, then > record that the interrupt needs to be "replayed" in another per-cpu > flag. The interrupt handler then exits back using RFI (which is context > synchronising the CPU). Later, when the kernel code does > local_irq_enable(), it checks the replay flag to see if anything needs > to be done. At that point we basically just call the interrupt handler > code like a normal function, and when that returns there is no context > synchronising instruction. AFAIU this can only happen for interrupts nesting over irqoff sections, therefore over kernel code, never userspace, right ? > > So membarrier IPI will always cause target CPUs to perform a context > synchronising instruction, but sometimes it happens before the IPI > handler function runs. If my understanding is correct, the replayed interrupt handler logic only nests over kernel code, which will eventually need to issue a context synchronizing instruction before returning to user-space. All we care about is that starting from the membarrier, each core either: - interrupt user-space to issue the context synchronizing instruction if they were running userspace, or - _eventually_ issue a context synchronizing instruction before returning to user-space if they were running kernel code. So your earlier statement "the membarrier just needs to have caused a memory barrier + context synchronistaion by the time it has done" is not strictly correct: the context synchronizing instruction does not strictly need to happen on each core before membarrier returns. A similar line of thoughts can be followed for memory barriers. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: Failure to build librseq on ppc
- On Jul 8, 2020, at 10:21 AM, Christophe Leroy christophe.le...@csgroup.eu wrote: > Le 08/07/2020 à 16:00, Mathieu Desnoyers a écrit : >> - On Jul 8, 2020, at 8:33 AM, Mathieu Desnoyers >> mathieu.desnoy...@efficios.com wrote: >> >>> - On Jul 7, 2020, at 8:59 PM, Segher Boessenkool >>> seg...@kernel.crashing.org >>> wrote: >> [...] >>>> >>>> So perhaps you have code like >>>> >>>> int *p; >>>> int x; >>>> ... >>>> asm ("lwz %0,%1" : "=r"(x) : "m"(*p)); >>> >>> We indeed have explicit "lwz" and "stw" instructions in there. >>> >>>> >>>> where that last line should actually read >>>> >>>> asm ("lwz%X1 %0,%1" : "=r"(x) : "m"(*p)); >>> >>> Indeed, turning those into "lwzx" and "stwx" seems to fix the issue. >>> >>> There has been some level of extra CPP macro coating around those >>> instructions >>> to >>> support both ppc32 and ppc64 with the same assembly. So adding %X[arg] is >>> not >>> trivial. >>> Let me see what can be done here. >> >> I did the following changes which appear to generate valid asm. >> See attached corresponding .S output. >> >> I grepped for uses of "m" asm operand in Linux powerpc code and noticed it's >> pretty much >> always used with e.g. "lwz%U1%X1". I could find one blog post discussing >> that %U >> is about >> update flag, and nothing about %X. Are those documented ? > > As far as I can see, %U is mentioned in > https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html in the > powerpc subpart, at the "m" constraint. Yep, I did notice it, but mistakenly thought it was only needed for "m<>" operand, not "m". Thanks, Mathieu > > For the %X I don't know. > > Christophe > >> >> Although it appears to generate valid asm, I have the feeling I'm relying on >> undocumented >> features here. :-/ -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
powerpc: Incorrect stw operand modifier in __set_pte_at
Hi, Reviewing use of the patterns "Un%Xn" with lwz and stw instructions (where n should be the operand number) within the Linux kernel led me to spot those 2 weird cases: arch/powerpc/include/asm/nohash/pgtable.h:__set_pte_at() __asm__ __volatile__("\ stw%U0%X0 %2,%0\n\ eieio\n\ stw%U0%X0 %L2,%1" : "=m" (*ptep), "=m" (*((unsigned char *)ptep+4)) : "r" (pte) : "memory"); I would have expected the stw to be: stw%U1%X1 %L2,%1" and: arch/powerpc/include/asm/book3s/32/pgtable.h:__set_pte_at() __asm__ __volatile__("\ stw%U0%X0 %2,%0\n\ eieio\n\ stw%U0%X0 %L2,%1" : "=m" (*ptep), "=m" (*((unsigned char *)ptep+4)) : "r" (pte) : "memory"); where I would have expected: stw%U1%X1 %L2,%1" Is it a bug or am I missing something ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
[PATCH 1/1] powerpc: Fix incorrect stw{, ux, u, x} instructions in __set_pte_at
The placeholder for instruction selection should use the second argument's operand, which is %1, not %0. This could generate incorrect assembly code if the instruction selection for argument %0 ever differs from argument %1. Fixes: 9bf2b5cdc5fe ("powerpc: Fixes for CONFIG_PTE_64BIT for SMP support") Signed-off-by: Mathieu Desnoyers Cc: Christophe Leroy Cc: Kumar Gala Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: linuxppc-dev@lists.ozlabs.org Cc: # v2.6.28+ --- arch/powerpc/include/asm/book3s/32/pgtable.h | 2 +- arch/powerpc/include/asm/nohash/pgtable.h| 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h b/arch/powerpc/include/asm/book3s/32/pgtable.h index 224912432821..f1467b3c417a 100644 --- a/arch/powerpc/include/asm/book3s/32/pgtable.h +++ b/arch/powerpc/include/asm/book3s/32/pgtable.h @@ -529,7 +529,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, __asm__ __volatile__("\ stw%U0%X0 %2,%0\n\ eieio\n\ - stw%U0%X0 %L2,%1" + stw%U1%X1 %L2,%1" : "=m" (*ptep), "=m" (*((unsigned char *)ptep+4)) : "r" (pte) : "memory"); diff --git a/arch/powerpc/include/asm/nohash/pgtable.h b/arch/powerpc/include/asm/nohash/pgtable.h index 4b7c3472eab1..a00e4c1746d6 100644 --- a/arch/powerpc/include/asm/nohash/pgtable.h +++ b/arch/powerpc/include/asm/nohash/pgtable.h @@ -199,7 +199,7 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, __asm__ __volatile__("\ stw%U0%X0 %2,%0\n\ eieio\n\ - stw%U0%X0 %L2,%1" + stw%U1%X1 %L2,%1" : "=m" (*ptep), "=m" (*((unsigned char *)ptep+4)) : "r" (pte) : "memory"); return; -- 2.11.0
Re: Failure to build librseq on ppc
- Segher Boessenkool wrote: > Hi! > > On Wed, Jul 08, 2020 at 10:27:27PM +1000, Michael Ellerman wrote: > > Segher Boessenkool writes: > > > You'll have to show the actual failing machine code, and with enough > > > context that we can relate this to the source code. > > > > > > -save-temps helps, or use -S instead of -c, etc. > > > > Attached below. > > Thanks! > > > I think that's from: > > > > #define LOAD_WORD "ld " > > > > #define RSEQ_ASM_OP_CMPEQ(var, expect, label) > > \ > > LOAD_WORD "%%r17, %[" __rseq_str(var) "]\n\t" > > \ > > The way this hardcodes r17 *will* break, btw. The compiler will not > likely want to use r17 as long as your code (after inlining etc.!) stays > small, but there is Murphy's law. r17 is in the clobber list, so it should be ok. > > Anyway... something in rseq_str is wrong, missing %X. This may > have to do with the abuse of inline asm here, making a fix harder :-( I just committed a fix which enhances the macros. Thanks for your help! Mathieu > > > Segher -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: Failure to build librseq on ppc
- On Jul 8, 2020, at 8:10 PM, Segher Boessenkool seg...@kernel.crashing.org wrote: > Hi! > > On Wed, Jul 08, 2020 at 10:00:01AM -0400, Mathieu Desnoyers wrote: [...] > >> -#define STORE_WORD "std " >> -#define LOAD_WORD "ld " >> -#define LOADX_WORD "ldx " >> +#define STORE_WORD(arg)"std%U[" __rseq_str(arg) "]%X[" >> __rseq_str(arg) >> "] "/* To memory ("m" constraint) */ >> +#define LOAD_WORD(arg) "lwd%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] " >> /* From memory ("m" constraint) */ > > That cannot work (you typoed "ld" here). Indeed, I noticed it before pushing to master (lwd -> ld). > > Some more advice about this code, pretty generic stuff: Let's take an example to support the discussion here. I'm taking it from master branch (after a cleanup changing e.g. LOAD_WORD into RSEQ_LOAD_LONG). So for powerpc32 we have (code edited to remove testing instrumentation): #define __rseq_str_1(x) #x #define __rseq_str(x) __rseq_str_1(x) #define RSEQ_STORE_LONG(arg)"stw%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] "/* To memory ("m" constraint) */ #define RSEQ_STORE_INT(arg) RSEQ_STORE_LONG(arg) /* To memory ("m" constraint) */ #define RSEQ_LOAD_LONG(arg) "lwz%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] "/* From memory ("m" constraint) */ #define RSEQ_LOAD_INT(arg) RSEQ_LOAD_LONG(arg) /* From memory ("m" constraint) */ #define RSEQ_LOADX_LONG "lwzx " /* From base register ("b" constraint) */ #define RSEQ_CMP_LONG "cmpw " #define __RSEQ_ASM_DEFINE_TABLE(label, version, flags, \ start_ip, post_commit_offset, abort_ip) \ ".pushsection __rseq_cs, \"aw\"\n\t" \ ".balign 32\n\t" \ __rseq_str(label) ":\n\t" \ ".long " __rseq_str(version) ", " __rseq_str(flags) "\n\t" \ /* 32-bit only supported on BE */ \ ".long 0x0, " __rseq_str(start_ip) ", 0x0, " __rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) "\n\t" \ ".popsection\n\t" \ ".pushsection __rseq_cs_ptr_array, \"aw\"\n\t" \ ".long 0x0, " __rseq_str(label) "b\n\t" \ ".popsection\n\t" /* * Exit points of a rseq critical section consist of all instructions outside * of the critical section where a critical section can either branch to or * reach through the normal course of its execution. The abort IP and the * post-commit IP are already part of the __rseq_cs section and should not be * explicitly defined as additional exit points. Knowing all exit points is * useful to assist debuggers stepping over the critical section. */ #define RSEQ_ASM_DEFINE_EXIT_POINT(start_ip, exit_ip) \ ".pushsection __rseq_exit_point_array, \"aw\"\n\t" \ /* 32-bit only supported on BE */ \ ".long 0x0, " __rseq_str(start_ip) ", 0x0, " __rseq_str(exit_ip) "\n\t" \ ".popsection\n\t" #define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs) \ "lis %%r17, (" __rseq_str(cs_label) ")@ha\n\t" \ "addi %%r17, %%r17, (" __rseq_str(cs_label) ")@l\n\t" \ RSEQ_STORE_INT(rseq_cs) "%%r17, %[" __rseq_str(rseq_cs) "]\n\t" \ __rseq_str(label) ":\n\t" #define RSEQ_ASM_DEFINE_TABLE(label, start_ip, post_commit_ip, abort_ip) \ __RSEQ_ASM_DEFINE_TABLE(label, 0x0, 0x0, start_ip, \ (post_commit_ip - start_ip), abort_ip) #define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label) \ RSEQ_LOAD_INT(current_cpu_id) "%%r17, %[" __rseq_str(current_cpu_id) "]\n\t" \ "cmpw cr7, %[" __rseq_str(cpu_id) "], %%r17\n\t
Re: Failure to build librseq on ppc
- On Jul 8, 2020, at 8:18 PM, Segher Boessenkool seg...@kernel.crashing.org wrote: > On Wed, Jul 08, 2020 at 08:01:23PM -0400, Mathieu Desnoyers wrote: >> > > #define RSEQ_ASM_OP_CMPEQ(var, expect, label) >> > > \ >> > > LOAD_WORD "%%r17, %[" __rseq_str(var) "]\n\t" >> > >\ >> > >> > The way this hardcodes r17 *will* break, btw. The compiler will not >> > likely want to use r17 as long as your code (after inlining etc.!) stays >> > small, but there is Murphy's law. >> >> r17 is in the clobber list, so it should be ok. > > What protects r17 *after* this asm statement? As discussed in the other leg of the thread (with the code example), r17 is in the clobber list of all asm statements using this macro, and is used as a temporary register within each inline asm. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: Failure to build librseq on ppc
- On Jul 9, 2020, at 1:37 PM, Segher Boessenkool seg...@kernel.crashing.org wrote: > On Thu, Jul 09, 2020 at 09:43:47AM -0400, Mathieu Desnoyers wrote: >> > What protects r17 *after* this asm statement? >> >> As discussed in the other leg of the thread (with the code example), >> r17 is in the clobber list of all asm statements using this macro, and >> is used as a temporary register within each inline asm. > > That works fine then, for a testcase. Using r17 is not a great idea for > performance (it increases the active register footprint, and causes more > registers to be saved in the prologue of the functions, esp. on older > compilers), and it is easier to just let the compiler choose a good > register to use. But maybe you want to see r17 in the generated > testcases, as eyecatcher or something, dunno :-) Just to make sure I understand your recommendation. So rather than hard coding r17 as the temporary registers, we could explicitly declare the temporary register as a C variable, pass it as an input operand to the inline asm, and then refer to it by operand name in the macros using it. This way the compiler would be free to perform its own register allocation. If that is what you have in mind, then yes, I think it makes a lot of sense. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: Failure to build librseq on ppc
- On Jul 9, 2020, at 1:42 PM, Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote: > - On Jul 9, 2020, at 1:37 PM, Segher Boessenkool > seg...@kernel.crashing.org > wrote: > >> On Thu, Jul 09, 2020 at 09:43:47AM -0400, Mathieu Desnoyers wrote: >>> > What protects r17 *after* this asm statement? >>> >>> As discussed in the other leg of the thread (with the code example), >>> r17 is in the clobber list of all asm statements using this macro, and >>> is used as a temporary register within each inline asm. >> >> That works fine then, for a testcase. Using r17 is not a great idea for >> performance (it increases the active register footprint, and causes more >> registers to be saved in the prologue of the functions, esp. on older >> compilers), and it is easier to just let the compiler choose a good >> register to use. But maybe you want to see r17 in the generated >> testcases, as eyecatcher or something, dunno :-) > > Just to make sure I understand your recommendation. So rather than > hard coding r17 as the temporary registers, we could explicitly > declare the temporary register as a C variable, pass it as an > input operand to the inline asm, and then refer to it by operand > name in the macros using it. This way the compiler would be free > to perform its own register allocation. > > If that is what you have in mind, then yes, I think it makes a > lot of sense. Except that asm goto have this limitation with gcc: those cannot have any output operand, only inputs, clobbers and target labels. We cannot modify a temporary register received as input operand. So I don't see how to get a temporary register allocated by the compiler considering this limitation. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: Failure to build librseq on ppc
- On Jul 9, 2020, at 4:46 PM, Segher Boessenkool seg...@kernel.crashing.org wrote: > On Thu, Jul 09, 2020 at 01:56:19PM -0400, Mathieu Desnoyers wrote: >> > Just to make sure I understand your recommendation. So rather than >> > hard coding r17 as the temporary registers, we could explicitly >> > declare the temporary register as a C variable, pass it as an >> > input operand to the inline asm, and then refer to it by operand >> > name in the macros using it. This way the compiler would be free >> > to perform its own register allocation. >> > >> > If that is what you have in mind, then yes, I think it makes a >> > lot of sense. >> >> Except that asm goto have this limitation with gcc: those cannot >> have any output operand, only inputs, clobbers and target labels. >> We cannot modify a temporary register received as input operand. So I don't >> see how to get a temporary register allocated by the compiler considering >> this limitation. > > Heh, yet another reason not to obfuscate your inline asm: it didn't > register this is asm goto. > > A clobber is one way, yes (those *are* allowed in asm goto). Another > way is to not actually change that register: move the original value > back into there at the end of the asm! (That isn't always easy to do, > it depends on your code). So something like > > long start = ...; > long tmp = start; > asm("stuff that modifies %0; ...; mr %0,%1" : : "r"(tmp), "r"(start)); > > is just fine: %0 isn't actually modified at all, as far as GCC is > concerned, and this isn't lying to it! It appears to be at the cost of adding one extra instruction on the fast-path to restore the register to its original value. I'll leave Boqun whom authored the original rseq-ppc code to figure out what works best performance-wise (when he finds time). Thanks for the pointers! Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
> /* >* When switching through a kernel thread, the loop in >* membarrier_{private,global}_expedited() may have observed that >* kernel thread and not issued an IPI. It is therefore possible to >* schedule between user->kernel->user threads without passing though > - * switch_mm(). Membarrier requires a barrier after storing to > - * rq->curr, before returning to userspace, so provide them here: > - * > - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly > - * provided by mmdrop(), > - * - a sync_core for SYNC_CORE. > + * switch_mm(). Membarrier requires a full barrier after storing to > + * rq->curr, before returning to userspace, for > + * {PRIVATE,GLOBAL}_EXPEDITED. This is implicitly provided by mmdrop(). >*/ > - if (mm) { > - membarrier_mm_sync_core_before_usermode(mm); > + if (mm) > mmdrop(mm); > - } > + > if (unlikely(prev_state == TASK_DEAD)) { > if (prev->sched_class->task_dead) > prev->sched_class->task_dead(prev); > @@ -6292,6 +6289,7 @@ void idle_task_exit(void) > BUG_ON(current != this_rq()->idle); > > if (mm != &init_mm) { > + /* enter_lazy_tlb is not done because we're about to go down */ > switch_mm(mm, &init_mm, current); > finish_arch_post_lock_switch(); > } > -- > 2.23.0 -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npig...@gmail.com wrote: > Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm: >> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am: >>> Also, as it stands, I can easily see in_irq() ceasing to promise to >>> serialize. There are older kernels for which it does not promise to >>> serialize. And I have plans to make it stop serializing in the >>> nearish future. >> >> You mean x86's return from interrupt? Sounds fun... you'll konw where to >> update the membarrier sync code, at least :) > > Oh, I should actually say Mathieu recently clarified a return from > interrupt doesn't fundamentally need to serialize in order to support > membarrier sync core. Clarification to your statement: Return from interrupt to kernel code does not need to be context serializing as long as kernel serializes before returning to user-space. However, return from interrupt to user-space needs to be context serializing. Thanks, Mathieu > > https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-July/214171.html > > So you may not need to do anything more if you relaxed it. > > Thanks, > Nick -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH v2] powerpc: select ARCH_HAS_MEMBARRIER_SYNC_CORE
- On Jul 15, 2020, at 5:48 AM, Nicholas Piggin npig...@gmail.com wrote: [...] > index 47bd4ea0837d..a4704f405e8d 100644 > --- a/arch/powerpc/include/asm/exception-64s.h > +++ b/arch/powerpc/include/asm/exception-64s.h > @@ -68,6 +68,13 @@ > * > * The nop instructions allow us to insert one or more instructions to flush > the > * L1-D cache when returning to userspace or a guest. > + * > + * powerpc relies on return from interrupt/syscall being context > synchronising > + * (which hrfid, rfid, and rfscv are) to support > ARCH_HAS_MEMBARRIER_SYNC_CORE > + * without additional additional synchronisation instructions. soft-masked > + * interrupt replay does not include a context-synchronising rfid, but those > + * always return to kernel, the context sync is only required for IPIs which > + * return to user. > */ > #define RFI_FLUSH_SLOT > \ > RFI_FLUSH_FIXUP_SECTION;\ I suspect the statement "the context sync is only required for IPIs which return to user." is misleading. As I recall that we need more than just context sync after IPI. We need context sync in return path of any trap/interrupt/system call which returns to user-space, else we'd need to add the proper core serializing barriers in the scheduler, as we had to do for lazy tlb on x86. Or am I missing something ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
- On Jul 16, 2020, at 7:00 AM, Peter Zijlstra pet...@infradead.org wrote: > On Thu, Jul 16, 2020 at 08:03:36PM +1000, Nicholas Piggin wrote: >> Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm: >> > On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote: >> >> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin wrote: > >> >> But I’m wondering if all this deferred sync stuff is wrong. In the >> >> brave new world of io_uring and such, perhaps kernel access matter >> >> too. Heck, even: >> > >> > IIRC the membarrier SYNC_CORE use-case is about user-space >> > self-modifying code. >> > >> > Userspace re-uses a text address and needs to SYNC_CORE before it can be >> > sure the old text is forgotten. Nothing the kernel does matters there. >> > >> > I suppose the manpage could be more clear there. >> >> True, but memory ordering of kernel stores from kernel threads for >> regular mem barrier is the concern here. >> >> Does io_uring update completion queue from kernel thread or interrupt, >> for example? If it does, then membarrier will not order such stores >> with user memory accesses. > > So we're talking about regular membarrier() then? Not the SYNC_CORE > variant per-se. > > Even there, I'll argue we don't care, but perhaps Mathieu has a > different opinion. I agree with Peter that we don't care about accesses to user-space memory performed concurrently with membarrier. What we'd care about in terms of accesses to user-space memory from the kernel is something that would be clearly ordered as happening before or after the membarrier call, for instance a read(2) followed by membarrier(2) after the read returns, or a read(2) issued after return from membarrier(2). The other scenario we'd care about is with the compiler barrier paired with membarrier: e.g. read(2) returns, compiler barrier, followed by a store. Or load, compiler barrier, followed by write(2). All those scenarios imply before/after ordering wrt either membarrier or the compiler barrier. I notice that io_uring has a "completion" queue. Let's try to come up with realistic usage scenarios. So the dependency chain would be provided by e.g.: * Infrequent read / Frequent write, communicating read completion through variable X wait for io_uring read request completion -> membarrier -> store X=1 with matching load from X (waiting for X==1) -> asm volatile (::: "memory") -> submit io_uring write request or this other scenario: * Frequent read / Infrequent write, communicating read completion through variable X load from X (waiting for X==1) -> membarrier -> submit io_uring write request with matching wait for io_uring read request completion -> asm volatile (::: "memory") -> store X=1 Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com wrote: > I should be more complete here, especially since I was complaining > about unclear barrier comment :) > > > CPU0 CPU1 > a. user stuff1. user stuff > b. membarrier() 2. enter kernel > c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule > d. read rq->curr 4. rq->curr switched to kthread > e. is kthread, skip IPI 5. switch_to kthread > f. return to user6. rq->curr switched to user thread > g. user stuff7. switch_to user thread > 8. exit kernel > 9. more user stuff > > What you're really ordering is a, g vs 1, 9 right? > > In other words, 9 must see a if it sees g, g must see 1 if it saw 9, > etc. > > Userspace does not care where the barriers are exactly or what kernel > memory accesses might be being ordered by them, so long as there is a > mb somewhere between a and g, and 1 and 9. Right? This is correct. Note that the accesses to user-space memory can be done either by user-space code or kernel code, it doesn't matter. However, in order to be considered as happening before/after either membarrier or the matching compiler barrier, kernel code needs to have causality relationship with user-space execution, e.g. user-space does a system call, or returns from a system call. In the case of io_uring, submitting a request or returning from waiting on request completion appear to provide this causality relationship. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote: > - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com wrote: >> I should be more complete here, especially since I was complaining >> about unclear barrier comment :) >> >> >> CPU0 CPU1 >> a. user stuff1. user stuff >> b. membarrier() 2. enter kernel >> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule >> d. read rq->curr 4. rq->curr switched to kthread >> e. is kthread, skip IPI 5. switch_to kthread >> f. return to user6. rq->curr switched to user thread >> g. user stuff7. switch_to user thread >> 8. exit kernel >> 9. more user stuff >> >> What you're really ordering is a, g vs 1, 9 right? >> >> In other words, 9 must see a if it sees g, g must see 1 if it saw 9, >> etc. >> >> Userspace does not care where the barriers are exactly or what kernel >> memory accesses might be being ordered by them, so long as there is a >> mb somewhere between a and g, and 1 and 9. Right? > > This is correct. Actually, sorry, the above is not quite right. It's been a while since I looked into the details of membarrier. The smp_mb() at the beginning of membarrier() needs to be paired with a smp_mb() _after_ rq->curr is switched back to the user thread, so the memory barrier is between store to rq->curr and following user-space accesses. The smp_mb() at the end of membarrier() needs to be paired with the smp_mb__after_spinlock() at the beginning of schedule, which is between accesses to userspace memory and switching rq->curr to kthread. As to *why* this ordering is needed, I'd have to dig through additional scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ? Thanks, Mathieu > Note that the accesses to user-space memory can be > done either by user-space code or kernel code, it doesn't matter. > However, in order to be considered as happening before/after > either membarrier or the matching compiler barrier, kernel code > needs to have causality relationship with user-space execution, > e.g. user-space does a system call, or returns from a system call. > > In the case of io_uring, submitting a request or returning from waiting > on request completion appear to provide this causality relationship. > > Thanks, > > Mathieu > > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers mathieu.desnoy...@efficios.com wrote: > - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers > mathieu.desnoy...@efficios.com wrote: > >> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com wrote: >>> I should be more complete here, especially since I was complaining >>> about unclear barrier comment :) >>> >>> >>> CPU0 CPU1 >>> a. user stuff1. user stuff >>> b. membarrier() 2. enter kernel >>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule >>> d. read rq->curr 4. rq->curr switched to kthread >>> e. is kthread, skip IPI 5. switch_to kthread >>> f. return to user6. rq->curr switched to user thread >>> g. user stuff7. switch_to user thread >>> 8. exit kernel >>> 9. more user stuff >>> >>> What you're really ordering is a, g vs 1, 9 right? >>> >>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9, >>> etc. >>> >>> Userspace does not care where the barriers are exactly or what kernel >>> memory accesses might be being ordered by them, so long as there is a >>> mb somewhere between a and g, and 1 and 9. Right? >> >> This is correct. > > Actually, sorry, the above is not quite right. It's been a while > since I looked into the details of membarrier. > > The smp_mb() at the beginning of membarrier() needs to be paired with a > smp_mb() _after_ rq->curr is switched back to the user thread, so the > memory barrier is between store to rq->curr and following user-space > accesses. > > The smp_mb() at the end of membarrier() needs to be paired with the > smp_mb__after_spinlock() at the beginning of schedule, which is > between accesses to userspace memory and switching rq->curr to kthread. > > As to *why* this ordering is needed, I'd have to dig through additional > scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ? Thinking further about this, I'm beginning to consider that maybe we have been overly cautious by requiring memory barriers before and after store to rq->curr. If CPU0 observes a CPU1's rq->curr->mm which differs from its own process (current) while running the membarrier system call, it necessarily means that CPU1 had to issue smp_mb__after_spinlock when entering the scheduler, between any user-space loads/stores and update of rq->curr. Requiring a memory barrier between update of rq->curr (back to current process's thread) and following user-space memory accesses does not seem to guarantee anything more than what the initial barrier at the beginning of __schedule already provides, because the guarantees are only about accesses to user-space memory. Therefore, with the memory barrier at the beginning of __schedule, just observing that CPU1's rq->curr differs from current should guarantee that a memory barrier was issued between any sequentially consistent instructions belonging to the current process on CPU1. Or am I missing/misremembering an important point here ? Thanks, Mathieu > > Thanks, > > Mathieu > > >> Note that the accesses to user-space memory can be >> done either by user-space code or kernel code, it doesn't matter. >> However, in order to be considered as happening before/after >> either membarrier or the matching compiler barrier, kernel code >> needs to have causality relationship with user-space execution, >> e.g. user-space does a system call, or returns from a system call. >> >> In the case of io_uring, submitting a request or returning from waiting >> on request completion appear to provide this causality relationship. >> >> Thanks, >> >> Mathieu >> >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> http://www.efficios.com > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
- On Jul 16, 2020, at 5:24 PM, Alan Stern st...@rowland.harvard.edu wrote: > On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote: >> - On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers >> mathieu.desnoy...@efficios.com wrote: >> >> > - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers >> > mathieu.desnoy...@efficios.com wrote: >> > >> >> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com >> >> wrote: >> >>> I should be more complete here, especially since I was complaining >> >>> about unclear barrier comment :) >> >>> >> >>> >> >>> CPU0 CPU1 >> >>> a. user stuff1. user stuff >> >>> b. membarrier() 2. enter kernel >> >>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule >> >>> d. read rq->curr 4. rq->curr switched to kthread >> >>> e. is kthread, skip IPI 5. switch_to kthread >> >>> f. return to user6. rq->curr switched to user thread >> >>> g. user stuff7. switch_to user thread >> >>> 8. exit kernel >> >>> 9. more user stuff >> >>> >> >>> What you're really ordering is a, g vs 1, 9 right? >> >>> >> >>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9, >> >>> etc. >> >>> >> >>> Userspace does not care where the barriers are exactly or what kernel >> >>> memory accesses might be being ordered by them, so long as there is a >> >>> mb somewhere between a and g, and 1 and 9. Right? >> >> >> >> This is correct. >> > >> > Actually, sorry, the above is not quite right. It's been a while >> > since I looked into the details of membarrier. >> > >> > The smp_mb() at the beginning of membarrier() needs to be paired with a >> > smp_mb() _after_ rq->curr is switched back to the user thread, so the >> > memory barrier is between store to rq->curr and following user-space >> > accesses. >> > >> > The smp_mb() at the end of membarrier() needs to be paired with the >> > smp_mb__after_spinlock() at the beginning of schedule, which is >> > between accesses to userspace memory and switching rq->curr to kthread. >> > >> > As to *why* this ordering is needed, I'd have to dig through additional >> > scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ? >> >> Thinking further about this, I'm beginning to consider that maybe we have >> been >> overly cautious by requiring memory barriers before and after store to >> rq->curr. >> >> If CPU0 observes a CPU1's rq->curr->mm which differs from its own process >> (current) >> while running the membarrier system call, it necessarily means that CPU1 had >> to issue smp_mb__after_spinlock when entering the scheduler, between any >> user-space >> loads/stores and update of rq->curr. >> >> Requiring a memory barrier between update of rq->curr (back to current >> process's >> thread) and following user-space memory accesses does not seem to guarantee >> anything more than what the initial barrier at the beginning of __schedule >> already >> provides, because the guarantees are only about accesses to user-space >> memory. >> >> Therefore, with the memory barrier at the beginning of __schedule, just >> observing that >> CPU1's rq->curr differs from current should guarantee that a memory barrier >> was >> issued >> between any sequentially consistent instructions belonging to the current >> process on >> CPU1. >> >> Or am I missing/misremembering an important point here ? > > Is it correct to say that the switch_to operations in 5 and 7 include > memory barriers? If they do, then skipping the IPI should be okay. > > The reason is as follows: The guarantee you need to enforce is that > anything written by CPU0 before the membarrier() will be visible to CPU1 > after it returns to user mode. Let's say that a writes to X and 9 > reads from X. > > Then we have an instance of the Store Buffer pattern: > > CPU0CPU1 > a. Write X 6. Write rq->curr for user thread > c. smp_mb() 7. switch_to memory barrier > d. Read rq->curr9. Read X > > In this pattern, the memory barriers make it impossible for both reads > to miss their corresponding writes. Since d does fail to read 6 (it > sees the earlier value stored by 4), 9 must read a. > > The other guarantee you need is that g on CPU0 will observe anything > written by CPU1 in 1. This is easier to see, using the fact that 3 is a > memory barrier and d reads from 4. Right, and Nick's reply involving pairs of loads/stores on each side clarifies the situation even further. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
- On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npig...@gmail.com wrote: [...] > > membarrier does replace barrier instructions on remote CPUs, which do > order accesses performed by the kernel on the user address space. So > membarrier should too I guess. > > Normal process context accesses like read(2) will do so because they > don't get filtered out from IPIs, but kernel threads using the mm may > not. But it should not be an issue, because membarrier's ordering is only with respect to submit and completion of io_uring requests, which are performed through system calls from the context of user-space threads, which are called from the right mm. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
- On Jul 17, 2020, at 10:51 AM, Alan Stern st...@rowland.harvard.edu wrote: > On Fri, Jul 17, 2020 at 09:39:25AM -0400, Mathieu Desnoyers wrote: >> - On Jul 16, 2020, at 5:24 PM, Alan Stern st...@rowland.harvard.edu >> wrote: >> >> > On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote: >> >> ----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers >> >> mathieu.desnoy...@efficios.com wrote: >> >> >> >> > - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers >> >> > mathieu.desnoy...@efficios.com wrote: >> >> > >> >> >> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com >> >> >> wrote: >> >> >>> I should be more complete here, especially since I was complaining >> >> >>> about unclear barrier comment :) >> >> >>> >> >> >>> >> >> >>> CPU0 CPU1 >> >> >>> a. user stuff1. user stuff >> >> >>> b. membarrier() 2. enter kernel >> >> >>> c. smp_mb() 3. smp_mb__after_spinlock(); // in __schedule >> >> >>> d. read rq->curr 4. rq->curr switched to kthread >> >> >>> e. is kthread, skip IPI 5. switch_to kthread >> >> >>> f. return to user6. rq->curr switched to user thread >> >> >>> g. user stuff7. switch_to user thread >> >> >>> 8. exit kernel >> >> >>> 9. more user stuff > > ... > >> >> Requiring a memory barrier between update of rq->curr (back to current >> >> process's >> >> thread) and following user-space memory accesses does not seem to >> >> guarantee >> >> anything more than what the initial barrier at the beginning of __schedule >> >> already >> >> provides, because the guarantees are only about accesses to user-space >> >> memory. > > ... > >> > Is it correct to say that the switch_to operations in 5 and 7 include >> > memory barriers? If they do, then skipping the IPI should be okay. >> > >> > The reason is as follows: The guarantee you need to enforce is that >> > anything written by CPU0 before the membarrier() will be visible to CPU1 >> > after it returns to user mode. Let's say that a writes to X and 9 >> > reads from X. >> > >> > Then we have an instance of the Store Buffer pattern: >> > >> >CPU0CPU1 >> >a. Write X 6. Write rq->curr for user thread >> >c. smp_mb() 7. switch_to memory barrier >> >d. Read rq->curr9. Read X >> > >> > In this pattern, the memory barriers make it impossible for both reads >> > to miss their corresponding writes. Since d does fail to read 6 (it >> > sees the earlier value stored by 4), 9 must read a. >> > >> > The other guarantee you need is that g on CPU0 will observe anything >> > written by CPU1 in 1. This is easier to see, using the fact that 3 is a >> > memory barrier and d reads from 4. >> >> Right, and Nick's reply involving pairs of loads/stores on each side >> clarifies the situation even further. > > The key part of my reply was the question: "Is it correct to say that > the switch_to operations in 5 and 7 include memory barriers?" From the > text quoted above and from Nick's reply, it seems clear that they do > not. I remember that switch_mm implies it, but not switch_to. The scenario that triggered this discussion is when the scheduler does a lazy tlb entry/exit, which is basically switch from a user task to a kernel thread without changing the mm, and usually switching back afterwards. This optimization means the rq->curr mm temporarily differs, which prevent IPIs from being sent by membarrier, but without involving a switch_mm. This requires explicit memory barriers either on entry/exit of lazy tlb mode, or explicit barriers in the scheduler for those special-cases. > I agree with Nick: A memory barrier is needed somewhere between the > assignment at 6 and the return to user mode at 8. Otherwise you end up > with the Store Buffer pattern having a memory barrier on only one side, > and it is well known that this arrangement does not guarantee any > ordering. Yes, I see this now. I'm still trying to wrap my head around why the memory barrier at the end of membarrier() needs to be paired with a scheduler barrier though. > One thing I don't understand about all this: Any context switch has to > include a memory barrier somewhere, but both you and Nick seem to be > saying that steps 6 and 7 don't include (or don't need) any memory > barriers. What am I missing? All context switch have the smp_mb__before_spinlock at the beginning of __schedule(), which I suspect is what you refer to. However this barrier is before the store to rq->curr, not after. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
- On Jul 17, 2020, at 12:11 PM, Alan Stern st...@rowland.harvard.edu wrote: >> > I agree with Nick: A memory barrier is needed somewhere between the >> > assignment at 6 and the return to user mode at 8. Otherwise you end up >> > with the Store Buffer pattern having a memory barrier on only one side, >> > and it is well known that this arrangement does not guarantee any >> > ordering. >> >> Yes, I see this now. I'm still trying to wrap my head around why the memory >> barrier at the end of membarrier() needs to be paired with a scheduler >> barrier though. > > The memory barrier at the end of membarrier() on CPU0 is necessary in > order to enforce the guarantee that any writes occurring on CPU1 before > the membarrier() is executed will be visible to any code executing on > CPU0 after the membarrier(). Ignoring the kthread issue, we can have: > > CPU0CPU1 > x = 1 > barrier() > y = 1 > r2 = y > membarrier(): > a: smp_mb() > b: send IPI IPI-induced mb > c: smp_mb() > r1 = x > > The writes to x and y are unordered by the hardware, so it's possible to > have r2 = 1 even though the write to x doesn't execute until b. If the > memory barrier at c is omitted then "r1 = x" can be reordered before b > (although not before a), so we get r1 = 0. This violates the guarantee > that membarrier() is supposed to provide. > > The timing of the memory barrier at c has to ensure that it executes > after the IPI-induced memory barrier on CPU1. If it happened before > then we could still end up with r1 = 0. That's why the pairing matters. > > I hope this helps your head get properly wrapped. :-) It does help a bit! ;-) This explains this part of the comment near the smp_mb at the end of membarrier: * Memory barrier on the caller thread _after_ we finished * waiting for the last IPI. [...] However, it does not explain why it needs to be paired with a barrier in the scheduler, clearly for the case where the IPI is skipped. I wonder whether this part of the comment is factually correct: * [...] Matches memory barriers around rq->curr modification in scheduler. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
- On Jul 17, 2020, at 1:44 PM, Alan Stern st...@rowland.harvard.edu wrote: > On Fri, Jul 17, 2020 at 12:22:49PM -0400, Mathieu Desnoyers wrote: >> - On Jul 17, 2020, at 12:11 PM, Alan Stern st...@rowland.harvard.edu >> wrote: >> >> >> > I agree with Nick: A memory barrier is needed somewhere between the >> >> > assignment at 6 and the return to user mode at 8. Otherwise you end up >> >> > with the Store Buffer pattern having a memory barrier on only one side, >> >> > and it is well known that this arrangement does not guarantee any >> >> > ordering. >> >> >> >> Yes, I see this now. I'm still trying to wrap my head around why the >> >> memory >> >> barrier at the end of membarrier() needs to be paired with a scheduler >> >> barrier though. >> > >> > The memory barrier at the end of membarrier() on CPU0 is necessary in >> > order to enforce the guarantee that any writes occurring on CPU1 before >> > the membarrier() is executed will be visible to any code executing on >> > CPU0 after the membarrier(). Ignoring the kthread issue, we can have: >> > >> >CPU0CPU1 >> >x = 1 >> >barrier() >> >y = 1 >> >r2 = y >> >membarrier(): >> > a: smp_mb() >> > b: send IPI IPI-induced mb >> > c: smp_mb() >> >r1 = x >> > >> > The writes to x and y are unordered by the hardware, so it's possible to >> > have r2 = 1 even though the write to x doesn't execute until b. If the >> > memory barrier at c is omitted then "r1 = x" can be reordered before b >> > (although not before a), so we get r1 = 0. This violates the guarantee >> > that membarrier() is supposed to provide. >> > >> > The timing of the memory barrier at c has to ensure that it executes >> > after the IPI-induced memory barrier on CPU1. If it happened before >> > then we could still end up with r1 = 0. That's why the pairing matters. >> > >> > I hope this helps your head get properly wrapped. :-) >> >> It does help a bit! ;-) >> >> This explains this part of the comment near the smp_mb at the end of >> membarrier: >> >> * Memory barrier on the caller thread _after_ we finished >> * waiting for the last IPI. [...] >> >> However, it does not explain why it needs to be paired with a barrier in the >> scheduler, clearly for the case where the IPI is skipped. I wonder whether >> this >> part >> of the comment is factually correct: >> >> * [...] Matches memory barriers around rq->curr modification in >> scheduler. > > The reasoning is pretty much the same as above: > > CPU0CPU1 > x = 1 > barrier() > y = 1 > r2 = y > membarrier(): > a: smp_mb() > switch to kthread (includes mb) > b: read rq->curr == kthread > switch to user (includes mb) > c: smp_mb() > r1 = x > > Once again, it is possible that x = 1 doesn't become visible to CPU0 > until shortly before b. But if c is omitted then "r1 = x" can be > reordered before b (to any time after a), so we can have r1 = 0. > > Here the timing requirement is that c executes after the first memory > barrier on CPU1 -- which is one of the ones around the rq->curr > modification. (In fact, in this scenario CPU1's switch back to the user > process is irrelevant.) That indeed covers the last scenario I was wondering about. Thanks Alan! Mathieu > > Alan Stern -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
- On Jul 19, 2020, at 11:03 PM, Nicholas Piggin npig...@gmail.com wrote: > Excerpts from Mathieu Desnoyers's message of July 17, 2020 11:42 pm: >> - On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npig...@gmail.com wrote: >> [...] >>> >>> membarrier does replace barrier instructions on remote CPUs, which do >>> order accesses performed by the kernel on the user address space. So >>> membarrier should too I guess. >>> >>> Normal process context accesses like read(2) will do so because they >>> don't get filtered out from IPIs, but kernel threads using the mm may >>> not. >> >> But it should not be an issue, because membarrier's ordering is only with >> respect >> to submit and completion of io_uring requests, which are performed through >> system calls from the context of user-space threads, which are called from >> the >> right mm. > > Is that true? Can io completions be written into an address space via a > kernel thread? I don't know the io_uring code well but it looks like > that's asynchonously using the user mm context. Indeed, the io completion appears to be signaled asynchronously between kernel and user-space. Therefore, both kernel and userspace code need to have proper memory barriers in place to signal completion, otherwise user-space could read garbage after it notices completion of a read. I did not review the entire io_uring implementation, but the publish side for completion appears to be: static void __io_commit_cqring(struct io_ring_ctx *ctx) { struct io_rings *rings = ctx->rings; /* order cqe stores with ring update */ smp_store_release(&rings->cq.tail, ctx->cached_cq_tail); if (wq_has_sleeper(&ctx->cq_wait)) { wake_up_interruptible(&ctx->cq_wait); kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN); } } The store-release on tail should be paired with a load_acquire on the reader-side (it's called "read_barrier()" in the code): tools/io_uring/queue.c: static int __io_uring_get_cqe(struct io_uring *ring, struct io_uring_cqe **cqe_ptr, int wait) { struct io_uring_cq *cq = &ring->cq; const unsigned mask = *cq->kring_mask; unsigned head; int ret; *cqe_ptr = NULL; head = *cq->khead; do { /* * It's necessary to use a read_barrier() before reading * the CQ tail, since the kernel updates it locklessly. The * kernel has the matching store barrier for the update. The * kernel also ensures that previous stores to CQEs are ordered * with the tail update. */ read_barrier(); if (head != *cq->ktail) { *cqe_ptr = &cq->cqes[head & mask]; break; } if (!wait) break; ret = io_uring_enter(ring->ring_fd, 0, 1, IORING_ENTER_GETEVENTS, NULL); if (ret < 0) return -errno; } while (1); return 0; } So as far as membarrier memory ordering dependencies are concerned, it relies on the store-release/load-acquire dependency chain in the completion queue to order against anything that was done prior to the completed requests. What is in-flight while the requests are being serviced provides no memory ordering guarantee whatsoever. > How about other memory accesses via kthread_use_mm? Presumably there is > still ordering requirement there for membarrier, Please provide an example case with memory accesses via kthread_use_mm where ordering matters to support your concern. > so I really think > it's a fragile interface with no real way for the user to know how > kernel threads may use its mm for any particular reason, so membarrier > should synchronize all possible kernel users as well. I strongly doubt so, but perhaps something should be clarified in the documentation if you have that feeling. Thanks, Mathieu > > Thanks, > Nick -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
- On Jul 21, 2020, at 6:04 AM, Nicholas Piggin npig...@gmail.com wrote: > Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am: [...] > > Yeah you're probably right in this case I think. Quite likely most kernel > tasks that asynchronously write to user memory would at least have some > kind of producer-consumer barriers. > > But is that restriction of all async modifications documented and enforced > anywhere? > >>> How about other memory accesses via kthread_use_mm? Presumably there is >>> still ordering requirement there for membarrier, >> >> Please provide an example case with memory accesses via kthread_use_mm where >> ordering matters to support your concern. > > I think the concern Andy raised with io_uring was less a specific > problem he saw and more a general concern that we have these memory > accesses which are not synchronized with membarrier. > >>> so I really think >>> it's a fragile interface with no real way for the user to know how >>> kernel threads may use its mm for any particular reason, so membarrier >>> should synchronize all possible kernel users as well. >> >> I strongly doubt so, but perhaps something should be clarified in the >> documentation >> if you have that feeling. > > I'd rather go the other way and say if you have reasoning or numbers for > why PF_KTHREAD is an important optimisation above rq->curr == rq->idle > then we could think about keeping this subtlety with appropriate > documentation added, otherwise we can just kill it and remove all doubt. > > That being said, the x86 sync core gap that I imagined could be fixed > by changing to rq->curr == rq->idle test does not actually exist because > the global membarrier does not have a sync core option. So fixing the > exit_lazy_tlb points that this series does *should* fix that. So > PF_KTHREAD may be less problematic than I thought from implementation > point of view, only semantics. Today, the membarrier global expedited command explicitly skips kernel threads, but it happens that membarrier private expedited considers those with the same mm as target for the IPI. So we already implement a semantic which differs between private and global expedited membarriers. This can be explained in part by the fact that kthread_use_mm was introduced after 4.16, where the most recent membarrier commands where introduced. It seems that the effect on membarrier was not considered when kthread_use_mm was introduced. Looking at membarrier(2) documentation, it states that IPIs are only sent to threads belonging to the same process as the calling thread. If my understanding of the notion of process is correct, this should rule out sending the IPI to kernel threads, given they are not "part" of the same process, only borrowing the mm. But I agree that the distinction is moot, and should be clarified. Without a clear use-case to justify adding a constraint on membarrier, I am tempted to simply clarify documentation of current membarrier commands, stating clearly that they are not guaranteed to affect kernel threads. Then, if we have a compelling use-case to implement a different behavior which covers kthreads, this could be added consistently across membarrier commands with a flag (or by adding new commands). Does this approach make sense ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra pet...@infradead.org wrote: > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote: > >> That being said, the x86 sync core gap that I imagined could be fixed >> by changing to rq->curr == rq->idle test does not actually exist because >> the global membarrier does not have a sync core option. So fixing the >> exit_lazy_tlb points that this series does *should* fix that. So >> PF_KTHREAD may be less problematic than I thought from implementation >> point of view, only semantics. > > So I've been trying to figure out where that PF_KTHREAD comes from, > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'. > > So the first version: > > > https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoy...@efficios.com > > appears to unconditionally send the IPI and checks p->mm in the IPI > context, but then v2: > > > https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoy...@efficios.com > > has the current code. But I've been unable to find the reason the > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'. Looking back at my inbox, it seems like you are the one who proposed to skip all kthreads: https://lkml.kernel.org/r/20190904124333.gq2...@hirez.programming.kicks-ass.net > > The comment doesn't really help either; sure we have the whole lazy mm > thing, but that's ->active_mm, not ->mm. > > Possibly it is because {,un}use_mm() do not have sufficient barriers to > make the remote p->mm test work? Or were we over-eager with the !p->mm > doesn't imply kthread 'cleanups' at the time? The nice thing about adding back kthreads to the threads considered for membarrier IPI is that it has no observable effect on the user-space ABI. No pre-existing kthread rely on this, and we just provide an additional guarantee for future kthread implementations. > Also, I just realized, I still have a fix for use_mm() now > kthread_use_mm() that seems to have been lost. I suspect we need to at least document the memory barriers in kthread_use_mm and kthread_unuse_mm to state that they are required by membarrier if we want to ipi kthreads as well. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
- On Jul 21, 2020, at 11:19 AM, Peter Zijlstra pet...@infradead.org wrote: > On Tue, Jul 21, 2020 at 11:15:13AM -0400, Mathieu Desnoyers wrote: >> - On Jul 21, 2020, at 11:06 AM, Peter Zijlstra pet...@infradead.org >> wrote: >> >> > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote: >> > >> >> That being said, the x86 sync core gap that I imagined could be fixed >> >> by changing to rq->curr == rq->idle test does not actually exist because >> >> the global membarrier does not have a sync core option. So fixing the >> >> exit_lazy_tlb points that this series does *should* fix that. So >> >> PF_KTHREAD may be less problematic than I thought from implementation >> >> point of view, only semantics. >> > >> > So I've been trying to figure out where that PF_KTHREAD comes from, >> > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy >> > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'. >> > >> > So the first version: >> > >> > >> > https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoy...@efficios.com >> > >> > appears to unconditionally send the IPI and checks p->mm in the IPI >> > context, but then v2: >> > >> > >> > https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoy...@efficios.com >> > >> > has the current code. But I've been unable to find the reason the >> > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'. >> >> Looking back at my inbox, it seems like you are the one who proposed to >> skip all kthreads: >> >> https://lkml.kernel.org/r/20190904124333.gq2...@hirez.programming.kicks-ass.net > > I had a feeling it might've been me ;-) I just couldn't find the email. > >> > The comment doesn't really help either; sure we have the whole lazy mm >> > thing, but that's ->active_mm, not ->mm. >> > >> > Possibly it is because {,un}use_mm() do not have sufficient barriers to >> > make the remote p->mm test work? Or were we over-eager with the !p->mm >> > doesn't imply kthread 'cleanups' at the time? >> >> The nice thing about adding back kthreads to the threads considered for >> membarrier >> IPI is that it has no observable effect on the user-space ABI. No >> pre-existing >> kthread >> rely on this, and we just provide an additional guarantee for future kthread >> implementations. >> >> > Also, I just realized, I still have a fix for use_mm() now >> > kthread_use_mm() that seems to have been lost. >> >> I suspect we need to at least document the memory barriers in kthread_use_mm >> and >> kthread_unuse_mm to state that they are required by membarrier if we want to >> ipi kthreads as well. > > Right, so going by that email you found it was mostly a case of being > lazy, but yes, if we audit the kthread_{,un}use_mm() barriers and add > any other bits that might be needed, covering kthreads should be > possible. > > No objections from me for making it so. I'm OK on making membarrier cover kthreads using mm as well, provided we audit kthread_{,un}use_mm() to make sure the proper barriers are in place after setting task->mm and before clearing it. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
- On Aug 25, 2021, at 8:51 PM, Sean Christopherson sea...@google.com wrote: > On Mon, Aug 23, 2021, Mathieu Desnoyers wrote: >> [ re-send to Darren Hart ] >> >> - On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers >> mathieu.desnoy...@efficios.com wrote: >> >> > - On Aug 20, 2021, at 6:50 PM, Sean Christopherson sea...@google.com >> > wrote: >> > >> >> Add a test to verify an rseq's CPU ID is updated correctly if the task is >> >> migrated while the kernel is handling KVM_RUN. This is a regression test >> >> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer >> >> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM >> >> without updating rseq, leading to a stale CPU ID and other badness. >> >> >> > >> > [...] >> > >> > +#define RSEQ_SIG 0xdeadbeef >> > >> > Is there any reason for defining a custom signature rather than including >> > tools/testing/selftests/rseq/rseq.h ? This should take care of including >> > the proper architecture header which will define the appropriate signature. >> > >> > Arguably you don't define rseq critical sections in this test per se, but >> > I'm wondering why the custom signature here. > > Partly to avoid taking a dependency on rseq.h, and partly to try to call out > that > the test doesn't actually do any rseq critical sections. It might be good to add a comment near this define stating this then, so nobody attempts to copy this as an example. > >> > [...] >> > >> >> + >> >> +static void *migration_worker(void *ign) >> >> +{ >> >> + cpu_set_t allowed_mask; >> >> + int r, i, nr_cpus, cpu; >> >> + >> >> + CPU_ZERO(&allowed_mask); >> >> + >> >> + nr_cpus = CPU_COUNT(&possible_mask); >> >> + >> >> + for (i = 0; i < 2; i++) { >> >> + cpu = i % nr_cpus; >> >> + if (!CPU_ISSET(cpu, &possible_mask)) >> >> + continue; >> >> + >> >> + CPU_SET(cpu, &allowed_mask); >> >> + >> >> + /* >> >> + * Bump the sequence count twice to allow the reader to detect >> >> + * that a migration may have occurred in between rseq and sched >> >> + * CPU ID reads. An odd sequence count indicates a migration >> >> + * is in-progress, while a completely different count indicates >> >> + * a migration occurred since the count was last read. >> >> + */ >> >> + atomic_inc(&seq_cnt); >> > >> > So technically this atomic_inc contains the required barriers because the >> > selftests implementation uses "__sync_add_and_fetch(&addr->val, 1)". But >> > it's rather odd that the semantic differs from the kernel implementation in >> > terms of memory barriers: the kernel implementation of atomic_inc >> > guarantees no memory barriers, but this one happens to provide full >> > barriers pretty much by accident (selftests futex/include/atomic.h >> > documents no such guarantee). > > Yeah, I got quite lost trying to figure out what atomics the test would > actually > end up with. At the very least, until things are clarified in the selftests atomic header, I would recommend adding a comment stating which memory barriers are required alongside each use of atomic_inc here. I would even prefer that we add extra (currently unneeded) write barriers to make extra sure that this stays documented. Performance of the write-side does not matter much here. > >> > If this full barrier guarantee is indeed provided by the selftests atomic.h >> > header, I would really like a comment stating that in the atomic.h header >> > so the carpet is not pulled from under our feet by a future optimization. >> > >> > >> >> + r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask); >> >> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)", >> >> + errno, strerror(errno)); >> >> + atomic_inc(&seq_cnt); >> >> + >> >> + CPU_CLR(cpu, &allowed_mask); >> >> + >> >> + /* >> >> + * Let the read-side get back into KVM_RUN to improve the odds >> >> + * of task migration coinc
Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
- On Aug 26, 2021, at 7:54 PM, Sean Christopherson sea...@google.com wrote: > On Thu, Aug 26, 2021, Mathieu Desnoyers wrote: >> - On Aug 25, 2021, at 8:51 PM, Sean Christopherson sea...@google.com >> wrote: >> >> >> + r = sched_setaffinity(0, sizeof(allowed_mask), >> >> >> &allowed_mask); >> >> >> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d >> >> >> (%s)", >> >> >> + errno, strerror(errno)); >> >> >> + atomic_inc(&seq_cnt); >> >> >> + >> >> >> + CPU_CLR(cpu, &allowed_mask); >> >> >> + >> >> >> + /* >> >> >> + * Let the read-side get back into KVM_RUN to improve >> >> >> the odds >> >> >> + * of task migration coinciding with KVM's run loop. >> >> > >> >> > This comment should be about increasing the odds of letting the seqlock >> >> > read-side complete. Otherwise, the delay between the two back-to-back >> >> > atomic_inc is so small that the seqlock read-side may never have time to >> >> > complete the reading the rseq cpu id and the sched_getcpu() call, and >> >> > can >> >> > retry forever. >> > >> > Hmm, but that's not why there's a delay. I'm not arguing that a livelock >> > isn't >> > possible (though that syscall would have to be screaming fast), >> >> I don't think we have the same understanding of the livelock scenario. AFAIU >> the >> livelock >> would be caused by a too-small delay between the two consecutive atomic_inc() >> from one >> loop iteration to the next compared to the time it takes to perform a >> read-side >> critical >> section of the seqlock. Back-to-back atomic_inc can be performed very >> quickly, >> so I >> doubt that the sched_getcpu implementation have good odds to be fast enough >> to >> complete >> in that narrow window, leading to lots of read seqlock retry. > > Ooooh, yeah, brain fart on my side. I was thinking of the two atomic_inc() in > the > same loop iteration and completely ignoring the next iteration. Yes, I 100% > agree > that a delay to ensure forward progress is needed. An assertion in main() > that > the > reader complete at least some reasonable number of KVM_RUNs is also probably a > good > idea, e.g. to rule out a false pass due to the reader never making forward > progress. Agreed. > > FWIW, the do-while loop does make forward progress without a delay, but at > ~50% > throughput, give or take. I did not expect absolutely no progress, but a significant slow down of the read-side. > >> > but the primary motivation is very much to allow the read-side enough time >> > to get back into KVM proper. >> >> I'm puzzled by your statement. AFAIU, let's say we don't have the delay, >> then we >> have: >> >> migration thread KVM_RUN/read-side thread >> --- >> - ioctl(KVM_RUN) >> - atomic_inc_seq_cst(&seqcnt) >> - sched_setaffinity >> - atomic_inc_seq_cst(&seqcnt) >> - a = atomic_load(&seqcnt) & ~1 >> - smp_rmb() >> - b = >> LOAD_ONCE(__rseq_abi->cpu_id); >> - sched_getcpu() >> - smp_rmb() >> - re-load seqcnt/compare >> (succeeds) >>- Can only succeed if entire >> read-side happens while the seqcnt >> is in an even numbered >> state. >> - if (a != b) abort() >> /* no delay. Even counter state is very >> short. */ >> - atomic_inc_seq_cst(&seqcnt) >> /* Let's suppose the lack of delay causes the >> setaffinity to complete too early compared >> with KVM_RUN ioctl */ >> - sched_setaffinity >> - atomic_inc_seq_cst(&seqcnt) >> >> /* no delay. Even counter state is very >>
Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
- On Aug 27, 2021, at 7:23 PM, Sean Christopherson sea...@google.com wrote: > On Fri, Aug 27, 2021, Mathieu Desnoyers wrote: [...] >> Does it reproduce if we randomize the delay to have it picked randomly from >> 0us >> to 100us (with 1us step) ? It would remove a lot of the needs for >> arch-specific >> magic delay value. > > My less-than-scientific testing shows that it can reproduce at delays up to > ~500us, > but above ~10us the reproducibility starts to drop. The bug still reproduces > reliably, it just takes more iterations, and obviously the test runs a bit > slower. > > Any objection to using a 1-10us delay, e.g. a simple usleep((i % 10) + 1)? Works for me, thanks! Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH v3 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
- On Sep 1, 2021, at 4:30 PM, Sean Christopherson sea...@google.com wrote: > Add a test to verify an rseq's CPU ID is updated correctly if the task is > migrated while the kernel is handling KVM_RUN. This is a regression test > for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer > to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM > without updating rseq, leading to a stale CPU ID and other badness. > > Signed-off-by: Sean Christopherson Thanks! Acked-by: Mathieu Desnoyers > --- > tools/testing/selftests/kvm/.gitignore | 1 + > tools/testing/selftests/kvm/Makefile| 3 + > tools/testing/selftests/kvm/rseq_test.c | 236 > 3 files changed, 240 insertions(+) > create mode 100644 tools/testing/selftests/kvm/rseq_test.c > > diff --git a/tools/testing/selftests/kvm/.gitignore > b/tools/testing/selftests/kvm/.gitignore > index 0709af0144c8..6d031ff6b68e 100644 > --- a/tools/testing/selftests/kvm/.gitignore > +++ b/tools/testing/selftests/kvm/.gitignore > @@ -47,6 +47,7 @@ > /kvm_page_table_test > /memslot_modification_stress_test > /memslot_perf_test > +/rseq_test > /set_memory_region_test > /steal_time > /kvm_binary_stats_test > diff --git a/tools/testing/selftests/kvm/Makefile > b/tools/testing/selftests/kvm/Makefile > index 5832f510a16c..0756e79cb513 100644 > --- a/tools/testing/selftests/kvm/Makefile > +++ b/tools/testing/selftests/kvm/Makefile > @@ -80,6 +80,7 @@ TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus > TEST_GEN_PROGS_x86_64 += kvm_page_table_test > TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test > TEST_GEN_PROGS_x86_64 += memslot_perf_test > +TEST_GEN_PROGS_x86_64 += rseq_test > TEST_GEN_PROGS_x86_64 += set_memory_region_test > TEST_GEN_PROGS_x86_64 += steal_time > TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test > @@ -92,6 +93,7 @@ TEST_GEN_PROGS_aarch64 += dirty_log_test > TEST_GEN_PROGS_aarch64 += dirty_log_perf_test > TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus > TEST_GEN_PROGS_aarch64 += kvm_page_table_test > +TEST_GEN_PROGS_aarch64 += rseq_test > TEST_GEN_PROGS_aarch64 += set_memory_region_test > TEST_GEN_PROGS_aarch64 += steal_time > TEST_GEN_PROGS_aarch64 += kvm_binary_stats_test > @@ -103,6 +105,7 @@ TEST_GEN_PROGS_s390x += demand_paging_test > TEST_GEN_PROGS_s390x += dirty_log_test > TEST_GEN_PROGS_s390x += kvm_create_max_vcpus > TEST_GEN_PROGS_s390x += kvm_page_table_test > +TEST_GEN_PROGS_s390x += rseq_test > TEST_GEN_PROGS_s390x += set_memory_region_test > TEST_GEN_PROGS_s390x += kvm_binary_stats_test > > diff --git a/tools/testing/selftests/kvm/rseq_test.c > b/tools/testing/selftests/kvm/rseq_test.c > new file mode 100644 > index ..060538bd405a > --- /dev/null > +++ b/tools/testing/selftests/kvm/rseq_test.c > @@ -0,0 +1,236 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +#define _GNU_SOURCE /* for program_invocation_short_name */ > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "kvm_util.h" > +#include "processor.h" > +#include "test_util.h" > + > +#define VCPU_ID 0 > + > +static __thread volatile struct rseq __rseq = { > + .cpu_id = RSEQ_CPU_ID_UNINITIALIZED, > +}; > + > +/* > + * Use an arbitrary, bogus signature for configuring rseq, this test does not > + * actually enter an rseq critical section. > + */ > +#define RSEQ_SIG 0xdeadbeef > + > +/* > + * Any bug related to task migration is likely to be timing-dependent; > perform > + * a large number of migrations to reduce the odds of a false negative. > + */ > +#define NR_TASK_MIGRATIONS 10 > + > +static pthread_t migration_thread; > +static cpu_set_t possible_mask; > +static bool done; > + > +static atomic_t seq_cnt; > + > +static void guest_code(void) > +{ > + for (;;) > + GUEST_SYNC(0); > +} > + > +static void sys_rseq(int flags) > +{ > + int r; > + > + r = syscall(__NR_rseq, &__rseq, sizeof(__rseq), flags, RSEQ_SIG); > + TEST_ASSERT(!r, "rseq failed, errno = %d (%s)", errno, strerror(errno)); > +} > + > +static void *migration_worker(void *ign) > +{ > + cpu_set_t allowed_mask; > + int r, i, nr_cpus, cpu; > + > + CPU_ZERO(&allowed_mask); > + > + nr_cpus = CPU_COUNT(&possible_mask); > + > + for (i = 0; i < NR_TASK_MIGRATIONS; i++) { > + cpu = i % nr_cpus; > + if (!CPU_ISSET(cpu, &possible_mask)) > +
Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode
- On Nov 28, 2020, at 11:01 AM, Nicholas Piggin npig...@gmail.com wrote: > And get rid of the generic sync_core_before_usermode facility. This is > functionally a no-op in the core scheduler code, but it also catches This sentence is incomplete. > > This helper is the wrong way around I think. The idea that membarrier > state requires a core sync before returning to user is the easy one > that does not need hiding behind membarrier calls. The gap in core > synchronization due to x86's sysret/sysexit and lazy tlb mode, is the > tricky detail that is better put in x86 lazy tlb code. Ideally yes this complexity should sit within the x86 architecture code if only that architecture requires it. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code
I. It is therefore possible to >* schedule between user->kernel->user threads without passing though >* switch_mm(). Membarrier requires a barrier after storing to > - * rq->curr, before returning to userspace, so provide them here: > + * rq->curr, before returning to userspace, and mmdrop() provides > + * this barrier. >* > - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly > - * provided by mmdrop(), > - * - a sync_core for SYNC_CORE. > + * XXX: I don't think mmdrop() actually does this. There's no > + * smp_mb__before/after_atomic() in there. I recall mmdrop providing a memory barrier. It looks like I event went though the trouble of documenting it myself. ;-) static inline void mmdrop(struct mm_struct *mm) { /* * The implicit full barrier implied by atomic_dec_and_test() is * required by the membarrier system call before returning to * user-space, after storing to rq->curr. */ if (unlikely(atomic_dec_and_test(&mm->mm_count))) __mmdrop(mm); } >*/ > - if (mm) { > - membarrier_mm_sync_core_before_usermode(mm); OK so here is the meat. The current code is using the (possibly incomplete) lazy TLB state known by the scheduler to sync core, and it appears it may be a bit more heavy that what is strictly needed. Your change instead rely on the internal knowledge of lazy TLB within x86 switch_mm_irqs_off to achieve this, which overall seems like an improvement. I agree with Nick's comment that it should go on top of his exit_lazy_mm patches. Thanks, Mathieu > + if (mm) > mmdrop(mm); > - } > + > if (unlikely(prev_state == TASK_DEAD)) { > if (prev->sched_class->task_dead) > prev->sched_class->task_dead(prev); > -- > 2.28.0 -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code
- On Dec 4, 2020, at 3:17 AM, Nadav Amit nadav.a...@gmail.com wrote: > I am not very familiar with membarrier, but here are my 2 cents while trying > to answer your questions. > >> On Dec 3, 2020, at 9:26 PM, Andy Lutomirski wrote: >> @@ -496,6 +497,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct >> mm_struct *next, >> * from one thread in a process to another thread in the same >> * process. No TLB flush required. >> */ >> + >> +// XXX: why is this okay wrt membarrier? >> if (!was_lazy) >> return; > > I am confused. > > On one hand, it seems that membarrier_private_expedited() would issue an IPI > to that core, as it would find that this core’s cpu_rq(cpu)->curr->mm is the > same as the one that the membarrier applies to. If the scheduler switches from one thread to another which both have the same mm, it means cpu_rq(cpu)->curr->mm is invariant, even though ->curr changes. So there is no need to issue a memory barrier or sync core for membarrier in this case, because there is no way the IPI can be missed. > But… (see below) > > >> @@ -508,12 +511,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct >> mm_struct *next, >> smp_mb(); >> next_tlb_gen = atomic64_read(&next->context.tlb_gen); >> if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == >> -next_tlb_gen) >> +next_tlb_gen) { >> +/* >> + * We're reactivating an mm, and membarrier might >> + * need to serialize. Tell membarrier. >> + */ >> + >> +// XXX: I can't understand the logic in >> +// membarrier_mm_sync_core_before_usermode(). What's >> +// the mm check for? >> +membarrier_mm_sync_core_before_usermode(next); > > On the other hand the reason for this mm check that you mention contradicts > my previous understanding as the git log says: > > commit 2840cf02fae627860156737e83326df354ee4ec6 > Author: Mathieu Desnoyers > Date: Thu Sep 19 13:37:01 2019 -0400 > >sched/membarrier: Call sync_core only before usermode for same mm > >When the prev and next task's mm change, switch_mm() provides the core >serializing guarantees before returning to usermode. The only case >where an explicit core serialization is needed is when the scheduler >keeps the same mm for prev and next. Hrm, so your point here is that if the scheduler keeps the same mm for prev and next, it means membarrier will have observed the same rq->curr->mm, and therefore the IPI won't be missed. I wonder if that membarrier_mm_sync_core_before_usermode is needed at all then or if we have just been too careful and did not consider that all the scenarios which need to be core-sync'd are indeed taken care of ? I see here that my prior commit message indeed discusses prev and next task's mm, but in reality, we are comparing current->mm with rq->prev_mm. So from a lazy TLB perspective, this probably matters, and we may still need a core sync in some lazy TLB scenarios. > >> /* >> * When switching through a kernel thread, the loop in >> * membarrier_{private,global}_expedited() may have observed that >> * kernel thread and not issued an IPI. It is therefore possible to >> * schedule between user->kernel->user threads without passing though >> * switch_mm(). Membarrier requires a barrier after storing to >> - * rq->curr, before returning to userspace, so provide them here: >> + * rq->curr, before returning to userspace, and mmdrop() provides >> + * this barrier. >> * >> - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly >> - * provided by mmdrop(), >> - * - a sync_core for SYNC_CORE. >> + * XXX: I don't think mmdrop() actually does this. There's no >> + * smp_mb__before/after_atomic() in there. > > I presume that since x86 is the only one that needs > membarrier_mm_sync_core_before_usermode(), nobody noticed the missing > smp_mb__before/after_atomic(). These are anyhow a compiler barrier in x86, > and such a barrier would take place before the return to userspace. mmdrop already provides the memory barriers for membarrer, as I documented within the function. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski l...@kernel.org wrote: > The old sync_core_before_usermode() comments said that a non-icache-syncing > return-to-usermode instruction is x86-specific and that all other > architectures automatically notice cross-modified code on return to > userspace. Based on my general understanding of how CPUs work and based on > my atttempt to read the ARM manual, this is not true at all. In fact, x86 > seems to be a bit of an anomaly in the other direction: x86's IRET is > unusually heavyweight for a return-to-usermode instruction. > > So let's drop any pretense that we can have a generic way implementation > behind membarrier's SYNC_CORE flush and require all architectures that opt > in to supply their own. Removing the generic implementation is OK with me, as this will really require architecture maintainers to think hard about it when porting this feature. > This means x86, arm64, and powerpc for now. Let's > also rename the function from sync_core_before_usermode() to > membarrier_sync_core_before_usermode() because the precise flushing details > may very well be specific to membarrier, and even the concept of > "sync_core" in the kernel is mostly an x86-ism. Work for me too. > > I admit that I'm rather surprised that the code worked at all on arm64, > and I'm suspicious that it has never been very well tested. My apologies > for not reviewing this more carefully in the first place. Please refer to Documentation/features/sched/membarrier-sync-core/arch-support.txt It clearly states that only arm, arm64, powerpc and x86 support the membarrier sync core feature as of now: # Architecture requirements # # * arm/arm64/powerpc # # Rely on implicit context synchronization as a result of exception return # when returning from IPI handler, and when returning to user-space. # # * x86 # # x86-32 uses IRET as return from interrupt, which takes care of the IPI. # However, it uses both IRET and SYSEXIT to go back to user-space. The IRET # instruction is core serializing, but not SYSEXIT. # # x86-64 uses IRET as return from interrupt, which takes care of the IPI. # However, it can return to user-space through either SYSRETL (compat code), # SYSRETQ, or IRET. # # Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely # instead on write_cr3() performed by switch_mm() to provide core serialization # after changing the current mm, and deal with the special case of kthread -> # uthread (temporarily keeping current mm into active_mm) by issuing a # sync_core_before_usermode() in that specific case. This is based on direct feedback from the architecture maintainers. You seem to have noticed odd cases on arm64 where this guarantee does not match reality. Where exactly can we find this in the code, and which part of the architecture manual can you point us to which supports your concern ? Based on the notes I have, use of `eret` on aarch64 guarantees a context synchronizing instruction when returning to user-space. Thanks, Mathieu > > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: linuxppc-dev@lists.ozlabs.org > Cc: Nicholas Piggin > Cc: Catalin Marinas > Cc: Will Deacon > Cc: linux-arm-ker...@lists.infradead.org > Cc: Mathieu Desnoyers > Cc: x...@kernel.org > Cc: sta...@vger.kernel.org > Fixes: 70216e18e519 ("membarrier: Provide core serializing command, > *_SYNC_CORE") > Signed-off-by: Andy Lutomirski > --- > > Hi arm64 and powerpc people- > > This is part of a series here: > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fixes > > Before I send out the whole series, I'm hoping that some arm64 and powerpc > people can help me verify that I did this patch right. Once I get > some feedback on this patch, I'll send out the whole pile. And once > *that's* done, I'll start giving the mm lazy stuff some serious thought. > > The x86 part is already fixed in Linus' tree. > > Thanks, > Andy > > arch/arm64/include/asm/sync_core.h | 21 + > arch/powerpc/include/asm/sync_core.h | 20 > arch/x86/Kconfig | 1 - > arch/x86/include/asm/sync_core.h | 7 +++ > include/linux/sched/mm.h | 1 - > include/linux/sync_core.h| 21 - > init/Kconfig | 3 --- > kernel/sched/membarrier.c| 15 +++ > 8 files changed, 55 insertions(+), 34 deletions(-) > create mode 100644 arch/arm64/include/asm/sync_core.h > create mode 100644 arch/powerpc/include/asm/sync_core.h > delete mode 100644 include/linux/sync_core.h > > diff --git a/arch/arm64/include/asm/sync_core.h >
Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
er does not do anything wrt cache coherency. We should perhaps explicitly point it out though. >> >> So, either Andy has a misunderstanding, or the man page is wrong, or >> my rudimentary understanding of what membarrier is supposed to be >> doing is wrong... > > Look at the latest man page: > > https://man7.org/linux/man-pages/man2/membarrier.2.html > > for MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE. The result may not be > all that enlightening. As described in the membarrier(2) man page and in include/uapi/linux/membarrier.h, membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE guarantees core serializing instructions in addition to the memory ordering guaranteed by MEMBARRIER_CMD_PRIVATE_EXPEDITED. It does not guarantee anything wrt i/d cache coherency. I initially considered adding such guarantees when we introduced membarrier sync-core, but decided against it because it would basically replicate what architectures already expose to user-space, e.g. flush_icache_user_range on arm32. So between code modification and allowing other threads to jump to that code, it should be expected that architectures without coherent i/d cache will need to flush caches to ensure coherency *and* to issue membarrier to make sure core serializing instructions will be issued by every thread acting on the same mm either immediately by means of the IPI, or before they return to user-space if they do not happen to be currently running when the membarrier system call is invoked. Hoping this clarifies things. I suspect we will need to clarify documentation about what membarrier *does not* guarantee, given that you mistakenly expected membarrier to take care of cache flushing. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
hread siblings have executed a core serializing instruc■ > tion. This guarantee is provided only for threads in the same > process as the calling thread. > > The "expedited" commands complete faster than the non-expedited > ones, they never block, but have the downside of causing extra > overhead. > > A process must register its intent to use the private expedited > sync core command prior to using it. > > This just says that the siblings have executed a serialising > instruction, in other words a barrier. It makes no claims concerning > cache coherency - and without some form of cache maintenance, there > can be no expectation that the I and D streams to be coherent with > each other. Right, membarrier is not doing anything wrt I/D caches. On architectures without coherent caches, users should use other system calls or instructions provided by the architecture to synchronize the appropriate address ranges. > This description is also weird in another respect. "guarantee that > all its running thread siblings have executed a core serializing > instruction" ... "The expedited commands ... never block". > > So, the core executing this call is not allowed to block, but the > other part indicates that the other CPUs _have_ executed a serialising > instruction before this call returns... one wonders how that happens > without blocking. Maybe the CPU spins waiting for completion instead? Membarrier expedited sync-core issues IPIs to all CPUs running sibling threads. AFAIR the IPI mechanism uses the "csd lock" which is basically busy waiting. So it does not "block", it busy-waits. For completeness of the explanation, other (non-running) threads acting on the same mm will eventually issue the context synchronizing instruction before returning to user-space whenever they are scheduled back. Thanks, Mathieu > > -- > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ > FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last! -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
- On Dec 27, 2020, at 4:36 PM, Andy Lutomirski l...@kernel.org wrote: [...] >> You seem to have noticed odd cases on arm64 where this guarantee does not >> match reality. Where exactly can we find this in the code, and which part >> of the architecture manual can you point us to which supports your concern ? >> >> Based on the notes I have, use of `eret` on aarch64 guarantees a context >> synchronizing >> instruction when returning to user-space. > > Based on my reading of the manual, ERET on ARM doesn't synchronize > anything at all. I can't find any evidence that it synchronizes data > or instructions, and I've seen reports that the CPU will happily > speculate right past it. Reading [1] there appears to be 3 kind of context synchronization events: - Taking an exception, - Returning from an exception, - ISB. This other source [2] adds (search for Context synchronization operation): - Exit from Debug state - Executing a DCPS instruction - Executing a DRPS instruction "ERET" falls into the second kind of events, and AFAIU should be context synchronizing. That was confirmed to me by Will Deacon when membarrier sync-core was implemented for aarch64. If the architecture reference manuals are wrong, is there an errata ? As for the algorithm to use on ARMv8 to update instructions, see [2] B2.3.4 Implication of caches for the application programmer "Synchronization and coherency issues between data and instruction accesses" Membarrier only takes care of making sure the "ISB" part of the algorithm can be done easily and efficiently on multiprocessor systems. Thanks, Mathieu [1] https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Barriers/ISB-in-more-detail [2] https://montcs.bloomu.edu/Information/ARMv8/ARMv8-A_Architecture_Reference_Manual_(Issue_A.a).pdf -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()
- On Dec 28, 2020, at 4:06 PM, Andy Lutomirski l...@kernel.org wrote: > On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers > wrote: >> >> - On Dec 28, 2020, at 2:44 PM, Andy Lutomirski l...@kernel.org wrote: >> >> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin >> > wrote: >> >> >> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote: >> >> > After chatting with rmk about this (but without claiming that any of >> >> > this is his opinion), based on the manpage, I think membarrier() >> >> > currently doesn't really claim to be synchronizing caches? It just >> >> > serializes cores. So arguably if userspace wants to use membarrier() >> >> > to synchronize code changes, userspace should first do the code >> >> > change, then flush icache as appropriate for the architecture, and >> >> > then do the membarrier() to ensure that the old code is unused? >> >> ^ exactly, yes. >> >> >> > >> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush() >> >> > syscall. That might cause you to end up with two IPIs instead of one >> >> > in total, but we probably don't care _that_ much about extra IPIs on >> >> > 32-bit arm? >> >> This was the original thinking, yes. The cacheflush IPI will flush specific >> regions of code, and the membarrier IPI issues context synchronizing >> instructions. >> >> Architectures with coherent i/d caches don't need the cacheflush step. > > There are different levels of coherency -- VIVT architectures may have > differing requirements compared to PIPT, etc. > > In any case, I feel like the approach taken by the documentation is > fundamentally confusing. Architectures don't all speak the same > language Agreed. > How about something like: I dislike the wording "barrier" and the association between "write" and "instruction fetch" done in the descriptions below. It leads to think that this behaves like a memory barrier, when in fact my understanding of a context synchronizing instruction is that it simply flushes internal CPU state, which would cause coherency issues if the CPU observes both the old and then the new code without having this state flushed. [ Sorry if I take more time to reply and if my replies are a bit more concise than usual. I'm currently on parental leave, so I have non-maskable interrupts to attend to. ;-) ] Thanks, Mathieu > > The SYNC_CORE operation causes all threads in the caller's address > space (including the caller) to execute an architecture-defined > barrier operation. membarrier() will ensure that this barrier is > executed at a time such that all data writes done by the calling > thread before membarrier() are made visible by the barrier. > Additional architecture-dependent cache management operations may be > required to use this for JIT code. > > x86: SYNC_CORE executes a barrier that will cause subsequent > instruction fetches to observe prior writes. Currently this will be a > "serializing" instruction, but, if future improved CPU documentation > becomes available and relaxes this requirement, the barrier may > change. The kernel guarantees that writing new or modified > instructions to normal memory (and issuing SFENCE if the writes were > non-temporal) then doing a membarrier SYNC_CORE operation is > sufficient to cause all threads in the caller's address space to > execute the new or modified instructions. This is true regardless of > whether or not those instructions are written at the same virtual > address from which they are subsequently executed. No additional > cache management is required on x86. > > arm: Something about the cache management syscalls. > > arm64: Ditto > > powerpc: I have no idea. -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski l...@kernel.org wrote: > The old sync_core_before_usermode() comments suggested that a > non-icache-syncing > return-to-usermode instruction is x86-specific and that all other > architectures automatically notice cross-modified code on return to > userspace. > > This is misleading. The incantation needed to modify code from one > CPU and execute it on another CPU is highly architecture dependent. > On x86, according to the SDM, one must modify the code, issue SFENCE > if the modification was WC or nontemporal, and then issue a "serializing > instruction" on the CPU that will execute the code. membarrier() can do > the latter. > > On arm64 and powerpc, one must flush the icache and then flush the pipeline > on the target CPU, although the CPU manuals don't necessarily use this > language. > > So let's drop any pretense that we can have a generic way to define or > implement membarrier's SYNC_CORE operation and instead require all > architectures to define the helper and supply their own documentation as to > how to use it. Agreed. Documentation of the sequence of operations that need to be performed when cross-modifying code on SMP should be per-architecture. The documentation of the architectural effects of membarrier sync-core should be per-arch as well. > This means x86, arm64, and powerpc for now. And also arm32, as discussed in the other leg of the patchset's email thread. > Let's also > rename the function from sync_core_before_usermode() to > membarrier_sync_core_before_usermode() because the precise flushing details > may very well be specific to membarrier, and even the concept of > "sync_core" in the kernel is mostly an x86-ism. OK > [...] > > static void ipi_rseq(void *info) > { > @@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int > cpu_id) > smp_call_func_t ipi_func = ipi_mb; > > if (flags == MEMBARRIER_FLAG_SYNC_CORE) { > - if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) > +#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE > return -EINVAL; > +#else > if (!(atomic_read(&mm->membarrier_state) & > MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY)) > return -EPERM; > ipi_func = ipi_sync_core; > +#endif Please change back this #ifndef / #else / #endif within function for if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) { ... } else { ... } I don't think mixing up preprocessor and code logic makes it more readable. Thanks, Mathieu > } else if (flags == MEMBARRIER_FLAG_RSEQ) { > if (!IS_ENABLED(CONFIG_RSEQ)) > return -EINVAL; > -- > 2.31.1 -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski l...@kernel.org wrote: [...] > +# An architecture that wants to support > +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what > it > +# is supposed to do and implement membarrier_sync_core_before_usermode() to > +# make it do that. Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via > +# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a > +# fantastic API and may not make sense on all architectures. Once an > +# architecture meets these requirements, Can we please remove the editorial comment about the quality of the membarrier sync-core's API ? At least it's better than having all userspace rely on mprotect() undocumented side-effects to perform something which typically works, until it won't, or until this prevents mprotect's implementation to be improved because it will start breaking JITs all over the place. We can simply state that the definition of what membarrier sync-core does is defined per-architecture, and document the sequence of operations to perform when doing cross-modifying code specifically for each architecture. Now if there are weird architectures where membarrier is an odd fit (I've heard that riscv might need address ranges to which the core sync needs to apply), then those might need to implement their own arch-specific system call, which is all fine. > +# > +# On x86, a program can safely modify code, issue > +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via > +# the modified address or an alias, from any thread in the calling process. > +# > +# On arm64, a program can modify code, flush the icache as needed, and issue > +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context > synchronizing > +# event", aka pipeline flush on all CPUs that might run the calling process. > +# Then the program can execute the modified code as long as it is executed > +# from an address consistent with the icache flush and the CPU's cache type. > +# > +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE > +# similarly to arm64. It would be nice if the powerpc maintainers could > +# add a more clear explanantion. We should document the requirements on ARMv7 as well. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski l...@kernel.org wrote: > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote: > >> Please change back this #ifndef / #else / #endif within function for >> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) { >> ... >> } else { >> ... >> } >> >> I don't think mixing up preprocessor and code logic makes it more readable. > > I agree, but I don't know how to make the result work well. > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED > case, so either I need to fake up a definition or use #ifdef. > > If I faked up a definition, I would want to assert, at build time, that > it isn't called. I don't think we can do: > > static void membarrier_sync_core_before_usermode() > { >BUILD_BUG_IF_REACHABLE(); > } Let's look at the context here: static void ipi_sync_core(void *info) { [] membarrier_sync_core_before_usermode() } ^ this can be within #ifdef / #endif static int membarrier_private_expedited(int flags, int cpu_id) [...] if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) return -EINVAL; if (!(atomic_read(&mm->membarrier_state) & MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY)) return -EPERM; ipi_func = ipi_sync_core; All we need to make the line above work is to define an empty ipi_sync_core function in the #else case after the ipi_sync_core() function definition. Or am I missing your point ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()
- On Jun 18, 2021, at 1:13 PM, Christophe Leroy christophe.le...@csgroup.eu wrote: [...] > > I don't understand all that complexity to just replace a simple > 'smp_mb__after_unlock_lock()'. > > #define smp_mb__after_unlock_lock() smp_mb() > #define smp_mb() barrier() > # define barrier() __asm__ __volatile__("": : :"memory") > > > Am I missing some subtility ? On powerpc CONFIG_SMP, smp_mb() is actually defined as: #define smp_mb()__smp_mb() #define __smp_mb() mb() #define mb() __asm__ __volatile__ ("sync" : : : "memory") So the original motivation here was to skip a "sync" instruction whenever switching between threads which are part of the same process. But based on recent discussions, I suspect my implementation may be inaccurately doing so though. Thanks, Mathieu > > Thanks > Christophe -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation
- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski l...@kernel.org wrote: > On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote: >> - On Jun 17, 2021, at 8:12 PM, Andy Lutomirski l...@kernel.org wrote: >> >> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote: >> > >> >> Please change back this #ifndef / #else / #endif within function for >> >> >> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) { >> >> ... >> >> } else { >> >> ... >> >> } >> >> >> >> I don't think mixing up preprocessor and code logic makes it more >> >> readable. >> > >> > I agree, but I don't know how to make the result work well. >> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED >> > case, so either I need to fake up a definition or use #ifdef. >> > >> > If I faked up a definition, I would want to assert, at build time, that >> > it isn't called. I don't think we can do: >> > >> > static void membarrier_sync_core_before_usermode() >> > { >> >BUILD_BUG_IF_REACHABLE(); >> > } >> >> Let's look at the context here: >> >> static void ipi_sync_core(void *info) >> { >> [] >> membarrier_sync_core_before_usermode() >> } >> >> ^ this can be within #ifdef / #endif >> >> static int membarrier_private_expedited(int flags, int cpu_id) >> [...] >>if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) >> return -EINVAL; >> if (!(atomic_read(&mm->membarrier_state) & >> MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY)) >> return -EPERM; >> ipi_func = ipi_sync_core; >> >> All we need to make the line above work is to define an empty ipi_sync_core >> function in the #else case after the ipi_sync_core() function definition. >> >> Or am I missing your point ? > > Maybe? > > My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the > core. > I would be fine with that if I could have the compiler statically verify that > it’s not called, but I’m uncomfortable having it there if the implementation > is > actively incorrect. I see. Another approach would be to implement a "setter" function to populate "ipi_func". That setter function would return -EINVAL in its #ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE implementation. Would that be better ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH 2/5] entry: rseq: Call rseq_handle_notify_resume() in tracehook_notify_resume()
- On Aug 17, 2021, at 8:12 PM, Sean Christopherson sea...@google.com wrote: > Invoke rseq_handle_notify_resume() from tracehook_notify_resume() now > that the two function are always called back-to-back by architectures > that have rseq. The rseq helper is stubbed out for architectures that > don't support rseq, i.e. this is a nop across the board. > > Note, tracehook_notify_resume() is horribly named and arguably does not > belong in tracehook.h as literally every line of code in it has nothing > to do with tracing. But, that's been true since commit a42c6ded827d > ("move key_repace_session_keyring() into tracehook_notify_resume()") > first usurped tracehook_notify_resume() back in 2012. Punt cleaning that > mess up to future patches. > > No functional change intended. This will make it harder to introduce new code paths which consume the NOTIFY_RESUME without calling the rseq callback, which introduces issues. Agreed. Acked-by: Mathieu Desnoyers > > Signed-off-by: Sean Christopherson > --- > arch/arm/kernel/signal.c | 1 - > arch/arm64/kernel/signal.c | 1 - > arch/csky/kernel/signal.c| 4 +--- > arch/mips/kernel/signal.c| 4 +--- > arch/powerpc/kernel/signal.c | 4 +--- > arch/s390/kernel/signal.c| 1 - > include/linux/tracehook.h| 2 ++ > kernel/entry/common.c| 4 +--- > kernel/entry/kvm.c | 4 +--- > 9 files changed, 7 insertions(+), 18 deletions(-) > > diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c > index a3a38d0a4c85..9df68d139965 100644 > --- a/arch/arm/kernel/signal.c > +++ b/arch/arm/kernel/signal.c > @@ -670,7 +670,6 @@ do_work_pending(struct pt_regs *regs, unsigned int > thread_flags, int syscall) > uprobe_notify_resume(regs); > } else { > tracehook_notify_resume(regs); > - rseq_handle_notify_resume(NULL, regs); > } > } > local_irq_disable(); > diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c > index 23036334f4dc..22b55db13da6 100644 > --- a/arch/arm64/kernel/signal.c > +++ b/arch/arm64/kernel/signal.c > @@ -951,7 +951,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, > > if (thread_flags & _TIF_NOTIFY_RESUME) { > tracehook_notify_resume(regs); > - rseq_handle_notify_resume(NULL, regs); > > /* >* If we reschedule after checking the affinity > diff --git a/arch/csky/kernel/signal.c b/arch/csky/kernel/signal.c > index 312f046d452d..bc4238b9f709 100644 > --- a/arch/csky/kernel/signal.c > +++ b/arch/csky/kernel/signal.c > @@ -260,8 +260,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, > if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) > do_signal(regs); > > - if (thread_info_flags & _TIF_NOTIFY_RESUME) { > + if (thread_info_flags & _TIF_NOTIFY_RESUME) > tracehook_notify_resume(regs); > - rseq_handle_notify_resume(NULL, regs); > - } > } > diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c > index f1e985109da0..c9b2a75563e1 100644 > --- a/arch/mips/kernel/signal.c > +++ b/arch/mips/kernel/signal.c > @@ -906,10 +906,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, > void > *unused, > if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) > do_signal(regs); > > - if (thread_info_flags & _TIF_NOTIFY_RESUME) { > + if (thread_info_flags & _TIF_NOTIFY_RESUME) > tracehook_notify_resume(regs); > - rseq_handle_notify_resume(NULL, regs); > - } > > user_enter(); > } > diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c > index e600764a926c..b93b87df499d 100644 > --- a/arch/powerpc/kernel/signal.c > +++ b/arch/powerpc/kernel/signal.c > @@ -293,10 +293,8 @@ void do_notify_resume(struct pt_regs *regs, unsigned long > thread_info_flags) > do_signal(current); > } > > - if (thread_info_flags & _TIF_NOTIFY_RESUME) { > + if (thread_info_flags & _TIF_NOTIFY_RESUME) > tracehook_notify_resume(regs); > - rseq_handle_notify_resume(NULL, regs); > - } > } > > static unsigned long get_tm_stackpointer(struct task_struct *tsk) > diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c > index 78ef53b29958..b307db26bf2d 100644 > --- a/arch/s390/kernel/signal.c > +++ b/arch/s390/kernel/sign
Re: [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest
- On Aug 17, 2021, at 8:12 PM, Sean Christopherson sea...@google.com wrote: > Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to > transferring to a KVM guest, which is roughly equivalent to an exit to > userspace and processes many of the same pending actions. While the task > cannot be in an rseq critical section as the KVM path is reachable only > via ioctl(KVM_RUN), the side effects that apply to rseq outside of a > critical section still apply, e.g. the CPU ID needs to be updated if the > task is migrated. > > Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults > and other badness in userspace VMMs that use rseq in combination with KVM, > e.g. due to the CPU ID being stale after task migration. I agree with the problem assessment, but I would recommend a small change to this fix. > > Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function") > Reported-by: Peter Foley > Bisected-by: Doug Evans > Cc: Shakeel Butt > Cc: Thomas Gleixner > Cc: sta...@vger.kernel.org > Signed-off-by: Sean Christopherson > --- > kernel/entry/kvm.c | 4 +++- > kernel/rseq.c | 4 ++-- > 2 files changed, 5 insertions(+), 3 deletions(-) > > diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c > index 49972ee99aff..049fd06b4c3d 100644 > --- a/kernel/entry/kvm.c > +++ b/kernel/entry/kvm.c > @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, > unsigned long ti_work) > if (ti_work & _TIF_NEED_RESCHED) > schedule(); > > - if (ti_work & _TIF_NOTIFY_RESUME) > + if (ti_work & _TIF_NOTIFY_RESUME) { > tracehook_notify_resume(NULL); > + rseq_handle_notify_resume(NULL, NULL); > + } > > ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work); > if (ret) > diff --git a/kernel/rseq.c b/kernel/rseq.c > index 35f7bd0fced0..58c79a7918cd 100644 > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -236,7 +236,7 @@ static bool in_rseq_cs(unsigned long ip, struct rseq_cs > *rseq_cs) > > static int rseq_ip_fixup(struct pt_regs *regs) > { > - unsigned long ip = instruction_pointer(regs); > + unsigned long ip = regs ? instruction_pointer(regs) : 0; > struct task_struct *t = current; > struct rseq_cs rseq_cs; > int ret; > @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs) >* If not nested over a rseq critical section, restart is useless. >* Clear the rseq_cs pointer and return. >*/ > - if (!in_rseq_cs(ip, &rseq_cs)) > + if (!regs || !in_rseq_cs(ip, &rseq_cs)) I think clearing the thread's rseq_cs unconditionally here when regs is NULL is not the behavior we want when this is called from xfer_to_guest_mode_work. If we have a scenario where userspace ends up calling this ioctl(KVM_RUN) from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to kill this application in the rseq_syscall handler when exiting back to usermode when the ioctl eventually returns. However, clearing the thread's rseq_cs will prevent this from happening. So I would favor an approach where we simply do: if (!regs) return 0; Immediately at the beginning of rseq_ip_fixup, before getting the instruction pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it is not relevant to do any fixup here, because it is nested in a ioctl system call. Effectively, this would preserve the SIGSEGV behavior when this ioctl is erroneously called by user-space from a rseq critical section. Thanks for looking into this ! Mathieu > return clear_rseq_cs(t); > ret = rseq_need_restart(t, rseq_cs.flags); > if (ret <= 0) > -- > 2.33.0.rc1.237.g0d66db33f3-goog -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
- On Aug 17, 2021, at 8:12 PM, Sean Christopherson sea...@google.com wrote: > Add a test to verify an rseq's CPU ID is updated correctly if the task is > migrated while the kernel is handling KVM_RUN. This is a regression test > for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer > to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM > without updating rseq, leading to a stale CPU ID and other badness. > > Signed-off-by: Sean Christopherson > --- [...] > + > +static void *migration_worker(void *ign) > +{ > + cpu_set_t allowed_mask; > + int r, i, nr_cpus, cpu; > + > + CPU_ZERO(&allowed_mask); > + > + nr_cpus = CPU_COUNT(&possible_mask); > + > + for (i = 0; i < 2; i++) { > + cpu = i % nr_cpus; > + if (!CPU_ISSET(cpu, &possible_mask)) > + continue; > + > + CPU_SET(cpu, &allowed_mask); > + > + r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask); > + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)", > errno, > + strerror(errno)); > + > + CPU_CLR(cpu, &allowed_mask); > + > + usleep(10); > + } > + done = true; > + return NULL; > +} > + > +int main(int argc, char *argv[]) > +{ > + struct kvm_vm *vm; > + u32 cpu, rseq_cpu; > + int r; > + > + /* Tell stdout not to buffer its content */ > + setbuf(stdout, NULL); > + > + r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask); > + TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno, > + strerror(errno)); > + > + if (CPU_COUNT(&possible_mask) < 2) { > + print_skip("Only one CPU, task migration not possible\n"); > + exit(KSFT_SKIP); > + } > + > + sys_rseq(0); > + > + /* > + * Create and run a dummy VM that immediately exits to userspace via > + * GUEST_SYNC, while concurrently migrating the process by setting its > + * CPU affinity. > + */ > + vm = vm_create_default(VCPU_ID, 0, guest_code); > + > + pthread_create(&migration_thread, NULL, migration_worker, 0); > + > + while (!done) { > + vcpu_run(vm, VCPU_ID); > + TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC, > + "Guest failed?"); > + > + cpu = sched_getcpu(); > + rseq_cpu = READ_ONCE(__rseq.cpu_id); > + > + /* > + * Verify rseq's CPU matches sched's CPU, and that sched's CPU > + * is stable. This doesn't handle the case where the task is > + * migrated between sched_getcpu() and reading rseq, and again > + * between reading rseq and sched_getcpu(), but in practice no > + * false positives have been observed, while on the other hand > + * blocking migration while this thread reads CPUs messes with > + * the timing and prevents hitting failures on a buggy kernel. > + */ I think you could get a stable cpu id between sched_getcpu and __rseq_abi.cpu_id if you add a pthread mutex to protect: sched_getcpu and __rseq_abi.cpu_id reads vs sched_setaffinity calls within the migration thread. Thoughts ? Thanks, Mathieu > + TEST_ASSERT(rseq_cpu == cpu || cpu != sched_getcpu(), > + "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu); > + } > + > + pthread_join(migration_thread, NULL); > + > + kvm_vm_free(vm); > + > + sys_rseq(RSEQ_FLAG_UNREGISTER); > + > + return 0; > +} > -- > 2.33.0.rc1.237.g0d66db33f3-goog -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
- On Aug 19, 2021, at 7:33 PM, Sean Christopherson sea...@google.com wrote: > On Thu, Aug 19, 2021, Mathieu Desnoyers wrote: >> - On Aug 17, 2021, at 8:12 PM, Sean Christopherson sea...@google.com >> wrote: >> >> > Add a test to verify an rseq's CPU ID is updated correctly if the task is >> > migrated while the kernel is handling KVM_RUN. This is a regression test >> > for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer >> > to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM >> > without updating rseq, leading to a stale CPU ID and other badness. >> > >> > Signed-off-by: Sean Christopherson >> > --- >> >> [...] >> >> > + while (!done) { >> > + vcpu_run(vm, VCPU_ID); >> > + TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC, >> > + "Guest failed?"); >> > + >> > + cpu = sched_getcpu(); >> > + rseq_cpu = READ_ONCE(__rseq.cpu_id); >> > + >> > + /* >> > + * Verify rseq's CPU matches sched's CPU, and that sched's CPU >> > + * is stable. This doesn't handle the case where the task is >> > + * migrated between sched_getcpu() and reading rseq, and again >> > + * between reading rseq and sched_getcpu(), but in practice no >> > + * false positives have been observed, while on the other hand >> > + * blocking migration while this thread reads CPUs messes with >> > + * the timing and prevents hitting failures on a buggy kernel. >> > + */ >> >> I think you could get a stable cpu id between sched_getcpu and >> __rseq_abi.cpu_id >> if you add a pthread mutex to protect: >> >> sched_getcpu and __rseq_abi.cpu_id reads >> >> vs >> >> sched_setaffinity calls within the migration thread. >> >> Thoughts ? > > I tried that and couldn't reproduce the bug. That's what I attempted to call > out > in the blurb "blocking migration while this thread reads CPUs ... prevents > hitting > failures on a buggy kernel". > > I considered adding arbitrary delays around the mutex to try and hit the bug, > but > I was worried that even if I got it "working" for this bug, the test would be > too > tailored to this bug and potentially miss future regression. Letting the two > threads run wild seemed like it would provide the best coverage, at the cost > of > potentially causing to false failures. OK, so your point is that using mutual exclusion to ensure stability of the cpu id changes the timings too much, to a point where the issues don't reproduce. I understand that this mutex ties the migration thread timings to the vcpu thread's use of the mutex, which will reduce timings randomness, which is unwanted here. I still really hate flakiness in tests, because then people stop caring when they fail once in a while. And with the nature of rseq, a once-in-a-while failure is a big deal. Let's see if we can use other tricks to ensure stability of the cpu id without changing timings too much. One idea would be to use a seqcount lock. But even if we use that, I'm concerned that the very long writer critical section calling sched_setaffinity would need to be alternated with a sleep to ensure the read-side progresses. The sleep delay could be relatively small compared to the duration of the sched_setaffinity call, e.g. ratio 1:10. static volatile uint64_t seqcnt; The thread responsible for setting the affinity would do something like: for (;;) { atomic_inc_seq_cst(&seqcnt); sched_setaffinity(..., n++ % nr_cpus); atomic_inc_seq_cst(&seqcnt); usleep(1); /* this is where read-side is allowed to progress. */ } And the thread reading the rseq cpu id and calling sched_getcpu(): uint64_t snapshot; do { snapshot = atomic_load(&seqcnt) & ~1; /* force retry if odd */ smp_rmb(); cpu = sched_getcpu(); rseq_cpu = READ_ONCE(__rseq.cpu_id); smp_rmb(); } while (snapshot != atomic_load(&seqcnt)); So the reader retry the cpu id reads whenever sched_setaffinity is being called by the migration thread, and whenever it is preempted for more than one migration thread loop. That should achieve our goal of providing cpu id stability without significantly changing the timings of the migration thread, given that it never blocks waiting for the reader. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest
- On Aug 19, 2021, at 7:48 PM, Sean Christopherson sea...@google.com wrote: > On Thu, Aug 19, 2021, Mathieu Desnoyers wrote: >> - On Aug 17, 2021, at 8:12 PM, Sean Christopherson sea...@google.com >> wrote: >> > @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs) >> > * If not nested over a rseq critical section, restart is useless. >> > * Clear the rseq_cs pointer and return. >> > */ >> > - if (!in_rseq_cs(ip, &rseq_cs)) >> > + if (!regs || !in_rseq_cs(ip, &rseq_cs)) >> >> I think clearing the thread's rseq_cs unconditionally here when regs is NULL >> is not the behavior we want when this is called from xfer_to_guest_mode_work. >> >> If we have a scenario where userspace ends up calling this ioctl(KVM_RUN) >> from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to >> kill this application in the rseq_syscall handler when exiting back to >> usermode >> when the ioctl eventually returns. >> >> However, clearing the thread's rseq_cs will prevent this from happening. >> >> So I would favor an approach where we simply do: >> >> if (!regs) >> return 0; >> >> Immediately at the beginning of rseq_ip_fixup, before getting the instruction >> pointer, so effectively skip all side-effects of the ip fixup code. Indeed, >> it >> is not relevant to do any fixup here, because it is nested in a ioctl system >> call. >> >> Effectively, this would preserve the SIGSEGV behavior when this ioctl is >> erroneously called by user-space from a rseq critical section. > > Ha, that's effectively what I implemented first, but I changed it because of > the > comment in clear_rseq_cs() that says: > > The rseq_cs field is set to NULL on preemption or signal delivery ... as well > as well as on top of code outside of the rseq assembly block. > > Which makes it sound like something might rely on clearing rseq_cs? This comment is describing succinctly the lazy clear scheme for rseq_cs. Without the lazy clear scheme, a rseq c.s. would look like: * init(rseq_cs) * cpu = TLS->rseq::cpu_id_start * [1] TLS->rseq::rseq_cs = rseq_cs * [start_ip] * [2] if (cpu != TLS->rseq::cpu_id) * goto abort_ip; * [3] * [post_commit_ip] * [4] TLS->rseq::rseq_cs = NULL But as a fast-path optimization, [4] is not entirely needed because the rseq_cs descriptor contains information about the instruction pointer range of the critical section. Therefore, userspace can omit [4], but if the kernel never clears it, it means that it will have to re-read the rseq_cs descriptor's content each time it needs to check it to confirm that it is not nested over a rseq c.s.. So making the kernel lazily clear the rseq_cs pointer is just an optimization which ensures that the kernel won't do useless work the next time it needs to check rseq_cs, given that it has already validated that the userspace code is currently not within the rseq c.s. currently advertised by the rseq_cs field. > > Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in > an > rseq critical section, and because syscalls in critical sections are illegal, > by > definition clearing rseq_cs is a nop unless userspace is misbehaving. Not quite, as I described above. But we want it to stay set so the CONFIG_DEBUG_RSEQ code executed when returning from ioctl to userspace will be able to validate that it is not nested within a rseq critical section. > > If that's true, what about explicitly checking that at NOTIFY_RESUME? Or is > it > not worth the extra code to detect an error that will likely be caught > anyways? The error will indeed already be caught on return from ioctl to userspace, so I don't see any added value in duplicating this check. Thanks, Mathieu > > diff --git a/kernel/rseq.c b/kernel/rseq.c > index 35f7bd0fced0..28b8342290b0 100644 > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -282,6 +282,13 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, > struct pt_regs *regs) > >if (unlikely(t->flags & PF_EXITING)) >return; > + if (!regs) { > +#ifdef CONFIG_DEBUG_RSEQ > + if (t->rseq && rseq_get_rseq_cs(t, &rseq_cs)) > + goto error; > +#endif > + return; > + } >ret = rseq_ip_fixup(regs); >if (unlikely(ret < 0)) >goto error; > >> Thanks for looking into this ! >> >> Mathieu >> >> >return clear_rseq_cs(t); >> >ret = rseq_need_restart(t, rseq_cs.flags); >> >if (ret <= 0) >> > -- >> > 2.33.0.rc1.237.g0d66db33f3-goog >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. > > http://www.efficios.com -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH v2 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest
- On Aug 20, 2021, at 6:49 PM, Sean Christopherson sea...@google.com wrote: > Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to > transferring to a KVM guest, which is roughly equivalent to an exit to > userspace and processes many of the same pending actions. While the task > cannot be in an rseq critical section as the KVM path is reachable only > by via ioctl(KVM_RUN), the side effects that apply to rseq outside of a > critical section still apply, e.g. the current CPU needs to be updated if > the task is migrated. > > Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults > and other badness in userspace VMMs that use rseq in combination with KVM, > e.g. due to the CPU ID being stale after task migration. Acked-by: Mathieu Desnoyers > > Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function") > Reported-by: Peter Foley > Bisected-by: Doug Evans > Cc: Shakeel Butt > Cc: Thomas Gleixner > Cc: sta...@vger.kernel.org > Signed-off-by: Sean Christopherson > --- > kernel/entry/kvm.c | 4 +++- > kernel/rseq.c | 14 +++--- > 2 files changed, 14 insertions(+), 4 deletions(-) > > diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c > index 49972ee99aff..049fd06b4c3d 100644 > --- a/kernel/entry/kvm.c > +++ b/kernel/entry/kvm.c > @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu, > unsigned long ti_work) > if (ti_work & _TIF_NEED_RESCHED) > schedule(); > > - if (ti_work & _TIF_NOTIFY_RESUME) > + if (ti_work & _TIF_NOTIFY_RESUME) { > tracehook_notify_resume(NULL); > + rseq_handle_notify_resume(NULL, NULL); > + } > > ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work); > if (ret) > diff --git a/kernel/rseq.c b/kernel/rseq.c > index 35f7bd0fced0..6d45ac3dae7f 100644 > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -282,9 +282,17 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, > struct pt_regs *regs) > > if (unlikely(t->flags & PF_EXITING)) > return; > - ret = rseq_ip_fixup(regs); > - if (unlikely(ret < 0)) > - goto error; > + > + /* > + * regs is NULL if and only if the caller is in a syscall path. Skip > + * fixup and leave rseq_cs as is so that rseq_sycall() will detect and > + * kill a misbehaving userspace on debug kernels. > + */ > + if (regs) { > + ret = rseq_ip_fixup(regs); > + if (unlikely(ret < 0)) > + goto error; > + } > if (unlikely(rseq_update_cpu_id(t))) > goto error; > return; > -- > 2.33.0.rc2.250.ged5fa647cd-goog -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com
Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
_code); > + > + pthread_create(&migration_thread, NULL, migration_worker, 0); > + > + while (!done) { > + vcpu_run(vm, VCPU_ID); > + TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC, > + "Guest failed?"); > + > + /* > + * Verify rseq's CPU matches sched's CPU. Ensure migration > + * doesn't occur between sched_getcpu() and reading the rseq > + * cpu_id by rereading both if the sequence count changes, or > + * if the count is odd (migration in-progress). > + */ > + do { > + /* > + * Drop bit 0 to force a mismatch if the count is odd, > + * i.e. if a migration is in-progress. > + */ > + snapshot = atomic_read(&seq_cnt) & ~1; > + smp_rmb(); > + cpu = sched_getcpu(); > + rseq_cpu = READ_ONCE(__rseq.cpu_id); > + smp_rmb(); > + } while (snapshot != atomic_read(&seq_cnt)); > + > + TEST_ASSERT(rseq_cpu == cpu, > + "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu); > + } > + > + pthread_join(migration_thread, NULL); > + > + kvm_vm_free(vm); > + > + sys_rseq(RSEQ_FLAG_UNREGISTER); > + > + return 0; > +} > -- > 2.33.0.rc2.250.ged5fa647cd-goog -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com