from:"Mathieu Desnoyers"

Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()

2013-11-07 Thread Mathieu Desnoyers

* Peter Zijlstra (pet...@infradead.org) wrote:

[...]

Hi Peter,

Looking at this simplified version of perf's ring buffer
synchronization, I get concerned about the following issue:

> /*
>  * One important detail is that the kbuf part and the kbuf_writer() are
>  * strictly per cpu and we can thus rely on program order for those.
>  *
>  * Only the userspace consumer can possibly run on another cpu, and thus we
>  * need to ensure data consistency for those.
>  */
> 
> struct buffer {
> u64 size;
> u64 tail;
> u64 head;
> void *data;
> };
> 
> struct buffer *kbuf, *ubuf;
> 
> /*
>  * If there's space in the buffer; store the data @buf; otherwise
>  * discard it.
>  */
> void kbuf_write(int sz, void *buf)
> {
>   u64 tail, head, offset;
> 
>   do {
>   tail = ACCESS_ONCE(ubuf->tail);
>   offset = head = kbuf->head;
>   if (CIRC_SPACE(head, tail, kbuf->size) < sz) {
>   /* discard @buf */
>   return;
>   }
>   head += sz;
>   } while (local_cmpxchg(&kbuf->head, offset, head) != offset)
> 

Let's suppose we have a thread executing kbuf_write(), interrupted by an
IRQ or NMI right after a successful local_cmpxchg() (space reservation
in the buffer). If the nested execution context also calls kbuf_write(),
it will therefore update ubuf->head (below) with the second reserved
space, and only after that will it return to the original thread context
and continue executing kbuf_write(), thus overwriting ubuf->head with
the prior-to-last reserved offset.

All this probably works OK most of the times, when we have an event flow
guaranteeing that a following event will fix things up, but there
appears to be a risk of losing events near the end of the trace when
those are in nested execution contexts.

Thoughts ?

Thanks,

Mathieu

> /*
>  * Ensure that if we see the userspace tail (ubuf->tail) such
>  * that there is space to write @buf without overwriting data
>  * userspace hasn't seen yet, we won't in fact store data before
>  * that read completes.
>  */
> 
> smp_mb(); /* A, matches with D */
> 
> memcpy(kbuf->data + offset, buf, sz);
> 
> /*
>  * Ensure that we write all the @buf data before we update the
>  * userspace visible ubuf->head pointer.
>  */
> smp_wmb(); /* B, matches with C */
> 
> ubuf->head = kbuf->head;
> }
> 
> /*
>  * Consume the buffer data and update the tail pointer to indicate to
>  * kernel space there's 'free' space.
>  */
> void ubuf_read(void)
> {
> u64 head, tail;
> 
> tail = ACCESS_ONCE(ubuf->tail);
> head = ACCESS_ONCE(ubuf->head);
> 
> /*
>  * Ensure we read the buffer boundaries before the actual buffer
>  * data...
>  */
> smp_rmb(); /* C, matches with B */
> 
> while (tail != head) {
> obj = ubuf->data + tail;
> /* process obj */
> tail += obj->size;
> tail %= ubuf->size;
> }
> 
> /*
>  * Ensure all data reads are complete before we issue the
>  * ubuf->tail update; once that update hits, kbuf_write() can
>  * observe and overwrite data.
>  */
> smp_mb(); /* D, matches with A */
> 
> ubuf->tail = tail;
> }

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH 1/9] powerpc: allocate sys_membarrier system call number

2015-08-27 Thread Mathieu Desnoyers

Allow it to be used from SPU, since it should not have unwanted
side-effects.

[ Untested on this architecture. To try it out: fetch linux-next/akpm,
  apply this patch, build/run a membarrier-enabled kernel, and do make
  kselftest. ]

Signed-off-by: Mathieu Desnoyers 
CC: Andrew Morton 
CC: linux-...@vger.kernel.org
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 4d65499..126d0c4 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -369,3 +369,4 @@ SYSCALL_SPU(bpf)
 COMPAT_SYS(execveat)
 PPC64ONLY(switch_endian)
 SYSCALL_SPU(userfaultfd)
+SYSCALL_SPU(membarrier)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index 4a055b6..13411be 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define __NR_syscalls  365
+#define __NR_syscalls  366
 
 #define __NR__exit __NR_exit
 #define NR_syscalls__NR_syscalls
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 6ad58d4..6337738 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -387,5 +387,6 @@
 #define __NR_execveat  362
 #define __NR_switch_endian 363
 #define __NR_userfaultfd   364
+#define __NR_membarrier365
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
1.9.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH 1/9] powerpc: allocate sys_membarrier system call number

2015-08-31 Thread Mathieu Desnoyers

- On Aug 31, 2015, at 2:54 AM, Michael Ellerman m...@ellerman.id.au wrote:

> On Thu, 2015-08-27 at 13:56 -0400, Mathieu Desnoyers wrote:
>> Allow it to be used from SPU, since it should not have unwanted
>> side-effects.
>> 
>> [ Untested on this architecture. To try it out: fetch linux-next/akpm,
>>   apply this patch, build/run a membarrier-enabled kernel, and do make
>>   kselftest. ]
>> 
>> Signed-off-by: Mathieu Desnoyers 
>> CC: Andrew Morton 
>> CC: linux-...@vger.kernel.org
>> CC: Benjamin Herrenschmidt 
>> CC: Paul Mackerras 
>> CC: Michael Ellerman 
>> CC: linuxppc-dev@lists.ozlabs.org
> 
> Thanks.
> 
> I get:
> 
> $ ./membarrier_test
> membarrier MEMBARRIER_CMD_QUERY syscall available.
> membarrier: MEMBARRIER_CMD_SHARED success.
> membarrier: tests done!
> 
> Which looks good.
> 
> Assuming the membarrier support hits 4.3, I'll take this via my tree for 4.3
> also.

Great! I'll take care of your comments about self-tests,

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 1/9] [picked] powerpc: allocate sys_membarrier system call number

2015-09-07 Thread Mathieu Desnoyers

Allow it to be used from SPU, since it should not have unwanted
side-effects.

[ Picked-by: Michael Ellerman  ]

Signed-off-by: Mathieu Desnoyers 
CC: Andrew Morton 
CC: linux-...@vger.kernel.org
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 4d65499..126d0c4 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -369,3 +369,4 @@ SYSCALL_SPU(bpf)
 COMPAT_SYS(execveat)
 PPC64ONLY(switch_endian)
 SYSCALL_SPU(userfaultfd)
+SYSCALL_SPU(membarrier)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index 4a055b6..13411be 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define __NR_syscalls  365
+#define __NR_syscalls  366
 
 #define __NR__exit __NR_exit
 #define NR_syscalls__NR_syscalls
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 6ad58d4..6337738 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -387,5 +387,6 @@
 #define __NR_execveat  362
 #define __NR_switch_endian 363
 #define __NR_userfaultfd   364
+#define __NR_membarrier365
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
1.9.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC PATCH for 4.21 09/16] powerpc: Wire up cpu_opv system call

2018-10-10 Thread Mathieu Desnoyers

Signed-off-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Boqun Feng 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 01b5171ea189..8f58710f5e8b 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -394,3 +394,4 @@ SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
 SYSCALL(rseq)
 COMPAT_SYS(io_pgetevents)
+SYSCALL(cpu_opv)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 985534d0b448..112e2c54750a 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -400,5 +400,6 @@
 #define __NR_pkey_mprotect 386
 #define __NR_rseq  387
 #define __NR_io_pgetevents 388
+#define __NR_cpu_opv   389
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0

[RFC PATCH for 4.21 09/16] powerpc: Wire up cpu_opv system call

2018-11-01 Thread Mathieu Desnoyers

Signed-off-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Boqun Feng 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index 01b5171ea189..8f58710f5e8b 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -394,3 +394,4 @@ SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
 SYSCALL(rseq)
 COMPAT_SYS(io_pgetevents)
+SYSCALL(cpu_opv)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 985534d0b448..112e2c54750a 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -400,5 +400,6 @@
 #define __NR_pkey_mprotect 386
 #define __NR_rseq  387
 #define __NR_io_pgetevents 388
+#define __NR_cpu_opv   389
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0

Re: [PATCH 07/14] powerpc: Add support for restartable sequences

2018-05-28 Thread Mathieu Desnoyers

- On May 24, 2018, at 3:03 AM, Michael Ellerman m...@ellerman.id.au wrote:

> Mathieu Desnoyers  writes:
>> - On May 23, 2018, at 4:14 PM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
> ...
>>> 
>>> Hi Boqun,
>>> 
>>> I tried your patch in a ppc64 le environment, and it does not survive boot
>>> with CONFIG_DEBUG_RSEQ=y. init gets killed right away.
> 
> 
> Sorry this code is super gross and hard to deal with.
> 
>> The following fixup gets ppc64 to work:
>>
>> --- a/arch/powerpc/kernel/entry_64.S
>> +++ b/arch/powerpc/kernel/entry_64.S
>> @@ -208,6 +208,7 @@ system_call_exit:
>> /* Check whether the syscall is issued inside a restartable sequence 
>> */
>> addir3,r1,STACK_FRAME_OVERHEAD
>> bl  rseq_syscall
>> +   ld  r3,RESULT(r1)
>>  #endif
>> /*
>>  * Disable interrupts so current_thread_info()->flags can't change,
> 
> I don't think that's safe.
> 
> If you look above that, we have r3, r8 and r12 all live:
> 
> .Lsyscall_exit:
>   std r3,RESULT(r1)
>   CURRENT_THREAD_INFO(r12, r1)
> 
>   ld  r8,_MSR(r1)
> #ifdef CONFIG_PPC_BOOK3S
>   /* No MSR:RI on BookE */
>   andi.   r10,r8,MSR_RI
>   beq-.Lunrecov_restore
> #endif
> 
> 
> They're all volatile across function calls:
> 
>  
> http://openpowerfoundation.org/wp-content/uploads/resources/leabi/content/dbdoclet.50655240_68174.html
> 
> 
> The system_call_exit symbol is actually there for kprobes and cosmetic
> purposes. The actual syscall return flow starts at .Lsyscall_exit.
> 
> So I think this would work:
> 
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index db4df061c33a..e19f377a25e0 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -184,6 +184,14 @@ system_call: /* label this so stack 
> traces look sane */
> 
> .Lsyscall_exit:
>   std r3,RESULT(r1)
> +
> +#ifdef CONFIG_DEBUG_RSEQ
> + /* Check whether the syscall is issued inside a restartable sequence */
> + addir3,r1,STACK_FRAME_OVERHEAD
> + bl  rseq_syscall
> + ld  r3,RESULT(r1)
> +#endif
> +
>   CURRENT_THREAD_INFO(r12, r1)
> 
>   ld  r8,_MSR(r1)
> 
> 
> I'll try and get this series into my test setup at some point, been a
> bit busy lately :)

Yes, this was needed. I had this in my tree already, but there is still
a kernel OOPS when running the rseq selftests on ppc64 with CONFIG_DEBUG_RSEQ=y.

My current dev tree is at: 
https://github.com/compudj/linux-percpu-dev/tree/rseq/dev-local

So considering we are at rc7 now, should I plan to removing the powerpc bits
for merge window submission, or is there someone planning to spend time on
fixing and testing ppc integration before the merge window opens ?

Thanks,

Mathieu


> 
> cheers

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

[RFC PATCH for 4.18 08/16] powerpc: Add support for restartable sequences

2018-06-02 Thread Mathieu Desnoyers

From: Boqun Feng 

Call the rseq_handle_notify_resume() function on return to userspace if
TIF_NOTIFY_RESUME thread flag is set.

Perform fixup on the pre-signal when a signal is delivered on top of a
restartable sequence critical section.

Signed-off-by: Boqun Feng 
Signed-off-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/Kconfig | 1 +
 arch/powerpc/kernel/signal.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c32a181a7cbb..ed21a777e8c6 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -223,6 +223,7 @@ config PPC
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HAVE_IRQ_TIME_ACCOUNTING
+   select HAVE_RSEQ
select IRQ_DOMAIN
select IRQ_FORCED_THREADING
select MODULES_USE_ELF_RELA
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index 61db86ecd318..d3bb3aaaf5ac 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -133,6 +133,8 @@ static void do_signal(struct task_struct *tsk)
/* Re-enable the breakpoints for the signal stack */
thread_change_pc(tsk, tsk->thread.regs);
 
+   rseq_signal_deliver(tsk->thread.regs);
+
if (is32) {
if (ksig.ka.sa.sa_flags & SA_SIGINFO)
ret = handle_rt_signal32(&ksig, oldset, tsk);
@@ -164,6 +166,7 @@ void do_notify_resume(struct pt_regs *regs, unsigned long 
thread_info_flags)
if (thread_info_flags & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
+   rseq_handle_notify_resume(regs);
}
 
user_enter();
-- 
2.11.0

[RFC PATCH for 4.18 09/16] powerpc: Add syscall detection for restartable sequences

2018-06-02 Thread Mathieu Desnoyers

From: Boqun Feng 

Syscalls are not allowed inside restartable sequences, so add a call to
rseq_syscall() at the very beginning of system call exiting path for
CONFIG_DEBUG_RSEQ=y kernel. This could help us to detect whether there
is a syscall issued inside restartable sequences.

[ Tested on 64-bit powerpc kernel by Mathieu Desnoyers. Still needs to
  be tested on 32-bit powerpc kernel. ]

Signed-off-by: Boqun Feng 
Signed-off-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/kernel/entry_32.S | 7 +++
 arch/powerpc/kernel/entry_64.S | 8 
 2 files changed, 15 insertions(+)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index eb8d01bae8c6..973577f2141c 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -365,6 +365,13 @@ syscall_dotrace_cont:
blrl/* Call handler */
.globl  ret_from_syscall
 ret_from_syscall:
+#ifdef CONFIG_DEBUG_RSEQ
+   /* Check whether the syscall is issued inside a restartable sequence */
+   stw r3,GPR3(r1)
+   addir3,r1,STACK_FRAME_OVERHEAD
+   bl  rseq_syscall
+   lwz r3,GPR3(r1)
+#endif
mr  r6,r3
CURRENT_THREAD_INFO(r12, r1)
/* disable interrupts so current_thread_info()->flags can't change */
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 51695608c68b..1c374387656a 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -184,6 +184,14 @@ system_call:   /* label this so stack 
traces look sane */
 
 .Lsyscall_exit:
std r3,RESULT(r1)
+
+#ifdef CONFIG_DEBUG_RSEQ
+   /* Check whether the syscall is issued inside a restartable sequence */
+   addir3,r1,STACK_FRAME_OVERHEAD
+   bl  rseq_syscall
+   ld  r3,RESULT(r1)
+#endif
+
CURRENT_THREAD_INFO(r12, r1)
 
ld  r8,_MSR(r1)
-- 
2.11.0

[RFC PATCH for 4.18 10/16] powerpc: Wire up restartable sequences system call

2018-06-02 Thread Mathieu Desnoyers

From: Boqun Feng 

Wire up the rseq system call on powerpc.

This provides an ABI improving the speed of a user-space getcpu
operation on powerpc by skipping the getcpu system call on the fast
path, as well as improving the speed of user-space operations on per-cpu
data compared to using load-reservation/store-conditional atomics.

Signed-off-by: Boqun Feng 
Signed-off-by: Mathieu Desnoyers 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Peter Zijlstra 
CC: "Paul E. McKenney" 
CC: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/systbl.h  | 1 +
 arch/powerpc/include/asm/unistd.h  | 2 +-
 arch/powerpc/include/uapi/asm/unistd.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/systbl.h 
b/arch/powerpc/include/asm/systbl.h
index d61f9c96d916..45d4d37495fd 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -392,3 +392,4 @@ SYSCALL(statx)
 SYSCALL(pkey_alloc)
 SYSCALL(pkey_free)
 SYSCALL(pkey_mprotect)
+SYSCALL(rseq)
diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index daf1ba97a00c..1e9708632dce 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -12,7 +12,7 @@
 #include 
 
 
-#define NR_syscalls387
+#define NR_syscalls388
 
 #define __NR__exit __NR_exit
 
diff --git a/arch/powerpc/include/uapi/asm/unistd.h 
b/arch/powerpc/include/uapi/asm/unistd.h
index 389c36fd8299..ac5ba55066dd 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -398,5 +398,6 @@
 #define __NR_pkey_alloc384
 #define __NR_pkey_free 385
 #define __NR_pkey_mprotect 386
+#define __NR_rseq  387
 
 #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
-- 
2.11.0

Re: [RFC PATCH for 4.18 09/16] powerpc: Add syscall detection for restartable sequences

2018-06-05 Thread Mathieu Desnoyers

- On Jun 5, 2018, at 1:21 AM, Michael Ellerman m...@ellerman.id.au wrote:

> Mathieu Desnoyers  writes:
>> From: Boqun Feng 
>>
>> Syscalls are not allowed inside restartable sequences, so add a call to
>> rseq_syscall() at the very beginning of system call exiting path for
>> CONFIG_DEBUG_RSEQ=y kernel. This could help us to detect whether there
>> is a syscall issued inside restartable sequences.
>>
>> [ Tested on 64-bit powerpc kernel by Mathieu Desnoyers. Still needs to
>>   be tested on 32-bit powerpc kernel. ]
>>
>> Signed-off-by: Boqun Feng 
>> Signed-off-by: Mathieu Desnoyers 
>> CC: Benjamin Herrenschmidt 
>> CC: Paul Mackerras 
>> CC: Michael Ellerman 
>> CC: Peter Zijlstra 
>> CC: "Paul E. McKenney" 
>> CC: linuxppc-dev@lists.ozlabs.org
>> ---
>>  arch/powerpc/kernel/entry_32.S | 7 +++
>>  arch/powerpc/kernel/entry_64.S | 8 
>>  2 files changed, 15 insertions(+)
> 
> I don't _love_ the #ifdefs in here, but they look correct and there's
> not really a better option until we rewrite the syscall handler in C.
> 
> The rseq selftests passed for me with this applied and enabled. So if
> you like here's some tags:
> 
> Tested-by: Michael Ellerman 
> Acked-by: Michael Ellerman 
> 

Adding you ack to the series.

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH for 4.18 10/16] powerpc: Wire up restartable sequences system call

2018-06-05 Thread Mathieu Desnoyers

- On Jun 5, 2018, at 1:18 AM, Michael Ellerman m...@ellerman.id.au wrote:

> Mathieu Desnoyers  writes:
> 
>> From: Boqun Feng 
>>
>> Wire up the rseq system call on powerpc.
>>
>> This provides an ABI improving the speed of a user-space getcpu
>> operation on powerpc by skipping the getcpu system call on the fast
>> path, as well as improving the speed of user-space operations on per-cpu
>> data compared to using load-reservation/store-conditional atomics.
>>
>> Signed-off-by: Boqun Feng 
>> Signed-off-by: Mathieu Desnoyers 
>> CC: Benjamin Herrenschmidt 
>> CC: Paul Mackerras 
>> CC: Michael Ellerman 
>> CC: Peter Zijlstra 
>> CC: "Paul E. McKenney" 
>> CC: linuxppc-dev@lists.ozlabs.org
>> ---
>>  arch/powerpc/include/asm/systbl.h  | 1 +
>>  arch/powerpc/include/asm/unistd.h  | 2 +-
>>  arch/powerpc/include/uapi/asm/unistd.h | 1 +
>>  3 files changed, 3 insertions(+), 1 deletion(-)
> 
> Looks fine to me.
> 
> I don't have any other new syscalls in my next, so this should not
> conflict with anything for 4.18.
> 
> Acked-by: Michael Ellerman  (powerpc)

Added your ack to the series, thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Appropriate liburcu cache line size for Power

2024-03-24 Thread Mathieu Desnoyers


Hi,

In the powerpc architecture support within the liburcu project [1]
we have a cache line size defined as 256 bytes with the following
comment:

/* Include size of POWER5+ L3 cache lines: 256 bytes */
#define CAA_CACHE_LINE_SIZE 256

I recently received a pull request on github [2] asking to
change this to 128 bytes. All the material provided supports
that the cache line sizes on powerpc are 128 bytes or less (even
L3 on POWER7, POWER8, and POWER9) [3].

I wonder where the 256 bytes L3 cache line size for POWER5+
we have in liburcu comes from, and I wonder if it's the right choice
for a cache line size on all powerpc, considering that the Linux
kernel cache line size appear to use 128 bytes on recent Power
architectures. I recall some benchmark experiments Paul and I did
on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
cache line size, and I suppose this is why we came up with this
value, but I don't have the detailed specs of that machine.

Any feedback on this matter would be appreciated.

Thanks!

Mathieu

[1] https://liburcu.org
[2] https://github.com/urcu/userspace-rcu/pull/22
[3] https://www.7-cpu.com/


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: Appropriate liburcu cache line size for Power

2024-03-26 Thread Mathieu Desnoyers


On 2024-03-26 03:19, Michael Ellerman wrote:

Mathieu Desnoyers  writes:

Hi,


Hi Mathieu,


In the powerpc architecture support within the liburcu project [1]
we have a cache line size defined as 256 bytes with the following
comment:

/* Include size of POWER5+ L3 cache lines: 256 bytes */
#define CAA_CACHE_LINE_SIZE 256

I recently received a pull request on github [2] asking to
change this to 128 bytes. All the material provided supports
that the cache line sizes on powerpc are 128 bytes or less (even
L3 on POWER7, POWER8, and POWER9) [3].

I wonder where the 256 bytes L3 cache line size for POWER5+
we have in liburcu comes from, and I wonder if it's the right choice
for a cache line size on all powerpc, considering that the Linux
kernel cache line size appear to use 128 bytes on recent Power
architectures. I recall some benchmark experiments Paul and I did
on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
cache line size, and I suppose this is why we came up with this
value, but I don't have the detailed specs of that machine.

Any feedback on this matter would be appreciated.


The ISA doesn't specify the cache line size, other than it is smaller
than a page.

In practice all the 64-bit IBM server CPUs I'm aware of have used 128
bytes. There are some 64-bit CPUs that use 64 bytes, eg. pasemi PA6T and
Freescale e6500.

It is possible to discover at runtime via AUXV headers. But that's no
use if you want a compile-time constant.


Indeed, and this CAA_CACHE_LINE_SIZE is part of the liburcu powerpc ABI,
so changing this would require a soname bump, which I don't want to do
without really good reasons.



I'm happy to run some benchmarks if you can point me at what to run. I
had a poke around the repository and found short_bench, but it seemed to
run for a very long time.


I've created a dedicated test program for this, see:

https://github.com/compudj/userspace-rcu-dev/tree/false-sharing

example use:

On a AMD Ryzen 7 PRO 6850U with Radeon Graphics:

for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 21320
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 22657
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 47599
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 531364
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 523634
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 519402
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 520651
1..1

Would point to false-sharing starting with cache lines smaller than
64 bytes. I get similar results (false-sharing under 64 bytes) with a
AMD EPYC 9654 96-Core Processor.

The test programs runs 4 threads by default, which can be overridden
with "-t N". This may be needed if you want this to use all cores from
a larger machine. See "-h" for options.

On a POWER9 (architected), altivec supported:

for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 12264
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 12276
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 25638
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 39934
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 53971
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 53599
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 53962
1..1

This points at false-sharing below 128 bytes stride.

On a e6500, altivec supported, Model 2.0 (pvr 8040 0120)

for a in 8 16 32 64 128 256 512; do tests/unit/test_false_sharing -s $a; done
ok 1 - Stride 8 bytes, increments per ms per thread: 9049
1..1
ok 1 - Stride 16 bytes, increments per ms per thread: 9054
1..1
ok 1 - Stride 32 bytes, increments per ms per thread: 18643
1..1
ok 1 - Stride 64 bytes, increments per ms per thread: 37417
1..1
ok 1 - Stride 128 bytes, increments per ms per thread: 37906
1..1
ok 1 - Stride 256 bytes, increments per ms per thread: 37870
1..1
ok 1 - Stride 512 bytes, increments per ms per thread: 37899
1..1

Which points at false-sharing below 64 bytes.

I prefer to be cautious about this cache line size value and aim for
a value which takes into account the largest known cache line size
for an architecture rather than use a too small due to the large
overhead caused by false-sharing.

Feedback is welcome.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: Appropriate liburcu cache line size for Power

2024-03-28 Thread Mathieu Desnoyers


On 2024-03-25 16:34, Nathan Lynch wrote:

Mathieu Desnoyers  writes:

In the powerpc architecture support within the liburcu project [1]
we have a cache line size defined as 256 bytes with the following
comment:

/* Include size of POWER5+ L3 cache lines: 256 bytes */
#define CAA_CACHE_LINE_SIZE 256

I recently received a pull request on github [2] asking to
change this to 128 bytes. All the material provided supports
that the cache line sizes on powerpc are 128 bytes or less (even
L3 on POWER7, POWER8, and POWER9) [3].

I wonder where the 256 bytes L3 cache line size for POWER5+
we have in liburcu comes from, and I wonder if it's the right choice
for a cache line size on all powerpc, considering that the Linux
kernel cache line size appear to use 128 bytes on recent Power
architectures. I recall some benchmark experiments Paul and I did
on a 64-core 1.9GHz POWER5+ machine that benefited from a 256 bytes
cache line size, and I suppose this is why we came up with this
value, but I don't have the detailed specs of that machine.

Any feedback on this matter would be appreciated.


For what it's worth, I found a copy of an IBM Journal of Research &
Development article confirming that POWER5's L3 had a 256-byte line
size:

   Each slice [of the L3] is 12-way set-associative, with 4,096
   congruence classes of 256-byte lines managed as two 128-byte sectors
   to match the L2 line size.

https://www.eecg.utoronto.ca/~moshovos/ACA08/readings/power5.pdf

I don't know of any reason to prefer 256 over 128 for current Power
processors though.


Thanks for the pointer. I will add a reference to it in the liburcu
source code to explain the cache line size choice.

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH printk v3 00/40] reduce console_lock scope

2022-11-11 Thread Mathieu Desnoyers


On 2022-11-07 09:15, John Ogness wrote:
[...]


The base commit for this series is from Paul McKenney's RCU tree
and provides an NMI-safe SRCU implementation [1]. Without the
NMI-safe SRCU implementation, this series is not less safe than
mainline. But we will need the NMI-safe SRCU implementation for
atomic consoles anyway, so we might as well get it in
now. Especially since it _does_ increase the reliability for
mainline in the panic path.


So, your email got me to review the SRCU nmi-safe series:

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/log/?h=srcunmisafe.2022.10.21a

Especially this commit:

https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git/commit/?h=srcunmisafe.2022.10.21a&id=5d0f5953b60f5f7a278085b55ddc73e2932f4c33

I disagree with the overall approach taken there, which is to create
yet another SRCU flavor, this time with explicit "nmi-safe" read-locks.
This adds complexity to the kernel APIs and I think we can be clever
about this and make SRCU nmi-safe without requiring a whole new incompatible
API.

You can find the basic idea needed to achieve this in the libside RCU
user-space implementation. I needed to introduce a split-counter concept
to support rseq vs atomics to keep track of per-cpu grace period counters.
The "rseq" counter is the fast-path, but if rseq fails, the abort handler
uses the atomic counter instead.

https://github.com/compudj/side/blob/main/src/rcu.h#L23

struct side_rcu_percpu_count {
uintptr_t begin;
uintptr_t rseq_begin;
uintptr_t end;
uintptr_t rseq_end;
}  __attribute__((__aligned__(SIDE_CACHE_LINE_SIZE)));

The idea is to "split" each percpu counter into two counters, one for rseq,
and the other for atomics. When a grace period wants to observe the value of
a percpu counter, it simply sums the two counters:

https://github.com/compudj/side/blob/main/src/rcu.c#L112

The same idea can be applied to SRCU in the kernel: one counter for percpu ops,
and the other counter for nmi context, so basically:

srcu_read_lock()

if (likely(!in_nmi()))
  increment the percpu-ops lock counter
else
  increment the atomic lock counter

srcu_read_unlock()

if (likely(!in_nmi()))
  increment the percpu-ops unlock counter
else
  increment the atomic unlock counter

Then in the grace period sum the percpu-ops and the atomic values whenever
each counter value is read.

This would allow SRCU to be NMI-safe without requiring the callers to
explicitly state whether they need to be nmi-safe or not, and would only
take the overhead of the atomics in the NMI handlers rather than for all
users which happen to use SRCU read locks shared with nmi handlers.

Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] powerpc/ftrace: fix syscall tracing on PPC64_ELF_ABI_V1

2022-12-06 Thread Mathieu Desnoyers


On 2022-12-05 17:50, Michael Ellerman wrote:

Michael Jeanson  writes:

On 2022-12-05 15:11, Michael Jeanson wrote:

Michael Jeanson  writes:

In v5.7 the powerpc syscall entry/exit logic was rewritten in C, on
PPC64_ELF_ABI_V1 this resulted in the symbols in the syscall table
changing from their dot prefixed variant to the non-prefixed ones.

Since ftrace prefixes a dot to the syscall names when matching them to
build its syscall event list, this resulted in no syscall events being
available.

Remove the PPC64_ELF_ABI_V1 specific version of
arch_syscall_match_sym_name to have the same behavior across all powerpc
variants.


This doesn't seem to work for me.

Event with it applied I still don't see anything in
/sys/kernel/debug/tracing/events/syscalls

Did we break it in some other way recently?

cheers


I did some further testing, my config also enabled KALLSYMS_ALL, when I remove
it there is indeed no syscall events.


Aha, OK that explains it I guess.

I was using ppc64_guest_defconfig which has ABI_V1 and FTRACE_SYSCALLS,
but does not have KALLSYMS_ALL. So I guess there's some other bug
lurking in there.


I don't have the setup handy to validate it, but I suspect it is caused 
by the way scripts/kallsyms.c:symbol_valid() checks whether a symbol 
entry needs to be integrated into the assembler output when 
--all-symbols is not specified. It only keeps symbols which addresses 
are in the text range. On PPC64_ELF_ABI_V1, this means only the 
dot-prefixed symbols will be kept (those point to the function begin), 
leaving out the non-dot-prefixed symbols (those point to the function 
descriptors).


So I see two possible solutions there: either we ensure that 
FTRACE_SYSCALLS selects KALLSYMS_ALL on PPC64_ELF_ABI_V1, or we modify 
scripts/kallsyms.c:symbol_valid() to also include function descriptor 
symbols. This would mean accepting symbols pointing into the .opd ELF 
section.


IMHO the second option would be better because it does not increase the 
kernel image size as much as KALLSYMS_ALL.


Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [PATCH] powerpc/ftrace: fix syscall tracing on PPC64_ELF_ABI_V1

2022-12-07 Thread Mathieu Desnoyers


On 2022-12-06 21:09, Michael Ellerman wrote:

Mathieu Desnoyers  writes:

On 2022-12-05 17:50, Michael Ellerman wrote:

Michael Jeanson  writes:

On 2022-12-05 15:11, Michael Jeanson wrote:

Michael Jeanson  writes:

In v5.7 the powerpc syscall entry/exit logic was rewritten in C, on
PPC64_ELF_ABI_V1 this resulted in the symbols in the syscall table
changing from their dot prefixed variant to the non-prefixed ones.

Since ftrace prefixes a dot to the syscall names when matching them to
build its syscall event list, this resulted in no syscall events being
available.

Remove the PPC64_ELF_ABI_V1 specific version of
arch_syscall_match_sym_name to have the same behavior across all powerpc
variants.


This doesn't seem to work for me.

Event with it applied I still don't see anything in
/sys/kernel/debug/tracing/events/syscalls

Did we break it in some other way recently?

cheers


I did some further testing, my config also enabled KALLSYMS_ALL, when I remove
it there is indeed no syscall events.


Aha, OK that explains it I guess.

I was using ppc64_guest_defconfig which has ABI_V1 and FTRACE_SYSCALLS,
but does not have KALLSYMS_ALL. So I guess there's some other bug
lurking in there.


I don't have the setup handy to validate it, but I suspect it is caused
by the way scripts/kallsyms.c:symbol_valid() checks whether a symbol
entry needs to be integrated into the assembler output when
--all-symbols is not specified. It only keeps symbols which addresses
are in the text range. On PPC64_ELF_ABI_V1, this means only the
dot-prefixed symbols will be kept (those point to the function begin),
leaving out the non-dot-prefixed symbols (those point to the function
descriptors).


OK. So I guess it never worked without KALLSYMS_ALL.


I suspect it worked prior to kernel v5.7, because back then the PPC64 
ABIv1 system call table contained pointers to the text section 
(beginning of functions) rather than function descriptors.


So changing this system call table to point to C functions introduced 
the dot-prefix match issue for syscall tracing as well as a dependency 
on KALLSYMS_ALL.




It seems like most distros enable KALLSYMS_ALL, so I guess that's why
we've never noticed.


Or very few people run recent PPC64 ABIv1 kernels :)




So I see two possible solutions there: either we ensure that
FTRACE_SYSCALLS selects KALLSYMS_ALL on PPC64_ELF_ABI_V1, or we modify
scripts/kallsyms.c:symbol_valid() to also include function descriptor
symbols. This would mean accepting symbols pointing into the .opd ELF
section.


My only worry is that will cause some other breakage, because .opd
symbols are not really "text" in the normal sense, ie. you can't execute
them directly.


AFAIU adding the .opd section to scripts/kallsyms.c text_ranges will 
only affect the result of symbol_valid(), which decides which symbols 
get pulled into a KALLSYMS=n/KALLSYMS_ALL=n kernel build. Considering 
that a KALLSYMS_ALL=y build pulls those function descriptor symbols into 
 the image without breaking anything, I don't see why adding the .opd 
section to this script text_ranges would have any ill side-effect.




On the other hand the help for KALLSYMS_ALL says:

   "Normally kallsyms only contains the symbols of functions"

But without .opd included that's not really true. In practice it
probably doesn't really matter, because eg. backtraces will point to dot
symbols which can be resolved.


Indeed, I don't see this affecting backtraces, but as soon as a lookup 
depends on comparing the C function pointer to a function descriptor, 
the .opd symbols are needed. Not having those function descriptor 
symbols in KALLSYMS_ALL=n builds seems rather error-prone.





IMHO the second option would be better because it does not increase the
kernel image size as much as KALLSYMS_ALL.


Yes I agree.

Even if that did break something, any breakage would be limited to
arches which uses function descriptors, which are now all rare.


Yes, it would only impact those arches using function descriptors, which 
are broken today with respect to system call tracing. Are you aware of 
other architectures other than PPC64 ELF ABI v1 supported by the Linux 
kernel that use function descriptors ?




Relatedly we have a patch in next to optionally use ABIv2 for 64-bit big
endian builds.


Interesting. Does it require a matching user-space ? (built with PPC64 
ABIv2 ?) Does it handle legacy PPC32 executables ?


Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [powerpc] Boot failure kernel BUG at mm/usercopy.c:102

2023-01-10 Thread Mathieu Desnoyers


On 2023-01-10 05:42, Sachin Sant wrote:

6.2.0-rc3-next-20230109 fails to boot on powerpc with following:

[ 0.444834] [ cut here ]
[ 0.444838] kernel BUG at mm/usercopy.c:102!
[ 0.444842] Oops: Exception in kernel mode, sig: 5 [#1]
[ 0.444845] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
[ 0.444849] Modules linked in:
[ 0.444853] CPU: 23 PID: 201 Comm: modprobe Not tainted 6.2.0-rc3-next-20230109 
#1
[ 0.444858] Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf06 
of:IBM,FW1030.00 (NH1030_026) hv:phyp pSeries
[ 0.444862] NIP: c055a934 LR: c055a930 CTR: 00725d90
[ 0.444866] REGS: c7f937c0 TRAP: 0700 Not tainted 
(6.2.0-rc3-next-20230109)
[ 0.444871] MSR: 80029033  CR: 28088822 XER: 
0007
[ 0.444879] CFAR: c02012a8 IRQMASK: 0
[ 0.444879] GPR00: c055a930 c7f93a60 c13b0800 
0066
[ 0.444879] GPR04: 7fff c7f93880 c7f93878 

[ 0.444879] GPR08: 7fff  c25e7150 
c29672b8
[ 0.444879] GPR12: 48088824 c00e87bf6300 c017f458 
c34b8100
[ 0.444879] GPR16: 00012f18eac0 7fffc5c095d0 7fffc5c095d8 
00012f140040
[ 0.444879] GPR20: fcff 001f 5455 
0008
[ 0.444879] GPR24: c723a0c0 7fffc5c09368  
7fffc5c09370
[ 0.444879] GPR28: 0250 0001 c3017000 
c723a0c0
[ 0.444922] NIP [c055a934] usercopy_abort+0xa4/0xb0
[ 0.444928] LR [c055a930] usercopy_abort+0xa0/0xb0
[ 0.444932] Call Trace:
[ 0.444933] [c7f93a60] [c055a930] usercopy_abort+0xa0/0xb0 
(unreliable)
[ 0.444939] [c7f93ad0] [c050eeb8] 
__check_heap_object+0x198/0x1d0
[ 0.444945] [c7f93b10] [c055a7e0] 
__check_object_size+0x290/0x340
[ 0.444949] [c7f93b50] [c060eba4] 
create_elf_tables.isra.20+0xc04/0xc90
[ 0.444956] [c7f93c10] [c0610b2c] load_elf_binary+0xdac/0x1320
[ 0.444962] [c7f93d00] [c0571cf0] bprm_execve+0x3d0/0x7c0
[ 0.444966] [c7f93dc0] [c0572b9c] kernel_execve+0x1ac/0x270
[ 0.444971] [c7f93e10] [c017f5cc] 
call_usermodehelper_exec_async+0x17c/0x250
[ 0.444978] [c7f93e50] [c000e054] 
ret_from_kernel_thread+0x5c/0x64
[ 0.444983] --- interrupt: 0 at 0x0
[ 0.444986] NIP:  LR:  CTR: 
[ 0.444990] REGS: c7f93e80 TRAP:  Not tainted 
(6.2.0-rc3-next-20230109)
[ 0.444994] MSR:  <> CR:  XER: 
[ 0.444998] CFAR:  IRQMASK: 0
[ 0.444998] GPR00:  c7f94000  

[ 0.444998] GPR04:    

[ 0.444998] GPR08:    

[ 0.444998] GPR12:   c017f458 
c34b8100
[ 0.444998] GPR16:    

[ 0.444998] GPR20:    

[ 0.444998] GPR24:    

[ 0.444998] GPR28:    

[ 0.445039] NIP [] 0x0
[ 0.445042] LR [] 0x0
[ 0.445044] --- interrupt: 0
[ 0.445046] Code: 392990f8 4814 3d02ffe9 390827f0 7d074378 7d094378 7c661b78 
3c62ffe7 f9610060 386319f0 4bca6935 6000 <0fe0>   
7c0802a6
[ 0.445061] ---[ end trace  ]—

Git bisect points to following patch:

commit 317c8194e6aeb8b3b573ad139fc2a0635856498e
  rseq: Introduce feature size and alignment ELF auxiliary vector entries

Reverting the patch helps boot the kernel.


Can you try with the following fix ?

https://lore.kernel.org/r/20230104192054.34046-1-mathieu.desnoy...@efficios.com
"rseq: Fix: Increase AT_VECTOR_SIZE_BASE to match rseq auxvec entries"

Peter, I suspect it would be good to merge that fix into tip/master though 
sched/core.

Thanks,

Mathieu




Thanks
-Sachin


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

Re: [2.6.24 patch] restore Cell OProfile support

2007-12-28 Thread Mathieu Desnoyers

* Adrian Bunk ([EMAIL PROTECTED]) wrote:
> This patch restores the Cell OProfile support that was killed by
> commit 09cadedbdc01f1a4bea1f427d4fb4642eaa19da9.
> 
> Signed-off-by: Adrian Bunk <[EMAIL PROTECTED]>
> 

Same here : why isn't this on by default when SPU_FS and OPROFILE are
enabled ?

> ---
> 
>  kernel/Kconfig.instrumentation |8 
>  1 file changed, 8 insertions(+)
> 
> 6d1446d35d367fdbdd2a0a29e0d156646ff40630 
> diff --git a/kernel/Kconfig.instrumentation b/kernel/Kconfig.instrumentation
> index a76635d..9d143e6 100644
> --- a/kernel/Kconfig.instrumentation
> +++ b/kernel/Kconfig.instrumentation
> @@ -53,2 +53,10 @@ config HARDWARE_PM
>  
> +config OPROFILE_CELL
> + bool "OProfile for Cell Broadband Engine"
> + depends on PPC && (SPU_FS = y && OPROFILE = m) || (SPU_FS = y && 
> OPROFILE = y) 
> + default y
> + help
> +   Profiling of Cell BE SPUs requires special support enabled
> +   by this option.
> +
>  config KPROBES
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

[2.6.24 patch] Fix Cell OProfile support

2007-12-29 Thread Mathieu Desnoyers

This patch restores the Cell OProfile support that was killed by
commit 09cadedbdc01f1a4bea1f427d4fb4642eaa19da9.

It puts it in arch/powerpc/Kconfig. Since I don't see any good reason to leave
this as a supplementary user-selectable option, it is now automatically enabled
whenever SPU_FS and OPROFILE are enabled.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Adrian Bunk <[EMAIL PROTECTED]>
CC: Randy Dunlap <[EMAIL PROTECTED]>
CC: [EMAIL PROTECTED]
CC: [EMAIL PROTECTED]
CC: linuxppc-dev@ozlabs.org
CC: [EMAIL PROTECTED]
---
 arch/powerpc/Kconfig |4 
 1 file changed, 4 insertions(+)

Index: linux-2.6-lttng/arch/powerpc/Kconfig
===
--- linux-2.6-lttng.orig/arch/powerpc/Kconfig   2007-12-28 17:00:24.0 
-0500
+++ linux-2.6-lttng/arch/powerpc/Kconfig2007-12-28 17:00:39.0 
-0500
@@ -163,6 +163,10 @@ config PPC_OF_PLATFORM_PCI
depends on PPC64 # not supported on 32 bits yet
default n
 
+config OPROFILE_CELL
+   def_bool y
+   depends on (SPU_FS = y && OPROFILE = m) || (SPU_FS = y && OPROFILE = y)
+
 source "init/Kconfig"
 
 source "arch/powerpc/platforms/Kconfig"

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [2.6.24 patch] Fix Cell OProfile support

2008-01-02 Thread Mathieu Desnoyers

* Arnd Bergmann ([EMAIL PROTECTED]) wrote:
> On Saturday 29 December 2007, Mathieu Desnoyers wrote:
> > This patch restores the Cell OProfile support that was killed by
> > commit 09cadedbdc01f1a4bea1f427d4fb4642eaa19da9.
> > 
> > It puts it in arch/powerpc/Kconfig. Since I don't see any good reason to 
> > leave
> > this as a supplementary user-selectable option, it is now automatically 
> > enabled
> > whenever SPU_FS and OPROFILE are enabled.
> 
> This one has already been superceded by the fix in 
> aed3a8c9bb1a8623a618232087c5ff62718e3b9a, which made CONFIG_OPROFILE_CELL an
> automatically selected option, from arch/powerpc/platforms/cell/Kconfig.
> 

Great, so my patch should be dropped. Thanks!

Mathieu

>   Arnd <><
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

[patch 05/28] Add cmpxchg64 and cmpxchg64_local to powerpc

2007-10-30 Thread Mathieu Desnoyers

Make sure that at least cmpxchg64_local is available on all architectures to use
for unsigned long long values.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
Acked-by: Paul Mackerras <[EMAIL PROTECTED]>
CC: linuxppc-dev@ozlabs.org
---
 include/asm-powerpc/system.h |6 ++
 1 file changed, 6 insertions(+)

Index: linux-2.6-lttng/include/asm-powerpc/system.h
===
--- linux-2.6-lttng.orig/include/asm-powerpc/system.h   2007-09-24 
10:50:11.0 -0400
+++ linux-2.6-lttng/include/asm-powerpc/system.h2007-09-24 
11:01:04.0 -0400
@@ -488,6 +488,12 @@ __cmpxchg_local(volatile void *ptr, unsi
  */
 #define NET_IP_ALIGN   0
 #define NET_SKB_PADL1_CACHE_BYTES
+
+#define cmpxchg64  cmpxchg
+#define cmpxchg64_localcmpxchg_local
+#else
+#include 
+#define cmpxchg64_local(ptr,o,n) __cmpxchg64_local_generic((ptr), (o), (n))
 #endif
 
 #define arch_align_stack(x) (x)

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

[patch 05/28] Add cmpxchg64 and cmpxchg64_local to powerpc

2007-10-31 Thread Mathieu Desnoyers

Make sure that at least cmpxchg64_local is available on all architectures to use
for unsigned long long values.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
Acked-by: Paul Mackerras <[EMAIL PROTECTED]>
CC: linuxppc-dev@ozlabs.org
---
 include/asm-powerpc/system.h |   18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

Index: linux-2.6-lttng/include/asm-powerpc/system.h
===
--- linux-2.6-lttng.orig/include/asm-powerpc/system.h   2007-10-31 
17:32:44.0 -0400
+++ linux-2.6-lttng/include/asm-powerpc/system.h2007-10-31 
17:56:25.0 -0400
@@ -461,7 +461,7 @@ __cmpxchg_local(volatile void *ptr, unsi
return old;
 }
 
-#define cmpxchg(ptr,o,n)\
+#define cmpxchg(ptr, o, n)  \
   ({\
  __typeof__(*(ptr)) _o_ = (o);  \
  __typeof__(*(ptr)) _n_ = (n);  \
@@ -470,7 +470,7 @@ __cmpxchg_local(volatile void *ptr, unsi
   })
 
 
-#define cmpxchg_local(ptr,o,n)  \
+#define cmpxchg_local(ptr, o, n)\
   ({\
  __typeof__(*(ptr)) _o_ = (o);  \
  __typeof__(*(ptr)) _n_ = (n);  \
@@ -490,6 +490,20 @@ __cmpxchg_local(volatile void *ptr, unsi
  */
 #define NET_IP_ALIGN   0
 #define NET_SKB_PADL1_CACHE_BYTES
+
+#define cmpxchg64(ptr, o, n)   \
+  ({   \
+   BUILD_BUG_ON(sizeof(*(ptr)) != 8);  \
+   cmpxchg((ptr), (o), (n));   \
+  })
+#define cmpxchg64_local(ptr, o, n) \
+  ({   \
+   BUILD_BUG_ON(sizeof(*(ptr)) != 8);  \
+   cmpxchg_local((ptr), (o), (n)); \
+  })
+#else
+#include 
+#define cmpxchg64_local(ptr, o, n) __cmpxchg64_local_generic((ptr), (o), (n))
 #endif
 
 #define arch_align_stack(x) (x)

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

2.6.23-rc2-mm2 build error on MIPS and ARM

2007-08-10 Thread Mathieu Desnoyers

Hi Andrew,

I got the following errors when building 2.6.23-rc2-mm2 on both mips and
arm. Both errors are very much alike.

MIPS:

 
/opt/crosstool/gcc-3.4.5-glibc-2.3.6/mips-unknown-linux-gnu/lib/gcc/mips-unknown-linux-gnu/3.4.5/include
 -D__KERNEL__ -Iinclude -Iinclude2 -I/home/compudj/git/linux-2.6-lttng/include 
-include include/linux/autoconf.h -I/home/compudj/git/linux-2.6-lttng/. -I. 
-Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing 
-fno-common -Werror-implicit-function-declaration -Os -mabi=32 -G 0 
-mno-abicalls -fno-pic -pipe -msoft-float -ffreestanding -march=r5000 
-Wa,--trap -I/home/compudj/git/linux-2.6-lttng/include/asm-mips/mach-ip22 
-Iinclude/asm-mips/mach-ip22 
-I/home/compudj/git/linux-2.6-lttng/include/asm-mips/mach-generic 
-Iinclude/asm-mips/mach-generic -D"VMLINUX_LOAD_ADDRESS=0x88002000" 
-fomit-frame-pointer -Wdeclaration-after-statement  -D"KBUILD_STR(s)=#s" 
-D"KBUILD_BASENAME=KBUILD_STR(asm_offsets)"  
-D"KBUILD_MODNAME=KBUILD_STR(asm_offsets)" -fverbose-asm -S -o 
arch/mips/kernel/asm-offsets.s 
/home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c
In file included from 
/home/compudj/git/linux-2.6-lttng/include/linux/sched.h:58,
 from 
/home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c:13:
/home/compudj/git/linux-2.6-lttng/include/linux/mm_types.h:115: error: syntax 
error before "pgprot_t"
/home/compudj/git/linux-2.6-lttng/include/linux/mm_types.h:115: warning: no 
semicolon at end of struct or union
/home/compudj/git/linux-2.6-lttng/include/linux/mm_types.h:161: error: syntax 
error before '}' token
/home/compudj/git/linux-2.6-lttng/include/linux/mm_types.h:175: error: syntax 
error before "pgd_t"
/home/compudj/git/linux-2.6-lttng/include/linux/mm_types.h:175: warning: no 
semicolon at end of struct or union
/home/compudj/git/linux-2.6-lttng/include/linux/mm_types.h:229: error: syntax 
error before '}' token
In file included from 
/home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c:13:
/home/compudj/git/linux-2.6-lttng/include/linux/sched.h: In function `mmdrop':
/home/compudj/git/linux-2.6-lttng/include/linux/sched.h:1509: error: 
dereferencing pointer to incomplete type
/home/compudj/git/linux-2.6-lttng/include/linux/sched.h: In function 
`arch_pick_mmap_layout':
/home/compudj/git/linux-2.6-lttng/include/linux/sched.h:1762: error: 
dereferencing pointer to incomplete type
/home/compudj/git/linux-2.6-lttng/include/linux/sched.h:1763: error: 
dereferencing pointer to incomplete type
/home/compudj/git/linux-2.6-lttng/include/linux/sched.h:1764: error: 
dereferencing pointer to incomplete type
In file included from 
/home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c:14:
/home/compudj/git/linux-2.6-lttng/include/linux/mm.h: In function 
`vma_nonlinear_insert':
/home/compudj/git/linux-2.6-lttng/include/linux/mm.h:968: error: dereferencing 
pointer to incomplete type
/home/compudj/git/linux-2.6-lttng/include/linux/mm.h:969: error: dereferencing 
pointer to incomplete type
/home/compudj/git/linux-2.6-lttng/include/linux/mm.h: In function 
`find_vma_intersection':
/home/compudj/git/linux-2.6-lttng/include/linux/mm.h:1078: error: dereferencing 
pointer to incomplete type
/home/compudj/git/linux-2.6-lttng/include/linux/mm.h: In function `vma_pages':
/home/compudj/git/linux-2.6-lttng/include/linux/mm.h:1085: error: dereferencing 
pointer to incomplete type
/home/compudj/git/linux-2.6-lttng/include/linux/mm.h:1085: error: dereferencing 
pointer to incomplete type
/home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c: In function 
`output_mm_defines':
/home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c:220: error: 
dereferencing pointer to incomplete type
/home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c:221: error: 
dereferencing pointer to incomplete type
/home/compudj/git/linux-2.6-lttng/arch/mips/kernel/asm-offsets.c:222: error: 
dereferencing pointer to incomplete type
make[2]: *** [arch/mips/kernel/asm-offsets.s] Error 1
make[1]: *** [prepare0] Error 2
make: *** [_all] Error 2


ARM:


  
/opt/crosstool/gcc-4.0.2-glibc-2.3.6/arm-unknown-linux-gnu/bin/arm-unknown-linux-gnu-gcc
 -Wp,-MD,arch/arm/kernel/.asm-offsets.s.d  -nostdinc -isystem 
/opt/crosstool/gcc-4.0.2-glibc-2.3.6/arm-unknown-linux-gnu/lib/gcc/arm-unknown-linux-gnu/4.0.2/include
 -D__KERNEL__ -Iinclude -Iinclude2 -I/home/compudj/git/linux-2.6-lttng/include 
-include include/linux/autoconf.h -mlittle-endian 
-I/home/compudj/git/linux-2.6-lttng/. -I. -Wall -Wundef -Wstrict-prototypes 
-Wno-trigraphs -fno-strict-aliasing -fno-common 
-Werror-implicit-function-declaration -Os -marm -fno-omit-frame-pointer -mapcs 
-mno-sched-prolog -mabi=apcs-gnu -mno-thumb-interwork -D__LINUX_ARM_ARCH__=4 
-march=armv4 -mtune=strongarm110 -msoft-float -Uarm -fno-omit-frame-pointer 
-fno-optimize-sibling-calls -Wdeclaration-after-statement -Wno-pointer-sign  
-D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_S

Re: 2.6.23-rc2-mm2 build error on MIPS and ARM

2007-08-12 Thread Mathieu Desnoyers

* Martin Schwidefsky ([EMAIL PROTECTED]) wrote:
> On Fri, 2007-08-10 at 22:58 -0400, Mathieu Desnoyers wrote:
> > I got the following errors when building 2.6.23-rc2-mm2 on both mips and
> > arm. Both errors are very much alike.
> 
> That attached patch should fix it for arm and mips. I'll try a few more
> architectures until I'm fed up with compiling cross-compilers..
> 

It fixes the problem for me, thanks!

Tested-by: Mathieu Desnoyers <[EMAIL PROTECTED]>

Mathieu


> -- 
> blue skies,
>   Martin.
> 
> "Reality continues to ruin my life." - Calvin.
> ---
> Subject: [PATCH] move mm_struct and vm_area_struct, compile fix.
> 
> From: Martin Schwidefsky <[EMAIL PROTECTED]>
> 
> Include  in  to make mips and arm
> compile again after mm_struct and vm_area_struct have been moved.
> 
> Signed-off-by: Martin Schwidefsky <[EMAIL PROTECTED]>
> ---
> 
>  include/linux/mm_types.h |1 +
>  1 file changed, 1 insertion(+)
> 
> diff -urpN linux-2.6/include/linux/mm_types.h
> linux-2.6-patched/include/linux/mm_types.h
> --- linux-2.6/include/linux/mm_types.h2007-08-11 11:30:10.0
> +0200
> +++ linux-2.6-patched/include/linux/mm_types.h2007-08-11
> 11:32:42.0 +0200
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct address_space;
>  
> 
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

[patch 04/28] Add cmpxchg64 and cmpxchg64_local to powerpc

2007-08-27 Thread Mathieu Desnoyers

Make sure that at least cmpxchg64_local is available on all architectures to use
for unsigned long long values.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: [EMAIL PROTECTED]
CC: linuxppc-dev@ozlabs.org
---
 include/asm-powerpc/system.h |6 ++
 1 file changed, 6 insertions(+)

Index: linux-2.6-lttng/include/asm-powerpc/system.h
===
--- linux-2.6-lttng.orig/include/asm-powerpc/system.h   2007-08-27 
11:42:08.0 -0400
+++ linux-2.6-lttng/include/asm-powerpc/system.h2007-08-27 
11:42:43.0 -0400
@@ -485,6 +485,12 @@ __cmpxchg_local(volatile void *ptr, unsi
  */
 #define NET_IP_ALIGN   0
 #define NET_SKB_PADL1_CACHE_BYTES
+
+#define cmpxchg64  cmpxchg
+#define cmpxchg64_localcmpxchg_local
+#else
+#include 
+#define cmpxchg64_local(ptr,o,n) __cmpxchg64_local_generic((ptr), (o), (n))
 #endif
 
 #define arch_align_stack(x) (x)

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [patch 04/28] Add cmpxchg64 and cmpxchg64_local to powerpc

2007-09-22 Thread Mathieu Desnoyers

* Paul Mackerras ([EMAIL PROTECTED]) wrote:
> Mathieu Desnoyers writes:
> 
> > Make sure that at least cmpxchg64_local is available on all architectures 
> > to use
> > for unsigned long long values.
> > 
> > Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
> 
> Acked-by: Paul Mackerras <[EMAIL PROTECTED]>

Thanks Paul,

Just make sure add-cmpxchg-local-to-generic-for-up.patch gets merged
before this one, since it needs asm-generic/cmpxchg-local.h.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

[RFC patch 6/9] POWERPC remove -traditional

2008-04-22 Thread Mathieu Desnoyers

Subject: [RFC patch 6/9] Re: [PATCH] Stringify support commas
 > This is a no-no for those archs that still use -traditional.
 > > I dunno if this is a problem for you at the moment and the
 > > right fix is anyway to nuke -traditional.
 > > 
 > > Sam

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Sam Ravnborg <[EMAIL PROTECTED]>
CC: [EMAIL PROTECTED]
CC: linuxppc-dev@ozlabs.org
---
 arch/powerpc/boot/Makefile |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6-lttng/arch/powerpc/boot/Makefile
===
--- linux-2.6-lttng.orig/arch/powerpc/boot/Makefile 2008-04-21 
10:03:05.0 -0400
+++ linux-2.6-lttng/arch/powerpc/boot/Makefile  2008-04-21 10:03:17.0 
-0400
@@ -23,7 +23,7 @@ BOOTCFLAGS:= -Wall -Wundef -Wstrict-
 -fno-strict-aliasing -Os -msoft-float -pipe \
 -fomit-frame-pointer -fno-builtin -fPIC -nostdinc \
 -isystem $(shell $(CROSS32CC) -print-file-name=include)
-BOOTAFLAGS := -D__ASSEMBLY__ $(BOOTCFLAGS) -traditional -nostdinc
+BOOTAFLAGS := -D__ASSEMBLY__ $(BOOTCFLAGS) -nostdinc
 
 ifdef CONFIG_DEBUG_INFO
 BOOTCFLAGS += -g

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [RFC patch 6/9] POWERPC remove -traditional

2008-04-23 Thread Mathieu Desnoyers

* Paul Mackerras ([EMAIL PROTECTED]) wrote:
> Mathieu Desnoyers writes:
> > Subject: [RFC patch 6/9] Re: [PATCH] Stringify support commas
> >  > This is a no-no for those archs that still use -traditional.
> >  > > I dunno if this is a problem for you at the moment and the
> >  > > right fix is anyway to nuke -traditional.
> >  > > 
> >  > > Sam
> > 
> > Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
> > CC: Sam Ravnborg <[EMAIL PROTECTED]>
> > CC: [EMAIL PROTECTED]
> > CC: linuxppc-dev@ozlabs.org
> > ---
> >  arch/powerpc/boot/Makefile |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Looks OK.  I'll put it in my tree.
> 
> Paul.

OK, although it won't be needed with v2 of the stringify with commas.
There seems to be some important details which differ between
-traditional and stdc gcc operation which makes it non-trivial to remove
the -traditional, at least on sparc. Maybe Kyle could tell us a bit more
about the nature of these limitations ?

I would wait a bit before merging until we at least identify what those
limitations are.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)

2008-08-18 Thread Mathieu Desnoyers

;> ---[ end trace ffb4a6e37d78370a ]---
>>
>> ---
>> Oops: Exception in kernel mode, sig: 11 [#1]
>> Exsw1600
>> Modules linked in: bridge stp llc tun beacon_freescale(P) 80211(P) 
>> shm_freescale(P) ath_remote_regs_freescale(P) exsw 
>> load_local_fpga_freescale 8021q poe_fi
>> NIP: c00bd6fc LR: c00bd6fc CTR: c00bd6c8
>> REGS: decd5cf0 TRAP: 0700   Tainted: P   (2.6.27-rc2)
>> MSR: 00029000   CR: 8882  XER: 2000
>> TASK = d9f5a9a0[6897] 'sh' THREAD: decd4000
>> GPR00:  decd5da0 d9f5a9a0 dd801180 decd5de8  0004 
>> fff9
>> GPR08: fffa  00d5ca28 decd5d90 2ccc1686 100ad874 1fffb200 
>> 
>> GPR16: 007fff00 decd5ec4 0008 c02fe298 00200200  c031aa34 
>> decd5de8
>> GPR24: decd5df4 1af2 dd507d00 c038 dd801180 decd5de8  
>> decd5da0
>> NIP [c00bd6fc] d_lookup+0x40/0x90
>> LR [c00bd6fc] d_lookup+0x40/0x90
>> Call Trace:
>> [decd5dc0] [c00bd7d8] d_hash_and_lookup+0x8c/0xc4
>> [decd5de0] [c00ed728] proc_flush_task+0xb4/0x268
>> [decd5e40] [c0032894] release_task+0x6c/0x39c
>> [decd5e70] [c0033238] wait_consider_task+0x674/0x920
>> [decd5eb0] [c0033610] do_wait+0x12c/0x3d4
>> [decd5f20] [c0033954] sys_wait4+0x9c/0xe8
>> [decd5f40] [c0010554] ret_from_syscall+0x0/0x3c
>> Instruction dump:
>> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db52a0
>> 73c1 7f83e378 7fa4eb78 4082002f <> 2f83 409e0030 801b52a0
>> ---[ end trace 00106ff4ec3b9c22 ]---
>>
>>
>>
>>
>

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)

2008-08-18 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> Mathieu Desnoyers wrote:
>>
>> Steve ? I just noticed this :
>>
>>
>> ftrace_modify_code(unsigned long ip, unsigned char *old_code,
>>unsigned char *new_code)
>> {
>> unsigned replaced;
>> unsigned old = *(unsigned *)old_code;
>> unsigned new = *(unsigned *)new_code;
>> int faulted = 0;
>>
>> /*
>>  * Note: Due to modules and __init, code can
>>  *  disappear and change, we need to protect against faulting
>>  *  as well as code changing.
>>  *
>>  * No real locking needed, this code is run through
>>  * kstop_machine.
>>  */
>> asm volatile (
>> "1: lwz %1, 0(%2)\n"
>> "   cmpw%1, %5\n"
>> "   bne 2f\n"
>> "   stwu%3, 0(%2)\n"
>> "2:\n"
>> ".section .fixup, \"ax\"\n"
>> "3: li %0, 1\n"
>> "   b 2b\n"
>> ".previous\n"
>> ".section __ex_table,\"a\"\n"
>> _ASM_ALIGN "\n"
>> _ASM_PTR "1b, 3b\n"
>> ".previous"
>> : "=r"(faulted), "=r"(replaced)
>> : "r"(ip), "r"(new),
>>   "0"(faulted), "r"(old)
>> : "memory");
>>
>> if (replaced != old && replaced != new)
>> faulted = 2;
>>
>> if (!faulted)
>> flush_icache_range(ip, ip + 8);
>>
>> return faulted;
>> }
>>
>> What happens if you are really unlucky and get the following execution
>> sequence ?
>>
>>
>> Load module A at address 0xddfc3e00
>> Populate ftrace function table while the system runs
>> Unload module A
>> Load module B at address 0xddfc3e00
>> Activate ftrace
>> -> At that point, since there is another module loaded in the same
>> address space, it won't fault. If there happens to be an instruction
>> which looks exactly like the instruction we are replacing at the same
>> location, we are introducing a bug. True both on x86 ans powerpc...
>>
>>   
>
> Hi Mathieu,
>
> Yep I know of this issue, and it is very unlikely it would happen.  But 
> that's not good enough, so this is why I did this:
>
> http://marc.info/?l=linux-kernel&m=121876928124384&w=2
> http://marc.info/?l=linux-kernel&m=121876938524523&w=2
>
> ;-)
>
> On powerpc it would be less likely an issue since the code segments are all 
> 4 bytes aligned. And a call being replaced would be a call to mcount 
> (relative jump). A call to mcount would be the same for both. Then again, 
> we could be replacing a nop, but most likely that wouldn't hurt either, 
> since nops are seldom used, and if we did call mcount, it would probably 
> not hurt. But it would be an issue if Module B happen to put a data section 
> that matched the code to jmp to mcount or a nop.
>

Ah, I did not see this one passing by :) That's the right solution
indeed. I guess it's not in 2.6.27-rc2/rc3 though ? I wonder if the
bug can be repeated with a module-free kernel ?

Mathieu

> -- Steve
>

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)

2008-08-18 Thread Mathieu Desnoyers

* Eran Liberty ([EMAIL PROTECTED]) wrote:
> Steven Rostedt wrote:
>> Eran Liberty wrote:
>>> After compiling a kernel with ftrace I started to experience all sorts of 
>>> crashes.
>>
>> Just to make sure...
>>
>> ftrace enables markers too, and RCU has tracing with the markers. This may 
>> not be the problem, but I just want to eliminate as many variables as 
>> possible.
>> Could you disable ftrace, but keep the markers on too.  Also, could you 
>> enable ftrace again and turn on the FTRACE_STARTUP_TEST.
>
> for the fun of it I took out all my propriety modules; so now its a non 
> tainted kernel.
>
> Here is the matrix:
>
> !FTRACE x !MARKERS => stable
> !FTRACE x MARKERS => stable
> FTRACE x !MARKERS => n/a (FTRACE forces MARKERS)
> FTRACE x MARKERS => unstable
> FTRACE x FTRACE_STARTUP_TEST x MARKERS => unstable + tests passed
>
> Testing tracer sched_switch: PASSED
> Testing tracer ftrace: PASSED
> Testing dynamic ftrace: PASSED
>
> Oops: Exception in kernel mode, sig: 11 [#1]
> Exsw1600
> Modules linked in:
> NIP: c00bbb20 LR: c00bbb20 CTR: 
> REGS: dd5b1c50 TRAP: 0700   Not tainted  (2.6.27-rc2)
> MSR: 00029000   CR: 24082282  XER: 2000
> TASK = ddcce060[1707] 'find' THREAD: dd5b
> GPR00:  dd5b1d00 ddcce060 dd801180 dd5b1d68 dd5b1d58 dd80125b 
> 100234ec
> GPR08: c080 00019330  dd5b1d20 24000288 100ad874 100936f8 
> 1008a1d0
> GPR16: 10083f80 dd5b1e2c dd5b1d68 fff4 c038 dd5b1d60 dd5b1d58 
> dd802084
> GPR24: dc3d7700 dd802018 dd5b1d68 c038 dd801180 dd5b1d68  
> dd5b1d00
> NIP [c00bbb20] d_lookup+0x40/0x90
> LR [c00bbb20] d_lookup+0x40/0x90
> Call Trace:
> [dd5b1d00] [dd5b1d58] 0xdd5b1d58 (unreliable)

Can you check if, at some point during the system execution (starting
from boot), 0xdd5b1d58 is an address where a module is loaded ? (the
module can be later unloaded, what I wonder is if this address would
appear to have had a loaded+unloaded module).

Actually, could you try to compile your kernel without "MODULE_UNLOAD" ?

Mathieu

> [dd5b1d20] [c00aebc4] do_lookup+0xe8/0x220
> [dd5b1d50] [c00b0a80] __link_path_walk+0x5a4/0xd54
> [dd5b1dc0] [c00b1288] path_walk+0x58/0xe0
> [dd5b1df0] [c00b13f8] do_path_lookup+0x78/0x13c
> [dd5b1e20] [c00b20f4] user_path_at+0x64/0xac
> [dd5b1e90] [c00a9028] vfs_lstat_fd+0x34/0x74
> [dd5b1ec0] [c00a90fc] vfs_lstat+0x30/0x48
> [dd5b1ed0] [c00a9144] sys_lstat64+0x30/0x5c
> [dd5b1f40] [c0010554] ret_from_syscall+0x0/0x3c
> Instruction dump:
> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
> 73c1 7f83e378 7fa4eb78 4082002f <> 2f83 409e0030 801b32a0
> ---[ end trace 1eb8fd7adac2bb65 ]---
>
> Liberty
>
>

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)

2008-08-18 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> 
> On Tue, 19 Aug 2008, Benjamin Herrenschmidt wrote:
> 
> > 
> > > Hmm, this was originally copied from x86, where we did a cmpxchg, but 
> > > that 
> > > is probably not needed since all of this is done in kstop_machine. Also, 
> > > only the "get" is needed. If we don't fault there, we wont fault on the 
> > > put (unless we have permissions wrong, and that would be a bug).
> > 
> > Would it ? How do we make sure the kernel text is mapped writeable ?
> 
> We map it writeable if FTRACE is enabled.
> 

Argh. See text_poke(). It's there exactly for this purpose on x86.

Mathieu

> > 
> > > So are you recommending something like
> > > 
> > > int cmd;
> > > 
> > >   if (__get_user(cmd, ip))
> > >   goto fault;
> > > 
> > >   if (cmd != old)
> > >   goto not_same;
> > > 
> > >   WARN_ON_ONCE(__put_user(cmd, ip));
> > > 
> > > If we did this, we could probably put this into the generic code:
> > 
> > That would work I suppose, I'll give it a try.
> > 
> > >   if (copy_from_user(cmd, ip, ARCH_CALL_SIZE))
> > >   goto fault;
> > > 
> > >   if (memcmp(cmd, old, ARCH_CALL_SIZE) != 0)
> > >   goto not_same;
> > > 
> > >   WARN_ON_ONCE(copy_to_user(cmd, ip, ARCH_CALL_SIZE));
> > 
> > You need the __ variants or the access_ok() checks will bite
> > you bad.
> 
> Ah, good point ;-)
> 
> -- Steve
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)

2008-08-18 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> 
> On Mon, 18 Aug 2008, Mathieu Desnoyers wrote:
> 
> > * Steven Rostedt ([EMAIL PROTECTED]) wrote:
> > > 
> > > On Tue, 19 Aug 2008, Benjamin Herrenschmidt wrote:
> > > 
> > > > 
> > > > > Hmm, this was originally copied from x86, where we did a cmpxchg, but 
> > > > > that 
> > > > > is probably not needed since all of this is done in kstop_machine. 
> > > > > Also, 
> > > > > only the "get" is needed. If we don't fault there, we wont fault on 
> > > > > the 
> > > > > put (unless we have permissions wrong, and that would be a bug).
> > > > 
> > > > Would it ? How do we make sure the kernel text is mapped writeable ?
> > > 
> > > We map it writeable if FTRACE is enabled.
> > > 
> > 
> > Argh. See text_poke(). It's there exactly for this purpose on x86.
> 
> Ouch, I just did. text_poke is quite heavy. It would be interesting to see 
> that performed on 20,000 locations at one time. I could play with it, but 
> I'm a bit nervous.
> 

It's alread used to modify the LOCK prefixes in alternative.c and did
not seem to be too slow for that.. it should therefore be ok.

Mathieu

> -- Steve

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)

2008-08-19 Thread Mathieu Desnoyers

* Eran Liberty ([EMAIL PROTECTED]) wrote:
> Mathieu Desnoyers wrote:
>> Can you check if, at some point during the system execution (starting
>> from boot), 0xdd5b1d58 is an address where a module is loaded ? (the
>> module can be later unloaded, what I wonder is if this address would
>> appear to have had a loaded+unloaded module).
>>
>> Actually, could you try to compile your kernel without "MODULE_UNLOAD" ?
>>
>> Mathieu
>>   
> No modules...
>
> ~ # uname -a
> Linux 2.6.27-rc2 #6 Tue Aug 19 12:33:34 IDT 2008 ppc unknown
> ~ # gunzip /proc/config.gz -c | grep UNLOAD
> # CONFIG_MODULE_UNLOAD is not set
> ~ # gunzip /proc/config.gz -c | grep FTRACE
> CONFIG_HAVE_FTRACE=y
> CONFIG_HAVE_DYNAMIC_FTRACE=y
> CONFIG_FTRACE=y
> CONFIG_DYNAMIC_FTRACE=y
> CONFIG_FTRACE_SELFTEST=y
> CONFIG_FTRACE_STARTUP_TEST=y
> ~ # gunzip /proc/config.gz -c | grep MARKERS
> CONFIG_MARKERS=y
> ~ # lsmod
> Module  Size  Used by
> ~ # while [ 1 ] ; do find / > /dev/null ; echo .  ; done
> .
> Oops: Exception in kernel mode, sig: 11 [#1]
> Exsw1600
> Modules linked in:
> NIP: c00bb724 LR: c00bb724 CTR: 
> REGS: d8b59c50 TRAP: 0700   Not tainted  (2.6.27-rc2)
> MSR: 00029000   CR: 24082282  XER: 2000
> TASK = dbc68de0[1712] 'find' THREAD: d8b58000
> GPR00:  d8b59d00 dbc68de0 dd801180 d8b59d68 d8b59d58 dd8019db 
> 100234ec
> GPR08: c080 00019330  d8b59d20 24000288 100ad874 100936f8 
> 1008a1d0
> GPR16: 10083f80 d8b59e2c d8b59d68 fff4 c038 d8b59d60 d8b59d58 
> dd802084

Register GPR30 is always zero... (also true for the other oops)

> GPR24: ddc69500 dd802018 d8b59d68 c038 dd801180 d8b59d68  
> d8b59d00
> NIP [c00bb724] d_lookup+0x40/0x90
> LR [c00bb724] d_lookup+0x40/0x90
> Call Trace:
> [d8b59d00] [d8b59d58] 0xd8b59d58 (unreliable)
> [d8b59d20] [c00ae7c8] do_lookup+0xe8/0x220
> [d8b59d50] [c00b0684] __link_path_walk+0x5a4/0xd54
> [d8b59dc0] [c00b0e8c] path_walk+0x58/0xe0
> [d8b59df0] [c00b0ffc] do_path_lookup+0x78/0x13c
> [d8b59e20] [c00b1cf8] user_path_at+0x64/0xac
> [d8b59e90] [c00a8c98] vfs_lstat_fd+0x34/0x74
> [d8b59ec0] [c00a8d6c] vfs_lstat+0x30/0x48
> [d8b59ed0] [c00a8db4] sys_lstat64+0x30/0x5c
> [d8b59f40] [c0010554] ret_from_syscall+0x0/0x3c
> Instruction dump:
> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0

Just like this instruction... maybe related ?

Mathieu

> 73c1 7f83e378 7fa4eb78 4082002f <> 2f83 409e0030 801b32a0
> ---[ end trace 7766edd310cd3442 ]---
>

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)

2008-08-19 Thread Mathieu Desnoyers

t;, nd=0xd99b3e28) at 
> fs/namei.c:1041
> #6  0xc00b0ffc in do_path_lookup (dfd=, 
> name=0xddc23000 "/proc/mtd", flags=, nd=0xd99b3e28) at 
> fs/namei.c:1091
> #7  0xc00b1cf8 in user_path_at (dfd=-100, name=, 
> flags=0, path=0xd99b3e98) at fs/namei.c:1334
> #8  0xc00a8c98 in vfs_lstat_fd (dfd=-578809472, name=0xd99b3d68 "", 
> stat=0xd99b3ed8) at fs/stat.c:83
> #9  0xc00a8d6c in vfs_lstat (name=0xd99b3d68 "", stat=0xd99b3d58) at 
> fs/stat.c:93
> #10 0xc00a8db4 in sys_lstat64 (filename=0xdd801180 "", statbuf=0xbfe285d8) 
> at fs/stat.c:381
> #11 0xc0010554 in syscall_dotrace_cont ()
>
> both cp & lr points to 0xC00BB724
> (gdb) info registers
> r0 0x00
> r1 0xd99b3d003650829568
> r2 0xddccf2e03721196256
> r3 0xdd8011803716157824
> r4 0xd99b3d683650829672
> r5 0xd99b3d583650829656
> r6 0xdd822a5b3716295259
> r7 0x100234ec268580076
> r8 0xc0803229614080
> r9 0x19330103216
> r100x65535
> r110xd99b3d203650829600
> r120x24000288603980424
> r130x100ad874269146228
> r140x100936f8269039352
> r150x1008a1d0269001168
> r160x10083f80268976000
> r170xd99b3e2c3650829868
> r180xd99b3d683650829672
> r190xfff44294967284
> r200xc0383224895488
> r210xd99b3d603650829664
> r220xd99b3d583650829656
> r230xdd8020843716161668
> r240xdc3fb2803695161984
> r250xdd8020183716161560
> r260xd99b3d683650829672
> r270xc0383224895488
> r280xdd8011803716157824
> r290xd99b3d683650829672
> r300x00
> r310xd99b3d003650829568
> pc 0xc00bb7243221993252
> cr 0x24082282604512898
> lr 0xc00bb7243221993252
> ctr0x00
> xer0x2000536870912
>
> (gdb) li *0xC00BB724
> Line 1361 of "fs/dcache.c" starts at address 0xc00bb724  and 
> ends at 0xc00bb728 .
>
> (gdb) disassemble 0xC00BB724
> Dump of assembler code for function d_lookup:
> 0xc00bb6e4 :mflrr0
> 0xc00bb6e8 :stw r0,4(r1)
> 0xc00bb6ec :nop
> 0xc00bb6f0 :stwur1,-32(r1)
> 0xc00bb6f4 :mflrr0
> 0xc00bb6f8 :stmwr27,12(r1)
> 0xc00bb6fc :lis r27,-16328
> 0xc00bb700 :mr  r31,r1
> 0xc00bb704 :stw r0,36(r1)
> 0xc00bb708 :mr  r28,r3
> 0xc00bb70c :mr  r29,r4
> 0xc00bb710 :    lwz     r30,12960(r27)
> 0xc00bb714 :andi.   r0,r30,1
> 0xc00bb718 :mr  r3,r28
> 0xc00bb71c :mr  r4,r29
> 0xc00bb720 :bnela-  0x2c
> 0xc00bb724 :.long 0x0
> 0xc00bb728 :cmpwi   cr7,r3,0
> 0xc00bb72c :bne-cr7,0xc00bb75c 
> 0xc00bb730 :lwz r0,12960(r27)
> 0xc00bb734 :cmpwcr7,r0,r30
> 0xc00bb738 :beq-cr7,0xc00bb75c 
> 0xc00bb73c :mr  r30,r0
> 0xc00bb740 :andi.   r0,r30,1
> 0xc00bb744 :mr  r3,r28
> 0xc00bb748 :mr  r4,r29
> 0xc00bb74c :beq+0xc00bb724 
> 0xc00bb750 :lwz r0,12960(r27)
> 0xc00bb754 :mr  r30,r0
> 0xc00bb758 :b   0xc00bb740 
> 0xc00bb75c :lwz r11,0(r1)
> 0xc00bb760 :lwz r0,4(r11)
> 0xc00bb764 :lmw r27,-20(r11)
> 0xc00bb768 :mtlrr0
> 0xc00bb76c :mr  r1,r11
> 0xc00bb770 :blr
> End of assembler dump.
> (gdb)
>
> Hope this is helpfull
>

Can you also give us

objdump -S --start-address=0xC00BB724 vmlinux | head 20

?

Then we could compare the result with the OOPS instruction dump :

7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
73c1 7f83e378 7fa4eb78 4082002f <> 2f83 409e0030 801b32a0

Mathieu


> Liberty
>
>
>

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)

2008-08-19 Thread Mathieu Desnoyers

* Eran Liberty ([EMAIL PROTECTED]) wrote:
> Mathieu Desnoyers wrote:
>> Can you also give us
>>
>> objdump -S --start-address=0xC00BB724 vmlinux | head 20
>>
>> ?
>>
>> Then we could compare the result with the OOPS instruction dump :
>>
>> 7c0802a6 bf61000c 3f60c038 7c3f0b78 90010024 7c7c1b78 7c9d2378 83db32a0
>> 73c1 7f83e378 7fa4eb78 4082002f <> 2f83 409e0030 801b32a0
>>
>> Mathieu
>>
>>   
>
> to give you more context I have run:
> > powerpc-linux-gnu-objdump -S --start-address=0xC00BB700 vmlinux | head -n 
> 60
>
> the discrepancy starts at address:
> c00bb720: 40 82 00 30 <=> 40 82 00 2f
> c00bb724: 4b ff fe 61 <=> 00 00 00 00
>

Ah !

I think I see what could be wrong :

First we have :

static unsigned int ftrace_nop = 0x6000;

We probably replace the original function call by this nop.

Then we do :

notrace unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr)
{
static unsigned int op;

/*
 * It would be nice to just use create_function_call, but that will
 * update the code itself. Here we need to just return the
 * instruction that is going to be modified, without modifying the
 * code.
 */
addr = GET_ADDR(addr);

/* Set to "bl addr" */
op = 0x4801 | (ftrace_calc_offset(ip, addr) & 0x03fc);

/*
 * No locking needed, this must be called via kstop_machine
 * which in essence is like running on a uniprocessor machine.
 */
return (unsigned char *)&op;
}

And I guess we must be doing a 0x4801 | 0x0; or something ?

Also, we have to consider that POWERPC 64 functions are :

/* PowerPC64's functions are data that points to the functions */

And this does not seem to hold for ppc32. Therefore, it is strange to me
that the same code is used for the update.. are we updating the correct
site ?

Mathieu

> vmlinux: file format elf32-powerpc
>
> Disassembly of section .text:
>
> c00bb700 :
> * d_lookup() is protected against the concurrent renames in some unrelated
> * directory using the seqlockt_t rename_lock.
> */
>
> struct dentry * d_lookup(struct dentry * parent, struct qstr * name)
> {
> c00bb700:   7c 3f 0b 78 mr  r31,r1
> c00bb704:   90 01 00 24 stw r0,36(r1)
> c00bb708:   7c 7c 1b 78 mr  r28,r3
> c00bb70c:   7c 9d 23 78 mr  r29,r4
> c00bb710:   83 db 32 a0 lwz r30,12960(r27)
> {
>unsigned ret;
>
> repeat:
>ret = sl->sequence;
>smp_rmb();
> c00bb714:   73 c0 00 01 andi.   r0,r30,1
>struct dentry * dentry = NULL;
>unsigned long seq;
>
>do {
>seq = read_seqbegin(&rename_lock);
>dentry = __d_lookup(parent, name);
> c00bb718:   7f 83 e3 78 mr  r3,r28
> c00bb71c:   7f a4 eb 78 mr  r4,r29
>if (unlikely(ret & 1)) {
> c00bb720:   40 82 00 30 bne-c00bb750 
> c00bb724:   4b ff fe 61 bl  c00bb584 <__d_lookup>
>if (dentry)
> c00bb728:   2f 83 00 00 cmpwi   cr7,r3,0
> c00bb72c:   40 9e 00 30 bne-cr7,c00bb75c 
> *
> * If sequence value changed then writer changed data while in section.
> */
> static __always_inline int read_seqretry(const seqlock_t *sl, unsigned 
> start)
> {
>smp_rmb();
> c00bb730:   80 1b 32 a0 lwz r0,12960(r27)
>break;
>} while (read_seqretry(&rename_lock, seq));
> c00bb734:   7f 80 f0 00 cmpwcr7,r0,r30
> c00bb738:   41 9e 00 24 beq-cr7,c00bb75c 
> /* Start of read calculation -- fetch last complete writer token */
> static __always_inline unsigned read_seqbegin(const seqlock_t *sl)
> {
>unsigned ret;
>
> repeat:
> c00bb73c:   7c 1e 03 78 mr  r30,r0
>ret = sl->sequence;
>smp_rmb();
> c00bb740:   73 c0 00 01 andi.   r0,r30,1
>struct dentry * dentry = NULL;
>

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: ftrace introduces instability into kernel 2.6.27(-rc2,-rc3)

2008-08-19 Thread Mathieu Desnoyers

* Steven Rostedt ([EMAIL PROTECTED]) wrote:
> 
> 
> On Mon, 18 Aug 2008, Mathieu Desnoyers wrote:
> 
> > * Steven Rostedt ([EMAIL PROTECTED]) wrote:
> > > 
> > > On Tue, 19 Aug 2008, Benjamin Herrenschmidt wrote:
> > > 
> > > > 
> > > > > Hmm, this was originally copied from x86, where we did a cmpxchg, but 
> > > > > that 
> > > > > is probably not needed since all of this is done in kstop_machine. 
> > > > > Also, 
> > > > > only the "get" is needed. If we don't fault there, we wont fault on 
> > > > > the 
> > > > > put (unless we have permissions wrong, and that would be a bug).
> > > > 
> > > > Would it ? How do we make sure the kernel text is mapped writeable ?
> > > 
> > > We map it writeable if FTRACE is enabled.
> > > 
> > 
> > Argh. See text_poke(). It's there exactly for this purpose on x86.
> > 
> 
> OK, I just tried text_poke and it unfortunately fails. The problem is that 
> it requires that the text you are changing is aligned and fits on one 
> page. We have no control over that.
> 
> -- Steve
> 

Ok, there are two cases where it's ok :

1 - in stop_machine, considering we are not touching code executed in
NMI handlers.
2 - when using my replace_instruction_safe() which uses a temporary
breakpoint when doing the instruction replacement.

In those cases you could use text_poke_early().

See
http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=arch/x86/kernel/immediate.c;h=7789e2c75bf03e645f15759d5dff0c1698493f92;hb=HEAD

For a use example. Basically it looks like :


360 pages[0] = virt_to_page((void *)bypass_eip);
361 vaddr = vmap(pages, 1, VM_MAP, PAGE_KERNEL);
362 BUG_ON(!vaddr);
363 text_poke_early(&vaddr[bypass_eip & ~PAGE_MASK],
364 (void *)addr, size);
365 /*
366  * Fill the rest with nops.
367  */
368 len = NR_NOPS - size;
369 add_nops((void *)
370     &vaddr[(bypass_eip & ~PAGE_MASK) + size],
371 len);
372 print_dbg_bytes("inserted nops",
373 &vaddr[(bypass_eip & ~PAGE_MASK) + size], len);
374 vunmap(vaddr);

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

[PATCH] sputrace : use marker_synchronize_unregister()

2008-09-29 Thread Mathieu Desnoyers

We need a marker_synchronize_unregister() before the end of exit() to make sure
every probe callers have exited the non preemptible section and thus are not
executing the probe code anymore.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Ingo Molnar <[EMAIL PROTECTED]>
CC: Jeremy Kerr <[EMAIL PROTECTED]>
CC: linuxppc-dev@ozlabs.org
CC: Christoph Hellwig <[EMAIL PROTECTED]>
---
 arch/powerpc/platforms/cell/spufs/sputrace.c |1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6-lttng/arch/powerpc/platforms/cell/spufs/sputrace.c
===
--- linux-2.6-lttng.orig/arch/powerpc/platforms/cell/spufs/sputrace.c   
2008-07-31 09:34:58.0 -0400
+++ linux-2.6-lttng/arch/powerpc/platforms/cell/spufs/sputrace.c
2008-07-31 09:35:15.0 -0400
@@ -233,6 +233,7 @@ static void __exit sputrace_exit(void)
 
remove_proc_entry("sputrace", NULL);
kfree(sputrace_log);
+   marker_synchronize_unregister();
 }
 
 module_init(sputrace_init);
-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

[patch 5/5] sputrace : use marker_synchronize_unregister()

2008-10-01 Thread Mathieu Desnoyers

We need a marker_synchronize_unregister() before the end of exit() to make sure
every probe callers have exited the non preemptible section and thus are not
executing the probe code anymore.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: Ingo Molnar <[EMAIL PROTECTED]>
CC: Jeremy Kerr <[EMAIL PROTECTED]>
CC: linuxppc-dev@ozlabs.org
CC: Christoph Hellwig <[EMAIL PROTECTED]>
---
 arch/powerpc/platforms/cell/spufs/sputrace.c |1 +
 1 file changed, 1 insertion(+)

Index: linux-2.6-lttng/arch/powerpc/platforms/cell/spufs/sputrace.c
===
--- linux-2.6-lttng.orig/arch/powerpc/platforms/cell/spufs/sputrace.c   
2008-07-31 09:34:58.0 -0400
+++ linux-2.6-lttng/arch/powerpc/platforms/cell/spufs/sputrace.c
2008-07-31 09:35:15.0 -0400
@@ -233,6 +233,7 @@ static void __exit sputrace_exit(void)
 
remove_proc_entry("sputrace", NULL);
kfree(sputrace_log);
+   marker_synchronize_unregister();
 }
 
 module_init(sputrace_init);

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

[patch 2/3] Add missing DATA_DATA in powerpc

2007-07-03 Thread Mathieu Desnoyers

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: [EMAIL PROTECTED]
CC: linuxppc-dev@ozlabs.org
--
 arch/powerpc/kernel/vmlinux.lds.S |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/arch/powerpc/kernel/vmlinux.lds.S
===
--- linux-2.6-lttng.orig/arch/powerpc/kernel/vmlinux.lds.S  2007-06-16 
22:17:53.0 -0400
+++ linux-2.6-lttng/arch/powerpc/kernel/vmlinux.lds.S   2007-06-16 
22:18:32.0 -0400
@@ -173,7 +173,9 @@
}
 #else
.data : {
-   *(.data .data.rel* .toc1)
+   DATA_DATA
+   *(.data.rel*)
+   *(.toc1)
*(.branch_lt)
}
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: + add-missing-data_data-in-powerpc.patch added to -mm tree

2007-07-05 Thread Mathieu Desnoyers

Please add to patch header: 

Newer asm-generic/vmlinux.lds.h provides a DATA_DATA macro which defines
the content of what is normally referred to by *(.data). It permits
adding more "subsections" (declared by gcc with the section attribute or
in assembly), whichwill be linked in the .data section.

* [EMAIL PROTECTED] ([EMAIL PROTECTED]) wrote:
> 
> The patch titled
>  powerpc: add missing DATA_DATA
> has been added to the -mm tree.  Its filename is
>  add-missing-data_data-in-powerpc.patch
> 
> *** Remember to use Documentation/SubmitChecklist when testing your code ***
> 
> See http://www.zip.com.au/~akpm/linux/patches/stuff/added-to-mm.txt to find
> out what to do about this
> 
> --
> Subject: powerpc: add missing DATA_DATA
> From: Mathieu Desnoyers <[EMAIL PROTECTED]>
> 
> Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>
> Cc: 
> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>
> ---
> 
>  arch/powerpc/kernel/vmlinux.lds.S |4 +++-
>  1 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff -puN arch/powerpc/kernel/vmlinux.lds.S~add-missing-data_data-in-powerpc 
> arch/powerpc/kernel/vmlinux.lds.S
> --- a/arch/powerpc/kernel/vmlinux.lds.S~add-missing-data_data-in-powerpc
> +++ a/arch/powerpc/kernel/vmlinux.lds.S
> @@ -174,7 +174,9 @@ SECTIONS
>   }
>  #else
>   .data : {
> - *(.data .data.rel* .toc1)
> + DATA_DATA
> + *(.data.rel*)
> + *(.toc1)
>   *(.branch_lt)
>   }
>  
> _
> 
> Patches currently in -mm which might be from [EMAIL PROTECTED] are
> 
> powerpc-promc-remove-undef-printk.patch
> i386-text-edit-lock.patch
> i386-text-edit-lock-alternative-instructions.patch
> i386-text-edit-lock-kprobes.patch
> immediate-values-global-modules-list-and-module-mutex.patch
> immediate-value-architecture-independent-code.patch
> immediate-values-non-optimized-architectures.patch
> immediate-value-add-kconfig-menus.patch
> immediate-values-kprobe-header-fix.patch
> immediate-value-i386-optimization.patch
> immediate-value-powerpc-optimization.patch
> immediate-value-documentation.patch
> f00f-bug-fixup-for-i386-use-immediate-values.patch
> scheduler-profiling-use-immediate-values.patch
> use-data_data-in-cris.patch
> add-missing-data_data-in-powerpc.patch
> use-data_data-in-xtensa.patch
> 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Powerpc - Include pagemap.h in asm/powerpc/tlb.h

2007-07-13 Thread Mathieu Desnoyers

Powerpc - Include pagemap.h in asm/powerpc/tlb.h

Fixes this powerpc build error in 2.6.22-rc6-mm1 for powerpc 64 :

In file included from include2/asm/tlb.h:60,
 from /home/compudj/git/linux-2.6-lttng/arch/powerpc/mm/init_64.
c:56:
/home/compudj/git/linux-2.6-lttng/include/asm-generic/tlb.h: In function 'tlb_fl
ush_mmu':
/home/compudj/git/linux-2.6-lttng/include/asm-generic/tlb.h:76: error: implicit 
declaration of function 'release_pages'
/home/compudj/git/linux-2.6-lttng/include/asm-generic/tlb.h: In function 'tlb_re
move_page':
/home/compudj/git/linux-2.6-lttng/include/asm-generic/tlb.h:105: error: implicit
 declaration of function 'page_cache_release'
make[2]: *** [arch/powerpc/mm/init_64.o] Error 1

release_pages is declared in linux/pagemap.h, but cannot be included in
linux/swap.h because of a sparc related comment:

/* only sparc can not include linux/pagemap.h in this file
 * so leave page_cache_release and release_pages undeclared... */
#define free_page_and_swap_cache(page) \
page_cache_release(page)
#define free_pages_and_swap_cache(pages, nr) \
release_pages((pages), (nr), 0);

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: linuxppc-dev@ozlabs.org
CC: Paul Mackerras <[EMAIL PROTECTED]>
---
 include/asm-powerpc/tlb.h |2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6-lttng/include/asm-powerpc/tlb.h
===
--- linux-2.6-lttng.orig/include/asm-powerpc/tlb.h  2007-07-13 
11:30:54.0 -0400
+++ linux-2.6-lttng/include/asm-powerpc/tlb.h   2007-07-13 11:31:22.0 
-0400
@@ -23,6 +23,8 @@
 #include 
 #endif
 
+#include 
+
 struct mmu_gather;
 
 #define tlb_start_vma(tlb, vma)do { } while (0)
-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

[patch 2/3] Add missing DATA_DATA in powerpc

2007-07-13 Thread Mathieu Desnoyers

Newer asm-generic/vmlinux.lds.h provides a DATA_DATA macro which defines the
content of what is normally referred to by *(.data). It permits adding more
"subsections" (declared by gcc with the section attribute or in assembly), which
will be linked in the .data section.

Signed-off-by: Mathieu Desnoyers <[EMAIL PROTECTED]>
CC: [EMAIL PROTECTED]
CC: linuxppc-dev@ozlabs.org
--
 arch/powerpc/kernel/vmlinux.lds.S |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/arch/powerpc/kernel/vmlinux.lds.S
===
--- linux-2.6-lttng.orig/arch/powerpc/kernel/vmlinux.lds.S  2007-07-11 
11:55:48.0 -0400
+++ linux-2.6-lttng/arch/powerpc/kernel/vmlinux.lds.S   2007-07-11 
15:48:42.0 -0400
@@ -174,7 +174,9 @@
}
 #else
.data : {
-   *(.data .data.rel* .toc1)
+   DATA_DATA
+   *(.data.rel*)
+   *(.toc1)
*(.branch_lt)
    }
 

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: Powerpc - Include pagemap.h in asm/powerpc/tlb.h

2007-07-19 Thread Mathieu Desnoyers

7;:
/home/compudj/git/linux-2.6-lttng/include/linux/pagemap.h:207: error: implicit 
declaration of function 'PageWriteback'
/home/compudj/git/linux-2.6-lttng/include/linux/pagemap.h:208: error: 
'PG_writeback' undeclared (first use in this function)
In file included from /home/compudj/git/linux-2.6-lttng/include/linux/rmap.h:9,
 from /home/compudj/git/linux-2.6-lttng/init/main.c:49:
/home/compudj/git/linux-2.6-lttng/include/linux/mm.h: At top level:
/home/compudj/git/linux-2.6-lttng/include/linux/mm.h:605: error: conflicting 
types for 'page_address'
include2/asm/highmem.h:61: error: previous implicit declaration of 
'page_address' was here

PageHighMem is declared in linux/page-flags.h, which is included in
linux/mm.h with the following comment:

/*
 * FIXME: take this include out, include page-flags.h in
 * files which need it (119 of them)
 */
#include 

(linux/mm.h is included from linux/highmem.h)

Actually, we get a circular inclusion there:

In file included from 
/home/compudj/git/linux-2.6-lttng/include/linux/highmem.h:24,
 from 
/home/compudj/git/linux-2.6-lttng/include/linux/pagemap.h:10,
 from /home/compudj/git/linux-2.6-lttng/include/linux/swap.h:9,
 from include2/asm/pgtable.h:15,
 from /home/compudj/git/linux-2.6-lttng/include/linux/mm.h:38,
 from /home/compudj/git/linux-2.6-lttng/include/linux/rmap.h:9,
 from /home/compudj/git/linux-2.6-lttng/init/main.c:49:


mm.h includes asm-sparc/pgtable.h includes linux/swap.h includes
linux/pagemap.h (which I have added myself) includes linux/highmem.h
includes mm.h.

Is it me or it all looks like a fubarish mess ? :(

Why is asm-sparc/pgtable.h including linux/swap.h ? Perharps the sparc
devs will be able to enlighten us...

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Re: [PATCH] powerpc: select ARCH_HAS_MEMBARRIER_SYNC_CORE

2020-07-07 Thread Mathieu Desnoyers

- On Jul 7, 2020, at 1:50 AM, Nicholas Piggin npig...@gmail.com wrote:

> Excerpts from Christophe Leroy's message of July 6, 2020 7:53 pm:
>> 
>> 
>> Le 06/07/2020 à 04:18, Nicholas Piggin a écrit :
>>> diff --git a/arch/powerpc/include/asm/exception-64s.h
>>> b/arch/powerpc/include/asm/exception-64s.h
>>> index 47bd4ea0837d..b88cb3a989b6 100644
>>> --- a/arch/powerpc/include/asm/exception-64s.h
>>> +++ b/arch/powerpc/include/asm/exception-64s.h
>>> @@ -68,6 +68,10 @@
>>>*
>>>* The nop instructions allow us to insert one or more instructions to 
>>> flush the
>>>* L1-D cache when returning to userspace or a guest.
>>> + *
>>> + * powerpc relies on return from interrupt/syscall being context 
>>> synchronising
>>> + * (which hrfid, rfid, and rfscv are) to support 
>>> ARCH_HAS_MEMBARRIER_SYNC_CORE
>>> + * without additional additional synchronisation instructions.
>> 
>> This file is dedicated to BOOK3S/64. What about other ones ?
>> 
>> On 32 bits, this is also valid as 'rfi' is also context synchronising,
>> but then why just add some comment in exception-64s.h and only there ?
> 
> Yeah you're right, I basically wanted to keep a note there just in case,
> because it's possible we would get a less synchronising return (maybe
> unlikely with meltdown) or even return from a kernel interrupt using a
> something faster (e.g., bctar if we don't use tar register in the kernel
> anywhere).
> 
> So I wonder where to add the note, entry_32.S and 64e.h as well?
> 

For 64-bit powerpc, I would be tempted to either place the comment in the header
implementing the RFI_TO_USER and RFI_TO_USER_OR_KERNEL macros or the .S files
using them, e.g. either:

arch/powerpc/include/asm/exception-64e.h
arch/powerpc/include/asm/exception-64s.h

or

arch/powerpc/kernel/exceptions-64s.S
arch/powerpc/kernel/entry_64.S

And for 32-bit powerpc, AFAIU

arch/powerpc/kernel/entry_32.S

uses SYNC + RFI to return to user-space. RFI is defined in

arch/powerpc/include/asm/ppc_asm.h

So a comment either near the RFI define and its uses should work.

> I should actually change the comment for 64-bit because soft masked
> interrupt replay is an interesting case. I thought it was okay (because
> the IPI would cause a hard interrupt which does do the rfi) but that
> should at least be written.

Yes.

> The context synchronisation happens before
> the Linux IPI function is called, but for the purpose of membarrier I
> think that is okay (the membarrier just needs to have caused a memory
> barrier + context synchronistaion by the time it has done).

Can you point me to the code implementing this logic ?

Thanks,

Mathieu

> 
> Thanks,
> Nick

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Failure to build librseq on ppc

2020-07-07 Thread Mathieu Desnoyers

Hi Boqun,

I'm trying to build librseq at:

https://git.kernel.org/pub/scm/libs/librseq/librseq.git

on powerpc, and I get these errors when building the rseq basic
test mirrored from the kernel selftests code:

/tmp/ccieEWxU.s: Assembler messages:
/tmp/ccieEWxU.s:118: Error: syntax error; found `,', expected `('
/tmp/ccieEWxU.s:118: Error: junk at end of line: `,8'
/tmp/ccieEWxU.s:121: Error: syntax error; found `,', expected `('
/tmp/ccieEWxU.s:121: Error: junk at end of line: `,8'
/tmp/ccieEWxU.s:626: Error: syntax error; found `,', expected `('
/tmp/ccieEWxU.s:626: Error: junk at end of line: `,8'
/tmp/ccieEWxU.s:629: Error: syntax error; found `,', expected `('
/tmp/ccieEWxU.s:629: Error: junk at end of line: `,8'
/tmp/ccieEWxU.s:735: Error: syntax error; found `,', expected `('
/tmp/ccieEWxU.s:735: Error: junk at end of line: `,8'
/tmp/ccieEWxU.s:738: Error: syntax error; found `,', expected `('
/tmp/ccieEWxU.s:738: Error: junk at end of line: `,8'
/tmp/ccieEWxU.s:741: Error: syntax error; found `,', expected `('
/tmp/ccieEWxU.s:741: Error: junk at end of line: `,8'
Makefile:581: recipe for target 'basic_percpu_ops_test.o' failed

I am using this compiler:

gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)
Target: powerpc-linux-gnu

So far, I got things to build by changing "m" operands to "Q" operands.
Based on 
https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints
it seems that "Q" means "A memory operand addressed by just a base register."
I suspect that lwz and stw don't expect some kind of immediate offset which
can be kept with "m", and "Q" fixes this. Is that the right fix ?

And should we change all operands passed to lwz and stw to a "Q" operand ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: Failure to build librseq on ppc

2020-07-08 Thread Mathieu Desnoyers

- On Jul 7, 2020, at 8:59 PM, Segher Boessenkool seg...@kernel.crashing.org 
wrote:

> Hi!
> 
> On Tue, Jul 07, 2020 at 03:17:10PM -0400, Mathieu Desnoyers wrote:
>> I'm trying to build librseq at:
>> 
>> https://git.kernel.org/pub/scm/libs/librseq/librseq.git
>> 
>> on powerpc, and I get these errors when building the rseq basic
>> test mirrored from the kernel selftests code:
>> 
>> /tmp/ccieEWxU.s: Assembler messages:
>> /tmp/ccieEWxU.s:118: Error: syntax error; found `,', expected `('
>> /tmp/ccieEWxU.s:118: Error: junk at end of line: `,8'
>> /tmp/ccieEWxU.s:121: Error: syntax error; found `,', expected `('
>> /tmp/ccieEWxU.s:121: Error: junk at end of line: `,8'
>> /tmp/ccieEWxU.s:626: Error: syntax error; found `,', expected `('
>> /tmp/ccieEWxU.s:626: Error: junk at end of line: `,8'
>> /tmp/ccieEWxU.s:629: Error: syntax error; found `,', expected `('
>> /tmp/ccieEWxU.s:629: Error: junk at end of line: `,8'
>> /tmp/ccieEWxU.s:735: Error: syntax error; found `,', expected `('
>> /tmp/ccieEWxU.s:735: Error: junk at end of line: `,8'
>> /tmp/ccieEWxU.s:738: Error: syntax error; found `,', expected `('
>> /tmp/ccieEWxU.s:738: Error: junk at end of line: `,8'
>> /tmp/ccieEWxU.s:741: Error: syntax error; found `,', expected `('
>> /tmp/ccieEWxU.s:741: Error: junk at end of line: `,8'
>> Makefile:581: recipe for target 'basic_percpu_ops_test.o' failed
> 
> You'll have to show the actual failing machine code, and with enough
> context that we can relate this to the source code.
> 
> -save-temps helps, or use -S instead of -c, etc.

Sure, see attached .S file.

> 
>> I am using this compiler:
>> 
>> gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)
>> Target: powerpc-linux-gnu
>> 
>> So far, I got things to build by changing "m" operands to "Q" operands.
>> Based on
>> https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints
>> it seems that "Q" means "A memory operand addressed by just a base register."
> 
> Yup.
> 
>> I suspect that lwz and stw don't expect some kind of immediate offset which
>> can be kept with "m", and "Q" fixes this. Is that the right fix ?
>> 
>> And should we change all operands passed to lwz and stw to a "Q" operand ?
> 
> No, lwz and stw exactly *do* take an immediate offset.
> 
> It sounds like the compiler passed memory addressed by indexed
> addressing, instead.  Which is fine for "m", and also fine for those
> insns... well, you need lwzx and stwx.
> 
> So perhaps you have code like
> 
>  int *p;
>  int x;
>  ...
>  asm ("lwz %0,%1" : "=r"(x) : "m"(*p));

We indeed have explicit "lwz" and "stw" instructions in there.

> 
> where that last line should actually read
> 
>  asm ("lwz%X1 %0,%1" : "=r"(x) : "m"(*p));

Indeed, turning those into "lwzx" and "stwx" seems to fix the issue.

There has been some level of extra CPP macro coating around those instructions 
to
support both ppc32 and ppc64 with the same assembly. So adding %X[arg] is not 
trivial.
Let me see what can be done here.

Thanks,

Mathieu


> 
> ?
> 
> 
> Segher

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


basic_percpu_ops_test.S
Description: Binary data

Re: Failure to build librseq on ppc

2020-07-08 Thread Mathieu Desnoyers

- On Jul 8, 2020, at 8:33 AM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Jul 7, 2020, at 8:59 PM, Segher Boessenkool 
> seg...@kernel.crashing.org
> wrote:
[...]
>> 
>> So perhaps you have code like
>> 
>>  int *p;
>>  int x;
>>  ...
>>  asm ("lwz %0,%1" : "=r"(x) : "m"(*p));
> 
> We indeed have explicit "lwz" and "stw" instructions in there.
> 
>> 
>> where that last line should actually read
>> 
>>  asm ("lwz%X1 %0,%1" : "=r"(x) : "m"(*p));
> 
> Indeed, turning those into "lwzx" and "stwx" seems to fix the issue.
> 
> There has been some level of extra CPP macro coating around those instructions
> to
> support both ppc32 and ppc64 with the same assembly. So adding %X[arg] is not
> trivial.
> Let me see what can be done here.

I did the following changes which appear to generate valid asm.
See attached corresponding .S output.

I grepped for uses of "m" asm operand in Linux powerpc code and noticed it's 
pretty much
always used with e.g. "lwz%U1%X1". I could find one blog post discussing that 
%U is about
update flag, and nothing about %X. Are those documented ?

Although it appears to generate valid asm, I have the feeling I'm relying on 
undocumented
features here. :-/

Here is the diff on 
https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/include/rseq/rseq-ppc.h
It's only compile-tested on powerpc32 so far:

diff --git a/include/rseq/rseq-ppc.h b/include/rseq/rseq-ppc.h
index eb53953..f689fe9 100644
--- a/include/rseq/rseq-ppc.h
+++ b/include/rseq/rseq-ppc.h
@@ -47,9 +47,9 @@ do {  
\

 #ifdef __PPC64__

-#define STORE_WORD "std "
-#define LOAD_WORD  "ld "
-#define LOADX_WORD "ldx "
+#define STORE_WORD(arg)"std%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) 
"] "/* To memory ("m" constraint) */
+#define LOAD_WORD(arg) "lwd%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] "
/* From memory ("m" constraint) */
+#define LOADX_WORD "ldx "  
/* From base register ("b" constraint) */
 #define CMP_WORD   "cmpd "

 #define __RSEQ_ASM_DEFINE_TABLE(label, version, flags, 
\
@@ -89,9 +89,9 @@ do {  
\

 #else /* #ifdef __PPC64__ */

-#define STORE_WORD "stw "
-#define LOAD_WORD  "lwz "
-#define LOADX_WORD "lwzx "
+#define STORE_WORD(arg)"stw%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) 
"] "/* To memory ("m" constraint) */
+#define LOAD_WORD(arg) "lwz%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] "
/* From memory ("m" constraint) */
+#define LOADX_WORD "lwzx " 
/* From base register ("b" constraint) */
 #define CMP_WORD   "cmpw "

 #define __RSEQ_ASM_DEFINE_TABLE(label, version, flags, 
\
@@ -125,7 +125,7 @@ do {
\
RSEQ_INJECT_ASM(1)  
\
"lis %%r17, (" __rseq_str(cs_label) ")@ha\n\t"  
\
"addi %%r17, %%r17, (" __rseq_str(cs_label) ")@l\n\t"   
\
-   "stw %%r17, %[" __rseq_str(rseq_cs) "]\n\t" 
\
+   "stw%U[" __rseq_str(rseq_cs) "]%X[" __rseq_str(rseq_cs) "] 
%%r17, %[" __rseq_str(rseq_cs) "]\n\t"   \
__rseq_str(label) ":\n\t"

 #endif /* #ifdef __PPC64__ */
@@ -136,7 +136,7 @@ do {
\

 #define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label) 
\
RSEQ_INJECT_ASM(2)  
\
-   "lwz %%r17, %[" __rseq_str(current_cpu_id) "]\n\t"  
\
+   "lwz%U[" __rseq_str(current_cpu_id) "]%X[" 
__rseq_str(current_cpu_id) "] %%r17, %[" __rseq_str(current_cpu_id) "]\n\t" \
"cmpw cr7, %[" __rseq_str(cpu_id) "], %%r17\n\t"
\
"bne- cr7, " __rseq_str(label) "\n\t"
@@ -153,25 +153,25 @@ do {

Re: [PATCH] powerpc: select ARCH_HAS_MEMBARRIER_SYNC_CORE

2020-07-08 Thread Mathieu Desnoyers

- On Jul 8, 2020, at 1:17 AM, Nicholas Piggin npig...@gmail.com wrote:

> Excerpts from Mathieu Desnoyers's message of July 7, 2020 9:25 pm:
>> - On Jul 7, 2020, at 1:50 AM, Nicholas Piggin npig...@gmail.com wrote:
>> 
[...]
>>> I should actually change the comment for 64-bit because soft masked
>>> interrupt replay is an interesting case. I thought it was okay (because
>>> the IPI would cause a hard interrupt which does do the rfi) but that
>>> should at least be written.
>> 
>> Yes.
>> 
>>> The context synchronisation happens before
>>> the Linux IPI function is called, but for the purpose of membarrier I
>>> think that is okay (the membarrier just needs to have caused a memory
>>> barrier + context synchronistaion by the time it has done).
>> 
>> Can you point me to the code implementing this logic ?
> 
> It's mostly in arch/powerpc/kernel/exception-64s.S and
> powerpc/kernel/irq.c, but a lot of asm so easier to explain.
> 
> When any Linux code does local_irq_disable(), we set interrupts as
> software-masked in a per-cpu flag. When interrupts (including IPIs) come
> in, the first thing we do is check that flag and if we are masked, then
> record that the interrupt needs to be "replayed" in another per-cpu
> flag. The interrupt handler then exits back using RFI (which is context
> synchronising the CPU). Later, when the kernel code does
> local_irq_enable(), it checks the replay flag to see if anything needs
> to be done. At that point we basically just call the interrupt handler
> code like a normal function, and when that returns there is no context
> synchronising instruction.

AFAIU this can only happen for interrupts nesting over irqoff sections,
therefore over kernel code, never userspace, right ?

> 
> So membarrier IPI will always cause target CPUs to perform a context
> synchronising instruction, but sometimes it happens before the IPI
> handler function runs.

If my understanding is correct, the replayed interrupt handler logic
only nests over kernel code, which will eventually need to issue a
context synchronizing instruction before returning to user-space.

All we care about is that starting from the membarrier, each core
either:

- interrupt user-space to issue the context synchronizing instruction if
  they were running userspace, or
- _eventually_ issue a context synchronizing instruction before returning
  to user-space if they were running kernel code.

So your earlier statement "the membarrier just needs to have caused a memory
barrier + context synchronistaion by the time it has done" is not strictly
correct: the context synchronizing instruction does not strictly need to
happen on each core before membarrier returns. A similar line of thoughts
can be followed for memory barriers.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: Failure to build librseq on ppc

2020-07-08 Thread Mathieu Desnoyers

- On Jul 8, 2020, at 10:21 AM, Christophe Leroy christophe.le...@csgroup.eu 
wrote:

> Le 08/07/2020 à 16:00, Mathieu Desnoyers a écrit :
>> - On Jul 8, 2020, at 8:33 AM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
>> 
>>> - On Jul 7, 2020, at 8:59 PM, Segher Boessenkool 
>>> seg...@kernel.crashing.org
>>> wrote:
>> [...]
>>>>
>>>> So perhaps you have code like
>>>>
>>>>   int *p;
>>>>   int x;
>>>>   ...
>>>>   asm ("lwz %0,%1" : "=r"(x) : "m"(*p));
>>>
>>> We indeed have explicit "lwz" and "stw" instructions in there.
>>>
>>>>
>>>> where that last line should actually read
>>>>
>>>>   asm ("lwz%X1 %0,%1" : "=r"(x) : "m"(*p));
>>>
>>> Indeed, turning those into "lwzx" and "stwx" seems to fix the issue.
>>>
>>> There has been some level of extra CPP macro coating around those 
>>> instructions
>>> to
>>> support both ppc32 and ppc64 with the same assembly. So adding %X[arg] is 
>>> not
>>> trivial.
>>> Let me see what can be done here.
>> 
>> I did the following changes which appear to generate valid asm.
>> See attached corresponding .S output.
>> 
>> I grepped for uses of "m" asm operand in Linux powerpc code and noticed it's
>> pretty much
>> always used with e.g. "lwz%U1%X1". I could find one blog post discussing 
>> that %U
>> is about
>> update flag, and nothing about %X. Are those documented ?
> 
> As far as I can see, %U is mentioned in
> https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html in the
> powerpc subpart, at the "m" constraint.

Yep, I did notice it, but mistakenly thought it was only needed for "m<>" 
operand,
not "m".

Thanks,

Mathieu

> 
> For the %X I don't know.
> 
> Christophe
> 
>> 
>> Although it appears to generate valid asm, I have the feeling I'm relying on
>> undocumented
>> features here. :-/

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

powerpc: Incorrect stw operand modifier in __set_pte_at

2020-07-08 Thread Mathieu Desnoyers

Hi,

Reviewing use of the patterns "Un%Xn" with lwz and stw instructions
(where n should be the operand number) within the Linux kernel led
me to spot those 2 weird cases:

arch/powerpc/include/asm/nohash/pgtable.h:__set_pte_at()

__asm__ __volatile__("\
stw%U0%X0 %2,%0\n\
eieio\n\
stw%U0%X0 %L2,%1"
: "=m" (*ptep), "=m" (*((unsigned char *)ptep+4))
: "r" (pte) : "memory");

I would have expected the stw to be:

stw%U1%X1 %L2,%1"

and:
arch/powerpc/include/asm/book3s/32/pgtable.h:__set_pte_at()

__asm__ __volatile__("\
stw%U0%X0 %2,%0\n\
eieio\n\
stw%U0%X0 %L2,%1"
: "=m" (*ptep), "=m" (*((unsigned char *)ptep+4))
: "r" (pte) : "memory");

where I would have expected:

stw%U1%X1 %L2,%1"

Is it a bug or am I missing something ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

[PATCH 1/1] powerpc: Fix incorrect stw{, ux, u, x} instructions in __set_pte_at

2020-07-08 Thread Mathieu Desnoyers

The placeholder for instruction selection should use the second
argument's operand, which is %1, not %0. This could generate incorrect
assembly code if the instruction selection for argument %0 ever differs
from argument %1.

Fixes: 9bf2b5cdc5fe ("powerpc: Fixes for CONFIG_PTE_64BIT for SMP support")
Signed-off-by: Mathieu Desnoyers 
Cc: Christophe Leroy 
Cc: Kumar Gala 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: linuxppc-dev@lists.ozlabs.org
Cc:  # v2.6.28+
---
 arch/powerpc/include/asm/book3s/32/pgtable.h | 2 +-
 arch/powerpc/include/asm/nohash/pgtable.h| 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 224912432821..f1467b3c417a 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -529,7 +529,7 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
__asm__ __volatile__("\
stw%U0%X0 %2,%0\n\
eieio\n\
-   stw%U0%X0 %L2,%1"
+   stw%U1%X1 %L2,%1"
: "=m" (*ptep), "=m" (*((unsigned char *)ptep+4))
: "r" (pte) : "memory");
 
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
b/arch/powerpc/include/asm/nohash/pgtable.h
index 4b7c3472eab1..a00e4c1746d6 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -199,7 +199,7 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
__asm__ __volatile__("\
stw%U0%X0 %2,%0\n\
eieio\n\
-   stw%U0%X0 %L2,%1"
+   stw%U1%X1 %L2,%1"
: "=m" (*ptep), "=m" (*((unsigned char *)ptep+4))
: "r" (pte) : "memory");
return;
-- 
2.11.0

Re: Failure to build librseq on ppc

2020-07-08 Thread Mathieu Desnoyers



- Segher Boessenkool  wrote:
> Hi!
> 
> On Wed, Jul 08, 2020 at 10:27:27PM +1000, Michael Ellerman wrote:
> > Segher Boessenkool  writes:
> > > You'll have to show the actual failing machine code, and with enough
> > > context that we can relate this to the source code.
> > >
> > > -save-temps helps, or use -S instead of -c, etc.
> > 
> > Attached below.
> 
> Thanks!
> 
> > I think that's from:
> > 
> > #define LOAD_WORD   "ld "
> > 
> > #define RSEQ_ASM_OP_CMPEQ(var, expect, label)   
> > \
> > LOAD_WORD "%%r17, %[" __rseq_str(var) "]\n\t"   
> > \
> 
> The way this hardcodes r17 *will* break, btw.  The compiler will not
> likely want to use r17 as long as your code (after inlining etc.!) stays
> small, but there is Murphy's law.

r17 is in the clobber list, so it should be ok.

> 
> Anyway...  something in rseq_str is wrong, missing %X.  This may
> have to do with the abuse of inline asm here, making a fix harder :-(

I just committed a fix which enhances the macros.

Thanks for your help!

Mathieu

> 
> 
> Segher

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: Failure to build librseq on ppc

2020-07-09 Thread Mathieu Desnoyers

- On Jul 8, 2020, at 8:10 PM, Segher Boessenkool seg...@kernel.crashing.org 
wrote:

> Hi!
> 
> On Wed, Jul 08, 2020 at 10:00:01AM -0400, Mathieu Desnoyers wrote:
[...]
> 
>> -#define STORE_WORD "std "
>> -#define LOAD_WORD  "ld "
>> -#define LOADX_WORD "ldx "
>> +#define STORE_WORD(arg)"std%U[" __rseq_str(arg) "]%X[" 
>> __rseq_str(arg)
>> "] "/* To memory ("m" constraint) */
>> +#define LOAD_WORD(arg) "lwd%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] "
>> /* From memory ("m" constraint) */
> 
> That cannot work (you typoed "ld" here).

Indeed, I noticed it before pushing to master (lwd -> ld).

> 
> Some more advice about this code, pretty generic stuff:

Let's take an example to support the discussion here. I'm taking it from
master branch (after a cleanup changing e.g. LOAD_WORD into RSEQ_LOAD_LONG).
So for powerpc32 we have (code edited to remove testing instrumentation):

#define __rseq_str_1(x) #x
#define __rseq_str(x)   __rseq_str_1(x)

#define RSEQ_STORE_LONG(arg)"stw%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) 
"] "/* To memory ("m" constraint) */
#define RSEQ_STORE_INT(arg) RSEQ_STORE_LONG(arg)
/* To memory ("m" constraint) */
#define RSEQ_LOAD_LONG(arg) "lwz%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) 
"] "/* From memory ("m" constraint) */
#define RSEQ_LOAD_INT(arg)  RSEQ_LOAD_LONG(arg) 
/* From memory ("m" constraint) */
#define RSEQ_LOADX_LONG "lwzx " 
/* From base register ("b" constraint) */
#define RSEQ_CMP_LONG   "cmpw "

#define __RSEQ_ASM_DEFINE_TABLE(label, version, flags,  
\
start_ip, post_commit_offset, abort_ip) 
\
".pushsection __rseq_cs, \"aw\"\n\t"
\
".balign 32\n\t"
\
__rseq_str(label) ":\n\t"   
\
".long " __rseq_str(version) ", " __rseq_str(flags) "\n\t"  
\
/* 32-bit only supported on BE */   
\
".long 0x0, " __rseq_str(start_ip) ", 0x0, " 
__rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) "\n\t" \
".popsection\n\t"   \
".pushsection __rseq_cs_ptr_array, \"aw\"\n\t"  \
".long 0x0, " __rseq_str(label) "b\n\t" \
".popsection\n\t"

/*
 * Exit points of a rseq critical section consist of all instructions outside
 * of the critical section where a critical section can either branch to or
 * reach through the normal course of its execution. The abort IP and the
 * post-commit IP are already part of the __rseq_cs section and should not be
 * explicitly defined as additional exit points. Knowing all exit points is
 * useful to assist debuggers stepping over the critical section.
 */
#define RSEQ_ASM_DEFINE_EXIT_POINT(start_ip, exit_ip)   
\
".pushsection __rseq_exit_point_array, \"aw\"\n\t"  
\
/* 32-bit only supported on BE */   
\
".long 0x0, " __rseq_str(start_ip) ", 0x0, " 
__rseq_str(exit_ip) "\n\t" \
".popsection\n\t"

#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs)
\
"lis %%r17, (" __rseq_str(cs_label) ")@ha\n\t"  
\
"addi %%r17, %%r17, (" __rseq_str(cs_label) ")@l\n\t"   
\
RSEQ_STORE_INT(rseq_cs) "%%r17, %[" __rseq_str(rseq_cs) "]\n\t" 
\
__rseq_str(label) ":\n\t"

#define RSEQ_ASM_DEFINE_TABLE(label, start_ip, post_commit_ip, abort_ip)
\
__RSEQ_ASM_DEFINE_TABLE(label, 0x0, 0x0, start_ip,  
\
(post_commit_ip - start_ip), abort_ip)

#define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label)  
\
RSEQ_LOAD_INT(current_cpu_id) "%%r17, %[" 
__rseq_str(current_cpu_id) "]\n\t" \
"cmpw cr7, %[" __rseq_str(cpu_id) "], %%r17\n\t

Re: Failure to build librseq on ppc

2020-07-09 Thread Mathieu Desnoyers

- On Jul 8, 2020, at 8:18 PM, Segher Boessenkool seg...@kernel.crashing.org 
wrote:

> On Wed, Jul 08, 2020 at 08:01:23PM -0400, Mathieu Desnoyers wrote:
>> > > #define RSEQ_ASM_OP_CMPEQ(var, expect, label)
>> > > \
>> > > LOAD_WORD "%%r17, %[" __rseq_str(var) "]\n\t"
>> > >\
>> > 
>> > The way this hardcodes r17 *will* break, btw.  The compiler will not
>> > likely want to use r17 as long as your code (after inlining etc.!) stays
>> > small, but there is Murphy's law.
>> 
>> r17 is in the clobber list, so it should be ok.
> 
> What protects r17 *after* this asm statement?

As discussed in the other leg of the thread (with the code example),
r17 is in the clobber list of all asm statements using this macro, and
is used as a temporary register within each inline asm.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: Failure to build librseq on ppc

2020-07-09 Thread Mathieu Desnoyers

- On Jul 9, 2020, at 1:37 PM, Segher Boessenkool seg...@kernel.crashing.org 
wrote:

> On Thu, Jul 09, 2020 at 09:43:47AM -0400, Mathieu Desnoyers wrote:
>> > What protects r17 *after* this asm statement?
>> 
>> As discussed in the other leg of the thread (with the code example),
>> r17 is in the clobber list of all asm statements using this macro, and
>> is used as a temporary register within each inline asm.
> 
> That works fine then, for a testcase.  Using r17 is not a great idea for
> performance (it increases the active register footprint, and causes more
> registers to be saved in the prologue of the functions, esp. on older
> compilers), and it is easier to just let the compiler choose a good
> register to use.  But maybe you want to see r17 in the generated
> testcases, as eyecatcher or something, dunno :-)

Just to make sure I understand your recommendation. So rather than
hard coding r17 as the temporary registers, we could explicitly
declare the temporary register as a C variable, pass it as an
input operand to the inline asm, and then refer to it by operand
name in the macros using it. This way the compiler would be free
to perform its own register allocation.

If that is what you have in mind, then yes, I think it makes a
lot of sense.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: Failure to build librseq on ppc

2020-07-09 Thread Mathieu Desnoyers

- On Jul 9, 2020, at 1:42 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Jul 9, 2020, at 1:37 PM, Segher Boessenkool 
> seg...@kernel.crashing.org
> wrote:
> 
>> On Thu, Jul 09, 2020 at 09:43:47AM -0400, Mathieu Desnoyers wrote:
>>> > What protects r17 *after* this asm statement?
>>> 
>>> As discussed in the other leg of the thread (with the code example),
>>> r17 is in the clobber list of all asm statements using this macro, and
>>> is used as a temporary register within each inline asm.
>> 
>> That works fine then, for a testcase.  Using r17 is not a great idea for
>> performance (it increases the active register footprint, and causes more
>> registers to be saved in the prologue of the functions, esp. on older
>> compilers), and it is easier to just let the compiler choose a good
>> register to use.  But maybe you want to see r17 in the generated
>> testcases, as eyecatcher or something, dunno :-)
> 
> Just to make sure I understand your recommendation. So rather than
> hard coding r17 as the temporary registers, we could explicitly
> declare the temporary register as a C variable, pass it as an
> input operand to the inline asm, and then refer to it by operand
> name in the macros using it. This way the compiler would be free
> to perform its own register allocation.
> 
> If that is what you have in mind, then yes, I think it makes a
> lot of sense.

Except that asm goto have this limitation with gcc: those cannot
have any output operand, only inputs, clobbers and target labels.
We cannot modify a temporary register received as input operand. So I don't
see how to get a temporary register allocated by the compiler considering
this limitation.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: Failure to build librseq on ppc

2020-07-09 Thread Mathieu Desnoyers

- On Jul 9, 2020, at 4:46 PM, Segher Boessenkool seg...@kernel.crashing.org 
wrote:

> On Thu, Jul 09, 2020 at 01:56:19PM -0400, Mathieu Desnoyers wrote:
>> > Just to make sure I understand your recommendation. So rather than
>> > hard coding r17 as the temporary registers, we could explicitly
>> > declare the temporary register as a C variable, pass it as an
>> > input operand to the inline asm, and then refer to it by operand
>> > name in the macros using it. This way the compiler would be free
>> > to perform its own register allocation.
>> > 
>> > If that is what you have in mind, then yes, I think it makes a
>> > lot of sense.
>> 
>> Except that asm goto have this limitation with gcc: those cannot
>> have any output operand, only inputs, clobbers and target labels.
>> We cannot modify a temporary register received as input operand. So I don't
>> see how to get a temporary register allocated by the compiler considering
>> this limitation.
> 
> Heh, yet another reason not to obfuscate your inline asm: it didn't
> register this is asm goto.
> 
> A clobber is one way, yes (those *are* allowed in asm goto).  Another
> way is to not actually change that register: move the original value
> back into there at the end of the asm!  (That isn't always easy to do,
> it depends on your code).  So something like
> 
>   long start = ...;
>   long tmp = start;
>   asm("stuff that modifies %0; ...; mr %0,%1" : : "r"(tmp), "r"(start));
> 
> is just fine: %0 isn't actually modified at all, as far as GCC is
> concerned, and this isn't lying to it!

It appears to be at the cost of adding one extra instruction on the fast-path
to restore the register to its original value. I'll leave Boqun whom authored
the original rseq-ppc code to figure out what works best performance-wise
(when he finds time).

Thanks for the pointers!

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-10 Thread Mathieu Desnoyers

>   /*
>* When switching through a kernel thread, the loop in
>* membarrier_{private,global}_expedited() may have observed that
>* kernel thread and not issued an IPI. It is therefore possible to
>* schedule between user->kernel->user threads without passing though
> -  * switch_mm(). Membarrier requires a barrier after storing to
> -  * rq->curr, before returning to userspace, so provide them here:
> -  *
> -  * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
> -  *   provided by mmdrop(),
> -  * - a sync_core for SYNC_CORE.
> +  * switch_mm(). Membarrier requires a full barrier after storing to
> +  * rq->curr, before returning to userspace, for
> +  * {PRIVATE,GLOBAL}_EXPEDITED. This is implicitly provided by mmdrop().
>*/
> - if (mm) {
> - membarrier_mm_sync_core_before_usermode(mm);
> + if (mm)
>   mmdrop(mm);
> - }
> +
>   if (unlikely(prev_state == TASK_DEAD)) {
>   if (prev->sched_class->task_dead)
>   prev->sched_class->task_dead(prev);
> @@ -6292,6 +6289,7 @@ void idle_task_exit(void)
>   BUG_ON(current != this_rq()->idle);
> 
>   if (mm != &init_mm) {
> + /* enter_lazy_tlb is not done because we're about to go down */
>   switch_mm(mm, &init_mm, current);
>   finish_arch_post_lock_switch();
>   }
> --
> 2.23.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-13 Thread Mathieu Desnoyers

- On Jul 13, 2020, at 9:47 AM, Nicholas Piggin npig...@gmail.com wrote:

> Excerpts from Nicholas Piggin's message of July 13, 2020 2:45 pm:
>> Excerpts from Andy Lutomirski's message of July 11, 2020 3:04 am:
>>> Also, as it stands, I can easily see in_irq() ceasing to promise to
>>> serialize.  There are older kernels for which it does not promise to
>>> serialize.  And I have plans to make it stop serializing in the
>>> nearish future.
>> 
>> You mean x86's return from interrupt? Sounds fun... you'll konw where to
>> update the membarrier sync code, at least :)
> 
> Oh, I should actually say Mathieu recently clarified a return from
> interrupt doesn't fundamentally need to serialize in order to support
> membarrier sync core.

Clarification to your statement:

Return from interrupt to kernel code does not need to be context serializing
as long as kernel serializes before returning to user-space.

However, return from interrupt to user-space needs to be context serializing.

Thanks,

Mathieu

> 
> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-July/214171.html
> 
> So you may not need to do anything more if you relaxed it.
> 
> Thanks,
> Nick

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v2] powerpc: select ARCH_HAS_MEMBARRIER_SYNC_CORE

2020-07-15 Thread Mathieu Desnoyers

- On Jul 15, 2020, at 5:48 AM, Nicholas Piggin npig...@gmail.com wrote:
[...]
> index 47bd4ea0837d..a4704f405e8d 100644
> --- a/arch/powerpc/include/asm/exception-64s.h
> +++ b/arch/powerpc/include/asm/exception-64s.h
> @@ -68,6 +68,13 @@
>  *
>  * The nop instructions allow us to insert one or more instructions to flush 
> the
>  * L1-D cache when returning to userspace or a guest.
> + *
> + * powerpc relies on return from interrupt/syscall being context 
> synchronising
> + * (which hrfid, rfid, and rfscv are) to support 
> ARCH_HAS_MEMBARRIER_SYNC_CORE
> + * without additional additional synchronisation instructions. soft-masked
> + * interrupt replay does not include a context-synchronising rfid, but those
> + * always return to kernel, the context sync is only required for IPIs which
> + * return to user.
>  */
> #define RFI_FLUSH_SLOT
> \
>   RFI_FLUSH_FIXUP_SECTION;\

I suspect the statement "the context sync is only required for IPIs which 
return to
user." is misleading.

As I recall that we need more than just context sync after IPI. We need context 
sync
in return path of any trap/interrupt/system call which returns to user-space, 
else
we'd need to add the proper core serializing barriers in the scheduler, as we 
had
to do for lazy tlb on x86.

Or am I missing something ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Mathieu Desnoyers

- On Jul 16, 2020, at 7:00 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Thu, Jul 16, 2020 at 08:03:36PM +1000, Nicholas Piggin wrote:
>> Excerpts from Peter Zijlstra's message of July 16, 2020 6:50 pm:
>> > On Wed, Jul 15, 2020 at 10:18:20PM -0700, Andy Lutomirski wrote:
>> >> > On Jul 15, 2020, at 9:15 PM, Nicholas Piggin  wrote:
> 
>> >> But I’m wondering if all this deferred sync stuff is wrong. In the
>> >> brave new world of io_uring and such, perhaps kernel access matter
>> >> too.  Heck, even:
>> > 
>> > IIRC the membarrier SYNC_CORE use-case is about user-space
>> > self-modifying code.
>> > 
>> > Userspace re-uses a text address and needs to SYNC_CORE before it can be
>> > sure the old text is forgotten. Nothing the kernel does matters there.
>> > 
>> > I suppose the manpage could be more clear there.
>> 
>> True, but memory ordering of kernel stores from kernel threads for
>> regular mem barrier is the concern here.
>> 
>> Does io_uring update completion queue from kernel thread or interrupt,
>> for example? If it does, then membarrier will not order such stores
>> with user memory accesses.
> 
> So we're talking about regular membarrier() then? Not the SYNC_CORE
> variant per-se.
> 
> Even there, I'll argue we don't care, but perhaps Mathieu has a
> different opinion.

I agree with Peter that we don't care about accesses to user-space
memory performed concurrently with membarrier.

What we'd care about in terms of accesses to user-space memory from the
kernel is something that would be clearly ordered as happening before
or after the membarrier call, for instance a read(2) followed by
membarrier(2) after the read returns, or a read(2) issued after return
from membarrier(2). The other scenario we'd care about is with the compiler
barrier paired with membarrier: e.g. read(2) returns, compiler barrier,
followed by a store. Or load, compiler barrier, followed by write(2).

All those scenarios imply before/after ordering wrt either membarrier or
the compiler barrier. I notice that io_uring has a "completion" queue.
Let's try to come up with realistic usage scenarios.

So the dependency chain would be provided by e.g.:

* Infrequent read / Frequent write, communicating read completion through 
variable X

wait for io_uring read request completion -> membarrier -> store X=1

with matching

load from X (waiting for X==1) -> asm volatile (::: "memory") -> submit 
io_uring write request

or this other scenario:

* Frequent read / Infrequent write, communicating read completion through 
variable X

load from X (waiting for X==1) -> membarrier -> submit io_uring write request

with matching

wait for io_uring read request completion -> asm volatile (::: "memory") -> 
store X=1

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Mathieu Desnoyers

- On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com wrote:
> I should be more complete here, especially since I was complaining
> about unclear barrier comment :)
> 
> 
> CPU0 CPU1
> a. user stuff1. user stuff
> b. membarrier()  2. enter kernel
> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
> d. read rq->curr 4. rq->curr switched to kthread
> e. is kthread, skip IPI  5. switch_to kthread
> f. return to user6. rq->curr switched to user thread
> g. user stuff7. switch_to user thread
> 8. exit kernel
> 9. more user stuff
> 
> What you're really ordering is a, g vs 1, 9 right?
> 
> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
> etc.
> 
> Userspace does not care where the barriers are exactly or what kernel
> memory accesses might be being ordered by them, so long as there is a
> mb somewhere between a and g, and 1 and 9. Right?

This is correct. Note that the accesses to user-space memory can be
done either by user-space code or kernel code, it doesn't matter.
However, in order to be considered as happening before/after
either membarrier or the matching compiler barrier, kernel code
needs to have causality relationship with user-space execution,
e.g. user-space does a system call, or returns from a system call.

In the case of io_uring, submitting a request or returning from waiting
on request completion appear to provide this causality relationship.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Mathieu Desnoyers

- On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com wrote:
>> I should be more complete here, especially since I was complaining
>> about unclear barrier comment :)
>> 
>> 
>> CPU0 CPU1
>> a. user stuff1. user stuff
>> b. membarrier()  2. enter kernel
>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
>> d. read rq->curr 4. rq->curr switched to kthread
>> e. is kthread, skip IPI  5. switch_to kthread
>> f. return to user6. rq->curr switched to user thread
>> g. user stuff7. switch_to user thread
>> 8. exit kernel
>> 9. more user stuff
>> 
>> What you're really ordering is a, g vs 1, 9 right?
>> 
>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
>> etc.
>> 
>> Userspace does not care where the barriers are exactly or what kernel
>> memory accesses might be being ordered by them, so long as there is a
>> mb somewhere between a and g, and 1 and 9. Right?
> 
> This is correct.

Actually, sorry, the above is not quite right. It's been a while
since I looked into the details of membarrier.

The smp_mb() at the beginning of membarrier() needs to be paired with a
smp_mb() _after_ rq->curr is switched back to the user thread, so the
memory barrier is between store to rq->curr and following user-space
accesses.

The smp_mb() at the end of membarrier() needs to be paired with the
smp_mb__after_spinlock() at the beginning of schedule, which is
between accesses to userspace memory and switching rq->curr to kthread.

As to *why* this ordering is needed, I'd have to dig through additional
scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?

Thanks,

Mathieu


> Note that the accesses to user-space memory can be
> done either by user-space code or kernel code, it doesn't matter.
> However, in order to be considered as happening before/after
> either membarrier or the matching compiler barrier, kernel code
> needs to have causality relationship with user-space execution,
> e.g. user-space does a system call, or returns from a system call.
> 
> In the case of io_uring, submitting a request or returning from waiting
> on request completion appear to provide this causality relationship.
> 
> Thanks,
> 
> Mathieu
> 
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-16 Thread Mathieu Desnoyers

- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
> mathieu.desnoy...@efficios.com wrote:
> 
>> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com wrote:
>>> I should be more complete here, especially since I was complaining
>>> about unclear barrier comment :)
>>> 
>>> 
>>> CPU0 CPU1
>>> a. user stuff1. user stuff
>>> b. membarrier()  2. enter kernel
>>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
>>> d. read rq->curr 4. rq->curr switched to kthread
>>> e. is kthread, skip IPI  5. switch_to kthread
>>> f. return to user6. rq->curr switched to user thread
>>> g. user stuff7. switch_to user thread
>>> 8. exit kernel
>>> 9. more user stuff
>>> 
>>> What you're really ordering is a, g vs 1, 9 right?
>>> 
>>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
>>> etc.
>>> 
>>> Userspace does not care where the barriers are exactly or what kernel
>>> memory accesses might be being ordered by them, so long as there is a
>>> mb somewhere between a and g, and 1 and 9. Right?
>> 
>> This is correct.
> 
> Actually, sorry, the above is not quite right. It's been a while
> since I looked into the details of membarrier.
> 
> The smp_mb() at the beginning of membarrier() needs to be paired with a
> smp_mb() _after_ rq->curr is switched back to the user thread, so the
> memory barrier is between store to rq->curr and following user-space
> accesses.
> 
> The smp_mb() at the end of membarrier() needs to be paired with the
> smp_mb__after_spinlock() at the beginning of schedule, which is
> between accesses to userspace memory and switching rq->curr to kthread.
> 
> As to *why* this ordering is needed, I'd have to dig through additional
> scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?

Thinking further about this, I'm beginning to consider that maybe we have been
overly cautious by requiring memory barriers before and after store to rq->curr.

If CPU0 observes a CPU1's rq->curr->mm which differs from its own process 
(current)
while running the membarrier system call, it necessarily means that CPU1 had
to issue smp_mb__after_spinlock when entering the scheduler, between any 
user-space
loads/stores and update of rq->curr.

Requiring a memory barrier between update of rq->curr (back to current process's
thread) and following user-space memory accesses does not seem to guarantee
anything more than what the initial barrier at the beginning of __schedule 
already
provides, because the guarantees are only about accesses to user-space memory.

Therefore, with the memory barrier at the beginning of __schedule, just 
observing that
CPU1's rq->curr differs from current should guarantee that a memory barrier was 
issued
between any sequentially consistent instructions belonging to the current 
process on
CPU1.

Or am I missing/misremembering an important point here ?

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
> 
>> Note that the accesses to user-space memory can be
>> done either by user-space code or kernel code, it doesn't matter.
>> However, in order to be considered as happening before/after
>> either membarrier or the matching compiler barrier, kernel code
>> needs to have causality relationship with user-space execution,
>> e.g. user-space does a system call, or returns from a system call.
>> 
>> In the case of io_uring, submitting a request or returning from waiting
>> on request completion appear to provide this causality relationship.
>> 
>> Thanks,
>> 
>> Mathieu
>> 
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> http://www.efficios.com
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers

- On Jul 16, 2020, at 5:24 PM, Alan Stern st...@rowland.harvard.edu wrote:

> On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote:
>> - On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
>> 
>> > - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
>> > mathieu.desnoy...@efficios.com wrote:
>> > 
>> >> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com 
>> >> wrote:
>> >>> I should be more complete here, especially since I was complaining
>> >>> about unclear barrier comment :)
>> >>> 
>> >>> 
>> >>> CPU0 CPU1
>> >>> a. user stuff1. user stuff
>> >>> b. membarrier()  2. enter kernel
>> >>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
>> >>> d. read rq->curr 4. rq->curr switched to kthread
>> >>> e. is kthread, skip IPI  5. switch_to kthread
>> >>> f. return to user6. rq->curr switched to user thread
>> >>> g. user stuff7. switch_to user thread
>> >>> 8. exit kernel
>> >>> 9. more user stuff
>> >>> 
>> >>> What you're really ordering is a, g vs 1, 9 right?
>> >>> 
>> >>> In other words, 9 must see a if it sees g, g must see 1 if it saw 9,
>> >>> etc.
>> >>> 
>> >>> Userspace does not care where the barriers are exactly or what kernel
>> >>> memory accesses might be being ordered by them, so long as there is a
>> >>> mb somewhere between a and g, and 1 and 9. Right?
>> >> 
>> >> This is correct.
>> > 
>> > Actually, sorry, the above is not quite right. It's been a while
>> > since I looked into the details of membarrier.
>> > 
>> > The smp_mb() at the beginning of membarrier() needs to be paired with a
>> > smp_mb() _after_ rq->curr is switched back to the user thread, so the
>> > memory barrier is between store to rq->curr and following user-space
>> > accesses.
>> > 
>> > The smp_mb() at the end of membarrier() needs to be paired with the
>> > smp_mb__after_spinlock() at the beginning of schedule, which is
>> > between accesses to userspace memory and switching rq->curr to kthread.
>> > 
>> > As to *why* this ordering is needed, I'd have to dig through additional
>> > scenarios from https://lwn.net/Articles/573436/. Or maybe Paul remembers ?
>> 
>> Thinking further about this, I'm beginning to consider that maybe we have 
>> been
>> overly cautious by requiring memory barriers before and after store to 
>> rq->curr.
>> 
>> If CPU0 observes a CPU1's rq->curr->mm which differs from its own process
>> (current)
>> while running the membarrier system call, it necessarily means that CPU1 had
>> to issue smp_mb__after_spinlock when entering the scheduler, between any
>> user-space
>> loads/stores and update of rq->curr.
>> 
>> Requiring a memory barrier between update of rq->curr (back to current 
>> process's
>> thread) and following user-space memory accesses does not seem to guarantee
>> anything more than what the initial barrier at the beginning of __schedule
>> already
>> provides, because the guarantees are only about accesses to user-space 
>> memory.
>> 
>> Therefore, with the memory barrier at the beginning of __schedule, just
>> observing that
>> CPU1's rq->curr differs from current should guarantee that a memory barrier 
>> was
>> issued
>> between any sequentially consistent instructions belonging to the current
>> process on
>> CPU1.
>> 
>> Or am I missing/misremembering an important point here ?
> 
> Is it correct to say that the switch_to operations in 5 and 7 include
> memory barriers?  If they do, then skipping the IPI should be okay.
> 
> The reason is as follows: The guarantee you need to enforce is that
> anything written by CPU0 before the membarrier() will be visible to CPU1
> after it returns to user mode.  Let's say that a writes to X and 9
> reads from X.
> 
> Then we have an instance of the Store Buffer pattern:
> 
>   CPU0CPU1
>   a. Write X  6. Write rq->curr for user thread
>   c. smp_mb() 7. switch_to memory barrier
>   d. Read rq->curr9. Read X
> 
> In this pattern, the memory barriers make it impossible for both reads
> to miss their corresponding writes.  Since d does fail to read 6 (it
> sees the earlier value stored by 4), 9 must read a.
> 
> The other guarantee you need is that g on CPU0 will observe anything
> written by CPU1 in 1.  This is easier to see, using the fact that 3 is a
> memory barrier and d reads from 4.

Right, and Nick's reply involving pairs of loads/stores on each side
clarifies the situation even further.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers

- On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npig...@gmail.com wrote:
[...]
> 
> membarrier does replace barrier instructions on remote CPUs, which do
> order accesses performed by the kernel on the user address space. So
> membarrier should too I guess.
> 
> Normal process context accesses like read(2) will do so because they
> don't get filtered out from IPIs, but kernel threads using the mm may
> not.

But it should not be an issue, because membarrier's ordering is only with 
respect
to submit and completion of io_uring requests, which are performed through
system calls from the context of user-space threads, which are called from the
right mm.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers

- On Jul 17, 2020, at 10:51 AM, Alan Stern st...@rowland.harvard.edu wrote:

> On Fri, Jul 17, 2020 at 09:39:25AM -0400, Mathieu Desnoyers wrote:
>> - On Jul 16, 2020, at 5:24 PM, Alan Stern st...@rowland.harvard.edu 
>> wrote:
>> 
>> > On Thu, Jul 16, 2020 at 02:58:41PM -0400, Mathieu Desnoyers wrote:
>> >> ----- On Jul 16, 2020, at 12:03 PM, Mathieu Desnoyers
>> >> mathieu.desnoy...@efficios.com wrote:
>> >> 
>> >> > - On Jul 16, 2020, at 11:46 AM, Mathieu Desnoyers
>> >> > mathieu.desnoy...@efficios.com wrote:
>> >> > 
>> >> >> - On Jul 16, 2020, at 12:42 AM, Nicholas Piggin npig...@gmail.com 
>> >> >> wrote:
>> >> >>> I should be more complete here, especially since I was complaining
>> >> >>> about unclear barrier comment :)
>> >> >>> 
>> >> >>> 
>> >> >>> CPU0 CPU1
>> >> >>> a. user stuff1. user stuff
>> >> >>> b. membarrier()  2. enter kernel
>> >> >>> c. smp_mb()  3. smp_mb__after_spinlock(); // in __schedule
>> >> >>> d. read rq->curr 4. rq->curr switched to kthread
>> >> >>> e. is kthread, skip IPI  5. switch_to kthread
>> >> >>> f. return to user6. rq->curr switched to user thread
>> >> >>> g. user stuff7. switch_to user thread
>> >> >>> 8. exit kernel
>> >> >>> 9. more user stuff
> 
> ...
> 
>> >> Requiring a memory barrier between update of rq->curr (back to current 
>> >> process's
>> >> thread) and following user-space memory accesses does not seem to 
>> >> guarantee
>> >> anything more than what the initial barrier at the beginning of __schedule
>> >> already
>> >> provides, because the guarantees are only about accesses to user-space 
>> >> memory.
> 
> ...
> 
>> > Is it correct to say that the switch_to operations in 5 and 7 include
>> > memory barriers?  If they do, then skipping the IPI should be okay.
>> > 
>> > The reason is as follows: The guarantee you need to enforce is that
>> > anything written by CPU0 before the membarrier() will be visible to CPU1
>> > after it returns to user mode.  Let's say that a writes to X and 9
>> > reads from X.
>> > 
>> > Then we have an instance of the Store Buffer pattern:
>> > 
>> >CPU0CPU1
>> >a. Write X  6. Write rq->curr for user thread
>> >c. smp_mb() 7. switch_to memory barrier
>> >d. Read rq->curr9. Read X
>> > 
>> > In this pattern, the memory barriers make it impossible for both reads
>> > to miss their corresponding writes.  Since d does fail to read 6 (it
>> > sees the earlier value stored by 4), 9 must read a.
>> > 
>> > The other guarantee you need is that g on CPU0 will observe anything
>> > written by CPU1 in 1.  This is easier to see, using the fact that 3 is a
>> > memory barrier and d reads from 4.
>> 
>> Right, and Nick's reply involving pairs of loads/stores on each side
>> clarifies the situation even further.
> 
> The key part of my reply was the question: "Is it correct to say that
> the switch_to operations in 5 and 7 include memory barriers?"  From the
> text quoted above and from Nick's reply, it seems clear that they do
> not.

I remember that switch_mm implies it, but not switch_to.

The scenario that triggered this discussion is when the scheduler does a
lazy tlb entry/exit, which is basically switch from a user task to
a kernel thread without changing the mm, and usually switching back afterwards.
This optimization means the rq->curr mm temporarily differs, which prevent
IPIs from being sent by membarrier, but without involving a switch_mm.
This requires explicit memory barriers either on entry/exit of lazy tlb
mode, or explicit barriers in the scheduler for those special-cases.

> I agree with Nick: A memory barrier is needed somewhere between the
> assignment at 6 and the return to user mode at 8.  Otherwise you end up
> with the Store Buffer pattern having a memory barrier on only one side,
> and it is well known that this arrangement does not guarantee any
> ordering.

Yes, I see this now. I'm still trying to wrap my head around why the memory
barrier at the end of membarrier() needs to be paired with a scheduler
barrier though.

> One thing I don't understand about all this: Any context switch has to
> include a memory barrier somewhere, but both you and Nick seem to be
> saying that steps 6 and 7 don't include (or don't need) any memory
> barriers.  What am I missing?

All context switch have the smp_mb__before_spinlock at the beginning of
__schedule(), which I suspect is what you refer to. However this barrier
is before the store to rq->curr, not after.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers

- On Jul 17, 2020, at 12:11 PM, Alan Stern st...@rowland.harvard.edu wrote:

>> > I agree with Nick: A memory barrier is needed somewhere between the
>> > assignment at 6 and the return to user mode at 8.  Otherwise you end up
>> > with the Store Buffer pattern having a memory barrier on only one side,
>> > and it is well known that this arrangement does not guarantee any
>> > ordering.
>> 
>> Yes, I see this now. I'm still trying to wrap my head around why the memory
>> barrier at the end of membarrier() needs to be paired with a scheduler
>> barrier though.
> 
> The memory barrier at the end of membarrier() on CPU0 is necessary in
> order to enforce the guarantee that any writes occurring on CPU1 before
> the membarrier() is executed will be visible to any code executing on
> CPU0 after the membarrier().  Ignoring the kthread issue, we can have:
> 
>   CPU0CPU1
>   x = 1
>   barrier()
>   y = 1
>   r2 = y
>   membarrier():
> a: smp_mb()
> b: send IPI   IPI-induced mb
> c: smp_mb()
>   r1 = x
> 
> The writes to x and y are unordered by the hardware, so it's possible to
> have r2 = 1 even though the write to x doesn't execute until b.  If the
> memory barrier at c is omitted then "r1 = x" can be reordered before b
> (although not before a), so we get r1 = 0.  This violates the guarantee
> that membarrier() is supposed to provide.
> 
> The timing of the memory barrier at c has to ensure that it executes
> after the IPI-induced memory barrier on CPU1.  If it happened before
> then we could still end up with r1 = 0.  That's why the pairing matters.
> 
> I hope this helps your head get properly wrapped.  :-)

It does help a bit! ;-)

This explains this part of the comment near the smp_mb at the end of membarrier:

 * Memory barrier on the caller thread _after_ we finished
 * waiting for the last IPI. [...]

However, it does not explain why it needs to be paired with a barrier in the
scheduler, clearly for the case where the IPI is skipped. I wonder whether this 
part
of the comment is factually correct:

 * [...] Matches memory barriers around rq->curr modification in 
scheduler.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-17 Thread Mathieu Desnoyers

- On Jul 17, 2020, at 1:44 PM, Alan Stern st...@rowland.harvard.edu wrote:

> On Fri, Jul 17, 2020 at 12:22:49PM -0400, Mathieu Desnoyers wrote:
>> - On Jul 17, 2020, at 12:11 PM, Alan Stern st...@rowland.harvard.edu 
>> wrote:
>> 
>> >> > I agree with Nick: A memory barrier is needed somewhere between the
>> >> > assignment at 6 and the return to user mode at 8.  Otherwise you end up
>> >> > with the Store Buffer pattern having a memory barrier on only one side,
>> >> > and it is well known that this arrangement does not guarantee any
>> >> > ordering.
>> >> 
>> >> Yes, I see this now. I'm still trying to wrap my head around why the 
>> >> memory
>> >> barrier at the end of membarrier() needs to be paired with a scheduler
>> >> barrier though.
>> > 
>> > The memory barrier at the end of membarrier() on CPU0 is necessary in
>> > order to enforce the guarantee that any writes occurring on CPU1 before
>> > the membarrier() is executed will be visible to any code executing on
>> > CPU0 after the membarrier().  Ignoring the kthread issue, we can have:
>> > 
>> >CPU0CPU1
>> >x = 1
>> >barrier()
>> >y = 1
>> >r2 = y
>> >membarrier():
>> >  a: smp_mb()
>> >  b: send IPI   IPI-induced mb
>> >  c: smp_mb()
>> >r1 = x
>> > 
>> > The writes to x and y are unordered by the hardware, so it's possible to
>> > have r2 = 1 even though the write to x doesn't execute until b.  If the
>> > memory barrier at c is omitted then "r1 = x" can be reordered before b
>> > (although not before a), so we get r1 = 0.  This violates the guarantee
>> > that membarrier() is supposed to provide.
>> > 
>> > The timing of the memory barrier at c has to ensure that it executes
>> > after the IPI-induced memory barrier on CPU1.  If it happened before
>> > then we could still end up with r1 = 0.  That's why the pairing matters.
>> > 
>> > I hope this helps your head get properly wrapped.  :-)
>> 
>> It does help a bit! ;-)
>> 
>> This explains this part of the comment near the smp_mb at the end of 
>> membarrier:
>> 
>>  * Memory barrier on the caller thread _after_ we finished
>>  * waiting for the last IPI. [...]
>> 
>> However, it does not explain why it needs to be paired with a barrier in the
>> scheduler, clearly for the case where the IPI is skipped. I wonder whether 
>> this
>> part
>> of the comment is factually correct:
>> 
>>  * [...] Matches memory barriers around rq->curr modification in 
>> scheduler.
> 
> The reasoning is pretty much the same as above:
> 
>   CPU0CPU1
>   x = 1
>   barrier()
>   y = 1
>   r2 = y
>   membarrier():
> a: smp_mb()
>   switch to kthread (includes mb)
> b: read rq->curr == kthread
>   switch to user (includes mb)
> c: smp_mb()
>   r1 = x
> 
> Once again, it is possible that x = 1 doesn't become visible to CPU0
> until shortly before b.  But if c is omitted then "r1 = x" can be
> reordered before b (to any time after a), so we can have r1 = 0.
> 
> Here the timing requirement is that c executes after the first memory
> barrier on CPU1 -- which is one of the ones around the rq->curr
> modification.  (In fact, in this scenario CPU1's switch back to the user
> process is irrelevant.)

That indeed covers the last scenario I was wondering about. Thanks Alan!

Mathieu

> 
> Alan Stern

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-20 Thread Mathieu Desnoyers

- On Jul 19, 2020, at 11:03 PM, Nicholas Piggin npig...@gmail.com wrote:

> Excerpts from Mathieu Desnoyers's message of July 17, 2020 11:42 pm:
>> - On Jul 16, 2020, at 7:26 PM, Nicholas Piggin npig...@gmail.com wrote:
>> [...]
>>> 
>>> membarrier does replace barrier instructions on remote CPUs, which do
>>> order accesses performed by the kernel on the user address space. So
>>> membarrier should too I guess.
>>> 
>>> Normal process context accesses like read(2) will do so because they
>>> don't get filtered out from IPIs, but kernel threads using the mm may
>>> not.
>> 
>> But it should not be an issue, because membarrier's ordering is only with
>> respect
>> to submit and completion of io_uring requests, which are performed through
>> system calls from the context of user-space threads, which are called from 
>> the
>> right mm.
> 
> Is that true? Can io completions be written into an address space via a
> kernel thread? I don't know the io_uring code well but it looks like
> that's asynchonously using the user mm context.

Indeed, the io completion appears to be signaled asynchronously between kernel
and user-space. Therefore, both kernel and userspace code need to have proper
memory barriers in place to signal completion, otherwise user-space could read
garbage after it notices completion of a read.

I did not review the entire io_uring implementation, but the publish side
for completion appears to be:

static void __io_commit_cqring(struct io_ring_ctx *ctx)
{
struct io_rings *rings = ctx->rings;

/* order cqe stores with ring update */
smp_store_release(&rings->cq.tail, ctx->cached_cq_tail);

if (wq_has_sleeper(&ctx->cq_wait)) {
wake_up_interruptible(&ctx->cq_wait);
kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN);
}
}

The store-release on tail should be paired with a load_acquire on the
reader-side (it's called "read_barrier()" in the code):

tools/io_uring/queue.c:

static int __io_uring_get_cqe(struct io_uring *ring,
  struct io_uring_cqe **cqe_ptr, int wait)
{
struct io_uring_cq *cq = &ring->cq;
const unsigned mask = *cq->kring_mask;
unsigned head;
int ret;

*cqe_ptr = NULL;
head = *cq->khead;
do {
/*
 * It's necessary to use a read_barrier() before reading
 * the CQ tail, since the kernel updates it locklessly. The
 * kernel has the matching store barrier for the update. The
 * kernel also ensures that previous stores to CQEs are ordered
 * with the tail update.
 */
read_barrier();
if (head != *cq->ktail) {
*cqe_ptr = &cq->cqes[head & mask];
break;
}
if (!wait)
break;
ret = io_uring_enter(ring->ring_fd, 0, 1,
IORING_ENTER_GETEVENTS, NULL);
if (ret < 0)
return -errno;
} while (1);

return 0;
}

So as far as membarrier memory ordering dependencies are concerned, it relies
on the store-release/load-acquire dependency chain in the completion queue to
order against anything that was done prior to the completed requests.

What is in-flight while the requests are being serviced provides no memory
ordering guarantee whatsoever.

> How about other memory accesses via kthread_use_mm? Presumably there is
> still ordering requirement there for membarrier,

Please provide an example case with memory accesses via kthread_use_mm where
ordering matters to support your concern.

> so I really think
> it's a fragile interface with no real way for the user to know how
> kernel threads may use its mm for any particular reason, so membarrier
> should synchronize all possible kernel users as well.

I strongly doubt so, but perhaps something should be clarified in the 
documentation
if you have that feeling.

Thanks,

Mathieu

> 
> Thanks,
> Nick

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-21 Thread Mathieu Desnoyers

- On Jul 21, 2020, at 6:04 AM, Nicholas Piggin npig...@gmail.com wrote:

> Excerpts from Mathieu Desnoyers's message of July 21, 2020 2:46 am:
[...]
> 
> Yeah you're probably right in this case I think. Quite likely most kernel
> tasks that asynchronously write to user memory would at least have some
> kind of producer-consumer barriers.
> 
> But is that restriction of all async modifications documented and enforced
> anywhere?
> 
>>> How about other memory accesses via kthread_use_mm? Presumably there is
>>> still ordering requirement there for membarrier,
>> 
>> Please provide an example case with memory accesses via kthread_use_mm where
>> ordering matters to support your concern.
> 
> I think the concern Andy raised with io_uring was less a specific
> problem he saw and more a general concern that we have these memory
> accesses which are not synchronized with membarrier.
> 
>>> so I really think
>>> it's a fragile interface with no real way for the user to know how
>>> kernel threads may use its mm for any particular reason, so membarrier
>>> should synchronize all possible kernel users as well.
>> 
>> I strongly doubt so, but perhaps something should be clarified in the
>> documentation
>> if you have that feeling.
> 
> I'd rather go the other way and say if you have reasoning or numbers for
> why PF_KTHREAD is an important optimisation above rq->curr == rq->idle
> then we could think about keeping this subtlety with appropriate
> documentation added, otherwise we can just kill it and remove all doubt.
> 
> That being said, the x86 sync core gap that I imagined could be fixed
> by changing to rq->curr == rq->idle test does not actually exist because
> the global membarrier does not have a sync core option. So fixing the
> exit_lazy_tlb points that this series does *should* fix that. So
> PF_KTHREAD may be less problematic than I thought from implementation
> point of view, only semantics.

Today, the membarrier global expedited command explicitly skips kernel threads,
but it happens that membarrier private expedited considers those with the
same mm as target for the IPI.

So we already implement a semantic which differs between private and global
expedited membarriers. This can be explained in part by the fact that
kthread_use_mm was introduced after 4.16, where the most recent membarrier
commands where introduced. It seems that the effect on membarrier was not
considered when kthread_use_mm was introduced.

Looking at membarrier(2) documentation, it states that IPIs are only sent to
threads belonging to the same process as the calling thread. If my understanding
of the notion of process is correct, this should rule out sending the IPI to
kernel threads, given they are not "part" of the same process, only borrowing
the mm. But I agree that the distinction is moot, and should be clarified.

Without a clear use-case to justify adding a constraint on membarrier, I am
tempted to simply clarify documentation of current membarrier commands,
stating clearly that they are not guaranteed to affect kernel threads. Then,
if we have a compelling use-case to implement a different behavior which covers
kthreads, this could be added consistently across membarrier commands with a
flag (or by adding new commands).

Does this approach make sense ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-21 Thread Mathieu Desnoyers

- On Jul 21, 2020, at 11:06 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:
> 
>> That being said, the x86 sync core gap that I imagined could be fixed
>> by changing to rq->curr == rq->idle test does not actually exist because
>> the global membarrier does not have a sync core option. So fixing the
>> exit_lazy_tlb points that this series does *should* fix that. So
>> PF_KTHREAD may be less problematic than I thought from implementation
>> point of view, only semantics.
> 
> So I've been trying to figure out where that PF_KTHREAD comes from,
> commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
> load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.
> 
> So the first version:
> 
>  
> https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoy...@efficios.com
> 
> appears to unconditionally send the IPI and checks p->mm in the IPI
> context, but then v2:
> 
>  
> https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoy...@efficios.com
> 
> has the current code. But I've been unable to find the reason the
> 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.

Looking back at my inbox, it seems like you are the one who proposed to
skip all kthreads: 

https://lkml.kernel.org/r/20190904124333.gq2...@hirez.programming.kicks-ass.net

> 
> The comment doesn't really help either; sure we have the whole lazy mm
> thing, but that's ->active_mm, not ->mm.
> 
> Possibly it is because {,un}use_mm() do not have sufficient barriers to
> make the remote p->mm test work? Or were we over-eager with the !p->mm
> doesn't imply kthread 'cleanups' at the time?

The nice thing about adding back kthreads to the threads considered for 
membarrier
IPI is that it has no observable effect on the user-space ABI. No pre-existing 
kthread
rely on this, and we just provide an additional guarantee for future kthread
implementations.

> Also, I just realized, I still have a fix for use_mm() now
> kthread_use_mm() that seems to have been lost.

I suspect we need to at least document the memory barriers in kthread_use_mm and
kthread_unuse_mm to state that they are required by membarrier if we want to
ipi kthreads as well.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-21 Thread Mathieu Desnoyers

- On Jul 21, 2020, at 11:19 AM, Peter Zijlstra pet...@infradead.org wrote:

> On Tue, Jul 21, 2020 at 11:15:13AM -0400, Mathieu Desnoyers wrote:
>> - On Jul 21, 2020, at 11:06 AM, Peter Zijlstra pet...@infradead.org 
>> wrote:
>> 
>> > On Tue, Jul 21, 2020 at 08:04:27PM +1000, Nicholas Piggin wrote:
>> > 
>> >> That being said, the x86 sync core gap that I imagined could be fixed
>> >> by changing to rq->curr == rq->idle test does not actually exist because
>> >> the global membarrier does not have a sync core option. So fixing the
>> >> exit_lazy_tlb points that this series does *should* fix that. So
>> >> PF_KTHREAD may be less problematic than I thought from implementation
>> >> point of view, only semantics.
>> > 
>> > So I've been trying to figure out where that PF_KTHREAD comes from,
>> > commit 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy
>> > load") changed 'p->mm' to '!(p->flags & PF_KTHREAD)'.
>> > 
>> > So the first version:
>> > 
>> >  
>> > https://lkml.kernel.org/r/20190906031300.1647-5-mathieu.desnoy...@efficios.com
>> > 
>> > appears to unconditionally send the IPI and checks p->mm in the IPI
>> > context, but then v2:
>> > 
>> >  
>> > https://lkml.kernel.org/r/20190908134909.12389-1-mathieu.desnoy...@efficios.com
>> > 
>> > has the current code. But I've been unable to find the reason the
>> > 'p->mm' test changed into '!(p->flags & PF_KTHREAD)'.
>> 
>> Looking back at my inbox, it seems like you are the one who proposed to
>> skip all kthreads:
>> 
>> https://lkml.kernel.org/r/20190904124333.gq2...@hirez.programming.kicks-ass.net
> 
> I had a feeling it might've been me ;-) I just couldn't find the email.
> 
>> > The comment doesn't really help either; sure we have the whole lazy mm
>> > thing, but that's ->active_mm, not ->mm.
>> > 
>> > Possibly it is because {,un}use_mm() do not have sufficient barriers to
>> > make the remote p->mm test work? Or were we over-eager with the !p->mm
>> > doesn't imply kthread 'cleanups' at the time?
>> 
>> The nice thing about adding back kthreads to the threads considered for
>> membarrier
>> IPI is that it has no observable effect on the user-space ABI. No 
>> pre-existing
>> kthread
>> rely on this, and we just provide an additional guarantee for future kthread
>> implementations.
>> 
>> > Also, I just realized, I still have a fix for use_mm() now
>> > kthread_use_mm() that seems to have been lost.
>> 
>> I suspect we need to at least document the memory barriers in kthread_use_mm 
>> and
>> kthread_unuse_mm to state that they are required by membarrier if we want to
>> ipi kthreads as well.
> 
> Right, so going by that email you found it was mostly a case of being
> lazy, but yes, if we audit the kthread_{,un}use_mm() barriers and add
> any other bits that might be needed, covering kthreads should be
> possible.
> 
> No objections from me for making it so.

I'm OK on making membarrier cover kthreads using mm as well, provided we
audit kthread_{,un}use_mm() to make sure the proper barriers are in place
after setting task->mm and before clearing it.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-26 Thread Mathieu Desnoyers

- On Aug 25, 2021, at 8:51 PM, Sean Christopherson sea...@google.com wrote:

> On Mon, Aug 23, 2021, Mathieu Desnoyers wrote:
>> [ re-send to Darren Hart ]
>> 
>> - On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers
>> mathieu.desnoy...@efficios.com wrote:
>> 
>> > - On Aug 20, 2021, at 6:50 PM, Sean Christopherson sea...@google.com 
>> > wrote:
>> > 
>> >> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> >> migrated while the kernel is handling KVM_RUN.  This is a regression test
>> >> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> >> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> >> without updating rseq, leading to a stale CPU ID and other badness.
>> >> 
>> > 
>> > [...]
>> > 
>> > +#define RSEQ_SIG 0xdeadbeef
>> > 
>> > Is there any reason for defining a custom signature rather than including
>> > tools/testing/selftests/rseq/rseq.h ? This should take care of including
>> > the proper architecture header which will define the appropriate signature.
>> > 
>> > Arguably you don't define rseq critical sections in this test per se, but
>> > I'm wondering why the custom signature here.
> 
> Partly to avoid taking a dependency on rseq.h, and partly to try to call out
> that
> the test doesn't actually do any rseq critical sections.

It might be good to add a comment near this define stating this then, so nobody
attempts to copy this as an example.

> 
>> > [...]
>> > 
>> >> +
>> >> +static void *migration_worker(void *ign)
>> >> +{
>> >> + cpu_set_t allowed_mask;
>> >> + int r, i, nr_cpus, cpu;
>> >> +
>> >> + CPU_ZERO(&allowed_mask);
>> >> +
>> >> + nr_cpus = CPU_COUNT(&possible_mask);
>> >> +
>> >> + for (i = 0; i < 2; i++) {
>> >> + cpu = i % nr_cpus;
>> >> + if (!CPU_ISSET(cpu, &possible_mask))
>> >> + continue;
>> >> +
>> >> + CPU_SET(cpu, &allowed_mask);
>> >> +
>> >> + /*
>> >> +  * Bump the sequence count twice to allow the reader to detect
>> >> +  * that a migration may have occurred in between rseq and sched
>> >> +  * CPU ID reads.  An odd sequence count indicates a migration
>> >> +  * is in-progress, while a completely different count indicates
>> >> +  * a migration occurred since the count was last read.
>> >> +  */
>> >> + atomic_inc(&seq_cnt);
>> > 
>> > So technically this atomic_inc contains the required barriers because the
>> > selftests implementation uses "__sync_add_and_fetch(&addr->val, 1)". But
>> > it's rather odd that the semantic differs from the kernel implementation in
>> > terms of memory barriers: the kernel implementation of atomic_inc
>> > guarantees no memory barriers, but this one happens to provide full
>> > barriers pretty much by accident (selftests futex/include/atomic.h
>> > documents no such guarantee).
> 
> Yeah, I got quite lost trying to figure out what atomics the test would 
> actually
> end up with.

At the very least, until things are clarified in the selftests atomic header, I 
would
recommend adding a comment stating which memory barriers are required alongside 
each
use of atomic_inc here. I would even prefer that we add extra (currently 
unneeded)
write barriers to make extra sure that this stays documented. Performance of 
the write-side
does not matter much here.

> 
>> > If this full barrier guarantee is indeed provided by the selftests atomic.h
>> > header, I would really like a comment stating that in the atomic.h header
>> > so the carpet is not pulled from under our feet by a future optimization.
>> > 
>> > 
>> >> + r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
>> >> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> >> + errno, strerror(errno));
>> >> + atomic_inc(&seq_cnt);
>> >> +
>> >> + CPU_CLR(cpu, &allowed_mask);
>> >> +
>> >> + /*
>> >> +  * Let the read-side get back into KVM_RUN to improve the odds
>> >> +  * of task migration coinc

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-27 Thread Mathieu Desnoyers




- On Aug 26, 2021, at 7:54 PM, Sean Christopherson sea...@google.com wrote:

> On Thu, Aug 26, 2021, Mathieu Desnoyers wrote:
>> - On Aug 25, 2021, at 8:51 PM, Sean Christopherson sea...@google.com 
>> wrote:
>> >> >> +  r = sched_setaffinity(0, sizeof(allowed_mask), 
>> >> >> &allowed_mask);
>> >> >> +  TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d 
>> >> >> (%s)",
>> >> >> +  errno, strerror(errno));
>> >> >> +  atomic_inc(&seq_cnt);
>> >> >> +
>> >> >> +  CPU_CLR(cpu, &allowed_mask);
>> >> >> +
>> >> >> +  /*
>> >> >> +   * Let the read-side get back into KVM_RUN to improve 
>> >> >> the odds
>> >> >> +   * of task migration coinciding with KVM's run loop.
>> >> > 
>> >> > This comment should be about increasing the odds of letting the seqlock
>> >> > read-side complete. Otherwise, the delay between the two back-to-back
>> >> > atomic_inc is so small that the seqlock read-side may never have time to
>> >> > complete the reading the rseq cpu id and the sched_getcpu() call, and 
>> >> > can
>> >> > retry forever.
>> > 
>> > Hmm, but that's not why there's a delay.  I'm not arguing that a livelock 
>> > isn't
>> > possible (though that syscall would have to be screaming fast),
>> 
>> I don't think we have the same understanding of the livelock scenario. AFAIU 
>> the
>> livelock
>> would be caused by a too-small delay between the two consecutive atomic_inc()
>> from one
>> loop iteration to the next compared to the time it takes to perform a 
>> read-side
>> critical
>> section of the seqlock. Back-to-back atomic_inc can be performed very 
>> quickly,
>> so I
>> doubt that the sched_getcpu implementation have good odds to be fast enough 
>> to
>> complete
>> in that narrow window, leading to lots of read seqlock retry.
> 
> Ooooh, yeah, brain fart on my side.  I was thinking of the two atomic_inc() in
> the
> same loop iteration and completely ignoring the next iteration.  Yes, I 100%
> agree
> that a delay to ensure forward progress is needed.  An assertion in main() 
> that
> the
> reader complete at least some reasonable number of KVM_RUNs is also probably a
> good
> idea, e.g. to rule out a false pass due to the reader never making forward
> progress.

Agreed.

> 
> FWIW, the do-while loop does make forward progress without a delay, but at 
> ~50%
> throughput, give or take.

I did not expect absolutely no progress, but a significant slow down of
the read-side.

> 
>> > but the primary motivation is very much to allow the read-side enough time
>> > to get back into KVM proper.
>> 
>> I'm puzzled by your statement. AFAIU, let's say we don't have the delay, 
>> then we
>> have:
>> 
>> migration thread KVM_RUN/read-side thread
>> ---
>>  - ioctl(KVM_RUN)
>> - atomic_inc_seq_cst(&seqcnt)
>> - sched_setaffinity
>> - atomic_inc_seq_cst(&seqcnt)
>>  - a = atomic_load(&seqcnt) & ~1
>>  - smp_rmb()
>>  - b = 
>> LOAD_ONCE(__rseq_abi->cpu_id);
>>  - sched_getcpu()
>>  - smp_rmb()
>>  - re-load seqcnt/compare 
>> (succeeds)
>>- Can only succeed if entire 
>> read-side happens while the seqcnt
>>  is in an even numbered 
>> state.
>>  - if (a != b) abort()
>>   /* no delay. Even counter state is very
>>  short. */
>> - atomic_inc_seq_cst(&seqcnt)
>>   /* Let's suppose the lack of delay causes the
>>  setaffinity to complete too early compared
>>  with KVM_RUN ioctl */
>> - sched_setaffinity
>> - atomic_inc_seq_cst(&seqcnt)
>> 
>>   /* no delay. Even counter state is very
>>

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-27 Thread Mathieu Desnoyers

- On Aug 27, 2021, at 7:23 PM, Sean Christopherson sea...@google.com wrote:

> On Fri, Aug 27, 2021, Mathieu Desnoyers wrote:
[...]
>> Does it reproduce if we randomize the delay to have it picked randomly from 
>> 0us
>> to 100us (with 1us step) ? It would remove a lot of the needs for 
>> arch-specific
>> magic delay value.
> 
> My less-than-scientific testing shows that it can reproduce at delays up to
> ~500us,
> but above ~10us the reproducibility starts to drop.  The bug still reproduces
> reliably, it just takes more iterations, and obviously the test runs a bit
> slower.
> 
> Any objection to using a 1-10us delay, e.g. a simple usleep((i % 10) + 1)?

Works for me, thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v3 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-09-02 Thread Mathieu Desnoyers

- On Sep 1, 2021, at 4:30 PM, Sean Christopherson sea...@google.com wrote:

> Add a test to verify an rseq's CPU ID is updated correctly if the task is
> migrated while the kernel is handling KVM_RUN.  This is a regression test
> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
> without updating rseq, leading to a stale CPU ID and other badness.
> 
> Signed-off-by: Sean Christopherson 

Thanks!

Acked-by: Mathieu Desnoyers 

> ---
> tools/testing/selftests/kvm/.gitignore  |   1 +
> tools/testing/selftests/kvm/Makefile|   3 +
> tools/testing/selftests/kvm/rseq_test.c | 236 
> 3 files changed, 240 insertions(+)
> create mode 100644 tools/testing/selftests/kvm/rseq_test.c
> 
> diff --git a/tools/testing/selftests/kvm/.gitignore
> b/tools/testing/selftests/kvm/.gitignore
> index 0709af0144c8..6d031ff6b68e 100644
> --- a/tools/testing/selftests/kvm/.gitignore
> +++ b/tools/testing/selftests/kvm/.gitignore
> @@ -47,6 +47,7 @@
> /kvm_page_table_test
> /memslot_modification_stress_test
> /memslot_perf_test
> +/rseq_test
> /set_memory_region_test
> /steal_time
> /kvm_binary_stats_test
> diff --git a/tools/testing/selftests/kvm/Makefile
> b/tools/testing/selftests/kvm/Makefile
> index 5832f510a16c..0756e79cb513 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -80,6 +80,7 @@ TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
> TEST_GEN_PROGS_x86_64 += kvm_page_table_test
> TEST_GEN_PROGS_x86_64 += memslot_modification_stress_test
> TEST_GEN_PROGS_x86_64 += memslot_perf_test
> +TEST_GEN_PROGS_x86_64 += rseq_test
> TEST_GEN_PROGS_x86_64 += set_memory_region_test
> TEST_GEN_PROGS_x86_64 += steal_time
> TEST_GEN_PROGS_x86_64 += kvm_binary_stats_test
> @@ -92,6 +93,7 @@ TEST_GEN_PROGS_aarch64 += dirty_log_test
> TEST_GEN_PROGS_aarch64 += dirty_log_perf_test
> TEST_GEN_PROGS_aarch64 += kvm_create_max_vcpus
> TEST_GEN_PROGS_aarch64 += kvm_page_table_test
> +TEST_GEN_PROGS_aarch64 += rseq_test
> TEST_GEN_PROGS_aarch64 += set_memory_region_test
> TEST_GEN_PROGS_aarch64 += steal_time
> TEST_GEN_PROGS_aarch64 += kvm_binary_stats_test
> @@ -103,6 +105,7 @@ TEST_GEN_PROGS_s390x += demand_paging_test
> TEST_GEN_PROGS_s390x += dirty_log_test
> TEST_GEN_PROGS_s390x += kvm_create_max_vcpus
> TEST_GEN_PROGS_s390x += kvm_page_table_test
> +TEST_GEN_PROGS_s390x += rseq_test
> TEST_GEN_PROGS_s390x += set_memory_region_test
> TEST_GEN_PROGS_s390x += kvm_binary_stats_test
> 
> diff --git a/tools/testing/selftests/kvm/rseq_test.c
> b/tools/testing/selftests/kvm/rseq_test.c
> new file mode 100644
> index ..060538bd405a
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/rseq_test.c
> @@ -0,0 +1,236 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#define _GNU_SOURCE /* for program_invocation_short_name */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "kvm_util.h"
> +#include "processor.h"
> +#include "test_util.h"
> +
> +#define VCPU_ID 0
> +
> +static __thread volatile struct rseq __rseq = {
> + .cpu_id = RSEQ_CPU_ID_UNINITIALIZED,
> +};
> +
> +/*
> + * Use an arbitrary, bogus signature for configuring rseq, this test does not
> + * actually enter an rseq critical section.
> + */
> +#define RSEQ_SIG 0xdeadbeef
> +
> +/*
> + * Any bug related to task migration is likely to be timing-dependent; 
> perform
> + * a large number of migrations to reduce the odds of a false negative.
> + */
> +#define NR_TASK_MIGRATIONS 10
> +
> +static pthread_t migration_thread;
> +static cpu_set_t possible_mask;
> +static bool done;
> +
> +static atomic_t seq_cnt;
> +
> +static void guest_code(void)
> +{
> + for (;;)
> + GUEST_SYNC(0);
> +}
> +
> +static void sys_rseq(int flags)
> +{
> + int r;
> +
> + r = syscall(__NR_rseq, &__rseq, sizeof(__rseq), flags, RSEQ_SIG);
> + TEST_ASSERT(!r, "rseq failed, errno = %d (%s)", errno, strerror(errno));
> +}
> +
> +static void *migration_worker(void *ign)
> +{
> + cpu_set_t allowed_mask;
> + int r, i, nr_cpus, cpu;
> +
> + CPU_ZERO(&allowed_mask);
> +
> + nr_cpus = CPU_COUNT(&possible_mask);
> +
> + for (i = 0; i < NR_TASK_MIGRATIONS; i++) {
> + cpu = i % nr_cpus;
> + if (!CPU_ISSET(cpu, &possible_mask))
> +

Re: [PATCH 2/8] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-11-30 Thread Mathieu Desnoyers

- On Nov 28, 2020, at 11:01 AM, Nicholas Piggin npig...@gmail.com wrote:

> And get rid of the generic sync_core_before_usermode facility. This is
> functionally a no-op in the core scheduler code, but it also catches

This sentence is incomplete.

> 
> This helper is the wrong way around I think. The idea that membarrier
> state requires a core sync before returning to user is the easy one
> that does not need hiding behind membarrier calls. The gap in core
> synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
> tricky detail that is better put in x86 lazy tlb code.

Ideally yes this complexity should sit within the x86 architecture code
if only that architecture requires it.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code

2020-12-04 Thread Mathieu Desnoyers

I. It is therefore possible to
>* schedule between user->kernel->user threads without passing though
>* switch_mm(). Membarrier requires a barrier after storing to
> -  * rq->curr, before returning to userspace, so provide them here:
> +  * rq->curr, before returning to userspace, and mmdrop() provides
> +  * this barrier.
>*
> -  * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
> -  *   provided by mmdrop(),
> -  * - a sync_core for SYNC_CORE.
> +  * XXX: I don't think mmdrop() actually does this.  There's no
> +  * smp_mb__before/after_atomic() in there.

I recall mmdrop providing a memory barrier. It looks like I event went though 
the
trouble of documenting it myself. ;-)

static inline void mmdrop(struct mm_struct *mm)
{
/*
 * The implicit full barrier implied by atomic_dec_and_test() is
 * required by the membarrier system call before returning to
 * user-space, after storing to rq->curr.
 */
if (unlikely(atomic_dec_and_test(&mm->mm_count)))
__mmdrop(mm);
}


>*/
> - if (mm) {
> - membarrier_mm_sync_core_before_usermode(mm);

OK so here is the meat. The current code is using the (possibly incomplete)
lazy TLB state known by the scheduler to sync core, and it appears it may be
a bit more heavy that what is strictly needed.

Your change instead rely on the internal knowledge of lazy TLB within x86
switch_mm_irqs_off to achieve this, which overall seems like an improvement.

I agree with Nick's comment that it should go on top of his exit_lazy_mm
patches.

Thanks,

Mathieu


> + if (mm)
>   mmdrop(mm);
> - }
> +
>   if (unlikely(prev_state == TASK_DEAD)) {
>   if (prev->sched_class->task_dead)
>   prev->sched_class->task_dead(prev);
> --
> 2.28.0

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC v2 1/2] [NEEDS HELP] x86/mm: Handle unlazying membarrier core sync in the arch code

2020-12-04 Thread Mathieu Desnoyers

- On Dec 4, 2020, at 3:17 AM, Nadav Amit nadav.a...@gmail.com wrote:

> I am not very familiar with membarrier, but here are my 2 cents while trying
> to answer your questions.
> 
>> On Dec 3, 2020, at 9:26 PM, Andy Lutomirski  wrote:
>> @@ -496,6 +497,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
>> mm_struct *next,
>>   * from one thread in a process to another thread in the same
>>   * process. No TLB flush required.
>>   */
>> +
>> +// XXX: why is this okay wrt membarrier?
>>  if (!was_lazy)
>>  return;
> 
> I am confused.
> 
> On one hand, it seems that membarrier_private_expedited() would issue an IPI
> to that core, as it would find that this core’s cpu_rq(cpu)->curr->mm is the
> same as the one that the membarrier applies to.

If the scheduler switches from one thread to another which both have the same 
mm,
it means cpu_rq(cpu)->curr->mm is invariant, even though ->curr changes. So 
there
is no need to issue a memory barrier or sync core for membarrier in this case,
because there is no way the IPI can be missed.

> But… (see below)
> 
> 
>> @@ -508,12 +511,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct
>> mm_struct *next,
>>  smp_mb();
>>  next_tlb_gen = atomic64_read(&next->context.tlb_gen);
>>  if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
>> -next_tlb_gen)
>> +next_tlb_gen) {
>> +/*
>> + * We're reactivating an mm, and membarrier might
>> + * need to serialize.  Tell membarrier.
>> + */
>> +
>> +// XXX: I can't understand the logic in
>> +// membarrier_mm_sync_core_before_usermode().  What's
>> +// the mm check for?
>> +membarrier_mm_sync_core_before_usermode(next);
> 
> On the other hand the reason for this mm check that you mention contradicts
> my previous understanding as the git log says:
> 
> commit 2840cf02fae627860156737e83326df354ee4ec6
> Author: Mathieu Desnoyers 
> Date:   Thu Sep 19 13:37:01 2019 -0400
> 
>sched/membarrier: Call sync_core only before usermode for same mm
>
>When the prev and next task's mm change, switch_mm() provides the core
>serializing guarantees before returning to usermode. The only case
>where an explicit core serialization is needed is when the scheduler
>keeps the same mm for prev and next.

Hrm, so your point here is that if the scheduler keeps the same mm for
prev and next, it means membarrier will have observed the same rq->curr->mm,
and therefore the IPI won't be missed. I wonder if that
membarrier_mm_sync_core_before_usermode is needed at all then or if we
have just been too careful and did not consider that all the scenarios which
need to be core-sync'd are indeed taken care of ?

I see here that my prior commit message indeed discusses prev and next task's
mm, but in reality, we are comparing current->mm with rq->prev_mm. So from
a lazy TLB perspective, this probably matters, and we may still need a core sync
in some lazy TLB scenarios.

> 
>>  /*
>>   * When switching through a kernel thread, the loop in
>>   * membarrier_{private,global}_expedited() may have observed that
>>   * kernel thread and not issued an IPI. It is therefore possible to
>>   * schedule between user->kernel->user threads without passing though
>>   * switch_mm(). Membarrier requires a barrier after storing to
>> - * rq->curr, before returning to userspace, so provide them here:
>> + * rq->curr, before returning to userspace, and mmdrop() provides
>> + * this barrier.
>>   *
>> - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
>> - *   provided by mmdrop(),
>> - * - a sync_core for SYNC_CORE.
>> + * XXX: I don't think mmdrop() actually does this.  There's no
>> + * smp_mb__before/after_atomic() in there.
> 
> I presume that since x86 is the only one that needs
> membarrier_mm_sync_core_before_usermode(), nobody noticed the missing
> smp_mb__before/after_atomic(). These are anyhow a compiler barrier in x86,
> and such a barrier would take place before the return to userspace.

mmdrop already provides the memory barriers for membarrer, as I documented 
within the
function.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-27 Thread Mathieu Desnoyers

- On Dec 27, 2020, at 1:28 PM, Andy Lutomirski l...@kernel.org wrote:

> The old sync_core_before_usermode() comments said that a non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.  Based on my general understanding of how CPUs work and based on
> my atttempt to read the ARM manual, this is not true at all.  In fact, x86
> seems to be a bit of an anomaly in the other direction: x86's IRET is
> unusually heavyweight for a return-to-usermode instruction.
> 
> So let's drop any pretense that we can have a generic way implementation
> behind membarrier's SYNC_CORE flush and require all architectures that opt
> in to supply their own.

Removing the generic implementation is OK with me, as this will really require
architecture maintainers to think hard about it when porting this feature.

> This means x86, arm64, and powerpc for now.  Let's
> also rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.

Work for me too.

> 
> I admit that I'm rather surprised that the code worked at all on arm64,
> and I'm suspicious that it has never been very well tested.  My apologies
> for not reviewing this more carefully in the first place.

Please refer to 
Documentation/features/sched/membarrier-sync-core/arch-support.txt

It clearly states that only arm, arm64, powerpc and x86 support the membarrier
sync core feature as of now:

# Architecture requirements
#
# * arm/arm64/powerpc
#
# Rely on implicit context synchronization as a result of exception return
# when returning from IPI handler, and when returning to user-space.
#
# * x86
#
# x86-32 uses IRET as return from interrupt, which takes care of the IPI.
# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET
# instruction is core serializing, but not SYSEXIT.
#
# x86-64 uses IRET as return from interrupt, which takes care of the IPI.
# However, it can return to user-space through either SYSRETL (compat code),
# SYSRETQ, or IRET.
#
# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely
# instead on write_cr3() performed by switch_mm() to provide core serialization
# after changing the current mm, and deal with the special case of kthread ->
# uthread (temporarily keeping current mm into active_mm) by issuing a
# sync_core_before_usermode() in that specific case.

This is based on direct feedback from the architecture maintainers.

You seem to have noticed odd cases on arm64 where this guarantee does not
match reality. Where exactly can we find this in the code, and which part
of the architecture manual can you point us to which supports your concern ?

Based on the notes I have, use of `eret` on aarch64 guarantees a context 
synchronizing
instruction when returning to user-space.

Thanks,

Mathieu

> 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: Nicholas Piggin 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: Mathieu Desnoyers 
> Cc: x...@kernel.org
> Cc: sta...@vger.kernel.org
> Fixes: 70216e18e519 ("membarrier: Provide core serializing command,
> *_SYNC_CORE")
> Signed-off-by: Andy Lutomirski 
> ---
> 
> Hi arm64 and powerpc people-
> 
> This is part of a series here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/fixes
> 
> Before I send out the whole series, I'm hoping that some arm64 and powerpc
> people can help me verify that I did this patch right.  Once I get
> some feedback on this patch, I'll send out the whole pile.  And once
> *that's* done, I'll start giving the mm lazy stuff some serious thought.
> 
> The x86 part is already fixed in Linus' tree.
> 
> Thanks,
> Andy
> 
> arch/arm64/include/asm/sync_core.h   | 21 +
> arch/powerpc/include/asm/sync_core.h | 20 
> arch/x86/Kconfig |  1 -
> arch/x86/include/asm/sync_core.h |  7 +++
> include/linux/sched/mm.h |  1 -
> include/linux/sync_core.h| 21 -
> init/Kconfig |  3 ---
> kernel/sched/membarrier.c| 15 +++
> 8 files changed, 55 insertions(+), 34 deletions(-)
> create mode 100644 arch/arm64/include/asm/sync_core.h
> create mode 100644 arch/powerpc/include/asm/sync_core.h
> delete mode 100644 include/linux/sync_core.h
> 
> diff --git a/arch/arm64/include/asm/sync_core.h
>

Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers

er does not
do anything wrt cache coherency. We should perhaps explicitly point
it out though.

>>
>> So, either Andy has a misunderstanding, or the man page is wrong, or
>> my rudimentary understanding of what membarrier is supposed to be
>> doing is wrong...
> 
> Look at the latest man page:
> 
> https://man7.org/linux/man-pages/man2/membarrier.2.html
> 
> for MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE.  The result may not be
> all that enlightening.

As described in the membarrier(2) man page and in 
include/uapi/linux/membarrier.h,
membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE guarantees core 
serializing
instructions in addition to the memory ordering guaranteed by 
MEMBARRIER_CMD_PRIVATE_EXPEDITED.

It does not guarantee anything wrt i/d cache coherency. I initially considered
adding such guarantees when we introduced membarrier sync-core, but decided
against it because it would basically replicate what architectures already
expose to user-space, e.g. flush_icache_user_range on arm32.

So between code modification and allowing other threads to jump to that code,
it should be expected that architectures without coherent i/d cache will need
to flush caches to ensure coherency *and* to issue membarrier to make sure
core serializing instructions will be issued by every thread acting on the
same mm either immediately by means of the IPI, or before they return to
user-space if they do not happen to be currently running when the membarrier
system call is invoked.

Hoping this clarifies things. I suspect we will need to clarify documentation
about what membarrier *does not* guarantee, given that you mistakenly expected
membarrier to take care of cache flushing.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers

hread siblings have executed a core  serializing  instruc■
>  tion.   This  guarantee is provided only for threads in the same
>  process as the calling thread.
> 
>  The "expedited" commands complete faster than the  non-expedited
>  ones,  they  never block, but have the downside of causing extra
>  overhead.
> 
>  A process must register its intent to use the private  expedited
>  sync core command prior to using it.
> 
> This just says that the siblings have executed a serialising
> instruction, in other words a barrier. It makes no claims concerning
> cache coherency - and without some form of cache maintenance, there
> can be no expectation that the I and D streams to be coherent with
> each other.

Right, membarrier is not doing anything wrt I/D caches. On architectures
without coherent caches, users should use other system calls or instructions
provided by the architecture to synchronize the appropriate address ranges. 

> This description is also weird in another respect. "guarantee that
> all its running thread siblings have executed a core serializing
> instruction" ... "The expedited commands ... never block".
> 
> So, the core executing this call is not allowed to block, but the
> other part indicates that the other CPUs _have_ executed a serialising
> instruction before this call returns... one wonders how that happens
> without blocking. Maybe the CPU spins waiting for completion instead?

Membarrier expedited sync-core issues IPIs to all CPUs running sibling
threads. AFAIR the IPI mechanism uses the "csd lock" which is basically
busy waiting. So it does not "block", it busy-waits.

For completeness of the explanation, other (non-running) threads acting
on the same mm will eventually issue the context synchronizing instruction
before returning to user-space whenever they are scheduled back.

Thanks,

Mathieu


> 
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers

- On Dec 27, 2020, at 4:36 PM, Andy Lutomirski l...@kernel.org wrote:

[...]

>> You seem to have noticed odd cases on arm64 where this guarantee does not
>> match reality. Where exactly can we find this in the code, and which part
>> of the architecture manual can you point us to which supports your concern ?
>>
>> Based on the notes I have, use of `eret` on aarch64 guarantees a context
>> synchronizing
>> instruction when returning to user-space.
> 
> Based on my reading of the manual, ERET on ARM doesn't synchronize
> anything at all.  I can't find any evidence that it synchronizes data
> or instructions, and I've seen reports that the CPU will happily
> speculate right past it.

Reading [1] there appears to be 3 kind of context synchronization events:

- Taking an exception,
- Returning from an exception,
- ISB.

This other source [2] adds (search for Context synchronization operation):

- Exit from Debug state
- Executing a DCPS instruction
- Executing a DRPS instruction

"ERET" falls into the second kind of events, and AFAIU should be context
synchronizing. That was confirmed to me by Will Deacon when membarrier
sync-core was implemented for aarch64. If the architecture reference manuals
are wrong, is there an errata ?

As for the algorithm to use on ARMv8 to update instructions, see [2]
B2.3.4  Implication of caches for the application programmer
"Synchronization and coherency issues between data and instruction accesses"

Membarrier only takes care of making sure the "ISB" part of the algorithm can be
done easily and efficiently on multiprocessor systems.

Thanks,

Mathieu

[1] 
https://developer.arm.com/documentation/den0024/a/Memory-Ordering/Barriers/ISB-in-more-detail
[2] 
https://montcs.bloomu.edu/Information/ARMv8/ARMv8-A_Architecture_Reference_Manual_(Issue_A.a).pdf

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [RFC please help] membarrier: Rewrite sync_core_before_usermode()

2020-12-28 Thread Mathieu Desnoyers

- On Dec 28, 2020, at 4:06 PM, Andy Lutomirski l...@kernel.org wrote:

> On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers
>  wrote:
>>
>> - On Dec 28, 2020, at 2:44 PM, Andy Lutomirski l...@kernel.org wrote:
>>
>> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin
>> >  wrote:
>> >>
>> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote:
>> >> > After chatting with rmk about this (but without claiming that any of
>> >> > this is his opinion), based on the manpage, I think membarrier()
>> >> > currently doesn't really claim to be synchronizing caches? It just
>> >> > serializes cores. So arguably if userspace wants to use membarrier()
>> >> > to synchronize code changes, userspace should first do the code
>> >> > change, then flush icache as appropriate for the architecture, and
>> >> > then do the membarrier() to ensure that the old code is unused?
>>
>> ^ exactly, yes.
>>
>> >> >
>> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush()
>> >> > syscall. That might cause you to end up with two IPIs instead of one
>> >> > in total, but we probably don't care _that_ much about extra IPIs on
>> >> > 32-bit arm?
>>
>> This was the original thinking, yes. The cacheflush IPI will flush specific
>> regions of code, and the membarrier IPI issues context synchronizing
>> instructions.
>>
>> Architectures with coherent i/d caches don't need the cacheflush step.
> 
> There are different levels of coherency -- VIVT architectures may have
> differing requirements compared to PIPT, etc.
> 
> In any case, I feel like the approach taken by the documentation is
> fundamentally confusing.  Architectures don't all speak the same
> language

Agreed.

> How about something like:

I dislike the wording "barrier" and the association between "write" and
"instruction fetch" done in the descriptions below. It leads to think that
this behaves like a memory barrier, when in fact my understanding of
a context synchronizing instruction is that it simply flushes internal
CPU state, which would cause coherency issues if the CPU observes both the
old and then the new code without having this state flushed.

[ Sorry if I take more time to reply and if my replies are a bit more
  concise than usual. I'm currently on parental leave, so I have
  non-maskable interrupts to attend to. ;-) ]

Thanks,

Mathieu

> 
> The SYNC_CORE operation causes all threads in the caller's address
> space (including the caller) to execute an architecture-defined
> barrier operation.  membarrier() will ensure that this barrier is
> executed at a time such that all data writes done by the calling
> thread before membarrier() are made visible by the barrier.
> Additional architecture-dependent cache management operations may be
> required to use this for JIT code.
> 
> x86: SYNC_CORE executes a barrier that will cause subsequent
> instruction fetches to observe prior writes.  Currently this will be a
> "serializing" instruction, but, if future improved CPU documentation
> becomes available and relaxes this requirement, the barrier may
> change.  The kernel guarantees that writing new or modified
> instructions to normal memory (and issuing SFENCE if the writes were
> non-temporal) then doing a membarrier SYNC_CORE operation is
> sufficient to cause all threads in the caller's address space to
> execute the new or modified instructions.  This is true regardless of
> whether or not those instructions are written at the same virtual
> address from which they are subsequently executed.  No additional
> cache management is required on x86.
> 
> arm: Something about the cache management syscalls.
> 
> arm64: Ditto
> 
> powerpc: I have no idea.

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-17 Thread Mathieu Desnoyers

- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski l...@kernel.org wrote:

> The old sync_core_before_usermode() comments suggested that a 
> non-icache-syncing
> return-to-usermode instruction is x86-specific and that all other
> architectures automatically notice cross-modified code on return to
> userspace.
> 
> This is misleading.  The incantation needed to modify code from one
> CPU and execute it on another CPU is highly architecture dependent.
> On x86, according to the SDM, one must modify the code, issue SFENCE
> if the modification was WC or nontemporal, and then issue a "serializing
> instruction" on the CPU that will execute the code.  membarrier() can do
> the latter.
> 
> On arm64 and powerpc, one must flush the icache and then flush the pipeline
> on the target CPU, although the CPU manuals don't necessarily use this
> language.
> 
> So let's drop any pretense that we can have a generic way to define or
> implement membarrier's SYNC_CORE operation and instead require all
> architectures to define the helper and supply their own documentation as to
> how to use it.

Agreed. Documentation of the sequence of operations that need to be performed
when cross-modifying code on SMP should be per-architecture. The documentation
of the architectural effects of membarrier sync-core should be per-arch as well.

> This means x86, arm64, and powerpc for now.

And also arm32, as discussed in the other leg of the patchset's email thread.

> Let's also
> rename the function from sync_core_before_usermode() to
> membarrier_sync_core_before_usermode() because the precise flushing details
> may very well be specific to membarrier, and even the concept of
> "sync_core" in the kernel is mostly an x86-ism.

OK

> 
[...]
> 
> static void ipi_rseq(void *info)
> {
> @@ -368,12 +373,14 @@ static int membarrier_private_expedited(int flags, int
> cpu_id)
>   smp_call_func_t ipi_func = ipi_mb;
> 
>   if (flags == MEMBARRIER_FLAG_SYNC_CORE) {
> - if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
> +#ifndef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
>   return -EINVAL;
> +#else
>   if (!(atomic_read(&mm->membarrier_state) &
> MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>   return -EPERM;
>   ipi_func = ipi_sync_core;
> +#endif

Please change back this #ifndef / #else / #endif within function for

if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
  ...
} else {
  ...
}

I don't think mixing up preprocessor and code logic makes it more readable.

Thanks,

Mathieu

>   } else if (flags == MEMBARRIER_FLAG_RSEQ) {
>   if (!IS_ENABLED(CONFIG_RSEQ))
>   return -EINVAL;
> --
> 2.31.1

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-17 Thread Mathieu Desnoyers

- On Jun 15, 2021, at 11:21 PM, Andy Lutomirski l...@kernel.org wrote:

[...]

> +# An architecture that wants to support
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what 
> it
> +# is supposed to do and implement membarrier_sync_core_before_usermode() to
> +# make it do that.  Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via
> +# Kconfig.Unfortunately, MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is not a
> +# fantastic API and may not make sense on all architectures.  Once an
> +# architecture meets these requirements,

Can we please remove the editorial comment about the quality of the membarrier
sync-core's API ?

At least it's better than having all userspace rely on mprotect() undocumented
side-effects to perform something which typically works, until it won't, or 
until
this prevents mprotect's implementation to be improved because it will start 
breaking
JITs all over the place.

We can simply state that the definition of what membarrier sync-core does is 
defined
per-architecture, and document the sequence of operations to perform when doing
cross-modifying code specifically for each architecture.

Now if there are weird architectures where membarrier is an odd fit (I've heard 
that
riscv might need address ranges to which the core sync needs to apply), then 
those
might need to implement their own arch-specific system call, which is all fine.

> +#
> +# On x86, a program can safely modify code, issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via
> +# the modified address or an alias, from any thread in the calling process.
> +#
> +# On arm64, a program can modify code, flush the icache as needed, and issue
> +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context 
> synchronizing
> +# event", aka pipeline flush on all CPUs that might run the calling process.
> +# Then the program can execute the modified code as long as it is executed
> +# from an address consistent with the icache flush and the CPU's cache type.
> +#
> +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE
> +# similarly to arm64.  It would be nice if the powerpc maintainers could
> +# add a more clear explanantion.

We should document the requirements on ARMv7 as well.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-18 Thread Mathieu Desnoyers

- On Jun 17, 2021, at 8:12 PM, Andy Lutomirski l...@kernel.org wrote:

> On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
> 
>> Please change back this #ifndef / #else / #endif within function for
>> 
>> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>>   ...
>> } else {
>>   ...
>> }
>> 
>> I don't think mixing up preprocessor and code logic makes it more readable.
> 
> I agree, but I don't know how to make the result work well.
> membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
> case, so either I need to fake up a definition or use #ifdef.
> 
> If I faked up a definition, I would want to assert, at build time, that
> it isn't called.  I don't think we can do:
> 
> static void membarrier_sync_core_before_usermode()
> {
>BUILD_BUG_IF_REACHABLE();
> }

Let's look at the context here:

static void ipi_sync_core(void *info)
{
[]
membarrier_sync_core_before_usermode()
}

^ this can be within #ifdef / #endif

static int membarrier_private_expedited(int flags, int cpu_id)
[...]
   if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
return -EINVAL;
if (!(atomic_read(&mm->membarrier_state) &
  MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
return -EPERM;
ipi_func = ipi_sync_core;

All we need to make the line above work is to define an empty ipi_sync_core
function in the #else case after the ipi_sync_core() function definition.

Or am I missing your point ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH for 4.16 v7 02/11] powerpc: membarrier: Skip memory barrier in switch_mm()

2021-06-18 Thread Mathieu Desnoyers

- On Jun 18, 2021, at 1:13 PM, Christophe Leroy christophe.le...@csgroup.eu 
wrote:
[...]
> 
> I don't understand all that complexity to just replace a simple
> 'smp_mb__after_unlock_lock()'.
> 
> #define smp_mb__after_unlock_lock()   smp_mb()
> #define smp_mb()  barrier()
> # define barrier() __asm__ __volatile__("": : :"memory")
> 
> 
> Am I missing some subtility ?

On powerpc CONFIG_SMP, smp_mb() is actually defined as:

#define smp_mb()__smp_mb()
#define __smp_mb()  mb()
#define mb()   __asm__ __volatile__ ("sync" : : : "memory")

So the original motivation here was to skip a "sync" instruction whenever
switching between threads which are part of the same process. But based on
recent discussions, I suspect my implementation may be inaccurately doing
so though.

Thanks,

Mathieu

> 
> Thanks
> Christophe

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH 8/8] membarrier: Rewrite sync_core_before_usermode() and improve documentation

2021-06-18 Thread Mathieu Desnoyers

- On Jun 18, 2021, at 3:58 PM, Andy Lutomirski l...@kernel.org wrote:

> On Fri, Jun 18, 2021, at 9:31 AM, Mathieu Desnoyers wrote:
>> - On Jun 17, 2021, at 8:12 PM, Andy Lutomirski l...@kernel.org wrote:
>> 
>> > On 6/17/21 7:47 AM, Mathieu Desnoyers wrote:
>> > 
>> >> Please change back this #ifndef / #else / #endif within function for
>> >> 
>> >> if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE)) {
>> >>   ...
>> >> } else {
>> >>   ...
>> >> }
>> >> 
>> >> I don't think mixing up preprocessor and code logic makes it more 
>> >> readable.
>> > 
>> > I agree, but I don't know how to make the result work well.
>> > membarrier_sync_core_before_usermode() isn't defined in the !IS_ENABLED
>> > case, so either I need to fake up a definition or use #ifdef.
>> > 
>> > If I faked up a definition, I would want to assert, at build time, that
>> > it isn't called.  I don't think we can do:
>> > 
>> > static void membarrier_sync_core_before_usermode()
>> > {
>> >BUILD_BUG_IF_REACHABLE();
>> > }
>> 
>> Let's look at the context here:
>> 
>> static void ipi_sync_core(void *info)
>> {
>> []
>> membarrier_sync_core_before_usermode()
>> }
>> 
>> ^ this can be within #ifdef / #endif
>> 
>> static int membarrier_private_expedited(int flags, int cpu_id)
>> [...]
>>if (!IS_ENABLED(CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE))
>> return -EINVAL;
>> if (!(atomic_read(&mm->membarrier_state) &
>>   MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY))
>> return -EPERM;
>> ipi_func = ipi_sync_core;
>> 
>> All we need to make the line above work is to define an empty ipi_sync_core
>> function in the #else case after the ipi_sync_core() function definition.
>> 
>> Or am I missing your point ?
> 
> Maybe?
> 
> My objection is that an empty ipi_sync_core is a lie — it doesn’t sync the 
> core.
> I would be fine with that if I could have the compiler statically verify that
> it’s not called, but I’m uncomfortable having it there if the implementation 
> is
> actively incorrect.

I see. Another approach would be to implement a "setter" function to populate
"ipi_func". That setter function would return -EINVAL in its #ifndef 
CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE
implementation.

Would that be better ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH 2/5] entry: rseq: Call rseq_handle_notify_resume() in tracehook_notify_resume()

2021-08-19 Thread Mathieu Desnoyers

- On Aug 17, 2021, at 8:12 PM, Sean Christopherson sea...@google.com wrote:

> Invoke rseq_handle_notify_resume() from tracehook_notify_resume() now
> that the two function are always called back-to-back by architectures
> that have rseq.  The rseq helper is stubbed out for architectures that
> don't support rseq, i.e. this is a nop across the board.
> 
> Note, tracehook_notify_resume() is horribly named and arguably does not
> belong in tracehook.h as literally every line of code in it has nothing
> to do with tracing.  But, that's been true since commit a42c6ded827d
> ("move key_repace_session_keyring() into tracehook_notify_resume()")
> first usurped tracehook_notify_resume() back in 2012.  Punt cleaning that
> mess up to future patches.
> 
> No functional change intended.

This will make it harder to introduce new code paths which consume the
NOTIFY_RESUME without calling the rseq callback, which introduces issues.
Agreed.

Acked-by: Mathieu Desnoyers 

> 
> Signed-off-by: Sean Christopherson 
> ---
> arch/arm/kernel/signal.c | 1 -
> arch/arm64/kernel/signal.c   | 1 -
> arch/csky/kernel/signal.c| 4 +---
> arch/mips/kernel/signal.c| 4 +---
> arch/powerpc/kernel/signal.c | 4 +---
> arch/s390/kernel/signal.c| 1 -
> include/linux/tracehook.h| 2 ++
> kernel/entry/common.c| 4 +---
> kernel/entry/kvm.c   | 4 +---
> 9 files changed, 7 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c
> index a3a38d0a4c85..9df68d139965 100644
> --- a/arch/arm/kernel/signal.c
> +++ b/arch/arm/kernel/signal.c
> @@ -670,7 +670,6 @@ do_work_pending(struct pt_regs *regs, unsigned int
> thread_flags, int syscall)
>   uprobe_notify_resume(regs);
>   } else {
>   tracehook_notify_resume(regs);
> - rseq_handle_notify_resume(NULL, regs);
>   }
>   }
>   local_irq_disable();
> diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
> index 23036334f4dc..22b55db13da6 100644
> --- a/arch/arm64/kernel/signal.c
> +++ b/arch/arm64/kernel/signal.c
> @@ -951,7 +951,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
> 
>   if (thread_flags & _TIF_NOTIFY_RESUME) {
>   tracehook_notify_resume(regs);
> - rseq_handle_notify_resume(NULL, regs);
> 
>   /*
>* If we reschedule after checking the affinity
> diff --git a/arch/csky/kernel/signal.c b/arch/csky/kernel/signal.c
> index 312f046d452d..bc4238b9f709 100644
> --- a/arch/csky/kernel/signal.c
> +++ b/arch/csky/kernel/signal.c
> @@ -260,8 +260,6 @@ asmlinkage void do_notify_resume(struct pt_regs *regs,
>   if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
>   do_signal(regs);
> 
> - if (thread_info_flags & _TIF_NOTIFY_RESUME) {
> + if (thread_info_flags & _TIF_NOTIFY_RESUME)
>   tracehook_notify_resume(regs);
> - rseq_handle_notify_resume(NULL, regs);
> - }
> }
> diff --git a/arch/mips/kernel/signal.c b/arch/mips/kernel/signal.c
> index f1e985109da0..c9b2a75563e1 100644
> --- a/arch/mips/kernel/signal.c
> +++ b/arch/mips/kernel/signal.c
> @@ -906,10 +906,8 @@ asmlinkage void do_notify_resume(struct pt_regs *regs, 
> void
> *unused,
>   if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
>   do_signal(regs);
> 
> - if (thread_info_flags & _TIF_NOTIFY_RESUME) {
> + if (thread_info_flags & _TIF_NOTIFY_RESUME)
>   tracehook_notify_resume(regs);
> - rseq_handle_notify_resume(NULL, regs);
> - }
> 
>   user_enter();
> }
> diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
> index e600764a926c..b93b87df499d 100644
> --- a/arch/powerpc/kernel/signal.c
> +++ b/arch/powerpc/kernel/signal.c
> @@ -293,10 +293,8 @@ void do_notify_resume(struct pt_regs *regs, unsigned long
> thread_info_flags)
>   do_signal(current);
>   }
> 
> - if (thread_info_flags & _TIF_NOTIFY_RESUME) {
> + if (thread_info_flags & _TIF_NOTIFY_RESUME)
>   tracehook_notify_resume(regs);
> - rseq_handle_notify_resume(NULL, regs);
> - }
> }
> 
> static unsigned long get_tm_stackpointer(struct task_struct *tsk)
> diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c
> index 78ef53b29958..b307db26bf2d 100644
> --- a/arch/s390/kernel/signal.c
> +++ b/arch/s390/kernel/sign

Re: [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest

2021-08-19 Thread Mathieu Desnoyers

- On Aug 17, 2021, at 8:12 PM, Sean Christopherson sea...@google.com wrote:

> Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to
> transferring to a KVM guest, which is roughly equivalent to an exit to
> userspace and processes many of the same pending actions.  While the task
> cannot be in an rseq critical section as the KVM path is reachable only
> via ioctl(KVM_RUN), the side effects that apply to rseq outside of a
> critical section still apply, e.g. the CPU ID needs to be updated if the
> task is migrated.
> 
> Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults
> and other badness in userspace VMMs that use rseq in combination with KVM,
> e.g. due to the CPU ID being stale after task migration.

I agree with the problem assessment, but I would recommend a small change
to this fix.

> 
> Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function")
> Reported-by: Peter Foley 
> Bisected-by: Doug Evans 
> Cc: Shakeel Butt 
> Cc: Thomas Gleixner 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Sean Christopherson 
> ---
> kernel/entry/kvm.c | 4 +++-
> kernel/rseq.c  | 4 ++--
> 2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
> index 49972ee99aff..049fd06b4c3d 100644
> --- a/kernel/entry/kvm.c
> +++ b/kernel/entry/kvm.c
> @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,
> unsigned long ti_work)
>   if (ti_work & _TIF_NEED_RESCHED)
>   schedule();
> 
> - if (ti_work & _TIF_NOTIFY_RESUME)
> + if (ti_work & _TIF_NOTIFY_RESUME) {
>   tracehook_notify_resume(NULL);
> + rseq_handle_notify_resume(NULL, NULL);
> + }
> 
>   ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
>   if (ret)
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 35f7bd0fced0..58c79a7918cd 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -236,7 +236,7 @@ static bool in_rseq_cs(unsigned long ip, struct rseq_cs
> *rseq_cs)
> 
> static int rseq_ip_fixup(struct pt_regs *regs)
> {
> - unsigned long ip = instruction_pointer(regs);
> + unsigned long ip = regs ? instruction_pointer(regs) : 0;
>   struct task_struct *t = current;
>   struct rseq_cs rseq_cs;
>   int ret;
> @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs)
>* If not nested over a rseq critical section, restart is useless.
>* Clear the rseq_cs pointer and return.
>*/
> - if (!in_rseq_cs(ip, &rseq_cs))
> + if (!regs || !in_rseq_cs(ip, &rseq_cs))

I think clearing the thread's rseq_cs unconditionally here when regs is NULL
is not the behavior we want when this is called from xfer_to_guest_mode_work.

If we have a scenario where userspace ends up calling this ioctl(KVM_RUN)
from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to
kill this application in the rseq_syscall handler when exiting back to usermode
when the ioctl eventually returns.

However, clearing the thread's rseq_cs will prevent this from happening.

So I would favor an approach where we simply do:

if (!regs)
 return 0;

Immediately at the beginning of rseq_ip_fixup, before getting the instruction
pointer, so effectively skip all side-effects of the ip fixup code. Indeed, it
is not relevant to do any fixup here, because it is nested in a ioctl system
call.

Effectively, this would preserve the SIGSEGV behavior when this ioctl is
erroneously called by user-space from a rseq critical section.

Thanks for looking into this !

Mathieu

>   return clear_rseq_cs(t);
>   ret = rseq_need_restart(t, rseq_cs.flags);
>   if (ret <= 0)
> --
> 2.33.0.rc1.237.g0d66db33f3-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-19 Thread Mathieu Desnoyers

- On Aug 17, 2021, at 8:12 PM, Sean Christopherson sea...@google.com wrote:

> Add a test to verify an rseq's CPU ID is updated correctly if the task is
> migrated while the kernel is handling KVM_RUN.  This is a regression test
> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
> without updating rseq, leading to a stale CPU ID and other badness.
> 
> Signed-off-by: Sean Christopherson 
> ---

[...]

> +
> +static void *migration_worker(void *ign)
> +{
> + cpu_set_t allowed_mask;
> + int r, i, nr_cpus, cpu;
> +
> + CPU_ZERO(&allowed_mask);
> +
> + nr_cpus = CPU_COUNT(&possible_mask);
> +
> + for (i = 0; i < 2; i++) {
> + cpu = i % nr_cpus;
> + if (!CPU_ISSET(cpu, &possible_mask))
> + continue;
> +
> + CPU_SET(cpu, &allowed_mask);
> +
> + r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)", 
> errno,
> + strerror(errno));
> +
> + CPU_CLR(cpu, &allowed_mask);
> +
> + usleep(10);
> + }
> + done = true;
> + return NULL;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> + struct kvm_vm *vm;
> + u32 cpu, rseq_cpu;
> + int r;
> +
> + /* Tell stdout not to buffer its content */
> + setbuf(stdout, NULL);
> +
> + r = sched_getaffinity(0, sizeof(possible_mask), &possible_mask);
> + TEST_ASSERT(!r, "sched_getaffinity failed, errno = %d (%s)", errno,
> + strerror(errno));
> +
> + if (CPU_COUNT(&possible_mask) < 2) {
> + print_skip("Only one CPU, task migration not possible\n");
> + exit(KSFT_SKIP);
> + }
> +
> + sys_rseq(0);
> +
> + /*
> +  * Create and run a dummy VM that immediately exits to userspace via
> +  * GUEST_SYNC, while concurrently migrating the process by setting its
> +  * CPU affinity.
> +  */
> + vm = vm_create_default(VCPU_ID, 0, guest_code);
> +
> + pthread_create(&migration_thread, NULL, migration_worker, 0);
> +
> + while (!done) {
> + vcpu_run(vm, VCPU_ID);
> + TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
> + "Guest failed?");
> +
> + cpu = sched_getcpu();
> + rseq_cpu = READ_ONCE(__rseq.cpu_id);
> +
> + /*
> +  * Verify rseq's CPU matches sched's CPU, and that sched's CPU
> +  * is stable.  This doesn't handle the case where the task is
> +  * migrated between sched_getcpu() and reading rseq, and again
> +  * between reading rseq and sched_getcpu(), but in practice no
> +  * false positives have been observed, while on the other hand
> +  * blocking migration while this thread reads CPUs messes with
> +  * the timing and prevents hitting failures on a buggy kernel.
> +  */

I think you could get a stable cpu id between sched_getcpu and __rseq_abi.cpu_id
if you add a pthread mutex to protect:

sched_getcpu and __rseq_abi.cpu_id  reads

vs

sched_setaffinity calls within the migration thread.

Thoughts ?

Thanks,

Mathieu

> + TEST_ASSERT(rseq_cpu == cpu || cpu != sched_getcpu(),
> + "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu);
> + }
> +
> + pthread_join(migration_thread, NULL);
> +
> + kvm_vm_free(vm);
> +
> + sys_rseq(RSEQ_FLAG_UNREGISTER);
> +
> + return 0;
> +}
> --
> 2.33.0.rc1.237.g0d66db33f3-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-20 Thread Mathieu Desnoyers

- On Aug 19, 2021, at 7:33 PM, Sean Christopherson sea...@google.com wrote:

> On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:
>> - On Aug 17, 2021, at 8:12 PM, Sean Christopherson sea...@google.com 
>> wrote:
>> 
>> > Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> > migrated while the kernel is handling KVM_RUN.  This is a regression test
>> > for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> > to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> > without updating rseq, leading to a stale CPU ID and other badness.
>> > 
>> > Signed-off-by: Sean Christopherson 
>> > ---
>> 
>> [...]
>> 
>> > +  while (!done) {
>> > +  vcpu_run(vm, VCPU_ID);
>> > +  TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
>> > +  "Guest failed?");
>> > +
>> > +  cpu = sched_getcpu();
>> > +  rseq_cpu = READ_ONCE(__rseq.cpu_id);
>> > +
>> > +  /*
>> > +   * Verify rseq's CPU matches sched's CPU, and that sched's CPU
>> > +   * is stable.  This doesn't handle the case where the task is
>> > +   * migrated between sched_getcpu() and reading rseq, and again
>> > +   * between reading rseq and sched_getcpu(), but in practice no
>> > +   * false positives have been observed, while on the other hand
>> > +   * blocking migration while this thread reads CPUs messes with
>> > +   * the timing and prevents hitting failures on a buggy kernel.
>> > +   */
>> 
>> I think you could get a stable cpu id between sched_getcpu and 
>> __rseq_abi.cpu_id
>> if you add a pthread mutex to protect:
>> 
>> sched_getcpu and __rseq_abi.cpu_id  reads
>> 
>> vs
>> 
>> sched_setaffinity calls within the migration thread.
>> 
>> Thoughts ?
> 
> I tried that and couldn't reproduce the bug.  That's what I attempted to call
> out
> in the blurb "blocking migration while this thread reads CPUs ... prevents
> hitting
> failures on a buggy kernel".
> 
> I considered adding arbitrary delays around the mutex to try and hit the bug,
> but
> I was worried that even if I got it "working" for this bug, the test would be
> too
> tailored to this bug and potentially miss future regression.  Letting the two
> threads run wild seemed like it would provide the best coverage, at the cost 
> of
> potentially causing to false failures.

OK, so your point is that using mutual exclusion to ensure stability of the cpu 
id
changes the timings too much, to a point where the issues don't reproduce. I 
understand
that this mutex ties the migration thread timings to the vcpu thread's use of 
the mutex,
which will reduce timings randomness, which is unwanted here.

I still really hate flakiness in tests, because then people stop caring when 
they
fail once in a while. And with the nature of rseq, a once-in-a-while failure is 
a
big deal. Let's see if we can use other tricks to ensure stability of the cpu id
without changing timings too much.

One idea would be to use a seqcount lock. But even if we use that, I'm 
concerned that
the very long writer critical section calling sched_setaffinity would need to be
alternated with a sleep to ensure the read-side progresses. The sleep delay 
could be
relatively small compared to the duration of the sched_setaffinity call, e.g. 
ratio
1:10.

static volatile uint64_t seqcnt;

The thread responsible for setting the affinity would do something like:

for (;;) {
  atomic_inc_seq_cst(&seqcnt);
  sched_setaffinity(..., n++ % nr_cpus);
  atomic_inc_seq_cst(&seqcnt);
  usleep(1);  /* this is where read-side is allowed to progress. */
}

And the thread reading the rseq cpu id and calling sched_getcpu():

uint64_t snapshot;

do {
  snapshot = atomic_load(&seqcnt) & ~1; /* force retry if odd */
  smp_rmb();
  cpu = sched_getcpu();
  rseq_cpu = READ_ONCE(__rseq.cpu_id);
  smp_rmb();
} while (snapshot != atomic_load(&seqcnt));

So the reader retry the cpu id reads whenever sched_setaffinity is being
called by the migration thread, and whenever it is preempted for more
than one migration thread loop.

That should achieve our goal of providing cpu id stability without significantly
changing the timings of the migration thread, given that it never blocks waiting
for the reader.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest

2021-08-20 Thread Mathieu Desnoyers

- On Aug 19, 2021, at 7:48 PM, Sean Christopherson sea...@google.com wrote:

> On Thu, Aug 19, 2021, Mathieu Desnoyers wrote:
>> - On Aug 17, 2021, at 8:12 PM, Sean Christopherson sea...@google.com 
>> wrote:
>> > @@ -250,7 +250,7 @@ static int rseq_ip_fixup(struct pt_regs *regs)
>> > * If not nested over a rseq critical section, restart is useless.
>> > * Clear the rseq_cs pointer and return.
>> > */
>> > -  if (!in_rseq_cs(ip, &rseq_cs))
>> > +  if (!regs || !in_rseq_cs(ip, &rseq_cs))
>> 
>> I think clearing the thread's rseq_cs unconditionally here when regs is NULL
>> is not the behavior we want when this is called from xfer_to_guest_mode_work.
>> 
>> If we have a scenario where userspace ends up calling this ioctl(KVM_RUN)
>> from within a rseq c.s., we really want a CONFIG_DEBUG_RSEQ=y kernel to
>> kill this application in the rseq_syscall handler when exiting back to 
>> usermode
>> when the ioctl eventually returns.
>> 
>> However, clearing the thread's rseq_cs will prevent this from happening.
>> 
>> So I would favor an approach where we simply do:
>> 
>> if (!regs)
>>  return 0;
>> 
>> Immediately at the beginning of rseq_ip_fixup, before getting the instruction
>> pointer, so effectively skip all side-effects of the ip fixup code. Indeed, 
>> it
>> is not relevant to do any fixup here, because it is nested in a ioctl system
>> call.
>> 
>> Effectively, this would preserve the SIGSEGV behavior when this ioctl is
>> erroneously called by user-space from a rseq critical section.
> 
> Ha, that's effectively what I implemented first, but I changed it because of 
> the
> comment in clear_rseq_cs() that says:
> 
>  The rseq_cs field is set to NULL on preemption or signal delivery ... as well
>  as well as on top of code outside of the rseq assembly block.
> 
> Which makes it sound like something might rely on clearing rseq_cs?

This comment is describing succinctly the lazy clear scheme for rseq_cs.

Without the lazy clear scheme, a rseq c.s. would look like:

 * init(rseq_cs)
 * cpu = TLS->rseq::cpu_id_start
 *   [1]   TLS->rseq::rseq_cs = rseq_cs
 *   [start_ip]
 *   [2]   if (cpu != TLS->rseq::cpu_id)
 * goto abort_ip;
 *   [3]   
 *   [post_commit_ip]  
 *   [4]   TLS->rseq::rseq_cs = NULL

But as a fast-path optimization, [4] is not entirely needed because the rseq_cs
descriptor contains information about the instruction pointer range of the 
critical
section. Therefore, userspace can omit [4], but if the kernel never clears it, 
it
means that it will have to re-read the rseq_cs descriptor's content each time it
needs to check it to confirm that it is not nested over a rseq c.s..

So making the kernel lazily clear the rseq_cs pointer is just an optimization 
which
ensures that the kernel won't do useless work the next time it needs to check
rseq_cs, given that it has already validated that the userspace code is 
currently
not within the rseq c.s. currently advertised by the rseq_cs field.

> 
> Ah, or is it the case that rseq_cs is non-NULL if and only if userspace is in 
> an
> rseq critical section, and because syscalls in critical sections are illegal, 
> by
> definition clearing rseq_cs is a nop unless userspace is misbehaving.

Not quite, as I described above. But we want it to stay set so the 
CONFIG_DEBUG_RSEQ
code executed when returning from ioctl to userspace will be able to validate 
that
it is not nested within a rseq critical section.

> 
> If that's true, what about explicitly checking that at NOTIFY_RESUME?  Or is 
> it
> not worth the extra code to detect an error that will likely be caught 
> anyways?

The error will indeed already be caught on return from ioctl to userspace, so I
don't see any added value in duplicating this check.

Thanks,

Mathieu

> 
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 35f7bd0fced0..28b8342290b0 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -282,6 +282,13 @@ void __rseq_handle_notify_resume(struct ksignal *ksig,
> struct pt_regs *regs)
> 
>if (unlikely(t->flags & PF_EXITING))
>return;
> +   if (!regs) {
> +#ifdef CONFIG_DEBUG_RSEQ
> +   if (t->rseq && rseq_get_rseq_cs(t, &rseq_cs))
> +       goto error;
> +#endif
> +   return;
> +   }
>ret = rseq_ip_fixup(regs);
>if (unlikely(ret < 0))
>goto error;
> 
>> Thanks for looking into this !
>> 
>> Mathieu
>> 
>> >return clear_rseq_cs(t);
>> >ret = rseq_need_restart(t, rseq_cs.flags);
>> >if (ret <= 0)
>> > --
>> > 2.33.0.rc1.237.g0d66db33f3-goog
>> 
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
> > http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v2 1/5] KVM: rseq: Update rseq when processing NOTIFY_RESUME on xfer to KVM guest

2021-08-23 Thread Mathieu Desnoyers

- On Aug 20, 2021, at 6:49 PM, Sean Christopherson sea...@google.com wrote:

> Invoke rseq's NOTIFY_RESUME handler when processing the flag prior to
> transferring to a KVM guest, which is roughly equivalent to an exit to
> userspace and processes many of the same pending actions.  While the task
> cannot be in an rseq critical section as the KVM path is reachable only
> by via ioctl(KVM_RUN), the side effects that apply to rseq outside of a
> critical section still apply, e.g. the current CPU needs to be updated if
> the task is migrated.
> 
> Clearing TIF_NOTIFY_RESUME without informing rseq can lead to segfaults
> and other badness in userspace VMMs that use rseq in combination with KVM,
> e.g. due to the CPU ID being stale after task migration.

Acked-by: Mathieu Desnoyers 

> 
> Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function")
> Reported-by: Peter Foley 
> Bisected-by: Doug Evans 
> Cc: Shakeel Butt 
> Cc: Thomas Gleixner 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Sean Christopherson 
> ---
> kernel/entry/kvm.c |  4 +++-
> kernel/rseq.c  | 14 +++---
> 2 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/entry/kvm.c b/kernel/entry/kvm.c
> index 49972ee99aff..049fd06b4c3d 100644
> --- a/kernel/entry/kvm.c
> +++ b/kernel/entry/kvm.c
> @@ -19,8 +19,10 @@ static int xfer_to_guest_mode_work(struct kvm_vcpu *vcpu,
> unsigned long ti_work)
>   if (ti_work & _TIF_NEED_RESCHED)
>   schedule();
> 
> - if (ti_work & _TIF_NOTIFY_RESUME)
> + if (ti_work & _TIF_NOTIFY_RESUME) {
>   tracehook_notify_resume(NULL);
> + rseq_handle_notify_resume(NULL, NULL);
> + }
> 
>   ret = arch_xfer_to_guest_mode_handle_work(vcpu, ti_work);
>   if (ret)
> diff --git a/kernel/rseq.c b/kernel/rseq.c
> index 35f7bd0fced0..6d45ac3dae7f 100644
> --- a/kernel/rseq.c
> +++ b/kernel/rseq.c
> @@ -282,9 +282,17 @@ void __rseq_handle_notify_resume(struct ksignal *ksig,
> struct pt_regs *regs)
> 
>   if (unlikely(t->flags & PF_EXITING))
>   return;
> - ret = rseq_ip_fixup(regs);
> - if (unlikely(ret < 0))
> - goto error;
> +
> + /*
> +  * regs is NULL if and only if the caller is in a syscall path.  Skip
> +  * fixup and leave rseq_cs as is so that rseq_sycall() will detect and
> +  * kill a misbehaving userspace on debug kernels.
> +  */
> + if (regs) {
> + ret = rseq_ip_fixup(regs);
> + if (unlikely(ret < 0))
> + goto error;
> + }
>   if (unlikely(rseq_update_cpu_id(t)))
>   goto error;
>   return;
> --
> 2.33.0.rc2.250.ged5fa647cd-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs

2021-08-23 Thread Mathieu Desnoyers

_code);
> +
> + pthread_create(&migration_thread, NULL, migration_worker, 0);
> +
> + while (!done) {
> + vcpu_run(vm, VCPU_ID);
> + TEST_ASSERT(get_ucall(vm, VCPU_ID, NULL) == UCALL_SYNC,
> + "Guest failed?");
> +
> + /*
> +  * Verify rseq's CPU matches sched's CPU.  Ensure migration
> +  * doesn't occur between sched_getcpu() and reading the rseq
> +  * cpu_id by rereading both if the sequence count changes, or
> +  * if the count is odd (migration in-progress).
> +  */
> + do {
> + /*
> +  * Drop bit 0 to force a mismatch if the count is odd,
> +  * i.e. if a migration is in-progress.
> +  */
> + snapshot = atomic_read(&seq_cnt) & ~1;
> + smp_rmb();
> + cpu = sched_getcpu();
> +     rseq_cpu = READ_ONCE(__rseq.cpu_id);
> + smp_rmb();
> + } while (snapshot != atomic_read(&seq_cnt));
> +
> + TEST_ASSERT(rseq_cpu == cpu,
> + "rseq CPU = %d, sched CPU = %d\n", rseq_cpu, cpu);
> + }
> +
> + pthread_join(migration_thread, NULL);
> +
> + kvm_vm_free(vm);
> +
> + sys_rseq(RSEQ_FLAG_UNREGISTER);
> +
> + return 0;
> +}
> --
> 2.33.0.rc2.250.ged5fa647cd-goog

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

1 2 >

1 - 100 of 167 matches

Mail list logo