[PATCH] tracing: Add new_exec tracepoint

2024-04-08 Thread Marco Elver
Add "new_exec" tracepoint, which is run right after the point of no
return but before the current task assumes its new exec identity.

Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint
runs before flushing the old exec, i.e. while the task still has the
original state (such as original MM), but when the new exec either
succeeds or crashes (but never returns to the original exec).

Being able to trace this event can be helpful in a number of use cases:

  * allowing tracing eBPF programs access to the original MM on exec,
before current->mm is replaced;
  * counting exec in the original task (via perf event);
  * profiling flush time ("new_exec" to "sched_process_exec").

Example of tracing output ("new_exec" and "sched_process_exec"):

  $ cat /sys/kernel/debug/tracing/trace_pipe
  <...>-379 [003] .   179.626921: new_exec: filename=/usr/bin/sshd 
pid=379 comm=sshd
  <...>-379 [003] .   179.629131: sched_process_exec: 
filename=/usr/bin/sshd pid=379 old_pid=379
  <...>-381 [002] .   180.048580: new_exec: filename=/bin/bash 
pid=381 comm=sshd
  <...>-381 [002] .   180.053122: sched_process_exec: 
filename=/bin/bash pid=381 old_pid=381
  <...>-385 [001] .   180.068277: new_exec: filename=/usr/bin/tty 
pid=385 comm=bash
  <...>-385 [001] .   180.069485: sched_process_exec: 
filename=/usr/bin/tty pid=385 old_pid=385
  <...>-389 [006] .   192.020147: new_exec: filename=/usr/bin/dmesg 
pid=389 comm=bash
   bash-389     [006] .   192.021377: sched_process_exec: 
filename=/usr/bin/dmesg pid=389 old_pid=389

Signed-off-by: Marco Elver 
---
 fs/exec.c   |  2 ++
 include/trace/events/task.h | 30 ++
 2 files changed, 32 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 38bf71cbdf5e..ab778ae1fc06 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1268,6 +1268,8 @@ int begin_new_exec(struct linux_binprm * bprm)
if (retval)
return retval;
 
+   trace_new_exec(current, bprm);
+
/*
 * Ensure all future errors are fatal.
 */
diff --git a/include/trace/events/task.h b/include/trace/events/task.h
index 47b527464d1a..8853dc44783d 100644
--- a/include/trace/events/task.h
+++ b/include/trace/events/task.h
@@ -56,6 +56,36 @@ TRACE_EVENT(task_rename,
__entry->newcomm, __entry->oom_score_adj)
 );
 
+/**
+ * new_exec - called before setting up new exec
+ * @task:  pointer to the current task
+ * @bprm:  pointer to linux_binprm used for new exec
+ *
+ * Called before flushing the old exec, but at the point of no return during
+ * switching to the new exec.
+ */
+TRACE_EVENT(new_exec,
+
+   TP_PROTO(struct task_struct *task, struct linux_binprm *bprm),
+
+   TP_ARGS(task, bprm),
+
+   TP_STRUCT__entry(
+   __string(   filename,   bprm->filename  )
+   __field(pid_t,  pid )
+   __string(   comm,   task->comm  )
+   ),
+
+   TP_fast_assign(
+   __assign_str(filename, bprm->filename);
+   __entry->pid = task->pid;
+   __assign_str(comm, task->comm);
+   ),
+
+   TP_printk("filename=%s pid=%d comm=%s",
+ __get_str(filename), __entry->pid, __get_str(comm))
+);
+
 #endif
 
 /* This part must be outside protection */
-- 
2.44.0.478.gd926399ef9-goog




Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Marco Elver
On Tue, 9 Apr 2024 at 16:31, Steven Rostedt  wrote:
>
> On Mon,  8 Apr 2024 11:01:54 +0200
> Marco Elver  wrote:
>
> > Add "new_exec" tracepoint, which is run right after the point of no
> > return but before the current task assumes its new exec identity.
> >
> > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint
> > runs before flushing the old exec, i.e. while the task still has the
> > original state (such as original MM), but when the new exec either
> > succeeds or crashes (but never returns to the original exec).
> >
> > Being able to trace this event can be helpful in a number of use cases:
> >
> >   * allowing tracing eBPF programs access to the original MM on exec,
> > before current->mm is replaced;
> >   * counting exec in the original task (via perf event);
> >   * profiling flush time ("new_exec" to "sched_process_exec").
> >
> > Example of tracing output ("new_exec" and "sched_process_exec"):
>
> How common is this? And can't you just do the same with adding a kprobe?

Our main use case would be to use this in BPF programs to become
exec-aware, where using the sched_process_exec hook is too late. This
is particularly important where the BPF program must stop inspecting
the user space's VM when the task does exec to become a new process.

kprobe (or BPF's fentry) is brittle here, because begin_new_exec()'s
permission check can still return an error which returns to the
original task without crashing. Only at the point of no return are we
guaranteed that the exec either succeeds, or the task is terminated on
failure.

I don't know if "common" is the right question here, because it's a
chicken-egg problem: no tracepoint, we give up; we have the
tracepoint, it unlocks a range of new use cases (that require robust
solution to make BPF programs exec-aware, and a tracepoint is the only
option IMHO).

Thanks,
-- Marco



Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-09 Thread Marco Elver
On Tue, Apr 09, 2024 at 08:46AM -0700, Kees Cook wrote:
[...]
> > +   trace_new_exec(current, bprm);
> > +
> 
> All other steps in this function have explicit comments about
> what/why/etc. Please add some kind of comment describing why the
> tracepoint is where it is, etc.

I beefed up the tracepoint documentation, and wrote a little paragraph
above where it's called to reinforce what we want.

[...]
> What about binfmt_misc, and binfmt_script? You may want bprm->interp
> too?

Good points. I'll make the below changes for v2:

diff --git a/fs/exec.c b/fs/exec.c
index ab778ae1fc06..472b9f7b40e8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1268,6 +1268,12 @@ int begin_new_exec(struct linux_binprm * bprm)
if (retval)
return retval;
 
+   /*
+* This tracepoint marks the point before flushing the old exec where
+* the current task is still unchanged, but errors are fatal (point of
+* no return). The later "sched_process_exec" tracepoint is called after
+* the current task has successfully switched to the new exec.
+*/
trace_new_exec(current, bprm);
 
/*
diff --git a/include/trace/events/task.h b/include/trace/events/task.h
index 8853dc44783d..623d9af777c1 100644
--- a/include/trace/events/task.h
+++ b/include/trace/events/task.h
@@ -61,8 +61,11 @@ TRACE_EVENT(task_rename,
  * @task:  pointer to the current task
  * @bprm:  pointer to linux_binprm used for new exec
  *
- * Called before flushing the old exec, but at the point of no return during
- * switching to the new exec.
+ * Called before flushing the old exec, where @task is still unchanged, but at
+ * the point of no return during switching to the new exec. At the point it is
+ * called the exec will either succeed, or on failure terminate the task. Also
+ * see the "sched_process_exec" tracepoint, which is called right after @task
+ * has successfully switched to the new exec.
  */
 TRACE_EVENT(new_exec,
 
@@ -71,19 +74,22 @@ TRACE_EVENT(new_exec,
TP_ARGS(task, bprm),
 
TP_STRUCT__entry(
+   __string(   interp, bprm->interp)
__string(   filename,   bprm->filename  )
__field(pid_t,  pid )
__string(   comm,   task->comm  )
),
 
TP_fast_assign(
+   __assign_str(interp, bprm->interp);
__assign_str(filename, bprm->filename);
__entry->pid = task->pid;
__assign_str(comm, task->comm);
),
 
-   TP_printk("filename=%s pid=%d comm=%s",
- __get_str(filename), __entry->pid, __get_str(comm))
+   TP_printk("interp=%s filename=%s pid=%d comm=%s",
+ __get_str(interp), __get_str(filename),
+ __entry->pid, __get_str(comm))
 );
 
 #endif



Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-10 Thread Marco Elver
On Wed, 10 Apr 2024 at 01:54, Masami Hiramatsu  wrote:
>
> On Tue, 9 Apr 2024 16:45:47 +0200
> Marco Elver  wrote:
>
> > On Tue, 9 Apr 2024 at 16:31, Steven Rostedt  wrote:
> > >
> > > On Mon,  8 Apr 2024 11:01:54 +0200
> > > Marco Elver  wrote:
> > >
> > > > Add "new_exec" tracepoint, which is run right after the point of no
> > > > return but before the current task assumes its new exec identity.
> > > >
> > > > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint
> > > > runs before flushing the old exec, i.e. while the task still has the
> > > > original state (such as original MM), but when the new exec either
> > > > succeeds or crashes (but never returns to the original exec).
> > > >
> > > > Being able to trace this event can be helpful in a number of use cases:
> > > >
> > > >   * allowing tracing eBPF programs access to the original MM on exec,
> > > > before current->mm is replaced;
> > > >   * counting exec in the original task (via perf event);
> > > >   * profiling flush time ("new_exec" to "sched_process_exec").
> > > >
> > > > Example of tracing output ("new_exec" and "sched_process_exec"):
> > >
> > > How common is this? And can't you just do the same with adding a kprobe?
> >
> > Our main use case would be to use this in BPF programs to become
> > exec-aware, where using the sched_process_exec hook is too late. This
> > is particularly important where the BPF program must stop inspecting
> > the user space's VM when the task does exec to become a new process.
>
> Just out of curiousity, would you like to audit that the user-program
> is not malformed? (security tracepoint?) I think that is an interesting
> idea. What kind of information you need?

I didn't have that in mind. If the BPF program reads (or even writes)
to user space memory, it must stop doing so before current->mm is
switched, otherwise it will lead to random results or memory
corruption. The new process may reallocate the memory that we want to
inspect, but the user space process must explicitly opt in to being
inspected or being manipulated. Just like the kernel "flushes" various
old state on exec since it's becoming a new process, a BPF program
that has per-process state needs to do the same.



Re: [PATCH] tracing: Add new_exec tracepoint

2024-04-10 Thread Marco Elver
On Wed, 10 Apr 2024 at 15:56, Masami Hiramatsu  wrote:
>
> On Mon,  8 Apr 2024 11:01:54 +0200
> Marco Elver  wrote:
>
> > Add "new_exec" tracepoint, which is run right after the point of no
> > return but before the current task assumes its new exec identity.
> >
> > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint
> > runs before flushing the old exec, i.e. while the task still has the
> > original state (such as original MM), but when the new exec either
> > succeeds or crashes (but never returns to the original exec).
> >
> > Being able to trace this event can be helpful in a number of use cases:
> >
> >   * allowing tracing eBPF programs access to the original MM on exec,
> > before current->mm is replaced;
> >   * counting exec in the original task (via perf event);
> >   * profiling flush time ("new_exec" to "sched_process_exec").
> >
> > Example of tracing output ("new_exec" and "sched_process_exec"):
>
> nit: "new_exec" name a bit stands out compared to other events, and hard to
> expect it comes before or after "sched_process_exec". Since "begin_new_exec"
> is internal implementation name, IMHO, it should not exposed to user.
> What do you think about calling this "sched_prepare_exec" ?

I like it, I'll rename it to sched_prepare_exec.

Thanks!



[PATCH v2] tracing: Add sched_prepare_exec tracepoint

2024-04-11 Thread Marco Elver
Add "sched_prepare_exec" tracepoint, which is run right after the point
of no return but before the current task assumes its new exec identity.

Unlike the tracepoint "sched_process_exec", the "sched_prepare_exec"
tracepoint runs before flushing the old exec, i.e. while the task still
has the original state (such as original MM), but when the new exec
either succeeds or crashes (but never returns to the original exec).

Being able to trace this event can be helpful in a number of use cases:

  * allowing tracing eBPF programs access to the original MM on exec,
before current->mm is replaced;
  * counting exec in the original task (via perf event);
  * profiling flush time ("sched_prepare_exec" to "sched_process_exec").

Example of tracing output:

 $ cat /sys/kernel/debug/tracing/trace_pipe
<...>-379  [003] .  179.626921: sched_prepare_exec: 
interp=/usr/bin/sshd filename=/usr/bin/sshd pid=379 comm=sshd
<...>-381  [002] .  180.048580: sched_prepare_exec: interp=/bin/bash 
filename=/bin/bash pid=381 comm=sshd
<...>-385  [001] .  180.068277: sched_prepare_exec: interp=/usr/bin/tty 
filename=/usr/bin/tty pid=385 comm=bash
<...>-389  [006] .  192.020147: sched_prepare_exec: 
interp=/usr/bin/dmesg filename=/usr/bin/dmesg pid=389 comm=bash

Signed-off-by: Marco Elver 
---
v2:
* Add more documentation.
* Also show bprm->interp in trace.
* Rename to sched_prepare_exec.
---
 fs/exec.c|  8 
 include/trace/events/sched.h | 35 +++
 2 files changed, 43 insertions(+)

diff --git a/fs/exec.c b/fs/exec.c
index 38bf71cbdf5e..57fee729dd92 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1268,6 +1268,14 @@ int begin_new_exec(struct linux_binprm * bprm)
if (retval)
return retval;
 
+   /*
+* This tracepoint marks the point before flushing the old exec where
+* the current task is still unchanged, but errors are fatal (point of
+* no return). The later "sched_process_exec" tracepoint is called after
+* the current task has successfully switched to the new exec.
+*/
+   trace_sched_prepare_exec(current, bprm);
+
/*
 * Ensure all future errors are fatal.
 */
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index dbb01b4b7451..226f47c6939c 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -420,6 +420,41 @@ TRACE_EVENT(sched_process_exec,
  __entry->pid, __entry->old_pid)
 );
 
+/**
+ * sched_prepare_exec - called before setting up new exec
+ * @task:  pointer to the current task
+ * @bprm:  pointer to linux_binprm used for new exec
+ *
+ * Called before flushing the old exec, where @task is still unchanged, but at
+ * the point of no return during switching to the new exec. At the point it is
+ * called the exec will either succeed, or on failure terminate the task. Also
+ * see the "sched_process_exec" tracepoint, which is called right after @task
+ * has successfully switched to the new exec.
+ */
+TRACE_EVENT(sched_prepare_exec,
+
+   TP_PROTO(struct task_struct *task, struct linux_binprm *bprm),
+
+   TP_ARGS(task, bprm),
+
+   TP_STRUCT__entry(
+   __string(   interp, bprm->interp)
+   __string(   filename,   bprm->filename  )
+   __field(pid_t,  pid )
+   __string(   comm,   task->comm  )
+   ),
+
+   TP_fast_assign(
+   __assign_str(interp, bprm->interp);
+   __assign_str(filename, bprm->filename);
+   __entry->pid = task->pid;
+   __assign_str(comm, task->comm);
+   ),
+
+   TP_printk("interp=%s filename=%s pid=%d comm=%s",
+ __get_str(interp), __get_str(filename),
+ __entry->pid, __get_str(comm))
+);
 
 #ifdef CONFIG_SCHEDSTATS
 #define DEFINE_EVENT_SCHEDSTAT DEFINE_EVENT
-- 
2.44.0.478.gd926399ef9-goog




Re: [PATCH v3 1/3] kasan: switch kunit tests to console tracepoints

2023-12-11 Thread Marco Elver
On Mon, 11 Dec 2023 at 23:48, Paul Heidekrüger  wrote:
>
> On 11.12.2023 21:51, Andrey Konovalov wrote:
> > On Mon, Dec 11, 2023 at 7:59 PM Paul Heidekrüger
> >  wrote:
> > >
> > > > Hi Paul,
> > > >
> > > > I've been successfully running KASAN tests with CONFIG_TRACEPOINTS
> > > > enabled on arm64 since this patch landed.
> > >
> > > Interesting ...
> > >
> > > > What happens when you try running the tests with .kunitconfig? Does
> > > > CONFIG_TRACEPOINTS or CONFIG_KASAN_KUNIT_TEST get disabled during
> > > > kernel building?
> > >
> > > Yes, exactly, that's what's happening.
> > >
> > > Here's the output kunit.py is giving me. I replaced CONFIG_DEBUG_KERNEL 
> > > with
> > > CONFIG_TRACEPOINTS in my .kunitconfig. Otherwise, it's identical with the 
> > > one I
> > > posted above.
> > >
> > > ➜   ./tools/testing/kunit/kunit.py run 
> > > --kunitconfig=mm/kasan/.kunitconfig --arch=arm64
> > > Configuring KUnit Kernel ...
> > > Regenerating .config ...
> > > Populating config with:
> > > $ make ARCH=arm64 O=.kunit olddefconfig
> > > ERROR:root:Not all Kconfig options selected in kunitconfig were 
> > > in the generated .config.
> > > This is probably due to unsatisfied dependencies.
> > > Missing: CONFIG_KASAN_KUNIT_TEST=y, CONFIG_TRACEPOINTS=y
> > >
> > > Does CONFIG_TRACEPOINTS have some dependency I'm not seeing? I couldn't 
> > > find a
> > > reason why it would get disabled, but I could definitely be wrong.
> >
> > Does your .kunitconfig include CONFIG_TRACEPOINTS=y? I don't see it in
> > the listing that you sent earlier.
>
> Yes. For the kunit.py output from my previous email, I replaced
> CONFIG_DEBUG_KERNEL=y with CONFIG_TRACEPOINTS=y. So, the .kunitconfig I used 
> to
> produce the output above was:
>
> CONFIG_KUNIT=y
> CONFIG_KUNIT_ALL_TESTS=n
> CONFIG_TRACEPOINTS=y
> CONFIG_KASAN=y
> CONFIG_KASAN_GENERIC=y
> CONFIG_KASAN_KUNIT_TEST=y
>
> This more or less mirrors what mm/kfence/.kunitconfig is doing, which also 
> isn't
> working on my side; kunit.py reports the same error.

mm/kfence/.kunitconfig does CONFIG_FTRACE=y. TRACEPOINTS is not user
selectable. I don't think any of this has changed since the initial
discussion above, so CONFIG_FTRACE=y is still needed.



Re: [PATCH v3 1/3] kasan: switch kunit tests to console tracepoints

2023-12-12 Thread Marco Elver
On Tue, 12 Dec 2023 at 10:19, Paul Heidekrüger  wrote:
>
> On 12.12.2023 00:37, Andrey Konovalov wrote:
> > On Tue, Dec 12, 2023 at 12:35 AM Paul Heidekrüger
> >  wrote:
> > >
> > > Using CONFIG_FTRACE=y instead of CONFIG_TRACEPOINTS=y produces the same 
> > > error
> > > for me.
> > >
> > > So
> > >
> > > CONFIG_KUNIT=y
> > > CONFIG_KUNIT_ALL_TESTS=n
> > > CONFIG_FTRACE=y
> > > CONFIG_KASAN=y
> > > CONFIG_KASAN_GENERIC=y
> > > CONFIG_KASAN_KUNIT_TEST=y
> > >
> > > produces
> > >
> > > ➜   ./tools/testing/kunit/kunit.py run 
> > > --kunitconfig=mm/kasan/.kunitconfig --arch=arm64
> > > Configuring KUnit Kernel ...
> > > Regenerating .config ...
> > > Populating config with:
> > > $ make ARCH=arm64 O=.kunit olddefconfig CC=clang
> > > ERROR:root:Not all Kconfig options selected in kunitconfig were 
> > > in the generated .config.
> > > This is probably due to unsatisfied dependencies.
> > > Missing: CONFIG_KASAN_KUNIT_TEST=y
> > >
> > > By that error message, CONFIG_FTRACE appears to be present in the 
> > > generated
> > > config, but CONFIG_KASAN_KUNIT_TEST still isn't. Presumably,
> > > CONFIG_KASAN_KUNIT_TEST is missing because of an unsatisfied dependency, 
> > > which
> > > must be CONFIG_TRACEPOINTS, unless I'm missing something ...
> > >
> > > If I just generate an arm64 defconfig and select CONFIG_FTRACE=y,
> > > CONFIG_TRACEPOINTS=y shows up in my .config. So, maybe this is 
> > > kunit.py-related
> > > then?
> > >
> > > Andrey, you said that the tests have been working for you; are you 
> > > running them
> > > with kunit.py?
> >
> > No, I just run the kernel built with a config file that I put together
> > based on defconfig.
>
> Ah. I believe I've figured it out.
>
> When I add CONFIG_STACK_TRACER=y in addition to CONFIG_FTRACE=y, it works.

CONFIG_FTRACE should be enough - maybe also check x86 vs. arm64 to debug more.

> CONFIG_STACK_TRACER selects CONFIG_FUNCTION_TRACER, CONFIG_FUNCTION_TRACER
> selects CONFIG_GENERIC_TRACER, CONFIG_GENERIC_TRACER selects CONFIG_TRACING, 
> and
> CONFIG_TRACING selects CONFIG_TRACEPOINTS.
>
> CONFIG_BLK_DEV_IO_TRACE=y also works instead of CONFIG_STACK_TRACER=y, as it
> directly selects CONFIG_TRACEPOINTS.
>
> CONFIG_FTRACE=y on its own does not appear suffice for kunit.py on arm64.

When you build manually with just CONFIG_FTRACE, is CONFIG_TRACEPOINTS enabled?

> I believe the reason my .kunitconfig as well as the existing
> mm/kfence/.kunitconfig work on X86 is because CONFIG_TRACEPOINTS=y is present 
> in
> an X86 defconfig.
>
> Does this make sense?
>
> Would you welcome a patch addressing this for the existing
> mm/kfence/.kunitconfig?
>
> I would also like to submit a patch for an mm/kasan/.kunitconfig. Do you think
> that would be helpful too?
>
> FWICT, kernel/kcsan/.kunitconfig might also be affected since
> CONFIG_KCSAN_KUNIT_TEST also depends on CONFIG_TRACEPOITNS, but I would have 
> to
> test that. That could be a third patch.

I'd support figuring out the minimal config (CONFIG_FTRACE or
something else?) that satisfies the TRACEPOINTS dependency. I always
thought CONFIG_FTRACE ought to be the one config option, but maybe
something changed.

Also maybe one of the tracing maintainers can help untangle what's
going on here.

Thanks,
-- Marco



Re: [RFC] Printk deadlock in bpf trace called from scheduler context

2024-07-29 Thread Marco Elver
On Mon, 29 Jul 2024 at 14:27, Peter Zijlstra  wrote:
>
> On Mon, Jul 29, 2024 at 01:46:09PM +0200, Radoslaw Zielonek wrote:
> > I am currently working on a syzbot-reported bug where bpf
> > is called from trace_sched_switch. In this scenario, we are still within
> > the scheduler context, and calling printk can create a deadlock.
> >
> > I am uncertain about the best approach to fix this issue.
>
> It's been like this forever, it doesn't need fixing, because tracepoints
> shouldn't be doing printk() in the first place.
>
> > Should we simply forbid such calls, or perhaps we should replace printk
> > with printk_deferred in the bpf where we are still in scheduler context?
>
> Not doing printk() is best.

And teaching more debugging tools to behave.

This particular case originates from fault injection:

> [   60.265518][ T8343]  should_fail_ex+0x383/0x4d0
> [   60.265547][ T8343]  strncpy_from_user+0x36/0x2d0
> [   60.265601][ T8343]  strncpy_from_user_nofault+0x70/0x140
> [   60.265637][ T8343]  bpf_probe_read_user_str+0x2a/0x70

Probably the fail_dump() function in lib/fault-inject.c being a little
too verbose in this case.

Radoslaw,  the fix should be in lib/fault-inject.c. Similar to other
debugging tools (like KFENCE, which you discovered) adding
lockdep_off()/lockdep_on(), prink_deferred, or not being as verbose in
this context may be more appropriate. Fault injection does not need to
print a message to inject a fault - the message is for debugging
purposes. Probably a reasonable compromise is to use printk_deferred()
in fail_dump() if in this context to still help with debugging on a
best effort basis. You also need to take care to avoid dumping the
stack in fail_dump().



Re: [syzbot] [trace?] linux-next test error: WARNING in rcu_core

2024-08-02 Thread Marco Elver
On Fri, 2 Aug 2024 at 09:58, syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:f524a5e4dfb7 Add linux-next specific files for 20240802
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=174c896d98
> kernel config:  https://syzkaller.appspot.com/x/.config?x=a66a5509e9947c4c
> dashboard link: https://syzkaller.appspot.com/bug?extid=263726e59eab6b442723
> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
> 2.40
>
> Downloadable assets:
> disk image: 
> https://storage.googleapis.com/syzbot-assets/8c0255b9a6ad/disk-f524a5e4.raw.xz
> vmlinux: 
> https://storage.googleapis.com/syzbot-assets/71d89466ea60/vmlinux-f524a5e4.xz
> kernel image: 
> https://storage.googleapis.com/syzbot-assets/ba8fcf059463/bzImage-f524a5e4.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+263726e59eab6b442...@syzkaller.appspotmail.com
>
> [ cut here ]
> WARNING: CPU: 0 PID: 1 at mm/slub.c:4550 
> slab_free_after_rcu_debug+0x18b/0x270 mm/slub.c:4550

See https://lore.kernel.org/all/zqyths-o85nqu...@elver.google.com/T/#u



Re: [syzbot] [virt?] KCSAN: data-race in virtqueue_disable_cb / vring_interrupt (4)

2024-09-12 Thread Marco Elver
On Thu, 12 Sept 2024 at 13:03, Michael S. Tsirkin  wrote:
>
> On Thu, Sep 12, 2024 at 01:11:21AM -0700, syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:7c6a3a65ace7 minmax: reduce min/max macro expansion in ato..
> > git tree:   upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1608e49f98
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=1e7d02549be622b2
> > dashboard link: https://syzkaller.appspot.com/bug?extid=8a02104389c2e0ef5049
> > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > Debian) 2.40
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> >
> > Downloadable assets:
> > disk image: 
> > https://storage.googleapis.com/syzbot-assets/a1f7496fa21f/disk-7c6a3a65.raw.xz
> > vmlinux: 
> > https://storage.googleapis.com/syzbot-assets/f423739e51a9/vmlinux-7c6a3a65.xz
> > kernel image: 
> > https://storage.googleapis.com/syzbot-assets/b65a0f38cbd7/bzImage-7c6a3a65.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+8a02104389c2e0ef5...@syzkaller.appspotmail.com
> >
> > ==
> > BUG: KCSAN: data-race in virtqueue_disable_cb / vring_interrupt
> >
> > write to 0x88810285ef52 of 1 bytes by interrupt on cpu 0:
> >  vring_interrupt+0x12b/0x180 drivers/virtio/virtio_ring.c:2591
>
>
> Yes, it's racy!
>
> 2589:/* Just a hint for performance: so it's ok that this can be 
> racy! */
> 2590:if (vq->event)
> 2591:vq->event_triggered = true;
>
>
> Question: is there a way to annotate code to tell syzbot it's ok?

In this case, "if (data_race(vq->event))" might be the right choice.

This is a quick guide on which access primitive to use in concurrent
code: 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/access-marking.txt

Thanks,
-- Marco



Re: [syzbot] [virt?] KCSAN: data-race in virtqueue_disable_cb / vring_interrupt (4)

2024-09-12 Thread Marco Elver
On Thu, 12 Sept 2024 at 16:34, Michael S. Tsirkin  wrote:
>
> On Thu, Sep 12, 2024 at 03:48:32PM +0200, Marco Elver wrote:
> > On Thu, 12 Sept 2024 at 13:03, Michael S. Tsirkin  wrote:
> > >
> > > On Thu, Sep 12, 2024 at 01:11:21AM -0700, syzbot wrote:
> > > > Hello,
> > > >
> > > > syzbot found the following issue on:
> > > >
> > > > HEAD commit:7c6a3a65ace7 minmax: reduce min/max macro expansion in 
> > > > ato..
> > > > git tree:   upstream
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=1608e49f98
> > > > kernel config:  
> > > > https://syzkaller.appspot.com/x/.config?x=1e7d02549be622b2
> > > > dashboard link: 
> > > > https://syzkaller.appspot.com/bug?extid=8a02104389c2e0ef5049
> > > > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > > > Debian) 2.40
> > > >
> > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > >
> > > > Downloadable assets:
> > > > disk image: 
> > > > https://storage.googleapis.com/syzbot-assets/a1f7496fa21f/disk-7c6a3a65.raw.xz
> > > > vmlinux: 
> > > > https://storage.googleapis.com/syzbot-assets/f423739e51a9/vmlinux-7c6a3a65.xz
> > > > kernel image: 
> > > > https://storage.googleapis.com/syzbot-assets/b65a0f38cbd7/bzImage-7c6a3a65.xz
> > > >
> > > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > > commit:
> > > > Reported-by: syzbot+8a02104389c2e0ef5...@syzkaller.appspotmail.com
> > > >
> > > > ==
> > > > BUG: KCSAN: data-race in virtqueue_disable_cb / vring_interrupt
> > > >
> > > > write to 0x88810285ef52 of 1 bytes by interrupt on cpu 0:
> > > >  vring_interrupt+0x12b/0x180 drivers/virtio/virtio_ring.c:2591
> > >
> > >
> > > Yes, it's racy!
> > >
> > > 2589:/* Just a hint for performance: so it's ok that this can be 
> > > racy! */
> > > 2590:if (vq->event)
> > > 2591:vq->event_triggered = true;
> > >
> > >
> > > Question: is there a way to annotate code to tell syzbot it's ok?
> >
> > In this case, "if (data_race(vq->event))" might be the right choice.
>
> No, vq->event is not racy.

Oops - yes.

> The race is between a write and a read of event_triggered.
> I think data_race tags a read, it can not tag a write, correct?

data_race() takes an expression, so either read or write can be
enclosed - e.g. "data_race(vq->event_triggered = true);" works as
well.



Re: [PATCH] virtio_ring: tag event_triggered as racy for KCSAN

2024-09-12 Thread Marco Elver
On Thu, 12 Sept 2024 at 16:45, Michael S. Tsirkin  wrote:
>
> event_triggered is fundamentally racy. There are races of 2 types:
> 1. vq processing can read false value while interrupt
>triggered and set it to true.
>result will be a bit of extra work when disabling cbs, no big deal.
>
> 1. vq processing can set false value then interrupt
>immediately sets true value
>since interrupt then triggers a callback which will
>process buffers, this is also not an issue.
>
> However, looks like KCSAN isn't smart enough to figure this out.
> Tag the field __data_racy for now.
> We should probably look at ways to make this more straight-forwardly
> correct.
>
> Cc: Marco Elver 
> Reported-by: syzbot+8a02104389c2e0ef5...@syzkaller.appspotmail.com
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/virtio/virtio_ring.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index be7309b1e860..724aa9c27c6b 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -194,7 +194,7 @@ struct vring_virtqueue {
> u16 last_used_idx;
>
> /* Hint for event idx: already triggered no need to disable. */
> -   bool event_triggered;
> +   bool __data_racy event_triggered;

I guess if you don't care about any data races on this variable, this
is reasonable. Although note that data race is more subtle than just a
"race": https://lwn.net/Articles/816850/

Acked-by: Marco Elver 



Re: [PATCH v2] virtio_ring: tag event_triggered as racy for KCSAN

2024-09-12 Thread Marco Elver
On Thu, 12 Sept 2024 at 17:02, Michael S. Tsirkin  wrote:
>
> Setting event_triggered from the interrupt handler
> is fundamentally racy. There are races of 2 types:
> 1. vq processing can read false value while interrupt
>triggered and set it to true.
>result will be a bit of extra work when disabling cbs, no big deal.
>
> 1. vq processing can set false value then interrupt
>immediately sets true value
>since interrupt then triggers a callback which will
>process buffers, this is also not an issue.
>
> However, looks like KCSAN can not figure all this out, and warns about
> the race between the write and the read.  Tag the access data_racy for
> now.  We should probably look at ways to make this more
> straight-forwardly correct.
>
> Cc: Marco Elver 
> Reported-by: syzbot+8a02104389c2e0ef5...@syzkaller.appspotmail.com
> Signed-off-by: Michael S. Tsirkin 

Probably more conservative than the __data_racy hammer:

Acked-by: Marco Elver 

> ---
>  drivers/virtio/virtio_ring.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index be7309b1e860..98374ed7c577 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -2588,7 +2588,7 @@ irqreturn_t vring_interrupt(int irq, void *_vq)
>
> /* Just a hint for performance: so it's ok that this can be racy! */
> if (vq->event)
> -   vq->event_triggered = true;
> +   data_race(vq->event_triggered = true);
>
> pr_debug("virtqueue callback for %p (%p)\n", vq, vq->vq.callback);
> if (vq->vq.callback)
> --
> MST
>



Re: [PATCH v4 05/10] signal: Introduce TRAP_PERF si_code and si_perf to siginfo

2021-04-20 Thread Marco Elver
On Tue, 20 Apr 2021 at 23:26, Marek Szyprowski  wrote:
>
> Hi Marco,
>
> On 08.04.2021 12:36, Marco Elver wrote:
> > Introduces the TRAP_PERF si_code, and associated siginfo_t field
> > si_perf. These will be used by the perf event subsystem to send signals
> > (if requested) to the task where an event occurred.
> >
> > Acked-by: Geert Uytterhoeven  # m68k
> > Acked-by: Arnd Bergmann  # asm-generic
> > Signed-off-by: Marco Elver 
>
> This patch landed in linux-next as commit fb6cc127e0b6 ("signal:
> Introduce TRAP_PERF si_code and si_perf to siginfo"). It causes
> regression on my test systems (arm 32bit and 64bit). Most systems fails
> to boot in the given time frame. I've observed that there is a timeout
> waiting for udev to populate /dev and then also during the network
> interfaces configuration. Reverting this commit, together with
> 97ba62b27867 ("perf: Add support for SIGTRAP on perf events") to let it
> compile, on top of next-20210420 fixes the issue.

Thanks, this is weird for sure and nothing in particular stands out.

I have questions:
-- Can you please share your config?
-- Also, can you share how you run this? Can it be reproduced in qemu?
-- How did you derive this patch to be at fault? Why not just
97ba62b27867, given you also need to revert it?

If you are unsure which patch exactly it is, can you try just
reverting 97ba62b27867 and see what happens?

Thanks,
-- Marco

> > ---
> >   arch/m68k/kernel/signal.c  |  3 +++
> >   arch/x86/kernel/signal_compat.c|  5 -
> >   fs/signalfd.c  |  4 
> >   include/linux/compat.h |  2 ++
> >   include/linux/signal.h |  1 +
> >   include/uapi/asm-generic/siginfo.h |  6 +-
> >   include/uapi/linux/signalfd.h  |  4 +++-
> >   kernel/signal.c| 11 +++
> >   8 files changed, 33 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/m68k/kernel/signal.c b/arch/m68k/kernel/signal.c
> > index 349570f16a78..a4b7ee1df211 100644
> > --- a/arch/m68k/kernel/signal.c
> > +++ b/arch/m68k/kernel/signal.c
> > @@ -622,6 +622,9 @@ static inline void siginfo_build_tests(void)
> >   /* _sigfault._addr_pkey */
> >   BUILD_BUG_ON(offsetof(siginfo_t, si_pkey) != 0x12);
> >
> > + /* _sigfault._perf */
> > + BUILD_BUG_ON(offsetof(siginfo_t, si_perf) != 0x10);
> > +
> >   /* _sigpoll */
> >   BUILD_BUG_ON(offsetof(siginfo_t, si_band)   != 0x0c);
> >   BUILD_BUG_ON(offsetof(siginfo_t, si_fd) != 0x10);
> > diff --git a/arch/x86/kernel/signal_compat.c 
> > b/arch/x86/kernel/signal_compat.c
> > index a5330ff498f0..0e5d0a7e203b 100644
> > --- a/arch/x86/kernel/signal_compat.c
> > +++ b/arch/x86/kernel/signal_compat.c
> > @@ -29,7 +29,7 @@ static inline void signal_compat_build_tests(void)
> >   BUILD_BUG_ON(NSIGFPE  != 15);
> >   BUILD_BUG_ON(NSIGSEGV != 9);
> >   BUILD_BUG_ON(NSIGBUS  != 5);
> > - BUILD_BUG_ON(NSIGTRAP != 5);
> > + BUILD_BUG_ON(NSIGTRAP != 6);
> >   BUILD_BUG_ON(NSIGCHLD != 6);
> >   BUILD_BUG_ON(NSIGSYS  != 2);
> >
> > @@ -138,6 +138,9 @@ static inline void signal_compat_build_tests(void)
> >   BUILD_BUG_ON(offsetof(siginfo_t, si_pkey) != 0x20);
> >   BUILD_BUG_ON(offsetof(compat_siginfo_t, si_pkey) != 0x14);
> >
> > + BUILD_BUG_ON(offsetof(siginfo_t, si_perf) != 0x18);
> > + BUILD_BUG_ON(offsetof(compat_siginfo_t, si_perf) != 0x10);
> > +
> >   CHECK_CSI_OFFSET(_sigpoll);
> >   CHECK_CSI_SIZE  (_sigpoll, 2*sizeof(int));
> >   CHECK_SI_SIZE   (_sigpoll, 4*sizeof(int));
> > diff --git a/fs/signalfd.c b/fs/signalfd.c
> > index 456046e15873..040a1142915f 100644
> > --- a/fs/signalfd.c
> > +++ b/fs/signalfd.c
> > @@ -134,6 +134,10 @@ static int signalfd_copyinfo(struct signalfd_siginfo 
> > __user *uinfo,
> >   #endif
> >   new.ssi_addr_lsb = (short) kinfo->si_addr_lsb;
> >   break;
> > + case SIL_PERF_EVENT:
> > + new.ssi_addr = (long) kinfo->si_addr;
> > + new.ssi_perf = kinfo->si_perf;
> > + break;
> >   case SIL_CHLD:
> >   new.ssi_pid= kinfo->si_pid;
> >   new.ssi_uid= kinfo->si_uid;
> > diff --git a/include/linux/compat.h b/include/linux/compat.h
> > index 6e65be753603..c8821d966812 100644
> > --- a/include/linux/compat.h
> > +++ b/include/linux/compat.h
> > @@ -236,6 +236,8 @@ typedef struct compat_siginfo {
> > 

Re: [PATCH] slub: Introduce CONFIG_SLUB_RCU_DEBUG

2023-09-11 Thread Marco Elver
On Fri, 25 Aug 2023 at 23:15, 'Jann Horn' via kasan-dev
 wrote:
>
> Currently, KASAN is unable to catch use-after-free in SLAB_TYPESAFE_BY_RCU
> slabs because use-after-free is allowed within the RCU grace period by
> design.
>
> Add a SLUB debugging feature which RCU-delays every individual
> kmem_cache_free() before either actually freeing the object or handing it
> off to KASAN, and change KASAN to poison freed objects as normal when this
> option is enabled.
>
> Note that this creates a 16-byte unpoisoned area in the middle of the
> slab metadata area, which kinda sucks but seems to be necessary in order
> to be able to store an rcu_head in there without triggering an ASAN
> splat during RCU callback processing.
>
> For now I've configured Kconfig.kasan to always enable this feature in the
> GENERIC and SW_TAGS modes; I'm not forcibly enabling it in HW_TAGS mode
> because I'm not sure if it might have unwanted performance degradation
> effects there.
>
> Signed-off-by: Jann Horn 
> ---
> can I get a review from the KASAN folks of this?
> I have been running it on my laptop for a bit and it seems to be working
> fine.
>
> Notes:
> With this patch, a UAF on a TYPESAFE_BY_RCU will splat with an error
> like this (tested by reverting a security bugfix).
> Note that, in the ASAN memory state dump, we can see the little
> unpoisoned 16-byte areas storing the rcu_head.
>
> BUG: KASAN: slab-use-after-free in folio_lock_anon_vma_read+0x129/0x4c0
> Read of size 8 at addr 888004e85b00 by task forkforkfork/592
>
> CPU: 0 PID: 592 Comm: forkforkfork Not tainted 
> 6.5.0-rc7-00105-gae70c1e1f6f5-dirty #334
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.16.2-debian-1.16.2-1 04/01/2014
> Call Trace:
>  
>  dump_stack_lvl+0x4a/0x80
>  print_report+0xcf/0x660
>  kasan_report+0xd4/0x110
>  folio_lock_anon_vma_read+0x129/0x4c0
>  rmap_walk_anon+0x1cc/0x290
>  folio_referenced+0x277/0x2a0
>  shrink_folio_list+0xb8c/0x1680
>  reclaim_folio_list+0xdc/0x1f0
>  reclaim_pages+0x211/0x280
>  madvise_cold_or_pageout_pte_range+0x812/0xb70
>  walk_pgd_range+0x70b/0xce0
>  __walk_page_range+0x343/0x360
>  walk_page_range+0x227/0x280
>  madvise_pageout+0x1cd/0x2d0
>  do_madvise+0x552/0x15a0
>  __x64_sys_madvise+0x62/0x70
>  do_syscall_64+0x3b/0x90
>  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [...]
>  
>
> Allocated by task 574:
>  kasan_save_stack+0x33/0x60
>  kasan_set_track+0x25/0x30
>  __kasan_slab_alloc+0x6e/0x70
>  kmem_cache_alloc+0xfd/0x2b0
>  anon_vma_fork+0x88/0x270
>  dup_mmap+0x87c/0xc10
>  copy_process+0x3399/0x3590
>  kernel_clone+0x10e/0x480
>  __do_sys_clone+0xa1/0xe0
>  do_syscall_64+0x3b/0x90
>  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>
> Freed by task 0:
>  kasan_save_stack+0x33/0x60
>  kasan_set_track+0x25/0x30
>  kasan_save_free_info+0x2b/0x50
>  __kasan_slab_free+0xfe/0x180
>  slab_free_after_rcu_debug+0xad/0x200
>  rcu_core+0x638/0x1620
>  __do_softirq+0x14c/0x581
>
> Last potentially related work creation:
>  kasan_save_stack+0x33/0x60
>  __kasan_record_aux_stack+0x94/0xa0
>  __call_rcu_common.constprop.0+0x47/0x730
>  __put_anon_vma+0x6e/0x150
>  unlink_anon_vmas+0x277/0x2e0
>  vma_complete+0x341/0x580
>  vma_merge+0x613/0xff0
>  mprotect_fixup+0x1c0/0x510
>  do_mprotect_pkey+0x5a7/0x710
>  __x64_sys_mprotect+0x47/0x60
>  do_syscall_64+0x3b/0x90
>  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>
> Second to last potentially related work creation:
> [...]
>
> The buggy address belongs to the object at 888004e85b00
>  which belongs to the cache anon_vma of size 192
> The buggy address is located 0 bytes inside of
>  freed 192-byte region [888004e85b00, 888004e85bc0)
>
> The buggy address belongs to the physical page:
> [...]
>
> Memory state around the buggy address:
>  888004e85a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  888004e85a80: 00 00 00 00 00 00 00 00 fc 00 00 fc fc fc fc fc
> >888004e85b00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>^
>  888004e85b80: fb fb fb fb fb fb fb fb fc 00 00 fc fc fc fc fc
>  888004e85c00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>
>  include/linux/kasan.h|  6 
>  include/linux/slub_def.h |  3 ++
>  lib/Kconfig.kasan|  2 ++
>  mm/Kconfig.debug | 21 +
>  mm/kasan/common.c| 15 -
>  mm/slub.c| 66 +---

Nice!

It'd be good to add a test case to lib/test_kasan module. I think you
could just copy/adjust the test case "test_memcache_typesafe_by_rcu"
from the KFENCE KUnit test suite.

>  6 files changed, 107 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/kasan.h b/incl

Re: [PATCH v3 1/2] kunit: add a KUnit test for SLUB debugging functionality

2021-04-08 Thread Marco Elver
On Tue, 6 Apr 2021 at 12:57, Vlastimil Babka  wrote:
>
>
> On 4/1/21 11:24 PM, Marco Elver wrote:
> > On Thu, 1 Apr 2021 at 21:04, Daniel Latypov  wrote:
> >> > }
> >> > #else
> >> > static inline bool slab_add_kunit_errors(void) { return false; }
> >> > #endif
> >> >
> >> > And anywhere you want to increase the error count, you'd call
> >> > slab_add_kunit_errors().
> >> >
> >> > Another benefit of this approach is that if KUnit is disabled, there is
> >> > zero overhead and no additional code generated (vs. the current
> >> > approach).
> >>
> >> The resource approach looks really good, but...
> >> You'd be picking up a dependency on
> >> https://lore.kernel.org/linux-kselftest/20210311152314.3814916-2-dlaty...@google.com/
> >> current->kunit_test will always be NULL unless CONFIG_KASAN=y &&
> >> CONFIG_KUNIT=y at the moment.
> >> My patch drops the CONFIG_KASAN requirement and opens it up to all tests.
> >
> > Oh, that's a shame, but hopefully it'll be in -next soon.
> >
> >> At the moment, it's just waiting another look over from Brendan or David.
> >> Any ETA on that, folks? :)
> >>
> >> So if you don't want to get blocked on that for now, I think it's fine to 
> >> add:
> >>   #ifdef CONFIG_SLUB_KUNIT_TEST
> >>   int errors;
> >>   #endif
> >
> > Until kunit fixes setting current->kunit_test, a cleaner workaround
> > that would allow to do the patch with kunit_resource, is to just have
> > an .init/.exit function that sets it ("current->kunit_test = test;").
> > And then perhaps add a note ("FIXME: ...") to remove it once the above
> > patch has landed.
> >
> > At least that way we get the least intrusive change for mm/slub.c, and
> > the test is the only thing that needs a 2-line patch to clean up
> > later.
>
> So when testing internally Oliver's new version with your suggestions (thanks
> again for those), I got lockdep splats because slab_add_kunit_errors is called
> also from irq disabled contexts, and kunit_find_named_resource will call
> spin_lock(&test->lock) that's not irq safe. Can we make the lock irq safe? I
> tried the change below and it makde the problem go away. If you agree, the
> question is how to proceed - make it part of Oliver's patch series and let
> Andrew pick it all with eventually kunit team's acks on this patch, or 
> whatnot.

>From what I can tell it should be fine to make it irq safe (ack for
your patch below). Regarding patch logistics, I'd probably add it to
the series. If that ends up not working, we'll find out sooner or
later.

(FYI, the prerequisite patch for current->kunit_test is in -next now.)

KUnit maintainers, do you have any preferences?

> 8<
>
> commit ab28505477892e9824c57ac338c88aec2ec0abce
> Author: Vlastimil Babka 
> Date:   Tue Apr 6 12:28:07 2021 +0200
>
> kunit: make test->lock irq safe
>
> diff --git a/include/kunit/test.h b/include/kunit/test.h
> index 49601c4b98b8..524d4789af22 100644
> --- a/include/kunit/test.h
> +++ b/include/kunit/test.h
> @@ -515,8 +515,9 @@ kunit_find_resource(struct kunit *test,
> void *match_data)
>  {
> struct kunit_resource *res, *found = NULL;
> +   unsigned long flags;
>
> -   spin_lock(&test->lock);
> +   spin_lock_irqsave(&test->lock, flags);
>
> list_for_each_entry_reverse(res, &test->resources, node) {
> if (match(test, res, (void *)match_data)) {
> @@ -526,7 +527,7 @@ kunit_find_resource(struct kunit *test,
> }
> }
>
> -   spin_unlock(&test->lock);
> +   spin_unlock_irqrestore(&test->lock, flags);
>
> return found;
>  }
> diff --git a/lib/kunit/test.c b/lib/kunit/test.c
> index ec9494e914ef..2c62eeb45b82 100644
> --- a/lib/kunit/test.c
> +++ b/lib/kunit/test.c
> @@ -442,6 +442,7 @@ int kunit_add_resource(struct kunit *test,
>void *data)
>  {
> int ret = 0;
> +   unsigned long flags;
>
> res->free = free;
> kref_init(&res->refcount);
> @@ -454,10 +455,10 @@ int kunit_add_resource(struct kunit *test,
> res->data = data;
> }
>
> -   spin_lock(&test->lock);
> +   spin_lock_irqsave(&test->lock, flags);
> list_add_tail(&res->node, &test->resources);
>

[PATCH v4 00/10] Add support for synchronous signals on perf events

2021-04-08 Thread Marco Elver
es. The
approach taken in "Add support for SIGTRAP on perf events" to trigger
the signal was suggested by Peter Zijlstra in [3].

[2] 
https://lore.kernel.org/lkml/CACT4Y+YPrXGw+AtESxAgPyZ84TYkNZdP0xpocX2jwVAbZD=-x...@mail.gmail.com/

[3] 
https://lore.kernel.org/lkml/ybv3rat566k+6...@hirez.programming.kicks-ass.net/

Marco Elver (9):
  perf: Apply PERF_EVENT_IOC_MODIFY_ATTRIBUTES to children
  perf: Support only inheriting events if cloned with CLONE_THREAD
  perf: Add support for event removal on exec
  signal: Introduce TRAP_PERF si_code and si_perf to siginfo
  perf: Add support for SIGTRAP on perf events
  selftests/perf_events: Add kselftest for process-wide sigtrap handling
  selftests/perf_events: Add kselftest for remove_on_exec
  tools headers uapi: Sync tools/include/uapi/linux/perf_event.h
  perf test: Add basic stress test for sigtrap handling

Peter Zijlstra (1):
  perf: Rework perf_event_exit_event()

 arch/m68k/kernel/signal.c |   3 +
 arch/x86/kernel/signal_compat.c   |   5 +-
 fs/signalfd.c |   4 +
 include/linux/compat.h|   2 +
 include/linux/perf_event.h|   9 +-
 include/linux/signal.h|   1 +
 include/uapi/asm-generic/siginfo.h|   6 +-
 include/uapi/linux/perf_event.h   |  12 +-
 include/uapi/linux/signalfd.h |   4 +-
 kernel/events/core.c  | 302 +-
 kernel/fork.c |   2 +-
 kernel/signal.c   |  11 +
 tools/include/uapi/linux/perf_event.h |  12 +-
 tools/perf/tests/Build|   1 +
 tools/perf/tests/builtin-test.c   |   5 +
 tools/perf/tests/sigtrap.c| 150 +
 tools/perf/tests/tests.h  |   1 +
 .../testing/selftests/perf_events/.gitignore  |   3 +
 tools/testing/selftests/perf_events/Makefile  |   6 +
 tools/testing/selftests/perf_events/config|   1 +
 .../selftests/perf_events/remove_on_exec.c| 260 +++
 tools/testing/selftests/perf_events/settings  |   1 +
 .../selftests/perf_events/sigtrap_threads.c   | 210 
 23 files changed, 924 insertions(+), 87 deletions(-)
 create mode 100644 tools/perf/tests/sigtrap.c
 create mode 100644 tools/testing/selftests/perf_events/.gitignore
 create mode 100644 tools/testing/selftests/perf_events/Makefile
 create mode 100644 tools/testing/selftests/perf_events/config
 create mode 100644 tools/testing/selftests/perf_events/remove_on_exec.c
 create mode 100644 tools/testing/selftests/perf_events/settings
 create mode 100644 tools/testing/selftests/perf_events/sigtrap_threads.c

-- 
2.31.0.208.g409f899ff0-goog



[PATCH v4 01/10] perf: Rework perf_event_exit_event()

2021-04-08 Thread Marco Elver
From: Peter Zijlstra 

Make perf_event_exit_event() more robust, such that we can use it from
other contexts. Specifically the up and coming remove_on_exec.

For this to work we need to address a few issues. Remove_on_exec will
not destroy the entire context, so we cannot rely on TASK_TOMBSTONE to
disable event_function_call() and we thus have to use
perf_remove_from_context().

When using perf_remove_from_context(), there's two races to consider.
The first is against close(), where we can have concurrent tear-down
of the event. The second is against child_list iteration, which should
not find a half baked event.

To address this, teach perf_remove_from_context() to special case
!ctx->is_active and about DETACH_CHILD.

Signed-off-by: Peter Zijlstra (Intel) 
[ el...@google.com: fix racing parent/child exit in sync_child_event(). ]
Signed-off-by: Marco Elver 
---
v4:
* Fix for parent and child racing to exit in sync_child_event().

v3:
* New dependency for series:
  https://lkml.kernel.org/r/YFn/i3akf+toj...@hirez.programming.kicks-ass.net
---
 include/linux/perf_event.h |   1 +
 kernel/events/core.c   | 142 +
 2 files changed, 80 insertions(+), 63 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3f7f89ea5e51..3d478abf411c 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -607,6 +607,7 @@ struct swevent_hlist {
 #define PERF_ATTACH_TASK_DATA  0x08
 #define PERF_ATTACH_ITRACE 0x10
 #define PERF_ATTACH_SCHED_CB   0x20
+#define PERF_ATTACH_CHILD  0x40
 
 struct perf_cgroup;
 struct perf_buffer;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 03db40f6cba9..e77294c7e654 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2204,6 +2204,26 @@ static void perf_group_detach(struct perf_event *event)
perf_event__header_size(leader);
 }
 
+static void sync_child_event(struct perf_event *child_event);
+
+static void perf_child_detach(struct perf_event *event)
+{
+   struct perf_event *parent_event = event->parent;
+
+   if (!(event->attach_state & PERF_ATTACH_CHILD))
+   return;
+
+   event->attach_state &= ~PERF_ATTACH_CHILD;
+
+   if (WARN_ON_ONCE(!parent_event))
+   return;
+
+   lockdep_assert_held(&parent_event->child_mutex);
+
+   sync_child_event(event);
+   list_del_init(&event->child_list);
+}
+
 static bool is_orphaned_event(struct perf_event *event)
 {
return event->state == PERF_EVENT_STATE_DEAD;
@@ -2311,6 +2331,7 @@ group_sched_out(struct perf_event *group_event,
 }
 
 #define DETACH_GROUP   0x01UL
+#define DETACH_CHILD   0x02UL
 
 /*
  * Cross CPU call to remove a performance event
@@ -2334,6 +2355,8 @@ __perf_remove_from_context(struct perf_event *event,
event_sched_out(event, cpuctx, ctx);
if (flags & DETACH_GROUP)
perf_group_detach(event);
+   if (flags & DETACH_CHILD)
+   perf_child_detach(event);
list_del_event(event, ctx);
 
if (!ctx->nr_events && ctx->is_active) {
@@ -2362,25 +2385,21 @@ static void perf_remove_from_context(struct perf_event 
*event, unsigned long fla
 
lockdep_assert_held(&ctx->mutex);
 
-   event_function_call(event, __perf_remove_from_context, (void *)flags);
-
/*
-* The above event_function_call() can NO-OP when it hits
-* TASK_TOMBSTONE. In that case we must already have been detached
-* from the context (by perf_event_exit_event()) but the grouping
-* might still be in-tact.
+* Because of perf_event_exit_task(), perf_remove_from_context() ought
+* to work in the face of TASK_TOMBSTONE, unlike every other
+* event_function_call() user.
 */
-   WARN_ON_ONCE(event->attach_state & PERF_ATTACH_CONTEXT);
-   if ((flags & DETACH_GROUP) &&
-   (event->attach_state & PERF_ATTACH_GROUP)) {
-   /*
-* Since in that case we cannot possibly be scheduled, simply
-* detach now.
-*/
-   raw_spin_lock_irq(&ctx->lock);
-   perf_group_detach(event);
+   raw_spin_lock_irq(&ctx->lock);
+   if (!ctx->is_active) {
+   __perf_remove_from_context(event, __get_cpu_context(ctx),
+  ctx, (void *)flags);
raw_spin_unlock_irq(&ctx->lock);
+   return;
}
+   raw_spin_unlock_irq(&ctx->lock);
+
+   event_function_call(event, __perf_remove_from_context, (void *)flags);
 }
 
 /*
@@ -12373,14 +12392,17 @@ void perf_pmu_migrate_context(struct pmu *pmu, int 
src_cpu, int dst_cpu)
 }
 EXPORT_SYMBOL_GPL(perf_pmu_migrate_context);
 
-static void sync_child_event(struct perf_event *child_event,
-  struct task_struc

[PATCH v4 02/10] perf: Apply PERF_EVENT_IOC_MODIFY_ATTRIBUTES to children

2021-04-08 Thread Marco Elver
As with other ioctls (such as PERF_EVENT_IOC_{ENABLE,DISABLE}), fix up
handling of PERF_EVENT_IOC_MODIFY_ATTRIBUTES to also apply to children.

Suggested-by: Dmitry Vyukov 
Reviewed-by: Dmitry Vyukov 
Signed-off-by: Marco Elver 
---
 kernel/events/core.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index e77294c7e654..a9a0a46909af 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3199,16 +3199,36 @@ static int perf_event_modify_breakpoint(struct 
perf_event *bp,
 static int perf_event_modify_attr(struct perf_event *event,
  struct perf_event_attr *attr)
 {
+   int (*func)(struct perf_event *, struct perf_event_attr *);
+   struct perf_event *child;
+   int err;
+
if (event->attr.type != attr->type)
return -EINVAL;
 
switch (event->attr.type) {
case PERF_TYPE_BREAKPOINT:
-   return perf_event_modify_breakpoint(event, attr);
+   func = perf_event_modify_breakpoint;
+   break;
default:
/* Place holder for future additions. */
return -EOPNOTSUPP;
}
+
+   WARN_ON_ONCE(event->ctx->parent_ctx);
+
+   mutex_lock(&event->child_mutex);
+   err = func(event, attr);
+   if (err)
+   goto out;
+   list_for_each_entry(child, &event->child_list, child_list) {
+   err = func(child, attr);
+   if (err)
+   goto out;
+   }
+out:
+   mutex_unlock(&event->child_mutex);
+   return err;
 }
 
 static void ctx_sched_out(struct perf_event_context *ctx,
-- 
2.31.0.208.g409f899ff0-goog



[PATCH v4 03/10] perf: Support only inheriting events if cloned with CLONE_THREAD

2021-04-08 Thread Marco Elver
Adds bit perf_event_attr::inherit_thread, to restricting inheriting
events only if the child was cloned with CLONE_THREAD.

This option supports the case where an event is supposed to be
process-wide only (including subthreads), but should not propagate
beyond the current process's shared environment.

Link: 
https://lore.kernel.org/lkml/ybvj6ejr%2fdy2t...@hirez.programming.kicks-ass.net/
Suggested-by: Peter Zijlstra 
Signed-off-by: Marco Elver 
---
v2:
* Add patch to series.
---
 include/linux/perf_event.h  |  5 +++--
 include/uapi/linux/perf_event.h |  3 ++-
 kernel/events/core.c| 21 ++---
 kernel/fork.c   |  2 +-
 4 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 3d478abf411c..1660039199b2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -958,7 +958,7 @@ extern void __perf_event_task_sched_in(struct task_struct 
*prev,
   struct task_struct *task);
 extern void __perf_event_task_sched_out(struct task_struct *prev,
struct task_struct *next);
-extern int perf_event_init_task(struct task_struct *child);
+extern int perf_event_init_task(struct task_struct *child, u64 clone_flags);
 extern void perf_event_exit_task(struct task_struct *child);
 extern void perf_event_free_task(struct task_struct *task);
 extern void perf_event_delayed_put(struct task_struct *task);
@@ -1449,7 +1449,8 @@ perf_event_task_sched_in(struct task_struct *prev,
 static inline void
 perf_event_task_sched_out(struct task_struct *prev,
  struct task_struct *next) { }
-static inline int perf_event_init_task(struct task_struct *child)  { 
return 0; }
+static inline int perf_event_init_task(struct task_struct *child,
+  u64 clone_flags) { 
return 0; }
 static inline void perf_event_exit_task(struct task_struct *child) { }
 static inline void perf_event_free_task(struct task_struct *task)  { }
 static inline void perf_event_delayed_put(struct task_struct *task){ }
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ad15e40d7f5d..813efb65fea8 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -389,7 +389,8 @@ struct perf_event_attr {
cgroup :  1, /* include cgroup events */
text_poke  :  1, /* include text poke 
events */
build_id   :  1, /* use build id in mmap2 
events */
-   __reserved_1   : 29;
+   inherit_thread :  1, /* children only inherit 
if cloned with CLONE_THREAD */
+   __reserved_1   : 28;
 
union {
__u32   wakeup_events;/* wakeup every n events */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a9a0a46909af..de2917b3c59e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -11649,6 +11649,9 @@ static int perf_copy_attr(struct perf_event_attr __user 
*uattr,
(attr->sample_type & PERF_SAMPLE_WEIGHT_STRUCT))
return -EINVAL;
 
+   if (!attr->inherit && attr->inherit_thread)
+   return -EINVAL;
+
 out:
return ret;
 
@@ -12869,12 +12872,13 @@ static int
 inherit_task_group(struct perf_event *event, struct task_struct *parent,
   struct perf_event_context *parent_ctx,
   struct task_struct *child, int ctxn,
-  int *inherited_all)
+  u64 clone_flags, int *inherited_all)
 {
int ret;
struct perf_event_context *child_ctx;
 
-   if (!event->attr.inherit) {
+   if (!event->attr.inherit ||
+   (event->attr.inherit_thread && !(clone_flags & CLONE_THREAD))) {
*inherited_all = 0;
return 0;
}
@@ -12906,7 +12910,8 @@ inherit_task_group(struct perf_event *event, struct 
task_struct *parent,
 /*
  * Initialize the perf_event context in task_struct
  */
-static int perf_event_init_context(struct task_struct *child, int ctxn)
+static int perf_event_init_context(struct task_struct *child, int ctxn,
+  u64 clone_flags)
 {
struct perf_event_context *child_ctx, *parent_ctx;
struct perf_event_context *cloned_ctx;
@@ -12946,7 +12951,8 @@ static int perf_event_init_context(struct task_struct 
*child, int ctxn)
 */
perf_event_groups_for_each(event, &parent_ctx->pinned_groups) {
ret = inherit_task_group(event, parent, parent_ctx,
-child, ctxn, &inherited_all);
+child, ctxn, clone_flags,
+ 

[PATCH v4 04/10] perf: Add support for event removal on exec

2021-04-08 Thread Marco Elver
Adds bit perf_event_attr::remove_on_exec, to support removing an event
from a task on exec.

This option supports the case where an event is supposed to be
process-wide only, and should not propagate beyond exec, to limit
monitoring to the original process image only.

Suggested-by: Peter Zijlstra 
Signed-off-by: Marco Elver 
---
v3:
* Rework based on Peter's "perf: Rework perf_event_exit_event()" added
  to the beginning of the series. Intermediate attempts between v2 and
  this v3 can be found here:
  https://lkml.kernel.org/r/yfm6aaksrlf2n...@elver.google.com

v2:
* Add patch to series.
---
 include/uapi/linux/perf_event.h |  3 +-
 kernel/events/core.c| 70 +
 2 files changed, 64 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 813efb65fea8..8c5b9f5ad63f 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -390,7 +390,8 @@ struct perf_event_attr {
text_poke  :  1, /* include text poke 
events */
build_id   :  1, /* use build id in mmap2 
events */
inherit_thread :  1, /* children only inherit 
if cloned with CLONE_THREAD */
-   __reserved_1   : 28;
+   remove_on_exec :  1, /* event is removed from 
task on exec */
+   __reserved_1   : 27;
 
union {
__u32   wakeup_events;/* wakeup every n events */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index de2917b3c59e..19c045ff2b9c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4247,6 +4247,57 @@ static void perf_event_enable_on_exec(int ctxn)
put_ctx(clone_ctx);
 }
 
+static void perf_remove_from_owner(struct perf_event *event);
+static void perf_event_exit_event(struct perf_event *event,
+ struct perf_event_context *ctx);
+
+/*
+ * Removes all events from the current task that have been marked
+ * remove-on-exec, and feeds their values back to parent events.
+ */
+static void perf_event_remove_on_exec(int ctxn)
+{
+   struct perf_event_context *ctx, *clone_ctx = NULL;
+   struct perf_event *event, *next;
+   LIST_HEAD(free_list);
+   unsigned long flags;
+   bool modified = false;
+
+   ctx = perf_pin_task_context(current, ctxn);
+   if (!ctx)
+   return;
+
+   mutex_lock(&ctx->mutex);
+
+   if (WARN_ON_ONCE(ctx->task != current))
+   goto unlock;
+
+   list_for_each_entry_safe(event, next, &ctx->event_list, event_entry) {
+   if (!event->attr.remove_on_exec)
+   continue;
+
+   if (!is_kernel_event(event))
+   perf_remove_from_owner(event);
+
+   modified = true;
+
+   perf_event_exit_event(event, ctx);
+   }
+
+   raw_spin_lock_irqsave(&ctx->lock, flags);
+   if (modified)
+   clone_ctx = unclone_ctx(ctx);
+   --ctx->pin_count;
+   raw_spin_unlock_irqrestore(&ctx->lock, flags);
+
+unlock:
+   mutex_unlock(&ctx->mutex);
+
+   put_ctx(ctx);
+   if (clone_ctx)
+   put_ctx(clone_ctx);
+}
+
 struct perf_read_data {
struct perf_event *event;
bool group;
@@ -7559,18 +7610,18 @@ void perf_event_exec(void)
struct perf_event_context *ctx;
int ctxn;
 
-   rcu_read_lock();
for_each_task_context_nr(ctxn) {
-   ctx = current->perf_event_ctxp[ctxn];
-   if (!ctx)
-   continue;
-
perf_event_enable_on_exec(ctxn);
+   perf_event_remove_on_exec(ctxn);
 
-   perf_iterate_ctx(ctx, perf_event_addr_filters_exec, NULL,
-  true);
+   rcu_read_lock();
+   ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
+   if (ctx) {
+   perf_iterate_ctx(ctx, perf_event_addr_filters_exec,
+NULL, true);
+   }
+   rcu_read_unlock();
}
-   rcu_read_unlock();
 }
 
 struct remote_output {
@@ -11652,6 +11703,9 @@ static int perf_copy_attr(struct perf_event_attr __user 
*uattr,
if (!attr->inherit && attr->inherit_thread)
return -EINVAL;
 
+   if (attr->remove_on_exec && attr->enable_on_exec)
+   return -EINVAL;
+
 out:
return ret;
 
-- 
2.31.0.208.g409f899ff0-goog



[PATCH v4 05/10] signal: Introduce TRAP_PERF si_code and si_perf to siginfo

2021-04-08 Thread Marco Elver
Introduces the TRAP_PERF si_code, and associated siginfo_t field
si_perf. These will be used by the perf event subsystem to send signals
(if requested) to the task where an event occurred.

Acked-by: Geert Uytterhoeven  # m68k
Acked-by: Arnd Bergmann  # asm-generic
Signed-off-by: Marco Elver 
---
 arch/m68k/kernel/signal.c  |  3 +++
 arch/x86/kernel/signal_compat.c|  5 -
 fs/signalfd.c  |  4 
 include/linux/compat.h |  2 ++
 include/linux/signal.h |  1 +
 include/uapi/asm-generic/siginfo.h |  6 +-
 include/uapi/linux/signalfd.h  |  4 +++-
 kernel/signal.c| 11 +++
 8 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/m68k/kernel/signal.c b/arch/m68k/kernel/signal.c
index 349570f16a78..a4b7ee1df211 100644
--- a/arch/m68k/kernel/signal.c
+++ b/arch/m68k/kernel/signal.c
@@ -622,6 +622,9 @@ static inline void siginfo_build_tests(void)
/* _sigfault._addr_pkey */
BUILD_BUG_ON(offsetof(siginfo_t, si_pkey) != 0x12);
 
+   /* _sigfault._perf */
+   BUILD_BUG_ON(offsetof(siginfo_t, si_perf) != 0x10);
+
/* _sigpoll */
BUILD_BUG_ON(offsetof(siginfo_t, si_band)   != 0x0c);
BUILD_BUG_ON(offsetof(siginfo_t, si_fd) != 0x10);
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index a5330ff498f0..0e5d0a7e203b 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -29,7 +29,7 @@ static inline void signal_compat_build_tests(void)
BUILD_BUG_ON(NSIGFPE  != 15);
BUILD_BUG_ON(NSIGSEGV != 9);
BUILD_BUG_ON(NSIGBUS  != 5);
-   BUILD_BUG_ON(NSIGTRAP != 5);
+   BUILD_BUG_ON(NSIGTRAP != 6);
BUILD_BUG_ON(NSIGCHLD != 6);
BUILD_BUG_ON(NSIGSYS  != 2);
 
@@ -138,6 +138,9 @@ static inline void signal_compat_build_tests(void)
BUILD_BUG_ON(offsetof(siginfo_t, si_pkey) != 0x20);
BUILD_BUG_ON(offsetof(compat_siginfo_t, si_pkey) != 0x14);
 
+   BUILD_BUG_ON(offsetof(siginfo_t, si_perf) != 0x18);
+   BUILD_BUG_ON(offsetof(compat_siginfo_t, si_perf) != 0x10);
+
CHECK_CSI_OFFSET(_sigpoll);
CHECK_CSI_SIZE  (_sigpoll, 2*sizeof(int));
CHECK_SI_SIZE   (_sigpoll, 4*sizeof(int));
diff --git a/fs/signalfd.c b/fs/signalfd.c
index 456046e15873..040a1142915f 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -134,6 +134,10 @@ static int signalfd_copyinfo(struct signalfd_siginfo 
__user *uinfo,
 #endif
new.ssi_addr_lsb = (short) kinfo->si_addr_lsb;
break;
+   case SIL_PERF_EVENT:
+   new.ssi_addr = (long) kinfo->si_addr;
+   new.ssi_perf = kinfo->si_perf;
+   break;
case SIL_CHLD:
new.ssi_pid= kinfo->si_pid;
new.ssi_uid= kinfo->si_uid;
diff --git a/include/linux/compat.h b/include/linux/compat.h
index 6e65be753603..c8821d966812 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -236,6 +236,8 @@ typedef struct compat_siginfo {
char 
_dummy_pkey[__COMPAT_ADDR_BND_PKEY_PAD];
u32 _pkey;
} _addr_pkey;
+   /* used when si_code=TRAP_PERF */
+   compat_u64 _perf;
};
} _sigfault;
 
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 205526c4003a..1e98548d7cf6 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -43,6 +43,7 @@ enum siginfo_layout {
SIL_FAULT_MCEERR,
SIL_FAULT_BNDERR,
SIL_FAULT_PKUERR,
+   SIL_PERF_EVENT,
SIL_CHLD,
SIL_RT,
SIL_SYS,
diff --git a/include/uapi/asm-generic/siginfo.h 
b/include/uapi/asm-generic/siginfo.h
index d2597000407a..d0bb9125c853 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -91,6 +91,8 @@ union __sifields {
char _dummy_pkey[__ADDR_BND_PKEY_PAD];
__u32 _pkey;
} _addr_pkey;
+   /* used when si_code=TRAP_PERF */
+   __u64 _perf;
};
} _sigfault;
 
@@ -155,6 +157,7 @@ typedef struct siginfo {
 #define si_lower   _sifields._sigfault._addr_bnd._lower
 #define si_upper   _sifields._sigfault._addr_bnd._upper
 #define si_pkey_sifields._sigfault._addr_pkey._pkey
+#define si_perf_sifields._sigfault._perf
 #define si_band_sifields._sigpoll._band
 #define si_fd  _sifields._sigpoll._fd
 #define si_call_addr   _sifields._sigsys._call_addr
@@ -253,7 +256,8 @@ typedef struct siginfo {
 #define TRAP_BRANCH 3  /* process taken branch trap */
 #define TRAP_HWBKPT 4  /* hardware breakpoint/watchpoint */
 #define TRAP_UNK   5  

[PATCH v4 06/10] perf: Add support for SIGTRAP on perf events

2021-04-08 Thread Marco Elver
Adds bit perf_event_attr::sigtrap, which can be set to cause events to
send SIGTRAP (with si_code TRAP_PERF) to the task where the event
occurred. The primary motivation is to support synchronous signals on
perf events in the task where an event (such as breakpoints) triggered.

To distinguish perf events based on the event type, the type is set in
si_errno. For events that are associated with an address, si_addr is
copied from perf_sample_data.

The new field perf_event_attr::sig_data is copied to si_perf, which
allows user space to disambiguate which event (of the same type)
triggered the signal. For example, user space could encode the relevant
information it cares about in sig_data.

We note that the choice of an opaque u64 provides the simplest and most
flexible option. Alternatives where a reference to some user space data
is passed back suffer from the problem that modification of referenced
data (be it the event fd, or the perf_event_attr) can race with the
signal being delivered (of course, the same caveat applies if user space
decides to store a pointer in sig_data, but the ABI explicitly avoids
prescribing such a design).

Link: 
https://lore.kernel.org/lkml/ybv3rat566k+6...@hirez.programming.kicks-ass.net/
Suggested-by: Peter Zijlstra 
Acked-by: Dmitry Vyukov 
Signed-off-by: Marco Elver 
---
v4:
* Generalize setting si_perf and si_addr independent of event type;
  introduces perf_event_attr::sig_data, which can be set by user space to
  be propagated to si_perf.
* Fix race between irq_work running and task's sighand being released by
  release_task().
* Warning in perf_sigtrap() if ctx->task and current mismatch; we expect
  this on architectures that do not properly implement
  arch_irq_work_raise().
* Require events that want sigtrap to be associated with a task.

v2:
* Use atomic_set(&event_count, 1), since it must always be 0 in
  perf_pending_event_disable().
* Implicitly restrict inheriting events if sigtrap, but the child was
  cloned with CLONE_CLEAR_SIGHAND, because it is not generally safe if
  the child cleared all signal handlers to continue sending SIGTRAP.
---
 include/linux/perf_event.h  |  3 ++
 include/uapi/linux/perf_event.h | 10 ++-
 kernel/events/core.c| 49 -
 3 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1660039199b2..18ba1282c5c7 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -778,6 +778,9 @@ struct perf_event {
void *security;
 #endif
struct list_headsb_list;
+
+   /* Address associated with event, which can be passed to siginfo_t. */
+   u64 sig_addr;
 #endif /* CONFIG_PERF_EVENTS */
 };
 
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 8c5b9f5ad63f..31b00e3b69c9 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -311,6 +311,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER4104 /* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5112 /* add: aux_watermark */
 #define PERF_ATTR_SIZE_VER6120 /* add: aux_sample_size */
+#define PERF_ATTR_SIZE_VER7128 /* add: sig_data */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -391,7 +392,8 @@ struct perf_event_attr {
build_id   :  1, /* use build id in mmap2 
events */
inherit_thread :  1, /* children only inherit 
if cloned with CLONE_THREAD */
remove_on_exec :  1, /* event is removed from 
task on exec */
-   __reserved_1   : 27;
+   sigtrap:  1, /* send synchronous 
SIGTRAP on event */
+   __reserved_1   : 26;
 
union {
__u32   wakeup_events;/* wakeup every n events */
@@ -443,6 +445,12 @@ struct perf_event_attr {
__u16   __reserved_2;
__u32   aux_sample_size;
__u32   __reserved_3;
+
+   /*
+* User provided data if sigtrap=1, passed back to user via
+* siginfo_t::si_perf, e.g. to permit user to identify the event.
+*/
+   __u64   sig_data;
 };
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 19c045ff2b9c..1d2077389c0c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6391,6 +6391,33 @@ void perf_event_wakeup(struct perf_event *event)
}
 }
 
+static void perf_sigtrap(struct perf_event *event)
+{
+   struct kernel_siginfo info;
+
+   /*
+* We'd expect this to only occur if the irq_work is delayed and either
+* ctx->task or current has changed in the meantime. This can be the
+* case on architectures that do not implement arch_irq_work_raise().
+*/
+   if (WARN_ON_ONCE(event-&

[PATCH v4 07/10] selftests/perf_events: Add kselftest for process-wide sigtrap handling

2021-04-08 Thread Marco Elver
Add a kselftest for testing process-wide perf events with synchronous
SIGTRAP on events (using breakpoints). In particular, we want to test
that changes to the event propagate to all children, and the SIGTRAPs
are in fact synchronously sent to the thread where the event occurred.

Note: The "signal_stress" test case is also added later in the series to
perf tool's built-in tests. The test here is more elaborate in that
respect, which on one hand avoids bloating the perf tool unnecessarily,
but we also benefit from structured tests with TAP-compliant output that
the kselftest framework provides.

Signed-off-by: Marco Elver 
---
v4:
* Update for new perf_event_attr::sig_data / si_perf handling.

v3:
* Fix for latest libc signal.h.

v2:
* Patch added to series.
---
 .../testing/selftests/perf_events/.gitignore  |   2 +
 tools/testing/selftests/perf_events/Makefile  |   6 +
 tools/testing/selftests/perf_events/config|   1 +
 tools/testing/selftests/perf_events/settings  |   1 +
 .../selftests/perf_events/sigtrap_threads.c   | 210 ++
 5 files changed, 220 insertions(+)
 create mode 100644 tools/testing/selftests/perf_events/.gitignore
 create mode 100644 tools/testing/selftests/perf_events/Makefile
 create mode 100644 tools/testing/selftests/perf_events/config
 create mode 100644 tools/testing/selftests/perf_events/settings
 create mode 100644 tools/testing/selftests/perf_events/sigtrap_threads.c

diff --git a/tools/testing/selftests/perf_events/.gitignore 
b/tools/testing/selftests/perf_events/.gitignore
new file mode 100644
index ..4dc43e1bd79c
--- /dev/null
+++ b/tools/testing/selftests/perf_events/.gitignore
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+sigtrap_threads
diff --git a/tools/testing/selftests/perf_events/Makefile 
b/tools/testing/selftests/perf_events/Makefile
new file mode 100644
index ..973a2c39ca83
--- /dev/null
+++ b/tools/testing/selftests/perf_events/Makefile
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: GPL-2.0
+CFLAGS += -Wl,-no-as-needed -Wall -I../../../../usr/include
+LDFLAGS += -lpthread
+
+TEST_GEN_PROGS := sigtrap_threads
+include ../lib.mk
diff --git a/tools/testing/selftests/perf_events/config 
b/tools/testing/selftests/perf_events/config
new file mode 100644
index ..ba58ff2203e4
--- /dev/null
+++ b/tools/testing/selftests/perf_events/config
@@ -0,0 +1 @@
+CONFIG_PERF_EVENTS=y
diff --git a/tools/testing/selftests/perf_events/settings 
b/tools/testing/selftests/perf_events/settings
new file mode 100644
index ..6091b45d226b
--- /dev/null
+++ b/tools/testing/selftests/perf_events/settings
@@ -0,0 +1 @@
+timeout=120
diff --git a/tools/testing/selftests/perf_events/sigtrap_threads.c 
b/tools/testing/selftests/perf_events/sigtrap_threads.c
new file mode 100644
index ..9c0fd442da60
--- /dev/null
+++ b/tools/testing/selftests/perf_events/sigtrap_threads.c
@@ -0,0 +1,210 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test for perf events with SIGTRAP across all threads.
+ *
+ * Copyright (C) 2021, Google LLC.
+ */
+
+#define _GNU_SOURCE
+
+/* We need the latest siginfo from the kernel repo. */
+#include 
+#include 
+#define __have_siginfo_t 1
+#define __have_sigval_t 1
+#define __have_sigevent_t 1
+#define __siginfo_t_defined
+#define __sigval_t_defined
+#define __sigevent_t_defined
+#define _BITS_SIGINFO_CONSTS_H 1
+#define _BITS_SIGEVENT_CONSTS_H 1
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest_harness.h"
+
+#define NUM_THREADS 5
+
+/* Data shared between test body, threads, and signal handler. */
+static struct {
+   int tids_want_signal;   /* Which threads still want a signal. */
+   int signal_count;   /* Sanity check number of signals 
received. */
+   volatile int iterate_on;/* Variable to set breakpoint on. */
+   siginfo_t first_siginfo;/* First observed siginfo_t. */
+} ctx;
+
+/* Unique value to check si_perf is correctly set from 
perf_event_attr::sig_data. */
+#define TEST_SIG_DATA(addr) (~(uint64_t)(addr))
+
+static struct perf_event_attr make_event_attr(bool enabled, volatile void 
*addr)
+{
+   struct perf_event_attr attr = {
+   .type   = PERF_TYPE_BREAKPOINT,
+   .size   = sizeof(attr),
+   .sample_period  = 1,
+   .disabled   = !enabled,
+   .bp_addr= (unsigned long)addr,
+   .bp_type= HW_BREAKPOINT_RW,
+   .bp_len = HW_BREAKPOINT_LEN_1,
+   .inherit= 1, /* Children inherit events ... */
+   .inherit_thread = 1, /* ... but only cloned with CLONE_THREAD. 
*/
+   .remove_on_exec = 1, /* Required by sigtrap. */
+   .sigtrap= 1, /* Request synchronous SIGTRAP on event. */
+   .sig_data  

[PATCH v4 08/10] selftests/perf_events: Add kselftest for remove_on_exec

2021-04-08 Thread Marco Elver
Add kselftest to test that remove_on_exec removes inherited events from
child tasks.

Signed-off-by: Marco Elver 
---
v3:
* Fix for latest libc signal.h.

v2:
* Add patch to series.
---
 .../testing/selftests/perf_events/.gitignore  |   1 +
 tools/testing/selftests/perf_events/Makefile  |   2 +-
 .../selftests/perf_events/remove_on_exec.c| 260 ++
 3 files changed, 262 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/perf_events/remove_on_exec.c

diff --git a/tools/testing/selftests/perf_events/.gitignore 
b/tools/testing/selftests/perf_events/.gitignore
index 4dc43e1bd79c..790c47001e77 100644
--- a/tools/testing/selftests/perf_events/.gitignore
+++ b/tools/testing/selftests/perf_events/.gitignore
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0-only
 sigtrap_threads
+remove_on_exec
diff --git a/tools/testing/selftests/perf_events/Makefile 
b/tools/testing/selftests/perf_events/Makefile
index 973a2c39ca83..fcafa5f0d34c 100644
--- a/tools/testing/selftests/perf_events/Makefile
+++ b/tools/testing/selftests/perf_events/Makefile
@@ -2,5 +2,5 @@
 CFLAGS += -Wl,-no-as-needed -Wall -I../../../../usr/include
 LDFLAGS += -lpthread
 
-TEST_GEN_PROGS := sigtrap_threads
+TEST_GEN_PROGS := sigtrap_threads remove_on_exec
 include ../lib.mk
diff --git a/tools/testing/selftests/perf_events/remove_on_exec.c 
b/tools/testing/selftests/perf_events/remove_on_exec.c
new file mode 100644
index ..5814611a1dc7
--- /dev/null
+++ b/tools/testing/selftests/perf_events/remove_on_exec.c
@@ -0,0 +1,260 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Test for remove_on_exec.
+ *
+ * Copyright (C) 2021, Google LLC.
+ */
+
+#define _GNU_SOURCE
+
+/* We need the latest siginfo from the kernel repo. */
+#include 
+#include 
+#define __have_siginfo_t 1
+#define __have_sigval_t 1
+#define __have_sigevent_t 1
+#define __siginfo_t_defined
+#define __sigval_t_defined
+#define __sigevent_t_defined
+#define _BITS_SIGINFO_CONSTS_H 1
+#define _BITS_SIGEVENT_CONSTS_H 1
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest_harness.h"
+
+static volatile int signal_count;
+
+static struct perf_event_attr make_event_attr(void)
+{
+   struct perf_event_attr attr = {
+   .type   = PERF_TYPE_HARDWARE,
+   .size   = sizeof(attr),
+   .config = PERF_COUNT_HW_INSTRUCTIONS,
+   .sample_period  = 1000,
+   .exclude_kernel = 1,
+   .exclude_hv = 1,
+   .disabled   = 1,
+   .inherit= 1,
+   /*
+* Children normally retain their inherited event on exec; with
+* remove_on_exec, we'll remove their event, but the parent and
+* any other non-exec'd children will keep their events.
+*/
+   .remove_on_exec = 1,
+   .sigtrap= 1,
+   };
+   return attr;
+}
+
+static void sigtrap_handler(int signum, siginfo_t *info, void *ucontext)
+{
+   if (info->si_code != TRAP_PERF) {
+   fprintf(stderr, "%s: unexpected si_code %d\n", __func__, 
info->si_code);
+   return;
+   }
+
+   signal_count++;
+}
+
+FIXTURE(remove_on_exec)
+{
+   struct sigaction oldact;
+   int fd;
+};
+
+FIXTURE_SETUP(remove_on_exec)
+{
+   struct perf_event_attr attr = make_event_attr();
+   struct sigaction action = {};
+
+   signal_count = 0;
+
+   /* Initialize sigtrap handler. */
+   action.sa_flags = SA_SIGINFO | SA_NODEFER;
+   action.sa_sigaction = sigtrap_handler;
+   sigemptyset(&action.sa_mask);
+   ASSERT_EQ(sigaction(SIGTRAP, &action, &self->oldact), 0);
+
+   /* Initialize perf event. */
+   self->fd = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 
PERF_FLAG_FD_CLOEXEC);
+   ASSERT_NE(self->fd, -1);
+}
+
+FIXTURE_TEARDOWN(remove_on_exec)
+{
+   close(self->fd);
+   sigaction(SIGTRAP, &self->oldact, NULL);
+}
+
+/* Verify event propagates to fork'd child. */
+TEST_F(remove_on_exec, fork_only)
+{
+   int status;
+   pid_t pid = fork();
+
+   if (pid == 0) {
+   ASSERT_EQ(signal_count, 0);
+   ASSERT_EQ(ioctl(self->fd, PERF_EVENT_IOC_ENABLE, 0), 0);
+   while (!signal_count);
+   _exit(42);
+   }
+
+   while (!signal_count); /* Child enables event. */
+   EXPECT_EQ(waitpid(pid, &status, 0), pid);
+   EXPECT_EQ(WEXITSTATUS(status), 42);
+}
+
+/*
+ * Verify that event does _not_ propagate to fork+exec'd child; event enabled
+ * after fork+exec.
+ */
+TEST_F(remove_on_exec, fork_exec_then_enable)
+{
+   pid_t pid_exec, pid_only_fork;
+   int pipefd[2];
+   int tmp;
+
+   /*
+* Non-exec child, to ensure exec does not affect

[PATCH v4 09/10] tools headers uapi: Sync tools/include/uapi/linux/perf_event.h

2021-04-08 Thread Marco Elver
Sync tool's uapi to pick up the changes adding inherit_thread,
remove_on_exec, and sigtrap fields to perf_event_attr.

Signed-off-by: Marco Elver 
---
v4:
* Update for new perf_event_attr::sig_data.

v3:
* Added to series.
---
 tools/include/uapi/linux/perf_event.h | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index ad15e40d7f5d..31b00e3b69c9 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -311,6 +311,7 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER4104 /* add: sample_regs_intr */
 #define PERF_ATTR_SIZE_VER5112 /* add: aux_watermark */
 #define PERF_ATTR_SIZE_VER6120 /* add: aux_sample_size */
+#define PERF_ATTR_SIZE_VER7128 /* add: sig_data */
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -389,7 +390,10 @@ struct perf_event_attr {
cgroup :  1, /* include cgroup events */
text_poke  :  1, /* include text poke 
events */
build_id   :  1, /* use build id in mmap2 
events */
-   __reserved_1   : 29;
+   inherit_thread :  1, /* children only inherit 
if cloned with CLONE_THREAD */
+   remove_on_exec :  1, /* event is removed from 
task on exec */
+   sigtrap:  1, /* send synchronous 
SIGTRAP on event */
+   __reserved_1   : 26;
 
union {
__u32   wakeup_events;/* wakeup every n events */
@@ -441,6 +445,12 @@ struct perf_event_attr {
__u16   __reserved_2;
__u32   aux_sample_size;
__u32   __reserved_3;
+
+   /*
+* User provided data if sigtrap=1, passed back to user via
+* siginfo_t::si_perf, e.g. to permit user to identify the event.
+*/
+   __u64   sig_data;
 };
 
 /*
-- 
2.31.0.208.g409f899ff0-goog



[PATCH v4 10/10] perf test: Add basic stress test for sigtrap handling

2021-04-08 Thread Marco Elver
Add basic stress test for sigtrap handling as a perf tool built-in test.
This allows sanity checking the basic sigtrap functionality from within
the perf tool.

Note: A more elaborate kselftest version of this test can also be found
in tools/testing/selftests/perf_events/sigtrap_threads.c.

Signed-off-by: Marco Elver 
---
v4:
* Update for new perf_event_attr::sig_data / si_perf handling.

v3:
* Added to series (per suggestion from Ian Rogers).
---
 tools/perf/tests/Build  |   1 +
 tools/perf/tests/builtin-test.c |   5 ++
 tools/perf/tests/sigtrap.c  | 150 
 tools/perf/tests/tests.h|   1 +
 4 files changed, 157 insertions(+)
 create mode 100644 tools/perf/tests/sigtrap.c

diff --git a/tools/perf/tests/Build b/tools/perf/tests/Build
index 650aec19d490..a429c7a02b37 100644
--- a/tools/perf/tests/Build
+++ b/tools/perf/tests/Build
@@ -64,6 +64,7 @@ perf-y += parse-metric.o
 perf-y += pe-file-parsing.o
 perf-y += expand-cgroup.o
 perf-y += perf-time-to-tsc.o
+perf-y += sigtrap.o
 
 $(OUTPUT)tests/llvm-src-base.c: tests/bpf-script-example.c tests/Build
$(call rule_mkdir)
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index c4b888f18e9c..28a1cb5eaa77 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -359,6 +359,11 @@ static struct test generic_tests[] = {
.func = test__perf_time_to_tsc,
.is_supported = test__tsc_is_supported,
},
+   {
+   .desc = "Sigtrap support",
+   .func = test__sigtrap,
+   .is_supported = test__wp_is_supported, /* uses wp for test */
+   },
{
.func = NULL,
},
diff --git a/tools/perf/tests/sigtrap.c b/tools/perf/tests/sigtrap.c
new file mode 100644
index ..c367cc2f64d5
--- /dev/null
+++ b/tools/perf/tests/sigtrap.c
@@ -0,0 +1,150 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Basic test for sigtrap support.
+ *
+ * Copyright (C) 2021, Google LLC.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "cloexec.h"
+#include "debug.h"
+#include "event.h"
+#include "tests.h"
+#include "../perf-sys.h"
+
+#define NUM_THREADS 5
+
+static struct {
+   int tids_want_signal;   /* Which threads still want a signal. */
+   int signal_count;   /* Sanity check number of signals 
received. */
+   volatile int iterate_on;/* Variable to set breakpoint on. */
+   siginfo_t first_siginfo;/* First observed siginfo_t. */
+} ctx;
+
+#define TEST_SIG_DATA (~(uint64_t)(&ctx.iterate_on))
+
+static struct perf_event_attr make_event_attr(void)
+{
+   struct perf_event_attr attr = {
+   .type   = PERF_TYPE_BREAKPOINT,
+   .size   = sizeof(attr),
+   .sample_period  = 1,
+   .disabled   = 1,
+   .bp_addr= (unsigned long)&ctx.iterate_on,
+   .bp_type= HW_BREAKPOINT_RW,
+   .bp_len = HW_BREAKPOINT_LEN_1,
+   .inherit= 1, /* Children inherit events ... */
+   .inherit_thread = 1, /* ... but only cloned with CLONE_THREAD. 
*/
+   .remove_on_exec = 1, /* Required by sigtrap. */
+   .sigtrap= 1, /* Request synchronous SIGTRAP on event. */
+   .sig_data   = TEST_SIG_DATA,
+   };
+   return attr;
+}
+
+static void
+sigtrap_handler(int signum __maybe_unused, siginfo_t *info, void *ucontext 
__maybe_unused)
+{
+   if (!__atomic_fetch_add(&ctx.signal_count, 1, __ATOMIC_RELAXED))
+   ctx.first_siginfo = *info;
+   __atomic_fetch_sub(&ctx.tids_want_signal, syscall(SYS_gettid), 
__ATOMIC_RELAXED);
+}
+
+static void *test_thread(void *arg)
+{
+   pthread_barrier_t *barrier = (pthread_barrier_t *)arg;
+   pid_t tid = syscall(SYS_gettid);
+   int i;
+
+   pthread_barrier_wait(barrier);
+
+   __atomic_fetch_add(&ctx.tids_want_signal, tid, __ATOMIC_RELAXED);
+   for (i = 0; i < ctx.iterate_on - 1; i++)
+   __atomic_fetch_add(&ctx.tids_want_signal, tid, 
__ATOMIC_RELAXED);
+
+   return NULL;
+}
+
+static int run_test_threads(pthread_t *threads, pthread_barrier_t *barrier)
+{
+   int i;
+
+   pthread_barrier_wait(barrier);
+   for (i = 0; i < NUM_THREADS; i++)
+   TEST_ASSERT_EQUAL("pthread_join() failed", 
pthread_join(threads[i], NULL), 0);
+
+   return TEST_OK;
+}
+
+static int run_stress_test(int fd, pthread_t *threads, pthread_barrier_t 
*barrier)
+{
+   int ret;
+
+   ctx.iterate_on = 3000;
+
+   TEST_ASSERT_EQUAL("misfired signal?", ctx.signal_count, 0);
+   TEST_ASSERT_EQUAL("enable failed", ioctl(fd, PERF_EVENT_IOC_ENABLE, 0), 
0);
+ 

[PATCH mm] kfence: report sensitive information based on no_hash_pointers

2021-02-23 Thread Marco Elver
We cannot rely on CONFIG_DEBUG_KERNEL to decide if we're running a
"debug kernel" where we can safely show potentially sensitive
information in the kernel log.

Instead, simply rely on the newly introduced "no_hash_pointers" to print
unhashed kernel pointers, as well as decide if our reports can include
other potentially sensitive information such as registers and corrupted
bytes.

Cc: Timur Tabi 
Signed-off-by: Marco Elver 
---

Depends on "lib/vsprintf: no_hash_pointers prints all addresses as
unhashed", which was merged into mainline yesterday:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b2bec7d8a42a3885d525e821d9354b6b08fd6adf

---
 Documentation/dev-tools/kfence.rst |  8 
 mm/kfence/core.c   | 10 +++---
 mm/kfence/kfence.h |  7 ---
 mm/kfence/kfence_test.c|  2 +-
 mm/kfence/report.c | 18 ++
 5 files changed, 18 insertions(+), 27 deletions(-)

diff --git a/Documentation/dev-tools/kfence.rst 
b/Documentation/dev-tools/kfence.rst
index 58a0a5fa1ddc..fdf04e741ea5 100644
--- a/Documentation/dev-tools/kfence.rst
+++ b/Documentation/dev-tools/kfence.rst
@@ -88,8 +88,8 @@ A typical out-of-bounds access looks like this::
 
 The header of the report provides a short summary of the function involved in
 the access. It is followed by more detailed information about the access and
-its origin. Note that, real kernel addresses are only shown for
-``CONFIG_DEBUG_KERNEL=y`` builds.
+its origin. Note that, real kernel addresses are only shown when using the
+kernel command line option ``no_hash_pointers``.
 
 Use-after-free accesses are reported as::
 
@@ -184,8 +184,8 @@ invalidly written bytes (offset from the address) are 
shown; in this
 representation, '.' denote untouched bytes. In the example above ``0xac`` is
 the value written to the invalid address at offset 0, and the remaining '.'
 denote that no following bytes have been touched. Note that, real values are
-only shown for ``CONFIG_DEBUG_KERNEL=y`` builds; to avoid information
-disclosure for non-debug builds, '!' is used instead to denote invalidly
+only shown if the kernel was booted with ``no_hash_pointers``; to avoid
+information disclosure otherwise, '!' is used instead to denote invalidly
 written bytes.
 
 And finally, KFENCE may also report on invalid accesses to any protected page
diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index cfe3d32ac5b7..3b8ec938470a 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -646,13 +646,9 @@ void __init kfence_init(void)
 
WRITE_ONCE(kfence_enabled, true);
schedule_delayed_work(&kfence_timer, 0);
-   pr_info("initialized - using %lu bytes for %d objects", 
KFENCE_POOL_SIZE,
-   CONFIG_KFENCE_NUM_OBJECTS);
-   if (IS_ENABLED(CONFIG_DEBUG_KERNEL))
-   pr_cont(" at 0x%px-0x%px\n", (void *)__kfence_pool,
-   (void *)(__kfence_pool + KFENCE_POOL_SIZE));
-   else
-   pr_cont("\n");
+   pr_info("initialized - using %lu bytes for %d objects at 0x%p-0x%p\n", 
KFENCE_POOL_SIZE,
+   CONFIG_KFENCE_NUM_OBJECTS, (void *)__kfence_pool,
+   (void *)(__kfence_pool + KFENCE_POOL_SIZE));
 }
 
 void kfence_shutdown_cache(struct kmem_cache *s)
diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
index 1accc840dbbe..24065321ff8a 100644
--- a/mm/kfence/kfence.h
+++ b/mm/kfence/kfence.h
@@ -16,13 +16,6 @@
 
 #include "../slab.h" /* for struct kmem_cache */
 
-/* For non-debug builds, avoid leaking kernel pointers into dmesg. */
-#ifdef CONFIG_DEBUG_KERNEL
-#define PTR_FMT "%px"
-#else
-#define PTR_FMT "%p"
-#endif
-
 /*
  * Get the canary byte pattern for @addr. Use a pattern that varies based on 
the
  * lower 3 bits of the address, to detect memory corruptions with higher
diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
index db1bb596acaf..4acf4251ee04 100644
--- a/mm/kfence/kfence_test.c
+++ b/mm/kfence/kfence_test.c
@@ -146,7 +146,7 @@ static bool report_matches(const struct expect_report *r)
break;
}
 
-   cur += scnprintf(cur, end - cur, " 0x" PTR_FMT, (void *)r->addr);
+   cur += scnprintf(cur, end - cur, " 0x%p", (void *)r->addr);
 
spin_lock_irqsave(&observed.lock, flags);
if (!report_available())
diff --git a/mm/kfence/report.c b/mm/kfence/report.c
index 901bd7ee83d8..4a424de44e2d 100644
--- a/mm/kfence/report.c
+++ b/mm/kfence/report.c
@@ -19,6 +19,8 @@
 
 #include "kfence.h"
 
+extern bool no_hash_pointers;
+
 /* Helper function to either print to a seq_file or to console. */
 __printf(2, 3)
 static void seq_con_printf(struct seq_file *seq, const char *fmt, ...)
@@ -118,7 +120,7 @@ void kfence_print_object(struct seq_file *

[PATCH RFC 0/4] Add support for synchronous signals on perf events

2021-02-23 Thread Marco Elver
The perf subsystem today unifies various tracing and monitoring
features, from both software and hardware. One benefit of the perf
subsystem is automatically inheriting events to child tasks, which
enables process-wide events monitoring with low overheads. By default
perf events are non-intrusive, not affecting behaviour of the tasks
being monitored.

For certain use-cases, however, it makes sense to leverage the
generality of the perf events subsystem and optionally allow the tasks
being monitored to receive signals on events they are interested in.
This patch series adds the option to synchronously signal user space on
events.

The discussion at [1] led to the changes proposed in this series. The
approach taken in patch 3/4 to use 'event_limit' to trigger the signal
was kindly suggested by Peter Zijlstra in [2].

[1] 
https://lore.kernel.org/lkml/CACT4Y+YPrXGw+AtESxAgPyZ84TYkNZdP0xpocX2jwVAbZD=-x...@mail.gmail.com/
[2] 
https://lore.kernel.org/lkml/ybv3rat566k+6...@hirez.programming.kicks-ass.net/ 

Motivation and example uses:

1.  Our immediate motivation is low-overhead sampling-based race
detection for user-space [3]. By using perf_event_open() at
process initialization, we can create hardware
breakpoint/watchpoint events that are propagated automatically
to all threads in a process. As far as we are aware, today no
existing kernel facility (such as ptrace) allows us to set up
process-wide watchpoints with minimal overheads (that are
comparable to mprotect() of whole pages).

[3] https://llvm.org/devmtg/2020-09/slides/Morehouse-GWP-Tsan.pdf 

2.  Other low-overhead error detectors that rely on detecting
accesses to certain memory locations or code, process-wide and
also only in a specific set of subtasks or threads.

Other example use-cases we found potentially interesting:

3.  Code hot patching without full stop-the-world. Specifically, by
setting a code breakpoint to entry to the patched routine, then
send signals to threads and check that they are not in the
routine, but without stopping them further. If any of the
threads will enter the routine, it will receive SIGTRAP and
pause.

4.  Safepoints without mprotect(). Some Java implementations use
"load from a known memory location" as a safepoint. When threads
need to be stopped, the page containing the location is
mprotect()ed and threads get a signal. This can be replaced with
a watchpoint, which does not require a whole page nor DTLB
shootdowns.

5.  Tracking data flow globally.

6.  Threads receiving signals on performance events to
throttle/unthrottle themselves.


Marco Elver (4):
  perf/core: Apply PERF_EVENT_IOC_MODIFY_ATTRIBUTES to children
  signal: Introduce TRAP_PERF si_code and si_perf to siginfo
  perf/core: Add support for SIGTRAP on perf events
  perf/core: Add breakpoint information to siginfo on SIGTRAP

 arch/m68k/kernel/signal.c  |  3 ++
 arch/x86/kernel/signal_compat.c|  5 ++-
 fs/signalfd.c  |  4 +++
 include/linux/compat.h |  2 ++
 include/linux/signal.h |  1 +
 include/uapi/asm-generic/siginfo.h |  6 +++-
 include/uapi/linux/perf_event.h|  3 +-
 include/uapi/linux/signalfd.h  |  4 ++-
 kernel/events/core.c   | 54 +-
 kernel/signal.c| 11 ++
 10 files changed, 88 insertions(+), 5 deletions(-)

-- 
2.30.0.617.g56c4b15f3c-goog



[PATCH RFC 1/4] perf/core: Apply PERF_EVENT_IOC_MODIFY_ATTRIBUTES to children

2021-02-23 Thread Marco Elver
As with other ioctls (such as PERF_EVENT_IOC_{ENABLE,DISABLE}), fix up
handling of PERF_EVENT_IOC_MODIFY_ATTRIBUTES to also apply to children.

Link: https://lkml.kernel.org/r/ybqvay8atmyto...@hirez.programming.kicks-ass.net
Suggested-by: Dmitry Vyukov 
Signed-off-by: Marco Elver 
---
 kernel/events/core.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 129dee540a8b..37a8297be164 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3179,16 +3179,36 @@ static int perf_event_modify_breakpoint(struct 
perf_event *bp,
 static int perf_event_modify_attr(struct perf_event *event,
  struct perf_event_attr *attr)
 {
+   int (*func)(struct perf_event *, struct perf_event_attr *);
+   struct perf_event *child;
+   int err;
+
if (event->attr.type != attr->type)
return -EINVAL;
 
switch (event->attr.type) {
case PERF_TYPE_BREAKPOINT:
-   return perf_event_modify_breakpoint(event, attr);
+   func = perf_event_modify_breakpoint;
+   break;
default:
/* Place holder for future additions. */
return -EOPNOTSUPP;
}
+
+   WARN_ON_ONCE(event->ctx->parent_ctx);
+
+   mutex_lock(&event->child_mutex);
+   err = func(event, attr);
+   if (err)
+   goto out;
+   list_for_each_entry(child, &event->child_list, child_list) {
+   err = func(child, attr);
+   if (err)
+   goto out;
+   }
+out:
+   mutex_unlock(&event->child_mutex);
+   return err;
 }
 
 static void ctx_sched_out(struct perf_event_context *ctx,
-- 
2.30.0.617.g56c4b15f3c-goog



[PATCH RFC 2/4] signal: Introduce TRAP_PERF si_code and si_perf to siginfo

2021-02-23 Thread Marco Elver
Introduces the TRAP_PERF si_code, and associated siginfo_t field
si_perf. These will be used by the perf event subsystem to send signals
(if requested) to the task where an event occurred.

Signed-off-by: Marco Elver 
---
 arch/m68k/kernel/signal.c  |  3 +++
 arch/x86/kernel/signal_compat.c|  5 -
 fs/signalfd.c  |  4 
 include/linux/compat.h |  2 ++
 include/linux/signal.h |  1 +
 include/uapi/asm-generic/siginfo.h |  6 +-
 include/uapi/linux/signalfd.h  |  4 +++-
 kernel/signal.c| 11 +++
 8 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/m68k/kernel/signal.c b/arch/m68k/kernel/signal.c
index 349570f16a78..a4b7ee1df211 100644
--- a/arch/m68k/kernel/signal.c
+++ b/arch/m68k/kernel/signal.c
@@ -622,6 +622,9 @@ static inline void siginfo_build_tests(void)
/* _sigfault._addr_pkey */
BUILD_BUG_ON(offsetof(siginfo_t, si_pkey) != 0x12);
 
+   /* _sigfault._perf */
+   BUILD_BUG_ON(offsetof(siginfo_t, si_perf) != 0x10);
+
/* _sigpoll */
BUILD_BUG_ON(offsetof(siginfo_t, si_band)   != 0x0c);
BUILD_BUG_ON(offsetof(siginfo_t, si_fd) != 0x10);
diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index a5330ff498f0..0e5d0a7e203b 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -29,7 +29,7 @@ static inline void signal_compat_build_tests(void)
BUILD_BUG_ON(NSIGFPE  != 15);
BUILD_BUG_ON(NSIGSEGV != 9);
BUILD_BUG_ON(NSIGBUS  != 5);
-   BUILD_BUG_ON(NSIGTRAP != 5);
+   BUILD_BUG_ON(NSIGTRAP != 6);
BUILD_BUG_ON(NSIGCHLD != 6);
BUILD_BUG_ON(NSIGSYS  != 2);
 
@@ -138,6 +138,9 @@ static inline void signal_compat_build_tests(void)
BUILD_BUG_ON(offsetof(siginfo_t, si_pkey) != 0x20);
BUILD_BUG_ON(offsetof(compat_siginfo_t, si_pkey) != 0x14);
 
+   BUILD_BUG_ON(offsetof(siginfo_t, si_perf) != 0x18);
+   BUILD_BUG_ON(offsetof(compat_siginfo_t, si_perf) != 0x10);
+
CHECK_CSI_OFFSET(_sigpoll);
CHECK_CSI_SIZE  (_sigpoll, 2*sizeof(int));
CHECK_SI_SIZE   (_sigpoll, 4*sizeof(int));
diff --git a/fs/signalfd.c b/fs/signalfd.c
index 456046e15873..040a1142915f 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -134,6 +134,10 @@ static int signalfd_copyinfo(struct signalfd_siginfo 
__user *uinfo,
 #endif
new.ssi_addr_lsb = (short) kinfo->si_addr_lsb;
break;
+   case SIL_PERF_EVENT:
+   new.ssi_addr = (long) kinfo->si_addr;
+   new.ssi_perf = kinfo->si_perf;
+   break;
case SIL_CHLD:
new.ssi_pid= kinfo->si_pid;
new.ssi_uid= kinfo->si_uid;
diff --git a/include/linux/compat.h b/include/linux/compat.h
index 6e65be753603..c8821d966812 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -236,6 +236,8 @@ typedef struct compat_siginfo {
char 
_dummy_pkey[__COMPAT_ADDR_BND_PKEY_PAD];
u32 _pkey;
} _addr_pkey;
+   /* used when si_code=TRAP_PERF */
+   compat_u64 _perf;
};
} _sigfault;
 
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 205526c4003a..1e98548d7cf6 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -43,6 +43,7 @@ enum siginfo_layout {
SIL_FAULT_MCEERR,
SIL_FAULT_BNDERR,
SIL_FAULT_PKUERR,
+   SIL_PERF_EVENT,
SIL_CHLD,
SIL_RT,
SIL_SYS,
diff --git a/include/uapi/asm-generic/siginfo.h 
b/include/uapi/asm-generic/siginfo.h
index d2597000407a..d0bb9125c853 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -91,6 +91,8 @@ union __sifields {
char _dummy_pkey[__ADDR_BND_PKEY_PAD];
__u32 _pkey;
} _addr_pkey;
+   /* used when si_code=TRAP_PERF */
+   __u64 _perf;
};
} _sigfault;
 
@@ -155,6 +157,7 @@ typedef struct siginfo {
 #define si_lower   _sifields._sigfault._addr_bnd._lower
 #define si_upper   _sifields._sigfault._addr_bnd._upper
 #define si_pkey_sifields._sigfault._addr_pkey._pkey
+#define si_perf_sifields._sigfault._perf
 #define si_band_sifields._sigpoll._band
 #define si_fd  _sifields._sigpoll._fd
 #define si_call_addr   _sifields._sigsys._call_addr
@@ -253,7 +256,8 @@ typedef struct siginfo {
 #define TRAP_BRANCH 3  /* process taken branch trap */
 #define TRAP_HWBKPT 4  /* hardware breakpoint/watchpoint */
 #define TRAP_UNK   5   /* undiagnosed trap */
-#define NSIGTRAP   5
+#define TRAP_PERF  

[PATCH RFC 3/4] perf/core: Add support for SIGTRAP on perf events

2021-02-23 Thread Marco Elver
Adds bit perf_event_attr::sigtrap, which can be set to cause events to
send SIGTRAP (with si_code TRAP_PERF) to the task where the event
occurred. To distinguish perf events and allow user space to decode
si_perf (if set), the event type is set in si_errno.

The primary motivation is to support synchronous signals on perf events
in the task where an event (such as breakpoints) triggered.

Link: 
https://lore.kernel.org/lkml/ybv3rat566k+6...@hirez.programming.kicks-ass.net/
Suggested-by: Peter Zijlstra 
Signed-off-by: Marco Elver 
---
 include/uapi/linux/perf_event.h |  3 ++-
 kernel/events/core.c| 21 +
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index ad15e40d7f5d..b9cc6829a40c 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -389,7 +389,8 @@ struct perf_event_attr {
cgroup :  1, /* include cgroup events */
text_poke  :  1, /* include text poke 
events */
build_id   :  1, /* use build id in mmap2 
events */
-   __reserved_1   : 29;
+   sigtrap:  1, /* send synchronous 
SIGTRAP on event */
+   __reserved_1   : 28;
 
union {
__u32   wakeup_events;/* wakeup every n events */
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 37a8297be164..8718763045fd 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6288,6 +6288,17 @@ void perf_event_wakeup(struct perf_event *event)
}
 }
 
+static void perf_sigtrap(struct perf_event *event)
+{
+   struct kernel_siginfo info;
+
+   clear_siginfo(&info);
+   info.si_signo = SIGTRAP;
+   info.si_code = TRAP_PERF;
+   info.si_errno = event->attr.type;
+   force_sig_info(&info);
+}
+
 static void perf_pending_event_disable(struct perf_event *event)
 {
int cpu = READ_ONCE(event->pending_disable);
@@ -6297,6 +6308,13 @@ static void perf_pending_event_disable(struct perf_event 
*event)
 
if (cpu == smp_processor_id()) {
WRITE_ONCE(event->pending_disable, -1);
+
+   if (event->attr.sigtrap) {
+   atomic_inc(&event->event_limit); /* rearm event */
+   perf_sigtrap(event);
+   return;
+   }
+
perf_event_disable_local(event);
return;
}
@@ -11325,6 +11343,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
event->state= PERF_EVENT_STATE_INACTIVE;
 
+   if (event->attr.sigtrap)
+   atomic_set(&event->event_limit, 1);
+
if (task) {
event->attach_state = PERF_ATTACH_TASK;
/*
-- 
2.30.0.617.g56c4b15f3c-goog



[PATCH RFC 4/4] perf/core: Add breakpoint information to siginfo on SIGTRAP

2021-02-23 Thread Marco Elver
Encode information from breakpoint attributes into siginfo_t, which
helps disambiguate which breakpoint fired.

Note, providing the event fd may be unreliable, since the event may have
been modified (via PERF_EVENT_IOC_MODIFY_ATTRIBUTES) between the event
triggering and the signal being delivered to user space.

Signed-off-by: Marco Elver 
---
 kernel/events/core.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 8718763045fd..d7908322d796 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -6296,6 +6296,17 @@ static void perf_sigtrap(struct perf_event *event)
info.si_signo = SIGTRAP;
info.si_code = TRAP_PERF;
info.si_errno = event->attr.type;
+
+   switch (event->attr.type) {
+   case PERF_TYPE_BREAKPOINT:
+   info.si_addr = (void *)(unsigned long)event->attr.bp_addr;
+   info.si_perf = (event->attr.bp_len << 16) | 
(u64)event->attr.bp_type;
+   break;
+   default:
+   /* No additional info set. */
+   break;
+   }
+
force_sig_info(&info);
 }
 
-- 
2.30.0.617.g56c4b15f3c-goog



Re: [PATCH RFC 4/4] perf/core: Add breakpoint information to siginfo on SIGTRAP

2021-02-23 Thread Marco Elver
On Tue, 23 Feb 2021 at 16:01, Dmitry Vyukov  wrote:
>
> On Tue, Feb 23, 2021 at 3:34 PM Marco Elver  wrote:
> >
> > Encode information from breakpoint attributes into siginfo_t, which
> > helps disambiguate which breakpoint fired.
> >
> > Note, providing the event fd may be unreliable, since the event may have
> > been modified (via PERF_EVENT_IOC_MODIFY_ATTRIBUTES) between the event
> > triggering and the signal being delivered to user space.
> >
> > Signed-off-by: Marco Elver 
> > ---
> >  kernel/events/core.c | 11 +++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 8718763045fd..d7908322d796 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -6296,6 +6296,17 @@ static void perf_sigtrap(struct perf_event *event)
> > info.si_signo = SIGTRAP;
> > info.si_code = TRAP_PERF;
> > info.si_errno = event->attr.type;
> > +
> > +   switch (event->attr.type) {
> > +   case PERF_TYPE_BREAKPOINT:
> > +   info.si_addr = (void *)(unsigned long)event->attr.bp_addr;
> > +   info.si_perf = (event->attr.bp_len << 16) | 
> > (u64)event->attr.bp_type;
> > +   break;
> > +   default:
> > +   /* No additional info set. */
>
> Should we prohibit using attr.sigtrap for !PERF_TYPE_BREAKPOINT if we
> don't know what info to pass yet?

I don't think it's necessary. This way, by default we get support for
other perf events. If user space observes si_perf==0, then there's no
information available. That would require that any event type that
sets si_perf in future, must ensure that it sets si_perf!=0.

I can add a comment to document the requirement here (and user space
facing documentation should get a copy of how the info is encoded,
too).

Alternatively, we could set si_errno to 0 if no info is available, at
the cost of losing the type information for events not explicitly
listed here.

What do you prefer?

> > +   break;
> > +   }
> > +
> > force_sig_info(&info);
> >  }
> >
> > --
> > 2.30.0.617.g56c4b15f3c-goog
> >


Re: [PATCH RFC 4/4] perf/core: Add breakpoint information to siginfo on SIGTRAP

2021-02-23 Thread Marco Elver
On Tue, 23 Feb 2021 at 16:16, Dmitry Vyukov  wrote:
>
> On Tue, Feb 23, 2021 at 4:10 PM 'Marco Elver' via kasan-dev
>  wrote:
> > > > Encode information from breakpoint attributes into siginfo_t, which
> > > > helps disambiguate which breakpoint fired.
> > > >
> > > > Note, providing the event fd may be unreliable, since the event may have
> > > > been modified (via PERF_EVENT_IOC_MODIFY_ATTRIBUTES) between the event
> > > > triggering and the signal being delivered to user space.
> > > >
> > > > Signed-off-by: Marco Elver 
> > > > ---
> > > >  kernel/events/core.c | 11 +++
> > > >  1 file changed, 11 insertions(+)
> > > >
> > > > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > > > index 8718763045fd..d7908322d796 100644
> > > > --- a/kernel/events/core.c
> > > > +++ b/kernel/events/core.c
> > > > @@ -6296,6 +6296,17 @@ static void perf_sigtrap(struct perf_event 
> > > > *event)
> > > > info.si_signo = SIGTRAP;
> > > > info.si_code = TRAP_PERF;
> > > > info.si_errno = event->attr.type;
> > > > +
> > > > +   switch (event->attr.type) {
> > > > +   case PERF_TYPE_BREAKPOINT:
> > > > +   info.si_addr = (void *)(unsigned 
> > > > long)event->attr.bp_addr;
> > > > +   info.si_perf = (event->attr.bp_len << 16) | 
> > > > (u64)event->attr.bp_type;
> > > > +   break;
> > > > +   default:
> > > > +   /* No additional info set. */
> > >
> > > Should we prohibit using attr.sigtrap for !PERF_TYPE_BREAKPOINT if we
> > > don't know what info to pass yet?
> >
> > I don't think it's necessary. This way, by default we get support for
> > other perf events. If user space observes si_perf==0, then there's no
> > information available. That would require that any event type that
> > sets si_perf in future, must ensure that it sets si_perf!=0.
> >
> > I can add a comment to document the requirement here (and user space
> > facing documentation should get a copy of how the info is encoded,
> > too).
> >
> > Alternatively, we could set si_errno to 0 if no info is available, at
> > the cost of losing the type information for events not explicitly
> > listed here.

Note that PERF_TYPE_HARDWARE == 0, so setting si_errno to 0 does not
work. Which leaves us with:

1. Ensure si_perf==0 (or some other magic value) if no info is
available and !=0 otherwise.

2. Return error for events where we do not officially support
requesting sigtrap.

I'm currently leaning towards (1).

> > What do you prefer?
>
> Ah, I see.
> Let's wait for the opinions of other people. There are a number of
> options for how to approach this.


Re: [PATCH RFC 0/4] Add support for synchronous signals on perf events

2021-02-23 Thread Marco Elver
On Tue, 23 Feb 2021 at 15:34, Marco Elver  wrote:
>
> The perf subsystem today unifies various tracing and monitoring
> features, from both software and hardware. One benefit of the perf
> subsystem is automatically inheriting events to child tasks, which
> enables process-wide events monitoring with low overheads. By default
> perf events are non-intrusive, not affecting behaviour of the tasks
> being monitored.
>
> For certain use-cases, however, it makes sense to leverage the
> generality of the perf events subsystem and optionally allow the tasks
> being monitored to receive signals on events they are interested in.
> This patch series adds the option to synchronously signal user space on
> events.
>
> The discussion at [1] led to the changes proposed in this series. The
> approach taken in patch 3/4 to use 'event_limit' to trigger the signal
> was kindly suggested by Peter Zijlstra in [2].
>
> [1] 
> https://lore.kernel.org/lkml/CACT4Y+YPrXGw+AtESxAgPyZ84TYkNZdP0xpocX2jwVAbZD=-x...@mail.gmail.com/
> [2] 
> https://lore.kernel.org/lkml/ybv3rat566k+6...@hirez.programming.kicks-ass.net/
>
> Motivation and example uses:
>
> 1.  Our immediate motivation is low-overhead sampling-based race
> detection for user-space [3]. By using perf_event_open() at
> process initialization, we can create hardware
> breakpoint/watchpoint events that are propagated automatically
> to all threads in a process. As far as we are aware, today no
> existing kernel facility (such as ptrace) allows us to set up
> process-wide watchpoints with minimal overheads (that are
> comparable to mprotect() of whole pages).
>
> [3] https://llvm.org/devmtg/2020-09/slides/Morehouse-GWP-Tsan.pdf
>
> 2.  Other low-overhead error detectors that rely on detecting
> accesses to certain memory locations or code, process-wide and
> also only in a specific set of subtasks or threads.
>
> Other example use-cases we found potentially interesting:
>
> 3.  Code hot patching without full stop-the-world. Specifically, by
> setting a code breakpoint to entry to the patched routine, then
> send signals to threads and check that they are not in the
> routine, but without stopping them further. If any of the
> threads will enter the routine, it will receive SIGTRAP and
> pause.
>
> 4.  Safepoints without mprotect(). Some Java implementations use
> "load from a known memory location" as a safepoint. When threads
> need to be stopped, the page containing the location is
> mprotect()ed and threads get a signal. This can be replaced with
> a watchpoint, which does not require a whole page nor DTLB
> shootdowns.
>
> 5.  Tracking data flow globally.
>
> 6.  Threads receiving signals on performance events to
> throttle/unthrottle themselves.
>
>
> Marco Elver (4):
>   perf/core: Apply PERF_EVENT_IOC_MODIFY_ATTRIBUTES to children
>   signal: Introduce TRAP_PERF si_code and si_perf to siginfo
>   perf/core: Add support for SIGTRAP on perf events
>   perf/core: Add breakpoint information to siginfo on SIGTRAP

Note that we're currently pondering fork + exec, and suggestions would
be appreciated. We think we'll need some restrictions, like Peter
proposed here: here:
https://lore.kernel.org/lkml/ybvj6ejr%2fdy2t...@hirez.programming.kicks-ass.net/

We think what we want is to inherit the events to children only if
cloned with CLONE_SIGHAND. If there's space for a 'inherit_mask' in
perf_event_attr, that'd be most flexible, but perhaps we do not have
the space.

Thanks,
-- Marco

>
>  arch/m68k/kernel/signal.c  |  3 ++
>  arch/x86/kernel/signal_compat.c|  5 ++-
>  fs/signalfd.c  |  4 +++
>  include/linux/compat.h |  2 ++
>  include/linux/signal.h |  1 +
>  include/uapi/asm-generic/siginfo.h |  6 +++-
>  include/uapi/linux/perf_event.h|  3 +-
>  include/uapi/linux/signalfd.h  |  4 ++-
>  kernel/events/core.c   | 54 +-
>  kernel/signal.c| 11 ++
>  10 files changed, 88 insertions(+), 5 deletions(-)
>
> --
> 2.30.0.617.g56c4b15f3c-goog
>


Re: [PATCH RFC 0/4] Add support for synchronous signals on perf events

2021-02-23 Thread Marco Elver
On Tue, 23 Feb 2021 at 21:27, Andy Lutomirski  wrote:
> > On Feb 23, 2021, at 6:34 AM, Marco Elver  wrote:
> >
> > The perf subsystem today unifies various tracing and monitoring
> > features, from both software and hardware. One benefit of the perf
> > subsystem is automatically inheriting events to child tasks, which
> > enables process-wide events monitoring with low overheads. By default
> > perf events are non-intrusive, not affecting behaviour of the tasks
> > being monitored.
> >
> > For certain use-cases, however, it makes sense to leverage the
> > generality of the perf events subsystem and optionally allow the tasks
> > being monitored to receive signals on events they are interested in.
> > This patch series adds the option to synchronously signal user space on
> > events.
>
> Unless I missed some machinations, which is entirely possible, you can’t call 
> force_sig_info() from NMI context. Not only am I not convinced that the core 
> signal code is NMI safe, but at least x86 can’t correctly deliver signals on 
> NMI return. You probably need an IPI-to-self.

force_sig_info() is called from an irq_work only: perf_pending_event
-> perf_pending_event_disable -> perf_sigtrap -> force_sig_info. What
did I miss?

> > The discussion at [1] led to the changes proposed in this series. The
> > approach taken in patch 3/4 to use 'event_limit' to trigger the signal
> > was kindly suggested by Peter Zijlstra in [2].
> >
> > [1] 
> > https://lore.kernel.org/lkml/CACT4Y+YPrXGw+AtESxAgPyZ84TYkNZdP0xpocX2jwVAbZD=-x...@mail.gmail.com/
> > [2] 
> > https://lore.kernel.org/lkml/ybv3rat566k+6...@hirez.programming.kicks-ass.net/
> >
> > Motivation and example uses:
> >
> > 1.Our immediate motivation is low-overhead sampling-based race
> >detection for user-space [3]. By using perf_event_open() at
> >process initialization, we can create hardware
> >breakpoint/watchpoint events that are propagated automatically
> >to all threads in a process. As far as we are aware, today no
> >existing kernel facility (such as ptrace) allows us to set up
> >process-wide watchpoints with minimal overheads (that are
> >comparable to mprotect() of whole pages).
>
> This would be doable much more simply with an API to set a breakpoint.  All 
> the machinery exists except the actual user API.

Isn't perf_event_open() that API?

A new user API implementation will either be a thin wrapper around
perf events or reinvent half of perf events to deal with managing
watchpoints across a set of tasks (process-wide or some subset).

It's not just breakpoints though.

> >[3] https://llvm.org/devmtg/2020-09/slides/Morehouse-GWP-Tsan.pdf
> >
> > 2.Other low-overhead error detectors that rely on detecting
> >accesses to certain memory locations or code, process-wide and
> >also only in a specific set of subtasks or threads.
> >
> > Other example use-cases we found potentially interesting:
> >
> > 3.Code hot patching without full stop-the-world. Specifically, by
> >setting a code breakpoint to entry to the patched routine, then
> >send signals to threads and check that they are not in the
> >routine, but without stopping them further. If any of the
> >threads will enter the routine, it will receive SIGTRAP and
> >pause.
>
> Cute.
>
> >
> > 4.Safepoints without mprotect(). Some Java implementations use
> >"load from a known memory location" as a safepoint. When threads
> >need to be stopped, the page containing the location is
> >mprotect()ed and threads get a signal. This can be replaced with
> >a watchpoint, which does not require a whole page nor DTLB
> >shootdowns.
>
> I’m skeptical. Propagating a hardware breakpoint to all threads involves IPIs 
> and horribly slow writes to DR1 (or 2, 3, or 4) and DR7.  A TLB flush can be 
> accelerated using paravirt or hypothetical future hardware. Or real live 
> hardware on ARM64.
>
> (The hypothetical future hardware is almost present on Zen 3.  A bit of work 
> is needed on the hardware end to make it useful.)

Fair enough. Although watchpoints can be much more fine-grained than
an mprotect() which then also has downsides (checking if the accessed
memory was actually the bytes we're interested in). Maybe we should
also ask CPU vendors to give us better watchpoints (perhaps start with
more of them, and easier to set in batch)? We still need a user space
API...

Thanks,
-- Marco



> >
> > 5.Tracking data flow globally.
> >
> > 6.Threads receiving signals on performance events to
> >throttle/unthrottle themselves.


Re: [PATCH mm] kfence: make reporting sensitive information configurable

2021-02-09 Thread Marco Elver
On Tue, 9 Feb 2021 at 19:06, Vlastimil Babka  wrote:
> On 2/9/21 4:13 PM, Marco Elver wrote:
> > We cannot rely on CONFIG_DEBUG_KERNEL to decide if we're running a
> > "debug kernel" where we can safely show potentially sensitive
> > information in the kernel log.
> >
> > Therefore, add the option CONFIG_KFENCE_REPORT_SENSITIVE to decide if we
> > should add potentially sensitive information to KFENCE reports. The
> > default behaviour remains unchanged.
> >
> > Signed-off-by: Marco Elver 
>
> Hi,
>
> could we drop this kconfig approach in favour of the boot option proposed 
> here?
> [1] Just do the prints with %p unconditionally and the boot option takes care 
> of
> it? Also Linus mentioned dislike of controlling potential memory leak to be a
> config option [2]
>
> Thanks,
> Vlastimil
>
> [1] https://lore.kernel.org/linux-mm/20210202213633.755469-1-ti...@kernel.org/
> [2]
> https://lore.kernel.org/linux-mm/CAHk-=wgaK4cz=k-jb4p-kpxbv73m9bja2w1w1lr3iu8+nep...@mail.gmail.com/

Is the patch at [1] already in -next? If not I'll wait until it is,
because otherwise KFENCE reports will be pretty useless.

I think it is reasonable to switch to '%p' once we have the boot
option, but doing so while we do not yet have the option doesn't work
for us. We can potentially drop this patch if the boot option patch
will make it into mainline soon. Otherwise my preference would be to
take this patch and revert it with the switch to '%p' when the boot
option has landed.

Thanks,
-- Marco


Re: [PATCH 0/3][RESEND] add support for never printing hashed addresses

2021-02-10 Thread Marco Elver
On Tue, Feb 09, 2021 at 11:18PM -0600, Timur Tabi wrote:
> [accidentally sent from the wrong email address, so resending]
> 
> [The list of email addresses on CC: is getting quite lengthy,
> so I hope I've included everyone.]
> 
> Although hashing addresses printed via printk does make the
> kernel more secure, it interferes with debugging, especially
> with some functions like print_hex_dump() which always uses
> hashed addresses.
> 
> To avoid having to choose between %p and %px, it's easier to
> add a kernel command line that treats all %p as %px.  This
> encourages developers to use %p more without making debugging
> more difficult.
> 
> Patches #1 and #2 upgrade the kselftest framework so that
> it can report on tests that were skipped outright.  This
> is needed for the test_printf module which will now skip
> %p hashing tests if hashing is disabled.
> 
> Patch #2 upgrades the printf library to check the command
> line.  It also updates test_printf().
> 
> Timur Tabi (3):
>   lib/test_printf: use KSTM_MODULE_GLOBALS macro
>   kselftest: add support for skipped tests
>   [v2] lib/vsprintf: make-printk-non-secret printks all addresses as
> unhashed
> 
>  .../admin-guide/kernel-parameters.txt | 15 +++
>  lib/test_printf.c | 12 +-
>  lib/vsprintf.c| 40 ++-
>  tools/testing/selftests/kselftest_module.h| 18 ++---
>  4 files changed, 75 insertions(+), 10 deletions(-)

I wanted to test this for deciding if we can show sensitive info in
KFENCE reports, which works just fine now that debug_never_hash_pointers
is non-static. FWIW,

Acked-by: Marco Elver 

But unfortunately this series broke some other test:

| In file included from lib/test_bitmap.c:17:
| lib/test_bitmap.c: In function ‘test_bitmap_init’:
| lib/../tools/testing/selftests/kselftest_module.h:45:48: error: 
‘skipped_tests’ undeclared (first use in this function); did you mean 
‘failed_tests’?
|45 |  return kstm_report(total_tests, failed_tests, skipped_tests); \
|   |^
| lib/test_bitmap.c:637:1: note: in expansion of macro ‘KSTM_MODULE_LOADERS’
|   637 | KSTM_MODULE_LOADERS(test_bitmap);
|   | ^~~
| lib/../tools/testing/selftests/kselftest_module.h:45:48: note: each 
undeclared identifier is reported only once for each function it appears in
|45 |  return kstm_report(total_tests, failed_tests, skipped_tests); \
|   |^
| lib/test_bitmap.c:637:1: note: in expansion of macro ‘KSTM_MODULE_LOADERS’
|   637 | KSTM_MODULE_LOADERS(test_bitmap);
|   | ^~~
| lib/../tools/testing/selftests/kselftest_module.h:46:1: error: control 
reaches end of non-void function [-Werror=return-type]
|46 | }   \
|   | ^
| lib/test_bitmap.c:637:1: note: in expansion of macro ‘KSTM_MODULE_LOADERS’
|   637 | KSTM_MODULE_LOADERS(test_bitmap);
|   | ^~~

My allyesconfig build suggests test_bitmap.c is the only one, so it
should probably be fixed up in this series.
 
Thanks,
-- Marco


Re: [PATCH 01/12] kasan, mm: don't save alloc stacks twice

2021-02-02 Thread Marco Elver
On Mon, Feb 01, 2021 at 08:43PM +0100, Andrey Konovalov wrote:
> Currently KASAN saves allocation stacks in both kasan_slab_alloc() and
> kasan_kmalloc() annotations. This patch changes KASAN to save allocation
> stacks for slab objects from kmalloc caches in kasan_kmalloc() only,
> and stacks for other slab objects in kasan_slab_alloc() only.
> 
> This change requires kasan_kmalloc() knowing whether the object
> belongs to a kmalloc cache. This is implemented by adding a flag field
> to the kasan_info structure. That flag is only set for kmalloc caches
> via a new kasan_cache_create_kmalloc() annotation.
> 
> Signed-off-by: Andrey Konovalov 

Reviewed-by: Marco Elver 

> ---
>  include/linux/kasan.h |  9 +
>  mm/kasan/common.c | 18 ++
>  mm/slab_common.c  |  1 +
>  3 files changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> index 6d8f3227c264..2d5de4092185 100644
> --- a/include/linux/kasan.h
> +++ b/include/linux/kasan.h
> @@ -83,6 +83,7 @@ static inline void kasan_disable_current(void) {}
>  struct kasan_cache {
>   int alloc_meta_offset;
>   int free_meta_offset;
> + bool is_kmalloc;
>  };
>  
>  #ifdef CONFIG_KASAN_HW_TAGS
> @@ -143,6 +144,13 @@ static __always_inline void kasan_cache_create(struct 
> kmem_cache *cache,
>   __kasan_cache_create(cache, size, flags);
>  }
>  
> +void __kasan_cache_create_kmalloc(struct kmem_cache *cache);
> +static __always_inline void kasan_cache_create_kmalloc(struct kmem_cache 
> *cache)
> +{
> + if (kasan_enabled())
> + __kasan_cache_create_kmalloc(cache);
> +}
> +
>  size_t __kasan_metadata_size(struct kmem_cache *cache);
>  static __always_inline size_t kasan_metadata_size(struct kmem_cache *cache)
>  {
> @@ -278,6 +286,7 @@ static inline void kasan_free_pages(struct page *page, 
> unsigned int order) {}
>  static inline void kasan_cache_create(struct kmem_cache *cache,
> unsigned int *size,
> slab_flags_t *flags) {}
> +static inline void kasan_cache_create_kmalloc(struct kmem_cache *cache) {}
>  static inline size_t kasan_metadata_size(struct kmem_cache *cache) { return 
> 0; }
>  static inline void kasan_poison_slab(struct page *page) {}
>  static inline void kasan_unpoison_object_data(struct kmem_cache *cache,
> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> index fe852f3cfa42..374049564ea3 100644
> --- a/mm/kasan/common.c
> +++ b/mm/kasan/common.c
> @@ -210,6 +210,11 @@ void __kasan_cache_create(struct kmem_cache *cache, 
> unsigned int *size,
>   *size = optimal_size;
>  }
>  
> +void __kasan_cache_create_kmalloc(struct kmem_cache *cache)
> +{
> + cache->kasan_info.is_kmalloc = true;
> +}
> +
>  size_t __kasan_metadata_size(struct kmem_cache *cache)
>  {
>   if (!kasan_stack_collection_enabled())
> @@ -394,17 +399,22 @@ void __kasan_slab_free_mempool(void *ptr, unsigned long 
> ip)
>   }
>  }
>  
> -static void set_alloc_info(struct kmem_cache *cache, void *object, gfp_t 
> flags)
> +static void set_alloc_info(struct kmem_cache *cache, void *object,
> + gfp_t flags, bool kmalloc)
>  {
>   struct kasan_alloc_meta *alloc_meta;
>  
> + /* Don't save alloc info for kmalloc caches in kasan_slab_alloc(). */
> + if (cache->kasan_info.is_kmalloc && !kmalloc)
> + return;
> +
>   alloc_meta = kasan_get_alloc_meta(cache, object);
>   if (alloc_meta)
>   kasan_set_track(&alloc_meta->alloc_track, flags);
>  }
>  
>  static void *kasan_kmalloc(struct kmem_cache *cache, const void *object,
> - size_t size, gfp_t flags, bool keep_tag)
> + size_t size, gfp_t flags, bool kmalloc)
>  {
>   unsigned long redzone_start;
>   unsigned long redzone_end;
> @@ -423,7 +433,7 @@ static void *kasan_kmalloc(struct kmem_cache *cache, 
> const void *object,
>   KASAN_GRANULE_SIZE);
>   redzone_end = round_up((unsigned long)object + cache->object_size,
>   KASAN_GRANULE_SIZE);
> - tag = assign_tag(cache, object, false, keep_tag);
> + tag = assign_tag(cache, object, false, kmalloc);
>  
>   /* Tag is ignored in set_tag without CONFIG_KASAN_SW/HW_TAGS */
>   kasan_unpoison(set_tag(object, tag), size);
> @@ -431,7 +441,7 @@ static void *kasan_kmalloc(struct kmem_cache *cache, 
> const void *object,
>  KASAN_KMALLOC_REDZONE);
>  
>   if (kasan_stack

Re: [PATCH 02/12] kasan, mm: optimize kmalloc poisoning

2021-02-02 Thread Marco Elver
On Mon, Feb 01, 2021 at 08:43PM +0100, Andrey Konovalov wrote:
> For allocations from kmalloc caches, kasan_kmalloc() always follows
> kasan_slab_alloc(). Currenly, both of them unpoison the whole object,
> which is unnecessary.
> 
> This patch provides separate implementations for both annotations:
> kasan_slab_alloc() unpoisons the whole object, and kasan_kmalloc()
> only poisons the redzone.
> 
> For generic KASAN, the redzone start might not be aligned to
> KASAN_GRANULE_SIZE. Therefore, the poisoning is split in two parts:
> kasan_poison_last_granule() poisons the unaligned part, and then
> kasan_poison() poisons the rest.
> 
> This patch also clarifies alignment guarantees of each of the poisoning
> functions and drops the unnecessary round_up() call for redzone_end.
> 
> With this change, the early SLUB cache annotation needs to be changed to
> kasan_slab_alloc(), as kasan_kmalloc() doesn't unpoison objects now.
> The number of poisoned bytes for objects in this cache stays the same, as
> kmem_cache_node->object_size is equal to sizeof(struct kmem_cache_node).
> 
> Signed-off-by: Andrey Konovalov 
> ---
>  mm/kasan/common.c | 93 +++
>  mm/kasan/kasan.h  | 43 +-
>  mm/kasan/shadow.c | 28 +++---
>  mm/slub.c |  3 +-
>  4 files changed, 119 insertions(+), 48 deletions(-)
> 
> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> index 374049564ea3..128cb330ca73 100644
> --- a/mm/kasan/common.c
> +++ b/mm/kasan/common.c
> @@ -278,21 +278,11 @@ void __kasan_poison_object_data(struct kmem_cache 
> *cache, void *object)
>   *based on objects indexes, so that objects that are next to each other
>   *get different tags.
>   */
> -static u8 assign_tag(struct kmem_cache *cache, const void *object,
> - bool init, bool keep_tag)
> +static u8 assign_tag(struct kmem_cache *cache, const void *object, bool init)
>  {
>   if (IS_ENABLED(CONFIG_KASAN_GENERIC))
>   return 0xff;
>  
> - /*
> -  * 1. When an object is kmalloc()'ed, two hooks are called:
> -  *kasan_slab_alloc() and kasan_kmalloc(). We assign the
> -  *tag only in the first one.
> -  * 2. We reuse the same tag for krealloc'ed objects.
> -  */
> - if (keep_tag)
> - return get_tag(object);
> -
>   /*
>* If the cache neither has a constructor nor has SLAB_TYPESAFE_BY_RCU
>* set, assign a tag when the object is being allocated (init == false).
> @@ -325,7 +315,7 @@ void * __must_check __kasan_init_slab_obj(struct 
> kmem_cache *cache,
>   }
>  
>   /* Tag is ignored in set_tag() without CONFIG_KASAN_SW/HW_TAGS */
> - object = set_tag(object, assign_tag(cache, object, true, false));
> + object = set_tag(object, assign_tag(cache, object, true));
>  
>   return (void *)object;
>  }
> @@ -413,12 +403,46 @@ static void set_alloc_info(struct kmem_cache *cache, 
> void *object,
>   kasan_set_track(&alloc_meta->alloc_track, flags);
>  }
>  
> +void * __must_check __kasan_slab_alloc(struct kmem_cache *cache,
> + void *object, gfp_t flags)
> +{
> + u8 tag;
> + void *tagged_object;
> +
> + if (gfpflags_allow_blocking(flags))
> + kasan_quarantine_reduce();
> +
> + if (unlikely(object == NULL))
> + return NULL;
> +
> + if (is_kfence_address(object))
> + return (void *)object;
> +
> + /*
> +  * Generate and assign random tag for tag-based modes.
> +  * Tag is ignored in set_tag() for the generic mode.
> +  */
> + tag = assign_tag(cache, object, false);
> + tagged_object = set_tag(object, tag);
> +
> + /*
> +  * Unpoison the whole object.
> +  * For kmalloc() allocations, kasan_kmalloc() will do precise poisoning.
> +  */
> + kasan_unpoison(tagged_object, cache->object_size);
> +
> + /* Save alloc info (if possible) for non-kmalloc() allocations. */
> + if (kasan_stack_collection_enabled())
> + set_alloc_info(cache, (void *)object, flags, false);
> +
> + return tagged_object;
> +}
> +
>  static void *kasan_kmalloc(struct kmem_cache *cache, const void *object,
> - size_t size, gfp_t flags, bool kmalloc)
> + size_t size, gfp_t flags)
>  {
>   unsigned long redzone_start;
>   unsigned long redzone_end;
> - u8 tag;
>  
>   if (gfpflags_allow_blocking(flags))
>   kasan_quarantine_reduce();
> @@ -429,33 +453,41 @@ static void *kasan_kmalloc(struct kmem_cache 
> *cache, const void *object,
>   if (is_kfence_address(kasan_reset_tag(object)))
>   return (void *)object;
>  
> + /*
> +  * The object has already been unpoisoned by kasan_slab_alloc() for
> +  * kmalloc() or by ksize() for krealloc().
> +  */
> +
> + /*
> +  * The redzone has byte-level precision for the gen

Re: [PATCH 03/12] kasan: optimize large kmalloc poisoning

2021-02-02 Thread Marco Elver
On Mon, Feb 01, 2021 at 08:43PM +0100, Andrey Konovalov wrote:
> Similarly to kasan_kmalloc(), kasan_kmalloc_large() doesn't need
> to unpoison the object as it as already unpoisoned by alloc_pages()
> (or by ksize() for krealloc()).
> 
> This patch changes kasan_kmalloc_large() to only poison the redzone.
> 
> Signed-off-by: Andrey Konovalov 

Reviewed-by: Marco Elver 

> ---
>  mm/kasan/common.c | 20 +++-
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> index 128cb330ca73..a7eb553c8e91 100644
> --- a/mm/kasan/common.c
> +++ b/mm/kasan/common.c
> @@ -494,7 +494,6 @@ EXPORT_SYMBOL(__kasan_kmalloc);
>  void * __must_check __kasan_kmalloc_large(const void *ptr, size_t size,
>   gfp_t flags)
>  {
> - struct page *page;
>   unsigned long redzone_start;
>   unsigned long redzone_end;
>  
> @@ -504,12 +503,23 @@ void * __must_check __kasan_kmalloc_large(const void 
> *ptr, size_t size,
>   if (unlikely(ptr == NULL))
>   return NULL;
>  
> - page = virt_to_page(ptr);
> + /*
> +  * The object has already been unpoisoned by kasan_alloc_pages() for
> +  * alloc_pages() or by ksize() for krealloc().
> +  */
> +
> + /*
> +  * The redzone has byte-level precision for the generic mode.
> +  * Partially poison the last object granule to cover the unaligned
> +  * part of the redzone.
> +  */
> + if (IS_ENABLED(CONFIG_KASAN_GENERIC))
> + kasan_poison_last_granule(ptr, size);
> +
> + /* Poison the aligned part of the redzone. */
>   redzone_start = round_up((unsigned long)(ptr + size),
>   KASAN_GRANULE_SIZE);
> - redzone_end = (unsigned long)ptr + page_size(page);
> -
> - kasan_unpoison(ptr, size);
> + redzone_end = (unsigned long)ptr + page_size(virt_to_page(ptr));
>   kasan_poison((void *)redzone_start, redzone_end - redzone_start,
>KASAN_PAGE_REDZONE);
>  
> -- 
> 2.30.0.365.g02bc693789-goog
> 


Re: [PATCH 04/12] kasan: clean up setting free info in kasan_slab_free

2021-02-02 Thread Marco Elver
On Mon, Feb 01, 2021 at 08:43PM +0100, Andrey Konovalov wrote:
> Put kasan_stack_collection_enabled() check and kasan_set_free_info()
> calls next to each other.
> 
> The way this was previously implemented was a minor optimization that
> relied of the the fact that kasan_stack_collection_enabled() is always
> true for generic KASAN. The confusion that this brings outweights saving
> a few instructions.
> 
> Signed-off-by: Andrey Konovalov 

Reviewed-by: Marco Elver 

> ---
>  mm/kasan/common.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> index a7eb553c8e91..086bb77292b6 100644
> --- a/mm/kasan/common.c
> +++ b/mm/kasan/common.c
> @@ -350,13 +350,11 @@ static bool kasan_slab_free(struct kmem_cache 
> *cache, void *object,
>  
>   kasan_poison(object, cache->object_size, KASAN_KMALLOC_FREE);
>  
> - if (!kasan_stack_collection_enabled())
> - return false;
> -
>   if ((IS_ENABLED(CONFIG_KASAN_GENERIC) && !quarantine))
>   return false;
>  
> - kasan_set_free_info(cache, object, tag);
> + if (kasan_stack_collection_enabled())
> + kasan_set_free_info(cache, object, tag);
>  
>   return kasan_quarantine_put(cache, object);
>  }
> -- 
> 2.30.0.365.g02bc693789-goog
> 


Re: [PATCH 02/12] kasan, mm: optimize kmalloc poisoning

2021-02-02 Thread Marco Elver
On Tue, 2 Feb 2021 at 18:16, Andrey Konovalov  wrote:
>
> On Tue, Feb 2, 2021 at 5:25 PM Marco Elver  wrote:
> >
> > > +#ifdef CONFIG_KASAN_GENERIC
> > > +
> > > +/**
> > > + * kasan_poison_last_granule - mark the last granule of the memory range 
> > > as
> > > + * unaccessible
> > > + * @addr - range start address, must be aligned to KASAN_GRANULE_SIZE
> > > + * @size - range size
> > > + *
> > > + * This function is only available for the generic mode, as it's the 
> > > only mode
> > > + * that has partially poisoned memory granules.
> > > + */
> > > +void kasan_poison_last_granule(const void *address, size_t size);
> > > +
> > > +#else /* CONFIG_KASAN_GENERIC */
> > > +
> > > +static inline void kasan_poison_last_granule(const void *address, size_t 
> > > size) { }
>
> ^
>
> > > +
> > > +#endif /* CONFIG_KASAN_GENERIC */
> > > +
> > >  /*
> > >   * Exported functions for interfaces called from assembly or from 
> > > generated
> > >   * code. Declarations here to avoid warning about missing declarations.
>
> > > @@ -96,6 +92,16 @@ void kasan_poison(const void *address, size_t size, u8 
> > > value)
> > >  }
> > >  EXPORT_SYMBOL(kasan_poison);
> > >
> > > +#ifdef CONFIG_KASAN_GENERIC
> > > +void kasan_poison_last_granule(const void *address, size_t size)
> > > +{
> > > + if (size & KASAN_GRANULE_MASK) {
> > > + u8 *shadow = (u8 *)kasan_mem_to_shadow(address + size);
> > > + *shadow = size & KASAN_GRANULE_MASK;
> > > + }
> > > +}
> > > +#endif
> >
> > The function declaration still needs to exist in the dead branch if
> > !IS_ENABLED(CONFIG_KASAN_GENERIC). It appears in that case it's declared
> > (in kasan.h), but not defined.  We shouldn't get linker errors because
> > the optimizer should remove the dead branch. Nevertheless, is this code
> > generally acceptable?
>
> The function is defined as empty when !CONFIG_KASAN_GENERIC, see above.

I missed that, thanks.

Reviewed-by: Marco Elver 


Re: [PATCH net-next] net: fix up truesize of cloned skb in skb_prepare_for_shift()

2021-02-02 Thread Marco Elver
On Tue, 2 Feb 2021 at 18:59, Eric Dumazet  wrote:
>
> On Mon, Feb 1, 2021 at 5:04 PM Marco Elver  wrote:
> >
> > Avoid the assumption that ksize(kmalloc(S)) == ksize(kmalloc(S)): when
> > cloning an skb, save and restore truesize after pskb_expand_head(). This
> > can occur if the allocator decides to service an allocation of the same
> > size differently (e.g. use a different size class, or pass the
> > allocation on to KFENCE).
> >
> > Because truesize is used for bookkeeping (such as sk_wmem_queued), a
> > modified truesize of a cloned skb may result in corrupt bookkeeping and
> > relevant warnings (such as in sk_stream_kill_queues()).
> >
> > Link: https://lkml.kernel.org/r/X9JR/j6dmmoy1...@elver.google.com
> > Reported-by: syzbot+7b99aafdcc2eedea6...@syzkaller.appspotmail.com
> > Suggested-by: Eric Dumazet 
> > Signed-off-by: Marco Elver 
>
> Signed-off-by: Eric Dumazet 

Thank you!


Re: [PATCH 01/12] kasan, mm: don't save alloc stacks twice

2021-02-02 Thread Marco Elver
On Tue, 2 Feb 2021 at 19:01, 'Andrey Konovalov' via kasan-dev
 wrote:
[...]
> > > @@ -83,6 +83,7 @@ static inline void kasan_disable_current(void) {}
> > >  struct kasan_cache {
> > >   int alloc_meta_offset;
> > >   int free_meta_offset;
> > > + bool is_kmalloc;
[...]
> > >   if (kasan_stack_collection_enabled())
> > > - set_alloc_info(cache, (void *)object, flags);
> > > + set_alloc_info(cache, (void *)object, flags, kmalloc);
> >
> > It doesn't bother me too much, but: 'bool kmalloc' shadows function
> > 'kmalloc' so this is technically fine, but using 'kmalloc' as the
> > variable name here might be confusing and there is a small chance it
> > might cause problems in a future refactor.
>
> Good point. Does "is_kmalloc" sound good?

Sure, that's also consistent with the new struct field.

Thanks,
-- Marco


Re: WARNING in sk_stream_kill_queues (5)

2021-02-03 Thread Marco Elver
On Mon, 14 Dec 2020 at 11:09, Marco Elver  wrote:
> On Thu, 10 Dec 2020 at 20:01, Marco Elver  wrote:
> > On Thu, 10 Dec 2020 at 18:14, Eric Dumazet  wrote:
> > > On Thu, Dec 10, 2020 at 5:51 PM Marco Elver  wrote:
> > [...]
> > > > So I started putting gdb to work, and whenever I see an allocation
> > > > exactly like the above that goes through tso_fragment() a warning
> > > > immediately follows.
> > > >
> > > > Long story short, I somehow synthesized this patch that appears to fix
> > > > things, but I can't explain why exactly:
> > > >
> > > > | --- a/net/core/skbuff.c
> > > > | +++ b/net/core/skbuff.c
> > > > | @@ -1679,13 +1679,6 @@ int pskb_expand_head(struct sk_buff *skb, int 
> > > > nhead, int ntail,
> > > > |
> > > > |   skb_metadata_clear(skb);
> > > > |
> > > > | - /* It is not generally safe to change skb->truesize.
> > > > | -  * For the moment, we really care of rx path, or
> > > > | -  * when skb is orphaned (not attached to a socket).
> > > > | -  */
> > > > | - if (!skb->sk || skb->destructor == sock_edemux)
> > > > | - skb->truesize += size - osize;
> > > > | -
> > > > |   return 0;
> > > > |
> > > > |  nofrags:
> > > >
> > > > Now, here are the breadcrumbs I followed:
> > > >
> > > >
> > > > 1.  Breakpoint on kfence_ksize() -- first allocation that matches 
> > > > the above:
> > > >
> > > > | #0  __kfence_ksize (s=18446612700164612096) at 
> > > > mm/kfence/core.c:726
> > > > | #1  0x816fbf30 in kfence_ksize 
> > > > (addr=0x888436856000) at mm/kfence/core.c:737
> > > > | #2  0x816217cf in ksize (objp=0x888436856000) at 
> > > > mm/slab_common.c:1178
> > > > | #3  0x84896911 in __alloc_skb (size=914710528, 
> > > > gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > > > | #4  0x84d0ba73 in alloc_skb_fclone 
> > > > (priority=, size=) at 
> > > > ./include/linux/skbuff.h:1144
> > > > | #5  sk_stream_alloc_skb (sk=0x8881176cc000, size=0, 
> > > > gfp=2592, force_schedule=232) at net/ipv4/tcp.c:888
> > > > | #6  0x84d41c36 in tso_fragment (gfp=, 
> > > > mss_now=, len=,
> > > > | skb=, sk=) at 
> > > > net/ipv4/tcp_output.c:2124
> > > > | #7  tcp_write_xmit (sk=0x8881176cc000, mss_now=21950, 
> > > > nonagle=3096, push_one=-1996874776, gfp=0)
> > > > | at net/ipv4/tcp_output.c:2674
> > > > | #8  0x84d43e48 in __tcp_push_pending_frames 
> > > > (sk=0x8881176cc000, cur_mss=337, nonagle=0)
> > > > | at ./include/net/sock.h:918
> > > > | #9  0x84d3259c in tcp_push_pending_frames 
> > > > (sk=) at ./include/net/tcp.h:1864
> > > > | #10 tcp_data_snd_check (sk=) at 
> > > > net/ipv4/tcp_input.c:5374
> > > > | #11 tcp_rcv_established (sk=0x8881176cc000, skb=0x0 
> > > > ) at net/ipv4/tcp_input.c:5869
> > > > | #12 0x84d56731 in tcp_v4_do_rcv 
> > > > (sk=0x8881176cc000, skb=0x888117f52ea0) at 
> > > > net/ipv4/tcp_ipv4.c:1668
> > > > | [...]
> > > >
> > > > Set watchpoint on skb->truesize:
> > > >
> > > > | (gdb) frame 3
> > > > | #3  0x84896911 in __alloc_skb (size=914710528, 
> > > > gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > > > | 217 size = SKB_WITH_OVERHEAD(ksize(data));
> > > > | (gdb) p &skb->truesize
> > > > | $5 = (unsigned int *) 0x888117f55f90
> > > > | (gdb) awatch *0x888117f55f90
> > > > | Hardware access (read/write) watchpoint 6: *0x888117f55f90
> > > >
> > > > 2.  Some time later, we see that the skb with kfence-allocated data
> > > > is cloned:
> > > >
> > > > | Thread 7 hit Hardware access (read/write) watchpoint 6: 
> > > > *0x888117f55f90
> > > > |
> > > > | Value = 1570
> > > &g

Re: [PATCH 05/12] kasan: unify large kfree checks

2021-02-03 Thread Marco Elver
On Mon, Feb 01, 2021 at 08:43PM +0100, Andrey Konovalov wrote:
> Unify checks in kasan_kfree_large() and in kasan_slab_free_mempool()
> for large allocations as it's done for small kfree() allocations.
> 
> With this change, kasan_slab_free_mempool() starts checking that the
> first byte of the memory that's being freed is accessible.
> 
> Signed-off-by: Andrey Konovalov 

Reviewed-by: Marco Elver 

> ---
>  include/linux/kasan.h | 16 
>  mm/kasan/common.c | 36 ++--
>  2 files changed, 34 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/kasan.h b/include/linux/kasan.h
> index 2d5de4092185..d53ea3c047bc 100644
> --- a/include/linux/kasan.h
> +++ b/include/linux/kasan.h
> @@ -200,6 +200,13 @@ static __always_inline bool kasan_slab_free(struct 
> kmem_cache *s, void *object)
>   return false;
>  }
>  
> +void __kasan_kfree_large(void *ptr, unsigned long ip);
> +static __always_inline void kasan_kfree_large(void *ptr)
> +{
> + if (kasan_enabled())
> + __kasan_kfree_large(ptr, _RET_IP_);
> +}
> +
>  void __kasan_slab_free_mempool(void *ptr, unsigned long ip);
>  static __always_inline void kasan_slab_free_mempool(void *ptr)
>  {
> @@ -247,13 +254,6 @@ static __always_inline void * __must_check 
> kasan_krealloc(const void *object,
>   return (void *)object;
>  }
>  
> -void __kasan_kfree_large(void *ptr, unsigned long ip);
> -static __always_inline void kasan_kfree_large(void *ptr)
> -{
> - if (kasan_enabled())
> - __kasan_kfree_large(ptr, _RET_IP_);
> -}
> -
>  /*
>   * Unlike kasan_check_read/write(), kasan_check_byte() is performed even for
>   * the hardware tag-based mode that doesn't rely on compiler instrumentation.
> @@ -302,6 +302,7 @@ static inline bool kasan_slab_free(struct kmem_cache *s, 
> void *object)
>  {
>   return false;
>  }
> +static inline void kasan_kfree_large(void *ptr) {}
>  static inline void kasan_slab_free_mempool(void *ptr) {}
>  static inline void *kasan_slab_alloc(struct kmem_cache *s, void *object,
>  gfp_t flags)
> @@ -322,7 +323,6 @@ static inline void *kasan_krealloc(const void *object, 
> size_t new_size,
>  {
>   return (void *)object;
>  }
> -static inline void kasan_kfree_large(void *ptr) {}
>  static inline bool kasan_check_byte(const void *address)
>  {
>   return true;
> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> index 086bb77292b6..9c64a00bbf9c 100644
> --- a/mm/kasan/common.c
> +++ b/mm/kasan/common.c
> @@ -364,6 +364,31 @@ bool __kasan_slab_free(struct kmem_cache *cache, void 
> *object, unsigned long ip)
>   return kasan_slab_free(cache, object, ip, true);
>  }
>  
> +static bool kasan_kfree_large(void *ptr, unsigned long ip)
> +{
> + if (ptr != page_address(virt_to_head_page(ptr))) {
> + kasan_report_invalid_free(ptr, ip);
> + return true;
> + }
> +
> + if (!kasan_byte_accessible(ptr)) {
> + kasan_report_invalid_free(ptr, ip);
> + return true;
> + }
> +
> + /*
> +  * The object will be poisoned by kasan_free_pages() or
> +  * kasan_slab_free_mempool().
> +  */
> +
> + return false;
> +}
> +
> +void __kasan_kfree_large(void *ptr, unsigned long ip)
> +{
> + kasan_kfree_large(ptr, ip);
> +}
> +
>  void __kasan_slab_free_mempool(void *ptr, unsigned long ip)
>  {
>   struct page *page;
> @@ -377,10 +402,8 @@ void __kasan_slab_free_mempool(void *ptr, unsigned long 
> ip)
>* KMALLOC_MAX_SIZE, and kmalloc falls back onto page_alloc.
>*/
>   if (unlikely(!PageSlab(page))) {
> - if (ptr != page_address(page)) {
> - kasan_report_invalid_free(ptr, ip);
> + if (kasan_kfree_large(ptr, ip))
>   return;
> - }
>   kasan_poison(ptr, page_size(page), KASAN_FREE_PAGE);
>   } else {
>   kasan_slab_free(page->slab_cache, ptr, ip, false);
> @@ -539,13 +562,6 @@ void * __must_check __kasan_krealloc(const void *object, 
> size_t size, gfp_t flag
>   return kasan_kmalloc(page->slab_cache, object, size, flags);
>  }
>  
> -void __kasan_kfree_large(void *ptr, unsigned long ip)
> -{
> - if (ptr != page_address(virt_to_head_page(ptr)))
> - kasan_report_invalid_free(ptr, ip);
> - /* The object will be poisoned by kasan_free_pages(). */
> -}
> -
>  bool __kasan_check_byte(const void *address, unsigned long ip)
>  {
>   if (!kasan_byte_accessible(address)) {
> -- 
> 2.30.0.365.g02bc693789-goog
> 


Re: [PATCH 08/12] kasan, mm: optimize krealloc poisoning

2021-02-03 Thread Marco Elver
On Mon, Feb 01, 2021 at 08:43PM +0100, Andrey Konovalov wrote:
> Currently, krealloc() always calls ksize(), which unpoisons the whole
> object including the redzone. This is inefficient, as kasan_krealloc()
> repoisons the redzone for objects that fit into the same buffer.
> 
> This patch changes krealloc() instrumentation to use uninstrumented
> __ksize() that doesn't unpoison the memory. Instead, kasan_kreallos()
> is changed to unpoison the memory excluding the redzone.
> 
> For objects that don't fit into the old allocation, this patch disables
> KASAN accessibility checks when copying memory into a new object instead
> of unpoisoning it.
> 
> Signed-off-by: Andrey Konovalov 
> ---
>  mm/kasan/common.c | 12 ++--
>  mm/slab_common.c  | 20 ++--
>  2 files changed, 24 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> index 9c64a00bbf9c..a51d6ea580b0 100644
> --- a/mm/kasan/common.c
> +++ b/mm/kasan/common.c
> @@ -476,7 +476,7 @@ static void *kasan_kmalloc(struct kmem_cache *cache, 
> const void *object,
>  
>   /*
>* The object has already been unpoisoned by kasan_slab_alloc() for
> -  * kmalloc() or by ksize() for krealloc().
> +  * kmalloc() or by kasan_krealloc() for krealloc().
>*/
>  
>   /*
> @@ -526,7 +526,7 @@ void * __must_check __kasan_kmalloc_large(const void 
> *ptr, size_t size,
>  
>   /*
>* The object has already been unpoisoned by kasan_alloc_pages() for
> -  * alloc_pages() or by ksize() for krealloc().
> +  * alloc_pages() or by kasan_krealloc() for krealloc().
>*/
>  
>   /*
> @@ -554,8 +554,16 @@ void * __must_check __kasan_krealloc(const void *object, 
> size_t size, gfp_t flag
>   if (unlikely(object == ZERO_SIZE_PTR))
>   return (void *)object;
>  
> + /*
> +  * Unpoison the object's data.
> +  * Part of it might already have been unpoisoned, but it's unknown
> +  * how big that part is.
> +  */
> + kasan_unpoison(object, size);
> +
>   page = virt_to_head_page(object);
>  
> + /* Piggy-back on kmalloc() instrumentation to poison the redzone. */
>   if (unlikely(!PageSlab(page)))
>   return __kasan_kmalloc_large(object, size, flags);
>   else
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index dad70239b54c..821f657d38b5 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1140,19 +1140,27 @@ static __always_inline void *__do_krealloc(const void 
> *p, size_t new_size,
>   void *ret;
>   size_t ks;
>  
> - if (likely(!ZERO_OR_NULL_PTR(p)) && !kasan_check_byte(p))
> - return NULL;
> -
> - ks = ksize(p);
> + /* Don't use instrumented ksize to allow precise KASAN poisoning. */
> + if (likely(!ZERO_OR_NULL_PTR(p))) {
> + if (!kasan_check_byte(p))
> + return NULL;
> + ks = __ksize(p);
> + } else
> + ks = 0;
>  

This unfortunately broke KFENCE:
https://syzkaller.appspot.com/bug?extid=e444e1006d07feef0ef3 + various
other false positives.

We need to use ksize() here, as __ksize() is unaware of KFENCE. Or
somehow add the same check here that ksize() uses to get the real object
size.

> + /* If the object still fits, repoison it precisely. */
>   if (ks >= new_size) {
>   p = kasan_krealloc((void *)p, new_size, flags);
>   return (void *)p;
>   }
>  
>   ret = kmalloc_track_caller(new_size, flags);
> - if (ret && p)
> - memcpy(ret, p, ks);
> + if (ret && p) {
> + /* Disable KASAN checks as the object's redzone is accessed. */
> + kasan_disable_current();
> + memcpy(ret, kasan_reset_tag(p), ks);
> + kasan_enable_current();
> + }
>  
>   return ret;
>  }
> -- 
> 2.30.0.365.g02bc693789-goog
> 


Re: [PATCH 06/12] kasan: rework krealloc tests

2021-02-03 Thread Marco Elver
On Mon, Feb 01, 2021 at 08:43PM +0100, Andrey Konovalov wrote:
> This patch reworks KASAN-KUnit tests for krealloc() to:
> 
> 1. Check both slab and page_alloc based krealloc() implementations.
> 2. Allow at least one full granule to fit between old and new sizes for
>each KASAN mode, and check accesses to that granule accordingly.
> 
> Signed-off-by: Andrey Konovalov 

Reviewed-by: Marco Elver 

> ---
>  lib/test_kasan.c | 91 ++--
>  1 file changed, 81 insertions(+), 10 deletions(-)
> 
> diff --git a/lib/test_kasan.c b/lib/test_kasan.c
> index 5699e43ca01b..2bb52853f341 100644
> --- a/lib/test_kasan.c
> +++ b/lib/test_kasan.c
> @@ -258,11 +258,14 @@ static void kmalloc_large_oob_right(struct kunit *test)
>   kfree(ptr);
>  }
>  
> -static void kmalloc_oob_krealloc_more(struct kunit *test)
> +static void krealloc_more_oob_helper(struct kunit *test,
> + size_t size1, size_t size2)
>  {
>   char *ptr1, *ptr2;
> - size_t size1 = 17;
> - size_t size2 = 19;
> + size_t middle;
> +
> + KUNIT_ASSERT_LT(test, size1, size2);
> + middle = size1 + (size2 - size1) / 2;
>  
>   ptr1 = kmalloc(size1, GFP_KERNEL);
>   KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1);
> @@ -270,15 +273,31 @@ static void kmalloc_oob_krealloc_more(struct kunit 
> *test)
>   ptr2 = krealloc(ptr1, size2, GFP_KERNEL);
>   KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr2);
>  
> - KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size2 + OOB_TAG_OFF] = 'x');
> + /* All offsets up to size2 must be accessible. */
> + ptr2[size1 - 1] = 'x';
> + ptr2[size1] = 'x';
> + ptr2[middle] = 'x';
> + ptr2[size2 - 1] = 'x';
> +
> + /* Generic mode is precise, so unaligned size2 must be inaccessible. */
> + if (IS_ENABLED(CONFIG_KASAN_GENERIC))
> + KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size2] = 'x');
> +
> + /* For all modes first aligned offset after size2 must be inaccessible. 
> */
> + KUNIT_EXPECT_KASAN_FAIL(test,
> + ptr2[round_up(size2, KASAN_GRANULE_SIZE)] = 'x');
> +
>   kfree(ptr2);
>  }
>  
> -static void kmalloc_oob_krealloc_less(struct kunit *test)
> +static void krealloc_less_oob_helper(struct kunit *test,
> + size_t size1, size_t size2)
>  {
>   char *ptr1, *ptr2;
> - size_t size1 = 17;
> - size_t size2 = 15;
> + size_t middle;
> +
> + KUNIT_ASSERT_LT(test, size2, size1);
> + middle = size2 + (size1 - size2) / 2;
>  
>   ptr1 = kmalloc(size1, GFP_KERNEL);
>   KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1);
> @@ -286,10 +305,60 @@ static void kmalloc_oob_krealloc_less(struct kunit 
> *test)
>   ptr2 = krealloc(ptr1, size2, GFP_KERNEL);
>   KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr2);
>  
> - KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size2 + OOB_TAG_OFF] = 'x');
> + /* Must be accessible for all modes. */
> + ptr2[size2 - 1] = 'x';
> +
> + /* Generic mode is precise, so unaligned size2 must be inaccessible. */
> + if (IS_ENABLED(CONFIG_KASAN_GENERIC))
> + KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size2] = 'x');
> +
> + /* For all modes first aligned offset after size2 must be inaccessible. 
> */
> + KUNIT_EXPECT_KASAN_FAIL(test,
> + ptr2[round_up(size2, KASAN_GRANULE_SIZE)] = 'x');
> +
> + /*
> +  * For all modes both middle and size1 should land in separate granules

middle, size1, and size2?

> +  * and thus be inaccessible.
> +  */
> + KUNIT_EXPECT_LE(test, round_up(size2, KASAN_GRANULE_SIZE),
> + round_down(middle, KASAN_GRANULE_SIZE));
> + KUNIT_EXPECT_LE(test, round_up(middle, KASAN_GRANULE_SIZE),
> + round_down(size1, KASAN_GRANULE_SIZE));
> + KUNIT_EXPECT_KASAN_FAIL(test, ptr2[middle] = 'x');
> + KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size1 - 1] = 'x');
> + KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size1] = 'x');
> +
>   kfree(ptr2);
>  }
>  
> +static void krealloc_more_oob(struct kunit *test)
> +{
> + krealloc_more_oob_helper(test, 201, 235);
> +}
> +
> +static void krealloc_less_oob(struct kunit *test)
> +{
> + krealloc_less_oob_helper(test, 235, 201);
> +}
> +
> +static void krealloc_pagealloc_more_oob(struct kunit *test)
> +{
> + /* page_alloc fallback in only implemented for SLUB. */
> + KASAN_TEST_NEEDS_CONFIG_ON(test, CONFIG_SLUB);
> +
> + krealloc_more_oob_helper(test

Re: [PATCH 07/12] kasan, mm: remove krealloc side-effect

2021-02-03 Thread Marco Elver
On Mon, Feb 01, 2021 at 08:43PM +0100, Andrey Konovalov wrote:
> Currently, if krealloc() is called on a freed object with KASAN enabled,
> it allocates and returns a new object, but doesn't copy any memory from
> the old one as ksize() returns 0. This makes a caller believe that
> krealloc() succeeded (KASAN report is printed though).
>
> This patch adds an accessibility check into __do_krealloc(). If the check
> fails, krealloc() returns NULL. This check duplicates the one in ksize();
> this is fixed in the following patch.

I think "side-effect" is ambiguous, because either way behaviour of
krealloc differs from a kernel with KASAN disabled. Something like
"kasan, mm: fail krealloc on already freed object" perhaps?

> This patch also adds a KASAN-KUnit test to check krealloc() behaviour
> when it's called on a freed object.
> 
> Signed-off-by: Andrey Konovalov 

Reviewed-by: Marco Elver 

> ---
>  lib/test_kasan.c | 20 
>  mm/slab_common.c |  3 +++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/lib/test_kasan.c b/lib/test_kasan.c
> index 2bb52853f341..61bc894d9f7e 100644
> --- a/lib/test_kasan.c
> +++ b/lib/test_kasan.c
> @@ -359,6 +359,25 @@ static void krealloc_pagealloc_less_oob(struct kunit 
> *test)
>   KMALLOC_MAX_CACHE_SIZE + 201);
>  }
>  
> +/*
> + * Check that krealloc() detects a use-after-free, returns NULL,
> + * and doesn't unpoison the freed object.
> + */
> +static void krealloc_uaf(struct kunit *test)
> +{
> + char *ptr1, *ptr2;
> + int size1 = 201;
> + int size2 = 235;
> +
> + ptr1 = kmalloc(size1, GFP_KERNEL);
> + KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1);
> + kfree(ptr1);
> +
> + KUNIT_EXPECT_KASAN_FAIL(test, ptr2 = krealloc(ptr1, size2, GFP_KERNEL));
> + KUNIT_ASSERT_PTR_EQ(test, (void *)ptr2, NULL);
> + KUNIT_EXPECT_KASAN_FAIL(test, *(volatile char *)ptr1);
> +}
> +
>  static void kmalloc_oob_16(struct kunit *test)
>  {
>   struct {
> @@ -1056,6 +1075,7 @@ static struct kunit_case kasan_kunit_test_cases[] = {
>   KUNIT_CASE(krealloc_less_oob),
>   KUNIT_CASE(krealloc_pagealloc_more_oob),
>   KUNIT_CASE(krealloc_pagealloc_less_oob),
> + KUNIT_CASE(krealloc_uaf),
>   KUNIT_CASE(kmalloc_oob_16),
>   KUNIT_CASE(kmalloc_uaf_16),
>   KUNIT_CASE(kmalloc_oob_in_memset),
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 39d1a8ff9bb8..dad70239b54c 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1140,6 +1140,9 @@ static __always_inline void *__do_krealloc(const void 
> *p, size_t new_size,
>   void *ret;
>   size_t ks;
>  
> + if (likely(!ZERO_OR_NULL_PTR(p)) && !kasan_check_byte(p))
> + return NULL;
> +
>   ks = ksize(p);
>  
>   if (ks >= new_size) {
> -- 
> 2.30.0.365.g02bc693789-goog
> 


Re: [PATCH 09/12] kasan: ensure poisoning size alignment

2021-02-03 Thread Marco Elver
On Mon, Feb 01, 2021 at 08:43PM +0100, Andrey Konovalov wrote:
> A previous changes d99f6a10c161 ("kasan: don't round_up too much")
> attempted to simplify the code by adding a round_up(size) call into
> kasan_poison(). While this allows to have less round_up() calls around
> the code, this results in round_up() being called multiple times.
> 
> This patch removes round_up() of size from kasan_poison() and ensures
> that all callers round_up() the size explicitly. This patch also adds
> WARN_ON() alignment checks for address and size to kasan_poison() and
> kasan_unpoison().
> 
> Signed-off-by: Andrey Konovalov 

Reviewed-by: Marco Elver 

> ---
>  mm/kasan/common.c |  9 ++---
>  mm/kasan/kasan.h  | 33 -
>  mm/kasan/shadow.c | 37 ++---
>  3 files changed, 48 insertions(+), 31 deletions(-)
> 
> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> index a51d6ea580b0..5691cca69397 100644
> --- a/mm/kasan/common.c
> +++ b/mm/kasan/common.c
> @@ -261,7 +261,8 @@ void __kasan_unpoison_object_data(struct kmem_cache 
> *cache, void *object)
>  
>  void __kasan_poison_object_data(struct kmem_cache *cache, void *object)
>  {
> - kasan_poison(object, cache->object_size, KASAN_KMALLOC_REDZONE);
> + kasan_poison(object, round_up(cache->object_size, KASAN_GRANULE_SIZE),
> + KASAN_KMALLOC_REDZONE);
>  }
>  
>  /*
> @@ -348,7 +349,8 @@ static bool kasan_slab_free(struct kmem_cache *cache, 
> void *object,
>   return true;
>   }
>  
> - kasan_poison(object, cache->object_size, KASAN_KMALLOC_FREE);
> + kasan_poison(object, round_up(cache->object_size, KASAN_GRANULE_SIZE),
> + KASAN_KMALLOC_FREE);
>  
>   if ((IS_ENABLED(CONFIG_KASAN_GENERIC) && !quarantine))
>   return false;
> @@ -490,7 +492,8 @@ static void *kasan_kmalloc(struct kmem_cache *cache, 
> const void *object,
>   /* Poison the aligned part of the redzone. */
>   redzone_start = round_up((unsigned long)(object + size),
>   KASAN_GRANULE_SIZE);
> - redzone_end = (unsigned long)object + cache->object_size;
> + redzone_end = round_up((unsigned long)(object + cache->object_size),
> + KASAN_GRANULE_SIZE);
>   kasan_poison((void *)redzone_start, redzone_end - redzone_start,
>  KASAN_KMALLOC_REDZONE);
>  
> diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
> index 6a2882997f23..2f7400a3412f 100644
> --- a/mm/kasan/kasan.h
> +++ b/mm/kasan/kasan.h
> @@ -321,30 +321,37 @@ static inline u8 kasan_random_tag(void) { return 0; }
>  
>  #ifdef CONFIG_KASAN_HW_TAGS
>  
> -static inline void kasan_poison(const void *address, size_t size, u8 value)
> +static inline void kasan_poison(const void *addr, size_t size, u8 value)
>  {
> - address = kasan_reset_tag(address);
> + addr = kasan_reset_tag(addr);
>  
>   /* Skip KFENCE memory if called explicitly outside of sl*b. */
> - if (is_kfence_address(address))
> + if (is_kfence_address(addr))
>   return;
>  
> - hw_set_mem_tag_range((void *)address,
> - round_up(size, KASAN_GRANULE_SIZE), value);
> + if (WARN_ON((u64)addr & KASAN_GRANULE_MASK))
> + return;
> + if (WARN_ON(size & KASAN_GRANULE_MASK))
> + return;
> +
> + hw_set_mem_tag_range((void *)addr, size, value);
>  }
>  
> -static inline void kasan_unpoison(const void *address, size_t size)
> +static inline void kasan_unpoison(const void *addr, size_t size)
>  {
> - u8 tag = get_tag(address);
> + u8 tag = get_tag(addr);
>  
> - address = kasan_reset_tag(address);
> + addr = kasan_reset_tag(addr);
>  
>   /* Skip KFENCE memory if called explicitly outside of sl*b. */
> - if (is_kfence_address(address))
> + if (is_kfence_address(addr))
>   return;
>  
> - hw_set_mem_tag_range((void *)address,
> - round_up(size, KASAN_GRANULE_SIZE), tag);
> + if (WARN_ON((u64)addr & KASAN_GRANULE_MASK))
> + return;
> + size = round_up(size, KASAN_GRANULE_SIZE);
> +
> + hw_set_mem_tag_range((void *)addr, size, tag);
>  }
>  
>  static inline bool kasan_byte_accessible(const void *addr)
> @@ -361,7 +368,7 @@ static inline bool kasan_byte_accessible(const void *addr)
>  /**
>   * kasan_poison - mark the memory range as unaccessible
>   * @addr - range start address, must be aligned to KASAN_GRANULE_SIZE
> - * @size - range size
> + * @size - range size, must be aligned to KASAN_

Re: [PATCH 11/12] kasan: always inline HW_TAGS helper functions

2021-02-03 Thread Marco Elver
On Mon, Feb 01, 2021 at 08:43PM +0100, Andrey Konovalov wrote:
> Mark all static functions in common.c and kasan.h that are used for
> hardware tag-based KASAN as __always_inline to avoid unnecessary
> function calls.
> 
> Signed-off-by: Andrey Konovalov 

Does objtool complain about any of these?

I'm not sure this is unconditionally a good idea. If there isn't a
quantifiable performance bug or case where we cannot call a function,
perhaps we can just let the compiler decide?

More comments below.

> ---
>  mm/kasan/common.c | 13 +++--
>  mm/kasan/kasan.h  |  6 +++---
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> index 5691cca69397..2004ecd6e43c 100644
> --- a/mm/kasan/common.c
> +++ b/mm/kasan/common.c
> @@ -279,7 +279,8 @@ void __kasan_poison_object_data(struct kmem_cache *cache, 
> void *object)
>   *based on objects indexes, so that objects that are next to each other
>   *get different tags.
>   */
> -static u8 assign_tag(struct kmem_cache *cache, const void *object, bool init)
> +static __always_inline u8 assign_tag(struct kmem_cache *cache,
> + const void *object, bool init)

This function might be small enough that it's fine.

>  {
>   if (IS_ENABLED(CONFIG_KASAN_GENERIC))
>   return 0xff;
> @@ -321,8 +322,8 @@ void * __must_check __kasan_init_slab_obj(struct 
> kmem_cache *cache,
>   return (void *)object;
>  }
>  
> -static bool kasan_slab_free(struct kmem_cache *cache, void *object,
> -   unsigned long ip, bool quarantine)
> +static __always_inline bool kasan_slab_free(struct kmem_cache *cache,
> + void *object, unsigned long ip, bool quarantine)
>  {

Because kasan_slab_free() is tail-called by __kasan_slab_free() and
__kasan_slab_free_mempool(), there should never be a call (and if there
is we need to figure out why). The additional code-bloat and I-cache
pressure might be worse vs. just a jump. I'd let the compiler decide.

>   u8 tag;
>   void *tagged_object;
> @@ -366,7 +367,7 @@ bool __kasan_slab_free(struct kmem_cache *cache, void 
> *object, unsigned long ip)
>   return kasan_slab_free(cache, object, ip, true);
>  }
>  
> -static bool kasan_kfree_large(void *ptr, unsigned long ip)
> +static __always_inline bool kasan_kfree_large(void *ptr, unsigned long 
> ip)
>  {

This one is tail-called by __kasan_kfree_large(). The usage in
__kasan_slab_free_mempool() is in an unlikely branch.

>   if (ptr != page_address(virt_to_head_page(ptr))) {
>   kasan_report_invalid_free(ptr, ip);
> @@ -461,8 +462,8 @@ void * __must_check __kasan_slab_alloc(struct kmem_cache 
> *cache,
>   return tagged_object;
>  }
>  
> -static void *kasan_kmalloc(struct kmem_cache *cache, const void *object,
> - size_t size, gfp_t flags)
> +static __always_inline void *kasan_kmalloc(struct kmem_cache *cache,
> + const void *object, size_t size, gfp_t flags)
>  {

Also only tail-called.

>   unsigned long redzone_start;
>   unsigned long redzone_end;
> diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
> index 2f7400a3412f..d5fe72747a53 100644
> --- a/mm/kasan/kasan.h
> +++ b/mm/kasan/kasan.h
> @@ -321,7 +321,7 @@ static inline u8 kasan_random_tag(void) { return 0; }
>  
>  #ifdef CONFIG_KASAN_HW_TAGS
>  
> -static inline void kasan_poison(const void *addr, size_t size, u8 value)
> +static __always_inline void kasan_poison(const void *addr, size_t size, u8 
> value)
>  {
>   addr = kasan_reset_tag(addr);
>  
> @@ -337,7 +337,7 @@ static inline void kasan_poison(const void *addr, size_t 
> size, u8 value)
>   hw_set_mem_tag_range((void *)addr, size, value);
>  }
>  
> -static inline void kasan_unpoison(const void *addr, size_t size)
> +static __always_inline void kasan_unpoison(const void *addr, size_t size)
>  {

Not sure about these 2. They should be small, but it's hard to say what
is ideal on which architecture.

>   u8 tag = get_tag(addr);
>  
> @@ -354,7 +354,7 @@ static inline void kasan_unpoison(const void *addr, 
> size_t size)
>   hw_set_mem_tag_range((void *)addr, size, tag);
>  }
>  
> -static inline bool kasan_byte_accessible(const void *addr)
> +static __always_inline bool kasan_byte_accessible(const void *addr)

This function feels like a macro and if the compiler uninlined it, we
could argue it's a bug. But not sure if we need the __always_inline,
unless you've seen this uninlined.

>  {
>   u8 ptr_tag = get_tag(addr);
>   u8 mem_tag = hw_get_mem_tag((void *)addr);
> -- 
> 2.30.0.365.g02bc693789-goog
> 


Re: [PATCH mm] kasan: untag addresses for KFENCE

2021-02-01 Thread Marco Elver
On Fri, 29 Jan 2021 at 19:50, Andrey Konovalov  wrote:
>
> KFENCE annotations operate on untagged addresses.
>
> Untag addresses in KASAN runtime where they might be tagged.
>
> Signed-off-by: Andrey Konovalov 

Reviewed-by: Marco Elver 

Thank you!

> ---
>
> This can be squashed into:
>
> revert kasan-remove-kfence-leftovers
> kfence, kasan: make KFENCE compatible with KASA
>
> ---
>  mm/kasan/common.c |  2 +-
>  mm/kasan/kasan.h  | 12 +---
>  2 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> index a390fae9d64b..fe852f3cfa42 100644
> --- a/mm/kasan/common.c
> +++ b/mm/kasan/common.c
> @@ -416,7 +416,7 @@ static void *kasan_kmalloc(struct kmem_cache *cache, 
> const void *object,
> if (unlikely(object == NULL))
> return NULL;
>
> -   if (is_kfence_address(object))
> +   if (is_kfence_address(kasan_reset_tag(object)))
> return (void *)object;
>
> redzone_start = round_up((unsigned long)(object + size),
> diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
> index 11c6e3650468..4fb8106f8e31 100644
> --- a/mm/kasan/kasan.h
> +++ b/mm/kasan/kasan.h
> @@ -320,22 +320,28 @@ static inline u8 kasan_random_tag(void) { return 0; }
>
>  static inline void kasan_poison(const void *address, size_t size, u8 value)
>  {
> +   address = kasan_reset_tag(address);
> +
> /* Skip KFENCE memory if called explicitly outside of sl*b. */
> if (is_kfence_address(address))
> return;
>
> -   hw_set_mem_tag_range(kasan_reset_tag(address),
> +   hw_set_mem_tag_range((void *)address,
> round_up(size, KASAN_GRANULE_SIZE), value);
>  }
>
>  static inline void kasan_unpoison(const void *address, size_t size)
>  {
> +   u8 tag = get_tag(address);
> +
> +   address = kasan_reset_tag(address);
> +
> /* Skip KFENCE memory if called explicitly outside of sl*b. */
> if (is_kfence_address(address))
> return;
>
> -   hw_set_mem_tag_range(kasan_reset_tag(address),
> -   round_up(size, KASAN_GRANULE_SIZE), get_tag(address));
> +   hw_set_mem_tag_range((void *)address,
> +   round_up(size, KASAN_GRANULE_SIZE), tag);
>  }
>
>  static inline bool kasan_byte_accessible(const void *addr)
> --
> 2.30.0.365.g02bc693789-goog
>


[PATCH net-next] net: fix up truesize of cloned skb in skb_prepare_for_shift()

2021-02-01 Thread Marco Elver
Avoid the assumption that ksize(kmalloc(S)) == ksize(kmalloc(S)): when
cloning an skb, save and restore truesize after pskb_expand_head(). This
can occur if the allocator decides to service an allocation of the same
size differently (e.g. use a different size class, or pass the
allocation on to KFENCE).

Because truesize is used for bookkeeping (such as sk_wmem_queued), a
modified truesize of a cloned skb may result in corrupt bookkeeping and
relevant warnings (such as in sk_stream_kill_queues()).

Link: https://lkml.kernel.org/r/X9JR/j6dmmoy1...@elver.google.com
Reported-by: syzbot+7b99aafdcc2eedea6...@syzkaller.appspotmail.com
Suggested-by: Eric Dumazet 
Signed-off-by: Marco Elver 
---
 net/core/skbuff.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2af12f7e170c..3787093239f5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3289,7 +3289,19 @@ EXPORT_SYMBOL(skb_split);
  */
 static int skb_prepare_for_shift(struct sk_buff *skb)
 {
-   return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
+   int ret = 0;
+
+   if (skb_cloned(skb)) {
+   /* Save and restore truesize: pskb_expand_head() may reallocate
+* memory where ksize(kmalloc(S)) != ksize(kmalloc(S)), but we
+* cannot change truesize at this point.
+*/
+   unsigned int save_truesize = skb->truesize;
+
+   ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
+   skb->truesize = save_truesize;
+   }
+   return ret;
 }
 
 /**

base-commit: 14e8e0f6008865d823a8184a276702a6c3cbef3d
-- 
2.30.0.365.g02bc693789-goog



Re: [PATCH net-next] net: fix up truesize of cloned skb in skb_prepare_for_shift()

2021-02-01 Thread Marco Elver
On Mon, 1 Feb 2021 at 17:50, Christoph Paasch
 wrote:
> On Mon, Feb 1, 2021 at 8:09 AM Marco Elver  wrote:
> >
> > Avoid the assumption that ksize(kmalloc(S)) == ksize(kmalloc(S)): when
> > cloning an skb, save and restore truesize after pskb_expand_head(). This
> > can occur if the allocator decides to service an allocation of the same
> > size differently (e.g. use a different size class, or pass the
> > allocation on to KFENCE).
> >
> > Because truesize is used for bookkeeping (such as sk_wmem_queued), a
> > modified truesize of a cloned skb may result in corrupt bookkeeping and
> > relevant warnings (such as in sk_stream_kill_queues()).
> >
> > Link: https://lkml.kernel.org/r/X9JR/j6dmmoy1...@elver.google.com
> > Reported-by: syzbot+7b99aafdcc2eedea6...@syzkaller.appspotmail.com
> > Suggested-by: Eric Dumazet 
> > Signed-off-by: Marco Elver 
> > ---
> >  net/core/skbuff.c | 14 +-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 2af12f7e170c..3787093239f5 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -3289,7 +3289,19 @@ EXPORT_SYMBOL(skb_split);
> >   */
> >  static int skb_prepare_for_shift(struct sk_buff *skb)
> >  {
> > -   return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > +   int ret = 0;
> > +
> > +   if (skb_cloned(skb)) {
> > +   /* Save and restore truesize: pskb_expand_head() may 
> > reallocate
> > +* memory where ksize(kmalloc(S)) != ksize(kmalloc(S)), but 
> > we
> > +* cannot change truesize at this point.
> > +*/
> > +   unsigned int save_truesize = skb->truesize;
> > +
> > +   ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > +   skb->truesize = save_truesize;
> > +   }
> > +   return ret;
>
> just a few days ago we found out that this also fixes a syzkaller
> issue on MPTCP (https://github.com/multipath-tcp/mptcp_net-next/issues/136).
> I confirmed that this patch fixes the issue for us as well:
>
> Tested-by: Christoph Paasch 

That's interesting, because according to your config you did not have
KFENCE enabled. Although it's hard to say what exactly caused the
truesize mismatch in your case, because it clearly can't be KFENCE
that caused ksize(kmalloc(S))!=ksize(kmalloc(S)) for you.

Thanks,
-- Marco


Re: KCSAN: data-race in blk_stat_add / blk_stat_timer_fn (5)

2021-02-05 Thread Marco Elver
On Fri, 5 Feb 2021 at 18:00, syzbot
 wrote:
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:2ab38c17 mailmap: remove the "repo-abbrev" comment
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=130e19b4d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=38728258f37833e3
> dashboard link: https://syzkaller.appspot.com/bug?extid=2b6452167d85a022bc6f
> compiler:   clang version 12.0.0 
> (https://github.com/llvm/llvm-project.git 
> 913f6005669cfb590c99865a90bc51ed0983d09d)
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+2b6452167d85a022b...@syzkaller.appspotmail.com
>
> ==
> BUG: KCSAN: data-race in blk_stat_add / blk_stat_timer_fn
>
> write to 0xe8d35c80 of 8 bytes by interrupt on cpu 0:
>  blk_rq_stat_init block/blk-stat.c:24 [inline]
>  blk_stat_timer_fn+0x349/0x410 block/blk-stat.c:95
>  call_timer_fn+0x2e/0x240 kernel/time/timer.c:1417
>  expire_timers+0x116/0x260 kernel/time/timer.c:1462
>  __run_timers+0x338/0x3d0 kernel/time/timer.c:1731
>  run_timer_softirq+0x19/0x30 kernel/time/timer.c:1744
>  __do_softirq+0x13c/0x2c3 kernel/softirq.c:343
>  asm_call_irq_on_stack+0xf/0x20
>  __run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline]
>  run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline]
>  do_softirq_own_stack+0x32/0x40 arch/x86/kernel/irq_64.c:77
>  invoke_softirq kernel/softirq.c:226 [inline]
>  __irq_exit_rcu+0xb4/0xc0 kernel/softirq.c:420
>  sysvec_apic_timer_interrupt+0x74/0x90 arch/x86/kernel/apic/apic.c:1096
>  asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:628
>
> read to 0xe8d35c80 of 8 bytes by interrupt on cpu 1:
>  blk_rq_stat_add block/blk-stat.c:46 [inline]
>  blk_stat_add+0x13d/0x230 block/blk-stat.c:74
>  __blk_mq_end_request+0x142/0x230 block/blk-mq.c:546
>  scsi_end_request+0x2a6/0x470 drivers/scsi/scsi_lib.c:604
>  scsi_io_completion+0x104/0xfb0 drivers/scsi/scsi_lib.c:969
>  scsi_finish_command+0x263/0x2b0 drivers/scsi/scsi.c:214
>  scsi_softirq_done+0xdf/0x440 drivers/scsi/scsi_lib.c:1449
>  blk_done_softirq+0x145/0x190 block/blk-mq.c:588
>  __do_softirq+0x13c/0x2c3 kernel/softirq.c:343
>  asm_call_irq_on_stack+0xf/0x20
>  __run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline]
>  run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline]
>  do_softirq_own_stack+0x32/0x40 arch/x86/kernel/irq_64.c:77
>  invoke_softirq kernel/softirq.c:226 [inline]
>  __irq_exit_rcu+0xb4/0xc0 kernel/softirq.c:420
>  common_interrupt+0xb5/0x130 arch/x86/kernel/irq.c:239
>  asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:619
>  check_access kernel/kcsan/core.c:633 [inline]
>  __tsan_read1+0x156/0x180 kernel/kcsan/core.c:839
>  tomoyo_get_mode security/tomoyo/util.c:1003 [inline]
>  tomoyo_init_request_info+0xfc/0x160 security/tomoyo/util.c:1031
>  tomoyo_path_perm+0x8b/0x330 security/tomoyo/file.c:815
>  tomoyo_inode_getattr+0x18/0x20 security/tomoyo/tomoyo.c:123
>  security_inode_getattr+0x7f/0xd0 security/security.c:1280
>  vfs_getattr fs/stat.c:121 [inline]
>  vfs_fstat+0x45/0x390 fs/stat.c:146
>  __do_sys_newfstat fs/stat.c:386 [inline]
>  __se_sys_newfstat+0x35/0x240 fs/stat.c:383
>  __x64_sys_newfstat+0x2d/0x40 fs/stat.c:383
>  do_syscall_64+0x39/0x80 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Reported by Kernel Concurrency Sanitizer on:
> CPU: 1 PID: 18199 Comm: modprobe Not tainted 5.11.0-rc5-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> ==

I've been looking at some data races in block/. For this one I was
wondering if there are any requirements for the stats counters? E.g.
do they have to be somewhat consistent, or does it not matter at all?

Because as-is, with concurrent update and aggregation (followed by
reinit) of the per-CPU counters, the values in blk_rq_stat can become
quite inconsistent.

I wanted to throw together a fix for this, but wasn't sure what the
level of tolerable errors for these counters is appropriate. I thought
of 3 options:

1. Just add more data_race() around them and accept whatever
inaccuracies we get due to the data races.

2. Add a per-CPU spinlock. This should be uncontended unless the timer
fires too often.

3. Use per-CPU seqlock. Not sure this buys us much because the timer
also resets the per-CPU counters and has to be serialized with other
potential updaters.

Thanks,
-- Marco


[PATCH] blk-mq-debugfs: mark concurrent stats counters as data races

2021-02-05 Thread Marco Elver
KCSAN reports that several of the blk-mq debugfs stats counters are
updated concurrently. Because blk-mq-debugfs does not demand precise
stats counters, potential lossy updates due to data races can be
tolerated. Therefore, mark and comment the accesses accordingly.

Reported-by: syzbot+2c308b859c8c103aa...@syzkaller.appspotmail.com
Reported-by: syzbot+44f9b37d2de57637d...@syzkaller.appspotmail.com
Reported-by: syzbot+49a9bcf457723ecaf...@syzkaller.appspotmail.com
Reported-by: syzbot+b9914ed52d5b1d63f...@syzkaller.appspotmail.com
Signed-off-by: Marco Elver 
---
Note: These 4 data races are among the most frequently encountered by
syzbot:

  https://syzkaller.appspot.com/bug?id=7994761095b9677fb8bccaf41a77a82d5f444839
  https://syzkaller.appspot.com/bug?id=08193ca23b80ec0e9bcbefba039162cff4f5d7a3
  https://syzkaller.appspot.com/bug?id=7c51c15438f963024c4a4b3a6d7e119f4bdb2199
  https://syzkaller.appspot.com/bug?id=6436cb57d04e8c5d6f0f40926d7511232aa2b5d4
---
 block/blk-mq-debugfs.c | 22 --
 block/blk-mq-sched.c   |  3 ++-
 block/blk-mq.c |  9 ++---
 3 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 4de03da9a624..687d201f0d7b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -554,15 +554,16 @@ static int hctx_dispatched_show(void *data, struct 
seq_file *m)
struct blk_mq_hw_ctx *hctx = data;
int i;
 
-   seq_printf(m, "%8u\t%lu\n", 0U, hctx->dispatched[0]);
+   seq_printf(m, "%8u\t%lu\n", 0U, data_race(hctx->dispatched[0]));
 
for (i = 1; i < BLK_MQ_MAX_DISPATCH_ORDER - 1; i++) {
unsigned int d = 1U << (i - 1);
 
-   seq_printf(m, "%8u\t%lu\n", d, hctx->dispatched[i]);
+   seq_printf(m, "%8u\t%lu\n", d, data_race(hctx->dispatched[i]));
}
 
-   seq_printf(m, "%8u+\t%lu\n", 1U << (i - 1), hctx->dispatched[i]);
+   seq_printf(m, "%8u+\t%lu\n", 1U << (i - 1),
+  data_race(hctx->dispatched[i]));
return 0;
 }
 
@@ -573,7 +574,7 @@ static ssize_t hctx_dispatched_write(void *data, const char 
__user *buf,
int i;
 
for (i = 0; i < BLK_MQ_MAX_DISPATCH_ORDER; i++)
-   hctx->dispatched[i] = 0;
+   data_race(hctx->dispatched[i] = 0);
return count;
 }
 
@@ -581,7 +582,7 @@ static int hctx_queued_show(void *data, struct seq_file *m)
 {
struct blk_mq_hw_ctx *hctx = data;
 
-   seq_printf(m, "%lu\n", hctx->queued);
+   seq_printf(m, "%lu\n", data_race(hctx->queued));
return 0;
 }
 
@@ -590,7 +591,7 @@ static ssize_t hctx_queued_write(void *data, const char 
__user *buf,
 {
struct blk_mq_hw_ctx *hctx = data;
 
-   hctx->queued = 0;
+   data_race(hctx->queued = 0);
return count;
 }
 
@@ -598,7 +599,7 @@ static int hctx_run_show(void *data, struct seq_file *m)
 {
struct blk_mq_hw_ctx *hctx = data;
 
-   seq_printf(m, "%lu\n", hctx->run);
+   seq_printf(m, "%lu\n", data_race(hctx->run));
return 0;
 }
 
@@ -607,7 +608,7 @@ static ssize_t hctx_run_write(void *data, const char __user 
*buf, size_t count,
 {
struct blk_mq_hw_ctx *hctx = data;
 
-   hctx->run = 0;
+   data_race(hctx->run = 0);
return count;
 }
 
@@ -702,7 +703,8 @@ static int ctx_completed_show(void *data, struct seq_file 
*m)
 {
struct blk_mq_ctx *ctx = data;
 
-   seq_printf(m, "%lu %lu\n", ctx->rq_completed[1], ctx->rq_completed[0]);
+   seq_printf(m, "%lu %lu\n", data_race(ctx->rq_completed[1]),
+  data_race(ctx->rq_completed[0]));
return 0;
 }
 
@@ -711,7 +713,7 @@ static ssize_t ctx_completed_write(void *data, const char 
__user *buf,
 {
struct blk_mq_ctx *ctx = data;
 
-   ctx->rq_completed[0] = ctx->rq_completed[1] = 0;
+   data_race(ctx->rq_completed[0] = ctx->rq_completed[1] = 0);
return count;
 }
 
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index deff4e826e23..71a49835e89a 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -332,7 +332,8 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx 
*hctx)
if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)))
return;
 
-   hctx->run++;
+   /* data race ok: hctx->run only for debugfs stats. */
+   data_race(hctx->run++);
 
/*
 * A return of -EAGAIN is an indication that hctx->dispatch is not
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f285a9123a8b..1d8970602032 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -341,7 +341,8 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
}
   

Re: [PATCH] kasan: fix stack traces dependency for HW_TAGS

2021-02-08 Thread Marco Elver
On Mon, 8 Feb 2021 at 19:40, Andrey Konovalov  wrote:
>
> Currently, whether the alloc/free stack traces collection is enabled by
> default for hardware tag-based KASAN depends on CONFIG_DEBUG_KERNEL.
> The intention for this dependency was to only enable collection on slow
> debug kernels due to a significant perf and memory impact.
>
> As it turns out, CONFIG_DEBUG_KERNEL is not considered a debug option
> and is enabled on many productions kernels including Android and Ubuntu.
> As the result, this dependency is pointless and only complicates the code
> and documentation.
>
> Having stack traces collection disabled by default would make the hardware
> mode work differently to to the software ones, which is confusing.
>
> This change removes the dependency and enables stack traces collection
> by default.
>
> Looking into the future, this default might makes sense for production
> kernels, assuming we implement a fast stack trace collection approach.
>
> Signed-off-by: Andrey Konovalov 

Reviewed-by: Marco Elver 

I'm in favor of this simplification.

The fact that CONFIG_DEBUG_KERNEL cannot be relied upon to determine
if we're running a debug kernel or not is a bit unfortunate though.

Thanks!

> ---
>  Documentation/dev-tools/kasan.rst | 3 +--
>  mm/kasan/hw_tags.c| 8 ++--
>  2 files changed, 3 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/dev-tools/kasan.rst 
> b/Documentation/dev-tools/kasan.rst
> index 1651d961f06a..a248ac3941be 100644
> --- a/Documentation/dev-tools/kasan.rst
> +++ b/Documentation/dev-tools/kasan.rst
> @@ -163,8 +163,7 @@ particular KASAN features.
>  - ``kasan=off`` or ``=on`` controls whether KASAN is enabled (default: 
> ``on``).
>
>  - ``kasan.stacktrace=off`` or ``=on`` disables or enables alloc and free 
> stack
> -  traces collection (default: ``on`` for ``CONFIG_DEBUG_KERNEL=y``, otherwise
> -  ``off``).
> +  traces collection (default: ``on``).
>
>  - ``kasan.fault=report`` or ``=panic`` controls whether to only print a KASAN
>report or also panic the kernel (default: ``report``).
> diff --git a/mm/kasan/hw_tags.c b/mm/kasan/hw_tags.c
> index e529428e7a11..d558799b25b3 100644
> --- a/mm/kasan/hw_tags.c
> +++ b/mm/kasan/hw_tags.c
> @@ -134,12 +134,8 @@ void __init kasan_init_hw_tags(void)
>
> switch (kasan_arg_stacktrace) {
> case KASAN_ARG_STACKTRACE_DEFAULT:
> -   /*
> -* Default to enabling stack trace collection for
> -* debug kernels.
> -*/
> -   if (IS_ENABLED(CONFIG_DEBUG_KERNEL))
> -   static_branch_enable(&kasan_flag_stacktrace);
> +   /* Default to enabling stack trace collection. */
> +   static_branch_enable(&kasan_flag_stacktrace);
> break;
> case KASAN_ARG_STACKTRACE_OFF:
> /* Do nothing, kasan_flag_stacktrace keeps its default value. 
> */
> --
> 2.30.0.478.g8a0d178c01-goog
>


[PATCH] bpf_lru_list: Read double-checked variable once without lock

2021-02-09 Thread Marco Elver
For double-checked locking in bpf_common_lru_push_free(), node->type is
read outside the critical section and then re-checked under the lock.
However, concurrent writes to node->type result in data races.

For example, the following concurrent access was observed by KCSAN:

  write to 0x88801521bc22 of 1 bytes by task 10038 on cpu 1:
   __bpf_lru_node_move_inkernel/bpf/bpf_lru_list.c:91
   __local_list_flushkernel/bpf/bpf_lru_list.c:298
   ...
  read to 0x88801521bc22 of 1 bytes by task 10043 on cpu 0:
   bpf_common_lru_push_free  kernel/bpf/bpf_lru_list.c:507
   bpf_lru_push_free kernel/bpf/bpf_lru_list.c:555
   ...

Fix the data races where node->type is read outside the critical section
(for double-checked locking) by marking the access with READ_ONCE() as
well as ensuring the variable is only accessed once.

Reported-by: syzbot+3536db46dfa58c573...@syzkaller.appspotmail.com
Reported-by: syzbot+516acdb03d3e27d91...@syzkaller.appspotmail.com
Signed-off-by: Marco Elver 
---
Detailed reports:

https://groups.google.com/g/syzkaller-upstream-moderation/c/PwsoQ7bfi8k/m/NH9Ni2WxAQAJ

https://groups.google.com/g/syzkaller-upstream-moderation/c/-fXQO9ehxSM/m/RmQEcI2oAQAJ
---
 kernel/bpf/bpf_lru_list.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c
index 1b6b9349cb85..d99e89f113c4 100644
--- a/kernel/bpf/bpf_lru_list.c
+++ b/kernel/bpf/bpf_lru_list.c
@@ -502,13 +502,14 @@ struct bpf_lru_node *bpf_lru_pop_free(struct bpf_lru 
*lru, u32 hash)
 static void bpf_common_lru_push_free(struct bpf_lru *lru,
 struct bpf_lru_node *node)
 {
+   u8 node_type = READ_ONCE(node->type);
unsigned long flags;
 
-   if (WARN_ON_ONCE(node->type == BPF_LRU_LIST_T_FREE) ||
-   WARN_ON_ONCE(node->type == BPF_LRU_LOCAL_LIST_T_FREE))
+   if (WARN_ON_ONCE(node_type == BPF_LRU_LIST_T_FREE) ||
+   WARN_ON_ONCE(node_type == BPF_LRU_LOCAL_LIST_T_FREE))
return;
 
-   if (node->type == BPF_LRU_LOCAL_LIST_T_PENDING) {
+   if (node_type == BPF_LRU_LOCAL_LIST_T_PENDING) {
struct bpf_lru_locallist *loc_l;
 
loc_l = per_cpu_ptr(lru->common_lru.local_list, node->cpu);
-- 
2.30.0.478.g8a0d178c01-goog



[PATCH mm] kfence: make reporting sensitive information configurable

2021-02-09 Thread Marco Elver
We cannot rely on CONFIG_DEBUG_KERNEL to decide if we're running a
"debug kernel" where we can safely show potentially sensitive
information in the kernel log.

Therefore, add the option CONFIG_KFENCE_REPORT_SENSITIVE to decide if we
should add potentially sensitive information to KFENCE reports. The
default behaviour remains unchanged.

Signed-off-by: Marco Elver 
---
 Documentation/dev-tools/kfence.rst | 6 +++---
 lib/Kconfig.kfence | 8 
 mm/kfence/core.c   | 2 +-
 mm/kfence/kfence.h | 3 +--
 mm/kfence/report.c | 6 +++---
 5 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/Documentation/dev-tools/kfence.rst 
b/Documentation/dev-tools/kfence.rst
index 58a0a5fa1ddc..5280d644f826 100644
--- a/Documentation/dev-tools/kfence.rst
+++ b/Documentation/dev-tools/kfence.rst
@@ -89,7 +89,7 @@ A typical out-of-bounds access looks like this::
 The header of the report provides a short summary of the function involved in
 the access. It is followed by more detailed information about the access and
 its origin. Note that, real kernel addresses are only shown for
-``CONFIG_DEBUG_KERNEL=y`` builds.
+``CONFIG_KFENCE_REPORT_SENSITIVE=y`` builds.
 
 Use-after-free accesses are reported as::
 
@@ -184,8 +184,8 @@ invalidly written bytes (offset from the address) are 
shown; in this
 representation, '.' denote untouched bytes. In the example above ``0xac`` is
 the value written to the invalid address at offset 0, and the remaining '.'
 denote that no following bytes have been touched. Note that, real values are
-only shown for ``CONFIG_DEBUG_KERNEL=y`` builds; to avoid information
-disclosure for non-debug builds, '!' is used instead to denote invalidly
+only shown for ``CONFIG_KFENCE_REPORT_SENSITIVE=y`` builds; to avoid
+information disclosure otherwise, '!' is used instead to denote invalidly
 written bytes.
 
 And finally, KFENCE may also report on invalid accesses to any protected page
diff --git a/lib/Kconfig.kfence b/lib/Kconfig.kfence
index 78f50ccb3b45..141494a5f530 100644
--- a/lib/Kconfig.kfence
+++ b/lib/Kconfig.kfence
@@ -55,6 +55,14 @@ config KFENCE_NUM_OBJECTS
  pages are required; with one containing the object and two adjacent
  ones used as guard pages.
 
+config KFENCE_REPORT_SENSITIVE
+   bool "Show potentially sensitive information in reports"
+   default y if DEBUG_KERNEL
+   help
+ Show potentially sensitive information such as unhashed pointers,
+ context bytes on memory corruptions, as well as dump registers in
+ KFENCE reports.
+
 config KFENCE_STRESS_TEST_FAULTS
int "Stress testing of fault handling and error reporting" if EXPERT
default 0
diff --git a/mm/kfence/core.c b/mm/kfence/core.c
index cfe3d32ac5b7..5f7e02db5f53 100644
--- a/mm/kfence/core.c
+++ b/mm/kfence/core.c
@@ -648,7 +648,7 @@ void __init kfence_init(void)
schedule_delayed_work(&kfence_timer, 0);
pr_info("initialized - using %lu bytes for %d objects", 
KFENCE_POOL_SIZE,
CONFIG_KFENCE_NUM_OBJECTS);
-   if (IS_ENABLED(CONFIG_DEBUG_KERNEL))
+   if (IS_ENABLED(CONFIG_KFENCE_REPORT_SENSITIVE))
pr_cont(" at 0x%px-0x%px\n", (void *)__kfence_pool,
(void *)(__kfence_pool + KFENCE_POOL_SIZE));
else
diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
index 1accc840dbbe..48a8196b947b 100644
--- a/mm/kfence/kfence.h
+++ b/mm/kfence/kfence.h
@@ -16,8 +16,7 @@
 
 #include "../slab.h" /* for struct kmem_cache */
 
-/* For non-debug builds, avoid leaking kernel pointers into dmesg. */
-#ifdef CONFIG_DEBUG_KERNEL
+#ifdef CONFIG_KFENCE_REPORT_SENSITIVE
 #define PTR_FMT "%px"
 #else
 #define PTR_FMT "%p"
diff --git a/mm/kfence/report.c b/mm/kfence/report.c
index 901bd7ee83d8..5e2dbabbab1d 100644
--- a/mm/kfence/report.c
+++ b/mm/kfence/report.c
@@ -148,9 +148,9 @@ static void print_diff_canary(unsigned long address, size_t 
bytes_to_show,
for (cur = (const u8 *)address; cur < end; cur++) {
if (*cur == KFENCE_CANARY_PATTERN(cur))
pr_cont(" .");
-   else if (IS_ENABLED(CONFIG_DEBUG_KERNEL))
+   else if (IS_ENABLED(CONFIG_KFENCE_REPORT_SENSITIVE))
pr_cont(" 0x%02x", *cur);
-   else /* Do not leak kernel memory in non-debug builds. */
+   else /* Do not leak kernel memory. */
pr_cont(" !");
}
pr_cont(" ]");
@@ -242,7 +242,7 @@ void kfence_report_error(unsigned long address, bool 
is_write, struct pt_regs *r
 
/* Print report footer. */
pr_err("\n");
-   if (IS_ENABLED(CONFIG_DEBUG_KERNEL) && regs)
+   if (IS_ENABLED(CONFIG_KFENCE_REPORT_SENSITIVE) && regs)
show_regs(regs);
else
dump_stack_print_info(KERN_ERR);
-- 
2.30.0.478.g8a0d178c01-goog



Re: [PATCH][RESEND] lib/vsprintf: make-printk-non-secret printks all addresses as unhashed

2021-02-09 Thread Marco Elver
On Tue, Feb 02, 2021 at 03:36PM -0600, Timur Tabi wrote:
> If the make-printk-non-secret command-line parameter is set, then
> printk("%p") will print addresses as unhashed.  This is useful for
> debugging purposes.
> 
> A large warning message is displayed if this option is enabled,
> because unhashed addresses, while useful for debugging, exposes
> kernel addresses which can be a security risk.
> 
> Signed-off-by: Timur Tabi 
> ---
>  lib/vsprintf.c | 34 --
>  1 file changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> index 3b53c73580c5..b9f87084afb0 100644
> --- a/lib/vsprintf.c
> +++ b/lib/vsprintf.c
> @@ -2090,6 +2090,30 @@ char *fwnode_string(char *buf, char *end, struct 
> fwnode_handle *fwnode,
>   return widen_string(buf, buf - buf_start, end, spec);
>  }
>  
> +/* Disable pointer hashing if requested */
> +static bool debug_never_hash_pointers __ro_after_init;

Would it be reasonable to make this non-static? Or somehow make it
possible to get this flag from other subsystems?

There are other places in the kernel that dump sensitive data such as
registers. We'd like to be able to use 'debug_never_hash_pointers' to
decide if our debugging tools can dump registers etc. What we really
need is info if the kernel is in debug mode and we can dump all kinds of
sensitive info; debug_never_hash_pointers is would be a good enough
proxy for that.

Thanks,
-- Marco


Re: PANIC: double fault in fixup_bad_iret

2020-06-01 Thread Marco Elver
On Sun, 31 May 2020 at 11:32, Dmitry Vyukov  wrote:
>
> On Fri, May 29, 2020 at 7:11 PM Peter Zijlstra  wrote:
> > > Like with KCSAN, we should blanket kill KASAN/UBSAN and friends (at the
> > > very least in arch/x86/) until they get that function attribute stuff
> > > sorted.
> >
> > Something like so.
> >
> > ---
> > diff --git a/arch/x86/Makefile b/arch/x86/Makefile
> > index 00e378de8bc0..a90d32b87d7e 100644
> > --- a/arch/x86/Makefile
> > +++ b/arch/x86/Makefile
> > @@ -1,6 +1,14 @@
> >  # SPDX-License-Identifier: GPL-2.0
> >  # Unified Makefile for i386 and x86_64
> >
> > +#
> > +# Until such a time that __no_kasan and __no_ubsan work as expected (and 
> > are
> > +# made part of noinstr), don't sanitize anything.
> > +#
> > +KASAN_SANITIZE := n
> > +UBSAN_SANITIZE := n
> > +KCOV_INSTRUMENT := n
> > +
> >  # select defconfig based on actual architecture
> >  ifeq ($(ARCH),x86)
> >ifeq ($(shell uname -m),x86_64)
>
> +kasan-dev
> +Marco, please send a fix for this

I think Peter wanted to send a patch to add __no_kcsan to noinstr:
https://lkml.kernel.org/r/20200529170755.gn706...@hirez.programming.kicks-ass.net

In the same patch we can add __no_sanitize_address to noinstr. But:

- We're missing a definition for __no_sanitize_undefined and
__no_sanitize_coverage.

- Could optionally add __no_{kasan,ubsan,kcov}, to be consistent with
__no_kcsan, although I'd just keep __no_sanitize for the unambiguous
names (__no_kcsan is special because __no_sanitize_thread and TSAN
instrumentation is just an implementation detail of KCSAN, which !=
KTSAN).

- We still need the above blanket no-instrument for x86 because of
GCC. We could guard it with "ifdef CONFIG_CC_IS_GCC".

Not sure what the best strategy is to minimize patch conflicts. For
now I could send just the patches to add missing definitions. If you'd
like me to send all patches (including modifying 'noinstr'), let me
know.

Thanks,
-- Marco


Re: [rcu] 2f08469563: BUG:kernel_reboot-without-warning_in_boot_stage

2020-05-19 Thread Marco Elver
On Mon, 18 May 2020 at 20:05, Marco Elver  wrote:
>
> On Mon, 18 May 2020, 'Nick Desaulniers' via kasan-dev wrote:
>
> > On Mon, May 18, 2020 at 7:34 AM Marco Elver  wrote:
> > >
> > > On Mon, 18 May 2020 at 14:44, Marco Elver  wrote:
> > > >
> > > > [+Cc clang-built-linux FYI]
> > > >
> > > > On Mon, 18 May 2020 at 12:11, Marco Elver  wrote:
> > > > >
> > > > > On Sun, 17 May 2020 at 05:47, Paul E. McKenney  
> > > > > wrote:
> > > > > >
> > > > > > On Sun, May 17, 2020 at 09:17:32AM +0800, kernel test robot wrote:
> > > > > > > Greeting,
> > > > > > >
> > > > > > > FYI, we noticed the following commit (built with clang-11):
> > > > > > >
> > > > > > > commit: 2f08469563550d15cb08a60898d3549720600eee ("rcu: Mark 
> > > > > > > rcu_state.ncpus to detect concurrent writes")
> > > > > > > https://git.kernel.org/cgit/linux/kernel/git/paulmck/linux-rcu.git
> > > > > > >  dev.2020.05.14c
> > > > > > >
> > > > > > > in testcase: boot
> > > > > > >
> > > > > > > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge 
> > > > > > > -smp 2 -m 8G
> > > > > > >
> > > > > > > caused below changes (please refer to attached dmesg/kmsg for 
> > > > > > > entire log/backtrace):
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > If you fix the issue, kindly add following tag
> > > > > > > Reported-by: kernel test robot 
> > > > > > >
> > > > > > >
> > > > > > > [0.054943] BRK [0x05204000, 0x05204fff] PGTABLE
> > > > > > > [0.061181] BRK [0x05205000, 0x05205fff] PGTABLE
> > > > > > > [0.062403] BRK [0x05206000, 0x05206fff] PGTABLE
> > > > > > > [0.065200] RAMDISK: [mem 0x7a247000-0x7fff]
> > > > > > > [0.067344] ACPI: Early table checksum verification disabled
> > > > > > > BUG: kernel reboot-without-warning in boot stage
> > > > > >
> > > > > > I am having some difficulty believing that this commit is at fault 
> > > > > > given
> > > > > > that the .config does not list CONFIG_KCSAN=y, but CCing Marco Elver
> > > > > > for his thoughts.  Especially given that I have never built with 
> > > > > > clang-11.
> > > > > >
> > > > > > But this does invoke ASSERT_EXCLUSIVE_WRITER() in early boot from
> > > > > > rcu_init().  Might clang-11 have objections to early use of this 
> > > > > > macro?
> > > > >
> > > > > The macro is a noop without KCSAN. I think the bisection went wrong.
> > > > >
> > > > > I am able to reproduce a reboot-without-warning when building with
> > > > > Clang 11 and the provided config. I did a bisect, starting with v5.6
> > > > > (good), and found this:
> > > > > - Since v5.6, first bad commit is
> > > > > 20e2aa812620439d010a3f78ba4e05bc0b3e2861 (Merge tag
> > > > > 'perf-urgent-2020-04-12' of
> > > > > git://git.kernel.org/pub/scm/linux/kernel//git/tip/tip)
> > > > > - The actual commit that introduced the problem is
> > > > > 2b3b76b5ec67568da4bb475d3ce8a92ef494b5de (perf/x86/intel/uncore: Add
> > > > > Ice Lake server uncore support) -- reverting it fixes the problem.
> > >
> > > Some more clues:
> > >
> > > 1. I should have noticed that this uses CONFIG_KASAN=y.
> >
> > Thanks for the report, testing, and bisection.  I don't see any
> > smoking gun in the code.
> > https://godbolt.org/z/qbK26r
>
> My guess is data layout and maybe some interaction with KASAN. I also
> played around with leaving icx_mmio_uncores empty, meaning none of the
> data it refers to end up in the data section (presumably because
> optimized out), which resulted in making the bug disappear as well.
>
> > >
> > > 2. Something about function icx_uncore_mmio_init(). Making it a noop
> > > also makes the issue go away.
> > >
> > > 3. Leaving icx_uncore_mmio_init() a noop but

Re: [rcu] 2f08469563: BUG:kernel_reboot-without-warning_in_boot_stage

2020-05-19 Thread Marco Elver
On Tue, 19 May 2020 at 12:16, Marco Elver  wrote:
>
> On Mon, 18 May 2020 at 20:05, Marco Elver  wrote:
> >
> > On Mon, 18 May 2020, 'Nick Desaulniers' via kasan-dev wrote:
> >
> > > On Mon, May 18, 2020 at 7:34 AM Marco Elver  wrote:
> > > >
> > > > On Mon, 18 May 2020 at 14:44, Marco Elver  wrote:
> > > > >
> > > > > [+Cc clang-built-linux FYI]
> > > > >
> > > > > On Mon, 18 May 2020 at 12:11, Marco Elver  wrote:
> > > > > >
> > > > > > On Sun, 17 May 2020 at 05:47, Paul E. McKenney  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Sun, May 17, 2020 at 09:17:32AM +0800, kernel test robot wrote:
> > > > > > > > Greeting,
> > > > > > > >
> > > > > > > > FYI, we noticed the following commit (built with clang-11):
> > > > > > > >
> > > > > > > > commit: 2f08469563550d15cb08a60898d3549720600eee ("rcu: Mark 
> > > > > > > > rcu_state.ncpus to detect concurrent writes")
> > > > > > > > https://git.kernel.org/cgit/linux/kernel/git/paulmck/linux-rcu.git
> > > > > > > >  dev.2020.05.14c
> > > > > > > >
> > > > > > > > in testcase: boot
> > > > > > > >
> > > > > > > > on test machine: qemu-system-x86_64 -enable-kvm -cpu 
> > > > > > > > SandyBridge -smp 2 -m 8G
> > > > > > > >
> > > > > > > > caused below changes (please refer to attached dmesg/kmsg for 
> > > > > > > > entire log/backtrace):
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > If you fix the issue, kindly add following tag
> > > > > > > > Reported-by: kernel test robot 
> > > > > > > >
> > > > > > > >
> > > > > > > > [0.054943] BRK [0x05204000, 0x05204fff] PGTABLE
> > > > > > > > [0.061181] BRK [0x05205000, 0x05205fff] PGTABLE
> > > > > > > > [0.062403] BRK [0x05206000, 0x05206fff] PGTABLE
> > > > > > > > [0.065200] RAMDISK: [mem 0x7a247000-0x7fff]
> > > > > > > > [0.067344] ACPI: Early table checksum verification disabled
> > > > > > > > BUG: kernel reboot-without-warning in boot stage
> > > > > > >
> > > > > > > I am having some difficulty believing that this commit is at 
> > > > > > > fault given
> > > > > > > that the .config does not list CONFIG_KCSAN=y, but CCing Marco 
> > > > > > > Elver
> > > > > > > for his thoughts.  Especially given that I have never built with 
> > > > > > > clang-11.
> > > > > > >
> > > > > > > But this does invoke ASSERT_EXCLUSIVE_WRITER() in early boot from
> > > > > > > rcu_init().  Might clang-11 have objections to early use of this 
> > > > > > > macro?
> > > > > >
> > > > > > The macro is a noop without KCSAN. I think the bisection went wrong.
> > > > > >
> > > > > > I am able to reproduce a reboot-without-warning when building with
> > > > > > Clang 11 and the provided config. I did a bisect, starting with v5.6
> > > > > > (good), and found this:
> > > > > > - Since v5.6, first bad commit is
> > > > > > 20e2aa812620439d010a3f78ba4e05bc0b3e2861 (Merge tag
> > > > > > 'perf-urgent-2020-04-12' of
> > > > > > git://git.kernel.org/pub/scm/linux/kernel//git/tip/tip)
> > > > > > - The actual commit that introduced the problem is
> > > > > > 2b3b76b5ec67568da4bb475d3ce8a92ef494b5de (perf/x86/intel/uncore: Add
> > > > > > Ice Lake server uncore support) -- reverting it fixes the problem.
> > > >
> > > > Some more clues:
> > > >
> > > > 1. I should have noticed that this uses CONFIG_KASAN=y.
> > >
> > > Thanks for the report, testing, and bisection.  I don't see any
> > > smoking gun in the code.
> > > https://godbolt.org/z/qbK26r
> >
> > My guess is data lay

[PATCH] kasan: Disable branch tracing for core runtime

2020-05-19 Thread Marco Elver
During early boot, while KASAN is not yet initialized, it is possible to
enter reporting code-path and end up in kasan_report(). While
uninitialized, the branch there prevents generating any reports,
however, under certain circumstances when branches are being traced
(TRACE_BRANCH_PROFILING), we may recurse deep enough to cause kernel
reboots without warning.

To prevent similar issues in future, we should disable branch tracing
for the core runtime.

Link: https://lore.kernel.org/lkml/20200517011732.GE24705@shao2-debian/
Reported-by: kernel test robot 
Signed-off-by: Marco Elver 
---
 mm/kasan/Makefile  | 16 
 mm/kasan/generic.c |  1 -
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/mm/kasan/Makefile b/mm/kasan/Makefile
index 434d503a6525..de3121848ddf 100644
--- a/mm/kasan/Makefile
+++ b/mm/kasan/Makefile
@@ -15,14 +15,14 @@ CFLAGS_REMOVE_tags_report.o = $(CC_FLAGS_FTRACE)
 
 # Function splitter causes unnecessary splits in __asan_load1/__asan_store1
 # see: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63533
-CFLAGS_common.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector)
-CFLAGS_generic.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector)
-CFLAGS_generic_report.o := $(call cc-option, -fno-conserve-stack 
-fno-stack-protector)
-CFLAGS_init.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector)
-CFLAGS_quarantine.o := $(call cc-option, -fno-conserve-stack 
-fno-stack-protector)
-CFLAGS_report.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector)
-CFLAGS_tags.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector)
-CFLAGS_tags_report.o := $(call cc-option, -fno-conserve-stack 
-fno-stack-protector)
+CFLAGS_common.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) 
-DDISABLE_BRANCH_PROFILING
+CFLAGS_generic.o := $(call cc-option, -fno-conserve-stack 
-fno-stack-protector) -DDISABLE_BRANCH_PROFILING
+CFLAGS_generic_report.o := $(call cc-option, -fno-conserve-stack 
-fno-stack-protector) -DDISABLE_BRANCH_PROFILING
+CFLAGS_init.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) 
-DDISABLE_BRANCH_PROFILING
+CFLAGS_quarantine.o := $(call cc-option, -fno-conserve-stack 
-fno-stack-protector) -DDISABLE_BRANCH_PROFILING
+CFLAGS_report.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) 
-DDISABLE_BRANCH_PROFILING
+CFLAGS_tags.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) 
-DDISABLE_BRANCH_PROFILING
+CFLAGS_tags_report.o := $(call cc-option, -fno-conserve-stack 
-fno-stack-protector) -DDISABLE_BRANCH_PROFILING
 
 obj-$(CONFIG_KASAN) := common.o init.o report.o
 obj-$(CONFIG_KASAN_GENERIC) += generic.o generic_report.o quarantine.o
diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c
index 56ff8885fe2e..098a7dbaced6 100644
--- a/mm/kasan/generic.c
+++ b/mm/kasan/generic.c
@@ -15,7 +15,6 @@
  */
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-#define DISABLE_BRANCH_PROFILING
 
 #include 
 #include 
-- 
2.26.2.761.g0e0b3e54be-goog



Re: [rcu] 2f08469563: BUG:kernel_reboot-without-warning_in_boot_stage

2020-05-19 Thread Marco Elver
On Tue, 19 May 2020 at 15:40, Marco Elver  wrote:
>
> On Tue, 19 May 2020 at 12:16, Marco Elver  wrote:
> >
> > On Mon, 18 May 2020 at 20:05, Marco Elver  wrote:
> > >
> > > On Mon, 18 May 2020, 'Nick Desaulniers' via kasan-dev wrote:
> > >
> > > > On Mon, May 18, 2020 at 7:34 AM Marco Elver  wrote:
> > > > >
> > > > > On Mon, 18 May 2020 at 14:44, Marco Elver  wrote:
> > > > > >
> > > > > > [+Cc clang-built-linux FYI]
> > > > > >
> > > > > > On Mon, 18 May 2020 at 12:11, Marco Elver  wrote:
> > > > > > >
> > > > > > > On Sun, 17 May 2020 at 05:47, Paul E. McKenney 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Sun, May 17, 2020 at 09:17:32AM +0800, kernel test robot 
> > > > > > > > wrote:
> > > > > > > > > Greeting,
> > > > > > > > >
> > > > > > > > > FYI, we noticed the following commit (built with clang-11):
> > > > > > > > >
> > > > > > > > > commit: 2f08469563550d15cb08a60898d3549720600eee ("rcu: Mark 
> > > > > > > > > rcu_state.ncpus to detect concurrent writes")
> > > > > > > > > https://git.kernel.org/cgit/linux/kernel/git/paulmck/linux-rcu.git
> > > > > > > > >  dev.2020.05.14c
> > > > > > > > >
> > > > > > > > > in testcase: boot
> > > > > > > > >
> > > > > > > > > on test machine: qemu-system-x86_64 -enable-kvm -cpu 
> > > > > > > > > SandyBridge -smp 2 -m 8G
> > > > > > > > >
> > > > > > > > > caused below changes (please refer to attached dmesg/kmsg for 
> > > > > > > > > entire log/backtrace):
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > If you fix the issue, kindly add following tag
> > > > > > > > > Reported-by: kernel test robot 
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > [0.054943] BRK [0x05204000, 0x05204fff] PGTABLE
> > > > > > > > > [0.061181] BRK [0x05205000, 0x05205fff] PGTABLE
> > > > > > > > > [0.062403] BRK [0x05206000, 0x05206fff] PGTABLE
> > > > > > > > > [0.065200] RAMDISK: [mem 0x7a247000-0x7fff]
> > > > > > > > > [0.067344] ACPI: Early table checksum verification 
> > > > > > > > > disabled
> > > > > > > > > BUG: kernel reboot-without-warning in boot stage
> > > > > > > >
> > > > > > > > I am having some difficulty believing that this commit is at 
> > > > > > > > fault given
> > > > > > > > that the .config does not list CONFIG_KCSAN=y, but CCing Marco 
> > > > > > > > Elver
> > > > > > > > for his thoughts.  Especially given that I have never built 
> > > > > > > > with clang-11.
> > > > > > > >
> > > > > > > > But this does invoke ASSERT_EXCLUSIVE_WRITER() in early boot 
> > > > > > > > from
> > > > > > > > rcu_init().  Might clang-11 have objections to early use of 
> > > > > > > > this macro?
> > > > > > >
> > > > > > > The macro is a noop without KCSAN. I think the bisection went 
> > > > > > > wrong.
> > > > > > >
> > > > > > > I am able to reproduce a reboot-without-warning when building with
> > > > > > > Clang 11 and the provided config. I did a bisect, starting with 
> > > > > > > v5.6
> > > > > > > (good), and found this:
> > > > > > > - Since v5.6, first bad commit is
> > > > > > > 20e2aa812620439d010a3f78ba4e05bc0b3e2861 (Merge tag
> > > > > > > 'perf-urgent-2020-04-12' of
> > > > > > > git://git.kernel.org/pub/scm/linux/kernel//git/tip/tip)
> > > > > > > -

Re: [PATCH] READ_ONCE, WRITE_ONCE, kcsan: Perform checks in __*_ONCE variants

2020-05-19 Thread Marco Elver
On Tue, 19 May 2020 at 23:10, Qian Cai  wrote:
>
> On Tue, May 12, 2020 at 3:09 PM Peter Zijlstra  wrote:
> >
> > On Tue, May 12, 2020 at 08:38:39PM +0200, Marco Elver wrote:
> > > diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> > > index 741c93c62ecf..e902ca5de811 100644
> > > --- a/include/linux/compiler.h
> > > +++ b/include/linux/compiler.h
> > > @@ -224,13 +224,16 @@ void ftrace_likely_update(struct ftrace_likely_data 
> > > *f, int val,
> > >   * atomicity or dependency ordering guarantees. Note that this may result
> > >   * in tears!
> > >   */
> > > -#define __READ_ONCE(x)   (*(const volatile __unqual_scalar_typeof(x) 
> > > *)&(x))
> > > +#define __READ_ONCE(x)   
> > > \
> > > +({   \
> > > + kcsan_check_atomic_read(&(x), sizeof(x));   \
> > > + data_race((*(const volatile __unqual_scalar_typeof(x) *)&(x))); \
> > > +})
> >
> > NAK
> >
> > This will actively insert instrumentation into __READ_ONCE() and I need
> > it to not have any.
>
> Any way to move this forward? Due to linux-next commit 6bcc8f459fe7
> (locking/atomics: Flip fallbacks and instrumentation), it triggers a
> lots of KCSAN warnings due to atomic ops are no longer marked.

This is no longer the right solution we believe due to the various
requirements that Peter also mentioned. See the discussion here:

https://lkml.kernel.org/r/canpmjnogfqhtda9wwpxs2kztqssozbwsumo5bqqw0c0g0zg...@mail.gmail.com

The new solution is here:
https://lkml.kernel.org/r/20200515150338.190344-1-el...@google.com
While it's a little inconvenient that we'll require Clang 11
(currently available by building yourself from LLVM repo), but until
we get GCC fixed (my patch there still pending :-/), this is probably
the right solution going forward.   If possible, please do test!

Thanks,
-- Marco

> For
> example,
> [  197.318288][ T1041] write to 0x9302764ccc78 of 8 bytes by task
> 1048 on cpu 47:
> [  197.353119][ T1041]  down_read_trylock+0x9e/0x1e0
> atomic_long_set(&sem->owner, val);
> __rwsem_set_reader_owned at kernel/locking/rwsem.c:205
> (inlined by) rwsem_set_reader_owned at kernel/locking/rwsem.c:213
> (inlined by) __down_read_trylock at kernel/locking/rwsem.c:1373
> (inlined by) down_read_trylock at kernel/locking/rwsem.c:1517
> [  197.374641][ T1041]  page_lock_anon_vma_read+0x19d/0x3c0
> [  197.398894][ T1041]  rmap_walk_anon+0x30e/0x620
>
> [  197.924695][ T1041] read to 0x9302764ccc78 of 8 bytes by task
> 1041 on cpu 43:
> [  197.959501][ T1041]  up_read+0xb8/0x41a
> arch_atomic64_read at arch/x86/include/asm/atomic64_64.h:22
> (inlined by) atomic64_read at include/asm-generic/atomic-instrumented.h:838
> (inlined by) atomic_long_read at include/asm-generic/atomic-long.h:29
> (inlined by) rwsem_clear_reader_owned at kernel/locking/rwsem.c:242
> (inlined by) __up_read at kernel/locking/rwsem.c:1433
> (inlined by) up_read at kernel/locking/rwsem.c:1574
> [  197.977728][ T1041]  rmap_walk_anon+0x2f2/0x620
> [  197.999055][ T1041]  rmap_walk+0xb5/0xe0


Re: [PATCH] READ_ONCE, WRITE_ONCE, kcsan: Perform checks in __*_ONCE variants

2020-05-20 Thread Marco Elver
On Wed, 20 May 2020 at 05:44, Nathan Chancellor
 wrote:
>
> On Tue, May 19, 2020 at 11:16:24PM -0400, Qian Cai wrote:
> > On Tue, May 19, 2020 at 10:47 PM Nathan Chancellor
> >  wrote:
> > >
> > > On Tue, May 19, 2020 at 10:28:41PM -0400, Qian Cai wrote:
> > > >
> > > >
> > > > > On May 19, 2020, at 6:05 PM, Thomas Gleixner  
> > > > > wrote:
> > > > >
> > > > > Yes, it's unfortunate, but we have to stop making major concessions 
> > > > > just
> > > > > because tools are not up to the task.
> > > > >
> > > > > We've done that way too much in the past and this particular problem
> > > > > clearly demonstrates that there are limits.
> > > > >
> > > > > Making brand new technology depend on sane tools is not asked too
> > > > > much. And yes, it's inconvenient, but all of us have to build tools
> > > > > every now and then to get our job done. It's not the end of the world.
> > > > >
> > > > > Building clang is trivial enough and pointing the make to the right
> > > > > compiler is not rocket science either.
> > > >
> > > > Yes, it all make sense from that angle. On the other hand, I want to be 
> > > > focus on kernel rather than compilers by using a stable and 
> > > > rocket-solid version. Not mentioned the time lost by compiling and 
> > > > properly manage my own toolchain in an automated environment, using 
> > > > such new version of compilers means that I have to inevitably deal with 
> > > > compiler bugs occasionally. Anyway, it is just some other more bugs I 
> > > > have to deal with, and I don’t have a better solution to offer right 
> > > > now.
> > >
> > > Hi Qian,
> > >
> > > Shameless plug but I have made a Python script to efficiently configure
> > > then build clang specifically for building the kernel (turn off a lot of
> > > different things that the kernel does not need).
> > >
> > > https://github.com/ClangBuiltLinux/tc-build
> > >
> > > I added an option '--use-good-revision', which uses an older master
> > > version (basically somewhere between clang-10 and current master) that
> > > has been qualified against the kernel. I currently update it every
> > > Linux release but I am probably going to start doing it every month as
> > > I have written a pretty decent framework to ensure that nothing is
> > > breaking on either the LLVM or kernel side.
> > >
> > > $ ./build-llvm.py --use-good-revision
> > >
> > > should be all you need to get off the ground and running if you wanted
> > > to give it a shot. The script is completely self contained by default so
> > > it won't mess with the rest of your system. Additionally, leaving off
> > > '--use-good-revision' will just use the master branch, which can
> > > definitely be broken but not as often as you would think (although I
> > > totally understand wanting to focus on kernel regressions only).
> >
> > Great, thanks. I'll try it in a bit.
>
> Please let me know if there are any issues!
>
> Do note that in order to get support for Marco's series, you will need
> to have a version of LLVM that includes [1], which the current
> --use-good-revision does not. You can checkout that revision exactly
> through the '-b' ('--branch') parameter:
>
> $ ./build-llvm.py -b 5a2c31116f412c3b6888be361137efd705e05814
>
> I also see another patch in LLVM that concerns KCSAN [2] but that does
> not appear used in Marco's series. Still might be worth having available
> in your version of clang.
>
> I'll try to bump the hash that '--use-good-revision' uses soon. I might
> wait until 5.7 final so that I can do both at the same time like I
> usually do but we'll see how much time I have.
>
> [1]: 
> https://github.com/llvm/llvm-project/commit/5a2c31116f412c3b6888be361137efd705e05814
> [2]: 
> https://github.com/llvm/llvm-project/commit/151ed6aa38a3ec6c01973b35f684586b6e1c0f7e

Thanks for sharing the script, this is very useful!

Note that [2] above is used, but optional:
https://lore.kernel.org/lkml/20200515150338.190344-5-el...@google.com/
It's not required for KCSAN to function correctly, but if it's
available it'll help find more data races with the default config.

Thanks,
-- Marco


[PATCH tip/locking/core v2 2/2] kcsan: Improve IRQ state trace reporting

2020-07-29 Thread Marco Elver
To improve the general usefulness of the IRQ state trace events with
KCSAN enabled, save and restore the trace information when entering and
exiting the KCSAN runtime as well as when generating a KCSAN report.

Without this, reporting the IRQ trace events (whether via a KCSAN report
or outside of KCSAN via a lockdep report) is rather useless due to
continuously being touched by KCSAN. This is because if KCSAN is
enabled, every instrumented memory access causes changes to IRQ trace
events (either by KCSAN disabling/enabling interrupts or taking
report_lock when generating a report).

Before "lockdep: Prepare for NMI IRQ state tracking", KCSAN avoided
touching the IRQ trace events via raw_local_irq_save/restore() and
lockdep_off/on().

Fixes: 248591f5d257 ("kcsan: Make KCSAN compatible with new IRQ state tracking")
Signed-off-by: Marco Elver 
---
v2:
* Use simple struct copy, now that the IRQ trace events are in a struct.

Depends on:  "lockdep: Prepare for NMI IRQ state tracking"
---
 include/linux/sched.h |  4 
 kernel/kcsan/core.c   | 23 +++
 kernel/kcsan/kcsan.h  |  7 +++
 kernel/kcsan/report.c |  3 +++
 4 files changed, 37 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 52e0fdd6a555..060e9214c8b5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1184,8 +1184,12 @@ struct task_struct {
 #ifdef CONFIG_KASAN
unsigned intkasan_depth;
 #endif
+
 #ifdef CONFIG_KCSAN
struct kcsan_ctxkcsan_ctx;
+#ifdef CONFIG_TRACE_IRQFLAGS
+   struct irqtrace_events  kcsan_save_irqtrace;
+#endif
 #endif
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
diff --git a/kernel/kcsan/core.c b/kernel/kcsan/core.c
index 732623c30359..0fe068192781 100644
--- a/kernel/kcsan/core.c
+++ b/kernel/kcsan/core.c
@@ -291,6 +291,20 @@ static inline unsigned int get_delay(void)
0);
 }
 
+void kcsan_save_irqtrace(struct task_struct *task)
+{
+#ifdef CONFIG_TRACE_IRQFLAGS
+   task->kcsan_save_irqtrace = task->irqtrace;
+#endif
+}
+
+void kcsan_restore_irqtrace(struct task_struct *task)
+{
+#ifdef CONFIG_TRACE_IRQFLAGS
+   task->irqtrace = task->kcsan_save_irqtrace;
+#endif
+}
+
 /*
  * Pull everything together: check_access() below contains the performance
  * critical operations; the fast-path (including check_access) functions should
@@ -336,9 +350,11 @@ static noinline void kcsan_found_watchpoint(const volatile 
void *ptr,
flags = user_access_save();
 
if (consumed) {
+   kcsan_save_irqtrace(current);
kcsan_report(ptr, size, type, KCSAN_VALUE_CHANGE_MAYBE,
 KCSAN_REPORT_CONSUMED_WATCHPOINT,
 watchpoint - watchpoints);
+   kcsan_restore_irqtrace(current);
} else {
/*
 * The other thread may not print any diagnostics, as it has
@@ -396,6 +412,12 @@ kcsan_setup_watchpoint(const volatile void *ptr, size_t 
size, int type)
goto out;
}
 
+   /*
+* Save and restore the IRQ state trace touched by KCSAN, since KCSAN's
+* runtime is entered for every memory access, and potentially useful
+* information is lost if dirtied by KCSAN.
+*/
+   kcsan_save_irqtrace(current);
if (!kcsan_interrupt_watcher)
local_irq_save(irq_flags);
 
@@ -539,6 +561,7 @@ kcsan_setup_watchpoint(const volatile void *ptr, size_t 
size, int type)
 out_unlock:
if (!kcsan_interrupt_watcher)
local_irq_restore(irq_flags);
+   kcsan_restore_irqtrace(current);
 out:
user_access_restore(ua_flags);
 }
diff --git a/kernel/kcsan/kcsan.h b/kernel/kcsan/kcsan.h
index 763d6d08d94b..29480010dc30 100644
--- a/kernel/kcsan/kcsan.h
+++ b/kernel/kcsan/kcsan.h
@@ -9,6 +9,7 @@
 #define _KERNEL_KCSAN_KCSAN_H
 
 #include 
+#include 
 
 /* The number of adjacent watchpoints to check. */
 #define KCSAN_CHECK_ADJACENT 1
@@ -22,6 +23,12 @@ extern unsigned int kcsan_udelay_interrupt;
  */
 extern bool kcsan_enabled;
 
+/*
+ * Save/restore IRQ flags state trace dirtied by KCSAN.
+ */
+void kcsan_save_irqtrace(struct task_struct *task);
+void kcsan_restore_irqtrace(struct task_struct *task);
+
 /*
  * Initialize debugfs file.
  */
diff --git a/kernel/kcsan/report.c b/kernel/kcsan/report.c
index 6b2fb1a6d8cd..9d07e175de0f 100644
--- a/kernel/kcsan/report.c
+++ b/kernel/kcsan/report.c
@@ -308,6 +308,9 @@ static void print_verbose_info(struct task_struct *task)
if (!task)
return;
 
+   /* Restore IRQ state trace for printing. */
+   kcsan_restore_irqtrace(task);
+
pr_err("\n");
debug_show_held_locks(task);
print_irqtrace_events(task);
-- 
2.28.0.rc0.142.g3c755180ce-goog



[PATCH tip/locking/core v2 1/2] lockdep: Refactor IRQ trace events fields into struct

2020-07-29 Thread Marco Elver
Refactor the IRQ trace events fields, used for printing information
about the IRQ trace events, into a separate struct 'irqtrace_events'.

This improves readability by separating the information only used in
reporting, as well as enables (simplified) storing/restoring of
irqtrace_events snapshots.

No functional change intended.

Signed-off-by: Marco Elver 
---
v2:
* Introduce patch, as pre-requisite to "kcsan: Improve IRQ state trace
  reporting".
---
 include/linux/irqflags.h | 13 +
 include/linux/sched.h| 11 ++--
 kernel/fork.c| 16 ---
 kernel/locking/lockdep.c | 58 +---
 4 files changed, 50 insertions(+), 48 deletions(-)

diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 5811ee8a5cd8..bd5c55755447 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -33,6 +33,19 @@
 
 #ifdef CONFIG_TRACE_IRQFLAGS
 
+/* Per-task IRQ trace events information. */
+struct irqtrace_events {
+   unsigned intirq_events;
+   unsigned long   hardirq_enable_ip;
+   unsigned long   hardirq_disable_ip;
+   unsigned inthardirq_enable_event;
+   unsigned inthardirq_disable_event;
+   unsigned long   softirq_disable_ip;
+   unsigned long   softirq_enable_ip;
+   unsigned intsoftirq_disable_event;
+   unsigned intsoftirq_enable_event;
+};
+
 DECLARE_PER_CPU(int, hardirqs_enabled);
 DECLARE_PER_CPU(int, hardirq_context);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8d1de021b315..52e0fdd6a555 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -980,17 +981,9 @@ struct task_struct {
 #endif
 
 #ifdef CONFIG_TRACE_IRQFLAGS
-   unsigned intirq_events;
+   struct irqtrace_events  irqtrace;
unsigned inthardirq_threaded;
-   unsigned long   hardirq_enable_ip;
-   unsigned long   hardirq_disable_ip;
-   unsigned inthardirq_enable_event;
-   unsigned inthardirq_disable_event;
u64 hardirq_chain_key;
-   unsigned long   softirq_disable_ip;
-   unsigned long   softirq_enable_ip;
-   unsigned intsoftirq_disable_event;
-   unsigned intsoftirq_enable_event;
int softirqs_enabled;
int softirq_context;
int irq_config;
diff --git a/kernel/fork.c b/kernel/fork.c
index 70d9d0a4de2a..56a640799680 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2035,17 +2035,11 @@ static __latent_entropy struct task_struct 
*copy_process(
seqcount_init(&p->mems_allowed_seq);
 #endif
 #ifdef CONFIG_TRACE_IRQFLAGS
-   p->irq_events = 0;
-   p->hardirq_enable_ip = 0;
-   p->hardirq_enable_event = 0;
-   p->hardirq_disable_ip = _THIS_IP_;
-   p->hardirq_disable_event = 0;
-   p->softirqs_enabled = 1;
-   p->softirq_enable_ip = _THIS_IP_;
-   p->softirq_enable_event = 0;
-   p->softirq_disable_ip = 0;
-   p->softirq_disable_event = 0;
-   p->softirq_context = 0;
+   memset(&p->irqtrace, 0, sizeof(p->irqtrace));
+   p->irqtrace.hardirq_disable_ip  = _THIS_IP_;
+   p->irqtrace.softirq_enable_ip   = _THIS_IP_;
+   p->softirqs_enabled = 1;
+   p->softirq_context  = 0;
 #endif
 
p->pagefault_disabled = 0;
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index c9ea05edce25..7b5800374c40 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3484,19 +3484,21 @@ check_usage_backwards(struct task_struct *curr, struct 
held_lock *this,
 
 void print_irqtrace_events(struct task_struct *curr)
 {
-   printk("irq event stamp: %u\n", curr->irq_events);
+   const struct irqtrace_events *trace = &curr->irqtrace;
+
+   printk("irq event stamp: %u\n", trace->irq_events);
printk("hardirqs last  enabled at (%u): [<%px>] %pS\n",
-   curr->hardirq_enable_event, (void *)curr->hardirq_enable_ip,
-   (void *)curr->hardirq_enable_ip);
+   trace->hardirq_enable_event, (void *)trace->hardirq_enable_ip,
+   (void *)trace->hardirq_enable_ip);
printk("hardirqs last disabled at (%u): [<%px>] %pS\n",
-   curr->hardirq_disable_event, (void *)curr->hardirq_disable_ip,
-   (void *)curr->hardirq_disable_ip);
+   trace->hardirq_disable_event, (void *)trace->hardirq_disable_ip,
+   (void *)trace->hard

[PATCH 0/5] kcsan: Cleanups, readability, and cosmetic improvements

2020-07-31 Thread Marco Elver
Cleanups, readability, and cosmetic improvements for KCSAN.

Marco Elver (5):
  kcsan: Simplify debugfs counter to name mapping
  kcsan: Simplify constant string handling
  kcsan: Remove debugfs test command
  kcsan: Show message if enabled early
  kcsan: Use pr_fmt for consistency

 kernel/kcsan/core.c |   8 ++-
 kernel/kcsan/debugfs.c  | 111 
 kernel/kcsan/report.c   |   4 +-
 kernel/kcsan/selftest.c |   8 +--
 4 files changed, 33 insertions(+), 98 deletions(-)

-- 
2.28.0.163.g6104cc2f0b6-goog



[PATCH 2/5] kcsan: Simplify constant string handling

2020-07-31 Thread Marco Elver
Simplify checking prefixes and length calculation of constant strings.
For the former, the kernel provides str_has_prefix(), and the latter we
should just use strlen("..") because GCC and Clang have optimizations
that optimize these into constants.

No functional change intended.

Signed-off-by: Marco Elver 
---
 kernel/kcsan/debugfs.c | 8 
 kernel/kcsan/report.c  | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/kcsan/debugfs.c b/kernel/kcsan/debugfs.c
index 3a9566addeff..116bdd8f050c 100644
--- a/kernel/kcsan/debugfs.c
+++ b/kernel/kcsan/debugfs.c
@@ -300,16 +300,16 @@ debugfs_write(struct file *file, const char __user *buf, 
size_t count, loff_t *o
WRITE_ONCE(kcsan_enabled, true);
} else if (!strcmp(arg, "off")) {
WRITE_ONCE(kcsan_enabled, false);
-   } else if (!strncmp(arg, "microbench=", sizeof("microbench=") - 1)) {
+   } else if (str_has_prefix(arg, "microbench=")) {
unsigned long iters;
 
-   if (kstrtoul(&arg[sizeof("microbench=") - 1], 0, &iters))
+   if (kstrtoul(&arg[strlen("microbench=")], 0, &iters))
return -EINVAL;
microbenchmark(iters);
-   } else if (!strncmp(arg, "test=", sizeof("test=") - 1)) {
+   } else if (str_has_prefix(arg, "test=")) {
unsigned long iters;
 
-   if (kstrtoul(&arg[sizeof("test=") - 1], 0, &iters))
+   if (kstrtoul(&arg[strlen("test=")], 0, &iters))
return -EINVAL;
test_thread(iters);
} else if (!strcmp(arg, "whitelist")) {
diff --git a/kernel/kcsan/report.c b/kernel/kcsan/report.c
index d05052c23261..15add93ff12e 100644
--- a/kernel/kcsan/report.c
+++ b/kernel/kcsan/report.c
@@ -279,8 +279,8 @@ static int get_stack_skipnr(const unsigned long 
stack_entries[], int num_entries
 
cur = strnstr(buf, "kcsan_", len);
if (cur) {
-   cur += sizeof("kcsan_") - 1;
-   if (strncmp(cur, "test", sizeof("test") - 1))
+   cur += strlen("kcsan_");
+   if (!str_has_prefix(cur, "test"))
continue; /* KCSAN runtime function. */
/* KCSAN related test. */
}
-- 
2.28.0.163.g6104cc2f0b6-goog



[PATCH 1/5] kcsan: Simplify debugfs counter to name mapping

2020-07-31 Thread Marco Elver
Simplify counter ID to name mapping by using an array with designated
inits. This way, we can turn a run-time BUG() into a compile-time static
assertion failure if a counter name is missing.

No functional change intended.

Signed-off-by: Marco Elver 
---
 kernel/kcsan/debugfs.c | 33 +
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/kernel/kcsan/debugfs.c b/kernel/kcsan/debugfs.c
index 023e49c58d55..3a9566addeff 100644
--- a/kernel/kcsan/debugfs.c
+++ b/kernel/kcsan/debugfs.c
@@ -19,6 +19,18 @@
  * Statistics counters.
  */
 static atomic_long_t counters[KCSAN_COUNTER_COUNT];
+static const char *const counter_names[] = {
+   [KCSAN_COUNTER_USED_WATCHPOINTS]= "used_watchpoints",
+   [KCSAN_COUNTER_SETUP_WATCHPOINTS]   = "setup_watchpoints",
+   [KCSAN_COUNTER_DATA_RACES]  = "data_races",
+   [KCSAN_COUNTER_ASSERT_FAILURES] = "assert_failures",
+   [KCSAN_COUNTER_NO_CAPACITY] = "no_capacity",
+   [KCSAN_COUNTER_REPORT_RACES]= "report_races",
+   [KCSAN_COUNTER_RACES_UNKNOWN_ORIGIN]= 
"races_unknown_origin",
+   [KCSAN_COUNTER_UNENCODABLE_ACCESSES]= 
"unencodable_accesses",
+   [KCSAN_COUNTER_ENCODING_FALSE_POSITIVES]= 
"encoding_false_positives",
+};
+static_assert(ARRAY_SIZE(counter_names) == KCSAN_COUNTER_COUNT);
 
 /*
  * Addresses for filtering functions from reporting. This list can be used as a
@@ -39,24 +51,6 @@ static struct {
 };
 static DEFINE_SPINLOCK(report_filterlist_lock);
 
-static const char *counter_to_name(enum kcsan_counter_id id)
-{
-   switch (id) {
-   case KCSAN_COUNTER_USED_WATCHPOINTS:return 
"used_watchpoints";
-   case KCSAN_COUNTER_SETUP_WATCHPOINTS:   return 
"setup_watchpoints";
-   case KCSAN_COUNTER_DATA_RACES:  return "data_races";
-   case KCSAN_COUNTER_ASSERT_FAILURES: return 
"assert_failures";
-   case KCSAN_COUNTER_NO_CAPACITY: return "no_capacity";
-   case KCSAN_COUNTER_REPORT_RACES:return "report_races";
-   case KCSAN_COUNTER_RACES_UNKNOWN_ORIGIN:return 
"races_unknown_origin";
-   case KCSAN_COUNTER_UNENCODABLE_ACCESSES:return 
"unencodable_accesses";
-   case KCSAN_COUNTER_ENCODING_FALSE_POSITIVES:return 
"encoding_false_positives";
-   case KCSAN_COUNTER_COUNT:
-   BUG();
-   }
-   return NULL;
-}
-
 void kcsan_counter_inc(enum kcsan_counter_id id)
 {
atomic_long_inc(&counters[id]);
@@ -271,8 +265,7 @@ static int show_info(struct seq_file *file, void *v)
/* show stats */
seq_printf(file, "enabled: %i\n", READ_ONCE(kcsan_enabled));
for (i = 0; i < KCSAN_COUNTER_COUNT; ++i)
-   seq_printf(file, "%s: %ld\n", counter_to_name(i),
-  atomic_long_read(&counters[i]));
+   seq_printf(file, "%s: %ld\n", counter_names[i], 
atomic_long_read(&counters[i]));
 
/* show filter functions, and filter type */
spin_lock_irqsave(&report_filterlist_lock, flags);
-- 
2.28.0.163.g6104cc2f0b6-goog



[PATCH 3/5] kcsan: Remove debugfs test command

2020-07-31 Thread Marco Elver
Remove the debugfs test command, as it is no longer needed now that we
have the KUnit+Torture based kcsan-test module. This is to avoid
confusion around how KCSAN should be tested, as only the kcsan-test
module is maintained.

Signed-off-by: Marco Elver 
---
 kernel/kcsan/debugfs.c | 66 --
 1 file changed, 66 deletions(-)

diff --git a/kernel/kcsan/debugfs.c b/kernel/kcsan/debugfs.c
index 116bdd8f050c..de1da1b01aa4 100644
--- a/kernel/kcsan/debugfs.c
+++ b/kernel/kcsan/debugfs.c
@@ -98,66 +98,6 @@ static noinline void microbenchmark(unsigned long iters)
current->kcsan_ctx = ctx_save;
 }
 
-/*
- * Simple test to create conflicting accesses. Write 'test=' to KCSAN's
- * debugfs file from multiple tasks to generate real conflicts and show 
reports.
- */
-static long test_dummy;
-static long test_flags;
-static long test_scoped;
-static noinline void test_thread(unsigned long iters)
-{
-   const long CHANGE_BITS = 0xff00ff00ff00ff00L;
-   const struct kcsan_ctx ctx_save = current->kcsan_ctx;
-   cycles_t cycles;
-
-   /* We may have been called from an atomic region; reset context. */
-   memset(¤t->kcsan_ctx, 0, sizeof(current->kcsan_ctx));
-
-   pr_info("KCSAN: %s begin | iters: %lu\n", __func__, iters);
-   pr_info("test_dummy@%px, test_flags@%px, test_scoped@%px,\n",
-   &test_dummy, &test_flags, &test_scoped);
-
-   cycles = get_cycles();
-   while (iters--) {
-   /* These all should generate reports. */
-   __kcsan_check_read(&test_dummy, sizeof(test_dummy));
-   ASSERT_EXCLUSIVE_WRITER(test_dummy);
-   ASSERT_EXCLUSIVE_ACCESS(test_dummy);
-
-   ASSERT_EXCLUSIVE_BITS(test_flags, ~CHANGE_BITS); /* no report */
-   __kcsan_check_read(&test_flags, sizeof(test_flags)); /* no 
report */
-
-   ASSERT_EXCLUSIVE_BITS(test_flags, CHANGE_BITS); /* report */
-   __kcsan_check_read(&test_flags, sizeof(test_flags)); /* no 
report */
-
-   /* not actually instrumented */
-   WRITE_ONCE(test_dummy, iters);  /* to observe value-change */
-   __kcsan_check_write(&test_dummy, sizeof(test_dummy));
-
-   test_flags ^= CHANGE_BITS; /* generate value-change */
-   __kcsan_check_write(&test_flags, sizeof(test_flags));
-
-   BUG_ON(current->kcsan_ctx.scoped_accesses.prev);
-   {
-   /* Should generate reports anywhere in this block. */
-   ASSERT_EXCLUSIVE_WRITER_SCOPED(test_scoped);
-   ASSERT_EXCLUSIVE_ACCESS_SCOPED(test_scoped);
-   BUG_ON(!current->kcsan_ctx.scoped_accesses.prev);
-   /* Unrelated accesses. */
-   __kcsan_check_access(&cycles, sizeof(cycles), 0);
-   __kcsan_check_access(&cycles, sizeof(cycles), 
KCSAN_ACCESS_ATOMIC);
-   }
-   BUG_ON(current->kcsan_ctx.scoped_accesses.prev);
-   }
-   cycles = get_cycles() - cycles;
-
-   pr_info("KCSAN: %s end   | cycles: %llu\n", __func__, cycles);
-
-   /* restore context */
-   current->kcsan_ctx = ctx_save;
-}
-
 static int cmp_filterlist_addrs(const void *rhs, const void *lhs)
 {
const unsigned long a = *(const unsigned long *)rhs;
@@ -306,12 +246,6 @@ debugfs_write(struct file *file, const char __user *buf, 
size_t count, loff_t *o
if (kstrtoul(&arg[strlen("microbench=")], 0, &iters))
return -EINVAL;
microbenchmark(iters);
-   } else if (str_has_prefix(arg, "test=")) {
-   unsigned long iters;
-
-   if (kstrtoul(&arg[strlen("test=")], 0, &iters))
-   return -EINVAL;
-   test_thread(iters);
} else if (!strcmp(arg, "whitelist")) {
set_report_filterlist_whitelist(true);
} else if (!strcmp(arg, "blacklist")) {
-- 
2.28.0.163.g6104cc2f0b6-goog



[PATCH 4/5] kcsan: Show message if enabled early

2020-07-31 Thread Marco Elver
Show a message in the kernel log if KCSAN was enabled early.

Signed-off-by: Marco Elver 
---
 kernel/kcsan/core.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/kcsan/core.c b/kernel/kcsan/core.c
index e43a55643e00..23d0c4e4cd3a 100644
--- a/kernel/kcsan/core.c
+++ b/kernel/kcsan/core.c
@@ -1,5 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#define pr_fmt(fmt) "kcsan: " fmt
+
 #include 
 #include 
 #include 
@@ -442,7 +444,7 @@ kcsan_setup_watchpoint(const volatile void *ptr, size_t 
size, int type)
 
if (IS_ENABLED(CONFIG_KCSAN_DEBUG)) {
kcsan_disable_current();
-   pr_err("KCSAN: watching %s, size: %zu, addr: %px [slot: %d, 
encoded: %lx]\n",
+   pr_err("watching %s, size: %zu, addr: %px [slot: %d, encoded: 
%lx]\n",
   is_write ? "write" : "read", size, ptr,
   watchpoint_slot((unsigned long)ptr),
   encode_watchpoint((unsigned long)ptr, size, is_write));
@@ -601,8 +603,10 @@ void __init kcsan_init(void)
 * We are in the init task, and no other tasks should be running;
 * WRITE_ONCE without memory barrier is sufficient.
 */
-   if (kcsan_early_enable)
+   if (kcsan_early_enable) {
+   pr_info("enabled early\n");
WRITE_ONCE(kcsan_enabled, true);
+   }
 }
 
 /* === Exported interface === 
*/
-- 
2.28.0.163.g6104cc2f0b6-goog



[PATCH 5/5] kcsan: Use pr_fmt for consistency

2020-07-31 Thread Marco Elver
Use the same pr_fmt throughout for consistency. [ The only exception is
report.c, where the format must be kept precisely as-is. ]

Signed-off-by: Marco Elver 
---
 kernel/kcsan/debugfs.c  | 8 +---
 kernel/kcsan/selftest.c | 8 +---
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/kernel/kcsan/debugfs.c b/kernel/kcsan/debugfs.c
index de1da1b01aa4..6c4914fa2fad 100644
--- a/kernel/kcsan/debugfs.c
+++ b/kernel/kcsan/debugfs.c
@@ -1,5 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#define pr_fmt(fmt) "kcsan: " fmt
+
 #include 
 #include 
 #include 
@@ -80,7 +82,7 @@ static noinline void microbenchmark(unsigned long iters)
 */
WRITE_ONCE(kcsan_enabled, false);
 
-   pr_info("KCSAN: %s begin | iters: %lu\n", __func__, iters);
+   pr_info("%s begin | iters: %lu\n", __func__, iters);
 
cycles = get_cycles();
while (iters--) {
@@ -91,7 +93,7 @@ static noinline void microbenchmark(unsigned long iters)
}
cycles = get_cycles() - cycles;
 
-   pr_info("KCSAN: %s end   | cycles: %llu\n", __func__, cycles);
+   pr_info("%s end   | cycles: %llu\n", __func__, cycles);
 
WRITE_ONCE(kcsan_enabled, was_enabled);
/* restore context */
@@ -154,7 +156,7 @@ static ssize_t insert_report_filterlist(const char *func)
ssize_t ret = 0;
 
if (!addr) {
-   pr_err("KCSAN: could not find function: '%s'\n", func);
+   pr_err("could not find function: '%s'\n", func);
return -ENOENT;
}
 
diff --git a/kernel/kcsan/selftest.c b/kernel/kcsan/selftest.c
index d26a052d3383..d98bc208d06d 100644
--- a/kernel/kcsan/selftest.c
+++ b/kernel/kcsan/selftest.c
@@ -1,5 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#define pr_fmt(fmt) "kcsan: " fmt
+
 #include 
 #include 
 #include 
@@ -116,16 +118,16 @@ static int __init kcsan_selftest(void)
if (do_test()) \
++passed;  \
else   \
-   pr_err("KCSAN selftest: " #do_test " failed"); \
+   pr_err("selftest: " #do_test " failed");   \
} while (0)
 
RUN_TEST(test_requires);
RUN_TEST(test_encode_decode);
RUN_TEST(test_matching_access);
 
-   pr_info("KCSAN selftest: %d/%d tests passed\n", passed, total);
+   pr_info("selftest: %d/%d tests passed\n", passed, total);
if (passed != total)
-   panic("KCSAN selftests failed");
+   panic("selftests failed");
return 0;
 }
 postcore_initcall(kcsan_selftest);
-- 
2.28.0.163.g6104cc2f0b6-goog



Re: [tip: locking/kcsan] READ_ONCE: Use data_race() to avoid KCSAN instrumentation

2020-05-20 Thread Marco Elver
On Thu, 21 May 2020 at 00:17, Borislav Petkov  wrote:
>
> Hi,
>
> On Tue, May 12, 2020 at 02:36:53PM -, tip-bot2 for Will Deacon wrote:
> > The following commit has been merged into the locking/kcsan branch of tip:
> >
> > Commit-ID: cdd28ad2d8110099e43527e96d059c5639809680
> > Gitweb:
> > https://git.kernel.org/tip/cdd28ad2d8110099e43527e96d059c5639809680
> > Author:Will Deacon 
> > AuthorDate:Mon, 11 May 2020 21:41:49 +01:00
> > Committer: Thomas Gleixner 
> > CommitterDate: Tue, 12 May 2020 11:04:17 +02:00
> >
> > READ_ONCE: Use data_race() to avoid KCSAN instrumentation
> >
> > Rather then open-code the disabling/enabling of KCSAN across the guts of
> > {READ,WRITE}_ONCE(), defer to the data_race() macro instead.
> >
> > Signed-off-by: Will Deacon 
> > Signed-off-by: Thomas Gleixner 
> > Acked-by: Peter Zijlstra (Intel) 
> > Cc: Marco Elver 
> > Link: https://lkml.kernel.org/r/20200511204150.27858-18-w...@kernel.org
>
> so this commit causes a kernel build slowdown depending on the .config
> of between 50% and over 100%. I just bisected locking/kcsan and got
>
> NOT_OK: cdd28ad2d811 READ_ONCE: Use data_race() to avoid KCSAN instrumentation
> OK: 88f1be32068d kcsan: Rework data_race() so that it can be used by 
> READ_ONCE()
>
> with a simple:
>
> $ git clean -dqfx && mk defconfig
> $ time make -j
>
> I'm not even booting the kernels - simply checking out the above commits
> and building the target kernels. I.e., something in that commit is
> making gcc go nuts in the compilation phases.

This should be fixed when the series that includes this commit is applied:
https://lore.kernel.org/lkml/20200515150338.190344-9-el...@google.com/

Thanks,
-- Marco


Re: [tip: locking/kcsan] READ_ONCE: Use data_race() to avoid KCSAN instrumentation

2020-05-21 Thread Marco Elver
On Thu, 21 May 2020 at 09:26, Borislav Petkov  wrote:
>
> On Thu, May 21, 2020 at 12:30:39AM +0200, Marco Elver wrote:
> > This should be fixed when the series that includes this commit is applied:
> > https://lore.kernel.org/lkml/20200515150338.190344-9-el...@google.com/
>
> Yap, that fixes it.
>
> Thx.

Thanks for confirming. I think Peter also mentioned that nested
statement expressions caused issues.

This probably also means we shouldn't have a nested "data_race()"
macro, to avoid any kind of nested statement expressions where
possible.

I will send a v2 of the above series to add that patch.

Thanks,
-- Marco


Re: [PATCH -tip 08/10] READ_ONCE, WRITE_ONCE: Remove data_race() wrapping

2020-05-21 Thread Marco Elver
On Fri, 15 May 2020 at 17:04, Marco Elver  wrote:
>
> The volatile access no longer needs to be wrapped in data_race(),
> because we require compilers that emit instrumentation distinguishing
> volatile accesses.
>
> Signed-off-by: Marco Elver 
> ---
>  include/linux/compiler.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 17c98b215572..fce56402c082 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -229,7 +229,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, 
> int val,
>  #define __READ_ONCE_SCALAR(x)  \
>  ({ \
> typeof(x) *__xp = &(x); \
> -   __unqual_scalar_typeof(x) __x = data_race(__READ_ONCE(*__xp));  \
> +   __unqual_scalar_typeof(x) __x = __READ_ONCE(*__xp); \
> kcsan_check_atomic_read(__xp, sizeof(*__xp));   \

Some self-review: We don't need kcsan_check_atomic anymore, and this
should be removed.

I'll send v2 to address this (together with fix to data_race()
removing nested statement expressions).

> smp_read_barrier_depends(); \
> (typeof(x))__x; \
> @@ -250,7 +250,7 @@ do {  
>   \
>  do {   \
> typeof(x) *__xp = &(x); \
> kcsan_check_atomic_write(__xp, sizeof(*__xp));  \

Same.

> -   data_race(({ __WRITE_ONCE(*__xp, val); 0; }));  \
> +   __WRITE_ONCE(*__xp, val);   \
>  } while (0)
>
>  #define WRITE_ONCE(x, val) \
> --
> 2.26.2.761.g0e0b3e54be-goog
>


Re: [PATCH -tip 08/10] READ_ONCE, WRITE_ONCE: Remove data_race() wrapping

2020-05-21 Thread Marco Elver
On Thu, 21 May 2020 at 11:47, Marco Elver  wrote:
>
> On Fri, 15 May 2020 at 17:04, Marco Elver  wrote:
> >
> > The volatile access no longer needs to be wrapped in data_race(),
> > because we require compilers that emit instrumentation distinguishing
> > volatile accesses.
> >
> > Signed-off-by: Marco Elver 
> > ---
> >  include/linux/compiler.h | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> > index 17c98b215572..fce56402c082 100644
> > --- a/include/linux/compiler.h
> > +++ b/include/linux/compiler.h
> > @@ -229,7 +229,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, 
> > int val,
> >  #define __READ_ONCE_SCALAR(x)  \
> >  ({ \
> > typeof(x) *__xp = &(x); \
> > -   __unqual_scalar_typeof(x) __x = data_race(__READ_ONCE(*__xp));  \
> > +   __unqual_scalar_typeof(x) __x = __READ_ONCE(*__xp); \
> > kcsan_check_atomic_read(__xp, sizeof(*__xp));   \
>
> Some self-review: We don't need kcsan_check_atomic anymore, and this
> should be removed.
>
> I'll send v2 to address this (together with fix to data_race()
> removing nested statement expressions).

The other thing here is that we no longer require __xp, and can just
pass x into __READ_ONCE.

> > smp_read_barrier_depends(); \
> > (typeof(x))__x; \
> > @@ -250,7 +250,7 @@ do {
> > \
> >  do {   \
> > typeof(x) *__xp = &(x); \
> > kcsan_check_atomic_write(__xp, sizeof(*__xp));  \
>
> Same.

__xp can also be removed.

Note that this effectively aliases __WRITE_ONCE_SCALAR to
__WRITE_ONCE. To keep the API consistent with READ_ONCE, I assume we
want to keep __WRITE_ONCE_SCALAR, in case it is meant to change in
future?

> > -   data_race(({ __WRITE_ONCE(*__xp, val); 0; }));  \
> > +   __WRITE_ONCE(*__xp, val);   \
> >  } while (0)
> >
> >  #define WRITE_ONCE(x, val) \
> > --
> > 2.26.2.761.g0e0b3e54be-goog
> >


[PATCH -tip v2 00/11] Fix KCSAN for new ONCE (require Clang 11)

2020-05-21 Thread Marco Elver
This patch series is the conclusion to [1], where we determined that due
to various interactions with no_sanitize attributes and the new
{READ,WRITE}_ONCE(), KCSAN will require Clang 11 or later. Other
sanitizers are largely untouched, and only KCSAN now has a hard
dependency on Clang 11. To test, a recent Clang development version will
suffice [2]. While a little inconvenient for now, it is hoped that in
future we may be able to fix GCC and re-enable GCC support.

The patch "kcsan: Restrict supported compilers" contains a detailed list
of requirements that led to this decision.

Most of the patches are related to KCSAN, however, the first patch also
includes an UBSAN related fix and is a dependency for the remaining
ones. The last 2 patches clean up the attributes by moving them to the
right place, and fix KASAN's way of defining __no_kasan_or_inline,
making it consistent with KCSAN.

The series has been tested by running kcsan-test several times and
completed successfully.

[1] 
https://lkml.kernel.org/r/canpmjnogfqhtda9wwpxs2kztqssozbwsumo5bqqw0c0g0zg...@mail.gmail.com
[2] https://github.com/llvm/llvm-project

v2:
* Remove unnecessary kcsan_check_atomic in ONCE.
* Simplify __READ_ONCE_SCALAR and remove __WRITE_ONCE_SCALAR. This
  effectively restores Will Deacon's pre-KCSAN version:
  
https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/tree/include/linux/compiler.h?h=rwonce/cleanup#n202
* Introduce patch making data_race() a single statement expression in
  response to apparent issues that compilers are having with nested
  statement expressions.

Arnd Bergmann (1):
  ubsan, kcsan: don't combine sanitizer with kcov on clang

Marco Elver (10):
  kcsan: Avoid inserting __tsan_func_entry/exit if possible
  kcsan: Support distinguishing volatile accesses
  kcsan: Pass option tsan-instrument-read-before-write to Clang
  kcsan: Remove 'noinline' from __no_kcsan_or_inline
  kcsan: Restrict supported compilers
  kcsan: Update Documentation to change supported compilers
  READ_ONCE, WRITE_ONCE: Remove data_race() and unnecessary checks
  data_race: Avoid nested statement expression
  compiler.h: Move function attributes to compiler_types.h
  compiler_types.h, kasan: Use __SANITIZE_ADDRESS__ instead of
CONFIG_KASAN to decide inlining

 Documentation/dev-tools/kcsan.rst |  9 +-
 include/linux/compiler.h  | 53 ---
 include/linux/compiler_types.h| 32 +++
 kernel/kcsan/core.c   | 43 +
 lib/Kconfig.kcsan | 20 +++-
 lib/Kconfig.ubsan | 11 +++
 scripts/Makefile.kcsan| 15 -
 7 files changed, 126 insertions(+), 57 deletions(-)

-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH -tip v2 01/11] ubsan, kcsan: don't combine sanitizer with kcov on clang

2020-05-21 Thread Marco Elver
From: Arnd Bergmann 

Clang does not allow -fsanitize-coverage=trace-{pc,cmp} together
with -fsanitize=bounds or with ubsan:

clang: error: argument unused during compilation: 
'-fsanitize-coverage=trace-pc' [-Werror,-Wunused-command-line-argument]
clang: error: argument unused during compilation: 
'-fsanitize-coverage=trace-cmp' [-Werror,-Wunused-command-line-argument]

To avoid the warning, check whether clang can handle this correctly
or disallow ubsan and kcsan when kcov is enabled.

Link: https://bugs.llvm.org/show_bug.cgi?id=45831
Link: https://lore.kernel.org/lkml/20200505142341.1096942-1-a...@arndb.de
Acked-by: Marco Elver 
Signed-off-by: Arnd Bergmann 
Signed-off-by: Marco Elver 
---
This patch is already in -rcu tree, but since since the series is based
on -tip, to avoid conflict it is required for the subsequent patches.
---
 lib/Kconfig.kcsan | 11 +++
 lib/Kconfig.ubsan | 11 +++
 2 files changed, 22 insertions(+)

diff --git a/lib/Kconfig.kcsan b/lib/Kconfig.kcsan
index ea28245c6c1d..a7276035ca0d 100644
--- a/lib/Kconfig.kcsan
+++ b/lib/Kconfig.kcsan
@@ -3,9 +3,20 @@
 config HAVE_ARCH_KCSAN
bool
 
+config KCSAN_KCOV_BROKEN
+   def_bool KCOV && CC_HAS_SANCOV_TRACE_PC
+   depends on CC_IS_CLANG
+   depends on !$(cc-option,-Werror=unused-command-line-argument 
-fsanitize=thread -fsanitize-coverage=trace-pc)
+   help
+ Some versions of clang support either KCSAN and KCOV but not the
+ combination of the two.
+ See https://bugs.llvm.org/show_bug.cgi?id=45831 for the status
+ in newer releases.
+
 menuconfig KCSAN
bool "KCSAN: dynamic data race detector"
depends on HAVE_ARCH_KCSAN && DEBUG_KERNEL && !KASAN
+   depends on !KCSAN_KCOV_BROKEN
select STACKTRACE
help
  The Kernel Concurrency Sanitizer (KCSAN) is a dynamic
diff --git a/lib/Kconfig.ubsan b/lib/Kconfig.ubsan
index 48469c95d78e..3baea77bf37f 100644
--- a/lib/Kconfig.ubsan
+++ b/lib/Kconfig.ubsan
@@ -26,9 +26,20 @@ config UBSAN_TRAP
  the system. For some system builders this is an acceptable
  trade-off.
 
+config UBSAN_KCOV_BROKEN
+   def_bool KCOV && CC_HAS_SANCOV_TRACE_PC
+   depends on CC_IS_CLANG
+   depends on !$(cc-option,-Werror=unused-command-line-argument 
-fsanitize=bounds -fsanitize-coverage=trace-pc)
+   help
+ Some versions of clang support either UBSAN or KCOV but not the
+ combination of the two.
+ See https://bugs.llvm.org/show_bug.cgi?id=45831 for the status
+ in newer releases.
+
 config UBSAN_BOUNDS
bool "Perform array index bounds checking"
default UBSAN
+   depends on !UBSAN_KCOV_BROKEN
help
  This option enables detection of directly indexed out of bounds
  array accesses, where the array size is known at compile time.
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH -tip v2 05/11] kcsan: Remove 'noinline' from __no_kcsan_or_inline

2020-05-21 Thread Marco Elver
Some compilers incorrectly inline small __no_kcsan functions, which then
results in instrumenting the accesses. For this reason, the 'noinline'
attribute was added to __no_kcsan_or_inline. All known versions of GCC
are affected by this. Supported version of Clang are unaffected, and
never inlines a no_sanitize function.

However, the attribute 'noinline' in __no_kcsan_or_inline causes
unexpected code generation in functions that are __no_kcsan and call a
__no_kcsan_or_inline function.

In certain situations it is expected that the __no_kcsan_or_inline
function is actually inlined by the __no_kcsan function, and *no* calls
are emitted. By removing the 'noinline' attribute we give the compiler
the ability to inline and generate the expected code in __no_kcsan
functions.

Link: 
https://lkml.kernel.org/r/canpmjnnopjk0tprxkb_deinav_ummorf1-2uajlhnlwqq1h...@mail.gmail.com
Signed-off-by: Marco Elver 
---
 include/linux/compiler.h | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index e24cc3a2bc3e..17c98b215572 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -276,11 +276,9 @@ do {   
\
 #ifdef __SANITIZE_THREAD__
 /*
  * Rely on __SANITIZE_THREAD__ instead of CONFIG_KCSAN, to avoid not inlining 
in
- * compilation units where instrumentation is disabled. The attribute 
'noinline'
- * is required for older compilers, where implicit inlining of very small
- * functions renders __no_sanitize_thread ineffective.
+ * compilation units where instrumentation is disabled.
  */
-# define __no_kcsan_or_inline __no_kcsan noinline notrace __maybe_unused
+# define __no_kcsan_or_inline __no_kcsan notrace __maybe_unused
 # define __no_sanitize_or_inline __no_kcsan_or_inline
 #else
 # define __no_kcsan_or_inline __always_inline
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH -tip v2 03/11] kcsan: Support distinguishing volatile accesses

2020-05-21 Thread Marco Elver
In the kernel, volatile is used in various concurrent context, whether
in low-level synchronization primitives or for legacy reasons. If
supported by the compiler, we will assume that aligned volatile accesses
up to sizeof(long long) (matching compiletime_assert_rwonce_type()) are
atomic.

Recent versions Clang [1] (GCC tentative [2]) can instrument volatile
accesses differently. Add the option (required) to enable the
instrumentation, and provide the necessary runtime functions. None of
the updated compilers are widely available yet (Clang 11 will be the
first release to support the feature).

[1] 
https://github.com/llvm/llvm-project/commit/5a2c31116f412c3b6888be361137efd705e05814
[2] https://gcc.gnu.org/pipermail/gcc-patches/2020-April/544452.html

This patch allows removing any explicit checks in primitives such as
READ_ONCE() and WRITE_ONCE().

Signed-off-by: Marco Elver 
---
v2:
* Reword Makefile comment.
---
 kernel/kcsan/core.c| 43 ++
 scripts/Makefile.kcsan |  5 -
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/kernel/kcsan/core.c b/kernel/kcsan/core.c
index a73a66cf79df..15f67949d11e 100644
--- a/kernel/kcsan/core.c
+++ b/kernel/kcsan/core.c
@@ -789,6 +789,49 @@ void __tsan_write_range(void *ptr, size_t size)
 }
 EXPORT_SYMBOL(__tsan_write_range);
 
+/*
+ * Use of explicit volatile is generally disallowed [1], however, volatile is
+ * still used in various concurrent context, whether in low-level
+ * synchronization primitives or for legacy reasons.
+ * [1] https://lwn.net/Articles/233479/
+ *
+ * We only consider volatile accesses atomic if they are aligned and would pass
+ * the size-check of compiletime_assert_rwonce_type().
+ */
+#define DEFINE_TSAN_VOLATILE_READ_WRITE(size)  
\
+   void __tsan_volatile_read##size(void *ptr) \
+   {  \
+   const bool is_atomic = size <= sizeof(long long) &&\
+  IS_ALIGNED((unsigned long)ptr, size);   \
+   if (IS_ENABLED(CONFIG_KCSAN_IGNORE_ATOMICS) && is_atomic)  \
+   return;\
+   check_access(ptr, size, is_atomic ? KCSAN_ACCESS_ATOMIC : 0);  \
+   }  \
+   EXPORT_SYMBOL(__tsan_volatile_read##size); \
+   void __tsan_unaligned_volatile_read##size(void *ptr)   \
+   __alias(__tsan_volatile_read##size);   \
+   EXPORT_SYMBOL(__tsan_unaligned_volatile_read##size);   \
+   void __tsan_volatile_write##size(void *ptr)\
+   {  \
+   const bool is_atomic = size <= sizeof(long long) &&\
+  IS_ALIGNED((unsigned long)ptr, size);   \
+   if (IS_ENABLED(CONFIG_KCSAN_IGNORE_ATOMICS) && is_atomic)  \
+   return;\
+   check_access(ptr, size,\
+KCSAN_ACCESS_WRITE |  \
+(is_atomic ? KCSAN_ACCESS_ATOMIC : 0));   \
+   }  \
+   EXPORT_SYMBOL(__tsan_volatile_write##size);\
+   void __tsan_unaligned_volatile_write##size(void *ptr)  \
+   __alias(__tsan_volatile_write##size);  \
+   EXPORT_SYMBOL(__tsan_unaligned_volatile_write##size)
+
+DEFINE_TSAN_VOLATILE_READ_WRITE(1);
+DEFINE_TSAN_VOLATILE_READ_WRITE(2);
+DEFINE_TSAN_VOLATILE_READ_WRITE(4);
+DEFINE_TSAN_VOLATILE_READ_WRITE(8);
+DEFINE_TSAN_VOLATILE_READ_WRITE(16);
+
 /*
  * The below are not required by KCSAN, but can still be emitted by the
  * compiler.
diff --git a/scripts/Makefile.kcsan b/scripts/Makefile.kcsan
index 20337a7ecf54..75d2942b9437 100644
--- a/scripts/Makefile.kcsan
+++ b/scripts/Makefile.kcsan
@@ -9,7 +9,10 @@ else
 cc-param = --param -$(1)
 endif
 
+# Keep most options here optional, to allow enabling more compilers if absence
+# of some options does not break KCSAN nor causes false positive reports.
 CFLAGS_KCSAN := -fsanitize=thread \
-   $(call cc-option,$(call cc-param,tsan-instrument-func-entry-exit=0) 
-fno-optimize-sibling-calls)
+   $(call cc-option,$(call cc-param,tsan-instrument-func-entry-exit=0) 
-fno-optimize-sibling-calls) \
+   $(call cc-param,tsan-distinguish-volatile=1)
 
 endif # CONFIG_KCSAN
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH -tip v2 09/11] data_race: Avoid nested statement expression

2020-05-21 Thread Marco Elver
It appears that compilers have trouble with nested statements
expressions, as such make the data_race() macro be only a single
statement expression. This will help us avoid potential problems in
future as its usage increases.

Link: https://lkml.kernel.org/r/20200520221712.ga21...@zn.tnic
Signed-off-by: Marco Elver 
---
v2:
* Add patch to series in response to above linked discussion.
---
 include/linux/compiler.h | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 7444f026eead..1f9bd9f35368 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -211,12 +211,11 @@ void ftrace_likely_update(struct ftrace_likely_data *f, 
int val,
  */
 #define data_race(expr)
\
 ({ \
+   __unqual_scalar_typeof(({ expr; })) __v;\
__kcsan_disable_current();  \
-   ({  \
-   __unqual_scalar_typeof(({ expr; })) __v = ({ expr; });  \
-   __kcsan_enable_current();   \
-   __v;\
-   }); \
+   __v = ({ expr; });  \
+   __kcsan_enable_current();   \
+   __v;\
 })
 
 /*
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH -tip v2 04/11] kcsan: Pass option tsan-instrument-read-before-write to Clang

2020-05-21 Thread Marco Elver
Clang (unlike GCC) removes reads before writes with matching addresses
in the same basic block. This is an optimization for TSAN, since writes
will always cause conflict if the preceding read would have.

However, for KCSAN we cannot rely on this option, because we apply
several special rules to writes, in particular when the
KCSAN_ASSUME_PLAIN_WRITES_ATOMIC option is selected. To avoid missing
potential data races, pass the -tsan-instrument-read-before-write option
to Clang if it is available [1].

[1] 
https://github.com/llvm/llvm-project/commit/151ed6aa38a3ec6c01973b35f684586b6e1c0f7e

Signed-off-by: Marco Elver 
---
 scripts/Makefile.kcsan | 1 +
 1 file changed, 1 insertion(+)

diff --git a/scripts/Makefile.kcsan b/scripts/Makefile.kcsan
index 75d2942b9437..bd4da1af5953 100644
--- a/scripts/Makefile.kcsan
+++ b/scripts/Makefile.kcsan
@@ -13,6 +13,7 @@ endif
 # of some options does not break KCSAN nor causes false positive reports.
 CFLAGS_KCSAN := -fsanitize=thread \
$(call cc-option,$(call cc-param,tsan-instrument-func-entry-exit=0) 
-fno-optimize-sibling-calls) \
+   $(call cc-option,$(call cc-param,tsan-instrument-read-before-write=1)) \
$(call cc-param,tsan-distinguish-volatile=1)
 
 endif # CONFIG_KCSAN
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH -tip v2 08/11] READ_ONCE, WRITE_ONCE: Remove data_race() and unnecessary checks

2020-05-21 Thread Marco Elver
The volatile accesses no longer need to be wrapped in data_race(),
because we require compilers that emit instrumentation distinguishing
volatile accesses. Consequently, we also no longer require the explicit
kcsan_check_atomic*(), since the compiler emits instrumentation
distinguishing the volatile accesses. Finally, simplify
__READ_ONCE_SCALAR and remove __WRITE_ONCE_SCALAR.

Signed-off-by: Marco Elver 
---
v2:
* Remove unnecessary kcsan_check_atomic*() in *_ONCE.
* Simplify __READ_ONCE_SCALAR and remove __WRITE_ONCE_SCALAR. This
  effectively restores Will Deacon's pre-KCSAN version:
  
https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/tree/include/linux/compiler.h?h=rwonce/cleanup#n202
---
 include/linux/compiler.h | 13 ++---
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 17c98b215572..7444f026eead 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -228,9 +228,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int 
val,
 
 #define __READ_ONCE_SCALAR(x)  \
 ({ \
-   typeof(x) *__xp = &(x); \
-   __unqual_scalar_typeof(x) __x = data_race(__READ_ONCE(*__xp));  \
-   kcsan_check_atomic_read(__xp, sizeof(*__xp));   \
+   __unqual_scalar_typeof(x) __x = __READ_ONCE(x); \
smp_read_barrier_depends(); \
(typeof(x))__x; \
 })
@@ -246,17 +244,10 @@ do {  
\
*(volatile typeof(x) *)&(x) = (val);\
 } while (0)
 
-#define __WRITE_ONCE_SCALAR(x, val)\
-do {   \
-   typeof(x) *__xp = &(x); \
-   kcsan_check_atomic_write(__xp, sizeof(*__xp));  \
-   data_race(({ __WRITE_ONCE(*__xp, val); 0; }));  \
-} while (0)
-
 #define WRITE_ONCE(x, val) \
 do {   \
compiletime_assert_rwonce_type(x);  \
-   __WRITE_ONCE_SCALAR(x, val);\
+   __WRITE_ONCE(x, val);   \
 } while (0)
 
 #ifdef CONFIG_KASAN
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH -tip v2 07/11] kcsan: Update Documentation to change supported compilers

2020-05-21 Thread Marco Elver
Signed-off-by: Marco Elver 
---
 Documentation/dev-tools/kcsan.rst | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/Documentation/dev-tools/kcsan.rst 
b/Documentation/dev-tools/kcsan.rst
index f4b5766f12cc..ce4bbd918648 100644
--- a/Documentation/dev-tools/kcsan.rst
+++ b/Documentation/dev-tools/kcsan.rst
@@ -8,8 +8,7 @@ approach to detect races. KCSAN's primary purpose is to detect 
`data races`_.
 Usage
 -
 
-KCSAN is supported in both GCC and Clang. With GCC it requires version 7.3.0 or
-later. With Clang it requires version 7.0.0 or later.
+KCSAN requires Clang version 11 or later.
 
 To enable KCSAN configure the kernel with::
 
@@ -121,12 +120,6 @@ the below options are available:
 static __no_kcsan_or_inline void foo(void) {
 ...
 
-  Note: Older compiler versions (GCC < 9) also do not always honor the
-  ``__no_kcsan`` attribute on regular ``inline`` functions. If false positives
-  with these compilers cannot be tolerated, for small functions where
-  ``__always_inline`` would be appropriate, ``__no_kcsan_or_inline`` should be
-  preferred instead.
-
 * To disable data race detection for a particular compilation unit, add to the
   ``Makefile``::
 
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH -tip v2 02/11] kcsan: Avoid inserting __tsan_func_entry/exit if possible

2020-05-21 Thread Marco Elver
To avoid inserting  __tsan_func_{entry,exit}, add option if supported by
compiler. Currently only Clang can be told to not emit calls to these
functions. It is safe to not emit these, since KCSAN does not rely on
them.

Note that, if we disable __tsan_func_{entry,exit}(), we need to disable
tail-call optimization in sanitized compilation units, as otherwise we
may skip frames in the stack trace; in particular when the tail called
function is one of the KCSAN's runtime functions, and a report is
generated, might we miss the function where the actual access occurred.
Since __tsan_func_{entry,exit}() insertion effectively disabled
tail-call optimization, there should be no observable change. [This was
caught and confirmed with kcsan-test & UNWINDER_ORC.]

Signed-off-by: Marco Elver 
---
 scripts/Makefile.kcsan | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/scripts/Makefile.kcsan b/scripts/Makefile.kcsan
index cafa28ae..20337a7ecf54 100644
--- a/scripts/Makefile.kcsan
+++ b/scripts/Makefile.kcsan
@@ -1,6 +1,15 @@
 # SPDX-License-Identifier: GPL-2.0
 ifdef CONFIG_KCSAN
 
-CFLAGS_KCSAN := -fsanitize=thread
+# GCC and Clang accept backend options differently. Do not wrap in cc-option,
+# because Clang accepts "--param" even if it is unused.
+ifdef CONFIG_CC_IS_CLANG
+cc-param = -mllvm -$(1)
+else
+cc-param = --param -$(1)
+endif
+
+CFLAGS_KCSAN := -fsanitize=thread \
+   $(call cc-option,$(call cc-param,tsan-instrument-func-entry-exit=0) 
-fno-optimize-sibling-calls)
 
 endif # CONFIG_KCSAN
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH -tip v2 06/11] kcsan: Restrict supported compilers

2020-05-21 Thread Marco Elver
The first version of Clang that supports -tsan-distinguish-volatile will
be able to support KCSAN. The first Clang release to do so, will be
Clang 11. This is due to satisfying all the following requirements:

1. Never emit calls to __tsan_func_{entry,exit}.

2. __no_kcsan functions should not call anything, not even
   kcsan_{enable,disable}_current(), when using __{READ,WRITE}_ONCE => Requires
   leaving them plain!

3. Support atomic_{read,set}*() with KCSAN, which rely on
   arch_atomic_{read,set}*() using __{READ,WRITE}_ONCE() => Because of
   #2, rely on Clang 11's -tsan-distinguish-volatile support. We will
   double-instrument atomic_{read,set}*(), but that's reasonable given
   it's still lower cost than the data_race() variant due to avoiding 2
   extra calls (kcsan_{en,dis}able_current() calls).

4. __always_inline functions inlined into __no_kcsan functions are never
   instrumented.

5. __always_inline functions inlined into instrumented functions are
   instrumented.

6. __no_kcsan_or_inline functions may be inlined into __no_kcsan functions =>
   Implies leaving 'noinline' off of __no_kcsan_or_inline.

7. Because of #6, __no_kcsan and __no_kcsan_or_inline functions should never be
   spuriously inlined into instrumented functions, causing the accesses of the
   __no_kcsan function to be instrumented.

Older versions of Clang do not satisfy #3. The latest GCC currently doesn't
support at least #1, #3, and #7.

Link: 
https://lkml.kernel.org/r/CANpmjNMTsY_8241bS7=xafqvzhflrvekv_um4aduwe_kh3r...@mail.gmail.com
Signed-off-by: Marco Elver 
---
 lib/Kconfig.kcsan | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/lib/Kconfig.kcsan b/lib/Kconfig.kcsan
index a7276035ca0d..3f3b5bca7a8f 100644
--- a/lib/Kconfig.kcsan
+++ b/lib/Kconfig.kcsan
@@ -3,6 +3,12 @@
 config HAVE_ARCH_KCSAN
bool
 
+config HAVE_KCSAN_COMPILER
+   def_bool CC_IS_CLANG && $(cc-option,-fsanitize=thread -mllvm 
-tsan-distinguish-volatile=1)
+   help
+ For the list of compilers that support KCSAN, please see
+ .
+
 config KCSAN_KCOV_BROKEN
def_bool KCOV && CC_HAS_SANCOV_TRACE_PC
depends on CC_IS_CLANG
@@ -15,7 +21,8 @@ config KCSAN_KCOV_BROKEN
 
 menuconfig KCSAN
bool "KCSAN: dynamic data race detector"
-   depends on HAVE_ARCH_KCSAN && DEBUG_KERNEL && !KASAN
+   depends on HAVE_ARCH_KCSAN && HAVE_KCSAN_COMPILER
+   depends on DEBUG_KERNEL && !KASAN
depends on !KCSAN_KCOV_BROKEN
select STACKTRACE
help
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH -tip v2 11/11] compiler_types.h, kasan: Use __SANITIZE_ADDRESS__ instead of CONFIG_KASAN to decide inlining

2020-05-21 Thread Marco Elver
Like is done for KCSAN, for KASAN we should also use __always_inline in
compilation units that have instrumentation disabled
(KASAN_SANITIZE_foo.o := n). Adds common documentation for KASAN and
KCSAN explaining the attribute.

Signed-off-by: Marco Elver 
---
 include/linux/compiler_types.h | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index b190a12e7089..5faf68eae204 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -167,7 +167,14 @@ struct ftrace_likely_data {
  */
 #define noinline_for_stack noinline
 
-#ifdef CONFIG_KASAN
+/*
+ * Sanitizer helper attributes: Because using __always_inline and
+ * __no_sanitize_* conflict, provide helper attributes that will either expand
+ * to __no_sanitize_* in compilation units where instrumentation is enabled
+ * (__SANITIZE_*__), or __always_inline in compilation units without
+ * instrumentation (__SANITIZE_*__ undefined).
+ */
+#ifdef __SANITIZE_ADDRESS__
 /*
  * We can't declare function 'inline' because __no_sanitize_address conflicts
  * with inlining. Attempt to inline it may cause a build failure.
@@ -182,10 +189,6 @@ struct ftrace_likely_data {
 
 #define __no_kcsan __no_sanitize_thread
 #ifdef __SANITIZE_THREAD__
-/*
- * Rely on __SANITIZE_THREAD__ instead of CONFIG_KCSAN, to avoid not inlining 
in
- * compilation units where instrumentation is disabled.
- */
 # define __no_kcsan_or_inline __no_kcsan notrace __maybe_unused
 # define __no_sanitize_or_inline __no_kcsan_or_inline
 #else
-- 
2.26.2.761.g0e0b3e54be-goog



[PATCH -tip v2 10/11] compiler.h: Move function attributes to compiler_types.h

2020-05-21 Thread Marco Elver
Cleanup and move the KASAN and KCSAN related function attributes to
compiler_types.h, where the rest of the same kind live.

No functional change intended.

Signed-off-by: Marco Elver 
---
 include/linux/compiler.h   | 29 -
 include/linux/compiler_types.h | 29 +
 2 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 1f9bd9f35368..8d3d03f9d562 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -249,35 +249,6 @@ do {   
\
__WRITE_ONCE(x, val);   \
 } while (0)
 
-#ifdef CONFIG_KASAN
-/*
- * We can't declare function 'inline' because __no_sanitize_address conflicts
- * with inlining. Attempt to inline it may cause a build failure.
- * https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67368
- * '__maybe_unused' allows us to avoid defined-but-not-used warnings.
- */
-# define __no_kasan_or_inline __no_sanitize_address notrace __maybe_unused
-# define __no_sanitize_or_inline __no_kasan_or_inline
-#else
-# define __no_kasan_or_inline __always_inline
-#endif
-
-#define __no_kcsan __no_sanitize_thread
-#ifdef __SANITIZE_THREAD__
-/*
- * Rely on __SANITIZE_THREAD__ instead of CONFIG_KCSAN, to avoid not inlining 
in
- * compilation units where instrumentation is disabled.
- */
-# define __no_kcsan_or_inline __no_kcsan notrace __maybe_unused
-# define __no_sanitize_or_inline __no_kcsan_or_inline
-#else
-# define __no_kcsan_or_inline __always_inline
-#endif
-
-#ifndef __no_sanitize_or_inline
-#define __no_sanitize_or_inline __always_inline
-#endif
-
 static __no_sanitize_or_inline
 unsigned long __read_once_word_nocheck(const void *addr)
 {
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index 6ed0612bc143..b190a12e7089 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -167,6 +167,35 @@ struct ftrace_likely_data {
  */
 #define noinline_for_stack noinline
 
+#ifdef CONFIG_KASAN
+/*
+ * We can't declare function 'inline' because __no_sanitize_address conflicts
+ * with inlining. Attempt to inline it may cause a build failure.
+ * https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67368
+ * '__maybe_unused' allows us to avoid defined-but-not-used warnings.
+ */
+# define __no_kasan_or_inline __no_sanitize_address notrace __maybe_unused
+# define __no_sanitize_or_inline __no_kasan_or_inline
+#else
+# define __no_kasan_or_inline __always_inline
+#endif
+
+#define __no_kcsan __no_sanitize_thread
+#ifdef __SANITIZE_THREAD__
+/*
+ * Rely on __SANITIZE_THREAD__ instead of CONFIG_KCSAN, to avoid not inlining 
in
+ * compilation units where instrumentation is disabled.
+ */
+# define __no_kcsan_or_inline __no_kcsan notrace __maybe_unused
+# define __no_sanitize_or_inline __no_kcsan_or_inline
+#else
+# define __no_kcsan_or_inline __always_inline
+#endif
+
+#ifndef __no_sanitize_or_inline
+#define __no_sanitize_or_inline __always_inline
+#endif
+
 #endif /* __KERNEL__ */
 
 #endif /* __ASSEMBLY__ */
-- 
2.26.2.761.g0e0b3e54be-goog



Re: [PATCH -tip 00/10] Fix KCSAN for new ONCE (require Clang 11)

2020-05-21 Thread Marco Elver
On Fri, 15 May 2020 at 17:03, Marco Elver  wrote:
>
> This patch series is the conclusion to [1], where we determined that due
> to various interactions with no_sanitize attributes and the new
> {READ,WRITE}_ONCE(), KCSAN will require Clang 11 or later. Other
> sanitizers are largely untouched, and only KCSAN now has a hard
> dependency on Clang 11. To test, a recent Clang development version will
> suffice [2]. While a little inconvenient for now, it is hoped that in
> future we may be able to fix GCC and re-enable GCC support.
>
> The patch "kcsan: Restrict supported compilers" contains a detailed list
> of requirements that led to this decision.
>
> Most of the patches are related to KCSAN, however, the first patch also
> includes an UBSAN related fix and is a dependency for the remaining
> ones. The last 2 patches clean up the attributes by moving them to the
> right place, and fix KASAN's way of defining __no_kasan_or_inline,
> making it consistent with KCSAN.
>
> The series has been tested by running kcsan-test several times and
> completed successfully.
>
> [1] 
> https://lkml.kernel.org/r/canpmjnogfqhtda9wwpxs2kztqssozbwsumo5bqqw0c0g0zg...@mail.gmail.com
> [2] https://github.com/llvm/llvm-project
>


Superseded by v2:
https://lkml.kernel.org/r/20200521110854.114437-1-el...@google.com


Re: [PATCH -tip v2 03/11] kcsan: Support distinguishing volatile accesses

2020-05-21 Thread Marco Elver
On Thu, 21 May 2020 at 15:18, Will Deacon  wrote:
>
> On Thu, May 21, 2020 at 01:08:46PM +0200, Marco Elver wrote:
> > In the kernel, volatile is used in various concurrent context, whether
> > in low-level synchronization primitives or for legacy reasons. If
> > supported by the compiler, we will assume that aligned volatile accesses
> > up to sizeof(long long) (matching compiletime_assert_rwonce_type()) are
> > atomic.
> >
> > Recent versions Clang [1] (GCC tentative [2]) can instrument volatile
> > accesses differently. Add the option (required) to enable the
> > instrumentation, and provide the necessary runtime functions. None of
> > the updated compilers are widely available yet (Clang 11 will be the
> > first release to support the feature).
> >
> > [1] 
> > https://github.com/llvm/llvm-project/commit/5a2c31116f412c3b6888be361137efd705e05814
> > [2] https://gcc.gnu.org/pipermail/gcc-patches/2020-April/544452.html
> >
> > This patch allows removing any explicit checks in primitives such as
> > READ_ONCE() and WRITE_ONCE().
> >
> > Signed-off-by: Marco Elver 
> > ---
> > v2:
> > * Reword Makefile comment.
> > ---
> >  kernel/kcsan/core.c| 43 ++
> >  scripts/Makefile.kcsan |  5 -
> >  2 files changed, 47 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/kcsan/core.c b/kernel/kcsan/core.c
> > index a73a66cf79df..15f67949d11e 100644
> > --- a/kernel/kcsan/core.c
> > +++ b/kernel/kcsan/core.c
> > @@ -789,6 +789,49 @@ void __tsan_write_range(void *ptr, size_t size)
> >  }
> >  EXPORT_SYMBOL(__tsan_write_range);
> >
> > +/*
> > + * Use of explicit volatile is generally disallowed [1], however, volatile 
> > is
> > + * still used in various concurrent context, whether in low-level
> > + * synchronization primitives or for legacy reasons.
> > + * [1] https://lwn.net/Articles/233479/
> > + *
> > + * We only consider volatile accesses atomic if they are aligned and would 
> > pass
> > + * the size-check of compiletime_assert_rwonce_type().
> > + */
> > +#define DEFINE_TSAN_VOLATILE_READ_WRITE(size)  
> > \
> > + void __tsan_volatile_read##size(void *ptr)
> >  \
> > + { 
> >  \
> > + const bool is_atomic = size <= sizeof(long long) &&   
> >  \
> > +IS_ALIGNED((unsigned long)ptr, size);  
> >  \
> > + if (IS_ENABLED(CONFIG_KCSAN_IGNORE_ATOMICS) && is_atomic) 
> >  \
> > + return;   
> >  \
> > + check_access(ptr, size, is_atomic ? KCSAN_ACCESS_ATOMIC : 0); 
> >  \
> > + } 
> >  \
> > + EXPORT_SYMBOL(__tsan_volatile_read##size);
> >  \
> > + void __tsan_unaligned_volatile_read##size(void *ptr)  
> >  \
> > + __alias(__tsan_volatile_read##size);  
> >  \
> > + EXPORT_SYMBOL(__tsan_unaligned_volatile_read##size);  
> >  \
> > + void __tsan_volatile_write##size(void *ptr)   
> >  \
> > + { 
> >  \
> > + const bool is_atomic = size <= sizeof(long long) &&   
> >  \
> > +IS_ALIGNED((unsigned long)ptr, size);  
> >  \
> > + if (IS_ENABLED(CONFIG_KCSAN_IGNORE_ATOMICS) && is_atomic) 
> >  \
> > + return;   
> >  \
> > + check_access(ptr, size,   
> >  \
> > +  KCSAN_ACCESS_WRITE | 
> >  \
> > +  (is_atomic ? KCSAN_ACCESS_ATOMIC : 0));  
> >  \
> > + } 
> >  \
> > + EXPORT_SYMBOL(__tsan_volatile_write##size);   
> >  \
> > + void __tsan_unaligned_volatile_write##size(void *ptr) 
> >  \
> > + __alias(__tsan_volatile_write##size); 
> >  \
> > + EXPORT_SYMBOL(__tsan_unaligned_volatile_write##size)
> > +
> > +DEFINE_TSAN_VOLATILE_READ_WRITE(1);
> > +DEFINE_TSAN_VOLATILE_READ_WRITE(2);
> > +DEFINE_TSAN_VOLATILE_READ_WRITE(4);
> > +DEFINE_TSAN_VOLATILE_READ_WRITE(8);
> > +DEFINE_TSAN_VOLATILE_READ_WRITE(16);
>
> Having a 16-byte case seems a bit weird to me, but I guess clang needs this
> for some reason?

Yes, the emitted fixed-size instrumentation is up to 16 bytes, so
we'll need it (for both volatile and non-volatile -- otherwise we'll
get linker errors). It doesn't mean we'll consider 16 byte volatile
accesses as atomic, because of the size check to compute is_atomic
above.

Thanks,
-- Marco


Re: [PATCH -tip v2 07/11] kcsan: Update Documentation to change supported compilers

2020-05-21 Thread Marco Elver
On Thu, 21 May 2020 at 15:33, Will Deacon  wrote:
>
> On Thu, May 21, 2020 at 01:08:50PM +0200, Marco Elver wrote:
> > Signed-off-by: Marco Elver 
> > ---
> >  Documentation/dev-tools/kcsan.rst | 9 +
> >  1 file changed, 1 insertion(+), 8 deletions(-)
>
> -ENOCOMMITMSG

Oops. Ok, then there will be a v3.

> Will
>
> --
> You received this message because you are subscribed to the Google Groups 
> "kasan-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kasan-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/kasan-dev/20200521133322.GC6608%40willie-the-truck.


Re: [PATCH -tip v2 09/11] data_race: Avoid nested statement expression

2020-05-21 Thread Marco Elver
On Thu, 21 May 2020 at 15:31, Will Deacon  wrote:
>
> On Thu, May 21, 2020 at 01:08:52PM +0200, Marco Elver wrote:
> > It appears that compilers have trouble with nested statements
> > expressions, as such make the data_race() macro be only a single
> > statement expression. This will help us avoid potential problems in
> > future as its usage increases.
> >
> > Link: https://lkml.kernel.org/r/20200520221712.ga21...@zn.tnic
> > Signed-off-by: Marco Elver 
> > ---
> > v2:
> > * Add patch to series in response to above linked discussion.
> > ---
> >  include/linux/compiler.h | 9 -
> >  1 file changed, 4 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> > index 7444f026eead..1f9bd9f35368 100644
> > --- a/include/linux/compiler.h
> > +++ b/include/linux/compiler.h
> > @@ -211,12 +211,11 @@ void ftrace_likely_update(struct ftrace_likely_data 
> > *f, int val,
> >   */
> >  #define data_race(expr)
> >   \
> >  ({   \
> > + __unqual_scalar_typeof(({ expr; })) __v;\
> >   __kcsan_disable_current();  \
> > - ({  \
> > - __unqual_scalar_typeof(({ expr; })) __v = ({ expr; });  \
> > - __kcsan_enable_current();   \
> > - __v;\
> > - }); \
> > + __v = ({ expr; });  \
> > + __kcsan_enable_current();   \
> > + __v;\
>
> Hopefully it doesn't matter, but this will run into issues with 'const'
> non-scalar expressions.

Good point. We could move the kcsan_disable_current() into ({
__kcsan_disable_current(); expr; }).

Will fix for v3.

Thanks,
-- Marco

> Will
>
> --
> You received this message because you are subscribed to the Google Groups 
> "kasan-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kasan-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/kasan-dev/20200521133150.GB6608%40willie-the-truck.


Re: [PATCH -tip v2 00/11] Fix KCSAN for new ONCE (require Clang 11)

2020-05-21 Thread Marco Elver
On Thu, 21 May 2020 at 15:36, Will Deacon  wrote:
>
> On Thu, May 21, 2020 at 01:08:43PM +0200, Marco Elver wrote:
> > This patch series is the conclusion to [1], where we determined that due
> > to various interactions with no_sanitize attributes and the new
> > {READ,WRITE}_ONCE(), KCSAN will require Clang 11 or later. Other
> > sanitizers are largely untouched, and only KCSAN now has a hard
> > dependency on Clang 11. To test, a recent Clang development version will
> > suffice [2]. While a little inconvenient for now, it is hoped that in
> > future we may be able to fix GCC and re-enable GCC support.
> >
> > The patch "kcsan: Restrict supported compilers" contains a detailed list
> > of requirements that led to this decision.
> >
> > Most of the patches are related to KCSAN, however, the first patch also
> > includes an UBSAN related fix and is a dependency for the remaining
> > ones. The last 2 patches clean up the attributes by moving them to the
> > right place, and fix KASAN's way of defining __no_kasan_or_inline,
> > making it consistent with KCSAN.
> >
> > The series has been tested by running kcsan-test several times and
> > completed successfully.
>
> I've left a few minor comments, but the only one that probably needs a bit
> of thought is using data_race() with const non-scalar expressions, since I
> think that's now prohibited by these changes. We don't have too many
> data_race() users yet, so probably not a big deal, but worth bearing in
> mind.

If you don't mind, I'll do a v3 with that fixed.

> Other than that,
>
> Acked-by: Will Deacon 

Thank you!

-- Marco

> Thanks!
>
> Will


[PATCH -tip v3 06/11] kcsan: Restrict supported compilers

2020-05-21 Thread Marco Elver
The first version of Clang that supports -tsan-distinguish-volatile will
be able to support KCSAN. The first Clang release to do so, will be
Clang 11. This is due to satisfying all the following requirements:

1. Never emit calls to __tsan_func_{entry,exit}.

2. __no_kcsan functions should not call anything, not even
   kcsan_{enable,disable}_current(), when using __{READ,WRITE}_ONCE => Requires
   leaving them plain!

3. Support atomic_{read,set}*() with KCSAN, which rely on
   arch_atomic_{read,set}*() using __{READ,WRITE}_ONCE() => Because of
   #2, rely on Clang 11's -tsan-distinguish-volatile support. We will
   double-instrument atomic_{read,set}*(), but that's reasonable given
   it's still lower cost than the data_race() variant due to avoiding 2
   extra calls (kcsan_{en,dis}able_current() calls).

4. __always_inline functions inlined into __no_kcsan functions are never
   instrumented.

5. __always_inline functions inlined into instrumented functions are
   instrumented.

6. __no_kcsan_or_inline functions may be inlined into __no_kcsan functions =>
   Implies leaving 'noinline' off of __no_kcsan_or_inline.

7. Because of #6, __no_kcsan and __no_kcsan_or_inline functions should never be
   spuriously inlined into instrumented functions, causing the accesses of the
   __no_kcsan function to be instrumented.

Older versions of Clang do not satisfy #3. The latest GCC currently doesn't
support at least #1, #3, and #7.

Link: 
https://lkml.kernel.org/r/CANpmjNMTsY_8241bS7=xafqvzhflrvekv_um4aduwe_kh3r...@mail.gmail.com
Acked-by: Will Deacon 
Signed-off-by: Marco Elver 
---
 lib/Kconfig.kcsan | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/lib/Kconfig.kcsan b/lib/Kconfig.kcsan
index a7276035ca0d..3f3b5bca7a8f 100644
--- a/lib/Kconfig.kcsan
+++ b/lib/Kconfig.kcsan
@@ -3,6 +3,12 @@
 config HAVE_ARCH_KCSAN
bool
 
+config HAVE_KCSAN_COMPILER
+   def_bool CC_IS_CLANG && $(cc-option,-fsanitize=thread -mllvm 
-tsan-distinguish-volatile=1)
+   help
+ For the list of compilers that support KCSAN, please see
+ .
+
 config KCSAN_KCOV_BROKEN
def_bool KCOV && CC_HAS_SANCOV_TRACE_PC
depends on CC_IS_CLANG
@@ -15,7 +21,8 @@ config KCSAN_KCOV_BROKEN
 
 menuconfig KCSAN
bool "KCSAN: dynamic data race detector"
-   depends on HAVE_ARCH_KCSAN && DEBUG_KERNEL && !KASAN
+   depends on HAVE_ARCH_KCSAN && HAVE_KCSAN_COMPILER
+   depends on DEBUG_KERNEL && !KASAN
depends on !KCSAN_KCOV_BROKEN
select STACKTRACE
help
-- 
2.26.2.761.g0e0b3e54be-goog



  1   2   3   4   5   6   7   8   9   10   >