[PATCH RFC 1/1] doc/rcu: Add some more listRCU patterns in the kernel

2019-06-01 Thread Joel Fernandes (Google)
We keep the initially written audit examples and add to it, since the
code that audit has is still relevant even though slightly different in
the kernel.

Cc: r...@vger.kernel.org
Signed-off-by: Joel Fernandes (Google) 
---
 Documentation/RCU/listRCU.txt | 154 +++---
 1 file changed, 144 insertions(+), 10 deletions(-)

diff --git a/Documentation/RCU/listRCU.txt b/Documentation/RCU/listRCU.txt
index adb5a3782846..af5bf1bd689c 100644
--- a/Documentation/RCU/listRCU.txt
+++ b/Documentation/RCU/listRCU.txt
@@ -7,8 +7,54 @@ is that all of the required memory barriers are included for 
you in
 the list macros.  This document describes several applications of RCU,
 with the best fits first.
 
-
-Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates
+Example 1: Read-mostly list: Deferred Destruction
+
+A widely used usecase for RCU lists in the kernel is lockless iteration over
+all processes in the system. task_struct::tasks represents the list node that
+links all the processes. The list can be traversed in parallel to any list
+additions or removals.
+
+The traversal of the list is done using for_each_process() which is defined by
+the 2 macros:
+
+#define next_task(p) \
+   list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
+
+#define for_each_process(p) \
+   for (p = &init_task ; (p = next_task(p)) != &init_task ; )
+
+The code traversing the list of all processes typically looks like:
+rcu_read_lock();
+for_each_process(p) {
+   /* Do something with p */
+}
+rcu_read_unlock();
+
+Thes code (simplified) removing a process from the task lists is in
+release_task():
+
+void release_task(struct task_struct *p)
+{
+   write_lock(&tasklist_lock);
+   list_del_rcu(&p->tasks);
+   write_unlock(&tasklist_lock);
+   call_rcu(&p->rcu, delayed_put_task_struct);
+}
+
+When a process exits, release_task() calls list_del_rcu(&p->tasks) to remove
+the task from the list of all tasks, under tasklist_lock writer lock
+protection. The tasklist_lock prevents concurrent list adds/removes from
+corrupting the list. Readers using for_each_process() are not protected with
+the tasklist_lock. To prevent readers from appearing to notice changes in the
+list pointers, the task_struct object is freed only after one more more grace
+periods elapse (with the help of call_rcu). This deferring of destruction
+ensures that any readers traversing the list will see valid p->tasks.next
+pointers and deletion/freeing can happen in parallel to traversal of the list.
+This pattern is also called an "existence lock" sometimes, since RCU makes sure
+the object exists in memory as long as readers exist, that are traversing.
+
+
+Example 2: Read-Side Action Taken Outside of Lock, No In-Place Updates
 
 The best applications are cases where, if reader-writer locking were
 used, the read-side lock would be dropped before taking any action
@@ -32,7 +78,7 @@ implementation of audit_filter_task() might be as follows:
enum audit_state   state;
 
read_lock(&auditsc_lock);
-   /* Note: audit_netlink_sem held by caller. */
+   /* Note: audit_filter_mutex held by caller. */
list_for_each_entry(e, &audit_tsklist, list) {
if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
read_unlock(&auditsc_lock);
@@ -56,7 +102,7 @@ This means that RCU can be easily applied to the read side, 
as follows:
enum audit_state   state;
 
rcu_read_lock();
-   /* Note: audit_netlink_sem held by caller. */
+   /* Note: audit_filter_mutex held by caller. */
list_for_each_entry_rcu(e, &audit_tsklist, list) {
if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
rcu_read_unlock();
@@ -139,7 +185,7 @@ Following are the RCU equivalents for these two functions:
 
 Normally, the write_lock() and write_unlock() would be replaced by
 a spin_lock() and a spin_unlock(), but in this case, all callers hold
-audit_netlink_sem, so no additional locking is required.  The auditsc_lock
+audit_filter_mutex, so no additional locking is required.  The auditsc_lock
 can therefore be eliminated, since use of RCU eliminates the need for
 writers to exclude readers.  Normally, the write_lock() calls would
 be converted into spin_lock() calls.
@@ -155,7 +201,7 @@ So, when readers can tolerate stale data and when entries 
are either added
 or deleted, without in-place modification, it is very easy to use RCU!
 
 
-Example 2: Handling In-Place Updates
+Example 3: Handling In-Place Updates
 
 The system-call auditing code does not update auditing rules in place.
 However, if it did, reader-writer-locked code to do so might look as
@@ -171,7 +217,7 @@ otherwise, the added fields would need to be filled in):
struct audit_newentry *ne;
 
write_lock(

Re: [PATCH RFC 1/1] doc/rcu: Add some more listRCU patterns in the kernel

2019-06-01 Thread Joel Fernandes
Forgot to CC +Neil Brown , will do in the next posting, thanks!

On Sat, Jun 1, 2019 at 5:39 AM Joel Fernandes (Google)
 wrote:
>
> We keep the initially written audit examples and add to it, since the
> code that audit has is still relevant even though slightly different in
> the kernel.
>
> Cc: r...@vger.kernel.org
> Signed-off-by: Joel Fernandes (Google) 
> ---
>  Documentation/RCU/listRCU.txt | 154 +++---
>  1 file changed, 144 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/RCU/listRCU.txt b/Documentation/RCU/listRCU.txt
> index adb5a3782846..af5bf1bd689c 100644
> --- a/Documentation/RCU/listRCU.txt
> +++ b/Documentation/RCU/listRCU.txt
> @@ -7,8 +7,54 @@ is that all of the required memory barriers are included for 
> you in
>  the list macros.  This document describes several applications of RCU,
>  with the best fits first.
>
> -
> -Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates
> +Example 1: Read-mostly list: Deferred Destruction
> +
> +A widely used usecase for RCU lists in the kernel is lockless iteration over
> +all processes in the system. task_struct::tasks represents the list node that
> +links all the processes. The list can be traversed in parallel to any list
> +additions or removals.
> +
> +The traversal of the list is done using for_each_process() which is defined 
> by
> +the 2 macros:
> +
> +#define next_task(p) \
> +   list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
> +
> +#define for_each_process(p) \
> +   for (p = &init_task ; (p = next_task(p)) != &init_task ; )
> +
> +The code traversing the list of all processes typically looks like:
> +rcu_read_lock();
> +for_each_process(p) {
> +   /* Do something with p */
> +}
> +rcu_read_unlock();
> +
> +Thes code (simplified) removing a process from the task lists is in
> +release_task():
> +
> +void release_task(struct task_struct *p)
> +{
> +   write_lock(&tasklist_lock);
> +   list_del_rcu(&p->tasks);
> +   write_unlock(&tasklist_lock);
> +   call_rcu(&p->rcu, delayed_put_task_struct);
> +}
> +
> +When a process exits, release_task() calls list_del_rcu(&p->tasks) to remove
> +the task from the list of all tasks, under tasklist_lock writer lock
> +protection. The tasklist_lock prevents concurrent list adds/removes from
> +corrupting the list. Readers using for_each_process() are not protected with
> +the tasklist_lock. To prevent readers from appearing to notice changes in the
> +list pointers, the task_struct object is freed only after one more more grace
> +periods elapse (with the help of call_rcu). This deferring of destruction
> +ensures that any readers traversing the list will see valid p->tasks.next
> +pointers and deletion/freeing can happen in parallel to traversal of the 
> list.
> +This pattern is also called an "existence lock" sometimes, since RCU makes 
> sure
> +the object exists in memory as long as readers exist, that are traversing.
> +
> +
> +Example 2: Read-Side Action Taken Outside of Lock, No In-Place Updates
>
>  The best applications are cases where, if reader-writer locking were
>  used, the read-side lock would be dropped before taking any action
> @@ -32,7 +78,7 @@ implementation of audit_filter_task() might be as follows:
> enum audit_state   state;
>
> read_lock(&auditsc_lock);
> -   /* Note: audit_netlink_sem held by caller. */
> +   /* Note: audit_filter_mutex held by caller. */
> list_for_each_entry(e, &audit_tsklist, list) {
> if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
> read_unlock(&auditsc_lock);
> @@ -56,7 +102,7 @@ This means that RCU can be easily applied to the read 
> side, as follows:
> enum audit_state   state;
>
> rcu_read_lock();
> -   /* Note: audit_netlink_sem held by caller. */
> +   /* Note: audit_filter_mutex held by caller. */
> list_for_each_entry_rcu(e, &audit_tsklist, list) {
> if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
> rcu_read_unlock();
> @@ -139,7 +185,7 @@ Following are the RCU equivalents for these two functions:
>
>  Normally, the write_lock() and write_unlock() would be replaced by
>  a spin_lock() and a spin_unlock(), but in this case, all callers hold
> -audit_netlink_sem, so no additional locking is required.  The auditsc_lock
> +audit_filter_mutex, so no additional locking is required.  The auditsc_lock
>  can therefore be eliminated, since use of RCU eliminates the need for
>  writers to exclude readers.  Normally, the write_lock() calls would
>  be converted into spin_lock() calls.
> @@ -155,7 +201,7 @@ So, when readers can tolerate stale data and when entries 
> are either added
>  or deleted, without in-place modification, it is very easy to use RCU!
>
>
> -Example 2: Handling In-Place Upda

Re: [PATCH RFC] Rough draft document on merging and rebasing

2019-06-01 Thread Theodore Ts'o
On Thu, May 30, 2019 at 01:53:17PM -0600, Jonathan Corbet wrote:
> +Rebasing
> +
> +
> +"Rebasing" is the process of changing the history of a series of commits
> +within a repository.  At its simplest, a rebase could change the starting
> +point of a patch series from one point to another.  Other uses include
> +fixing (or deleting) broken commits, adding tags to commits, or changing
> +the order in which commits are applied.  Used properly, rebasing can yield
> +a cleaner and clearer development history; used improperly, it can obscure
> +that history and introduce bugs.

Rebasing is being used in two senses here.  The first is where the
diffs don't change (much), but the starting point of the changes is
being changed.  The second is where a broken commit is dropped, adding
a signed-off-by, etc.

Both have the property that they can used for good or ill, but the
details when this is an issue can change a bit.  More below...

> +There are a few rules of thumb that can help developers to avoid the worst
> +perils of rebasing:
> +
> + - History that has been exposed to the world beyond your private system
> +   should not be rebased.  Others may have pulled a copy of your tree and
> +   built on it; rebasing your tree will create pain for them.  If work is
> +   in need of rebasing, that is usually a sign that it is not yet ready to
> +   be committed to a public repository.

That seems to be a bit too categorical.  It's been recommended, and
some people do, push patches to a branch so the zero-day bot will pick
it up and test for potential problems.  And broken commits *do* get
dropped from candidate stable kernel releases.  And, of course,
there's linux-next, which is constantly getting rebased.

And there have been people who have pushed out RFC patche series onto
a git branch, and publicized it on LKML for the convenience of
reviewers.  (Perhaps because there is a patch which is so big it
exceeds the LKML size restrictions.)

I think it's more about whether people know that a branch is
considered unstable from a historical perspective or not.  No one
builds on top of linux-next because, well, that would be silly.

Finally, I'm bit concerned about anything which states absolutes,
because there are people who tend to be real stickler for the rules,
and if they see something stated in absolute terms, they fail to
understand that there are exceptions that are well understood, and in
use for years before the existence of the document which is trying to
codify best practices.

> + - Realize the rebasing a patch series changes the environment in which it
> +   was developed and, likely, invalidates much of the testing that was
> +   done.  A rebased patch series should, as a general rule, be treated like
> +   new code and retested from the beginning.

In this paragraph, "rebasing" is being used in the change the base
commit on top of which a series of changes were based upon.  And it's
what most people think of when they see the word "rebase".

However "git rebase" is also used to rewrite git history when dropping
a bad commit, or fixing things so the branch is bisectable, etc.  But
in the definitions above, the word "rebase" was defined to include
both "changing the base of a patch stack", and "rewriting git
history".  I wonder if that helps more than it hinders understanding.

At least in my mind, "rebasing" is much more often the wrong thing,
where as "rewriting git history" for a leaf repository can be more
often justifable.  And of course, if someone is working on a feature
for which the review and development cycle takes several kernel
releases, I'd claim that "rebasing" in that case always makes sense,
even if they *have* exposed their patch series via git for review /
commenting purposes.

Regards,

- Ted


Re: [PATCH v3 2/3] kernel/ucounts: expose count of inotify watches in use

2019-06-01 Thread Albert Vaca Cintora
On Sat, Jun 1, 2019 at 2:00 AM Andrew Morton  wrote:
>
> On Fri, 31 May 2019 21:50:15 +0200 Albert Vaca Cintora  
> wrote:
>
> > Adds a readonly 'current_inotify_watches' entry to the user sysctl table.
> > The handler for this entry is a custom function that ends up calling
> > proc_dointvec. Said sysctl table already contains 'max_inotify_watches'
> > and it gets mounted under /proc/sys/user/.
> >
> > Inotify watches are a finite resource, in a similar way to available file
> > descriptors. The motivation for this patch is to be able to set up
> > monitoring and alerting before an application starts failing because
> > it runs out of inotify watches.
> >
> > ...
> >
> > --- a/kernel/ucount.c
> > +++ b/kernel/ucount.c
> > @@ -118,6 +118,26 @@ static void put_ucounts(struct ucounts *ucounts)
> >   kfree(ucounts);
> >  }
> >
> > +#ifdef CONFIG_INOTIFY_USER
> > +int proc_read_inotify_watches(struct ctl_table *table, int write,
> > +  void __user *buffer, size_t *lenp, loff_t *ppos)
> > +{
> > + struct ucounts *ucounts;
> > + struct ctl_table fake_table;
>
> hmm.
>
> > + int count = -1;
> > +
> > + ucounts = get_ucounts(current_user_ns(), current_euid());
> > + if (ucounts != NULL) {
> > + count = atomic_read(&ucounts->ucount[UCOUNT_INOTIFY_WATCHES]);
> > + put_ucounts(ucounts);
> > + }
> > +
> > + fake_table.data = &count;
> > + fake_table.maxlen = sizeof(count);
> > + return proc_dointvec(&fake_table, write, buffer, lenp, ppos);
>
> proc_dointvec
> ->do_proc_dointvec
>   ->__do_proc_dointvec
> ->proc_first_pos_non_zero_ignore
>   ->warn_sysctl_write
> ->pr_warn_once(..., table->procname)
>
> and I think ->procname is uninitialized.
>
> That's from a cursory check.  Perhaps other uninitialized members of
> fake_table are accessed, dunno.
>
> we could do
>
> {
> struct ctl_table fake_table = {
> .data = &count,
> .maxlen = sizeof(count),
> };
>
> return proc_dointvec(&fake_table, write, buffer, lenp, ppos);
> }
>
> or whatever.  That will cause the pr_warn_once to print "(null)" but
> that's OK I guess.
>
> Are there other places in the kernel which do this temp ctl_table
> trick?  If so, what do they do?  If not, what is special about this
> code?
>
>

I copied this 'fake_table' trick from proc_do_entropy in
drivers/char/random.c exactly as it is. It is also used in other
places with slight variations.

Note that, since we are creating a read-only proc file,
proc_first_pos_non_zero_ignore is not called from __do_proc_dointvec,
so the uninitialized ->procname is not accessed.

Albert