from:"Matthew Wilcox"

Re: [PATCH -next v3 3/3] nfs make use of str_false_true helper

2024-08-26 Thread Matthew Wilcox

On Tue, Aug 27, 2024 at 10:45:17AM +0800, Hongbo Li wrote:
> The helper str_false_true is introduced to reback "false/true"
> string literal. We can simplify this format by str_false_true.

This seems unnecessarily verbose & complex.  How about:

dprintk("%s: link support=%s\n", __func__, strbool(*res != 0));

> - dprintk("%s: link support=%s\n", __func__, *res == 0 ? "false" : 
> "true");
> + dprintk("%s: link support=%s\n", __func__, str_false_true(*res == 0));

(do we have a convention for the antonym of kstrtoX?)

Re: [PATCH] seq_buf: Introduce DECLARE_SEQ_BUF and seq_buf_cstr()

2023-10-27 Thread Matthew Wilcox

On Thu, Oct 26, 2023 at 10:54:26AM -0700, Kees Cook wrote:
> > >   do_soemthing(seq_buf_cstr(s));
> > 
> > Do we really need to call it _cstr? Why not just have seq_buf_str() ?
> > 
> > I mean, this is C, do we need to state that in the name too?
> 
> I'm fine either way. I did that just to make the distinction between our
> length-managed string of characters interface (seq_buf), and the
> %NUL-terminated string of characters (traditionally called "C String" in
> other languages). And it was still shorter than "seq_buf_terminate(s);
> s->buffer" ;)

'cstr' might be short for 'counted string' ...

Re: [PATCH v2] seq_buf: Introduce DECLARE_SEQ_BUF and seq_buf_str()

2023-10-27 Thread Matthew Wilcox

On Fri, Oct 27, 2023 at 06:54:51AM +0200, Christoph Hellwig wrote:
> > Instead, we can just return s->buffer direction after terminating it
> > in refactored seq_buf_terminate(), now known as seq_buf_str():
> > 
> > do_soemthing(seq_buf_str(s));
> 
> Looks good.  Btw, one typical do_something would be printing it,
> so adding a format specifier that's using this helper would also
> probably be very useful.

my hope is to get vsprintf.c completely refactored to use seq_buf
internally, and then printsb(&sb) would actually be a primitive we'd
have insted of printk("%pSB", &sb);

this would btw let us get rid of the entire %pFOO infrastructure.
and make dump_page() far less crap.

Re: Limited/Broken functionality of ASLR for Libs >= 2MB

2024-01-15 Thread Matthew Wilcox

On Mon, Jan 15, 2024 at 04:40:36PM +, Sam James wrote:
> m...@horotw.com writes:
> > Hey, I read that ASLR is currently (since kernel >=5.18) broken for
> > 32bit libs and reduced in effectiveness for 64bit libs... (the issue
> > only arises if a lib is over 2MB).
> > I confirmed this for myself but only for the 64bit case.
> >
> > I saw that this issue is being tracked by ubuntu
> > (https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1983357).
> > If this is the wrong place and I should instead report it elsewhere I
> > am very sorry.
> 
> See also https://bugs.debian.org/1024149. Unfortunately, I don't
> think the issue found its way upstream until now (thanks).
> 
> CCing relevant maintainers (per the Debian bug).

You know, my email address is all over that commit and the doofus who
"discovered the vulnerability" didn't even have the courtesy to let
me know.  I've had several private emails about this over the last few
days and I just don't care.  Who's running 32-bit code and cares about
security?  32-bit kernels are known-vulnerable to all kinds of security
problems, and I think this is the least of your worries.

This was intended to happen, it's not a surprise.

Re: Limited/Broken functionality of ASLR for Libs >= 2MB

2024-01-15 Thread Matthew Wilcox

On Mon, Jan 15, 2024 at 07:21:19PM +0100, m...@horotw.com wrote:
> Am 15.01.2024 17:52, schrieb Matthew Wilcox:
> > On Mon, Jan 15, 2024 at 04:40:36PM +, Sam James wrote:
> > > m...@horotw.com writes:
> > > > Hey, I read that ASLR is currently (since kernel >=5.18) broken for
> > > > 32bit libs and reduced in effectiveness for 64bit libs... (the issue
> > > > only arises if a lib is over 2MB).
> > > > I confirmed this for myself but only for the 64bit case.
> > > >
> > > > I saw that this issue is being tracked by ubuntu
> > > > (https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1983357).
> > > > If this is the wrong place and I should instead report it elsewhere I
> > > > am very sorry.
> > > 
> > > See also https://bugs.debian.org/1024149. Unfortunately, I don't
> > > think the issue found its way upstream until now (thanks).
> > > 
> > > CCing relevant maintainers (per the Debian bug).
> > 
> > You know, my email address is all over that commit and the doofus who
> > "discovered the vulnerability" didn't even have the courtesy to let
> > me know.  I've had several private emails about this over the last few
> > days and I just don't care.  Who's running 32-bit code and cares about
> > security?  32-bit kernels are known-vulnerable to all kinds of security
> > problems, and I think this is the least of your worries.
> > 
> > This was intended to happen, it's not a surprise.
> 
> Hi,
> first of all I am very sorry, I didn't realize I should have contacted you
> first (I'm not the one who found the bug initially), I will do it
> differently in the future.

I'm not annoyed *at you*.  I'm annoyed at the guy who first "discovered"
it.  I'm annoyed at the people who are running around with their hair
on fire.  I'm annoyed at all the people who *didn't* contact me.

> Unfortunately, my knowledge is not sufficient to judge how bad it is that
> 32bit effectively has no ASLR support anymore.
> 
> 64bit is also affected, even though there are probably more than enough
> bits left there? I have since seen that both Arch and Ubuntu seem to have
> "patches" in place 
> (https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/commit/3904bcb32cc58c10232fb618bf96c1b43b0bc9d7)
> in which they set the `CONFIG_ARCH_MMAP_RND_BITS=32` and
> `CONFIG_ARCH_MMAP_RND_COMPAT_BITS=16`, I'm not sure if this is a good
> result or if it will cause other problems.

Yeah, I don't know either.  Outside my scope of expertise.

I received a suggestion off-list that we only do the PMD alignment on
64-bit, which seems quite reasonable to me.  After all, I don't care
about performance on 32-bit just as much as I don't care about security
on 32-bit.

Re: [PATCH] ARM: unwind: improve unwinders for noreturn case

2024-03-19 Thread Matthew Wilcox

On Wed, Mar 20, 2024 at 11:30:13AM +0800, Jiangfeng Xiao wrote:
> The checkpatch.pl script reports the "WARNING: printk() should
> include KERN_ facility level" warning.
> 
> That's why I changed printk to pr_warn.
> I should change printk to printk(KERN_DEFAULT).

No, you should ignore checkpatch.  For bonus points, figure out why you
should ignore it specifically in this case.

Re: [PATCH v2] fs: Set file_handle::handle_bytes before referencing file_handle::f_handle

2024-04-04 Thread Matthew Wilcox

On Thu, Apr 04, 2024 at 02:12:15PM -0700, Kees Cook wrote:
> Since __counted_by(handle_bytes) was added to struct file_handle, we need
> to explicitly set it in the one place it wasn't yet happening prior to
> accessing the flex array "f_handle". For robustness also check for a
> negative value for handle_bytes, which is possible for an "int", but
> nothing appears to set.

Why not change handle_bytes from an int to a u32?

Also, what a grotty function.

handle_dwords = f_handle.handle_bytes >> 2;
...
handle_bytes = handle_dwords * sizeof(u32);

> Fixes: 1b43c4629756 ("fs: Annotate struct file_handle with __counted_by() and 
> use struct_size()")
> Signed-off-by: Kees Cook 
> ---
> Cc: Jan Kara 
> Cc: Chuck Lever 
> Cc: Gustavo A. R. Silva 
> Cc: Christian Brauner 
> Cc: Alexander Viro 
> Cc: Jeff Layton 
> Cc: Amir Goldstein 
> Cc: linux-fsde...@vger.kernel.org
> Cc: linux-...@vger.kernel.org
> Cc: linux-hardening@vger.kernel.org
>  v2: more bounds checking, add comments, dropped reviews since logic changed
>  v1: https://lore.kernel.org/all/20240403215358.work.365-k...@kernel.org/
> ---
>  fs/fhandle.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fhandle.c b/fs/fhandle.c
> index 8a7f86c2139a..854f866eaad2 100644
> --- a/fs/fhandle.c
> +++ b/fs/fhandle.c
> @@ -40,6 +40,11 @@ static long do_sys_name_to_handle(const struct path *path,
>GFP_KERNEL);
>   if (!handle)
>   return -ENOMEM;
> + /*
> +  * Since handle->f_handle is about to be written, make sure the
> +  * associated __counted_by(handle_bytes) variable is correct.
> +  */
> + handle->handle_bytes = f_handle.handle_bytes;
>  
>   /* convert handle size to multiple of sizeof(u32) */
>   handle_dwords = f_handle.handle_bytes >> 2;
> @@ -51,8 +56,8 @@ static long do_sys_name_to_handle(const struct path *path,
>   handle->handle_type = retval;
>   /* convert handle size to bytes */
>   handle_bytes = handle_dwords * sizeof(u32);
> - handle->handle_bytes = handle_bytes;
> - if ((handle->handle_bytes > f_handle.handle_bytes) ||
> + /* check if handle_bytes would have exceeded the allocation */
> + if ((handle_bytes < 0) || (handle_bytes > f_handle.handle_bytes) ||
>   (retval == FILEID_INVALID) || (retval < 0)) {
>   /* As per old exportfs_encode_fh documentation
>* we could return ENOSPC to indicate overflow
> @@ -68,6 +73,8 @@ static long do_sys_name_to_handle(const struct path *path,
>   handle_bytes = 0;
>   } else
>   retval = 0;
> + /* the "valid" number of bytes may fewer than originally allocated */
> + handle->handle_bytes = handle_bytes;
>   /* copy the mount id */
>   if (put_user(real_mount(path->mnt)->mnt_id, mnt_id) ||
>   copy_to_user(ufh, handle,
> -- 
> 2.34.1
> 
>

Re: [PATCH] alloc_tag: Tighten file permissions on /proc/allocinfo

2024-04-25 Thread Matthew Wilcox

On Thu, Apr 25, 2024 at 04:45:51PM -0400, Kent Overstreet wrote:
> On Thu, Apr 25, 2024 at 01:08:50PM -0700, Kees Cook wrote:
> > The /proc/allocinfo file exposes a tremendous about of information about
> > kernel build details, memory allocations (obviously), and potentially
> > even image layout (due to ordering). As this is intended to be consumed
> > by system owners (like /proc/slabinfo), use the same file permissions as
> > there: 0400.
> 
> Err...
> 
> The side effect of locking down more and more reporting interfaces is
> that programs that consume those interfaces now have to run as root.

sudo cat /proc/allocinfo | analyse-that-fie

Re: [PATCH] alloc_tag: Tighten file permissions on /proc/allocinfo

2024-04-25 Thread Matthew Wilcox

On Thu, Apr 25, 2024 at 08:58:34PM -0400, Kent Overstreet wrote:
> On Thu, Apr 25, 2024 at 05:43:33PM -0700, Kees Cook wrote:
> > All this said, I'm still not excited about any of these files living
> > in /proc at all -- we were supposed to use /sys for this kind of thing,
> > but its interface wasn't great for this kind of more "free-form" data,
> > and debugfs isn't good for production interfaces. /proc really should
> > only have pid information -- we end up exposing these top-level files to
> > every mount namespace with a /proc mount. :( But that's a yet-to-be-solved
> > problem...
> 
> It really wouldn't be that hard to relax the 4k file limit in sysfs.

It's a lot harder to relax the GregKH opposition to multiple values per
file in sysfs.

Re: [PATCH v3] fs: fix unintentional arithmetic wraparound in offset calculation

2024-05-16 Thread Matthew Wilcox

On Fri, May 17, 2024 at 12:29:06AM +, Justin Stitt wrote:
> When running syzkaller with the newly reintroduced signed integer
> overflow sanitizer we encounter this report:

why do you keep saying it's unintentional?  it's clearly intended.

Re: [PATCH v3] fs: fix unintentional arithmetic wraparound in offset calculation

2024-05-16 Thread Matthew Wilcox

On Fri, May 17, 2024 at 02:26:47AM +0100, Al Viro wrote:
> On Fri, May 17, 2024 at 02:13:22AM +0100, Matthew Wilcox wrote:
> > On Fri, May 17, 2024 at 12:29:06AM +, Justin Stitt wrote:
> > > When running syzkaller with the newly reintroduced signed integer
> > > overflow sanitizer we encounter this report:
> > 
> > why do you keep saying it's unintentional?  it's clearly intended.
> 
> Because they are short on actual bugs to be found by their tooling
> and attempt to inflate the sound/noise rate; therefore, every time
> when overflow _IS_ handled correctly, it must have been an accident -
> we couldn't have possibly done the analysis correctly.  And if somebody
> insists that they _are_ capable of basic math, they must be dishonest.
> So... "unintentional" it's going to be.
> 
>  Math is hard, mmkay?  
> 
> Al, more than slightly annoyed by that aspect of the entire thing...

Yes, some of the patches I've seen floating past actually seem nice, but
the vast majority just seem like make-work.  And the tone is definitely
inappropriate.

Re: [RFC] Mitigating unexpected arithmetic overflow

2024-05-17 Thread Matthew Wilcox

On Wed, May 08, 2024 at 11:11:35PM -0700, Kees Cook wrote:
>   Or even just simple math between u8s:
> 
>   while (xas->xa_offset == 255) {
>   xas->xa_offset = xas->xa_node->offset - 1;
> 
>   Is it expecting to wrap (truncate)? How is it distinguished from
>   the "num_elems++" case?

Hi ;-)

This looks to me like it's walking up the tree, looking to go to the
'prev' element.  So yes, the wrapping from 0 to 255 is intentional,
because once we find an offset that's not 0, we've set offset to the
right one.  Just to give more context, here's the few lines around it:

if (xas->xa_offset != get_offset(xas->xa_index, xas->xa_node))
xas->xa_offset--;

while (xas->xa_offset == 255) {
xas->xa_offset = xas->xa_node->offset - 1;
xas->xa_node = xa_parent(xas->xa, xas->xa_node);
if (!xas->xa_node)
return set_bounds(xas);
}

So it's kind of nasty.  I don't want to write it as:

while (xas->xa_offset == 0) {
xas->xa_offset = xas->xa_node->offset;
xas->xa_node = xa_parent(xas->xa, xas->xa_node);
}
xas->xa_offset--;

because there's that conditional subtraction before the loop starts.
The xarray test-suite is pretty good if anyone thinks they have a clearer
way to write this loop ...

Re: [PATCH v4 1/1] exec: seal system mappings

2024-11-25 Thread Matthew Wilcox

On Mon, Nov 25, 2024 at 08:20:21PM +, jef...@chromium.org wrote:
> +/*
> + * Kernel cmdline override for CONFIG_SEAL_SYSTEM_MAPPINGS
> + */
> +enum seal_system_mappings_type {
> + SEAL_SYSTEM_MAPPINGS_DISABLED,
> + SEAL_SYSTEM_MAPPINGS_ENABLED
> +};
> +
> +static enum seal_system_mappings_type seal_system_mappings_v __ro_after_init 
> =
> + IS_ENABLED(CONFIG_SEAL_SYSTEM_MAPPINGS) ? SEAL_SYSTEM_MAPPINGS_ENABLED :
> + SEAL_SYSTEM_MAPPINGS_DISABLED;
> +
> +static const struct constant_table value_table_sys_mapping[] __initconst = {
> + { "no", SEAL_SYSTEM_MAPPINGS_DISABLED},
> + { "yes", SEAL_SYSTEM_MAPPINGS_ENABLED},
> + { }
> +};
> +
> +static int __init early_seal_system_mappings_override(char *buf)
> +{
> + if (!buf)
> + return -EINVAL;
> +
> + seal_system_mappings_v = lookup_constant(value_table_sys_mapping,
> + buf, seal_system_mappings_v);
> + return 0;
> +}
> +
> +early_param("exec.seal_system_mappings", 
> early_seal_system_mappings_override);

Are you paid by the line?  This all seems ridiculously overcomplicated.
Look at (first example I found) kgdbwait:

static int __init opt_kgdb_wait(char *str)
{
kgdb_break_asap = 1;

kdb_init(KDB_INIT_EARLY);
if (kgdb_io_module_registered &&
IS_ENABLED(CONFIG_ARCH_HAS_EARLY_DEBUG))
kgdb_initial_breakpoint();

return 0;
}
early_param("kgdbwait", opt_kgdb_wait);

I don't understand why you've created a new 'exec' namespace, and why
this feature fits in 'exec'.  That seems like an implementation detail.
I'd lose the "exec." prefix.

Re: [PATCH] mm: Handle compound pages better in __dump_page()

2024-11-17 Thread Matthew Wilcox

On Sat, Nov 16, 2024 at 09:52:44PM -0800, Kees Cook wrote:
> GCC 15's -Warray-bounds reports:
> 
> In function 'page_fixed_fake_head',
> inlined from '_compound_head' at ../include/linux/page-flags.h:251:24,
> inlined from '__dump_page' at ../mm/debug.c:123:11:
> ../include/asm-generic/rwonce.h:44:26: warning: array subscript 9 is outside 
> array bounds of 'struct page[1]' [-Warray-bounds=]

Thanks for bringing this back up.  I have a somewhat orphaned patch in
my tree that has a terrible commit message which was no help.

That said, this patch is definitely wrong because it's unsafe to
call page_fixed_fake_head().

> (Not noted in this warning is that the code passes through page_folio()
> _Generic macro.)
> 
> It may not be that "precise" is always 1 page, so accessing "page[1]"
> in either page_folio() or folio_test_large() may cause problems.

folio_test_large() does not touch page[1].  Look:

static inline bool folio_test_large(const struct folio *folio)
{
return folio_test_head(folio);

static __always_inline bool folio_test_head(const struct folio *folio)
{
return test_bit(PG_head, const_folio_flags(folio, FOLIO_PF_ANY));

#define FOLIO_PF_ANY0

static const unsigned long *const_folio_flags(const struct folio *folio,
unsigned n)
{
const struct page *page = &folio->page;

VM_BUG_ON_PGFLAGS(PageTail(page), page);
VM_BUG_ON_PGFLAGS(n > 0 && !test_bit(PG_head, &page->flags), page);
return &page[n].flags;

so we only look at page[0].

> Instead, explicitly make precise 2 pages. Just open-coding page_folio()
> isn't sufficient to avoid the warning[1].

Why not?  What goes wrong?  I'm trying to get gcc-15 installed here now
...

Re: [PATCH -next] mm: usercopy: add a debugfs interface to bypass the vmalloc check.

2024-12-02 Thread Matthew Wilcox

On Tue, Dec 03, 2024 at 10:31:59AM +0800, Ze Zuo wrote:
> The commit 0aef499f3172 ("mm/usercopy: Detect vmalloc overruns") introduced
> vmalloc check for usercopy. However, in subsystems like networking, when
> memory allocated using vmalloc or vmap is subsequently copied using
> functions like copy_to_iter/copy_from_iter, the check is triggered. This
> adds overhead in the copy path, such as the cost of searching the
> red-black tree, which increases the performance burden.
> 
> We found that after merging this patch, network bandwidth performance in
> the XDP scenario significantly dropped from 25 Gbits/sec to 8 Gbits/sec,
> the hardened_usercopy is enabled by default.

What is "the XDP scenario", exactly?  Are these large or small packets?
What's taking the time in find_vmap_area()?  Is it lock contention?

Re: [PATCH -next] mm: usercopy: add a debugfs interface to bypass the vmalloc check.

2024-12-03 Thread Matthew Wilcox

On Tue, Dec 03, 2024 at 08:02:26PM +0100, Uladzislau Rezki wrote:

I think there are a few other things we can try here.

First, if the copy is small (and I still don't have an answer to that
...), we can skip the vmalloc lookup if the copy doesn't cross a page
boundary.

Second, we could try storing this in a maple tree rather than an rbtree.
That gives us RCU protected lookups rather than under a spinlock.

It might even be worth going to a rwlock first, in case the problem is
that there's severe lock contention.

But I've asked for data on spinlock contention and not received an
answer on that either, so I don't know what to suggest.

Anyway, NACK to the original patch; that's just a horrible idea.

Re: [PATCH -next] mm: usercopy: add a debugfs interface to bypass the vmalloc check.

2024-12-15 Thread Matthew Wilcox

On Wed, Dec 04, 2024 at 09:51:07AM +0100, Uladzislau Rezki wrote:
> I think, when i have more free cycles, i will check it from performance
> point of view. Because i do not know how much a maple tree is efficient
> when it comes to lookups, insert and removing.

Maple tree has a fanout of around 8-12 at each level, while an rbtree has
a fanout of two (arguably 3, since we might find the node).  Let's say you
have 1000 vmalloc areas.  A perfectly balanced rbtree would have 9 levels
(and might well be 11+ levels if imperfectly balanced -- and part of the
advantage of rbtrees over AVL trees is that they can be less balanced
so need fewer rotations).  A perfectly balanced maple tree would have
only 3 levels.

Addition/removal is more expensive.  We biased the implementation heavily
towards lookup, so we chose to keep it very compact.  Most users (and
particularly the VMA tree which was our first client) do more lookups
than modifications; a real application takes many more pagefaults than
it does calls to mmap/munmap/mprotect/etc.

> As an RCU safe data structure, yes, a searching is improved in a way there
> is no need in taking spinlock. As a noted earlier i do not know if a maple 
> tree allows to find a data when instead of key, it is associated with, we
> pass something that is withing a searchable area: [va_start:va_end].

That's what maple trees do; they store non-overlapping ranges.  So you
can look up any address in a range and it will return you the pointer
associated with that range.  Just like you'd want for a page fault ;-)

Re: [PATCH] inotify: Use strscpy() for event->name copies

2024-12-16 Thread Matthew Wilcox

On Mon, Dec 16, 2024 at 02:45:15PM -0800, Kees Cook wrote:
> Since we have already allocated "len + 1" space for event->name, make sure
> that name->name cannot ever accidentally cause a copy overflow by calling
> strscpy() instead of the unbounded strcpy() routine. This assists in
> the ongoing efforts to remove the unsafe strcpy() API[1] from the kernel.

Since a qstr can't contain a NUL before the length, why not just use
memcpy()?

>   event->name_len = len;
>   if (len)
> - strcpy(event->name, name->name);
> + strscpy(event->name, name->name, event->name_len + 1);

Re: [PATCH] mm: Handle compound pages better in __dump_page()

2024-11-17 Thread Matthew Wilcox

On Sun, Nov 17, 2024 at 08:46:45PM -0800, Kees Cook wrote:
> On Mon, Nov 18, 2024 at 04:10:52AM +0000, Matthew Wilcox wrote:
> > folio_test_large() does not touch page[1].  Look:
> 
> It does, though. :( It's via the PageTail(), which calls page_is_fake_head():

Oh.  It shouldn't; that's unnecessary.

+++ b/include/linux/page-flags.h
@@ -306,7 +306,7 @@ static const unsigned long *const_folio_flags(const struct 
folio *folio,
 {
const struct page *page = &folio->page;

-   VM_BUG_ON_PGFLAGS(PageTail(page), page);
+   VM_BUG_ON_PGFLAGS(page->compound_head & 1, page);
VM_BUG_ON_PGFLAGS(n > 0 && !test_bit(PG_head, &page->flags), page);
return &page[n].flags;
 }
@@ -315,7 +315,7 @@ static unsigned long *folio_flags(struct folio *folio, 
unsigned n)
 {
struct page *page = &folio->page;

-   VM_BUG_ON_PGFLAGS(PageTail(page), page);
+   VM_BUG_ON_PGFLAGS(page->compound_head & 1, page);
VM_BUG_ON_PGFLAGS(n > 0 && !test_bit(PG_head, &page->flags), page);
return &page[n].flags;
 }

should fix that.

Re: [PATCH v4 1/1] exec: seal system mappings

2025-01-23 Thread Matthew Wilcox

On Thu, Jan 23, 2025 at 04:50:46PM -0500, enh wrote:
> yeah, at this point i should (a) drag in +cferris who may have actual
> experience of this and (b) admit that iirc i've never personally seen
> _evidence_ of this, just claims. most famously in the chrome source...
> if you `grep -r /proc/.*/maps` you'll find lots of examples, but
> something like 
> https://chromium.googlesource.com/chromium/src/+/main/base/debug/proc_maps_linux.h#61
> is quite representative of the "folklore" in this area.

That folklore is 100% based on a true story!  I'm not sure that all of
the details are precisely correct, but it's true enough that I wouldn't
quibble with it.

In fact, we want to make it worse.  Because the mmap_lock is such a
huge point of contention, we want to read /proc/PID/maps protected
only by RCU.  That will relax the guarantees to:

a. If a VMA existed and was not modified during the duration of the
   read, it will definitely be returned.
b. If a VMA was added during the call, it might be returned.
c. If a VMA was removed during the call, it might be returned.
d. If an address was covered by a VMA before the call and that
   VMA was modified during the call, you might get the prior or
   posterior state of the VMA.  And you might get both!

What might be confusing:

e. If VMA A is added, then VMA B is added, your call might show you VMA
   B and not VMA A.
f. Similarly for deleted.
g. If you have, say, a VMA from (4000-9000) and you mprotect the region
   (5000-6000), you might see:
4000-9000 oldA
   or
4000-5000 newA
4000-9000 oldA
   or
4000-5000 newA
5000-6000 newB
4000-9000 oldA
   or
4000-5000 newA
5000-6000 newB
6000-9000 newC

(it's possible other combinations might be visible; i'm not working on
the details of this right now)

We shouldn't be able to _skip_  a VMA.  That seems far worse than
returning duplicates; if your maps parser sees duplicates it can either
try to figure it out itself, or retry the whole read.

Re: Re: Re: [PATCH v2] treewide: const qualify ctl_tables where applicable

2025-01-27 Thread Matthew Wilcox

On Mon, Jan 27, 2025 at 04:55:58PM +0200, Jani Nikula wrote:
> You could have static const within functions too. You get the rodata
> protection and function local scope, best of both worlds?

timer_active is on the stack, so it can't be static const.

Does this really need to be cc'd to such a wide distribution list?

Re: [PATCH v4 1/1] exec: seal system mappings

2025-01-13 Thread Matthew Wilcox

On Mon, Jan 13, 2025 at 01:26:59PM -0800, Jeff Xu wrote:
> This patch is intended for ChromeOS and Android and is
> feature-complete from their perspective.

"I have everything I need from the Google point of view, so I will push
this feature into upstream".

No, thanks.

[PATCH 0/1] Put seq_buf on a diet

2023-10-19 Thread Matthew Wilcox (Oracle)

Prompted by the recent mails on ksummit, let's actually try to make this
work this time.  We need a container for manipulating strings easily,
and seq_buf is the closest thing we have to it.  The only problem I have
with it is the readpos that is only useful for the tracing code today.
So move it from the seq_buf to the tracing code.

We should go further with this patch series, including using seq_buf
within vsprintf, but if we can't get over this hurdle first, I'm not
going to waste my time on this again.

Matthew Wilcox (Oracle) (1):
  trace: Move readpos from seq_buf to trace_seq

 include/linux/seq_buf.h   |  5 +
 include/linux/trace_seq.h |  2 ++
 kernel/trace/trace.c  | 10 +-
 kernel/trace/trace_seq.c  |  6 +-
 lib/seq_buf.c | 13 +
 5 files changed, 18 insertions(+), 18 deletions(-)

-- 
2.40.1

[PATCH 1/1] trace: Move readpos from seq_buf to trace_seq

2023-10-19 Thread Matthew Wilcox (Oracle)

To make seq_buf more lightweight as a string buf, move the readpos member
from seq_buf to its container, trace_seq.  That puts the responsibility
of maintaining the readpos entirely in the tracing code.  If some future
users want to package up the readpos with a seq_buf, we can define a
new struct then.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/seq_buf.h   |  5 +
 include/linux/trace_seq.h |  2 ++
 kernel/trace/trace.c  | 10 +-
 kernel/trace/trace_seq.c  |  6 +-
 lib/seq_buf.c | 13 +
 5 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/include/linux/seq_buf.h b/include/linux/seq_buf.h
index 515d7fcb9634..a0fb013cebdf 100644
--- a/include/linux/seq_buf.h
+++ b/include/linux/seq_buf.h
@@ -14,19 +14,16 @@
  * @buffer:pointer to the buffer
  * @size:  size of the buffer
  * @len:   the amount of data inside the buffer
- * @readpos:   The next position to read in the buffer.
  */
 struct seq_buf {
char*buffer;
size_t  size;
size_t  len;
-   loff_t  readpos;
 };
 
 static inline void seq_buf_clear(struct seq_buf *s)
 {
s->len = 0;
-   s->readpos = 0;
 }
 
 static inline void
@@ -143,7 +140,7 @@ extern __printf(2, 0)
 int seq_buf_vprintf(struct seq_buf *s, const char *fmt, va_list args);
 extern int seq_buf_print_seq(struct seq_file *m, struct seq_buf *s);
 extern int seq_buf_to_user(struct seq_buf *s, char __user *ubuf,
-  int cnt);
+  size_t start, int cnt);
 extern int seq_buf_puts(struct seq_buf *s, const char *str);
 extern int seq_buf_putc(struct seq_buf *s, unsigned char c);
 extern int seq_buf_putmem(struct seq_buf *s, const void *mem, unsigned int 
len);
diff --git a/include/linux/trace_seq.h b/include/linux/trace_seq.h
index 6be92bf559fe..3691e0e76a1a 100644
--- a/include/linux/trace_seq.h
+++ b/include/linux/trace_seq.h
@@ -14,6 +14,7 @@
 struct trace_seq {
charbuffer[PAGE_SIZE];
struct seq_buf  seq;
+   size_t  readpos;
int full;
 };
 
@@ -22,6 +23,7 @@ trace_seq_init(struct trace_seq *s)
 {
seq_buf_init(&s->seq, s->buffer, PAGE_SIZE);
s->full = 0;
+   s->readpos = 0;
 }
 
 /**
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index abaaf516fcae..217cabd09c3e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1730,15 +1730,15 @@ static ssize_t trace_seq_to_buffer(struct trace_seq *s, 
void *buf, size_t cnt)
 {
int len;
 
-   if (trace_seq_used(s) <= s->seq.readpos)
+   if (trace_seq_used(s) <= s->readpos)
return -EBUSY;
 
-   len = trace_seq_used(s) - s->seq.readpos;
+   len = trace_seq_used(s) - s->readpos;
if (cnt > len)
cnt = len;
-   memcpy(buf, s->buffer + s->seq.readpos, cnt);
+   memcpy(buf, s->buffer + s->readpos, cnt);
 
-   s->seq.readpos += cnt;
+   s->readpos += cnt;
return cnt;
 }
 
@@ -7006,7 +7006,7 @@ tracing_read_pipe(struct file *filp, char __user *ubuf,
 
/* Now copy what we have to the user */
sret = trace_seq_to_user(&iter->seq, ubuf, cnt);
-   if (iter->seq.seq.readpos >= trace_seq_used(&iter->seq))
+   if (iter->seq.readpos >= trace_seq_used(&iter->seq))
trace_seq_init(&iter->seq);
 
/*
diff --git a/kernel/trace/trace_seq.c b/kernel/trace/trace_seq.c
index bac06ee3b98b..7be97229ddf8 100644
--- a/kernel/trace/trace_seq.c
+++ b/kernel/trace/trace_seq.c
@@ -370,8 +370,12 @@ EXPORT_SYMBOL_GPL(trace_seq_path);
  */
 int trace_seq_to_user(struct trace_seq *s, char __user *ubuf, int cnt)
 {
+   int ret;
__trace_seq_init(s);
-   return seq_buf_to_user(&s->seq, ubuf, cnt);
+   ret = seq_buf_to_user(&s->seq, ubuf, s->readpos, cnt);
+   if (ret > 0)
+   s->readpos += ret;
+   return ret;
 }
 EXPORT_SYMBOL_GPL(trace_seq_to_user);
 
diff --git a/lib/seq_buf.c b/lib/seq_buf.c
index 45c450f423fa..0d218f3835ae 100644
--- a/lib/seq_buf.c
+++ b/lib/seq_buf.c
@@ -340,7 +340,7 @@ int seq_buf_path(struct seq_buf *s, const struct path 
*path, const char *esc)
  *
  * Returns -EFAULT if the copy to userspace fails.
  */
-int seq_buf_to_user(struct seq_buf *s, char __user *ubuf, int cnt)
+int seq_buf_to_user(struct seq_buf *s, char __user *ubuf, size_t start, int 
cnt)
 {
int len;
int ret;
@@ -350,20 +350,17 @@ int seq_buf_to_user(struct seq_buf *s, char __user *ubuf, 
int cnt)
 
len = seq_buf_used(s);
 
-   if (len <= s->readpos)
+   if (len <= start)
return -EBUSY;
 
-   len -= s->readpos;
+   len -= start;
if (cnt > len)
cnt = len;
-

[PATCH 1/1] trace: Move readpos from seq_buf to trace_seq

2023-10-19 Thread Matthew Wilcox (Oracle)

To make seq_buf more lightweight as a string buf, move the readpos member
from seq_buf to its container, trace_seq.  That puts the responsibility
of maintaining the readpos entirely in the tracing code.  If some future
users want to package up the readpos with a seq_buf, we can define a
new struct then.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/seq_buf.h   |  5 +
 include/linux/trace_seq.h |  2 ++
 kernel/trace/trace.c  | 10 +-
 kernel/trace/trace_seq.c  |  6 +-
 lib/seq_buf.c | 13 +
 5 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/include/linux/seq_buf.h b/include/linux/seq_buf.h
index 515d7fcb9634..a0fb013cebdf 100644
--- a/include/linux/seq_buf.h
+++ b/include/linux/seq_buf.h
@@ -14,19 +14,16 @@
  * @buffer:pointer to the buffer
  * @size:  size of the buffer
  * @len:   the amount of data inside the buffer
- * @readpos:   The next position to read in the buffer.
  */
 struct seq_buf {
char*buffer;
size_t  size;
size_t  len;
-   loff_t  readpos;
 };
 
 static inline void seq_buf_clear(struct seq_buf *s)
 {
s->len = 0;
-   s->readpos = 0;
 }
 
 static inline void
@@ -143,7 +140,7 @@ extern __printf(2, 0)
 int seq_buf_vprintf(struct seq_buf *s, const char *fmt, va_list args);
 extern int seq_buf_print_seq(struct seq_file *m, struct seq_buf *s);
 extern int seq_buf_to_user(struct seq_buf *s, char __user *ubuf,
-  int cnt);
+  size_t start, int cnt);
 extern int seq_buf_puts(struct seq_buf *s, const char *str);
 extern int seq_buf_putc(struct seq_buf *s, unsigned char c);
 extern int seq_buf_putmem(struct seq_buf *s, const void *mem, unsigned int 
len);
diff --git a/include/linux/trace_seq.h b/include/linux/trace_seq.h
index 6be92bf559fe..3691e0e76a1a 100644
--- a/include/linux/trace_seq.h
+++ b/include/linux/trace_seq.h
@@ -14,6 +14,7 @@
 struct trace_seq {
charbuffer[PAGE_SIZE];
struct seq_buf  seq;
+   size_t  readpos;
int full;
 };
 
@@ -22,6 +23,7 @@ trace_seq_init(struct trace_seq *s)
 {
seq_buf_init(&s->seq, s->buffer, PAGE_SIZE);
s->full = 0;
+   s->readpos = 0;
 }
 
 /**
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index abaaf516fcae..217cabd09c3e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1730,15 +1730,15 @@ static ssize_t trace_seq_to_buffer(struct trace_seq *s, 
void *buf, size_t cnt)
 {
int len;
 
-   if (trace_seq_used(s) <= s->seq.readpos)
+   if (trace_seq_used(s) <= s->readpos)
return -EBUSY;
 
-   len = trace_seq_used(s) - s->seq.readpos;
+   len = trace_seq_used(s) - s->readpos;
if (cnt > len)
cnt = len;
-   memcpy(buf, s->buffer + s->seq.readpos, cnt);
+   memcpy(buf, s->buffer + s->readpos, cnt);
 
-   s->seq.readpos += cnt;
+   s->readpos += cnt;
return cnt;
 }
 
@@ -7006,7 +7006,7 @@ tracing_read_pipe(struct file *filp, char __user *ubuf,
 
/* Now copy what we have to the user */
sret = trace_seq_to_user(&iter->seq, ubuf, cnt);
-   if (iter->seq.seq.readpos >= trace_seq_used(&iter->seq))
+   if (iter->seq.readpos >= trace_seq_used(&iter->seq))
trace_seq_init(&iter->seq);
 
/*
diff --git a/kernel/trace/trace_seq.c b/kernel/trace/trace_seq.c
index bac06ee3b98b..7be97229ddf8 100644
--- a/kernel/trace/trace_seq.c
+++ b/kernel/trace/trace_seq.c
@@ -370,8 +370,12 @@ EXPORT_SYMBOL_GPL(trace_seq_path);
  */
 int trace_seq_to_user(struct trace_seq *s, char __user *ubuf, int cnt)
 {
+   int ret;
__trace_seq_init(s);
-   return seq_buf_to_user(&s->seq, ubuf, cnt);
+   ret = seq_buf_to_user(&s->seq, ubuf, s->readpos, cnt);
+   if (ret > 0)
+   s->readpos += ret;
+   return ret;
 }
 EXPORT_SYMBOL_GPL(trace_seq_to_user);
 
diff --git a/lib/seq_buf.c b/lib/seq_buf.c
index 45c450f423fa..0d218f3835ae 100644
--- a/lib/seq_buf.c
+++ b/lib/seq_buf.c
@@ -340,7 +340,7 @@ int seq_buf_path(struct seq_buf *s, const struct path 
*path, const char *esc)
  *
  * Returns -EFAULT if the copy to userspace fails.
  */
-int seq_buf_to_user(struct seq_buf *s, char __user *ubuf, int cnt)
+int seq_buf_to_user(struct seq_buf *s, char __user *ubuf, size_t start, int 
cnt)
 {
int len;
int ret;
@@ -350,20 +350,17 @@ int seq_buf_to_user(struct seq_buf *s, char __user *ubuf, 
int cnt)
 
len = seq_buf_used(s);
 
-   if (len <= s->readpos)
+   if (len <= start)
return -EBUSY;
 
-   len -= s->readpos;
+   len -= start;
if (cnt > len)
cnt = len;
-

[PATCH v2 0/1] Put seq_buf on a diet

2023-10-19 Thread Matthew Wilcox (Oracle)

Prompted by the recent mails on ksummit, let's actually try to make this
work this time.  We need a container for manipulating strings easily,
and seq_buf is the closest thing we have to it.  The only problem I have
with it is the readpos that is only useful for the tracing code today.
So move it from the seq_buf to the tracing code.

We should go further with this patch series, including using seq_buf
within vsprintf, but if we can't get over this hurdle first, I'm not
going to waste my time on this again.

v2:
 - Add linux-trace-ker...@vger.kernel.org
 - Fix kernel-doc

Matthew Wilcox (Oracle) (1):
  trace: Move readpos from seq_buf to trace_seq

 include/linux/seq_buf.h   |  5 +
 include/linux/trace_seq.h |  2 ++
 kernel/trace/trace.c  | 10 +-
 kernel/trace/trace_seq.c  |  6 +-
 lib/seq_buf.c | 22 ++
 5 files changed, 23 insertions(+), 22 deletions(-)

-- 
2.40.1

[PATCH v2 1/1] trace: Move readpos from seq_buf to trace_seq

2023-10-19 Thread Matthew Wilcox (Oracle)

To make seq_buf more lightweight as a string buf, move the readpos member
from seq_buf to its container, trace_seq.  That puts the responsibility
of maintaining the readpos entirely in the tracing code.  If some future
users want to package up the readpos with a seq_buf, we can define a
new struct then.

Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/seq_buf.h   |  5 +
 include/linux/trace_seq.h |  2 ++
 kernel/trace/trace.c  | 10 +-
 kernel/trace/trace_seq.c  |  6 +-
 lib/seq_buf.c | 22 ++
 5 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/include/linux/seq_buf.h b/include/linux/seq_buf.h
index 515d7fcb9634..a0fb013cebdf 100644
--- a/include/linux/seq_buf.h
+++ b/include/linux/seq_buf.h
@@ -14,19 +14,16 @@
  * @buffer:pointer to the buffer
  * @size:  size of the buffer
  * @len:   the amount of data inside the buffer
- * @readpos:   The next position to read in the buffer.
  */
 struct seq_buf {
char*buffer;
size_t  size;
size_t  len;
-   loff_t  readpos;
 };
 
 static inline void seq_buf_clear(struct seq_buf *s)
 {
s->len = 0;
-   s->readpos = 0;
 }
 
 static inline void
@@ -143,7 +140,7 @@ extern __printf(2, 0)
 int seq_buf_vprintf(struct seq_buf *s, const char *fmt, va_list args);
 extern int seq_buf_print_seq(struct seq_file *m, struct seq_buf *s);
 extern int seq_buf_to_user(struct seq_buf *s, char __user *ubuf,
-  int cnt);
+  size_t start, int cnt);
 extern int seq_buf_puts(struct seq_buf *s, const char *str);
 extern int seq_buf_putc(struct seq_buf *s, unsigned char c);
 extern int seq_buf_putmem(struct seq_buf *s, const void *mem, unsigned int 
len);
diff --git a/include/linux/trace_seq.h b/include/linux/trace_seq.h
index 6be92bf559fe..3691e0e76a1a 100644
--- a/include/linux/trace_seq.h
+++ b/include/linux/trace_seq.h
@@ -14,6 +14,7 @@
 struct trace_seq {
charbuffer[PAGE_SIZE];
struct seq_buf  seq;
+   size_t  readpos;
int full;
 };
 
@@ -22,6 +23,7 @@ trace_seq_init(struct trace_seq *s)
 {
seq_buf_init(&s->seq, s->buffer, PAGE_SIZE);
s->full = 0;
+   s->readpos = 0;
 }
 
 /**
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index abaaf516fcae..217cabd09c3e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1730,15 +1730,15 @@ static ssize_t trace_seq_to_buffer(struct trace_seq *s, 
void *buf, size_t cnt)
 {
int len;
 
-   if (trace_seq_used(s) <= s->seq.readpos)
+   if (trace_seq_used(s) <= s->readpos)
return -EBUSY;
 
-   len = trace_seq_used(s) - s->seq.readpos;
+   len = trace_seq_used(s) - s->readpos;
if (cnt > len)
cnt = len;
-   memcpy(buf, s->buffer + s->seq.readpos, cnt);
+   memcpy(buf, s->buffer + s->readpos, cnt);
 
-   s->seq.readpos += cnt;
+   s->readpos += cnt;
return cnt;
 }
 
@@ -7006,7 +7006,7 @@ tracing_read_pipe(struct file *filp, char __user *ubuf,
 
/* Now copy what we have to the user */
sret = trace_seq_to_user(&iter->seq, ubuf, cnt);
-   if (iter->seq.seq.readpos >= trace_seq_used(&iter->seq))
+   if (iter->seq.readpos >= trace_seq_used(&iter->seq))
trace_seq_init(&iter->seq);
 
/*
diff --git a/kernel/trace/trace_seq.c b/kernel/trace/trace_seq.c
index bac06ee3b98b..7be97229ddf8 100644
--- a/kernel/trace/trace_seq.c
+++ b/kernel/trace/trace_seq.c
@@ -370,8 +370,12 @@ EXPORT_SYMBOL_GPL(trace_seq_path);
  */
 int trace_seq_to_user(struct trace_seq *s, char __user *ubuf, int cnt)
 {
+   int ret;
__trace_seq_init(s);
-   return seq_buf_to_user(&s->seq, ubuf, cnt);
+   ret = seq_buf_to_user(&s->seq, ubuf, s->readpos, cnt);
+   if (ret > 0)
+   s->readpos += ret;
+   return ret;
 }
 EXPORT_SYMBOL_GPL(trace_seq_to_user);
 
diff --git a/lib/seq_buf.c b/lib/seq_buf.c
index 45c450f423fa..b7477aefff53 100644
--- a/lib/seq_buf.c
+++ b/lib/seq_buf.c
@@ -324,23 +324,24 @@ int seq_buf_path(struct seq_buf *s, const struct path 
*path, const char *esc)
  * seq_buf_to_user - copy the sequence buffer to user space
  * @s: seq_buf descriptor
  * @ubuf: The userspace memory location to copy to
+ * @start: The first byte in the buffer to copy
  * @cnt: The amount to copy
  *
  * Copies the sequence buffer into the userspace memory pointed to
- * by @ubuf. It starts from the last read position (@s->readpos)
- * and writes up to @cnt characters or till it reaches the end of
- * the content in the buffer (@s->len), which ever comes first.
+ * by @ubuf. It starts from @start and writes up to @cnt characters
+ * or unt

[PATCH] powerpc: Remove initialisation of readpos

2023-10-24 Thread Matthew Wilcox (Oracle)

While powerpc doesn't use the seq_buf readpos, it did explicitly
initialise it for no good reason.

Fixes: d0ed46b60396 ("tracing: Move readpos from seq_buf to trace_seq")
Signed-off-by: Matthew Wilcox (Oracle) 
---
 arch/powerpc/kernel/setup-common.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 2f1026fba00d..34975532e44c 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -601,7 +601,6 @@ struct seq_buf ppc_hw_desc __initdata = {
.buffer = ppc_hw_desc_buf,
.size = sizeof(ppc_hw_desc_buf),
.len = 0,
-   .readpos = 0,
 };
 
 static __init void probe_machine(void)
-- 
2.40.1

Re: [PATCH -next v3 3/3] nfs make use of str_false_true helper

Re: [PATCH] seq_buf: Introduce DECLARE_SEQ_BUF and seq_buf_cstr()

Re: [PATCH v2] seq_buf: Introduce DECLARE_SEQ_BUF and seq_buf_str()

Re: Limited/Broken functionality of ASLR for Libs >= 2MB

Re: Limited/Broken functionality of ASLR for Libs >= 2MB

Re: [PATCH] ARM: unwind: improve unwinders for noreturn case

Re: [PATCH v2] fs: Set file_handle::handle_bytes before referencing file_handle::f_handle

Re: [PATCH] alloc_tag: Tighten file permissions on /proc/allocinfo

Re: [PATCH] alloc_tag: Tighten file permissions on /proc/allocinfo

Re: [PATCH v3] fs: fix unintentional arithmetic wraparound in offset calculation

Re: [PATCH v3] fs: fix unintentional arithmetic wraparound in offset calculation

Re: [RFC] Mitigating unexpected arithmetic overflow

Re: [PATCH v4 1/1] exec: seal system mappings

Re: [PATCH] mm: Handle compound pages better in __dump_page()

Re: [PATCH -next] mm: usercopy: add a debugfs interface to bypass the vmalloc check.

Re: [PATCH -next] mm: usercopy: add a debugfs interface to bypass the vmalloc check.

Re: [PATCH -next] mm: usercopy: add a debugfs interface to bypass the vmalloc check.

Re: [PATCH] inotify: Use strscpy() for event->name copies

Re: [PATCH] mm: Handle compound pages better in __dump_page()

Re: [PATCH v4 1/1] exec: seal system mappings

Re: Re: Re: [PATCH v2] treewide: const qualify ctl_tables where applicable

Re: [PATCH v4 1/1] exec: seal system mappings

[PATCH 0/1] Put seq_buf on a diet

[PATCH 1/1] trace: Move readpos from seq_buf to trace_seq

[PATCH 1/1] trace: Move readpos from seq_buf to trace_seq

[PATCH v2 0/1] Put seq_buf on a diet

[PATCH v2 1/1] trace: Move readpos from seq_buf to trace_seq

[PATCH] powerpc: Remove initialisation of readpos

28 matches

Site Navigation

Mail list logo

Footer information