Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()

2024-04-09 Thread Marco Elver
On Mon, 8 Apr 2024 at 20:24, Alexei Starovoitov
 wrote:
>
> On Mon, Apr 8, 2024 at 2:30 AM Marco Elver  wrote:
> >
> > On Fri, 5 Apr 2024 at 22:28, Andrii Nakryiko  
> > wrote:
> > >
> > > On Fri, Apr 5, 2024 at 1:28 AM Marco Elver  wrote:
> > > >
> > > > On Fri, 5 Apr 2024 at 01:23, Alexei Starovoitov
> > > >  wrote:
> > [...]
> > > > > and the tasks can use mmaped array shared across all or unique to each
> > > > > process.
> > > > > And both bpf and user space can read/write them with a single 
> > > > > instruction.
> > > >
> > > > That's BPF_F_MMAPABLE, right?
> > > >
> > > > That does not work because the mmapped region is global. Our 
> > > > requirements are:
>
> It sounds not like "requirements", but a description of the proposed
> solution.

These requirements can also be implemented differently, e.g. with the
proposed task-local storage maps if done right (the devil appears to
be in the implementation-details - these details are currently beyond
my knowledge of the BPF subsystem internals). They really are the bare
minimum requirements for the use case. The proposed solution probably
happens to be the simplest one, and mapping requirements is relatively
straightforward for it.

> Pls share the actual use case.
> This "tracing prog" sounds more like a ghost scheduler that
> wants to interact with known user processes.

It's tcmalloc - we have a BPF program provide a simpler variant of the
"thread re-scheduling" notifications discussed here:
https://lore.kernel.org/all/CACT4Y+beLh1qnHF9bxhMUcva8KyuvZs7Mg_31SGK5xSoR=3...@mail.gmail.com/

Various synchronization algorithms can be optimized if they know about
scheduling events. This has been proposed in [1] to implement an
adaptive mutex. However, we think that there are many more
possibilities, but it really depends on the kind of scheduler state
being exposed ("thread on CPU" as in [1], or more details, like
details about which thread was switched in, which was switched out,
where was the thread migrated to, etc.). Committing to these narrow
use case ABIs and extending the main kernel with more and more such
information does not scale. Instead, BPF easily allows to expose this
information where it's required.

[1] 
https://lore.kernel.org/all/20230529191416.53955-1-mathieu.desnoy...@efficios.com/

> > > > 1. Single tracing BPF program.
> > > >
> > > > 2. Per-process (per VM) memory region (here it's per-thread, but each
> > > > thread just registers the same process-wide region).  No sharing
> > > > between processes.
> > > >
> > > > 3. From #2 it follows: exec unregisters the registered memory region;
> > > > fork gets a cloned region.
> > > >
> > > > 4. Unprivileged processes can do prctl(REGISTER). Some of them might
> > > > not be able to use the bpf syscall.
> > > >
> > > > The reason for #2 is that each user space process also writes to the
> > > > memory region (read by the BPF program to make updates depending on
> > > > what state it finds), and having shared state between processes
> > > > doesn't work here.
> > > >
> > > > Is there any reasonable BPF facility that can do this today? (If
> > > > BPF_F_MMAPABLE could do it while satisfying requirements 2-4, I'd be a
> > > > happy camper.)
> > >
> > > You could simulate something like this with multi-element ARRAY +
> > > BPF_F_MMAPABLE, though you'd need to pre-allocate up to max number of
> > > processes, so it's not an exact fit.
> >
> > Right, for production use this is infeasible.
>
> Last I heard, ghost agent and a few important tasks can mmap bpf array
> and share it with bpf prog.
> So quite feasible.

Nothing related to ghost.

It's tcmalloc, i.e. every process running everywhere.

> > > But what seems to be much closer is using BPF task-local storage, if
> > > we support mmap()'ing its memory into user-space. We've had previous
> > > discussions on how to achieve this (the simplest being that
> > > mmap(task_local_map_fd, ...) maps current thread's part of BPF task
> > > local storage). You won't get automatic cloning (you'd have to do that
> > > from the BPF program on fork/exec tracepoint, for example), and within
> > > the process you'd probably want to have just one thread (main?) to
> > > mmap() initially and just share the pointer across all relevant
> > > threads.
> >
> > In the way you imagine it, would that allow all threads sharing the
> > same memory, despite it being task-local? Presumably each task's local
> > storage would be mapped to just point to the same memory?
> >
> > > But this is a more generic building block, IMO. This relying
> > > on BPF map also means pinning is possible and all the other BPF map
> > > abstraction benefits.
> >
> > Deployment-wise it will make things harder because unprivileged
> > processes still have to somehow get the map's shared fd somehow to
> > mmap() it. Not unsolvable, and in general what you describe looks
> > interesting, but I currently can't see how it will be simpler.
>
> bpf map can be pinned into bpffs for any unpriv proce

[PATCH v3 1/2] proc: restrict /proc/pid/mem access via param knobs

2024-04-09 Thread Adrian Ratiu
Prior to v2.6.39 write access to /proc//mem was restricted,
after which it got allowed in commit 198214a7ee50 ("proc: enable
writing to /proc/pid/mem"). Famous last words from that patch:
"no longer a security hazard". :)

Afterwards exploits started causing drama like [1]. The exploits
using /proc/*/mem can be rather sophisticated like [2] which
installed an arbitrary payload from noexec storage into a running
process then exec'd it, which itself could include an ELF loader
to run arbitrary code off noexec storage.

One of the well-known problems with /proc/*/mem writes is they
ignore page permissions via FOLL_FORCE, as opposed to writes via
process_vm_writev which respect page permissions. These writes can
also be used to bypass mode bits.

To harden against these types of attacks, distrbutions might want
to restrict /proc/pid/mem accesses, either entirely or partially,
for eg. to restrict FOLL_FORCE usage.

Known valid use-cases which still need these accesses are:

* Debuggers which also have ptrace permissions, so they can access
memory anyway via PTRACE_POKEDATA & co. Some debuggers like GDB
are designed to write /proc/pid/mem for basic functionality.

* Container supervisors using the seccomp notifier to intercept
syscalls and rewrite memory of calling processes by passing
around /proc/pid/mem file descriptors.

There might be more, that's why these params default to disabled.

Regarding other mechanisms which can block these accesses:

* seccomp filters can be used to block mmap/mprotect calls with W|X
perms, but they often can't block open calls as daemons want to
read/write their runtime state and seccomp filters cannot check
file paths, so plain write calls can't be easily blocked.

* Since the mem file is part of the dynamic /proc// space, we
can't run chmod once at boot to restrict it (and trying to react
to every process and run chmod doesn't scale, and the kernel no
longer allows chmod on any of these paths).

* SELinux could be used with a rule to cover all /proc/*/mem files,
but even then having multiple ways to deny an attack is useful in
case one layer fails.

Thus we introduce three kernel parameters to restrict /proc/*/mem
access: read, write and foll_force. All three can be independently
set to the following values:

all => restrict all access unconditionally.
ptracer => restrict all access except for ptracer processes.

If left unset, the existing behaviour is preserved, i.e. access
is governed by basic file permissions.

Examples which can be passed by bootloaders:

restrict_proc_mem_foll_force=all
restrict_proc_mem_write=ptracer
restrict_proc_mem_read=ptracer

Each distribution needs to decide what restrictions to apply,
depending on its use-cases. Embedded systems might want to do
more, while general-purpouse distros might want a more relaxed
policy, because for e.g. foll_force=all and write=all both break
break GDB, so it might be a bit excessive.

Based on an initial patch by Mike Frysinger .

Link: https://lwn.net/Articles/476947/ [1]
Link: https://issues.chromium.org/issues/40089045 [2]
Cc: Guenter Roeck 
Cc: Doug Anderson 
Cc: Kees Cook 
Cc: Jann Horn 
Cc: Andrew Morton 
Cc: Randy Dunlap 
Cc: Christian Brauner 
Co-developed-by: Mike Frysinger 
Signed-off-by: Mike Frysinger 
Signed-off-by: Adrian Ratiu 
---
 .../admin-guide/kernel-parameters.txt |  27 +
 fs/proc/base.c| 103 +-
 include/linux/jump_label.h|   5 +
 3 files changed, 133 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 6e62b8cb19c8d..d7f7db41369c7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5665,6 +5665,33 @@
reset_devices   [KNL] Force drivers to reset the underlying device
during initialization.
 
+   restrict_proc_mem_read= [KNL]
+   Format: {all | ptracer}
+   Allows restricting read access to /proc/*/mem files.
+   Depending on restriction level, open for reads return 
-EACCESS.
+   Can be one of:
+   - 'all' restricts all access unconditionally.
+   - 'ptracer' allows access only for ptracer processes.
+   If not specified, then basic file permissions continue 
to apply.
+
+   restrict_proc_mem_write= [KNL]
+   Format: {all | ptracer}
+   Allows restricting write access to /proc/*/mem files.
+   Depending on restriction level, open for writes return 
-EACCESS.
+   Can be one of:
+   - 'all' restricts all access unconditionally.
+   - 'ptracer' allows access only for ptracer processes.
+   If not specified, then basic file permissions continu

[PATCH v3 2/2] proc: add Kconfigs to restrict /proc/pid/mem access

2024-04-09 Thread Adrian Ratiu
Some systems might have difficulty changing their bootloaders
to enable the newly added restrict_proc_mem* params, for e.g.
remote embedded doing OTA updates, so this provides a set of
Kconfigs to set /proc/pid/mem restrictions at build-time.

The boot params take precedence over the Kconfig values. This
can be reversed, but doing it this way I think makes sense.

Another idea is to have a global bool Kconfig which can enable
or disable this mechanism in its entirety, however it does not
seem necessary since all three knobs default to off, the branch
logic overhead is rather minimal and I assume most of systems
will want to restrict at least the use of FOLL_FORCE.

Cc: Guenter Roeck 
Cc: Doug Anderson 
Cc: Kees Cook 
Cc: Jann Horn 
Cc: Andrew Morton 
Cc: Randy Dunlap 
Cc: Christian Brauner 
Signed-off-by: Adrian Ratiu 
---
 fs/proc/base.c   | 33 +
 security/Kconfig | 42 ++
 2 files changed, 75 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index c733836c42a65..e8ee848fc4a98 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -889,6 +889,17 @@ static int __mem_open_check_access_restriction(struct file 
*file)
!__mem_open_current_is_ptracer(file))
return -EACCES;
 
+#ifdef CONFIG_SECURITY_PROC_MEM_WRITE_RESTRICT
+   /* Deny if writes are unconditionally disabled via Kconfig */
+   if (!strncmp(CONFIG_SECURITY_PROC_MEM_WRITE_RESTRICT, "all", 3))
+   return -EACCES;
+
+   /* Deny if writes are allowed only for ptracers via Kconfig */
+   if (!strncmp(CONFIG_SECURITY_PROC_MEM_WRITE_RESTRICT, 
"ptracer", 7) &&
+   !__mem_open_current_is_ptracer(file))
+   return -EACCES;
+#endif
+
} else if (file->f_mode & FMODE_READ) {
/* Deny if reads are unconditionally disabled via param */
if (static_branch_unlikely(&restrict_proc_mem[2]))
@@ -898,6 +909,17 @@ static int __mem_open_check_access_restriction(struct file 
*file)
if (static_branch_unlikely(&restrict_proc_mem[3]) &&
!__mem_open_current_is_ptracer(file))
return -EACCES;
+
+#ifdef CONFIG_SECURITY_PROC_MEM_READ_RESTRICT
+   /* Deny if reads are unconditionally disabled via Kconfig */
+   if (!strncmp(CONFIG_SECURITY_PROC_MEM_READ_RESTRICT, "all", 3))
+   return -EACCES;
+
+   /* Deny if reads are allowed only for ptracers via Kconfig */
+   if (!strncmp(CONFIG_SECURITY_PROC_MEM_READ_RESTRICT, "ptracer", 
7) &&
+   !__mem_open_current_is_ptracer(file))
+   return -EACCES;
+#endif
}
 
return 0;
@@ -930,6 +952,17 @@ static unsigned int __mem_rw_get_foll_force_flag(struct 
file *file)
!__mem_open_current_is_ptracer(file))
return 0;
 
+#ifdef CONFIG_SECURITY_PROC_MEM_FOLL_FORCE_RESTRICT
+   /* Deny if FOLL_FORCE is disabled via Kconfig */
+   if (!strncmp(CONFIG_SECURITY_PROC_MEM_FOLL_FORCE_RESTRICT, "all", 3))
+   return 0;
+
+   /* Deny if FOLL_FORCE is only allowed for ptracers via Kconfig */
+   if (!strncmp(CONFIG_SECURITY_PROC_MEM_FOLL_FORCE_RESTRICT, "ptracer", 
7) &&
+   !__mem_open_current_is_ptracer(file))
+   return 0;
+#endif
+
return FOLL_FORCE;
 }
 
diff --git a/security/Kconfig b/security/Kconfig
index 412e76f1575d0..31a588cedec8d 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -19,6 +19,48 @@ config SECURITY_DMESG_RESTRICT
 
  If you are unsure how to answer this question, answer N.
 
+config SECURITY_PROC_MEM_READ_RESTRICT
+   string "Restrict read access to /proc/*/mem files"
+   depends on PROC_FS
+   default "none"
+   help
+ This option allows specifying a restriction level for read access
+ to /proc/*/mem files. Can be one of:
+ - 'all' restricts all access unconditionally.
+ - 'ptracer' allows access only for ptracer processes.
+
+ This can also be set at boot with the "restrict_proc_mem_read=" param.
+
+ If unsure leave empty to continue using basic file permissions.
+
+config SECURITY_PROC_MEM_WRITE_RESTRICT
+   string "Restrict write access to /proc/*/mem files"
+   depends on PROC_FS
+   default "none"
+   help
+ This option allows specifying a restriction level for write access
+ to /proc/*/mem files. Can be one of:
+ - 'all' restricts all access unconditionally.
+ - 'ptracer' allows access only for ptracer processes.
+
+ This can also be set at boot with the "restrict_proc_mem_write=" 
param.
+
+ If unsure leave empty to continue using basic file permissions.
+
+config SECURITY_PROC_MEM_FOLL_FORCE_RESTRICT
+   string "Restrict use of FOLL_FORCE for /proc/*/mem access"
+

[PATCH 0/2] TPM documentation updates

2024-04-09 Thread Jarkko Sakkinen
Re-send of TPM documentation updates. Note that in this patch set the
patch versions do not have relation to the patch set version, as they
were before managed as independent patches.

Cc: Alexander Steffen 
CC: Daniel P. Smith 
Cc: James Bottomley 
Cc: Jason Gunthorpe 
Cc: Jonathan Corbet 
Cc: Lino Sanfilippo 
Cc: Mimi Zohar 
Cc: Peter Huewe 
Cc: Randy Dunlap 
Cc: linux-doc@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-integr...@vger.kernel.org

v1:
- Collect the latest versions of patches sent earlier.

Jarkko Sakkinen (2):
  MAINTAINERS: Update URL's for KEYS/KEYRINGS_INTEGRITY and TPM DEVICE
DRIVER
  Documentation: tpm_tis

 Documentation/security/tpm/index.rst   |  1 +
 Documentation/security/tpm/tpm_tis.rst | 46 ++
 MAINTAINERS|  3 +-
 3 files changed, 49 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/security/tpm/tpm_tis.rst

-- 
2.44.0




[PATCH 1/2] MAINTAINERS: Update URL's for KEYS/KEYRINGS_INTEGRITY and TPM DEVICE DRIVER

2024-04-09 Thread Jarkko Sakkinen
Add TPM driver test suite URL to the MAINTAINERS files and move the wiki
URL to more appropriate location.

Link: https://gitlab.com/jarkkojs/linux-tpmdd-test
Link: https://kernsec.org/wiki/index.php/Linux_Kernel_Integrity
Acked-by: Paul Menzel 
Signed-off-by: Jarkko Sakkinen 
---
v2:
- Typo fix: 
https://lore.kernel.org/all/eaa5107ac4f982b6fd6e80b522643a591e719dc9.ca...@hansenpartnership.com/
- Typo fix: 
https://lore.kernel.org/all/1ab10318-5e3d-417c-9984-7b17f4e38...@molgen.mpg.de/
---
 MAINTAINERS | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 1339918df52a..01dc4940fc06 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12015,6 +12015,7 @@ M:  Mimi Zohar 
 L: linux-integr...@vger.kernel.org
 L: keyri...@vger.kernel.org
 S: Supported
+W: https://kernsec.org/wiki/index.php/Linux_Kernel_Integrity
 F: security/integrity/platform_certs
 
 KFENCE
@@ -22344,7 +22345,7 @@ M:  Jarkko Sakkinen 
 R: Jason Gunthorpe 
 L: linux-integr...@vger.kernel.org
 S: Maintained
-W: https://kernsec.org/wiki/index.php/Linux_Kernel_Integrity
+W: https://gitlab.com/jarkkojs/linux-tpmdd-test
 Q: https://patchwork.kernel.org/project/linux-integrity/list/
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd.git
 F: drivers/char/tpm/
-- 
2.44.0




[PATCH 2/2] Documentation: tpm_tis

2024-04-09 Thread Jarkko Sakkinen
Based recent discussions on LKML, provide preliminary bits of tpm_tis_core
dependent drivers. Includes only bare essentials but can be extended later
on case by case. This way some people may even want to read it later on.

Signed-off-by: Jarkko Sakkinen 
---
v4:
- Extended the text to address some of Stefan's concerns with v2.
- Had to unfortunately remove Randy's reviewed-by because of this, given
  the amount of text added.
v3:
- Fixed incorrect buffer size:
  
https://lore.kernel.org/linux-integrity/d957dbd3-4975-48d7-abc5-1a01c0959...@linux.ibm.com/
v2:
- Fixed errors reported by Randy:
  
https://lore.kernel.org/all/aed28265-d677-491a-a045-24b351854...@infradead.org/
- Improved the text a bit to have a better presentation.
---
 Documentation/security/tpm/index.rst   |  1 +
 Documentation/security/tpm/tpm_tis.rst | 46 ++
 2 files changed, 47 insertions(+)
 create mode 100644 Documentation/security/tpm/tpm_tis.rst

diff --git a/Documentation/security/tpm/index.rst 
b/Documentation/security/tpm/index.rst
index fc40e9f23c85..f27a17f60a96 100644
--- a/Documentation/security/tpm/index.rst
+++ b/Documentation/security/tpm/index.rst
@@ -5,6 +5,7 @@ Trusted Platform Module documentation
 .. toctree::
 
tpm_event_log
+   tpm_tis
tpm_vtpm_proxy
xen-tpmfront
tpm_ftpm_tee
diff --git a/Documentation/security/tpm/tpm_tis.rst 
b/Documentation/security/tpm/tpm_tis.rst
new file mode 100644
index ..b448ea3db71d
--- /dev/null
+++ b/Documentation/security/tpm/tpm_tis.rst
@@ -0,0 +1,46 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=
+TPM FIFO interface driver
+=
+
+TCG PTP Specification defines two interface types: FIFO and CRB. The former is
+based on sequenced read and write operations,  and the latter is based on a
+buffer containing the full command or response.
+
+FIFO (First-In-First-Out) interface is used by the tpm_tis_core dependent
+drivers. Originally Linux had only a driver called tpm_tis, which covered
+memory mapped (aka MMIO) interface but it was later on extended to cover other
+physical interfaces supported by the TCG standard.
+
+For legacy compliance the original MMIO driver is called tpm_tis and the
+framework for FIFO drivers is named as tpm_tis_core. The postfix "tis" in
+tpm_tis comes from the TPM Interface Specification, which is the hardware
+interface specification for TPM 1.x chips.
+
+Communication is based on a 20 KiB buffer shared by the TPM chip through a
+hardware bus or memory map, depending on the physical wiring. The buffer is
+further split into five equal-size 4 KiB buffers, which provide equivalent
+sets of registers for communication between the CPU and TPM. These
+communication endpoints are called localities in the TCG terminology.
+
+When the kernel wants to send commands to the TPM chip, it first reserves
+locality 0 by setting the requestUse bit in the TPM_ACCESS register. The bit is
+cleared by the chip when the access is granted. Once it completes its
+communication, the kernel writes the TPM_ACCESS.activeLocality bit. This
+informs the chip that the locality has been relinquished.
+
+Pending localities are served in order by the chip in descending order, one at
+a time:
+
+- Locality 0 has the lowest priority.
+- Locality 5 has the highest priority.
+
+Further information on the purpose and meaning of the localities can be found
+in section 3.2 of the TCG PC Client Platform TPM Profile Specification.
+
+References
+==
+
+TCG PC Client Platform TPM Profile (PTP) Specification
+https://trustedcomputinggroup.org/resource/pc-client-platform-tpm-profile-ptp-specification/
-- 
2.44.0




[PATCH v1 00/18] mm: mapcount for large folios + page_mapcount() cleanups

2024-04-09 Thread David Hildenbrand
This series tracks the mapcount of large folios in a single value, so
it can be read efficiently and atomically, just like the mapcount of
small folios.

folio_mapcount() is then used in a couple more places, most notably to
reduce false negatives in folio_likely_mapped_shared(), and many users of
page_mapcount() are cleaned up (that's maybe why you got CCed on the
full series, sorry sh+xtensa folks! :) ).

The remaining s390x user and one KSM user of page_mapcount() are getting
removed separately on the list right now. I have patches to handle the
other KSM one, the khugepaged one and the kpagecount one; as they are not
as "obvious", I will send them out separately in the future. Once that is
all in place, I'm planning on moving page_mapcount() into
fs/proc/task_mmu.c, the remaining user for the time being (and we can
discuss at LSF/MM details on that :) ).

I proposed the mapcount for large folios (previously called total
mapcount) originally in part of [1] and I later included it in [2] where
it is a requirement. In the meantime, I changed the patch a bit so I
dropped all RB's. During the discussion of [1], Peter Xu correctly raised
that this additional tracking might affect the performance when
PMD->PTE remapping THPs. In the meantime. I addressed that by batching RMAP
operations during fork(), unmap/zap and when PMD->PTE remapping THPs.

Running some of my micro-benchmarks [3] (fork,munmap,cow-byte,remap) on 1
GiB of memory backed by folios with the same order, I observe the following
on an Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz tuned for reproducible
results as much as possible:

Standard deviation is mostly < 1%, except for order-9, where it's < 2% for
fork() and munmap().

(1) Small folios are not affected (< 1%) in all 4 microbenchmarks.
(2) Order-4 folios are not affected (< 1%) in all 4 microbenchmarks. A bit
weird comapred to the other orders ...
(3) PMD->PTE remapping of order-9 THPs is not affected (< 1%)
(4) COW-byte (COWing a single page by writing a single byte) is not
affected for any order (< 1 %). The page copy_fault overhead dominates
everything.
(5) fork() is mostly not affected (< 1%), except order-2, where we have
a slowdown of ~4%. Already for order-3 folios, we're down to a slowdown
of < 1%.
(6) munmap() sees a slowdown by < 3% for some orders (order-5,
order-6, order-9), but less for others (< 1% for order-4 and order-8,
< 2% for order-2, order-3, order-7).

Especially the fork() and munmap() benchmark are sensitive to each added
instruction and other system noise, so I suspect some of the change and
observed weirdness (order-4) is due to code layout changes and other
factors, but not really due to the added atomics.

So in the common case where we can batch, the added atomics don't really
make a big difference, especially in light of the recent improvements for
large folios that we recently gained due to batching. Surprisingly, for
some cases where we cannot batch (e.g., COW), the added atomics don't seem
to matter, because other overhead dominates.

My fork and munmap micro-benchmarks don't cover cases where we cannot
batch-process bigger parts of large folios. As this is not the common case,
I'm not worrying about that right now.

Future work is batching RMAP operations during swapout and folio
migration.

Not CCing everybody (e.g., cgroups folks just because of the doc
updated) recommended by get_maintainers, to reduce noise. Tested on
x86-64, compile-tested on a bunch of other archs. Will do more testing
in the upcoming days.

[1] https://lore.kernel.org/all/20230809083256.699513-1-da...@redhat.com/
[2] https://lore.kernel.org/all/20231124132626.235350-1-da...@redhat.com/
[3] 
https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c?ref_type=heads

Cc: Andrew Morton 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Peter Xu 
Cc: Ryan Roberts 
Cc: Yin Fengwei 
Cc: Yang Shi 
Cc: Zi Yan 
Cc: Jonathan Corbet 
Cc: Hugh Dickins 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: John Paul Adrian Glaubitz 
Cc: Chris Zankel 
Cc: Max Filippov 
Cc: Muchun Song 
Cc: Miaohe Lin 
Cc: Naoya Horiguchi 
Cc: Richard Chang 

David Hildenbrand (18):
  mm: allow for detecting underflows with page_mapcount() again
  mm/rmap: always inline anon/file rmap duplication of a single PTE
  mm/rmap: add fast-path for small folios when
adding/removing/duplicating
  mm: track mapcount of large folios in single value
  mm: improve folio_likely_mapped_shared() using the mapcount of large
folios
  mm: make folio_mapcount() return 0 for small typed folios
  mm/memory: use folio_mapcount() in zap_present_folio_ptes()
  mm/huge_memory: use folio_mapcount() in zap_huge_pmd() sanity check
  mm/memory-failure: use folio_mapcount() in hwpoison_user_mappings()
  mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range()
  mm/migrate: use folio_likely_mapped_shared() in
add_page_for_migration()
  sh/mm/cache: use folio_mapped() in copy_from_user_page()
  

[PATCH v1 02/18] mm/rmap: always inline anon/file rmap duplication of a single PTE

2024-04-09 Thread David Hildenbrand
As we grow the code, the compiler might make stupid decisions and
unnecessarily degrade fork() performance. Let's make sure to always inline
functions that operate on a single PTE so the compiler will always
optimize out the loop and avoid a function call.

This is a preparation for maintining a total mapcount for large folios.

Signed-off-by: David Hildenbrand 
---
 include/linux/rmap.h | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 9bf9324214fc..9549d78928bb 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -347,8 +347,12 @@ static inline void folio_dup_file_rmap_ptes(struct folio 
*folio,
 {
__folio_dup_file_rmap(folio, page, nr_pages, RMAP_LEVEL_PTE);
 }
-#define folio_dup_file_rmap_pte(folio, page) \
-   folio_dup_file_rmap_ptes(folio, page, 1)
+
+static __always_inline void folio_dup_file_rmap_pte(struct folio *folio,
+   struct page *page)
+{
+   __folio_dup_file_rmap(folio, page, 1, RMAP_LEVEL_PTE);
+}
 
 /**
  * folio_dup_file_rmap_pmd - duplicate a PMD mapping of a page range of a folio
@@ -448,8 +452,13 @@ static inline int folio_try_dup_anon_rmap_ptes(struct 
folio *folio,
return __folio_try_dup_anon_rmap(folio, page, nr_pages, src_vma,
 RMAP_LEVEL_PTE);
 }
-#define folio_try_dup_anon_rmap_pte(folio, page, vma) \
-   folio_try_dup_anon_rmap_ptes(folio, page, 1, vma)
+
+static __always_inline int folio_try_dup_anon_rmap_pte(struct folio *folio,
+   struct page *page, struct vm_area_struct *src_vma)
+{
+   return __folio_try_dup_anon_rmap(folio, page, 1, src_vma,
+RMAP_LEVEL_PTE);
+}
 
 /**
  * folio_try_dup_anon_rmap_pmd - try duplicating a PMD mapping of a page range
-- 
2.44.0




[PATCH v1 01/18] mm: allow for detecting underflows with page_mapcount() again

2024-04-09 Thread David Hildenbrand
Commit 53277bcf126d ("mm: support page_mapcount() on page_has_type()
pages") made it impossible to detect mapcount underflows by treating
any negative raw mapcount value as a mapcount of 0.

We perform such underflow checks in zap_present_folio_ptes() and
zap_huge_pmd(), which would currently no longer trigger.

Let's check against PAGE_MAPCOUNT_RESERVE instead by using
page_type_has_type(), like page_has_type() would, so we can still catch
some underflows.

Fixes: 53277bcf126d ("mm: support page_mapcount() on page_has_type() pages")
Signed-off-by: David Hildenbrand 
---
 include/linux/mm.h | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef34cf54c14f..0fb8a40f82dd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1229,11 +1229,10 @@ static inline void page_mapcount_reset(struct page 
*page)
  */
 static inline int page_mapcount(struct page *page)
 {
-   int mapcount = atomic_read(&page->_mapcount) + 1;
+   int mapcount = atomic_read(&page->_mapcount);
 
/* Handle page_has_type() pages */
-   if (mapcount < 0)
-   mapcount = 0;
+   mapcount = page_type_has_type(mapcount) ? 0 : mapcount + 1;
if (unlikely(PageCompound(page)))
mapcount += folio_entire_mapcount(page_folio(page));
 
-- 
2.44.0




[PATCH v1 03/18] mm/rmap: add fast-path for small folios when adding/removing/duplicating

2024-04-09 Thread David Hildenbrand
Let's add a fast-path for small folios to all relevant rmap functions.
Note that only RMAP_LEVEL_PTE applies.

This is a preparation for tracking the mapcount of large folios in a
single value.

Signed-off-by: David Hildenbrand 
---
 include/linux/rmap.h | 13 +
 mm/rmap.c| 26 --
 2 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 9549d78928bb..327f1ca5a487 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -322,6 +322,11 @@ static __always_inline void __folio_dup_file_rmap(struct 
folio *folio,
 
switch (level) {
case RMAP_LEVEL_PTE:
+   if (!folio_test_large(folio)) {
+   atomic_inc(&page->_mapcount);
+   break;
+   }
+
do {
atomic_inc(&page->_mapcount);
} while (page++, --nr_pages > 0);
@@ -405,6 +410,14 @@ static __always_inline int 
__folio_try_dup_anon_rmap(struct folio *folio,
if (PageAnonExclusive(page + i))
return -EBUSY;
}
+
+   if (!folio_test_large(folio)) {
+   if (PageAnonExclusive(page))
+   ClearPageAnonExclusive(page);
+   atomic_inc(&page->_mapcount);
+   break;
+   }
+
do {
if (PageAnonExclusive(page))
ClearPageAnonExclusive(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 56b313aa2ebf..4bde6d60db6c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1172,15 +1172,18 @@ static __always_inline unsigned int 
__folio_add_rmap(struct folio *folio,
 
switch (level) {
case RMAP_LEVEL_PTE:
+   if (!folio_test_large(folio)) {
+   nr = atomic_inc_and_test(&page->_mapcount);
+   break;
+   }
+
do {
first = atomic_inc_and_test(&page->_mapcount);
-   if (first && folio_test_large(folio)) {
+   if (first) {
first = atomic_inc_return_relaxed(mapped);
-   first = (first < ENTIRELY_MAPPED);
+   if (first < ENTIRELY_MAPPED)
+   nr++;
}
-
-   if (first)
-   nr++;
} while (page++, --nr_pages > 0);
break;
case RMAP_LEVEL_PMD:
@@ -1514,15 +1517,18 @@ static __always_inline void __folio_remove_rmap(struct 
folio *folio,
 
switch (level) {
case RMAP_LEVEL_PTE:
+   if (!folio_test_large(folio)) {
+   nr = atomic_add_negative(-1, &page->_mapcount);
+   break;
+   }
+
do {
last = atomic_add_negative(-1, &page->_mapcount);
-   if (last && folio_test_large(folio)) {
+   if (last) {
last = atomic_dec_return_relaxed(mapped);
-   last = (last < ENTIRELY_MAPPED);
+   if (last < ENTIRELY_MAPPED)
+   nr++;
}
-
-   if (last)
-   nr++;
} while (page++, --nr_pages > 0);
break;
case RMAP_LEVEL_PMD:
-- 
2.44.0




[PATCH v1 04/18] mm: track mapcount of large folios in single value

2024-04-09 Thread David Hildenbrand
Let's track the mapcount of large folios in a single value. The mapcount of
a large folio currently corresponds to the sum of the entire mapcount and
all page mapcounts.

This sum is what we actually want to know in folio_mapcount() and it is
also sufficient for implementing folio_mapped().

With PTE-mapped THP becoming more important and more widely used, we want
to avoid looping over all pages of a folio just to obtain the mapcount
of large folios. The comment "In the common case, avoid the loop when no
pages mapped by PTE" in folio_total_mapcount() does no longer hold for
mTHP that are always mapped by PTE.

Further, we are planning on using folio_mapcount() more
frequently, and might even want to remove page mapcounts for large
folios in some kernel configs. Therefore, allow for reading the mapcount of
large folios efficiently and atomically without looping over any pages.

Maintain the mapcount also for hugetlb pages for simplicity. Use the new
mapcount to implement folio_mapcount() and folio_mapped(). Make
page_mapped() simply call folio_mapped(). We can now get rid of
folio_large_is_mapped().

_nr_pages_mapped is now only used in rmap code and for debugging
purposes. Keep folio_nr_pages_mapped() around, but document that its use
should be limited to rmap internals and debugging purposes.

This change implies one additional atomic add/sub whenever
mapping/unmapping (parts of) a large folio.

As we now batch RMAP operations for PTE-mapped THP during fork(),
during unmap/zap, and when PTE-remapping a PMD-mapped THP, and we adjust
the large mapcount for a PTE batch only once, the added overhead in the
common case is small. Only when unmapping individual pages of a large folio
(e.g., during COW), the overhead might be bigger in comparison, but it's
essentially one additional atomic operation.

Note that before the new mapcount would overflow, already our refcount
would overflow: each mapping requires a folio reference. Extend the
focumentation of folio_mapcount().

Signed-off-by: David Hildenbrand 
---
 Documentation/mm/transhuge.rst | 12 +-
 include/linux/mm.h | 44 --
 include/linux/mm_types.h   |  5 ++--
 include/linux/rmap.h   | 10 
 mm/debug.c |  3 ++-
 mm/hugetlb.c   |  4 ++--
 mm/internal.h  |  3 +++
 mm/khugepaged.c|  2 +-
 mm/page_alloc.c|  4 
 mm/rmap.c  | 34 +-
 10 files changed, 62 insertions(+), 59 deletions(-)

diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst
index 93c9239b9ebe..1ba0ad63246c 100644
--- a/Documentation/mm/transhuge.rst
+++ b/Documentation/mm/transhuge.rst
@@ -116,14 +116,14 @@ pages:
 succeeds on tail pages.
 
   - map/unmap of a PMD entry for the whole THP increment/decrement
-folio->_entire_mapcount and also increment/decrement
-folio->_nr_pages_mapped by ENTIRELY_MAPPED when _entire_mapcount
-goes from -1 to 0 or 0 to -1.
+folio->_entire_mapcount, increment/decrement folio->_large_mapcount
+and also increment/decrement folio->_nr_pages_mapped by ENTIRELY_MAPPED
+when _entire_mapcount goes from -1 to 0 or 0 to -1.
 
   - map/unmap of individual pages with PTE entry increment/decrement
-page->_mapcount and also increment/decrement folio->_nr_pages_mapped
-when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts
-the number of pages mapped by PTE.
+page->_mapcount, increment/decrement folio->_large_mapcount and also
+increment/decrement folio->_nr_pages_mapped when page->_mapcount goes
+from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE.
 
 split_huge_page internally has to distribute the refcounts in the head
 page to the tail pages before clearing all PG_head/tail bits from the page
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0fb8a40f82dd..1862a216af15 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1239,16 +1239,26 @@ static inline int page_mapcount(struct page *page)
return mapcount;
 }
 
-int folio_total_mapcount(const struct folio *folio);
+static inline int folio_large_mapcount(const struct folio *folio)
+{
+   VM_WARN_ON_FOLIO(!folio_test_large(folio), folio);
+   return atomic_read(&folio->_large_mapcount) + 1;
+}
 
 /**
- * folio_mapcount() - Calculate the number of mappings of this folio.
+ * folio_mapcount() - Number of mappings of this folio.
  * @folio: The folio.
  *
- * A large folio tracks both how many times the entire folio is mapped,
- * and how many times each individual page in the folio is mapped.
- * This function calculates the total number of times the folio is
- * mapped.
+ * The folio mapcount corresponds to the number of present user page table
+ * entries that reference any part of a folio. Each such present user page
+ * table entry must be paired with exactly on folio reference.
+ *
+ * For or

[PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios

2024-04-09 Thread David Hildenbrand
We can now read the mapcount of large folios very efficiently. Use it to
improve our handling of partially-mappable folios, falling back
to making a guess only in case the folio is not "obviously mapped shared".

We can now better detect partially-mappable folios where the first page is
not mapped as "mapped shared", reducing "false negatives"; but false
negatives are still possible.

While at it, fixup a wrong comment (false positive vs. false negative)
for KSM folios.

Signed-off-by: David Hildenbrand 
---
 include/linux/mm.h | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1862a216af15..daf687f0e8e5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2183,7 +2183,7 @@ static inline size_t folio_size(struct folio *folio)
  *   indicate "mapped shared" (false positive) when two VMAs in the same MM
  *   cover the same file range.
  *#. For (small) KSM folios, the return value can wrongly indicate "mapped
- *   shared" (false negative), when the folio is mapped multiple times into
+ *   shared" (false positive), when the folio is mapped multiple times into
  *   the same MM.
  *
  * Further, this function only considers current page table mappings that
@@ -2200,7 +2200,22 @@ static inline size_t folio_size(struct folio *folio)
  */
 static inline bool folio_likely_mapped_shared(struct folio *folio)
 {
-   return page_mapcount(folio_page(folio, 0)) > 1;
+   int mapcount = folio_mapcount(folio);
+
+   /* Only partially-mappable folios require more care. */
+   if (!folio_test_large(folio) || unlikely(folio_test_hugetlb(folio)))
+   return mapcount > 1;
+
+   /* A single mapping implies "mapped exclusively". */
+   if (mapcount <= 1)
+   return false;
+
+   /* If any page is mapped more than once we treat it "mapped shared". */
+   if (folio_entire_mapcount(folio) || mapcount > folio_nr_pages(folio))
+   return true;
+
+   /* Let's guess based on the first subpage. */
+   return atomic_read(&folio->_mapcount) > 0;
 }
 
 #ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
-- 
2.44.0




[PATCH v1 06/18] mm: make folio_mapcount() return 0 for small typed folios

2024-04-09 Thread David Hildenbrand
We already handle it properly for large folios. Let's also return "0"
for small typed folios, like page_mapcount() currently would.

Consequently, folio_mapcount() will never return negative values for
typed folios, but may return negative values for underflows.

Signed-off-by: David Hildenbrand 
---
 include/linux/mm.h | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index daf687f0e8e5..d453232bba62 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1260,12 +1260,19 @@ static inline int folio_large_mapcount(const struct 
folio *folio)
  * references the entire folio counts exactly once, even when such special
  * page table entries are comprised of multiple ordinary page table entries.
  *
+ * Will report 0 for pages which cannot be mapped into userspace, such as
+ * slab, page tables and similar.
+ *
  * Return: The number of times this folio is mapped.
  */
 static inline int folio_mapcount(const struct folio *folio)
 {
-   if (likely(!folio_test_large(folio)))
-   return atomic_read(&folio->_mapcount) + 1;
+   int mapcount;
+
+   if (likely(!folio_test_large(folio))) {
+   mapcount = atomic_read(&folio->_mapcount);
+   return page_type_has_type(mapcount) ? 0 : mapcount + 1;
+   }
return folio_large_mapcount(folio);
 }
 
-- 
2.44.0




[PATCH v1 07/18] mm/memory: use folio_mapcount() in zap_present_folio_ptes()

2024-04-09 Thread David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. In zap_present_folio_ptes(), let's simply check
the folio mapcount(). If there is some issue, it will underflow at some
point either way when unmapping.

As indicated already in commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap
with PTE-mapped THP"), we already documented "If we ever have a cheap
folio_mapcount(), we might just want to check for underflows there.".

There is no change for small folios. For large folios, we'll now catch
more underflows when batch-unmapping, because instead of only testing
the mapcount of the first subpage, we'll test if the folio mapcount
underflows.

Signed-off-by: David Hildenbrand 
---
 mm/memory.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 78422d1c7381..178492efb4af 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1502,8 +1502,7 @@ static __always_inline void zap_present_folio_ptes(struct 
mmu_gather *tlb,
if (!delay_rmap) {
folio_remove_rmap_ptes(folio, page, nr, vma);
 
-   /* Only sanity-check the first page in a batch. */
-   if (unlikely(page_mapcount(page) < 0))
+   if (unlikely(folio_mapcount(folio) < 0))
print_bad_pte(vma, addr, ptent, page);
}
 
-- 
2.44.0




[PATCH v1 08/18] mm/huge_memory: use folio_mapcount() in zap_huge_pmd() sanity check

2024-04-09 Thread David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. Let's similarly check for folio_mapcount() underflows
instead of page_mapcount() underflows like we do in
zap_present_folio_ptes() now.

Instead of the VM_BUG_ON(), we should actually be doing something like
print_bad_pte(). For now, let's keep it simple and use WARN_ON_ONCE(),
performing that check independently of DEBUG_VM.

Signed-off-by: David Hildenbrand 
---
 mm/huge_memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d8d2ed80b0bf..68ac27d229ef 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1851,7 +1851,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
 
folio = page_folio(page);
folio_remove_rmap_pmd(folio, page, vma);
-   VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+   WARN_ON_ONCE(folio_mapcount(folio) < 0);
VM_BUG_ON_PAGE(!PageHead(page), page);
} else if (thp_migration_supported()) {
swp_entry_t entry;
-- 
2.44.0




[PATCH v1 09/18] mm/memory-failure: use folio_mapcount() in hwpoison_user_mappings()

2024-04-09 Thread David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. We can only unmap full folios; page_mapped(),
which we check here, is translated to folio_mapped() -- based on
folio_mapcount(). So let's print the folio mapcount instead.

Signed-off-by: David Hildenbrand 
---
 mm/memory-failure.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 88359a185c5f..ee2f4b8905ef 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1628,8 +1628,8 @@ static bool hwpoison_user_mappings(struct page *p, 
unsigned long pfn,
 
unmap_success = !page_mapped(p);
if (!unmap_success)
-   pr_err("%#lx: failed to unmap page (mapcount=%d)\n",
-  pfn, page_mapcount(p));
+   pr_err("%#lx: failed to unmap page (folio mapcount=%d)\n",
+  pfn, folio_mapcount(page_folio(p)));
 
/*
 * try_to_unmap() might put mlocked page in lru cache, so call
-- 
2.44.0




[PATCH v1 10/18] mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range()

2024-04-09 Thread David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is
absolutely necessary.

For tracing purposes, we use page_mapcount() in
__alloc_contig_migrate_range(). Adding that mapcount to total_mapped sounds
strange: total_migrated and total_reclaimed would count each page only
once, not multiple times.

But then, isolate_migratepages_range() adds each folio only once to the
list. So for large folios, we would query the mapcount of the
first page of the folio, which doesn't make too much sense for large
folios.

Let's simply use folio_mapped() * folio_nr_pages(), which makes more
sense as nr_migratepages is also incremented by the number of pages in
the folio in case of successful migration.

Signed-off-by: David Hildenbrand 
---
 mm/page_alloc.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 393366d4a704..40fc0f60e021 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6389,8 +6389,12 @@ int __alloc_contig_migrate_range(struct compact_control 
*cc,
 
if (trace_mm_alloc_contig_migrate_range_info_enabled()) {
total_reclaimed += nr_reclaimed;
-   list_for_each_entry(page, &cc->migratepages, lru)
-   total_mapped += page_mapcount(page);
+   list_for_each_entry(page, &cc->migratepages, lru) {
+   struct folio *folio = page_folio(page);
+
+   total_mapped += folio_mapped(folio) *
+   folio_nr_pages(folio);
+   }
}
 
ret = migrate_pages(&cc->migratepages, alloc_migration_target,
-- 
2.44.0




[PATCH v1 11/18] mm/migrate: use folio_likely_mapped_shared() in add_page_for_migration()

2024-04-09 Thread David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. In add_page_for_migration(), we actually want to
check if the folio is mapped shared, to reject such folios. So let's
use folio_likely_mapped_shared() instead.

For small folios, fully mapped THP, and hugetlb folios, there is no change.
For partially mapped, shared THP, we should now do a better job at
rejecting such folios.

Signed-off-by: David Hildenbrand 
---
 mm/migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 285072bca29c..d87ce32645d4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2140,7 +2140,7 @@ static int add_page_for_migration(struct mm_struct *mm, 
const void __user *p,
goto out_putfolio;
 
err = -EACCES;
-   if (page_mapcount(page) > 1 && !migrate_all)
+   if (folio_likely_mapped_shared(folio) && !migrate_all)
goto out_putfolio;
 
err = -EBUSY;
-- 
2.44.0




[PATCH v1 12/18] sh/mm/cache: use folio_mapped() in copy_from_user_page()

2024-04-09 Thread David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is
absolutely necessary.

We're already using folio_mapped in copy_user_highpage() and
copy_to_user_page() for a similar purpose so ... let's also simply use
it for copy_from_user_page().

There is no change for small folios. Likely we won't stumble over many
large folios on sh in that code either way.

Signed-off-by: David Hildenbrand 
---
 arch/sh/mm/cache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/sh/mm/cache.c b/arch/sh/mm/cache.c
index 9bcaa5619eab..d8be352e14d2 100644
--- a/arch/sh/mm/cache.c
+++ b/arch/sh/mm/cache.c
@@ -84,7 +84,7 @@ void copy_from_user_page(struct vm_area_struct *vma, struct 
page *page,
 {
struct folio *folio = page_folio(page);
 
-   if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) &&
+   if (boot_cpu_data.dcache.n_aliases && folio_mapped(folio) &&
test_bit(PG_dcache_clean, &folio->flags)) {
void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
memcpy(dst, vfrom, len);
-- 
2.44.0




[PATCH v1 13/18] mm/filemap: use folio_mapcount() in filemap_unaccount_folio()

2024-04-09 Thread David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is
absolutely necessary.

Let's use folio_mapcount() instead of filemap_unaccount_folio().

No functional change intended, because we're only dealing with small
folios.

Signed-off-by: David Hildenbrand 
---
 mm/filemap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c668e11cd6ef..d4aa82ad5b59 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -168,7 +168,7 @@ static void filemap_unaccount_folio(struct address_space 
*mapping,
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
 
if (mapping_exiting(mapping) && !folio_test_large(folio)) {
-   int mapcount = page_mapcount(&folio->page);
+   int mapcount = folio_mapcount(folio);
 
if (folio_ref_count(folio) >= mapcount + 2) {
/*
-- 
2.44.0




[PATCH v1 14/18] mm/migrate_device: use folio_mapcount() in migrate_vma_check_page()

2024-04-09 Thread David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. Let's convert migrate_vma_check_page() to work on
a folio internally so we can remove the page_mapcount() usage.

Note that we reject any large folios.

There is a lot more folio conversion to be had, but that has to wait for
another day. No functional change intended.

Signed-off-by: David Hildenbrand 
---
 mm/migrate_device.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index d40b46ae9d65..b929b450b77c 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -324,6 +324,8 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
  */
 static bool migrate_vma_check_page(struct page *page, struct page *fault_page)
 {
+   struct folio *folio = page_folio(page);
+
/*
 * One extra ref because caller holds an extra reference, either from
 * isolate_lru_page() for a regular page, or migrate_vma_collect() for
@@ -336,18 +338,18 @@ static bool migrate_vma_check_page(struct page *page, 
struct page *fault_page)
 * check them than regular pages, because they can be mapped with a pmd
 * or with a pte (split pte mapping).
 */
-   if (PageCompound(page))
+   if (folio_test_large(folio))
return false;
 
/* Page from ZONE_DEVICE have one extra reference */
-   if (is_zone_device_page(page))
+   if (folio_is_zone_device(folio))
extra++;
 
/* For file back page */
-   if (page_mapping(page))
-   extra += 1 + page_has_private(page);
+   if (folio_mapping(folio))
+   extra += 1 + folio_has_private(folio);
 
-   if ((page_count(page) - extra) > page_mapcount(page))
+   if ((folio_ref_count(folio) - extra) > folio_mapcount(folio))
return false;
 
return true;
-- 
2.44.0




[PATCH v1 15/18] trace/events/page_ref: trace the raw page mapcount value

2024-04-09 Thread David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. We already trace raw page->refcount, raw page->flags
and raw page->mapping, and don't involve any folios. Let's also trace the
raw mapcount value that does not consider the entire mapcount of large
folios, and we don't add "1" to it.

When dealing with typed folios, this makes a lot more sense. ... and
it's for debugging purposes only either way.

Signed-off-by: David Hildenbrand 
---
 include/trace/events/page_ref.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/page_ref.h b/include/trace/events/page_ref.h
index 8a99c1cd417b..fe33a255b7d0 100644
--- a/include/trace/events/page_ref.h
+++ b/include/trace/events/page_ref.h
@@ -30,7 +30,7 @@ DECLARE_EVENT_CLASS(page_ref_mod_template,
__entry->pfn = page_to_pfn(page);
__entry->flags = page->flags;
__entry->count = page_ref_count(page);
-   __entry->mapcount = page_mapcount(page);
+   __entry->mapcount = atomic_read(&page->_mapcount);
__entry->mapping = page->mapping;
__entry->mt = get_pageblock_migratetype(page);
__entry->val = v;
@@ -79,7 +79,7 @@ DECLARE_EVENT_CLASS(page_ref_mod_and_test_template,
__entry->pfn = page_to_pfn(page);
__entry->flags = page->flags;
__entry->count = page_ref_count(page);
-   __entry->mapcount = page_mapcount(page);
+   __entry->mapcount = atomic_read(&page->_mapcount);
__entry->mapping = page->mapping;
__entry->mt = get_pageblock_migratetype(page);
__entry->val = v;
-- 
2.44.0




[PATCH v1 16/18] xtensa/mm: convert check_tlb_entry() to sanity check folios

2024-04-09 Thread David Hildenbrand
We want to limit the use of page_mapcount() to the places where it is
absolutely necessary. So let's convert check_tlb_entry() to perform
sanity checks on folios instead of pages.

This essentially already happened: page_count() is mapped to
folio_ref_count(), and page_mapped() to folio_mapped() internally.
However, we would have printed the page_mapount(), which
does not really match what page_mapped() would have checked.

Let's simply print the folio mapcount to avoid using page_mapcount(). For
small folios there is no change.

Signed-off-by: David Hildenbrand 
---
 arch/xtensa/mm/tlb.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
index 4f974b74883c..d8b60d6e50a8 100644
--- a/arch/xtensa/mm/tlb.c
+++ b/arch/xtensa/mm/tlb.c
@@ -256,12 +256,13 @@ static int check_tlb_entry(unsigned w, unsigned e, bool 
dtlb)
dtlb ? 'D' : 'I', w, e, r0, r1, pte);
if (pte == 0 || !pte_present(__pte(pte))) {
struct page *p = pfn_to_page(r1 >> PAGE_SHIFT);
-   pr_err("page refcount: %d, mapcount: %d\n",
-   page_count(p),
-   page_mapcount(p));
-   if (!page_count(p))
+   struct folio *f = page_folio(p);
+
+   pr_err("folio refcount: %d, mapcount: %d\n",
+   folio_ref_count(f), folio_mapcount(f));
+   if (!folio_ref_count(f))
rc |= TLB_INSANE;
-   else if (page_mapcount(p))
+   else if (folio_mapped(f))
rc |= TLB_SUSPICIOUS;
} else {
rc |= TLB_INSANE;
-- 
2.44.0




[PATCH v1 17/18] mm/debug: print only page mapcount (excluding folio entire mapcount) in __dump_folio()

2024-04-09 Thread David Hildenbrand
Let's simplify and only print the page mapcount: we already print the
large folio mapcount and the entire folio mapcount for large folios
separately; that should be sufficient to figure out what's happening.

While at it, print the page mapcount also if it had an underflow,
filtering out only typed pages.

Signed-off-by: David Hildenbrand 
---
 mm/debug.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/debug.c b/mm/debug.c
index d064db42af54..69e524c3e601 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -55,15 +55,10 @@ static void __dump_folio(struct folio *folio, struct page 
*page,
unsigned long pfn, unsigned long idx)
 {
struct address_space *mapping = folio_mapping(folio);
-   int mapcount = atomic_read(&page->_mapcount) + 1;
+   int mapcount = atomic_read(&page->_mapcount);
char *type = "";
 
-   /* Open-code page_mapcount() to avoid looking up a stale folio */
-   if (mapcount < 0)
-   mapcount = 0;
-   if (folio_test_large(folio))
-   mapcount += folio_entire_mapcount(folio);
-
+   mapcount = page_type_has_type(mapcount) ? 0 : mapcount + 1;
pr_warn("page: refcount:%d mapcount:%d mapping:%p index:%#lx 
pfn:%#lx\n",
folio_ref_count(folio), mapcount, mapping,
folio->index + idx, pfn);
-- 
2.44.0




[PATCH v1 18/18] Documentation/admin-guide/cgroup-v1/memory.rst: don't reference page_mapcount()

2024-04-09 Thread David Hildenbrand
Let's stop talking about page_mapcount().

Signed-off-by: David Hildenbrand 
---
 Documentation/admin-guide/cgroup-v1/memory.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst 
b/Documentation/admin-guide/cgroup-v1/memory.rst
index 46110e6a31bb..9cde26d33843 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -802,8 +802,8 @@ a page or a swap can be moved only when it is charged to 
the task's current
 |   | anonymous pages, file pages (and swaps) in the range mmapped by the task 
|
 |   | will be moved even if the task hasn't done page fault, i.e. they might   
|
 |   | not be the task's "RSS", but other task's "RSS" that maps the same file. 
|
-|   | And mapcount of the page is ignored (the page can be moved even if   
|
-|   | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to
|
+|   | The mapcount of the page is ignored (the page can be moved independent   
|
+|   | of the mapcount). You must enable Swap Extension (see 2.4) to
|
 |   | enable move of swap charges. 
|
 
+---+--+
 
-- 
2.44.0




Re: [PATCH v1 01/18] mm: allow for detecting underflows with page_mapcount() again

2024-04-09 Thread Zi Yan
On 9 Apr 2024, at 15:22, David Hildenbrand wrote:

> Commit 53277bcf126d ("mm: support page_mapcount() on page_has_type()
> pages") made it impossible to detect mapcount underflows by treating
> any negative raw mapcount value as a mapcount of 0.
>
> We perform such underflow checks in zap_present_folio_ptes() and
> zap_huge_pmd(), which would currently no longer trigger.
>
> Let's check against PAGE_MAPCOUNT_RESERVE instead by using
> page_type_has_type(), like page_has_type() would, so we can still catch
> some underflows.
>
> Fixes: 53277bcf126d ("mm: support page_mapcount() on page_has_type() pages")
> Signed-off-by: David Hildenbrand 
> ---
>  include/linux/mm.h | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef34cf54c14f..0fb8a40f82dd 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1229,11 +1229,10 @@ static inline void page_mapcount_reset(struct page 
> *page)
>   */
>  static inline int page_mapcount(struct page *page)
>  {
> - int mapcount = atomic_read(&page->_mapcount) + 1;
> + int mapcount = atomic_read(&page->_mapcount);
>
>   /* Handle page_has_type() pages */
> - if (mapcount < 0)
> - mapcount = 0;
> + mapcount = page_type_has_type(mapcount) ? 0 : mapcount + 1;
>   if (unlikely(PageCompound(page)))
>   mapcount += folio_entire_mapcount(page_folio(page));

LGTM. This could be picked up separately. Reviewed-by: Zi Yan 

--
Best Regards,
Yan, Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v1 04/18] mm: track mapcount of large folios in single value

2024-04-09 Thread Zi Yan
On 9 Apr 2024, at 15:22, David Hildenbrand wrote:

> Let's track the mapcount of large folios in a single value. The mapcount of
> a large folio currently corresponds to the sum of the entire mapcount and
> all page mapcounts.
>
> This sum is what we actually want to know in folio_mapcount() and it is
> also sufficient for implementing folio_mapped().
>
> With PTE-mapped THP becoming more important and more widely used, we want
> to avoid looping over all pages of a folio just to obtain the mapcount
> of large folios. The comment "In the common case, avoid the loop when no
> pages mapped by PTE" in folio_total_mapcount() does no longer hold for
> mTHP that are always mapped by PTE.
>
> Further, we are planning on using folio_mapcount() more
> frequently, and might even want to remove page mapcounts for large
> folios in some kernel configs. Therefore, allow for reading the mapcount of
> large folios efficiently and atomically without looping over any pages.
>
> Maintain the mapcount also for hugetlb pages for simplicity. Use the new
> mapcount to implement folio_mapcount() and folio_mapped(). Make
> page_mapped() simply call folio_mapped(). We can now get rid of
> folio_large_is_mapped().
>
> _nr_pages_mapped is now only used in rmap code and for debugging
> purposes. Keep folio_nr_pages_mapped() around, but document that its use
> should be limited to rmap internals and debugging purposes.
>
> This change implies one additional atomic add/sub whenever
> mapping/unmapping (parts of) a large folio.
>
> As we now batch RMAP operations for PTE-mapped THP during fork(),
> during unmap/zap, and when PTE-remapping a PMD-mapped THP, and we adjust
> the large mapcount for a PTE batch only once, the added overhead in the
> common case is small. Only when unmapping individual pages of a large folio
> (e.g., during COW), the overhead might be bigger in comparison, but it's
> essentially one additional atomic operation.
>
> Note that before the new mapcount would overflow, already our refcount
> would overflow: each mapping requires a folio reference. Extend the
> focumentation of folio_mapcount().

s/focumentation/documentation/  ;)

--
Best Regards,
Yan, Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()

2024-04-09 Thread Alexei Starovoitov
On Tue, Apr 9, 2024 at 1:11 AM Marco Elver  wrote:
>
> On Mon, 8 Apr 2024 at 20:24, Alexei Starovoitov
>  wrote:
> >
> > On Mon, Apr 8, 2024 at 2:30 AM Marco Elver  wrote:
> > >
> > > On Fri, 5 Apr 2024 at 22:28, Andrii Nakryiko  
> > > wrote:
> > > >
> > > > On Fri, Apr 5, 2024 at 1:28 AM Marco Elver  wrote:
> > > > >
> > > > > On Fri, 5 Apr 2024 at 01:23, Alexei Starovoitov
> > > > >  wrote:
> > > [...]
> > > > > > and the tasks can use mmaped array shared across all or unique to 
> > > > > > each
> > > > > > process.
> > > > > > And both bpf and user space can read/write them with a single 
> > > > > > instruction.
> > > > >
> > > > > That's BPF_F_MMAPABLE, right?
> > > > >
> > > > > That does not work because the mmapped region is global. Our 
> > > > > requirements are:
> >
> > It sounds not like "requirements", but a description of the proposed
> > solution.
>
> These requirements can also be implemented differently, e.g. with the
> proposed task-local storage maps if done right (the devil appears to
> be in the implementation-details - these details are currently beyond
> my knowledge of the BPF subsystem internals). They really are the bare
> minimum requirements for the use case. The proposed solution probably
> happens to be the simplest one, and mapping requirements is relatively
> straightforward for it.
>
> > Pls share the actual use case.
> > This "tracing prog" sounds more like a ghost scheduler that
> > wants to interact with known user processes.
>
> It's tcmalloc - we have a BPF program provide a simpler variant of the
> "thread re-scheduling" notifications discussed here:
> https://lore.kernel.org/all/CACT4Y+beLh1qnHF9bxhMUcva8KyuvZs7Mg_31SGK5xSoR=3...@mail.gmail.com/
>
> Various synchronization algorithms can be optimized if they know about
> scheduling events. This has been proposed in [1] to implement an
> adaptive mutex. However, we think that there are many more
> possibilities, but it really depends on the kind of scheduler state
> being exposed ("thread on CPU" as in [1], or more details, like
> details about which thread was switched in, which was switched out,
> where was the thread migrated to, etc.). Committing to these narrow
> use case ABIs and extending the main kernel with more and more such
> information does not scale. Instead, BPF easily allows to expose this
> information where it's required.
>
> [1] 
> https://lore.kernel.org/all/20230529191416.53955-1-mathieu.desnoy...@efficios.com/

Thanks for the links. That's very helpful.
I was about to mention rseq.
I think it's a better fit, but Mathieu doesn't want non-uapi bits in it.
Ideally we could have added pointer + size to rseq as a scratch area
that bpf prog can read/write.
User space would register such area, kernel would pin pages,
and bpf would directly access that.
Pretty much the same idea of bpf local storage, but without bpf map.
rseq has its pros and cons.
Having a bpf local storage map to manage such pinned pages provides
separation of concerns. And it's up to tcmalloc whether to use
one such map for all tasks or map for every tcmalloc instance.
Such scratch areas are per-task per-map.
rseq dictates per-task only, but no concerns with setup thanks
to glibc integration.

> > The main issue with bpf_probe_read/write_user() is their non-determinism.
> > They will error when memory is swapped out.
>
> Right. Although, user space mlock'ing and prefaulting the memory
> solves it in the common case (some corner cases like after fork() are
> still tricky).

yeah. gets tricky indeed.
With bpf local storage map managing pinned pages the map_fd with
auto-close property addresses unpinning. No need to hack into
fork-ing exit-ing paths, but tcmalloc would need to re-register
the areas into bpf local storage after fork... guessing.

> > These helpers are ok-ish for observability when consumers understand
> > that some events might be lost, but for 24/7 production use
> > losing reads becomes a problem that bpf prog cannot mitigate.
> > What do bpf prog suppose to do when this safer bpf_probe_write_user errors?
> > Use some other mechanism to communicate with user space?
>
> Right, for use cases where these errors aren't ok it does not work.
> But if the algorithm is tolerant to lossy reads/writes from the BPF
> program side, it's not an issue.

Reading Dmitry Vyukov's comments in that thread... looks like
he's also concerned with the 'probe' part of rseq api.

> > A mechanism with such builtin randomness in behavior is a footgun for
> > bpf users.
> > We have bpf_copy_from_user*() that don't have this non-determinism.
> > We can introduce bpf_copy_to_user(), but it will be usable
> > from sleepable bpf prog.
> > While it sounds you need it somewhere where scheduler makes decisions,
> > so I suspect bpf array or arena is a better fit.
>
> Correct, it can't sleep because it wants scheduler events.
>
> > Or something that extends bpf local storage map.
> > See long discussion:
> > https://lore.kernel.org/bpf/45878586-cc5f-435

Re: [PATCH v1 01/18] mm: allow for detecting underflows with page_mapcount() again

2024-04-09 Thread Matthew Wilcox
On Tue, Apr 09, 2024 at 09:22:44PM +0200, David Hildenbrand wrote:
> Commit 53277bcf126d ("mm: support page_mapcount() on page_has_type()
> pages") made it impossible to detect mapcount underflows by treating
> any negative raw mapcount value as a mapcount of 0.

Yes, but I don't think this is the right place to check for underflow.
We should be checking for that on modification, not on read.  I think
it's more important for page_mapcount() to be fast than a debugging aid.