Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()
On Mon, 8 Apr 2024 at 20:24, Alexei Starovoitov wrote: > > On Mon, Apr 8, 2024 at 2:30 AM Marco Elver wrote: > > > > On Fri, 5 Apr 2024 at 22:28, Andrii Nakryiko > > wrote: > > > > > > On Fri, Apr 5, 2024 at 1:28 AM Marco Elver wrote: > > > > > > > > On Fri, 5 Apr 2024 at 01:23, Alexei Starovoitov > > > > wrote: > > [...] > > > > > and the tasks can use mmaped array shared across all or unique to each > > > > > process. > > > > > And both bpf and user space can read/write them with a single > > > > > instruction. > > > > > > > > That's BPF_F_MMAPABLE, right? > > > > > > > > That does not work because the mmapped region is global. Our > > > > requirements are: > > It sounds not like "requirements", but a description of the proposed > solution. These requirements can also be implemented differently, e.g. with the proposed task-local storage maps if done right (the devil appears to be in the implementation-details - these details are currently beyond my knowledge of the BPF subsystem internals). They really are the bare minimum requirements for the use case. The proposed solution probably happens to be the simplest one, and mapping requirements is relatively straightforward for it. > Pls share the actual use case. > This "tracing prog" sounds more like a ghost scheduler that > wants to interact with known user processes. It's tcmalloc - we have a BPF program provide a simpler variant of the "thread re-scheduling" notifications discussed here: https://lore.kernel.org/all/CACT4Y+beLh1qnHF9bxhMUcva8KyuvZs7Mg_31SGK5xSoR=3...@mail.gmail.com/ Various synchronization algorithms can be optimized if they know about scheduling events. This has been proposed in [1] to implement an adaptive mutex. However, we think that there are many more possibilities, but it really depends on the kind of scheduler state being exposed ("thread on CPU" as in [1], or more details, like details about which thread was switched in, which was switched out, where was the thread migrated to, etc.). Committing to these narrow use case ABIs and extending the main kernel with more and more such information does not scale. Instead, BPF easily allows to expose this information where it's required. [1] https://lore.kernel.org/all/20230529191416.53955-1-mathieu.desnoy...@efficios.com/ > > > > 1. Single tracing BPF program. > > > > > > > > 2. Per-process (per VM) memory region (here it's per-thread, but each > > > > thread just registers the same process-wide region). No sharing > > > > between processes. > > > > > > > > 3. From #2 it follows: exec unregisters the registered memory region; > > > > fork gets a cloned region. > > > > > > > > 4. Unprivileged processes can do prctl(REGISTER). Some of them might > > > > not be able to use the bpf syscall. > > > > > > > > The reason for #2 is that each user space process also writes to the > > > > memory region (read by the BPF program to make updates depending on > > > > what state it finds), and having shared state between processes > > > > doesn't work here. > > > > > > > > Is there any reasonable BPF facility that can do this today? (If > > > > BPF_F_MMAPABLE could do it while satisfying requirements 2-4, I'd be a > > > > happy camper.) > > > > > > You could simulate something like this with multi-element ARRAY + > > > BPF_F_MMAPABLE, though you'd need to pre-allocate up to max number of > > > processes, so it's not an exact fit. > > > > Right, for production use this is infeasible. > > Last I heard, ghost agent and a few important tasks can mmap bpf array > and share it with bpf prog. > So quite feasible. Nothing related to ghost. It's tcmalloc, i.e. every process running everywhere. > > > But what seems to be much closer is using BPF task-local storage, if > > > we support mmap()'ing its memory into user-space. We've had previous > > > discussions on how to achieve this (the simplest being that > > > mmap(task_local_map_fd, ...) maps current thread's part of BPF task > > > local storage). You won't get automatic cloning (you'd have to do that > > > from the BPF program on fork/exec tracepoint, for example), and within > > > the process you'd probably want to have just one thread (main?) to > > > mmap() initially and just share the pointer across all relevant > > > threads. > > > > In the way you imagine it, would that allow all threads sharing the > > same memory, despite it being task-local? Presumably each task's local > > storage would be mapped to just point to the same memory? > > > > > But this is a more generic building block, IMO. This relying > > > on BPF map also means pinning is possible and all the other BPF map > > > abstraction benefits. > > > > Deployment-wise it will make things harder because unprivileged > > processes still have to somehow get the map's shared fd somehow to > > mmap() it. Not unsolvable, and in general what you describe looks > > interesting, but I currently can't see how it will be simpler. > > bpf map can be pinned into bpffs for any unpriv proce
[PATCH v3 1/2] proc: restrict /proc/pid/mem access via param knobs
Prior to v2.6.39 write access to /proc//mem was restricted, after which it got allowed in commit 198214a7ee50 ("proc: enable writing to /proc/pid/mem"). Famous last words from that patch: "no longer a security hazard". :) Afterwards exploits started causing drama like [1]. The exploits using /proc/*/mem can be rather sophisticated like [2] which installed an arbitrary payload from noexec storage into a running process then exec'd it, which itself could include an ELF loader to run arbitrary code off noexec storage. One of the well-known problems with /proc/*/mem writes is they ignore page permissions via FOLL_FORCE, as opposed to writes via process_vm_writev which respect page permissions. These writes can also be used to bypass mode bits. To harden against these types of attacks, distrbutions might want to restrict /proc/pid/mem accesses, either entirely or partially, for eg. to restrict FOLL_FORCE usage. Known valid use-cases which still need these accesses are: * Debuggers which also have ptrace permissions, so they can access memory anyway via PTRACE_POKEDATA & co. Some debuggers like GDB are designed to write /proc/pid/mem for basic functionality. * Container supervisors using the seccomp notifier to intercept syscalls and rewrite memory of calling processes by passing around /proc/pid/mem file descriptors. There might be more, that's why these params default to disabled. Regarding other mechanisms which can block these accesses: * seccomp filters can be used to block mmap/mprotect calls with W|X perms, but they often can't block open calls as daemons want to read/write their runtime state and seccomp filters cannot check file paths, so plain write calls can't be easily blocked. * Since the mem file is part of the dynamic /proc// space, we can't run chmod once at boot to restrict it (and trying to react to every process and run chmod doesn't scale, and the kernel no longer allows chmod on any of these paths). * SELinux could be used with a rule to cover all /proc/*/mem files, but even then having multiple ways to deny an attack is useful in case one layer fails. Thus we introduce three kernel parameters to restrict /proc/*/mem access: read, write and foll_force. All three can be independently set to the following values: all => restrict all access unconditionally. ptracer => restrict all access except for ptracer processes. If left unset, the existing behaviour is preserved, i.e. access is governed by basic file permissions. Examples which can be passed by bootloaders: restrict_proc_mem_foll_force=all restrict_proc_mem_write=ptracer restrict_proc_mem_read=ptracer Each distribution needs to decide what restrictions to apply, depending on its use-cases. Embedded systems might want to do more, while general-purpouse distros might want a more relaxed policy, because for e.g. foll_force=all and write=all both break break GDB, so it might be a bit excessive. Based on an initial patch by Mike Frysinger . Link: https://lwn.net/Articles/476947/ [1] Link: https://issues.chromium.org/issues/40089045 [2] Cc: Guenter Roeck Cc: Doug Anderson Cc: Kees Cook Cc: Jann Horn Cc: Andrew Morton Cc: Randy Dunlap Cc: Christian Brauner Co-developed-by: Mike Frysinger Signed-off-by: Mike Frysinger Signed-off-by: Adrian Ratiu --- .../admin-guide/kernel-parameters.txt | 27 + fs/proc/base.c| 103 +- include/linux/jump_label.h| 5 + 3 files changed, 133 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 6e62b8cb19c8d..d7f7db41369c7 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5665,6 +5665,33 @@ reset_devices [KNL] Force drivers to reset the underlying device during initialization. + restrict_proc_mem_read= [KNL] + Format: {all | ptracer} + Allows restricting read access to /proc/*/mem files. + Depending on restriction level, open for reads return -EACCESS. + Can be one of: + - 'all' restricts all access unconditionally. + - 'ptracer' allows access only for ptracer processes. + If not specified, then basic file permissions continue to apply. + + restrict_proc_mem_write= [KNL] + Format: {all | ptracer} + Allows restricting write access to /proc/*/mem files. + Depending on restriction level, open for writes return -EACCESS. + Can be one of: + - 'all' restricts all access unconditionally. + - 'ptracer' allows access only for ptracer processes. + If not specified, then basic file permissions continu
[PATCH v3 2/2] proc: add Kconfigs to restrict /proc/pid/mem access
Some systems might have difficulty changing their bootloaders to enable the newly added restrict_proc_mem* params, for e.g. remote embedded doing OTA updates, so this provides a set of Kconfigs to set /proc/pid/mem restrictions at build-time. The boot params take precedence over the Kconfig values. This can be reversed, but doing it this way I think makes sense. Another idea is to have a global bool Kconfig which can enable or disable this mechanism in its entirety, however it does not seem necessary since all three knobs default to off, the branch logic overhead is rather minimal and I assume most of systems will want to restrict at least the use of FOLL_FORCE. Cc: Guenter Roeck Cc: Doug Anderson Cc: Kees Cook Cc: Jann Horn Cc: Andrew Morton Cc: Randy Dunlap Cc: Christian Brauner Signed-off-by: Adrian Ratiu --- fs/proc/base.c | 33 + security/Kconfig | 42 ++ 2 files changed, 75 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index c733836c42a65..e8ee848fc4a98 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -889,6 +889,17 @@ static int __mem_open_check_access_restriction(struct file *file) !__mem_open_current_is_ptracer(file)) return -EACCES; +#ifdef CONFIG_SECURITY_PROC_MEM_WRITE_RESTRICT + /* Deny if writes are unconditionally disabled via Kconfig */ + if (!strncmp(CONFIG_SECURITY_PROC_MEM_WRITE_RESTRICT, "all", 3)) + return -EACCES; + + /* Deny if writes are allowed only for ptracers via Kconfig */ + if (!strncmp(CONFIG_SECURITY_PROC_MEM_WRITE_RESTRICT, "ptracer", 7) && + !__mem_open_current_is_ptracer(file)) + return -EACCES; +#endif + } else if (file->f_mode & FMODE_READ) { /* Deny if reads are unconditionally disabled via param */ if (static_branch_unlikely(&restrict_proc_mem[2])) @@ -898,6 +909,17 @@ static int __mem_open_check_access_restriction(struct file *file) if (static_branch_unlikely(&restrict_proc_mem[3]) && !__mem_open_current_is_ptracer(file)) return -EACCES; + +#ifdef CONFIG_SECURITY_PROC_MEM_READ_RESTRICT + /* Deny if reads are unconditionally disabled via Kconfig */ + if (!strncmp(CONFIG_SECURITY_PROC_MEM_READ_RESTRICT, "all", 3)) + return -EACCES; + + /* Deny if reads are allowed only for ptracers via Kconfig */ + if (!strncmp(CONFIG_SECURITY_PROC_MEM_READ_RESTRICT, "ptracer", 7) && + !__mem_open_current_is_ptracer(file)) + return -EACCES; +#endif } return 0; @@ -930,6 +952,17 @@ static unsigned int __mem_rw_get_foll_force_flag(struct file *file) !__mem_open_current_is_ptracer(file)) return 0; +#ifdef CONFIG_SECURITY_PROC_MEM_FOLL_FORCE_RESTRICT + /* Deny if FOLL_FORCE is disabled via Kconfig */ + if (!strncmp(CONFIG_SECURITY_PROC_MEM_FOLL_FORCE_RESTRICT, "all", 3)) + return 0; + + /* Deny if FOLL_FORCE is only allowed for ptracers via Kconfig */ + if (!strncmp(CONFIG_SECURITY_PROC_MEM_FOLL_FORCE_RESTRICT, "ptracer", 7) && + !__mem_open_current_is_ptracer(file)) + return 0; +#endif + return FOLL_FORCE; } diff --git a/security/Kconfig b/security/Kconfig index 412e76f1575d0..31a588cedec8d 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -19,6 +19,48 @@ config SECURITY_DMESG_RESTRICT If you are unsure how to answer this question, answer N. +config SECURITY_PROC_MEM_READ_RESTRICT + string "Restrict read access to /proc/*/mem files" + depends on PROC_FS + default "none" + help + This option allows specifying a restriction level for read access + to /proc/*/mem files. Can be one of: + - 'all' restricts all access unconditionally. + - 'ptracer' allows access only for ptracer processes. + + This can also be set at boot with the "restrict_proc_mem_read=" param. + + If unsure leave empty to continue using basic file permissions. + +config SECURITY_PROC_MEM_WRITE_RESTRICT + string "Restrict write access to /proc/*/mem files" + depends on PROC_FS + default "none" + help + This option allows specifying a restriction level for write access + to /proc/*/mem files. Can be one of: + - 'all' restricts all access unconditionally. + - 'ptracer' allows access only for ptracer processes. + + This can also be set at boot with the "restrict_proc_mem_write=" param. + + If unsure leave empty to continue using basic file permissions. + +config SECURITY_PROC_MEM_FOLL_FORCE_RESTRICT + string "Restrict use of FOLL_FORCE for /proc/*/mem access" +
[PATCH 0/2] TPM documentation updates
Re-send of TPM documentation updates. Note that in this patch set the patch versions do not have relation to the patch set version, as they were before managed as independent patches. Cc: Alexander Steffen CC: Daniel P. Smith Cc: James Bottomley Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Lino Sanfilippo Cc: Mimi Zohar Cc: Peter Huewe Cc: Randy Dunlap Cc: linux-doc@vger.kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-integr...@vger.kernel.org v1: - Collect the latest versions of patches sent earlier. Jarkko Sakkinen (2): MAINTAINERS: Update URL's for KEYS/KEYRINGS_INTEGRITY and TPM DEVICE DRIVER Documentation: tpm_tis Documentation/security/tpm/index.rst | 1 + Documentation/security/tpm/tpm_tis.rst | 46 ++ MAINTAINERS| 3 +- 3 files changed, 49 insertions(+), 1 deletion(-) create mode 100644 Documentation/security/tpm/tpm_tis.rst -- 2.44.0
[PATCH 1/2] MAINTAINERS: Update URL's for KEYS/KEYRINGS_INTEGRITY and TPM DEVICE DRIVER
Add TPM driver test suite URL to the MAINTAINERS files and move the wiki URL to more appropriate location. Link: https://gitlab.com/jarkkojs/linux-tpmdd-test Link: https://kernsec.org/wiki/index.php/Linux_Kernel_Integrity Acked-by: Paul Menzel Signed-off-by: Jarkko Sakkinen --- v2: - Typo fix: https://lore.kernel.org/all/eaa5107ac4f982b6fd6e80b522643a591e719dc9.ca...@hansenpartnership.com/ - Typo fix: https://lore.kernel.org/all/1ab10318-5e3d-417c-9984-7b17f4e38...@molgen.mpg.de/ --- MAINTAINERS | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 1339918df52a..01dc4940fc06 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -12015,6 +12015,7 @@ M: Mimi Zohar L: linux-integr...@vger.kernel.org L: keyri...@vger.kernel.org S: Supported +W: https://kernsec.org/wiki/index.php/Linux_Kernel_Integrity F: security/integrity/platform_certs KFENCE @@ -22344,7 +22345,7 @@ M: Jarkko Sakkinen R: Jason Gunthorpe L: linux-integr...@vger.kernel.org S: Maintained -W: https://kernsec.org/wiki/index.php/Linux_Kernel_Integrity +W: https://gitlab.com/jarkkojs/linux-tpmdd-test Q: https://patchwork.kernel.org/project/linux-integrity/list/ T: git git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd.git F: drivers/char/tpm/ -- 2.44.0
[PATCH 2/2] Documentation: tpm_tis
Based recent discussions on LKML, provide preliminary bits of tpm_tis_core dependent drivers. Includes only bare essentials but can be extended later on case by case. This way some people may even want to read it later on. Signed-off-by: Jarkko Sakkinen --- v4: - Extended the text to address some of Stefan's concerns with v2. - Had to unfortunately remove Randy's reviewed-by because of this, given the amount of text added. v3: - Fixed incorrect buffer size: https://lore.kernel.org/linux-integrity/d957dbd3-4975-48d7-abc5-1a01c0959...@linux.ibm.com/ v2: - Fixed errors reported by Randy: https://lore.kernel.org/all/aed28265-d677-491a-a045-24b351854...@infradead.org/ - Improved the text a bit to have a better presentation. --- Documentation/security/tpm/index.rst | 1 + Documentation/security/tpm/tpm_tis.rst | 46 ++ 2 files changed, 47 insertions(+) create mode 100644 Documentation/security/tpm/tpm_tis.rst diff --git a/Documentation/security/tpm/index.rst b/Documentation/security/tpm/index.rst index fc40e9f23c85..f27a17f60a96 100644 --- a/Documentation/security/tpm/index.rst +++ b/Documentation/security/tpm/index.rst @@ -5,6 +5,7 @@ Trusted Platform Module documentation .. toctree:: tpm_event_log + tpm_tis tpm_vtpm_proxy xen-tpmfront tpm_ftpm_tee diff --git a/Documentation/security/tpm/tpm_tis.rst b/Documentation/security/tpm/tpm_tis.rst new file mode 100644 index ..b448ea3db71d --- /dev/null +++ b/Documentation/security/tpm/tpm_tis.rst @@ -0,0 +1,46 @@ +.. SPDX-License-Identifier: GPL-2.0 + += +TPM FIFO interface driver += + +TCG PTP Specification defines two interface types: FIFO and CRB. The former is +based on sequenced read and write operations, and the latter is based on a +buffer containing the full command or response. + +FIFO (First-In-First-Out) interface is used by the tpm_tis_core dependent +drivers. Originally Linux had only a driver called tpm_tis, which covered +memory mapped (aka MMIO) interface but it was later on extended to cover other +physical interfaces supported by the TCG standard. + +For legacy compliance the original MMIO driver is called tpm_tis and the +framework for FIFO drivers is named as tpm_tis_core. The postfix "tis" in +tpm_tis comes from the TPM Interface Specification, which is the hardware +interface specification for TPM 1.x chips. + +Communication is based on a 20 KiB buffer shared by the TPM chip through a +hardware bus or memory map, depending on the physical wiring. The buffer is +further split into five equal-size 4 KiB buffers, which provide equivalent +sets of registers for communication between the CPU and TPM. These +communication endpoints are called localities in the TCG terminology. + +When the kernel wants to send commands to the TPM chip, it first reserves +locality 0 by setting the requestUse bit in the TPM_ACCESS register. The bit is +cleared by the chip when the access is granted. Once it completes its +communication, the kernel writes the TPM_ACCESS.activeLocality bit. This +informs the chip that the locality has been relinquished. + +Pending localities are served in order by the chip in descending order, one at +a time: + +- Locality 0 has the lowest priority. +- Locality 5 has the highest priority. + +Further information on the purpose and meaning of the localities can be found +in section 3.2 of the TCG PC Client Platform TPM Profile Specification. + +References +== + +TCG PC Client Platform TPM Profile (PTP) Specification +https://trustedcomputinggroup.org/resource/pc-client-platform-tpm-profile-ptp-specification/ -- 2.44.0
[PATCH v1 00/18] mm: mapcount for large folios + page_mapcount() cleanups
This series tracks the mapcount of large folios in a single value, so it can be read efficiently and atomically, just like the mapcount of small folios. folio_mapcount() is then used in a couple more places, most notably to reduce false negatives in folio_likely_mapped_shared(), and many users of page_mapcount() are cleaned up (that's maybe why you got CCed on the full series, sorry sh+xtensa folks! :) ). The remaining s390x user and one KSM user of page_mapcount() are getting removed separately on the list right now. I have patches to handle the other KSM one, the khugepaged one and the kpagecount one; as they are not as "obvious", I will send them out separately in the future. Once that is all in place, I'm planning on moving page_mapcount() into fs/proc/task_mmu.c, the remaining user for the time being (and we can discuss at LSF/MM details on that :) ). I proposed the mapcount for large folios (previously called total mapcount) originally in part of [1] and I later included it in [2] where it is a requirement. In the meantime, I changed the patch a bit so I dropped all RB's. During the discussion of [1], Peter Xu correctly raised that this additional tracking might affect the performance when PMD->PTE remapping THPs. In the meantime. I addressed that by batching RMAP operations during fork(), unmap/zap and when PMD->PTE remapping THPs. Running some of my micro-benchmarks [3] (fork,munmap,cow-byte,remap) on 1 GiB of memory backed by folios with the same order, I observe the following on an Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz tuned for reproducible results as much as possible: Standard deviation is mostly < 1%, except for order-9, where it's < 2% for fork() and munmap(). (1) Small folios are not affected (< 1%) in all 4 microbenchmarks. (2) Order-4 folios are not affected (< 1%) in all 4 microbenchmarks. A bit weird comapred to the other orders ... (3) PMD->PTE remapping of order-9 THPs is not affected (< 1%) (4) COW-byte (COWing a single page by writing a single byte) is not affected for any order (< 1 %). The page copy_fault overhead dominates everything. (5) fork() is mostly not affected (< 1%), except order-2, where we have a slowdown of ~4%. Already for order-3 folios, we're down to a slowdown of < 1%. (6) munmap() sees a slowdown by < 3% for some orders (order-5, order-6, order-9), but less for others (< 1% for order-4 and order-8, < 2% for order-2, order-3, order-7). Especially the fork() and munmap() benchmark are sensitive to each added instruction and other system noise, so I suspect some of the change and observed weirdness (order-4) is due to code layout changes and other factors, but not really due to the added atomics. So in the common case where we can batch, the added atomics don't really make a big difference, especially in light of the recent improvements for large folios that we recently gained due to batching. Surprisingly, for some cases where we cannot batch (e.g., COW), the added atomics don't seem to matter, because other overhead dominates. My fork and munmap micro-benchmarks don't cover cases where we cannot batch-process bigger parts of large folios. As this is not the common case, I'm not worrying about that right now. Future work is batching RMAP operations during swapout and folio migration. Not CCing everybody (e.g., cgroups folks just because of the doc updated) recommended by get_maintainers, to reduce noise. Tested on x86-64, compile-tested on a bunch of other archs. Will do more testing in the upcoming days. [1] https://lore.kernel.org/all/20230809083256.699513-1-da...@redhat.com/ [2] https://lore.kernel.org/all/20231124132626.235350-1-da...@redhat.com/ [3] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c?ref_type=heads Cc: Andrew Morton Cc: "Matthew Wilcox (Oracle)" Cc: Peter Xu Cc: Ryan Roberts Cc: Yin Fengwei Cc: Yang Shi Cc: Zi Yan Cc: Jonathan Corbet Cc: Hugh Dickins Cc: Yoshinori Sato Cc: Rich Felker Cc: John Paul Adrian Glaubitz Cc: Chris Zankel Cc: Max Filippov Cc: Muchun Song Cc: Miaohe Lin Cc: Naoya Horiguchi Cc: Richard Chang David Hildenbrand (18): mm: allow for detecting underflows with page_mapcount() again mm/rmap: always inline anon/file rmap duplication of a single PTE mm/rmap: add fast-path for small folios when adding/removing/duplicating mm: track mapcount of large folios in single value mm: improve folio_likely_mapped_shared() using the mapcount of large folios mm: make folio_mapcount() return 0 for small typed folios mm/memory: use folio_mapcount() in zap_present_folio_ptes() mm/huge_memory: use folio_mapcount() in zap_huge_pmd() sanity check mm/memory-failure: use folio_mapcount() in hwpoison_user_mappings() mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range() mm/migrate: use folio_likely_mapped_shared() in add_page_for_migration() sh/mm/cache: use folio_mapped() in copy_from_user_page()
[PATCH v1 02/18] mm/rmap: always inline anon/file rmap duplication of a single PTE
As we grow the code, the compiler might make stupid decisions and unnecessarily degrade fork() performance. Let's make sure to always inline functions that operate on a single PTE so the compiler will always optimize out the loop and avoid a function call. This is a preparation for maintining a total mapcount for large folios. Signed-off-by: David Hildenbrand --- include/linux/rmap.h | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 9bf9324214fc..9549d78928bb 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -347,8 +347,12 @@ static inline void folio_dup_file_rmap_ptes(struct folio *folio, { __folio_dup_file_rmap(folio, page, nr_pages, RMAP_LEVEL_PTE); } -#define folio_dup_file_rmap_pte(folio, page) \ - folio_dup_file_rmap_ptes(folio, page, 1) + +static __always_inline void folio_dup_file_rmap_pte(struct folio *folio, + struct page *page) +{ + __folio_dup_file_rmap(folio, page, 1, RMAP_LEVEL_PTE); +} /** * folio_dup_file_rmap_pmd - duplicate a PMD mapping of a page range of a folio @@ -448,8 +452,13 @@ static inline int folio_try_dup_anon_rmap_ptes(struct folio *folio, return __folio_try_dup_anon_rmap(folio, page, nr_pages, src_vma, RMAP_LEVEL_PTE); } -#define folio_try_dup_anon_rmap_pte(folio, page, vma) \ - folio_try_dup_anon_rmap_ptes(folio, page, 1, vma) + +static __always_inline int folio_try_dup_anon_rmap_pte(struct folio *folio, + struct page *page, struct vm_area_struct *src_vma) +{ + return __folio_try_dup_anon_rmap(folio, page, 1, src_vma, +RMAP_LEVEL_PTE); +} /** * folio_try_dup_anon_rmap_pmd - try duplicating a PMD mapping of a page range -- 2.44.0
[PATCH v1 01/18] mm: allow for detecting underflows with page_mapcount() again
Commit 53277bcf126d ("mm: support page_mapcount() on page_has_type() pages") made it impossible to detect mapcount underflows by treating any negative raw mapcount value as a mapcount of 0. We perform such underflow checks in zap_present_folio_ptes() and zap_huge_pmd(), which would currently no longer trigger. Let's check against PAGE_MAPCOUNT_RESERVE instead by using page_type_has_type(), like page_has_type() would, so we can still catch some underflows. Fixes: 53277bcf126d ("mm: support page_mapcount() on page_has_type() pages") Signed-off-by: David Hildenbrand --- include/linux/mm.h | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ef34cf54c14f..0fb8a40f82dd 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1229,11 +1229,10 @@ static inline void page_mapcount_reset(struct page *page) */ static inline int page_mapcount(struct page *page) { - int mapcount = atomic_read(&page->_mapcount) + 1; + int mapcount = atomic_read(&page->_mapcount); /* Handle page_has_type() pages */ - if (mapcount < 0) - mapcount = 0; + mapcount = page_type_has_type(mapcount) ? 0 : mapcount + 1; if (unlikely(PageCompound(page))) mapcount += folio_entire_mapcount(page_folio(page)); -- 2.44.0
[PATCH v1 03/18] mm/rmap: add fast-path for small folios when adding/removing/duplicating
Let's add a fast-path for small folios to all relevant rmap functions. Note that only RMAP_LEVEL_PTE applies. This is a preparation for tracking the mapcount of large folios in a single value. Signed-off-by: David Hildenbrand --- include/linux/rmap.h | 13 + mm/rmap.c| 26 -- 2 files changed, 29 insertions(+), 10 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 9549d78928bb..327f1ca5a487 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -322,6 +322,11 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio, switch (level) { case RMAP_LEVEL_PTE: + if (!folio_test_large(folio)) { + atomic_inc(&page->_mapcount); + break; + } + do { atomic_inc(&page->_mapcount); } while (page++, --nr_pages > 0); @@ -405,6 +410,14 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio, if (PageAnonExclusive(page + i)) return -EBUSY; } + + if (!folio_test_large(folio)) { + if (PageAnonExclusive(page)) + ClearPageAnonExclusive(page); + atomic_inc(&page->_mapcount); + break; + } + do { if (PageAnonExclusive(page)) ClearPageAnonExclusive(page); diff --git a/mm/rmap.c b/mm/rmap.c index 56b313aa2ebf..4bde6d60db6c 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1172,15 +1172,18 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio, switch (level) { case RMAP_LEVEL_PTE: + if (!folio_test_large(folio)) { + nr = atomic_inc_and_test(&page->_mapcount); + break; + } + do { first = atomic_inc_and_test(&page->_mapcount); - if (first && folio_test_large(folio)) { + if (first) { first = atomic_inc_return_relaxed(mapped); - first = (first < ENTIRELY_MAPPED); + if (first < ENTIRELY_MAPPED) + nr++; } - - if (first) - nr++; } while (page++, --nr_pages > 0); break; case RMAP_LEVEL_PMD: @@ -1514,15 +1517,18 @@ static __always_inline void __folio_remove_rmap(struct folio *folio, switch (level) { case RMAP_LEVEL_PTE: + if (!folio_test_large(folio)) { + nr = atomic_add_negative(-1, &page->_mapcount); + break; + } + do { last = atomic_add_negative(-1, &page->_mapcount); - if (last && folio_test_large(folio)) { + if (last) { last = atomic_dec_return_relaxed(mapped); - last = (last < ENTIRELY_MAPPED); + if (last < ENTIRELY_MAPPED) + nr++; } - - if (last) - nr++; } while (page++, --nr_pages > 0); break; case RMAP_LEVEL_PMD: -- 2.44.0
[PATCH v1 04/18] mm: track mapcount of large folios in single value
Let's track the mapcount of large folios in a single value. The mapcount of a large folio currently corresponds to the sum of the entire mapcount and all page mapcounts. This sum is what we actually want to know in folio_mapcount() and it is also sufficient for implementing folio_mapped(). With PTE-mapped THP becoming more important and more widely used, we want to avoid looping over all pages of a folio just to obtain the mapcount of large folios. The comment "In the common case, avoid the loop when no pages mapped by PTE" in folio_total_mapcount() does no longer hold for mTHP that are always mapped by PTE. Further, we are planning on using folio_mapcount() more frequently, and might even want to remove page mapcounts for large folios in some kernel configs. Therefore, allow for reading the mapcount of large folios efficiently and atomically without looping over any pages. Maintain the mapcount also for hugetlb pages for simplicity. Use the new mapcount to implement folio_mapcount() and folio_mapped(). Make page_mapped() simply call folio_mapped(). We can now get rid of folio_large_is_mapped(). _nr_pages_mapped is now only used in rmap code and for debugging purposes. Keep folio_nr_pages_mapped() around, but document that its use should be limited to rmap internals and debugging purposes. This change implies one additional atomic add/sub whenever mapping/unmapping (parts of) a large folio. As we now batch RMAP operations for PTE-mapped THP during fork(), during unmap/zap, and when PTE-remapping a PMD-mapped THP, and we adjust the large mapcount for a PTE batch only once, the added overhead in the common case is small. Only when unmapping individual pages of a large folio (e.g., during COW), the overhead might be bigger in comparison, but it's essentially one additional atomic operation. Note that before the new mapcount would overflow, already our refcount would overflow: each mapping requires a folio reference. Extend the focumentation of folio_mapcount(). Signed-off-by: David Hildenbrand --- Documentation/mm/transhuge.rst | 12 +- include/linux/mm.h | 44 -- include/linux/mm_types.h | 5 ++-- include/linux/rmap.h | 10 mm/debug.c | 3 ++- mm/hugetlb.c | 4 ++-- mm/internal.h | 3 +++ mm/khugepaged.c| 2 +- mm/page_alloc.c| 4 mm/rmap.c | 34 +- 10 files changed, 62 insertions(+), 59 deletions(-) diff --git a/Documentation/mm/transhuge.rst b/Documentation/mm/transhuge.rst index 93c9239b9ebe..1ba0ad63246c 100644 --- a/Documentation/mm/transhuge.rst +++ b/Documentation/mm/transhuge.rst @@ -116,14 +116,14 @@ pages: succeeds on tail pages. - map/unmap of a PMD entry for the whole THP increment/decrement -folio->_entire_mapcount and also increment/decrement -folio->_nr_pages_mapped by ENTIRELY_MAPPED when _entire_mapcount -goes from -1 to 0 or 0 to -1. +folio->_entire_mapcount, increment/decrement folio->_large_mapcount +and also increment/decrement folio->_nr_pages_mapped by ENTIRELY_MAPPED +when _entire_mapcount goes from -1 to 0 or 0 to -1. - map/unmap of individual pages with PTE entry increment/decrement -page->_mapcount and also increment/decrement folio->_nr_pages_mapped -when page->_mapcount goes from -1 to 0 or 0 to -1 as this counts -the number of pages mapped by PTE. +page->_mapcount, increment/decrement folio->_large_mapcount and also +increment/decrement folio->_nr_pages_mapped when page->_mapcount goes +from -1 to 0 or 0 to -1 as this counts the number of pages mapped by PTE. split_huge_page internally has to distribute the refcounts in the head page to the tail pages before clearing all PG_head/tail bits from the page diff --git a/include/linux/mm.h b/include/linux/mm.h index 0fb8a40f82dd..1862a216af15 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1239,16 +1239,26 @@ static inline int page_mapcount(struct page *page) return mapcount; } -int folio_total_mapcount(const struct folio *folio); +static inline int folio_large_mapcount(const struct folio *folio) +{ + VM_WARN_ON_FOLIO(!folio_test_large(folio), folio); + return atomic_read(&folio->_large_mapcount) + 1; +} /** - * folio_mapcount() - Calculate the number of mappings of this folio. + * folio_mapcount() - Number of mappings of this folio. * @folio: The folio. * - * A large folio tracks both how many times the entire folio is mapped, - * and how many times each individual page in the folio is mapped. - * This function calculates the total number of times the folio is - * mapped. + * The folio mapcount corresponds to the number of present user page table + * entries that reference any part of a folio. Each such present user page + * table entry must be paired with exactly on folio reference. + * + * For or
[PATCH v1 05/18] mm: improve folio_likely_mapped_shared() using the mapcount of large folios
We can now read the mapcount of large folios very efficiently. Use it to improve our handling of partially-mappable folios, falling back to making a guess only in case the folio is not "obviously mapped shared". We can now better detect partially-mappable folios where the first page is not mapped as "mapped shared", reducing "false negatives"; but false negatives are still possible. While at it, fixup a wrong comment (false positive vs. false negative) for KSM folios. Signed-off-by: David Hildenbrand --- include/linux/mm.h | 19 +-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 1862a216af15..daf687f0e8e5 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2183,7 +2183,7 @@ static inline size_t folio_size(struct folio *folio) * indicate "mapped shared" (false positive) when two VMAs in the same MM * cover the same file range. *#. For (small) KSM folios, the return value can wrongly indicate "mapped - * shared" (false negative), when the folio is mapped multiple times into + * shared" (false positive), when the folio is mapped multiple times into * the same MM. * * Further, this function only considers current page table mappings that @@ -2200,7 +2200,22 @@ static inline size_t folio_size(struct folio *folio) */ static inline bool folio_likely_mapped_shared(struct folio *folio) { - return page_mapcount(folio_page(folio, 0)) > 1; + int mapcount = folio_mapcount(folio); + + /* Only partially-mappable folios require more care. */ + if (!folio_test_large(folio) || unlikely(folio_test_hugetlb(folio))) + return mapcount > 1; + + /* A single mapping implies "mapped exclusively". */ + if (mapcount <= 1) + return false; + + /* If any page is mapped more than once we treat it "mapped shared". */ + if (folio_entire_mapcount(folio) || mapcount > folio_nr_pages(folio)) + return true; + + /* Let's guess based on the first subpage. */ + return atomic_read(&folio->_mapcount) > 0; } #ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE -- 2.44.0
[PATCH v1 06/18] mm: make folio_mapcount() return 0 for small typed folios
We already handle it properly for large folios. Let's also return "0" for small typed folios, like page_mapcount() currently would. Consequently, folio_mapcount() will never return negative values for typed folios, but may return negative values for underflows. Signed-off-by: David Hildenbrand --- include/linux/mm.h | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index daf687f0e8e5..d453232bba62 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1260,12 +1260,19 @@ static inline int folio_large_mapcount(const struct folio *folio) * references the entire folio counts exactly once, even when such special * page table entries are comprised of multiple ordinary page table entries. * + * Will report 0 for pages which cannot be mapped into userspace, such as + * slab, page tables and similar. + * * Return: The number of times this folio is mapped. */ static inline int folio_mapcount(const struct folio *folio) { - if (likely(!folio_test_large(folio))) - return atomic_read(&folio->_mapcount) + 1; + int mapcount; + + if (likely(!folio_test_large(folio))) { + mapcount = atomic_read(&folio->_mapcount); + return page_type_has_type(mapcount) ? 0 : mapcount + 1; + } return folio_large_mapcount(folio); } -- 2.44.0
[PATCH v1 07/18] mm/memory: use folio_mapcount() in zap_present_folio_ptes()
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. In zap_present_folio_ptes(), let's simply check the folio mapcount(). If there is some issue, it will underflow at some point either way when unmapping. As indicated already in commit 10ebac4f95e7 ("mm/memory: optimize unmap/zap with PTE-mapped THP"), we already documented "If we ever have a cheap folio_mapcount(), we might just want to check for underflows there.". There is no change for small folios. For large folios, we'll now catch more underflows when batch-unmapping, because instead of only testing the mapcount of the first subpage, we'll test if the folio mapcount underflows. Signed-off-by: David Hildenbrand --- mm/memory.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 78422d1c7381..178492efb4af 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1502,8 +1502,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb, if (!delay_rmap) { folio_remove_rmap_ptes(folio, page, nr, vma); - /* Only sanity-check the first page in a batch. */ - if (unlikely(page_mapcount(page) < 0)) + if (unlikely(folio_mapcount(folio) < 0)) print_bad_pte(vma, addr, ptent, page); } -- 2.44.0
[PATCH v1 08/18] mm/huge_memory: use folio_mapcount() in zap_huge_pmd() sanity check
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. Let's similarly check for folio_mapcount() underflows instead of page_mapcount() underflows like we do in zap_present_folio_ptes() now. Instead of the VM_BUG_ON(), we should actually be doing something like print_bad_pte(). For now, let's keep it simple and use WARN_ON_ONCE(), performing that check independently of DEBUG_VM. Signed-off-by: David Hildenbrand --- mm/huge_memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d8d2ed80b0bf..68ac27d229ef 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1851,7 +1851,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, folio = page_folio(page); folio_remove_rmap_pmd(folio, page, vma); - VM_BUG_ON_PAGE(page_mapcount(page) < 0, page); + WARN_ON_ONCE(folio_mapcount(folio) < 0); VM_BUG_ON_PAGE(!PageHead(page), page); } else if (thp_migration_supported()) { swp_entry_t entry; -- 2.44.0
[PATCH v1 09/18] mm/memory-failure: use folio_mapcount() in hwpoison_user_mappings()
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. We can only unmap full folios; page_mapped(), which we check here, is translated to folio_mapped() -- based on folio_mapcount(). So let's print the folio mapcount instead. Signed-off-by: David Hildenbrand --- mm/memory-failure.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 88359a185c5f..ee2f4b8905ef 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1628,8 +1628,8 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn, unmap_success = !page_mapped(p); if (!unmap_success) - pr_err("%#lx: failed to unmap page (mapcount=%d)\n", - pfn, page_mapcount(p)); + pr_err("%#lx: failed to unmap page (folio mapcount=%d)\n", + pfn, folio_mapcount(page_folio(p))); /* * try_to_unmap() might put mlocked page in lru cache, so call -- 2.44.0
[PATCH v1 10/18] mm/page_alloc: use folio_mapped() in __alloc_contig_migrate_range()
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. For tracing purposes, we use page_mapcount() in __alloc_contig_migrate_range(). Adding that mapcount to total_mapped sounds strange: total_migrated and total_reclaimed would count each page only once, not multiple times. But then, isolate_migratepages_range() adds each folio only once to the list. So for large folios, we would query the mapcount of the first page of the folio, which doesn't make too much sense for large folios. Let's simply use folio_mapped() * folio_nr_pages(), which makes more sense as nr_migratepages is also incremented by the number of pages in the folio in case of successful migration. Signed-off-by: David Hildenbrand --- mm/page_alloc.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 393366d4a704..40fc0f60e021 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6389,8 +6389,12 @@ int __alloc_contig_migrate_range(struct compact_control *cc, if (trace_mm_alloc_contig_migrate_range_info_enabled()) { total_reclaimed += nr_reclaimed; - list_for_each_entry(page, &cc->migratepages, lru) - total_mapped += page_mapcount(page); + list_for_each_entry(page, &cc->migratepages, lru) { + struct folio *folio = page_folio(page); + + total_mapped += folio_mapped(folio) * + folio_nr_pages(folio); + } } ret = migrate_pages(&cc->migratepages, alloc_migration_target, -- 2.44.0
[PATCH v1 11/18] mm/migrate: use folio_likely_mapped_shared() in add_page_for_migration()
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. In add_page_for_migration(), we actually want to check if the folio is mapped shared, to reject such folios. So let's use folio_likely_mapped_shared() instead. For small folios, fully mapped THP, and hugetlb folios, there is no change. For partially mapped, shared THP, we should now do a better job at rejecting such folios. Signed-off-by: David Hildenbrand --- mm/migrate.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/migrate.c b/mm/migrate.c index 285072bca29c..d87ce32645d4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2140,7 +2140,7 @@ static int add_page_for_migration(struct mm_struct *mm, const void __user *p, goto out_putfolio; err = -EACCES; - if (page_mapcount(page) > 1 && !migrate_all) + if (folio_likely_mapped_shared(folio) && !migrate_all) goto out_putfolio; err = -EBUSY; -- 2.44.0
[PATCH v1 12/18] sh/mm/cache: use folio_mapped() in copy_from_user_page()
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. We're already using folio_mapped in copy_user_highpage() and copy_to_user_page() for a similar purpose so ... let's also simply use it for copy_from_user_page(). There is no change for small folios. Likely we won't stumble over many large folios on sh in that code either way. Signed-off-by: David Hildenbrand --- arch/sh/mm/cache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/sh/mm/cache.c b/arch/sh/mm/cache.c index 9bcaa5619eab..d8be352e14d2 100644 --- a/arch/sh/mm/cache.c +++ b/arch/sh/mm/cache.c @@ -84,7 +84,7 @@ void copy_from_user_page(struct vm_area_struct *vma, struct page *page, { struct folio *folio = page_folio(page); - if (boot_cpu_data.dcache.n_aliases && page_mapcount(page) && + if (boot_cpu_data.dcache.n_aliases && folio_mapped(folio) && test_bit(PG_dcache_clean, &folio->flags)) { void *vfrom = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK); memcpy(dst, vfrom, len); -- 2.44.0
[PATCH v1 13/18] mm/filemap: use folio_mapcount() in filemap_unaccount_folio()
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. Let's use folio_mapcount() instead of filemap_unaccount_folio(). No functional change intended, because we're only dealing with small folios. Signed-off-by: David Hildenbrand --- mm/filemap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/filemap.c b/mm/filemap.c index c668e11cd6ef..d4aa82ad5b59 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -168,7 +168,7 @@ static void filemap_unaccount_folio(struct address_space *mapping, add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE); if (mapping_exiting(mapping) && !folio_test_large(folio)) { - int mapcount = page_mapcount(&folio->page); + int mapcount = folio_mapcount(folio); if (folio_ref_count(folio) >= mapcount + 2) { /* -- 2.44.0
[PATCH v1 14/18] mm/migrate_device: use folio_mapcount() in migrate_vma_check_page()
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. Let's convert migrate_vma_check_page() to work on a folio internally so we can remove the page_mapcount() usage. Note that we reject any large folios. There is a lot more folio conversion to be had, but that has to wait for another day. No functional change intended. Signed-off-by: David Hildenbrand --- mm/migrate_device.c | 12 +++- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/mm/migrate_device.c b/mm/migrate_device.c index d40b46ae9d65..b929b450b77c 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -324,6 +324,8 @@ static void migrate_vma_collect(struct migrate_vma *migrate) */ static bool migrate_vma_check_page(struct page *page, struct page *fault_page) { + struct folio *folio = page_folio(page); + /* * One extra ref because caller holds an extra reference, either from * isolate_lru_page() for a regular page, or migrate_vma_collect() for @@ -336,18 +338,18 @@ static bool migrate_vma_check_page(struct page *page, struct page *fault_page) * check them than regular pages, because they can be mapped with a pmd * or with a pte (split pte mapping). */ - if (PageCompound(page)) + if (folio_test_large(folio)) return false; /* Page from ZONE_DEVICE have one extra reference */ - if (is_zone_device_page(page)) + if (folio_is_zone_device(folio)) extra++; /* For file back page */ - if (page_mapping(page)) - extra += 1 + page_has_private(page); + if (folio_mapping(folio)) + extra += 1 + folio_has_private(folio); - if ((page_count(page) - extra) > page_mapcount(page)) + if ((folio_ref_count(folio) - extra) > folio_mapcount(folio)) return false; return true; -- 2.44.0
[PATCH v1 15/18] trace/events/page_ref: trace the raw page mapcount value
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. We already trace raw page->refcount, raw page->flags and raw page->mapping, and don't involve any folios. Let's also trace the raw mapcount value that does not consider the entire mapcount of large folios, and we don't add "1" to it. When dealing with typed folios, this makes a lot more sense. ... and it's for debugging purposes only either way. Signed-off-by: David Hildenbrand --- include/trace/events/page_ref.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/trace/events/page_ref.h b/include/trace/events/page_ref.h index 8a99c1cd417b..fe33a255b7d0 100644 --- a/include/trace/events/page_ref.h +++ b/include/trace/events/page_ref.h @@ -30,7 +30,7 @@ DECLARE_EVENT_CLASS(page_ref_mod_template, __entry->pfn = page_to_pfn(page); __entry->flags = page->flags; __entry->count = page_ref_count(page); - __entry->mapcount = page_mapcount(page); + __entry->mapcount = atomic_read(&page->_mapcount); __entry->mapping = page->mapping; __entry->mt = get_pageblock_migratetype(page); __entry->val = v; @@ -79,7 +79,7 @@ DECLARE_EVENT_CLASS(page_ref_mod_and_test_template, __entry->pfn = page_to_pfn(page); __entry->flags = page->flags; __entry->count = page_ref_count(page); - __entry->mapcount = page_mapcount(page); + __entry->mapcount = atomic_read(&page->_mapcount); __entry->mapping = page->mapping; __entry->mt = get_pageblock_migratetype(page); __entry->val = v; -- 2.44.0
[PATCH v1 16/18] xtensa/mm: convert check_tlb_entry() to sanity check folios
We want to limit the use of page_mapcount() to the places where it is absolutely necessary. So let's convert check_tlb_entry() to perform sanity checks on folios instead of pages. This essentially already happened: page_count() is mapped to folio_ref_count(), and page_mapped() to folio_mapped() internally. However, we would have printed the page_mapount(), which does not really match what page_mapped() would have checked. Let's simply print the folio mapcount to avoid using page_mapcount(). For small folios there is no change. Signed-off-by: David Hildenbrand --- arch/xtensa/mm/tlb.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c index 4f974b74883c..d8b60d6e50a8 100644 --- a/arch/xtensa/mm/tlb.c +++ b/arch/xtensa/mm/tlb.c @@ -256,12 +256,13 @@ static int check_tlb_entry(unsigned w, unsigned e, bool dtlb) dtlb ? 'D' : 'I', w, e, r0, r1, pte); if (pte == 0 || !pte_present(__pte(pte))) { struct page *p = pfn_to_page(r1 >> PAGE_SHIFT); - pr_err("page refcount: %d, mapcount: %d\n", - page_count(p), - page_mapcount(p)); - if (!page_count(p)) + struct folio *f = page_folio(p); + + pr_err("folio refcount: %d, mapcount: %d\n", + folio_ref_count(f), folio_mapcount(f)); + if (!folio_ref_count(f)) rc |= TLB_INSANE; - else if (page_mapcount(p)) + else if (folio_mapped(f)) rc |= TLB_SUSPICIOUS; } else { rc |= TLB_INSANE; -- 2.44.0
[PATCH v1 17/18] mm/debug: print only page mapcount (excluding folio entire mapcount) in __dump_folio()
Let's simplify and only print the page mapcount: we already print the large folio mapcount and the entire folio mapcount for large folios separately; that should be sufficient to figure out what's happening. While at it, print the page mapcount also if it had an underflow, filtering out only typed pages. Signed-off-by: David Hildenbrand --- mm/debug.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/mm/debug.c b/mm/debug.c index d064db42af54..69e524c3e601 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -55,15 +55,10 @@ static void __dump_folio(struct folio *folio, struct page *page, unsigned long pfn, unsigned long idx) { struct address_space *mapping = folio_mapping(folio); - int mapcount = atomic_read(&page->_mapcount) + 1; + int mapcount = atomic_read(&page->_mapcount); char *type = ""; - /* Open-code page_mapcount() to avoid looking up a stale folio */ - if (mapcount < 0) - mapcount = 0; - if (folio_test_large(folio)) - mapcount += folio_entire_mapcount(folio); - + mapcount = page_type_has_type(mapcount) ? 0 : mapcount + 1; pr_warn("page: refcount:%d mapcount:%d mapping:%p index:%#lx pfn:%#lx\n", folio_ref_count(folio), mapcount, mapping, folio->index + idx, pfn); -- 2.44.0
[PATCH v1 18/18] Documentation/admin-guide/cgroup-v1/memory.rst: don't reference page_mapcount()
Let's stop talking about page_mapcount(). Signed-off-by: David Hildenbrand --- Documentation/admin-guide/cgroup-v1/memory.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 46110e6a31bb..9cde26d33843 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -802,8 +802,8 @@ a page or a swap can be moved only when it is charged to the task's current | | anonymous pages, file pages (and swaps) in the range mmapped by the task | | | will be moved even if the task hasn't done page fault, i.e. they might | | | not be the task's "RSS", but other task's "RSS" that maps the same file. | -| | And mapcount of the page is ignored (the page can be moved even if | -| | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to | +| | The mapcount of the page is ignored (the page can be moved independent | +| | of the mapcount). You must enable Swap Extension (see 2.4) to | | | enable move of swap charges. | +---+--+ -- 2.44.0
Re: [PATCH v1 01/18] mm: allow for detecting underflows with page_mapcount() again
On 9 Apr 2024, at 15:22, David Hildenbrand wrote: > Commit 53277bcf126d ("mm: support page_mapcount() on page_has_type() > pages") made it impossible to detect mapcount underflows by treating > any negative raw mapcount value as a mapcount of 0. > > We perform such underflow checks in zap_present_folio_ptes() and > zap_huge_pmd(), which would currently no longer trigger. > > Let's check against PAGE_MAPCOUNT_RESERVE instead by using > page_type_has_type(), like page_has_type() would, so we can still catch > some underflows. > > Fixes: 53277bcf126d ("mm: support page_mapcount() on page_has_type() pages") > Signed-off-by: David Hildenbrand > --- > include/linux/mm.h | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index ef34cf54c14f..0fb8a40f82dd 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1229,11 +1229,10 @@ static inline void page_mapcount_reset(struct page > *page) > */ > static inline int page_mapcount(struct page *page) > { > - int mapcount = atomic_read(&page->_mapcount) + 1; > + int mapcount = atomic_read(&page->_mapcount); > > /* Handle page_has_type() pages */ > - if (mapcount < 0) > - mapcount = 0; > + mapcount = page_type_has_type(mapcount) ? 0 : mapcount + 1; > if (unlikely(PageCompound(page))) > mapcount += folio_entire_mapcount(page_folio(page)); LGTM. This could be picked up separately. Reviewed-by: Zi Yan -- Best Regards, Yan, Zi signature.asc Description: OpenPGP digital signature
Re: [PATCH v1 04/18] mm: track mapcount of large folios in single value
On 9 Apr 2024, at 15:22, David Hildenbrand wrote: > Let's track the mapcount of large folios in a single value. The mapcount of > a large folio currently corresponds to the sum of the entire mapcount and > all page mapcounts. > > This sum is what we actually want to know in folio_mapcount() and it is > also sufficient for implementing folio_mapped(). > > With PTE-mapped THP becoming more important and more widely used, we want > to avoid looping over all pages of a folio just to obtain the mapcount > of large folios. The comment "In the common case, avoid the loop when no > pages mapped by PTE" in folio_total_mapcount() does no longer hold for > mTHP that are always mapped by PTE. > > Further, we are planning on using folio_mapcount() more > frequently, and might even want to remove page mapcounts for large > folios in some kernel configs. Therefore, allow for reading the mapcount of > large folios efficiently and atomically without looping over any pages. > > Maintain the mapcount also for hugetlb pages for simplicity. Use the new > mapcount to implement folio_mapcount() and folio_mapped(). Make > page_mapped() simply call folio_mapped(). We can now get rid of > folio_large_is_mapped(). > > _nr_pages_mapped is now only used in rmap code and for debugging > purposes. Keep folio_nr_pages_mapped() around, but document that its use > should be limited to rmap internals and debugging purposes. > > This change implies one additional atomic add/sub whenever > mapping/unmapping (parts of) a large folio. > > As we now batch RMAP operations for PTE-mapped THP during fork(), > during unmap/zap, and when PTE-remapping a PMD-mapped THP, and we adjust > the large mapcount for a PTE batch only once, the added overhead in the > common case is small. Only when unmapping individual pages of a large folio > (e.g., during COW), the overhead might be bigger in comparison, but it's > essentially one additional atomic operation. > > Note that before the new mapcount would overflow, already our refcount > would overflow: each mapping requires a folio reference. Extend the > focumentation of folio_mapcount(). s/focumentation/documentation/ ;) -- Best Regards, Yan, Zi signature.asc Description: OpenPGP digital signature
Re: [PATCH bpf-next 1/2] bpf: Introduce bpf_probe_write_user_registered()
On Tue, Apr 9, 2024 at 1:11 AM Marco Elver wrote: > > On Mon, 8 Apr 2024 at 20:24, Alexei Starovoitov > wrote: > > > > On Mon, Apr 8, 2024 at 2:30 AM Marco Elver wrote: > > > > > > On Fri, 5 Apr 2024 at 22:28, Andrii Nakryiko > > > wrote: > > > > > > > > On Fri, Apr 5, 2024 at 1:28 AM Marco Elver wrote: > > > > > > > > > > On Fri, 5 Apr 2024 at 01:23, Alexei Starovoitov > > > > > wrote: > > > [...] > > > > > > and the tasks can use mmaped array shared across all or unique to > > > > > > each > > > > > > process. > > > > > > And both bpf and user space can read/write them with a single > > > > > > instruction. > > > > > > > > > > That's BPF_F_MMAPABLE, right? > > > > > > > > > > That does not work because the mmapped region is global. Our > > > > > requirements are: > > > > It sounds not like "requirements", but a description of the proposed > > solution. > > These requirements can also be implemented differently, e.g. with the > proposed task-local storage maps if done right (the devil appears to > be in the implementation-details - these details are currently beyond > my knowledge of the BPF subsystem internals). They really are the bare > minimum requirements for the use case. The proposed solution probably > happens to be the simplest one, and mapping requirements is relatively > straightforward for it. > > > Pls share the actual use case. > > This "tracing prog" sounds more like a ghost scheduler that > > wants to interact with known user processes. > > It's tcmalloc - we have a BPF program provide a simpler variant of the > "thread re-scheduling" notifications discussed here: > https://lore.kernel.org/all/CACT4Y+beLh1qnHF9bxhMUcva8KyuvZs7Mg_31SGK5xSoR=3...@mail.gmail.com/ > > Various synchronization algorithms can be optimized if they know about > scheduling events. This has been proposed in [1] to implement an > adaptive mutex. However, we think that there are many more > possibilities, but it really depends on the kind of scheduler state > being exposed ("thread on CPU" as in [1], or more details, like > details about which thread was switched in, which was switched out, > where was the thread migrated to, etc.). Committing to these narrow > use case ABIs and extending the main kernel with more and more such > information does not scale. Instead, BPF easily allows to expose this > information where it's required. > > [1] > https://lore.kernel.org/all/20230529191416.53955-1-mathieu.desnoy...@efficios.com/ Thanks for the links. That's very helpful. I was about to mention rseq. I think it's a better fit, but Mathieu doesn't want non-uapi bits in it. Ideally we could have added pointer + size to rseq as a scratch area that bpf prog can read/write. User space would register such area, kernel would pin pages, and bpf would directly access that. Pretty much the same idea of bpf local storage, but without bpf map. rseq has its pros and cons. Having a bpf local storage map to manage such pinned pages provides separation of concerns. And it's up to tcmalloc whether to use one such map for all tasks or map for every tcmalloc instance. Such scratch areas are per-task per-map. rseq dictates per-task only, but no concerns with setup thanks to glibc integration. > > The main issue with bpf_probe_read/write_user() is their non-determinism. > > They will error when memory is swapped out. > > Right. Although, user space mlock'ing and prefaulting the memory > solves it in the common case (some corner cases like after fork() are > still tricky). yeah. gets tricky indeed. With bpf local storage map managing pinned pages the map_fd with auto-close property addresses unpinning. No need to hack into fork-ing exit-ing paths, but tcmalloc would need to re-register the areas into bpf local storage after fork... guessing. > > These helpers are ok-ish for observability when consumers understand > > that some events might be lost, but for 24/7 production use > > losing reads becomes a problem that bpf prog cannot mitigate. > > What do bpf prog suppose to do when this safer bpf_probe_write_user errors? > > Use some other mechanism to communicate with user space? > > Right, for use cases where these errors aren't ok it does not work. > But if the algorithm is tolerant to lossy reads/writes from the BPF > program side, it's not an issue. Reading Dmitry Vyukov's comments in that thread... looks like he's also concerned with the 'probe' part of rseq api. > > A mechanism with such builtin randomness in behavior is a footgun for > > bpf users. > > We have bpf_copy_from_user*() that don't have this non-determinism. > > We can introduce bpf_copy_to_user(), but it will be usable > > from sleepable bpf prog. > > While it sounds you need it somewhere where scheduler makes decisions, > > so I suspect bpf array or arena is a better fit. > > Correct, it can't sleep because it wants scheduler events. > > > Or something that extends bpf local storage map. > > See long discussion: > > https://lore.kernel.org/bpf/45878586-cc5f-435
Re: [PATCH v1 01/18] mm: allow for detecting underflows with page_mapcount() again
On Tue, Apr 09, 2024 at 09:22:44PM +0200, David Hildenbrand wrote: > Commit 53277bcf126d ("mm: support page_mapcount() on page_has_type() > pages") made it impossible to detect mapcount underflows by treating > any negative raw mapcount value as a mapcount of 0. Yes, but I don't think this is the right place to check for underflow. We should be checking for that on modification, not on read. I think it's more important for page_mapcount() to be fast than a debugging aid.