[PATCH v2 0/6] remove tlb_remove_page_ptdesc()

2025-02-24 Thread Qi Zheng
Changes in v2:
 - add [PATCH v2 2/6] (Peter Zijlstra)
 - remove [PATCH 4/5] and add [PATCH v2 5/6]
 - rebase onto the next-20250224

Hi all,

As suggested by Peter Zijlstra below [1], this series aims to remove
tlb_remove_page_ptdesc().

: Fundamentally tlb_remove_page() is about removing *pages* as from a PTE,
: there should not be a page-table anywhere near here *ever*.
:
: Yes, some architectures use tlb_remove_page() for page-tables too, but
: that is more or less an implementation detail that can be fixed.

After this series, all architectures use tlb_remove_table() or 
tlb_remove_ptdesc()
to remove the page table pages. In the future, once all architectures using
tlb_remove_table() have also converted to using struct ptdesc (eg. powerpc), it
may be possible to use only tlb_remove_ptdesc().

This series is based on next-20250224.

Comments and suggestions are welcome!

Thanks,
Qi

[1]. 
https://lore.kernel.org/linux-mm/20250103111457.gc22...@noisy.programming.kicks-ass.net/

Qi Zheng (6):
  mm: pgtable: make generic tlb_remove_table() use struct ptdesc
  mm: pgtable: change pt parameter of tlb_remove_ptdesc() to struct
ptdesc *
  mm: pgtable: convert some architectures to use tlb_remove_ptdesc()
  riscv: pgtable: unconditionally use tlb_remove_ptdesc()
  x86: pgtable: convert to use tlb_remove_ptdesc()
  mm: pgtable: remove tlb_remove_page_ptdesc()

 arch/csky/include/asm/pgalloc.h  |  3 +--
 arch/hexagon/include/asm/pgalloc.h   |  3 +--
 arch/loongarch/include/asm/pgalloc.h |  3 +--
 arch/m68k/include/asm/sun3_pgalloc.h |  3 +--
 arch/mips/include/asm/pgalloc.h  |  3 +--
 arch/nios2/include/asm/pgalloc.h |  9 -
 arch/openrisc/include/asm/pgalloc.h  |  3 +--
 arch/riscv/include/asm/pgalloc.h | 26 --
 arch/sh/include/asm/pgalloc.h|  3 +--
 arch/um/include/asm/pgalloc.h|  9 +++--
 arch/x86/mm/pgtable.c|  8 
 include/asm-generic/tlb.h| 14 --
 12 files changed, 26 insertions(+), 61 deletions(-)

-- 
2.20.1




[PATCH v2 3/6] mm: pgtable: convert some architectures to use tlb_remove_ptdesc()

2025-02-24 Thread Qi Zheng
Now, the nine architectures of csky, hexagon, loongarch, m68k, mips,
nios2, openrisc, sh and um do not select CONFIG_MMU_GATHER_RCU_TABLE_FREE,
and just call pagetable_dtor() + tlb_remove_page_ptdesc() (the wrapper of
tlb_remove_page()). This is the same as the implementation of
tlb_remove_{ptdesc|table}() under !CONFIG_MMU_GATHER_TABLE_FREE, so
convert these architectures to use tlb_remove_ptdesc().

The ultimate goal is to make the architecture only use tlb_remove_ptdesc()
or tlb_remove_table() for page table pages.

Signed-off-by: Qi Zheng 
Suggested-by: Peter Zijlstra (Intel) 
---
 arch/csky/include/asm/pgalloc.h  | 3 +--
 arch/hexagon/include/asm/pgalloc.h   | 3 +--
 arch/loongarch/include/asm/pgalloc.h | 3 +--
 arch/m68k/include/asm/sun3_pgalloc.h | 3 +--
 arch/mips/include/asm/pgalloc.h  | 3 +--
 arch/nios2/include/asm/pgalloc.h | 9 -
 arch/openrisc/include/asm/pgalloc.h  | 3 +--
 arch/sh/include/asm/pgalloc.h| 3 +--
 arch/um/include/asm/pgalloc.h| 9 +++--
 9 files changed, 14 insertions(+), 25 deletions(-)

diff --git a/arch/csky/include/asm/pgalloc.h b/arch/csky/include/asm/pgalloc.h
index bf8400c28b5a3..9d2b50265a8d8 100644
--- a/arch/csky/include/asm/pgalloc.h
+++ b/arch/csky/include/asm/pgalloc.h
@@ -63,8 +63,7 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 
 #define __pte_free_tlb(tlb, pte, address)  \
 do {   \
-   pagetable_dtor(page_ptdesc(pte));   \
-   tlb_remove_page_ptdesc(tlb, page_ptdesc(pte));  \
+   tlb_remove_ptdesc((tlb), page_ptdesc(pte)); \
 } while (0)
 
 extern void pagetable_init(void);
diff --git a/arch/hexagon/include/asm/pgalloc.h 
b/arch/hexagon/include/asm/pgalloc.h
index 1ee5f5f157ca7..3d35d2bc42534 100644
--- a/arch/hexagon/include/asm/pgalloc.h
+++ b/arch/hexagon/include/asm/pgalloc.h
@@ -89,8 +89,7 @@ static inline void pmd_populate_kernel(struct mm_struct *mm, 
pmd_t *pmd,
 
 #define __pte_free_tlb(tlb, pte, addr) \
 do {   \
-   pagetable_dtor((page_ptdesc(pte))); \
-   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
+   tlb_remove_ptdesc((tlb), page_ptdesc(pte)); \
 } while (0)
 
 #endif
diff --git a/arch/loongarch/include/asm/pgalloc.h 
b/arch/loongarch/include/asm/pgalloc.h
index 7211dff8c969e..ac026146e7e95 100644
--- a/arch/loongarch/include/asm/pgalloc.h
+++ b/arch/loongarch/include/asm/pgalloc.h
@@ -57,8 +57,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct 
*mm)
 
 #define __pte_free_tlb(tlb, pte, address)  \
 do {   \
-   pagetable_dtor(page_ptdesc(pte));   \
-   tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\
+   tlb_remove_ptdesc((tlb), page_ptdesc(pte)); \
 } while (0)
 
 #ifndef __PAGETABLE_PMD_FOLDED
diff --git a/arch/m68k/include/asm/sun3_pgalloc.h 
b/arch/m68k/include/asm/sun3_pgalloc.h
index 80afc3a187249..ddc24812f1832 100644
--- a/arch/m68k/include/asm/sun3_pgalloc.h
+++ b/arch/m68k/include/asm/sun3_pgalloc.h
@@ -19,8 +19,7 @@ extern const char bad_pmd_string[];
 
 #define __pte_free_tlb(tlb, pte, addr) \
 do {   \
-   pagetable_dtor(page_ptdesc(pte));   \
-   tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\
+   tlb_remove_ptdesc((tlb), page_ptdesc(pte)); \
 } while (0)
 
 static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t 
*pte)
diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h
index 26c7a6ede983c..7e73d2f913dd4 100644
--- a/arch/mips/include/asm/pgalloc.h
+++ b/arch/mips/include/asm/pgalloc.h
@@ -50,8 +50,7 @@ extern pgd_t *pgd_alloc(struct mm_struct *mm);
 
 #define __pte_free_tlb(tlb, pte, address)  \
 do {   \
-   pagetable_dtor(page_ptdesc(pte));   \
-   tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\
+   tlb_remove_ptdesc((tlb), page_ptdesc(pte)); \
 } while (0)
 
 #ifndef __PAGETABLE_PMD_FOLDED
diff --git a/arch/nios2/include/asm/pgalloc.h b/arch/nios2/include/asm/pgalloc.h
index 12a536b7bfbd4..4b4a1766e2cc7 100644
--- a/arch/nios2/include/asm/pgalloc.h
+++ b/arch/nios2/include/asm/pgalloc.h
@@ -28,10 +28,9 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t 
*pmd,
 
 extern pgd_t *pgd_alloc(struct mm_struct *mm);
 
-#define __pte_free_tlb(tlb, pte, addr) \
-   do {\
-   pagetable_dtor(page_ptdesc(pte));   \
-   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \

[PATCH v2 1/6] mm: pgtable: make generic tlb_remove_table() use struct ptdesc

2025-02-24 Thread Qi Zheng
Now only arm will call tlb_remove_ptdesc()/tlb_remove_table() when
CONFIG_MMU_GATHER_TABLE_FREE is disabled. In this case, the type of the
table parameter is actually struct ptdesc * instead of struct page *.

Since struct ptdesc still overlaps with struct page and has not been
separated from it, forcing the table parameter to struct page * will not
cause any problems at this time. But this is definitely incorrect and
needs to be fixed. So just like the generic __tlb_remove_table(), let
generic tlb_remove_table() use struct ptdesc by default when
CONFIG_MMU_GATHER_TABLE_FREE is disabled.

Signed-off-by: Qi Zheng 
---
 include/asm-generic/tlb.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b35b36fa7aabf..54f579750c8e5 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -232,10 +232,10 @@ static inline void tlb_remove_page(struct mmu_gather 
*tlb, struct page *page);
  */
 static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
-   struct page *page = (struct page *)table;
+   struct ptdesc *ptdesc = (struct ptdesc *)table;
 
-   pagetable_dtor(page_ptdesc(page));
-   tlb_remove_page(tlb, page);
+   pagetable_dtor(ptdesc);
+   tlb_remove_page(tlb, ptdesc_page(ptdesc));
 }
 #endif /* CONFIG_MMU_GATHER_TABLE_FREE */
 
-- 
2.20.1




[PATCH v2 2/6] mm: pgtable: change pt parameter of tlb_remove_ptdesc() to struct ptdesc *

2025-02-24 Thread Qi Zheng
All callers of tlb_remove_ptdesc() pass it a pointer of struct ptdesc, so
let's change the pt parameter from void * to struct ptdesc * to perform
a type safety check.

Signed-off-by: Qi Zheng 
Originally-by: Peter Zijlstra (Intel) 
---
 include/asm-generic/tlb.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 54f579750c8e5..18bf499ef8801 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -506,7 +506,7 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, 
struct page *page)
return tlb_remove_page_size(tlb, page, PAGE_SIZE);
 }
 
-static inline void tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt)
+static inline void tlb_remove_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt)
 {
tlb_remove_table(tlb, pt);
 }
-- 
2.20.1




[PATCH v2 5/6] x86: pgtable: convert to use tlb_remove_ptdesc()

2025-02-24 Thread Qi Zheng
The x86 has already been converted to use struct ptdesc, so convert it to
use tlb_remove_ptdesc() instead of tlb_remove_table().

Signed-off-by: Qi Zheng 
---
 arch/x86/mm/pgtable.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index b1c1f72c1fd1b..f28ddac0f734a 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -45,7 +45,7 @@ early_param("userpte", setup_userpte);
 void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 {
paravirt_release_pte(page_to_pfn(pte));
-   tlb_remove_table(tlb, page_ptdesc(pte));
+   tlb_remove_ptdesc(tlb, page_ptdesc(pte));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 2
@@ -59,21 +59,21 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 #ifdef CONFIG_X86_PAE
tlb->need_flush_all = 1;
 #endif
-   tlb_remove_table(tlb, virt_to_ptdesc(pmd));
+   tlb_remove_ptdesc(tlb, virt_to_ptdesc(pmd));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 3
 void ___pud_free_tlb(struct mmu_gather *tlb, pud_t *pud)
 {
paravirt_release_pud(__pa(pud) >> PAGE_SHIFT);
-   tlb_remove_table(tlb, virt_to_ptdesc(pud));
+   tlb_remove_ptdesc(tlb, virt_to_ptdesc(pud));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 4
 void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
 {
paravirt_release_p4d(__pa(p4d) >> PAGE_SHIFT);
-   tlb_remove_table(tlb, virt_to_ptdesc(p4d));
+   tlb_remove_ptdesc(tlb, virt_to_ptdesc(p4d));
 }
 #endif /* CONFIG_PGTABLE_LEVELS > 4 */
 #endif /* CONFIG_PGTABLE_LEVELS > 3 */
-- 
2.20.1




[PATCH v2 6/6] mm: pgtable: remove tlb_remove_page_ptdesc()

2025-02-24 Thread Qi Zheng
The tlb_remove_ptdesc()/tlb_remove_table() is specially designed for page
table pages, and now all architectures have been converted to use it to
remove page table pages. So let's remove tlb_remove_page_ptdesc(), it
currently has no users and should not be used for page table pages.

Signed-off-by: Qi Zheng 
Suggested-by: Peter Zijlstra (Intel) 
---
 include/asm-generic/tlb.h | 6 --
 1 file changed, 6 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 18bf499ef8801..dd292c6d3ce88 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -511,12 +511,6 @@ static inline void tlb_remove_ptdesc(struct mmu_gather 
*tlb, struct ptdesc *pt)
tlb_remove_table(tlb, pt);
 }
 
-/* Like tlb_remove_ptdesc, but for page-like page directories. */
-static inline void tlb_remove_page_ptdesc(struct mmu_gather *tlb, struct 
ptdesc *pt)
-{
-   tlb_remove_page(tlb, ptdesc_page(pt));
-}
-
 static inline void tlb_change_page_size(struct mmu_gather *tlb,
 unsigned int page_size)
 {
-- 
2.20.1




[PATCH v2 4/6] riscv: pgtable: unconditionally use tlb_remove_ptdesc()

2025-02-24 Thread Qi Zheng
To support fast gup, the commit 69be3fb111e7 ("riscv: enable
MMU_GATHER_RCU_TABLE_FREE for SMP && MMU") did the following:

1) use tlb_remove_page_ptdesc() for those platforms which use IPI to
   perform TLB shootdown

2) use tlb_remove_ptdesc() for those platforms which use SBI to perform
   TLB shootdown

The tlb_remove_page_ptdesc() is the wrapper of the tlb_remove_page(). By
design, the tlb_remove_page() should be used to remove a normal page from
a page table entry, and should not be used for page table pages.

The tlb_remove_ptdesc() is the wrapper of the tlb_remove_table(), which is
designed specifically for freeing page table pages. If the
CONFIG_MMU_GATHER_TABLE_FREE is enabled, the tlb_remove_table() will use
semi RCU to free page table pages, that is:

 - batch table freeing: asynchronous free by RCU
 - single table freeing: IPI + synchronous free

If the CONFIG_MMU_GATHER_TABLE_FREE is disabled, the tlb_remove_table()
will fall back to pagetable_dtor() + tlb_remove_page().

For case 1), since we need to perform TLB shootdown before freeing the
page table page, the local_irq_save() in fast gup can block the freeing
and protect the fast gup page walker. Therefore we can ensure safety by
just using tlb_remove_page_ptdesc(). In addition, we can also the
tlb_remove_ptdesc()/tlb_remove_table() to achieve it, and it doesn't
matter whether CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected. And in
theory, the performance of freeing pages asynchronously via RCU will not
be lower than synchronous free.

For case 2), since local_irq_save() only disable S-privilege IPI irq but
not M-privilege's, which is used by the SBI implementation to perform TLB
shootdown, so we must select CONFIG_MMU_GATHER_RCU_TABLE_FREE and use
tlb_remove_ptdesc() to ensure safety. The riscv selects this config for
SMP && MMU, the CONFIG_RISCV_SBI is dependent on MMU. Therefore, only the
UP system may have the situation where CONFIG_MMU_GATHER_RCU_TABLE_FREE is
disabled but CONFIG_RISCV_SBI is enabled. But there is no freeing vs fast
gup race in the UP system.

So, in summary, we can use tlb_remove_ptdesc() to support fast gup in all
cases, and this interface is specifically designed for page table pages.
So let's use it unconditionally.

Signed-off-by: Qi Zheng 
Suggested-by: Peter Zijlstra (Intel) 
---
 arch/riscv/include/asm/pgalloc.h | 26 --
 1 file changed, 4 insertions(+), 22 deletions(-)

diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index 3e2aebea6312d..770ce18a7328b 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -15,24 +15,6 @@
 #define __HAVE_ARCH_PUD_FREE
 #include 
 
-/*
- * While riscv platforms with riscv_ipi_for_rfence as true require an IPI to
- * perform TLB shootdown, some platforms with riscv_ipi_for_rfence as false use
- * SBI to perform TLB shootdown. To keep software pagetable walkers safe in 
this
- * case we switch to RCU based table free (MMU_GATHER_RCU_TABLE_FREE). See the
- * comment below 'ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE' in 
include/asm-generic/tlb.h
- * for more details.
- */
-static inline void riscv_tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt)
-{
-   if (riscv_use_sbi_for_rfence()) {
-   tlb_remove_ptdesc(tlb, pt);
-   } else {
-   pagetable_dtor(pt);
-   tlb_remove_page_ptdesc(tlb, pt);
-   }
-}
-
 static inline void pmd_populate_kernel(struct mm_struct *mm,
pmd_t *pmd, pte_t *pte)
 {
@@ -108,14 +90,14 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, 
pud_t *pud,
  unsigned long addr)
 {
if (pgtable_l4_enabled)
-   riscv_tlb_remove_ptdesc(tlb, virt_to_ptdesc(pud));
+   tlb_remove_ptdesc(tlb, virt_to_ptdesc(pud));
 }
 
 static inline void __p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d,
  unsigned long addr)
 {
if (pgtable_l5_enabled)
-   riscv_tlb_remove_ptdesc(tlb, virt_to_ptdesc(p4d));
+   tlb_remove_ptdesc(tlb, virt_to_ptdesc(p4d));
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
@@ -143,7 +125,7 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd,
  unsigned long addr)
 {
-   riscv_tlb_remove_ptdesc(tlb, virt_to_ptdesc(pmd));
+   tlb_remove_ptdesc(tlb, virt_to_ptdesc(pmd));
 }
 
 #endif /* __PAGETABLE_PMD_FOLDED */
@@ -151,7 +133,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, 
pmd_t *pmd,
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
  unsigned long addr)
 {
-   riscv_tlb_remove_ptdesc(tlb, page_ptdesc(pte));
+   tlb_remove_ptdesc(tlb, page_ptdesc(pte));
 }
 #endif /* CONFIG_MMU */
 
-- 
2.20.1




[PATCH 8/9] um: pass FD for memory operations when needed

2025-02-24 Thread Benjamin Berg
From: Benjamin Berg 

Instead of always sharing the FDs with the userspace process, only hand
over the FDs needed for mmap when required. The idea is that userspace
might be able to force the stub into executing an mmap syscall, however,
it will not be able to manipulate the control flow sufficiently to have
access to an FD that would allow mapping arbitrary memory.

Security wise, we need to be sure that only the expected syscalls are
executed after the kernel sends FDs through the socket. This is
currently not the case, as userspace can trivially jump to the
rt_sigreturn syscall instruction to execute any syscall that the stub is
permitted to do. With this, it can trick the kernel to send the FD,
which in turn allows userspace to freely map any physical memory.

As such, this is currently *not* secure. However, in principle the
approach should be fine with a more strict SECCOMP filter and a careful
review of the stub control flow (as userspace can prepare a stack). With
some care, it is likely possible to extend the security model to SMP if
desired.

Signed-off-by: Benjamin Berg 

---

v1:
* Add startup and runtime checks for close_range
---
 arch/um/include/shared/skas/mm_id.h |  11 +++
 arch/um/include/shared/skas/stub-data.h |   1 +
 arch/um/kernel/skas/mmu.c   |   3 +
 arch/um/kernel/skas/stub.c  |  87 ++--
 arch/um/kernel/skas/stub_exe.c  |  40 +---
 arch/um/os-Linux/skas/mem.c |  66 -
 arch/um/os-Linux/skas/process.c | 126 +---
 arch/um/os-Linux/start_up.c |  10 +-
 8 files changed, 284 insertions(+), 60 deletions(-)

diff --git a/arch/um/include/shared/skas/mm_id.h 
b/arch/um/include/shared/skas/mm_id.h
index 0654c57bb28e..f2d4c383c958 100644
--- a/arch/um/include/shared/skas/mm_id.h
+++ b/arch/um/include/shared/skas/mm_id.h
@@ -6,10 +6,21 @@
 #ifndef __MM_ID_H
 #define __MM_ID_H
 
+#ifdef CONFIG_UML_SECCOMP
+#define STUB_MAX_FDS 4
+#else
+#define STUB_MAX_FDS 0
+#endif
+
 struct mm_id {
int pid;
unsigned long stack;
int syscall_data_len;
+
+   /* Only used with SECCOMP mode */
+   int sock;
+   int syscall_fd_num;
+   int syscall_fd_map[STUB_MAX_FDS];
 };
 
 void __switch_mm(struct mm_id *mm_idp);
diff --git a/arch/um/include/shared/skas/stub-data.h 
b/arch/um/include/shared/skas/stub-data.h
index 675f1a0a1390..c261a77a32f6 100644
--- a/arch/um/include/shared/skas/stub-data.h
+++ b/arch/um/include/shared/skas/stub-data.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define FUTEX_IN_CHILD 0
 #define FUTEX_IN_KERN 1
diff --git a/arch/um/kernel/skas/mmu.c b/arch/um/kernel/skas/mmu.c
index 0abc509e3f4c..b31d27e7810e 100644
--- a/arch/um/kernel/skas/mmu.c
+++ b/arch/um/kernel/skas/mmu.c
@@ -78,6 +78,9 @@ void destroy_context(struct mm_struct *mm)
mmu->id.pid = -1;
}
 
+   if (using_seccomp && mmu->id.sock)
+   os_close_file(mmu->id.sock);
+
free_pages(mmu->id.stack, ilog2(STUB_DATA_PAGES));
 
guard(spinlock_irqsave)(&mm_list_lock);
diff --git a/arch/um/kernel/skas/stub.c b/arch/um/kernel/skas/stub.c
index ec644cd82bbe..e92aefe06da8 100644
--- a/arch/um/kernel/skas/stub.c
+++ b/arch/um/kernel/skas/stub.c
@@ -7,24 +7,54 @@
 
 #ifdef CONFIG_UML_SECCOMP
 #include 
+#include 
 #include 
 #endif
 
-static __always_inline int syscall_handler(struct stub_data *d)
+/*
+ * Known security issues
+ *
+ * Userspace can jump to this address to execute *any* syscall that is
+ * permitted by the stub. As we will return afterwards, it can do
+ * whatever it likes, including:
+ * - Tricking the kernel into handing out the memory FD
+ * - Using this memory FD to read/write all physical memory
+ * - Running in parallel to the kernel processing a syscall
+ *   (possibly creating data races?)
+ * - Blocking e.g. SIGALRM to avoid time based scheduling
+ *
+ * To avoid this, the permitted location for each syscall needs to be
+ * checked for in the SECCOMP filter (which is reasonably simple). Also,
+ * more care will need to go into considerations how the code might be
+ * tricked by using a prepared stack (or even modifying the stack from
+ * another thread in case SMP support is added).
+ *
+ * As for the SIGALRM, the best counter measure will be to check in the
+ * kernel that the process is reporting back the SIGALRM in a timely
+ * fashion.
+ */
+static __always_inline int syscall_handler(int fd_map[STUB_MAX_FDS])
 {
+   struct stub_data *d = get_stub_data();
int i;
unsigned long res;
+   int fd;
 
for (i = 0; i < d->syscall_data_len; i++) {
struct stub_syscall *sc = &d->syscall_data[i];
 
switch (sc->syscall) {
case STUB_SYSCALL_MMAP:
+   if (fd_map)
+   fd = fd_map[sc->mem.fd];
+   else
+   fd = sc->mem.fd;
+

[PATCH 7/9] um: Implement kernel side of SECCOMP based process handling

2025-02-24 Thread Benjamin Berg
This adds the kernel side of the seccomp based process handling.

Co-authored-by: Johannes Berg 
Signed-off-by: Benjamin Berg 
Signed-off-by: Benjamin Berg 

---

v1:
- Fix FUTEX_WAIT EINTR handling
- Don't send fatal_sigsegv when waiting during child startup
---
 arch/um/include/shared/common-offsets.h|   2 +
 arch/um/include/shared/os.h|   2 +-
 arch/um/include/shared/skas/stub-data.h|   5 +-
 arch/um/kernel/skas/mmu.c  |   7 +-
 arch/um/kernel/skas/stub_exe.c | 141 +++-
 arch/um/os-Linux/internal.h|   5 +-
 arch/um/os-Linux/skas/mem.c|  37 +-
 arch/um/os-Linux/skas/process.c| 375 +++--
 arch/x86/um/shared/sysdep/kernel-offsets.h |   2 +
 arch/x86/um/tls_32.c   |  23 +-
 10 files changed, 460 insertions(+), 139 deletions(-)

diff --git a/arch/um/include/shared/common-offsets.h 
b/arch/um/include/shared/common-offsets.h
index 93e7097a2922..8ca66a1918c3 100644
--- a/arch/um/include/shared/common-offsets.h
+++ b/arch/um/include/shared/common-offsets.h
@@ -16,3 +16,5 @@ DEFINE(UM_NSEC_PER_SEC, NSEC_PER_SEC);
 DEFINE(UM_NSEC_PER_USEC, NSEC_PER_USEC);
 
 DEFINE(UM_KERN_GDT_ENTRY_TLS_ENTRIES, GDT_ENTRY_TLS_ENTRIES);
+
+DEFINE(UM_SECCOMP_ARCH_NATIVE, SECCOMP_ARCH_NATIVE);
diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index cb213c9dbeff..4e53b96de296 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -283,7 +283,7 @@ int unmap(struct mm_id *mm_idp, unsigned long addr, 
unsigned long len);
 
 /* skas/process.c */
 extern int is_skas_winch(int pid, int fd, void *data);
-extern int start_userspace(unsigned long stub_stack);
+extern int start_userspace(struct mm_id *mm_id);
 extern void userspace(struct uml_pt_regs *regs);
 extern void new_thread(void *stack, jmp_buf *buf, void (*handler)(void));
 extern void switch_threads(jmp_buf *me, jmp_buf *you);
diff --git a/arch/um/include/shared/skas/stub-data.h 
b/arch/um/include/shared/skas/stub-data.h
index 81ac2cd12112..675f1a0a1390 100644
--- a/arch/um/include/shared/skas/stub-data.h
+++ b/arch/um/include/shared/skas/stub-data.h
@@ -17,6 +17,8 @@
 #define FUTEX_IN_KERN 1
 
 struct stub_init_data {
+   int seccomp;
+
unsigned long stub_start;
 
int stub_code_fd;
@@ -24,7 +26,8 @@ struct stub_init_data {
int stub_data_fd;
unsigned long stub_data_offset;
 
-   unsigned long segv_handler;
+   unsigned long signal_handler;
+   unsigned long signal_restorer;
 };
 
 #define STUB_NEXT_SYSCALL(s) \
diff --git a/arch/um/kernel/skas/mmu.c b/arch/um/kernel/skas/mmu.c
index f8ee5d612c47..0abc509e3f4c 100644
--- a/arch/um/kernel/skas/mmu.c
+++ b/arch/um/kernel/skas/mmu.c
@@ -38,14 +38,11 @@ int init_new_context(struct task_struct *task, struct 
mm_struct *mm)
scoped_guard(spinlock_irqsave, &mm_list_lock) {
/* Insert into list, used for lookups when the child dies */
list_add(&mm->context.list, &mm_list);
-
}
 
-   new_id->pid = start_userspace(stack);
-   if (new_id->pid < 0) {
-   ret = new_id->pid;
+   ret = start_userspace(new_id);
+   if (ret < 0)
goto out_free;
-   }
 
/* Ensure the new MM is clean and nothing unwanted is mapped */
unmap(new_id, 0, STUB_START);
diff --git a/arch/um/kernel/skas/stub_exe.c b/arch/um/kernel/skas/stub_exe.c
index 23c99b285e82..f40f2332b676 100644
--- a/arch/um/kernel/skas/stub_exe.c
+++ b/arch/um/kernel/skas/stub_exe.c
@@ -3,6 +3,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 void _start(void);
 
@@ -25,8 +28,6 @@ noinline static void real_init(void)
} sa = {
/* Need to set SA_RESTORER (but the handler never returns) */
.sa_flags = SA_ONSTACK | SA_NODEFER | SA_SIGINFO | 0x0400,
-   /* no need to mask any signals */
-   .sa_mask = 0,
};
 
/* set a nice name */
@@ -35,6 +36,9 @@ noinline static void real_init(void)
/* Make sure this process dies if the kernel dies */
stub_syscall2(__NR_prctl, PR_SET_PDEATHSIG, SIGKILL);
 
+   /* Needed in SECCOMP mode (and safe to do anyway) */
+   stub_syscall5(__NR_prctl, PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+
/* read information from STDIN and close it */
res = stub_syscall3(__NR_read, 0,
(unsigned long)&init_data, sizeof(init_data));
@@ -63,18 +67,133 @@ noinline static void real_init(void)
stack.ss_sp = (void *)init_data.stub_start + UM_KERN_PAGE_SIZE;
stub_syscall2(__NR_sigaltstack, (unsigned long)&stack, 0);
 
-   /* register SIGSEGV handler */
-   sa.sa_handler_ = (void *) init_data.segv_handler;
-   res = stub_syscall4(__NR_rt_sigaction, SIGSEGV, (unsigned long)&sa, 0,
-   sizeof(sa.sa_mask));
-   if (res != 0)
-   stub_syscall1(

[PATCH 6/9] um: Track userspace children dying in SECCOMP mode

2025-02-24 Thread Benjamin Berg
When in seccomp mode, we would hang forever on the futex if a child has
died unexpectedly. In contrast, ptrace mode will notice it and kill the
corresponding thread when it fails to run it.

Fix this issue using a new IRQ that is fired after a SIGCHLD and keeping
an (internal) list of all MMs. In the IRQ handler, find the affected MM
and set its PID to -1 as well as the futex variable to FUTEX_IN_KERN.

This, together with futex returning -EINTR after the signal is
sufficient to implement a race-free detection of a child dying.

Note that this also enables IRQ handling while starting a userspace
process. This should be safe and SECCOMP requires the IRQ in case the
process does not come up properly.

Signed-off-by: Benjamin Berg 
Signed-off-by: Benjamin Berg 

---

v1:
- Permit IRQs during startup to enable detection there

RFCv2:
- Use "struct list_head" for the list by placing it into the mm_context
---
 arch/um/include/asm/irq.h   |  5 +-
 arch/um/include/asm/mmu.h   |  3 ++
 arch/um/include/shared/irq_user.h   |  1 +
 arch/um/include/shared/os.h |  1 +
 arch/um/include/shared/skas/mm_id.h |  2 +
 arch/um/kernel/irq.c|  6 +++
 arch/um/kernel/skas/mmu.c   | 83 +++--
 arch/um/os-Linux/process.c  | 31 +++
 arch/um/os-Linux/signal.c   | 19 ++-
 9 files changed, 143 insertions(+), 8 deletions(-)

diff --git a/arch/um/include/asm/irq.h b/arch/um/include/asm/irq.h
index 749dfe8512e8..36dbedd1af48 100644
--- a/arch/um/include/asm/irq.h
+++ b/arch/um/include/asm/irq.h
@@ -13,17 +13,18 @@
 #define TELNETD_IRQ8
 #define XTERM_IRQ  9
 #define RANDOM_IRQ 10
+#define SIGCHLD_IRQ11
 
 #ifdef CONFIG_UML_NET_VECTOR
 
-#define VECTOR_BASE_IRQ(RANDOM_IRQ + 1)
+#define VECTOR_BASE_IRQ(SIGCHLD_IRQ + 1)
 #define VECTOR_IRQ_SPACE   8
 
 #define UM_FIRST_DYN_IRQ (VECTOR_IRQ_SPACE + VECTOR_BASE_IRQ)
 
 #else
 
-#define UM_FIRST_DYN_IRQ (RANDOM_IRQ + 1)
+#define UM_FIRST_DYN_IRQ (SIGCHLD_IRQ + 1)
 
 #endif
 
diff --git a/arch/um/include/asm/mmu.h b/arch/um/include/asm/mmu.h
index a3eaca41ff61..4d0e4239f3cc 100644
--- a/arch/um/include/asm/mmu.h
+++ b/arch/um/include/asm/mmu.h
@@ -6,11 +6,14 @@
 #ifndef __ARCH_UM_MMU_H
 #define __ARCH_UM_MMU_H
 
+#include "linux/types.h"
 #include 
 
 typedef struct mm_context {
struct mm_id id;
 
+   struct list_head list;
+
/* Address range in need of a TLB sync */
unsigned long sync_tlb_range_from;
unsigned long sync_tlb_range_to;
diff --git a/arch/um/include/shared/irq_user.h 
b/arch/um/include/shared/irq_user.h
index da0f6eea30d0..53a1f0651b96 100644
--- a/arch/um/include/shared/irq_user.h
+++ b/arch/um/include/shared/irq_user.h
@@ -16,6 +16,7 @@ enum um_irq_type {
 
 struct siginfo;
 extern void sigio_handler(int sig, struct siginfo *unused_si, struct 
uml_pt_regs *regs);
+extern void sigchld_handler(int sig, struct siginfo *unused_si, struct 
uml_pt_regs *regs);
 void sigio_run_timetravel_handlers(void);
 extern void free_irq_by_fd(int fd);
 extern void deactivate_fd(int fd, int irqnum);
diff --git a/arch/um/include/shared/os.h b/arch/um/include/shared/os.h
index 5babad8c5f75..cb213c9dbeff 100644
--- a/arch/um/include/shared/os.h
+++ b/arch/um/include/shared/os.h
@@ -198,6 +198,7 @@ extern int create_mem_file(unsigned long long len);
 extern void report_enomem(void);
 
 /* process.c */
+pid_t os_reap_child(void);
 extern void os_alarm_process(int pid);
 extern void os_kill_process(int pid, int reap_child);
 extern void os_kill_ptraced_process(int pid, int reap_child);
diff --git a/arch/um/include/shared/skas/mm_id.h 
b/arch/um/include/shared/skas/mm_id.h
index 140388c282f6..0654c57bb28e 100644
--- a/arch/um/include/shared/skas/mm_id.h
+++ b/arch/um/include/shared/skas/mm_id.h
@@ -14,4 +14,6 @@ struct mm_id {
 
 void __switch_mm(struct mm_id *mm_idp);
 
+void notify_mm_kill(int pid);
+
 #endif
diff --git a/arch/um/kernel/irq.c b/arch/um/kernel/irq.c
index a4991746f5ea..7161bde0c8c1 100644
--- a/arch/um/kernel/irq.c
+++ b/arch/um/kernel/irq.c
@@ -689,3 +689,9 @@ void __init init_IRQ(void)
/* Initialize EPOLL Loop */
os_setup_epoll();
 }
+
+extern void sigchld_handler(int sig, struct siginfo *unused_si,
+   struct uml_pt_regs *regs)
+{
+   do_IRQ(SIGCHLD_IRQ, regs);
+}
diff --git a/arch/um/kernel/skas/mmu.c b/arch/um/kernel/skas/mmu.c
index 0eb5a1d3ba70..f8ee5d612c47 100644
--- a/arch/um/kernel/skas/mmu.c
+++ b/arch/um/kernel/skas/mmu.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -19,6 +20,9 @@
 /* Ensure the stub_data struct covers the allocated area */
 static_assert(sizeof(struct stub_data) == STUB_DATA_PAGES * UM_KERN_PAGE_SIZE);
 
+spinlock_t mm_list_lock;
+struct list_head mm_list;
+
 int init_new_context(struct task_struct *task, struct mm_struct *mm)
 {
struct mm_id *new_id = 

[PATCH 0/9] SECCOMP based userspace for UML

2025-02-24 Thread Benjamin Berg
From: Benjamin Berg 

Hi all,

another version of the SECCOMP patchset. I think that this should now be
good enough for general consumption. Compared to the last RFC version
there is an important bugfix that caused a SIGSEGV loop and various
other small bugfixes and cleanups.

The patchset adds a new userspace handling mode to UML that is based on
a SECCOMP filter and trusted code within each userspace process.

The motivation the new SECCOMP mode is that it saves context switches
when handling pagefaults and for syscalls like mmap. The approach may
also permit SMP support in the future and might make it easier to port
UML to further host architectures.

Benjamin

v1:
- Remove explicit (and insufficient) kconfig.h includes
- Change commit order to move configuration to the end
- Fix futex wait race condition
- Also handle child dying during stub startup

RFCv2:
- Fix FP handling on i386
- Improved MM list for userspace sigchild handling
- Remove kconfig.h includes
- Minor cleanups

Benjamin Berg (9):
  um: Store full CSGSFS and SS register from mcontext
  um: Move faultinfo extraction into userspace routine
  um: Add stub side of SECCOMP/futex based process handling
  um: Add helper functions to get/set state for SECCOMP
  um: Add SECCOMP support detection and initialization
  um: Track userspace children dying in SECCOMP mode
  um: Implement kernel side of SECCOMP based process handling
  um: pass FD for memory operations when needed
  um: Add UML_SECCOMP configuration option

 arch/um/Kconfig|  19 +
 arch/um/include/asm/irq.h  |   5 +-
 arch/um/include/asm/mmu.h  |   3 +
 arch/um/include/shared/common-offsets.h|   4 +
 arch/um/include/shared/irq_user.h  |   1 +
 arch/um/include/shared/os.h|   3 +-
 arch/um/include/shared/skas/mm_id.h|  13 +
 arch/um/include/shared/skas/skas.h |   5 +
 arch/um/include/shared/skas/stub-data.h|  20 +-
 arch/um/kernel/irq.c   |   6 +
 arch/um/kernel/skas/mmu.c  |  89 +++-
 arch/um/kernel/skas/stub.c | 134 +-
 arch/um/kernel/skas/stub_exe.c | 159 ++-
 arch/um/os-Linux/internal.h|   5 +-
 arch/um/os-Linux/process.c |  31 ++
 arch/um/os-Linux/registers.c   |   4 +-
 arch/um/os-Linux/signal.c  |  19 +-
 arch/um/os-Linux/skas/mem.c| 103 -
 arch/um/os-Linux/skas/process.c| 485 +++--
 arch/um/os-Linux/start_up.c| 150 ++-
 arch/x86/um/os-Linux/mcontext.c| 223 +-
 arch/x86/um/ptrace.c   |  76 +++-
 arch/x86/um/shared/sysdep/kernel-offsets.h |   2 +
 arch/x86/um/shared/sysdep/mcontext.h   |   9 +
 arch/x86/um/shared/sysdep/stub-data.h  |  23 +
 arch/x86/um/shared/sysdep/stub.h   |   2 +
 arch/x86/um/shared/sysdep/stub_32.h|  13 +
 arch/x86/um/shared/sysdep/stub_64.h|  17 +
 arch/x86/um/tls_32.c   |  23 +-
 29 files changed, 1439 insertions(+), 207 deletions(-)
 create mode 100644 arch/x86/um/shared/sysdep/stub-data.h

-- 
2.48.1




[PATCH 1/9] um: Store full CSGSFS and SS register from mcontext

2025-02-24 Thread Benjamin Berg
Doing this allows using registers as retrieved from an mcontext to be
pushed to a process using PTRACE_SETREGS.

It is not entirely clear to me why CSGSFS was masked. Doing so creates
issues when using the mcontext as process state in seccomp and simply
copying the register appears to work perfectly fine for ptrace.

Signed-off-by: Benjamin Berg 
---
 arch/x86/um/os-Linux/mcontext.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/um/os-Linux/mcontext.c b/arch/x86/um/os-Linux/mcontext.c
index e80ab7d28117..1b0d95328b2c 100644
--- a/arch/x86/um/os-Linux/mcontext.c
+++ b/arch/x86/um/os-Linux/mcontext.c
@@ -27,7 +27,6 @@ void get_regs_from_mc(struct uml_pt_regs *regs, mcontext_t 
*mc)
COPY(RIP);
COPY2(EFLAGS, EFL);
COPY2(CS, CSGSFS);
-   regs->gp[CS / sizeof(unsigned long)] &= 0x;
-   regs->gp[CS / sizeof(unsigned long)] |= 3;
+   regs->gp[SS / sizeof(unsigned long)] = mc->gregs[REG_CSGSFS] >> 48;
 #endif
 }
-- 
2.48.1




[PATCH 5/9] um: Add SECCOMP support detection and initialization

2025-02-24 Thread Benjamin Berg
This detects seccomp support, sets the global using_seccomp variable and
initilizes the exec registers.

Signed-off-by: Benjamin Berg 
Signed-off-by: Benjamin Berg 
---
 arch/um/include/shared/skas/skas.h |   5 +
 arch/um/os-Linux/registers.c   |   4 +-
 arch/um/os-Linux/skas/process.c|   3 +
 arch/um/os-Linux/start_up.c| 146 -
 4 files changed, 154 insertions(+), 4 deletions(-)

diff --git a/arch/um/include/shared/skas/skas.h 
b/arch/um/include/shared/skas/skas.h
index 85c50122ab98..ff54aced05cc 100644
--- a/arch/um/include/shared/skas/skas.h
+++ b/arch/um/include/shared/skas/skas.h
@@ -8,6 +8,11 @@
 
 #include 
 
+#ifdef CONFIG_UML_SECCOMP
+extern int using_seccomp;
+#else
+#define using_seccomp 0
+#endif
 extern int userspace_pid[];
 
 extern void new_thread_handler(void);
diff --git a/arch/um/os-Linux/registers.c b/arch/um/os-Linux/registers.c
index d7ca148807b2..bfba2cbc9478 100644
--- a/arch/um/os-Linux/registers.c
+++ b/arch/um/os-Linux/registers.c
@@ -14,8 +14,8 @@
 
 /* This is set once at boot time and not changed thereafter */
 
-static unsigned long exec_regs[MAX_REG_NR];
-static unsigned long *exec_fp_regs;
+unsigned long exec_regs[MAX_REG_NR];
+unsigned long *exec_fp_regs;
 
 int init_pid_registers(int pid)
 {
diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index b9449f175684..2a492cfa5dd3 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -309,6 +309,9 @@ static int __init init_stub_exe_fd(void)
 }
 __initcall(init_stub_exe_fd);
 
+#ifdef CONFIG_UML_SECCOMP
+int using_seccomp;
+#endif
 int userspace_pid[NR_CPUS];
 
 /**
diff --git a/arch/um/os-Linux/start_up.c b/arch/um/os-Linux/start_up.c
index 93fc82c01aba..ab202ad430f6 100644
--- a/arch/um/os-Linux/start_up.c
+++ b/arch/um/os-Linux/start_up.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
+ * Copyright (C) 2021 Benjamin Berg 
  * Copyright (C) 2000 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
  */
 
@@ -24,6 +25,15 @@
 #include 
 #include 
 #include 
+#ifdef CONFIG_UML_SECCOMP
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#endif
 #include 
 #include 
 #include "internal.h"
@@ -224,6 +234,128 @@ static void __init check_ptrace(void)
check_sysemu();
 }
 
+#ifdef CONFIG_UML_SECCOMP
+extern unsigned long host_fp_size;
+extern unsigned long exec_regs[MAX_REG_NR];
+extern unsigned long *exec_fp_regs;
+
+__initdata static struct stub_data *seccomp_test_stub_data;
+
+static void __init sigsys_handler(int sig, siginfo_t *info, void *p)
+{
+   ucontext_t *uc = p;
+
+   /* Stow away the location of the mcontext in the stack */
+   seccomp_test_stub_data->mctx_offset = (unsigned long)&uc->uc_mcontext -
+ (unsigned 
long)&seccomp_test_stub_data->sigstack[0];
+   exit(0);
+}
+
+static bool __init init_seccomp(void)
+{
+   int pid;
+   int status;
+   int n;
+
+   /* We check that we can install a seccomp filter and then exit(0)
+* from a trapped syscall.
+*
+* Note that we cannot verify that no seccomp filter already exists
+* for a syscall that results in the process/thread to be killed.
+*/
+
+   os_info("Checking that seccomp filters can be installed...");
+
+   seccomp_test_stub_data = mmap(0, sizeof(*seccomp_test_stub_data),
+ PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANON, 0, 0);
+
+   pid = fork();
+   if (pid == 0) {
+   static struct sock_filter filter[] = {
+   BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
+   offsetof(struct seccomp_data, nr)),
+   BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, 
__NR_clock_nanosleep, 1, 0),
+   BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
+   BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_TRAP),
+   };
+   static struct sock_fprog prog = {
+   .len = ARRAY_SIZE(filter),
+   .filter = filter,
+   };
+   struct sigaction sa;
+
+   set_sigstack(seccomp_test_stub_data->sigstack,
+sizeof(seccomp_test_stub_data->sigstack));
+
+   sa.sa_flags = SA_ONSTACK | SA_NODEFER | SA_SIGINFO;
+   sa.sa_sigaction = (void *) sigsys_handler;
+   sa.sa_restorer = NULL;
+   if (sigaction(SIGSYS, &sa, NULL) < 0)
+   exit(1);
+
+   prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+   if (syscall(__NR_seccomp, SECCOMP_SET_MODE_FILTER,
+   SECCOMP_FILTER_FLAG_TSYNC, &prog) != 0)
+   exit(2);
+
+   sleep(0);
+
+   /* Never reached. */
+   exit(3);
+   }
+
+   if (pid < 0)
+   fa

[PATCH 9/9] um: Add UML_SECCOMP configuration option

2025-02-24 Thread Benjamin Berg
Add the UML_SECCOMP configuration options.

Signed-off-by: Benjamin Berg 

---
v1:
- Move to the end

RFCv2:
- Remove "default n"
---
 arch/um/Kconfig | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index 18051b1cfce0..11ed4422593c 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -258,6 +258,25 @@ config KASAN_SHADOW_OFFSET
  set to a large value. On low-memory systems, try 0x7fff8000, as it 
fits
  into the immediate of most instructions, improving performance.
 
+config UML_SECCOMP
+   bool "SECCOMP based userspace"
+   help
+ With SECCOMP userspace processes work collaboratively with the kernel
+ instead of being traced using ptrace. All syscalls from the 
application
+ are caught and redirected using a signal. This signal handler in turn
+ is permitted to do the selected set of syscalls to communicate with
+ the UML kernel and do the required memory management.
+
+ This method is overall faster than the ptrace based userspace,
+ primarily because it reduces the number of context switches for
+ (minor) page faults.
+ However, the SECCOMP filter is not (yet) restrictive enough to prevent
+ userspace from reading and writing all physical memory. Userspace
+ processes could also trick the stub into disabling SIGALRM which
+ prevents it from being interrupted for scheduling purposes.
+
+ If in doubt say N, as the feature has security implications.
+
 endmenu
 
 source "arch/um/drivers/Kconfig"
-- 
2.48.1




[PATCH 2/9] um: Move faultinfo extraction into userspace routine

2025-02-24 Thread Benjamin Berg
The segv handler is called slightly differently depending on whether
PTRACE_FULL_FAULTINFO is set or not (32bit vs. 64bit). The only
difference is that we don't try to pass the registers and instruction
pointer to the segv handler.

It would be good to either document or remove the difference, but I do
not know why this difference exists. And, passing NULL can even result
in a crash.

Signed-off-by: Benjamin Berg 
---
 arch/um/os-Linux/skas/process.c | 17 ++---
 1 file changed, 6 insertions(+), 11 deletions(-)

diff --git a/arch/um/os-Linux/skas/process.c b/arch/um/os-Linux/skas/process.c
index e2f8f156402f..b9449f175684 100644
--- a/arch/um/os-Linux/skas/process.c
+++ b/arch/um/os-Linux/skas/process.c
@@ -163,12 +163,6 @@ static void get_skas_faultinfo(int pid, struct faultinfo 
*fi)
memcpy(fi, (void *)current_stub_stack(), sizeof(*fi));
 }
 
-static void handle_segv(int pid, struct uml_pt_regs *regs)
-{
-   get_skas_faultinfo(pid, ®s->faultinfo);
-   segv(regs->faultinfo, 0, 1, NULL);
-}
-
 static void handle_trap(int pid, struct uml_pt_regs *regs)
 {
if ((UPT_IP(regs) >= STUB_START) && (UPT_IP(regs) < STUB_END))
@@ -521,13 +515,14 @@ void userspace(struct uml_pt_regs *regs)
 
switch (sig) {
case SIGSEGV:
-   if (PTRACE_FULL_FAULTINFO) {
-   get_skas_faultinfo(pid,
-  ®s->faultinfo);
+   get_skas_faultinfo(pid, ®s->faultinfo);
+
+   if (PTRACE_FULL_FAULTINFO)
(*sig_info[SIGSEGV])(SIGSEGV, (struct 
siginfo *)&si,
 regs);
-   }
-   else handle_segv(pid, regs);
+   else
+   segv(regs->faultinfo, 0, 1, NULL);
+
break;
case SIGTRAP + 0x80:
handle_trap(pid, regs);
-- 
2.48.1




[PATCH 3/9] um: Add stub side of SECCOMP/futex based process handling

2025-02-24 Thread Benjamin Berg
This adds the stub side for the new seccomp process management code. In
this case we do register save/restore through the signal handler
mcontext.

Add special code for handling TLS, which for x86_64 means setting the
FS_BASE/GS_BASE registers while for i386 it means calling the
set_thread_area syscall.

Co-authored-by: Johannes Berg 
Signed-off-by: Benjamin Berg 
Signed-off-by: Benjamin Berg 

---

v1:
- Cleanup futex EINTR/EAGAIN handling

RFCv2:
- Add include guards into new architecture specific header file
---
 arch/um/include/shared/common-offsets.h |  2 +
 arch/um/include/shared/skas/stub-data.h | 14 +++
 arch/um/kernel/skas/stub.c  | 53 +
 arch/x86/um/shared/sysdep/stub-data.h   | 23 +++
 arch/x86/um/shared/sysdep/stub.h|  2 +
 arch/x86/um/shared/sysdep/stub_32.h | 13 ++
 arch/x86/um/shared/sysdep/stub_64.h | 17 
 7 files changed, 124 insertions(+)
 create mode 100644 arch/x86/um/shared/sysdep/stub-data.h

diff --git a/arch/um/include/shared/common-offsets.h 
b/arch/um/include/shared/common-offsets.h
index 73f3a4792ed8..93e7097a2922 100644
--- a/arch/um/include/shared/common-offsets.h
+++ b/arch/um/include/shared/common-offsets.h
@@ -14,3 +14,5 @@ DEFINE(UM_THREAD_SIZE, THREAD_SIZE);
 
 DEFINE(UM_NSEC_PER_SEC, NSEC_PER_SEC);
 DEFINE(UM_NSEC_PER_USEC, NSEC_PER_USEC);
+
+DEFINE(UM_KERN_GDT_ENTRY_TLS_ENTRIES, GDT_ENTRY_TLS_ENTRIES);
diff --git a/arch/um/include/shared/skas/stub-data.h 
b/arch/um/include/shared/skas/stub-data.h
index 81a4cace032c..81ac2cd12112 100644
--- a/arch/um/include/shared/skas/stub-data.h
+++ b/arch/um/include/shared/skas/stub-data.h
@@ -11,6 +11,10 @@
 #include 
 #include 
 #include 
+#include 
+
+#define FUTEX_IN_CHILD 0
+#define FUTEX_IN_KERN 1
 
 struct stub_init_data {
unsigned long stub_start;
@@ -52,6 +56,16 @@ struct stub_data {
/* 128 leaves enough room for additional fields in the struct */
struct stub_syscall syscall_data[(UM_KERN_PAGE_SIZE - 128) / 
sizeof(struct stub_syscall)] __aligned(16);
 
+   /* data shared with signal handler (only used in seccomp mode) */
+   short restart_wait;
+   unsigned int futex;
+   int signal;
+   unsigned short si_offset;
+   unsigned short mctx_offset;
+
+   /* seccomp architecture specific state restore */
+   struct stub_data_arch arch_data;
+
/* Stack for our signal handlers and for calling into . */
unsigned char sigstack[UM_KERN_PAGE_SIZE] __aligned(UM_KERN_PAGE_SIZE);
 };
diff --git a/arch/um/kernel/skas/stub.c b/arch/um/kernel/skas/stub.c
index 796fc266d3bb..ec644cd82bbe 100644
--- a/arch/um/kernel/skas/stub.c
+++ b/arch/um/kernel/skas/stub.c
@@ -5,6 +5,11 @@
 
 #include 
 
+#ifdef CONFIG_UML_SECCOMP
+#include 
+#include 
+#endif
+
 static __always_inline int syscall_handler(struct stub_data *d)
 {
int i;
@@ -57,3 +62,51 @@ stub_syscall_handler(void)
 
trap_myself();
 }
+
+#ifdef CONFIG_UML_SECCOMP
+void __section(".__syscall_stub")
+stub_signal_interrupt(int sig, siginfo_t *info, void *p)
+{
+   struct stub_data *d = get_stub_data();
+   ucontext_t *uc = p;
+   long res;
+
+   d->signal = sig;
+   d->si_offset = (unsigned long)info - (unsigned long)&d->sigstack[0];
+   d->mctx_offset = (unsigned long)&uc->uc_mcontext - (unsigned 
long)&d->sigstack[0];
+
+restart_wait:
+   d->futex = FUTEX_IN_KERN;
+   do {
+   res = stub_syscall3(__NR_futex, (unsigned long)&d->futex,
+   FUTEX_WAKE, 1);
+   } while (res == -EINTR);
+   do {
+   res = stub_syscall4(__NR_futex, (unsigned long)&d->futex,
+   FUTEX_WAIT, FUTEX_IN_KERN, 0);
+   } while (res == -EINTR || d->futex == FUTEX_IN_KERN);
+
+   if (res < 0 && res != -EAGAIN)
+   stub_syscall1(__NR_exit_group, 1);
+
+   /* Try running queued syscalls. */
+   if (syscall_handler(d) < 0 || d->restart_wait) {
+   /* Report SIGSYS if we restart. */
+   d->signal = SIGSYS;
+   d->restart_wait = 0;
+   goto restart_wait;
+   }
+
+   /* Restore arch dependent state that is not part of the mcontext */
+   stub_seccomp_restore_state(&d->arch_data);
+
+   /* Return so that the host modified mcontext is restored. */
+}
+
+void __section(".__syscall_stub")
+stub_signal_restorer(void)
+{
+   /* We must not have anything on the stack when doing rt_sigreturn */
+   stub_syscall0(__NR_rt_sigreturn);
+}
+#endif
diff --git a/arch/x86/um/shared/sysdep/stub-data.h 
b/arch/x86/um/shared/sysdep/stub-data.h
new file mode 100644
index ..82b1b7f8ac3d
--- /dev/null
+++ b/arch/x86/um/shared/sysdep/stub-data.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ARCH_STUB_DATA_H
+#define __ARCH_STUB_DATA_H
+
+#ifdef __i386__
+#include 
+#include 
+
+struct stub_data_arch {
+   int sync;
+   struct

[PATCH 4/9] um: Add helper functions to get/set state for SECCOMP

2025-02-24 Thread Benjamin Berg
When not using ptrace, we need to both save and restore registers
through the mcontext as provided by the host kernel to our signal
handlers.

Add corresponding functions to store the state to an mcontext and
helpers to access the mcontext of the subprocess through the stub data.

Signed-off-by: Benjamin Berg 
Signed-off-by: Benjamin Berg 

---

RFCv2:
- Proper FP register handling
---
 arch/x86/um/os-Linux/mcontext.c  | 220 ++-
 arch/x86/um/ptrace.c |  76 ++---
 arch/x86/um/shared/sysdep/mcontext.h |   9 ++
 3 files changed, 285 insertions(+), 20 deletions(-)

diff --git a/arch/x86/um/os-Linux/mcontext.c b/arch/x86/um/os-Linux/mcontext.c
index 1b0d95328b2c..84c4a1117b1a 100644
--- a/arch/x86/um/os-Linux/mcontext.c
+++ b/arch/x86/um/os-Linux/mcontext.c
@@ -1,9 +1,12 @@
 // SPDX-License-Identifier: GPL-2.0
-#include 
 #define __FRAME_OFFSETS
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
+#include 
 
 void get_regs_from_mc(struct uml_pt_regs *regs, mcontext_t *mc)
 {
@@ -17,6 +20,10 @@ void get_regs_from_mc(struct uml_pt_regs *regs, mcontext_t 
*mc)
COPY2(UESP, ESP); /* sic */
COPY(EBX); COPY(EDX); COPY(ECX); COPY(EAX);
COPY(EIP); COPY_SEG_CPL3(CS); COPY(EFL); COPY_SEG_CPL3(SS);
+#undef COPY2
+#undef COPY
+#undef COPY_SEG
+#undef COPY_SEG_CPL3
 #else
 #define COPY2(X,Y) regs->gp[X/sizeof(unsigned long)] = mc->gregs[REG_##Y]
 #define COPY(X) regs->gp[X/sizeof(unsigned long)] = mc->gregs[REG_##X]
@@ -28,5 +35,216 @@ void get_regs_from_mc(struct uml_pt_regs *regs, mcontext_t 
*mc)
COPY2(EFLAGS, EFL);
COPY2(CS, CSGSFS);
regs->gp[SS / sizeof(unsigned long)] = mc->gregs[REG_CSGSFS] >> 48;
+#undef COPY2
+#undef COPY
+#endif
+}
+
+#ifdef CONFIG_UML_SECCOMP
+/* Same thing, but the copy macros are turned around. */
+void get_mc_from_regs(struct uml_pt_regs *regs, mcontext_t *mc, int 
single_stepping)
+{
+#ifdef __i386__
+#define COPY2(X,Y) mc->gregs[REG_##Y] = regs->gp[X]
+#define COPY(X) mc->gregs[REG_##X] = regs->gp[X]
+#define COPY_SEG(X) mc->gregs[REG_##X] = regs->gp[X] & 0x;
+#define COPY_SEG_CPL3(X) mc->gregs[REG_##X] = (regs->gp[X] & 0x) | 3;
+   COPY_SEG(GS); COPY_SEG(FS); COPY_SEG(ES); COPY_SEG(DS);
+   COPY(EDI); COPY(ESI); COPY(EBP);
+   COPY2(UESP, ESP); /* sic */
+   COPY(EBX); COPY(EDX); COPY(ECX); COPY(EAX);
+   COPY(EIP); COPY_SEG_CPL3(CS); COPY(EFL); COPY_SEG_CPL3(SS);
+#else
+#define COPY2(X,Y) mc->gregs[REG_##Y] = regs->gp[X/sizeof(unsigned long)]
+#define COPY(X) mc->gregs[REG_##X] = regs->gp[X/sizeof(unsigned long)]
+   COPY(R8); COPY(R9); COPY(R10); COPY(R11);
+   COPY(R12); COPY(R13); COPY(R14); COPY(R15);
+   COPY(RDI); COPY(RSI); COPY(RBP); COPY(RBX);
+   COPY(RDX); COPY(RAX); COPY(RCX); COPY(RSP);
+   COPY(RIP);
+   COPY2(EFLAGS, EFL);
+   mc->gregs[REG_CSGSFS] = mc->gregs[REG_CSGSFS] & 0xl;
+   mc->gregs[REG_CSGSFS] |= (regs->gp[SS / sizeof(unsigned long)] & 
0x) << 48;
 #endif
+
+   if (single_stepping)
+   mc->gregs[REG_EFL] |= X86_EFLAGS_TF;
+   else
+   mc->gregs[REG_EFL] &= ~X86_EFLAGS_TF;
+}
+
+#ifdef CONFIG_X86_32
+struct _xstate_64 {
+   struct _fpstate_64  fpstate;
+   struct _header  xstate_hdr;
+   struct _ymmh_state  ymmh;
+   /* New processor state extensions go here: */
+};
+
+/* Not quite the right structures as these contain more information */
+int um_i387_from_fxsr(struct _fpstate_32 *i387,
+ const struct _fpstate_64 *fxsave);
+int um_fxsr_from_i387(struct _fpstate_64 *fxsave,
+ const struct _fpstate_32 *from);
+#else
+#define _xstate_64 _xstate
+#endif
+
+static struct _fpstate *get_fpstate(struct stub_data *data,
+   mcontext_t *mcontext,
+   int *fp_size)
+{
+   struct _fpstate *res;
+
+   /* Assume floating point registers are on the same page */
+   res = (void *)(((unsigned long)mcontext->fpregs &
+   (UM_KERN_PAGE_SIZE - 1)) +
+  (unsigned long)&data->sigstack[0]);
+
+   if ((void *)res + sizeof(struct _fpstate) >
+   (void *)data->sigstack + sizeof(data->sigstack))
+   return NULL;
+
+   if (res->sw_reserved.magic1 != FP_XSTATE_MAGIC1) {
+   *fp_size = sizeof(struct _fpstate);
+   } else {
+   char *magic2_addr;
+
+   magic2_addr = (void *)res;
+   magic2_addr += res->sw_reserved.extended_size;
+   magic2_addr -= FP_XSTATE_MAGIC2_SIZE;
+
+   /* We still need to be within our stack */
+   if ((void *)magic2_addr >
+   (void *)data->sigstack + sizeof(data->sigstack))
+   return NULL;
+
+   /* If we do not read MAGIC2, then we did something wrong */
+   if (*(__u32 *)m

RE: [PATCH 3/6] ceph: return the correct dentry on mkdir

2025-02-24 Thread Viacheslav Dubeyko
On Mon, 2025-02-24 at 13:15 +1100, NeilBrown wrote:
> On Fri, 21 Feb 2025, Viacheslav Dubeyko wrote:
> > On Fri, 2025-02-21 at 10:36 +1100, NeilBrown wrote:
> > > ceph already splices the correct dentry (in splice_dentry()) from the
> > > result of mkdir but does nothing more with it.
> > > 
> > > Now that ->mkdir can return a dentry, return the correct dentry.
> > > 
> > > Signed-off-by: NeilBrown 
> > > ---
> > >  fs/ceph/dir.c | 9 -
> > >  1 file changed, 8 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> > > index 39e0f240de06..c1a1c168bb27 100644
> > > --- a/fs/ceph/dir.c
> > > +++ b/fs/ceph/dir.c
> > > @@ -1099,6 +1099,7 @@ static struct dentry *ceph_mkdir(struct mnt_idmap 
> > > *idmap, struct inode *dir,
> > >   struct ceph_client *cl = mdsc->fsc->client;
> > >   struct ceph_mds_request *req;
> > >   struct ceph_acl_sec_ctx as_ctx = {};
> > > + struct dentry *ret = NULL;
> > 
> > I believe that it makes sense to initialize pointer by error here and always
> > return ret as output. If something goes wrong in the logic, then we already 
> > have
> > error.
> 
> I'm not certain that I understand, but I have made a change which seems
> to be consistent with the above and included it below.  Please let me
> know if it is what you intended.
> 
> > 
> > >   int err;
> > >   int op;
> > >  
> > > @@ -1166,14 +1167,20 @@ static struct dentry *ceph_mkdir(struct mnt_idmap 
> > > *idmap, struct inode *dir,
> > >   !req->r_reply_info.head->is_dentry)
> > >   err = ceph_handle_notrace_create(dir, dentry);
> > >  out_req:
> > > + if (!err && req->r_dentry != dentry)
> > > + /* Some other dentry was spliced in */
> > > + ret = dget(req->r_dentry);
> > >   ceph_mdsc_put_request(req);
> > >  out:
> > >   if (!err)
> > > + /* Should this use 'ret' ?? */
> > 
> > Could we make a decision should or shouldn't? :)
> > It looks not good to leave this comment instead of proper implementation. 
> > Do we
> > have some obstacles to make this decision?
> 
> I suspect we should use ret, but I didn't want to make a change which
> wasn't directly required by my needed.  So I highlighted this which
> looks to me like a possible bug, hoping that someone more familiar with
> the code would give an opinion.  Do you agree that 'ret' (i.e.
> ->r_dentry) should be used when ret is not NULL?
> 

I think if we are going to return ret as a dentry, then it makes sense to call
the ceph_init_inode_acls() for d_inode(ret). I don't see the point to call
ceph_init_inode_acls() for d_inode(dentry) then.

> > 
> > >   ceph_init_inode_acls(d_inode(dentry), &as_ctx);
> > >   else
> > >   d_drop(dentry);
> > >   ceph_release_acl_sec_ctx(&as_ctx);
> > > - return ERR_PTR(err);
> > > + if (err)
> > > + return ERR_PTR(err);
> > > + return ret;
> > 
> > What's about this?
> > 
> > return err ? ERR_PTR(err) : ret;
> 
> We could do that, but you said above that you thought we should always
> return 'ret' - which does make some sense.
> 
> What do you think of the following alternate patch?
> 

Patch looks good to me. Thanks.

Reviewed-by: Viacheslav Dubeyko 

> Thanks,
> NeilBrown
> 
> diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> index 39e0f240de06..d2e5c557df83 100644
> --- a/fs/ceph/dir.c
> +++ b/fs/ceph/dir.c
> @@ -1099,6 +1099,7 @@ static struct dentry *ceph_mkdir(struct mnt_idmap 
> *idmap, struct inode *dir,
>   struct ceph_client *cl = mdsc->fsc->client;
>   struct ceph_mds_request *req;
>   struct ceph_acl_sec_ctx as_ctx = {};
> + struct dentry *ret;
>   int err;
>   int op;
>  
> @@ -1116,32 +1117,32 @@ static struct dentry *ceph_mkdir(struct mnt_idmap 
> *idmap, struct inode *dir,
> ceph_vinop(dir), dentry, dentry, mode);
>   op = CEPH_MDS_OP_MKDIR;
>   } else {
> - err = -EROFS;
> + ret = ERR_PTR(-EROFS);
>   goto out;
>   }
>  
>   if (op == CEPH_MDS_OP_MKDIR &&
>   ceph_quota_is_max_files_exceeded(dir)) {
> - err = -EDQUOT;
> + ret = ERR_PTR(-EDQUOT);
>   goto out;
>   }
>   if ((op == CEPH_MDS_OP_MKSNAP) && IS_ENCRYPTED(dir) &&
>   !fscrypt_has_encryption_key(dir)) {
> - err = -ENOKEY;
> + ret = ERR_PTR(-ENOKEY);
>   goto out;
>   }
>  
>  
>   req = ceph_mdsc_create_request(mdsc, op, USE_AUTH_MDS);
>   if (IS_ERR(req)) {
> - err = PTR_ERR(req);
> + ret = ERR_CAST(req);
>   goto out;
>   }
>  
>   mode |= S_IFDIR;
>   req->r_new_inode = ceph_new_inode(dir, dentry, &mode, &as_ctx);
>   if (IS_ERR(req->r_new_inode)) {
> - err = PTR_ERR(req->r_new_inode);
> + ret = ERR_CAST(req->r_new_inode);
>   req->r_new_inode = NULL;
>   goto out_req;
>   }
> @@ -1165,15 +1166,23 @@ static struct dentry *ceph_mkdir(struct mnt_idmap 
> *idmap, s

RE: [PATCH 3/6] ceph: return the correct dentry on mkdir

2025-02-24 Thread NeilBrown
On Tue, 25 Feb 2025, Viacheslav Dubeyko wrote:
> On Mon, 2025-02-24 at 13:15 +1100, NeilBrown wrote:
> > On Fri, 21 Feb 2025, Viacheslav Dubeyko wrote:
> > > On Fri, 2025-02-21 at 10:36 +1100, NeilBrown wrote:
> > > > ceph already splices the correct dentry (in splice_dentry()) from the
> > > > result of mkdir but does nothing more with it.
> > > > 
> > > > Now that ->mkdir can return a dentry, return the correct dentry.
> > > > 
> > > > Signed-off-by: NeilBrown 
> > > > ---
> > > >  fs/ceph/dir.c | 9 -
> > > >  1 file changed, 8 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> > > > index 39e0f240de06..c1a1c168bb27 100644
> > > > --- a/fs/ceph/dir.c
> > > > +++ b/fs/ceph/dir.c
> > > > @@ -1099,6 +1099,7 @@ static struct dentry *ceph_mkdir(struct mnt_idmap 
> > > > *idmap, struct inode *dir,
> > > > struct ceph_client *cl = mdsc->fsc->client;
> > > > struct ceph_mds_request *req;
> > > > struct ceph_acl_sec_ctx as_ctx = {};
> > > > +   struct dentry *ret = NULL;
> > > 
> > > I believe that it makes sense to initialize pointer by error here and 
> > > always
> > > return ret as output. If something goes wrong in the logic, then we 
> > > already have
> > > error.
> > 
> > I'm not certain that I understand, but I have made a change which seems
> > to be consistent with the above and included it below.  Please let me
> > know if it is what you intended.
> > 
> > > 
> > > > int err;
> > > > int op;
> > > >  
> > > > @@ -1166,14 +1167,20 @@ static struct dentry *ceph_mkdir(struct 
> > > > mnt_idmap *idmap, struct inode *dir,
> > > > !req->r_reply_info.head->is_dentry)
> > > > err = ceph_handle_notrace_create(dir, dentry);
> > > >  out_req:
> > > > +   if (!err && req->r_dentry != dentry)
> > > > +   /* Some other dentry was spliced in */
> > > > +   ret = dget(req->r_dentry);
> > > > ceph_mdsc_put_request(req);
> > > >  out:
> > > > if (!err)
> > > > +   /* Should this use 'ret' ?? */
> > > 
> > > Could we make a decision should or shouldn't? :)
> > > It looks not good to leave this comment instead of proper implementation. 
> > > Do we
> > > have some obstacles to make this decision?
> > 
> > I suspect we should use ret, but I didn't want to make a change which
> > wasn't directly required by my needed.  So I highlighted this which
> > looks to me like a possible bug, hoping that someone more familiar with
> > the code would give an opinion.  Do you agree that 'ret' (i.e.
> > ->r_dentry) should be used when ret is not NULL?
> > 
> 
> I think if we are going to return ret as a dentry, then it makes sense to call
> the ceph_init_inode_acls() for d_inode(ret). I don't see the point to call
> ceph_init_inode_acls() for d_inode(dentry) then.

If the mkdir used the original dentry, then ->mkdir returns NULL so ret
is NULL.  If the mkdir used a different dentry it returns that, so ret
is not NULL.

I'll try to re-organise the code so that "dentry" is the correct dentry
on success, and "ret" is the returned dentry, which might be NULL.

Thanks,
NeilBrown


> 
> > > 
> > > > ceph_init_inode_acls(d_inode(dentry), &as_ctx);
> > > > else
> > > > d_drop(dentry);
> > > > ceph_release_acl_sec_ctx(&as_ctx);
> > > > -   return ERR_PTR(err);
> > > > +   if (err)
> > > > +   return ERR_PTR(err);
> > > > +   return ret;
> > > 
> > > What's about this?
> > > 
> > > return err ? ERR_PTR(err) : ret;
> > 
> > We could do that, but you said above that you thought we should always
> > return 'ret' - which does make some sense.
> > 
> > What do you think of the following alternate patch?
> > 
> 
> Patch looks good to me. Thanks.
> 
> Reviewed-by: Viacheslav Dubeyko 
> 
> > Thanks,
> > NeilBrown
> > 
> > diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> > index 39e0f240de06..d2e5c557df83 100644
> > --- a/fs/ceph/dir.c
> > +++ b/fs/ceph/dir.c
> > @@ -1099,6 +1099,7 @@ static struct dentry *ceph_mkdir(struct mnt_idmap 
> > *idmap, struct inode *dir,
> > struct ceph_client *cl = mdsc->fsc->client;
> > struct ceph_mds_request *req;
> > struct ceph_acl_sec_ctx as_ctx = {};
> > +   struct dentry *ret;
> > int err;
> > int op;
> >  
> > @@ -1116,32 +1117,32 @@ static struct dentry *ceph_mkdir(struct mnt_idmap 
> > *idmap, struct inode *dir,
> >   ceph_vinop(dir), dentry, dentry, mode);
> > op = CEPH_MDS_OP_MKDIR;
> > } else {
> > -   err = -EROFS;
> > +   ret = ERR_PTR(-EROFS);
> > goto out;
> > }
> >  
> > if (op == CEPH_MDS_OP_MKDIR &&
> > ceph_quota_is_max_files_exceeded(dir)) {
> > -   err = -EDQUOT;
> > +   ret = ERR_PTR(-EDQUOT);
> > goto out;
> > }
> > if ((op == CEPH_MDS_OP_MKSNAP) && IS_ENCRYPTED(dir) &&
> > !fscrypt_has_encryption_key(dir)) {
> > - 

Re: [PATCH 3/6] ceph: return the correct dentry on mkdir

2025-02-24 Thread Jeff Layton
On Mon, 2025-02-24 at 22:09 +, Viacheslav Dubeyko wrote:
> On Mon, 2025-02-24 at 13:15 +1100, NeilBrown wrote:
> > On Fri, 21 Feb 2025, Viacheslav Dubeyko wrote:
> > > On Fri, 2025-02-21 at 10:36 +1100, NeilBrown wrote:
> > > > ceph already splices the correct dentry (in splice_dentry()) from the
> > > > result of mkdir but does nothing more with it.
> > > > 
> > > > Now that ->mkdir can return a dentry, return the correct dentry.
> > > > 
> > > > Signed-off-by: NeilBrown 
> > > > ---
> > > >  fs/ceph/dir.c | 9 -
> > > >  1 file changed, 8 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> > > > index 39e0f240de06..c1a1c168bb27 100644
> > > > --- a/fs/ceph/dir.c
> > > > +++ b/fs/ceph/dir.c
> > > > @@ -1099,6 +1099,7 @@ static struct dentry *ceph_mkdir(struct mnt_idmap 
> > > > *idmap, struct inode *dir,
> > > > struct ceph_client *cl = mdsc->fsc->client;
> > > > struct ceph_mds_request *req;
> > > > struct ceph_acl_sec_ctx as_ctx = {};
> > > > +   struct dentry *ret = NULL;
> > > 
> > > I believe that it makes sense to initialize pointer by error here and 
> > > always
> > > return ret as output. If something goes wrong in the logic, then we 
> > > already have
> > > error.
> > 
> > I'm not certain that I understand, but I have made a change which seems
> > to be consistent with the above and included it below.  Please let me
> > know if it is what you intended.
> > 
> > > 
> > > > int err;
> > > > int op;
> > > >  
> > > > @@ -1166,14 +1167,20 @@ static struct dentry *ceph_mkdir(struct 
> > > > mnt_idmap *idmap, struct inode *dir,
> > > > !req->r_reply_info.head->is_dentry)
> > > > err = ceph_handle_notrace_create(dir, dentry);
> > > >  out_req:
> > > > +   if (!err && req->r_dentry != dentry)
> > > > +   /* Some other dentry was spliced in */
> > > > +   ret = dget(req->r_dentry);
> > > > ceph_mdsc_put_request(req);
> > > >  out:
> > > > if (!err)
> > > > +   /* Should this use 'ret' ?? */
> > > 
> > > Could we make a decision should or shouldn't? :)
> > > It looks not good to leave this comment instead of proper implementation. 
> > > Do we
> > > have some obstacles to make this decision?
> > 
> > I suspect we should use ret, but I didn't want to make a change which
> > wasn't directly required by my needed.  So I highlighted this which
> > looks to me like a possible bug, hoping that someone more familiar with
> > the code would give an opinion.  Do you agree that 'ret' (i.e.
> > ->r_dentry) should be used when ret is not NULL?
> > 
> 
> I think if we are going to return ret as a dentry, then it makes sense to call
> the ceph_init_inode_acls() for d_inode(ret). I don't see the point to call
> ceph_init_inode_acls() for d_inode(dentry) then.
> 

My assumption when looking at this was that they should point to the
same inode. That said, working with d_inode(ret) after that point is
less confusing to the casual reader.

> > > 
> > > > ceph_init_inode_acls(d_inode(dentry), &as_ctx);
> > > > else
> > > > d_drop(dentry);
> > > > ceph_release_acl_sec_ctx(&as_ctx);
> > > > -   return ERR_PTR(err);
> > > > +   if (err)
> > > > +   return ERR_PTR(err);
> > > > +   return ret;
> > > 
> > > What's about this?
> > > 
> > > return err ? ERR_PTR(err) : ret;
> > 
> > We could do that, but you said above that you thought we should always
> > return 'ret' - which does make some sense.
> > 
> > What do you think of the following alternate patch?
> > 
> 
> Patch looks good to me. Thanks.
> 
> Reviewed-by: Viacheslav Dubeyko 
> 
> > Thanks,
> > NeilBrown
> > 
> > diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
> > index 39e0f240de06..d2e5c557df83 100644
> > --- a/fs/ceph/dir.c
> > +++ b/fs/ceph/dir.c
> > @@ -1099,6 +1099,7 @@ static struct dentry *ceph_mkdir(struct mnt_idmap 
> > *idmap, struct inode *dir,
> > struct ceph_client *cl = mdsc->fsc->client;
> > struct ceph_mds_request *req;
> > struct ceph_acl_sec_ctx as_ctx = {};
> > +   struct dentry *ret;
> > int err;
> > int op;
> >  
> > @@ -1116,32 +1117,32 @@ static struct dentry *ceph_mkdir(struct mnt_idmap 
> > *idmap, struct inode *dir,
> >   ceph_vinop(dir), dentry, dentry, mode);
> > op = CEPH_MDS_OP_MKDIR;
> > } else {
> > -   err = -EROFS;
> > +   ret = ERR_PTR(-EROFS);
> > goto out;
> > }
> >  
> > if (op == CEPH_MDS_OP_MKDIR &&
> > ceph_quota_is_max_files_exceeded(dir)) {
> > -   err = -EDQUOT;
> > +   ret = ERR_PTR(-EDQUOT);
> > goto out;
> > }
> > if ((op == CEPH_MDS_OP_MKSNAP) && IS_ENCRYPTED(dir) &&
> > !fscrypt_has_encryption_key(dir)) {
> > -   err = -ENOKEY;
> > +   ret = ERR_PTR(-ENOKEY);
> > goto out;
> > }
> >  
> >  
> > req = ceph_mds

Re: [PATCH 6/6] VFS: Change vfs_mkdir() to return the dentry.

2025-02-24 Thread Chuck Lever
On 2/23/25 9:51 PM, NeilBrown wrote:
> On Sat, 22 Feb 2025, Chuck Lever wrote:
>> On 2/20/25 6:36 PM, NeilBrown wrote:
> ...
>>> +   dchild = vfs_mkdir(&nop_mnt_idmap, dirp, dchild, iap->ia_mode);
>>> +   if (IS_ERR(dchild)) {
>>> +   host_err = PTR_ERR(dchild);
>>> +   } else if (d_is_negative(dchild)) {
>>> +   err = nfserr_serverfault;
>>> +   goto out;
>>> +   } else if (unlikely(dchild != resfhp->fh_dentry)) {
>>> dput(resfhp->fh_dentry);
>>> -   resfhp->fh_dentry = dget(d);
>>> -   err = fh_update(resfhp);
>>
>> Hi Neil, why is this fh_update() call no longer necessary?
>>
> 
> I tried to explain that in the commit message:
> 
> I removed the fh_update()
>   call as that is not needed and out-of-place.  A subsequent
>   nfsd_create_setattr() call will call fh_update() when needed.
> 
> I don't think the fh_update() was needed even when first added in 
> Commit 3819bb0d79f5 ("nfsd: vfs_mkdir() might succeed leaving dentry negative 
> unhashed")
> 
> as there was already an fh_update() call later in the function.

Thanks for the patch description verbiage, and sorry I missed it.

Even so, IMHO this belongs in a separate patch instead of buried in this
unrelated API change. This doesn't fix a bug nor is it necessary for
changing the return value of vfs_mkdir() AFAICT. At the very least, a
separate patch makes it possible to include a sensible reference to
3819bb0d79f5, which is helpful.

IME these tiny weird looking warts often have a purpose that is revealed
only once the code is made to look reasonable.

Make the fh_update() removal a pre-requisite clean-up to this patch,
maybe?


> Thanks,
> NeilBrown
> 
> 
> 
>>
>>> -   dput(dchild);
>>> -   dchild = d;
>>> -   if (err)
>>> -   goto out;
>>> +   resfhp->fh_dentry = dget(dchild);
>>> }
>>> break;
>>> case S_IFCHR:
>>> @@ -1530,7 +1517,8 @@ nfsd_create_locked(struct svc_rqst *rqstp, struct 
>>> svc_fh *fhp,
>>> err = nfsd_create_setattr(rqstp, fhp, resfhp, attrs);
>>>  
>>>  out:
>>> -   dput(dchild);
>>> +   if (!IS_ERR(dchild))
>>> +   dput(dchild);
>>> return err;
>>>  
>>>  out_nfserr:
>>> diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
>>> index 21c3aaf7b274..fe493f3ed6b6 100644
>>> --- a/fs/overlayfs/dir.c
>>> +++ b/fs/overlayfs/dir.c
>>> @@ -138,37 +138,6 @@ int ovl_cleanup_and_whiteout(struct ovl_fs *ofs, 
>>> struct inode *dir,
>>> goto out;
>>>  }
>>>  
>>> -int ovl_mkdir_real(struct ovl_fs *ofs, struct inode *dir,
>>> -  struct dentry **newdentry, umode_t mode)
>>> -{
>>> -   int err;
>>> -   struct dentry *d, *dentry = *newdentry;
>>> -
>>> -   err = ovl_do_mkdir(ofs, dir, dentry, mode);
>>> -   if (err)
>>> -   return err;
>>> -
>>> -   if (likely(!d_unhashed(dentry)))
>>> -   return 0;
>>> -
>>> -   /*
>>> -* vfs_mkdir() may succeed and leave the dentry passed
>>> -* to it unhashed and negative. If that happens, try to
>>> -* lookup a new hashed and positive dentry.
>>> -*/
>>> -   d = ovl_lookup_upper(ofs, dentry->d_name.name, dentry->d_parent,
>>> -dentry->d_name.len);
>>> -   if (IS_ERR(d)) {
>>> -   pr_warn("failed lookup after mkdir (%pd2, err=%i).\n",
>>> -   dentry, err);
>>> -   return PTR_ERR(d);
>>> -   }
>>> -   dput(dentry);
>>> -   *newdentry = d;
>>> -
>>> -   return 0;
>>> -}
>>> -
>>>  struct dentry *ovl_create_real(struct ovl_fs *ofs, struct inode *dir,
>>>struct dentry *newdentry, struct ovl_cattr *attr)
>>>  {
>>> @@ -191,7 +160,8 @@ struct dentry *ovl_create_real(struct ovl_fs *ofs, 
>>> struct inode *dir,
>>>  
>>> case S_IFDIR:
>>> /* mkdir is special... */
>>> -   err =  ovl_mkdir_real(ofs, dir, &newdentry, attr->mode);
>>> +   newdentry =  ovl_do_mkdir(ofs, dir, newdentry, 
>>> attr->mode);
>>> +   err = PTR_ERR_OR_ZERO(newdentry);
>>> break;
>>>  
>>> case S_IFCHR:
>>> @@ -219,7 +189,8 @@ struct dentry *ovl_create_real(struct ovl_fs *ofs, 
>>> struct inode *dir,
>>> }
>>>  out:
>>> if (err) {
>>> -   dput(newdentry);
>>> +   if (!IS_ERR(newdentry))
>>> +   dput(newdentry);
>>> return ERR_PTR(err);
>>> }
>>> return newdentry;
>>> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
>>> index 0021e2025020..6f2f8f4cfbbc 100644
>>> --- a/fs/overlayfs/overlayfs.h
>>> +++ b/fs/overlayfs/overlayfs.h
>>> @@ -241,13 +241,14 @@ static inline int ovl_do_create(struct ovl_fs *ofs,
>>> return err;
>>>  }
>>>  
>>> -static inline int ovl_do_mkdir(struct ovl_fs *ofs,
>>> -  struc

Re: [PATCH 1/6] Change inode_operations.mkdir to return struct dentry *

2025-02-24 Thread Trond Myklebust
On Mon, 2025-02-24 at 14:09 +1100, NeilBrown wrote:
> On Mon, 24 Feb 2025, Al Viro wrote:
> > On Mon, Feb 24, 2025 at 12:34:06PM +1100, NeilBrown wrote:
> > > On Sat, 22 Feb 2025, Al Viro wrote:
> > > > On Fri, Feb 21, 2025 at 10:36:30AM +1100, NeilBrown wrote:
> > > > 
> > > > > +In general, filesystems which use d_instantiate_new() to
> > > > > install the new
> > > > > +inode can safely return NULL.  Filesystems which may not
> > > > > have an I_NEW inode
> > > > > +should use d_drop();d_splice_alias() and return the result
> > > > > of the latter.
> > > > 
> > > > IMO that's a bad pattern, _especially_ if you want to go for
> > > > "in-update"
> > > > kind of stuff later.
> > > 
> > > Agreed.  I have a draft patch to change d_splice_alias() and
> > > d_exact_alias() to work on hashed dentrys.  I thought it should
> > > go after
> > > these mkdir patches rather than before.
> > 
> > Could you give a braindump on the things d_exact_alias() is needed
> > for?
> > It's a recurring headache when doing ->d_name/->d_parent audits;
> > see e.g.
> > https://lore.kernel.org/all/20241213080023.GI3387508@ZenIV/ for
> > related
> > mini-rant from the latest iteration.
> > 
> > Proof of correctness is bloody awful; it feels like the primitive
> > itself
> > is wrong, but I'd never been able to write anything concise
> > regarding
> > the things we really want there ;-/
> > 
> 
> As I understand it, it is needed (or wanted) to handle the
> possibility
> of an inode becoming "stale" and then recovering.  This could happen,
> for example, with a temporarily misconfigured NFS server.
> 
> If ->d_revalidate gets a NFSERR_STALE from the server it will return
> '0'
> so lookup_fast() and others will call d_invalidate() which will
> d_drop()
> the dentry.  There are other paths on which -ESTALE can result in
> d_drop().
> 
> If a subsequent attempt to "open" the name successfully finds the
> same
> inode we want to reuse the old dentry rather than create a new one.
> 
> I don't really understand why.  This code was added 20 years ago
> before
> git.
> It was introduced by
> 
> commit 89a45174b6b32596ea98fa3f89a243e2c1188a01
> Author: Trond Myklebust 
> Date:   Tue Jan 4 21:41:37 2005 +0100
> 
>  VFS: Avoid dentry aliasing problems in filesystems like NFS,
> where
>   inodes may be marked as stale in one instance (causing the
> dentry
>   to be dropped) then re-enabled in the next instance.
>     
>  Signed-off-by: Trond Myklebust 
> 
> in history.git
> 
> Trond: do you have any memory of this?  Can you explain what the
> symptom
> was that you wanted to fix?
> 
> The original patch used d_add_unique() for lookup and atomic_open and
> readdir prime-dcache.  We now only use it for v4 atomic_open.  Maybe
> we
> don't need it at all?  Or maybe we need to restore it to those other
> callers? 
> 

2005? That looks like it was trying to deal with the userspace NFS
server. I can't remember when it was given the ability to use the inode
generation counter, but I'm pretty sure that in 2005 there were plenty
of setups out there that had the older version that reused filehandles
(due to inode number reuse). So you would get spurious ESTALE errors
sometimes due to inode number reuse, sometimes because the filehandle
fell out of the userspace NFS server's cache.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.mykleb...@hammerspace.com