[RFC] Type-Partitioned vmalloc (with sample *.ko code)

2025-02-28 Thread Maxwell Bland
Dear Linux Hardening, Security, and Memory Management Mailing Lists,

This is primarily an FYI and an RFC. I have some code, included below,
that could be dropped into a *.ko for the 6.1.X kernel, but really this
mail is to query about ideas for acceptable upstream changes.

Thank you ahead of time for reading! If the title alone of this email
sticks out and makes sense immediately, feel free to skip the
introduction below.

INTRODUCTION

For the past few months, I have been sparring with recent CVE PoCs in
the kernel, applying monkey patches to dynamic data structure
allocations, attempting to prevent data-only attacks which use write
gadgets to modify dynamically allocated struct fields otherwise declared
constant.

I wanted to share, briefly, what I feel is a reasonable and general
solution to the standard contemporary exploit procedure. For those
unfamiliar with recent PoC's, see a case study of recent exploits in Man
Yue Mo's article here:

https://github.blog/security/vulnerability-research/the-android-kernel-mitigations-obstacle-race/

Particularly, understanding the "Running arbitrary root commands using
ret2kworker(TM)" section will give a general idea of the issue.

Summarizing, there are thousands of dynamic data structures alloc'd and
free'd in the kernel all the time, for files, for processes, and so
forth, and it is elementary to manipulate any instance of data, but hard
to protect every single one of them. These range from trng device
pointers to kworker queues---everything passing through vmalloc.

The strawman approach presented here is for security engineers to read
CVE-XYZ-ABC PoC, identify the portion of the system being manipulated,
and patch the allocation handler to protect just that data at the
page-table layer, by:

- Reorganizing allocations of those structures so that they are on
the same 2MB hugepage, adjacently, as otherwise existing hardware
support to prevent their mutation (PTE flags) will trigger for unrelated
data allocated adjacently.

- Writing a handler to ensure non-malicious modifications, e.g.  keeping
"const" fields const, ensuring modifications to other fields happen at
the right physical PC values and the right pages, handling atomic
updates so that the exception fault on these values maintains ordering
under race conditions (maybe "doubling up" on atomic assembly operations
due to certain microarch issues at the chipset level, see below), and so
on, and so forth.

Eventually, this Sisyphean task amounts to a mountain worth of
point-patches and encoded wisdom, valuable but absurd insofar as there
are a thousand more places for an exploit to manipulate instead of the
protected ones.

DATATYPE PARTITIONED VIRTUAL MEMORY ALLOCATION

The above process can be generalized by changing Linux's vmalloc to
behave more like seL4 (though not identically), by tying allocation
itself to the typing of an object:

https://docs.sel4.systems/Tutorials/untyped.html "objects should be

Without the caveat that objects must be "allocated in order of size,
largest first, to avoid wasting memory."

I demonstrated something similar previously to prevent the intermixed
allocation of SECCOMP BPF code pages with data on ARM64's Android Kernel
here (with which you may be familiar):

https://lore.kernel.org/all/20240423095843.446565600-1-mbl...@motorola.com/

That said, the above patch does not do the same for other critical
dynamically allocated data.

So, for instance, to prevent struct file manipulation, I've written the
following code into a init-time loaded kernel (v6.1.x) module:

filp_cachep_ind =
(struct kmem_cache **)kallsyms_lookup_name_ind("filp_cachep");
/* Just nix the existing file cache for one which is page-aligned */
*filp_cachep_ind = kmem_cache_create(
"filp", sizeof(struct file), PAGE_SIZE,
SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);

I.e. aligning cache allocations to PAGE_SIZE. See the appendix for
associated module code.

Of course, this is a little insane since:

(1) I'm effectively double allocating the cache to change how
the structs are allocated, because I can't change the kernel's
init process (part of this has to do with Google's GKI).

(2) The kmem infrastructure needs to be also monkey patched so
that this "PAGE_SIZE" alignment actually indicates that objects
can still be allocated next to eachother at the originally
set alignment, reducing dead space due to wasted bytes (not
implemented). And, most important

(3) struct file is just one case of thousands.

However, it seems fine for protecting a specific, given file allocation
targeted by something like:

https://github.com/chompie1337/s8_2019_2215_poc/blob/34f6481ed4ed4cff661b50ac465fc73655b82f64/poc/knox_bypass.c#L50

given you also have the appropriate protection handlers (see appendix
below), this works fine even outside of access to a HVCI system.

Hopefully the above reasoning is clear enough. If so, the proposal
(though it

Re: [RFC] Type-Partitioned vmalloc (with sample *.ko code)

2025-03-06 Thread Maxwell Bland
On Mon, Mar 03, 2025 at 10:26:16AM -0800, Kees Cook wrote:
> On Fri, Feb 28, 2025 at 02:57:40PM -0600, Maxwell Bland wrote:
> > Summarizing, there are thousands of dynamic data structures alloc'd and
> > free'd in the kernel all the time, for files, for processes, and so
> > forth, and it is elementary to manipulate any instance of data, but hard
> > to protect every single one of them. These range from trng device
> > pointers to kworker queues---everything passing through vmalloc.
> > 
> > - Reorganizing allocations of those structures so that they are on
> > the same 2MB hugepage, adjacently, as otherwise existing hardware
> > support to prevent their mutation (PTE flags) will trigger for unrelated
> > data allocated adjacently.
> 
> This sounds like the "write rarely" proposal:
> https://github.com/KSPP/linux/issues/130
> 
> which isolates chosen data structures into immutable memory, except for
> specific helpers which are allowed to write to the memory. This is
> needed most, by far, for page tables:
> https://lore.kernel.org/lkml/20250203101839.1223008-1-kevin.brod...@arm.com/

Thank you for this pointer and the others below. I spent a lot of time
the past two days thinking about your email and the links.

> It looks from your example at the end that you're depending on a
> hypervisor to perform the memory protection and fault handling? Or maybe
> I'm misunderstanding where the protection is happening?

Correct. I use the fault handler, proper, though, and optimize it
through careful management of protected vs. unprotected resources, which
pushes me up against the problem of determining specific policies for
each type of kmalloc.

> 
> > - Writing a handler to ensure non-malicious modifications, e.g.  keeping
> > "const" fields const, ensuring modifications to other fields happen at
> > the right physical PC values and the right pages, handling atomic
> > updates so that the exception fault on these values maintains ordering
> > under race conditions (maybe "doubling up" on atomic assembly operations
> > due to certain microarch issues at the chipset level, see below), and so
> > on, and so forth.
> 
> As I understand it, this depends on avoiding ROP attacks that will hijack
> those PC values. (This is generally true for the entire concept of
> "write rarely", though -- nothing specific to this implementation.)

I think a more general solution to this problem, leveraging the POE
mechanism (or just stage-2 translation tables), is to build something on
top of or around CFI.  This is natural since the protections already
assume CFI for the data tagging. I can imagine some GCC plugin or
compiler pass for functions, which can appropriately inject
unlock/relock calls around "critical" functions (part of the paciasp
instrumentation).

In fact, I rewrote the QCOM SMC handler to ensure the lock/unlock
semantics were inlined into the specific data operation context, to
prevent creation of a privilege escalation callgate given a CFI bypass.
I attached the code for this at the end.

I will bring this and some other points up to Kevin.

> The current proposals use a method of gaining temporary write permission
> during the execution path of an approved writer (rather than doing it
> via the fault handler, which tends to be very expensive).

I've not found the fault handler approach to be too expensive, at least
for a system matching all current guarantees. Once we begin talking
about struct file's f_lock and every kmalloc, I am inclined to agree.

I think a fault handler based solution can still get a lot of distance
if frequently updated fields of structs were indirected through pointers
(and separate kmalloc calls).

One issue with the POE and other solutions I see is also a lack of
infrastructure for applying specific policies to updates on data
structures: it's one thing to lock the page table outside of
set_memory_rw, but another to ensure the arguments to that API are
not corrupted, e.g. overwriting plt->target here?

arch/arm64/net/bpf_jit_comp.c
2417:   if (set_memory_rw(PAGE_MASK & ((uintptr_t)&plt->target),

> > I demonstrated something similar previously to prevent the intermixed
> > allocation of SECCOMP BPF code pages with data on ARM64's Android Kernel
> > here (with which you may be familiar):
> > 
> > https://lore.kernel.org/all/20240423095843.446565600-1-mbl...@motorola.com/
> 
> Did this v4 go any further? I see earlier versions had some discussion
> around it.

No response, so I did not stress the issue (should I have?) I ended up
just hacking around Google's GKI, so upstreaming was no longer
necessary.

> in here... Do you mean to make a distinction between the