A few weeks ago a conversation about retguard (a diff is probably coming) caused me re-consider & re-read the BROP paper
https://www.scs.stanford.edu/brop/bittau-brop.pdf After lots of details, page 8 has a table summarizing the attack process. Step 5 contains the text "The attacker can now dump the entire binary to find more gadgets". This diff hinders that step. It prevents download of immutable text segments via system calls as a simple step. This is valuable because BROP needs this to be a simple step, because the write is the maximum tooling the attacker has at the moment. There is a difficulty in the way of "oh, just make code segments non-readable". Most MMU lack the ability to manage PROT_EXEC without PROT_READ. Being able to read the code using data instructions is implicit in these architectures. There is a very short list of architectures and sub-architectures that could block read to code regions if we wrote uvm/pmap code. We really want this property for security reasons, but since most MMU lack it we have not made a lot of progress. This illusive property, of block reads to code, is called "X-only" in our group. So that means user processe can read their own code. Can't stop that. The kernel also reads userland memory, when you do a system call like write() or sendto(). It does this using copyin() against the userland region, which again uses the MMU. But the MMU lacks the ability we desire. Changing this lookup to instead use higher-level virtual-memory representation data structure inspection would introduce either races or some pretty strong locks because the virtual memory layout can be changed by threads, and therefore we would hurt threading performance. So we cannot simply inspect the whole virtual-memory data structures for the region. So created a very small coherent cache of unpermitted regions which gets looked up before copyin(), and after a few iterations of coding, I managed to do it without locking! Depending on binary type, this cache is an array of 2 to 4 text regions: main program text, ld.so text, signal trampoline, and libc.so text. Normally this would need management & updates when processes do mprotect/mmap/munmap, but a few weeks ago I introduced mimmutable(), and since all those text segements now immutable we know they cannot change! So there is no need to update or delete entries in this cache. Once we know this table is complete, we don't need to lock lookups. The kernel now prevents "write(fd, &main, length)" and "write(fd, &open, length)" by returning -1 EFAULT. This protection is not made available to other libraries (like libcrypto). I've looked into doing this via a special system call, and via an implicit arrangement inside mimmutable(), but the changes are much more complicated and the security benefit is lower (so for now, I am going to punt on that). Someone is going to reply "but I'll copy libc text to other memory before I do the write operation!" Please show us at which step in the BROP procedure on page 8 this copy operation is done, and how. BROP is used when the attack tooling is insufficient for complicated sequences like "copy elsewhere + write"; BROP is a method to collect powerful gadgetry that you don't have for a next-round attack sequence. In particular, BROP would be used to learn the libc random relink, and in particular to discover the location of all syscall stubs. There are are unfinished pieces in here, but it seems to be working fine. The next step will be to find out if we have any software in ports which tries to perform output system calls with their text segments as data origin. Let me know. Index: sys/kern/exec_elf.c =================================================================== RCS file: /cvs/src/sys/kern/exec_elf.c,v retrieving revision 1.177 diff -u -p -u -r1.177 exec_elf.c --- sys/kern/exec_elf.c 5 Dec 2022 23:18:37 -0000 1.177 +++ sys/kern/exec_elf.c 20 Dec 2022 07:10:34 -0000 @@ -621,9 +621,11 @@ exec_elf_makecmds(struct proc *p, struct } else addr = ELF_NO_ADDR; - /* Permit system calls in specific main-programs */ + /* + * Permit system calls in main-text static binaries. + * Also block the ld.so syscall-grant + */ if (interp == NULL) { - /* statics. Also block the ld.so syscall-grant */ syscall = VMCMD_SYSCALL; p->p_vmspace->vm_map.flags |= VM_MAP_SYSCALL_ONCE; } Index: sys/kern/exec_subr.c =================================================================== RCS file: /cvs/src/sys/kern/exec_subr.c,v retrieving revision 1.64 diff -u -p -u -r1.64 exec_subr.c --- sys/kern/exec_subr.c 5 Dec 2022 23:18:37 -0000 1.64 +++ sys/kern/exec_subr.c 20 Dec 2022 06:39:40 -0000 @@ -215,6 +215,10 @@ vmcmd_map_pagedvn(struct proc *p, struct if (cmd->ev_flags & VMCMD_IMMUTABLE) uvm_map_immutable(&p->p_vmspace->vm_map, cmd->ev_addr, round_page(cmd->ev_addr + cmd->ev_len), 1); + if ((flags & UVM_FLAG_SYSCALL) || + ((cmd->ev_flags & VMCMD_IMMUTABLE) && (cmd->ev_prot & PROT_EXEC))) + uvm_map_xonly(&p->p_vmspace->vm_map, + cmd->ev_addr, round_page(cmd->ev_addr + cmd->ev_len)); } return (error); Index: sys/kern/kern_sig.c =================================================================== RCS file: /cvs/src/sys/kern/kern_sig.c,v retrieving revision 1.301 diff -u -p -u -r1.301 kern_sig.c --- sys/kern/kern_sig.c 16 Oct 2022 16:27:02 -0000 1.301 +++ sys/kern/kern_sig.c 19 Dec 2022 18:17:28 -0000 @@ -1642,6 +1642,9 @@ coredump(struct proc *p) atomic_setbits_int(&pr->ps_flags, PS_COREDUMP); + /* disable xonly checks, so we can write out text sections if needed */ + p->p_vmspace->vm_map.xonly_count = 0; + /* Don't dump if will exceed file size limit. */ if (USPACE + ptoa(vm->vm_dsize + vm->vm_ssize) >= lim_cur(RLIMIT_CORE)) return (EFBIG); Index: sys/kern/kern_subr.c =================================================================== RCS file: /cvs/src/sys/kern/kern_subr.c,v retrieving revision 1.51 diff -u -p -u -r1.51 kern_subr.c --- sys/kern/kern_subr.c 14 Aug 2022 01:58:27 -0000 1.51 +++ sys/kern/kern_subr.c 20 Dec 2022 01:29:43 -0000 @@ -43,6 +43,8 @@ #include <sys/sched.h> #include <sys/malloc.h> #include <sys/queue.h> +#include <uvm/uvm.h> +#include <uvm/uvm_map.h> int uiomove(void *cp, size_t n, struct uio *uio) @@ -78,8 +80,12 @@ uiomove(void *cp, size_t n, struct uio * sched_pause(preempt); if (uio->uio_rw == UIO_READ) error = copyout(cp, iov->iov_base, cnt); - else + else { + if (uvm_map_xonly_check(uio->uio_procp, + (vaddr_t)iov->iov_base, cnt)) + return EFAULT; error = copyin(iov->iov_base, cp, cnt); + } if (error) return (error); break; Index: sys/kern/subr_log.c =================================================================== RCS file: /cvs/src/sys/kern/subr_log.c,v retrieving revision 1.75 diff -u -p -u -r1.75 subr_log.c --- sys/kern/subr_log.c 2 Jul 2022 08:50:42 -0000 1.75 +++ sys/kern/subr_log.c 20 Dec 2022 01:26:45 -0000 @@ -644,6 +644,8 @@ dosendsyslog(struct proc *p, const char */ len = MIN(nbyte, sizeof(pri)); if (sflg == UIO_USERSPACE) { +// if (uvm_map_xonly_check(p, buf, len)) +// return (EFAULT); if ((error = copyin(buf, pri, len))) return (error); } else Index: sys/uvm/uvm_io.c =================================================================== RCS file: /cvs/src/sys/uvm/uvm_io.c,v retrieving revision 1.30 diff -u -p -u -r1.30 uvm_io.c --- sys/uvm/uvm_io.c 7 Oct 2022 14:59:39 -0000 1.30 +++ sys/uvm/uvm_io.c 19 Dec 2022 18:10:20 -0000 @@ -57,7 +57,7 @@ uvm_io(vm_map_t map, struct uio *uio, in vsize_t chunksz, togo, sz; struct uvm_map_deadq dead_entries; int error, extractflags; - + int save_xonly_count; /* * step 0: sanity checks and set up for copy loop. start with a * large chunk size. if we have trouble finding vm space we will @@ -84,8 +84,12 @@ uvm_io(vm_map_t map, struct uio *uio, in error = 0; extractflags = 0; - if (flags & UVM_IO_FIXPROT) + if (flags & UVM_IO_FIXPROT) { extractflags |= UVM_EXTRACT_FIXPROT; + /* Disable xonly checks on this map */ + save_xonly_count = map->xonly_count; + map->xonly_count = 0; + } /* * step 1: main loop... while we've got data to move @@ -134,6 +138,10 @@ uvm_io(vm_map_t map, struct uio *uio, in if (error) break; } + + /* Restore xonly checks on this map */ + if (flags & UVM_IO_FIXPROT) + map->xonly_count = save_xonly_count; return (error); } Index: sys/uvm/uvm_map.c =================================================================== RCS file: /cvs/src/sys/uvm/uvm_map.c,v retrieving revision 1.305 diff -u -p -u -r1.305 uvm_map.c --- sys/uvm/uvm_map.c 18 Dec 2022 23:41:17 -0000 1.305 +++ sys/uvm/uvm_map.c 20 Dec 2022 07:13:07 -0000 @@ -3472,6 +3472,7 @@ uvmspace_exec(struct proc *p, vaddr_t st uvmspace_free(ovm); } + p->p_vmspace->vm_map.xonly_count = 0; /* Release dead entries */ uvm_unmap_detach(&dead_entries, 0); @@ -4258,8 +4259,71 @@ uvm_map_syscall(struct vm_map *map, vadd entry = RBT_NEXT(uvm_map_addr, entry); } + /* Add libc's text segment to the XONLY list */ + if (map->xonly_count < UVM_MAP_XONLY_MAX) { + //printf("%d xsysc %lx-%lx\n", map->xonly_count, start, end); + map->xonly[map->xonly_count].start = start; + map->xonly[map->xonly_count].end = end; + map->xonly_count++; + } + map->wserial++; map->flags |= VM_MAP_SYSCALL_ONCE; + vm_map_unlock(map); + return (0); +} + +/* + * uvm_map_xonly_check: if the address is in an x-only region, return EFAULT + */ +int +uvm_map_xonly_check(struct proc *p, vaddr_t start, vsize_t len) +{ + struct vm_map *map = &p->p_vmspace->vm_map; + vaddr_t end = start + len; + int i, r = 0; + + /* + * When system calls are registered and msyscall(2) is blocked, + * there are no new calls to setup xonly regions + */ + if ((map->flags & VM_MAP_SYSCALL_ONCE) == 0) + vm_map_lock(map); + for (i = 0; i < map->xonly_count; i++) { + vaddr_t s = map->xonly[i].start, e = map->xonly[i].end; + + if ((start >= s && start < e) || (end >= s && end < e)) { + r = EFAULT; + break; + } + } + if ((map->flags & VM_MAP_SYSCALL_ONCE) == 0) + vm_map_unlock(map); + return (r); +} + +/* + * uvm_map_xonly: remember regions which are X-only for uiomove() + * + * => map must be unlocked + */ +int +uvm_map_xonly(struct vm_map *map, vaddr_t start, vaddr_t end) +{ + if (start > end) + return EINVAL; + start = MAX(start, map->min_offset); + end = MIN(end, map->max_offset); + if (start >= end) + return 0; + + vm_map_lock(map); + if (map->xonly_count < UVM_MAP_XONLY_MAX) { + //printf("%d xonly %lx-%lx\n", map->xonly_count, start, end); + map->xonly[map->xonly_count].start = start; + map->xonly[map->xonly_count].end = end; + map->xonly_count++; + } vm_map_unlock(map); return (0); } Index: sys/uvm/uvm_map.h =================================================================== RCS file: /cvs/src/sys/uvm/uvm_map.h,v retrieving revision 1.81 diff -u -p -u -r1.81 uvm_map.h --- sys/uvm/uvm_map.h 17 Nov 2022 23:26:07 -0000 1.81 +++ sys/uvm/uvm_map.h 20 Dec 2022 05:23:43 -0000 @@ -168,6 +168,12 @@ struct vm_map_entry { vsize_t fspace_augment; /* max(fspace) in subtree */ }; +struct uvm_xonly { + vaddr_t start; + vaddr_t end; +}; +#define UVM_MAP_XONLY_MAX 10 + #define VM_MAPENT_ISWIRED(entry) ((entry)->wired_count != 0) TAILQ_HEAD(uvm_map_deadq, vm_map_entry); /* dead entry queue */ @@ -309,6 +315,9 @@ struct vm_map { struct uvm_addr_state *uaddr_any[4]; /* More selectors. */ struct uvm_addr_state *uaddr_brk_stack; /* Brk/stack selector. */ + struct uvm_xonly xonly[UVM_MAP_XONLY_MAX]; + int xonly_count; + /* * XXX struct mutex changes size because of compile options, so place * place after fields which are inspected by libkvm / procmap(8) @@ -354,6 +363,8 @@ int uvm_map_extract(struct vm_map *, va struct vm_map * uvm_map_create(pmap_t, vaddr_t, vaddr_t, int); vaddr_t uvm_map_pie(vaddr_t); vaddr_t uvm_map_hint(struct vmspace *, vm_prot_t, vaddr_t, vaddr_t); +int uvm_map_xonly(struct vm_map *, vaddr_t, vaddr_t); +int uvm_map_xonly_check(struct proc *, vaddr_t, vsize_t); int uvm_map_syscall(struct vm_map *, vaddr_t, vaddr_t); int uvm_map_immutable(struct vm_map *, vaddr_t, vaddr_t, int); int uvm_map_inherit(struct vm_map *, vaddr_t, vaddr_t, vm_inherit_t);