Re: [RFC] Implementing RLIMIT_AS

Sergey Bugaev Sat, 21 Dec 2024 22:34:04 -0800

On Sun, Dec 22, 2024 at 6:11 AM Diego Nieto Cid <dnie...@gmail.com> wrote:
> I just didn't understand the hard/soft limits. It's better described
> by the structure members and not the comments:
>
>     struct rlimit {
>         rlim_t rlim_cur;  /* Soft limit */
>         rlim_t rlim_max;  /* Hard limit (ceiling for rlim_cur) */
>     };
>
> So `rlim_cur` is the limit that must be enforced and `rlim_max` is
> the maximum value an unprivileged process can set its `rlim_cur` to.


Hm so now that I think of it, it could make sense to enforce soft
limits in userland, if we care about "memory allocated" and not "size
of address space" (because the latter is influenced by memory received
in messages etc). You'd track the amount of memory allocated, increase
it in mmap (when making anonymous or private mappings) and sbrk, and
compare it with the soft value from _hurd_rlimits to reject the calls
when they would exceed the limit — no gnumach changes needed. It's not
a security mechanism, and nothing would prevent you from ignoring the
limit should you want to, but it sounds like it would solve all the
use cases mentioned (zzuf testsuite, clueless programs that just try
to malloc a lot of memory).

Maybe I'm describing the same thing as RLIMIT_DATA? The Linux man page
says Linux applies it to mmap as well as sbrk since 4.7.

> Currently, I have issues understanding how vm_map_copy_t and how they
> affect the total memory of the process.

vm_map_copy_t is memory in transfer between maps; this is how a Mach
message transferring memory is represented. By itself this memory is
not owned by any particular map, and so shouldn't be accounted towards
anyone's limit. But when you copy the copy object (vm_map_copy_t)
*out* into a destination map (vm_map_copy_overwrite, vm_map_copyout,
vm_map_copyout_page_list), the new memory appears in the destination
map. vm_map_copyin* is how a copy object gets created, but that
shouldn't increase the amount of address space used by anyone (but see
about VM_PROT_NONE below).

> > Yes, with the host port being an optional parameter for the case when
> > the limit is getting requested to be increased.
>
> Great.

FWIW, this means that the caller would be potentially sending the host
priv port to someone who's not necessarily the kernel. That's fine if
we're acting on mach_task_self (since if someone is interposing our
task port, we can trust them), but not fine if we're a privileged
server who's willing to raise the given task's memory allowance
according to some policy.

> > >     [vm_map] [task exec] map size: 0, requested size: 4294967296, hard 
> > > limit: 2147483648
> >
> > It'd be useful to get a backtrace. You can make your grub use
> > /hurd/exec.static instead of /hurd/exec, and use kdb's trace/u command
> > to get the userland backtrace easily. You could also add mach_print()s
> > in exec.c.

That is very likely from the 4 GB red zone, as Luca points out.

> > One thing is: it's a VM_PROT_NONE/VM_PROT_NONE area. We wouldn't really
> > want to make such area account for RLIMIT_AS, as they are not meant to
> > store anything.
> >
>
> This complicates a bit the accounting. I can keep a count of memory allocated
> whit that protection. But I supose I need to check for calls to `vm_protect` 
> or
> its underlying implementation.

So a lot like soft&hard limits, in Mach, there is (current) protection
and there is max protection. If you open a file with O_RDWR and mmap
it with MAP_READ, you'd get protection of VM_PROT_READ and max
protection of VM_PROT_READ | VM_PROT_WRITE | VM_PROT_EXECUTE, so you
can then increase the protection with vm_protect. If you open the file
with O_RDONLY however, you'd only get max protection of VM_PROT_READ |
VM_PROT_EXECUTE, and an attempt to vm_protect it to include
VM_PROT_WRITE will fail with KERN_PROTECTION_FAILURE. (This
KERN_PROTECTION_FAILURE should be mapped to Unix EACCESS, but I don't
see glibc doing this.)

The 4 GB of memory that the exec server reserves is mapped with
cur_protection = VM_PROT_NONE, max_protection = VM_PROT_NONE, so the
protection can never be increased. This area is not usable as memory,
it's a pure address space allocation, to prevent other things from
being allocated in this range of address space. So Samuel is saying
you could detect this case (of max_protection == VM_PROT_NONE) and
avoid counting it towards the limit. But there's a complication: we
still do want it to count towards map->size, so we may need yet
another counter, or something.

Similarly when deallocating memory with max_protection ==
VM_PROT_NONE, you don't want to de-crease the address space usage
counter. And indeed if vm_map_protect(new_prot = VM_PROT_NONE, set_max
= TRUE) is called and it changes non-VM_PROT_NONE memory to
VM_PROT_NONE, you'd want to subtract its size from the address space
usage.

Sergey

Re: [RFC] Implementing RLIMIT_AS

Reply via email to