A recent regression in gnome-keyring (perhaps only on systems that use dbus-x11, it isn't completely clear to me yet) has prompted me to look at how rlimits work in Debian. It isn't clear to me which package is or should be responsible for choosing what arbitrary limits we use in practice.
The kernel has some defaults, which it sets on pid 1. Some are hard-coded, but increasingly many seem to be dependent on system state (for example limiting memory sizes to a fixed fraction of system RAM). Traditionally, when the init system was extremely minimal and delegated the majority of its responsibilities to child processes (sysvinit or similar), these defaults would be inherited by pid 1's children, and recursively inherited by user processes. In principle, the pam_limits.so module sets rlimits for user processes. However, by default it is unconfigured, and in the absence of configuration it needs to default to *something* - either inheriting from its parent process, or resetting the limits to something predictable. Inheriting from its parent process is problematic because the parent process might have reset its limits internally, and in a sysvinit world it might have been restarted by a sysadmin in an arbitrary execution environment, leading to unpredictable limits in user processes; but resetting the limits is also problematic, because it results in PAM having to second-guess the limits coming from the kernel, which presumably knows better. Debian's PAM package currently carries a non-upstream patch to screen-scrape the rlimits of pid 1 and use them as a guess at what the kernel's defaults must have been. This makes perfect sense in a sysvinit world, where sysvinit hardly does anything (the real work of booting the system is all delegated to sysv-rc) and therefore is unlikely to need to raise its rlimits; but it doesn't really make sense under systemd, where pid 1 does a significant amount, and raises its rlimits accordingly. systemd *also* has configurable default limits to be passed down to system services (see DefaultLimitMEMLOCK, etc. in /etc/systemd/system.conf). How is this meant to work, and is it working as intended in practice? If I'm understanding correctly, upstream it's meant to go something like this, with more-indented components selectively overriding less-indented components: kernel -> (kernel defaults) init -> (systemd's configuration, if using systemd) system service providing an entry point -> PAM stack, pam_login.so -> (pam_login configuration, if used) user sessions but because sysadmins of sysvinit systems are expected to run "service foo restart" in an unknown execution environment, our patched PAM changes this to: kernel -> (kernel defaults) init -> (systemd's configuration, if using systemd) system service providing an entry point -> PAM stack, pam_login.so -> (PAM's best guess at what the limits *should have been*) (pam_login configuration, if used) user sessions system service providing an entry point -> ... sysadmin's arbitrary login session... -> system service restarted by sysadmin -> PAM stack, pam_login.so -> (PAM's best guess at what the limits *should have been*) (pam_login configuration, if used) user sessions I wonder whether the solution ought to involve something like this: * On non-systemd-booted systems, PAM continues to screen-scrape limits from pid 1 for compatibility with the "service foo restart" use-case; * On systemd systems, PAM stops doing that, and inherits from the parent process by default, resulting in user processes getting the limits configured in pam_limits (if set), or if not set there, then the limits from systemd system.conf (if set), or if not set there either, the limits from the kernel Rationale: on sysvinit or runit systems, pid 1 is very simple and is unlikely to need to elevate any limits, but sysadmins are expected to restart system services in an unpredictable execution environment (certainly true for systemd, I'm not so sure for runit). On systemd systems, pid 1 is more complex, but part of the value we get for that complexity is that even when sysadmins restart system services, the service receives a known and predictable execution environment, so it does not need to be robust against inheriting a wrong rlimit or other parameters. See also #917374, #976373, #923312. The reason I ask about this is that I want to make sure we are setting rlimits, and in particular RLIMIT_MEMLOCK, to a realistic value for 2021. The wider context here is that gnome-keyring-daemon, GNOME's implementation of the org.freedesktop.Secrets interface, is currently setcap cap_ipc_lock=ep so that it can mlock(2) secrets and stop them from getting swapped out. This is ineffective on systems that can hibernate, at which point everything (even locked memory) has to be written to swap in any case, but it's better than nothing. This filesystem capability results in gnome-keyring-daemon having elevated privileges (even though the privilege is a relatively minor one), which in principle means it should not trust the execution environment inherited from a less-privileged caller that might be trying to trick it into executing attacker-provided arbitrary code with the elevated privilege. Recent security-hardening changes in GLib made it distrust most environment variables to reduce the number of foot-shooting incidents (although authors of setuid or privileged components should note that GLib's maintainers still consider it to be the setuid program's responsibility to sanitize its own execution environment, since it is the setuid program that is setting up an unusual trust relationship). However, in order to work as designed, gnome-keyring *has to* be able to trust environment variables that it inherits: either DBUS_SESSION_BUS_ADDRESS, or XDG_RUNTIME_DIR, or both. Otherwise, it cannot connect to the D-Bus session bus and provide its intended functionality. (#981420, #981555) Historically, gnome-keyring's RAM also contained GPG keys (although it now delegates GPG key handling to GnuPG's gpg-agent and does not ever see a decrypted GPG key itself), and also contained SSH keys (although if I understand correctly, it now delegates *those* to OpenSSH's ssh-agent, and does not ever see a decrypted SSH key itself). Those are obviously also desirable to lock into RAM, although I note that neither gpg-agent nor ssh-agent has CAP_IPC_LOCK, so presumably they find the default RLIMIT_MEMLOCK sufficient for their needs. Empirically, it seems that user processes on a Debian 11 system booted with systemd have a RLIMIT_MEMLOCK of 1/8 of RAM (which is definitely plenty, but perhaps too much: #976373); user processes on a Debian 11 system booted with sysvinit have a RLIMIT_MEMLOCK of 64K (perhaps not enough); and user processes on a Debian 10 system booted with systemd had a fixed RLIMIT_MEMLOCK of 64M (perhaps more like the right value). I would like to have gnome-keyring inherit some realistic RLIMIT_MEMLOCK that is enough for it to lock passwords into RAM without special privileges, without either having to rely on some lowest-common-denominator behaviour or being a denial-of-service vector. If that only happens under systemd, so be it - we can document in README.Debian that if sysvinit users want to lock passwords into RAM, they will have to configure pam_limits themselves - but I would hope that given the number of sysvinit advocates in the project, there is someone who can implement reasonable behaviour for sysvinit systems too. We have seen a similar mess in the past with arbitrary file descriptor limits, where nobody is quite sure which component is responsible for setting the limit to be realistic for 2010s/2020s hardware that can certainly cope with managing more than 4K file descriptors at a time. smcv