Tony Finch wrote: > Ludovic Gasc <gml...@gmail.com> wrote: > > > > 1. The list of minimal capabilities needed for bind to run correctly: > > http://man7.org/linux/man-pages/man7/capabilities.7.html > > named already drops capabilities - have a look at the code around here: > https://source.isc.org/cgi-bin/gitweb.cgi?p=bind9.git;a=blob;f=bin/named/unix/os.c;hb=v9_11_2#l234 > > Note that it's a bit clever - the privileges are dropped in two stages, > right at the start, and after the server has been configured.
I checked just now to see what that code actually ends up doing, and on my system I ended up with: $ grep -h ^Cap /proc/$(pidof named)/**/status | sort | uniq -c 6 CapAmb: 0000000000000000 6 CapBnd: 0000003fffffffff 6 CapEff: 0000000001000400 6 CapInh: 0000000000000000 6 CapPrm: 0000000001000400 $ That decodes to: - The effective and permitted capabilities sets were reduced to CAP_NET_BIND_SERVICE and CAP_SYS_RESOURCE. - The ambient and inheritable capabilities sets were cleared. - The capability bounding set was left completely open-ended. It's not clear why CAP_SYS_RESOURCE needs to be retained past startup: /* * XXX We might want to add CAP_SYS_RESOURCE, though it's not * clear it would work right given the way linuxthreads work. * XXXDCL But since we need to be able to set the maximum number * of files, the stack size, data size, and core dump size to * support named.conf options, this is now being added to test. */ SET_CAP(CAP_SYS_RESOURCE); See commits 5e4b7294d88ab58371d8c98e05ea80086dcb67cd, 108490a7f8529aff50a0ac7897580b59a73d9845. "[T]o test"? CAP_SYS_RESOURCE is documented as permitting: CAP_SYS_RESOURCE * Use reserved space on ext2 filesystems; * make ioctl(2) calls controlling ext3 journaling; * override disk quota limits; * increase resource limits (see setrlimit(2)); * override RLIMIT_NPROC resource limit; * override maximum number of consoles on console allocation; * override maximum number of keymaps; * allow more than 64hz interrupts from the real-time clock; * raise msg_qbytes limit for a System V message queue above the limit in /proc/sys/kernel/msgmnb (see msgop(2) and msgctl(2)); * allow the RLIMIT_NOFILE resource limit on the number of "in- flight" file descriptors to be bypassed when passing file descriptors to another process via a UNIX domain socket (see unix(7)); * override the /proc/sys/fs/pipe-size-max limit when setting the capacity of a pipe using the F_SETPIPE_SZ fcntl(2) command. * use F_SETPIPE_SZ to increase the capacity of a pipe above the limit specified by /proc/sys/fs/pipe-max-size; * override /proc/sys/fs/mqueue/queues_max limit when creating POSIX message queues (see mq_overview(7)); * employ the prctl(2) PR_SET_MM operation; * set /proc/[pid]/oom_score_adj to a value lower than the value last set by a process with CAP_SYS_RESOURCE. I would guess that retaining CAP_NET_BIND_SERVICE and CAP_SYS_RESOURCE during the process runtime permits open-ended reloading of the config at runtime (e.g., binding to a new IP address on port 53 without needing to restart the daemon). So even though BIND drops some capabilities, it's still running with elevated privileges compared to a traditional non-root user. systemd permits a nice pattern for network daemons that want to run as an unprivileged user, but bind to a privileged port (and without using socket activation), without starting the process as root. Basically, you put something like this in the unit file: [Service] User=… Group=… CapabilityBoundingSet=CAP_NET_BIND_SERVICE CAP_SYS_CHROOT CAP_SETPCAP AmbientCapabilities=CAP_NET_BIND_SERVICE CAP_SYS_CHROOT CAP_SETPCAP … Any needed filesystem directories and permissions need to be set up correctly before hand. The service is started by the init system as the unprivileged User/Group specified in the unit file, so there's no need to change UID/GID. CAP_NET_BIND_SERVICE is then used to bind to a privileged port, CAP_SYS_CHROOT is used to perform the chroot, and CAP_SETPCAP is used to drop all remaining capabilities from the capability sets and the capability bounding set, so you end up with a completely unprivileged process at runtime. (Alternatively you could keep CAP_NET_BIND_SERVICE and drop CAP_SYS_CHROOT and CAP_SETPCAP, if you wanted to retain the capability to perform privileged binds at runtime. Or you could eliminate CAP_SYS_CHROOT and use other systemd functionality to make parts of the filesystem inaccessible, etc.) This pattern might be a bit hard to retrofit into BIND at this point, though, other than by adding more knobs. -- Robert Edmonds _______________________________________________ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users