The commit is pushed to "branch-rh9-5.14.vz9.1.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git after rh9-5.14.0-4.vz9.10.12 ------> commit 5f595f2018acd423666b40bc405db90954557b69 Author: Pavel Tikhomirov <ptikhomi...@virtuozzo.com> Date: Wed Oct 20 11:39:30 2021 +0300
ve/device_cgroup: Show all devices allowed in ct to fool docker We've seen that docker 20+ not only writes "a *:* rwm" to privileged docker container device-cgroup (as pre-19 version did) but also checks the content after write, and docker expects that all devices are allowed for privileged docker container. In our VZCT we obviously can't afford to actually allow all devices because root device cgroup of VZCT should restrict which devices are allowed to be read/modified/mknod in VZCT and which are not, and all nested cgroup inherit this. Before the patch reading devices list in VZCT one would see a whitelist there each allowed device is present: CT-101 /# cat /sys/fs/cgroup/devices/test/devices.list ... c 1:11 rwm c 10:200 rwm c 10:235 rwm c 10:229 rwm b 182:177568 rm b 182:177569 rm Docker expects to see "a *:* rwm" as if docker is on bare host and nobody touched device cgroup before that. As a solution we can just show docker what he wants. The idea is to detect if the content of the whitelist of the device cgroup to be shown is equal to the content of the whitelist of the root device cgroup of the VZCT, then always show "a *:* rwm". CT-101 /# cat /sys/fs/cgroup/devices/test/devices.list a *:* rwm If one changes the whitelist (even reorder) this cgroup would show a full list of all allowed devices as before. This change of the output looks consistent enough: when you see "a *:* rwm" in your cgroup it means that all devices of your VZCT are available for you. Only difference to mainstream behaviour is when you prohibit some device via devices.deny you get not a blacklist but an inverse whitelist. Related task - a CRIU/vzctl task for devices cgroup migration support: https://jira.sw.ru/browse/PSBM-123668 https://jira.sw.ru/browse/PSBM-123630 Signed-off-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com> vz8 rebase: - introduced css_get_local_root() similar to cgroup_get_local_root() (cherry picked from vz7 commit a6dba9fbee35 ("ve/device_cgroup: show all devices allowed in ct to fool docker")) In the scope of https://jira.sw.ru/browse/PSBM-123743 Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> Reviewed-by: Pavel Tikhomirov <ptikhomi...@virtuozzo.com> +++ kernel/cgroup: rename css_get_local_root css functions with _get_ wording usually take reference counters. Rename to css_local_root() to comply. Change all uses accordingly. https://jira.sw.ru/browse/PSBM-131253 Signed-off-by: Andrey Zhadchenko <andrey.zhadche...@virtuozzo.com> Reviewed-by: Kirill Tkhai <ktk...@virtuozzo.com> Ported vz8 commit e06581026a84 ("ve/device_cgroup: Show all devices allowed in ct to fool docker")). Use css_ve_root1(). Signed-off-by: Nikita Yushchenko <nikita.yushche...@virtuozzo.com> --- security/device_cgroup.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) diff --git a/security/device_cgroup.c b/security/device_cgroup.c index 448c5bef0996..3ccec8dda4d6 100644 --- a/security/device_cgroup.c +++ b/security/device_cgroup.c @@ -270,12 +270,44 @@ static void set_majmin(char *str, unsigned m) sprintf(str, "%u", m); } +struct dev_exception_item *dev_exeption_next(struct list_head *head) +{ + return list_entry_rcu(head->next, struct dev_exception_item, list); +} + +static bool devcgroup_exceptions_equal(struct dev_cgroup *first_cgrp, + struct dev_cgroup *second_cgrp) +{ + struct list_head *first = &first_cgrp->exceptions, + *second = &second_cgrp->exceptions; + struct dev_exception_item *exf, *exs; + + for (exf = dev_exeption_next(first->next), + exs = dev_exeption_next(second->next); + &exf->list != first && &exs->list != second; + exf = dev_exeption_next(exf->list.next), + exs = dev_exeption_next(exs->list.next)) { + /* Check that exceptions are equal */ + if (exf->type != exs->type || + exf->major != exs->major || + exf->minor != exs->minor || + exf->access != exs->access) + return false; + } + + if (&exf->list != first || &exs->list != second) + return false; + + return true; +} + static int devcgroup_seq_show(struct seq_file *m, void *v) { struct dev_cgroup *devcgroup = css_to_devcgroup(seq_css(m)); struct dev_exception_item *ex; char maj[MAJMINLEN], min[MAJMINLEN], acc[ACCLEN]; short type, mask; + struct dev_cgroup *root_cgrp; type = (short)seq_cft(m)->private; mask = (type == DEVCG_EXTRA_LIST) ? @@ -289,12 +321,24 @@ static int devcgroup_seq_show(struct seq_file *m, void *v) * This way, the file remains as a "whitelist of devices" */ if (devcgroup->behavior == DEVCG_DEFAULT_ALLOW) { +allow_all: set_access(acc, mask); set_majmin(maj, ~0); set_majmin(min, ~0); seq_printf(m, "%c %s:%s %s\n", type_to_char(DEVCG_DEV_ALL), maj, min, acc); } else { + /* + * Fooling docker in CT again: if exceptions in ve are the same + * as in ve root cgroup - show as if we allow everyting + */ + if (!ve_is_super(get_exec_env())) { + root_cgrp = css_to_devcgroup(css_ve_root1(seq_css(m))); + if (root_cgrp && + devcgroup_exceptions_equal(devcgroup, root_cgrp)) + goto allow_all; + } + list_for_each_entry_rcu(ex, &devcgroup->exceptions, list) { set_access(acc, ex->access & mask); set_majmin(maj, ex->major); _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel