From: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com> https://jira.sw.ru/browse/PSBM-34438
(This fix was adapted from PCS6.) It is possible for a container to create lots of mount points, which may make operations with them slower. As some of these operations take global locks (namespace_sem, vfsmount_lock), it might affect other containers as well. Let us limit the maximum number of mount points a VE may create. The limit can be customized via /proc/sys/fs/ve-mount-nr knob. Changes in v.3: * Revisited VE-specific parts of the patch to reduce the impact on the generic code. Changes in v.2: * The situations where VE0 mounts something and another VE unmounts it seem to be of no concern. If so, it is OK not to alter struct mount: the mount counter for a VE may become unbalanced but this is acceptable here. * The sysctl knob is now defined alongside other VE sysctls. Signed-off-by: Evgenii Shatokhin <eshatok...@virtuozzo.com> Acked-by: Stanislav Kinsburskiy <skinsbur...@virtuozzo.com> Testing results: This "mount bomb" ends with "cannot allocate memory" and makes no harm: https://github.com/avagin/ctb/blob/master/mounts/run.sh ===================================== ve/fs: Print a message if a VE reaches its limit on mounts This should help debugging the situations when alloc_vfsmnt() fails. Rate-limiting is used to avoid the flood of such messages if the mount limit is hit often. v2: Print the name of the VE rather than its address. Suggested in https://jira.sw.ru/browse/PSBM-42825 Signed-off-by: Evgenii Shatokhin <eshatok...@virtuozzo.com> ===================================== ve/fs: Use comparison of signed integers in ve_mount_allowed() ve->mnr_nr is a signed integer which may become negative in some cases: * if the increments and decrements of ve->mnr_nr race against each other and the decrements win; * if something is mounted by one VE but unmounted by another VE. sysctl_ve_mount_nr is unsigned. So the comparison (ve->mnr_nr < sysctl_ve_mount_nr) can actually be compiled as ((unsigned int)ve->mnr_nr < sysctl_ve_mount_nr). ve_mount_allowed() would return 0 in that case and the mount operation would fail as a result. This patch fixes the problem. https://jira.sw.ru/browse/PSBM-42825 Signed-off-by: Evgenii Shatokhin <eshatok...@virtuozzo.com> ========================================== ve/fs: convert per-VE number of currently mounted fs to atomic_t At the moment ve->mnt_nr is not atomic and not guarded by any lock => convert it to atomic_t in order to avoid races on updates https://jira.sw.ru/browse/PSBM-69880 Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> Reviewed-by: Kirill Tkhai <ktk...@virtuozzo.com> ========================================== ve/fs: limit "fs.ve-mount-nr" sysctl with INT_MAX sysctl "fs.ve-mount-nr" is unsigned int and is casted to "int" while comparing values => if we set it to a value > INT_MAX, VE won't be able to mount anything after that. => set a max value for sysctl == INT_MAX https://jira.sw.ru/browse/PSBM-69880 Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> Reviewed-by: Kirill Tkhai <ktk...@virtuozzo.com> =========================================== ve/fs/mount: pin mnt to a VE for correct number of mounts accounting If a mount is done in one VE context and umount - in another, counters ve->mnt_nr became unbalanced and this can cause denial of new mounts due to per-VE numbre of mounts limit. Additionally move related functions to fs/namespace.c to avoid adding fs/mount.h into ve.h: ve_mount_nr_inc() ve_mount_nr_dec() ve_mount_allowed() Example: mount is done from inside a CT, umount - from host => ve->mnt_nr gets incorrectly increased for a CT. https://jira.sw.ru/browse/PSBM-69880 Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> Reviewed-by: Kirill Tkhai <ktk...@virtuozzo.com> =========================================== VZ 8 rebase part https://jira.sw.ru/browse/PSBM-127782 vz7 commit: 32ea6a7 ("ve/fs: add per-VE limit of mount points") Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalit...@virtuozzo.com> +++ ve/fs: Compilation fix: vz_fs_table[] initialization kernel/ve/veowner.c:75:4: error: positional initialization of field in 'struct' declared with 'designated_init' attribute [-Werror=designated-init] { 0 } ^ kernel/ve/veowner.c:75:4: note: (near initialization for 'vz_fs_table[2]') mFixes: 9872dca0f40c ("ve/fs: add per-VE limit of mount points") Signed-off-by: Konstantin Khorenko <khore...@virtuozzo.com> (cherry picked from vz8 commit 35c5509b64d5c177a07a51ce38a8d082428608cb) Signed-off-by: Andrey Zhadchenko <andrey.zhadche...@virtuozzo.com> --- fs/mount.h | 3 +++ fs/namespace.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++-- include/linux/ve.h | 4 ++++ kernel/ve/ve.c | 4 ++++ kernel/ve/veowner.c | 19 +++++++++++++++++++ 5 files changed, 78 insertions(+), 2 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index 0b6e08c..e19f732 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -68,6 +68,9 @@ struct mount { struct hlist_node mnt_umount; }; struct list_head mnt_umounting; /* list entry for umount propagation */ +#ifdef CONFIG_VE + struct ve_struct *ve_owner; /* VE in which this mount was created */ +#endif /* CONFIG_VE */ #ifdef CONFIG_FSNOTIFY struct fsnotify_mark_connector __rcu *mnt_fsnotify_marks; __u32 mnt_fsnotify_mask; diff --git a/fs/namespace.c b/fs/namespace.c index bc2819c..e5da1fa 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -194,9 +194,22 @@ int mnt_get_count(struct mount *mnt) #endif } +static inline int ve_mount_allowed(void); +static inline void ve_mount_nr_inc(struct mount *mnt); +static inline void ve_mount_nr_dec(struct mount *mnt); + static struct mount *alloc_vfsmnt(const char *name) { - struct mount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL); + struct mount *mnt; + + if (!ve_mount_allowed()) { + pr_warn_ratelimited( + "CT#%s reached the limit on mounts.\n", + ve_name(get_exec_env())); + return NULL; + } + + mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL); if (mnt) { int err; @@ -234,6 +247,7 @@ static struct mount *alloc_vfsmnt(const char *name) INIT_LIST_HEAD(&mnt->mnt_umounting); INIT_HLIST_HEAD(&mnt->mnt_stuck_children); mnt->mnt.mnt_userns = &init_user_ns; + ve_mount_nr_inc(mnt); } return mnt; @@ -565,6 +579,7 @@ static void free_vfsmnt(struct mount *mnt) mnt_userns = mnt_user_ns(&mnt->mnt); if (mnt_userns != &init_user_ns) put_user_ns(mnt_userns); + ve_mount_nr_dec(mnt); kfree_const(mnt->mnt_devname); #ifdef CONFIG_SMP free_percpu(mnt->mnt_pcp); @@ -2749,7 +2764,38 @@ int ve_devmnt_process(struct ve_struct *ve, dev_t dev, void **data_pp, int remou return err; } -#endif + +static inline int ve_mount_allowed(void) +{ + struct ve_struct *ve = get_exec_env(); + + return ve_is_super(ve) || + atomic_read(&ve->mnt_nr) < (int)sysctl_ve_mount_nr; +} + +static inline void ve_mount_nr_inc(struct mount *mnt) +{ + struct ve_struct *ve = get_exec_env(); + + mnt->ve_owner = get_ve(ve); + atomic_inc(&ve->mnt_nr); +} + +static inline void ve_mount_nr_dec(struct mount *mnt) +{ + struct ve_struct *ve = mnt->ve_owner; + + atomic_dec(&ve->mnt_nr); + put_ve(ve); + mnt->ve_owner = NULL; +} + +#else /* CONFIG_VE */ + +static inline int ve_mount_allowed(void) { return 1; } +static inline void ve_mount_nr_inc(struct mount *mnt) { } +static inline void ve_mount_nr_dec(struct mount *mnt) { } +#endif /* CONFIG_VE */ static int ve_prepare_mount_options(struct fs_context *fc, void *data) { diff --git a/include/linux/ve.h b/include/linux/ve.h index 30d110c..32c2ffb 100644 --- a/include/linux/ve.h +++ b/include/linux/ve.h @@ -67,6 +67,8 @@ struct ve_struct { unsigned long meminfo_val; + atomic_t mnt_nr; /* number of present VE mounts */ + struct kthread_worker *kthreadd_worker; struct task_struct *kthreadd_task; @@ -98,6 +100,8 @@ struct ve_devmnt { #define capable_setveid() \ (ve_is_super(get_exec_env()) && capable(CAP_SYS_ADMIN)) +extern unsigned int sysctl_ve_mount_nr; + #ifdef CONFIG_VE extern struct ve_struct *get_ve(struct ve_struct *ve); extern void put_ve(struct ve_struct *ve); diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c index 9da8259..3a53e6d 100644 --- a/kernel/ve/ve.c +++ b/kernel/ve/ve.c @@ -55,6 +55,8 @@ struct ve_struct ve0 = { #else 2, #endif + + .mnt_nr = ATOMIC_INIT(0), .meminfo_val = VE_MEMINFO_SYSTEM, .vdso_64 = (struct vdso_image*)&vdso_image_64, .vdso_32 = (struct vdso_image*)&vdso_image_32, @@ -649,6 +651,8 @@ static struct cgroup_subsys_state *ve_create(struct cgroup_subsys_state *parent_ INIT_LIST_HEAD(&ve->ve_list); kmapset_init_key(&ve->sysfs_perms_key); + atomic_set(&ve->mnt_nr, 0); + INIT_LIST_HEAD(&ve->devmnt_list); mutex_init(&ve->devmnt_mutex); diff --git a/kernel/ve/veowner.c b/kernel/ve/veowner.c index afe836c..b0aba35 100644 --- a/kernel/ve/veowner.c +++ b/kernel/ve/veowner.c @@ -46,7 +46,26 @@ static void prepare_proc(void) * ------------------------------------------------------------------------ */ +/* + * Operations with a big amount of mount points can require a lot of time. + * These operations take the global lock namespace_sem, so they can affect + * other containers. Let us allow no more than sysctl_ve_mount_nr mount + * points for a VE. + */ +unsigned int sysctl_ve_mount_nr = 4096; +static int ve_mount_nr_min = 0; +static int ve_mount_nr_max = INT_MAX; + static struct ctl_table vz_fs_table[] = { + { + .procname = "ve-mount-nr", + .data = &sysctl_ve_mount_nr, + .maxlen = sizeof(sysctl_ve_mount_nr), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &ve_mount_nr_min, + .extra2 = &ve_mount_nr_max, + }, { } }; -- 1.8.3.1 _______________________________________________ Devel mailing list Devel@openvz.org https://lists.openvz.org/mailman/listinfo/devel